Companion to a Stage-1 registered report · a living benchmark

Companies are replacing human survey panels with language models. Do those models choose like people?

People do not choose products in a vacuum: the same option looks better or worse depending on what sits next to it on the shelf. These well-documented pulls are called context effects, and they are the most thoroughly documented way human choice departs from simple rationality. A synthetic respondent — a language model prompted to answer a survey as if it were a person — has to reproduce them to stand in for a person. This study measures seven context effects in a large pre-registered human sample and in a battery of language models, on identical consumer choices, and maps where each model's behavior tracks the human benchmark.

Shown, not told

What a context effect is

Take two cameras that people like about equally. Add a third that is plainly worse than one of them — slightly pricier, slightly worse image. Nobody buys the decoy, but choices move anyway: the option that beats it outright starts to look like the smart pick. Adding an option that should be irrelevant changes which original option wins.

Two options

a calibrated, evenly-split baseline

Camera Amid-price · very good image
50%
Camera Bpremium · best image
50%

Add a decoy that Camera A beats outright

slightly pricier than A, slightly worse image

Camera Athe target — it beats the decoy
61%
Camera B
33%
Decoyworse than A on both
6%

Illustrative shares, not data — this is the attraction effect with a frequency decoy, one of the seven effects the study tests. The study's real baselines are calibrated so the two starting options are close to evenly preferred, because an effect has no room to appear when one option already dominates.

The study's main result

The fidelity map

Each dot is one effect, in one model, against the calibrated human benchmark: the human effect size runs along the bottom, the model's effect size up the side, in percentage points. On the dashed diagonal, the model simulates people — same direction, same size. Below the diagonal it under-expresses the effect: right direction, wrong size. Near zero it shows no human-like context sensitivity at all. And below the zero line it reverses the human pattern — the dangerous case, because the model's answers still look fluent and plausible while pointing the opposite way.

View the plotted cells as a table

HIGH

Matches people: same direction, sizes close to the benchmark. Synthetic respondents are defensible for these phenomena, within the tested scope.

PARTIAL

Direction transfers, size does not — usable for detecting that an effect exists, not for estimating how large it is.

LOW

Does not track human context sensitivity; synthetic respondents drawn from it would mislead a study of context effects.

ANTI

Reversed. The model inverts the human pattern while its answers keep looking sensible — invisible to anyone not holding the human benchmark. A published digital-twin follow-up has already sighted this case.

The contest the paper stages

Four rival accounts of how a model chooses

The most likely result is that models reproduce some effects and not others, so the paper is organized as a contest among rival accounts of how a language model arrives at a choice — each predicting a different pattern of which effects transfer, with pre-registered tests adjudicating between them. Each account has a signature the map would show.

Each account points to a different place where fidelity would crack, and the tests below are built to expose those cracks. The accounts are attributed per effect, not per model — a single model can fall under different accounts for different effects.

Pre-registered, stated in the protocol's frozen wording

Three tests of why fidelity breaks

The memorization test

Every effect is tested on the classic published stimuli and on newly constructed stimuli that have never appeared in print. Effects that appear only on the published stimuli indicate recitation of training data; effects that survive on the new stimuli indicate behavior.

The knows-it-versus-shows-it test

Each model is shown the actual choice sets and probed for recognition: is anything strategically constructed about this set, and which option is the decoy? The informative question is whether detecting the manipulation in a given stimulus predicts responding to it.

The argument-strength test

The effects differ in how much statable support the choice set gives the predicted choice, and the protocol orders all seven on this dimension — attraction at the top ("B beats C outright"), compromise in the middle, the extremity effect at the bottom. If models choose by generating reasons, fidelity should decline down this ordering.

A fourth comparison sharpens the argument-strength test by holding everything else fixed: the same consumer scenarios are shown as numbers, as words, and as pictures, so only the format changes. A model that shows the attraction effect when the decoy's inferiority can be read off the numbers, but not when the same inferiority must be seen in a picture, is reading arguments where a person would perceive options. Because the ranking of different effects is itself contestable, this formats test carries the confirmatory weight for the reason-construction account.

What we learn regardless of outcome

Mixed results are the point

Mixed results are the most likely outcome, and an unplanned search through them would become a nearly infinite forking-paths situation. So every pattern the data can produce was assigned a reading before any data exist: two validity gates remove cells that would be misread, four fidelity classes describe how faithfully each model simulates, five pattern tests attribute why fidelity breaks, a precedence rule resolves overlaps, and seven named scenarios are the composite verdicts. The space of conclusions is finite and composable by stated rules — that is what lets a design with this many cells be a registered report.

The seven are not exhaustive: a model matching none is routed by the individual test readings under the precedence rule and recorded for every model in the tree-routing file.

Walk the registered decision tree

The whole plan — gates, benchmarks, the map, the five tests, the precedence rule, and the seven scenarios — as an interactive flow. Highlight an account to light up its route through the tree.

Open the decision tree →

Why this page exists

A benchmark that stays alive

Model lineups expire within months, so the study is built to be re-run rather than to freeze a snapshot. Three engines in the battery are permanently re-runnable open weights; every engine is pinned to a dated snapshot; and one results file feeds both the paper and this page, so they can never disagree. When new model generations arrive, the same pipeline scores them against the same fixed human benchmark, and this map simply gains dots.

Right now the file behind this page is a synthetic dress rehearsal: twelve planted engines, each built to exercise one branch of the registered decision tree, proving the pipeline routes every pattern to its pre-written reading. Nothing here is a finding.