Companion to a Stage-1 registered report · a living benchmark
People do not choose products in a vacuum: the same option looks better or worse depending on what sits next to it on the shelf. These well-documented pulls are called context effects, and they are the most thoroughly documented way human choice departs from simple rationality. A synthetic respondent — a language model prompted to answer a survey as if it were a person — has to reproduce them to stand in for a person. This study measures seven context effects in a large pre-registered human sample and in a battery of language models, on identical consumer choices, and maps where each model's behavior tracks the human benchmark.
Shown, not told
Take two cameras that people like about equally. Add a third that is plainly worse than one of them — slightly pricier, slightly worse image. Nobody buys the decoy, but choices move anyway: the option that beats it outright starts to look like the smart pick. Adding an option that should be irrelevant changes which original option wins.
a calibrated, evenly-split baseline
slightly pricier than A, slightly worse image
Illustrative shares, not data — this is the attraction effect with a frequency decoy, one of the seven effects the study tests. The study's real baselines are calibrated so the two starting options are close to evenly preferred, because an effect has no room to appear when one option already dominates.
The study's main result
Each dot is one effect, in one model, against the calibrated human benchmark: the human effect size runs along the bottom, the model's effect size up the side, in percentage points. On the dashed diagonal, the model simulates people — same direction, same size. Below the diagonal it under-expresses the effect: right direction, wrong size. Near zero it shows no human-like context sensitivity at all. And below the zero line it reverses the human pattern — the dangerous case, because the model's answers still look fluent and plausible while pointing the opposite way.
Matches people: same direction, sizes close to the benchmark. Synthetic respondents are defensible for these phenomena, within the tested scope.
Direction transfers, size does not — usable for detecting that an effect exists, not for estimating how large it is.
Does not track human context sensitivity; synthetic respondents drawn from it would mislead a study of context effects.
Reversed. The model inverts the human pattern while its answers keep looking sensible — invisible to anyone not holding the human benchmark. A published digital-twin follow-up has already sighted this case.
The contest the paper stages
The most likely result is that models reproduce some effects and not others, so the paper is organized as a contest among rival accounts of how a language model arrives at a choice — each predicting a different pattern of which effects transfer, with pre-registered tests adjudicating between them. Each account has a signature the map would show.
The model reproduces an effect because it has seen that effect described in its training text.
Predicted signature. Effects appear on the classic published stimuli but vanish on stimuli that have never appeared in print.
The model reproduces an effect only when the choice set hands it an argument it can state in words.
Predicted signature. Fidelity falls as the set offers less explicit support for the predicted choice — down the argument-strength ladder, and from numbers to words to pictures.
The model detects that the set has been arranged to push a choice and answers in the way it judges correct.
Predicted signature. It recognizes the manipulation without moving with it — effects null or reversed while recognition is high. This is the account with a published sighting: digital twins produced a significant reversed compromise effect where their own humans showed none.
The model reproduces human choice patterns whatever the novelty of the stimulus, the arguability of the set, or the format.
Predicted signature. High fidelity everywhere and no breakdown for the tests to find.
Each account points to a different place where fidelity would crack, and the tests below are built to expose those cracks. The accounts are attributed per effect, not per model — a single model can fall under different accounts for different effects.
Pre-registered, stated in the protocol's frozen wording
Every effect is tested on the classic published stimuli and on newly constructed stimuli that have never appeared in print. Effects that appear only on the published stimuli indicate recitation of training data; effects that survive on the new stimuli indicate behavior.
Each model is shown the actual choice sets and probed for recognition: is anything strategically constructed about this set, and which option is the decoy? The informative question is whether detecting the manipulation in a given stimulus predicts responding to it.
The effects differ in how much statable support the choice set gives the predicted choice, and the protocol orders all seven on this dimension — attraction at the top ("B beats C outright"), compromise in the middle, the extremity effect at the bottom. If models choose by generating reasons, fidelity should decline down this ordering.
A fourth comparison sharpens the argument-strength test by holding everything else fixed: the same consumer scenarios are shown as numbers, as words, and as pictures, so only the format changes. A model that shows the attraction effect when the decoy's inferiority can be read off the numbers, but not when the same inferiority must be seen in a picture, is reading arguments where a person would perceive options. Because the ranking of different effects is itself contestable, this formats test carries the confirmatory weight for the reason-construction account.
What we learn regardless of outcome
Mixed results are the most likely outcome, and an unplanned search through them would become a nearly infinite forking-paths situation. So every pattern the data can produce was assigned a reading before any data exist: two validity gates remove cells that would be misread, four fidelity classes describe how faithfully each model simulates, five pattern tests attribute why fidelity breaks, a precedence rule resolves overlaps, and seven named scenarios are the composite verdicts. The space of conclusions is finite and composable by stated rules — that is what lets a design with this many cells be a registered report.
The seven are not exhaustive: a model matching none is routed by the individual test readings under the precedence rule and recorded for every model in the tree-routing file.
The whole plan — gates, benchmarks, the map, the five tests, the precedence rule, and the seven scenarios — as an interactive flow. Highlight an account to light up its route through the tree.
Why this page exists
Model lineups expire within months, so the study is built to be re-run rather than to freeze a snapshot. Three engines in the battery are permanently re-runnable open weights; every engine is pinned to a dated snapshot; and one results file feeds both the paper and this page, so they can never disagree. When new model generations arrive, the same pipeline scores them against the same fixed human benchmark, and this map simply gains dots.
Right now the file behind this page is a synthetic dress rehearsal: twelve planted engines, each built to exercise one branch of the registered decision tree, proving the pipeline routes every pattern to its pre-written reading. Nothing here is a finding.