Do language models choose like people? · Context-effect fidelity

A worked example

What a context effect is

Take two cameras that people like about equally. Add a third that is plainly worse than one of them: slightly pricier, with a slightly worse image. Nobody buys that decoy, but choices move anyway, because the option that beats it outright starts to look like the smart pick. Adding an option that should be irrelevant changes which original option wins.

Two options

a calibrated, evenly-split baseline

Camera Amid-price · very good image

50%

Camera Bpremium · best image

50%

Add a decoy that Camera A beats outright

slightly pricier than A, slightly worse image

Camera Athe target: it beats the decoy

61%

Camera B

33%

Decoyworse than A on both

6%

The shares above are illustrative. The structure is the attraction effect with a frequency decoy, one of the seven effects the study tests. The study's real baselines were calibrated in a completed three-wave pre-test so the two starting options are close to evenly preferred, because an effect has no room to appear when one option already dominates.

The study's main result

The fidelity map

Each dot is one effect, in one model, plotted against the calibrated human benchmark: the human effect size runs along the bottom and the model's effect size up the side, both in percentage points. A dot on the dashed diagonal means the model reproduces the human effect in the same direction and at the same size. A dot below the diagonal under-expresses the effect: it moves in the right direction but by too little. A dot near zero shows no human-like context sensitivity at all. A dot below the zero line reverses the human pattern, which is the case that matters most in practice, because the model's answers still look fluent and plausible while pointing the opposite way.

Model

Effect

Stimulus novelty

Fidelity class

View the plotted cells as a table

HIGH

Matches people: same direction, sizes close to the benchmark. Synthetic respondents are defensible for these phenomena, within the tested scope.

PARTIAL

Direction transfers while size does not, so the model can detect that an effect exists but cannot estimate how large it is.

LOW

Does not track human context sensitivity; synthetic respondents drawn from it would mislead a study of context effects.

ANTI

The model systematically inverts the human pattern while its answers keep looking sensible, so the failure is invisible to anyone who does not hold the human benchmark. One published study of digital twins has already reported this case.

Read each model's scorecard → Explore every condition →

The rival accounts

Four rival accounts of how a model chooses

The most likely result is that models reproduce some effects and not others, so the paper is organized as a contest among rival accounts of how a language model arrives at a choice, with each account predicting a different pattern of which effects transfer and pre-registered tests adjudicating between them. Each account has a signature the map would show.

Retrieval

The model reproduces an effect because it has seen that effect described in its training text.

Predicted signature. Effects appear on the classic published stimuli but vanish on stimuli that have never appeared in print.

Reason-construction

The model reproduces an effect only when the choice set hands it an argument it can state in words.

Predicted signature. Fidelity falls as the set offers less explicit support for the predicted choice: down the argument-strength ordering, and from numbers to words to pictures.

Rational-override

The model detects that the set has been arranged to push a choice and answers in the way it judges correct.

Predicted signature. It recognizes the manipulation without moving with it, leaving effects null or reversed while recognition stays high. This account has already been sighted once: digital twins (models prompted to answer as specific real survey takers) produced a significant reversed compromise effect where their own humans showed none.

Deep-mimicry

The model reproduces human choice patterns whatever the novelty of the stimulus, the arguability of the set, or the format.

Predicted signature. High fidelity everywhere and no breakdown for the tests to find.

Each account points to a different place where fidelity would break, and the tests below are built to expose those breaks. The accounts are attributed per effect: a single model can fall under different accounts for different effects.

The pre-registered tests

Three tests of why fidelity breaks

The memorization test

Every effect is tested on the classic published stimuli and on newly constructed stimuli that have never appeared in print. Effects that appear only on the published stimuli indicate recitation of training data; effects that survive on the new stimuli indicate behavior.

The knows-it-versus-shows-it test

Each model is shown the actual choice sets and probed for recognition: is anything strategically constructed about this set, and which option is the decoy? The informative question is whether detecting the manipulation in a given stimulus predicts responding to it.

The argument-strength test

The effects differ in how much statable support the choice set gives the predicted choice, and the protocol orders all seven on this dimension: attraction at the top ("B beats C outright"), compromise in the middle, and the extremity effect at the bottom. If models choose by generating reasons, fidelity should decline down this ordering.

A fourth comparison sharpens the argument-strength test by holding everything else fixed: the same consumer scenarios are shown as numbers, as words, and as pictures, so only the format changes. A model that shows the attraction effect when the decoy's inferiority can be read off the numbers, but not when the same inferiority must be seen in a picture, is reading arguments where a person would perceive options. Because the ranking of different effects is itself contestable, this formats test carries the confirmatory weight for the reason-construction account. See the same stimulus morph between its three formats →

The registered readings

How mixed results will be read

Mixed results are the most likely outcome, and the design treats them as informative. An unplanned search through a mixed pattern would become a nearly infinite forking-paths problem, so every pattern the data can produce was assigned a reading before any data exist: two validity criteria remove cells that would be misread, four fidelity classes describe how faithfully each model simulates, five pattern tests attribute why fidelity breaks, a precedence rule resolves overlaps, and seven named scenarios are the composite verdicts. The space of conclusions is finite and composed by stated rules, which is what lets a design with this many cells run as a registered report.

The seven are not exhaustive: a model matching none is routed by the individual test readings under the precedence rule and recorded for every model in the tree-routing file.

Walk the registered decision tree

The whole plan (validity criteria, benchmarks, the map, the five tests, the precedence rule, and the seven scenarios) as an interactive flow. Highlight an account to light up its route through the tree.

Open the decision tree →

The re-run design

A benchmark built to be re-run

Model lineups expire within months, so the study is built to be re-run. Three of the tested engines are permanently re-runnable open weights; every engine is pinned to a dated snapshot; and one results file feeds both the paper and this page, so they can never disagree. When new model generations arrive, the same pipeline scores them against the same fixed human benchmark, and this map simply gains dots.

Right now the file behind this page is a synthetic dress rehearsal: twelve planted engines, each built to exercise one branch of the registered decision tree and confirm that the pipeline routes every pattern to its pre-written reading. Because the engines are planted, the values on this site carry no evidence about any real model.

How re-running works →