Placeholder

These are illustrative placeholder numbers, not findings. This is a Stage-1 registered-report protocol; no data have been collected yet. The layout shows exactly what readers will see once the study runs. After acceptance and data collection, the figures are replaced automatically and this becomes a living benchmark, re-run on new models as they are released.

Companion to a registered report · living benchmark

Do large language models reproduce human context effects in consumer choice?

People do not choose products in a vacuum. The same option looks better or worse depending on what sits next to it on the shelf, a family of well-documented pulls called context effects. As researchers start using language models as stand-in survey respondents, one question decides whether that is safe: when a model picks among the same products a person sees, does it bend the same way a human does, or does it quietly bend differently, or even the opposite way?

What this dashboard shows

For each context effect and each model, the size of the effect in people versus the size of the same effect in the model, on identical choices. The closer a point sits to the diagonal, the more faithfully the model behaves like a human.

How to read a model

Each model gets a sign-agreement rate (how often it leans the same direction as people), a mean distance from the human numbers, a calibration slope (does it over- or under-shoot), and a fidelity class from HIGH to the reversed, dangerous ANTI.

Why it stays live

One results file feeds both the paper and this page, so they can never disagree. New model generations are scored against the same fixed human benchmark over time, turning a one-off study into a benchmark the field can track.

The centerpiece

The fidelity map

Model effect size (vertical) against the calibrated human effect size (horizontal), in percentage points, one point per effect. The dashed line is perfect fidelity: a model that matched people exactly would place every point on it. Distance below the line means the model under-expresses the effect; points that cross into the lower band reverse it. Hover or tap any point for detail; use the filters to isolate an effect, a model, or a fidelity class.

Per model

How faithfully each model simulates

Each model summarized across all effects. Click any column heading to sort. Sign agreement is the share of effects running the same direction as people; mean absolute deviation is the average distance from the human numbers in percentage points (smaller is better); the calibration slope is the line through the model-versus-human points (1.0 is perfectly calibrated, below 1 under-shoots, negative means systematically reversed).

Model Fidelity class Sign agreement Mean abs. deviation Calibration slope

Seven pre-registered readings

The overall stories the data could tell

Beyond the cell-by-cell map, the protocol names seven overall patterns in advance, each with a single registered reading fixed before any data are seen. This is how the analysis shows it has an answer ready for every realistic way the results can fall. At Stage 2 the observed pattern is routed to one of these, shown below the cards.

Stage-2 verdict

Resolved once the study runs