The pre-test, completed before any confirmatory choice

How the stimuli were calibrated

A context effect has no room to appear when one option already dominates, so before the study runs, every adjustable two-option starting pair had to sit close to an even split — the registered target puts the weaker option between 40 and 45 percent. Three waves on Prolific measured each pair, adjusted any pair outside the band by a rule written down in advance, and re-tested it in the next wave. The published-values pairs were checked against the same band and reported, never altered. This page shows every pair's path, wave by wave, into the band — and reports the ones that would not go.

Every pair, wave by wave

The calibration paths

Each small chart is one two-option pair: the dots are its measured share of the weaker option in waves 1, 2, and 3, with 95 percent intervals; the blue band is the 40–45 target. Click a pair for its full history — the values shown at each wave, what was adjusted and why, and how it ended.

Published-values pairs (tier 1) are fixed by fidelity to their sources: they were measured once, reported, and never adjusted — five of the seven stayed outside the band as measured in 2026, one beyond the registered 25–75 limits, and every result on their cells is read against those measured baselines. The final rule for the adjustable pairs, fixed before the last wave's data arrived: a pair became final when its share fell inside 40–45, or when its 95 percent interval overlapped that band while the share stayed between 33 and 50.

Could people see the trick?

Structure checks

Alongside the shares, each wave asked a separate question: shown a choice set, could participants say which option was the planted one? The registered benchmark was 70 percent correct. Four of the seven structures passed; the three that missed are reported with their item corrections, and the confirmatory human replication — not this check — decides whether each effect exists.

Structure Wave 1 Wave 2 Wave 3 Benchmark

Do the invented categories read as real?

The anchored realism comparison

The three invented categories repeatedly missed the registered realism threshold — a mean of 5.0 on a 7-point “reads like a real product page” item. But that threshold had been set before any real category was ever rated on the item. So 112 fresh raters scored the seven published categories' pages, rendered identically, on the same question. Six of the seven published categories missed the threshold too, and the invented categories' scores sit entirely inside the published range: the threshold was miscalibrated for the deliberately minimal page format every category shares, not a defect of the invented products.