The living benchmark
What re-running means
The model lineup is a snapshot of the systems available in mid-2026, in a field whose models turn over within months, so the specific fidelity numbers are a reading taken at one time. The comparison across model generations, and the deposited materials that let anyone re-run the map, are how the design makes that churn a quantity it tracks, not a threat to what it concludes. Whatever the models do, the human benchmark stands: it is collected once, at high power, under a registered protocol, and every later model generation is scored against the same fixed human numbers.
A re-runnable open core
Three of the ten engines are permanently re-runnable open weights. Anyone can re-run the core comparison unchanged, years from now, without depending on any provider keeping an API alive.
Dated model snapshots
Every engine is pinned to a dated snapshot, with a registered substitution rule for the two snapshots most at risk of retirement during the study window. A fidelity number always names exactly which model, on exactly which date, produced it.
One results file
The analysis emits a single results pack; the manuscript and this site both draw from it, so the registered tree and the reported numbers can never disagree. Re-running a new model appends its cells to the same file and this map simply gains dots.
The trajectory itself is part of what the study measures: the battery includes three older generations of the same model families, so the map can read whether fidelity is rising, flat, or falling as capability grows — and the T5 trajectory test carries a registered reading for each answer. If fidelity is not a free byproduct of capability gains, that is the strongest argument for routine validation and for this benchmark.
Version log
Benchmark versions
The registered deliverable's shape, populated with the synthetic dress rehearsal: twelve planted engines, each built to show one pattern (a faithful simulator, a textbook parrot, a rational overrider, a ceiling case, a refuser, and so on), carried through the full pipeline to prove that every planted pattern reaches the fidelity class and named scenario it was designed to reach. No data collected; nothing here is a finding.
The real results: the calibrated human benchmark and the full ten-engine battery on the registered design. The results pack overwrites this site's single data file; the page code does not change.
Re-runs on new model generations against the same fixed human benchmark, each a dated snapshot appended to the map — the living answer to whether synthetic respondents are becoming more, or less, faithful to human choice.
Open materials
Everything needed to re-run it
All stimuli, prompts, code, and data will be public at registration. The deposit holds the single stimulus file every arm reads, the model-collection and analysis code, the survey instruments, the human benchmark data once collected, and the results pack this site renders — deposited so the field can re-run the comparison as models change.
During masked review this site stays anonymous: it carries no author name, institution, or repository handle.