The living benchmark

What re-running means

The model lineup is a snapshot of the systems available in mid-2026, in a field whose models turn over within months, so the specific fidelity numbers are a reading taken at one time. The comparison across model generations, and the deposited materials that let anyone re-run the map, are how the design makes that churn a quantity it tracks, not a threat to what it concludes. Whatever the models do, the human benchmark stands: it is collected once, at high power, under a registered protocol, and every later model generation is scored against the same fixed human numbers.

A re-runnable open core

Three of the ten engines are permanently re-runnable open weights. Anyone can re-run the core comparison unchanged, years from now, without depending on any provider keeping an API alive.

Dated model snapshots

Every engine is pinned to a dated snapshot, with a registered substitution rule for the two snapshots most at risk of retirement during the study window. A fidelity number always names exactly which model, on exactly which date, produced it.

One results file

The analysis emits a single results pack; the manuscript and this site both draw from it, so the registered tree and the reported numbers can never disagree. Re-running a new model appends its cells to the same file and this map simply gains dots.

The trajectory itself is part of what the study measures: the battery includes three older generations of the same model families, so the map can read whether fidelity is rising, flat, or falling as capability grows — and the T5 trajectory test carries a registered reading for each answer. If fidelity is not a free byproduct of capability gains, that is the strongest argument for routine validation and for this benchmark.

Version log

Benchmark versions

v0.12026-07-01 · Stage-1 placeholder

The registered deliverable's shape, populated with the synthetic dress rehearsal: twelve planted engines, each built to show one pattern (a faithful simulator, a textbook parrot, a rational overrider, a ceiling case, a refuser, and so on), carried through the full pipeline to prove that every planted pattern reaches the fidelity class and named scenario it was designed to reach. No data collected; nothing here is a finding.

v1.0at Stage 2 · after acceptance and data collection

The real results: the calibrated human benchmark and the full ten-engine battery on the registered design. The results pack overwrites this site's single data file; the page code does not change.

v1.xas new models arrive

Re-runs on new model generations against the same fixed human benchmark, each a dated snapshot appended to the map — the living answer to whether synthetic respondents are becoming more, or less, faithful to human choice.

Open materials

Everything needed to re-run it

All stimuli, prompts, code, and data will be public at registration. The deposit holds the single stimulus file every arm reads, the model-collection and analysis code, the survey instruments, the human benchmark data once collected, and the results pack this site renders — deposited so the field can re-run the comparison as models change.

During masked review this site stays anonymous: it carries no author name, institution, or repository handle.