Per-model scorecards

How faithfully each engine simulates

Each engine summarized across all scored cells. Sign agreement is the share of cells leaning the same direction as people; mean absolute deviation is the average distance from the human numbers in percentage points (smaller is better); the calibration slope says whether it over- or under-shoots overall (1.00 is perfectly calibrated; negative means systematically reversed). Click a column heading to sort, a model to open its scorecard.

The registered battery at Stage 2 holds ten engines: six current flagships, three older generations of the same families, and one vision-capable engine that runs the picture cells and doubles as its family's older generation — several flagships carry a reasoning toggle, and three engines are permanently re-runnable open weights, pinned to dated snapshots.

Engine Fidelity class Sign agreement Mean abs. deviation (pp) Calibration slope Cells scored

A model that loses more than one third of its cells to refusals or ceilings is classified as not assessable as a survey respondent — a practical result in its own right, reported apart from the map.