Every condition, one strip at a time
Condition explorer
Pick a stimulus-novelty tier and a format, and read each effect as one strip: the dark tick is the human benchmark with its shaded 95 percent interval, and each colored dot is one engine's effect size on identical scenarios. Flip to the by-engine view to see the same cells regrouped: every engine's distance from the human number, effect by effect. Flipping a facet moves the same dots to their new places.
View the plotted cells as a table
Benchmarks are pooled across the 13 product categories in the rehearsal pack; the per-category split arrives with the real data. Engine dots carry no interval by design: a model effect is a population quantity computed over the fully enumerated grid of orders, paraphrases, and personas, so its distance from the human number is a descriptive gap, never a test statistic. Hollow dots are cells the registered gates held out of scoring (a ceiling baseline, or an effect that failed its human gate).