Methods & decision tree — Context-effect fidelity

The registered pipeline, in plain English

How every pattern of results will be read

The analysis plan fixes, before any data are collected, how every pattern of results will be read. Two validity gates run before any agreement is scored. The per-effect human gate asks whether each effect replicated in our own calibrated human sample (Holm-corrected, Bayes factor above three); an effect that fails exits fidelity scoring and is reported as a replication result. The per-cell model gate requires a parse rate of at least 95 percent and a refusal rate of at most 5 percent, and its ceiling rule holds out any cell where a model's baseline share leaves no room for a decoy to move the choice — so a ceiling artifact is never read as immunity to the manipulation. What survives the gates feeds the fidelity map, the four fidelity classes, and five pattern tests whose overlaps a registered precedence rule resolves, ending in one of seven named scenarios.

Figure 1, made walkable

The pre-registered decision tree

The whole plan as one flow: gates on the left, the fidelity map at the center with its four classes above it, the five pattern tests, the precedence rule, and the seven scenarios on the right. Click any node for its registered reading; highlight an account or a scenario to light its route. Drag to pan, scroll to zoom, or use the controls — the keyboard works too (arrows pan, +/− zoom, 0 fits).

Every branch ends in a reading fixed before any data are collected. Accounts are attributed per effect, not per model; patterns consistent across model families license class-level statements, while family-specific patterns are reported per family and label the aggregate map heterogeneous.

Step 1 · the benchmark the models are judged against

Human benchmarks

For every effect, in every product category, memorization tier, and format, the analysis estimates the effect size with its uncertainty interval from the registered statistic for that effect. Alongside every full-sample estimate it computes a first-trial-only estimate — a clean one-observation-per-person experiment embedded in the within-person design — as a registered comparability check. The pooled rehearsal estimates below show the shape of that table; nothing in it is a finding.

Effect	Tier	Effect (pp)	95% interval	First-trial (pp)	Human gate

Rehearsal values, pooled across categories, numbers format. Every estimate and test on the human side uses cluster-robust standard errors with each participant as one cluster, and all confirmatory tests are Holm-corrected within families of effects. Model effect sizes carry no interval by design: a model effect is a population quantity computed over the fully enumerated grid of position orders, paraphrases, and personas, so the model-versus-human deviation is a descriptive distance, never a test statistic.

Table 3 · the composite verdicts

The seven pre-registered scenarios

The gates, classes, and pattern tests compose into a small number of overall stories the data are likely to tell, and the protocol names seven in advance, each with a one-line registered reading. They are not additional tests but the composite verdicts the pieces assemble — how the analysis shows it has a reading ready for each realistic way the results can fall.

A model matching none of the seven is routed by the individual test readings under the precedence rule and recorded for every model in the tree-routing file. Web Appendix C works two deliberately contradictory patterns end to end, to show that the tree decides each case and does not defer.

The design in one paragraph

What produces the numbers

Seven effects across 13 product categories and three memorization tiers — exact published values, new values in published categories, and invented categories that cannot sit in any training corpus — give 165 registered stimuli. On the human side, roughly 18,900 participants contribute about 100,200 responses (95,200 of them confirmatory choices), with every binary baseline calibrated to the registered 40–60 band before any confirmatory choice is collected. On the model side, a battery of ten engines answers the same scenarios under a designed grid of position orders, three frozen paraphrases, and twenty panel-matched personas — 917,212 calls in all — with a separate recognition-probe arm that never shares a session with a behavioral choice. A format module re-runs the top of the argument-strength ladder with attributes as numbers, words, and pictures, and a price-free arm re-runs the attraction family with no price attribute at all.

Browse the registered stimuli → The engine battery →