petri-labs / bench

petri-bench — can your model do science?

A contamination-proof benchmark for the scientific method of LLM agents: hypothesis isolation, controlled experimentation, inferential statistics, and calibrated claims — scored objectively, with no LLM judge. Every task's ground truth is a seeded random draw created and verified at generation time, so it cannot exist in any training corpus. The answer must be earned by experiment, not recalled.

Headline result: across 699 committed episodes — nine frontier models from five labs on 30 procedurally generated tasks, three episodes per task, identical agent loop — a one-factor-at-a-time sweep with informative test values beats every frontier model by 25–33 points on its completed task cells (paired Wilcoxon per model, all Holm-corrected p < 0.001), solves 100% of tasks, and no model solves above 52%. The ablation that locates the gap: the same sweep with test values drawn blindly from each parameter's legal range scores 40.5–47.7 across three value draws (mean 45.2) — below every model's mean in every draw. Intervention design and procedural discipline are both first-order: models choose values better than chance, then give the advantage back through procedure — skipped measurements, missing factorials, direction errors — all visible in the logs.

Leaderboard (standardized sweep v2 — tier-fair L3, n=3 episodes per task)

Models are ordered by the L1 mean — the one tier every model has completed at n=3. Some queues were paused mid-run to protect subscription quotas, so some cells are partial (episode counts shown in-cell); overall appears only for models with all three tiers complete, and solved/avg calls cover each model's recorded episodes. Small print gives the range of the three per-pass means on complete cells. Every number traces to a committed episode artifact.

solver	episodes	L1 identify	L2 + magnitude	L3 interactions	overall	solved	avg calls
adaptive (curated values)	30	94.8	94.2	87.5	92.2	100%	3.4
ofat (curated values; factorial at L3)	30	92.5	92.5	87.5	90.8	100%	4.0
gpt-5.6-sol (codex, xhigh)	73/90	76.7 73.2–80.5	47.3 44.8–51.4	44.0 n=13	pending	34%	4.8
claude-opus-4.8	60/90	72.2 68.5–76.2	57.9 54.3–64.6	pending	pending	37%	5.0
claude-fable-5 (native tools)	52/90	72.2 65.8–80.0	40.5 n=22	pending	pending	52%	6.1
gpt-5.5 (codex, xhigh)	60/90	69.9 66.5–72.5	53.0 47.7–59.8	pending	pending	37%	4.4
glm-5.1	81/90	68.0 61.5–75.2	63.9 57.4–68.0	66.7 n=21	pending	48%	4.1
gemini-3.1-pro	90	64.4 52.2–74.5	65.5 59.6–69.1	44.1 40.2–50.1	58.0	49%	6.3
claude-haiku-4.5	90	63.1 55.5–76.5	59.6 50.8–66.3	59.5 56.0–65.1	60.7	36%	4.5
claude-sonnet-4.6	69/90	59.3 56.8–61.0	59.2 54.4–62.3	53.4 n=9	pending	38%	3.5
ofat-rand (ablation: blind values; lowest of 3 draws, mean 45.2)	30	38.5	47.8	35.1	40.5	33%	3.7
random (chance floor)	30	10.0	19.0	28.1	19.0	3%	0

Score = correctness + method rigor (is the conclusion backed by a significant, isolating controlled experiment?) + efficiency, under a hard budget of 8 calls. All models run the identical agent loop (same brief, tools, budget, scoring). Transports: claude/gemini/glm CLIs on the frozen text protocol; codex sessions at xhigh reasoning effort; claude-fable-5 runs the same loop over native MCP tool-calling, because its safety layer deterministically flags text-serialized harness protocols — the finding, the isolation tests, and the transport are documented in the transport record. glm-5.2 joined the sweep and averaged 55.6 over its first 4 episodes before its queue was paused; it gets a row when its cells complete. Sweep v1's per-model L3 numbers were construct-invalid (the interaction tier was briefed with the single-factor instruction) — those 129 episodes are archived in-repo, and every L3 cell above comes from the tier-fair re-run.

What the evaluation found

The reference gap survives every new model. Recomputed per model on v2-valid completed cells: the curated-value sweep leads by +25 to +33 points (paired Wilcoxon per model over 16–30 tasks, all Holm-corrected p < 0.001). The two models with all three tiers complete land at 60.7 (haiku) and 58.0 (gemini) against the reference's 90.8.
The L1/L2 split is now extreme. gpt-5.6-sol posts the best L1 mean recorded (76.7) and near-floor L2 solving (1 of 30 episodes); claude-fable-5 ties opus on L1 mean with the best L1 solve rate on the board (24/30) and collapses on L2 (3/22). Identifying which parameter changed and reading which way it pushes the metric are different skills, and the second one is where frontier models fail.
The tier-fair L3 re-run vindicated the v0.6 construct fix. Under the corrected factorial briefing glm-5.1's L3 jumped from a construct-invalid 53.2 to 66.7 with 15/21 solved — the best model L3 recorded — while haiku (59.5) and gemini (44.1) barely moved. The v1 instruction was masking a real capability difference, not adding noise.
The hard-task wall fell. In v1, both L1 tasks rated hard by statistical power defeated all six models. claude-fable-5 solved swarm-101 in all three episodes and swarm-202 in two, plus two hard-rated L2 tasks; gpt-5.6-sol cracked a hard L1 and a hard L3. The power-based rating still orders difficulty — the ceiling just moved.
Solve rate and score keep diverging — by design. fable-5 has the highest model solve rate on the board (52%) but pays heavy method taxes (chronic over-testing, probe-leaning submissions), landing its recorded-episode mean at 58.8. haiku still converts a 36% solve rate into the best complete overall through clean, fully backed process.

Failure fingerprints (per model, over its recorded v2 episodes)

model	episodes	solved	wrong param	direction errors	over-tested	p-hacked	probe-only
gpt-5.6-sol (codex)	73	25	25	21	17	1	0
claude-opus-4.8	60	22	15	19	22	3	0
claude-fable-5	52	27	12	6	26	0	5
gpt-5.5 (codex)	60	22	22	16	24	0	0
glm-5.1	81	39	21	14	12	1	0
gemini-3.1-pro	90	44	27	12	42	1	5
claude-haiku-4.5	90	32	33	17	11	0	0
claude-sonnet-4.6	69	26	31	11	16	0	1

p-hacked = the audit caught a submission that reads significant alone but survives only through redundant testing, failing a Holm correction across the episode's full test family. probe-only = the submission rests on matching the hidden world's output rather than an isolating controlled experiment. over-tested = redundant isolating tests (re-testing a parameter, or running more tests than candidates). The algorithmic solvers trigger zero flags; among models, process discipline is not ordered by capability tier or price — the cheapest model (haiku) still runs the cleanest process, gemini's chronic over-testing (42/90) replicates across scaffolds, and the newest, largest model on the board has the most distinctive fingerprint: claude-fable-5 measures compulsively (over-tests in half its episodes), leans on probes for its answers (5 probe-only submissions), and never once p-hacks — a cautious empiricist that pays its score away in method taxes.

How it works

Mystery worlds

Each task seeds one of five validated simulations (market, swarm, origin, morph, social) and secretly changes exactly one parameter (two at L3) from a revealed control config. The change is empirically verified significant at generation; decoys are verified inert. Deterministic per seed; byte-stable and CI-enforced.

A blind, budgeted harness

The agent gets four tools and 8 calls: experiment(A, B, metric) — a replicated controlled A/B returning statistics only (Mann-Whitney U, Holm-adjusted p, Cliff's delta; configs never echoed); probe(guess, metric) — compare a guess against the hidden world (deliberately confoundable); claim; submit.

Objective scoring

Correctness + method rigor — points only when the submitted parameter is backed by a significant experiment isolating exactly that parameter (L3: a genuine 2x2 factorial) — + efficiency. No LLM judge anywhere in the headline metric.

A process-integrity audit

A separate lens re-applies Holm across the episode's whole test family and distinguishes low power (honest, reported) from p-hacking (fishing: redundant tests behind a lone-significant submission). It also catches probe-only "output matching." Headline scores never hide it.

Audit everything

Every number above traces to a self-contained episode artifact {task, log, score, audit, provenance} — the full call log with statistics, the score breakdown, the integrity audit, and a provenance stamp pinning the exact engine versions and commit. Re-score any episode from its log alone.

Browse all episode artifacts · download the full set (zip) · read the tech report

The tech report documents sweep v1 in depth (design, validation, and the L1/L2 episodes, which are unchanged here); this page carries the v2 leaderboard — tier-fair L3, two new frontier models, and the corrected transport notes. The report's v2 revision lands when the paused cells complete.

The instrument underneath

The five simulations are not toys with sliders — each is validated, headless and reproducibly on every release, against an established result from its literature (19/19 checks): the Vicsek order-disorder transition (swarm), Cont's stylized facts of asset returns (market), Deffuant/Hegselmann-Krause cluster scaling (social), and the Pearson Gray-Scott phase diagram (morph). The same engines power the in-browser playgrounds and the petri-labs-mcp server the bench runs through.

validation 19/19 deterministic per seed frozen task specs, CI-enforced power-based difficulty ratings self-contained episode artifacts

Run it on your model

Any MCP client

{
  "mcpServers": {
    "petri-labs": {
      "command": "npx",
      "args": ["-y", "petri-labs-mcp"]
    }
  }
}

The bench composes only the public MCP tools — describe_model, run_experiment, run_simulation — so any agent that speaks MCP can attempt the tasks.

The harness CLI

# one task, by hand
bench task market 101 my-run
bench experiment my-run volatility \
  '{}' '{"chartFrac":0.6}'
bench submit my-run chartFrac up
bench score my-run

# a whole model, one command
bench sweep configs/frontier-v1.json
bench report && bench taxonomy

BYOK over HTTP (Anthropic/OpenAI-compatible) or drive a local CLI (codex / gemini / claude) through the same loop. Crash-safe resume; keys never leave your environment.

Cite

Archived at Zenodo: concept DOI 10.5281/zenodo.20618024. Tech report: HTML · PDF.

@software{petri_labs,
  title  = {petri-labs: validated model organisms and a
            contamination-proof benchmark for AI-driven science},
  author = {Sozudogru, Baris},
  doi    = {10.5281/zenodo.20618024},
  url    = {https://petri-labs.org},
  year   = {2026}
}