A contamination-proof benchmark for the scientific method of LLM agents: hypothesis isolation, controlled experimentation, inferential statistics, and calibrated claims — scored objectively, with no LLM judge. Every task's ground truth is a seeded random draw created and verified at generation time, so it cannot exist in any training corpus. The answer must be earned by experiment, not recalled.
Per-tier means over 30 episodes per cell; small print gives the range of the three per-pass means — an honest view of run-to-run variance. Every number traces to a committed episode artifact.
| solver | L1 identify | L2 + magnitude | L3 interactions | overall | solved | avg calls |
|---|---|---|---|---|---|---|
| adaptive (curated values) | 94.8 | 94.2 | 87.5 | 92.2 | 100% | 3.4 |
| ofat (curated values; factorial at L3) | 92.5 | 92.5 | 87.5 | 90.8 | 100% | 4.0 |
| glm-5.1 | 68.0 61.5–75.3 | 63.9 57.4–68.0 | 53.2 47.2–56.6 | 61.7 | 46% | 4.3 |
| claude-opus-4.8 | 72.3 68.5–76.3 | 57.9 54.3–64.6 | 55.1 45.3–65.7 | 61.7 | 43% | 5.3 |
| claude-haiku-4.5 | 63.1 55.5–76.5 | 59.6 50.8–66.3 | 60.8 59.1–62.1 | 61.1 | 37% | 4.5 |
| gpt-5.5 (codex) | 69.9 66.5–72.5 | 53.0 47.7–59.8 | 53.9 50.3–58.4 | 58.9 | 42% | 4.8 |
| claude-sonnet-4.6 | 59.3 56.8–61.0 | 59.2 54.4–62.3 | 56.3 54.6–57.4 | 58.3 | 41% | 4.1 |
| gemini-3.1-pro | 64.4 52.3–74.5 | 65.5 59.6–69.1 | 42.0 29.0–52.9 | 57.3 | 50% | 6.5 |
| ofat-rand (ablation: blind values; lowest of 3 draws, mean 45.2) | 38.5 | 47.8 | 35.2 | 40.5 | 33% | 3.7 |
| random (chance floor) | 10.0 | 19.0 | 28.1 | 19.0 | 3% | 0 |
Score = correctness + method rigor (is the conclusion backed by a significant, isolating controlled experiment?) + efficiency, under a hard budget of 8 calls. Subscription-CLI models ran at their CLIs' default settings; model identities as reported by the CLIs (June 2026).
| model | solved | wrong param | direction errors | over-tested | p-hacked | probe-only |
|---|---|---|---|---|---|---|
| glm-5.1 | 41 | 31 | 14 | 16 | 2 | 0 |
| claude-opus-4.8 | 39 | 26 | 19 | 29 | 3 | 1 |
| claude-haiku-4.5 | 33 | 31 | 17 | 11 | 0 | 0 |
| gpt-5.5 (codex) | 38 | 33 | 16 | 32 | 0 | 0 |
| claude-sonnet-4.6 | 37 | 36 | 11 | 17 | 0 | 1 |
| gemini-3.1-pro | 45 | 27 | 12 | 49 | 1 | 4 |
p-hacked = the audit caught a submission that reads significant alone but survives only through redundant testing, failing a Holm correction across the episode's full test family. probe-only = the submission rests on matching the hidden world's output rather than an isolating controlled experiment. over-tested = redundant isolating tests (re-testing a parameter, or running more tests than candidates). The algorithmic solvers trigger zero flags across 90 episodes; among models, process discipline is not ordered by capability tier or price — the cheapest model (haiku) runs the cleanest process, and gemini's chronic over-testing (49/90) replicates its signature from the earlier own-scaffold pilot, so the trait belongs to the model, not the prompt.
Each task seeds one of five validated simulations (market, swarm, origin, morph, social) and secretly changes exactly one parameter (two at L3) from a revealed control config. The change is empirically verified significant at generation; decoys are verified inert. Deterministic per seed; byte-stable and CI-enforced.
The agent gets four tools and 8 calls: experiment(A, B, metric) — a replicated controlled A/B returning statistics only (Mann-Whitney U, Holm-adjusted p, Cliff's delta; configs never echoed); probe(guess, metric) — compare a guess against the hidden world (deliberately confoundable); claim; submit.
Correctness + method rigor — points only when the submitted parameter is backed by a significant experiment isolating exactly that parameter (L3: a genuine 2x2 factorial) — + efficiency. No LLM judge anywhere in the headline metric.
A separate lens re-applies Holm across the episode's whole test family and distinguishes low power (honest, reported) from p-hacking (fishing: redundant tests behind a lone-significant submission). It also catches probe-only "output matching." Headline scores never hide it.
Every number above traces to a self-contained episode artifact {task, log, score, audit, provenance} —
the full call log with statistics, the score breakdown, the integrity audit, and a provenance stamp pinning the exact
engine versions and commit. Re-score any episode from its log alone.
Browse all episode artifacts · download the full set (zip) · read the tech report
The five simulations are not toys with sliders — each is validated, headless and reproducibly on every release, against an established result from its literature (19/19 checks): the Vicsek order-disorder transition (swarm), Cont's stylized facts of asset returns (market), Deffuant/Hegselmann-Krause cluster scaling (social), and the Pearson Gray-Scott phase diagram (morph). The same engines power the in-browser playgrounds and the petri-labs-mcp server the bench runs through.
validation 19/19 deterministic per seed frozen task specs, CI-enforced power-based difficulty ratings self-contained episode artifacts
{
"mcpServers": {
"petri-labs": {
"command": "npx",
"args": ["-y", "petri-labs-mcp"]
}
}
}
The bench composes only the public MCP tools — describe_model, run_experiment, run_simulation — so any agent that speaks MCP can attempt the tasks.
# one task, by hand
bench task market 101 my-run
bench experiment my-run volatility \
'{}' '{"chartFrac":0.6}'
bench submit my-run chartFrac up
bench score my-run
# a whole model, one command
bench sweep configs/frontier-v1.json
bench report && bench taxonomy
BYOK over HTTP (Anthropic/OpenAI-compatible) or drive a local CLI (codex / gemini / claude) through the same loop. Crash-safe resume; keys never leave your environment.
Archived at Zenodo: concept DOI 10.5281/zenodo.20618024. Tech report: HTML · PDF.
@software{petri_labs,
title = {petri-labs: validated model organisms and a
contamination-proof benchmark for AI-driven science},
author = {Sozudogru, Baris},
doi = {10.5281/zenodo.20618024},
url = {https://petri-labs.org},
year = {2026}
}