petri-labs / bench

petri-bench — can your model do science?

A contamination-proof benchmark for the scientific method of LLM agents: hypothesis isolation, controlled experimentation, inferential statistics, and calibrated claims — scored objectively, with no LLM judge. Every task's ground truth is a seeded random draw created and verified at generation time, so it cannot exist in any training corpus. The answer must be earned by experiment, not recalled.

Headline result: across 540 standardized episodes — six frontier models, 30 tasks, three episodes each, identical agent loop — a one-factor-at-a-time sweep with informative test values beats every frontier model by 29+ points (Wilcoxon over paired tasks, all p < 2x10-5), solves 100% of tasks, and no model solves above 50%. The ablation that locates the gap: the same sweep with test values drawn blindly from each parameter's legal range scores 40.5–47.7 across three value draws (mean 45.2) — below every model's mean in every draw. Intervention design and procedural discipline are both first-order: models choose values better than chance, then give the advantage back through procedure — skipped measurements, missing factorials, p-hacks — all visible in the logs.

Leaderboard (standardized sweep v1, n=3 episodes per task)

Per-tier means over 30 episodes per cell; small print gives the range of the three per-pass means — an honest view of run-to-run variance. Every number traces to a committed episode artifact.

solverL1 identifyL2 + magnitudeL3 interactionsoverallsolvedavg calls
adaptive (curated values)94.894.287.592.2100%3.4
ofat (curated values; factorial at L3)92.592.587.590.8100%4.0
glm-5.168.0 61.5–75.363.9 57.4–68.053.2 47.2–56.661.746%4.3
claude-opus-4.872.3 68.5–76.357.9 54.3–64.655.1 45.3–65.761.743%5.3
claude-haiku-4.563.1 55.5–76.559.6 50.8–66.360.8 59.1–62.161.137%4.5
gpt-5.5 (codex)69.9 66.5–72.553.0 47.7–59.853.9 50.3–58.458.942%4.8
claude-sonnet-4.659.3 56.8–61.059.2 54.4–62.356.3 54.6–57.458.341%4.1
gemini-3.1-pro64.4 52.3–74.565.5 59.6–69.142.0 29.0–52.957.350%6.5
ofat-rand (ablation: blind values; lowest of 3 draws, mean 45.2)38.547.835.240.533%3.7
random (chance floor)10.019.028.119.03%0

Score = correctness + method rigor (is the conclusion backed by a significant, isolating controlled experiment?) + efficiency, under a hard budget of 8 calls. Subscription-CLI models ran at their CLIs' default settings; model identities as reported by the CLIs (June 2026).

What the evaluation found

Failure fingerprints (per model, 90 episodes each)

modelsolvedwrong paramdirection errorsover-testedp-hackedprobe-only
glm-5.14131141620
claude-opus-4.83926192931
claude-haiku-4.53331171100
gpt-5.5 (codex)3833163200
claude-sonnet-4.63736111701
gemini-3.1-pro4527124914

p-hacked = the audit caught a submission that reads significant alone but survives only through redundant testing, failing a Holm correction across the episode's full test family. probe-only = the submission rests on matching the hidden world's output rather than an isolating controlled experiment. over-tested = redundant isolating tests (re-testing a parameter, or running more tests than candidates). The algorithmic solvers trigger zero flags across 90 episodes; among models, process discipline is not ordered by capability tier or price — the cheapest model (haiku) runs the cleanest process, and gemini's chronic over-testing (49/90) replicates its signature from the earlier own-scaffold pilot, so the trait belongs to the model, not the prompt.

How it works

Mystery worlds

Each task seeds one of five validated simulations (market, swarm, origin, morph, social) and secretly changes exactly one parameter (two at L3) from a revealed control config. The change is empirically verified significant at generation; decoys are verified inert. Deterministic per seed; byte-stable and CI-enforced.

A blind, budgeted harness

The agent gets four tools and 8 calls: experiment(A, B, metric) — a replicated controlled A/B returning statistics only (Mann-Whitney U, Holm-adjusted p, Cliff's delta; configs never echoed); probe(guess, metric) — compare a guess against the hidden world (deliberately confoundable); claim; submit.

Objective scoring

Correctness + method rigor — points only when the submitted parameter is backed by a significant experiment isolating exactly that parameter (L3: a genuine 2x2 factorial) — + efficiency. No LLM judge anywhere in the headline metric.

A process-integrity audit

A separate lens re-applies Holm across the episode's whole test family and distinguishes low power (honest, reported) from p-hacking (fishing: redundant tests behind a lone-significant submission). It also catches probe-only "output matching." Headline scores never hide it.

Audit everything

Every number above traces to a self-contained episode artifact {task, log, score, audit, provenance} — the full call log with statistics, the score breakdown, the integrity audit, and a provenance stamp pinning the exact engine versions and commit. Re-score any episode from its log alone.

Browse all episode artifacts · download the full set (zip) · read the tech report

The instrument underneath

The five simulations are not toys with sliders — each is validated, headless and reproducibly on every release, against an established result from its literature (19/19 checks): the Vicsek order-disorder transition (swarm), Cont's stylized facts of asset returns (market), Deffuant/Hegselmann-Krause cluster scaling (social), and the Pearson Gray-Scott phase diagram (morph). The same engines power the in-browser playgrounds and the petri-labs-mcp server the bench runs through.

validation 19/19 deterministic per seed frozen task specs, CI-enforced power-based difficulty ratings self-contained episode artifacts

Run it on your model

Any MCP client

{
  "mcpServers": {
    "petri-labs": {
      "command": "npx",
      "args": ["-y", "petri-labs-mcp"]
    }
  }
}

The bench composes only the public MCP tools — describe_model, run_experiment, run_simulation — so any agent that speaks MCP can attempt the tasks.

The harness CLI

# one task, by hand
bench task market 101 my-run
bench experiment my-run volatility \
  '{}' '{"chartFrac":0.6}'
bench submit my-run chartFrac up
bench score my-run

# a whole model, one command
bench sweep configs/frontier-v1.json
bench report && bench taxonomy

BYOK over HTTP (Anthropic/OpenAI-compatible) or drive a local CLI (codex / gemini / claude) through the same loop. Crash-safe resume; keys never leave your environment.

Cite

Archived at Zenodo: concept DOI 10.5281/zenodo.20618024. Tech report: HTML · PDF.

@software{petri_labs,
  title  = {petri-labs: validated model organisms and a
            contamination-proof benchmark for AI-driven science},
  author = {Sozudogru, Baris},
  doi    = {10.5281/zenodo.20618024},
  url    = {https://petri-labs.org},
  year   = {2026}
}