Working draft (v0.5). Section 7 reports the complete standardized evaluation — six frontier models x 30 tasks x three episodes (540 episodes, all committed) — plus the privileged-value ablation and headline significance tests added in response to an adversarial cross-model review round (GPT-5.5 and Gemini 3.1 Pro as referees; their major findings drove the v0.5 revisions, including a retraction noted in Section 7.2). Published as a preprint via Zenodo and at petri-labs.org/bench; an arXiv (cs.AI) posting follows once endorsement is in place, and a data-and-benchmarks venue version is planned.
Benchmarks for AI “scientist” agents predominantly test the rediscovery of published results, which rewards memorization: a model can succeed by recalling the answer rather than earning it. We present petri-bench, a benchmark that scores the scientific method of an LLM agent — hypothesis isolation, controlled experimentation, inferential statistics, and calibrated claims — on procedurally generated causal-discovery tasks whose ground truth is created at generation time and therefore could not have appeared in any training corpus before evaluation. Each task is a seeded “mystery world” built on one of five deterministic multi-agent simulations: exactly one hidden parameter (two, at the hardest tier) is changed from a revealed control configuration, and the agent must identify the change through a budgeted, blind experiment harness that returns inferential statistics only. Scoring is objective and reproducible — no LLM judge in the headline metric — and decomposes into correctness, method rigor (whether the submitted conclusion is backed by a significant, isolating controlled experiment), and efficiency. A separate process-integrity audit flags p-hacking (submissions that survive only through uncorrected multiple comparisons after redundant testing) and probe-only “output matching.” Tasks carry statistically grounded difficulty ratings, and the simulation layer is validated against established results from four literatures (19/19 checks). In a standardized 540-episode evaluation — six frontier models (GPT, Gemini, GLM, and three Claude tiers), three episodes per task, all through an identical agent loop — a one-factor-at-a-time reference sweep with curated test values outscored every model by 29 points or more (Wilcoxon signed-rank over paired tasks, all p < 2x10^-5), solved 100% of tasks, while no model solved above 50% (best: 45/90 episodes). An ablation locates the gap: the same sweep with test values drawn blindly from each parameter’s legal range scores 40.5-47.7 across three independent value draws (mean 45.2) — below every model’s mean in every draw — so the reference’s advantage decomposes into informative intervention design plus disciplined procedure, and the models sit between the blind and informed variants: their intervention choices carry real information, but their procedure gives the advantage back. The procedure failures are direct, log-verified, and value-independent: in 89 episodes a model identified the changed parameter and submitted the wrong effect direction — 61% of those never directly measured the determining contrast, and the rest measured it and contradicted their own significant evidence. The benchmark’s two axes separate cleanly: the model with the highest solve rate posts the lowest score (chronic over-testing, probe-only shortcuts), while the smallest Claude tier converts the lowest solve rate into a near-top score through clean process. A process-integrity audit catches eleven violations (p-hacking, probe-only “output matching”) that headline scores alone would hide. All episode artifacts are self-contained and auditable; tasks, code, and results are archived (DOI: 10.5281/zenodo.20618024).
Two observations motivate this benchmark.
First, evaluation is now the bottleneck for AI-driven science. Agent systems write papers that pass workshop peer review (Yamada et al., 2025), and public platforms run research agents as a product category. The open question is no longer whether such agents can produce scientific artifacts but whether they can do science: choose the right experiment, control what must be controlled, reason correctly from noisy data, and claim only what their evidence supports.
Second, the dominant evaluation paradigm cannot answer that question. Benchmarks built from published studies — data-driven discovery suites and rediscovery tasks alike — are contaminated by construction: the target findings, and frequently the exact datasets, are in the training corpora of the models being evaluated (Reddy & Shojaee, 2024). The field’s own surveys warn that rediscovery benchmarks reward memorization over method. Purpose-built discovery environments avoid contamination but are text-game-like, single-domain, or fixed-set (Jansen et al., 2024; Koblischke et al., 2025), so they measure exploration in one world rather than statistical method across domains.
petri-bench takes a third path: procedural generation of causal-discovery tasks over deterministic simulations, with the ground truth created — and empirically verified — at generation time. Because the hidden change is a seeded random draw, the answer to any specific task instance cannot have existed in any training corpus before its generation (and fresh instances regenerate on demand once a set is published; Section 3.2). Because the simulations are deterministic given a seed, every result is exactly reproducible. And because the agent’s full call log is recorded, the process can be scored, not just the answer.
| Benchmark | Task substrate | Contamination resistance | Multi-domain | Statistical-method scoring | Process audit |
|---|---|---|---|---|---|
| ScienceWorld (Wang et al., 2022) | text game, elementary science | fixed public set | no (one world) | no | no |
| DiscoveryBench (Majumder et al., 2024) | published datasets/findings (+ synthetic split) | targets are published | yes | partial (workflow match) | no |
| ScienceAgentBench (Chen et al., 2024) | published data-driven tasks | targets are published | yes | output-based | no |
| DiscoveryWorld (Jansen et al., 2024) | virtual discovery game | novel worlds, fixed public set | themed worlds | partial | no |
| Gravity-Bench (Koblischke et al., 2025) | gravitational dynamics | simulated, fixed public set | no (one domain) | partial | no |
| petri-bench | 5 deterministic simulations | procedural; fresh instances on demand | yes (5 domains) | yes (isolation + inference) | yes (p-hacking, output-matching) |
Column definitions: contamination resistance describes whether task answers could, in principle, be recovered from training data at evaluation time — any fixed public set, including ours once frozen and released, is recoverable by later models; the operative distinction is whether fresh, verified instances can be generated on demand (Section 9). Statistical-method scoring asks whether the score depends on how conclusions were established (isolation, inference), not only on the final answer. Characterizations follow the cited versions of each benchmark; we welcome corrections from their authors.
Law-discovery suites in single physical domains (e.g., symbolic-regression-style benchmarks) share the contamination-resistance of simulation but score the recovered equation, not the experimental method. petri-bench’s contribution is the combination: procedurally generated ground truth across qualitatively different domains, scored on whether the agent practiced sound statistical method, with an explicit audit for the characteristic failure modes of LLM experimenters.
Tasks are built over five deterministic, headless simulations (“model
organisms”), each from a different complex-systems literature, exposed
to agents through the published petri-labs-mcp tool
server:
| Simulation | Domain | Target metric | Replicates |
|---|---|---|---|
| market | heterogeneous-agent asset market | volatility | 12 |
| swarm | flocking / collective motion | polarization order parameter | 12 |
| origin | evolutionary population dynamics | final population | 12 |
| morph | Gray-Scott reaction-diffusion | pattern blob count | 12 |
| social | bounded-confidence opinion dynamics | opinion cluster count (polarization at L3) | 12 |
Each simulation is deterministic per seed; paired per-replicate seeds make every reported statistic bit-reproducible. Section 6 validates that these engines reproduce established results from their respective literatures.
A task is a seeded mystery world. From a revealed control
configuration, the generator draws one hidden parameter change (the
driver) from a per-simulation pool, plus two decoy
candidates verified inert at generation time: the driver’s effect on the
target metric must test significant (Mann-Whitney U with Holm correction
across the metric vector) and each decoy’s must not. A task therefore
ships with an empirically verified ground truth —
{parameter, hidden value, direction of effect} — that did
not exist before generation. The contamination claim is time-scoped: the
evaluations in this report ran before the frozen sets and their answers
were published, so no evaluated model could have trained on them; once
released, the frozen sets are recoverable like any public benchmark, and
the durable defense is procedural — fresh, verified instances regenerate
from new seeds on demand, and a leaderboard evaluation can hold its
seeds private (Section 9). Tasks are frozen as JSON specs; regenerating
any frozen task from its seed reproduces it byte-for-byte (enforced in
CI).
Three difficulty tiers:
The frozen sets are core-v0 (10 L1 tasks: five
simulations x two seeds), l2-v0 (10), and
l3-v0 (10), plus the original six-task
pilot-v0.
Agents interact with a task only through four tools, under a hard budget of 8 calls:
experiment(configA, configB, metric) — a replicated
controlled A/B between two configurations of the agent’s choosing.
Returns inferential statistics only: group means, relative change,
Mann-Whitney U with Holm-adjusted p, significance verdict, and Cliff’s
delta effect size. The resolved configurations are never echoed back.
(The unpaired Mann-Whitney U is the deliberately conservative default;
the paired per-replicate seeds primarily serve bit-reproducibility, and
a paired-test variant is future work.)probe(guess, metric) — compares the agent’s guessed
configuration against the hidden world, returning the same
statistics. A non-significant result means the guess is
metric-indistinguishable from the truth. Probes support hypothesis
refinement but, deliberately, are confoundable: a decoy can reproduce
the hidden world’s metric level without being the changed parameter
(Section 7.3).claim(param, effect) — an optional declared belief,
graded for consistency with the agent’s own logged evidence.submit(...) — the tier-appropriate final answer;
submitting ends the episode.Blindness (statistics only, no configuration echo) closes the leakage channel a naive harness would open; the budget makes experiment selection part of the measured skill. The budget is enforced by the harness itself — calls past the eighth return an error instead of running; the separate x0.6 scoring gate (Section 4) exists for episode logs produced outside the live harness, where over-budget calls could otherwise appear.
Scoring is computed deterministically from the episode log; there is no LLM judge in the headline metric.
| Tier | Correctness | Method rigor | Efficiency |
|---|---|---|---|
| L1 | parameter 30 + direction 20 | 30 | 20 |
| L2 | parameter 25 + direction 15 + magnitude 20 (10 if adjacent class) | 25 | 15 |
| L3 | both parameters 30 (12 for one) + interaction sign 25 | factorial rigor 25 | 20 |
Method rigor is the axis this benchmark exists for: at L1/L2 it awards points only when the submitted parameter is backed by a significant experiment that isolates exactly that parameter on the target metric — a correct answer must be earned, not guessed or pattern-matched. At L3 the bar is a genuine 2x2 factorial: a significant isolating experiment on each parameter plus a combined-change experiment. Efficiency rewards solving within few calls (zero if no experiments were run, so an unbacked lucky guess scores at most the correctness points). Exceeding the budget multiplies the final score by 0.6. Declared claims inconsistent with the agent’s own logged statistics are tracked as claim validity.
A second lens is computed over each episode and deliberately kept out of the headline score, so the published reproduction numbers remain frozen while method red flags still surface. The audit re-applies Holm-Bonferroni across the episode’s entire family of isolating tests on the target metric (the MCP server corrects within a single experiment, but not across an episode) and separates two things that are easy to conflate:
backingSurvivesHolm): does the submission’s evidence still
clear alpha after family-wide correction? A minimal
one-test-per-candidate sweep whose driver is merely borderline fails
this honestly — that is low statistical power, reported but not blamed.
(The reference baseline’s swarm episode is exactly this case: raw p ~
0.017, Holm-adjusted across three tests 0.0502.)pHacking): the behavioral
red flag. It fires only when the agent fished — re-tested a
parameter repeatedly, or ran more isolating tests than there are
candidates — and then submitted a lone-significant hit that fails the
family-wide correction. The minimal complete design is never flagged,
however borderline its result.The audit also classifies each submission’s support: backed by an isolating experiment, resting on probe output-matching only, or entirely unbacked.
Every task ships with a statistically grounded difficulty rating: the
binding driver’s Cliff’s delta (effect size) and an oracle
power — the Monte-Carlo probability that an optimal isolating sweep
solves a random-seed instance of the task configuration. Concretely:
from the per-seed raw metric values collected at generation for each
cell (control and every candidate’s treatment), draw 2,000 seeded
subsamples without replacement at the bench’s replicate count (12),
re-run the Mann-Whitney/Holm decision on each, and count the fraction in
which the driver tests significant and every decoy does not. Because the
bench’s experiments are deterministic at a fixed replicate count, spare
budget buys no statistical power, so the one-factor-at-a-time baseline
and the oracle coincide; oracle power is the task’s solvability margin.
Ratings band into easy (power ≥ 0.85), moderate (≥ 0.55), and hard (<
0.55): across the 30 frozen tasks, 14 rate easy, 9 moderate, and 7 hard.
The swarm tasks dominate the hard band (weak drivers on a noisy order
parameter); morph’s regime-change drivers give near-perfect separation.
Methodology details: docs/petri-bench-difficulty.md.
A benchmark over toy worlds is only as credible as the worlds. Four of the five simulations are checked, headless and reproducibly on every release, against an established result from their literatures (19/19 checks passing); origin’s check is outstanding (Section 9):
| Simulation | Reproduced result | Reference | Checks |
|---|---|---|---|
| swarm | order-to-disorder transition under alignment noise | Vicsek et al. (1995) | 5/5 |
| market | stylized facts: heavy tails, volatility clustering, no linear autocorrelation | Cont (2001) | 3/3 |
| social | bounded-confidence cluster scaling ~ 1/(2 eps); consensus transition | Deffuant et al. (2000); Hegselmann & Krause (2002) | 4/4 |
| morph | Gray-Scott phase structure: washout / spots / labyrinth across the feed-kill plane | Pearson (1993) | 7/7 |
Each validation records an explicit caveat where the implementation
differs from the cited model (e.g., the swarm boids are not the Vicsek
model, so the claim is the qualitative transition, not the critical
exponent; morph uses rescaled diffusion on a 96x96 torus, so the claim
is the phase structure, not Pearson’s exact coordinates). The origin
simulation is not yet covered by a literature check (Section 9). Reports
and scripted figures: docs/validation/.
Four algorithmic reference solvers calibrate the score scale.
random: a seeded guess with no experiments — the chance
floor. ofat: one isolating test per candidate using a
curated, known-informative test value, then submit the significant
driver; at L3 the same solver extends into the full 2x2 factorial the
tier requires (the table’s L3 row reports this factorial extension, not
bare OFAT). adaptive: the same sweep with early stopping.
ofat-rand: the privileged-value ablation — the
identical procedure with test values drawn uniformly at random
from each parameter’s legal [min, max] (seeded, deterministic),
isolating how much of the reference’s performance comes from knowing
informative values rather than from the procedure.
A generic llm solver runs any model in an identical
agent loop: the model receives the task brief and the four harness
tools, the harness enforces the budget, and the full transcript becomes
the episode log. Two transports plug into the same loop — direct HTTP
(Anthropic- and OpenAI-compatible endpoints) and locally installed CLIs
(driven headless over a strict one-JSON-tool-call protocol, executing in
an empty read-only scratch directory so the model cannot access task
files). All solvers are scored by the same code on the same contract.
The verbatim scaffold (system prompt, task brief template, tool
descriptions) is reproduced in Appendix A.
Six frontier models were evaluated on all 30 frozen tasks (five
simulations x two seeds x three tiers), three episodes per
task, through the identical agent loop: the same
system prompt, the same four tools, the same budget, the same scoring.
API-served and subscription-CLI models alike run behind the
LlmClient seam; CLI transports execute in an empty scratch
directory in read-only mode, so no model can access the task files. 540
episodes (90 per model), all committed as self-contained artifacts.
Per-tier means over 30 episodes per cell. An episode pass is one complete run over all 30 tasks (pass 1, 2, 3); parentheses give the range of the three per-pass means — an honest view of run-to-run variance:
| Model | L1 | L2 | L3 | Overall | Solve rate | Avg calls |
|---|---|---|---|---|---|---|
| adaptive (curated values) | 94.8 | 94.2 | 87.5 | 92.2 | 100% | 3.4 |
| ofat (curated values; factorial at L3) | 92.5 | 92.5 | 87.5 | 90.8 | 100% | 4.0 |
| glm-5.1 | 68.0 (61.5-75.3) | 63.9 (57.4-68.0) | 53.2 (47.2-56.6) | 61.7 | 46% | 4.3 |
| opus-4.8 | 72.3 (68.5-76.3) | 57.9 (54.3-64.6) | 55.1 (45.3-65.7) | 61.7 | 43% | 5.3 |
| haiku-4.5 | 63.1 (55.5-76.5) | 59.6 (50.8-66.3) | 60.8 (59.1-62.1) | 61.1 | 37% | 4.5 |
| codex-gpt-5.5 | 69.9 (66.5-72.5) | 53.0 (47.7-59.8) | 53.9 (50.3-58.4) | 58.9 | 42% | 4.8 |
| sonnet-4.6 | 59.3 (56.8-61.0) | 59.2 (54.4-62.3) | 56.3 (54.6-57.4) | 58.3 | 41% | 4.1 |
| gemini-3.1-pro | 64.4 (52.3-74.5) | 65.5 (59.6-69.1) | 42.0 (29.0-52.9) | 57.3 | 50% | 6.5 |
| ofat-rand (ablation: random values; lowest of 3 draws, mean 45.2) | 38.5 | 47.8 | 35.2 | 40.5 | 33% | 3.7 |
| random (chance floor) | 10.0 | 19.0 | 28.1 | 19.0 | 3% | 0 |
The headline at n=3: the reference sweep outscores every frontier model by 29 points or more (Wilcoxon signed-rank over the 30 paired tasks: the reference is better on 25-29 of 30 against every model, all p < 2x10^-5), solves 100% of tasks — and no model solves above 50% (best: gemini, 45/90 episodes).
Where the gap comes from — the privileged-value
ablation. The reference knows one informative test value per
candidate; models must choose their own. Re-running the
identical procedure with test values drawn blindly from each
parameter’s legal range (ofat-rand) collapses the score
from 90.8 to 40.5 — below every model. Statistics, stated with
the same discipline the benchmark demands. The ablation arm is itself
sensitive to its value draws: three independent draw seeds score 40.5,
47.5, and 47.7 (mean 45.2; solve rate 33-37%), and the committed draw is
the lowest — a fact that would flatter any per-draw significance test of
the models’ margin. We therefore treat the model-vs-ablation separation
as descriptive: every model’s mean exceeds the ablation’s best
draw (57.3 vs 47.7, a 9.6-point floor), the direction is unanimous
across all six models and all three draws, and against the committed
draw each model wins 18-20 of 30 paired tasks (tie-corrected Wilcoxon p
= 0.007-0.041 uncorrected; three of six survive Holm across the family,
three land at p = 0.053). The inferential anchor of this report is the
reference-vs-model family, which is unambiguous: all six Holm-corrected
p < 10^-4. The decomposition is therefore explicit rather than
assumed: intervention design (choosing informative values) and
procedural discipline (isolation, inference, factorial structure) are
both first-order terms. Models occupy the middle — their intervention
choices carry real information, but their procedure gives back
the advantage: the failures Section 7.3 documents (direction submitted
without measuring it, missing factorials, unbacked and probe-only
submissions, p-hacks) are value-independent and visible directly in the
logs. We retract the earlier working-draft claim that value knowledge
accounts for “not most” of the gap; the ablation shows it is a dominant
term for the reference, while the models’ deficit against the
reference is jointly procedural and informational.
Five structural observations:
hard by oracle power (the
swarm order-parameter tasks) were solved by zero of six models, with the
easy band mostly at 5-6 of 6; one moderate task (market-202) also
defeated all six. The power-based rating (Section 5) carries predictive
signal about where models fail — independent evidence that it measures
something real.Caveats — what “identical loop” does and does not standardize: identical were the system prompt, task briefs, tool surface and protocol, budget, and scoring; not standardized were decoding parameters (subscription CLIs run at their defaults), model snapshots (identities as the CLIs report them, June 2026), and the transport’s text-JSON tool protocol, which is uniform across models but may depress absolute scores relative to native function-calling APIs. The benchmark’s scoring also rewards exactly the isolate-and-infer design its generator verifies tasks against; this is by construction — the reference solvers are reference implementations of the demanded method, not arbitrary competitors — but it means scores measure fidelity to that method, not open-ended creativity (held-out task families that break the OFAT template, such as the planned L4 boundary-location tier, are the designed counterweight).
Because every episode log is committed, failure analysis is computable rather than anecdotal. Each episode is classified into objective, non-exclusive flags along two axes — what went wrong (wrong parameter, direction error, magnitude/interaction error, no answer) and how the method failed (unbacked submission, probe-only support, invalid claims, p-hacking, over-testing, budget misuse). Denominators: direction errors are counted among episodes that identified the correct parameter; flags are episode counts out of the stated set and may co-occur. On the 540 standardized episodes:
Each model failed differently in this evaluation — answer-finding
without discipline (gemini), skipped direction measurements atop clean
identification (opus), decoy susceptibility (sonnet), low-yield but
unflagged process (haiku) — and these profiles were consistent across
both scaffolds and all three passes. This per-model failure structure is
precisely the axis petri-bench is built to measure. The full table
regenerates with one command (taxonomy) and is committed at
results/taxonomy.md.
An earlier pilot ran six models on the six pilot-v0 L1
tasks under each model’s own CLI agent scaffold (heterogeneous prompts
and tool surfaces). It produced the same headline (baseline 92.5 vs best
model 74.2) and first exposed the over-exploration and
probe-output-matching failure modes, but its per-model ordering is
confounded by scaffold differences — gemini’s pilot 74.2 vs standardized
57.3 illustrates how large the scaffold-plus-task-set term can be (the
pilot used six easier L1 tasks). The pilot artifacts remain committed
under results/pilot-v0/ for comparison; the standardized
sweep supersedes them for all claims in this report.
The sweep is one command, config-driven, and crash-safe: it resumes past completed episodes, skips models whose credentials are absent, and an episode interrupted by an infrastructure failure is retried on the next pass rather than scored. API keys are read only from named environment variables (HTTP models) or stay inside the local CLIs (subscription models); nothing is stored in the repository. Marginal cost of sweep v1 was zero API spend (subscription CLIs; simulation calls are CPU-cheap), bounded mainly by wall-clock at roughly 5-7 minutes per model per tier.
pilot-v0 set and on
the L2 set; 87.5 on the L3 set).{task, log, score, audit, provenance}; the provenance stamp
records the benchmark, MCP-server, and all five engine versions, the git
commit, and the runtime, so any number in this report traces to the
exact code that produced it.report and taxonomy.petri-labs-mcp tool server (npm); the benchmark adds no
privileged engine access.pandoc docs/petri-bench-report.md -s -o petri-bench-report.tex
exports cleanly (all tables as longtables); the published HTML and PDF
render from the same source via
packages/hub/scripts/render-report.sh.probe
is deliberately available because hypothesis refinement against the
system under study is part of real method — but it permits the
output-matching shortcut. The current design answers with the rigor axis
(a probe-only submission forfeits all method-rigor points) and the audit
(it is flagged), rather than by removing correctness credit; reasonable
designers could choose more harshly, and the full-metric-vector
identification planned below would close the confound at the
source.Every model received exactly this scaffold. System prompt:
You are a careful experimental scientist solving a hidden-parameter mystery. Exactly one (or, at L3, two) of the candidate parameters was changed from a revealed control to make a hidden treatment world. Use the tools to run controlled experiments — change ONE parameter at a time and read the Holm-adjusted p-value and effect size — and reach an evidence-backed answer. Only conclude an effect when a result is significant. You have a limited budget of experiments; do not waste calls. When confident, call
submit.
Task brief template (values filled per task):
model: <simulation id>
target metric: <metric>
control config: <JSON of the revealed control>
the changed parameter is one of: <candidate names>
experiment budget: 8 calls
<tier goal — L1: "ONE parameter was changed. Identify which, and whether it
pushes the target metric UP or DOWN." | L2: adds the magnitude classes by
relative change |deltaPct|: small 10-35%, medium 35-75%, large >=75% |
L3: "TWO parameters were changed together. Identify BOTH and the SIGN of
their interaction (positive if the combined effect exceeds the sum of the
individual effects, else negative).">
Run experiments (change ONE parameter vs the control), then submit.
Tool descriptions as presented: experiment — “Run a
replicated A/B between two configs; returns statistics only (no configs
echoed)”; probe — “Compare a guessed config against the
hidden world; a non-significant result means your guess matches”;
claim — “Record a belief about one parameter’s effect
(optional book-keeping)”; submit — the tier-appropriate
final answer (parameter + direction; + magnitude at L2; the parameter
pair + interaction sign at L3). Config arguments are described as
“parameter overrides on the control (omit a param to leave it at
control).” Note that the brief does not disclose parameter ranges;
models choose intervention values from their own priors.
The ablation’s additional value draws are reproducible:
PETRI_OFAT_RAND_SALT=1 (and =2) re-runs
ofat-rand with the alternative draw seeds cited in Section
7.2.
CLI transports deliver the scaffold and transcript over stdin each turn and require exactly one JSON tool call in reply (one in-protocol retry, then the episode ends as a logged no-submission); HTTP transports use the providers’ native tool-calling. The sweep in this report used the CLI transport for all six models, so the protocol term is uniform.