petri-bench episode artifacts

Every episode behind every published number, as self-contained JSON: {task, log, score, audit, provenance}. The log records each harness call with its statistics; the provenance stamp pins the exact code versions and commit. Re-score any episode from its log alone. Back to results · Download all (zip)

baselines/

baselines/adaptive/

baselines/ofat/

baselines/ofat-rand/

baselines/random/

codex-gpt-5.5/

gemini-3.1-pro/

glm-5.1/

haiku-4.5/

opus-4.8/

pilot-v0/

pilot-v0/excluded/

sonnet-4.6/