Every episode behind every published number, as self-contained JSON: {task, log, score, audit, provenance}.
The log records each harness call with its statistics; the provenance stamp pins the exact code versions and commit.
Re-score any episode from its log alone. Back to results · Download all (zip)