Transparency report
Protocol Fixes and Reruns
In late June 2026, while extending the evaluation protocol, we introduced a bug into our own harness — and our audit caught it distorting scores. This report documents the failure openly: the root cause, the four leaderboard rows we invalidated, the corrected protocol, and the tests that now pin it in place. We publish it because benchmark trust is earned by showing the failures, not by having none.
Context
The events below took place during the benchmark's earlier V2 scoring era, when submissions were scored by bits-per-byte (BPB, lower is better) on a hidden held-out text. The current leaderboard uses the successor protocol — a hidden exact-match exam — but the selection semantics this incident produced are exactly the ones in force today, and every current row was produced under them. Nothing in this report affects current scores; it explains why the protocol is shaped the way it is.
Root cause
A protocol revision intended to make candidate packaging robust also made the harness select the official graded artifact by lowest public-validation BPB. Public validation is visible to the agent — it can train against it — so using it as the official selector rewarded validation-overfit candidates and discarded the agent's own, better-generalizing final choices. The result was a leaderboard where scores clustered artificially and the strongest configuration regressed sharply.
- Four affected rows (run-0009 through run-0012) were removed from the official leaderboard.
- An earlier strong run predating this harness revision was reclassified as legacy/internal evidence rather than kept as an official row, because it lacked the full provenance the corrected harness records.
- The corrected principle: the harness preserves artifacts; the agent chooses what gets graded.
| Run | Held-out BPB | Reason removed |
|---|---|---|
| run-0009 (low) | 3.5670 | Public-val-selected artifact |
| run-0010 (medium) | 3.2662 | Public-val-selected artifact |
| run-0011 (high) | 3.4217 | Public-val-selected artifact |
| run-0012 (xhigh) | 3.4265 | Public-val-selected artifact |
Corrected protocol
The runner keeps public-validation feedback — it is useful signal — but public-validation rank no longer controls anything official:
- runner submit_candidate --model-dir X packages a candidate atomically, validates it, scores public validation for feedback and history only.
- runner submit_candidate --model-dir X --promote declares: if time expires now, grade this model.
- runner promote_candidate / runner finalize_candidate promote an existing candidate or lock the final artifact and signal completion.
- Grading priority: finalized candidate → latest promoted candidate → a valid final_model directory → otherwise no official artifact.
- At the deadline, the harness drains already-written candidate-selection requests without granting any additional training compute — a run that promoted in its final seconds is honored; one that never promoted is not rescued.
What the audit surfaced along the way
Auditing the anomaly end-to-end surfaced two additional issues, both fixed and both now covered by regression tests:
- Explicit reasoning-effort passing. The agent scaffold exposes reasoning effort as an explicit run variant; the harness now passes it explicitly rather than relying only on config injection, and a no-spend sandbox test verifies the launched command includes it.
- Deadline drain. One rerun produced no official artifact despite building candidates: a promotion request written near the deadline sat behind a slow compute job and was never processed. The harness now drains pending selection requests at timeout (without running queued training jobs), preserving the contract that the agent's last explicit choice is honored.
The corrected semantics are locked by local and rig-side sandbox tests: the public-validation best is not automatically graded; the latest promoted candidate is; a finalized candidate takes priority over a promoted one; effort is passed explicitly; and deadline-drained selections are honored without extra compute.
Why we publish this
Two reasons. First, the incident is the clearest argument for the current design: selection on a visible validation set doesn't just permit overfitting, it manufactures it — we watched a harness do exactly that. The protocol that came out of this failure (agent-owned selection, hidden exam as sole judge) has already caught a frontier model scoring a perfect 100% on public validation while earning 16% on the hidden exam. Second, every benchmark has bugs eventually; the differentiator is whether they're found by the operators or by the audience. Ours are documented here, with the invalidations, before anyone had to ask. If you spot the next one, tell us: hello@aribench.com.