Transparency report

Protocol Fixes and Reruns

In late June 2026, while extending the evaluation protocol, we introduced a bug into our own harness — and our audit caught it distorting scores. This report documents the failure openly: the root cause, the four leaderboard rows we invalidated, the corrected protocol, and the tests that now pin it in place. We publish it because benchmark trust is earned by showing the failures, not by having none.

Invalidated4 rows
Root causeSelector bug
FixAgent-owned selection
StatusRerun & superseded

Context

The events below took place during the benchmark's earlier V2 scoring era, when submissions were scored by bits-per-byte (BPB, lower is better) on a hidden held-out text. The current leaderboard uses the successor protocol — a hidden exact-match exam — but the selection semantics this incident produced are exactly the ones in force today, and every current row was produced under them. Nothing in this report affects current scores; it explains why the protocol is shaped the way it is.

Root cause

A protocol revision intended to make candidate packaging robust also made the harness select the official graded artifact by lowest public-validation BPB. Public validation is visible to the agent — it can train against it — so using it as the official selector rewarded validation-overfit candidates and discarded the agent's own, better-generalizing final choices. The result was a leaderboard where scores clustered artificially and the strongest configuration regressed sharply.

RunHeld-out BPBReason removed
run-0009 (low)3.5670Public-val-selected artifact
run-0010 (medium)3.2662Public-val-selected artifact
run-0011 (high)3.4217Public-val-selected artifact
run-0012 (xhigh)3.4265Public-val-selected artifact
How it was caught: the anomaly was visible in the data itself — four runs collapsing into a narrow band while the previously strongest configuration dropped — and the candidate-history audit showed the graded artifacts were the public-validation argmin, scoring markedly better on the visible validation set than on the hidden held-out (a gap of ~0.5–0.7 BPB). Runs whose selection was distorted remain archived as diagnostic artifacts, clearly excluded from official scores.

Corrected protocol

The runner keeps public-validation feedback — it is useful signal — but public-validation rank no longer controls anything official:

What the audit surfaced along the way

Auditing the anomaly end-to-end surfaced two additional issues, both fixed and both now covered by regression tests:

The corrected semantics are locked by local and rig-side sandbox tests: the public-validation best is not automatically graded; the latest promoted candidate is; a finalized candidate takes priority over a promoted one; effort is passed explicitly; and deadline-drained selections are honored without extra compute.

Why we publish this

Two reasons. First, the incident is the clearest argument for the current design: selection on a visible validation set doesn't just permit overfitting, it manufactures it — we watched a harness do exactly that. The protocol that came out of this failure (agent-owned selection, hidden exam as sole judge) has already caught a frontier model scoring a perfect 100% on public validation while earning 16% on the hidden exam. Second, every benchmark has bugs eventually; the differentiator is whether they're found by the operators or by the audience. Ours are documented here, with the invalidations, before anyone had to ask. If you spot the next one, tell us: hello@aribench.com.

Full protocol details, scoring rules, and the seeds/variance policy live on the Methodology page.