Methodology

How ARI Bench Works

ARI Bench measures whether a frontier model can act as an ML engineer: given a fixed compute budget and a working starter system, can it autonomously build a better model? This page documents the complete protocol — the environment, the hidden exam, the anti-contamination design, the scoring rules, and the variance policy — so every number on the leaderboard can be independently understood and challenged.

Budget1 hour · 1 GPU

Exam100 hidden questions

ScoreExact-match %

Official rows3 seeds, mean ± SE

What is being measured

Each evaluated model is run as an autonomous agent. It receives a task brief, a working starter training pipeline (a small byte-level model), ordinary public training text, and a fixed wall-clock budget on a single fixed GPU. Its job is to produce the most capable small model it can — from scratch — under those constraints. The score belongs to the agent, not to the artifact it builds: ARI Bench is a measure of autonomous ML-engineering ability, the core skill underlying recursive AI improvement.

Fixed budget. Exactly one hour of wall-clock time on one fixed GPU. The deadline is enforced by the harness, not the agent.
From scratch. Random initialization only. No pretrained, distilled, or downloaded weights. The training sandbox has no network access.
Bounded artifact. The submitted bundle (weights + inference code) must be at most 16 MB gzipped. The grader recomputes the size; self-reported sizes are ignored.
Same harness for every model. Identical task brief, starter code, budget, sandbox images (pinned by digest), and grading path. The only variable is the model under test (and, where applicable, its reasoning-effort setting, which is reported as part of the row's identity).

The environment

The agent works in a controlled three-container sandbox:

The controller gives the agent a shell in its workspace — with no GPU and no Python. All compute goes through a single audited command, runner, which executes Python in an isolated, network-disabled GPU container.
No single compute call may run longer than 20 minutes, and the final five minutes of the hour are reserved for submission: new compute is refused in that window. Every runner response is prefixed with the remaining time, and the agent can query runner time_left at any point — time management is part of the skill being measured, and the harness makes the clock unambiguous.
The grader runs in a separate container the agent can never reach. Provider credentials are never mounted into the training or grading containers.

The hidden exam

The official score comes from 100 hidden completion questions. Each question is a text with one answer span. The grader sends the submitted model the prefix ending where the answer begins, greedily generates exactly the answer's length using the model's own inference code, and marks the question pass/fail on an exact byte match. Two design details matter:

Prefix-only, target-withholding grading. The submission never receives the byte it is being scored on — it only ever sees prefixes. Earlier hidden answers are masked out of later prompts, so the exam is a set of independent questions, not an oracle transcript that could be replayed.
Locally learnable, globally unfamiliar. Questions are built from small passages, records, local vocabularies, schedules, lists, symbolic and numeric sequences, and extraction tasks. Many answers depend on definitions given inside the prompt rather than on global English statistics — so memorizing common text is not enough; the built model has to actually use its context. The grader provides up to 1,024 bytes of preceding context per prediction.

A small public validation set in the same format ships inside the sandbox. It is deliberately smaller and less diverse than the hidden exam, and it exists for calibration — checking answer formatting, runner compatibility, and basic behavior. It is not the official selector (see below).

Anti-contamination

The question bank never enters the repository or the sandbox. Hidden exam questions are maintained outside version control; only the SHA-256 of the frozen built exam is published (3f571c9c…), so results are pinned to an exact, verifiable evaluation set without exposing it.
The training environment is air-gapped. The GPU container has no network access; the agent cannot fetch weights, code, or the exam. Egress from the controller is limited to the model provider's API through an allow-listed proxy.
Everything is pinned. Container image digests, the task brief hash, the opencode version and config hash, and the exam hash are recorded in each run's effective config — any deviation is detectable after the fact.
Public validation cannot leak the exam. The public set is a different, smaller sample. Overfitting it is possible — and is caught, not rewarded (next section).

Artifact selection: the agent decides, the hidden exam judges

During a run the agent can package and submit candidate models at any time. The harness scores candidates on the public validation set as feedback only, and keeps an append-only history of every submission. The official graded artifact is chosen by the agent: the candidate it explicitly promotes (or finalizes) is what faces the hidden exam — if the run hits the deadline, the most recently promoted candidate is graded. Public-validation rank is never used to select the official artifact.

This rule exists because model-selection judgment is part of the ability being measured — and because selecting on a set the agent can see invites overfitting. The live leaderboard already contains a clean demonstration: one frontier model achieved a perfect 100% on the public validation set during its run, and scored 16% on the hidden exam. A benchmark that selected artifacts by public-validation score would have laundered that memorization into a headline number. ARI Bench's protocol surfaces it instead.

We arrived at this rule the honest way: an earlier protocol revision briefly selected artifacts by public-validation score, and our own audit caught it distorting results. The full account — what broke, how we detected it, which runs were invalidated, and the corrected semantics — is published in Protocol Fixes & Reruns. We keep that report public because a benchmark that audits itself openly is the only kind worth trusting.

Scoring, seeds, and variance

Official score = percent of the 100 hidden questions answered exactly (0–100%). Raw bits-per-byte on the hidden answer spans is also recorded as a diagnostic, but is not the leaderboard score.
Agent runs are stochastic. The same model with the same budget makes different engineering choices run to run. A single run is therefore a provisional result, and the leaderboard labels it as such.
Official rows require 3 independent seeds. The published score is the mean across valid seeded runs, reported with its standard error (± SE). Rows with fewer seeds are marked provisional until the protocol quota is met. We do not rank two rows apart when their intervals overlap — a tie within noise is reported as what it is.
Failures are recorded, not averaged away. A run that produces no valid promoted artifact scores what it earned (including zero); infrastructure failures are marked invalid and rerun, never silently dropped into a mean.

Versioning

The benchmark version, exam hash, grading parameters, and protocol semantics are frozen together. Any change that could shift scores — a new exam, new budget, new selection semantics — becomes a new, separately-labeled track; existing rows are never silently re-scored under new rules. Runs graded under superseded protocol revisions are either rerun or explicitly marked legacy, as documented in Protocol Fixes & Reruns.

Limitations

We would rather list these ourselves than have someone else do it for us:

Seed coverage is in progress. Most current rows are single-seed and marked provisional; official 3-seed rows with error bars are being filled in. Treat close rankings as preliminary until then.
One scaffold. All models run through the same open-source agent scaffold for comparability. A model might perform differently under its native tooling; we hold the scaffold fixed and report what that harness measures.
Reasoning effort is part of a row's identity. Where a provider exposes an effort setting, each setting is benchmarked and labeled as its own row rather than blended.
A one-hour budget measures budgeted engineering, not peak capability. Higher-effort configurations can be iteration-starved inside a fixed hour; that trade-off is part of what the benchmark measures, and longer-budget tracks are run separately.
Scores are exam-relative. A percentage on this exam is comparable across models and across time on the same track — it is not a general claim about a model's overall intelligence.

Questions, challenges, and eval requests

If you believe a number on the board is wrong, we want to know: hello@aribench.com. Labs can request an evaluation (including pre-release, under NDA) via Request eval. The protocol described here is the one every row on the leaderboard was produced under; where a historical row predates a protocol fix, it is labeled and linked to the relevant report.