Methodology

How ARI Bench Works

ARI Bench measures whether a frontier model can act as an ML engineer: given a fixed compute budget and a working starter system, can it autonomously build a better model? This page documents the complete protocol — the environment, the hidden exam, the anti-contamination design, the scoring rules, and the variance policy — so every number on the leaderboard can be independently understood and challenged.

Budget1 hour · 1 GPU
Exam100 hidden questions
ScoreExact-match %
Official rows3 seeds, mean ± SE

What is being measured

Each evaluated model is run as an autonomous agent. It receives a task brief, a working starter training pipeline (a small byte-level model), ordinary public training text, and a fixed wall-clock budget on a single fixed GPU. Its job is to produce the most capable small model it can — from scratch — under those constraints. The score belongs to the agent, not to the artifact it builds: ARI Bench is a measure of autonomous ML-engineering ability, the core skill underlying recursive AI improvement.

The environment

The agent works in a controlled three-container sandbox:

The hidden exam

The official score comes from 100 hidden completion questions. Each question is a text with one answer span. The grader sends the submitted model the prefix ending where the answer begins, greedily generates exactly the answer's length using the model's own inference code, and marks the question pass/fail on an exact byte match. Two design details matter:

A small public validation set in the same format ships inside the sandbox. It is deliberately smaller and less diverse than the hidden exam, and it exists for calibration — checking answer formatting, runner compatibility, and basic behavior. It is not the official selector (see below).

Anti-contamination

Artifact selection: the agent decides, the hidden exam judges

During a run the agent can package and submit candidate models at any time. The harness scores candidates on the public validation set as feedback only, and keeps an append-only history of every submission. The official graded artifact is chosen by the agent: the candidate it explicitly promotes (or finalizes) is what faces the hidden exam — if the run hits the deadline, the most recently promoted candidate is graded. Public-validation rank is never used to select the official artifact.

This rule exists because model-selection judgment is part of the ability being measured — and because selecting on a set the agent can see invites overfitting. The live leaderboard already contains a clean demonstration: one frontier model achieved a perfect 100% on the public validation set during its run, and scored 16% on the hidden exam. A benchmark that selected artifacts by public-validation score would have laundered that memorization into a headline number. ARI Bench's protocol surfaces it instead.

We arrived at this rule the honest way: an earlier protocol revision briefly selected artifacts by public-validation score, and our own audit caught it distorting results. The full account — what broke, how we detected it, which runs were invalidated, and the corrected semantics — is published in Protocol Fixes & Reruns. We keep that report public because a benchmark that audits itself openly is the only kind worth trusting.

Scoring, seeds, and variance

Versioning

The benchmark version, exam hash, grading parameters, and protocol semantics are frozen together. Any change that could shift scores — a new exam, new budget, new selection semantics — becomes a new, separately-labeled track; existing rows are never silently re-scored under new rules. Runs graded under superseded protocol revisions are either rerun or explicitly marked legacy, as documented in Protocol Fixes & Reruns.

Limitations

We would rather list these ourselves than have someone else do it for us:

Questions, challenges, and eval requests

If you believe a number on the board is wrong, we want to know: hello@aribench.com. Labs can request an evaluation (including pre-release, under NDA) via Request eval. The protocol described here is the one every row on the leaderboard was produced under; where a historical row predates a protocol fix, it is labeled and linked to the relevant report.