Public methodology · v0.1 BETA

How we decide whether a paper reproduces.

Every verdict on this site is computed by automated agents running your code in a hermetic sandbox. This page documents how — verdict by verdict, gate by gate. It is the single source of truth that every badge, comment, RSS item, and email links back to.

Verdict labels at a glance

A verdict is a label attached to a specific numerical claim in a specific version of a paper, computed from a specific reproduction job. It is never attached to an author, a lab, or a paper as a whole. Each label below has its own anchored section further down this page with the full rule.

VerdictThresholdOne-line meaning
REPRODUCEDall metrics within tolerance · ≥3 seeds · confidence ≥ 0.7Every numerical claim we tested matched the paper's reported value within the published tolerance window.
WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence.all seeds outside tolerance, same direction · confidence ≥ 0.9 · cross-model agreement · 72h notice elapsedOur reproduction did not match the reported numbers for the claim under test. Published only after every reliability gate in §13.7 passes.
PARTIALdirectional match outside tolerance · or proxy benchmark · or single-seed signalBetween REPRODUCED and WRONG. Reproduces directionally but outside the band, on a proxy split, or short of the confidence floor.
PENDINGjob queued, in flight, or in 72h pre-publication holdA reproduction job is in progress, or a high-confidence not-reproduced finding is in the 72-hour author-notice window.
DISPUTEDauthor has filed a dispute via /legal/disputesAn author dispute is open. The badge renders dimmed with a link to the dispute thread until the methodology audit completes.
NOT_ATTEMPTEDplanner refused: no code, gated data, license restriction, etc.The planner agent declined to run a reproduction. The page explains why in the rationale field; no claim about reproducibility is made.
OUT_OF_BUDGETplanner estimated cost > per-paper budget capThe smallest credible experiment we could find still exceeded the per-paper compute budget. Recorded with rationale; eligible for re-attempt at a higher budget tier.

REPRODUCEDREPRODUCED

The label means: every numerical claim we tested matched the paper's reported value within the published tolerance window, across multiple random seeds, with a confidence score at or above 0.7.

For REPRODUCED to apply, all of the following must hold:

  • Every reported metric we attempted to reproduce landed inside its tolerance band (see defaults).
  • At least three random seeds were averaged; per-seed variance did not push any seed outside the band.
  • The sanity baseline for the benchmark passed (a known checkpoint on the same eval script lands where it should).
  • Confidence ≥ 0.7. Confidence aggregates seed agreement, metric coverage, extractor certainty on the reported number, and runtime stability.
  • The reproduction ran on the released code, on the released checkpoint where applicable, on the eval split the paper specifies.

REPRODUCED verdicts get a 5% audit pass — a random sample is hand- reviewed against the captured stdout and the paper text, and the audit log is appended to the methodology page. A failed audit retracts the verdict and lists it at /legal/retractions.

PARTIALPARTIAL

PARTIAL covers the space between REPRODUCED and WRONG. It is the honest label when the reproduction signal is real but not decisive. A verdict lands at PARTIAL when any of the following is true:

  • Directional match outside the band. The reproduced number moves in the same direction as the reported one but lands outside the tolerance window — for example, the paper reports 78.4 top-1 and the reproduction averages 76.0 with seed variance ±0.4.
  • Proxy benchmark. We could not run the exact eval split the paper used (gated data, missing license, infrastructure mismatch) so we ran the closest public proxy and the result reproduced on the proxy. The proxy is named in the verdict drawer.
  • Mixed metric coverage. Some reported numbers landed inside the band, others outside. The verdict drawer lists every metric, its tolerance, and its outcome.
  • Single-seed signal short of confidence. One seed run produced a strong not-reproduced signal, but the multi-seed budget was exhausted before we could confirm. The label stays PARTIAL until the budget unlocks a re-run.

PARTIAL is never a softened WRONG. If a finding has the evidence to clear the WRONG gates, it ships as WRONG; if it does not, the public label is PARTIAL and the drawer says why.

WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence.WRONG

The label is precise, narrow, and engineered to mean exactly one thing: a specific headline numerical claim in a specific version of a paper did not reproduce, within published tolerance, on a specific automated reproduction job whose ID is attached to the verdict. It is never a statement about the author's intent, the lab's history, or the paper's overall worth.

A WRONG verdict is published only after every one of the reliability gates below clears.
  • ≥3 seeds agree the reproduced number is outside tolerance, in the same direction (multi-seed).
  • The sanity baseline for the benchmark passed on this image, this eval script, and this hardware.
  • The cross-model triple ensemble (multi-temperature, multi-model within OpenAI, multi-prompt) agreed on the WRONG verdict for this run.
  • Confidence ≥ 0.9.
  • The corresponding author was notified by email and 72 hours have elapsed since the notice (PRD §17.X.1).

The 72-hour pre-publication notice is the legal floor of the system: until the timer elapses, the WRONG badge is suppressed everywhere — paper page, Wall of Wrong, RSS, OG image, auto-tweet. During the hold the verdict renders as PENDING. The full WRONG definition page lives at /methodology/wrong.

Demo mode note: this platform is v0.1 BETA. Notifier emails are suppressed while DEMO_MODE is on, so the 72-hour timer is exercised end to end on staging fixtures rather than against live authors. The legal flow is implemented; the wire to send is held until the demo gate flips.

PENDINGPENDING

PENDING means a reproduction job is queued, running, or held in the 72-hour pre-publication notice window before a not-reproduced finding goes public. The badge is intentionally calm — it carries no claim about the paper, only about the status of our work.

  • Queued. The planner has accepted the paper and the job is in line behind the per-day compute budget.
  • Running. The container is live; logs are streaming. The job has a wall-clock cap and will be killed if it exceeds it.
  • Pre-publication notice. A high-confidence not-reproduced finding is waiting out the 72-hour author-notice window. The author may reply, dispute, or stay silent.
  • Cool-off. A second not-reproduced finding on the same paper is held for 7 days after the first to prevent an avalanche effect while methodology is audited.

PRE vs POST

Every paper on the site carries up to two distinct signals.

  • PRE is a Phase-1 heuristic: a no-compute, no-sandbox prediction of how likely the paper is to reproduce, based on the abstract, the repo layout, the presence or absence of a checkpoint, the dataset license, and the tightness of the reported numbers. PRE is cheap, fast, and explicitly probabilistic. It never appears with a WRONG badge.
  • POST is the sandbox reproduction: a Modal container actually runs the code, records stdout and a signed manifest, and the verdict is computed from the captured numbers. Every public REPRODUCED, PARTIAL, and WRONG verdict on the site is a POST verdict — backed by a job ID and an evidence drawer.

A paper can carry a PRE prediction with no POST verdict (we have not run the code yet) or a POST verdict that supersedes the PRE prediction. When both are present, the POST verdict is the badge on the page; the PRE prediction lives in the drawer next to its calibration curve.

The reproduction pipeline

Each reproduction job moves through nine stages. Every stage is logged; logs persist for 365 days; container hashes and commands persist indefinitely so that any third party can re-run the same job.

  1. 1Planner decides ATTEMPT, DEFER, or NOT_ATTEMPTED

    An agent reads the paper, the abstract, the linked repo, and the table of headline numbers and emits a structured decision. The planner refuses if there is no public code, the data is gated, the license forbids reproduction, or the required compute exceeds the per-paper cap.

  2. 2prepare_environment (Modal, hermetic, no network egress at run time)

    Clone the repo at the default commit. Parse pyproject / requirements / Dockerfile / env.yml. Build the Docker image, cached by repo + commit, with hash-pinned dependencies via pip-compile --generate-hashes.

  3. 3discover_experiments

    Find the README's reproduction section. Parse Makefiles and scripts/. An LLM extracts the experiment commands and the expected outputs, wrapped in an untrusted-content gate so prompt-injection payloads in the repo cannot redirect the agent.

  4. 4Planner picks the smallest credible experiment within budget

    Prefer running inference on a released checkpoint over a test split. Avoid 1,000-GPU training reruns. The selection is recorded with rationale.

  5. 5run_experiment in a sandboxed container with hard caps

    Capture stdout, stderr, and metrics.json. Enforce wall-clock, CPU, RAM, GPU-minutes, and ephemeral-disk caps at the runtime level. Kill on budget exceed.

  6. 6Extractor parses the paper for reported numbers and the run output for reproduced numbers

    Emit (metric, reported, reproduced) tuples with byte-offset citations into both the paper text and the captured run output.

  7. 7Verdict computation

    Apply the per-metric tolerance. Status is REPRODUCED if all pass, PARTIAL if some, NOT_REPRODUCED if all fail. Confidence is a function of seeds, metrics, and per-seed variance.

  8. 8Reliability gates for any public WRONG (PRD §13.7)

    Multi-seed agreement (≥3 seeds), sanity-check baseline, cross-model verdict agreement (OpenAI triple ensemble — multi-temp, multi-model, multi-prompt), trace verification, confidence ≥ 0.9, and a 5% audit on REPRODUCED verdicts.

  9. 9Persistence and notification

    Insert the verdict, results, and annotations into the database via a trusted Verdict Validator service (agents have no direct DB write access). Send the 72-hour author pre-publication notice for high-confidence not-reproduced verdicts. Email subscribers and the author claim queue.

A WRONG verdict is published only after the pipeline output clears every reliability gate in PRD §13.7: multi-seed agreement, sanity baseline, cross-model agreement (OpenAI triple ensemble), confidence ≥ 0.9, and the 72-hour author pre-publication notice.

Multi-seed agreement

The first reliability gate. A reported number cannot be declared not-reproduced on a single run. We require at least three independent seeds — typically 0, 1, 2 — and we score the gate only when all three land on the same side of the tolerance band.

  • Three seeds agree, all inside band: REPRODUCED is on the table.
  • Three seeds agree, all outside band, same direction: WRONG is on the table — but the remaining gates still have to clear.
  • Seeds straddle the band: the verdict moves to PARTIAL with the per-seed numbers in the drawer.
  • Per-seed variance > tolerance: the run is treated as inconclusive; the page reports the spread and waits for a larger budget.

Sanity baseline

The second reliability gate. Before any verdict is written, we run a known-good baseline on the same image, the same eval script, and the same hardware to verify the environment itself is not poisoning the numbers.

The baseline is a checkpoint with a published expected score on the benchmark in question — typically the previous-generation model the paper is comparing against. If the baseline lands inside its own tolerance band on our infra, the gate passes; if it does not, the job is marked sandbox-suspect and no verdict ships. The baseline number, its expected value, and the delta are written into the verdict manifest so anyone re-running the job can verify the gate themselves.

Cross-model gate (OpenAI triple ensemble)

The third reliability gate, and the one most specific to a system where LLMs are scoring scientific output. Before a WRONG verdict is allowed to clear, the same captured stdout, the same extracted reported and reproduced numbers, and the same tolerance schema are fed to a triple ensemble of OpenAI models and the verdicts are required to agree.

  • Multi-temperature. The same prompt is run at three temperatures spanning low-determinism through standard sampling.
  • Multi-model within OpenAI. Two distinct OpenAI model tiers — a flagship reasoner and a mid-tier verifier — are queried in parallel and required to land on the same label.
  • Multi-prompt. Three prompt phrasings that each derive the verdict from the same evidence are required to agree (one neutral, one adversarial-against-WRONG, one adversarial-for-WRONG).

Per PRD §13.7.3 Pass 4 (effective 2026-05-05), the ensemble is OpenAI-only. The earlier Anthropic + OpenAI cross-vendor design was retired because the dual-vendor reliability cost exceeded the marginal gain on the benchmark suite. The change is logged in the update log below.

The evidence chain

Every public verdict — REPRODUCED, PARTIAL, or WRONG — ships with an evidence chain that a third party can re-execute end to end. The chain is one click from the verdict badge on every surface.

  1. Reproduction job ID. A stable identifier for the exact Modal job that produced the verdict.
  2. Captured stdout and stderr. With byte-offset citations into the run output for the reproduced number.
  3. Paper text byte-offset. For the reported number, so anyone can verify we read the right cell of the right table.
  4. Container hash. The immutable image SHA the sandbox ran on, plus the exact shell commands used.
  5. Signed manifest. Binding the above into a single artifact, so a third party can re-run the same job and check the result against ours.

Agents have no direct database write access. Every verdict that appears on a public surface was inserted by the Verdict Validator service, which gates on the evidence chain above plus the four reliability gates plus the C1 (structural) + C2 (textual PDF-match) claim-citation gates added in the 2026-05-13 hardening. The gate logs are persisted with the manifest.

Agents and prompt versions

Five agents staff the pipeline. Each one has a public persona page, a versioned system prompt, and a model card. Every public artifact we publish records the prompt version and model used at the time.

AgentRoleModelPrompt
AuditorReproduction planner and executorClaude Opus 4.7v0.1.0
PredictorPRE replication-likelihood scoringClaude Sonnet 4.6v0.1.0
ReviewerPEER peer-review predictionClaude Opus 4.7v0.1.0
ReaderQ&A on paper pagesClaude Haiku 4.5v0.1.0
NotifierAuthor notice and email compositionClaude Haiku 4.5v0.1.0

Dispute process

Any verified author of a paper may file a dispute against any verdict on that paper. Filing a dispute flips the verdict status to disputed_pending_review within 60 seconds, and the public surfaces render the badge dimmed with a link to the dispute thread.

We commit to a first agent response within 48 hours of filing, and a first amendment-or-confirmation decision within 48 additional hours. If a methodology error is confirmed, the verdict is retracted and listed at /legal/retractions.

Full policy: /legal/disputes.

Compute environment

Every reproduction job runs in a fresh Modal container, pinned to an immutable image SHA. The image SHA is published with each verdict so that a third party can re-run the same job.

base imagenvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 (pinned SHA)
isolationgVisor or equivalent kernel-level runtime
wall-clock cap30 minutes per job
CPU / RAM cap8 vCPU · 32 GB RAM
disk cap50 GB ephemeral, no persistent volumes
GPU cap1× A10G or A100-40 · 30 GPU-minutes
per-paper $ cap$20 default · admin override up to $200
per-day $ cap$500 default; auto-pause at 1.5×
secrets in sandboxnone — no platform tokens injected
network egresskernel-level allow-list; HuggingFace Hub, PyPI, Conda, GitHub, declared dataset hosts only

The runtime kernel-isolation layer is gVisor (or Modal's equivalent). Network egress is enforced as a kernel-level NetworkPolicy with an allow-list (HuggingFace Hub, PyPI, Conda, GitHub, the paper's declared dataset hosts only). All other egress is dropped at the network namespace boundary.

Tolerance defaults

Tolerance is the per-metric window inside which a reproduced number counts as a match. Defaults are conservative, and overridable per paper when the paper itself specifies a stricter or looser bound. Every threshold actually used is recorded in the verdict drawer.

Metric kindDefault toleranceRationale
Top-1 accuracy / F1 / BLEU±2 absolute pointsSeed variance
Speed (ms, throughput)±10% relativeHardware variation
Loss values±5% relativeConvergence noise
Counts (num parameters)exactNo reason to vary
Floating-point quality (FID, SSIM)±5% relativeSampling noise

Update log

Any methodology change is logged here with a timestamp. The log is append-only; superseded entries are not deleted.

  1. 2026-05-12Methodology page expanded with per-verdict anchored sections (#reproduced, #partial, #wrong, #pending), an explicit PRE-vs-POST section, and the evidence-chain, multi-seed, sanity-baseline, and cross-model gates broken out as their own sections. Legacy /methodology/<status> routes now redirect to the corresponding anchor.
  2. 2026-05-05Cross-model gate moved from a cross-vendor design to an OpenAI-only triple ensemble (multi-temperature, multi-model within OpenAI, multi-prompt). PRD §13.7.3 Pass 4. Earlier cross-vendor wording on the methodology page is superseded.
  3. 2026-04-24Initial publication of the public methodology page.