paperiswrong

Verdict labels at a glance

A verdict is a label attached to a specific numerical claim in a specific version of a paper, computed from a specific reproduction job. It is never attached to an author, a lab, or a paper as a whole. Each label below has its own anchored section further down this page with the full rule.

Verdict	Threshold	One-line meaning
REPRODUCED	all metrics within tolerance · ≥3 seeds · confidence ≥ 0.7	Every numerical claim we tested matched the paper's reported value within the published tolerance window.
WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence.	all seeds outside tolerance, same direction · confidence ≥ 0.9 · cross-model agreement · 72h notice elapsed	Our reproduction did not match the reported numbers for the claim under test. Published only after every reliability gate in §13.7 passes.
PARTIAL	directional match outside tolerance · or proxy benchmark · or single-seed signal	Between REPRODUCED and WRONG. Reproduces directionally but outside the band, on a proxy split, or short of the confidence floor.
PENDING	job queued, in flight, or in 72h pre-publication hold	A reproduction job is in progress, or a high-confidence not-reproduced finding is in the 72-hour author-notice window.
DISPUTED	author has filed a dispute via /legal/disputes	An author dispute is open. The badge renders dimmed with a link to the dispute thread until the methodology audit completes.
NOT_ATTEMPTED	planner refused: no code, gated data, license restriction, etc.	The planner agent declined to run a reproduction. The page explains why in the rationale field; no claim about reproducibility is made.
OUT_OF_BUDGET	planner estimated cost > per-paper budget cap	The smallest credible experiment we could find still exceeded the per-paper compute budget. Recorded with rationale; eligible for re-attempt at a higher budget tier.

REPRODUCEDREPRODUCED

The label means: every numerical claim we tested matched the paper's reported value within the published tolerance window, across multiple random seeds, with a confidence score at or above 0.7.

For REPRODUCED to apply, all of the following must hold:

Every reported metric we attempted to reproduce landed inside its tolerance band (see defaults).
At least three random seeds were averaged; per-seed variance did not push any seed outside the band.
The sanity baseline for the benchmark passed (a known checkpoint on the same eval script lands where it should).
Confidence ≥ 0.7. Confidence aggregates seed agreement, metric coverage, extractor certainty on the reported number, and runtime stability.
The reproduction ran on the released code, on the released checkpoint where applicable, on the eval split the paper specifies.

REPRODUCED verdicts get a 5% audit pass — a random sample is hand- reviewed against the captured stdout and the paper text, and the audit log is appended to the methodology page. A failed audit retracts the verdict and lists it at /legal/retractions.

PARTIALPARTIAL

PARTIAL covers the space between REPRODUCED and WRONG. It is the honest label when the reproduction signal is real but not decisive. A verdict lands at PARTIAL when any of the following is true:

Directional match outside the band. The reproduced number moves in the same direction as the reported one but lands outside the tolerance window — for example, the paper reports 78.4 top-1 and the reproduction averages 76.0 with seed variance ±0.4.
Proxy benchmark. We could not run the exact eval split the paper used (gated data, missing license, infrastructure mismatch) so we ran the closest public proxy and the result reproduced on the proxy. The proxy is named in the verdict drawer.
Mixed metric coverage. Some reported numbers landed inside the band, others outside. The verdict drawer lists every metric, its tolerance, and its outcome.
Single-seed signal short of confidence. One seed run produced a strong not-reproduced signal, but the multi-seed budget was exhausted before we could confirm. The label stays PARTIAL until the budget unlocks a re-run.

PARTIAL is never a softened WRONG. If a finding has the evidence to clear the WRONG gates, it ships as WRONG; if it does not, the public label is PARTIAL and the drawer says why.

WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence.WRONG

The label is precise, narrow, and engineered to mean exactly one thing: a specific headline numerical claim in a specific version of a paper did not reproduce, within published tolerance, on a specific automated reproduction job whose ID is attached to the verdict. It is never a statement about the author's intent, the lab's history, or the paper's overall worth.

A WRONG verdict is published only after every one of the reliability gates below clears.

≥3 seeds agree the reproduced number is outside tolerance, in the same direction (multi-seed).
The sanity baseline for the benchmark passed on this image, this eval script, and this hardware.
The cross-model triple ensemble (multi-temperature, multi-model within OpenAI, multi-prompt) agreed on the WRONG verdict for this run.
Confidence ≥ 0.9.
The corresponding author was notified by email and 72 hours have elapsed since the notice (PRD §17.X.1).

The 72-hour pre-publication notice is the legal floor of the system: until the timer elapses, the WRONG badge is suppressed everywhere — paper page, Wall of Wrong, RSS, OG image, auto-tweet. During the hold the verdict renders as PENDING. The full WRONG definition page lives at /methodology/wrong.

Demo mode note: this platform is v0.1 BETA. Notifier emails are suppressed while DEMO_MODE is on, so the 72-hour timer is exercised end to end on staging fixtures rather than against live authors. The legal flow is implemented; the wire to send is held until the demo gate flips.

PENDINGPENDING

PENDING means a reproduction job is queued, running, or held in the 72-hour pre-publication notice window before a not-reproduced finding goes public. The badge is intentionally calm — it carries no claim about the paper, only about the status of our work.

Queued. The planner has accepted the paper and the job is in line behind the per-day compute budget.
Running. The container is live; logs are streaming. The job has a wall-clock cap and will be killed if it exceeds it.
Pre-publication notice. A high-confidence not-reproduced finding is waiting out the 72-hour author-notice window. The author may reply, dispute, or stay silent.
Cool-off. A second not-reproduced finding on the same paper is held for 7 days after the first to prevent an avalanche effect while methodology is audited.

PRE vs POST

Every paper on the site carries up to two distinct signals.

PRE is a Phase-1 heuristic: a no-compute, no-sandbox prediction of how likely the paper is to reproduce, based on the abstract, the repo layout, the presence or absence of a checkpoint, the dataset license, and the tightness of the reported numbers. PRE is cheap, fast, and explicitly probabilistic. It never appears with a WRONG badge.
POST is the sandbox reproduction: a Modal container actually runs the code, captures the measurements, and persists the evidence used to compute the verdict. Every public REPRODUCED, PARTIAL, and WRONG verdict on the site is a POST verdict — backed by a job ID and an evidence drawer.

A paper can carry a PRE prediction with no POST verdict (we have not run the code yet) or a POST verdict that supersedes the PRE prediction. When both are present, the POST verdict is the badge on the page; the PRE prediction lives in the drawer next to its calibration curve.

The reproduction pipeline

Each reproduction job moves through nine stages. Every stage is logged; logs persist for 365 days; container hashes and commands persist indefinitely so that any third party can re-run the same job.

1Planner decides ATTEMPT, DEFER, or NOT_ATTEMPTED
An agent reads the paper, the abstract, the linked repo, and the table of headline numbers and emits a structured decision. The planner refuses if there is no public code, the data is gated, the license forbids reproduction, or the required compute exceeds the per-paper cap.
2prepare_environment (Modal, hermetic, no network egress at run time)
Clone the repo at the default commit. Parse pyproject / requirements / Dockerfile / env.yml. Build the Docker image, cached by repo + commit, with hash-pinned dependencies via pip-compile --generate-hashes.
3discover_experiments
Find the README's reproduction section. Parse Makefiles and scripts/. An LLM extracts the experiment commands and the expected outputs, wrapped in an untrusted-content gate so prompt-injection payloads in the repo cannot redirect the agent.
4Planner picks the smallest credible experiment within budget
Prefer running inference on a released checkpoint over a test split. Avoid 1,000-GPU training reruns. The selection is recorded with rationale.
5run_experiment in a sandboxed container with hard caps
Capture stdout, stderr, and metrics.json. Enforce wall-clock, CPU, RAM, GPU-minutes, and ephemeral-disk caps at the runtime level. Kill on budget exceed.
6Extractor parses the paper for reported numbers and the run output for reproduced numbers
Emit a structured paper quote and locator plus the measured metric. Validated receipts retain the exact quote, PDF hash, trace offset, and execution-trace hash.
7Verdict computation
Apply the per-metric tolerance. Status is REPRODUCED if all pass, PARTIAL if some, NOT_REPRODUCED if all fail. Confidence is a function of seeds, metrics, and per-seed variance.
8Reliability gates for any public WRONG (PRD §13.7)
Multi-seed agreement (≥3 seeds), sanity-check baseline, cross-model verdict agreement (OpenAI triple ensemble — multi-temp, multi-model, multi-prompt), trace verification, confidence ≥ 0.9, and a 5% audit on REPRODUCED verdicts.
9Persistence and notification
Insert the verdict, results, and annotations into the database via a trusted Verdict Validator service (agents have no direct DB write access). Send the 72-hour author pre-publication notice for high-confidence not-reproduced verdicts. Email subscribers and the author claim queue.

A WRONG verdict is published only after the pipeline output clears every reliability gate in PRD §13.7: multi-seed agreement, sanity baseline, cross-model agreement (OpenAI triple ensemble), confidence ≥ 0.9, and the 72-hour author pre-publication notice.

Multi-seed agreement

The first reliability gate. A reported number cannot be declared not-reproduced on a single run. We require at least three independent seeds — typically 0, 1, 2 — and we score the gate only when all three land on the same side of the tolerance band.

Three seeds agree, all inside band: REPRODUCED is on the table.
Three seeds agree, all outside band, same direction: WRONG is on the table — but the remaining gates still have to clear.
Seeds straddle the band: the verdict moves to PARTIAL with the per-seed numbers in the drawer.
Per-seed variance > tolerance: the run is treated as inconclusive; the page reports the spread and waits for a larger budget.

Sanity baseline

The second reliability gate. Before any verdict is written, we run a known-good baseline on the same image, the same eval script, and the same hardware to verify the environment itself is not poisoning the numbers.

The baseline is a checkpoint with a published expected score on the benchmark in question — typically the previous-generation model the paper is comparing against. If the baseline lands inside its own tolerance band on our infra, the gate passes; if it does not, the job is marked sandbox-suspect and no verdict ships. The public validation receipt records whether the sanity baseline passed, alongside the retained trace and source hashes used to audit the reproduction.

Cross-model gate (OpenAI triple ensemble)

The third reliability gate, and the one most specific to a system where LLMs are scoring scientific output. Before a WRONG verdict is allowed to clear, the same captured stdout, the same extracted reported and reproduced numbers, and the same tolerance schema are fed to a triple ensemble of OpenAI models and the verdicts are required to agree.

Multi-temperature. The same prompt is run at three temperatures spanning low-determinism through standard sampling.
Multi-model within OpenAI. Two distinct OpenAI model tiers — a flagship reasoner and a mid-tier verifier — are queried in parallel and required to land on the same label.
Multi-prompt. Three prompt phrasings that each derive the verdict from the same evidence are required to agree (one neutral, one adversarial-against-WRONG, one adversarial-for-WRONG).

Per PRD §13.7.3 Pass 4 (effective 2026-05-05), the ensemble is OpenAI-only. The earlier Anthropic + OpenAI cross-vendor design was retired because the dual-vendor reliability cost exceeded the marginal gain on the benchmark suite. The change is logged in the update log below.

The evidence chain

A decisive public claim is shown only when its persisted validation receipt still matches the joined paper, job, protocol, measurement, and outcome. Legacy or mismatched rows fail closed to PENDING.

Joined identity. Job ID, paper version, agent version, protocol tier, and outcome.
Paper evidence. Exact quoted span, page and reviewed table-cell or text-span locator, plus the SHA-256 of the versioned PDF bytes.
Measurement identity. Metric, model and dataset revisions, sample count, and sample-manifest hash.
Execution evidence. A bounded retained trace, metric offset, sanity-baseline state, and hashes of the driver, Modal source, and trace.
Integrity receipt. Policy, citation, and receipt hashes bind those fields together. This is a deterministic integrity check, not an independent signature or a promise that every external dependency remains re-executable forever.

Agents have no direct database write access. Every verdict that appears on a public surface was inserted by the Verdict Validator service, which gates on the receipt fields above plus the four reliability gates plus the C1 (structural) + C2 (textual PDF-match) claim-citation gates added in the 2026-05-13 hardening. The gate receipt checks are persisted with the validated citation.

Agents and prompt versions

Five agents staff the pipeline. Each one has a public persona page, a versioned system prompt, and a model card. Every public artifact we publish records the prompt version and model used at the time.

Agent	Role	Model	Prompt
Auditor	Reproduction planner and executor	Claude Opus 4.7	v0.1.0
Predictor	PRE replication-likelihood scoring	Claude Sonnet 4.6	v0.1.0
Reviewer	PEER peer-review prediction	Claude Opus 4.7	v0.1.0
Reader	Q&A on paper pages	Claude Haiku 4.5	v0.1.0
Notifier	Author notice and email composition	Claude Haiku 4.5	v0.1.0

Dispute process

Any verified author of a paper may file a dispute against any verdict on that paper. Filing a dispute flips the verdict status to disputed_pending_review within 60 seconds, and the public surfaces render the badge dimmed with a link to the dispute thread.

We commit to a first agent response within 48 hours of filing, and a first amendment-or-confirmation decision within 48 additional hours. If a methodology error is confirmed, the verdict is retracted and listed at /legal/retractions.

Full policy: /legal/disputes.

Compute environment

Reproduction jobs run in isolated Modal containers defined by versioned source. Current claim receipts publish hashes of the driver and Modal source that produced the retained trace. They do not yet publish a container image SHA or exact shell-command manifest.

base image	nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 (pinned SHA)
isolation	gVisor or equivalent kernel-level runtime
wall-clock cap	30 minutes per job
CPU / RAM cap	8 vCPU · 32 GB RAM
disk cap	50 GB ephemeral, no persistent volumes
GPU cap	1× A10G or A100-40 · 30 GPU-minutes
per-paper $ cap	$20 default · admin override up to $200
per-day $ cap	$500 default; auto-pause at 1.5×
secrets in sandbox	none — no platform tokens injected
network egress	kernel-level allow-list; HuggingFace Hub, PyPI, Conda, GitHub, declared dataset hosts only

The runtime kernel-isolation layer is gVisor (or Modal's equivalent). Network egress is enforced as a kernel-level NetworkPolicy with an allow-list (HuggingFace Hub, PyPI, Conda, GitHub, the paper's declared dataset hosts only). All other egress is dropped at the network namespace boundary.

Tolerance defaults

Tolerance is the per-metric window inside which a reproduced number counts as a match. Defaults are conservative, and overridable per paper when the paper itself specifies a stricter or looser bound. Every threshold actually used is recorded in the verdict drawer.

Metric kind	Default tolerance	Rationale
Top-1 accuracy / F1 / BLEU	±2 absolute points	Seed variance
Speed (ms, throughput)	±10% relative	Hardware variation
Loss values	±5% relative	Convergence noise
Counts (num parameters)	exact	No reason to vary
Floating-point quality (FID, SSIM)	±5% relative	Sampling noise

Update log

Any methodology change is logged here with a timestamp. The log is append-only; superseded entries are not deleted.

2026-05-12Methodology page expanded with per-verdict anchored sections (#reproduced, #partial, #wrong, #pending), an explicit PRE-vs-POST section, and the evidence-chain, multi-seed, sanity-baseline, and cross-model gates broken out as their own sections. Legacy /methodology/<status> routes now redirect to the corresponding anchor.
2026-05-05Cross-model gate moved from a cross-vendor design to an OpenAI-only triple ensemble (multi-temperature, multi-model within OpenAI, multi-prompt). PRD §13.7.3 Pass 4. Earlier cross-vendor wording on the methodology page is superseded.
2026-04-24Initial publication of the public methodology page.