The exact technical definition
WRONG means: a specific headline numerical claim in a specific version of a paper did not reproduce, within published tolerance, on a specific automated reproduction job whose ID is attached to the verdict.
A verdict carries the WRONG label only when every one of the following is true: ≥3 seeds agree the reproduced number is outside tolerance in the same direction; the sanity-check baseline passed; two independent models (Claude and GPT) reach the same verdict; the confidence score is ≥ 0.9; and 72 hours have elapsed since the corresponding author was notified by email.
What it does NOT mean
We do not, and will not, use this label to make any claim about an author's intent, honesty, or character. The platform takes no position on why a number did not reproduce. Common, ordinary reasons include:
- seed sensitivity that exceeds what the paper reported
- environment drift between the paper's setup and the released repo
- checkpoint-vs-training mismatches between what the paper measured and what the repo runs
- parsing differences in metric extraction
- numerical precision differences between hardware or library versions
- genuine scientific disagreement about the right way to score the metric
The verdict surface is engineered to be opinion-defensible (the methodology is fully disclosed), substantially true (the underlying factual claim is just “our job did not match the paper's number”), and claim-scoped (never paper-scoped).
Gates that block unjustified WRONG
Before a verdict is even allowed to be a public WRONG, two gates run:
- C1 (structural): the driver MUST declare a typed
CLAIM_CITATIONwith the table, row, column, reported value, quoted text, and PDF page — and itsPROTOCOL_MATCHtier must beexact. Anyproxyorunknowntier is structurally incapable of publishing WRONG; it auto- downgrades topartialorpending. - C2 (textual): the validator fetches the arXiv PDF and confirms the
quotedTextappears within ±200 characters of the citedreportedValueon the cited page. If the citation says “Table 8 says 5.6” and the PDF doesn't contain “5.6” near Table 8, the verdict cannot publish.
Both gates were the 2026-05-13 red-team hardening response to seven false-positive WRONG verdicts that all turned out to be citation problems (made-up paper headlines, wrong table, wrong column, wrong metric variant). The validator itself, with live counts of how many verdicts have flowed through each gate, lives at /validator. The seven retracted false positives are at /legal/retractions.
The evidence chain
Every WRONG verdict is one click from its evidence drawer. The evidence chain consists of:
- Reproduction job ID — a stable identifier for the exact job that produced the verdict.
- Captured stdout and stderr with byte-offset citations into the run output for the reproduced number.
- Paper text byte-offset for the reported number, so anyone can verify we read the right cell of the right table.
- Container hash — the immutable image SHA the sandbox ran on, plus the exact shell commands used.
- Signed manifest binding the above into a single artifact, so a third party can re-run the same job and check the result against ours.
The 72-hour pre-publication notice
For any not-reproduced finding with confidence ≥ 0.9 on a paper whose corresponding author is non-anonymous, the verdict is held in a 72-hour notice window before any public surface is allowed to render the WRONG badge.
- An agent drafts and sends a templated notice email to the corresponding author.
- The verdict enters pre_publication_notice state for 72 hours from email send.
- During the hold, the badge is suppressed everywhere — paper page, Wall of Wrong, RSS, OG image, auto-tweet.
- If the author replies before 72h, their reply auto-attaches as the first comment when the verdict goes public, and the verdict transitions to disputed_pending_review.
- If 72h elapse with no reply, the verdict transitions to published_wrong and all surfaces light up.
The dispute process
A verified author may file a dispute at any time, before or after publication. Filing flips the verdict status to disputed_pending_review within 60 seconds; the badge dims and links to the dispute thread on every surface; the rendered label becomes WRONG (under dispute).
We commit to a first agent response within 48 hours and an amendment-or-confirmation decision within 48 additional hours. If the dispute is upheld, the verdict is amended or retracted; the retraction is logged at /legal/retractions.
Full policy: /legal/disputes.
Cooling-off rules
No two not-reproduced verdicts on the same paper may be published within 7 days of each other. A second finding inside the window is held in pending_cooloff until the window expires; if the first verdict is under active dispute, the cooling-off extends until the dispute is resolved.
The rule prevents an avalanche effect on a single paper while a potentially-flawed methodology is still being audited.
Worked example
Consider a paper that reports top-1 accuracy of 78.4% on a public benchmark, with a released checkpoint and an evaluation script. We run the released checkpoint on the released eval split, across seeds 0, 1, and 2.
| Field | Value |
|---|---|
| Reported (paper, Table 2) | 78.4 |
| Reproduced (seed 0 / 1 / 2) | 73.1 · 73.0 · 73.3 |
| Tolerance (top-1 default) | ±2.0 absolute points |
| Δ (mean reproduced − reported) | −5.27 |
| Per-seed agreement | 3 of 3 outside tolerance, same direction |
| Computed verdict | WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence. |
The verdict here attaches to the specific top-1 accuracy claim in Table 2 of the paper's named version. The label is not a statement about other tables, other versions, the authors, or the lab.
Words we never use
A list of intent-implying terms is absolutely prohibited in any agent-generated copy, OG image, tweet, email, push notification, or UI string anywhere on the site. The build is configured to fail if any of them appears in shipped code.
The authoritative, machine-readable list lives in src/lib/legal-guard.ts and the corresponding test in tests/unit/legal.test.ts enforces it on every commit. The list covers thirteen intent-implying English terms covering allegations of dishonesty, data tampering, and academic misappropriation — plus their common inflected forms.
We use, instead, the language of reproducibility: “reproduction did not match,” “we were unable to reproduce,” “reported X; reproduced Y.”