paperiswrong

The exact technical definition

WRONG means: a specific headline numerical claim in a specific version of a paper did not reproduce, within published tolerance, on a specific automated reproduction job whose ID is attached to the verdict.

WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence. is a claim about a single number on a single run. It is never a claim about a paper as a whole, an author as a person, or a lab as an institution.

A verdict carries the WRONG label only when every one of the following is true: ≥3 seeds agree the reproduced number is outside tolerance in the same direction; the sanity-check baseline passed; the OpenAI multi-model, multi-temperature, multi-prompt ensemble reaches the same verdict; the confidence score is ≥ 0.9; and 72 hours have elapsed since the corresponding author was notified by email.

What it does NOT mean

We do not, and will not, use this label to make any claim about an author's intent, honesty, or character. The platform takes no position on why a number did not reproduce. Common, ordinary reasons include:

seed sensitivity that exceeds what the paper reported
environment drift between the paper's setup and the released repo
checkpoint-vs-training mismatches between what the paper measured and what the repo runs
parsing differences in metric extraction
numerical precision differences between hardware or library versions
genuine scientific disagreement about the right way to score the metric

The verdict surface is engineered to be opinion-defensible (the methodology is fully disclosed), substantially true (the underlying factual claim is just “our job did not match the paper's number”), and claim-scoped (never paper-scoped).

Gates that block unjustified WRONG

Before a verdict is even allowed to be a public WRONG, two gates run:

C1 (structural): the driver MUST declare a typed CLAIM_CITATION with the table, row, column, reported value, quoted text, and PDF page — and its PROTOCOL_MATCH tier must be exact. Any proxy or unknown tier is structurally incapable of publishing WRONG; it auto- downgrades to partial or pending.
C2 (textual): the validator fetches the versioned arXiv PDF and requires the quote itself to contain the reported value, appear on the cited page, match the declared section, and pass a reviewed table locator that binds the row, ordered columns, and numeric cell. Any mismatch blocks publication.

Both gates were the 2026-05-13 red-team hardening response to seven false-positive WRONG verdicts that all turned out to be citation problems (made-up paper headlines, wrong table, wrong column, wrong metric variant). The validator itself, with live counts of how many verdicts have flowed through each gate, lives at /validator. The seven retracted false positives are at /legal/retractions.

The evidence chain

A WRONG verdict can publish only when an exact-protocol validation receipt still matches all of the following:

Reproduction job ID — a stable identifier for the exact job that produced the verdict.
Paper evidence — the exact quote, page and table locator, and SHA-256 of the versioned PDF.
Measurement identity — metric, pinned model and dataset revisions, sample count, and sample-manifest hash.
Execution evidence — retained trace and metric offset plus driver, Modal-source, and trace hashes.
Integrity receipt binding the policy, citation, job, paper version, protocol, measurement, and outcome. The receipt is deterministic but is not an independent signature.

The 72-hour pre-publication notice

For any not-reproduced finding with confidence ≥ 0.9 on a paper whose corresponding author is non-anonymous, the verdict is held in a 72-hour notice window before any public surface is allowed to render the WRONG badge.

An agent drafts and sends a templated notice email to the corresponding author.
The verdict enters pre_publication_notice state for 72 hours from email send.
During the hold, the badge is suppressed everywhere — paper page, Wall of Wrong, RSS, OG image, auto-tweet.
If the author replies before 72h, their reply auto-attaches as the first comment when the verdict goes public, and the verdict transitions to disputed_pending_review.
If 72h elapse with no reply, the verdict transitions to published_wrong and all surfaces light up.

The dispute process

A verified author may file a dispute at any time, before or after publication. Filing flips the verdict status to disputed_pending_review within 60 seconds; the badge dims and links to the dispute thread on every surface; the rendered label becomes WRONG (under dispute).

We commit to a first agent response within 48 hours and an amendment-or-confirmation decision within 48 additional hours. If the dispute is upheld, the verdict is amended or retracted; the retraction is logged at /legal/retractions.

Full policy: /legal/disputes.

Cooling-off rules

No two not-reproduced verdicts on the same paper may be published within 7 days of each other. A second finding inside the window is held in pending_cooloff until the window expires; if the first verdict is under active dispute, the cooling-off extends until the dispute is resolved.

The rule prevents an avalanche effect on a single paper while a potentially-flawed methodology is still being audited.

Worked example

Consider a paper that reports top-1 accuracy of 78.4% on a public benchmark, with a released checkpoint and an evaluation script. We run the released checkpoint on the released eval split, across seeds 0, 1, and 2.

Field	Value
Reported (paper, Table 2)	78.4
Reproduced (seed 0 / 1 / 2)	73.1 · 73.0 · 73.3
Tolerance (top-1 default)	±2.0 absolute points
Δ (mean reproduced − reported)	−5.27
Per-seed agreement	3 of 3 outside tolerance, same direction
Computed verdict	WRONGWRONG = our reproduction did not match this claim's reported numbers. See evidence.

The verdict here attaches to the specific top-1 accuracy claim in Table 2 of the paper's named version. The label is not a statement about other tables, other versions, the authors, or the lab.

Words we never use

A list of intent-implying terms is absolutely prohibited in any agent-generated copy, OG image, tweet, email, push notification, or UI string anywhere on the site. The build is configured to fail if any of them appears in shipped code.

The authoritative, machine-readable list lives in src/lib/legal-guard.ts and the corresponding test in tests/unit/legal.test.ts enforces it on every commit. The list covers intent-implying allegations, data tampering, and academic misappropriation — plus their common inflected forms.

We use, instead, the language of reproducibility: “reproduction did not match,” “we were unable to reproduce,” “reported X; reproduced Y.”