Public

Verdict Validator

Every reproduction verdict ships through the Verdict Validator before it lands in the database. The validator is the answer to a real failure: on 2026-05-13 we retracted 7 WRONG verdicts that turned out to be false positives caused by drivers that made-up, mis-cited, or category-confused the paper headline they were comparing against. The validator was built so this cannot happen the same way again.

What the validator does

  1. C1 — Structural gate. Every driver must declare a typed claim citation: Table N, row X, column Y, reported value V plus the literal quoted text and PDF page number. Drivers that don't measure the paper's exact protocol must declare protocol_match = "proxy" (microslice eval, community fine-tune, etc.) or "unknown" (measuring a metric the paper doesn't report). Only "exact" drivers can publish a public WRONG.
  2. C2 — Textual gate. The validator fetches the cited paper's PDF, extracts text with pdfjs-dist, and verifies that the cited quoted text appears within ±200 characters of the cited reported value. Citations that don't match the PDF — made-up, mis-cited, or category-confused — are rejected at C2.
  3. Never upgrades. The validator can only downgrade. A driver that proposes not_reproduced with a non-exact protocol gets downgraded to partial (or pending for unknown protocols). It never goes the other way.

See src/lib/verdict-validator.ts and src/lib/claim-citation.ts for the implementation. The 7-WRONG retraction post-mortem lives at docs/red-team/2026-05-13-summary.md.

Live numbers

Counts below are live queries against the production database — every verdict currently visible on the platform.

Current POST verdicts
55
all kinds, is_current=true
With claim citation
90.9%
50 of 55
Public WRONG verdicts
0
zero since validator landed
False positives retracted
7
all on 2026-05-13

Protocol-match breakdown

A driver's protocol-match tier tells you how close the reproduction is to the paper's exact protocol. Only exact can publish a public WRONG.

TierVerdictsShareCan publish WRONG?
exact35.5%Yes — gated by C2.
proxy3461.8%No — auto-downgraded to PARTIAL.
unknown1323.6%No — auto-downgraded to PENDING.
59.1%Pre-validator rows (legacy / not_attempted-only drivers).

Recent verdicts gated by the validator

What the validator prevents

The 2026-05-13 red-team rollup identified 7 then-public WRONG verdicts as false positives. Every one was a citation problem — the paper headline the driver was comparing against was wrong (made-up, mis-cited Table, mis-cited row, or a category that the paper doesn't even report). The full audit lives at /legal/retractions. The validator's C1 and C2 gates are how the platform self- corrected: a driver cannot publish a WRONG today without a structurally-typed citation that has been verified against the actual PDF.

Three sibling surfaces complete the self-correction quadrant: /anchors (runtime drift probe across the execution stack), /legal/retractions (historical false positives, append-only log), and /skipped (refusal transparency — papers paperiswrong deliberately did not reproduce).

For developers

Every reproduction driver under scripts/run-reproduction-*.ts is required by a CI lint (tests/unit/scripts/validator-wiring-lint.test.ts) to wire the validator. Adding a new driver without a citation fails the build. Drivers that legitimately never publish a non-not_attempted verdict (closed-weights papers, retracted reproductions, image generation) are on an explicit allowlist with a written justification.

The public API exposes the citation and protocol-match columns on every verdict at /api/v1/verdicts and /api/v1/papers/:arxivId. The on-page ClaimCitationCard renders the same data inline with every verdict on /p/[arxivId].