Public retraction log

Verdict diff — arXiv:2402.00838

The verdict on OLMo: Accelerating the Science of Language Models was retracted on 2026-05-13. Below is the structured before / after: what we said, what we say now, and why the original verdict was incorrect.

Before / after

Before — original verdict
WRONG
agent_version
v0.1.0-olmo-lambada-microslice
verdict_id
8b876af6-f689-41cb-a8d8-cb1e79c62c92
After — current verdict
RETRACTED
The original row is marked is_current=false in the database; no current public verdict exists for this paper. The reproduction was disabled to prevent re-publication under the same flawed protocol.

Why the original verdict was incorrect

The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string "lambada" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.

Evidence trail

  • Audit thread — long-form post-mortem covering all seven 2026-05-13 retractions, including this one.
  • Rollback PR — the GitHub pull request that landed the corrected driver and flipped the verdict row to is_current=false.
  • All public retractions — the append-only retraction log under PRD §17.X.8(d).
  • Verdict Validator — the C1/C2 gates that prevent this class of mistake from shipping again.

What changed structurally

The 2026-05-13 retraction rollup landed two structural fixes so the citation-side failure that caused the original incorrect verdict cannot ship the same way again:

  1. Typed claim citation per verdict. Every reproduction driver now declares a structured CLAIM_CITATION (Table, row, column, reported value, quoted text, PDF page) before its Modal job runs. The original verdict on this paper was published against a non-citable headline — that path is now closed by the build-failing validator-wiring lint.
  2. PDF-verified textual gate. The Verdict Validator fetches the cited paper's PDF and checks that the quoted text appears within ±200 characters of the cited reported value. Made-up, mis-cited, or category-confused citations fail this gate and the verdict is auto-downgraded.