2026-05-13 audit thread

Headline: 7 of 7 then-public WRONG verdicts were retracted on the same day. No external party was harmed (zero traffic to date). Pipeline hardening landed under PR #112 and PR #116.

Root cause

Every reproduction driver does two separate things. First, it makes a claimabout what the paper says — for example, “the paper's Table 3 BART R-L is 40.90 on CNN/DailyMail.” Second, it makes a measurement— for example, “our reproduction measured rougeL on the same split and got 30.7.” Until 2026-05-13 the validator only sanity-checked the measurement (multi-seed, sanity baseline, confidence). It never sanity-checked the claim.

That gap was the root cause. In all seven cases the measurement was approximately correct under some protocol; the claim that the measurement was being compared against was either category-confused (BART rougeL vs rougeLsum, BLIP-2 greedy vs beam-5), the wrong row of the right table (SBERT unsupervised-large vs supervised-base), the wrong checkpoint (XLM-R MNLI-fine-tuned vs zero-shot), the wrong protocol (CodeBERT 200-way zero-shot vs 1000-way fine-tuned), wrong (Phi-1 PIQA target that does not exist in the paper), or wrong table (OLMo Table 6 is carbon emissions, not LAMBADA).

The seven cases

Each row below summarises one retraction. The retraction log at /legal/retractions carries the same data alongside future retractions.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (arXiv:1910.13461)
The driver scored rougeL (sentence-level longest common subsequence) and compared it against the paper's Table 3 R-L of 40.90, which is in fact rougeLsum (summary-level longest common subsequence per Lin 2004). The two metrics measure different things. Under the correct metric variant the model lands within paper bounds.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (arXiv:2301.12597)
The driver decoded with greedy search, n = 20 examples, and no prompt prefix. The paper uses beam search width 5 with a "a photo of" prompt prefix on the full Flickr30k Karpathy split. The retraction replaces the WRONG row with a PARTIAL captured under the paper's published decoding protocol.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arXiv:2002.08155)
The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.
OLMo: Accelerating the Science of Language Models (arXiv:2402.00838)
The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string "lambada" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084)
The driver evaluated the checkpoint sentence-transformers/bert-base-nli-stsb-mean-tokens against Table 1's unsupervised SBERT-NLI-large row (Spearman 79.23). The matching baseline is Table 2's supervised-base STS-b row (85.35). Once the comparison is repointed to the right row, the model lands within 0.5 points of the paper's number.
Textbooks Are All You Need (arXiv:2306.11644)
The driver invented a 0.730 PIQA target headline for Phi-1 (described internally as the "midpoint of the expected band per the mandate"). The Phi-1 paper reports HumanEval pass@1 = 50.6 and does not report a PIQA score at all. The retraction replaces the WRONG row with a PARTIAL on the actual HumanEval headline.
Unsupervised Cross-lingual Representation Learning at Scale (arXiv:1911.02116)
The driver loaded joeddav/xlm-roberta-large-xnli, a checkpoint that is fine-tuned on XNLI-train plus MNLI. The 89.1 English-XNLI headline in the paper is a zero-shot transfer number from a model fine-tuned on English MNLI only. The driver was measuring a different model; the rolled-back row records the correct zero-shot protocol as a PARTIAL.

Hardening that landed

PR #112 — verdict-validator-hardening. A new column pair on reproduction_jobsrecords the driver's structured citation of the paper claim and the protocol-fidelity tier it ran under. A new module src/lib/verdict-validator.ts runs two checks on every proposed not-reproduced verdict before the DB write: a structural check (citation is present and well-formed, protocol_match is “exact”) and a textual check (the cited quote appears verbatim in the paper PDF within 200 characters of the cited numeric value). Any failure downgrades the verdict to PARTIAL or PENDING. The validator never upgrades.

PR #116 — pdf-extractor-upgrade. The textual check above is only as useful as the PDF text extractor. The pre-2026-05-13 extractor was a home-grown parenthesised-string scanner that worked only on uncompressed PDF content streams; both DeBERTa-v2 and MiniCPM rejected at C2 because their PDFs are FlateDecode-compressed. The upgrade replaces the home-grown extractor with pdfjs-dist, which handles FlateDecode and gives per-page text. Five real-paper regression tests now exercise the path.

Driver-side discipline. Three exemplar drivers (BERT, Mamba, BART) demonstrate the new contract. The remaining drivers cannot emit a not-reproduced verdict without attaching a citation; the validator downgrades to PARTIAL until they are hardened.

What was not the problem

The Modal sandbox, the multi-seed gate, the sanity baseline, the triple-ensemble cross-model agreement check, the 75-character word boundary on the forbidden-words list, the render-time evidence-coercion guard, and the 72-hour pre-publication notice queue all worked exactly as designed. The flaw was upstream of any of them: the platform was comparing real measurements against wrong reference numbers, and nothing in the pipeline noticed.

A retraction log only matters if it is more transparent than the platform's subjects are expected to be. This thread will stay here forward.

Root cause

The seven cases

Hardening that landed

What was not the problem

See also