paperiswrong

How a retraction lands here

A verdict is retracted when an audit confirms a methodology error — for example, a sandbox bug, a wrong commit, a wrong environment, a wrong numerical extraction, a category-confused paper claim, or a mis-pinned metric variant. The full process is documented under /legal/disputes.

Each row below carries the retraction date, the paper, the original agent version that produced the now-wrong verdict, the label the verdict was retracted to, a one-paragraph reason, and a link to the audit thread that produced the retraction.

Log (8)

2026-07-10arXiv:2005.14165REPRODUCEDRETRACTED (no current verdict)
Language Models are Few-Shot Learners
The reproduction measured the 124M-parameter GPT-2 checkpoint on WikiText-103 but attached that result to the GPT-3 paper. The GPT-3 paper does not report that checkpoint result as its claim, so the stored row is excluded from every public verdict surface.
agent_version: v0.1.0-gpt2-perplexity-microslicebefore/after audit thread rollback PR
2026-05-13arXiv:1910.13461WRONGPARTIAL
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
The driver scored rougeL (sentence-level longest common subsequence) and compared it against the paper's Table 3 R-L of 40.90, which is in fact rougeLsum (summary-level longest common subsequence per Lin 2004). The two metrics measure different things. Under the correct metric variant the model lands within paper bounds.
agent_version: v0.1.0-bart-cnndm-rougelbefore/after audit thread rollback PR
2026-05-13arXiv:2301.12597WRONGPARTIAL
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The driver decoded with greedy search, n = 20 examples, and no prompt prefix. The paper uses beam search width 5 with a "a photo of" prompt prefix on the full Flickr30k Karpathy split. The retraction replaces the WRONG row with a PARTIAL captured under the paper's published decoding protocol.
agent_version: v0.1.0-blip2-flickr30k-n20-greedybefore/after audit thread rollback PR
2026-05-13arXiv:2002.08155WRONGRETRACTED (no current verdict)
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.
agent_version: v0.1.0-codebert-codesearch-200waybefore/after audit thread rollback PR
2026-05-13arXiv:2402.00838WRONGRETRACTED (no current verdict)
OLMo: Accelerating the Science of Language Models
The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string "lambada" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.
agent_version: v0.1.0-olmo-lambada-microslicebefore/after audit thread rollback PR
2026-05-13arXiv:1908.10084WRONGREPRODUCED
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
The driver evaluated the checkpoint sentence-transformers/bert-base-nli-stsb-mean-tokens against Table 1's unsupervised SBERT-NLI-large row (Spearman 79.23). The matching baseline is Table 2's supervised-base STS-b row (85.35). Once the comparison is repointed to the right row, the model lands within 0.5 points of the paper's number.
agent_version: v0.1.0-sbert-stsb-2slicebefore/after audit thread rollback PR
2026-05-13arXiv:2306.11644WRONGPARTIAL
Textbooks Are All You Need
The driver invented a 0.730 PIQA target headline for Phi-1 (described internally as the "midpoint of the expected band per the mandate"). The Phi-1 paper reports HumanEval pass@1 = 50.6 and does not report a PIQA score at all. The retraction replaces the WRONG row with a PARTIAL on the actual HumanEval headline.
agent_version: v0.1.0-phi1-piqa-microslicebefore/after audit thread rollback PR
2026-05-13arXiv:1911.02116WRONGPARTIAL
Unsupervised Cross-lingual Representation Learning at Scale
The driver loaded joeddav/xlm-roberta-large-xnli, a checkpoint that is fine-tuned on XNLI-train plus MNLI. The 89.1 English-XNLI headline in the paper is a zero-shot transfer number from a model fine-tuned on English MNLI only. The driver was measuring a different model; the rolled-back row records the correct zero-shot protocol as a PARTIAL.
agent_version: v0.1.0-xlm-r-xnli-zeroshotbefore/after audit thread rollback PR

Why we keep this list public

A retraction log is the credibility test. A platform that publishes verdicts about other peoples' work has to be at least as transparent about its own errors as it expects its subjects to be. Every retraction here ages forward; we do not rewrite history.

The 2026-05-13 rollup — seven retractions on the same day — was the result of an internal red-team audit that found a common root cause across every then-public WRONG verdict. The pipeline hardening that followed is documented in the audit thread linked from each row above.