Legal

Retraction Log

When a methodology error is confirmed, we retract the verdict and list the retraction here. We do not delete retracted entries. We preserve them so the public record stays straight. Subscribe via RSS or JSON at /api/v1/retractions.

How a retraction lands here

A verdict is retracted when an audit confirms a methodology error — for example, a sandbox bug, a wrong commit, a wrong environment, a wrong numerical extraction, a category-confused paper claim, or a mis-pinned metric variant. The full process is documented under /legal/disputes.

Each row below carries the retraction date, the paper, the original agent version that produced the now-wrong verdict, the label the verdict was retracted to, a one-paragraph reason, and a link to the audit thread that produced the retraction.

Log (7)

  • 2026-05-13arXiv:1910.13461WRONGPARTIAL

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    The driver scored rougeL (sentence-level longest common subsequence) and compared it against the paper's Table 3 R-L of 40.90, which is in fact rougeLsum (summary-level longest common subsequence per Lin 2004). The two metrics measure different things. Under the correct metric variant the model lands within paper bounds.

    agent_version: v0.1.0-bart-cnndm-rougelbefore/afteraudit threadrollback PR
  • 2026-05-13arXiv:2301.12597WRONGPARTIAL

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    The driver decoded with greedy search, n = 20 examples, and no prompt prefix. The paper uses beam search width 5 with a "a photo of" prompt prefix on the full Flickr30k Karpathy split. The retraction replaces the WRONG row with a PARTIAL captured under the paper's published decoding protocol.

    agent_version: v0.1.0-blip2-flickr30k-n20-greedybefore/afteraudit threadrollback PR
  • 2026-05-13arXiv:2002.08155WRONGRETRACTED (no current verdict)

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.

    agent_version: v0.1.0-codebert-codesearch-200waybefore/afteraudit threadrollback PR
  • 2026-05-13arXiv:2402.00838WRONGRETRACTED (no current verdict)

    OLMo: Accelerating the Science of Language Models

    The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string "lambada" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.

    agent_version: v0.1.0-olmo-lambada-microslicebefore/afteraudit threadrollback PR
  • 2026-05-13arXiv:1908.10084WRONGREPRODUCED

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    The driver evaluated the checkpoint sentence-transformers/bert-base-nli-stsb-mean-tokens against Table 1's unsupervised SBERT-NLI-large row (Spearman 79.23). The matching baseline is Table 2's supervised-base STS-b row (85.35). Once the comparison is repointed to the right row, the model lands within 0.5 points of the paper's number.

    agent_version: v0.1.0-sbert-stsb-2slicebefore/afteraudit threadrollback PR
  • 2026-05-13arXiv:2306.11644WRONGPARTIAL

    Textbooks Are All You Need

    The driver invented a 0.730 PIQA target headline for Phi-1 (described internally as the "midpoint of the expected band per the mandate"). The Phi-1 paper reports HumanEval pass@1 = 50.6 and does not report a PIQA score at all. The retraction replaces the WRONG row with a PARTIAL on the actual HumanEval headline.

    agent_version: v0.1.0-phi1-piqa-microslicebefore/afteraudit threadrollback PR
  • 2026-05-13arXiv:1911.02116WRONGPARTIAL

    Unsupervised Cross-lingual Representation Learning at Scale

    The driver loaded joeddav/xlm-roberta-large-xnli, a checkpoint that is fine-tuned on XNLI-train plus MNLI. The 89.1 English-XNLI headline in the paper is a zero-shot transfer number from a model fine-tuned on English MNLI only. The driver was measuring a different model; the rolled-back row records the correct zero-shot protocol as a PARTIAL.

    agent_version: v0.1.0-xlm-r-xnli-zeroshotbefore/afteraudit threadrollback PR

Why we keep this list public

A retraction log is the credibility test. A platform that publishes verdicts about other peoples' work has to be at least as transparent about its own errors as it expects its subjects to be. Every retraction here ages forward; we do not rewrite history.

The 2026-05-13 rollup — seven retractions on the same day — was the result of an internal red-team audit that found a common root cause across every then-public WRONG verdict. The pipeline hardening that followed is documented in the audit thread linked from each row above.