{"retractions":[{"arxiv_id":"1910.13461","paper_title":"BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension","original_label":"WRONG","original_agent_version":"v0.1.0-bart-cnndm-rougel","original_verdict_id":null,"retracted_on":"2026-05-13","retracted_to":"PARTIAL","reason":"The driver scored rougeL (sentence-level longest common subsequence) and compared it against the paper's Table 3 R-L of 40.90, which is in fact rougeLsum (summary-level longest common subsequence per Lin 2004). The two metrics measure different things. Under the correct metric variant the model lands within paper bounds.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/104","diff_href":"/p/1910.13461/diff"},{"arxiv_id":"2301.12597","paper_title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","original_label":"WRONG","original_agent_version":"v0.1.0-blip2-flickr30k-n20-greedy","original_verdict_id":null,"retracted_on":"2026-05-13","retracted_to":"PARTIAL","reason":"The driver decoded with greedy search, n = 20 examples, and no prompt prefix. The paper uses beam search width 5 with a \"a photo of\" prompt prefix on the full Flickr30k Karpathy split. The retraction replaces the WRONG row with a PARTIAL captured under the paper's published decoding protocol.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/106","diff_href":"/p/2301.12597/diff"},{"arxiv_id":"2002.08155","paper_title":"CodeBERT: A Pre-Trained Model for Programming and Natural Languages","original_label":"WRONG","original_agent_version":"v0.1.0-codebert-codesearch-200way","original_verdict_id":"95fb7f95-a23c-499d-a170-98b5b138a965","retracted_on":"2026-05-13","retracted_to":"RETRACTED (no current verdict)","reason":"The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/108","diff_href":"/p/2002.08155/diff"},{"arxiv_id":"2402.00838","paper_title":"OLMo: Accelerating the Science of Language Models","original_label":"WRONG","original_agent_version":"v0.1.0-olmo-lambada-microslice","original_verdict_id":"8b876af6-f689-41cb-a8d8-cb1e79c62c92","retracted_on":"2026-05-13","retracted_to":"RETRACTED (no current verdict)","reason":"The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string \"lambada\" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/111","diff_href":"/p/2402.00838/diff"},{"arxiv_id":"1908.10084","paper_title":"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks","original_label":"WRONG","original_agent_version":"v0.1.0-sbert-stsb-2slice","original_verdict_id":null,"retracted_on":"2026-05-13","retracted_to":"REPRODUCED","reason":"The driver evaluated the checkpoint sentence-transformers/bert-base-nli-stsb-mean-tokens against Table 1's unsupervised SBERT-NLI-large row (Spearman 79.23). The matching baseline is Table 2's supervised-base STS-b row (85.35). Once the comparison is repointed to the right row, the model lands within 0.5 points of the paper's number.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/103","diff_href":"/p/1908.10084/diff"},{"arxiv_id":"2306.11644","paper_title":"Textbooks Are All You Need","original_label":"WRONG","original_agent_version":"v0.1.0-phi1-piqa-microslice","original_verdict_id":null,"retracted_on":"2026-05-13","retracted_to":"PARTIAL","reason":"The driver invented a 0.730 PIQA target headline for Phi-1 (described internally as the \"midpoint of the expected band per the mandate\"). The Phi-1 paper reports HumanEval pass@1 = 50.6 and does not report a PIQA score at all. The retraction replaces the WRONG row with a PARTIAL on the actual HumanEval headline.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/107","diff_href":"/p/2306.11644/diff"},{"arxiv_id":"1911.02116","paper_title":"Unsupervised Cross-lingual Representation Learning at Scale","original_label":"WRONG","original_agent_version":"v0.1.0-xlm-r-xnli-zeroshot","original_verdict_id":null,"retracted_on":"2026-05-13","retracted_to":"PARTIAL","reason":"The driver loaded joeddav/xlm-roberta-large-xnli, a checkpoint that is fine-tuned on XNLI-train plus MNLI. The 89.1 English-XNLI headline in the paper is a zero-shot transfer number from a model fine-tuned on English MNLI only. The driver was measuring a different model; the rolled-back row records the correct zero-shot protocol as a PARTIAL.","audit_href":"/legal/retractions/2026-05-13","rollback_pr_url":"https://github.com/adrianaoun/YourPaperIsWrong.com/pull/110","diff_href":"/p/1911.02116/diff"}],"count":7,"timestamp":"2026-06-11T04:23:17.168Z"}