Timeline — paperiswrong

Recent activity

2026-07-10 00:00Zretracted
Language Models are Few-Shot Learners2005.14165
The reproduction measured the 124M-parameter GPT-2 checkpoint on WikiText-103 but attached that result to the GPT-3 paper. The GPT-3 paper does not report that checkpoint result as its claim, so the stored row is excluded from every public verdict surface.
2026-05-15 19:56ZPENDINGunknown
2 OLMo 2 Furious2501.00656
v0.1.0-olmo2-winogrande-microslice
2026-05-15 19:19ZPENDINGproxy
RoBERTa: A Robustly Optimized BERT Pretraining Approach1907.11692
v0.1.0-roberta-mnli-microslice
2026-05-15 19:19ZPENDINGproxy
Mamba: Linear-Time Sequence Modeling with Selective State Spaces2312.00752
v0.1.0-mamba-wikitext2-3slice8
2026-05-15 19:19ZPENDINGproxy
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter1910.01108
v0.1.0-distilbert-sst2-microslice
2026-05-15 19:19ZPENDINGexact
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding1810.04805
v0.1.0-bert-sst2-3slice100
2026-05-15 18:26ZPENDINGunknown
Stable LM 2 1.6B Technical Report2402.17834
v0.1.0-stablelm2-winogrande-microslice
2026-05-15 17:51ZPENDINGunknown
OLMoE: Open Mixture-of-Experts Language Models2409.02060
v0.1.0-olmoe-winogrande-microslice
2026-05-15 17:15ZPENDINGunknown
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning2501.12948
v0.1.0-deepseek-r1-winogrande-microslice
2026-05-15 16:33ZPENDINGunknown
SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model2502.02737
v0.1.0-smollm2-winogrande-microslice
2026-05-15 16:17ZPENDINGunknown
Qwen2.5 Technical Report2412.15115
v0.1.0-qwen25-winogrande-microslice
2026-05-15 16:10ZPENDINGunknown
Yi: Open Foundation Models by 01.AI2403.04652
v0.1.0-yi-lambada-microslice
2026-05-15 16:10ZPENDINGunknown
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone2404.14219
v0.1.0-phi3-winogrande-microslice
2026-05-15 16:08ZPENDINGproxy
Searching for MobileNetV31905.02244
v0.1.0-mobilenet-v3-large-microslice
2026-05-15 03:04ZPENDINGproxy
Mistral 7B2310.06825
v0.1.0-mistral-hellaswag-microslice
2026-05-15 00:04ZPENDINGproxy
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies2404.06395
v0.1.0-minicpm-mmlu5shot-microslice
2026-05-14 23:57ZPENDINGproxy
Gemma: Open Models Based on Gemini Research and Technology2403.08295
v0.1.0-gemma-hellaswag-microslice
2026-05-14 23:56ZPENDINGproxy
TinyLlama: An Open-Source Small Language Model2401.02385
v0.1.0-tinyllama-hellaswag-microslice
2026-05-14 23:56ZPENDINGproxy
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing2111.09543
v0.1.0-deberta-mnli-microslice
2026-05-14 23:54ZPENDINGproxy
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model2211.05100
v0.1.0-bloom-lambada-microslice
2026-05-14 23:53ZPENDINGunknown
Code Llama: Open Foundation Models for Code2308.12950
v0.1.0-codellama-pythonppl-microslice
2026-05-14 23:52ZPENDINGunknown
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence2401.14196
v0.1.0-deepseek-coder-pythonppl-microslice
2026-05-14 23:52ZPENDINGproxy
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling2304.01373
v0.1.0-pythia14-lambada-microslice
2026-05-14 23:48ZPENDINGproxy
OPT: Open Pre-trained Transformer Language Models2205.01068
v0.1.0-opt-lambada-microslice
2026-05-14 23:48ZPENDINGproxy
Swin Transformer V2: Scaling Up Capacity and Resolution2111.09883
v0.1.0-swinv2-imagenet-microslice
2026-05-14 23:48ZPENDINGproxy
XLNet: Generalized Autoregressive Pretraining for Language Understanding1906.08237
v0.1.0-xlnet-mnli-microslice
2026-05-14 23:48ZPENDINGproxy
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks1908.10084
v0.2.0-sbert-stsb-test-3slice-table2
2026-05-14 23:48ZPENDINGunknown
StarCoder: may the source be with you!2305.06161
v0.1.0-starcoder-pythonppl-microslice
2026-05-14 23:40ZPENDINGproxy
LoRA: Low-Rank Adaptation of Large Language Models2106.09685
v0.1.0-lora-mrpc-microslice
2026-05-14 23:39ZPENDINGproxy
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision2102.05918
v0.1.0b-align-imagenette-3slice100
2026-05-14 23:39ZPENDINGunknown
Qwen2 Technical Report2407.10671
v0.1.0-qwen2-lambada-microslice
2026-05-13 00:00Zretracted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension1910.13461
The driver scored rougeL (sentence-level longest common subsequence) and compared it against the paper's Table 3 R-L of 40.90, which is in fact rougeLsum (summary-level longest common subsequence per Lin 2004). The two metrics measure different things. Under the correct metric variant the model lands within paper bounds.
2026-05-13 00:00Zretracted
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models2301.12597
The driver decoded with greedy search, n = 20 examples, and no prompt prefix. The paper uses beam search width 5 with a "a photo of" prompt prefix on the full Flickr30k Karpathy split. The retraction replaces the WRONG row with a PARTIAL captured under the paper's published decoding protocol.
2026-05-13 00:00Zretracted
CodeBERT: A Pre-Trained Model for Programming and Natural Languages2002.08155
The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.
2026-05-13 00:00Zretracted
OLMo: Accelerating the Science of Language Models2402.00838
The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string "lambada" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.
2026-05-13 00:00Zretracted
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks1908.10084
The driver evaluated the checkpoint sentence-transformers/bert-base-nli-stsb-mean-tokens against Table 1's unsupervised SBERT-NLI-large row (Spearman 79.23). The matching baseline is Table 2's supervised-base STS-b row (85.35). Once the comparison is repointed to the right row, the model lands within 0.5 points of the paper's number.
2026-05-13 00:00Zretracted
Textbooks Are All You Need2306.11644
The driver invented a 0.730 PIQA target headline for Phi-1 (described internally as the "midpoint of the expected band per the mandate"). The Phi-1 paper reports HumanEval pass@1 = 50.6 and does not report a PIQA score at all. The retraction replaces the WRONG row with a PARTIAL on the actual HumanEval headline.
2026-05-13 00:00Zretracted
Unsupervised Cross-lingual Representation Learning at Scale1911.02116
The driver loaded joeddav/xlm-roberta-large-xnli, a checkpoint that is fine-tuned on XNLI-train plus MNLI. The 89.1 English-XNLI headline in the paper is a zero-shot transfer number from a model fine-tuned on English MNLI only. The driver was measuring a different model; the rolled-back row records the correct zero-shot protocol as a PARTIAL.

Reading the feed

Each entry pairs an ISO-8601 timestamp with a status chip (REPRODUCED / PARTIAL / PENDING / NOT_ATTEMPTED / RETRACTED) and, for verdicts, the protocol-match tier the driver declared (exact / proxy / unknown).
Only PROTOCOL_MATCH=exact drivers can publish a public WRONG verdict. The validator C1 gate auto-downgrades every other tier — see /validator.
Each entry's clickable target is the paper page, where the structured claim citation and the full reproduction evidence (stdout offsets, signed manifest, agent version) live.
NOT_ATTEMPTED and OUT_OF_BUDGET rows also appear on the refusal-transparency surface at /skipped with the per-paper rationale (gated weights, methodological mismatch, or out-of-budget).