Quiet day · no new POST verdicts in 24h

We tell you which ML papers
are wrong.

By running the code. Publicly. On every paper on arXiv.

“Wrong” is a technical term — our reproduction did not match this claim's reported numbers. See methodology →

10k
papers indexed
186
reproductions run
75%
reproduce within tolerance
0 public WRONG since the validator landed·50 of 55 verdicts PDF-cited·8/10 anchors healthy

Today's verdicts

POST verdicts published in the last 24 hours.

See the Wall of Wrong →
PARTIAL
2 OLMo 2 Furious
· arXiv 2024 · cs.CL
REPRODUCED
RoBERTa: A Robustly Optimized BERT Pretraining Approach
· arXiv preprint · cs.CL
REPRODUCED
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
· COLM 2024 · cs.LG
REPRODUCED
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
· NeurIPS 2019 EMC^2 Workshop · cs.CL
REPRODUCED
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
· NAACL 2019 · cs.CL
PARTIAL
Stable LM 2 1.6B Technical Report
· arXiv 2024 · cs.CL
PARTIAL
OLMoE: Open Mixture-of-Experts Language Models
· arXiv 2024 · cs.CL
PARTIAL
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
· arXiv 2025 · cs.CL
PARTIAL
SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model
· arXiv 2025 · cs.CL
PARTIAL
Qwen2.5 Technical Report
· arXiv 2024 · cs.CL

We run your code

Every reproduction starts from the official repo, runs in a clean Modal sandbox, and logs every command. No re-implementation.

Authors get the last word

Right of reply isn't a footer — it's a sidebar. Every verdict links to the author's response in the same view.

Evidence in one click

Each WRONG label is one click from the diff, the logs, and the reproduction job. Always.

The Wall of Wrong

A chronological feed of every paper that didn't reproduce on our reproduction job, sortable by venue, lab, and confidence. The signature page of paperiswrong.

Open the Wall →
PARTIAL · 0.55 2 OLMo 2 Furious
PARTIAL · 0.55 Stable LM 2 1.6B Technical Report
PARTIAL · 0.55 OLMoE: Open Mixture-of-Experts Language Models
PARTIAL · 0.55 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs v…