Live counts
- Total claims50
- Verified2
- Partial33
- Did not reproduce0
- Not checkable15
- Ambiguous0
What the labels mean
Verified is reserved for exact-protocol checks: the reproduction measured the cited claim under the same benchmark, metric, and scoring setup.
Partially supported means the result is directionally consistent, but the check used a proxy such as a micro-slice, community checkpoint, or nearby evaluation setup.
Not checkable means the job did not actually test that claim: for example, the cited table is MMLU but the run measured WinoGrande, or the model is gated.
Checked claims
- Not checkableunknownconfidence 0.552026-05-15
The paper reports 61.2 on accuracy for OLMo-2-7B-Instruct.
Compared against Table 8 · row OLMo-2-7B-Instruct · column MMLU 5-shot · PDF page 16.61.2
- Partially supportedproxyconfidence 0.852026-05-15
The paper reports 90.2 on accuracy for RoBERTa.
Compared against Table 8 · row RoBERTa · column MNLI · PDF page 8.RoBERTa 90.2
- Partially supportedproxyconfidence 0.652026-05-15
The paper reports 28.39 on perplexity for Mamba 130M.
Compared against Table 4 · row Mamba 130M · column WikiText (ppl) · PDF page 8.Mamba 130M 28.39
Mamba: Linear-Time Sequence Modeling with Selective State Spaces2312.00752v0.1.0-mamba-wikitext2-3slice8 - Partially supportedproxyconfidence 0.802026-05-15
The paper reports 91.3 on accuracy for DistilBERT.
Compared against Table 2 · row DistilBERT · column SST-2 · PDF page 4.DistilBERT 77.0 51.3 91.3 85.5 59.9 86.9 56.1 89.2
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter1910.01108v0.1.0-distilbert-sst2-microslice - Verifiedexactconfidence 0.802026-05-15
The paper reports 93.5 on accuracy for BERT-BASE.
Compared against Table 6 · row BERT-BASE · column SST-2 · PDF page 6.BERT BASE 84.6 88.9 92.7 89.3 71.2 93.5
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding1810.04805v0.1.0-bert-sst2-3slice100 - Not checkableunknownconfidence 0.552026-05-15
The paper reports 38.8 on accuracy for StableLM-2-1.6B-Chat.
Compared against Table 3 · row StableLM-2-1.6B-Chat · column MMLU 5-shot · PDF page 9.38.8
- Not checkableunknownconfidence 0.552026-05-15
The paper reports 54.1 on accuracy for OLMoE-1B-7B-Instruct.
Compared against Table 4 · row OLMoE-1B-7B-Instruct · column MMLU 5-shot · PDF page 8.54.1
- Not checkableunknownconfidence 0.552026-05-15
The paper reports 43.9 on accuracy for DeepSeek-R1-Distill-Qwen-1.5B.
Compared against Table 5 · row DeepSeek-R1-Distill-Qwen-1.5B · column MMLU · PDF page 14.43.9
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning2501.12948v0.1.0-deepseek-r1-winogrande-microslice - Not checkableunknownconfidence 0.552026-05-15
The paper reports 50.8 on accuracy for SmolLM2-1.7B-Instruct.
Compared against Table 4 · row SmolLM2-1.7B-Instruct · column MMLU 5-shot · PDF page 8.50.8
SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model2502.02737v0.1.0-smollm2-winogrande-microslice - Not checkableunknownconfidence 0.552026-05-15
The paper reports 47.5 on accuracy for Qwen2.5-0.5B-Instruct.
Compared against Results · row Qwen2.5-0.5B-Instruct · column MMLU 5-shot · PDF page 8.47.5
- Not checkableunknownconfidence 0.802026-05-15
The paper reports 64.11 on accuracy for Yi-6B.
Compared against Table 9 · row Yi-6B · column MMLU 5-shot accuracy · PDF page 17.64.11
- Not checkableunknownconfidence 0.552026-05-15
The paper reports 68.8 on accuracy for phi-3-mini.
Compared against Table 2 · row phi-3-mini · column MMLU 5-shot · PDF page 4.68.8
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone2404.14219v0.1.0-phi3-winogrande-microslice - Partially supportedproxyconfidence 0.502026-05-15
The paper reports 75.2 on ImageNet-1k top-1 accuracy for V3-Large 1.0.
Compared against Table 3 · row V3-Large 1.0 · column Top-1 · PDF page 6.V3-Large 1.0 75.2 219 5.4M 51 61 44
- Partially supportedproxyconfidence 0.802026-05-15
The paper reports 81.3 on accuracy for Mistral-7B.
Compared against Table 2 · row Mistral-7B · column HellaSwag 0-shot · PDF page 4.Mistral-7B 81.3
- Partially supportedproxyconfidence 0.552026-05-15
The paper reports 53.46 on accuracy for MiniCPM-2.4B.
Compared against Table 3 · row MiniCPM-2.4B · column MMLU · PDF page 14.MiniCPM-2.4B 51.13 51.07 53.46 50.00 47.31 53.83 10.24
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies2404.06395v0.1.0-minicpm-mmlu5shot-microslice - Partially supportedproxyconfidence 0.602026-05-14
The paper reports 71.4 on accuracy for Gemma-2B.
Compared against Table 7 · row Gemma-2B · column HellaSwag 0-shot · PDF page 7.Gemma-2B 71.4
Gemma: Open Models Based on Gemini Research and Technology2403.08295v0.1.0-gemma-hellaswag-microslice - Partially supportedproxyconfidence 0.602026-05-14
The paper reports 59.2 on accuracy for TinyLlama 1.1B (3T).
Compared against Table 2 · row TinyLlama 1.1B (3T) · column HellaSwag 0-shot · PDF page 4.TinyLlama 59.20
- Partially supportedproxyconfidence 0.852026-05-14
The paper reports 91.8 on accuracy for DeBERTa-v3-large.
Compared against Table 1 · row DeBERTa-v3-large · column MNLI-m · PDF page 6.DeBERTa-v3-large 91.8
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing2111.09543v0.1.0-deberta-mnli-microslice - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 35.7 on accuracy for BLOOM-560m.
Compared against Table 9 · row BLOOM-560m · column LAMBADA English · PDF page 14.BLOOM-560m 35.7
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model2211.05100v0.1.0-bloom-lambada-microslice - Not checkableunknownconfidence 0.602026-05-14
The paper reports 38.4 on pass@1 for Code Llama - Python 7B.
Compared against Table 2 · row Code Llama - Python 7B · column HumanEval pass@1 · PDF page 7.38.4
- Not checkableunknownconfidence 0.602026-05-14
The paper reports 34.8 on pass@1 for DeepSeek-Coder-1.3B-Base.
Compared against Table 2 · row DeepSeek-Coder-1.3B-Base · column HumanEval pass@1 · PDF page 8.34.8
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence2401.14196v0.1.0-deepseek-coder-pythonppl-microslice - Partially supportedproxyconfidence 0.602026-05-14
The paper reports 57.2 on accuracy for Pythia-1.4B (standard).
Compared against Table 5 · row Pythia-1.4B (standard) · column LAMBADA accuracy · PDF page 19.57.2
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling2304.01373v0.1.0-pythia14-lambada-microslice - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 58 on accuracy for OPT-1.3B.
Compared against Table 3 · row OPT-1.3B · column LAMBADA zero-shot accuracy · PDF page 7.58.0
- Partially supportedproxyconfidence 0.552026-05-14
The paper reports 81.8 on accuracy for SwinV2-T (window 8, 256²).
Compared against Table 5 · row SwinV2-T (window 8, 256²) · column ImageNet-1k val top-1 · PDF page 8.81.8
- Partially supportedproxyconfidence 0.852026-05-14
The paper reports 86.8 on accuracy for XLNet-Base.
Compared against Table 6 · row XLNet-Base · column MNLI matched accuracy · PDF page 8.86.8
XLNet: Generalized Autoregressive Pretraining for Language Understanding1906.08237v0.1.0-xlnet-mnli-microslice - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 85.35 on spearman for SBERT-NLI-STSb-base (trained on NLI + STSb).
Compared against Table 2 · row SBERT-NLI-STSb-base (trained on NLI + STSb) · column STSb test Spearman correlation · PDF page 6.85.35
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks1908.10084v0.2.0-sbert-stsb-test-3slice-table2 - Not checkableunknownconfidence 0.602026-05-14
The paper reports 15.17 on pass@1 for StarCoderBase-1B.
Compared against Table 6 · row StarCoderBase-1B · column HumanEval pass@1 · PDF page 12.15.17
- Partially supportedproxyconfidence 0.802026-05-14
The paper reports 89.7 on accuracy for RoBERTa-base + LoRA (r=8).
Compared against Table 2 · row RoBERTa-base + LoRA (r=8) · column MRPC dev accuracy · PDF page 6.RoBERTa base 89.7
- Partially supportedproxyconfidence 0.552026-05-14
The paper reports 76.4 on accuracy for ALIGN-base.
Compared against Table 4 · row ALIGN-base · column ImageNet zero-shot top-1 · PDF page 6.ALIGN 76.4
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision2102.05918v0.1.0b-align-imagenette-3slice100 - Not checkableunknownconfidence 0.802026-05-14
The paper reports 50 on accuracy for Qwen2-0.5B.
Compared against HF Open LLM Leaderboard (external) · row Qwen2-0.5B · column LAMBADA OpenAI acc.Qwen2
- Partially supportedproxyconfidence 0.502026-05-14
The paper reports 77.1 on accuracy for EfficientNet-B0.
Compared against Table 2 · row EfficientNet-B0 · column ImageNet val top-1 · PDF page 6.EfficientNet-B0 77.1
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks1905.11946v0.1.0-efficientnet-microslice - Partially supportedproxyconfidence 0.602026-05-14
The paper reports 78.13 on accuracy for Falcon-7B.
Compared against Table 17 · row Falcon-7B · column HellaSwag 0-shot · PDF page 32.Falcon-7B 78.13
- Partially supportedproxyconfidence 0.602026-05-14
The paper reports 73.4 on accuracy for phi-1.5.
Compared against Table 3 · row phi-1.5 · column WinoGrande 0-shot · PDF page 5.phi-1.5 73.4
- Not checkableunknownconfidence 0.502026-05-14
The paper reports 25 on perplexity for BigBird-RoBERTa-base.
Compared against Section 4 · row BigBird-RoBERTa-base · column WikiText-2 MLM perplexity (probe) · PDF page 6.BigBird
- Partially supportedproxyconfidence 0.602026-05-14
The paper reports 45.1 on accuracy for FLAN-T5-Large.
Compared against Table 6 · row FLAN-T5-Large · column MMLU 0-shot direct · PDF page 14.FLAN-T5-Large 45.1
- Partially supportedproxyconfidence 0.602026-05-14
The paper reports 43.7 on bleu4 for BLIP-2 OPT-2.7B.
Compared against Table 3 · row BLIP-2 OPT-2.7B · column Flickr30k zero-shot BLEU-4 · PDF page 7.BLIP-2 OPT-2.7B 43.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models2301.12597v0.2.0-blip2-flickr30k-beam5-n100 - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 81.4 on accuracy for DINOv2-B/14.
Compared against Table 5 · row DINOv2-B/14 · column ImageNet-1k k-NN top-1 · PDF page 11.ViT-B/14 81.4
- Partially supportedproxyconfidence 0.802026-05-14
The paper reports 76.1 on accuracy for DINO ViT-B/16.
Compared against Table 2 · row DINO ViT-B/16 · column k-NN ImageNet-1k top-1 · PDF page 5.ViT-B/16 76.1
- Not checkableproxyconfidence n/a2026-05-14
The paper reports 82.1 on accuracy for ConvNeXt-T.
Compared against Table 1 · row ConvNeXt-T · column ImageNet val top-1 · PDF page 5.ConvNeXt-T 82.1
- Partially supportedexactconfidence 0.602026-05-14
The paper reports 40.9 on rougeLsum for BART.
Compared against Table 3 · row BART · column R-L · PDF page 7.BART 44.16 21.28 40.90
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension1910.13461v0.1.0-bart-cnndm-200slice - Verifiedexactconfidence 0.852026-05-14
The paper reports 91.1 on accuracy for DeBERTalarge.
Compared against Table 2 · row DeBERTalarge · column MNLI-m/mm · PDF page 7.DeBERTalarge 91.1/91.1 95.5/90.1 90.7/88.0 86.8 91.4/91.0 90.8 93.8
DeBERTa: Decoding-enhanced BERT with Disentangled Attention2006.03654v0.1.0-deberta-v2-mnli-microslice - Partially supportedproxyconfidence 0.752026-05-14
The paper reports 91.6 on accuracy for ViT-B/16 CIFAR10.
Compared against Table 11 · row ViT-B/16 CIFAR10 · column zero-shot top-1 · PDF page 19.ViT-B/16 91.6
Learning Transferable Visual Models From Natural Language Supervision2103.00020v0.1.0-clip-cifar10-3slice100 - Partially supportedproxyconfidence 0.752026-05-14
The paper reports 0.056 on wer for whisper-tiny.en.
Compared against Table 8 · row whisper-tiny.en · column LibriSpeech test-clean WER · PDF page 25.tiny.en 5.6
Robust Speech Recognition via Large-Scale Weak Supervision2212.04356v0.1.0-whisper-librispeech-3slice16 - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 99 on accuracy for ViT-B/16.
Compared against Table 5 · row ViT-B/16 · column CIFAR-10 transfer · PDF page 6.ViT-B/16 99.0
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale2010.11929v0.1.0-vit-cifar10-3slice100 - Partially supportedproxyconfidence 0.452026-05-14
The paper reports 76 on accuracy for ResNet-50.
Compared against Table 4 · row ResNet-50 · column ImageNet val top-1 · PDF page 7.ResNet-50 76.0
- Partially supportedproxyconfidence 0.802026-05-14
The paper reports 88.5 on accuracy for ELECTRA-Base.
Compared against Table 8 · row ELECTRA-Base · column MNLI · PDF page 8.ELECTRA-Base 88.5
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators2003.10555v0.1.0-electra-mnli-microslice - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 82.4 on accuracy for T5-Small.
Compared against Table 14 · row T5-Small · column MNLI-m · PDF page 30.T5-Small 82.4
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer1910.10683v0.1.0-t5-mnli-microslice - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 89.3 on f1 for ALBERT-base.
Compared against Table 2 · row ALBERT-base · column MRPC F1 · PDF page 7.ALBERT-base 89.3
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations1909.11942v0.1.0-albert-mrpc-microslice - Partially supportedproxyconfidence 0.802026-05-14
The paper reports 84.3 on accuracy for MobileBERT.
Compared against Table 4 · row MobileBERT · column MNLI-m · PDF page 6.MobileBERT 84.3
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices2004.02984v0.1.0-mobilebert-mnli-microslice - Not checkableproxyconfidence n/a2026-05-14
The paper reports 77.2 on accuracy for Llama 2 7B.
Compared against Table 3 · row Llama 2 7B · column HellaSwag 0-shot · PDF page 9.Llama 2 7B 77.2
Scope
A claim check is smaller than a paper verdict. It says whether a specific cited value or span matched the reproduction job under the stated protocol, not whether the whole paper is good or bad.