Evidence

Claim checks

paperiswrong works best when it checks one concrete claim at a time: a table value, metric, sentence, or short quoted span. This page shows the exact claim citations attached to current reproduction verdicts.

Live counts

  • Total claims
    50
  • Verified
    2
  • Partial
    33
  • Did not reproduce
    0
  • Not checkable
    15
  • Ambiguous
    0

What the labels mean

Verified is reserved for exact-protocol checks: the reproduction measured the cited claim under the same benchmark, metric, and scoring setup.

Partially supported means the result is directionally consistent, but the check used a proxy such as a micro-slice, community checkpoint, or nearby evaluation setup.

Not checkable means the job did not actually test that claim: for example, the cited table is MMLU but the run measured WinoGrande, or the model is gated.

Checked claims

  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 61.2 on accuracy for OLMo-2-7B-Instruct.

    Compared against Table 8 · row OLMo-2-7B-Instruct · column MMLU 5-shot · PDF page 16.
    61.2
    2 OLMo 2 Furious2501.00656v0.1.0-olmo2-winogrande-microslice
  • Partially supportedproxyconfidence 0.852026-05-15

    The paper reports 90.2 on accuracy for RoBERTa.

    Compared against Table 8 · row RoBERTa · column MNLI · PDF page 8.
    RoBERTa 90.2
    RoBERTa: A Robustly Optimized BERT Pretraining Approach1907.11692v0.1.0-roberta-mnli-microslice
  • Partially supportedproxyconfidence 0.652026-05-15

    The paper reports 28.39 on perplexity for Mamba 130M.

    Compared against Table 4 · row Mamba 130M · column WikiText (ppl) · PDF page 8.
    Mamba 130M 28.39
  • Partially supportedproxyconfidence 0.802026-05-15

    The paper reports 91.3 on accuracy for DistilBERT.

    Compared against Table 2 · row DistilBERT · column SST-2 · PDF page 4.
    DistilBERT 77.0 51.3 91.3 85.5 59.9 86.9 56.1 89.2
  • Verifiedexactconfidence 0.802026-05-15

    The paper reports 93.5 on accuracy for BERT-BASE.

    Compared against Table 6 · row BERT-BASE · column SST-2 · PDF page 6.
    BERT BASE 84.6 88.9 92.7 89.3 71.2 93.5
  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 38.8 on accuracy for StableLM-2-1.6B-Chat.

    Compared against Table 3 · row StableLM-2-1.6B-Chat · column MMLU 5-shot · PDF page 9.
    38.8
    Stable LM 2 1.6B Technical Report2402.17834v0.1.0-stablelm2-winogrande-microslice
  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 54.1 on accuracy for OLMoE-1B-7B-Instruct.

    Compared against Table 4 · row OLMoE-1B-7B-Instruct · column MMLU 5-shot · PDF page 8.
    54.1
    OLMoE: Open Mixture-of-Experts Language Models2409.02060v0.1.0-olmoe-winogrande-microslice
  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 43.9 on accuracy for DeepSeek-R1-Distill-Qwen-1.5B.

    Compared against Table 5 · row DeepSeek-R1-Distill-Qwen-1.5B · column MMLU · PDF page 14.
    43.9
  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 50.8 on accuracy for SmolLM2-1.7B-Instruct.

    Compared against Table 4 · row SmolLM2-1.7B-Instruct · column MMLU 5-shot · PDF page 8.
    50.8
  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 47.5 on accuracy for Qwen2.5-0.5B-Instruct.

    Compared against Results · row Qwen2.5-0.5B-Instruct · column MMLU 5-shot · PDF page 8.
    47.5
    Qwen2.5 Technical Report2412.15115v0.1.0-qwen25-winogrande-microslice
  • Not checkableunknownconfidence 0.802026-05-15

    The paper reports 64.11 on accuracy for Yi-6B.

    Compared against Table 9 · row Yi-6B · column MMLU 5-shot accuracy · PDF page 17.
    64.11
    Yi: Open Foundation Models by 01.AI2403.04652v0.1.0-yi-lambada-microslice
  • Not checkableunknownconfidence 0.552026-05-15

    The paper reports 68.8 on accuracy for phi-3-mini.

    Compared against Table 2 · row phi-3-mini · column MMLU 5-shot · PDF page 4.
    68.8
  • Partially supportedproxyconfidence 0.502026-05-15

    The paper reports 75.2 on ImageNet-1k top-1 accuracy for V3-Large 1.0.

    Compared against Table 3 · row V3-Large 1.0 · column Top-1 · PDF page 6.
    V3-Large 1.0 75.2 219 5.4M 51 61 44
    Searching for MobileNetV31905.02244v0.1.0-mobilenet-v3-large-microslice
  • Partially supportedproxyconfidence 0.802026-05-15

    The paper reports 81.3 on accuracy for Mistral-7B.

    Compared against Table 2 · row Mistral-7B · column HellaSwag 0-shot · PDF page 4.
    Mistral-7B 81.3
    Mistral 7B2310.06825v0.1.0-mistral-hellaswag-microslice
  • Partially supportedproxyconfidence 0.552026-05-15

    The paper reports 53.46 on accuracy for MiniCPM-2.4B.

    Compared against Table 3 · row MiniCPM-2.4B · column MMLU · PDF page 14.
    MiniCPM-2.4B 51.13 51.07 53.46 50.00 47.31 53.83 10.24
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 71.4 on accuracy for Gemma-2B.

    Compared against Table 7 · row Gemma-2B · column HellaSwag 0-shot · PDF page 7.
    Gemma-2B 71.4
    Gemma: Open Models Based on Gemini Research and Technology2403.08295v0.1.0-gemma-hellaswag-microslice
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 59.2 on accuracy for TinyLlama 1.1B (3T).

    Compared against Table 2 · row TinyLlama 1.1B (3T) · column HellaSwag 0-shot · PDF page 4.
    TinyLlama 59.20
    TinyLlama: An Open-Source Small Language Model2401.02385v0.1.0-tinyllama-hellaswag-microslice
  • Partially supportedproxyconfidence 0.852026-05-14

    The paper reports 91.8 on accuracy for DeBERTa-v3-large.

    Compared against Table 1 · row DeBERTa-v3-large · column MNLI-m · PDF page 6.
    DeBERTa-v3-large 91.8
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 35.7 on accuracy for BLOOM-560m.

    Compared against Table 9 · row BLOOM-560m · column LAMBADA English · PDF page 14.
    BLOOM-560m 35.7
  • Not checkableunknownconfidence 0.602026-05-14

    The paper reports 38.4 on pass@1 for Code Llama - Python 7B.

    Compared against Table 2 · row Code Llama - Python 7B · column HumanEval pass@1 · PDF page 7.
    38.4
    Code Llama: Open Foundation Models for Code2308.12950v0.1.0-codellama-pythonppl-microslice
  • Not checkableunknownconfidence 0.602026-05-14

    The paper reports 34.8 on pass@1 for DeepSeek-Coder-1.3B-Base.

    Compared against Table 2 · row DeepSeek-Coder-1.3B-Base · column HumanEval pass@1 · PDF page 8.
    34.8
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 57.2 on accuracy for Pythia-1.4B (standard).

    Compared against Table 5 · row Pythia-1.4B (standard) · column LAMBADA accuracy · PDF page 19.
    57.2
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 58 on accuracy for OPT-1.3B.

    Compared against Table 3 · row OPT-1.3B · column LAMBADA zero-shot accuracy · PDF page 7.
    58.0
    OPT: Open Pre-trained Transformer Language Models2205.01068v0.1.0-opt-lambada-microslice
  • Partially supportedproxyconfidence 0.552026-05-14

    The paper reports 81.8 on accuracy for SwinV2-T (window 8, 256²).

    Compared against Table 5 · row SwinV2-T (window 8, 256²) · column ImageNet-1k val top-1 · PDF page 8.
    81.8
    Swin Transformer V2: Scaling Up Capacity and Resolution2111.09883v0.1.0-swinv2-imagenet-microslice
  • Partially supportedproxyconfidence 0.852026-05-14

    The paper reports 86.8 on accuracy for XLNet-Base.

    Compared against Table 6 · row XLNet-Base · column MNLI matched accuracy · PDF page 8.
    86.8
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 85.35 on spearman for SBERT-NLI-STSb-base (trained on NLI + STSb).

    Compared against Table 2 · row SBERT-NLI-STSb-base (trained on NLI + STSb) · column STSb test Spearman correlation · PDF page 6.
    85.35
    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks1908.10084v0.2.0-sbert-stsb-test-3slice-table2
  • Not checkableunknownconfidence 0.602026-05-14

    The paper reports 15.17 on pass@1 for StarCoderBase-1B.

    Compared against Table 6 · row StarCoderBase-1B · column HumanEval pass@1 · PDF page 12.
    15.17
    StarCoder: may the source be with you!2305.06161v0.1.0-starcoder-pythonppl-microslice
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 89.7 on accuracy for RoBERTa-base + LoRA (r=8).

    Compared against Table 2 · row RoBERTa-base + LoRA (r=8) · column MRPC dev accuracy · PDF page 6.
    RoBERTa base 89.7
    LoRA: Low-Rank Adaptation of Large Language Models2106.09685v0.1.0-lora-mrpc-microslice
  • Partially supportedproxyconfidence 0.552026-05-14

    The paper reports 76.4 on accuracy for ALIGN-base.

    Compared against Table 4 · row ALIGN-base · column ImageNet zero-shot top-1 · PDF page 6.
    ALIGN 76.4
  • Not checkableunknownconfidence 0.802026-05-14

    The paper reports 50 on accuracy for Qwen2-0.5B.

    Compared against HF Open LLM Leaderboard (external) · row Qwen2-0.5B · column LAMBADA OpenAI acc.
    Qwen2
    Qwen2 Technical Report2407.10671v0.1.0-qwen2-lambada-microslice
  • Partially supportedproxyconfidence 0.502026-05-14

    The paper reports 77.1 on accuracy for EfficientNet-B0.

    Compared against Table 2 · row EfficientNet-B0 · column ImageNet val top-1 · PDF page 6.
    EfficientNet-B0 77.1
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 78.13 on accuracy for Falcon-7B.

    Compared against Table 17 · row Falcon-7B · column HellaSwag 0-shot · PDF page 32.
    Falcon-7B 78.13
    The Falcon Series of Open Language Models2311.16867v0.1.0-falcon-hellaswag-microslice
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 73.4 on accuracy for phi-1.5.

    Compared against Table 3 · row phi-1.5 · column WinoGrande 0-shot · PDF page 5.
    phi-1.5 73.4
    Textbooks Are All You Need II: phi-1.5 technical report2309.05463v0.1.0-phi-winogrande-microslice
  • Not checkableunknownconfidence 0.502026-05-14

    The paper reports 25 on perplexity for BigBird-RoBERTa-base.

    Compared against Section 4 · row BigBird-RoBERTa-base · column WikiText-2 MLM perplexity (probe) · PDF page 6.
    BigBird
    Big Bird: Transformers for Longer Sequences2007.14062v0.1.0-bigbird-wikitext2-3slice6
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 45.1 on accuracy for FLAN-T5-Large.

    Compared against Table 6 · row FLAN-T5-Large · column MMLU 0-shot direct · PDF page 14.
    FLAN-T5-Large 45.1
    Scaling Instruction-Finetuned Language Models2210.11416v0.1.0-flan-t5-mmlu-microslice
  • Partially supportedproxyconfidence 0.602026-05-14

    The paper reports 43.7 on bleu4 for BLIP-2 OPT-2.7B.

    Compared against Table 3 · row BLIP-2 OPT-2.7B · column Flickr30k zero-shot BLEU-4 · PDF page 7.
    BLIP-2 OPT-2.7B 43.7
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 81.4 on accuracy for DINOv2-B/14.

    Compared against Table 5 · row DINOv2-B/14 · column ImageNet-1k k-NN top-1 · PDF page 11.
    ViT-B/14 81.4
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 76.1 on accuracy for DINO ViT-B/16.

    Compared against Table 2 · row DINO ViT-B/16 · column k-NN ImageNet-1k top-1 · PDF page 5.
    ViT-B/16 76.1
  • Not checkableproxyconfidence n/a2026-05-14

    The paper reports 82.1 on accuracy for ConvNeXt-T.

    Compared against Table 1 · row ConvNeXt-T · column ImageNet val top-1 · PDF page 5.
    ConvNeXt-T 82.1
    A ConvNet for the 2020s2201.03545v0.1.0-convnext-imagenet-microslice
  • Partially supportedexactconfidence 0.602026-05-14

    The paper reports 40.9 on rougeLsum for BART.

    Compared against Table 3 · row BART · column R-L · PDF page 7.
    BART 44.16 21.28 40.90
  • Verifiedexactconfidence 0.852026-05-14

    The paper reports 91.1 on accuracy for DeBERTalarge.

    Compared against Table 2 · row DeBERTalarge · column MNLI-m/mm · PDF page 7.
    DeBERTalarge 91.1/91.1 95.5/90.1 90.7/88.0 86.8 91.4/91.0 90.8 93.8
    DeBERTa: Decoding-enhanced BERT with Disentangled Attention2006.03654v0.1.0-deberta-v2-mnli-microslice
  • Partially supportedproxyconfidence 0.752026-05-14

    The paper reports 91.6 on accuracy for ViT-B/16 CIFAR10.

    Compared against Table 11 · row ViT-B/16 CIFAR10 · column zero-shot top-1 · PDF page 19.
    ViT-B/16 91.6
  • Partially supportedproxyconfidence 0.752026-05-14

    The paper reports 0.056 on wer for whisper-tiny.en.

    Compared against Table 8 · row whisper-tiny.en · column LibriSpeech test-clean WER · PDF page 25.
    tiny.en 5.6
    Robust Speech Recognition via Large-Scale Weak Supervision2212.04356v0.1.0-whisper-librispeech-3slice16
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 99 on accuracy for ViT-B/16.

    Compared against Table 5 · row ViT-B/16 · column CIFAR-10 transfer · PDF page 6.
    ViT-B/16 99.0
  • Partially supportedproxyconfidence 0.452026-05-14

    The paper reports 76 on accuracy for ResNet-50.

    Compared against Table 4 · row ResNet-50 · column ImageNet val top-1 · PDF page 7.
    ResNet-50 76.0
    Deep Residual Learning for Image Recognition1512.03385v0.1.0-resnet-microslice
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 88.5 on accuracy for ELECTRA-Base.

    Compared against Table 8 · row ELECTRA-Base · column MNLI · PDF page 8.
    ELECTRA-Base 88.5
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 82.4 on accuracy for T5-Small.

    Compared against Table 14 · row T5-Small · column MNLI-m · PDF page 30.
    T5-Small 82.4
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 89.3 on f1 for ALBERT-base.

    Compared against Table 2 · row ALBERT-base · column MRPC F1 · PDF page 7.
    ALBERT-base 89.3
  • Partially supportedproxyconfidence 0.802026-05-14

    The paper reports 84.3 on accuracy for MobileBERT.

    Compared against Table 4 · row MobileBERT · column MNLI-m · PDF page 6.
    MobileBERT 84.3
  • Not checkableproxyconfidence n/a2026-05-14

    The paper reports 77.2 on accuracy for Llama 2 7B.

    Compared against Table 3 · row Llama 2 7B · column HellaSwag 0-shot · PDF page 9.
    Llama 2 7B 77.2
    Llama 2: Open Foundation and Fine-Tuned Chat Models2307.09288v0.1.0-llama2-hellaswag-microslice

Scope

A claim check is smaller than a paper verdict. It says whether a specific cited value or span matched the reproduction job under the stated protocol, not whether the whole paper is good or bad.