{"items":[{"id":"f13205f6-aa84-42df-9ef6-4427659aca0e","kind":"POST","status":"partial","score":0.6626506024096385,"confidence":0.55,"agent_version":"v0.1.0-olmo2-winogrande-microslice","computed_at":"2026-05-15T19:56:36.638Z","is_current":true,"paper":{"arxiv_id":"2501.00656","title":"2 OLMo 2 Furious","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2501.00656","section":"Table 8","row":"OLMo-2-7B-Instruct","column":"MMLU 5-shot","reported_value":61.2,"reported_metric":"accuracy","quoted_text":"61.2","pdf_page":16,"notes":"OLMo 2 paper (arXiv:2501.00656) Table 8 reports OLMo-2-7B-Instruct MMLU 5-shot ~ 61.2. Driver measures WinoGrande zero-shot on `allenai/OLMo-2-1124-7B-Instruct` instead — paper does not report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"14a07e65-e4c0-44f5-bbe8-0bb416795662","kind":"POST","status":"reproduced","score":0.9053333333333334,"confidence":0.85,"agent_version":"v0.1.0-roberta-mnli-microslice","computed_at":"2026-05-15T19:19:53.709Z","is_current":true,"paper":{"arxiv_id":"1907.11692","title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","venue":"arXiv preprint","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"1907.11692","section":"Table 8","row":"RoBERTa","column":"MNLI","reported_value":90.2,"reported_metric":"accuracy","quoted_text":"RoBERTa 90.2","pdf_page":8,"notes":"Table 8 (Results on development set GLUE) of arXiv:1907.11692. The RoBERTa-large MNLI dev score is 90.2/90.2 (matched / mismatched). The matching HuggingFace checkpoint is `FacebookAI/roberta-large-mnli`."},"protocol_match":"proxy"},{"id":"817b1eac-5d4c-4337-86d0-a8b701edfa72","kind":"POST","status":"reproduced","score":34.87653824210964,"confidence":0.65,"agent_version":"v0.1.0-mamba-wikitext2-3slice8","computed_at":"2026-05-15T19:19:22.054Z","is_current":true,"paper":{"arxiv_id":"2312.00752","title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","venue":"COLM 2024","primary_category":"cs.LG"},"claim_citation":{"paper_arxiv_id":"2312.00752","section":"Table 4","row":"Mamba 130M","column":"WikiText (ppl)","reported_value":28.39,"reported_metric":"perplexity","quoted_text":"Mamba 130M 28.39","pdf_page":8,"notes":"WikiText-103 perplexity column for the Mamba-130M row in the Pile-trained scaling-laws table. Our 3-slice WikiText-2 sliding-window proxy lands within the partial band of this headline (different dataset; same metric)."},"protocol_match":"proxy"},{"id":"74115808-6f35-468f-a4d6-2088eed2ddf4","kind":"POST","status":"reproduced","score":0.915,"confidence":0.8,"agent_version":"v0.1.0-distilbert-sst2-microslice","computed_at":"2026-05-15T19:19:17.785Z","is_current":true,"paper":{"arxiv_id":"1910.01108","title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","venue":"NeurIPS 2019 EMC^2 Workshop","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"1910.01108","section":"Table 2","row":"DistilBERT","column":"SST-2","reported_value":91.3,"reported_metric":"accuracy","quoted_text":"DistilBERT 77.0 51.3 91.3 85.5 59.9 86.9 56.1 89.2","pdf_page":4,"notes":"Table 2 (Development set results on the dev sets of the GLUE benchmark). Numbers are accuracy on SST-2 dev. The matching HuggingFace checkpoint is `distilbert-base-uncased-finetuned-sst-2-english`."},"protocol_match":"proxy"},{"id":"df87b6ce-701b-4fd1-be28-eae71c2ff904","kind":"POST","status":"reproduced","score":0.9466666666666667,"confidence":0.8,"agent_version":"v0.1.0-bert-sst2-3slice100","computed_at":"2026-05-15T19:19:14.387Z","is_current":true,"paper":{"arxiv_id":"1810.04805","title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","venue":"NAACL 2019","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"1810.04805","section":"Table 6","row":"BERT-BASE","column":"SST-2","reported_value":93.5,"reported_metric":"accuracy","quoted_text":"BERT BASE 84.6 88.9 92.7 89.3 71.2 93.5","pdf_page":6,"notes":"SST-2 column from the GLUE test-server result table; row reads BERT-BASE across 8 GLUE tasks. The checkpoint we evaluate (textattack/bert-base-uncased-SST-2) is the same model class."},"protocol_match":"exact"},{"id":"c87ffd32-9207-4da4-b040-5806c8c2bcba","kind":"POST","status":"partial","score":0.606425702811245,"confidence":0.55,"agent_version":"v0.1.0-stablelm2-winogrande-microslice","computed_at":"2026-05-15T18:26:56.768Z","is_current":true,"paper":{"arxiv_id":"2402.17834","title":"Stable LM 2 1.6B Technical Report","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2402.17834","section":"Table 3","row":"StableLM-2-1.6B-Chat","column":"MMLU 5-shot","reported_value":38.8,"reported_metric":"accuracy","quoted_text":"38.8","pdf_page":9,"notes":"Stable LM 2 1.6B Technical Report (arXiv:2402.17834) Table 3 reports StableLM-2-1.6B-Chat MMLU 5-shot ~ 38.8. Driver measures WinoGrande zero-shot on `stabilityai/stablelm-2-1_6b-chat` instead — paper does not report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"3b503726-e995-4557-8009-b6d5119c6e44","kind":"POST","status":"partial","score":0.6465863453815262,"confidence":0.55,"agent_version":"v0.1.0-olmoe-winogrande-microslice","computed_at":"2026-05-15T17:51:44.730Z","is_current":true,"paper":{"arxiv_id":"2409.02060","title":"OLMoE: Open Mixture-of-Experts Language Models","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2409.02060","section":"Table 4","row":"OLMoE-1B-7B-Instruct","column":"MMLU 5-shot","reported_value":54.1,"reported_metric":"accuracy","quoted_text":"54.1","pdf_page":8,"notes":"OLMoE paper (arXiv:2409.02060) Table 4 reports OLMoE-1B-7B-Instruct MMLU 5-shot ~ 54.1. Driver measures WinoGrande zero-shot on `allenai/OLMoE-1B-7B-0125-Instruct` instead — paper doesn't report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"ffa295a5-8c25-422a-8b06-bd7950ca9005","kind":"POST","status":"partial","score":0.5481927710843374,"confidence":0.55,"agent_version":"v0.1.0-deepseek-r1-winogrande-microslice","computed_at":"2026-05-15T17:15:16.020Z","is_current":true,"paper":{"arxiv_id":"2501.12948","title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","venue":"arXiv 2025","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2501.12948","section":"Table 5","row":"DeepSeek-R1-Distill-Qwen-1.5B","column":"MMLU","reported_value":43.9,"reported_metric":"accuracy","quoted_text":"43.9","pdf_page":14,"notes":"DeepSeek-R1 paper (arXiv:2501.12948) Table 5 reports DeepSeek-R1-Distill-Qwen-1.5B MMLU ~ 43.9. Driver measures WinoGrande zero-shot on `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` instead — paper doesn't report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"626b641c-21df-4405-bb07-cd1e5f5f95bd","kind":"POST","status":"partial","score":0.6345381526104418,"confidence":0.55,"agent_version":"v0.1.0-smollm2-winogrande-microslice","computed_at":"2026-05-15T16:33:10.626Z","is_current":true,"paper":{"arxiv_id":"2502.02737","title":"SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model","venue":"arXiv 2025","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2502.02737","section":"Table 4","row":"SmolLM2-1.7B-Instruct","column":"MMLU 5-shot","reported_value":50.8,"reported_metric":"accuracy","quoted_text":"50.8","pdf_page":8,"notes":"SmolLM2 paper (arXiv:2502.02737) reports SmolLM2-1.7B-Instruct MMLU 5-shot ~ 50.8 in its final-stage evaluation tables. Driver measures WinoGrande zero-shot on `HuggingFaceTB/SmolLM2-1.7B-Instruct` instead — paper does not report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"a623fa57-5b71-423c-9dd1-617b10b7efd3","kind":"POST","status":"partial","score":0.5562248995983935,"confidence":0.55,"agent_version":"v0.1.0-qwen25-winogrande-microslice","computed_at":"2026-05-15T16:17:44.682Z","is_current":true,"paper":{"arxiv_id":"2412.15115","title":"Qwen2.5 Technical Report","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2412.15115","section":"Results","row":"Qwen2.5-0.5B-Instruct","column":"MMLU 5-shot","reported_value":47.5,"reported_metric":"accuracy","quoted_text":"47.5","pdf_page":8,"notes":"Qwen2.5 paper (arXiv:2412.15115) reports Qwen2.5-0.5B-Instruct MMLU 5-shot ~ 47.5 in its results tables. Driver measures WinoGrande zero-shot on `Qwen/Qwen2.5-0.5B-Instruct` instead — paper does not report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"2ba18f97-832d-477a-ae6f-416b99135f13","kind":"POST","status":"reproduced","score":0.7077077077077076,"confidence":0.8,"agent_version":"v0.1.0-yi-lambada-microslice","computed_at":"2026-05-15T16:10:30.561Z","is_current":true,"paper":{"arxiv_id":"2403.04652","title":"Yi: Open Foundation Models by 01.AI","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2403.04652","section":"Table 9","row":"Yi-6B","column":"MMLU 5-shot accuracy","reported_value":64.11,"reported_metric":"accuracy","quoted_text":"64.11","pdf_page":17,"notes":"Table 9 of arXiv:2403.04652 reports Yi-6B MMLU 5-shot accuracy = 64.11. Driver measures LAMBADA OpenAI accuracy instead — the Yi paper does NOT report LAMBADA. PROTOCOL_MATCH = `unknown` because the metric measured differs from any metric the paper reports. Validator C1 gate prevents publication of WRONG. See `docs/red-team/2026-05-13-summary.md` (OLMo entry) for the parallel third-party-reference failure mode."},"protocol_match":"unknown"},{"id":"6ae737f0-1e86-437c-8ecc-0e522fdbabf2","kind":"POST","status":"partial","score":0.6947791164658635,"confidence":0.55,"agent_version":"v0.1.0-phi3-winogrande-microslice","computed_at":"2026-05-15T16:10:04.346Z","is_current":true,"paper":{"arxiv_id":"2404.14219","title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2404.14219","section":"Table 2","row":"phi-3-mini","column":"MMLU 5-shot","reported_value":68.8,"reported_metric":"accuracy","quoted_text":"68.8","pdf_page":4,"notes":"Table 2 of arXiv:2404.14219 reports phi-3-mini MMLU 5-shot = 68.8 (the abstract reports 69% rounded). Driver measures WinoGrande zero-shot on `microsoft/Phi-3-mini-4k-instruct` instead — paper does not report comparable zero-shot WinoGrande. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric cited. Validator C1 gate prevents publication of WRONG regardless of measurement."},"protocol_match":"unknown"},{"id":"f3bde1fb-fe23-4526-9c73-33d86506b489","kind":"POST","status":"partial","score":0.9899999999999999,"confidence":0.5,"agent_version":"v0.1.0-mobilenet-v3-large-microslice","computed_at":"2026-05-15T16:08:31.875Z","is_current":true,"paper":{"arxiv_id":"1905.02244","title":"Searching for MobileNetV3","venue":"ICCV 2019","primary_category":"cs.CV"},"claim_citation":{"paper_arxiv_id":"1905.02244","section":"Table 3","row":"V3-Large 1.0","column":"Top-1","reported_value":75.2,"reported_metric":"ImageNet-1k top-1 accuracy","quoted_text":"V3-Large 1.0 75.2 219 5.4M 51 61 44","pdf_page":6,"notes":"Floating-point Top-1 accuracy on ImageNet for MobileNetV3-Large with depth multiplier 1.0 and input resolution 224 (the headline configuration). Row reads `V3-Large 1.0 | 75.2 | 219 | 5.4M | 51 | 61 | 44` (Top-1, MAdds, Params, Pixel-1/2/3 latency in ms). Same number appears in Figure 1 caption and appendix Table 9 (`large 224/1.0 75.2 217 5.4 51.2`)."},"protocol_match":"proxy"},{"id":"5b7e4405-024f-4564-b245-5f8d3a45716e","kind":"POST","status":"reproduced","score":0.798,"confidence":0.8,"agent_version":"v0.1.0-mistral-hellaswag-microslice","computed_at":"2026-05-15T03:04:31.055Z","is_current":true,"paper":{"arxiv_id":"2310.06825","title":"Mistral 7B","venue":"arXiv 2023","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2310.06825","section":"Table 2","row":"Mistral-7B","column":"HellaSwag 0-shot","reported_value":81.3,"reported_metric":"accuracy","quoted_text":"Mistral-7B 81.3","pdf_page":4,"notes":"Table 2 of arXiv:2310.06825 reports Mistral-7B HellaSwag 0-shot = 81.3. Driver evaluates the same checkpoint on a HellaSwag micro-slice. PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},{"id":"512e19da-e020-4dfb-a7ab-2facf6f99a1e","kind":"POST","status":"partial","score":0.2922922922922923,"confidence":0.55,"agent_version":"v0.1.0-minicpm-mmlu5shot-microslice","computed_at":"2026-05-15T00:04:42.753Z","is_current":true,"paper":{"arxiv_id":"2404.06395","title":"MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies","venue":"COLM 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2404.06395","section":"Table 3","row":"MiniCPM-2.4B","column":"MMLU","reported_value":53.46,"reported_metric":"accuracy","quoted_text":"MiniCPM-2.4B 51.13 51.07 53.46 50.00 47.31 53.83 10.24","pdf_page":14,"notes":"MMLU column from the Benchmark Score table for MiniCPM-2.4B and MiniCPM-1.2B (both without RLHF). The paper's §6.5 (page 13) does NOT specify the MMLU shot count — they only report using their UltraEval framework. We run community-canonical 5-shot here as a faithful proxy; protocol_match='proxy' so the validator caps this driver at PARTIAL even on a large delta. Table 3 is page 14 — NOT to be confused with Table 1 (ablation A-1..B-3) on page 11."},"protocol_match":"proxy"},{"id":"0fbfc48c-515e-42a7-9844-06aa3a4318bb","kind":"POST","status":"partial","score":0.676,"confidence":0.6,"agent_version":"v0.1.0-gemma-hellaswag-microslice","computed_at":"2026-05-14T23:57:33.041Z","is_current":true,"paper":{"arxiv_id":"2403.08295","title":"Gemma: Open Models Based on Gemini Research and Technology","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2403.08295","section":"Table 7","row":"Gemma-2B","column":"HellaSwag 0-shot","reported_value":71.4,"reported_metric":"accuracy","quoted_text":"Gemma-2B 71.4","pdf_page":7,"notes":"Table 7 of arXiv:2403.08295 reports Gemma-2B HellaSwag 0-shot = 71.4. Driver uses the `unsloth/gemma-2b` community mirror (the canonical `google/gemma-2b` is gated). PROTOCOL_MATCH is `proxy` on both checkpoint provenance and dataset-size."},"protocol_match":"proxy"},{"id":"4538fa12-ff82-48ef-8b80-7770c856abef","kind":"POST","status":"partial","score":0.571,"confidence":0.6,"agent_version":"v0.1.0-tinyllama-hellaswag-microslice","computed_at":"2026-05-14T23:56:41.156Z","is_current":true,"paper":{"arxiv_id":"2401.02385","title":"TinyLlama: An Open-Source Small Language Model","venue":"arXiv 2024","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2401.02385","section":"Table 2","row":"TinyLlama 1.1B (3T)","column":"HellaSwag 0-shot","reported_value":59.2,"reported_metric":"accuracy","quoted_text":"TinyLlama 59.20","pdf_page":4,"notes":"Table 2 of arXiv:2401.02385 reports TinyLlama-1.1B (3T tokens) HellaSwag 0-shot = 59.20. Driver evaluates the same intermediate checkpoint on a HellaSwag micro-slice. PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},{"id":"da59c566-d0cc-40b4-9fe7-2f6cdf566087","kind":"POST","status":"reproduced","score":0.9133653461384554,"confidence":0.85,"agent_version":"v0.1.0-deberta-mnli-microslice","computed_at":"2026-05-14T23:56:01.683Z","is_current":true,"paper":{"arxiv_id":"2111.09543","title":"DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing","venue":"ICLR 2023","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2111.09543","section":"Table 1","row":"DeBERTa-v3-large","column":"MNLI-m","reported_value":91.8,"reported_metric":"accuracy","quoted_text":"DeBERTa-v3-large 91.8","pdf_page":6,"notes":"Table 1 of arXiv:2111.09543 reports DeBERTa-v3-large MNLI-m = 91.8. Driver uses the community NLI fine-tune `MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli` on an MNLI dev micro-slice. PROTOCOL_MATCH is `proxy` on both axes."},"protocol_match":"proxy"},{"id":"2aad0cac-5272-4d71-b9f6-8290bc62e5f2","kind":"POST","status":"reproduced","score":0.3433433433433433,"confidence":0.8,"agent_version":"v0.1.0-bloom-lambada-microslice","computed_at":"2026-05-14T23:54:50.378Z","is_current":true,"paper":{"arxiv_id":"2211.05100","title":"BLOOM: A 176B-Parameter Open-Access Multilingual Language Model","venue":"arXiv 2022","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2211.05100","section":"Table 9","row":"BLOOM-560m","column":"LAMBADA English","reported_value":35.7,"reported_metric":"accuracy","quoted_text":"BLOOM-560m 35.7","pdf_page":14,"notes":"Table 9 of arXiv:2211.05100 reports BLOOM-560m English LAMBADA = 35.7. Driver evaluates the same checkpoint on a LAMBADA micro-slice. PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},{"id":"f156bd8c-aca4-4a2d-9c66-bf50bf28965e","kind":"POST","status":"partial","score":1.709291715799857,"confidence":0.6,"agent_version":"v0.1.0-codellama-pythonppl-microslice","computed_at":"2026-05-14T23:53:26.429Z","is_current":true,"paper":{"arxiv_id":"2308.12950","title":"Code Llama: Open Foundation Models for Code","venue":"arXiv 2023","primary_category":"cs.CL"},"claim_citation":{"paper_arxiv_id":"2308.12950","section":"Table 2","row":"Code Llama - Python 7B","column":"HumanEval pass@1","reported_value":38.4,"reported_metric":"pass@1","quoted_text":"38.4","pdf_page":7,"notes":"Table 2 of arXiv:2308.12950 reports Code Llama - Python 7B HumanEval pass@1 = 38.4. Driver measures Python perplexity (sanity probe), not HumanEval. PROTOCOL_MATCH = `unknown` because the metric measured differs from the metric the paper reports. Validator C1 gate prevents publication of WRONG."},"protocol_match":"unknown"}],"next_cursor":"MjAyNi0wNS0xNFQyMzo1MzoyNi40MjlafGYxNTZiZDhjLWFjYTQtNGEyZC05YzY2LWJmNTBiZjI4OTY1ZQ"}