Recent activity
- 2026-05-15 19:56ZPARTIALunknownv0.1.0-olmo2-winogrande-microslice
- 2026-05-15 19:19ZREPRODUCEDproxyv0.1.0-roberta-mnli-microslice
- 2026-05-15 19:19ZREPRODUCEDproxyv0.1.0-mamba-wikitext2-3slice8
- 2026-05-15 19:19ZREPRODUCEDproxyv0.1.0-distilbert-sst2-microslice
- 2026-05-15 19:19ZREPRODUCEDexactv0.1.0-bert-sst2-3slice100
- 2026-05-15 18:26ZPARTIALunknownv0.1.0-stablelm2-winogrande-microslice
- 2026-05-15 17:51ZPARTIALunknownv0.1.0-olmoe-winogrande-microslice
- 2026-05-15 17:15ZPARTIALunknownv0.1.0-deepseek-r1-winogrande-microslice
- 2026-05-15 16:33ZPARTIALunknownv0.1.0-smollm2-winogrande-microslice
- 2026-05-15 16:17ZPARTIALunknownv0.1.0-qwen25-winogrande-microslice
- 2026-05-15 16:10ZREPRODUCEDunknownv0.1.0-yi-lambada-microslice
- 2026-05-15 16:10ZPARTIALunknownv0.1.0-phi3-winogrande-microslice
- 2026-05-15 16:08ZPARTIALproxyv0.1.0-mobilenet-v3-large-microslice
- 2026-05-15 03:04ZREPRODUCEDproxyv0.1.0-mistral-hellaswag-microslice
- 2026-05-15 00:04ZPARTIALproxyv0.1.0-minicpm-mmlu5shot-microslice
- 2026-05-14 23:57ZPARTIALproxyv0.1.0-gemma-hellaswag-microslice
- 2026-05-14 23:56ZPARTIALproxyv0.1.0-tinyllama-hellaswag-microslice
- 2026-05-14 23:56ZREPRODUCEDproxyv0.1.0-deberta-mnli-microslice
- 2026-05-14 23:54ZREPRODUCEDproxyv0.1.0-bloom-lambada-microslice
- 2026-05-14 23:53ZPARTIALunknownv0.1.0-codellama-pythonppl-microslice
- 2026-05-14 23:52ZPARTIALunknownv0.1.0-deepseek-coder-pythonppl-microslice
- 2026-05-14 23:52ZPARTIALproxyv0.1.0-pythia14-lambada-microslice
- 2026-05-14 23:48ZREPRODUCEDproxyv0.1.0-opt-lambada-microslice
- 2026-05-14 23:48ZPARTIALproxyv0.1.0-swinv2-imagenet-microslice
- 2026-05-14 23:48ZREPRODUCEDproxyv0.1.0-xlnet-mnli-microslice
- 2026-05-14 23:48ZREPRODUCEDproxyv0.2.0-sbert-stsb-test-3slice-table2
- 2026-05-14 23:48ZPARTIALunknownv0.1.0-starcoder-pythonppl-microslice
- 2026-05-14 23:40ZREPRODUCEDproxyv0.1.0-lora-mrpc-microslice
- 2026-05-14 23:39ZPARTIALproxyv0.1.0b-align-imagenette-3slice100
- 2026-05-14 23:39ZREPRODUCEDunknownv0.1.0-qwen2-lambada-microslice
- 2026-05-13 00:00Zretracted
The driver scored rougeL (sentence-level longest common subsequence) and compared it against the paper's Table 3 R-L of 40.90, which is in fact rougeLsum (summary-level longest common subsequence per Lin 2004). The two metrics measure different things. Under the correct metric variant the model lands within paper bounds.
- 2026-05-13 00:00Zretracted
The driver decoded with greedy search, n = 20 examples, and no prompt prefix. The paper uses beam search width 5 with a "a photo of" prompt prefix on the full Flickr30k Karpathy split. The retraction replaces the WRONG row with a PARTIAL captured under the paper's published decoding protocol.
- 2026-05-13 00:00Zretracted
The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.
- 2026-05-13 00:00Zretracted
The driver compared a measured LAMBADA OpenAI accuracy of about 0.5576 against a 0.617 headline it attributed to Table 6 of the OLMo paper. Table 6 is in fact a carbon-emissions table; the string "lambada" appears zero times in the published PDF, and the only OLMo-1B zero-shot table (Table 3) does not include LAMBADA among its eight tasks. The reproduction has been converted to a not-attempted stub so it cannot re-publish.
- 2026-05-13 00:00Zretracted
The driver evaluated the checkpoint sentence-transformers/bert-base-nli-stsb-mean-tokens against Table 1's unsupervised SBERT-NLI-large row (Spearman 79.23). The matching baseline is Table 2's supervised-base STS-b row (85.35). Once the comparison is repointed to the right row, the model lands within 0.5 points of the paper's number.
- 2026-05-13 00:00Zretracted
The driver invented a 0.730 PIQA target headline for Phi-1 (described internally as the "midpoint of the expected band per the mandate"). The Phi-1 paper reports HumanEval pass@1 = 50.6 and does not report a PIQA score at all. The retraction replaces the WRONG row with a PARTIAL on the actual HumanEval headline.
- 2026-05-13 00:00Zretracted
The driver loaded joeddav/xlm-roberta-large-xnli, a checkpoint that is fine-tuned on XNLI-train plus MNLI. The 89.1 English-XNLI headline in the paper is a zero-shot transfer number from a model fine-tuned on English MNLI only. The driver was measuring a different model; the rolled-back row records the correct zero-shot protocol as a PARTIAL.
Reading the feed
- Each entry pairs an ISO-8601 timestamp with a status chip (REPRODUCED / PARTIAL / PENDING / NOT_ATTEMPTED / RETRACTED) and, for verdicts, the protocol-match tier the driver declared (exact / proxy / unknown).
- Only PROTOCOL_MATCH=exact drivers can publish a public WRONG verdict. The validator C1 gate auto-downgrades every other tier — see /validator.
- Each entry's clickable target is the paper page, where the structured claim citation and the full reproduction evidence (stdout offsets, signed manifest, agent version) live.
- NOT_ATTEMPTED and OUT_OF_BUDGET rows also appear on the refusal-transparency surface at /skipped with the per-paper rationale (gated weights, methodological mismatch, or out-of-budget).