Live status
10 anchors mark healthy but the last reproduction is more than 14 days old — re-running would refresh the signal.
1 anchor sit in fallback: the driver couldn't access the paper's cited dataset (typically because ImageNet-1k is HuggingFace-gated and our Modal sandbox is hermetic per PRD §18.X.1, so no HF_TOKEN). The driver ran on a subset (imagenette / CIFAR-10) and correctly stamped protocol_match=proxy. This is not a platform regression — it's a structural limitation with an honest signal.
Anchor table
Each row is a single anchor. Click the paper link for the full verdict + evidence; click citationto see the driver's structurally-typed claim citation that the validator gates against.
| Model | Section / Row | Paper | Measured | Δ | ± | State | Last run |
|---|---|---|---|---|---|---|---|
| BERT-base1810.04805 | Table 6·BERT-BASE | 93.5 | 94.7 | +1.2 | 3.0 | healthy (stale) | 2026-05-15 |
| RoBERTa-base1907.11692 | Table 8·RoBERTa | 90.2 | 90.5 | +0.3 | 3.0 | healthy (stale) | 2026-05-15 |
| DistilBERT1910.01108 | Table 2·DistilBERT | 91.3 | 91.5 | +0.2 | 3.0 | healthy (stale) | 2026-05-15 |
| ELECTRA-base2003.10555 | Table 1·ELECTRA-Base | 88.5 | 87.7 | -0.8 | 3.0 | healthy (stale) | 2026-05-14 |
| ALBERT-base1909.11942 | Table 2·ALBERT-base | 89.3 | 90.7 | +1.4 | 3.0 | healthy (stale) | 2026-05-14 |
| T5-small1910.10683 | Table 14·T5-Small | 82.4 | 81.5 | -0.9 | 3.0 | healthy (stale) | 2026-05-14 |
| ViT-B/162010.11929 | Table 5·ViT-B/16 | 99.0 | 97.7 | -1.3 | 1.5 | healthy (stale) | 2026-05-14 |
| CLIP ViT-B/162103.00020 | Table 11·ViT-B/16 | 91.6 | 88.7 | -2.9 | 3.0 | healthy (stale) | 2026-05-14 |
| ResNet-501512.03385 | Table 3·ResNet-50 | 76.0 | 42.0 | -34.0 | 3.0 | fallback | 2026-05-14 |
| ConvNeXt-tiny2201.03545 | Table 1·ConvNeXt-T | 82.1 | — | — | 3.0 | no data | 2026-05-14 |
The anchor contract
An anchor passes when the latest reproduction's measured value sits inside its tolerance band. The tolerance is set per-anchor to absorb known sources of variance (microslice vs full benchmark, community fine-tune vs paper checkpoint, prompt-template sensitivity for zero-shot tasks). A drift alarm is the platform telling itself: something has changed since this anchor was last calibrated.
Anchors are curated, not auto-generated. Adding one requires editing src/lib/anchors.ts with a written rationale, and the corresponding driver's CLAIM_CITATION must match the anchor's expected value exactly. The test suite in tests/unit/anchors.test.ts enforces structural soundness of the list.
Why this exists
On 2026-05-13 we retracted 7 public WRONG verdicts that turned out to be false positives — every one was a citation problem. The Verdict Validator (C1 structural + C2 textual gates) addresses the citation-side failure mode: a driver can no longer publish a WRONG against a made-up paper headline. Anchor probes address the orthogonal failure mode: a driver that's measuring the right thing, against the right paper headline, but now measuring it wrongbecause some part of the execution stack has drifted. Both gates together are the platform's answer to PRD §3 #1 (“evidence first”). The full self-correction quadrant adds two more surfaces: /legal/retractions (historical false positives) and /skipped (papers we deliberately did not reproduce).