Public

Anchor probes

Anchor probes are the platform's own self-test. Each one pins a paper, a model, and a headline metric whose expected value is independently well-known. If a library version bump or Modal runtime regression breaks the platform, the anchors drift first — and the dashboard below stops being green.

Live status

Anchors
10
curated set
Healthy
0
within tolerance
Drift
0
alarm — investigate
Fallback
1
dataset gated; expected
No data
1
reproduction pending

10 anchors mark healthy but the last reproduction is more than 14 days old — re-running would refresh the signal.

1 anchor sit in fallback: the driver couldn't access the paper's cited dataset (typically because ImageNet-1k is HuggingFace-gated and our Modal sandbox is hermetic per PRD §18.X.1, so no HF_TOKEN). The driver ran on a subset (imagenette / CIFAR-10) and correctly stamped protocol_match=proxy. This is not a platform regression — it's a structural limitation with an honest signal.

Anchor table

Each row is a single anchor. Click the paper link for the full verdict + evidence; click citationto see the driver's structurally-typed claim citation that the validator gates against.

ModelSection / RowPaperMeasuredΔ±StateLast run
BERT-base1810.04805Table 6·BERT-BASE93.594.7+1.23.0healthy (stale)2026-05-15
RoBERTa-base1907.11692Table 8·RoBERTa90.290.5+0.33.0healthy (stale)2026-05-15
DistilBERT1910.01108Table 2·DistilBERT91.391.5+0.23.0healthy (stale)2026-05-15
ELECTRA-base2003.10555Table 1·ELECTRA-Base88.587.7-0.83.0healthy (stale)2026-05-14
ALBERT-base1909.11942Table 2·ALBERT-base89.390.7+1.43.0healthy (stale)2026-05-14
T5-small1910.10683Table 14·T5-Small82.481.5-0.93.0healthy (stale)2026-05-14
ViT-B/162010.11929Table 5·ViT-B/1699.097.7-1.31.5healthy (stale)2026-05-14
CLIP ViT-B/162103.00020Table 11·ViT-B/1691.688.7-2.93.0healthy (stale)2026-05-14
ResNet-501512.03385Table 3·ResNet-5076.042.0-34.03.0fallback2026-05-14
ConvNeXt-tiny2201.03545Table 1·ConvNeXt-T82.13.0no data2026-05-14

The anchor contract

An anchor passes when the latest reproduction's measured value sits inside its tolerance band. The tolerance is set per-anchor to absorb known sources of variance (microslice vs full benchmark, community fine-tune vs paper checkpoint, prompt-template sensitivity for zero-shot tasks). A drift alarm is the platform telling itself: something has changed since this anchor was last calibrated.

Anchors are curated, not auto-generated. Adding one requires editing src/lib/anchors.ts with a written rationale, and the corresponding driver's CLAIM_CITATION must match the anchor's expected value exactly. The test suite in tests/unit/anchors.test.ts enforces structural soundness of the list.

Why this exists

On 2026-05-13 we retracted 7 public WRONG verdicts that turned out to be false positives — every one was a citation problem. The Verdict Validator (C1 structural + C2 textual gates) addresses the citation-side failure mode: a driver can no longer publish a WRONG against a made-up paper headline. Anchor probes address the orthogonal failure mode: a driver that's measuring the right thing, against the right paper headline, but now measuring it wrongbecause some part of the execution stack has drifted. Both gates together are the platform's answer to PRD §3 #1 (“evidence first”). The full self-correction quadrant adds two more surfaces: /legal/retractions (historical false positives) and /skipped (papers we deliberately did not reproduce).