Anchors — paperiswrong

Live status

Anchors

curated set

Healthy

within tolerance

Drift

alarm — investigate

Fallback

dataset gated; expected

Unvalidated

run exists; score hidden

No result

pending or not attempted

No data

no current run

9 anchors have a current run but no valid claim receipt. Their run date and agent are shown for provenance; their unchecked measurement and delta remain hidden.

1 anchor have a current non-decisive run, such as pending or not attempted. These rows did not fail receipt validation; they produced no public measurement to compare.

Anchor table

Each row is a single anchor. Click the paper link for the full verdict + evidence; click citationto see the driver's structurally-typed claim citation that the validator gates against.

Model	Section / Row	Paper	Measured	Δ	±	State	Last run
BERT-base1810.04805	Table 6·BERT-BASE	93.5	—	—	3.0	unvalidated	2026-05-15v0.1.0-bert-sst2-3slice100
RoBERTa-base1907.11692	Table 8·RoBERTa	90.2	—	—	3.0	unvalidated	2026-05-15v0.1.0-roberta-mnli-microslice
DistilBERT1910.01108	Table 2·DistilBERT	91.3	—	—	3.0	unvalidated	2026-05-15v0.1.0-distilbert-sst2-microslice
ELECTRA-base2003.10555	Table 1·ELECTRA-Base	88.5	—	—	3.0	unvalidated	2026-05-14v0.1.0-electra-mnli-microslice
ALBERT-base1909.11942	Table 2·ALBERT-base	89.3	—	—	3.0	unvalidated	2026-05-14v0.1.0-albert-mrpc-microslice
T5-small1910.10683	Table 14·T5-Small	82.4	—	—	3.0	unvalidated	2026-05-14v0.1.0-t5-mnli-microslice
ViT-B/162010.11929	Table 5·ViT-B/16	99.0	—	—	1.5	unvalidated	2026-05-14v0.1.0-vit-cifar10-3slice100
CLIP ViT-B/162103.00020	Table 11·ViT-B/16	91.6	—	—	3.0	unvalidated	2026-05-14v0.1.0-clip-cifar10-3slice100
ResNet-501512.03385	Table 3·ResNet-50	76.0	—	—	3.0	unvalidated	2026-05-14v0.1.0-resnet-microslice
ConvNeXt-tiny2201.03545	Table 1·ConvNeXt-T	82.1	—	—	3.0	not attempted	2026-05-14v0.1.0-convnext-imagenet-microslice

The anchor contract

An anchor passes when the latest reproduction's measured value sits inside its tolerance band. The tolerance is set per-anchor to absorb known sources of variance (microslice vs full benchmark, community fine-tune vs paper checkpoint, prompt-template sensitivity for zero-shot tasks). A drift alarm is the platform telling itself: something has changed since this anchor was last calibrated.

Anchors are curated, not auto-generated. Adding one requires editing src/lib/anchors.ts with a written rationale, and the corresponding driver's CLAIM_CITATION must match the anchor's expected value exactly. The test suite in tests/unit/anchors.test.ts enforces structural soundness of the list.

Why this exists

On 2026-05-13 we retracted 7 public WRONG verdicts that turned out to be false positives — every one was a citation problem. The Verdict Validator (C1 structural + C2 textual gates) addresses the citation-side failure mode: a driver can no longer publish a WRONG against a made-up paper headline. Anchor probes address the orthogonal failure mode: a driver that's measuring the right thing, against the right paper headline, but now measuring it wrongbecause some part of the execution stack has drifted. Both gates together are the platform's answer to PRD §3 #1 (“evidence first”). The full self-correction quadrant adds two more surfaces: /legal/retractions (historical false positives) and /skipped (papers we deliberately did not reproduce).