Retraction diff: CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Before / after

Before — original verdict

WRONG

agent_version

v0.1.0-codebert-codesearch-200way

verdict_id

95fb7f95-a23c-499d-a170-98b5b138a965

After — current verdict

RETRACTED

The original row is marked is_current=false in the database; no current public verdict exists for this paper. The reproduction was disabled to prevent re-publication under the same flawed protocol.

Why the original verdict was incorrect

The driver evaluated the pre-trained microsoft/codebert-base checkpoint zero-shot using a 200-way candidate pool and a mean-pooled cosine similarity score. The paper's Table 4 0.840 Python MRR comes from a CodeBERT fine-tuned on the CodeSearchNet train split, evaluated against a 1000-way candidate pool, with a [CLS] classifier head. The retraction script disables the reproduction until the protocol matches the paper's.

Evidence trail

Audit thread — long-form post-mortem covering all seven 2026-05-13 retractions, including this one.
Rollback PR — the GitHub pull request that landed the corrected driver and flipped the verdict row to is_current=false.
All public retractions — the append-only retraction log under PRD §17.X.8(d).
Verdict Validator — the C1/C2 gates that prevent this class of mistake from shipping again.

What changed structurally

The 2026-05-13 retraction rollup landed two structural fixes so the citation-side failure that caused the original incorrect verdict cannot ship the same way again:

Typed claim citation per verdict. Every reproduction driver now declares a structured CLAIM_CITATION (Table, row, column, reported value, quoted text, PDF page) before its Modal job runs. The original verdict on this paper was published against a non-citable headline — that path is now closed by the build-failing validator-wiring lint.
PDF-verified textual gate. The Verdict Validator fetches the cited paper's PDF and checks that the quoted text appears within ±200 characters of the cited reported value. Made-up, mis-cited, or category-confused citations fail this gate and the verdict is auto-downgraded.