{"paper":{"arxiv_id":"2103.00020","title":"Learning Transferable Visual Models From Natural Language Supervision","abstract":"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.","primary_category":"cs.CV","venue":"ICML 2021","published_at":null,"latest_version":1,"withdrawn":false},"latest_version":{"id":"41b609bc-8853-44aa-9a53-c3fc8d7f3b5d","version":1,"source_url":"https://arxiv.org/abs/2103.00020","rendered_html_url":null,"rendering_engine":null},"verdict":{"id":"82c5868e-830a-4c2e-8a10-dd8aff011664","kind":"POST","status":"reproduced","score":0.8866666666666667,"confidence":0.75,"agent_version":"v0.1.0-clip-cifar10-3slice100","computed_at":"2026-05-14T23:20:20.263Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2103.00020","section":"Table 11","row":"ViT-B/16 CIFAR10","column":"zero-shot top-1","reported_value":91.6,"reported_metric":"accuracy","quoted_text":"ViT-B/16 91.6","pdf_page":19,"notes":"Table 11 of arXiv:2103.00020 reports CLIP ViT-B/16 zero-shot CIFAR-10 top-1 = 91.6 with an 80-prompt ensemble. Driver runs a single-prompt eval on a 900-sample micro-slice — `proxy` on both prompt-ensembling and dataset-size axes."},"protocol_match":"proxy"},"verdicts":{"post":{"id":"82c5868e-830a-4c2e-8a10-dd8aff011664","kind":"POST","status":"reproduced","score":0.8866666666666667,"confidence":0.75,"agent_version":"v0.1.0-clip-cifar10-3slice100","computed_at":"2026-05-14T23:20:20.263Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2103.00020","section":"Table 11","row":"ViT-B/16 CIFAR10","column":"zero-shot top-1","reported_value":91.6,"reported_metric":"accuracy","quoted_text":"ViT-B/16 91.6","pdf_page":19,"notes":"Table 11 of arXiv:2103.00020 reports CLIP ViT-B/16 zero-shot CIFAR-10 top-1 = 91.6 with an 80-prompt ensemble. Driver runs a single-prompt eval on a 900-sample micro-slice — `proxy` on both prompt-ensembling and dataset-size axes."},"protocol_match":"proxy"},"pre":{"id":"e1e36d1d-c9ad-4314-b36b-40de74c95abd","kind":"PRE","status":"pending","score":0.4134,"confidence":0.5,"agent_version":"pre-heuristic-v0.1+no-llm","computed_at":"2026-05-06T17:20:49.695Z","is_current":true,"claim_citation":null,"protocol_match":null}}}