{"paper":{"arxiv_id":"2102.05918","title":"Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision","abstract":"Pre-trained representations are becoming crucial for many NLP and perception tasks. We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB.","primary_category":"cs.CV","venue":"ICML 2021","published_at":null,"latest_version":1,"withdrawn":false},"latest_version":{"id":"05ad0c34-cabe-4803-a4c1-f00674e02210","version":1,"source_url":"https://arxiv.org/abs/2102.05918","rendered_html_url":null,"rendering_engine":null},"verdict":{"id":"07f860fb-86d0-40ea-828d-f7b2875579b2","kind":"POST","status":"partial","score":1,"confidence":0.55,"agent_version":"v0.1.0b-align-imagenette-3slice100","computed_at":"2026-05-14T23:39:29.603Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2102.05918","section":"Table 4","row":"ALIGN-base","column":"ImageNet zero-shot top-1","reported_value":76.4,"reported_metric":"accuracy","quoted_text":"ALIGN 76.4","pdf_page":6,"notes":"Table 4 of arXiv:2102.05918 reports ALIGN-base zero-shot ImageNet top-1 = 76.4. Driver evaluates the kakaobrain/align-base checkpoint on Imagenette (10-class ImageNet subset). PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},"verdicts":{"post":{"id":"07f860fb-86d0-40ea-828d-f7b2875579b2","kind":"POST","status":"partial","score":1,"confidence":0.55,"agent_version":"v0.1.0b-align-imagenette-3slice100","computed_at":"2026-05-14T23:39:29.603Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2102.05918","section":"Table 4","row":"ALIGN-base","column":"ImageNet zero-shot top-1","reported_value":76.4,"reported_metric":"accuracy","quoted_text":"ALIGN 76.4","pdf_page":6,"notes":"Table 4 of arXiv:2102.05918 reports ALIGN-base zero-shot ImageNet top-1 = 76.4. Driver evaluates the kakaobrain/align-base checkpoint on Imagenette (10-class ImageNet subset). PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},"pre":null}}