{"paper":{"arxiv_id":"2301.12597","title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","abstract":"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.","primary_category":"cs.CV","venue":"ICML 2023","published_at":null,"latest_version":1,"withdrawn":false},"latest_version":{"id":"daa59079-1eab-4516-8656-1b3ced96c493","version":1,"source_url":"https://arxiv.org/abs/2301.12597","rendered_html_url":null,"rendering_engine":null},"verdict":{"id":"5a5c6faa-1b17-4dce-b5e2-8094b88679e2","kind":"POST","status":"partial","score":0.3039409735591193,"confidence":0.6,"agent_version":"v0.2.0-blip2-flickr30k-beam5-n100","computed_at":"2026-05-14T23:27:31.653Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2301.12597","section":"Table 3","row":"BLIP-2 OPT-2.7B","column":"Flickr30k zero-shot BLEU-4","reported_value":43.7,"reported_metric":"bleu4","quoted_text":"BLIP-2 OPT-2.7B 43.7","pdf_page":7,"notes":"Table 3 of arXiv:2301.12597 reports zero-shot Flickr30k BLEU-4 = 43.7 for BLIP-2 OPT-2.7B under beam=5 with the documented prompt prefix. Driver evaluates a 100-sample Flickr30k test slice — `proxy` on dataset-size axis only. Pre-fix WRONG (PR #106 retraction) used greedy decoding + n=20 + no prompt; that protocol drift is what produced the false positive the validator now refuses to publish."},"protocol_match":"proxy"},"verdicts":{"post":{"id":"5a5c6faa-1b17-4dce-b5e2-8094b88679e2","kind":"POST","status":"partial","score":0.3039409735591193,"confidence":0.6,"agent_version":"v0.2.0-blip2-flickr30k-beam5-n100","computed_at":"2026-05-14T23:27:31.653Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2301.12597","section":"Table 3","row":"BLIP-2 OPT-2.7B","column":"Flickr30k zero-shot BLEU-4","reported_value":43.7,"reported_metric":"bleu4","quoted_text":"BLIP-2 OPT-2.7B 43.7","pdf_page":7,"notes":"Table 3 of arXiv:2301.12597 reports zero-shot Flickr30k BLEU-4 = 43.7 for BLIP-2 OPT-2.7B under beam=5 with the documented prompt prefix. Driver evaluates a 100-sample Flickr30k test slice — `proxy` on dataset-size axis only. Pre-fix WRONG (PR #106 retraction) used greedy decoding + n=20 + no prompt; that protocol drift is what produced the false positive the validator now refuses to publish."},"protocol_match":"proxy"},"pre":null}}