{"paper":{"arxiv_id":"2210.11416","title":"Scaling Instruction-Finetuned Language Models","abstract":"Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.","primary_category":"cs.LG","venue":"arXiv 2022","published_at":null,"latest_version":1,"withdrawn":false},"latest_version":{"id":"7fa4a133-e4e6-4b3e-bd87-61e43ec3a818","version":1,"source_url":"https://arxiv.org/abs/2210.11416","rendered_html_url":null,"rendering_engine":null},"verdict":{"id":"35e43f3b-8eed-4284-8c9d-c0598d3e6bed","kind":"POST","status":"partial","score":0.40562248995983935,"confidence":0.6,"agent_version":"v0.1.0-flan-t5-mmlu-microslice","computed_at":"2026-05-14T23:28:30.248Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2210.11416","section":"Table 6","row":"FLAN-T5-Large","column":"MMLU 0-shot direct","reported_value":45.1,"reported_metric":"accuracy","quoted_text":"FLAN-T5-Large 45.1","pdf_page":14,"notes":"Table 6 of arXiv:2210.11416 reports FLAN-T5-Large MMLU 0-shot direct = 45.1. Driver evaluates the same `google/flan-t5-large` checkpoint on an MMLU micro-slice. PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},"verdicts":{"post":{"id":"35e43f3b-8eed-4284-8c9d-c0598d3e6bed","kind":"POST","status":"partial","score":0.40562248995983935,"confidence":0.6,"agent_version":"v0.1.0-flan-t5-mmlu-microslice","computed_at":"2026-05-14T23:28:30.248Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2210.11416","section":"Table 6","row":"FLAN-T5-Large","column":"MMLU 0-shot direct","reported_value":45.1,"reported_metric":"accuracy","quoted_text":"FLAN-T5-Large 45.1","pdf_page":14,"notes":"Table 6 of arXiv:2210.11416 reports FLAN-T5-Large MMLU 0-shot direct = 45.1. Driver evaluates the same `google/flan-t5-large` checkpoint on an MMLU micro-slice. PROTOCOL_MATCH is `proxy` (dataset-size)."},"protocol_match":"proxy"},"pre":null}}