{"paper":{"arxiv_id":"2404.06395","title":"MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies","abstract":"The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler, which is conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS.","primary_category":"cs.CL","venue":"COLM 2024","published_at":null,"latest_version":1,"withdrawn":false},"latest_version":{"id":"32975ba8-106a-48db-b6e2-00a4a981bac8","version":1,"source_url":"https://arxiv.org/abs/2404.06395","rendered_html_url":null,"rendering_engine":null},"verdict":{"id":"512e19da-e020-4dfb-a7ab-2facf6f99a1e","kind":"POST","status":"partial","score":0.2922922922922923,"confidence":0.55,"agent_version":"v0.1.0-minicpm-mmlu5shot-microslice","computed_at":"2026-05-15T00:04:42.753Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2404.06395","section":"Table 3","row":"MiniCPM-2.4B","column":"MMLU","reported_value":53.46,"reported_metric":"accuracy","quoted_text":"MiniCPM-2.4B 51.13 51.07 53.46 50.00 47.31 53.83 10.24","pdf_page":14,"notes":"MMLU column from the Benchmark Score table for MiniCPM-2.4B and MiniCPM-1.2B (both without RLHF). The paper's §6.5 (page 13) does NOT specify the MMLU shot count — they only report using their UltraEval framework. We run community-canonical 5-shot here as a faithful proxy; protocol_match='proxy' so the validator caps this driver at PARTIAL even on a large delta. Table 3 is page 14 — NOT to be confused with Table 1 (ablation A-1..B-3) on page 11."},"protocol_match":"proxy"},"verdicts":{"post":{"id":"512e19da-e020-4dfb-a7ab-2facf6f99a1e","kind":"POST","status":"partial","score":0.2922922922922923,"confidence":0.55,"agent_version":"v0.1.0-minicpm-mmlu5shot-microslice","computed_at":"2026-05-15T00:04:42.753Z","is_current":true,"claim_citation":{"paper_arxiv_id":"2404.06395","section":"Table 3","row":"MiniCPM-2.4B","column":"MMLU","reported_value":53.46,"reported_metric":"accuracy","quoted_text":"MiniCPM-2.4B 51.13 51.07 53.46 50.00 47.31 53.83 10.24","pdf_page":14,"notes":"MMLU column from the Benchmark Score table for MiniCPM-2.4B and MiniCPM-1.2B (both without RLHF). The paper's §6.5 (page 13) does NOT specify the MMLU shot count — they only report using their UltraEval framework. We run community-canonical 5-shot here as a faithful proxy; protocol_match='proxy' so the validator caps this driver at PARTIAL even on a large delta. Table 3 is page 14 — NOT to be confused with Table 1 (ablation A-1..B-3) on page 11."},"protocol_match":"proxy"},"pre":null}}