[2605.03677] Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Wenjin Hou Shangpin Peng Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Weinong Wang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Zheng Ruan Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Yue Zhang Zhenglin Zhou [2pt] Mingqi Gao Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Yifei Chen Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Kaiqi Wang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Hongming Yang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Chengquan Zhang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Zhuotao Tian Affiliation: Shenzhen Loop Area Institute [2pt] Han Hu Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com weinong.wang@hotmail.com hehefan@zju.edu.cn Yi Yang Fei Wu Hehe Fan [2pt] Zhejiang University

Abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across LLMs and MLLMs, centered on a dual-perspective optimization strategy. Specifically, from the student’s perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher’s perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.¹¹1Code is available at https://github.com/WenjinHou/Uni-OPD.

^†^†footnotetext:

{}^{\scalebox{1.0}{\hskip-5.58054pt $\ast$}}

Equal contribution. ^⋆Work was done when Wenjin Hou and Shangpin Peng interned at Tencent.
^†Project leader. ^‡Project supervisor. ^🖂Corresponding author.

Refer to caption — Figure 1: Overall performance comparisons and convergence behavior. Results are shown for settings including multi-teacher, strong-to-weak, and cross-modal distillation on math reasoning and code generation tasks. Uni-OPD consistently outperforms OPD and converges faster than RL, demonstrating its effectiveness across diverse settings.

1 Introduction

Injecting complex reasoning abilities, domain knowledge, and human preferences into LLMs and MLLMs remains a core challenge in the post-training stage. Conventional approaches typically follow a two-stage paradigm: supervised fine-tuning (SFT) first, followed by reinforcement learning (RL) (Guo et al., 2025a; Xu et al., 2025a; Zeng et al., 2026; Zhao et al., 2026a). While SFT leverages expert data for training, its inherently off-policy nature introduces substantial exposure bias (Qin et al., 2025; Song and Zheng, 2026). Entering rarely covered erroneous states during inference may lead to compounding errors. Alternatively, on-policy RL (e.g., GRPO (Shao et al., 2024b)) alleviates distribution shift through online sampling. However, it mainly relies on sequence-level or terminal rewards, making fine-grained credit assignment difficult and limiting the stability of long-term training (Team et al., 2026).

Recently, on-policy distillation (OPD) has emerged as a promising post-training paradigm for efficiently transferring the knowledge and capabilities of domain experts into a single, unified model. It combines the strengths of RL and SFT, namely on-policy sampling and token-level supervision. Concretely, OPD trains the student on its own sampled trajectories with teacher feedback under a reverse KL objective (Lu and Lab, 2025; DeepSeek-AI, 2026).

Despite its empirical success, current OPD research remains largely confined to LLM distillation (Zhou et al., 2025; Yang et al., 2026b; Xiao et al., 2026; Yang et al., 2026c; Wu et al., 2026). Although a few recent works extend OPD to MLLMs, they are restricted to limited subsets of tasks within a single modality, such as video (Li et al., 2026a) or speech (Cao et al., 2026). To this end, we first aim to develop a unified OPD framework for both LLMs and MLLMs, enabling effective knowledge distillation across tasks and modalities.

Key observations. Beyond unifying the framework, we raise a more fundamental question: what makes OPD a reliable optimization paradigm? We posit that effective OPD depends on two factors. First, the student must sufficiently explore informative states, i.e., diverse and appropriately difficult self-generated trajectories. Second, the teacher’s token-level supervision must remain reliable when applied to student rollouts. In particular, the reliability of token-level guidance is significantly enhanced when its trajectory-level aggregation remains order-consistent with outcome reward (i.e., correct trajectories receive higher aggregated scores than incorrect ones). The outcome reward thus provides a global anchor for calibrating unreliable teacher supervision. These observations motivate a dual-perspective optimization strategy that jointly improves student exploration and the reliability of teacher signals.

Our recipe. Building on these insights, we introduce Uni-OPD, a dual-perspective strategy for optimizing OPD from the fundamental roles of the student and the teacher. In this unified framework, we adopt two complementary data-balancing strategies, namely offline difficulty-aware and online correctness-aware balancing, to promote exploration of informative student-generated states. We further present a novel outcome-guided margin calibration mechanism to obtain reliable teacher supervision. Extensive experiments on LLMs and MLLMs verify our recipe.

To summarize, our contributions are threefold:

$\bullet$

Key bottlenecks of OPD. We identify two core bottlenecks in OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Our analysis reveals that reliable teacher supervision largely depends on whether token-level guidance remains order-consistent with the outcome reward.
$\bullet$

Dual-perspective optimization recipe. We present a dual-perspective optimization recipe for unified OPD that jointly improves student exploration and teacher supervision. Concretely, we combine offline and online data balancing with an outcome-guided margin calibration mechanism, leading to more effective optimization.
$\bullet$

Comprehensive experimental validation. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation (i.e., combining text-only and multimodal tasks). Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

2 Related Work

Knowledge distillation for LLMs and MLLMs. Knowledge distillation (Hinton et al., 2015; Xu et al., 2024) aims to transfer knowledge from a larger teacher model to a smaller student model. Conventional approaches typically rely on off-policy forward Kullback–Leibler (KL) divergence on a static dataset to align the student’s generation distribution with that of the teacher (Liu et al., 2024d; Guo et al., 2025b; He et al., 2025a; Liu and Zhang, 2025; Ko et al., 2025). Another line of work treats supervised fine-tuning (SFT) on tokens generated by the teacher as an alternative off-policy distillation strategy for eliciting reasoning capabilities during LLM and MLLM post-training (Guo et al., 2025a; Zhang et al., 2025c; Bansal et al., 2025; Zhang et al., 2025b; Team et al., 2026; Xiao et al., 2026). Though effective, these off-policy methods essentially imitate the teacher’s behavior, limiting the student’s ability to surpass the teacher and making the student prone to exposure bias (Song and Zheng, 2026).

On-policy distillation. OPD (Agarwal et al., 2024; Lu and Lab, 2025) allows a superior teacher to provide feedback on the student’s on-policy trajectories. This paradigm effectively alleviates exposure bias and elevates the student’s upper performance bound. Owing to these merits, OPD has become an efficient way to merge capabilities from multiple experts into a single student model (Xiao et al., 2026; Yang et al., 2026c), as well as to support strong-to-weak distillation (Bai et al., 2025a; Zeng et al., 2026). Building on this paradigm, current studies on OPD have branched into several key directions. From the lens of the teacher, recent work explores teacher-free self-distillation paradigms (Kujanpää et al., 2024; Shenfeld et al., 2026; Zhao et al., 2026b; Hübotter et al., 2026; Ye et al., 2026; Zhang et al., 2026a; Stein et al., 2026), develops black-box OPD methods (Ye et al., 2025; Xiong et al., 2026), and facilitates distillation across different model families (Patiño et al., 2025). Complementary efforts focus on unified training frameworks (Zhang et al., 2026b) and stable optimization strategies (Jin et al., 2026; Kim and Baek, 2026; Li et al., 2026b; Xu et al., 2026) combined with RL (Yang et al., 2026a; Qu et al., 2026; Jang et al., 2026; Wang et al., 2026). Few works extend OPD to multimodal domains (Bousselham et al., 2025; Ko et al., 2026; Li et al., 2026a; Cao et al., 2026). In this work, we push OPD with a dual-perspective recipe that promotes student exploration and teacher reliability, generalizing across LLMs and MLLMs. More detailed related work is provided in the appendix E.

3 Methodology

We propose Uni-OPD, a unified framework that advances OPD across LLMs and MLLMs, as shown in Fig. 2. Our design is driven by two fundamental bottlenecks in OPD: insufficient exploration of informative student-generated states and unreliable teacher supervision for student rollouts. Uni-OPD addresses them with a dual-perspective recipe that enhances student exploration and calibrates teacher supervision to align with the outcome reward. We first introduce the preliminaries in section 3.1, followed by an overview of Uni-OPD in section 3.2. We then detail the exploration strategy in section 3.3 and the supervision calibration mechanism in section 3.4.

3.1 Preliminaries

On-policy distillation. OPD retains the on-policy nature of optimization while providing token-level credit assignment, enabling effective post-training. During training, the student policy $\pi_{{\bm{\theta}}}$ samples its trajectories and is optimized by minimizing the reverse Kullback-Leibler (KL) divergence to the teacher policy $\pi_{\mathrm{T}}$ over these samples:

\mathcal{J}_{\text{OPD}}(\bm{\theta})=\min_{\bm{\theta}}\mathbb{E}_{\bm{q}\sim D,\,\bm{\tau}\sim{\pi}_{\bm{\theta}}(\cdot\mid\bm{q})}\!\Big[\mathcal{D}_{\mathrm{KL}}\!\Big({\pi}_{\bm{\theta}}(\bm{\tau}\!\mid\!\bm{q})\,\big\|\,\pi_{\text{T}}(\bm{\tau}\!\mid\!\bm{q})\Big)\Big],

(1)

where ${\bm{q}}$ is the input question, $\bm{\tau}=(o_{1},\dots,o_{|\bm{\tau}|})$ is a trajectory sampled by the student, $o_{t}$ is the token at step $t$ , and $|\bm{\tau}|$ is the length of the trajectory. The gradient of OPD can be derived as:

\nabla_{\bm{\theta}}\mathcal{J}_{\text{OPD}}(\bm{\theta})=\mathbb{E}_{\bm{q}\sim D,\,\bm{\tau}\sim{\pi}_{\bm{\theta}}(\cdot\mid\bm{q})}\!\Big[\sum_{t=1}^{|\bm{\tau}|}\!\big(\log{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})-\log\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})\big)\,\nabla_{\bm{\theta}}\log{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})\Big],

(2)

where ${\bm{o}}_{<t}$ denotes the prefix before step $t$ . The gradient naturally induces a token-level reward at step $t$ , analogous to standard RL:

r^{\mathrm{OPD}}_{t}=\log\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})-\log{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})=\log\frac{\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})}{{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})}.

(3)

This formulation provides fine-grained credit assignment signals at the token level.

Analyzing teacher supervision in OPD. As shown in Eq. 3, OPD relies on the teacher to provide fine-grained supervision for student-generated trajectories. For effective optimization, this signal should align with overall trajectory correctness. In practice, this alignment is not guaranteed and can fail in several typical ways: (a) OOD degradation: when student rollouts enter sparse or out-of-distribution regions relative to the teacher, $\log\pi_{\mathrm{T}}(o_{t}\mid\cdot)$ may become noisy, disrupting the ranking between correct and incorrect trajectories. (b) Overestimation of incorrect trajectories: incorrect trajectories may receive abnormally high scores when their local token patterns align with the teacher’s high-confidence regions. (c) Underestimation of correct trajectories: correct trajectories may receive abnormally low scores when their generation paths deviate from the teacher’s dominant regions, thereby suppressing useful reasoning paths. These phenomena suggest that teacher supervision is not always reliable, motivating us to introduce an outcome reward as a global anchor for calibrating trajectory-level supervision.

3.2 The Overview of Uni-OPD

In this work, we propose Uni-OPD, a unified OPD framework that generalizes across both LLMs and MLLMs, as illustrated in Fig. 2. Formally, given expert teachers $\{\pi_{\mathrm{T}_{1}},\pi_{\mathrm{T}_{2}},\dots,\pi_{\mathrm{T}_{N}}\}$ who specialize in different domains, and letting $w_{i}$ denote the weight assigned to teacher $\pi_{\mathrm{T}_{i}}$ , we define the objective as:

\mathcal{J}_{\text{Uni-OPD}}(\bm{\theta})=\sum_{i=1}^{N}w_{i}\,\mathcal{D}_{\mathrm{KL}}\!\left({\pi}_{\bm{\theta}}\,\|\,\pi_{\mathrm{T}_{i}}\right),

(4)

This formulation provides a unified objective for both single-teacher and multi-teacher distillation by aggregating supervision from multiple experts. Building on this objective, we optimize OPD from the two fundamental roles. From the student’s perspective, we introduce a data-balancing strategy that promotes exploration via offline difficulty-aware and online correctness-aware selection. From the teacher’s perspective, we develop an outcome-guided margin calibration mechanism to correct unreliable token-level supervision by enforcing consistency with outcome rewards. These designs stabilize optimization and improve the reliability of OPD.

3.3 Joint Offline and Online Data Balancing Strategy for Student Exploration

From the student’s perspective, sufficient diversity and an appropriate level of difficulty in the generated trajectories are essential for effective OPD. To this end, based on our empirical study, we propose complementary data-balancing strategies for both offline data construction and online sampling.

Offline difficulty-aware data balancing. A prevalent practice in RL is to estimate prompt difficulty via multiple rollouts and then filter out samples that are either overly easy (i.e., always correct) or overly hard (i.e., always incorrect) (An et al., 2025; Zhou et al., 2023a). However, for small-scale models, training data often exhibits a mirrored J-shaped or U-shaped distribution (see Fig. 3). Strictly removing these easy or hard samples can substantially reduce data diversity and limit exploration of informative student-generated states. Our empirical findings show that such filtering leads to substantial performance degradation in OPD.

Based on this observation, we adopt a difficulty-aware balancing strategy that selectively upsamples mid-difficulty samples (i.e., correct in only some of multiple rollouts). As shown in Fig. 3, this strategy reshapes the data distribution into a more uniform form while preserving both diversity and difficulty. In addition, it consistently improves performance on math reasoning and code generation. Overall, these results show that maintaining data diversity and a balanced difficulty spectrum enables the student to generate more informative trajectories, thereby exploring a broader solution space.

Online correctness-aware data balancing. After applying offline difficulty-aware balancing, we further observe that insufficient exploration can cause the model to collapse to local optima during training, especially when rollout groups lack sufficient outcome diversity (e.g., only incorrect trajectories). To mitigate this issue, we explicitly enforce a balanced composition of correct and incorrect trajectories within each rollout group during training. This prevents degenerate cases in which all samples share the same outcome and thus yield uninformative gradients. By maintaining such a balance, we ensure that the student consistently receives meaningful contrastive signals for stable on-policy learning. As shown in Fig. 4, an appropriate outcome balance achieves better performance than using only correct samples or an excessively high correct/incorrect ratio.

3.4 Outcome-guided Margin Calibration for Teacher Supervision

A basic premise of OPD is that the teacher exhibits a directional likelihood preference over positive and negative trajectories. In particular, relative to the student, the teacher should assign higher likelihood to correct trajectories and lower likelihood to incorrect ones. Under this premise, the resulting distillation signal should remain consistent with outcome-level correctness at the trajectory level. We next formalize this principle through a trajectory-level distillation return and develop an outcome-guided calibration strategy based on it.

Trajectory-level distillation return. To characterize the overall supervision signal along a rollout trajectory, we define the trajectory-level distillation return as the average log-probability gap between the teacher and the student:

G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\triangleq\frac{1}{|\bm{\tau}|}\sum_{t=1}^{|\bm{\tau}|}\log\frac{\pi_{T}(o_{t}\mid\bm{q},\bm{o}_{<t})}{\pi_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})}=\frac{1}{|\bm{\tau}|}\sum_{t=1}^{|\bm{\tau}|}r^{\mathrm{OPD}}_{t}\,,

(5)

This quantity measures the teacher’s average log-likelihood preference over the student along trajectory $\bm{\tau}$ . When $G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})>0$ , the teacher assigns higher confidence than the student on average, encouraging the student to move toward this trajectory. Conversely, when $G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})<0$ , the student is discouraged from moving toward this trajectory. The normalization by trajectory length ensures comparability across trajectories of different lengths.

Order consistency as a trajectory-level criterion. For a given question ${\bm{q}}$ , let $R({\bm{q}},\bm{\tau})\in\{0,1\}$ denote the outcome reward of a sampled trajectory $\bm{\tau}$ , where $R({\bm{q}},\bm{\tau})=1$ indicates that the final answer in $\bm{\tau}$ is correct for question ${\bm{q}}$ , and $R({\bm{q}},\bm{\tau})=0$ otherwise. We then define the positive and negative trajectory sets as:

	$\displaystyle S_{+}({\bm{q}})\triangleq\{\bm{\tau}\mid R({\bm{q}},\bm{\tau})=1\},\qquad$	$\displaystyle S_{-}({\bm{q}})\triangleq\{\bm{\tau}\mid R({\bm{q}},\bm{\tau})=0\}.$	(6)
Following the trajectory-level bandit formulation in (Ouyang et al., 2022), we treat the prompt as the context and the entire generated trajectory as a macro-action. Under this view, the associated outcome reward naturally serves as a one-step trajectory-level return, denoted as $G_{\mathrm{RL}}({\bm{q}},\bm{\tau})=R({\bm{q}},\bm{\tau})$ . Therefore, the outcome-level RL return induces the following oracle ordering:
	$\displaystyle G_{\mathrm{RL}}({\bm{q}},\bm{\tau}_{+})\geq G_{\mathrm{RL}}({\bm{q}},\bm{\tau}_{-})\,,\qquad$	$\displaystyle\forall\bm{\tau}_{+}\in S_{+}({\bm{q}}),\;\forall\bm{\tau}_{-}\in S_{-}({\bm{q}})\,.$	(7)
The derivation process is provided in section A.3. This motivates a trajectory-level reliability criterion for OPD. Under the distillation premise, the trajectory-level distillation return $G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})$ should preserve the same outcome-induced ordering as $G_{\mathrm{RL}}({\bm{q}},\bm{\tau})$ . Specifically, for any prompt ${\bm{q}}$ , we expect:
	$\displaystyle G_{\mathrm{OPD}}({\bm{q}},\bm{\tau}_{+})\geq G_{\mathrm{OPD}}({\bm{q}},\bm{\tau}_{-})\,,\qquad$	$\displaystyle\forall\bm{\tau}_{+}\in S_{+}({\bm{q}}),\;\forall\bm{\tau}_{-}\in S_{-}({\bm{q}})\,.$	(8)

Teacher supervision may violate ordering. In practice, however, the teacher’s supervision is not always reliable. As discussed in section 3.1, teacher scoring may degrade in sparse out-of-distribution regions, overestimate incorrect trajectories, or underestimate correct ones due to spurious local patterns. Such failures may persist even after token-level supervision is aggregated to the trajectory level. A mean-based criterion is therefore insufficient, since the mismatch is often concentrated in a few extreme samples: a single overly confident negative trajectory or a severely underestimated positive trajectory can already distort the supervision signal for the entire prompt group.

Outcome-guided margin calibration. Based on the above analysis, during OPD training, the constraint in Eq. 8 should hold between positive and negative trajectories within each prompt. To this end, we consider the margin between the lowest-scoring correct trajectory and the highest-scoring incorrect trajectory, which directly characterizes whether the ordering is violated in the most adversarial case. We define the prompt-level margin as

m(\bm{q})\triangleq\min_{\bm{\tau}\in S_{+}(\bm{q})}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\max_{\bm{\tau}\in S_{-}(\bm{q})}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\,.

(9)

By construction, $m({\bm{q}})\geq 0$ indicates strict order consistency on prompt ${\bm{q}}$ , since even the worst positive trajectory still outperforms the best negative one (see Fig. 5). Thus, $m({\bm{q}})\geq 0$ means that all positive trajectories are ranked above all negative ones for prompt ${\bm{q}}$ . To improve robustness, we further require:

m(\bm{q})\geq\delta\,,

(10)

where $\delta>0$ defines a safety margin against estimation noise and finite-sample fluctuations. Since $S_{+}({\bm{q}})$ and $S_{-}({\bm{q}})$ are determined by outcome rewards, this criterion uses the outcome signal as a global anchor to calibrate the teacher’s trajectory-level scores. This formulation enables direct interventions on the margin, allowing us to suppress ordering violations or enlarge the separation between positive and negative trajectories.

Margin calibration strategy. Based on Eq. 10, we present two calibration strategies: margin mask and margin shift. Specifically, the margin mask keeps only the prompt groups satisfying $m({\bm{q}})\geq\delta$ and discards the rest, so that training is performed only with reliable supervision. Margin shift instead repairs an unreliable group with the smallest additive correction. For groups with $m({\bm{q}})<\delta$ , we define:

\lambda({\bm{q}})=\delta-m({\bm{q}}),\qquad\widetilde{G}_{\mathrm{OPD}}({\bm{q}},\bm{\tau})=G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})+\lambda({\bm{q}})\,\bm{1}\{R({\bm{q}},\bm{\tau})=1\}.

(11)

This shift preserves the relative ordering within $S_{+}({\bm{q}})$ and guarantees

\min_{\bm{\tau}\in S_{+}({\bm{q}})}\widetilde{G}_{\mathrm{OPD}}({\bm{q}},\bm{\tau})-\max_{\bm{\tau}\in S_{-}({\bm{q}})}\widetilde{G}_{\mathrm{OPD}}({\bm{q}},\bm{\tau})=\delta\,.

(12)

In this way, margin shift restores outcome-consistent ordering with a minimal group-level correction, while margin mask provides a more conservative alternative when the supervision signal is too unreliable to calibrate.

Table 1: Performance of Qwen3-4B Student under math reasoning and code generation benchmarks. Teacher models (i.e., Qwen3-4B-Math-RL and Qwen3-4B-Code-RL) are developed through domain-specific RL. The performance of teacher models is denoted by the “RL” type. Bold values indicate the best score within each group. Avg. denotes the average score within each domain.

Method	Math Reasoning					Code Generation
Method	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.	Human Eval+	MBPP+	LCB	Avg.
Student (4B)	23.0	19.3	12.3	9.2	15.9	77.4	65.3	17.7	53.5
Teacher (RL)	60.1	55.1	32.5	38.5	46.6	85.2	69.8	26.6	60.5
Single–Teacher Distillation
ExPO	58.7	55.2	32.4	37.0	45.8	84.8	70.2	28.0	61.0
OPD	57.9	52.4	30.2	37.8	44.6	82.6	68.8	25.7	59.0
ExOPD	62.7	56.1	33.9	39.3	48.0	86.9	70.7	28.6	62.1
Uni-OPD	63.3	57.0	34.8	39.8	48.7	88.3	71.6	29.7	63.2
Multi–Teacher Distillation
SFT	58.5	53.3	30.7	34.8	44.3	86.4	69.6	26.4	60.8
ExPO	57.5	54.5	31.7	36.3	45.0	86.7	72.0	29.0	62.6
OPD	60.9	55.2	33.4	38.3	47.0	86.3	70.9	23.4	60.2
ExOPD	61.0	56.0	34.4	39.2	47.7	86.3	70.6	29.0	62.0
Uni-OPD	62.3	57.2	34.9	39.6	48.5	88.0	72.6	30.1	63.6

4 Experiments and Analysis

In this section, we conduct comprehensive experiments across both textual and multimodal domains to evaluate the effectiveness of Uni-OPD. We first detail the experimental configurations (section 4.1). Subsequently, we assess how the proposed recipe improves OPD performance across diverse distillation scenarios for LLMs and MLLMs, including single-teacher and multi-teacher distillation (section 4.2), strong-to-weak distillation (section 4.3), and cross-modal distillation (section 4.4). Finally, we provide a rigorous ablation study to further analyze the core strategies of our method (section 4.5).

4.1 Experimental Setup

Table 2: Performance of Qwen3-VL-4B-Instruct Student under math reasoning, logic reasoning, and document understanding benchmarks. Teacher models (i.e., Qwen3-VL-4B-Instruct-Math-RL, Qwen3-VL-4B-Instruct-Logic-RL and Qwen3-VL-4B-Instruct-Document-RL) are developed through domain-specific RL. Bold values indicate the best score within each group. Avg. denotes the mean score within each category.

Method	Math Reasoning				Logic Reasoning				Document Understanding
	Math	Dyna	We	Avg.	LogicVista	LogicVista	Visu	Avg.	AI2D	Chart	Doc	Info	Avg.
	Vision	Math	Math	Avg.	Accuracy	Format	Logic	Avg.	AI2D	QA	VQA	VQA	Avg.
Student (4B)	33.8	62.2	67.5	54.5	49.9	66.4	25.1	47.0	81.7	73.5	94.9	79.8	82.5
Teacher (RL)	47.2	65.3	79.5	64.0	52.5	73.8	27.4	51.2	82.5	76.4	95.1	81.6	83.9
Single–Teacher Distillation
OPD	47.5	64.8	77.5	63.3	49.8	73.0	26.1	49.6	82.4	75.4	95.2	81.4	83.6
Uni-OPD	47.8	65.4	78.3	63.9	53.1	73.8	28.2	51.7	82.6	75.8	95.2	81.2	83.7
Multi–Teacher Distillation
OPD	41.0	60.9	71.7	57.9	51.3	72.3	26.3	50.0	82.6	75.0	95.1	81.3	83.4
Uni-OPD	45.5	62.3	76.1	61.0	54.0	75.2	27.5	52.5	83.0	75.7	95.3	81.6	83.9

Models. We conduct experiments on the Qwen3 family (Yang et al., 2025; Bai et al., 2025a). For textual experiments, we use Qwen3-4B and Qwen3-1.7B as student models. In the same-sized setting, we apply domain-specific RL to Qwen3-4B to obtain specialized teachers. In the strong-to-weak setting, we use Qwen3-30B-A3B-Instruct-2507 as the strong teacher. For multimodal experiments, we use Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct as student models, and obtain multimodal teachers through domain-specific RL. Detailed training setups are in section B.1.

Training datasets. We use task-specific training data to construct and distill specialized teachers. For textual tasks, we use 57K math reasoning samples filtered from DeepMath (He et al., 2025b) (difficulty level $\geq 6$ ) and 25K code generation samples from the Code subset of Eurus-2-RL-Data (Cui et al., 2025). For multimodal tasks, we use math reasoning, logic reasoning, and document understanding data mainly from OpenMMReasoner-RL-74K (Zhang et al., 2025b). Detailed training data configurations are provided in section B.2.

Baselines. We compare Uni-OPD against several representative baselines for LLM distillation: (1) SFT, which performs supervised fine-tuning on teacher-generated trajectories via cross-entropy loss; (2) ExPO (Yang et al., 2026b), a weight-space extrapolation method that merges domain-specific teachers and extrapolates their weights relative to the student model; (3) ExOPD, a reward-level extrapolation approach that scales the reward factor ( $>1$ ) to enable the student to surpass the performance boundaries of its teachers. For MLLM experiments, since OPD remains largely underexplored in this setting, we use vanilla OPD as the primary baseline.

Evaluation benchmarks. We evaluate Uni-OPD on a comprehensive benchmark suite spanning textual and multimodal capabilities, organized along five capability axes: Textual Math Reasoning: AIME24 (AI-MO, 2024), AIME25 (OpenCompass, 2025), HMMT25 (February and November) (Balunović et al., 2025); Textual Code Generation: HumanEval+ (Liu et al., 2023b), MBPP+ (Liu et al., 2023b), and LiveCodeBench (v6 only, Feb. 25 $\sim$ May 25) (Jain et al., 2024); Multimodal Math Reasoning: MathVision (Wang et al., 2024a), DynaMath (Zou et al., 2024), and WeMath (Qiao et al., 2025); Multimodal Logic Reasoning: LogicVista (Xiao et al., 2024) and VisuLogic (Xu et al., 2025b); Document Understanding: AI2D (Kembhavi et al., 2016), ChartQA (Masry et al., 2022), DocVQA (Mathew et al., 2021), and InfoVQA (Mathew et al., 2022). Detailed information is in section C.1.

4.2 Single-Teacher and Multi-Teacher Distillation on LLMs and MLLMs

As an effective and flexible paradigm for consolidating capabilities from one or multiple teachers into a unified student model, we first evaluate Uni-OPD on both LLMs and MLLMs across diverse domains. Specifically, for LLMs, following G-OPD (Yang et al., 2026b), we conduct experiments on math reasoning and code generation. For MLLMs, we further consider three domains: math reasoning, logic reasoning, and document understanding.

Main results. As shown in Table 1, Uni-OPD achieves the best overall performance on LLM distillation under both single-teacher and multi-teacher settings. In single-teacher distillation, Uni-OPD consistently outperforms OPD and ExOPD, obtaining the highest scores of 48.7 on math reasoning and 63.2 on code generation. More importantly, under multi-teacher distillation, Uni-OPD effectively merges the distinct capabilities of multiple teachers into a single student model, yielding gains of 1.5% and 3.4% over OPD on math reasoning and code generation.

A similar trend is observed for MLLMs in Table 2. Under single-teacher distillation, Uni-OPD delivers the best average performance in all three domains, reaching 63.9 on math reasoning, 51.7 on logic reasoning, and 83.7 on document understanding. For multi-teacher distillation, Uni-OPD consistently outperforms OPD, improving the average score from 57.9 to 61.0 on math reasoning, from 50.0 to 52.5 on logic reasoning, and from 83.4 to 83.9 on document understanding. The consistent gains across settings validate the robustness of Uni-OPD.

4.3 Strong-to-Weak Distillation

Table 3: Results for strong-to-weak distillation setting under math reasoning and code generation benchmarks. The teacher model is Qwen3-30B-A3B-Instruct-2507, and the student models are the smaller Qwen3-4B and Qwen3-1.7B. Bold values indicate the best score within each group. Avg. denotes the average score within each domain.

Method	Math Reasoning					Code Generation
Method	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.	Human Eval+	MBPP+	LCB	Avg.
Teacher	72.1	61.4	42.5	57.1	58.3	81.9	77.2	23.4	60.8
Qwen3-4B Student
Student	23.0	19.3	12.3	9.2	15.9	77.4	65.3	17.7	53.5
OPD	56.5	46.4	28.5	33.4	41.2	82.9	72.4	21.6	59.0
Uni-OPD	55.9	50.2	29.8	35.6	42.9	83.1	71.3	28.0	60.8
Qwen3-1.7B Student
Student	13.9	11.1	5.6	4.9	8.9	61.9	53.4	11.9	42.4
OPD	35.7	27.6	17.2	14.6	23.8	67.1	56.7	23.4	49.1
Uni-OPD	35.2	30.7	17.7	16.4	25.0	71.5	58.6	28.0	52.7

Strong-to-weak distillation is particularly important for the practical post-training of small models (Bai et al., 2025a). We further investigate whether Uni-OPD can better facilitate the transfer of reasoning capabilities from a larger, stronger teacher model (e.g., Qwen3-30B-A3B-Instruct-2507) to significantly smaller students (e.g., Qwen3-4B and Qwen3-1.7B). In this setting, the student is trained on both math and code data, with teacher feedback provided across both domains, which can be viewed as a multi-teacher scenario.

Main results. The results for the strong-to-weak distillation setting are presented in Table 3. Notably, Uni-OPD yields significant performance gains across both the 4B and 1.7B student settings. When distilled from the highly capable 30B teacher, Uni-OPD consistently outperforms standard OPD. Specifically, for the 4B student, Uni-OPD achieves average scores of 42.9 in mathematical reasoning and 60.8 in code generation, surpassing standard OPD by 1.7 and 1.8 points, respectively. This trend holds even for the highly constrained 1.7B student, where Uni-OPD lifts performance to 25.0 on math reasoning and 52.7 on code generation. These results demonstrate that Uni-OPD effectively bridges the capacity gap, enabling smaller students to more effectively absorb and replicate complex reasoning behaviors from superior teachers.

4.4 Cross-Modal Distillation

Table 4: Results for cross-modal distillation under textual code generation and multimodal math reasoning benchmarks. The student model is Qwen3-VL-4B-Instruct. The teacher models are developed from the same MLLM backbone via domain-specific RL on textual code and multimodal math domains, i.e., Qwen3-VL-4B-Instruct-Code-RL and Qwen3-VL-4B-Instruct-Math-RL, respectively. Bold values indicate the best score within each group. Avg. denotes the average score within each domain.

Method	Code Generation (Textual)				Math Reasoning (Multimodal)
Method	Human Eval+	MBPP+	LCB	Avg.	Math Vision	Dyna Math	We Math	Avg.
Student	76.8	70.0	37.0	61.3	33.8	62.2	67.5	54.5
Teacher	82.2	70.5	40.1	64.3	47.2	65.3	79.5	64.0
OPD	83.1	70.6	38.6	64.1	46.1	65.4	76.6	62.7
Uni-OPD	84.1	71.4	41.3	65.6	46.6	66.5	78.5	63.9

Cross-modal distillation is an important yet underexplored setting in OPD. Unlike conventional distillation settings, where capability transfer typically occurs within the same modality, here we investigate whether textual and multimodal capabilities can be unified into a single student policy. Specifically, we use Qwen3-VL-4B-Instruct as the student model, and construct domain-specific teachers from the same MLLM backbone via RL on textual code data and multimodal math data, respectively. As a result, although the student is multimodal, one of the transferred capabilities is learned from a teacher specialized in a purely textual domain, enabling capability transfer across modality boundaries. This setting is beneficial for integrating and transferring cross-modal capabilities.

Main results. As shown in Table 4, Uni-OPD achieves consistent gains over standard OPD across both textual code generation and multimodal math reasoning in this cross-modal setting. Specifically, it improves the average score from 64.1 to 65.6 on code generation and from 62.7 to 63.9 on math reasoning. On the textual side, the gains are consistent across all three code benchmarks, with the largest improvement on LCB (38.6 $\rightarrow$ 41.3). On the multimodal side, Uni-OPD further improves MathVision (46.1 $\rightarrow$ 46.6) and DynaMath (65.4 $\rightarrow$ 66.5), while maintaining strong performance on WeMath. These results suggest that Uni-OPD can effectively absorb and coordinate capabilities originating from both textual and multimodal domains within a unified student model, rather than improving one domain at the expense of the other. For a broader view of cross-modal distillation, we further provide results on code and logic reasoning in appendix D.

4.5 Ablation Study

Table 5: Results of Uni-OPD variants with a Qwen3-4B Student on math reasoning and code generation. We ablate core strategies (i.e., offline data balancing, online data balancing, and margin calibration) to assess their effectiveness using the Qwen3-4B-RL and Qwen3-30B-A3B-Instruct teacher models.

Configuration	Math Reasoning					Code Generation
Configuration	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.	Human Eval+	MBPP+	LCB	Avg.
Qwen3-4B RL Teacher
OPD	60.9	55.2	33.4	38.3	47.0	86.3	70.9	23.4	60.2
Uni-OPD	62.3	57.2	34.9	39.6	48.5	88.0	72.6	30.1	63.6
w/o offline data balancing	62.6	56.5	32.5	38.5	47.5	88.0	71.1	27.9	62.3
w/o online data balancing	62.5	56.7	33.2	38.9	47.8	88.0	71.8	28.0	62.6
w/o margin calibration	63.0	54.7	33.4	38.1	47.3	86.4	71.6	25.7	61.2
Qwen3-30B A3B-Instruct Teacher
OPD	56.5	46.4	28.5	33.4	41.2	82.9	72.4	21.6	59.0
Uni-OPD	55.9	50.2	29.8	35.6	42.9	83.1	71.3	28.0	60.8
w/o offline data balancing	57.1	46.3	28.8	36.8	42.2	80.6	70.3	28.0	59.6
w/o online data balancing	57.0	47.6	26.8	37.0	42.1	81.6	71.4	28.0	60.3
w/o margin calibration	54.9	48.1	29.1	35.8	42.0	82.8	70.4	25.7	59.6

In Table 5, we conduct comprehensive ablation studies to evaluate the individual contributions of each strategy in our Uni-OPD. Applying our proposed operations results in a significant improvement in accuracy over the vanilla OPD. In particular, the average gains reach +1.5/+3.4 points on math/code with the Qwen3-4B-RL teacher, and +1.7/+1.8 points with the Qwen3-30B-A3B-Instruct teacher. Offline and online data balancing address insufficient exploration: without either of them, the student policy struggles to be exposed to diverse and challenging trajectories. Margin calibration improves supervision reliability: without it, token-level feedback can become misaligned with outcome rewards, leading to less stable training and suboptimal performance.

Table 6: Comparison results for different margin calibration. We directly incorporate them into OPD to examine which strategy better benefits OPD training.

Method	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.
Student (4B)	23.0	19.3	12.3	9.2	15.9
OPD	57.9	52.4	30.2	37.8	44.6
+ margin mask	62.3	56.2	34.3	38.1	47.7
+ margin shift	62.7	56.3	34.4	39.2	48.1

Margin mask vs. margin shift. We consider various strategies to calibrate the return signals for improving teacher supervision. In this work, we explore two simple variants, namely margin mask and margin shift. As shown in Table 6, directly incorporating either mechanism into OPD yields consistent performance gains over the baseline, underscoring the necessity of reliable teacher supervision. Among them, margin shift achieves slightly better results and is therefore adopted in our main experiments. More ablations are in section D.3.

4.6 Qualitative Evaluation

To intuitively illustrate the effectiveness of our outcome-guided margin calibration, we use a token-level reward heatmap for visualization. As shown in Fig. 6, we display the two failure modes under the same question: the overestimation of incorrect trajectories (top-left) and the underestimation of correct trajectories (bottom-left). Each token is colored by its reward value: blue tokens indicate student-preferred $(r^{\mathrm{OPD}}_{t}\!<\!0)$ , and red tokens indicate teacher-preferred $(r^{\mathrm{OPD}}_{t}\!>\!0)$ , with saturation proportional to magnitude. On the top-left, an incorrect rollout still accumulates a high distillation return: most of its tokens are saturated red, since they fall on regions where the teacher dominates the student. On the bottom-left, a correct rollout receives a low distillation return: its tokens are already well-covered by the student, so the teacher provides little additional return (predominantly faint colors with some blue). The right column shows the same two rollouts after our outcome-guided margin calibration. Concretely, the per-token rewards are uniformly shifted so that the trajectory-level aggregation aligns with the outcome reward.

4.7 Analysis and Takeaways

Based on our comprehensive and systematic study on both LLMs and MLLMs across single-teacher, multi-teacher, strong-to-weak, and cross-modal distillation settings, we deliver three takeaways to further advance OPD.

$\bullet$

Balancing reasoning capability and efficiency. Uni-OPD achieves the best performance with substantially fewer optimization steps than RL (Fig. 1), and consistently delivers strong reasoning capability across diverse domains (Tables 1–4, and D.1–D.3 in the Appendix).
$\bullet$

Teacher value comes from the capability gap, not absolute strength alone. In OPD, even with the same 4B backbone, a domain-specific RL teacher injects new capabilities and knowledge that drive the student to improve and even surpass the teacher (Tables 1 and 2). Moreover, our dual-perspective recipe further translates this gap into student gains, consistently boosting performance across all model sizes.
$\bullet$

OPD distills reasoning as a modality-agnostic capability. Trained jointly on textual and multimodal data, the multimodal student under Uni-OPD improves textual code generation and multimodal math/logic reasoning (Tables 4 and D.3). The per-token signal carries reasoning patterns largely independent of modality, enabling a unified, single-stage path that enhances both textual and multimodal reasoning within one multimodal model.
$\bullet$

OPD cleanly merges specialized capabilities, with related ones reinforcing each other. Beyond two teachers, Uni-OPD extends to three, jointly improving all capabilities (Tables 2 and D.2). OPD thus offers a scalable path for merging many specialists into one reasoner, with related ones synergizing via shared reasoning structure.

Reproducibility statement. To facilitate a clear understanding of our contributions and support broader adoption of our work, we provide extensive materials. In the main text, we detail the key components of our method in section 3 and report the main experimental results in section 4. In the supplementary materials, we further elaborate on Method Details (appendix A), Training Details (appendix B), and Evaluation Details (appendix C), which together should be sufficient to reproduce our results. All code, training data, complete scripts, and model checkpoints will be open-sourced upon publication to accelerate future research.

5 Conclusion and Future Work

In this paper, we present Uni-OPD, a unified OPD framework that generalizes across LLMs and MLLMs. We identify two key bottlenecks for effective OPD: insufficient student exploration of informative states and unreliable teacher supervision for student rollouts. To address them, we propose a dual-perspective optimization strategy: (i) offline difficulty-aware and online correctness-aware data balancing for student exploration, and (ii) outcome-guided margin calibration for teacher supervision. Extensive experiments on 16 benchmarks covering multi-teacher, strong-to-weak, and cross-modal settings demonstrate the effectiveness and versatility of Uni-OPD. We hope this work can provide a practical foundation for future research on scalable and reliable distillation across models, teachers, and modalities.

For future work, our findings suggest several promising directions: (1) extending Uni-OPD to larger-scale teacher distillation settings; (2) applying Uni-OPD to broader capability merging scenarios, such as agentic planning, tool use, and long-horizon decision making; and (3) uncovering the mechanistic principles of OPD, particularly how it shapes training dynamics and parameter geometry.

References

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) GPT-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §E.1.
R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In The twelfth international conference on learning representations, Cited by: §E.3, §2.
AI-MO (2024) AIME 2024. Note: https://huggingface.co/datasets/AI-MO/aimo-validation-aime Cited by: 1st item, §4.1.
AI@Meta (2024a) Introducing Llama 3.1: our most capable models to date. Note: https://ai.meta.com/blog/meta-llama-3-1 Cited by: §E.1.
AI@Meta (2024b) Llama 3 model card. Note: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Cited by: §E.1.
C. An, Z. Xie, X. Li, L. Li, J. Zhang, S. Gong, M. Zhong, J. Xu, X. Qiu, M. Wang, and L. Kong (2025) POLARIS: a post-training recipe for scaling reinforcement learning on advanced reasoning models. External Links: Link Cited by: §A.1, §3.3.
Anthropic (2023a) Claude 2. External Links: Link Cited by: §E.1.
Anthropic (2023b) Introducing Claude. External Links: Link Cited by: §E.1.
Anthropic (2024) The Claude 3 model family: Opus, Sonnet, Haiku. External Links: Link Cited by: §E.1.
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou (2023) Qwen-VL: a versatile vision-language model for understanding, localization. Text Reading, and Beyond. Cited by: §E.1.
S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. (2025a) Qwen3-VL technical report. arXiv preprint arXiv:2511.21631. Cited by: §2, §4.1, §4.3.
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025b) Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923. Cited by: §E.1.
M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev (2025) MathArena: evaluating LLMs on uncontaminated math competitions. arXiv preprint arXiv:2505.23281. Cited by: 2nd item, §4.1.
H. Bansal, D. S. Sachan, K. Chang, A. Grover, G. Ghosh, W. Yih, and R. Pasunuru (2025) Honeybee: data recipes for vision-language reasoners. arXiv preprint arXiv:2510.12225. Cited by: §2.
E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf (2023) Open LLM leaderboard. Note: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard Cited by: §D.2.
W. Bousselham, H. Kuehne, and C. Schmid (2025) VOLD: reasoning transfer from LLMs to vision-language models via on-policy distillation. arXiv preprint arXiv:2510.23497. Cited by: §E.3, §2.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems. Cited by: §E.1.
D. Cao, D. Fu, H. Yu, S. Zheng, X. Tan, and T. Jin (2026) X-OPD: cross-modal on-policy distillation for capability alignment in speech llms. arXiv preprint arXiv:2603.24596. Cited by: §E.3, §1, §2.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? Try ARC, the AI2 reasoning challenge. ArXiv. Cited by: §D.2.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §D.2.
G. Cui, L. Yuan, Z. Wang, H. Wang, Y. Zhang, J. Chen, W. Li, B. He, Y. Fan, T. Yu, et al. (2025) Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. Cited by: §B.2, §4.1.
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi (2023) InstructBLIP: towards general-purpose vision-language models with instruction tuning. Cited by: §E.1.
DeepSeek-AI (2026) DeepSeek-V4: towards highly efficient million-token context intelligence. Cited by: §1.
Y. Gu, L. Dong, F. Wei, and M. Huang (2023) MiniLLM: on-policy distillation of large language models. arXiv preprint arXiv:2306.08543. Cited by: §E.3.
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025a) DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §E.1, §1, §2.
Y. Guo, W. Yang, Z. Sun, N. Ding, Z. Liu, and Y. Lin (2025b) Learning to focus: causal attention distillation via gradient-guided token pruning. arXiv preprint arXiv:2506.07851. Cited by: §2.
C. He, Y. Ding, J. Guo, R. Gong, H. Qin, and X. Liu (2025a) DA-KD: difficulty-aware knowledge distillation for efficient large language models. In Forty-second International Conference on Machine Learning, Cited by: §2.
Z. He, T. Liang, J. Xu, Q. Liu, X. Chen, Y. Wang, L. Song, D. Yu, Z. Liang, W. Wang, et al. (2025b) DeepMath-103K: a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456. Cited by: §B.2, §4.1.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2020) Measuring massive multitask language understanding. In International Conference on Learning Representations, Cited by: §D.2.
G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
W. Hou, W. Liu, H. Hu, X. Sun, S. Yeung-Levy, and H. Fan (2026) Seeing is believing? a benchmark for multimodal large language models on visual illusions and anomalies. arXiv preprint arXiv:2602.01816. Cited by: §E.1.
J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. (2026) Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802. Cited by: §E.3, §2.
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. (2024) GPT-4o system card. arXiv preprint arXiv:2410.21276. Cited by: §E.1.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica (2024) Livecodebench: holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974. Cited by: 3rd item, §4.1.
I. Jang, J. Yeom, J. Yeo, H. Lim, and T. Kim (2026) Stable on-policy distillation through adaptive target reformulation. arXiv preprint arXiv:2601.07155. Cited by: §2.
W. Jin, T. Min, Y. Yang, S. R. Kadhe, Y. Zhou, D. Wei, N. Baracaldo, and K. Lee (2026) Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079. Cited by: §2.
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi (2016) A diagram is worth a dozen images. In European conference on computer vision, pp. 235–251. Cited by: 1st item, §4.1.
M. Kim and S. J. Baek (2026) Explain in your own words: improving reasoning via token-selective dual knowledge distillation. arXiv preprint arXiv:2603.13260. Cited by: §2.
J. Ko, S. Abdali, Y. J. Kim, T. Chen, and P. Cameron (2026) Scaling reasoning efficiently via relaxed on-policy distillation. arXiv preprint arXiv:2603.11137. Cited by: §2.
J. Ko, T. Chen, S. Kim, T. Ding, L. Liang, I. Zharkov, and S. Yun (2025) DistiLLM-2: a contrastive approach boosts the distillation of LLMs. arXiv preprint arXiv:2503.07067. Cited by: §2.
K. Kujanpää, P. Marttinen, H. Valpola, and A. Ilin (2024) Efficient knowledge injection in LLMs via self-distillation. arXiv preprint arXiv:2412.14964. Cited by: §2.
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, Cited by: §A.1.
X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia (2024) LISA: reasoning segmentation via large language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §E.1.
H. Levesque, E. Davis, and L. Morgenstern (2012) The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, Cited by: §D.2.
J. Li, H. Yin, H. Xu, B. Xu, W. Tan, Z. He, J. Ju, Z. Luo, and J. Luan (2026a) Video-OPD: efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation. arXiv preprint arXiv:2602.02994. Cited by: §E.3, §1, §2.
J. Li, S. Yang, S. Wu, H. Shi, C. Zheng, H. Xu, and J. Jia (2025) Logits-based finetuning. arXiv preprint arXiv:2505.24461. Cited by: §E.1.
Y. Li, Y. Zuo, B. He, J. Zhang, C. Xiao, C. Qian, T. Yu, H. Gao, W. Yang, Z. Liu, et al. (2026b) Rethinking on-policy distillation of large language models: phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016. Cited by: §E.3, §2.
S. Lin, J. Hilton, and O. Evans (2022) TruthfulQA: measuring how models mimic human falsehoods. In ACL, Cited by: §D.2.
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. (2024a) DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. Cited by: §E.1.
H. Liu, C. Li, Y. Li, and Y. J. Lee (2024b) Improved baselines with visual instruction tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §E.1.
H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee (2024c) LLaVA-NeXT: improved reasoning, OCR, and world knowledge. External Links: Link Cited by: §E.1.
H. Liu, C. Li, Q. Wu, and Y. J. Lee (2023a) Visual instruction tuning. Advances in neural information processing systems. Cited by: §E.1.
J. Liu, C. Zhang, J. Guo, Y. Zhang, H. Que, K. Deng, Z. Bai, J. Liu, G. Zhang, J. Wang, et al. (2024d) DDK: distilling domain knowledge for efficient large language models. Advances in Neural Information Processing Systems 37, pp. 98297–98319. Cited by: §2.
J. Liu, C. S. Xia, Y. Wang, and L. Zhang (2023b) Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems 36, pp. 21558–21572. Cited by: 1st item, 2nd item, §4.1.
L. Liu and M. Zhang (2025) Less is more: selective reflection for compatible and efficient knowledge distillation in large language models. arXiv preprint arXiv:2508.06135. Cited by: §2.
Y. Liu, J. Cui, Z. Tian, S. Yang, Q. He, X. Wang, and J. Su (2024e) Typicalness-aware learning for failure detection. arXiv preprint arXiv:2411.01981. Cited by: §E.1.
K. Lu and T. M. Lab (2025) On-policy distillation. Thinking Machines Lab: Connectionism. Note: https://thinkingmachines.ai/blog/on-policy-distillation External Links: Document Cited by: §1, §2.
A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque (2022) ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pp. 2263–2279. Cited by: §B.2, 2nd item, §4.1.
M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar (2022) InfographicVQA. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706. Cited by: §B.2, 4th item, §4.1.
M. Mathew, D. Karatzas, and C. Jawahar (2021) DocVQA: a dataset for VQA on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2200–2209. Cited by: 3rd item, §4.1.
Y. Meng, M. Xia, and D. Chen (2024) SimPO: simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems. Cited by: §D.2.
OpenAI (2023) GPT-4V(ision) system card. Cited by: §E.1.
OpenCompass (2025) AIME 2025. Note: https://huggingface.co/datasets/opencompass/AIME2025 Cited by: §4.1.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35. Cited by: §3.4.
C. M. Patiño, K. Rasul, Q. Gallouédec, B. Burtenshaw, S. Paniego, V. Srivastav, T. Frere, E. Beeching, L. Tunstall, L. von Werra, and T. Wolf (2025) Unlocking on-policy distillation for any model family. Cited by: §2.
S. Peng, W. Wang, Z. Tian, S. Yang, X. W, H. Xu, C. Zhang, T. Isobe, B. Hu, and M. Zhang (2026) Uni-DPO: a unified paradigm for dynamic preference optimization of LLMs. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §D.2, §E.1.
S. Peng, S. Yang, L. Jiang, and Z. Tian (2025) Mitigating object hallucinations via sentence-level early intervention. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §E.1.
R. Qiao, Q. Tan, G. Dong, M. MinhuiWu, C. Sun, X. Song, J. Wang, Z. Gongque, S. Lei, Y. Zhang, et al. (2025) We-math: does your large multimodal model achieve human-like mathematical reasoning?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 20023–20070. Cited by: 3rd item, §4.1.
L. Qin, Q. Chen, Y. Zhou, Z. Chen, Y. Li, L. Liao, M. Li, W. Che, and P. S. Yu (2025) A survey of multilingual large language models. Patterns 6 (1). Cited by: §1.
T. Qu, L. Tang, B. Peng, S. Yang, B. Yu, and J. Jia (2025) Does your vision-language model get lost in the long video sampling dilemma?. arXiv preprint arXiv:2503.12496. Cited by: §E.1.
Y. Qu, A. Setlur, V. Smith, R. Salakhutdinov, and A. Kumar (2026) POPE: learning to reason on hard problems via privileged on-policy exploration. arXiv preprint arXiv:2601.18779. Cited by: §2.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, Cited by: §E.1.
T. Shao, Z. Tian, H. Zhao, and J. Su (2024a) Explore the potential of CLIP for training-free open vocabulary semantic segmentation. In European Conference on Computer Vision, Cited by: §E.1.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024b) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §E.2, §1.
I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal (2026) Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897. Cited by: §E.3, §2.
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §B.1.
M. Song and M. Zheng (2026) A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626. Cited by: §1, §2.
A. Stein, F. Huang, and T. Goldstein (2026) GATES: self-distillation under privileged context with consensus gating. arXiv preprint arXiv:2602.20574. Cited by: §2.
A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §D.2.
G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. (2024) Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. Cited by: §E.1.
H. V. Team, P. Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, Q. Yang, Q. Peng, B. Luo, H. Yang, X. Zhang, J. Zhang, H. Peng, H. Yang, S. Xie, L. Zhou, G. Pei, B. Wu, K. Wu, J. Yang, B. Wang, K. Liu, J. Zhu, J. Jiang, Linus, H. Hu, and C. Zhang (2025) HunyuanOCR technical report. Cited by: §E.1.
K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. (2026) Kimi k2.5: visual agentic intelligence. arXiv preprint arXiv:2602.02276. Cited by: §1, §2.
Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia (2019) Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §E.1.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §E.1.
J. Wang, B. Chen, Y. Li, B. Kang, Y. Chen, and Z. Tian (2025) DeCLIP: decoupled learning for open-vocabulary dense perception. In Proceedings of the Computer Vision and Pattern Recognition Conference, Cited by: §E.1.
K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li (2024a) Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems 37, pp. 95095–95169. Cited by: 1st item, §4.1.
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. (2024b) Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191. Cited by: §E.1.
Y. Wang, X. Chen, X. Jin, M. Wang, and L. Yang (2026) OpenClaw-RL: train any agent simply by talking. arXiv preprint arXiv:2603.10165. Cited by: §E.2, §2.
Y. Wu, S. Han, and H. Cai (2026) Lightning opd: efficient post-training for large reasoning models with offline on-policy distillation. arXiv preprint arXiv:2604.13010. Cited by: §1.
B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. (2026) Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780. Cited by: §E.3, §1, §2, §2.
Y. Xiao, E. Sun, T. Liu, and W. Wang (2024) Logicvista: multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973. Cited by: 1st item, §4.1.
J. Xiong, H. Shen, S. Gong, Y. Cheng, J. Shen, C. Tao, H. Tan, H. Bai, L. Shang, and N. Wong (2026) OVD: on-policy verbal distillation. arXiv preprint arXiv:2601.21968. Cited by: §2.
H. Xu, X. Wu, W. Wang, Z. Li, D. Zheng, B. Chen, Y. Hu, S. Kang, J. Ji, Y. Zhang, et al. (2025a) RedStar: does scaling long-cot data unlock better slow-reasoning systems?. arXiv preprint arXiv:2501.11284. Cited by: §1.
W. Xu, J. Wang, W. Wang, Z. Chen, W. Zhou, A. Yang, L. Lu, H. Li, X. Wang, X. Zhu, et al. (2025b) Visulogic: a benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279. Cited by: 2nd item, §4.1.
X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou (2024) A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116. Cited by: §2.
Y. Xu, H. Sang, Z. Zhou, R. He, and Z. Wang (2026) PACED: distillation at the frontier of student competence. arXiv preprint arXiv:2603.11178. Cited by: §2.
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §E.1, §4.1.
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. (2024a) Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: §E.1.
C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan (2026a) Self-distilled RLVR. arXiv preprint arXiv:2604.03128. Cited by: §E.2, §2.
S. Yang, Y. Chen, Z. Tian, C. Wang, J. Li, B. Yu, and J. Jia (2024b) VisionZip: longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467. Cited by: §E.1.
S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, Y. Guo, and S. Zhang (2023a) LiDAR-LLM: exploring the potential of large language models for 3d LiDAR understanding. arXiv preprint arXiv:2312.14074. Cited by: §E.1.
S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia (2023b) An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240. Cited by: §E.1.
S. Yang, Z. Tian, L. Jiang, and J. Jia (2024c) Unified language-driven zero-shot domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §E.1.
W. Yang, W. Liu, R. Xie, K. Yang, S. Yang, and Y. Lin (2026b) Learning beyond teacher: generalized on-policy distillation with reward extrapolation. arXiv preprint arXiv:2602.12125. Cited by: §B.1, §E.3, §1, §4.1, §4.2.
Z. Yang, Z. Liu, Y. Chen, W. Dai, B. Wang, S. Lin, C. Lee, Y. Chen, D. Jiang, J. He, et al. (2026c) Nemotron-Cascade 2: post-training LLMs with cascade RL and multi-domain on-policy distillation. arXiv preprint arXiv:2603.19220. Cited by: §1, §2.
T. Ye, L. Dong, Z. Chi, X. Wu, S. Huang, and F. Wei (2025) Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643. Cited by: §E.3, §2.
T. Ye, L. Dong, X. Wu, S. Huang, and F. Wei (2026) On-policy context distillation for language models. arXiv preprint arXiv:2602.12275. Cited by: §E.3, §2.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §D.2.
A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. (2026) GLM-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763. Cited by: §1, §2.
D. Zhang, Z. Yang, S. Janghorbani, J. Han, A. Ressler II, Q. Qian, G. D. Lyng, S. S. Batra, and R. E. Tillman (2026a) Fast and effective on-policy distillation from reasoning prefixes. arXiv preprint arXiv:2602.15260. Cited by: §E.3, §2.
K. Zhang, B. Li, P. Zhang, F. Pu, J. A. Cahyono, K. Hu, S. Liu, Y. Zhang, J. Yang, C. Li, et al. (2025a) LMMs-Eval: reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, Cited by: §C.2.
K. Zhang, K. Wu, Z. Yang, B. Li, K. Hu, B. Wang, Z. Liu, X. Li, and L. Bing (2025b) OpenMMReasoner: pushing the frontiers for multimodal reasoning with an open and general recipe. arXiv preprint arXiv:2511.16334. Cited by: §2, §4.1.
S. Zhang, X. Zhang, T. Zhang, B. Hu, Y. Chen, and J. Xu (2026b) KDFlow: a user-friendly and efficient knowledge distillation framework for large language models. arXiv preprint arXiv:2603.01875. Cited by: §E.3, §2.
Y. Zhang, B. Ni, X. Chen, H. Zhang, Y. Rao, H. Peng, Q. Lu, H. Hu, M. Guo, and S. Hu (2025c) Bee: a high-quality corpus and full-stack suite to unlock advanced fully open mllms. arXiv preprint arXiv:2510.13795. Cited by: §2.
S. Zhao, Z. Wang, X. Zhao, J. Zhou, C. Xu, C. Liu, L. Zhang, Y. Jia, Y. Zhang, H. Yu, et al. (2026a) Large language model post-training: a unified view of off-policy and on-policy learning. arXiv preprint arXiv:2604.07941. Cited by: §1.
S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover (2026b) Self-distilled reasoner: on-policy self-distillation for large language models. arXiv preprint arXiv:2601.18734. Cited by: §E.3, §2.
C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. (2025) Group sequence policy optimization. arXiv preprint arXiv:2507.18071. Cited by: §E.2.
Z. Zhong, C. Wang, Y. Liu, S. Yang, L. Tang, Y. Zhang, J. Li, T. Qu, Y. Li, Y. Chen, et al. (2024) Lyra: an efficient and speech-centric framework for omni-cognition. arXiv preprint arXiv:2412.09501. Cited by: §E.1.
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. (2023a) LIMA: less is more for alignment. Advances in Neural Information Processing Systems 36, pp. 55006–55021. Cited by: §3.3.
G. Zhou, H. Bao, J. Huang, J. Deng, J. Zhang, J. She, K. Cai, L. Ren, L. Ren, Q. Luo, et al. (2025) OpenOneRec technical report. arXiv preprint arXiv:2512.24762. Cited by: §1.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou (2023b) Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911. Cited by: §D.2.
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny (2023) MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592. Cited by: §E.1.
C. Zou, X. Guo, R. Yang, J. Zhang, B. Hu, and H. Zhang (2024) Dynamath: a dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836. Cited by: 2nd item, §4.1.

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Supplementary Material

Appendix Outline

This material provides supplementary details to the main paper, including the following sections:

$\bullet$
(A) Method Details
- -
  
  (A.1) Offline Difficulty-Aware Data Balancing
- -
  
  (A.2) Online Correctness-Aware Data Balancing
- -
  
  (A.3) Order Consistency of Trajectory-level Returns
- -
  
  (A.4) Outcome-Guided Margin Calibration
$\bullet$
(B) Training Details
- -
  
  (B.1) Training Setup
- -
  
  (B.2) Training Data
- -
  
  (B.3) Training Reward Acquisition
- -
  
  (B.4) Training Pseudocode
- -
  
  (B.5) Training Dynamics
- -
  
  (B.6) Training Complexity
$\bullet$
(C) Evaluation Details
- -
  
  (C.1) Evaluation Benchmarks
- -
  
  (C.2) Evaluation Setup
$\bullet$
(D) Further Evaluations
- -
  
  (D.1) More Evaluation Results
- -
  
  (D.2) Downstream Task Evaluation
- -
  
  (D.3) Further Ablation
$\bullet$
(E) Related Work
- -
  
  (E.1) Multimodal Large Language Models
- -
  
  (E.2) Reinforcement Learning
- -
  
  (E.3) On-Policy Distillation
$\bullet$

(F) Case Studies

Appendix A Method Details

In this section, we provide a detailed exposition of the key components of our proposed Uni-OPD framework, including its formulations and implementations.

A.1 Offline Difficulty-Aware Data Balancing

In this section, we provide a detailed description of our offline difficulty-aware data balancing strategy.

Offline rollout sampling. Before training, we perform a one-time offline rollout pass over the entire training set using the student model (e.g., Qwen3-4B). For each training instance, the student is prompted to generate $N\!=\!8$ independent candidate responses, which serve as the basis for subsequent difficulty estimation.

The rollouts are produced with vLLM (Kwon et al., 2023) under the same prompt template that will later be used at training time, so that the estimated difficulty reflects the actual input format the student will see. The decoding configuration is kept fixed throughout this offline phase: we use temperature $=1.0$ , top- $p=0.95$ , top- $k=50$ , and a maximum response length of $16{,}384$ tokens. For each instance, we then verify the correctness of its $N$ candidate responses with the task-specific verifier (section B.3) and record the number of correct ones. The resulting empirical pass rate $k/N$ serves as our proxy for the instance’s difficulty: a lower pass rate indicates a harder example, while a higher pass rate indicates an easier one.

Limitations of aggressive difficulty filtering. Prior work on online RL optimization, such as GRPO, often relies on a heuristic pre-training filter that simply discards “trivial” samples such as all-correct cases, because these instances yield zero advantage and therefore provide essentially no learning signal. POLARIS (An et al., 2025), for example, reports that removing the easiest samples leads to consistent performance gains, and argues that keeping an unfiltered dataset can actively hinder training.

In the token-level reward OPD setting, however, we find that such aggressive filtering is, in fact, counterproductive. Empirically, removing any specific difficulty tier, whether the easiest or the hardest, consistently hurts final performance. A plausible explanation is that each tier contributes a distinct pattern of token-level credit: easy instances calibrate the student’s baseline behavior, intermediate instances provide the richest contrastive signals between correct and incorrect trajectories, and hard instances expose the student to diverse, non-trivial solution paths. Dropping any tier, therefore, both distorts the overall distribution of token-level credit and narrows the space of solution patterns to which the student is exposed.

Difficulty-aware data balancing. Motivated by this observation, we adopt a difficulty-aware balancing scheme that deliberately preserves the full spectrum of difficulty while reweighting its different regions, rather than truncating them. Concretely, after the offline rollout pass, we examine the empirical distribution over the number of correct responses out of $N$ . Across our training sources, we observe two recurring shapes: (i) a U-shaped distribution, where both very easy and very hard instances dominate while intermediate ones are sparse; and (ii) a mirrored-J-shaped distribution, where easy instances dominate and the mass decays toward the hard end.

We treat the two shapes slightly differently. For U-shaped distributions, we upsample instances of intermediate difficulty, namely those with $1$ – $7$ correct responses out of $N=8$ , so as to fill in the under-represented middle region. For mirrored-J-shaped distributions, we instead upsample all non-trivial instances, i.e., everything with $1$ – $8$ correct responses, to counteract the long tail of easy samples. In both cases, the effect of the reweighting is to flatten the overall difficulty distribution and to ensure that the token-level credit signals arriving during training are more evenly spread across difficulty levels. Empirically, we find that this simple rebalancing consistently leads to better final performance than either no filtering or the conventional drop-the-easy-cases strategy.

A.2 Online Correctness-Aware Data Balancing

In this section, we detail the online correctness-aware data balancing strategy that operates during rollout. While the offline difficulty-aware balancing in section A.1 controls the prompt-level difficulty distribution before training, the composition of correct and incorrect trajectories within a rollout group still varies dramatically as the student evolves. This subsection describes how we regulate such intra-group composition online.

Motivation. In OPD, for each prompt $\bm{q}$ we sample $G$ on-policy trajectories $\{\bm{\tau}_{i}\}_{i=1}^{G}$ and split them into a positive set $S_{+}(\bm{q})$ and a negative set $S_{-}(\bm{q})$ based on the outcome reward $R_{i}$ . As training proceeds, many prompts exhibit degenerate outcome distributions: either $|S_{-}(\bm{q})|\!\ll\!G$ (the student nearly masters $\bm{q}$ ) or $|S_{+}(\bm{q})|\!\ll\!G$ (the student often fails on $\bm{q}$ ). In both cases, the outcome-level contrast vanishes and the outcome-guided margin calibration in section A.4 cannot provide any corrective signal, since the prompt-level margin $m(\bm{q})$ is undefined. If left unregulated, such degenerate groups dominate the batch and drive the student into local optima with shrinking exploration.

Online correctness-aware balancing. To preserve sufficient outcome diversity throughout training, we maintain a target correct-to-total ratio $\gamma^{\star}\!\in\!(0,1)$ at the batch level (we use $\gamma^{\star}\!\approx\!0.5$ by default, so positive and negative trajectories are roughly balanced). At each training step, given a freshly rolled-out batch $\mathcal{B}$ , we let $\gamma(\mathcal{B})=\sum_{\bm{\tau}_{i}\!\in\!\mathcal{B}}\mathbf{1}\{R_{i}\!=\!1\}/|\mathcal{B}|$ denote the current correct-to-total ratio across the whole batch. Whenever $|\gamma(\mathcal{B})-\gamma^{\star}|\!>\!\epsilon$ for a tolerance $\epsilon$ , we downweight the over-represented side (correct or incorrect trajectories) by subsampling within each group, so that the overall batch ratio is pulled back to the $\gamma^{\star}\!\pm\!\epsilon$ interval. Subsampling is performed uniformly inside each group, which keeps the intra-group difficulty distribution intact and avoids biasing the prompt-level difficulty spectrum inherited from offline balancing.

A.3 Order Consistency of Trajectory-level Returns

This section provides a brief explanation for the order-consistency conditions in Eqs. (7) and (8) of the main paper. The key observation is two-fold. First, treating the entire reasoning rollout as a single macro-action gives $G_{\mathrm{RL}}({\bm{q}},\bm{\tau})\!=\!R({\bm{q}},\bm{\tau})$ , so $G_{\mathrm{RL}}$ respects the outcome-induced ordering by construction. Second, under the distillation premise, the trajectory-level distillation return $G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})$ is expected to preserve the same ordering, although this is a desideratum rather than a definitional consequence.

Trajectory-as-one-action view of outcome-based RL. In outcome-based RL for reasoning, supervision is provided only at the trajectory level: a rollout $\bm{\tau}$ receives a single scalar reward $R(\bm{q},\bm{\tau})$ determined by the final answer. Under this view, the trajectory-level return reduces to the outcome reward itself, i.e.,

G_{\mathrm{RL}}(\bm{q},\bm{\tau})=R(\bm{q},\bm{\tau})\,.

(13)

Order consistency under binary rewards. For the binary outcome reward adopted in this work, any $\bm{\tau}_{+}\!\in\!S_{+}(\bm{q})$ satisfies $R(\bm{q},\bm{\tau}_{+})\!=\!1$ , while any $\bm{\tau}_{-}\!\in\!S_{-}(\bm{q})$ satisfies $R(\bm{q},\bm{\tau}_{-})\!=\!0$ . Combined with Eq. 13, we have

G_{\mathrm{RL}}(\bm{q},\bm{\tau}_{+})=1\;\geq\;0=G_{\mathrm{RL}}(\bm{q},\bm{\tau}_{-})\,,

(14)

for all $\bm{\tau}_{+}\!\in\!S_{+}(\bm{q})$ and $\bm{\tau}_{-}\!\in\!S_{-}(\bm{q})$ , which recovers Eq. 7 directly.

Extension to soft outcome rewards. The same argument extends to soft outcome rewards, where $R(\bm{q},\bm{\tau})\!\in\![0,1]$ (or any bounded interval) measures a graded notion of correctness, e.g., partial credit or a verifier’s confidence score. As long as the trajectory partition is defined by thresholding the outcome reward, i.e., $S_{+}(\bm{q})\!=\!\{\bm{\tau}\mid R(\bm{q},\bm{\tau})\!\geq\!\eta\}$ and $S_{-}(\bm{q})\!=\!\{\bm{\tau}\mid R(\bm{q},\bm{\tau})\!<\!\eta\}$ for some threshold $\eta$ , then by Eq. 13 every positive trajectory attains a return no smaller than that of any negative trajectory, and Eq. 7 still holds. In particular, the binary case is recovered as the special instance $\eta\!=\!1$ , $R\!\in\!\{0,1\}$ .

From RL return to distillation return. The distillation return $G_{\mathrm{OPD}}(\bm{q},\bm{\tau})$ defined in Eq. 5 plays the same role for OPD training as $G_{\mathrm{RL}}$ does for outcome-based RL: it is the trajectory-level supervision signal broadcast to all tokens in the rollout. The distillation premise in section 3.4 posits that, relative to the student, the teacher assigns a higher log-likelihood to correct trajectories than incorrect ones. In other words, the teacher’s trajectory-level preference is expected to be aligned with the outcome reward, so that $G_{\mathrm{OPD}}$ should inherit the same outcome-level ordering as $G_{\mathrm{RL}}$ , leading to Eq. 8. Unlike the RL return, however, $G_{\mathrm{OPD}}$ is derived from the teacher–student log-probability gap rather than the outcome reward itself, so the ordering is a desired property rather than a guaranteed one. The order-consistency condition in Eq. 8 provides a principled target, and subsequent margin mask and margin shift strategies (section A.4) are designed to enforce it whenever the teacher’s supervision violates this property in practice.

A.4 Outcome-Guided Margin Calibration

Algorithm 1 Greedy Margin Mask

1:Inputs:

2: Prompt

\bm{q}

with rollout group

\{\bm{\tau}_{i}\}_{i=1}^{G}

, outcome rewards

\{R_{i}\}_{i=1}^{G}

with

R_{i}\!\in\!\{0,1\}

, min retention ratio

\rho

3: trajectory-level distillation returns

\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}

, target margin

\delta

, mode

\in\{\mathrm{MinMax},\mathrm{Mean}\}

4:Output: Keep-mask

\{k_{i}\}_{i=1}^{G}\in\{0,1\}^{G}

\triangleright

k_{i}\!=\!1

means “keep trajectory

\bm{\tau}_{i}

” and

k_{i}\!=\!0

means “drop it”.

6:Notation: For any two subsets

A\!\subseteq\!S_{+}(\bm{q})

and

B\!\subseteq\!S_{-}(\bm{q})

, we define the prompt-level margin

\displaystyle\textsc{Margin}(A,B;\mathrm{MinMax})=\min_{\bm{\tau}\in A}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\max_{\bm{\tau}\in B}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

\displaystyle\textsc{Margin}(A,B;\mathrm{Mean})=\operatorname*{mean}_{\bm{\tau}\in A}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\operatorname*{mean}_{\bm{\tau}\in B}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

10:function GreedyMarginMask(

\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\rho,\text{mode}

)

11:

\triangleright

Step 1: split the group by outcome correctness.

12:

S_{+}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=1\}

S_{-}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=0\}

13:

N_{+}\leftarrow|S_{+}(\bm{q})|

N_{-}\leftarrow|S_{-}(\bm{q})|

14:

k_{i}\leftarrow 1,\quad\forall i=1,\ldots,G

\triangleright

initialize: keep all trajectories

15: if

N_{+}=0

N_{-}=0

then

16: return

\{k_{i}\}_{i=1}^{G}

\triangleright

ordering is not defined; no masking

17: end if

18:

19:

\triangleright

Step 2: sort each side so that the most ordering-violating trajectory is at the front.

20:

L_{+}(\bm{q})\leftarrow

sort

S_{+}(\bm{q})

G_{\mathrm{OPD}}(\bm{q},\cdot)

ascending

\triangleright

L_{+}(\bm{q})[1]

= correct trajectory with lowest return

21:

L_{-}(\bm{q})\leftarrow

sort

S_{-}(\bm{q})

G_{\mathrm{OPD}}(\bm{q},\cdot)

descending

\triangleright

L_{-}(\bm{q})[1]

= incorrect trajectory with highest return

22:

23:

\triangleright

Step 3: iteratively drop the trajectory whose removal increases the margin the most.

24: while

\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q});\text{mode})<\delta

25: if

|L_{+}(\bm{q})|\leq\lceil\rho N_{+}\rceil

and

|L_{-}(\bm{q})|\leq\lceil\rho N_{-}\rceil

then

26: break

\triangleright

minimum retention ratio reached on both sides

27: end if

28:

29:

\triangleright

Margin gain when the worst correct trajectory

L_{+}(\bm{q})[1]

is dropped.

30:

\Delta_{+}\leftarrow\textsc{Margin}(L_{+}(\bm{q})\!\setminus\!\{L_{+}(\bm{q})[1]\},L_{-}(\bm{q});\text{mode})-\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q});\text{mode})

31:

\triangleright

Margin gain when the best incorrect trajectory

L_{-}(\bm{q})[1]

is dropped.

32:

\Delta_{-}\leftarrow\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q})\!\setminus\!\{L_{-}(\bm{q})[1]\};\text{mode})-\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q});\text{mode})

33: if

\max(\Delta_{+},\Delta_{-})\leq 0

then

34: break

\triangleright

no single removal can further improve the margin

35: end if

36: if

\Delta_{+}>\Delta_{-}

and

|L_{+}(\bm{q})|>\lceil\rho N_{+}\rceil

then

37:

\bm{\tau}_{\mathrm{drop}}\leftarrow\textsc{PopFront}(L_{+}(\bm{q}))

\triangleright

greedy drop on the positive side

38: else

39:

\bm{\tau}_{\mathrm{drop}}\leftarrow\textsc{PopFront}(L_{-}(\bm{q}))

\triangleright

greedy drop on the negative side

40: end if

41:

k_{\,\mathrm{idx}(\bm{\tau}_{\mathrm{drop}})}\leftarrow 0

\triangleright

exclude this trajectory from the subsequent gradient update

42: end while

43: return

\{k_{i}\}_{i=1}^{G}

44:end function

In this section, we describe the details of the two outcome-guided margin calibration strategies introduced in section 3.4: Margin Mask and Margin Shift. Both strategies operate on the trajectory-level distillation returns $\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}$ within a rollout group of a prompt $\bm{q}$ , with the common goal of enforcing the order-consistency condition $m(\bm{q})\!\geq\!\delta$ (Eq. 10). They differ in how they repair violations: Margin Mask removes the most adversarial trajectories until the condition holds, whereas Margin Shift applies a minimal additive correction to restore the margin in closed form.

Margin choices: MinMax vs. Mean. Following the prompt-level margin in Eq. 9, we define the margin between $S_{+}(\bm{q})$ and $S_{-}(\bm{q})$ in two modes: the MinMax mode uses $\min_{\bm{\tau}\in S_{+}}\!{G}_{\mathrm{OPD}}\!-\!\max_{\bm{\tau}\in S_{-}}\!{G}_{\mathrm{OPD}}$ and characterizes the worst-case ordering violation; the Mean mode uses $\mathrm{mean}_{\bm{\tau}\in S_{+}}\!{G}_{\mathrm{OPD}}\!-\!\mathrm{mean}_{\bm{\tau}\in S_{-}}\!{G}_{\mathrm{OPD}}$ and reflects the average-case ordering tendency. MinMax is more conservative (it forces every positive to outrank every negative), while Mean is more lenient and less sensitive to individual outliers.

Detailed implementation of margin mask. The margin mask strategy discards unreliable trajectories until the prompt-level margin is restored. We implement its fine-grained, data-efficient variant as Greedy Margin Mask, which removes the single most adversarial trajectory in each iteration rather than discarding the entire group. Specifically, given the rollout group $\{\bm{\tau}_{i}\}_{i=1}^{G}$ of prompt $\bm{q}$ with trajectory-level returns $\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}$ , we sort the positives in ascending order of $G_{\mathrm{OPD}}$ (so the worst correct trajectory comes first) and the negatives in descending order (so the best incorrect trajectory comes first). At each iteration, we compute the margin improvement obtained by removing the front of each sorted list and greedily dropping the side that yields the larger improvement. The iteration terminates once (i) the target margin $m(\bm{q})\!\geq\!\delta$ is satisfied, (ii) no further beneficial removal exists, or (iii) a minimum retention ratio $\rho\!\in\!(0,1)$ is reached to prevent excessive data loss. The masked trajectories are excluded from the subsequent gradient update by setting their trajectory-level return to zero, i.e., $\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\!=\!k_{i}\!\cdot\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})$ , where $k_{i}\!\in\!\{0,1\}$ is the keep mask. In distributed training, the trajectory-level statistics are aggregated across all ranks via AllReduce so that the masking is deterministic and consistent across devices. The procedure is in algorithm 1.

Detailed implementation of margin shift. The margin shift strategy applies a minimal additive correction to the trajectory-level returns so that the margin exactly meets the target $\delta$ , rather than discarding any sample. Given the rollout group $\{\bm{\tau}_{i}\}_{i=1}^{G}$ of prompt $\bm{q}$ , we first compute the current margin $m(\bm{q})$ with the chosen mode (Mean by default). If $m(\bm{q})\!<\!\delta$ , we define the required shift as $\lambda(\bm{q})\!=\!\delta\!-\!m(\bm{q})\!>\!0$ and distribute it across trajectories in one of three directions: (i) Lift: add $\lambda(\bm{q})$ to every positive trajectory, i.e., $\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!=\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!+\!\lambda(\bm{q})\mathbf{1}\{r(\bm{q},\bm{\tau})\!=\!1\}$ , which matches Eq. 11 in the main text; (ii) Suppress: subtract $\lambda(\bm{q})$ from every negative trajectory, i.e., $\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!=\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!-\!\lambda(\bm{q})\mathbf{1}\{r(\bm{q},\bm{\tau})\!=\!0\}$ ; and (iii) Spread: split the correction symmetrically, adding $\lambda(\bm{q})/2$ to positives and subtracting $\lambda(\bm{q})/2$ from negatives. All three variants (a) preserve the relative ordering within $S_{+}(\bm{q})$ and within $S_{-}(\bm{q})$ respectively, and (b) guarantee that the calibrated margin equals $\delta$ , i.e., $\min_{\bm{\tau}\in S_{+}}\!\widetilde{G}_{\mathrm{OPD}}\!-\!\max_{\bm{\tau}\in S_{-}}\!\widetilde{G}_{\mathrm{OPD}}\!=\!\delta$ . In distributed training, the aggregation of trajectory-level statistics and the computation of $\lambda(\bm{q})$ are done via AllReduce to ensure consistency across devices. The procedure is in algorithm 2.

Algorithm 2 Margin Shift

1:Inputs:

2: Prompt

\bm{q}

with rollout group

\{\bm{\tau}_{i}\}_{i=1}^{G}

, outcome rewards

\{R_{i}\}_{i=1}^{G}

with

R_{i}\!\in\!\{0,1\}

3: trajectory-level distillation returns

\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}

4: target margin

\delta

, mode

\in\{\mathrm{MinMax},\mathrm{Mean}\}

, direction

\in\{\mathrm{Lift},\mathrm{Suppress},\mathrm{Spread}\}

5:Output: Calibrated trajectory-level returns

\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}

7:function MarginShift(

\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\text{mode},\text{direction}

)

\triangleright

Step 1: split the group by outcome correctness.

S_{+}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=1\}

S_{-}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=0\}

10: if

S_{+}(\bm{q})=\emptyset

S_{-}(\bm{q})=\emptyset

then

11: return

\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}

\triangleright

ordering is not defined

12: end if

13:

14:

\triangleright

Step 2: summarize each side and compute the prompt-level margin

m(\bm{q})

15: if mode

=

\mathrm{MinMax}

then

16:

\displaystyle G_{+}(\bm{q})\leftarrow\min_{\bm{\tau}\in S_{+}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

\triangleright

worst-scoring correct trajectory

17:

\displaystyle G_{-}(\bm{q})\leftarrow\max_{\bm{\tau}\in S_{-}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

\triangleright

best-scoring incorrect trajectory

18: else

19:

\displaystyle G_{+}(\bm{q})\leftarrow\operatorname*{mean}_{\bm{\tau}\in S_{+}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

\triangleright

average correct score

20:

\displaystyle G_{-}(\bm{q})\leftarrow\operatorname*{mean}_{\bm{\tau}\in S_{-}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

\triangleright

average incorrect score

21: end if

22:

m(\bm{q})\leftarrow G_{+}(\bm{q})-G_{-}(\bm{q})

23:

24:

\triangleright

Step 3: additive correction when the margin is below the target.

25:

\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i}),\quad\forall i=1,\ldots,G

\triangleright

start from the uncalibrated returns

26: if

m(\bm{q})<\delta

then

27:

\lambda(\bm{q})\leftarrow\delta-m(\bm{q})

\triangleright

amount by which the margin falls short of

\delta

28: if direction

=

\mathrm{Lift}

then

29:

\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{+}=\lambda(\bm{q}),\quad\forall\bm{\tau}\in S_{+}(\bm{q})

\triangleright

pull all correct trajectories up

30: else if direction

=

\mathrm{Suppress}

then

31:

\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{-}=\lambda(\bm{q}),\quad\forall\bm{\tau}\in S_{-}(\bm{q})

\triangleright

push all incorrect trajectories down

32: else

33:

\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{+}=\lambda(\bm{q})/2,\quad\forall\bm{\tau}\in S_{+}(\bm{q})

\triangleright

split: half up on the positive side, …

34:

\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{-}=\lambda(\bm{q})/2,\quad\forall\bm{\tau}\in S_{-}(\bm{q})

\triangleright

…and half down on the negative side

35: end if

36: end if

37: return

\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}

38:end function

Appendix B Training Details

In this section, we present details related to training, including the training setup (section B.1), the training datasets (section B.2), the training reward acquisition (section B.3), the training pseudocode (section B.4), the training dynamics (section B.5), and the training complexity analysis (section B.6). These details are provided to enhance the reproducibility of Uni-OPD.

B.1 Training Setup

To support multi-teacher OPD for both LLMs and MLLMs, we build Uni-OPD upon a widely used training framework, Miles²²2https://github.com/radixark/miles. Specifically, we use Megatron-LM³³3https://github.com/nvidia/megatron-lm (Shoeybi et al., 2019) as the training backend and SGLang⁴⁴4https://github.com/sgl-project/sglang as the rollout inference engine. For teacher models, we deploy them as independent SGLang services that can be accessed via HTTP from arbitrary locations to obtain token-level rewards, enabling flexible teacher extensions and scalable multi-teacher integration.

Each teacher is served behind a pool of SGLang endpoints with client-side shuffled round-robin load balancing, and a lightweight task-to-teacher routing table dispatches every prompt to the teacher best matched to its domain (e.g., math reasoning or code generation), so that new teachers or new tasks can be plugged in by simply extending the registry without touching the training loop. Because each teacher only needs to expose its prefill-time input_token_logprobs, no gradient, KV cache, or parameter is shared with the student, which keeps teachers fully stateless and decouples their deployment from the trainer. As a result, teacher scoring overlaps with student generation and contributes negligible overhead to the overall training throughput.

General training hyperparameters. All general training settings, including the batch size, rollout numbers, learning rate schedule, optimizer choice, and so on, are identical to those used in ExOPD⁵⁵5https://github.com/RUCBM/G-OPD (Yang et al., 2026b), ensuring a fair and controlled comparison. The prompts used for training are provided in section B.1.

RL training setup. Teacher models are trained using reinforcement learning (RL). Detailed training settings of the teacher models are provided in Table B.1.

Table B.1: Teacher model training configuration with GRPO.

Group	Setting	Value
Model	Base model	LLM: Math, Code: Qwen3-4B
	Base model	MLLM: Math, Logic, Document: Qwen3-VL-4B-Inst.
	Training steps	LLM: Math, Code: 500, 300
	Training steps	MLLM: Math, Logic, Document: 300, 300, 160
Optimization	Tensor Parallelism (TP)	2
	Micro batch size / GPU	1
	Training batch size	128
	Learning rate	$1\times 10^{-6}$
	Warm-up steps	0
	LR schedule	Constant
	ZeRO stage	3
	Optimizer	Adam
Sequence	Max prompt length	2048
Sequence	Max response length	16384
RL Algorithm	Advantage estimator	GRPO
	GRPO clip ratio	0.2
	Use KL in reward	False
	KL loss coefficient	0.0
	Entropy coefficient	0.0
Rollout	Samples per prompt ( $n$ )	8
	Temperature	1.0
	Top- $p$	0.95
	Top- $k$	50
Hardware	GPUs	$16\times$ NVIDIA H20

OPD training setup. For OPD, we inherit most hyperparameters (e.g., learning rate, optimizer, and sequence lengths) from the teacher RL setup in Table B.1, so that the student is trained under the same optimization regime as its teachers. The OPD-specific entries, including the training batch size, the number of on-policy samples per prompt, the online correctness-aware filter, and the margin calibration configuration, are summarized in Table B.2. Concretely, we use a training batch size of 64 and sample $n\!=\!16$ on-policy rollouts per prompt, which we find provides a good trade-off between return estimation quality and computational efficiency (see the ablation in Table D.6). The online correctness-aware filter is applied in sample filter mode with a target correct-to-incorrect ratio of $1{:}1$ within each training batch, following section A.2. For margin calibration (section A.4), we adopt group-level mean normalization in both domains, while the shift direction and target margin are tuned per domain: for the textual domain, we use Spread with $\delta\!=\!0.4$ , and for the multimodal domain, we use Lift with $\delta\!=\!0$ .

Table B.2: OPD training configuration. Most hyperparameters inherit from the teacher RL setup in Table B.1; only the entries that differ between OPD and RL are listed here.

Group	Setting	Textual	Multimodal
Optimization	Training batch size	64	64
Optimization	Samples per prompt ( $n$ )	16	16
Online filter	Filter mode	Sample filter	Sample filter
Online filter	Correct/Incorrect ratio	$1{:}1$	$1{:}1$
Margin calibration	Scope	Group	Group
	Mode	Mean	Mean
	Direction	Spread	Lift
	Target margin $\delta$	$0.4$	$0$

B.2 Training Data

Textual math reasoning data. We use a subset of the DeepMath dataset (He et al., 2025b) with difficulty level $\!\geq\!6$ to train mathematical reasoning ability, comprising 57.0K samples.

Textual code generation data. We use the Code subset of the Eurus-2-RL-Data dataset (Cui et al., 2025) with 25.3K samples to train code generation ability.

Multimodal math reasoning data. For multimodal math reasoning tasks, we draw 14.8K samples from the OpenMMReasoner-RL dataset⁶⁶6https://huggingface.co/datasets/OpenMMReasoner/OpenMMReasoner-RL-74K, covering MMK12, WeMath-Standard, and WeMath-Pro subsets.

Multimodal logic reasoning data. We collect 14.8K samples spanning AlgoPuzzle, PuzzleVQA, and ThinkLite-VL-Hard subsets from the OpenMMReasoner-RL-74K dataset.

Multimodal document understanding data. We include 14.6K document understanding samples, obtained by 15% sampling from the TQA subset of OpenMMReasoner with ChartQA (Masry et al., 2022) and InfographicsVQA (Mathew et al., 2022) training sets.

B.3 Training Reward Acquisition

In this section, we describe how training rewards are obtained for different data sources. For textual math reasoning tasks, we use the rule-based verifier provided by DeepMath⁷⁷7https://github.com/zwhe99/DeepMath to determine whether generated answers are correct. For textual code generation tasks, we use the rule-based verifier provided by PRIME⁸⁸8https://github.com/PRIME-RL/PRIME to evaluate the correctness of generated code. For multimodal tasks, we use the verifier released by OpenMMReasoner⁹⁹9https://github.com/EvolvingLMMs-Lab/OpenMMReasoner to assess whether generated answers are correct.

B.4 Training Pseudocode

The full training procedure of Uni-OPD is summarized in algorithm 3. In brief, the procedure (1) samples a prompt batch with offline difficulty-aware balancing (section A.1); (2) rolls out $G$ trajectories per prompt and computes the trajectory-level distillation return ${G}_{\mathrm{OPD}}$ from teacher–student log-probability differences (Eq. 5); (3) applies online correctness-aware balancing across the batch (section A.2); (4) calibrates $G_{\mathrm{OPD}}$ via the prompt-level margin $m(\bm{q})$ (Eq. 9) using either Greedy Margin Mask (algorithm 1) or Margin Shift (algorithm 2); and (5) broadcasts the calibrated returns to token-level advantages and updates the student $\pi_{\bm{\theta}}$ .

Algorithm 3 Uni-OPD: Outcome-guided Policy Distillation with Margin Calibration

1:Input:

2: Teacher

\pi_{\text{T}}

, student

\pi_{\bm{\theta}}

, dataset

\mathcal{D}

, group size

G

, target margin

\delta

, calibration mode

\!\in\!\{\textsc{Mask},\textsc{Shift}\}

, learning rate

\eta

3:Output: Updated student parameters

\bm{\theta}

5:function UniOPD(

\pi_{\text{T}},\pi_{\bm{\theta}},\mathcal{D},G,\delta,\text{mode},\eta

)

\triangleright

Offline difficulty-aware data balancing (once before training; see section A.1).

7: Sample a prompt batch

\mathcal{B}\subset\mathcal{D}

with rebalanced difficulty distribution

9: while not converged do

10:

\triangleright

Rollout and token-level scoring (per prompt).

11: for all prompt

\bm{q}\in\mathcal{B}

12: Rollout

G

trajectories

\{\bm{\tau}_{i}\}_{i=1}^{G}\sim\pi_{\bm{\theta}}(\cdot\mid\bm{q})

13: for

i=1,\ldots,G

14: Obtain outcome reward

R_{i}=r(\bm{q},\bm{\tau}_{i})\in\{0,1\}

15: for all token

o_{t}\in\bm{\tau}_{i}

16:

r^{\mathrm{OPD}}_{t}(\bm{\tau}_{i})\leftarrow\log\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})-\log\pi_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})

\triangleright

token-level OPD reward

17: end for

18:

\triangleright

Trajectory-level distillation return (Eq. 5).

19:

G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow\dfrac{1}{|\bm{\tau}_{i}|}\sum_{t=1}^{|\bm{\tau}_{i}|}r^{\mathrm{OPD}}_{t}(\bm{\tau}_{i})

20: end for

21: Partition:

S_{+}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=1\}

S_{-}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=0\}

\triangleright

correct / incorrect trajectory sets

22: end for

23:

24:

\triangleright

Online correctness-aware data balancing (across the batch; see section A.2).

25:

\mathcal{B}\leftarrow\textsc{OnlineCorrectnessAwareDataBalancing}\bigl(\mathcal{B},\{R_{i}\}_{\bm{q},i}\bigr)

26:

27:

\triangleright

Outcome-guided margin calibration (per prompt; Eqs. 9 and 10).

28: for all prompt

\bm{q}\in\mathcal{B}

29: Compute prompt-level margin

m(\bm{q})=\min_{\bm{\tau}\in S_{+}(\bm{q})}{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\max_{\bm{\tau}\in S_{-}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau})

30: if mode

=\textsc{Mask}

then

31:

\{k_{\bm{q},i}\}_{i=1}^{G}\leftarrow\textsc{GreedyMarginMask}(\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\rho,\text{mode})

\triangleright

algorithm 1

32:

\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow k_{\bm{q},i}\cdot G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i}),\quad\forall i=1,\ldots,G

\triangleright

zero out masked trajectories

33: else

34:

\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}\leftarrow\textsc{MarginShift}(\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\text{mode},\text{direction})

\triangleright

algorithm 2

35: end if

36: end for

37:

38:

\triangleright

Token-level broadcasting and policy update.

39: for all prompt

\bm{q}\in\mathcal{B}

, rollout

i=1,\ldots,G

, token

o_{t}\in\bm{\tau}_{i}

40:

\widetilde{A}_{t}(\bm{q},\bm{\tau}_{i})\leftarrow\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})

\triangleright

broadcast calibrated trajectory return to all tokens

41: end for

42:

\mathcal{L}(\bm{\theta})=-\,\mathbb{E}_{\bm{q},\bm{\tau}_{i},t}\!\left[\widetilde{A}_{t}(\bm{q},\bm{\tau}_{i})\,\log\pi_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})\right]

43:

\bm{\theta}\leftarrow\bm{\theta}-\eta\,\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta})

\triangleright

one gradient step on the student

44: end while

45: return

\bm{\theta}

46:end function

B.5 Training Dynamics

Fig. B.1 demonstrates the effectiveness of Uni-OPD along three complementary axes. From a comparable starting point ( $\sim$ 35% correct, entropy $\sim$ 0.33, length $\sim$ 1.6k), Uni-OPD converges to a substantially higher response-correct ratio than OPD, peaking at $80.6\%$ versus $75.2\%$ and averaging $75.5\%$ over the final 10 steps versus OPD’s $69.1\%$ (+6.4 absolute points). Crucially, this accuracy gain is not obtained by sacrificing exploration: policy entropy rises mildly under both methods, with Uni-OPD maintaining a marginally higher steady-state value, ruling out the entropy-collapse failure mode that typically plagues teacher-driven training. Meanwhile, the average response length grows from $\sim$ 1.6k to $\sim$ 8k tokens, with Uni-OPD producing slightly longer outputs than OPD (7.8k vs. 7.1k), indicating that the model learns to perform more elaborate reasoning rather than collapsing to short, high-confidence shortcuts. Together, these trends suggest that Uni-OPD provides a consistent improvement over OPD without adverse effects on exploration or response length.

B.6 Training Complexity

Beyond vanilla OPD, Uni-OPD introduces lightweight components on top of the standard per-iteration cost during training: online correctness-aware data balancing (per batch; section A.2), and outcome-guided margin calibration via Margin Mask / Shift (per prompt; section A.4). Let $B$ be the training batch size (number of prompts) and $G$ be the rollout group size. The online balancing only resamples prompts based on their precomputed $\{R_{i}\}$ , costing $O(BG)$ per iteration. Margin Mask and Margin Shift both operate on the $G$ trajectory-level returns within each prompt group: Margin Shift is $O(G)$ per prompt, while the greedy variant of Margin Mask is at most $O(G^{2})$ per prompt in the worst case (typically $G\!\leq\!16$ in our setup).

In contrast, the dominant per-iteration cost of OPD comes from two stages whose complexity scales linearly with the total number of rollout tokens $T_{\text{tok}}\!=\!\sum_{i=1}^{BG}|\bm{\tau}_{i}|$ and cubically with the hidden size $d$ : (i) sampling $BG$ on-policy rollouts from the student, and (ii) running a teacher prefill pass over these rollouts to obtain token-level log-probabilities, each of order $O(T_{\text{tok}}\,d^{2})$ for transformer forward passes. Typical numbers in our setup ( $Bs\!=\!64$ , $N\!=\!16$ , average length $\sim$ 8k) give $T_{\text{tok}}$ on the order of $8\!\times\!10^{6}$ tokens per iteration. All of Uni-OPD’s additional computation scales with the number of trajectories rather than the number of tokens, involves only scalar comparisons and additions, and is therefore several orders of magnitude cheaper than the rollout and teacher-scoring stages that OPD already pays. In practice, we observe that enabling all three components adds less than $1\%$ wall-clock overhead per iteration relative to vanilla OPD, while delivering the accuracy improvements reported in section B.5 and the main experiments. Thus Uni-OPD offers a favorable accuracy–compute trade-off: a negligible compute surcharge in exchange for consistently better final performance.

Appendix C Evaluation Details

C.1 Evaluation Benchmarks

We evaluate our Uni-OPD on a comprehensive benchmark suite spanning textual and multimodal capabilities, organized along five capability axes:

$\bullet$
Textual Math Reasoning:
- -
  
  AIME (2024/2025) (AI-MO, 2024): A prestigious high school mathematics competition featuring challenging problems that test deep mathematical reasoning.
- -
  
  HMMT25 (Feb & Nov) (Balunović et al., 2025): Contest-level benchmarks designed to rigorously evaluate advanced reasoning across algebra, geometry, combinatorics, and other domains.
$\bullet$
Textual Code Generation:
- -
  
  HumanEval+ (Liu et al., 2023b): A set of 164 hand-written programming problems evaluating functional correctness, covering language understanding, reasoning, algorithms, and basic mathematics.
- -
  
  MBPP+ (Liu et al., 2023b): A collection of $\sim$ 1,000 crowd-sourced Python tasks targeting entry-level programming skills, including fundamentals and standard library usage.
- -
  
  LiveCodeBench (v6) (Jain et al., 2024): A contamination-free and continuously updated benchmark assessing not only code generation but also execution, self-repair, and test prediction.
$\bullet$
Multimodal Math Reasoning:
- -
  
  MathVision (Wang et al., 2024a): A curated dataset of 3,040 visual problems sourced from real competitions, spanning 16 disciplines and multiple difficulty levels for evaluating multimodal mathematical reasoning.
- -
  
  DynaMath (Zou et al., 2024): A dynamically generated benchmark based on 501 seed question generators, enabling diverse and scalable evaluation through multiple sampled variants.
- -
  
  WeMath (Qiao et al., 2025): A large-scale benchmark with 6.5K visual math problems organized into 67 hierarchical knowledge concepts, designed to analyze problem-solving processes.
$\bullet$
Multimodal Logic Reasoning:
- -
  
  LogicVista (Xiao et al., 2024): A benchmark for evaluating multimodal logical reasoning across 5 task types and 9 capabilities using annotated multiple-choice questions with human reasoning.
- -
  
  VisuLogic (Xu et al., 2025b): A challenging visual reasoning benchmark focusing on reasoning directly from visual inputs, with tasks that are difficult to express textually and expose gaps in current MLLMs.
$\bullet$
Document Understanding:
- -
  
  AI2D (Kembhavi et al., 2016): A diagram understanding benchmark focusing on parsing diagram structure and reasoning over relationships between components via question answering.
- -
  
  ChartQA (Masry et al., 2022): A benchmark for question answering over charts, requiring complex visual and logical reasoning over both chart structure and underlying data.
- -
  
  DocVQA (Mathew et al., 2021): A large-scale document visual question answering dataset over document images, emphasizing structural and textual understanding.
- -
  
  InfoVQA (Mathew et al., 2022): A benchmark on infographic understanding that requires joint reasoning over layout, text, and visual elements with an emphasis on multi-step reasoning.

C.2 Evaluation Setup

Textual evaluations. For all textual evaluations, we use a sampling temperature of 1.0, top- $p$ of 1.0, a maximum generation length of 16,384 tokens, and a fixed random seed of 42. We use the vLLM inference engine to perform sampling. For math reasoning benchmarks, we sample $N=32$ solutions per problem, while for code generation benchmarks, we sample $N=4$ solutions per problem. For evaluation, we adopt Math-Verify¹⁰¹⁰10https://github.com/huggingface/Math-Verify as a rule-based verifier for math reasoning tasks. For code generation, we use the EvalPlus¹¹¹¹11https://github.com/evalplus/evalplus and LiveCodeBench¹²¹²12https://github.com/livecodebench/livecodebench frameworks to assess functional correctness. For all main results, we report the average accuracy across sampled solutions (i.e., pass@1), and compute pass@k as:

\text{pass}@k=1-\frac{\binom{N-c}{k}}{\binom{N}{k}}\,,

(15)

where $N$ is the number of samples and $c$ is the number of correct solutions.

Table C.1: Reported evaluation metrics for different benchmark datasets. We summarize the primary metrics used for performance reporting across math, logic, and document understanding tasks.

Category	Tasks	Filter	$N$ -Shot	Reported Metric
Multimodal Math Reasoning	MathVision Test	none	0	mathvision_standard_eval
	DynaMath Reasoning	none	0	dynamath_average
	WeMath TestMini Reasoning	none	0	acc_score
Multimodal Logic Reasoning	LogicVista Reasoning	none	0	acc_score
	LogicVista Reasoning	none	0	format_score
	VisuLogic	none	0	visulogic_acc
Document Understanding	AI2D	flexible-extract	0	exact_match
	ChartQA	none	0	relaxed_human_split
	DocVQA Val	none	0	anls
	InfoVQA Val	none	0	anls

Multimodal evaluations. For multimodal evaluations, we adopt the widely used LMMs-Eval¹³¹³13https://github.com/evolvinglmms-lab/lmms-eval (Zhang et al., 2025a) framework and strictly follow its official evaluation protocols and configurations. The reported evaluation metrics are summarized in Table C.1.

Appendix D Further Evaluations

D.1 More Evaluation Results

Table D.1: Performance of Qwen3-1.7B Student under math reasoning and code generation benchmarks. Teacher models (i.e., Qwen3-4B-Math-RL and Qwen3-4B-Code-RL) are developed through domain-specific RL. The performance of teacher models is denoted by the “RL” type.

Method	Math Reasoning					Code Generation
Method	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.	Human Eval+	MBPP+	LCB	Avg.
Student	13.9	11.1	5.6	4.9	8.9	61.9	53.4	11.9	42.4
Teacher	60.1	55.1	32.5	38.5	46.6	85.2	69.8	26.6	60.5
Single–Teacher Distillation
OPD	42.3	35.4	18.4	19.1	28.8	71.8	58.2	26.7	52.5
Uni-OPD	42.6	35.1	20.8	20.9	29.9	73.0	60.0	28.1	53.7
Multi–Teacher Distillation
OPD	40.3	32.4	20.0	20.3	28.3	73.2	59.1	25.7	52.7
Uni-OPD	44.0	35.1	19.5	19.8	29.6	72.9	60.5	28.0	53.8

Table D.2: Performance of Qwen3-VL-2B-Instruct Student under math reasoning, logic reasoning, and document understanding benchmarks. Teacher models (i.e., Qwen3-VL-4B-Instruct-Math-RL, Qwen3-VL-4B-Instruct-Logic-RL and Qwen3-VL-4B-Instruct-Document-RL) are developed through domain-specific RL. Avg. denotes the mean score within each category.

Method	Math Reasoning				Logic Reasoning				Document Understanding
	Math	Dyna	We	Avg.	LogicVista	LogicVista	Visu	Avg.	AI2D	Chart	Doc	Info	Avg.
	Vision	Math	Math	Avg.	Accuracy	Format	Logic	Avg.	AI2D	QA	VQA	VQA	Avg.
Student	11.1	49.1	48.6	36.3	32.4	59.1	6.4	32.6	73.4	66.1	92.8	72.4	76.2
Teacher	47.2	65.3	79.5	64.0	52.5	73.8	27.4	51.2	82.5	76.4	95.1	81.6	83.9
Single–Teacher Distillation
OPD	24.4	54.5	64.8	47.9	35.3	61.6	26.8	41.2	76.1	66.0	93.0	72.2	76.8
Uni-OPD	25.5	55.2	65.0	48.6	36.8	65.2	27.6	43.2	76.7	66.6	92.9	72.6	77.2
Multi–Teacher Distillation
OPD	15.2	50.2	57.6	41.0	38.0	65.2	27.2	43.4	76.2	66.1	92.9	72.5	76.9
Uni-OPD	18.7	51.2	58.7	43.9	42.0	69.8	27.0	46.3	76.0	66.5	93.0	72.6	77.0

Single-teacher and multi-teacher distillation on LLMs and MLLMs. We further evaluate Uni-OPD under both single-teacher and multi-teacher distillation settings on LLMs and MLLMs. As shown in Tables D.1and D.2, our method consistently outperforms the standard OPD baseline across all domains and settings. On the LLM student (i.e., Qwen3-1.7B), Uni-OPD improves the average scores on both math reasoning and code generation under single-teacher and multi-teacher distillation. On the MLLM student (i.e., Qwen3-VL-2B-Instruct), it delivers consistent gains across math reasoning, logic reasoning, and document understanding. Further, it narrows the gap to the teacher ensemble under multi-teacher distillation. Consistent improvements in smaller students provide strong empirical evidence for our dual-perspective approach, confirming that student exploration and teacher reliability are indeed the fundamental drivers of successful and reliable distillation.

Table D.3: Performance of Qwen3-VL-4B-Instruct Student under code generation and logic reasoning benchmarks. Teacher models (i.e., Qwen3-VL-4B-Instruct-Code-RL and Qwen3-VL-4B-Instruct-Logic-RL) are developed through domain-specific RL. The performance of teacher models is denoted by the “RL” type.

Method	Code Generation				Logic Reasoning
	Human Eval+	MBPP+	LCB	Avg.	LogicVista	LogicVista	Visu	Avg.
	Human Eval+	MBPP+	LCB	Avg.	Accuracy	Format	Logic	Avg.
Student	76.8	70.0	37.0	61.3	49.9	66.4	25.1	47.0
Teacher	82.2	70.5	40.1	64.3	52.5	73.8	27.4	51.2
Multi–Teacher Distillation
OPD	79.0	68.5	39.6	62.4	50.0	69.3	27.3	48.9
Uni-OPD	79.4	69.2	41.4	63.3	52.0	73.8	28.0	51.3

Cross-modal distillation on code generation and logic reasoning. Beyond the cross-modal distillation on math reasoning and code generation, we further conduct cross-modal distillation on code generation and logic reasoning. Specifically, we combine text-only code data with multimodal logic reasoning data, and jointly distill from two domain-specific teachers (Qwen3-VL-4B-Instruct-Code-RL and Qwen3-VL-4B-Instruct-Logic-RL) into a single Qwen3-VL-4B-Instruct student. As shown in Table D.3, Uni-OPD outperforms the standard OPD baseline on both the code generation and logic reasoning averages, with the largest gain on LCB (39.6 $\rightarrow$ 41.4) and LogicVista Accuracy (50.0 $\rightarrow$ 52.0). These results confirm that Uni-OPD effectively integrates heterogeneous text-only and multimodal data under a single training run, further supporting its applicability to cross-modal distillation.

D.2 Downstream Task Evaluation

Table D.4: General downstream task performance. Evaluation on 8 general benchmarks to ensure general-purpose capabilities are maintained after OPD.

Model	MMLU	ARC	HellaSwag	TruthfulQA	Winogrande	GSM8K	CommonsenseQA	IFEval	Avg.
Qwen3-4B	68.3	80.7	68.4	54.8	66.6	84.2	75.8	88.9	73.5
Math Teacher	68.4	80.8	68.5	54.3	66.0	86.7	75.4	89.2	73.7
Code Teacher	68.3	80.2	68.3	54.8	65.7	85.8	75.7	89.7	73.6
OPD	68.3	80.3	68.4	54.6	66.5	88.6	75.5	89.2	73.9
Uni-OPD	68.3	80.3	68.3	54.6	66.0	88.6	75.7	89.2	73.9

Evaluation on general capabilities. To assess the impact of OPD on general downstream performance of the policy model, we evaluate the models on a diverse set of benchmarks from the Hugging Face Open LLM Leaderboard (Beeching et al., 2023) following recent studies (Peng et al., 2026; Meng et al., 2024). Specifically, we report results on MMLU (Hendrycks et al., 2020), ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), TruthfulQA (Lin et al., 2022), Winogrande (Levesque et al., 2012), GSM8K (Cobbe et al., 2021), CommonsenseQA (Talmor et al., 2019), and IFEval (Zhou et al., 2023b). We strictly follow the standard evaluation protocols provided by the lm-evaluation-harness system¹⁴¹⁴14https://github.com/EleutherAI/lm-evaluation-harness. For IFEval, we report the inst_level_loose_acc.

The results are presented in Table D.4. Overall, Uni-OPD not only outperforms OPD and domain-specific teachers on math reasoning and code generation benchmarks demonstrated in the main text, but also retains strong performance across a wide range of downstream tasks. These results suggest that OPD serves as a general and effective framework for improving LLM performance beyond task-specific settings.

D.3 Further Ablation

Table D.5: Effectiveness validation of margin shift across different hyperparameters. We conduct single-teacher distillation experiments with a Qwen3-4B Student using individual math and code teachers.

Configuration	Math Reasoning					Code Generation
Configuration	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.	Human Eval+	MBPP+	LCB	Avg.
OPD (no shift)	57.9	52.4	30.2	37.8	44.6	82.6	68.8	25.7	59.0
Global + Mean + Lift	61.8	55.2	34.8	39.4	47.8	85.7	71.4	25.7	60.9
Global + MinMax + Lift	62.4	57.3	32.2	38.2	47.5	85.8	71.8	26.7	61.4
Group + MinMax + Spread	63.4	56.7	33.4	39.0	48.1	86.9	70.6	26.7	61.4
Group + Mean + Spread (ours)	62.7	56.3	34.4	39.2	48.2	88.3	72.3	26.7	62.4

Hyperparameter analysis for margin shift. As shown in Table D.5, we compare four variants of margin shift against the OPD baseline across math reasoning and code generation benchmarks. The shift scope (Global vs. Group), normalization mode (Mean vs. MinMax), and shift direction (Lift vs. Spread) are ablated systematically. All shift variants consistently outperform the vanilla OPD baseline, demonstrating the general effectiveness of margin shift. Among the variants, Group + Mean + Spread achieves the best average performance on both code generation (62.4) and math reasoning (48.2), indicating that group-level mean normalization with bidirectional shifting provides a more calibrated return signal. Applying the shift to both correct and incorrect responses (Spread) proves beneficial over unidirectional shifting (Lift), and group-level statistics generalize better than global ones when reward distributions vary across prompts. Furthermore, we observe that MinMax-based normalization and global-scope statistics are susceptible to outlier return values, as extreme return values within a batch can distort the shift magnitude and destabilize training. In contrast, group-level mean normalization produces more robust and consistent return estimates, contributing to stable optimization throughout training.

Table D.6: The effects of rollout number. The global batch size is fixed at

n\times bs=1024

throughout.

Method	AIME 2024	AIME 2025	HMMT 25 Feb.	HMMT 25 Nov.	Avg.
Student (4B)	23.0	19.3	12.3	9.2	15.9
OPD
$n=4$ , $bs=256$	60.1	55.1	32.5	29.6	44.3
$n=8$ , $bs=128$	59.8	52.9	29.6	35.8	44.5
$n=16$ , $bs=64$	57.9	52.4	30.2	37.8	44.6
$n=32$ , $bs=32$	58.3	51.2	30.6	36.9	44.3
OPD + Margin shift
$n=4$ , $bs=256$	57.9	52.4	33.2	37.8	45.3
$n=8$ , $bs=128$	62.5	55.4	31.9	39.2	47.3
$n=16$ , $bs=64$	62.7	56.3	34.4	39.2	48.2
$n=32$ , $bs=32$	63.1	55.4	34.2	39.6	48.1

Hyperparameter analysis for rollout number $n$ . As shown in Table D.6, we ablate the rollout number $n$ in OPD while keeping the global batch size fixed at 1024 (i.e., $n\times bs=1024$ ), so that increasing $n$ comes at the cost of a smaller per-step batch size $bs$ . For the OPD baseline, performance remains largely stable across all values of $n$ (44.3–44.6 avg.), suggesting that the base method is relatively insensitive to this trade-off. In contrast, OPD with margin shift benefits notably from larger rollout groups: average performance improves from 45.3 at $n{=}4$ to 48.2 at $n{=}16$ , as more responses per prompt yield more reliable relative return estimation for the margin-based calibration. We find that increasing $n$ from 16 to 32 yields comparable performance. Considering return estimation quality, training stability, and computational efficiency, we therefore set $n{=}16$ as our default.

Appendix E Related Work

E.1 Multimodal Large Language Models

Large Language Models (LLMs) have undergone rapid development in recent years (Touvron et al., 2023; Achiam et al., 2023; AI@Meta, 2024b; Hurst et al., 2024; Yang et al., 2024a; AI@Meta, 2024a; Yang et al., 2025; Brown et al., 2020; Team et al., 2024; Anthropic, 2023b; a; 2024; Liu et al., 2024a; Guo et al., 2025a; Li et al., 2025), significantly improving reasoning capabilities. Meanwhile, MLLMs have also seen substantial progress (Radford et al., 2021; Shao et al., 2024a; Wang et al., 2025; Tian et al., 2019; Liu et al., 2024e; Yang et al., 2024c; Peng et al., 2026; Team et al., 2025). Leveraging advances in LLMs, multimodal large language models (MLLMs) further integrate visual and textual representations through cross-modal learning, achieving strong multimodal understanding and generation capabilities. A key driver of this success lies in the combination of large-scale self-supervised pre-training on diverse corpora and subsequent high-quality supervised fine-tuning (SFT), which enables LLMs and MLLMs to exhibit strong generalization and emergent capabilities in real-world tasks (Wang et al., 2024b; Bai et al., 2023; 2025b; Liu et al., 2023a; 2024b; 2024c; Dai et al., 2023; OpenAI, 2023; Zhu et al., 2023; Qu et al., 2025; Yang et al., 2023b; Zhong et al., 2024; Yang et al., 2023a; 2024b; Lai et al., 2024; Peng et al., 2025; Hou et al., 2026). Building upon these foundations, KD has emerged as an important paradigm for transferring sophisticated reasoning capabilities from teacher models to more efficient students. Among various distillation strategies, OPD has recently emerged as a mainstream post-training paradigm for both LLMs and MLLMs. In the on-policy setting, however, the effectiveness of distillation is tied to both the quality of student exploration and the reliability of teacher feedback. In this work, we present a dual-perspective optimization strategy from both the student and teacher sides to improve data suitability and training stability in OPD.

E.2 Reinforcement Learning

By optimizing trajectories sampled from the current policy, on-policy RL alleviates distribution mismatch and is often instantiated with verifiable or outcome-based rewards in reasoning tasks. Notable methods include GRPO (Shao et al., 2024b) for critic-free grouped optimization and GSPO (Zheng et al., 2025) for sequence-level stable optimization. Recently, some works have also combined RLVR with OPD, such as Self-Distilled RLVR (Yang et al., 2026a) and OpenClaw-RL (Wang et al., 2026). In our work, we use GRPO to obtain stronger domain-specific teachers and use the corresponding reward models as global guidance for return calibration in OPD.

E.3 On-Policy Distillation

Early OPD work, such as MiniLLM (Gu et al., 2023) and GKD (Agarwal et al., 2024), establishes the basic paradigm of using teacher feedback on student-generated trajectories under a reverse KL objective. Recent studies further broaden this paradigm from multiple perspectives. In self-distillation methods, OPSD (Zhao et al., 2026b) uses privileged information; SDFT (Shenfeld et al., 2026) allows the student to absorb knowledge from retrieved demonstrations while reducing forgetting. SDPO (Hübotter et al., 2026) treats the current model itself as a self-teacher; OPCD (Ye et al., 2026) internalizes context knowledge into model parameters by minimizing reverse KL between the student and a context-conditioned teacher on the student’s trajectories. Regarding teacher access, black-box OPD (Ye et al., 2025) introduces a discriminator-guided framework that does not require teacher logits. Several works also focus on improving optimization and efficiency. ExOPD (Yang et al., 2026b) reformulates OPD as weighted dense RL; Fast and Effective OPD (Zhang et al., 2026a) improves efficiency through prefix-only distillation; KDFlow (Zhang et al., 2026b) provides an extensible distillation framework supporting both off-policy and on-policy training; MiMo-V2-Flash (Xiao et al., 2026) introduces multi-teacher OPD, enabling effective capability merging across domains. Li et al. (Li et al., 2026b) rethink OPD in terms of its phenomenology, mechanisms, and training recipes.

Recently, OPD has also begun to extend beyond text-only settings. VOLD (Bousselham et al., 2025) transfers reasoning ability from text teachers to vision-language students through a two-stage pipeline that combines cold-start alignment with GRPO and OPD. Video-OPD (Li et al., 2026a) adapts OPD to long-video grounding and introduces a curriculum that filters unreliable teacher signals. X-OPD (Cao et al., 2026) further extends OPD to speech through cross-modal alignment. In contrast, our work focuses on developing a unified OPD framework with an open recipe for both LLMs and MLLMs.

Appendix F Case Studies

We provide qualitative case studies of Uni-OPD, standard OPD, and the Student model across both LLM and MLLM benchmarks, covering textual math reasoning, code generation, logical reasoning, multimodal math reasoning, and chart understanding.

We first revisit the math reasoning case in Fig. F.1, and provide a detailed output comparison of standard OPD and our Uni-OPD. Standard OPD assigns high returns to incorrect trajectories and low returns to correct ones. Furthermore, the code generation case in Fig. F.2 highlights Uni-OPD’s ability to balance algorithmic efficiency and code readability. These case studies demonstrate how our dual-perspective optimization–specifically by restoring order consistency through margin calibration–leads to more reliable and high-quality model outputs.

Across the multimodal case studies in Fig. F.3–F.9, our observations reveal three consistent patterns: (a) Uni-OPD demonstrates superior efficiency on complex reasoning problems, producing more concise outputs while maintaining correctness, whereas the Student model and standard OPD frequently generate excessively long responses that are truncated before reaching a final answer; (b) Uni-OPD achieves higher correctness than the Student model, often succeeding on questions where the Student model fails; and (c) Our data-balancing strategies encourage exploration of informative student-generated states during training, improving Uni-OPD’s ability to tackle challenging visual and mathematical reasoning problems that the Student model cannot solve on its own.