arXiv:2605.03677 · cs.LG · uncurated · rendered via ar5iv

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.03677] Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Wenjin Hou    Shangpin Peng Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Weinong Wang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Zheng Ruan Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Yue Zhang    Zhenglin Zhou    [2pt] Mingqi Gao Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Yifei Chen Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Kaiqi Wang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Hongming Yang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Chengquan Zhang Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn    Zhuotao Tian Affiliation: Shenzhen Loop Area Institute    [2pt] Han Hu Affiliation: LLM Department, Tencent[2pt] houwj17@gmail.com   weinong.wang@hotmail.com   hehefan@zju.edu.cn      Yi Yang      Fei Wu      Hehe Fan    [2pt] Zhejiang University
Abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across LLMs and MLLMs, centered on a dual-perspective optimization strategy. Specifically, from the student’s perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher’s perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.111Code is available at https://github.com/WenjinHou/Uni-OPD.

footnotetext: {}^{\scalebox{1.0}{\hskip-5.58054pt $\ast$}}Equal contribution.  Work was done when Wenjin Hou and Shangpin Peng interned at Tencent.
Project leader.  Project supervisor.  🖂Corresponding author.
Refer to caption
Figure 1: Overall performance comparisons and convergence behavior. Results are shown for settings including multi-teacher, strong-to-weak, and cross-modal distillation on math reasoning and code generation tasks. Uni-OPD consistently outperforms OPD and converges faster than RL, demonstrating its effectiveness across diverse settings.

1 Introduction

Injecting complex reasoning abilities, domain knowledge, and human preferences into LLMs and MLLMs remains a core challenge in the post-training stage. Conventional approaches typically follow a two-stage paradigm: supervised fine-tuning (SFT) first, followed by reinforcement learning (RL) (Guo et al., 2025a; Xu et al., 2025a; Zeng et al., 2026; Zhao et al., 2026a). While SFT leverages expert data for training, its inherently off-policy nature introduces substantial exposure bias (Qin et al., 2025; Song and Zheng, 2026). Entering rarely covered erroneous states during inference may lead to compounding errors. Alternatively, on-policy RL (e.g., GRPO (Shao et al., 2024b)) alleviates distribution shift through online sampling. However, it mainly relies on sequence-level or terminal rewards, making fine-grained credit assignment difficult and limiting the stability of long-term training (Team et al., 2026).

Recently, on-policy distillation (OPD) has emerged as a promising post-training paradigm for efficiently transferring the knowledge and capabilities of domain experts into a single, unified model. It combines the strengths of RL and SFT, namely on-policy sampling and token-level supervision. Concretely, OPD trains the student on its own sampled trajectories with teacher feedback under a reverse KL objective (Lu and Lab, 2025; DeepSeek-AI, 2026).

Despite its empirical success, current OPD research remains largely confined to LLM distillation (Zhou et al., 2025; Yang et al., 2026b; Xiao et al., 2026; Yang et al., 2026c; Wu et al., 2026). Although a few recent works extend OPD to MLLMs, they are restricted to limited subsets of tasks within a single modality, such as video (Li et al., 2026a) or speech (Cao et al., 2026). To this end, we first aim to develop a unified OPD framework for both LLMs and MLLMs, enabling effective knowledge distillation across tasks and modalities.

Key observations. Beyond unifying the framework, we raise a more fundamental question: what makes OPD a reliable optimization paradigm? We posit that effective OPD depends on two factors. First, the student must sufficiently explore informative states, i.e., diverse and appropriately difficult self-generated trajectories. Second, the teacher’s token-level supervision must remain reliable when applied to student rollouts. In particular, the reliability of token-level guidance is significantly enhanced when its trajectory-level aggregation remains order-consistent with outcome reward (i.e., correct trajectories receive higher aggregated scores than incorrect ones). The outcome reward thus provides a global anchor for calibrating unreliable teacher supervision. These observations motivate a dual-perspective optimization strategy that jointly improves student exploration and the reliability of teacher signals.

Our recipe. Building on these insights, we introduce Uni-OPD, a dual-perspective strategy for optimizing OPD from the fundamental roles of the student and the teacher. In this unified framework, we adopt two complementary data-balancing strategies, namely offline difficulty-aware and online correctness-aware balancing, to promote exploration of informative student-generated states. We further present a novel outcome-guided margin calibration mechanism to obtain reliable teacher supervision. Extensive experiments on LLMs and MLLMs verify our recipe.

To summarize, our contributions are threefold:

  • \bullet

    Key bottlenecks of OPD. We identify two core bottlenecks in OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Our analysis reveals that reliable teacher supervision largely depends on whether token-level guidance remains order-consistent with the outcome reward.

  • \bullet

    Dual-perspective optimization recipe. We present a dual-perspective optimization recipe for unified OPD that jointly improves student exploration and teacher supervision. Concretely, we combine offline and online data balancing with an outcome-guided margin calibration mechanism, leading to more effective optimization.

  • \bullet

    Comprehensive experimental validation. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation (i.e., combining text-only and multimodal tasks). Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

2 Related Work

Knowledge distillation for LLMs and MLLMs. Knowledge distillation (Hinton et al., 2015; Xu et al., 2024) aims to transfer knowledge from a larger teacher model to a smaller student model. Conventional approaches typically rely on off-policy forward Kullback–Leibler (KL) divergence on a static dataset to align the student’s generation distribution with that of the teacher (Liu et al., 2024d; Guo et al., 2025b; He et al., 2025a; Liu and Zhang, 2025; Ko et al., 2025). Another line of work treats supervised fine-tuning (SFT) on tokens generated by the teacher as an alternative off-policy distillation strategy for eliciting reasoning capabilities during LLM and MLLM post-training (Guo et al., 2025a; Zhang et al., 2025c; Bansal et al., 2025; Zhang et al., 2025b; Team et al., 2026; Xiao et al., 2026). Though effective, these off-policy methods essentially imitate the teacher’s behavior, limiting the student’s ability to surpass the teacher and making the student prone to exposure bias (Song and Zheng, 2026).

On-policy distillation. OPD (Agarwal et al., 2024; Lu and Lab, 2025) allows a superior teacher to provide feedback on the student’s on-policy trajectories. This paradigm effectively alleviates exposure bias and elevates the student’s upper performance bound. Owing to these merits, OPD has become an efficient way to merge capabilities from multiple experts into a single student model (Xiao et al., 2026; Yang et al., 2026c), as well as to support strong-to-weak distillation (Bai et al., 2025a; Zeng et al., 2026). Building on this paradigm, current studies on OPD have branched into several key directions. From the lens of the teacher, recent work explores teacher-free self-distillation paradigms (Kujanpää et al., 2024; Shenfeld et al., 2026; Zhao et al., 2026b; Hübotter et al., 2026; Ye et al., 2026; Zhang et al., 2026a; Stein et al., 2026), develops black-box OPD methods (Ye et al., 2025; Xiong et al., 2026), and facilitates distillation across different model families (Patiño et al., 2025). Complementary efforts focus on unified training frameworks (Zhang et al., 2026b) and stable optimization strategies (Jin et al., 2026; Kim and Baek, 2026; Li et al., 2026b; Xu et al., 2026) combined with RL (Yang et al., 2026a; Qu et al., 2026; Jang et al., 2026; Wang et al., 2026). Few works extend OPD to multimodal domains (Bousselham et al., 2025; Ko et al., 2026; Li et al., 2026a; Cao et al., 2026). In this work, we push OPD with a dual-perspective recipe that promotes student exploration and teacher reliability, generalizing across LLMs and MLLMs. More detailed related work is provided in the appendix E.

3 Methodology

Refer to caption
Figure 2: Overview of the Uni-OPD framework. (Left) Offline difficulty-aware and online correctness-aware data balancing promote student exploration. (Right) Outcome-guided margin calibration mechanism improves the reliability of teacher supervision. (Middle) The resulting student policy merges complementary capabilities from multiple domain-specific teachers more effectively than standard OPD, leading to stronger overall performance.

We propose Uni-OPD, a unified framework that advances OPD across LLMs and MLLMs, as shown in Fig. 2. Our design is driven by two fundamental bottlenecks in OPD: insufficient exploration of informative student-generated states and unreliable teacher supervision for student rollouts. Uni-OPD addresses them with a dual-perspective recipe that enhances student exploration and calibrates teacher supervision to align with the outcome reward. We first introduce the preliminaries in section 3.1, followed by an overview of Uni-OPD in section 3.2. We then detail the exploration strategy in section 3.3 and the supervision calibration mechanism in section 3.4.

3.1 Preliminaries

On-policy distillation. OPD retains the on-policy nature of optimization while providing token-level credit assignment, enabling effective post-training. During training, the student policy π𝜽\pi_{{\bm{\theta}}} samples its trajectories and is optimized by minimizing the reverse Kullback-Leibler (KL) divergence to the teacher policy πT\pi_{\mathrm{T}} over these samples:

𝒥OPD(𝜽)=min𝜽𝔼𝒒D,𝝉π𝜽(𝒒)[𝒟KL(π𝜽(𝝉𝒒)πT(𝝉𝒒))],\mathcal{J}_{\text{OPD}}(\bm{\theta})=\min_{\bm{\theta}}\mathbb{E}_{\bm{q}\sim D,\,\bm{\tau}\sim{\pi}_{\bm{\theta}}(\cdot\mid\bm{q})}\!\Big[\mathcal{D}_{\mathrm{KL}}\!\Big({\pi}_{\bm{\theta}}(\bm{\tau}\!\mid\!\bm{q})\,\big\|\,\pi_{\text{T}}(\bm{\tau}\!\mid\!\bm{q})\Big)\Big], (1)

where 𝒒{\bm{q}} is the input question, 𝝉=(o1,,o|𝝉|)\bm{\tau}=(o_{1},\dots,o_{|\bm{\tau}|}) is a trajectory sampled by the student, oto_{t} is the token at step tt, and |𝝉||\bm{\tau}| is the length of the trajectory. The gradient of OPD can be derived as:

𝜽𝒥OPD(𝜽)=𝔼𝒒D,𝝉π𝜽(𝒒)[t=1|𝝉|(logπ𝜽(ot𝒒,𝒐<t)logπT(ot𝒒,𝒐<t))𝜽logπ𝜽(ot𝒒,𝒐<t)],\nabla_{\bm{\theta}}\mathcal{J}_{\text{OPD}}(\bm{\theta})=\mathbb{E}_{\bm{q}\sim D,\,\bm{\tau}\sim{\pi}_{\bm{\theta}}(\cdot\mid\bm{q})}\!\Big[\sum_{t=1}^{|\bm{\tau}|}\!\big(\log{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})-\log\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})\big)\,\nabla_{\bm{\theta}}\log{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})\Big], (2)

where 𝒐<t{\bm{o}}_{<t} denotes the prefix before step tt. The gradient naturally induces a token-level reward at step tt, analogous to standard RL:

rtOPD=logπT(ot𝒒,𝒐<t)logπ𝜽(ot𝒒,𝒐<t)=logπT(ot𝒒,𝒐<t)π𝜽(ot𝒒,𝒐<t).r^{\mathrm{OPD}}_{t}=\log\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})-\log{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})=\log\frac{\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})}{{\pi}_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})}. (3)

This formulation provides fine-grained credit assignment signals at the token level.

Analyzing teacher supervision in OPD. As shown in Eq. 3, OPD relies on the teacher to provide fine-grained supervision for student-generated trajectories. For effective optimization, this signal should align with overall trajectory correctness. In practice, this alignment is not guaranteed and can fail in several typical ways: (a) OOD degradation: when student rollouts enter sparse or out-of-distribution regions relative to the teacher, logπT(ot)\log\pi_{\mathrm{T}}(o_{t}\mid\cdot) may become noisy, disrupting the ranking between correct and incorrect trajectories. (b) Overestimation of incorrect trajectories: incorrect trajectories may receive abnormally high scores when their local token patterns align with the teacher’s high-confidence regions. (c) Underestimation of correct trajectories: correct trajectories may receive abnormally low scores when their generation paths deviate from the teacher’s dominant regions, thereby suppressing useful reasoning paths. These phenomena suggest that teacher supervision is not always reliable, motivating us to introduce an outcome reward as a global anchor for calibrating trajectory-level supervision.

3.2 The Overview of Uni-OPD

In this work, we propose Uni-OPD, a unified OPD framework that generalizes across both LLMs and MLLMs, as illustrated in Fig. 2. Formally, given expert teachers {πT1,πT2,,πTN}\{\pi_{\mathrm{T}_{1}},\pi_{\mathrm{T}_{2}},\dots,\pi_{\mathrm{T}_{N}}\} who specialize in different domains, and letting wiw_{i} denote the weight assigned to teacher πTi\pi_{\mathrm{T}_{i}}, we define the objective as:

𝒥Uni-OPD(𝜽)=i=1Nwi𝒟KL(π𝜽πTi),\mathcal{J}_{\text{Uni-OPD}}(\bm{\theta})=\sum_{i=1}^{N}w_{i}\,\mathcal{D}_{\mathrm{KL}}\!\left({\pi}_{\bm{\theta}}\,\|\,\pi_{\mathrm{T}_{i}}\right), (4)

This formulation provides a unified objective for both single-teacher and multi-teacher distillation by aggregating supervision from multiple experts. Building on this objective, we optimize OPD from the two fundamental roles. From the student’s perspective, we introduce a data-balancing strategy that promotes exploration via offline difficulty-aware and online correctness-aware selection. From the teacher’s perspective, we develop an outcome-guided margin calibration mechanism to correct unreliable token-level supervision by enforcing consistency with outcome rewards. These designs stabilize optimization and improve the reliability of OPD.

3.3 Joint Offline and Online Data Balancing Strategy for Student Exploration

From the student’s perspective, sufficient diversity and an appropriate level of difficulty in the generated trajectories are essential for effective OPD. To this end, based on our empirical study, we propose complementary data-balancing strategies for both offline data construction and online sampling.

Refer to caption
Figure 3: Data difficulty distribution and its impact on OPD performance. (Left) Training data often exhibits mirrored J-shaped or U-shaped difficulty distributions. (Right) A naive strategy is to filter out overly easy or overly hard samples (i.e., all-correct or all-wrong cases), but this reduces diversity. In contrast, our difficulty-balancing strategy upsamples mid-difficulty samples to preserve a balanced spectrum and empirically outperforms filtering.

Offline difficulty-aware data balancing. A prevalent practice in RL is to estimate prompt difficulty via multiple rollouts and then filter out samples that are either overly easy (i.e., always correct) or overly hard (i.e., always incorrect) (An et al., 2025; Zhou et al., 2023a). However, for small-scale models, training data often exhibits a mirrored J-shaped or U-shaped distribution (see Fig. 3). Strictly removing these easy or hard samples can substantially reduce data diversity and limit exploration of informative student-generated states. Our empirical findings show that such filtering leads to substantial performance degradation in OPD.

Based on this observation, we adopt a difficulty-aware balancing strategy that selectively upsamples mid-difficulty samples (i.e., correct in only some of multiple rollouts). As shown in Fig. 3, this strategy reshapes the data distribution into a more uniform form while preserving both diversity and difficulty. In addition, it consistently improves performance on math reasoning and code generation. Overall, these results show that maintaining data diversity and a balanced difficulty spectrum enables the student to generate more informative trajectories, thereby exploring a broader solution space.

Refer to caption
Figure 4: Impact of online correct and incorrect ratio on student final performance.

Online correctness-aware data balancing. After applying offline difficulty-aware balancing, we further observe that insufficient exploration can cause the model to collapse to local optima during training, especially when rollout groups lack sufficient outcome diversity (e.g., only incorrect trajectories). To mitigate this issue, we explicitly enforce a balanced composition of correct and incorrect trajectories within each rollout group during training. This prevents degenerate cases in which all samples share the same outcome and thus yield uninformative gradients. By maintaining such a balance, we ensure that the student consistently receives meaningful contrastive signals for stable on-policy learning. As shown in Fig. 4, an appropriate outcome balance achieves better performance than using only correct samples or an excessively high correct/incorrect ratio.

3.4 Outcome-guided Margin Calibration for Teacher Supervision

A basic premise of OPD is that the teacher exhibits a directional likelihood preference over positive and negative trajectories. In particular, relative to the student, the teacher should assign higher likelihood to correct trajectories and lower likelihood to incorrect ones. Under this premise, the resulting distillation signal should remain consistent with outcome-level correctness at the trajectory level. We next formalize this principle through a trajectory-level distillation return and develop an outcome-guided calibration strategy based on it.

Refer to caption
Figure 5: Demonstration of unreliable teacher supervision and outcome-guided margin calibration mechanism. (Left) Standard teacher supervision in OPD suffers from misalignment between trajectory-level return and outcome rewards, yielding unreliable supervision signals. (Right) Our method uses outcome rewards as a global anchor to calibrate returns through margin-based adjustment, restoring order consistency and improving optimization stability.

Trajectory-level distillation return. To characterize the overall supervision signal along a rollout trajectory, we define the trajectory-level distillation return as the average log-probability gap between the teacher and the student:

GOPD(𝒒,𝝉)1|𝝉|t=1|𝝉|logπT(ot𝒒,𝒐<t)π𝜽(ot𝒒,𝒐<t)=1|𝝉|t=1|𝝉|rtOPD,G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\triangleq\frac{1}{|\bm{\tau}|}\sum_{t=1}^{|\bm{\tau}|}\log\frac{\pi_{T}(o_{t}\mid\bm{q},\bm{o}_{<t})}{\pi_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})}=\frac{1}{|\bm{\tau}|}\sum_{t=1}^{|\bm{\tau}|}r^{\mathrm{OPD}}_{t}\,, (5)

This quantity measures the teacher’s average log-likelihood preference over the student along trajectory 𝝉\bm{\tau}. When GOPD(𝒒,𝝉)>0G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})>0, the teacher assigns higher confidence than the student on average, encouraging the student to move toward this trajectory. Conversely, when GOPD(𝒒,𝝉)<0G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})<0, the student is discouraged from moving toward this trajectory. The normalization by trajectory length ensures comparability across trajectories of different lengths.

Order consistency as a trajectory-level criterion. For a given question 𝒒{\bm{q}}, let R(𝒒,𝝉){0,1}R({\bm{q}},\bm{\tau})\in\{0,1\} denote the outcome reward of a sampled trajectory 𝝉\bm{\tau}, where R(𝒒,𝝉)=1R({\bm{q}},\bm{\tau})=1 indicates that the final answer in 𝝉\bm{\tau} is correct for question 𝒒{\bm{q}}, and R(𝒒,𝝉)=0R({\bm{q}},\bm{\tau})=0 otherwise. We then define the positive and negative trajectory sets as:

S+(𝒒){𝝉R(𝒒,𝝉)=1},\displaystyle S_{+}({\bm{q}})\triangleq\{\bm{\tau}\mid R({\bm{q}},\bm{\tau})=1\},\qquad S(𝒒){𝝉R(𝒒,𝝉)=0}.\displaystyle S_{-}({\bm{q}})\triangleq\{\bm{\tau}\mid R({\bm{q}},\bm{\tau})=0\}. (6)
Following the trajectory-level bandit formulation in (Ouyang et al., 2022), we treat the prompt as the context and the entire generated trajectory as a macro-action. Under this view, the associated outcome reward naturally serves as a one-step trajectory-level return, denoted as GRL(𝒒,𝝉)=R(𝒒,𝝉)G_{\mathrm{RL}}({\bm{q}},\bm{\tau})=R({\bm{q}},\bm{\tau}). Therefore, the outcome-level RL return induces the following oracle ordering:
GRL(𝒒,𝝉+)GRL(𝒒,𝝉),\displaystyle G_{\mathrm{RL}}({\bm{q}},\bm{\tau}_{+})\geq G_{\mathrm{RL}}({\bm{q}},\bm{\tau}_{-})\,,\qquad 𝝉+S+(𝒒),𝝉S(𝒒).\displaystyle\forall\bm{\tau}_{+}\in S_{+}({\bm{q}}),\;\forall\bm{\tau}_{-}\in S_{-}({\bm{q}})\,. (7)
The derivation process is provided in section A.3. This motivates a trajectory-level reliability criterion for OPD. Under the distillation premise, the trajectory-level distillation return GOPD(𝒒,𝝉)G_{\mathrm{OPD}}({\bm{q}},\bm{\tau}) should preserve the same outcome-induced ordering as GRL(𝒒,𝝉)G_{\mathrm{RL}}({\bm{q}},\bm{\tau}). Specifically, for any prompt 𝒒{\bm{q}}, we expect:
GOPD(𝒒,𝝉+)GOPD(𝒒,𝝉),\displaystyle G_{\mathrm{OPD}}({\bm{q}},\bm{\tau}_{+})\geq G_{\mathrm{OPD}}({\bm{q}},\bm{\tau}_{-})\,,\qquad 𝝉+S+(𝒒),𝝉S(𝒒).\displaystyle\forall\bm{\tau}_{+}\in S_{+}({\bm{q}}),\;\forall\bm{\tau}_{-}\in S_{-}({\bm{q}})\,. (8)

Teacher supervision may violate ordering. In practice, however, the teacher’s supervision is not always reliable. As discussed in section 3.1, teacher scoring may degrade in sparse out-of-distribution regions, overestimate incorrect trajectories, or underestimate correct ones due to spurious local patterns. Such failures may persist even after token-level supervision is aggregated to the trajectory level. A mean-based criterion is therefore insufficient, since the mismatch is often concentrated in a few extreme samples: a single overly confident negative trajectory or a severely underestimated positive trajectory can already distort the supervision signal for the entire prompt group.

Outcome-guided margin calibration. Based on the above analysis, during OPD training, the constraint in Eq. 8 should hold between positive and negative trajectories within each prompt. To this end, we consider the margin between the lowest-scoring correct trajectory and the highest-scoring incorrect trajectory, which directly characterizes whether the ordering is violated in the most adversarial case. We define the prompt-level margin as

m(𝒒)min𝝉S+(𝒒)GOPD(𝒒,𝝉)max𝝉S(𝒒)GOPD(𝒒,𝝉).m(\bm{q})\triangleq\min_{\bm{\tau}\in S_{+}(\bm{q})}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\max_{\bm{\tau}\in S_{-}(\bm{q})}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\,. (9)

By construction, m(𝒒)0m({\bm{q}})\geq 0 indicates strict order consistency on prompt 𝒒{\bm{q}}, since even the worst positive trajectory still outperforms the best negative one (see Fig. 5). Thus, m(𝒒)0m({\bm{q}})\geq 0 means that all positive trajectories are ranked above all negative ones for prompt 𝒒{\bm{q}}. To improve robustness, we further require:

m(𝒒)δ,m(\bm{q})\geq\delta\,, (10)

where δ>0\delta>0 defines a safety margin against estimation noise and finite-sample fluctuations. Since S+(𝒒)S_{+}({\bm{q}}) and S(𝒒)S_{-}({\bm{q}}) are determined by outcome rewards, this criterion uses the outcome signal as a global anchor to calibrate the teacher’s trajectory-level scores. This formulation enables direct interventions on the margin, allowing us to suppress ordering violations or enlarge the separation between positive and negative trajectories.

Margin calibration strategy. Based on Eq. 10, we present two calibration strategies: margin mask and margin shift. Specifically, the margin mask keeps only the prompt groups satisfying m(𝒒)δm({\bm{q}})\geq\delta and discards the rest, so that training is performed only with reliable supervision. Margin shift instead repairs an unreliable group with the smallest additive correction. For groups with m(𝒒)<δm({\bm{q}})<\delta, we define:

λ(𝒒)=δm(𝒒),G~OPD(𝒒,𝝉)=GOPD(𝒒,𝝉)+λ(𝒒) 1{R(𝒒,𝝉)=1}.\lambda({\bm{q}})=\delta-m({\bm{q}}),\qquad\widetilde{G}_{\mathrm{OPD}}({\bm{q}},\bm{\tau})=G_{\mathrm{OPD}}({\bm{q}},\bm{\tau})+\lambda({\bm{q}})\,\bm{1}\{R({\bm{q}},\bm{\tau})=1\}. (11)

This shift preserves the relative ordering within S+(𝒒)S_{+}({\bm{q}}) and guarantees

min𝝉S+(𝒒)G~OPD(𝒒,𝝉)max𝝉S(𝒒)G~OPD(𝒒,𝝉)=δ.\min_{\bm{\tau}\in S_{+}({\bm{q}})}\widetilde{G}_{\mathrm{OPD}}({\bm{q}},\bm{\tau})-\max_{\bm{\tau}\in S_{-}({\bm{q}})}\widetilde{G}_{\mathrm{OPD}}({\bm{q}},\bm{\tau})=\delta\,. (12)

In this way, margin shift restores outcome-consistent ordering with a minimal group-level correction, while margin mask provides a more conservative alternative when the supervision signal is too unreliable to calibrate.

Table 1: Performance of Qwen3-4B Student under math reasoning and code generation benchmarks. Teacher models (i.e., Qwen3-4B-Math-RL and Qwen3-4B-Code-RL) are developed through domain-specific RL. The performance of teacher models is denoted by the “RL” type. Bold values indicate the best score within each group. Avg. denotes the average score within each domain.
Method Math Reasoning Code Generation
AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg. Human Eval+ MBPP+ LCB Avg.
Student (4B) 23.0 19.3 12.3 9.2 15.9 77.4 65.3 17.7 53.5
Teacher (RL) 60.1 55.1 32.5 38.5 46.6 85.2 69.8 26.6 60.5
Single–Teacher Distillation
ExPO 58.7 55.2 32.4 37.0 45.8 84.8 70.2 28.0 61.0
OPD 57.9 52.4 30.2 37.8 44.6 82.6 68.8 25.7 59.0
ExOPD 62.7 56.1 33.9 39.3 48.0 86.9 70.7 28.6 62.1
Uni-OPD 63.3 57.0 34.8 39.8 48.7 88.3 71.6 29.7 63.2
Multi–Teacher Distillation
SFT 58.5 53.3 30.7 34.8 44.3 86.4 69.6 26.4 60.8
ExPO 57.5 54.5 31.7 36.3 45.0 86.7 72.0 29.0 62.6
OPD 60.9 55.2 33.4 38.3 47.0 86.3 70.9 23.4 60.2
ExOPD 61.0 56.0 34.4 39.2 47.7 86.3 70.6 29.0 62.0
Uni-OPD 62.3 57.2 34.9 39.6 48.5 88.0 72.6 30.1 63.6

4 Experiments and Analysis

In this section, we conduct comprehensive experiments across both textual and multimodal domains to evaluate the effectiveness of Uni-OPD. We first detail the experimental configurations (section 4.1). Subsequently, we assess how the proposed recipe improves OPD performance across diverse distillation scenarios for LLMs and MLLMs, including single-teacher and multi-teacher distillation (section 4.2), strong-to-weak distillation (section 4.3), and cross-modal distillation (section 4.4). Finally, we provide a rigorous ablation study to further analyze the core strategies of our method (section 4.5).

4.1 Experimental Setup

Table 2: Performance of Qwen3-VL-4B-Instruct Student under math reasoning, logic reasoning, and document understanding benchmarks. Teacher models (i.e., Qwen3-VL-4B-Instruct-Math-RL, Qwen3-VL-4B-Instruct-Logic-RL and Qwen3-VL-4B-Instruct-Document-RL) are developed through domain-specific RL. Bold values indicate the best score within each group. Avg. denotes the mean score within each category.
Method Math Reasoning Logic Reasoning Document Understanding
Math Dyna We Avg. LogicVista LogicVista Visu Avg. AI2D Chart Doc Info Avg.
Vision Math Math Accuracy Format Logic QA VQA VQA
Student (4B) 33.8 62.2 67.5 54.5 49.9 66.4 25.1 47.0 81.7 73.5 94.9 79.8 82.5
Teacher (RL) 47.2 65.3 79.5 64.0 52.5 73.8 27.4 51.2 82.5 76.4 95.1 81.6 83.9
Single–Teacher Distillation
OPD 47.5 64.8 77.5 63.3 49.8 73.0 26.1 49.6 82.4 75.4 95.2 81.4 83.6
Uni-OPD 47.8 65.4 78.3 63.9 53.1 73.8 28.2 51.7 82.6 75.8 95.2 81.2 83.7
Multi–Teacher Distillation
OPD 41.0 60.9 71.7 57.9 51.3 72.3 26.3 50.0 82.6 75.0 95.1 81.3 83.4
Uni-OPD 45.5 62.3 76.1 61.0 54.0 75.2 27.5 52.5 83.0 75.7 95.3 81.6 83.9

Models. We conduct experiments on the Qwen3 family (Yang et al., 2025; Bai et al., 2025a). For textual experiments, we use Qwen3-4B and Qwen3-1.7B as student models. In the same-sized setting, we apply domain-specific RL to Qwen3-4B to obtain specialized teachers. In the strong-to-weak setting, we use Qwen3-30B-A3B-Instruct-2507 as the strong teacher. For multimodal experiments, we use Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct as student models, and obtain multimodal teachers through domain-specific RL. Detailed training setups are in section B.1.

Training datasets. We use task-specific training data to construct and distill specialized teachers. For textual tasks, we use 57K math reasoning samples filtered from DeepMath (He et al., 2025b) (difficulty level 6\geq 6) and 25K code generation samples from the Code subset of Eurus-2-RL-Data (Cui et al., 2025). For multimodal tasks, we use math reasoning, logic reasoning, and document understanding data mainly from OpenMMReasoner-RL-74K (Zhang et al., 2025b). Detailed training data configurations are provided in section B.2.

Baselines. We compare Uni-OPD against several representative baselines for LLM distillation: (1) SFT, which performs supervised fine-tuning on teacher-generated trajectories via cross-entropy loss; (2) ExPO (Yang et al., 2026b), a weight-space extrapolation method that merges domain-specific teachers and extrapolates their weights relative to the student model; (3) ExOPD, a reward-level extrapolation approach that scales the reward factor (>1>1) to enable the student to surpass the performance boundaries of its teachers. For MLLM experiments, since OPD remains largely underexplored in this setting, we use vanilla OPD as the primary baseline.

Evaluation benchmarks. We evaluate Uni-OPD on a comprehensive benchmark suite spanning textual and multimodal capabilities, organized along five capability axes: Textual Math Reasoning: AIME24 (AI-MO, 2024), AIME25 (OpenCompass, 2025), HMMT25 (February and November) (Balunović et al., 2025); Textual Code Generation: HumanEval+ (Liu et al., 2023b), MBPP+ (Liu et al., 2023b), and LiveCodeBench (v6 only, Feb. 25\simMay 25) (Jain et al., 2024); Multimodal Math Reasoning: MathVision (Wang et al., 2024a), DynaMath (Zou et al., 2024), and WeMath (Qiao et al., 2025); Multimodal Logic Reasoning: LogicVista (Xiao et al., 2024) and VisuLogic (Xu et al., 2025b); Document Understanding: AI2D (Kembhavi et al., 2016), ChartQA (Masry et al., 2022), DocVQA (Mathew et al., 2021), and InfoVQA (Mathew et al., 2022). Detailed information is in section C.1.

4.2 Single-Teacher and Multi-Teacher Distillation on LLMs and MLLMs

As an effective and flexible paradigm for consolidating capabilities from one or multiple teachers into a unified student model, we first evaluate Uni-OPD on both LLMs and MLLMs across diverse domains. Specifically, for LLMs, following G-OPD (Yang et al., 2026b), we conduct experiments on math reasoning and code generation. For MLLMs, we further consider three domains: math reasoning, logic reasoning, and document understanding.

Main results. As shown in Table 1, Uni-OPD achieves the best overall performance on LLM distillation under both single-teacher and multi-teacher settings. In single-teacher distillation, Uni-OPD consistently outperforms OPD and ExOPD, obtaining the highest scores of 48.7 on math reasoning and 63.2 on code generation. More importantly, under multi-teacher distillation, Uni-OPD effectively merges the distinct capabilities of multiple teachers into a single student model, yielding gains of 1.5% and 3.4% over OPD on math reasoning and code generation.

A similar trend is observed for MLLMs in Table 2. Under single-teacher distillation, Uni-OPD delivers the best average performance in all three domains, reaching 63.9 on math reasoning, 51.7 on logic reasoning, and 83.7 on document understanding. For multi-teacher distillation, Uni-OPD consistently outperforms OPD, improving the average score from 57.9 to 61.0 on math reasoning, from 50.0 to 52.5 on logic reasoning, and from 83.4 to 83.9 on document understanding. The consistent gains across settings validate the robustness of Uni-OPD.

4.3 Strong-to-Weak Distillation

Table 3: Results for strong-to-weak distillation setting under math reasoning and code generation benchmarks. The teacher model is Qwen3-30B-A3B-Instruct-2507, and the student models are the smaller Qwen3-4B and Qwen3-1.7B. Bold values indicate the best score within each group. Avg. denotes the average score within each domain.
Method Math Reasoning Code Generation
AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg. Human Eval+ MBPP+ LCB Avg.
Teacher 72.1 61.4 42.5 57.1 58.3 81.9 77.2 23.4 60.8
Qwen3-4B Student
Student 23.0 19.3 12.3 9.2 15.9 77.4 65.3 17.7 53.5
OPD 56.5 46.4 28.5 33.4 41.2 82.9 72.4 21.6 59.0
Uni-OPD 55.9 50.2 29.8 35.6 42.9 83.1 71.3 28.0 60.8
Qwen3-1.7B Student
Student 13.9 11.1 5.6 4.9 8.9 61.9 53.4 11.9 42.4
OPD 35.7 27.6 17.2 14.6 23.8 67.1 56.7 23.4 49.1
Uni-OPD 35.2 30.7 17.7 16.4 25.0 71.5 58.6 28.0 52.7

Strong-to-weak distillation is particularly important for the practical post-training of small models (Bai et al., 2025a). We further investigate whether Uni-OPD can better facilitate the transfer of reasoning capabilities from a larger, stronger teacher model (e.g., Qwen3-30B-A3B-Instruct-2507) to significantly smaller students (e.g., Qwen3-4B and Qwen3-1.7B). In this setting, the student is trained on both math and code data, with teacher feedback provided across both domains, which can be viewed as a multi-teacher scenario.

Main results. The results for the strong-to-weak distillation setting are presented in Table 3. Notably, Uni-OPD yields significant performance gains across both the 4B and 1.7B student settings. When distilled from the highly capable 30B teacher, Uni-OPD consistently outperforms standard OPD. Specifically, for the 4B student, Uni-OPD achieves average scores of 42.9 in mathematical reasoning and 60.8 in code generation, surpassing standard OPD by 1.7 and 1.8 points, respectively. This trend holds even for the highly constrained 1.7B student, where Uni-OPD lifts performance to 25.0 on math reasoning and 52.7 on code generation. These results demonstrate that Uni-OPD effectively bridges the capacity gap, enabling smaller students to more effectively absorb and replicate complex reasoning behaviors from superior teachers.

4.4 Cross-Modal Distillation

Table 4: Results for cross-modal distillation under textual code generation and multimodal math reasoning benchmarks. The student model is Qwen3-VL-4B-Instruct. The teacher models are developed from the same MLLM backbone via domain-specific RL on textual code and multimodal math domains, i.e., Qwen3-VL-4B-Instruct-Code-RL and Qwen3-VL-4B-Instruct-Math-RL, respectively. Bold values indicate the best score within each group. Avg. denotes the average score within each domain.
Method Code Generation (Textual) Math Reasoning (Multimodal)
Human Eval+ MBPP+ LCB Avg. Math Vision Dyna Math We Math Avg.
Student 76.8 70.0 37.0 61.3 33.8 62.2 67.5 54.5
Teacher 82.2 70.5 40.1 64.3 47.2 65.3 79.5 64.0
OPD 83.1 70.6 38.6 64.1 46.1 65.4 76.6 62.7
Uni-OPD 84.1 71.4 41.3 65.6 46.6 66.5 78.5 63.9

Cross-modal distillation is an important yet underexplored setting in OPD. Unlike conventional distillation settings, where capability transfer typically occurs within the same modality, here we investigate whether textual and multimodal capabilities can be unified into a single student policy. Specifically, we use Qwen3-VL-4B-Instruct as the student model, and construct domain-specific teachers from the same MLLM backbone via RL on textual code data and multimodal math data, respectively. As a result, although the student is multimodal, one of the transferred capabilities is learned from a teacher specialized in a purely textual domain, enabling capability transfer across modality boundaries. This setting is beneficial for integrating and transferring cross-modal capabilities.

Main results. As shown in Table 4, Uni-OPD achieves consistent gains over standard OPD across both textual code generation and multimodal math reasoning in this cross-modal setting. Specifically, it improves the average score from 64.1 to 65.6 on code generation and from 62.7 to 63.9 on math reasoning. On the textual side, the gains are consistent across all three code benchmarks, with the largest improvement on LCB (38.6 \rightarrow 41.3). On the multimodal side, Uni-OPD further improves MathVision (46.1 \rightarrow 46.6) and DynaMath (65.4 \rightarrow 66.5), while maintaining strong performance on WeMath. These results suggest that Uni-OPD can effectively absorb and coordinate capabilities originating from both textual and multimodal domains within a unified student model, rather than improving one domain at the expense of the other. For a broader view of cross-modal distillation, we further provide results on code and logic reasoning in appendix D.

4.5 Ablation Study

Table 5: Results of Uni-OPD variants with a Qwen3-4B Student on math reasoning and code generation. We ablate core strategies (i.e., offline data balancing, online data balancing, and margin calibration) to assess their effectiveness using the Qwen3-4B-RL and Qwen3-30B-A3B-Instruct teacher models.
Configuration Math Reasoning Code Generation
AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg. Human Eval+ MBPP+ LCB Avg.
Qwen3-4B RL Teacher
OPD 60.9 55.2 33.4 38.3 47.0 86.3 70.9 23.4 60.2
Uni-OPD 62.3 57.2 34.9 39.6 48.5 88.0 72.6 30.1 63.6
w/o offline data balancing 62.6 56.5 32.5 38.5 47.5 88.0 71.1 27.9 62.3
w/o online data balancing 62.5 56.7 33.2 38.9 47.8 88.0 71.8 28.0 62.6
w/o margin calibration 63.0 54.7 33.4 38.1 47.3 86.4 71.6 25.7 61.2
Qwen3-30B A3B-Instruct Teacher
OPD 56.5 46.4 28.5 33.4 41.2 82.9 72.4 21.6 59.0
Uni-OPD 55.9 50.2 29.8 35.6 42.9 83.1 71.3 28.0 60.8
w/o offline data balancing 57.1 46.3 28.8 36.8 42.2 80.6 70.3 28.0 59.6
w/o online data balancing 57.0 47.6 26.8 37.0 42.1 81.6 71.4 28.0 60.3
w/o margin calibration 54.9 48.1 29.1 35.8 42.0 82.8 70.4 25.7 59.6

In Table 5, we conduct comprehensive ablation studies to evaluate the individual contributions of each strategy in our Uni-OPD. Applying our proposed operations results in a significant improvement in accuracy over the vanilla OPD. In particular, the average gains reach +1.5/+3.4 points on math/code with the Qwen3-4B-RL teacher, and +1.7/+1.8 points with the Qwen3-30B-A3B-Instruct teacher. Offline and online data balancing address insufficient exploration: without either of them, the student policy struggles to be exposed to diverse and challenging trajectories. Margin calibration improves supervision reliability: without it, token-level feedback can become misaligned with outcome rewards, leading to less stable training and suboptimal performance.

Table 6: Comparison results for different margin calibration. We directly incorporate them into OPD to examine which strategy better benefits OPD training.
Method AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg.
Student (4B) 23.0 19.3 12.3 9.2 15.9
OPD 57.9 52.4 30.2 37.8 44.6
+ margin mask 62.3 56.2 34.3 38.1 47.7
+ margin shift 62.7 56.3 34.4 39.2 48.1

Margin mask vs. margin shift. We consider various strategies to calibrate the return signals for improving teacher supervision. In this work, we explore two simple variants, namely margin mask and margin shift. As shown in Table 6, directly incorporating either mechanism into OPD yields consistent performance gains over the baseline, underscoring the necessity of reliable teacher supervision. Among them, margin shift achieves slightly better results and is therefore adopted in our main experiments. More ablations are in section D.3.

4.6 Qualitative Evaluation

To intuitively illustrate the effectiveness of our outcome-guided margin calibration, we use a token-level reward heatmap for visualization. As shown in Fig. 6, we display the two failure modes under the same question: the overestimation of incorrect trajectories (top-left) and the underestimation of correct trajectories (bottom-left). Each token is colored by its reward value: blue tokens indicate student-preferred (rtOPD<0)(r^{\mathrm{OPD}}_{t}\!<\!0), and red tokens indicate teacher-preferred (rtOPD>0)(r^{\mathrm{OPD}}_{t}\!>\!0), with saturation proportional to magnitude. On the top-left, an incorrect rollout still accumulates a high distillation return: most of its tokens are saturated red, since they fall on regions where the teacher dominates the student. On the bottom-left, a correct rollout receives a low distillation return: its tokens are already well-covered by the student, so the teacher provides little additional return (predominantly faint colors with some blue). The right column shows the same two rollouts after our outcome-guided margin calibration. Concretely, the per-token rewards are uniformly shifted so that the trajectory-level aggregation aligns with the outcome reward.

Refer to caption
Figure 6: Heatmap visualization of failure modes in OPD and the effect of margin shift. Left: an incorrect rollout with a high distillation return (top) and a correct rollout with a low one (bottom). Right: the same two rollouts after our margin shift, with the outcome ordering restored.

4.7 Analysis and Takeaways

Based on our comprehensive and systematic study on both LLMs and MLLMs across single-teacher, multi-teacher, strong-to-weak, and cross-modal distillation settings, we deliver three takeaways to further advance OPD.

  • \bullet

    Balancing reasoning capability and efficiency. Uni-OPD achieves the best performance with substantially fewer optimization steps than RL (Fig. 1), and consistently delivers strong reasoning capability across diverse domains (Tables 14, and D.1D.3 in the Appendix).

  • \bullet

    Teacher value comes from the capability gap, not absolute strength alone. In OPD, even with the same 4B backbone, a domain-specific RL teacher injects new capabilities and knowledge that drive the student to improve and even surpass the teacher (Tables 1 and 2). Moreover, our dual-perspective recipe further translates this gap into student gains, consistently boosting performance across all model sizes.

  • \bullet

    OPD distills reasoning as a modality-agnostic capability. Trained jointly on textual and multimodal data, the multimodal student under Uni-OPD improves textual code generation and multimodal math/logic reasoning (Tables 4 and D.3). The per-token signal carries reasoning patterns largely independent of modality, enabling a unified, single-stage path that enhances both textual and multimodal reasoning within one multimodal model.

  • \bullet

    OPD cleanly merges specialized capabilities, with related ones reinforcing each other. Beyond two teachers, Uni-OPD extends to three, jointly improving all capabilities (Tables 2 and D.2). OPD thus offers a scalable path for merging many specialists into one reasoner, with related ones synergizing via shared reasoning structure.

Reproducibility statement. To facilitate a clear understanding of our contributions and support broader adoption of our work, we provide extensive materials. In the main text, we detail the key components of our method in section 3 and report the main experimental results in section 4. In the supplementary materials, we further elaborate on Method Details (appendix A), Training Details (appendix B), and Evaluation Details (appendix C), which together should be sufficient to reproduce our results. All code, training data, complete scripts, and model checkpoints will be open-sourced upon publication to accelerate future research.

5 Conclusion and Future Work

In this paper, we present Uni-OPD, a unified OPD framework that generalizes across LLMs and MLLMs. We identify two key bottlenecks for effective OPD: insufficient student exploration of informative states and unreliable teacher supervision for student rollouts. To address them, we propose a dual-perspective optimization strategy: (i) offline difficulty-aware and online correctness-aware data balancing for student exploration, and (ii) outcome-guided margin calibration for teacher supervision. Extensive experiments on 16 benchmarks covering multi-teacher, strong-to-weak, and cross-modal settings demonstrate the effectiveness and versatility of Uni-OPD. We hope this work can provide a practical foundation for future research on scalable and reliable distillation across models, teachers, and modalities.

For future work, our findings suggest several promising directions: (1) extending Uni-OPD to larger-scale teacher distillation settings; (2) applying Uni-OPD to broader capability merging scenarios, such as agentic planning, tool use, and long-horizon decision making; and (3) uncovering the mechanistic principles of OPD, particularly how it shapes training dynamics and parameter geometry.

References

  • J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) GPT-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §E.1.
  • R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In The twelfth international conference on learning representations, Cited by: §E.3, §2.
  • AI-MO (2024) AIME 2024. Note: https://huggingface.co/datasets/AI-MO/aimo-validation-aime Cited by: 1st item, §4.1.
  • AI@Meta (2024a) Introducing Llama 3.1: our most capable models to date. Note: https://ai.meta.com/blog/meta-llama-3-1 Cited by: §E.1.
  • AI@Meta (2024b) Llama 3 model card. Note: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Cited by: §E.1.
  • C. An, Z. Xie, X. Li, L. Li, J. Zhang, S. Gong, M. Zhong, J. Xu, X. Qiu, M. Wang, and L. Kong (2025) POLARIS: a post-training recipe for scaling reinforcement learning on advanced reasoning models. External Links: Link Cited by: §A.1, §3.3.
  • Anthropic (2023a) Claude 2. External Links: Link Cited by: §E.1.
  • Anthropic (2023b) Introducing Claude. External Links: Link Cited by: §E.1.
  • Anthropic (2024) The Claude 3 model family: Opus, Sonnet, Haiku. External Links: Link Cited by: §E.1.
  • J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou (2023) Qwen-VL: a versatile vision-language model for understanding, localization. Text Reading, and Beyond. Cited by: §E.1.
  • S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. (2025a) Qwen3-VL technical report. arXiv preprint arXiv:2511.21631. Cited by: §2, §4.1, §4.3.
  • S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025b) Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923. Cited by: §E.1.
  • M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev (2025) MathArena: evaluating LLMs on uncontaminated math competitions. arXiv preprint arXiv:2505.23281. Cited by: 2nd item, §4.1.
  • H. Bansal, D. S. Sachan, K. Chang, A. Grover, G. Ghosh, W. Yih, and R. Pasunuru (2025) Honeybee: data recipes for vision-language reasoners. arXiv preprint arXiv:2510.12225. Cited by: §2.
  • E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf (2023) Open LLM leaderboard. Note: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard Cited by: §D.2.
  • W. Bousselham, H. Kuehne, and C. Schmid (2025) VOLD: reasoning transfer from LLMs to vision-language models via on-policy distillation. arXiv preprint arXiv:2510.23497. Cited by: §E.3, §2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems. Cited by: §E.1.
  • D. Cao, D. Fu, H. Yu, S. Zheng, X. Tan, and T. Jin (2026) X-OPD: cross-modal on-policy distillation for capability alignment in speech llms. arXiv preprint arXiv:2603.24596. Cited by: §E.3, §1, §2.
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? Try ARC, the AI2 reasoning challenge. ArXiv. Cited by: §D.2.
  • K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §D.2.
  • G. Cui, L. Yuan, Z. Wang, H. Wang, Y. Zhang, J. Chen, W. Li, B. He, Y. Fan, T. Yu, et al. (2025) Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. Cited by: §B.2, §4.1.
  • W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi (2023) InstructBLIP: towards general-purpose vision-language models with instruction tuning. Cited by: §E.1.
  • DeepSeek-AI (2026) DeepSeek-V4: towards highly efficient million-token context intelligence. Cited by: §1.
  • Y. Gu, L. Dong, F. Wei, and M. Huang (2023) MiniLLM: on-policy distillation of large language models. arXiv preprint arXiv:2306.08543. Cited by: §E.3.
  • D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025a) DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §E.1, §1, §2.
  • Y. Guo, W. Yang, Z. Sun, N. Ding, Z. Liu, and Y. Lin (2025b) Learning to focus: causal attention distillation via gradient-guided token pruning. arXiv preprint arXiv:2506.07851. Cited by: §2.
  • C. He, Y. Ding, J. Guo, R. Gong, H. Qin, and X. Liu (2025a) DA-KD: difficulty-aware knowledge distillation for efficient large language models. In Forty-second International Conference on Machine Learning, Cited by: §2.
  • Z. He, T. Liang, J. Xu, Q. Liu, X. Chen, Y. Wang, L. Song, D. Yu, Z. Liang, W. Wang, et al. (2025b) DeepMath-103K: a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456. Cited by: §B.2, §4.1.
  • D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2020) Measuring massive multitask language understanding. In International Conference on Learning Representations, Cited by: §D.2.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • W. Hou, W. Liu, H. Hu, X. Sun, S. Yeung-Levy, and H. Fan (2026) Seeing is believing? a benchmark for multimodal large language models on visual illusions and anomalies. arXiv preprint arXiv:2602.01816. Cited by: §E.1.
  • J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. (2026) Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802. Cited by: §E.3, §2.
  • A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. (2024) GPT-4o system card. arXiv preprint arXiv:2410.21276. Cited by: §E.1.
  • N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica (2024) Livecodebench: holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974. Cited by: 3rd item, §4.1.
  • I. Jang, J. Yeom, J. Yeo, H. Lim, and T. Kim (2026) Stable on-policy distillation through adaptive target reformulation. arXiv preprint arXiv:2601.07155. Cited by: §2.
  • W. Jin, T. Min, Y. Yang, S. R. Kadhe, Y. Zhou, D. Wei, N. Baracaldo, and K. Lee (2026) Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079. Cited by: §2.
  • A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi (2016) A diagram is worth a dozen images. In European conference on computer vision, pp. 235–251. Cited by: 1st item, §4.1.
  • M. Kim and S. J. Baek (2026) Explain in your own words: improving reasoning via token-selective dual knowledge distillation. arXiv preprint arXiv:2603.13260. Cited by: §2.
  • J. Ko, S. Abdali, Y. J. Kim, T. Chen, and P. Cameron (2026) Scaling reasoning efficiently via relaxed on-policy distillation. arXiv preprint arXiv:2603.11137. Cited by: §2.
  • J. Ko, T. Chen, S. Kim, T. Ding, L. Liang, I. Zharkov, and S. Yun (2025) DistiLLM-2: a contrastive approach boosts the distillation of LLMs. arXiv preprint arXiv:2503.07067. Cited by: §2.
  • K. Kujanpää, P. Marttinen, H. Valpola, and A. Ilin (2024) Efficient knowledge injection in LLMs via self-distillation. arXiv preprint arXiv:2412.14964. Cited by: §2.
  • W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, Cited by: §A.1.
  • X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia (2024) LISA: reasoning segmentation via large language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §E.1.
  • H. Levesque, E. Davis, and L. Morgenstern (2012) The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, Cited by: §D.2.
  • J. Li, H. Yin, H. Xu, B. Xu, W. Tan, Z. He, J. Ju, Z. Luo, and J. Luan (2026a) Video-OPD: efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation. arXiv preprint arXiv:2602.02994. Cited by: §E.3, §1, §2.
  • J. Li, S. Yang, S. Wu, H. Shi, C. Zheng, H. Xu, and J. Jia (2025) Logits-based finetuning. arXiv preprint arXiv:2505.24461. Cited by: §E.1.
  • Y. Li, Y. Zuo, B. He, J. Zhang, C. Xiao, C. Qian, T. Yu, H. Gao, W. Yang, Z. Liu, et al. (2026b) Rethinking on-policy distillation of large language models: phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016. Cited by: §E.3, §2.
  • S. Lin, J. Hilton, and O. Evans (2022) TruthfulQA: measuring how models mimic human falsehoods. In ACL, Cited by: §D.2.
  • A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. (2024a) DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. Cited by: §E.1.
  • H. Liu, C. Li, Y. Li, and Y. J. Lee (2024b) Improved baselines with visual instruction tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §E.1.
  • H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee (2024c) LLaVA-NeXT: improved reasoning, OCR, and world knowledge. External Links: Link Cited by: §E.1.
  • H. Liu, C. Li, Q. Wu, and Y. J. Lee (2023a) Visual instruction tuning. Advances in neural information processing systems. Cited by: §E.1.
  • J. Liu, C. Zhang, J. Guo, Y. Zhang, H. Que, K. Deng, Z. Bai, J. Liu, G. Zhang, J. Wang, et al. (2024d) DDK: distilling domain knowledge for efficient large language models. Advances in Neural Information Processing Systems 37, pp. 98297–98319. Cited by: §2.
  • J. Liu, C. S. Xia, Y. Wang, and L. Zhang (2023b) Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems 36, pp. 21558–21572. Cited by: 1st item, 2nd item, §4.1.
  • L. Liu and M. Zhang (2025) Less is more: selective reflection for compatible and efficient knowledge distillation in large language models. arXiv preprint arXiv:2508.06135. Cited by: §2.
  • Y. Liu, J. Cui, Z. Tian, S. Yang, Q. He, X. Wang, and J. Su (2024e) Typicalness-aware learning for failure detection. arXiv preprint arXiv:2411.01981. Cited by: §E.1.
  • K. Lu and T. M. Lab (2025) On-policy distillation. Thinking Machines Lab: Connectionism. Note: https://thinkingmachines.ai/blog/on-policy-distillation External Links: Document Cited by: §1, §2.
  • A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque (2022) ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pp. 2263–2279. Cited by: §B.2, 2nd item, §4.1.
  • M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar (2022) InfographicVQA. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706. Cited by: §B.2, 4th item, §4.1.
  • M. Mathew, D. Karatzas, and C. Jawahar (2021) DocVQA: a dataset for VQA on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2200–2209. Cited by: 3rd item, §4.1.
  • Y. Meng, M. Xia, and D. Chen (2024) SimPO: simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems. Cited by: §D.2.
  • OpenAI (2023) GPT-4V(ision) system card. Cited by: §E.1.
  • OpenCompass (2025) AIME 2025. Note: https://huggingface.co/datasets/opencompass/AIME2025 Cited by: §4.1.
  • L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35. Cited by: §3.4.
  • C. M. Patiño, K. Rasul, Q. Gallouédec, B. Burtenshaw, S. Paniego, V. Srivastav, T. Frere, E. Beeching, L. Tunstall, L. von Werra, and T. Wolf (2025) Unlocking on-policy distillation for any model family. Cited by: §2.
  • S. Peng, W. Wang, Z. Tian, S. Yang, X. W, H. Xu, C. Zhang, T. Isobe, B. Hu, and M. Zhang (2026) Uni-DPO: a unified paradigm for dynamic preference optimization of LLMs. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §D.2, §E.1.
  • S. Peng, S. Yang, L. Jiang, and Z. Tian (2025) Mitigating object hallucinations via sentence-level early intervention. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §E.1.
  • R. Qiao, Q. Tan, G. Dong, M. MinhuiWu, C. Sun, X. Song, J. Wang, Z. Gongque, S. Lei, Y. Zhang, et al. (2025) We-math: does your large multimodal model achieve human-like mathematical reasoning?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 20023–20070. Cited by: 3rd item, §4.1.
  • L. Qin, Q. Chen, Y. Zhou, Z. Chen, Y. Li, L. Liao, M. Li, W. Che, and P. S. Yu (2025) A survey of multilingual large language models. Patterns 6 (1). Cited by: §1.
  • T. Qu, L. Tang, B. Peng, S. Yang, B. Yu, and J. Jia (2025) Does your vision-language model get lost in the long video sampling dilemma?. arXiv preprint arXiv:2503.12496. Cited by: §E.1.
  • Y. Qu, A. Setlur, V. Smith, R. Salakhutdinov, and A. Kumar (2026) POPE: learning to reason on hard problems via privileged on-policy exploration. arXiv preprint arXiv:2601.18779. Cited by: §2.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, Cited by: §E.1.
  • T. Shao, Z. Tian, H. Zhao, and J. Su (2024a) Explore the potential of CLIP for training-free open vocabulary semantic segmentation. In European Conference on Computer Vision, Cited by: §E.1.
  • Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024b) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §E.2, §1.
  • I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal (2026) Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897. Cited by: §E.3, §2.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §B.1.
  • M. Song and M. Zheng (2026) A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626. Cited by: §1, §2.
  • A. Stein, F. Huang, and T. Goldstein (2026) GATES: self-distillation under privileged context with consensus gating. arXiv preprint arXiv:2602.20574. Cited by: §2.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §D.2.
  • G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. (2024) Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. Cited by: §E.1.
  • H. V. Team, P. Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, Q. Yang, Q. Peng, B. Luo, H. Yang, X. Zhang, J. Zhang, H. Peng, H. Yang, S. Xie, L. Zhou, G. Pei, B. Wu, K. Wu, J. Yang, B. Wang, K. Liu, J. Zhu, J. Jiang, Linus, H. Hu, and C. Zhang (2025) HunyuanOCR technical report. Cited by: §E.1.
  • K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. (2026) Kimi k2.5: visual agentic intelligence. arXiv preprint arXiv:2602.02276. Cited by: §1, §2.
  • Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia (2019) Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §E.1.
  • H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §E.1.
  • J. Wang, B. Chen, Y. Li, B. Kang, Y. Chen, and Z. Tian (2025) DeCLIP: decoupled learning for open-vocabulary dense perception. In Proceedings of the Computer Vision and Pattern Recognition Conference, Cited by: §E.1.
  • K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li (2024a) Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems 37, pp. 95095–95169. Cited by: 1st item, §4.1.
  • P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. (2024b) Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191. Cited by: §E.1.
  • Y. Wang, X. Chen, X. Jin, M. Wang, and L. Yang (2026) OpenClaw-RL: train any agent simply by talking. arXiv preprint arXiv:2603.10165. Cited by: §E.2, §2.
  • Y. Wu, S. Han, and H. Cai (2026) Lightning opd: efficient post-training for large reasoning models with offline on-policy distillation. arXiv preprint arXiv:2604.13010. Cited by: §1.
  • B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. (2026) Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780. Cited by: §E.3, §1, §2, §2.
  • Y. Xiao, E. Sun, T. Liu, and W. Wang (2024) Logicvista: multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973. Cited by: 1st item, §4.1.
  • J. Xiong, H. Shen, S. Gong, Y. Cheng, J. Shen, C. Tao, H. Tan, H. Bai, L. Shang, and N. Wong (2026) OVD: on-policy verbal distillation. arXiv preprint arXiv:2601.21968. Cited by: §2.
  • H. Xu, X. Wu, W. Wang, Z. Li, D. Zheng, B. Chen, Y. Hu, S. Kang, J. Ji, Y. Zhang, et al. (2025a) RedStar: does scaling long-cot data unlock better slow-reasoning systems?. arXiv preprint arXiv:2501.11284. Cited by: §1.
  • W. Xu, J. Wang, W. Wang, Z. Chen, W. Zhou, A. Yang, L. Lu, H. Li, X. Wang, X. Zhu, et al. (2025b) Visulogic: a benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279. Cited by: 2nd item, §4.1.
  • X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou (2024) A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116. Cited by: §2.
  • Y. Xu, H. Sang, Z. Zhou, R. He, and Z. Wang (2026) PACED: distillation at the frontier of student competence. arXiv preprint arXiv:2603.11178. Cited by: §2.
  • A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §E.1, §4.1.
  • A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. (2024a) Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: §E.1.
  • C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan (2026a) Self-distilled RLVR. arXiv preprint arXiv:2604.03128. Cited by: §E.2, §2.
  • S. Yang, Y. Chen, Z. Tian, C. Wang, J. Li, B. Yu, and J. Jia (2024b) VisionZip: longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467. Cited by: §E.1.
  • S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, Y. Guo, and S. Zhang (2023a) LiDAR-LLM: exploring the potential of large language models for 3d LiDAR understanding. arXiv preprint arXiv:2312.14074. Cited by: §E.1.
  • S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia (2023b) An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240. Cited by: §E.1.
  • S. Yang, Z. Tian, L. Jiang, and J. Jia (2024c) Unified language-driven zero-shot domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §E.1.
  • W. Yang, W. Liu, R. Xie, K. Yang, S. Yang, and Y. Lin (2026b) Learning beyond teacher: generalized on-policy distillation with reward extrapolation. arXiv preprint arXiv:2602.12125. Cited by: §B.1, §E.3, §1, §4.1, §4.2.
  • Z. Yang, Z. Liu, Y. Chen, W. Dai, B. Wang, S. Lin, C. Lee, Y. Chen, D. Jiang, J. He, et al. (2026c) Nemotron-Cascade 2: post-training LLMs with cascade RL and multi-domain on-policy distillation. arXiv preprint arXiv:2603.19220. Cited by: §1, §2.
  • T. Ye, L. Dong, Z. Chi, X. Wu, S. Huang, and F. Wei (2025) Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643. Cited by: §E.3, §2.
  • T. Ye, L. Dong, X. Wu, S. Huang, and F. Wei (2026) On-policy context distillation for language models. arXiv preprint arXiv:2602.12275. Cited by: §E.3, §2.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §D.2.
  • A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. (2026) GLM-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763. Cited by: §1, §2.
  • D. Zhang, Z. Yang, S. Janghorbani, J. Han, A. Ressler II, Q. Qian, G. D. Lyng, S. S. Batra, and R. E. Tillman (2026a) Fast and effective on-policy distillation from reasoning prefixes. arXiv preprint arXiv:2602.15260. Cited by: §E.3, §2.
  • K. Zhang, B. Li, P. Zhang, F. Pu, J. A. Cahyono, K. Hu, S. Liu, Y. Zhang, J. Yang, C. Li, et al. (2025a) LMMs-Eval: reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, Cited by: §C.2.
  • K. Zhang, K. Wu, Z. Yang, B. Li, K. Hu, B. Wang, Z. Liu, X. Li, and L. Bing (2025b) OpenMMReasoner: pushing the frontiers for multimodal reasoning with an open and general recipe. arXiv preprint arXiv:2511.16334. Cited by: §2, §4.1.
  • S. Zhang, X. Zhang, T. Zhang, B. Hu, Y. Chen, and J. Xu (2026b) KDFlow: a user-friendly and efficient knowledge distillation framework for large language models. arXiv preprint arXiv:2603.01875. Cited by: §E.3, §2.
  • Y. Zhang, B. Ni, X. Chen, H. Zhang, Y. Rao, H. Peng, Q. Lu, H. Hu, M. Guo, and S. Hu (2025c) Bee: a high-quality corpus and full-stack suite to unlock advanced fully open mllms. arXiv preprint arXiv:2510.13795. Cited by: §2.
  • S. Zhao, Z. Wang, X. Zhao, J. Zhou, C. Xu, C. Liu, L. Zhang, Y. Jia, Y. Zhang, H. Yu, et al. (2026a) Large language model post-training: a unified view of off-policy and on-policy learning. arXiv preprint arXiv:2604.07941. Cited by: §1.
  • S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover (2026b) Self-distilled reasoner: on-policy self-distillation for large language models. arXiv preprint arXiv:2601.18734. Cited by: §E.3, §2.
  • C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. (2025) Group sequence policy optimization. arXiv preprint arXiv:2507.18071. Cited by: §E.2.
  • Z. Zhong, C. Wang, Y. Liu, S. Yang, L. Tang, Y. Zhang, J. Li, T. Qu, Y. Li, Y. Chen, et al. (2024) Lyra: an efficient and speech-centric framework for omni-cognition. arXiv preprint arXiv:2412.09501. Cited by: §E.1.
  • C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. (2023a) LIMA: less is more for alignment. Advances in Neural Information Processing Systems 36, pp. 55006–55021. Cited by: §3.3.
  • G. Zhou, H. Bao, J. Huang, J. Deng, J. Zhang, J. She, K. Cai, L. Ren, L. Ren, Q. Luo, et al. (2025) OpenOneRec technical report. arXiv preprint arXiv:2512.24762. Cited by: §1.
  • J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou (2023b) Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911. Cited by: §D.2.
  • D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny (2023) MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592. Cited by: §E.1.
  • C. Zou, X. Guo, R. Yang, J. Zhang, B. Hu, and H. Zhang (2024) Dynamath: a dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836. Cited by: 2nd item, §4.1.

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Supplementary Material

Appendix Outline

This material provides supplementary details to the main paper, including the following sections:

  • \bullet

    (A) Method Details

    • -

      (A.1) Offline Difficulty-Aware Data Balancing

    • -

      (A.2) Online Correctness-Aware Data Balancing

    • -

      (A.3) Order Consistency of Trajectory-level Returns

    • -

      (A.4) Outcome-Guided Margin Calibration

  • \bullet

    (B) Training Details

    • -

      (B.1) Training Setup

    • -

      (B.2) Training Data

    • -

      (B.3) Training Reward Acquisition

    • -

      (B.4) Training Pseudocode

    • -

      (B.5) Training Dynamics

    • -

      (B.6) Training Complexity

  • \bullet

    (C) Evaluation Details

    • -

      (C.1) Evaluation Benchmarks

    • -

      (C.2) Evaluation Setup

  • \bullet

    (D) Further Evaluations

    • -

      (D.1) More Evaluation Results

    • -

      (D.2) Downstream Task Evaluation

    • -

      (D.3) Further Ablation

  • \bullet

    (E) Related Work

    • -

      (E.1) Multimodal Large Language Models

    • -

      (E.2) Reinforcement Learning

    • -

      (E.3) On-Policy Distillation

  • \bullet

    (F) Case Studies

Appendix A Method Details

In this section, we provide a detailed exposition of the key components of our proposed Uni-OPD framework, including its formulations and implementations.

A.1 Offline Difficulty-Aware Data Balancing

In this section, we provide a detailed description of our offline difficulty-aware data balancing strategy.

Offline rollout sampling. Before training, we perform a one-time offline rollout pass over the entire training set using the student model (e.g., Qwen3-4B). For each training instance, the student is prompted to generate N=8N\!=\!8 independent candidate responses, which serve as the basis for subsequent difficulty estimation.

The rollouts are produced with vLLM (Kwon et al., 2023) under the same prompt template that will later be used at training time, so that the estimated difficulty reflects the actual input format the student will see. The decoding configuration is kept fixed throughout this offline phase: we use temperature =1.0=1.0, top-p=0.95p=0.95, top-k=50k=50, and a maximum response length of 16,38416{,}384 tokens. For each instance, we then verify the correctness of its NN candidate responses with the task-specific verifier (section B.3) and record the number of correct ones. The resulting empirical pass rate k/Nk/N serves as our proxy for the instance’s difficulty: a lower pass rate indicates a harder example, while a higher pass rate indicates an easier one.

Limitations of aggressive difficulty filtering. Prior work on online RL optimization, such as GRPO, often relies on a heuristic pre-training filter that simply discards “trivial” samples such as all-correct cases, because these instances yield zero advantage and therefore provide essentially no learning signal. POLARIS (An et al., 2025), for example, reports that removing the easiest samples leads to consistent performance gains, and argues that keeping an unfiltered dataset can actively hinder training.

In the token-level reward OPD setting, however, we find that such aggressive filtering is, in fact, counterproductive. Empirically, removing any specific difficulty tier, whether the easiest or the hardest, consistently hurts final performance. A plausible explanation is that each tier contributes a distinct pattern of token-level credit: easy instances calibrate the student’s baseline behavior, intermediate instances provide the richest contrastive signals between correct and incorrect trajectories, and hard instances expose the student to diverse, non-trivial solution paths. Dropping any tier, therefore, both distorts the overall distribution of token-level credit and narrows the space of solution patterns to which the student is exposed.

Difficulty-aware data balancing. Motivated by this observation, we adopt a difficulty-aware balancing scheme that deliberately preserves the full spectrum of difficulty while reweighting its different regions, rather than truncating them. Concretely, after the offline rollout pass, we examine the empirical distribution over the number of correct responses out of NN. Across our training sources, we observe two recurring shapes: (i) a U-shaped distribution, where both very easy and very hard instances dominate while intermediate ones are sparse; and (ii) a mirrored-J-shaped distribution, where easy instances dominate and the mass decays toward the hard end.

We treat the two shapes slightly differently. For U-shaped distributions, we upsample instances of intermediate difficulty, namely those with 1177 correct responses out of N=8N=8, so as to fill in the under-represented middle region. For mirrored-J-shaped distributions, we instead upsample all non-trivial instances, i.e., everything with 1188 correct responses, to counteract the long tail of easy samples. In both cases, the effect of the reweighting is to flatten the overall difficulty distribution and to ensure that the token-level credit signals arriving during training are more evenly spread across difficulty levels. Empirically, we find that this simple rebalancing consistently leads to better final performance than either no filtering or the conventional drop-the-easy-cases strategy.

A.2 Online Correctness-Aware Data Balancing

In this section, we detail the online correctness-aware data balancing strategy that operates during rollout. While the offline difficulty-aware balancing in section A.1 controls the prompt-level difficulty distribution before training, the composition of correct and incorrect trajectories within a rollout group still varies dramatically as the student evolves. This subsection describes how we regulate such intra-group composition online.

Motivation. In OPD, for each prompt 𝒒\bm{q} we sample GG on-policy trajectories {𝝉i}i=1G\{\bm{\tau}_{i}\}_{i=1}^{G} and split them into a positive set S+(𝒒)S_{+}(\bm{q}) and a negative set S(𝒒)S_{-}(\bm{q}) based on the outcome reward RiR_{i}. As training proceeds, many prompts exhibit degenerate outcome distributions: either |S(𝒒)|G|S_{-}(\bm{q})|\!\ll\!G (the student nearly masters 𝒒\bm{q}) or |S+(𝒒)|G|S_{+}(\bm{q})|\!\ll\!G (the student often fails on 𝒒\bm{q}). In both cases, the outcome-level contrast vanishes and the outcome-guided margin calibration in section A.4 cannot provide any corrective signal, since the prompt-level margin m(𝒒)m(\bm{q}) is undefined. If left unregulated, such degenerate groups dominate the batch and drive the student into local optima with shrinking exploration.

Online correctness-aware balancing. To preserve sufficient outcome diversity throughout training, we maintain a target correct-to-total ratio γ(0,1)\gamma^{\star}\!\in\!(0,1) at the batch level (we use γ0.5\gamma^{\star}\!\approx\!0.5 by default, so positive and negative trajectories are roughly balanced). At each training step, given a freshly rolled-out batch \mathcal{B}, we let γ()=𝝉i𝟏{Ri=1}/||\gamma(\mathcal{B})=\sum_{\bm{\tau}_{i}\!\in\!\mathcal{B}}\mathbf{1}\{R_{i}\!=\!1\}/|\mathcal{B}| denote the current correct-to-total ratio across the whole batch. Whenever |γ()γ|>ϵ|\gamma(\mathcal{B})-\gamma^{\star}|\!>\!\epsilon for a tolerance ϵ\epsilon, we downweight the over-represented side (correct or incorrect trajectories) by subsampling within each group, so that the overall batch ratio is pulled back to the γ±ϵ\gamma^{\star}\!\pm\!\epsilon interval. Subsampling is performed uniformly inside each group, which keeps the intra-group difficulty distribution intact and avoids biasing the prompt-level difficulty spectrum inherited from offline balancing.

A.3 Order Consistency of Trajectory-level Returns

This section provides a brief explanation for the order-consistency conditions in Eqs. (7) and (8) of the main paper. The key observation is two-fold. First, treating the entire reasoning rollout as a single macro-action gives GRL(𝒒,𝝉)=R(𝒒,𝝉)G_{\mathrm{RL}}({\bm{q}},\bm{\tau})\!=\!R({\bm{q}},\bm{\tau}), so GRLG_{\mathrm{RL}} respects the outcome-induced ordering by construction. Second, under the distillation premise, the trajectory-level distillation return GOPD(𝒒,𝝉)G_{\mathrm{OPD}}({\bm{q}},\bm{\tau}) is expected to preserve the same ordering, although this is a desideratum rather than a definitional consequence.

Trajectory-as-one-action view of outcome-based RL. In outcome-based RL for reasoning, supervision is provided only at the trajectory level: a rollout 𝝉\bm{\tau} receives a single scalar reward R(𝒒,𝝉)R(\bm{q},\bm{\tau}) determined by the final answer. Under this view, the trajectory-level return reduces to the outcome reward itself, i.e.,

GRL(𝒒,𝝉)=R(𝒒,𝝉).G_{\mathrm{RL}}(\bm{q},\bm{\tau})=R(\bm{q},\bm{\tau})\,. (13)

Order consistency under binary rewards. For the binary outcome reward adopted in this work, any 𝝉+S+(𝒒)\bm{\tau}_{+}\!\in\!S_{+}(\bm{q}) satisfies R(𝒒,𝝉+)=1R(\bm{q},\bm{\tau}_{+})\!=\!1, while any 𝝉S(𝒒)\bm{\tau}_{-}\!\in\!S_{-}(\bm{q}) satisfies R(𝒒,𝝉)=0R(\bm{q},\bm{\tau}_{-})\!=\!0. Combined with Eq. 13, we have

GRL(𝒒,𝝉+)=1 0=GRL(𝒒,𝝉),G_{\mathrm{RL}}(\bm{q},\bm{\tau}_{+})=1\;\geq\;0=G_{\mathrm{RL}}(\bm{q},\bm{\tau}_{-})\,, (14)

for all 𝝉+S+(𝒒)\bm{\tau}_{+}\!\in\!S_{+}(\bm{q}) and 𝝉S(𝒒)\bm{\tau}_{-}\!\in\!S_{-}(\bm{q}), which recovers Eq. 7 directly.

Extension to soft outcome rewards. The same argument extends to soft outcome rewards, where R(𝒒,𝝉)[0,1]R(\bm{q},\bm{\tau})\!\in\![0,1] (or any bounded interval) measures a graded notion of correctness, e.g., partial credit or a verifier’s confidence score. As long as the trajectory partition is defined by thresholding the outcome reward, i.e., S+(𝒒)={𝝉R(𝒒,𝝉)η}S_{+}(\bm{q})\!=\!\{\bm{\tau}\mid R(\bm{q},\bm{\tau})\!\geq\!\eta\} and S(𝒒)={𝝉R(𝒒,𝝉)<η}S_{-}(\bm{q})\!=\!\{\bm{\tau}\mid R(\bm{q},\bm{\tau})\!<\!\eta\} for some threshold η\eta, then by Eq. 13 every positive trajectory attains a return no smaller than that of any negative trajectory, and Eq. 7 still holds. In particular, the binary case is recovered as the special instance η=1\eta\!=\!1, R{0,1}R\!\in\!\{0,1\}.

From RL return to distillation return. The distillation return GOPD(𝒒,𝝉)G_{\mathrm{OPD}}(\bm{q},\bm{\tau}) defined in Eq. 5 plays the same role for OPD training as GRLG_{\mathrm{RL}} does for outcome-based RL: it is the trajectory-level supervision signal broadcast to all tokens in the rollout. The distillation premise in section 3.4 posits that, relative to the student, the teacher assigns a higher log-likelihood to correct trajectories than incorrect ones. In other words, the teacher’s trajectory-level preference is expected to be aligned with the outcome reward, so that GOPDG_{\mathrm{OPD}} should inherit the same outcome-level ordering as GRLG_{\mathrm{RL}}, leading to Eq. 8. Unlike the RL return, however, GOPDG_{\mathrm{OPD}} is derived from the teacher–student log-probability gap rather than the outcome reward itself, so the ordering is a desired property rather than a guaranteed one. The order-consistency condition in Eq. 8 provides a principled target, and subsequent margin mask and margin shift strategies (section A.4) are designed to enforce it whenever the teacher’s supervision violates this property in practice.

A.4 Outcome-Guided Margin Calibration

Algorithm 1 Greedy Margin Mask
1:Inputs:
2: Prompt 𝒒\bm{q} with rollout group {𝝉i}i=1G\{\bm{\tau}_{i}\}_{i=1}^{G}, outcome rewards {Ri}i=1G\{R_{i}\}_{i=1}^{G} with Ri{0,1}R_{i}\!\in\!\{0,1\}, min retention ratio ρ\rho,
3: trajectory-level distillation returns {GOPD(𝒒,𝝉i)}i=1G\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}, target margin δ\delta, mode {MinMax,Mean}\in\{\mathrm{MinMax},\mathrm{Mean}\}.
4:Output: Keep-mask {ki}i=1G{0,1}G\{k_{i}\}_{i=1}^{G}\in\{0,1\}^{G} \triangleright ki=1k_{i}\!=\!1 means “keep trajectory 𝛕i\bm{\tau}_{i}” and ki=0k_{i}\!=\!0 means “drop it”.
5:
6:Notation: For any two subsets AS+(𝒒)A\!\subseteq\!S_{+}(\bm{q}) and BS(𝒒)B\!\subseteq\!S_{-}(\bm{q}), we define the prompt-level margin
7:Margin(A,B;MinMax)=min𝝉AGOPD(𝒒,𝝉)max𝝉BGOPD(𝒒,𝝉)\displaystyle\textsc{Margin}(A,B;\mathrm{MinMax})=\min_{\bm{\tau}\in A}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\max_{\bm{\tau}\in B}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau}),
8:Margin(A,B;Mean)=mean𝝉AGOPD(𝒒,𝝉)mean𝝉BGOPD(𝒒,𝝉)\displaystyle\textsc{Margin}(A,B;\mathrm{Mean})=\operatorname*{mean}_{\bm{\tau}\in A}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\operatorname*{mean}_{\bm{\tau}\in B}\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau}),
9:
10:function GreedyMarginMask(𝒒,{𝝉i,Ri,GOPD(𝒒,𝝉i)}i=1G,δ,ρ,mode\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\rho,\text{mode})
11:  \triangleright Step 1: split the group by outcome correctness.
12:  S+(𝒒){𝝉iRi=1}S_{+}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=1\}, S(𝒒){𝝉iRi=0}S_{-}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=0\}
13:  N+|S+(𝒒)|N_{+}\leftarrow|S_{+}(\bm{q})|, N|S(𝒒)|N_{-}\leftarrow|S_{-}(\bm{q})|
14:  ki1,i=1,,Gk_{i}\leftarrow 1,\quad\forall i=1,\ldots,G \triangleright initialize: keep all trajectories
15:  if N+=0N_{+}=0 or N=0N_{-}=0 then
16:   return {ki}i=1G\{k_{i}\}_{i=1}^{G} \triangleright ordering is not defined; no masking
17:  end if
18:  
19:  \triangleright Step 2: sort each side so that the most ordering-violating trajectory is at the front.
20:  L+(𝒒)L_{+}(\bm{q})\leftarrow sort S+(𝒒)S_{+}(\bm{q}) by GOPD(𝒒,)G_{\mathrm{OPD}}(\bm{q},\cdot) ascending \triangleright L+(𝒒)[1]L_{+}(\bm{q})[1] = correct trajectory with lowest return
21:  L(𝒒)L_{-}(\bm{q})\leftarrow sort S(𝒒)S_{-}(\bm{q}) by GOPD(𝒒,)G_{\mathrm{OPD}}(\bm{q},\cdot) descending \triangleright L(𝒒)[1]L_{-}(\bm{q})[1] = incorrect trajectory with highest return
22:  
23:  \triangleright Step 3: iteratively drop the trajectory whose removal increases the margin the most.
24:  while Margin(L+(𝒒),L(𝒒);mode)<δ\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q});\text{mode})<\delta do
25:   if |L+(𝒒)|ρN+|L_{+}(\bm{q})|\leq\lceil\rho N_{+}\rceil and |L(𝒒)|ρN|L_{-}(\bm{q})|\leq\lceil\rho N_{-}\rceil then
26:     break \triangleright minimum retention ratio reached on both sides
27:   end if
28:   
29:   \triangleright Margin gain when the worst correct trajectory L+(𝐪)[1]L_{+}(\bm{q})[1] is dropped.
30:   Δ+Margin(L+(𝒒){L+(𝒒)[1]},L(𝒒);mode)Margin(L+(𝒒),L(𝒒);mode)\Delta_{+}\leftarrow\textsc{Margin}(L_{+}(\bm{q})\!\setminus\!\{L_{+}(\bm{q})[1]\},L_{-}(\bm{q});\text{mode})-\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q});\text{mode})
31:   \triangleright Margin gain when the best incorrect trajectory L(𝐪)[1]L_{-}(\bm{q})[1] is dropped.
32:   ΔMargin(L+(𝒒),L(𝒒){L(𝒒)[1]};mode)Margin(L+(𝒒),L(𝒒);mode)\Delta_{-}\leftarrow\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q})\!\setminus\!\{L_{-}(\bm{q})[1]\};\text{mode})-\textsc{Margin}(L_{+}(\bm{q}),L_{-}(\bm{q});\text{mode})
33:   if max(Δ+,Δ)0\max(\Delta_{+},\Delta_{-})\leq 0 then
34:     break \triangleright no single removal can further improve the margin
35:   end if
36:   if Δ+>Δ\Delta_{+}>\Delta_{-} and |L+(𝒒)|>ρN+|L_{+}(\bm{q})|>\lceil\rho N_{+}\rceil then
37:     𝝉dropPopFront(L+(𝒒))\bm{\tau}_{\mathrm{drop}}\leftarrow\textsc{PopFront}(L_{+}(\bm{q})) \triangleright greedy drop on the positive side
38:   else
39:     𝝉dropPopFront(L(𝒒))\bm{\tau}_{\mathrm{drop}}\leftarrow\textsc{PopFront}(L_{-}(\bm{q})) \triangleright greedy drop on the negative side
40:   end if
41:   kidx(𝝉drop)0k_{\,\mathrm{idx}(\bm{\tau}_{\mathrm{drop}})}\leftarrow 0 \triangleright exclude this trajectory from the subsequent gradient update
42:  end while
43:  return {ki}i=1G\{k_{i}\}_{i=1}^{G}
44:end function

In this section, we describe the details of the two outcome-guided margin calibration strategies introduced in section 3.4: Margin Mask and Margin Shift. Both strategies operate on the trajectory-level distillation returns {GOPD(𝒒,𝝉i)}i=1G\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G} within a rollout group of a prompt 𝒒\bm{q}, with the common goal of enforcing the order-consistency condition m(𝒒)δm(\bm{q})\!\geq\!\delta (Eq. 10). They differ in how they repair violations: Margin Mask removes the most adversarial trajectories until the condition holds, whereas Margin Shift applies a minimal additive correction to restore the margin in closed form.

Margin choices: MinMax vs. Mean. Following the prompt-level margin in Eq. 9, we define the margin between S+(𝒒)S_{+}(\bm{q}) and S(𝒒)S_{-}(\bm{q}) in two modes: the MinMax mode uses min𝝉S+GOPDmax𝝉SGOPD\min_{\bm{\tau}\in S_{+}}\!{G}_{\mathrm{OPD}}\!-\!\max_{\bm{\tau}\in S_{-}}\!{G}_{\mathrm{OPD}} and characterizes the worst-case ordering violation; the Mean mode uses mean𝝉S+GOPDmean𝝉SGOPD\mathrm{mean}_{\bm{\tau}\in S_{+}}\!{G}_{\mathrm{OPD}}\!-\!\mathrm{mean}_{\bm{\tau}\in S_{-}}\!{G}_{\mathrm{OPD}} and reflects the average-case ordering tendency. MinMax is more conservative (it forces every positive to outrank every negative), while Mean is more lenient and less sensitive to individual outliers.

Detailed implementation of margin mask. The margin mask strategy discards unreliable trajectories until the prompt-level margin is restored. We implement its fine-grained, data-efficient variant as Greedy Margin Mask, which removes the single most adversarial trajectory in each iteration rather than discarding the entire group. Specifically, given the rollout group {𝝉i}i=1G\{\bm{\tau}_{i}\}_{i=1}^{G} of prompt 𝒒\bm{q} with trajectory-level returns {GOPD(𝒒,𝝉i)}i=1G\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}, we sort the positives in ascending order of GOPDG_{\mathrm{OPD}} (so the worst correct trajectory comes first) and the negatives in descending order (so the best incorrect trajectory comes first). At each iteration, we compute the margin improvement obtained by removing the front of each sorted list and greedily dropping the side that yields the larger improvement. The iteration terminates once (i) the target margin m(𝒒)δm(\bm{q})\!\geq\!\delta is satisfied, (ii) no further beneficial removal exists, or (iii) a minimum retention ratio ρ(0,1)\rho\!\in\!(0,1) is reached to prevent excessive data loss. The masked trajectories are excluded from the subsequent gradient update by setting their trajectory-level return to zero, i.e., G~OPD(𝒒,𝝉i)=kiGOPD(𝒒,𝝉i)\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\!=\!k_{i}\!\cdot\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i}), where ki{0,1}k_{i}\!\in\!\{0,1\} is the keep mask. In distributed training, the trajectory-level statistics are aggregated across all ranks via AllReduce so that the masking is deterministic and consistent across devices. The procedure is in algorithm 1.

Detailed implementation of margin shift. The margin shift strategy applies a minimal additive correction to the trajectory-level returns so that the margin exactly meets the target δ\delta, rather than discarding any sample. Given the rollout group {𝝉i}i=1G\{\bm{\tau}_{i}\}_{i=1}^{G} of prompt 𝒒\bm{q}, we first compute the current margin m(𝒒)m(\bm{q}) with the chosen mode (Mean by default). If m(𝒒)<δm(\bm{q})\!<\!\delta, we define the required shift as λ(𝒒)=δm(𝒒)>0\lambda(\bm{q})\!=\!\delta\!-\!m(\bm{q})\!>\!0 and distribute it across trajectories in one of three directions: (i) Lift: add λ(𝒒)\lambda(\bm{q}) to every positive trajectory, i.e., G~OPD(𝒒,𝝉)=GOPD(𝒒,𝝉)+λ(𝒒)𝟏{r(𝒒,𝝉)=1}\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!=\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!+\!\lambda(\bm{q})\mathbf{1}\{r(\bm{q},\bm{\tau})\!=\!1\}, which matches Eq. 11 in the main text; (ii) Suppress: subtract λ(𝒒)\lambda(\bm{q}) from every negative trajectory, i.e., G~OPD(𝒒,𝝉)=GOPD(𝒒,𝝉)λ(𝒒)𝟏{r(𝒒,𝝉)=0}\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!=\!G_{\mathrm{OPD}}(\bm{q},\bm{\tau})\!-\!\lambda(\bm{q})\mathbf{1}\{r(\bm{q},\bm{\tau})\!=\!0\}; and (iii) Spread: split the correction symmetrically, adding λ(𝒒)/2\lambda(\bm{q})/2 to positives and subtracting λ(𝒒)/2\lambda(\bm{q})/2 from negatives. All three variants (a) preserve the relative ordering within S+(𝒒)S_{+}(\bm{q}) and within S(𝒒)S_{-}(\bm{q}) respectively, and (b) guarantee that the calibrated margin equals δ\delta, i.e., min𝝉S+G~OPDmax𝝉SG~OPD=δ\min_{\bm{\tau}\in S_{+}}\!\widetilde{G}_{\mathrm{OPD}}\!-\!\max_{\bm{\tau}\in S_{-}}\!\widetilde{G}_{\mathrm{OPD}}\!=\!\delta. In distributed training, the aggregation of trajectory-level statistics and the computation of λ(𝒒)\lambda(\bm{q}) are done via AllReduce to ensure consistency across devices. The procedure is in algorithm 2.

Algorithm 2 Margin Shift
1:Inputs:
2: Prompt 𝒒\bm{q} with rollout group {𝝉i}i=1G\{\bm{\tau}_{i}\}_{i=1}^{G}, outcome rewards {Ri}i=1G\{R_{i}\}_{i=1}^{G} with Ri{0,1}R_{i}\!\in\!\{0,1\},
3: trajectory-level distillation returns {GOPD(𝒒,𝝉i)}i=1G\{G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},
4: target margin δ\delta, mode {MinMax,Mean}\in\{\mathrm{MinMax},\mathrm{Mean}\}, direction {Lift,Suppress,Spread}\in\{\mathrm{Lift},\mathrm{Suppress},\mathrm{Spread}\}.
5:Output: Calibrated trajectory-level returns {G~OPD(𝒒,𝝉i)}i=1G\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}
6:
7:function MarginShift(𝒒,{𝝉i,Ri,GOPD(𝒒,𝝉i)}i=1G,δ,mode,direction\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\text{mode},\text{direction})
8:  \triangleright Step 1: split the group by outcome correctness.
9:  S+(𝒒){𝝉iRi=1}S_{+}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=1\}, S(𝒒){𝝉iRi=0}S_{-}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=0\}
10:  if S+(𝒒)=S_{+}(\bm{q})=\emptyset or S(𝒒)=S_{-}(\bm{q})=\emptyset then
11:   return {G~OPD(𝒒,𝝉i)GOPD(𝒒,𝝉i)}i=1G\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G} \triangleright ordering is not defined
12:  end if
13:  
14:  \triangleright Step 2: summarize each side and compute the prompt-level margin m(𝐪)m(\bm{q}).
15:  if mode == MinMax\mathrm{MinMax} then
16:   G+(𝒒)min𝝉S+(𝒒)GOPD(𝒒,𝝉)\displaystyle G_{+}(\bm{q})\leftarrow\min_{\bm{\tau}\in S_{+}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau}) \triangleright worst-scoring correct trajectory
17:   G(𝒒)max𝝉S(𝒒)GOPD(𝒒,𝝉)\displaystyle G_{-}(\bm{q})\leftarrow\max_{\bm{\tau}\in S_{-}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau}) \triangleright best-scoring incorrect trajectory
18:  else
19:   G+(𝒒)mean𝝉S+(𝒒)GOPD(𝒒,𝝉)\displaystyle G_{+}(\bm{q})\leftarrow\operatorname*{mean}_{\bm{\tau}\in S_{+}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau}) \triangleright average correct score
20:   G(𝒒)mean𝝉S(𝒒)GOPD(𝒒,𝝉)\displaystyle G_{-}(\bm{q})\leftarrow\operatorname*{mean}_{\bm{\tau}\in S_{-}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau}) \triangleright average incorrect score
21:  end if
22:  m(𝒒)G+(𝒒)G(𝒒)m(\bm{q})\leftarrow G_{+}(\bm{q})-G_{-}(\bm{q})
23:  
24:  \triangleright Step 3: additive correction when the margin is below the target.
25:  G~OPD(𝒒,𝝉i)GOPD(𝒒,𝝉i),i=1,,G\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i}),\quad\forall i=1,\ldots,G \triangleright start from the uncalibrated returns
26:  if m(𝒒)<δm(\bm{q})<\delta then
27:   λ(𝒒)δm(𝒒)\lambda(\bm{q})\leftarrow\delta-m(\bm{q}) \triangleright amount by which the margin falls short of δ\delta
28:   if direction == Lift\mathrm{Lift} then
29:     G~OPD(𝒒,𝝉)+=λ(𝒒),𝝉S+(𝒒)\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{+}=\lambda(\bm{q}),\quad\forall\bm{\tau}\in S_{+}(\bm{q}) \triangleright pull all correct trajectories up
30:   else if direction == Suppress\mathrm{Suppress} then
31:     G~OPD(𝒒,𝝉)-=λ(𝒒),𝝉S(𝒒)\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{-}=\lambda(\bm{q}),\quad\forall\bm{\tau}\in S_{-}(\bm{q}) \triangleright push all incorrect trajectories down
32:   else
33:     G~OPD(𝒒,𝝉)+=λ(𝒒)/2,𝝉S+(𝒒)\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{+}=\lambda(\bm{q})/2,\quad\forall\bm{\tau}\in S_{+}(\bm{q}) \triangleright split: half up on the positive side, …
34:     G~OPD(𝒒,𝝉)-=λ(𝒒)/2,𝝉S(𝒒)\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})\mathrel{-}=\lambda(\bm{q})/2,\quad\forall\bm{\tau}\in S_{-}(\bm{q}) \triangleright …and half down on the negative side
35:   end if
36:  end if
37:  return {G~OPD(𝒒,𝝉i)}i=1G\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}
38:end function

Appendix B Training Details

In this section, we present details related to training, including the training setup (section B.1), the training datasets (section B.2), the training reward acquisition (section B.3), the training pseudocode (section B.4), the training dynamics (section B.5), and the training complexity analysis (section B.6). These details are provided to enhance the reproducibility of Uni-OPD.

B.1 Training Setup

To support multi-teacher OPD for both LLMs and MLLMs, we build Uni-OPD upon a widely used training framework, Miles222https://github.com/radixark/miles. Specifically, we use Megatron-LM333https://github.com/nvidia/megatron-lm (Shoeybi et al., 2019) as the training backend and SGLang444https://github.com/sgl-project/sglang as the rollout inference engine. For teacher models, we deploy them as independent SGLang services that can be accessed via HTTP from arbitrary locations to obtain token-level rewards, enabling flexible teacher extensions and scalable multi-teacher integration.

Each teacher is served behind a pool of SGLang endpoints with client-side shuffled round-robin load balancing, and a lightweight task-to-teacher routing table dispatches every prompt to the teacher best matched to its domain (e.g., math reasoning or code generation), so that new teachers or new tasks can be plugged in by simply extending the registry without touching the training loop. Because each teacher only needs to expose its prefill-time input_token_logprobs, no gradient, KV cache, or parameter is shared with the student, which keeps teachers fully stateless and decouples their deployment from the trainer. As a result, teacher scoring overlaps with student generation and contributes negligible overhead to the overall training throughput.

General training hyperparameters. All general training settings, including the batch size, rollout numbers, learning rate schedule, optimizer choice, and so on, are identical to those used in ExOPD555https://github.com/RUCBM/G-OPD (Yang et al., 2026b), ensuring a fair and controlled comparison. The prompts used for training are provided in section B.1.

Training Prompt Template Math Reasoning
<|im_start|>user
{question}
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>assistant
  Code Reasoning
<|im_start|>user
{question}
Write Python code to solve the problem. Present the code in
‘‘‘python
Your code
‘‘‘
at the end.
You need to think first then write the Python code.<|im_end|>
<|im_start|>assistant
  Multimodal Math Reasoning
<|im_start|>user
<image>
{question}
Please solve the problem step by step and put your answer in one \boxed{}.<|im_end|>
<|im_start|>assistant

RL training setup. Teacher models are trained using reinforcement learning (RL). Detailed training settings of the teacher models are provided in Table B.1.

Table B.1: Teacher model training configuration with GRPO.
Group Setting Value
Model Base model LLM: Math, Code: Qwen3-4B
MLLM: Math, Logic, Document: Qwen3-VL-4B-Inst.
Training steps LLM: Math, Code: 500, 300
MLLM: Math, Logic, Document: 300, 300, 160
Optimization Tensor Parallelism (TP) 2
Micro batch size / GPU 1
Training batch size 128
Learning rate 1×1061\times 10^{-6}
Warm-up steps 0
LR schedule Constant
ZeRO stage 3
Optimizer Adam
Sequence Max prompt length 2048
Max response length 16384
RL Algorithm Advantage estimator GRPO
GRPO clip ratio 0.2
Use KL in reward False
KL loss coefficient 0.0
Entropy coefficient 0.0
Rollout Samples per prompt (nn) 8
Temperature 1.0
Top-pp 0.95
Top-kk 50
Hardware GPUs 16×16\times NVIDIA H20

OPD training setup. For OPD, we inherit most hyperparameters (e.g., learning rate, optimizer, and sequence lengths) from the teacher RL setup in Table B.1, so that the student is trained under the same optimization regime as its teachers. The OPD-specific entries, including the training batch size, the number of on-policy samples per prompt, the online correctness-aware filter, and the margin calibration configuration, are summarized in Table B.2. Concretely, we use a training batch size of 64 and sample n=16n\!=\!16 on-policy rollouts per prompt, which we find provides a good trade-off between return estimation quality and computational efficiency (see the ablation in Table D.6). The online correctness-aware filter is applied in sample filter mode with a target correct-to-incorrect ratio of 1:11{:}1 within each training batch, following section A.2. For margin calibration (section A.4), we adopt group-level mean normalization in both domains, while the shift direction and target margin are tuned per domain: for the textual domain, we use Spread with δ=0.4\delta\!=\!0.4, and for the multimodal domain, we use Lift with δ=0\delta\!=\!0.

Table B.2: OPD training configuration. Most hyperparameters inherit from the teacher RL setup in Table B.1; only the entries that differ between OPD and RL are listed here.
Group Setting Textual Multimodal
Optimization Training batch size 64 64
Samples per prompt (nn) 16 16
Online filter Filter mode Sample filter Sample filter
Correct/Incorrect ratio 1:11{:}1 1:11{:}1
Margin calibration Scope Group Group
Mode Mean Mean
Direction Spread Lift
Target margin δ\delta 0.40.4 0

B.2 Training Data

Textual math reasoning data. We use a subset of the DeepMath dataset (He et al., 2025b) with difficulty level 6\!\geq\!6 to train mathematical reasoning ability, comprising 57.0K samples.

Textual code generation data. We use the Code subset of the Eurus-2-RL-Data dataset (Cui et al., 2025) with 25.3K samples to train code generation ability.

Multimodal math reasoning data. For multimodal math reasoning tasks, we draw 14.8K samples from the OpenMMReasoner-RL dataset666https://huggingface.co/datasets/OpenMMReasoner/OpenMMReasoner-RL-74K, covering MMK12, WeMath-Standard, and WeMath-Pro subsets.

Multimodal logic reasoning data. We collect 14.8K samples spanning AlgoPuzzle, PuzzleVQA, and ThinkLite-VL-Hard subsets from the OpenMMReasoner-RL-74K dataset.

Multimodal document understanding data. We include 14.6K document understanding samples, obtained by 15% sampling from the TQA subset of OpenMMReasoner with ChartQA (Masry et al., 2022) and InfographicsVQA (Mathew et al., 2022) training sets.

B.3 Training Reward Acquisition

In this section, we describe how training rewards are obtained for different data sources. For textual math reasoning tasks, we use the rule-based verifier provided by DeepMath777https://github.com/zwhe99/DeepMath to determine whether generated answers are correct. For textual code generation tasks, we use the rule-based verifier provided by PRIME888https://github.com/PRIME-RL/PRIME to evaluate the correctness of generated code. For multimodal tasks, we use the verifier released by OpenMMReasoner999https://github.com/EvolvingLMMs-Lab/OpenMMReasoner to assess whether generated answers are correct.

B.4 Training Pseudocode

The full training procedure of Uni-OPD is summarized in algorithm 3. In brief, the procedure (1) samples a prompt batch with offline difficulty-aware balancing (section A.1); (2) rolls out GG trajectories per prompt and computes the trajectory-level distillation return GOPD{G}_{\mathrm{OPD}} from teacher–student log-probability differences (Eq. 5); (3) applies online correctness-aware balancing across the batch (section A.2); (4) calibrates GOPDG_{\mathrm{OPD}} via the prompt-level margin m(𝒒)m(\bm{q}) (Eq. 9) using either Greedy Margin Mask (algorithm 1) or Margin Shift (algorithm 2); and (5) broadcasts the calibrated returns to token-level advantages and updates the student π𝜽\pi_{\bm{\theta}}.

Algorithm 3 Uni-OPD: Outcome-guided Policy Distillation with Margin Calibration
1:Input:
2: Teacher πT\pi_{\text{T}}, student π𝜽\pi_{\bm{\theta}}, dataset 𝒟\mathcal{D}, group size GG, target margin δ\delta, calibration mode {Mask,Shift}\!\in\!\{\textsc{Mask},\textsc{Shift}\}, learning rate η\eta.
3:Output: Updated student parameters 𝜽\bm{\theta}.
4:
5:function UniOPD(πT,π𝜽,𝒟,G,δ,mode,η\pi_{\text{T}},\pi_{\bm{\theta}},\mathcal{D},G,\delta,\text{mode},\eta)
6:  \triangleright Offline difficulty-aware data balancing (once before training; see section A.1).
7:  Sample a prompt batch 𝒟\mathcal{B}\subset\mathcal{D} with rebalanced difficulty distribution
8:  
9:  while not converged do
10:   \triangleright Rollout and token-level scoring (per prompt).
11:   for all prompt 𝒒\bm{q}\in\mathcal{B} do
12:     Rollout GG trajectories {𝝉i}i=1Gπ𝜽(𝒒)\{\bm{\tau}_{i}\}_{i=1}^{G}\sim\pi_{\bm{\theta}}(\cdot\mid\bm{q})
13:     for i=1,,Gi=1,\ldots,G do
14:      Obtain outcome reward Ri=r(𝒒,𝝉i){0,1}R_{i}=r(\bm{q},\bm{\tau}_{i})\in\{0,1\}
15:      for all token ot𝝉io_{t}\in\bm{\tau}_{i} do
16:        rtOPD(𝝉i)logπT(ot𝒒,𝒐<t)logπ𝜽(ot𝒒,𝒐<t)r^{\mathrm{OPD}}_{t}(\bm{\tau}_{i})\leftarrow\log\pi_{\text{T}}(o_{t}\mid\bm{q},\bm{o}_{<t})-\log\pi_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t}) \triangleright token-level OPD reward
17:      end for
18:      \triangleright Trajectory-level distillation return (Eq. 5).
19:      GOPD(𝒒,𝝉i)1|𝝉i|t=1|𝝉i|rtOPD(𝝉i)G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow\dfrac{1}{|\bm{\tau}_{i}|}\sum_{t=1}^{|\bm{\tau}_{i}|}r^{\mathrm{OPD}}_{t}(\bm{\tau}_{i})
20:     end for
21:     Partition: S+(𝒒){𝝉iRi=1}S_{+}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=1\}, S(𝒒){𝝉iRi=0}S_{-}(\bm{q})\leftarrow\{\bm{\tau}_{i}\mid R_{i}=0\} \triangleright correct / incorrect trajectory sets
22:   end for
23:   
24:   \triangleright Online correctness-aware data balancing (across the batch; see section A.2).
25:   OnlineCorrectnessAwareDataBalancing(,{Ri}𝒒,i)\mathcal{B}\leftarrow\textsc{OnlineCorrectnessAwareDataBalancing}\bigl(\mathcal{B},\{R_{i}\}_{\bm{q},i}\bigr)
26:   
27:   \triangleright Outcome-guided margin calibration (per prompt; Eqs. 9 and 10).
28:   for all prompt 𝒒\bm{q}\in\mathcal{B} do
29:     Compute prompt-level margin m(𝒒)=min𝝉S+(𝒒)GOPD(𝒒,𝝉)max𝝉S(𝒒)GOPD(𝒒,𝝉)m(\bm{q})=\min_{\bm{\tau}\in S_{+}(\bm{q})}{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau})-\max_{\bm{\tau}\in S_{-}(\bm{q})}G_{\mathrm{OPD}}(\bm{q},\bm{\tau})
30:     if mode =Mask=\textsc{Mask} then
31:      {k𝒒,i}i=1GGreedyMarginMask(𝒒,{𝝉i,Ri,GOPD(𝒒,𝝉i)}i=1G,δ,ρ,mode)\{k_{\bm{q},i}\}_{i=1}^{G}\leftarrow\textsc{GreedyMarginMask}(\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\rho,\text{mode}) \triangleright algorithm 1
32:      G~OPD(𝒒,𝝉i)k𝒒,iGOPD(𝒒,𝝉i),i=1,,G\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\leftarrow k_{\bm{q},i}\cdot G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i}),\quad\forall i=1,\ldots,G \triangleright zero out masked trajectories
33:     else
34:      {G~OPD(𝒒,𝝉i)}i=1GMarginShift(𝒒,{𝝉i,Ri,GOPD(𝒒,𝝉i)}i=1G,δ,mode,direction)\{\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G}\leftarrow\textsc{MarginShift}(\bm{q},\{\bm{\tau}_{i},R_{i},G_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i})\}_{i=1}^{G},\delta,\text{mode},\text{direction}) \triangleright algorithm 2
35:     end if
36:   end for
37:   
38:   \triangleright Token-level broadcasting and policy update.
39:   for all prompt 𝒒\bm{q}\in\mathcal{B}, rollout i=1,,Gi=1,\ldots,G, token ot𝝉io_{t}\in\bm{\tau}_{i} do
40:     A~t(𝒒,𝝉i)G~OPD(𝒒,𝝉i)\widetilde{A}_{t}(\bm{q},\bm{\tau}_{i})\leftarrow\widetilde{G}_{\mathrm{OPD}}(\bm{q},\bm{\tau}_{i}) \triangleright broadcast calibrated trajectory return to all tokens
41:   end for
42:   (𝜽)=𝔼𝒒,𝝉i,t[A~t(𝒒,𝝉i)logπ𝜽(ot𝒒,𝒐<t)]\mathcal{L}(\bm{\theta})=-\,\mathbb{E}_{\bm{q},\bm{\tau}_{i},t}\!\left[\widetilde{A}_{t}(\bm{q},\bm{\tau}_{i})\,\log\pi_{\bm{\theta}}(o_{t}\mid\bm{q},\bm{o}_{<t})\right]
43:   𝜽𝜽η𝜽(𝜽)\bm{\theta}\leftarrow\bm{\theta}-\eta\,\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}) \triangleright one gradient step on the student
44:  end while
45:  return 𝜽\bm{\theta}
46:end function

B.5 Training Dynamics

Fig. B.1 demonstrates the effectiveness of Uni-OPD along three complementary axes. From a comparable starting point (\sim35% correct, entropy \sim0.33, length \sim1.6k), Uni-OPD converges to a substantially higher response-correct ratio than OPD, peaking at 80.6%80.6\% versus 75.2%75.2\% and averaging 75.5%75.5\% over the final 10 steps versus OPD’s 69.1%69.1\% (+6.4 absolute points). Crucially, this accuracy gain is not obtained by sacrificing exploration: policy entropy rises mildly under both methods, with Uni-OPD maintaining a marginally higher steady-state value, ruling out the entropy-collapse failure mode that typically plagues teacher-driven training. Meanwhile, the average response length grows from \sim1.6k to \sim8k tokens, with Uni-OPD producing slightly longer outputs than OPD (7.8k vs. 7.1k), indicating that the model learns to perform more elaborate reasoning rather than collapsing to short, high-confidence shortcuts. Together, these trends suggest that Uni-OPD provides a consistent improvement over OPD without adverse effects on exploration or response length.

Refer to caption
Figure B.1: Training dynamics of OPD and Uni-OPD for multi-teacher distillation. We track three indicators along the optimization trajectory: response correctness (%), policy entropy, and average response length.

B.6 Training Complexity

Beyond vanilla OPD, Uni-OPD introduces lightweight components on top of the standard per-iteration cost during training: online correctness-aware data balancing (per batch; section A.2), and outcome-guided margin calibration via Margin Mask / Shift (per prompt; section A.4). Let BB be the training batch size (number of prompts) and GG be the rollout group size. The online balancing only resamples prompts based on their precomputed {Ri}\{R_{i}\}, costing O(BG)O(BG) per iteration. Margin Mask and Margin Shift both operate on the GG trajectory-level returns within each prompt group: Margin Shift is O(G)O(G) per prompt, while the greedy variant of Margin Mask is at most O(G2)O(G^{2}) per prompt in the worst case (typically G16G\!\leq\!16 in our setup).

In contrast, the dominant per-iteration cost of OPD comes from two stages whose complexity scales linearly with the total number of rollout tokens Ttok=i=1BG|𝝉i|T_{\text{tok}}\!=\!\sum_{i=1}^{BG}|\bm{\tau}_{i}| and cubically with the hidden size dd: (i) sampling BGBG on-policy rollouts from the student, and (ii) running a teacher prefill pass over these rollouts to obtain token-level log-probabilities, each of order O(Ttokd2)O(T_{\text{tok}}\,d^{2}) for transformer forward passes. Typical numbers in our setup (Bs=64Bs\!=\!64, N=16N\!=\!16, average length \sim8k) give TtokT_{\text{tok}} on the order of 8×1068\!\times\!10^{6} tokens per iteration. All of Uni-OPD’s additional computation scales with the number of trajectories rather than the number of tokens, involves only scalar comparisons and additions, and is therefore several orders of magnitude cheaper than the rollout and teacher-scoring stages that OPD already pays. In practice, we observe that enabling all three components adds less than 1%1\% wall-clock overhead per iteration relative to vanilla OPD, while delivering the accuracy improvements reported in section B.5 and the main experiments. Thus Uni-OPD offers a favorable accuracy–compute trade-off: a negligible compute surcharge in exchange for consistently better final performance.

Appendix C Evaluation Details

C.1 Evaluation Benchmarks

We evaluate our Uni-OPD on a comprehensive benchmark suite spanning textual and multimodal capabilities, organized along five capability axes:

  • \bullet

    Textual Math Reasoning:

    • -

      AIME (2024/2025) (AI-MO, 2024): A prestigious high school mathematics competition featuring challenging problems that test deep mathematical reasoning.

    • -

      HMMT25 (Feb & Nov) (Balunović et al., 2025): Contest-level benchmarks designed to rigorously evaluate advanced reasoning across algebra, geometry, combinatorics, and other domains.

  • \bullet

    Textual Code Generation:

    • -

      HumanEval+ (Liu et al., 2023b): A set of 164 hand-written programming problems evaluating functional correctness, covering language understanding, reasoning, algorithms, and basic mathematics.

    • -

      MBPP+ (Liu et al., 2023b): A collection of \sim1,000 crowd-sourced Python tasks targeting entry-level programming skills, including fundamentals and standard library usage.

    • -

      LiveCodeBench (v6) (Jain et al., 2024): A contamination-free and continuously updated benchmark assessing not only code generation but also execution, self-repair, and test prediction.

  • \bullet

    Multimodal Math Reasoning:

    • -

      MathVision (Wang et al., 2024a): A curated dataset of 3,040 visual problems sourced from real competitions, spanning 16 disciplines and multiple difficulty levels for evaluating multimodal mathematical reasoning.

    • -

      DynaMath (Zou et al., 2024): A dynamically generated benchmark based on 501 seed question generators, enabling diverse and scalable evaluation through multiple sampled variants.

    • -

      WeMath (Qiao et al., 2025): A large-scale benchmark with 6.5K visual math problems organized into 67 hierarchical knowledge concepts, designed to analyze problem-solving processes.

  • \bullet

    Multimodal Logic Reasoning:

    • -

      LogicVista (Xiao et al., 2024): A benchmark for evaluating multimodal logical reasoning across 5 task types and 9 capabilities using annotated multiple-choice questions with human reasoning.

    • -

      VisuLogic (Xu et al., 2025b): A challenging visual reasoning benchmark focusing on reasoning directly from visual inputs, with tasks that are difficult to express textually and expose gaps in current MLLMs.

  • \bullet

    Document Understanding:

    • -

      AI2D (Kembhavi et al., 2016): A diagram understanding benchmark focusing on parsing diagram structure and reasoning over relationships between components via question answering.

    • -

      ChartQA (Masry et al., 2022): A benchmark for question answering over charts, requiring complex visual and logical reasoning over both chart structure and underlying data.

    • -

      DocVQA (Mathew et al., 2021): A large-scale document visual question answering dataset over document images, emphasizing structural and textual understanding.

    • -

      InfoVQA (Mathew et al., 2022): A benchmark on infographic understanding that requires joint reasoning over layout, text, and visual elements with an emphasis on multi-step reasoning.

C.2 Evaluation Setup

Textual evaluations. For all textual evaluations, we use a sampling temperature of 1.0, top-pp of 1.0, a maximum generation length of 16,384 tokens, and a fixed random seed of 42. We use the vLLM inference engine to perform sampling. For math reasoning benchmarks, we sample N=32N=32 solutions per problem, while for code generation benchmarks, we sample N=4N=4 solutions per problem. For evaluation, we adopt Math-Verify101010https://github.com/huggingface/Math-Verify as a rule-based verifier for math reasoning tasks. For code generation, we use the EvalPlus111111https://github.com/evalplus/evalplus and LiveCodeBench121212https://github.com/livecodebench/livecodebench frameworks to assess functional correctness. For all main results, we report the average accuracy across sampled solutions (i.e., pass@1), and compute pass@k as:

pass@k=1(Nck)(Nk),\text{pass}@k=1-\frac{\binom{N-c}{k}}{\binom{N}{k}}\,, (15)

where NN is the number of samples and cc is the number of correct solutions.

Table C.1: Reported evaluation metrics for different benchmark datasets. We summarize the primary metrics used for performance reporting across math, logic, and document understanding tasks.
Category Tasks Filter NN-Shot Reported Metric
Multimodal Math Reasoning MathVision Test none 0 mathvision_standard_eval
DynaMath Reasoning none 0 dynamath_average
WeMath TestMini Reasoning none 0 acc_score
Multimodal Logic Reasoning LogicVista Reasoning none 0 acc_score
LogicVista Reasoning none 0 format_score
VisuLogic none 0 visulogic_acc
Document Understanding AI2D flexible-extract 0 exact_match
ChartQA none 0 relaxed_human_split
DocVQA Val none 0 anls
InfoVQA Val none 0 anls

Multimodal evaluations. For multimodal evaluations, we adopt the widely used LMMs-Eval131313https://github.com/evolvinglmms-lab/lmms-eval (Zhang et al., 2025a) framework and strictly follow its official evaluation protocols and configurations. The reported evaluation metrics are summarized in Table C.1.

Appendix D Further Evaluations

D.1 More Evaluation Results

Table D.1: Performance of Qwen3-1.7B Student under math reasoning and code generation benchmarks. Teacher models (i.e., Qwen3-4B-Math-RL and Qwen3-4B-Code-RL) are developed through domain-specific RL. The performance of teacher models is denoted by the “RL” type.
Method Math Reasoning Code Generation
AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg. Human Eval+ MBPP+ LCB Avg.
Student 13.9 11.1 5.6 4.9 8.9 61.9 53.4 11.9 42.4
Teacher 60.1 55.1 32.5 38.5 46.6 85.2 69.8 26.6 60.5
Single–Teacher Distillation
OPD 42.3 35.4 18.4 19.1 28.8 71.8 58.2 26.7 52.5
Uni-OPD 42.6 35.1 20.8 20.9 29.9 73.0 60.0 28.1 53.7
Multi–Teacher Distillation
OPD 40.3 32.4 20.0 20.3 28.3 73.2 59.1 25.7 52.7
Uni-OPD 44.0 35.1 19.5 19.8 29.6 72.9 60.5 28.0 53.8
Table D.2: Performance of Qwen3-VL-2B-Instruct Student under math reasoning, logic reasoning, and document understanding benchmarks. Teacher models (i.e., Qwen3-VL-4B-Instruct-Math-RL, Qwen3-VL-4B-Instruct-Logic-RL and Qwen3-VL-4B-Instruct-Document-RL) are developed through domain-specific RL. Avg. denotes the mean score within each category.
Method Math Reasoning Logic Reasoning Document Understanding
Math Dyna We Avg. LogicVista LogicVista Visu Avg. AI2D Chart Doc Info Avg.
Vision Math Math Accuracy Format Logic QA VQA VQA
Student 11.1 49.1 48.6 36.3 32.4 59.1 6.4 32.6 73.4 66.1 92.8 72.4 76.2
Teacher 47.2 65.3 79.5 64.0 52.5 73.8 27.4 51.2 82.5 76.4 95.1 81.6 83.9
Single–Teacher Distillation
OPD 24.4 54.5 64.8 47.9 35.3 61.6 26.8 41.2 76.1 66.0 93.0 72.2 76.8
Uni-OPD 25.5 55.2 65.0 48.6 36.8 65.2 27.6 43.2 76.7 66.6 92.9 72.6 77.2
Multi–Teacher Distillation
OPD 15.2 50.2 57.6 41.0 38.0 65.2 27.2 43.4 76.2 66.1 92.9 72.5 76.9
Uni-OPD 18.7 51.2 58.7 43.9 42.0 69.8 27.0 46.3 76.0 66.5 93.0 72.6 77.0

Single-teacher and multi-teacher distillation on LLMs and MLLMs. We further evaluate Uni-OPD under both single-teacher and multi-teacher distillation settings on LLMs and MLLMs. As shown in Tables D.1and D.2, our method consistently outperforms the standard OPD baseline across all domains and settings. On the LLM student (i.e., Qwen3-1.7B), Uni-OPD improves the average scores on both math reasoning and code generation under single-teacher and multi-teacher distillation. On the MLLM student (i.e., Qwen3-VL-2B-Instruct), it delivers consistent gains across math reasoning, logic reasoning, and document understanding. Further, it narrows the gap to the teacher ensemble under multi-teacher distillation. Consistent improvements in smaller students provide strong empirical evidence for our dual-perspective approach, confirming that student exploration and teacher reliability are indeed the fundamental drivers of successful and reliable distillation.

Table D.3: Performance of Qwen3-VL-4B-Instruct Student under code generation and logic reasoning benchmarks. Teacher models (i.e., Qwen3-VL-4B-Instruct-Code-RL and Qwen3-VL-4B-Instruct-Logic-RL) are developed through domain-specific RL. The performance of teacher models is denoted by the “RL” type.
Method Code Generation Logic Reasoning
Human Eval+ MBPP+ LCB Avg. LogicVista LogicVista Visu Avg.
Accuracy Format Logic
Student 76.8 70.0 37.0 61.3 49.9 66.4 25.1 47.0
Teacher 82.2 70.5 40.1 64.3 52.5 73.8 27.4 51.2
Multi–Teacher Distillation
OPD 79.0 68.5 39.6 62.4 50.0 69.3 27.3 48.9
Uni-OPD 79.4 69.2 41.4 63.3 52.0 73.8 28.0 51.3

Cross-modal distillation on code generation and logic reasoning. Beyond the cross-modal distillation on math reasoning and code generation, we further conduct cross-modal distillation on code generation and logic reasoning. Specifically, we combine text-only code data with multimodal logic reasoning data, and jointly distill from two domain-specific teachers (Qwen3-VL-4B-Instruct-Code-RL and Qwen3-VL-4B-Instruct-Logic-RL) into a single Qwen3-VL-4B-Instruct student. As shown in Table D.3, Uni-OPD outperforms the standard OPD baseline on both the code generation and logic reasoning averages, with the largest gain on LCB (39.6 \rightarrow 41.4) and LogicVista Accuracy (50.0 \rightarrow 52.0). These results confirm that Uni-OPD effectively integrates heterogeneous text-only and multimodal data under a single training run, further supporting its applicability to cross-modal distillation.

D.2 Downstream Task Evaluation

Table D.4: General downstream task performance. Evaluation on 8 general benchmarks to ensure general-purpose capabilities are maintained after OPD.
Model MMLU ARC HellaSwag TruthfulQA Winogrande GSM8K CommonsenseQA IFEval Avg.
Qwen3-4B 68.3 80.7 68.4 54.8 66.6 84.2 75.8 88.9 73.5
Math Teacher 68.4 80.8 68.5 54.3 66.0 86.7 75.4 89.2 73.7
Code Teacher 68.3 80.2 68.3 54.8 65.7 85.8 75.7 89.7 73.6
OPD 68.3 80.3 68.4 54.6 66.5 88.6 75.5 89.2 73.9
Uni-OPD 68.3 80.3 68.3 54.6 66.0 88.6 75.7 89.2 73.9

Evaluation on general capabilities. To assess the impact of OPD on general downstream performance of the policy model, we evaluate the models on a diverse set of benchmarks from the Hugging Face Open LLM Leaderboard (Beeching et al., 2023) following recent studies (Peng et al., 2026; Meng et al., 2024). Specifically, we report results on MMLU (Hendrycks et al., 2020), ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), TruthfulQA (Lin et al., 2022), Winogrande (Levesque et al., 2012), GSM8K (Cobbe et al., 2021), CommonsenseQA (Talmor et al., 2019), and IFEval (Zhou et al., 2023b). We strictly follow the standard evaluation protocols provided by the lm-evaluation-harness system141414https://github.com/EleutherAI/lm-evaluation-harness. For IFEval, we report the inst_level_loose_acc.

The results are presented in Table D.4. Overall, Uni-OPD not only outperforms OPD and domain-specific teachers on math reasoning and code generation benchmarks demonstrated in the main text, but also retains strong performance across a wide range of downstream tasks. These results suggest that OPD serves as a general and effective framework for improving LLM performance beyond task-specific settings.

D.3 Further Ablation

Table D.5: Effectiveness validation of margin shift across different hyperparameters. We conduct single-teacher distillation experiments with a Qwen3-4B Student using individual math and code teachers.
Configuration Math Reasoning Code Generation
AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg. Human Eval+ MBPP+ LCB Avg.
OPD (no shift) 57.9 52.4 30.2 37.8 44.6 82.6 68.8 25.7 59.0
Global + Mean + Lift 61.8 55.2 34.8 39.4 47.8 85.7 71.4 25.7 60.9
Global + MinMax + Lift 62.4 57.3 32.2 38.2 47.5 85.8 71.8 26.7 61.4
Group + MinMax + Spread 63.4 56.7 33.4 39.0 48.1 86.9 70.6 26.7 61.4
Group + Mean + Spread (ours) 62.7 56.3 34.4 39.2 48.2 88.3 72.3 26.7 62.4

Hyperparameter analysis for margin shift. As shown in Table D.5, we compare four variants of margin shift against the OPD baseline across math reasoning and code generation benchmarks. The shift scope (Global vs. Group), normalization mode (Mean vs. MinMax), and shift direction (Lift vs. Spread) are ablated systematically. All shift variants consistently outperform the vanilla OPD baseline, demonstrating the general effectiveness of margin shift. Among the variants, Group + Mean + Spread achieves the best average performance on both code generation (62.4) and math reasoning (48.2), indicating that group-level mean normalization with bidirectional shifting provides a more calibrated return signal. Applying the shift to both correct and incorrect responses (Spread) proves beneficial over unidirectional shifting (Lift), and group-level statistics generalize better than global ones when reward distributions vary across prompts. Furthermore, we observe that MinMax-based normalization and global-scope statistics are susceptible to outlier return values, as extreme return values within a batch can distort the shift magnitude and destabilize training. In contrast, group-level mean normalization produces more robust and consistent return estimates, contributing to stable optimization throughout training.

Table D.6: The effects of rollout number. The global batch size is fixed at n×bs=1024n\times bs=1024 throughout.
Method AIME 2024 AIME 2025 HMMT 25 Feb. HMMT 25 Nov. Avg.
Student (4B) 23.0 19.3 12.3 9.2 15.9
OPD
n=4n=4, bs=256bs=256 60.1 55.1 32.5 29.6 44.3
n=8n=8, bs=128bs=128 59.8 52.9 29.6 35.8 44.5
n=16n=16, bs=64bs=64 57.9 52.4 30.2 37.8 44.6
n=32n=32, bs=32bs=32 58.3 51.2 30.6 36.9 44.3
OPD + Margin shift
n=4n=4, bs=256bs=256 57.9 52.4 33.2 37.8 45.3
n=8n=8, bs=128bs=128 62.5 55.4 31.9 39.2 47.3
n=16n=16, bs=64bs=64 62.7 56.3 34.4 39.2 48.2
n=32n=32, bs=32bs=32 63.1 55.4 34.2 39.6 48.1

Hyperparameter analysis for rollout number nn. As shown in Table D.6, we ablate the rollout number nn in OPD while keeping the global batch size fixed at 1024 (i.e., n×bs=1024n\times bs=1024), so that increasing nn comes at the cost of a smaller per-step batch size bsbs. For the OPD baseline, performance remains largely stable across all values of nn (44.3–44.6 avg.), suggesting that the base method is relatively insensitive to this trade-off. In contrast, OPD with margin shift benefits notably from larger rollout groups: average performance improves from 45.3 at n=4n{=}4 to 48.2 at n=16n{=}16, as more responses per prompt yield more reliable relative return estimation for the margin-based calibration. We find that increasing nn from 16 to 32 yields comparable performance. Considering return estimation quality, training stability, and computational efficiency, we therefore set n=16n{=}16 as our default.

Appendix E Related Work

E.1 Multimodal Large Language Models

Large Language Models (LLMs) have undergone rapid development in recent years (Touvron et al., 2023; Achiam et al., 2023; AI@Meta, 2024b; Hurst et al., 2024; Yang et al., 2024a; AI@Meta, 2024a; Yang et al., 2025; Brown et al., 2020; Team et al., 2024; Anthropic, 2023b; a; 2024; Liu et al., 2024a; Guo et al., 2025a; Li et al., 2025), significantly improving reasoning capabilities. Meanwhile, MLLMs have also seen substantial progress (Radford et al., 2021; Shao et al., 2024a; Wang et al., 2025; Tian et al., 2019; Liu et al., 2024e; Yang et al., 2024c; Peng et al., 2026; Team et al., 2025). Leveraging advances in LLMs, multimodal large language models (MLLMs) further integrate visual and textual representations through cross-modal learning, achieving strong multimodal understanding and generation capabilities. A key driver of this success lies in the combination of large-scale self-supervised pre-training on diverse corpora and subsequent high-quality supervised fine-tuning (SFT), which enables LLMs and MLLMs to exhibit strong generalization and emergent capabilities in real-world tasks (Wang et al., 2024b; Bai et al., 2023; 2025b; Liu et al., 2023a; 2024b; 2024c; Dai et al., 2023; OpenAI, 2023; Zhu et al., 2023; Qu et al., 2025; Yang et al., 2023b; Zhong et al., 2024; Yang et al., 2023a; 2024b; Lai et al., 2024; Peng et al., 2025; Hou et al., 2026). Building upon these foundations, KD has emerged as an important paradigm for transferring sophisticated reasoning capabilities from teacher models to more efficient students. Among various distillation strategies, OPD has recently emerged as a mainstream post-training paradigm for both LLMs and MLLMs. In the on-policy setting, however, the effectiveness of distillation is tied to both the quality of student exploration and the reliability of teacher feedback. In this work, we present a dual-perspective optimization strategy from both the student and teacher sides to improve data suitability and training stability in OPD.

E.2 Reinforcement Learning

By optimizing trajectories sampled from the current policy, on-policy RL alleviates distribution mismatch and is often instantiated with verifiable or outcome-based rewards in reasoning tasks. Notable methods include GRPO (Shao et al., 2024b) for critic-free grouped optimization and GSPO (Zheng et al., 2025) for sequence-level stable optimization. Recently, some works have also combined RLVR with OPD, such as Self-Distilled RLVR (Yang et al., 2026a) and OpenClaw-RL (Wang et al., 2026). In our work, we use GRPO to obtain stronger domain-specific teachers and use the corresponding reward models as global guidance for return calibration in OPD.

E.3 On-Policy Distillation

Early OPD work, such as MiniLLM (Gu et al., 2023) and GKD (Agarwal et al., 2024), establishes the basic paradigm of using teacher feedback on student-generated trajectories under a reverse KL objective. Recent studies further broaden this paradigm from multiple perspectives. In self-distillation methods, OPSD (Zhao et al., 2026b) uses privileged information; SDFT (Shenfeld et al., 2026) allows the student to absorb knowledge from retrieved demonstrations while reducing forgetting. SDPO (Hübotter et al., 2026) treats the current model itself as a self-teacher; OPCD (Ye et al., 2026) internalizes context knowledge into model parameters by minimizing reverse KL between the student and a context-conditioned teacher on the student’s trajectories. Regarding teacher access, black-box OPD (Ye et al., 2025) introduces a discriminator-guided framework that does not require teacher logits. Several works also focus on improving optimization and efficiency. ExOPD (Yang et al., 2026b) reformulates OPD as weighted dense RL; Fast and Effective OPD (Zhang et al., 2026a) improves efficiency through prefix-only distillation; KDFlow (Zhang et al., 2026b) provides an extensible distillation framework supporting both off-policy and on-policy training; MiMo-V2-Flash (Xiao et al., 2026) introduces multi-teacher OPD, enabling effective capability merging across domains. Li et al. (Li et al., 2026b) rethink OPD in terms of its phenomenology, mechanisms, and training recipes.

Recently, OPD has also begun to extend beyond text-only settings. VOLD (Bousselham et al., 2025) transfers reasoning ability from text teachers to vision-language students through a two-stage pipeline that combines cold-start alignment with GRPO and OPD. Video-OPD (Li et al., 2026a) adapts OPD to long-video grounding and introduces a curriculum that filters unreliable teacher signals. X-OPD (Cao et al., 2026) further extends OPD to speech through cross-modal alignment. In contrast, our work focuses on developing a unified OPD framework with an open recipe for both LLMs and MLLMs.

Appendix F Case Studies

We provide qualitative case studies of Uni-OPD, standard OPD, and the Student model across both LLM and MLLM benchmarks, covering textual math reasoning, code generation, logical reasoning, multimodal math reasoning, and chart understanding.

We first revisit the math reasoning case in Fig. F.1, and provide a detailed output comparison of standard OPD and our Uni-OPD. Standard OPD assigns high returns to incorrect trajectories and low returns to correct ones. Furthermore, the code generation case in Fig. F.2 highlights Uni-OPD’s ability to balance algorithmic efficiency and code readability. These case studies demonstrate how our dual-perspective optimization–specifically by restoring order consistency through margin calibration–leads to more reliable and high-quality model outputs.

Across the multimodal case studies in Fig. F.3F.9, our observations reveal three consistent patterns: (a) Uni-OPD demonstrates superior efficiency on complex reasoning problems, producing more concise outputs while maintaining correctness, whereas the Student model and standard OPD frequently generate excessively long responses that are truncated before reaching a final answer; (b) Uni-OPD achieves higher correctness than the Student model, often succeeding on questions where the Student model fails; and (c) Our data-balancing strategies encourage exploration of informative student-generated states during training, improving Uni-OPD’s ability to tackle challenging visual and mathematical reasoning problems that the Student model cannot solve on its own.

Refer to caption
Figure F.1: Comparison of math reasoning outputs between OPD and Uni-OPD. In this case, standard OPD assigns high returns to incorrect reasoning trajectories and low returns to correct ones. In contrast, our Uni-OPD performs outcome-guided margin calibration to restore order consistency between correct and incorrect trajectories, yielding a reliable supervision signal that ultimately improves both efficiency and correctness of the generated solutions. On this question, we further measure pass@1 accuracy over 6464 rollouts: standard OPD reaches 79.69%79.69\%, while our Uni-OPD attains 82.81%82.81\%, further validating the effectiveness of the proposed strategy.
Refer to caption
Figure F.2: Comparison of code generation for the Find_Max task. While the Student model produces correct logic with limited readability, the OPD baseline introduces redundant computation (two passes) despite adding comments. Our Uni-OPD generates a superior solution that is both computationally efficient (single pass) and well-commented, demonstrating its effectiveness in aligning with complex task requirements.
Refer to caption
Figure F.3: Example output of LogicVista. The Student model produces an incorrect reasoning trace and arrives at the wrong answer. Standard OPD overthinks the problem, generating an excessively long response that is truncated without producing a final answer. In contrast, Uni-OPD reasons concisely and correctly answers the question.
Refer to caption
Figure F.4: Example output of LogicVista. All three models correctly answer this multi-step arithmetic reasoning question. OPD and Uni-OPD both reason concisely, with Uni-OPD being slightly more token-efficient.
Refer to caption
Figure F.5: Example output of VisuLogic. Uni-OPD correctly answers both questions, demonstrating that our training recipe encourages student exploration to improve its ability for challenging visual reasoning problems.
Refer to caption
Figure F.6: Example output of LogicVista. On this challenging visual pattern reasoning puzzle, both the Student model and OPD fail to produce a final answer due to overthinking. Uni-OPD, however, identifies the correct pattern and selects the right answer.
Refer to caption
Figure F.7: Example output of ChartQA. All models answer the simpler chart question correctly, while only Uni-OPD answers the more complex one correctly.
Refer to caption
Figure F.8: Example output of MathVision. All three models follow the required format, but only Uni-OPD produces correct reasoning and reaches the right answer.
Refer to caption
Figure F.9: Example output of WeMath. This geometry problem requires correctly identifying which side accommodates two circle diameters. Both the Student model and OPD confuse the orientation of AB and AD, while Uni-OPD correctly answers the question.

Comments

· 0
Be the first to comment on this paper.