[2605.05040] Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Xin Yu Liuchen Liao Yiwen Zhang Affiliation: TikTok Affiliation: TikTok Affiliation: TikTok [0.35em] Yingchen Yu Lingzhou Xue Qinzhen Guo Corresponding author. Email: guoqinzhen@bytedance.com.
This work was completed while Xin Yu was an intern at TikTok. Affiliation: TikTok Affiliation: TikTok [0.6em] The Pennsylvania State University

Abstract

On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose Preference-Based Self-Distillation (PBSD), which revisits on-policy self-distillation through a reward-regularized perspective. Instead of directly matching the teacher distribution, we derive a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, yielding a target policy provably superior to the original teacher under this objective. Practically, PBSD optimizes preference gaps between teacher and student samples while maintaining on-policy student sampling. We support this framework with a statistical analysis of the induced preference-learning problem, formally establishing when on policy self-distillation is preferable to learning from an external teacher in our setting. Experiments on mathematical reasoning and tool-use benchmarks across multiple model scales demonstrate that PBSD consistently achieves the strongest average performance among comparable baselines, showing improved training stability over prior self-distillation baselines while preserving token efficiency.

1 Introduction

On-policy distillation (OPD) (Agarwal et al., 2024; Lu and Lab, 2025) leverages a stronger teacher model to provide dense token-level learning signals along a student’s sampled trajectories. It has emerged as an important paradigm for post-training large language models (LLMs) (Song and Zheng, 2026), offering an efficient, structured alternative to reinforcement learning (RL) based optimization (Lu and Lab, 2025; Li et al., 2026a). Compared to RL-based post-training methods such as GRPO (Shao et al., 2024), OPD is typically more token-efficient because it avoids repeated group rollouts and reward evaluations (Lu and Lab, 2025; Song and Zheng, 2026). However, standard OPD relies on a separate, typically larger teacher model and assumes a shared token vocabulary between teacher and student, which bottlenecks its computational efficiency and practical scalability (Agarwal et al., 2024; Fu et al., 2026; Li et al., 2026b). To overcome these limitations, on-policy self-distillation (Zhao et al., 2026a; Sang et al., 2026) has emerged as a compelling solution. Rather than querying a stronger external teacher model, on-policy self-distillation unifies the teacher and student models into a single architecture, with the teacher instantiated by conditioning the model on additional context $\bm{c}$ in the prompt (Zhao et al., 2026a). Similar to OPD, existing on-policy self-distillation methods optimize the divergence between teacher and student distributions, often via a forward-KL objective (Zhao et al., 2026a) or a reverse-KL objective (Hübotter et al., 2026; Yang et al., 2026). More broadly, scalable self-distillation may help democratize capable open-source agents by reducing the reliance on expensive proprietary teachers or dense human supervision, unlocking new applications in multi-turn agent training (Wang et al., 2026a), tool-use and conversational agents (Wang et al., 2026b), and autonomous decision-making (Afsharrad et al., 2026).

Despite its promise, this KL-based formulation of on-policy self-distillation suffers from two key limitations. First, directly optimizing a KL divergence toward the context-augmented model is unstable and can actively degrade reasoning performance over the course of training, as observed in recent analyses of both self-distillation and on-policy distillation (Kim et al., 2026; Fu et al., 2026; Li et al., 2026b). Specifically, KL matching tends to suppress epistemic verbalization (namely, the model’s explicit expression of uncertainty, hesitation, self-checking, and error correction), yielding reasoning traces that are shorter but undesirably overconfident. Second, as illustrated in Figure 1, treating a context-augmented model as a teacher is a fundamentally strong assumption. Unlike standard OPD, this teacher shares the exact parameters of the student and differs only in its prompt. Thus, the resulting supervision often lacks the diversity and exploratory value provided by a genuinely stronger external model (Li et al., 2026b). Together, these observations motivate us to move beyond fixed-teacher KL matching. Instead, we seek a more robust target distribution that better preserves exploratory reasoning and admits more stable optimization.

Refer to caption — Figure 1: Comparison of three on-policy distillation paradigms. Left: Standard on-policy distillation relies on a stronger external teacher, training the student via direct KL-based distribution matching. Middle: Traditional self-distillation replaces the external teacher with the same base model under a privileged context and retains direct KL matching toward the induced teacher distribution. Right: Our proposed PBSD moves beyond direct KL matching by combining the teacher distribution with a reward function to construct a reward-reweighted target policy for the student to learn, which provides more stable and exploratory supervision for the student.

Table 1: Comparison of post-training paradigms. “Dense Supervision” denotes the use of response- or token-level learning signals as opposed to sparse outcome-level feedback, and “No External Teacher/Scorer” indicates that training operates entirely without an auxiliary teacher model or reward/scoring model. Under these criteria, PBSD is the only method that achieves on-policy optimization, dense supervision, token efficiency, and reward-aware optimization without relying on an external teacher or scorer.

Method	On-Policy	Dense Supervision	No External Teacher/Scorer	Token Efficiency	Reward-Aware
SFT	✗	✓	✓	✓	✗
GRPO	✓	✗	✗	✗	✓
OPD	✓	✓	✗	✓	✗
OPSD	✓	✓	✓	✓	✗
PBSD (Ours)	✓	✓	✓	✓	✓

In this work, we propose Preference-Based Self-Distillation (PBSD) by revisiting on-policy self-distillation through a reward-regularized lens. Rather than relying solely on KL divergence, we introduce a teacher-anchored objective (i.e., Eq. (3)) that augments KL matching with reward maximization. Under this formulation, the student is encouraged to stay close to the teacher while shifting probability mass toward responses with higher latent reward. Crucially, this objective admits an analytic optimum in which the target policy is a reward-reweighted version of the teacher distribution, rather than the teacher distribution itself. Motivated by this optimal target, PBSD optimizes the preference gaps between teacher and student samples while maintaining on-policy student sampling. We further provide a statistical analysis clarifying when contextual self-distillation, i.e., learning from a relevant teacher, can be theoretically preferable to learning from an external teacher; specifically, we show this by analyzing the statistical error of the maximum likelihood estimator (MLE) in the induced preference-learning problem. As summarized in Table 1, our proposed PBSD addresses the gap between classic post-training methods, as it retains the rich teacher-derived signal of self-distillation while aligning the optimization process with reward-aware policy improvement.

We summarize our main contributions as follows:

•

We introduce a reward-regularized objective for self-distillation that balances keeping the student model close to the teacher model with maximizing the reward function. We show that its closed-form analytic optimum, obtained by reweighting the teacher distribution via latent rewards, is provably superior to the original teacher distribution under this objective.
•

We propose PBSD, a novel preference-based on-policy self-distillation framework. Furthermore, we provide a statistical analysis of the induced preference-learning problem, establishing the conditions under which contextual self-distillation is theoretically preferable to learning from an external teacher.
•

We empirically evaluate our proposed PBSD on mathematical reasoning and tool-use benchmarks across multiple model scales. In our comparisons, PBSD achieves the strongest overall performance, showing improved training stability over prior self-distillation baselines while retaining the token-efficiency benefits of self-distillation.

2 Methodology

In this section, we first derive a reward-aware objective for on-policy distillation by augmenting KL-based matching with reward maximization, and then show how to optimize the resulting objective through preference-based learning.

2.1 A Reward-Regularized Objective for On-Policy Distillation

Formally, let $x$ denote an input prompt and let the student model be parameterized by $\pi_{\theta}(y\mid x)$ , where $y=(y_{1},\dots,y_{T})$ is an output sequence of length $T$ . Under the standard autoregressive factorization,

\pi_{\theta}(y\mid x)=\prod_{t=1}^{T}\pi_{\theta}(y_{t}\mid x,y_{<t}).

(1)

Let the teacher policy be denoted by $\pi^{\mathrm{teach}}(\cdot\mid x)$ , omitting parameters because the teacher is fixed during optimization. Classical on-policy distillation often trains the student to match the teacher distribution through KL divergence. A common formulation minimizes the reverse KL from the student to the teacher:

\min_{\pi_{\theta}}\;\mathbb{E}_{x\sim\mathcal{D}}\left[D_{\mathrm{KL}}\!\left(\pi_{\theta}(\cdot\mid x)\,\|\,\pi^{\mathrm{teach}}(\cdot\mid x)\right)\right],

(2)

which has been used in recent on-policy distillation methods (Agarwal et al., 2024; Lu and Lab, 2025; Zhao et al., 2026a; Hübotter et al., 2026). Under this objective, the teacher distribution itself is treated as the target to be learned. The limitation is that pure KL matching does not distinguish which teacher-supported responses are more useful for the downstream objective. To address this issue, we instead optimize KL matching together with reward maximization. For a fixed input $x$ , let $r(x,y)$ denote the latent target reward. We consider

\max_{\pi_{\theta}(\cdot\mid x)}\;\mathbb{E}_{x\sim\mathcal{D}}\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}[r(x,y)]-\beta D_{\mathrm{KL}}\!\left(\pi_{\theta}(\cdot\mid x)\,\|\,\pi^{\mathrm{teach}}(\cdot\mid x)\right),

(3)

where $\beta>0$ controls the strength of the KL regularization. The KL term keeps the policy close to $\pi^{\mathrm{teach}}$ , while the reward term favors responses with larger latent reward. This is importantly different from classical RLHF: in standard RLHF, the KL regularizer is typically defined with respect to the initial base model as the reference policy, whereas in our setting the reference policy is the teacher policy $\pi^{\mathrm{teach}}$ , which may be either an external teacher or an internal teacher constructed from the same base model. When $\beta$ is large, the objective behaves like a soft reward-weighted selection over teacher-supported responses. In this sense, pure KL matching is a special case that ignores the reward-dependent importance of each response, whereas Eq. (3) redistributes probability mass within the teacher-supported region according to reward. This distinction is central to our view of on-policy distillation: the teacher should define useful support and inductive bias, but it should not be treated as a final target to be copied uniformly.

Proposition 1 (Optimal reward-tilted policy).

For each fixed input $x$ , the optimizer of Eq. (3) is the reward-tilted teacher distribution

\pi^{\star}(y\mid x)=\frac{\pi^{\mathrm{teach}}(y\mid x)\exp(r(x,y)/\beta)}{Z(x)},\qquad Z(x)=\sum_{y}\pi^{\mathrm{teach}}(y\mid x)\exp(r(x,y)/\beta).

(4)

Equivalently, the latent reward can be written in terms of the optimal policy and the teacher up to an $x$ -dependent additive constant:

r(x,y)=\beta\log\frac{\pi^{\star}(y\mid x)}{\pi^{\mathrm{teach}}(y\mid x)}+\beta\log Z(x).

(5)

The derivation is deferred to Appendix D.1. Thus, the desired policy does not merely copy the teacher. Instead, it takes the teacher distribution as a reference measure and reweights it by the reward-dependent factor $\exp(r(x,y)/\beta)$ . The following proposition formalizes that this reward-tilted solution is never worse than directly using the teacher under the same reward-regularized objective.

Proposition 2 (Reward-tilted policy improves over the teacher).

Let $F$ denote the population version of Eq. (3) aggregated over $x\sim\mathcal{D}$ . Then the optimizer $\pi^{\star}$ in Eq. (4) satisfies

F(\pi^{\star})\geq F(\pi^{\mathrm{teach}}),

(6)

with strict inequality whenever $r(x,y)$ is non-constant over teacher-supported responses on a set of inputs with positive measure.

The proof is deferred to Appendix D.1. This observation is important for on-policy distillation: the teacher is often not an oracle and therefore should not be treated as the final target. Instead, the teacher should serve as a support distribution, while latent reward information reweights probability mass toward more valuable behavior. This perspective matches the motivation in Section 1: on-policy distillation should not uniformly copy the teacher, but should selectively amplify higher-value teacher-induced responses. The remaining question is how to optimize this reward-aware target in practice when the reward itself is unobserved.

2.2 Solving the Objective via Preference-Based Optimization

The previous subsection defines the desired target policy, but that policy depends on a latent reward that is not directly observed. To make the objective optimizable in practice, we instantiate it through preference-based learning. Specifically, we compare a response sampled from the better-conditioned teacher with an on-policy response sampled from the current student, and use the induced preference relation to optimize the student toward the reward-reweighted teacher policy.

Our teacher is a better-conditioned policy rather than a larger external model. For each input $x$ , we assume an additional context $\bm{c}$ that is available only when constructing the teacher signal. In prior self-distillation work, this context is typically instantiated as privileged per-instance information, such as expert demonstrations or reference reasoning traces in reasoning tasks (Zhao et al., 2026a; Sang et al., 2026), tool or API information in structured action-generation settings (Shenfeld et al., 2026), retrieved evidence or search traces in retrieval- or search-augmented settings (Chen et al., 2026), or more general teacher-only feedback signals (Ding, 2026; Hübotter et al., 2026). We denote the resulting teacher by

\pi^{\mathrm{teach}}(y\mid x):=\pi(y\mid x,\bm{c}),

(7)

while the student remains $\pi_{\theta}(y\mid x)$ . The teacher and student may share the same base model; their distinction comes from the conditioning information, not necessarily from model capacity.

Using the optimal-policy identity induced by the reward-regularized KL objective, the latent reward can be represented by the log-ratio between the optimized policy and a reference policy, up to an input-dependent constant. In our setting, we use the better-conditioned teacher as the reference policy and optimize the student. For a preferred response $y^{+}$ and a less preferred response $y^{-}$ , the resulting preference margin is

m_{\theta}(x,y^{+},y^{-})=\beta\left[\log\frac{\pi_{\theta}(y^{+}\mid x)}{\pi^{\mathrm{teach}}(y^{+}\mid x)}-\log\frac{\pi_{\theta}(y^{-}\mid x)}{\pi^{\mathrm{teach}}(y^{-}\mid x)}\right].

(8)

This margin induces, under a Bradley-Terry preference model, the probability that the teacher-generated response is preferred over the student-generated response:

P_{\theta}(y^{+}\succ y^{-}\mid x)=\sigma\!\left(m_{\theta}(x,y^{+},y^{-})\right).

(9)

Here $\sigma(z):=1/(1+\exp(-z))$ denotes the logistic sigmoid function. We maximize the corresponding pairwise log-likelihood,

\max_{\theta}\;\mathbb{E}_{(x,\bm{c})\sim\mathcal{D}}\mathbb{E}_{y^{+}\sim\pi^{\mathrm{teach}}(\cdot\mid x),y^{-}\sim\pi_{\theta}(\cdot\mid x)}\left[\log\sigma\!\left(m_{\theta}(x,y^{+},y^{-})\right)\right].

(10)

Equivalently, the implementation minimizes the negative log-likelihood in Eq. (10). Here $y^{+}$ is generated by the better-conditioned teacher and $y^{-}$ is generated by the current student. Maximizing Eq. (10) increases the relative probability that the student assigns to teacher-generated responses over its own current responses. This online construction makes the preference signal adaptive: the teacher rollout provides an attraction term toward better-supported behavior, while the student rollout exposes current failure modes that should be downweighted. PBSD therefore preserves the logistic structure of DPO, but uses it as a self-distillation mechanism induced by the reward-reweighted objective in Section 2.1.

The token-level effect of this objective is explicit from its gradient. For a sampled triple $(x_{i},y_{i}^{+},y_{i}^{-})$ , define

\ell_{i}(\theta)=-\log\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right).

(11)

Since the teacher policy is fixed, differentiating only through the student policy gives

	$\displaystyle\nabla_{\theta}\ell_{i}(\theta)$	$\displaystyle=-\beta\sigma\!\left(-m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\left[\nabla_{\theta}\log\pi_{\theta}(y_{i}^{+}\mid x_{i})-\nabla_{\theta}\log\pi_{\theta}(y_{i}^{-}\mid x_{i})\right]$
		$\displaystyle=-\beta\sigma\!\left(-m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\left[\sum_{k=1}^{\|y_{i}^{+}\|}\nabla_{\theta}\log\pi_{\theta}(y_{i,k}^{+}\mid x_{i},y_{i,<k}^{+})-\sum_{k=1}^{\|y_{i}^{-}\|}\nabla_{\theta}\log\pi_{\theta}(y_{i,k}^{-}\mid x_{i},y_{i,<k}^{-})\right].$		(12)

Thus, gradient descent on $\ell_{i}$ increases the likelihood of teacher-generated tokens while decreasing the likelihood of the student-generated negative sample, with an adaptive weight $\sigma(-m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-}))$ that becomes larger when the current student assigns insufficient relative preference to $y_{i}^{+}$ . The full online training procedure is summarized in Algorithm 1.

Input: Training set

\mathcal{D}

of pairs

(x,\bm{c})

; student policy

\pi_{\theta}(y\mid x)

; teacher policy

\pi^{\mathrm{teach}}(y\mid x):=\pi(y\mid x,\bm{c})

; temperature

\beta

; learning rate

\eta

; total training steps

T

Output: Updated student parameters

\theta

1ex

for $t=1,\dots,T$ do

Sample a mini-batch

\mathcal{B}=\{(x_{i},\bm{c}_{i})\}_{i=1}^{B}

from

\mathcal{D}

;

for each $(x_{i},\bm{c}_{i})\in\mathcal{B}$ do

Generate a student response

y_{i}^{-}\sim\pi_{\theta}(\cdot\mid x_{i})

;

Generate a teacher response

y_{i}^{+}\sim\pi^{\mathrm{teach}}(\cdot\mid x_{i})

;

Compute the PBSD margin

m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})

using Eq. (8);

Compute the pairwise loss

\mathcal{L}_{i}=-\log\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)

;

Average over the mini-batch:

\mathcal{L}_{\mathrm{PBSD}}=\frac{1}{B}\sum_{i=1}^{B}\mathcal{L}_{i}

;

Update the student policy:

\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}_{\mathrm{PBSD}}

;

Algorithm 1 PBSD: Preference-Based Online Self-Distillation

3 Theoretical Analysis of PBSD

We analyze online PBSD from a statistical perspective. The positive response is drawn from a fixed context-augmented teacher, while the negative response is drawn on-policy from the current student. Our goal is to understand what kind of teacher-generated positives lead to the most favorable in preference alignment. This perspective is closely related to recent theoretical studies of RLHF and pairwise preference learning (Zhu et al., 2024, 2023), but here we specialize the analysis to the online PBSD objective induced by contextual self-distillation. The details of optimization induction can be found in Appendix E.1; here we focus on the sample-complexity bound of our problem and its interpretation.

3.1 Pairwise MLE and Informative Comparisons

We now view the PBSD empirical objective as a pairwise MLE, where the local information in a logistic comparison objective is determined by the Hessian of the empirical negative log-likelihood. For the $i$ -th sampled pair $(x_{i},y_{i}^{+},y_{i}^{-})$ , let $\ell_{i}(\theta)$ denote the sample loss defined in Eq. (11). The empirical negative log-likelihood is

\widehat{\mathcal{L}}_{n}(\theta):=\frac{1}{n}\sum_{i=1}^{n}\ell_{i}(\theta).

The empirical MLE is

\widehat{\theta}_{n}\in\arg\min_{\theta}\widehat{\mathcal{L}}_{n}(\theta).

(13)

To expose the local statistical structure, we only introduce sample-indexed notation for the score-gap direction,

d_{i}(\theta):=\nabla_{\theta}\log\pi_{\theta}(y_{i}^{+}\mid x_{i})-\nabla_{\theta}\log\pi_{\theta}(y_{i}^{-}\mid x_{i}).

Then

\nabla_{\theta}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})=\beta d_{i}(\theta).

The sample loss $\ell_{i}(\theta)$ is generally nonconvex for a neural policy. We therefore focus on its Gauss–Newton component, obtained by differentiating the logistic loss with respect to the margin and keeping the resulting outer-product term. This yields

\nabla_{\theta}^{2}\ell_{i}(\theta)=\beta^{2}\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\left(1-\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\right)d_{i}(\theta)d_{i}(\theta)^{\top}.

(14)

Averaging over the $n$ pairs yields the empirical Hessian, or local information matrix,

\widehat{H}_{n}(\theta)=\frac{\beta^{2}}{n}\sum_{i=1}^{n}\underbrace{\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\left(1-\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\right)}_{\text{pairwise curvature weight}}d_{i}(\theta)d_{i}(\theta)^{\top}.

(15)

For the original nonlinear policy, Eq. (15) is the Gauss–Newton component of the Hessian. It is the local statistical object that controls the sample complexity of pairwise estimation. The detailed derivations for this subsection are deferred to Appendix E.1.

Theorem 1 (Local MLE complexity for PBSD).

Suppose that, within a local neighborhood of $\theta^{\star}$ , the pairwise logistic model induced by the PBSD loss is well specified and the Gauss–Newton Hessian in Eq. (15) is locally stable, and define

\theta^{\star}\in\arg\min_{\theta}\,\mathcal{L}(\theta),\qquad\mathcal{L}(\theta):=\mathbb{E}[\ell_{i}(\theta)].

(16)

Assume also that the score-gap features are bounded and that $\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))>0$ . Then, with probability at least $1-\delta$ , the local MLE $\widehat{\theta}_{n}$ satisfies

\left\|\widehat{\theta}_{n}-\theta^{\star}\right\|_{2}\leq C\sqrt{\frac{d+\log(1/\delta)}{n\,\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))}}.

(17)

Here $d$ is the local parameter dimension and $C>0$ is an absolute constant depending only on the boundedness constants.

The proof is deferred to Appendix E.2, where we also justify why the required local conditions are mild in our setting. The theorem shows that the local sample complexity is governed by the smallest eigenvalue of the empirical information matrix: larger curvature yields a tighter estimation bound. For the Hessian estimate, each comparison pair contributes

\beta^{2}\sigma\!\left(m_{\theta^{\star}}(x_{i},y_{i}^{+},y_{i}^{-})\right)\left(1-\sigma\!\left(m_{\theta^{\star}}(x_{i},y_{i}^{+},y_{i}^{-})\right)\right)d_{i}(\theta^{\star})d_{i}(\theta^{\star})^{\top}.

(18)

Context-Augmented Teacher vs. External Teacher.

Thus a useful pair must satisfy two complementary conditions. First, from the perspective of self-distillation, the positive samples should come from a distribution that is more diverse than the current student distribution, so that the induced score-gap directions $d_{i}(\theta^{\star})$ span informative directions beyond those already covered by the student. If the teacher responses collapse onto a narrow region already represented by the student, then the outer-product terms $d_{i}(\theta^{\star})d_{i}(\theta^{\star})^{\top}$ contribute little new geometric information. Second, the logistic curvature weight $\sigma(m_{\theta^{\star}})(1-\sigma(m_{\theta^{\star}}))$ requires the teacher–student gap to remain moderate. It is maximized when $m_{\theta^{\star}}(x_{i},y_{i}^{+},y_{i}^{-})=0$ , where it equals $1/4$ , and it decays to zero as $|m_{\theta^{\star}}(x_{i},y_{i}^{+},y_{i}^{-})|$ grows. In contextual self-distillation, this condition is more plausible because the teacher and student are induced from the same base model and therefore remain relatively close in distribution. By contrast, in more general on-policy distillation with an external teacher model, the distribution shift can be substantially larger, which makes overly large margins and saturation more likely. Through Eq. (17), these two properties jointly improve statistical complexity: richer score-gap directions together with moderate margins strengthen the Hessian, increase $\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))$ , and tighten the estimation bound.

4 Experiments

In this section, we will empirically verify the effectiveness of PBSD across two different task. Following the previous work, we conduct the experiment of mathematially reasoning. Besides, as improving the agentic quality of open source model is important, we also compare the capacity of fine-tuned model for tool use.

4.1 Experiment Setup

Models and Tasks.

We experiment with instruct-tuned Qwen models (Yang et al., 2025) at three scales: Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. We study two task domains: mathematical reasoning and tool use. Unless otherwise noted, these two domains use the same training pipeline, model initialization, LoRA configuration, optimizer, teacher–student construction, rollout configuration, and checkpoint-selection protocol; they differ only in their evaluation data and metrics. For mathematical reasoning, we use the math subset of OpenThoughts (Guha et al., 2025), sampling up to 30K problem–solution pairs with chain-of-thought reasoning, and report results on AIME 2024, AIME 2025, and HMMT 2025. For tool use, we follow the ToolAlpaca-based setup of Shenfeld et al. (2026), using a 4046-example training split and a 94-example test split. Here, “agentic quality” refers to the model’s ability to map user intent to correct multi-step or action-oriented behavior, a capability that is central to recent work on LLM agents and open-source tool-using systems (Wang et al., 2026b, a). Detailed dataset statistics are provided in Appendix F.1.

Baselines and Metrics.

We compare PBSD with six baselines: SFT, GRPO(Shao et al., 2024), DAPO (Yu et al., 2025a), OPSD (Zhao et al., 2026a), SDFT (Shenfeld et al., 2026), and SRPO (Li et al., 2026a). DAPO is a large-scale RLVR recipe that strengthens GRPO-style training through decoupled clipping and dynamic sampling. SDFT is a demonstration-conditioned self-distillation method that enables on-policy learning directly from demonstrations. SRPO is a hybrid on-policy objective that routes correct samples to GRPO-style reward optimization and failed samples to self-distillation-based correction. The training setup is shared across the two task domains, but the evaluation protocol is task-specific. For mathematical reasoning, we report Avg@12, namely the average accuracy over 12 sampled responses per question. For tool use, we report top-1 accuracy on the test split. Detailed evaluation protocols for both tasks are provided in Appendix F.4.

Implementation Details.

Across all methods and both task domains, we fine-tune Qwen3 instruct models with LoRA (rank 64 and $\alpha=128$ ) on 8 H100 GPUs using AdamW, bfloat16 precision, gradient checkpointing, and FlashAttention 2. We fix the teacher policy to the initial checkpoint during PBSD training for stability. All trainable methods are trained for 500 steps, evaluated every 50 steps, and we report the peak checkpoint within this fixed budget. Detailed training configurations, including the shared setup and method-specific hyperparameters for SFT, GRPO, DAPO, OPSD, SDFT, SRPO, and PBSD, are provided in Appendix F.3.

4.2 Main Results

Table 2: Performance comparison across mathematical reasoning and tool use for Qwen3 models. Under Math, we report Avg@12 using the sampling configuration recommended in the Qwen3 blog (temperature 1.0, maximum generation length 38k). Under Tool Use, we report top-1 accuracy on the tool-use test set. Numbers in parentheses denote standard deviation over three random seeds. All trainable methods are trained for 500 steps, evaluated every 50 steps, and we report the peak checkpoint within this fixed budget. Within each model scale, the best result in each column is shown in bold, and the second-best result is underlined.

Model	Method	Math				Tool Use
Model	Method	AIME24	AIME25	HMMT25	Average	Acc.
Qwen3-8B	Base (Instruct)	75.6 (0.5)	65.3 (0.4)	43.6 (0.2)	61.5	61.3 (0.5)
	+ SFT	71.9 (1.5)	63.9 (0.2)	42.4 (0.3)	59.4	61.7 (1.7)
	+ GRPO	76.1 (0.6)	69.0 (0.1)	46.3 (0.5)	63.8	68.8 (2.7)
	+ DAPO	76.0 (0.9)	68.7 (1.0)	46.0 (0.7)	63.6	67.7 (0.5)
	+ OPSD	77.3 (0.3)	70.1 (0.6)	45.1 (0.6)	64.2	65.6 (1.8)
	+ SDFT	75.9 (0.7)	69.1 (0.7)	44.2 (0.6)	63.1	62.8 (1.7)
	+ SRPO	77.1 (0.3)	69.6 (1.5)	43.9 (0.7)	63.5	62.1 (2.5)
	+ PBSD	78.4 (0.3)	71.0 (0.1)	46.1 (0.2)	65.2	72.0 (1.3)
Qwen3-4B	Base (Instruct)	74.5 (0.3)	66.1 (0.2)	42.0 (0.3)	60.9	45.7 (0.9)
	+ SFT	71.3 (0.6)	64.3 (0.8)	43.4 (0.1)	59.7	51.1 (1.7)
	+ GRPO	75.7 (0.1)	67.8 (0.6)	44.4 (0.8)	62.6	58.9 (1.3)
	+ DAPO	76.1 (0.6)	67.7 (0.3)	44.3 (0.7)	62.7	48.2 (2.2)
	+ OPSD	76.3 (0.8)	68.1 (0.3)	46.0 (0.6)	63.5	53.9 (1.0)
	+ SDFT	75.6 (0.3)	67.1 (0.3)	44.7 (0.2)	62.5	41.8 (1.3)
	+ SRPO	76.3 (0.3)	67.8 (0.4)	45.5 (0.3)	63.2	46.1 (2.0)
	+ PBSD	77.3 (0.1)	69.0 (0.1)	45.6 (0.3)	64.0	60.6 (1.7)
Qwen3-1.7B	Base (Instruct)	51.4 (0.2)	37.2 (0.5)	23.1 (0.1)	37.3	36.9 (1.0)
	+ SFT	49.9 (0.3)	36.6 (0.5)	22.8 (0.2)	36.4	39.7 (0.5)
	+ GRPO	50.6 (0.5)	38.4 (0.1)	23.5 (0.3)	37.5	42.9 (0.5)
	+ DAPO	51.5 (0.7)	38.6 (1.4)	23.5 (0.3)	37.9	43.6 (0.9)
	+ OPSD	57.1 (0.8)	43.3 (0.6)	29.3 (0.3)	43.2	37.2 (1.7)
	+ SDFT	57.0 (0.3)	43.0 (0.1)	28.7 (0.3)	42.9	38.3 (1.7)
	+ SRPO	58.1 (0.2)	44.1 (0.3)	29.4 (0.5)	43.9	36.9 (1.3)
	+ PBSD	58.5 (0.5)	44.4 (0.2)	30.0 (0.2)	44.3	44.7 (1.7)

Table 2 reports the main results across both mathematical reasoning and tool use for three Qwen3 model scales. PBSD consistently improves over the base instruct model and achieves the strongest average result at all three model scales. On Qwen3-8B, PBSD reaches a math average of $65.2$ , improving over the base instruct model by $3.7$ points and over the strongest baseline OPSD by $1.0$ point. On Qwen3-4B, PBSD attains a math average of $64.0$ , again outperforming both the base model ( $60.9$ ) and OPSD ( $63.5$ ). At the 1.7B scale, PBSD now delivers the best result on all three math benchmarks and reaches the strongest overall average of $44.3$ , ahead of SRPO ( $43.9$ ) and OPSD ( $43.2$ ). On tool use, PBSD also achieves the strongest result at every model scale, reaching $72.0$ , $60.6$ , and $44.7$ for Qwen3-8B, Qwen3-4B, and Qwen3-1.7B, respectively. These results show that pairwise self-distillation with a contextual teacher provides a more effective optimization signal than directly matching the teacher distribution or relying on pure reward optimization.

Compared with prior baselines, the gains from PBSD are clearest in overall consistency across tasks and scales. On Qwen3-8B, PBSD achieves the best result on AIME24, AIME25, the overall math average, and tool use, while remaining second-best on HMMT25 behind GRPO. On Qwen3-4B, PBSD again gives the strongest average result, with consistent improvements on AIME24, AIME25, and tool use, although OPSD remains slightly better on HMMT25. At the 1.7B scale, PBSD surpasses both OPSD and SRPO across all three math benchmarks and also gives the best tool-use result. Taken together, these trends suggest that PBSD is particularly effective when the student can benefit from teacher-induced positive samples without sacrificing exploration, while still preserving the token-efficiency advantages of self-distillation.

4.3 Ablation Studies

Stable during training.

Figure 2A shows that PBSD does not degenerate as training proceeds. Unlike OPSD, which peaks early and then declines, PBSD continues to improve and achieves the strongest final AIME25 performance.

Token efficiency.

Figure 2B shows that PBSD is token-efficient throughout training. It reaches high final performance without requiring the very large token budget used by RL-based optimization.

We provide additional ablations in Appendix F to further compare prompt-level gains from expert demonstrations and training-time gains from OPSD and PBSD, summarized as follows:

Expert demonstrations correct base-model errors.

Table 3 shows a case-by-case study on 30 selected AIME25 problems. While directly injecting expert demonstrations yields near-perfect problem-level performance, both OPSD and PBSD also recover a substantial portion of the base model’s majority-vote errors, indicating that self-distillation can absorb useful supervision from the privileged teacher signal.

PBSD preserves exploration.

Table 4 compares student and teacher accuracy together with completion lengths from the base student, the demonstration-conditioned teacher, OPSD, and PBSD. In particular, it lets us check whether PBSD preserves a more moderate completion length than the teacher and OPSD while still improving final accuracy.

A fixed teacher is sufficient.

Table 5 shows that keeping the teacher fixed at initialization is already effective. Updating the teacher every 5 steps does not provide a clear benefit, suggesting that a stable teacher signal is more important than a rapidly refreshed reference in our setting.

5 Conclusion

In this work, we revisited on-policy self-distillation through a reward-regularized lens, establishing that direct KL matching to a context-augmented teacher fundamentally limits both stability and reward awareness. To overcome these bottlenecks, we proposed PBSD, a preference-based self-distillation framework that learns a reward-reweighted target policy rather than uniformly imitating the teacher. We also provided a statistical analysis showing the conditions under which contextual self-distillation outperforms distillation from an external teacher. Empirically, across multiple model scales on mathematical reasoning and tool-use benchmarks, PBSD consistently achieved the strongest average performance. It successfully combines the token efficiency of standard self-distillation with significantly improved training stability. These results highlight reward-aware self-distillation as a promising, scalable paradigm for the scalable and stable post-training of reasoning-oriented LLMs.

References

A. Afsharrad, A. Abedsoltan, A. Moradipari, and S. Lall (2026) On-policy distillation of language models for autonomous vehicle motion planning. arXiv preprint arXiv:2604.07944. Cited by: Appendix C, §1.
R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. Ramos Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, Cited by: Appendix C, §1, §2.1.
F. Bach (2010) Self-concordant analysis for logistic regression. Electronic Journal of Statistics 4, pp. 384–414. External Links: Document Cited by: §E.2, §E.2.
J. Bai, X. Yu, M. Xu, W. Lu, X. Pan, K. Maeng, D. Kifer, J. Wang, and Y. Wang (2025) Towards better optimization for listwise preference in diffusion models. arXiv preprint arXiv:2510.01540. Cited by: §F.3.
B. Chen, S. Wang, Y. Ma, Z. Liang, X. Zhang, Y. Lv, Y. Yang, H. Dai, L. Mao, T. Zhao, et al. (2026) OneSearch-v2: the latent reasoning enhanced self-distillation generative search framework. arXiv preprint arXiv:2603.24422. Cited by: Appendix C, §2.2.
K. Ding (2026) HDPO: hybrid distillation policy optimization via privileged self-distillation. arXiv preprint arXiv:2603.23871. Cited by: §2.2.
Y. Fu, H. Huang, K. Jiang, Y. Zhu, and Z. Liu (2026) Revisiting on-policy distillation: empirical failure modes and simple fixes. arXiv preprint arXiv:2603.25562. Cited by: Appendix C, §1, §1.
E. Guha, R. Marten, S. Keh, N. Raoof, G. Smyrnis, H. Bansal, M. Nezhurina, J. Mercat, T. Vu, Z. Sprague, et al. (2025) Openthoughts: data recipes for reasoning models. arXiv preprint arXiv:2506.04178. Cited by: §F.1, §4.1.
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. (2022) Lora: low-rank adaptation of large language models.. Iclr 1 (2), pp. 3. Cited by: §F.3.
J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. (2026) Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802. Cited by: Appendix C, §1, §2.1, §2.2.
J. Hübotter, F. Lübeck, L. D. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. (2025) Test-time self-distillation. arXiv preprint arXiv:2502.07750. Cited by: Appendix C.
J. Kim, X. Luo, M. Kim, S. Lee, D. Kim, J. Jeon, D. Li, and Y. Yang (2026) Why does self-distillation (sometimes) degrade the reasoning capability of llms?. arXiv preprint arXiv:2603.24472. Cited by: Appendix C, §1.
G. Li, T. Yang, J. Fang, M. Song, M. Zheng, H. Guo, D. Zhang, J. Wang, and T. Chua (2026a) Unifying group-relative and self-distillation policy optimization via sample routing. arXiv preprint arXiv:2604.02288. Cited by: Appendix C, Appendix C, Appendix C, §1, §4.1.
Y. Li, Y. Zuo, B. He, J. Zhang, C. Xiao, C. Qian, T. Yu, H. Gao, W. Yang, Z. Liu, and N. Ding (2026b) Rethinking on-policy distillation of large language models: phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016. Cited by: Appendix C, §1, §1.
K. Lu and T. M. Lab (2025) On-policy distillation. Thinking Machines Lab: Connectionism. Note: https://thinkingmachines.ai/blog/on-policy-distillation External Links: Document Cited by: §1, §2.1.
S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012) A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science 27 (4), pp. 538–557. External Links: Document Cited by: §E.2, §E.2.
R. Y. Pang, W. Yuan, K. Cho, H. He, S. Sukhbaatar, and J. Weston (2024) Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733. Cited by: Appendix C.
B. Qi, P. Li, F. Li, J. Gao, K. Zhang, and B. Zhou (2024) Online dpo: online direct preference optimization with fast-slow chasing. arXiv preprint arXiv:2406.05534. Cited by: Appendix C.
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. In Advances in Neural Information Processing Systems, Cited by: Appendix C.
H. Sang, Y. Xu, Z. Zhou, R. He, Z. Wang, and J. Sun (2026) On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433. Cited by: Appendix C, §1, §2.2.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, and D. Guo (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §4.1.
I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal (2026) Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897. Cited by: Appendix C, §F.1, §2.2, §4.1, §4.1.
R. Shi, R. Zhou, and S. S. Du (2025) The crucial role of samplers in online direct preference optimization. In International Conference on Learning Representations, Cited by: Appendix C.
M. Song and M. Zheng (2026) A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626. Cited by: Appendix C, §1.
S. Tu, J. Lin, X. Tian, Q. Zhang, L. Li, Y. Fu, N. Xu, W. He, X. Lan, D. Jiang, and D. Zhao (2025) Enhancing llm reasoning with iterative dpo: a comprehensive empirical investigation. In Conference on Language Modeling, Cited by: Appendix C.
A. W. van der Vaart (2000) Asymptotic statistics. Cambridge University Press. Cited by: §E.2, §E.2.
C. Wang, Y. Wang, W. Zheng, Y. Li, Y. Ye, X. Wang, and J. Yang (2026a) Skill-sd: skill-conditioned self-distillation for multi-turn llm agents. arXiv preprint arXiv:2604.10674. Cited by: §1, §4.1.
Y. Wang, X. Chen, X. Jin, M. Wang, and L. Yang (2026b) Openclaw-rl: train any agent simply by talking. arXiv preprint arXiv:2603.10165. Cited by: §1, §4.1.
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §4.1.
C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan (2026) Self-distilled rlvr. arXiv preprint arXiv:2604.03128. Cited by: Appendix C, Appendix C, §1.
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, and J. Chen (2025a) DAPO: an open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: Appendix C, §4.1.
X. Yu, Y. Wang, J. Chen, and L. Xue (2025b) Altlora: towards better gradient approximation in low-rank adaptation with alternating projections. arXiv preprint arXiv:2505.12455. Cited by: §F.3.
X. Yu, C. Xie, Z. Zhao, T. Fan, L. Xue, and Z. Zhang (2025c) PrunedLoRA: robust gradient-based structured pruning for low-rank adaptation in fine-tuning. arXiv preprint arXiv:2510.00192. Cited by: §F.3.
X. Yu, H. Xing, and L. Xue (2026) EXACT: explicit attribute-guided decoding-time personalization. arXiv preprint arXiv:2602.17695. Cited by: §F.3.
R. Zhang, R. H. Bai, H. Zheng, N. Jaitly, R. Collobert, and Y. Zhang (2026) Embarrassingly simple self-distillation improves code generation. arXiv preprint arXiv:2604.01193. Cited by: Appendix C.
S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover (2026a) Self-distilled reasoner: on-policy self-distillation for large language models. arXiv preprint arXiv:2601.18734. Cited by: Appendix C, Appendix C, §F.3, §F.3, Appendix F, §1, §2.1, §2.2, §4.1.
Z. Zhao, Y. Zhou, X. Yu, Z. Zhang, D. Zhu, T. Shen, Z. Li, J. Yang, X. Wang, J. Su, et al. (2026b) Each rank could be an expert: single-ranked mixture of experts lora for multi-task learning. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pp. 1998–2009. Cited by: §F.3.
B. Zhu, M. I. Jordan, and J. Jiao (2023) Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proceedings of the 40th International Conference on Machine Learning, pp. 43037–43067. Cited by: Appendix C, §3.
B. Zhu, M. I. Jordan, and J. Jiao (2024) Iterative data smoothing: mitigating reward overfitting and overoptimization in rlhf. arXiv preprint arXiv:2401.16335. Cited by: Appendix C, §3.

Appendix A Limitations

The main limitation of our setting is that PBSD still depends on the quality of the contextual teacher signal. When the additional privileged context is weak or only marginally informative, the induced teacher can also be weak, and the benefit of self-distillation is correspondingly reduced. At the same time, this is precisely the core challenge that PBSD is designed to address: in contextual self-distillation, the teacher is often not a strong oracle but only a better-conditioned version of the same base model, so the goal is not to copy it uniformly, but to extract the more valuable part of its signal through reward-aware reweighting. In this sense, the limitation is not separate from our problem formulation; rather, it defines the regime in which PBSD is intended to improve over pure KL-based teacher matching.

Appendix B Appendix Overview

This appendix is organized to mirror the logic of the main paper from background, to derivation, to theory, and finally to experimental detail. Appendix C expands the literature discussion that is abbreviated in the main text. It is divided into three thematic parts: on-policy distillation, self-distillation, and RLHF/DPO-style preference optimization. The goal of this section is to situate PBSD relative to dense teacher-based post-training, contextual self-distillation, and reward-aware preference optimization.

Appendix D provides the derivations that support the methodology section. Section D.1 contains two components: first, the derivation of the optimal reward-tilted teacher distribution used in Proposition 1; second, the proof that this reward-tilted solution is no worse than directly using the teacher under the same reward-regularized objective, corresponding to Proposition 2. This appendix therefore justifies the objective-level motivation for replacing pure KL matching with reward-aware reweighting.

Appendix E develops the technical details behind the statistical analysis in the main text. Section E.1 derives the sample-level gradient and the local Hessian form of the online PBSD objective, which are the ingredients needed to interpret PBSD as a pairwise logistic-style estimation problem. Section E.2 then states the local assumptions used in the analysis, explains why these assumptions are mild in our setting, discusses what happens when they are violated, and finally proves the local MLE-style error bound for the induced learning problem.

Appendix F collects the detailed experimental material that supports Section 4 of the main paper. Section F.1 describes the two task domains, namely mathematical reasoning and tool use, together with the corresponding training and evaluation splits. The data-visualization subsection then gives concrete examples of the processed mathematical reasoning and tool-use instances so that the structure of the training data is explicit. Section F.3 documents the shared training setup and the method-specific configurations for SFT, GRPO, OPSD, and PBSD. Section F.4 specifies the decoding and evaluation protocols for both mathematical reasoning and tool use. The case-by-case study subsection explains the problem-level analysis on 30 selected AIME25 questions. The capability-gain subsection studies how expert demonstrations affect correctness and completion length on hard AIME24 examples, with particular emphasis on whether PBSD preserves exploratory reasoning behavior. The teacher-update-frequency subsection reports the fixed-teacher versus refreshed-teacher ablation. Finally, the count-version subsection records the run-level counts underlying the main results table so that the reported averages and variances can be traced back to the underlying runs.

Appendix C Appendix for Related Work

On-policy distillation. OPD trains the student on its own sampled trajectories while querying a stronger teacher on those visited states, reducing the mismatch of off-policy distillation [Agarwal et al., 2024]. This paradigm has become an important post-training recipe for LLMs because it keeps the on-policy nature of RL-style optimization while replacing sparse outcome rewards with dense token-level supervision [Song and Zheng, 2026, Fu et al., 2026]. Recent extensions apply this idea to reasoning and RLVR settings, including contextual on-policy supervision [Zhao et al., 2026a], self-distilled RLVR [Yang et al., 2026], sample-routing formulations that connect distillation and group-relative optimization [Li et al., 2026a], and domain-specific adaptations such as autonomous driving [Afsharrad et al., 2026]. However, most OPD methods still depend on an additional large teacher model throughout training, which increases online inference cost and system complexity. They also suffer from teacher–student matching issues, including incompatible vocabularies, output spaces, and large distribution gaps that can make token-level imitation brittle [Li et al., 2026b, Song and Zheng, 2026]. These limitations motivate self-distillation methods that seek to preserve dense on-policy supervision without relying on an external teacher.

Self-distillation. Recent self-distillation methods remove the need for an external teacher by reusing the same base model as both teacher and student, typically constructing the teacher through additional context, search traces, or stronger decoding-time signals. Representative examples include OPSD for reasoning [Zhao et al., 2026a], self-distillation fine-tuning and continual-learning style methods such as SDFT [Shenfeld et al., 2026, Sang et al., 2026], reinforcement-oriented variants such as Reinforcement Learning via Self-Distillation [Hübotter et al., 2026], and application-focused extensions to code generation and search [Zhang et al., 2026, Chen et al., 2026]. Despite their differences, these approaches are still largely centered on KL-based distribution matching: some adopt a forward-KL style objective that fits the student to teacher-provided token distributions, while others use reverse-KL style matching that treats the teacher-induced policy as the target support. Recent work has also started to analyze the limitations of this recipe. Kim et al. [Kim et al., 2026] show that self-distillation can sometimes harm reasoning performance, and Hübotter et al. [Hübotter et al., 2025] explore self-distillation at test time rather than as a fully reward-aware training objective. More recent work moves this line closer to RLHF. SDFT emphasizes on-policy learning from demonstrations for continual learning [Shenfeld et al., 2026], Li et al. [Li et al., 2026a] connect self-distillation with group-relative optimization through sample routing, while Yang et al. [Yang et al., 2026] modifies the RLVR pipeline with self-distilled advantage-style signals. However, these extensions still largely treat self-distillation as a KL-anchored policy matching problem and do not directly formulate how latent reward should reweight the teacher distribution itself. As a result, how to extend self-distillation from pure KL matching to a genuinely reward-aware objective, where reward optimization is incorporated explicitly rather than only through heuristic weighting or advantage shaping, remains an open question.

RLHF and DPO-style preference optimization. A parallel line of work studies preference optimization more broadly through RLHF and DPO-style objectives. Foundational analyses of RLHF and preference learning characterize the role of the reference policy and the induced reward-regularized target distribution [Zhu et al., 2023, Rafailov et al., 2023, Zhu et al., 2024]. On the algorithmic side, recent work has explored iterative and online variants of DPO as a lower-cost alternative to RL. Iterative Reasoning Preference Optimization [Pang et al., 2024] shows that repeatedly regenerating reasoning trajectories and re-optimizing preference pairs can substantially improve reasoning, especially when combined with an auxiliary NLL term. Tu et al. [Tu et al., 2025] further provide a comprehensive empirical study of iterative DPO for reasoning, arguing that multi-round DPO together with iterative reward-model refinement can approach RL-level performance at lower computational cost. In the RLVR setting, DAPO [Yu et al., 2025a] scales GRPO-style training with decoupled clipping and dynamic sampling, while SRPO [Li et al., 2026a] explicitly combines group-relative reward optimization with self-distillation through sample routing. In a more explicitly online setting, OFS-DPO and COFS-DPO [Qi et al., 2024] study online DPO under streaming or cross-domain preference updates, emphasizing continual adaptation and regret-based analysis. Complementarily, Shi et al. [Shi et al., 2025] analyze the optimization behavior of online DPO and show that sampler design has a decisive effect on convergence rates, with stronger online samplers yielding faster convergence both theoretically and empirically. Compared with these lines of work, our focus is different: rather than studying DPO primarily as a general preference-optimization alternative to RLHF, we use the RLHF perspective to revisit on-policy self-distillation and derive a reward-aware target policy tailored to the contextual teacher setting.

Appendix D Appendix for Methodology

This section collects the proofs and derivations corresponding to the methodology section in the main text. Section D.1 derives the reward-tilted optimal policy under the reward-regularized objective and proves that this optimum improves over the teacher under the same objective.

D.1 Details for the Reward-Regularized Distillation Objective

Derivation of Proposition 1.

For notational simplicity, fix $x$ and write $\pi_{y}=\pi(y\mid x)$ , $\pi^{\mathrm{teach}}_{y}=\pi^{\mathrm{teach}}(y\mid x)$ , and $r_{y}=r(x,y)$ . The single-input objective in Eq. (3) becomes

\max_{\pi\in\Delta}\left\{\sum_{y}\pi_{y}r_{y}-\beta\sum_{y}\pi_{y}\log\frac{\pi_{y}}{\pi_{y}^{\mathrm{teach}}}\right\},

(19)

where $\Delta$ denotes the probability simplex. Since the objective is strictly concave in $\pi$ whenever $\beta>0$ , the optimizer is unique and can be obtained from the first-order optimality conditions. Introducing a Lagrange multiplier $\lambda$ for the normalization constraint $\sum_{y}\pi_{y}=1$ , the Lagrangian is

\mathcal{J}(\pi,\lambda)=\sum_{y}\pi_{y}r_{y}-\beta\sum_{y}\pi_{y}\log\frac{\pi_{y}}{\pi^{\mathrm{teach}}_{y}}+\lambda\left(\sum_{y}\pi_{y}-1\right).

(20)

Differentiating with respect to each coordinate $\pi_{y}$ gives

\frac{\partial\mathcal{J}}{\partial\pi_{y}}=r_{y}-\beta\left(\log\frac{\pi_{y}}{\pi^{\mathrm{teach}}_{y}}+1\right)+\lambda=0.

(21)

Here we used the identity

\frac{\partial}{\partial\pi_{y}}\left(\pi_{y}\log\frac{\pi_{y}}{\pi_{y}^{\mathrm{teach}}}\right)=\log\frac{\pi_{y}}{\pi_{y}^{\mathrm{teach}}}+1.

Rearranging Eq. (21) yields

\log\frac{\pi_{y}}{\pi_{y}^{\mathrm{teach}}}=\frac{r_{y}+\lambda-\beta}{\beta},

(22)

and exponentiating both sides gives

\pi_{y}=\pi^{\mathrm{teach}}_{y}\exp(r_{y}/\beta)\exp((\lambda-\beta)/\beta).

(23)

The last exponential factor is independent of $y$ , so all coordinates share the same proportionality constant. To determine it, impose the normalization constraint:

1=\sum_{y}\pi_{y}=\exp((\lambda-\beta)/\beta)\sum_{y}\pi_{y}^{\mathrm{teach}}\exp(r_{y}/\beta).

Therefore,

\exp((\lambda-\beta)/\beta)=\left(\sum_{y}\pi_{y}^{\mathrm{teach}}\exp(r_{y}/\beta)\right)^{-1}=\frac{1}{Z(x)}.

Substituting this back gives

\pi_{y}=\frac{\pi_{y}^{\mathrm{teach}}\exp(r_{y}/\beta)}{Z(x)},

which is exactly the reward-tilted teacher distribution in Eq. (4). This derivation makes explicit that the teacher policy serves only as a reference measure, while the reward term changes the final target by exponentially reweighting the teacher support according to $r_{y}$ .

Proof of Proposition 2.

We first compute the value of the reward-regularized objective at the optimizer $\pi^{\star}$ . For a fixed $x$ , substituting

\pi^{\star}(y\mid x)=\frac{\pi^{\mathrm{teach}}(y\mid x)\exp(r(x,y)/\beta)}{Z(x)}

into the KL term gives

\log\frac{\pi^{\star}(y\mid x)}{\pi^{\mathrm{teach}}(y\mid x)}=\frac{r(x,y)}{\beta}-\log Z(x).

Hence

	$\displaystyle\mathbb{E}_{y\sim\pi^{\star}(\cdot\mid x)}[r(x,y)]-\beta D_{\mathrm{KL}}\!\left(\pi^{\star}(\cdot\mid x)\,\\|\,\pi^{\mathrm{teach}}(\cdot\mid x)\right)$
	$\displaystyle=\sum_{y}\pi^{\star}(y\mid x)r(x,y)-\beta\sum_{y}\pi^{\star}(y\mid x)\left(\frac{r(x,y)}{\beta}-\log Z(x)\right)$
	$\displaystyle=\sum_{y}\pi^{\star}(y\mid x)r(x,y)-\sum_{y}\pi^{\star}(y\mid x)r(x,y)+\beta\log Z(x)\sum_{y}\pi^{\star}(y\mid x)$
	$\displaystyle=\beta\log Z(x).$

Aggregating over $x\sim\mathcal{D}$ gives

F(\pi^{\star})=\mathbb{E}_{x\sim\mathcal{D}}\left[\beta\log Z(x)\right]=\mathbb{E}_{x\sim\mathcal{D}}\left[\beta\log\sum_{y}\pi^{\mathrm{teach}}(y\mid x)\exp(r(x,y)/\beta)\right].

(24)

In contrast, evaluating the teacher itself under the same objective removes the KL term because $D_{\mathrm{KL}}(\pi^{\mathrm{teach}}\|\pi^{\mathrm{teach}})=0$ . Therefore,

F(\pi^{\mathrm{teach}})=\mathbb{E}_{x\sim\mathcal{D}}\mathbb{E}_{y\sim\pi^{\mathrm{teach}}(\cdot\mid x)}\left[r(x,y)\right].

(25)

To compare the two values, fix $x$ and write

Z(x)=\mathbb{E}_{y\sim\pi^{\mathrm{teach}}(\cdot\mid x)}\left[\exp(r(x,y)/\beta)\right].

Since $\log(\cdot)$ is concave, Jensen’s inequality implies

\beta\log\mathbb{E}_{y\sim\pi^{\mathrm{teach}}(\cdot\mid x)}\left[\exp(r(x,y)/\beta)\right]\geq\mathbb{E}_{y\sim\pi^{\mathrm{teach}}(\cdot\mid x)}\left[r(x,y)\right],

(26)

where equality holds only when $r(x,y)$ is constant over the support of $\pi^{\mathrm{teach}}(\cdot\mid x)$ . Combining Eq. (24), Eq. (25), and Eq. (26) yields

F(\pi^{\star})\geq F(\pi^{\mathrm{teach}}).

Thus, under the same reward-regularized objective, the reward-tilted policy is never worse than directly using the teacher. The inequality is strict whenever the reward varies over teacher-supported responses, which is precisely the regime in which uniform teacher matching is suboptimal.

Appendix E Appendix for Theoretical Analysis

This section provides the technical details that support the theoretical analysis in the main text. Section E.1 derives the sample-level gradient and the local Hessian form of the online PBSD objective. Section E.2 proves the local MLE complexity result and formalizes the statistical interpretation based on informative comparison pairs.

At a high level, the question studied in this appendix is the following: in online PBSD, where positives are drawn from a fixed context-augmented teacher and negatives are drawn on-policy from the current student, what properties of the induced comparison pairs lead to the strongest statistical guarantee for estimating the target policy? Our analysis shows that the answer depends on both the diversity of the teacher-supported directions and the moderation of the induced preference margins.

E.1 Details for Online PBSD Objective

Gradient derivation for Eq. (12).

Since the teacher policy $\pi^{\mathrm{teach}}(\cdot\mid x)$ is fixed, differentiating the PBSD margin in Eq. (8) gives

\nabla_{\theta}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})=\beta\left(\nabla_{\theta}\log\pi_{\theta}(y_{i}^{+}\mid x_{i})-\nabla_{\theta}\log\pi_{\theta}(y_{i}^{-}\mid x_{i})\right)=\beta d_{i}(\theta).

(27)

By the chain rule,

\nabla_{\theta}\ell_{i}(\theta)=\frac{d}{dm}\bigl[-\log\sigma(m)\bigr]\Big|_{m=m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})}\nabla_{\theta}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-}).

(28)

Using $\sigma^{\prime}(m)=\sigma(m)(1-\sigma(m))$ , we obtain

\frac{d}{dm}\bigl[-\log\sigma(m)\bigr]=-\frac{\sigma^{\prime}(m)}{\sigma(m)}=-(1-\sigma(m))=-\sigma(-m).

(29)

Substituting Eq. (27) and Eq. (29) into Eq. (28) yields

\nabla_{\theta}\ell_{i}(\theta)=-\beta\sigma\!\left(-m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)d_{i}(\theta),

which is exactly the sample-level form underlying Eq. (12).

Local Hessian derivation for Eq. (15).

Write

\phi(m):=-\log\sigma(m)=\log(1+\exp(-m)).

Its derivatives are

\phi^{\prime}(m)=-\sigma(-m),\qquad\phi^{\prime\prime}(m)=\sigma(m)\sigma(-m)=\sigma(m)(1-\sigma(m)).

By the chain rule,

	$\displaystyle\nabla_{\theta}^{2}\ell_{i}(\theta)=\phi^{\prime\prime}\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\nabla_{\theta}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\nabla_{\theta}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})^{\top}$
	$\displaystyle\quad+\phi^{\prime}\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\nabla_{\theta}^{2}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-}).$

The first term is the Gauss–Newton component. Using

\nabla_{\theta}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})=\beta d_{i}(\theta),

and dropping the second-order term $\nabla_{\theta}^{2}m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})$ yields the approximation

\nabla_{\theta}^{2}\ell_{i}(\theta)=\beta^{2}\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\left(1-\sigma\!\left(m_{\theta}(x_{i},y_{i}^{+},y_{i}^{-})\right)\right)d_{i}(\theta)d_{i}(\theta)^{\top}.

Averaging over the $n$ pairs yields Eq. (15).

E.2 Details for Pairwise MLE and Informative Comparisons

Assumptions used in Theorem 1.

The proof relies only on the following standard local conditions for smooth $M$ -estimation and logistic-type models.

•

A1 (Bounded score-gap features). There exists $G>0$ such that $\sup_{\theta\in\mathcal{U}}\|d_{i}(\theta)\|_{2}\leq G$ almost surely in a local neighborhood $\mathcal{U}$ of $\theta^{\star}$ .
•

A2 (Local curvature condition). There exists $c_{0}>0$ such that for all $\theta\in\mathcal{U}$ ,

$\widehat{H}_{n}(\theta)\succeq c_{0}\,\widehat{H}_{n}(\theta^{\star}),\qquad\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))>0.$

Assumption A1 ensures bounded local score fluctuations, while Assumption A2 combines the two irreducible ingredients needed for the local inverse-Hessian argument: stable curvature in a neighborhood and nondegenerate information at $\theta^{\star}$ . These are standard local regularity conditions in likelihood-based $M$ -estimation, logistic regression, and restricted-strong-convexity analyses [van der Vaart, 2000, Bach, 2010, Negahban et al., 2012].

Why these assumptions are mild in our setting.

These conditions are local rather than global, so they are only required in a neighborhood of the target parameter reached by training. In practice, A1 rules out pathological comparisons with numerically unstable or excessively extreme score-gap features; this is natural in our setting because the comparison model is built from bounded-probability logistic factors over finite sampled responses. Moreover, several algorithmic choices help enforce this locality in practice: KL regularization keeps the student close to the teacher, bounded rollout lengths prevent extreme token-level score accumulation, and conservative optimization choices such as small learning rates or gradient clipping reduce the chance of leaving the local regime. Assumption A2 then asks only that the resulting teacher–student comparisons remain informative but not fully saturated. If the teacher and student are too close, the outer products $d_{i}(\theta)d_{i}(\theta)^{\top}$ do not span enough directions; if they are too far apart, the logistic curvature weight $\sigma(m_{\theta})(1-\sigma(m_{\theta}))$ collapses toward zero. This is precisely why our contextual self-distillation setup is statistically favorable: it tends to generate richer directions than pure self-copying while keeping the teacher–student gap moderate enough to preserve curvature. Similar local-curvature interpretations are standard in the statistical literature on smooth $M$ -estimation and logistic models [van der Vaart, 2000, Bach, 2010, Negahban et al., 2012].

Effect of assumption violations on robustness.

These assumptions also clarify the robustness regime of PBSD. If A1 is violated because score-gap features become too extreme, then the empirical score can have much higher variance, making optimization noisier and increasing sensitivity to sampling fluctuations. If A2 is violated because the local curvature becomes nearly singular, then the same preference data can produce much larger parameter perturbations, so training becomes less stable and the estimation bound deteriorates through a smaller $\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))$ . In practical terms, this means PBSD is most robust when teacher–student comparisons are informative but not saturated: enough separation to generate useful preference directions, but not so much separation that the logistic curvature collapses. When this balance fails, our theorem should be interpreted as a local guarantee rather than a global robustness claim. This is also why the practical design of PBSD favors moderate teacher guidance, KL anchoring, and conservative optimization, all of which help keep training inside the regime where the statistical guarantee remains informative.

Proof of Theorem 1.

We give a direct local argument without appealing to external pairwise-MLE results. Let

\widehat{\mathcal{L}}_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\ell_{i}(\theta),\qquad\mathcal{L}(\theta):=\mathbb{E}[\ell_{i}(\theta)].

Since $\theta^{\star}$ minimizes the population risk, $\nabla\mathcal{L}(\theta^{\star})=0$ . By a first-order expansion of the empirical gradient around $\theta^{\star}$ ,

0=\nabla\widehat{\mathcal{L}}_{n}(\widehat{\theta}_{n})=\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star})+\widehat{H}_{n}(\widetilde{\theta})(\widehat{\theta}_{n}-\theta^{\star})

for some $\widetilde{\theta}$ on the segment between $\widehat{\theta}_{n}$ and $\theta^{\star}$ .

Equivalently,

\widehat{\theta}_{n}-\theta^{\star}=-\widehat{H}_{n}(\widetilde{\theta})^{-1}\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star}).

Thus the estimation error is controlled by two quantities: the empirical score at $\theta^{\star}$ and the local conditioning of the Hessian.

We first bound the empirical score. For each pair $i$ , define

\psi_{i}:=\nabla\ell_{i}(\theta^{\star}):=-\beta\sigma\!\left(-m_{\theta^{\star}}(x_{i},y_{i}^{+},y_{i}^{-})\right)d_{i}(\theta^{\star}).

Therefore,

\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star})=\frac{1}{n}\sum_{i=1}^{n}\psi_{i}.

Because $\theta^{\star}$ minimizes the population risk, $\mathbb{E}[\psi_{i}]=0$ . Moreover, the bounded-feature assumption implies $\|d_{i}(\theta^{\star})\|_{2}\leq G$ , and $0\leq\sigma(-m_{\theta^{\star}}(x_{i},y_{i}^{+},y_{i}^{-}))\leq 1$ . Therefore

\|\psi_{i}\|_{2}\leq\beta G\qquad\text{almost surely.}

Hence the score contributions are bounded mean-zero random vectors. A standard vector Hoeffding inequality gives that, with probability at least $1-\delta$ ,

\left\|\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star})\right\|_{2}\leq C\beta G\sqrt{\frac{d+\log(1/\delta)}{n}}

for an absolute constant $C$ .

Next we use the local stability assumption on the Gauss–Newton Hessian. Specifically, assume that throughout the neighborhood containing $\widehat{\theta}_{n}$ ,

\widehat{H}_{n}(\widetilde{\theta})\succeq c_{0}\,\widehat{H}_{n}(\theta^{\star})

for some constant $c_{0}>0$ . Then

\widehat{H}_{n}(\widetilde{\theta})^{-1}\preceq c_{0}^{-1}\widehat{H}_{n}(\theta^{\star})^{-1}.

Using the first-order expansion above and the dual norm induced by $\widehat{H}_{n}(\theta^{\star})$ , we obtain

\left\|\widehat{\theta}_{n}-\theta^{\star}\right\|_{\widehat{H}_{n}(\theta^{\star})}\leq c_{0}^{-1}\left\|\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star})\right\|_{\widehat{H}_{n}(\theta^{\star})^{-1}}

To relate this dual norm to the Euclidean norm of the score, use the spectral bound

\widehat{H}_{n}(\theta^{\star})^{-1}\preceq\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))^{-1}I.

Therefore,

\left\|\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star})\right\|_{\widehat{H}_{n}(\theta^{\star})^{-1}}\leq\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))^{-1/2}\left\|\nabla\widehat{\mathcal{L}}_{n}(\theta^{\star})\right\|_{2}\leq C\sqrt{\frac{d+\log(1/\delta)}{n\,\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))}},

where the constant absorbs $\beta$ , $G$ , and $c_{0}^{-1}$ . Substituting this bound into the previous display gives

\left\|\widehat{\theta}_{n}-\theta^{\star}\right\|_{\widehat{H}_{n}(\theta^{\star})}\leq C\sqrt{\frac{d+\log(1/\delta)}{n\,\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))}}.

Finally, convert the Hessian norm to the Euclidean norm:

\left\|\widehat{\theta}_{n}-\theta^{\star}\right\|_{2}\leq\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))^{-1/2}\left\|\widehat{\theta}_{n}-\theta^{\star}\right\|_{\widehat{H}_{n}(\theta^{\star})}

\left\|\widehat{\theta}_{n}-\theta^{\star}\right\|_{2}\leq C\sqrt{\frac{d+\log(1/\delta)}{n\,\lambda_{\min}(\widehat{H}_{n}(\theta^{\star}))}},

which proves Theorem 1.

Appendix F Appendix for Experiment

This section provides the detailed experimental design underlying the results in Section 4. Our implementation follows the OPSD protocol of Zhao et al. [2026a] whenever applicable so that the comparison against prior baselines isolates the effect of the proposed PBSD objective as cleanly as possible. The appendix is organized as follows. Appendix F.1 summarizes the datasets used in the mathematical reasoning and tool-use experiments. Appendix F.2 visualizes representative examples from the processed training data. Appendix F.3 collects the detailed training configuration and method notes for SFT, GRPO, DAPO, OPSD, SDFT, SRPO, and PBSD. Appendix F.4 describes the evaluation protocols for math and tool use. Appendix F.5 presents the case-by-case study design. Appendix F.6 studies capability gains from expert demonstrations, with particular attention to reasoning length and exploratory behavior. Appendix F.7 reports the teacher update frequency ablation.

F.1 Datasets

We experiment with Qwen3 instruct models at three scales: Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Across the main paper, we study two task domains: mathematical reasoning and tool use.

Mathematical reasoning data.

Following the setup described in Section 4.1, we train on the mathematical reasoning subset of OpenThoughts [Guha et al., 2025], from which we sample up to 30K problem–solution pairs that contain chain-of-thought reasoning traces. This domain shares the same training pipeline as the tool-use experiments; only the evaluation benchmarks and metrics differ. We evaluate on three competition-level mathematical benchmarks: AIME 2024, AIME 2025, and HMMT 2025. In the main paper, all reported mathematical reasoning numbers use the same Avg@12 evaluation protocol.

Tool-use data.

For the additional tool-use study, we follow the setup in Shenfeld et al. [2026], which uses ToolAlpaca as the underlying domain. Each example consists of a user query together with tool or API information, and the model must generate the correct tool call. As in mathematical reasoning, we keep the same training pipeline and change only the evaluation split and metric. In our processed split, the training set contains 4046 examples and the test set contains 94 examples. This benchmark is used to evaluate whether the benefit of demonstration-conditioned self-distillation extends beyond long-form reasoning to structured action generation.

F.2 Data Visualization

To make the data format concrete, we provide representative examples from the processed training data used in the two task domains. The examples below are written in the same key–value style as the data consumed by our training pipeline and are intended to illustrate the structure of each instance.

Mathematical reasoning examples.

Each mathematical reasoning example contains a problem statement together with a reference reasoning trace and the final answer. Two representative examples are shown below.

Tool-use examples.

Each tool-use example contains a user request, the available tool specification, and the target tool invocation. Two representative examples are shown below.

F.3 Detailed Training Configuration

LoRA Hu et al. [2022] is a powerful tool for parameter efficient fine-tuning and has been widely investigated recently [Zhao et al., 2026b, Yu et al., 2025b, c]. Here, all methods are fine-tuned with LoRA on 8 H100 GPUs. This training configuration is shared across both task domains; unless explicitly stated otherwise, mathematical reasoning and tool use use the same model initialization, optimizer, LoRA setup, rollout configuration, and checkpoint-selection protocol. Across all runs, we use AdamW, bfloat16 precision, gradient checkpointing, and FlashAttention 2. The shared LoRA configuration is rank $r=64$ , LoRA alpha $\alpha=128$ , and target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. To accelerate rollout generation for on-policy methods, we use vLLM for inference. We keep the optimization hyperparameters of the baselines aligned with OPSD [Zhao et al., 2026a] as closely as possible so that the main difference lies in the training objective rather than in separate hyperparameter search. All trainable methods are trained for 500 steps, evaluated every 50 steps, and we report the peak checkpoint within this fixed budget.

SFT.

SFT is trained on the reasoning traces in the training set and can be viewed as an off-policy distillation baseline. We use learning rate $5\times 10^{-6}$ , effective batch size 32, LoRA rank 64, LoRA alpha 128, and maximum sequence length 16,000, following the OPSD paper.

GRPO.

For GRPO, we follow the OPSD implementation settings: learning rate $5\times 10^{-6}$ , effective batch size 32, maximum completion length 16,000, 8 generations per prompt, sampling temperature 1.2, and KL coefficient 0.0. Because GRPO uses group-based rollouts, its token cost scales with the number of sampled responses per prompt.

OPSD.

For OPSD, we use the configuration reported by Zhao et al. [2026a]: learning rate $5\times 10^{-6}$ , effective batch size 32, maximum completion length 1024, a single on-policy rollout per prompt, and sampling temperature 1.1. Unless otherwise stated, OPSD uses full-vocabulary logit distillation. As in the original paper, the teacher is the same base model instantiated with privileged reasoning context, and gradients are backpropagated only through the student branch.

PBSD.

PBSD shares the same base model, LoRA setup, optimizer, batch size, rollout count, maximum completion length, and sampling temperature as OPSD. Concretely, we use learning rate $5\times 10^{-6}$ , effective batch size 32, maximum completion length 1024, one sampled response per prompt, and sampling temperature 1.1. The only algorithmic change is the training objective: PBSD replaces token-level KL matching with the proposed pairwise preference-based self-distillation objective. In addition, we fix the teacher policy to the initial checkpoint throughout training to avoid teacher drift and to keep the teacher signal stationary while the student evolves on-policy.

For OPSD and PBSD, the teacher and student are instantiated from the same underlying model but receive different contexts. The student conditions only on the original problem, while the teacher is augmented with privileged information derived from the training example. In all on-policy methods, the student generates the rollout that defines the training state distribution. For OPSD, this rollout is used for token-level KL matching against the teacher distribution. For PBSD, the same on-policy student rollout serves as the negative sample, while the better-conditioned teacher provides the positive sample used in the pairwise objective.

This design choice is also related to a broader set of preference-based methods that use DPO-style objectives together with a shared base model in adjacent domains, such as listwise preference optimization for diffusion models [Bai et al., 2025] and attribute-guided decoding-time personalization [Yu et al., 2026].

Our goal is to compare objectives rather than search separately for method-specific hyperparameters. Therefore, PBSD is intentionally matched to OPSD in optimization and rollout configuration, with the objective function as the primary difference. SFT, GRPO, DAPO, OPSD, SDFT, and SRPO also follow aligned training settings wherever those settings apply. All trainable methods are trained for 500 steps, evaluated every 50 steps, and selected by the peak checkpoint within this fixed budget. This protocol keeps the comparison focused on the training objective rather than on unequal training budgets or checkpoint-selection rules.

F.4 Detailed Evaluation Configuration

Mathematical reasoning evaluation.

At evaluation time, we follow the Qwen3 sampling configuration used in OPSD and the Qwen3 technical report. We sample 12 responses per prompt with temperature 1.0 and report Avg@12. The maximum generation length is set to approximately 38k tokens (38,912 max new tokens). We use top- $p=0.95$ , top- $k=-1$ , min- $p=0.0$ , and zero presence penalty. These settings are shared across all methods so that the reported differences reflect training rather than decoding.

Tool-use evaluation.

For tool use, evaluation is performed on the 94-example test split described in Appendix F.1. We report top-1 accuracy on the generated tool call. This is the only task-specific difference in protocol relative to mathematical reasoning: the training configuration is unchanged, but evaluation uses the tool-use test split and its corresponding metric rather than Avg@12 on math benchmarks. The comparison protocol mirrors our main mathematical reasoning experiments: we evaluate the base model and compare SFT, GRPO, DAPO, OPSD, SDFT, SRPO, and PBSD under matched optimization settings whenever possible, with the aim of isolating the effect of the training objective. As in the mathematical reasoning experiments, PBSD uses the same base model as both teacher and student, with the teacher instantiated through additional contextual information and the student trained on-policy. This setup allows us to test whether the reward-aware pairwise objective remains beneficial when the desired output is a correct tool-use action sequence instead of a long-form chain-of-thought answer.

F.5 Case-by-Case Study

In addition to aggregate benchmark accuracy, we conduct a case-by-case study on AIME25. We select 30 representative AIME25 questions and, for each method, record whether the majority-vote prediction over sampled responses is correct or incorrect on each question.

The study is organized as a grid-style comparison. Each row corresponds to one evaluation setting and each column corresponds to one selected AIME25 problem. The rows include: (1) the base student model prompted only with the original question, (2) the same base model prompted with a reference solution as an expert demonstration, (3) OPSD, and (4) PBSD. Each cell is marked with a check if the majority-vote answer is correct and a cross otherwise.

This design lets us inspect problem-level changes induced by prompt-level guidance and by training-time self-distillation under a common evaluation protocol. Table 3 reports the full case-by-case study.

Table 3: Case-by-case analysis on 30 selected AIME25 problems. “Base (Student)” denotes the base model prompted only with the original question, while “Base (Teacher)” denotes the same base model prompted with the reference solution as an expert demonstration. Each cell is marked with ✓ if the majority-vote answer is correct and ✗ otherwise.

Method	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Q11	Q12	Q13	Q14	Q15	Q16	Q17	Q18	Q19	Q20	Q21	Q22	Q23	Q24	Q25	Q26	Q27	Q28	Q29	Q30
Base (Student)	✓	✗	✗	✗	✗	✓	✓	✓	✓	✓	✓	✓	✓	✗	✗	✓	✓	✗	✗	✓	✓	✗	✓	✓	✓	✓	✓	✓	✗	✗
Base (Teacher)	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
OPSD	✓	✗	✗	✗	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗	✓	✓	✓	✗	✗	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗	✗
PBSD	✓	✗	✗	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗	✓	✓	✓	✓	✓	✓	✓	✗	✓	✓	✓	✓	✓	✓	✓	✗

F.6 Capability Gain from Expert Demonstrations

To study how demonstration-conditioned supervision changes generation behavior, we compare reasoning traces on a subset of difficult AIME24 questions for which the base student fails under majority voting. We use 11 AIME24 questions where the 4B base student is incorrect under Avg@12. For each question, we collect generations from four systems: the base student, the demonstration-conditioned teacher, OPSD, and PBSD.

For each question, we record both problem-level accuracy and average completion length for the base student and the demonstration-conditioned teacher, together with the average completion length of OPSD and PBSD. This setup is intended to compare whether PBSD preserves longer completions than the teacher and OPSD while still improving answer accuracy, which would indicate that useful reasoning and exploration remain present after training. Table 4 provides the per-question bookkeeping format used to compare base, teacher, OPSD, and PBSD under the same protocol.

Table 4: Question-level comparison on the hard AIME24 subset. For each problem, we report accuracy and mean completion length for the base student and the demonstration-conditioned teacher, together with the mean completion length of OPSD and PBSD. This table is designed to compare whether PBSD preserves longer completions than the teacher and OPSD while still improving accuracy.

Problem ID	Student Acc.	Student Len.	Teacher Acc.	Teacher Len.	OPSD Len.	PBSD Len.
61	1/12	55,952	12/12	11,486	13,445	23,143
62	0/12	48,932	12/12	24,164	24,134	39,133
63	0/12	26,644	12/12	20,313	23,123	25,673
64	4/12	14,626	12/12	14,601	17,234	15,364
73	0/12	72,696	12/12	18,452	19,332	5,631
74	5/12	19,510	11/12	9,180	12,903	16,327
77	3/12	25,928	12/12	5,753	6,738	12,283
78	3/12	63,701	12/12	9,586	12,234	45,212
81	0/12	31,146	12/12	25,718	26,239	33,445
88	0/12	80,990	12/12	25,327	27,613	67,313
89	0/12	26,236	12/12	28,548	26,123	28,314

F.7 Teacher Update Frequency Study

In the main experiments, we keep the teacher fixed to the initial checkpoint throughout PBSD training. To test whether a more adaptive teacher could further improve learning, we additionally consider a periodic-update variant in which the teacher is refreshed from the current student every 5 gradient-update steps.

This ablation is conducted under the same Qwen3-4B training and evaluation setup as the main mathematical reasoning experiments; the only change is the teacher update rule. The fixed-teacher variant uses the initial context-augmented model as the teacher for the entire run, while the refreshed-teacher variant periodically replaces the teacher with the latest student checkpoint and then re-instantiates the contextual teacher from that checkpoint. Table 5 reports the benchmark-wise comparison on AIME24, AIME25, and HMMT25, together with the average across the three tasks.

Table 5: Teacher update frequency ablation on Qwen3-4B. “Fixed” denotes the default PBSD setting in which the teacher remains equal to the initial checkpoint throughout training. “Update every 5 steps” denotes a variant in which the teacher is refreshed from the student every 5 gradient-update steps.

Teacher Update Rule	AIME24	AIME25	HMMT25	Average
Fixed teacher	77.5	68.9	45.6	64.0
Update every 5 steps	77.2	68.6	44.4	63.4