arXiv:2605.04078 · cs.LG · uncurated · rendered via ar5iv

Validity-Calibrated Reasoning Distillation

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.04078] Validity‑Calibrated Reasoning Distillation

Validity‑Calibrated Reasoning Distillation

Khouloud Saadi Corresponding author. Affiliation: Department of Computer Science Affiliation: KAUST    Thuwal    Saudi Arabia Affiliation: khouloud.saadi@kaust.edu.sa Affiliation:     Di Wang Affiliation: Department of Computer Science Affiliation: KAUST    Thuwal    Saudi Arabia Affiliation: di.wang@kaust.edu.sa
Abstract

Reasoning distillation aims to transfer multi-step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher–student hierarchies and frame distillation as trajectory imitation. This is misaligned with the structure of reasoning, where intermediate steps are often locally under-specified: global correctness constrains the final answer, but does not uniquely determine each intermediate move. We propose validity-calibrated reasoning distillation, a framework that treats reasoning distillation as a problem of local learning-signal allocation rather than path alignment. Instead of enforcing token-level imitation, we compare the student’s and teacher’s proposed next-step actions under the same prefix and use their relative local validity to modulate the strength of the distillation update. This yields a dynamic, context-dependent supervision mechanism that preserves the teacher’s structural guidance while adapting update strength to local reasoning quality. Across mathematical reasoning, code generation, and instruction-following benchmarks, our method consistently outperforms strong distillation baselines. These results indicate that effective LLM reasoning distillation is governed not by rigid trajectory imitation, but by locally calibrated allocation of learning signal.

1 Introduction

Reasoning has become a defining capability of modern language models (Ko et al., 2025; Yang et al., 2025b; Pang et al., 2024; Shao et al., 2024; Achiam et al., 2023), driving progress in mathematical problem solving (Cobbe et al., 2021; Hendrycks et al., 2021), code generation (Luo et al., 2024a; Xu et al., 2024), and complex instruction following. Frontier LLMs such as GPT-4 (Achiam et al., 2023), DeepSeek-R1 (Guo et al., 2025), and ReasonFlux (Yang et al., 2025a) exhibit strong multi-step reasoning abilities but remain prohibitively expensive for widespread deployment. In contrast, smaller language models are far more efficient and accessible, yet they still struggle to perform reliable, context-dependent reasoning (Yang et al., 2024; Grattafiori et al., 2024). Closing this gap is essential for building scalable, broadly usable AI systems.

Recent efforts to improve the reasoning performance of compact models have focused on distillation (Schmidhuber, 1992; Hinton, 2015), aiming to transfer reasoning behavior from powerful LLMs to smaller students (Liu et al., 2024; Yang et al., 2025b; Ko et al., 2025). Most existing reasoning distillation methods (Li et al., 2022; Fu et al., 2023) formulate supervision as a trajectory imitation problem (Magister et al., 2023; Liu et al., 2024; Ko et al., 2025; Yang et al., 2025b), enforcing token-level alignment with a teacher rollout. This is typically implemented through fixed teacher-centric objectives or schedules that interpolate between teacher and student prefixes (Liu et al., 2024; Ko et al., 2025), training the student to reproduce a specific realization of the teacher’s reasoning process.

At a deeper level, trajectory-based reasoning distillation relies on an implicit assumption that is rarely made explicit: a model that is globally superior also provides uniformly superior learning signal at each intermediate step (Li et al., 2024; Shridhar et al., 2023; Li et al., 2023a). In other words, global model quality is assumed to transfer monotonically to local decision quality under all prefixes. While this assumption is natural in standard prediction tasks (Hinton, 2015; Jiao et al., 2020), where supervision is aligned at each output position, it is fundamentally mismatched with multi-step reasoning. Reasoning is evaluated only at the level of the final answer, whereas intermediate steps are latent variables whose correctness is often under-specified (Liao et al., 2025; Achiam et al., 2023). Hence, global capability does not induce a total ordering over local reasoning moves, and the teacher-provided learning signal can vary substantially across decision points. When this monotonicity breaks, uniform distillation becomes structurally miscalibrated: strong updates may be enforced where local evidence is weak, while informative local signals remain underutilized.

Refer to caption
Figure 1: Distribution of the reward ratio rs/rtr_{s}/r_{t} between the student Qwen2.5-Math-1.5B (rsr_{s}) and the teacher Qwen2.5-Math-7B-Instruct (rtr_{t}) policies. rsr_{s} and rtr_{t} are computed with Skywork-o1-OpenPRM-Qwen-2.5-1.5B. While the ratio is expected to concentrate below 1, a substantial fraction of probability mass (28.8%) lies in the region 1\geq 1, indicating frequent cases where the student attains higher reward than the teacher.

To examine this mismatch directly, we compare the local validity of teacher LLM and student next-token proposals under identical prefixes, using an auxiliary judge to score token-level transitions. Figure 1 shows the empirical distribution of the reward ratio rS/rTr_{S}/r_{T} between a Qwen2.5-Math-1.5B student and a Qwen2.5-Math-7B-Instruct teacher. If global teacher superiority consistently translated into local step superiority, the ratio would concentrate strictly below 11. Instead, we observe that in 28.8% of decision points, the student’s proposed continuation attains equal or higher local reward than the teacher’s under the same context. Appendix 6 provides qualitative examples from the same setting, showing cases where the student proposes a better or locally valid alternative to the teacher’s next-token continuation. This phenomenon does not contradict the teacher’s overall strength; rather, it reveals that local step quality varies substantially even when one model is globally stronger.

These observations expose a fundamental limitation of trajectory imitation. By applying uniform distillation pressure at every token, existing objectives conflate two distinct roles of the teacher: providing structural guidance about the solution manifold, and determining the strength of the learning signal at each decision point. When intermediate reasoning steps are locally ambiguous or weakly informative, strong imitation can be harmful; conversely, when local evidence is strong, uniform supervision fails to exploit it. Treating all teacher actions as equally authoritative therefore leads to miscalibrated learning signals across the reasoning chain.

This perspective reframes the core question of reasoning distillation. Rather than asking which trajectory should be imitated, the central challenge is to determine how strongly the student should update at each decision point. We propose Validity-Calibrated Reasoning Distillation (VCRD), a framework that allocates token-level supervision based on the relative local validity of teacher and student proposals. At each prefix, both models produce candidate next steps, which are evaluated under the same context by an auxiliary judge. Their relative validity then scales the distillation update, yielding three regimes: parity when both steps are similarly justified, attenuation when the teacher is locally stronger, and amplification when the student’s continuation is more locally valid. Importantly, VCRD does not alter the direction of supervision; it adaptively calibrates its strength, preserving teacher guidance while aligning learning pressure with local reasoning evidence. A detailed related work section is provided in Appendix 7. In summary, our contributions are as follows:

  • We identify a previously implicit monotonicity assumption in trajectory-based LLM reasoning distillation, namely, that global teacher superiority implies uniformly superior local learning signal, and show that this assumption is violated in multi-step reasoning.

  • We introduce VCRD framework, which allocates token-level supervision using the relative local validity of teacher and student proposals.

  • We provide a theoretical justification via a teacher-anchored KL trust-region formulation, demonstrating that relative validity induces the correct first-order scaling of the distillation update.

  • We empirically validate the importance of the amplification regime, showing that VCRD yields consistent improvements across mathematical reasoning, code generation, and instruction-following tasks, with average gains of 2.77%, 1.15%, and 2.35%, respectively.

2 Proposed Approach: Validity-Calibrated Reasoning Distillation

Refer to caption

Figure 2: Overview of VCRD. Rather than enforcing uniform trajectory imitation, VCRD allocates token-level learning signal based on the relative local validity of teacher and student proposals under the same prefix. By modulating update strength rather than direction, the method preserves teacher guidance while adapting supervision to locally under-specified reasoning steps.

Reasoning distillation differs from standard knowledge distillation in a fundamental way. In conventional prediction tasks, the teacher’s output defines a well-specified target, and deviations by the student can be unambiguously interpreted as errors (Sun et al., 2019; Jiao et al., 2020). In multi-step reasoning, however, correctness constrains only the final answer: intermediate steps are often locally under-specified, with multiple plausible continuations coexisting under the same prefix. As a result, teacher superiority does not imply reliable supervision at every decision point.

Despite this mismatch, reasoning KD methods (Ko et al., 2025; Wang et al., 2025) assign the teacher a uniformly privileged role, enforcing token-level alignment along an entire reasoning chain and implicitly treating each teacher step as the unique correct continuation. This conflates the direction of supervision with its strength, leading to miscalibrated updates when local reasoning varies across steps. The central question is therefore not which trajectory should be imitated, but:

Given this prefix and these candidate reasoning steps, how strongly should the student update be applied at token tt?

To answer this question, we introduce VCRD framework: At each prefix, an auxiliary judge evaluates the teacher’s and student’s candidate next tokens under the same context, and their relative local validity determines the magnitude of the distillation update. As illustrated in Figure 2, this yields prefix-conditioned, token-level supervision that reinforces informative teacher guidance, attenuates weak signals, and amplifies well-justified student local decisions, while preserving the teacher-anchored optimization geometry.

2.1 Problem Formulation

We consider a prompt xx paired with an output sequence y=(y1,,yT)y=(y_{1},\dots,y_{T}). A pretrained teacher model p(yx)p(y\mid x) and a student qθ(yx)q_{\theta}(y\mid x) both generate tokens autoregressively: at step tt, they condition on the prefix y<ty_{<t}, forming the context ct1=(x,y<t)c_{t-1}=(x,y_{<t}) and defining next-token distributions p(ct1)p(\cdot\mid c_{t-1}) and qθ(ct1)q_{\theta}(\cdot\mid c_{t-1}). During distillation, we draw both a teacher rollout yTp(x)y^{T}\sim p(\cdot\mid x) and a student rollout ySqθ(x)y^{S}\sim q_{\theta}(\cdot\mid x). These induce two prefixes at position tt: ct1T=(x,y<tT)c^{T}_{t-1}=(x,y^{T}_{<t}) and ct1S=(x,y<tS)c^{S}_{t-1}=(x,y^{S}_{<t}). For any prefix cc (in particular ct1Tc^{T}_{t-1} or ct1Sc^{S}_{t-1}), the teacher and student define conditional distributions p(c)p(\cdot\mid c) and qθ(c)q_{\theta}(\cdot\mid c), from which they propose next tokens atTp(c)a^{T}_{t}\sim p(\cdot\mid c) and atSqθ(c)a^{S}_{t}\sim q_{\theta}(\cdot\mid c).

These two conditioning contexts arise naturally in reasoning distillation. Teacher prefixes ct1Tc^{T}_{t-1} anchor learning to high-quality partial traces, while student prefixes ct1Sc^{S}_{t-1} expose the model to its own induced state distribution, which is critical for robustness at inference time. Prior work typically mixes these contexts using fixed schedules or interpolation coefficients (Ko et al., 2025). In contrast, we treat them symmetrically and determine the supervision strength at each step based on the local validity of the proposed next-token decision under the current prefix.

2.2 Local Validity

To evaluate the quality of individual reasoning steps, we introduce an auxiliary validity judge JJ that assigns a local validity score r(ct1,at)[0,1]r(c_{t-1},a_{t})\in[0,1] to a proposed next token ata_{t} under prefix ct1c_{t-1}. The judge operates purely at the token level, assessing the local coherence of the transition (ct1,at)(c_{t-1},a_{t}) without considering global solution correctness or full trajectories. At each step tt, teacher and student produce candidate tokens atTa_{t}^{T} and atSa_{t}^{S}. We compare their local validity under the same prefix. Under the teacher prefix ct1Tc^{T}_{t-1} and the student prefix ct1Sc^{S}_{t-1}, the relative validity ratios are:

wtT\displaystyle w_{t}^{T} =r(ct1T,atS)r(ct1T,atT)+ε,wtS\displaystyle=\frac{r(c^{T}_{t-1},a^{S}_{t})}{r(c^{T}_{t-1},a^{T}_{t})+\varepsilon},\qquad w_{t}^{S} =r(ct1S,atS)r(ct1S,atT)+ε.\displaystyle=\frac{r(c^{S}_{t-1},a^{S}_{t})}{r(c^{S}_{t-1},a^{T}_{t})+\varepsilon}. (1)

where ε\varepsilon is a small positive constant used for numerical stability. These prefix-conditioned ratios act as local learning-signal scalers, determining the strength of the distillation update at each decision point. For clarity, we use wtw_{t} below as a generic placeholder referring to either wtTw_{t}^{T} or wtSw_{t}^{S}, depending on whether the update is applied under a teacher or student prefix. They induce three regimes:

  • wt1w_{t}\approx 1: parity. Teacher and student propose similarly justified moves; the update is like standard distillation.

  • wt<1w_{t}<1: attenuation. The teacher’s move is locally superior; reducing the update prevents the student from over-committing in regions where the local reasoning landscape is weakly informative or difficult to learn.

  • wt>1w_{t}>1: amplification. The student proposes a more locally coherent step. This reflects under-specification in the teacher distribution, not incorrectness, and amplifying the update sharpens the student’s trajectory within the teacher-supported solution manifold.

2.3 Distillation Objective

For any prefix cc, the teacher and student define next-token distributions p(c)p(\cdot\mid c) and qθ(c)q_{\theta}(\cdot\mid c). Rather than relying on the standard forward or reverse KL divergences, we adopt the skew KL (SKL) and skew reverse KL (SRKL) objectives Ko et al. (2025), which provide more stable behavior when teacher and student condition on different prefixes. We first introduce the mixture distributions:

mp,qθ(α)(ac)\displaystyle m^{(\alpha)}_{p,q_{\theta}}(a\mid c) =αp(ac)+(1α)qθ(ac),\displaystyle=\alpha p(a\mid c)+(1-\alpha)q_{\theta}(a\mid c),\quad mp,qθ(1α)(ac)\displaystyle m^{(1-\alpha)}_{p,q_{\theta}}(a\mid c) =(1α)p(ac)+αqθ(ac).\displaystyle=(1-\alpha)p(a\mid c)+\alpha q_{\theta}(a\mid c).

with α[0,1]\alpha\in[0,1] controlling the skew between teacher and student updates. The skewed divergences are then defined:

DSKL(α)(pqθ)\displaystyle D^{(\alpha)}_{\mathrm{SKL}}(p\,\|\,q_{\theta}) =KL(p(c)mp,qθ(α)(c)),\displaystyle=\mathrm{KL}\!\left(p(\cdot\mid c)\,\Big\|\,m^{(\alpha)}_{p,q_{\theta}}(\cdot\mid c)\right),\hskip 2.77501pt DSRKL(α)(pqθ)\displaystyle D^{(\alpha)}_{\mathrm{SRKL}}(p\,\|\,q_{\theta}) =KL(qθ(c)mp,qθ(1α)(c)).\displaystyle=\mathrm{KL}\!\left(q_{\theta}(\cdot\mid c)\,\Big\|\,m^{(1-\alpha)}_{p,q_{\theta}}(\cdot\mid c)\right). (2)
Teacher-prefix supervision (LV–SKL) & Student-prefix supervision (LV–SRKL).

Under teacher prefixes ct1Tc^{T}_{t-1} and student prefixes ct1Sc^{S}_{t-1}, we weight the SKL and the SRKL divergences, using the local-validity ratios wtTw_{t}^{T} and wtSw_{t}^{S}, respectively:

LVSKL\displaystyle\mathcal{L}_{\mathrm{LV\!-\!SKL}} =t=1TwtTDSKL(α)(p(ct1T)qθ(ct1T)),\displaystyle=\sum_{t=1}^{T}w_{t}^{T}D^{(\alpha)}_{\mathrm{SKL}}\!\big(p(\cdot\mid c^{T}_{t-1})\,\big\|\,q_{\theta}(\cdot\mid c^{T}_{t-1})\big),\hskip-5.55002pt LVSRKL\displaystyle\mathcal{L}_{\mathrm{LV\!-\!SRKL}} =t=1TwtSDSRKL(α)(p(ct1S)qθ(ct1S)).\displaystyle=\sum_{t=1}^{T}w_{t}^{S}D^{(\alpha)}_{\mathrm{SRKL}}\!\big(p(\cdot\mid c^{S}_{t-1})\,\big\|\,q_{\theta}(\cdot\mid c^{S}_{t-1})\big). (3)
Final objective.

The overall loss combines both supervision regimes:

=λTLVSKL+λSLVSRKL,λT,λS0.\mathcal{L}=\lambda_{T}\,\mathcal{L}_{\mathrm{LV\!-\!SKL}}+\lambda_{S}\,\mathcal{L}_{\mathrm{LV\!-\!SRKL}},\qquad\lambda_{T},\lambda_{S}\geq 0. (4)

This objective preserves the geometric stability benefits of skewed KL while allocating token-level supervision according to the locally inferred validity of each teacher–student decision point. By dynamically attenuating or amplifying updates based on relative local justification, VCRD provides a principled, context-dependent alternative to trajectory-level distillation. For completeness, Algorithm 1 in Appendix 9 summarizes the full training procedure.

3 Theoretical Perspective

We provide a theoretical view that motivates validity-calibrated distillation. Rather than treating distillation as trajectory imitation, we view it as allocating learning signal across decision points in which the next step is often locally under-specified. Formal proofs appear in Appendix 8.

3.1 Reasoning as Sequential Local Decision Making

We model autoregressive reasoning as a sequential decision process. At step tt, the model conditions on the prefix ct=(x,yt)c_{t}=(x,y_{\leq t}) and selects an action at𝒱a_{t}\in\mathcal{V}. The teacher and student induce next-token policies π(act)=p(act)\pi(a\mid c_{t})=p(a\mid c_{t}) and πθ(act)=qθ(act)\pi_{\theta}(a\mid c_{t})=q_{\theta}(a\mid c_{t}), respectively. A key property of reasoning tasks is local under-specification: correctness constrains the final answer but typically does not fix a unique continuation at each prefix. Disagreement between teacher and student at ctc_{t} therefore need not indicate error, but may reflect multiple locally coherent moves. This invalidates uniform token-level imitation and raises the question:

Given a specific prefix ctc_{t}, how much learning signal should be allocated to this decision?

To formalize this idea, consider improving the student at a fixed prefix ctc_{t}. The teacher distribution π(ct)\pi(\cdot\mid c_{t}) provides a strong structural prior, while improvement should favor actions with high local validity r(ct,a)r(c_{t},a). This naturally leads to the teacher-anchored trust-region objective:

maxπ~𝔼aπ~[r(ct,a)]s.t.KL(π~(ct)π(ct))δ.\max_{\tilde{\pi}}~\mathbb{E}_{a\sim\tilde{\pi}}[\,r(c_{t},a)\,]\quad\text{s.t.}\quad\mathrm{KL}\!\big(\tilde{\pi}(\cdot\mid c_{t})\,\big\|\,\pi(\cdot\mid c_{t})\big)\leq\delta. (5)

As shown in subsection 8.1 of Appendix 8, the optimal solution takes the form of an exponentially tilted distribution:

π~(act)π(act)exp(ηr(ct,a)),\tilde{\pi}^{\star}(a\mid c_{t})\propto\pi(a\mid c_{t})\,\exp\!\big(\eta\,r(c_{t},a)\big), (6)

for a multiplier η0\eta\geq 0 determined by the trust-region radius. Rather than imitating a full trajectory, this update redistributes probability mass within the teacher’s support toward actions that exhibit higher local validity. This cleanly separates structural guidance (the teacher manifold) from per-step learning-signal strength, providing the foundation for VCRD.

3.2 From optimal improvement to first-order learning-signal allocation

The trust-region objective in Eq. (5) specifies an ideal update: reallocating probability mass within the teacher distribution according to local validity. Computing the exponentially tilted optimum in Eq. (6), however, is infeasible in large-vocabulary LLMs because the validity judge can evaluate only a small number of candidate actions. Rather than constructing the full distribution, we interpret the trust-region solution as defining a first-order improvement direction. Expanding logπ~(act)\log\tilde{\pi}^{\star}(a\mid c_{t}) around the teacher policy π(act)\pi(a\mid c_{t}) gives, for sufficiently small η\eta:

logπ~(act)\displaystyle\log\tilde{\pi}^{\star}(a\mid c_{t}) =logπ(act)+η(r(ct,a)𝔼aπ[r(ct,a)])+𝒪(η2),\displaystyle=\log\pi(a\mid c_{t})+\eta\!\left(r(c_{t},a)-\mathbb{E}_{a^{\prime}\sim\pi}[r(c_{t},a^{\prime})]\right)+\mathcal{O}(\eta^{2}), (7)

Thus, to first order, the exponential-tilt update implies a log-probability change

Δlogπ~(act)r(ct,a)𝔼aπ(ct)[r(ct,a)].\Delta\log\tilde{\pi}(a\mid c_{t})\propto r(c_{t},a)-\mathbb{E}_{a^{\prime}\sim\pi(\cdot\mid c_{t})}[r(c_{t},a^{\prime})]. (8)

showing that only relative validity under the same prefix matters. In practice, the expectation in Eq. (8) is unavailable: at each prefix ctc_{t} we observe only two realized actions, the teacher token atTa^{T}_{t} and the student token atSa^{S}_{t}, together with their validities. This constitutes a two-sample bandit-feedback setting. To obtain a stable, scale-invariant proxy for local improvement strength, we define w(ct)=r(ct,atS)r(ct,atT)+εw(c_{t})=\frac{r(c_{t},a^{S}_{t})}{\,r(c_{t},a^{T}_{t})+\varepsilon\,} as the validity ratio, with ε>0\varepsilon>0 for numerical stability. The ratio w(ct)w(c_{t}) does not alter the direction of a KL-anchored update; it rescales its magnitude. In Section 11, we conduct an ablation study comparing this relative validity formulation to r(ct,atS)r(ct,atT)r(c_{t},a^{S}_{t})-r(c_{t},a^{T}_{t}) aka (rsrtr_{s}-r_{t}) directly and other alternative weighting choices and find that our formulation yields consistently stronger performance and more stable training. The resulting first-order update at prefix ctc_{t} is:

Δθtw(ct)θ[KL(π(ct)πθ(ct))],\Delta\theta_{t}\propto w(c_{t})\,\nabla_{\theta}\!\left[-\mathrm{KL}\big(\pi(\cdot\mid c_{t})\,\|\,\pi_{\theta}(\cdot\mid c_{t})\big)\right], (9)

with derivation in Appendix 8.2. The KL anchoring terms fix the update directions and keep learning anchored to the teacher’s solution manifold, while the validity ratio controls the local step size: w(ct)<1w(c_{t})<1 attenuates updates when the student proposes a weak continuation, and w(ct)>1w(c_{t})>1 amplifies updates when the teacher’s sampled continuation is locally under-specified. This implements a principled first-order allocator of learning signal using only the two actions available at each prefix. Viewed in this light, the distillation objective introduced in the previous section can be interpreted as a practical instantiation of this first-order update. Rather than explicitly constructing the optimal distribution π~\tilde{\pi}^{\star}, the method decomposes its effect across token-level KL gradients under both teacher- and student-conditioned prefixes. The KL terms define the update directions that preserve the teacher’s solution manifold, while the validity ratio w(ct)w(c_{t}) governs the strength of these updates. This realizes a distributed approximation of the trust-region improvement: learning signal is adaptively allocated across decision points, amplifying updates when the teacher’s continuation is locally under-specified and attenuating them when the student proposes weak reasoning steps.

4 Experiments

We evaluate our VCRD framework across three complementary reasoning domains: (i) mathematical problem solving, (ii) code generation, and (iii) instruction following. These settings differ substantially in structure, difficulty, and the diversity of locally valid continuations, providing a comprehensive assessment of our proposed method. In our experiments, we follow a similar setup to the one outlined in Ko et al. (2025). Full experimental details appear in Appendix 10. In our experiments, we use a Process Reward Model (PRM) as the validity judge, providing dense, step-level feedback on the local correctness of intermediate reasoning steps. Specifically, we employ Skywork-o1-OpenPRM-Qwen-2.5-1.5B (o1 Team, 2024), one of the most advanced open-source PRM available at the time of our experiments. This PRM is built on Qwen2.5-Math-1.5B-Instruct, an instruction-tuned backbone, and its released model card reports strong performance on both mathematical and coding evaluations.111https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-1.5B.

Given a prefix cc and a candidate next token aa, the PRM returns a scalar score r(c,a)[0,1]r(c,a)\in[0,1], with higher values indicating more coherent and contextually plausible reasoning. These prefix-conditioned signals serve as the backbone of VCRD, enabling us to compare the relative validity of the teacher’s and student’s next-step decisions under the same context and modulate the distillation strength accordingly. We further analyze sensitivity to the validity judge in Appendix 13, showing that VCRD achieves nearly identical performance with 1.5B and 7B PRMs, thereby confirming its robustness to different PRM sizes.

4.1 Mathematical Reasoning

Setup. We evaluate VCRD on various math‑reasoning benchmarks, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), Minerva (Lewkowycz et al., 2022), Gaokao (Zhang et al., 2023), Olympiad (He et al., 2024), SAT‑Math (Zhong et al., 2024), CMATH (Wei et al., 2023), AMC23 (AI-MO, 2024), and AIME24 (MAA, 2025). We use the same setup as (Ko et al., 2025) with two teacher–student configurations: Qwen2-Math-7B-Instruct \rightarrow Qwen2-Math-1.5B and Qwen2.5-Math-7B-Instruct \rightarrow Qwen2.5-Math-1.5B.
Results. Tables 1 and 2 show that VCRD consistently improves math-reasoning performance across both the Qwen2-Math and Qwen2.5-Math settings. In the Qwen2-Math 7B→1.5B configuration, VCRD reaches an average Pass@1 of 56.06, outperforming the strongest baseline DistilLLM (54.47) and yielding gains on challenging datasets such as Olympiad. The improvements are even larger for Qwen2.5-Math 7B→1.5B, where VCRD attains 59.96 (+2.09 over DistillLM-2), with substantial boosts on competition-level benchmarks, most notably AMC23 and Olympiad. Remarkably, the VCRD-distilled 1.5B student approaches the average performance of its 7B teacher in the Qwen2.5 setting despite being nearly five times smaller. These results reflect VCRD’s core principle: relative-validity weighting provides a more reliable learning signal than uniform imitation, enabling the student to benefit from strong teacher steps while avoiding propagation of locally ambiguous reasoning.

Table 1: Pass@1 results for distilling Qwen2-Math-7B-Instruct (T\mathcal{M}_{T}) into Qwen2-Math-1.5B (S\mathcal{M}_{S}) on eight math-reasoning benchmarks. AVG is the average over all datasets.
Qwen2-Math-7B-Inst (T\mathcal{M}_{T}) \rightarrow Qwen2-Math-1.5B (S\mathcal{M}_{S})
Method GSM8K MATH SVAMP Minerva Gaokao Olympiad AMC23 SAT-Math AVG.
T\mathcal{M}_{T} 88.40 74.90 94.30 30.90 64.70 37.60 60.00 100.0 68.85
S\mathcal{M}_{S} 26.20 20.50 21.70 6.60 16.10 7.40 10.00 81.20 23.71
KD 80.10 61.00 84.30 26.50 51.90 23.90 27.50 40.60 49.48
GKD 81.00 61.10 84.70 25.00 52.70 23.70 25.00 78.10 53.91
DistilLLM 80.90 61.00 85.80 27.20 50.90 23.70 37.50 68.80 54.47
ABKD 81.50 61.30 85.50 24.30 51.90 22.40 32.50 43.80 50.40
DistillLM-2 81.90 61.50 85.00 23.90 54.00 24.40 30.00 65.60 53.29
VCRD 81.00 61.80 85.90 25.70 53.20 24.70 35.00 81.20 56.06
Table 2: Pass@1 results for distilling Qwen2.5-Math-7B-Instruct (T\mathcal{M}_{T}) into Qwen2.5-Math-1.5B (S\mathcal{M}_{S}) across seven math-reasoning benchmarks. AVG denotes the average score across all datasets.
Qwen2.5-Math-7B-Inst (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Math-1.5B (S\mathcal{M}_{S})
Method GSM8K MATH Gaokao Olympiad CMATH AIME24 AMC23 AVG.
T\mathcal{M}_{T} 85.10 76.30 66.50 37.00 89.80 16.70 60.00 61.63
S\mathcal{M}_{S} 79.10 51.10 42.10 16.90 62.00 3.30 22.50 39.57
KD 85.50 73.70 62.90 32.40 88.20 10.00 47.50 57.17
GKD 85.50 74.10 60.00 33.30 88.80 6.70 50.00 56.91
DistilLLM 84.90 74.40 60.30 33.50 89.30 13.30 47.50 57.60
ABKD 84.20 74.30 61.80 33.20 88.50 6.70 47.50 56.60
DistillLM-2 85.70 73.20 61.60 34.10 89.70 13.30 47.50 57.87
VCRD 85.00 73.70 62.10 35.40 90.20 13.30 60.00 59.96

4.2 Code Generation

Setup. For code generation, we evaluate VCRD on HumanEval, HumanEval+ (Chen, 2021), MBPP, and MBPP+ (Austin et al., 2021). Distillation is conducted using WizardCoder prompts (Luo et al., 2024b) under two teacher–student configurations: Qwen2.5-Coder-7B-Instruct (Hui et al., 2024)\rightarrowQwen2.5-Coder-1.5B and Qwen2.5-Coder-14B-Instruct\rightarrowQwen2.5-Coder-7B-Instruct. This setup enables evaluation across both standard benchmarks and different-scale teacher–student pairs.
Results. Table 3 shows that VCRD consistently improves student code-generation performance across both Qwen2.5-Coder distillation settings. In the 7B\rightarrow1.5B configuration, VCRD attains an average Pass@1 of 67.72, outperforming all baselines. Even with a larger-scale teacher-student pair (14B\rightarrow7B), VCRD reaches an average of 81.75 and achieving 83.5 on HumanEval+ compared to 82.3 compared to the best baseline. Although code generation is more deterministic than mathematical reasoning, VCRD still yields measurable gains, indicating that prefix-conditioned validity weighting reliably improves student updates even in structured synthesis tasks.

Table 3: Pass@1 results on HumanEval and MBPP benchmarks for two Qwen2.5-Coder distillation settings. AVG denotes the average Pass@1 across the four tasks.
Method HEval HEval++ MBPP MBPP++ AVG
Qwen2.5-Coder-7B-Inst (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Coder-1.5B (S\mathcal{M}_{S})
T\mathcal{M}_{T} 91.50 86.00 82.00 70.40 82.47
S\mathcal{M}_{S} 69.50 64.00 73.00 61.40 66.97
KD 68.90 61.00 72.80 61.90 66.15
GKD 68.30 62.80 73.30 63.20 66.90
DistilLLM 68.30 62.80 73.30 62.70 66.77
ABKD 69.50 64.00 71.20 60.60 66.32
Distillm-2 67.70 61.60 73.80 63.20 66.57
VCRD 69.50 64.60 73.80 63.00 67.72
Qwen2.5-Coder-14B-Inst (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Coder-7B-Inst (S\mathcal{M}_{S})
T\mathcal{M}_{T} 91.50 86.60 85.40 72.80 84.07
Distillm 87.80 82.30 82.50 70.40 80.75
Distillm-2 88.40 82.30 83.90 70.90 81.37
VCRD 89.00 83.50 83.30 71.20 81.75

4.3 General Instruction-Following

Setup.

For instruction following, we evaluate VCRD under two teacher–student configurations (Qwen2.5-7B-Instruct\rightarrowQwen2.5-1.5B and Qwen2-7B-Instruct\rightarrowQwen2-1.5B) using distillation on UltraChat200k (Ding et al., 2023). Performance is measured via win-rate on AlpacaEval (Li et al., 2023b), Evol-Instruct (Xu et al., 2024), and UltraFeedback (Li et al., 2023b), using GPT-4o or GPT-4o-mini as an LLM-as-a-Judge (Zheng et al., 2023) following prior work (Ko et al., 2025).

Results.

Table 4 shows that VCRD yields consistent improvements over existing distillation approaches across both Qwen2.5 and Qwen2 instruction-following settings. In the Qwen2.5-7B\rightarrow1.5B configuration, VCRD achieves an average win rate of 66.50, surpassing the strongest baseline DistilLLM (65.47) and improving performance across all three benchmarks, for example, raising Evol-Instruct from 55.4 to 57.8. Even larger gains are observed in the Qwen2-7B\rightarrow1.5B setup, where VCRD attains an average win rate of 52.96, outperforming DistillLM-2 (50.61) by +2.35 and substantially improving Evol-Instruct (44.5 vs. 37.3). These results indicate that local-validity calibration is particularly effective for instruction-following tasks, where prefixes are often under-specified and teacher trajectories exhibit substantial variability.

Table 4: Win Rate (WR%) on instruction-following datasets for Qwen2.5-7B-Instruct \rightarrow Qwen2.5-1.5B and Qwen2-7B-Instruct \rightarrow Qwen2-1.5B. AVG is the average WR across all benchmarks.
Method Qwen2.5-7B-Inst (T\mathcal{M}_{T}) \rightarrow Qwen2.5-1.5B (S\mathcal{M}_{S}) Qwen2-7B-Inst (T\mathcal{M}_{T}) \rightarrow Qwen2-1.5B (S\mathcal{M}_{S})
AlpacaEval Evol-Inst UltraFeed AVG. AlpacaEval Evol-Inst UltraFeed AVG.
T\mathcal{M}_{T} 91.43 69.67 77.16 79.42 88.54 59.75 64.86 71.05
S\mathcal{M}_{S} 60.09 30.73 42.94 44.59 56.11 19.49 38.20 37.93
KD 72.14 54.59 62.60 63.11 60.81 32.45 49.25 47.50
SeqKD 67.92 46.33 52.89 55.71 56.58 28.32 39.10 41.33
DistilLLM 75.11 55.38 65.92 65.47 64.13 35.66 51.26 50.35
DistillLM-2 73.66 56.77 65.57 65.33 62.58 37.27 51.97 50.61
VCRD 76.02 57.80 65.67 66.50 62.83 44.49 51.56 52.96

4.4 Critical Role of Amplification

Refer to caption
Figure 3: Qwen2.5-Math-7B-Inst\rightarrowQwen2.5-Math-1.5B (left) and Qwen2.5-Coder-7B-Inst\rightarrowQwen2.5-Coder-1.5B (right).

A key consequence of validity-calibrated supervision is the emergence of an amplification regime, in which the distillation update is strengthened when the student’s locally proposed step is judged more valid than the teacher’s under the same prefix. This behavior directly follows from the breakdown of the monotonicity assumption: global teacher superiority does not guarantee locally optimal reasoning decisions at every step. To isolate the role of amplification, Figure 3 compares full VCRD against a variant in which amplification is disabled by clamping all validity-based weights to 1, thereby enforcing uniform update strength. Across both mathematical reasoning (left) and code generation (right), the full VCRD consistently outperforms the clamped variant, demonstrating that suppressing locally superior student decisions leads to systematic degradation.

The performance gap is most pronounced on more challenging benchmarks such as AMC23, AIME24, and HumanEval+, where intermediate reasoning steps are noisier and locally under-specified. In these regimes, positive reinforcement of locally well-justified student moves plays a critical role in stabilizing learning. These results confirm that amplification is not a heuristic addition, but a necessary component of VCRD when local learning signal is not ordered by global model quality.

4.5 Ablation Study

Table 5 compares VCRD, our proposed method, with constant-weight counterparts that use the same loss components but remove validity modulation. Here, LV-SKL (wT=1w^{T}=1) corresponds to teacher-prefix distillation with uniform weights, LV-SRKL (wS=1w^{S}=1) corresponds to pure on-policy distillation with student rollouts and per-token skewed reverse-KL supervision, and LV-SKL+LV-SRKL (wT=wS=1w^{T}=w^{S}=1) corresponds to mixed-prefix distillation with uniform weights. Across both mathematical reasoning and code generation, validity calibration substantially improves over the corresponding constant-weight objectives. Compared with pure on-policy (LV-SRKL (wS=1w^{S}=1)) distillation, VCRD improves the average score by +3.72+3.72 points on math and by +2.27+2.27 points on code. These gains show that the improvement is not simply due to using student rollouts, teacher prefixes, or mixed-prefix training. Instead, the improvement comes from calibrating the strength of the distillation update according to the local validity of teacher and student continuations.

Table 5: Contribution of validity calibration across mathematical reasoning and code generation. SRKL (wS=1w^{S}=1) denotes pure on-policy distillation with student rollouts, per-token reverse-KL supervision, and no validity modulation. SKL+SRKL (wT=wS=1w^{T}=w^{S}=1) is the constant-weight mixed-prefix baseline.
Qwen2.5-Math-7B-Instruct (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Math-1.5B (S\mathcal{M}_{S})
Method GSM8K MATH Gaokao Olympiad CMATH AIME24 AMC23 AVG
LV-SKL (wT=1w^{T}=1) 84.50 73.80 62.30 34.70 86.30 6.70 47.50 56.54
LV-SRKL (wS=1w^{S}=1) 85.70 72.90 59.00 33.90 88.00 6.70 47.50 56.24
LV-SKL+LV-SRKL (wT=wS=1w^{T}=w^{S}=1) 85.50 73.20 61.30 34.10 90.20 13.30 47.50 57.87
LV-SKL 83.70 73.80 62.60 34.70 87.50 10.10 60.00 58.91
LV-SRKL 86.60 72.80 61.00 35.10 88.70 10.00 42.50 56.67
VCRD 85.00 73.70 62.10 35.40 90.20 13.30 60.00 59.96
Qwen2.5-Coder-7B-Instruct (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Coder-1.5B (S\mathcal{M}_{S})
Method HumanEval HumanEval++ MBPP MBPP++ AVG
LV-SKL (wT=1w^{T}=1) 66.50 60.40 73.50 63.0 65.85
LV-SRKL (wS=1w^{S}=1) 65.90 60.40 73.30 62.20 65.45
LV-SKL+LV-SRKL (wT=wS=1w^{T}=w^{S}=1) 64.60 59.10 72.80 62.20 64.67
LV-SKL 68.90 62.80 73.00 63.00 66.92
LV-SRKL 67.10 59.80 74.30 63.20 66.10
VCRD 69.50 64.60 73.80 63.00 67.72

4.6 PRM‑Free Validity Approximation

In many practical distillation scenarios, most notably for models such as DeepSeek, no pretrained PRM is available to provide token‑level validity scores. To extend VCRD to this setting while remaining aligned with our theoretical framework, we replace the external judge with a teacher‑likelihood proxy: at each prefix, we compare the probability that the teacher assigns to the student’s proposed next token against a temperature‑softened teacher baseline.

Table 6: Pass@1 on HumanEval for DS-Coder-6.9B \rightarrow DS-Coder-1.3B.
Method HEval HEval++ AVG
T\mathcal{M}_{T} 77.40 70.70 74.05
S\mathcal{M}_{S} 35.40 29.30 32.35
KD 42.10 37.20 39.65
SeqKD 41.50 36.00 38.75
GKD 40.09 36.00 38.05
DistilLLM 42.10 38.40 40.25
ABKD 42.10 37.80 39.95
Distillm-2 42.70 38.40 40.55
VCRD-Prob 44.50 40.90 42.70

This probability‑ratio weight mirrors the structure of the trust‑region analysis in Section 3, where the ratio rs/rt+ϵr_{s}/r_{t}+\epsilon acts as a first‑order estimator of the local improvement direction implied by the exponentially tilted solution. Because the teacher action is sampled rather than taken greedily, this proxy naturally preserves the amplification regime (wt>1w_{t}>1), capturing the theoretical phenomenon of local under‑specification, cases in which the student proposes a more locally coherent continuation than the teacher’s sampled token. Empirically, this PRM‑free variant of VCRD behaves consistently with the PRM‑based method: on coding (Table 6) VCRD raises the average HumanEval/HEval+ Pass@1 to 42.7, surpassing all baselines by +2.1–4.6 points; and in math reasoning (Table 7), it outperforms Distillm-2. These results show that even without an explicit reward model, VCRD’s core principle, allocating learning signal by relative local validity, remains intact: weak student moves are attenuated, strong ones are amplified when the teacher sample is locally suboptimal, and the KL anchoring preserves stable convergence within the teacher’s solution manifold.

Table 7: Pass@1 results on four math-reasoning benchmarks. AVG is the mean across the four tasks. VCRD-Prob represents the PRM-free approach. Qwen-2.5-Math-7B-Instruct\rightarrowQwen-2.5-Math-1.5B
Method GSM8K MATH Gaokao Olympiad AVG
Distillm-2 85.70 73.20 61.60 34.10 63.65
VCRD-Prob 84.80 73.90 64.20 33.60 64.13

5 Concluding Remarks

This work revisits LLM reasoning distillation from a foundational perspective. We show that prevailing trajectory-based approaches rely on an implicit monotonicity assumption: that global teacher superiority induces uniformly reliable local learning signals. This assumption is misaligned with multi-step reasoning, where intermediate steps are often under-specified and the usefulness of supervision can vary substantially across decision points. Motivated by this insight, we introduced Validity-Calibrated Reasoning Distillation (VCRD), which reframes reasoning distillation as token-level learning-signal calibration rather than trajectory imitation. By comparing the relative local validity of teacher and student proposals under shared prefixes, VCRD adaptively modulates the strength of KL-based updates while preserving teacher-anchored optimization geometry. This enables principled attenuation in weakly informative regions and amplification when strong local reasoning evidence is present, without changing the direction of supervision or departing from the teacher’s solution manifold. Empirically, VCRD yields consistent improvements across mathematical reasoning, code generation, and instruction-following benchmarks, including settings with strong teachers and diverse solution spaces. Ablation studies further confirm the importance of validity-based calibration and prefix-level supervision, aligning empirical behavior with the theoretical motivation. More broadly, our results suggest that effective reasoning distillation requires calibrating how strongly models learn from intermediate steps, rather than enforcing uniform imitation of a single reasoning trajectory. VCRD currently uses PRMs to estimate token-level local validity. Although PRMs provide a natural way to compare teacher and student continuations under a shared prefix, their availability and calibration remain limited across model families and reasoning domains. To reduce this dependence, we introduced a PRM-free variant that uses a teacher-likelihood proxy in place of the external judge, preserving the same principle of relative local calibration without requiring an explicit reward model. Preliminary results show that this variant continues to outperform strong distillation baselines, suggesting that VCRD’s core mechanism is not tied to a specific PRM. At the same time, teacher likelihood remains an imperfect surrogate for local validity and may overlook important aspects of step-level reasoning quality. Developing general, well-calibrated, and domain-agnostic validity estimators therefore remains an important direction for future work.

References

  • J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §1, §1.
  • R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In The twelfth international conference on learning representations, Cited by: §7.
  • AI-MO (2024) AMC 2023 dataset. Note: https://huggingface.co/datasets/AI-MO/aimo-validation-amcAccessed 2024 Cited by: §10.1, §4.1.
  • J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. (2021) Program synthesis with large language models. arXiv preprint arXiv:2108.07732. Cited by: §10.1, §4.2.
  • M. Chen (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Cited by: §10.1, §4.2.
  • L. Chenglin, Q. Chen, L. Li, C. Wang, F. Tao, Y. Li, Z. Chen, and Y. Zhang (2024) Mixed distillation helps smaller language models reason better. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 1673–1690. Cited by: §7.
  • K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §1, §10.1, §4.1.
  • G. Cui, L. Yuan, N. Ding, G. Yao, B. He, W. Zhu, Y. Ni, G. Xie, R. Xie, Y. Lin, et al. (2023) ULTRAFEEDBACK: boosting language models with scaled ai feedback. In Forty-first International Conference on Machine Learning, Cited by: §10.2.
  • N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou (2023) Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3029–3051. Cited by: §10.2, §4.3.
  • Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot (2023) Specializing smaller language models towards multi-step reasoning. In International Conference on Machine Learning, pp. 10421–10430. Cited by: §1, §7.
  • A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §1.
  • D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §1.
  • C. He, R. Luo, Y. Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, et al. (2024) Olympiadbench: a challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3828–3850. Cited by: §10.1, §4.1.
  • D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §1, §10.1, §4.1.
  • G. Hinton (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §1, §7, §7.
  • N. Ho, L. Schmid, and S. Yun (2023) Large language models are reasoning teachers. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pp. 14852–14882. Cited by: §7.
  • C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C. Lee, and T. Pfister (2023) Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017. Cited by: §7.
  • E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. (2022) Lora: low-rank adaptation of large language models.. ICLR 1 (2), pp. 3. Cited by: §10.2.
  • B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. (2024) Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186. Cited by: §4.2.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2020) TinyBERT: distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu (Eds.), Online, pp. 4163–4174. External Links: Link, Document Cited by: §1, §2.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, and X. Carreras (Eds.), Austin, Texas, pp. 1317–1327. External Links: Link, Document Cited by: §7.
  • J. Ko, T. Chen, S. Kim, T. Ding, L. Liang, I. Zharkov, and S. Yun (2025) DistiLLM-2: a contrastive approach boosts the distillation of llms. In Forty-second International Conference on Machine Learning, Cited by: §1, §1, §10.2, §10.2, §10.3, §2.1, §2.3, §2, §4.1, §4.3, §4, §7.
  • J. Ko, S. Kim, T. Chen, and S. Yun (2024) DistiLLM: towards streamlined distillation for large language models. In International Conference on Machine Learning, pp. 24872–24895. Cited by: §7.
  • A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. (2022) Solving quantitative reasoning problems with language models. Advances in neural information processing systems 35, pp. 3843–3857. Cited by: §10.1, §4.1.
  • L. H. Li, J. Hessel, Y. Yu, X. Ren, K. W. Chang, and Y. Choi (2023a) Symbolic chain-of-thought distillation: small models can also “think” step-by-step. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, pp. 2665–2679. Cited by: §1.
  • S. Li, J. Chen, Y. Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y. Mao, et al. (2022) Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726. Cited by: §1, §7.
  • X. Li, S. He, J. Wu, Z. Yang, Y. Xu, Y. jun Jun, H. Liu, K. Liu, and J. Zhao (2024) MoDE-cotd: chain-of-thought distillation for complex reasoning tasks with mixture of decoupled lora-experts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 11475–11485. Cited by: §1.
  • X. Li, T. Zhang, Y. Dubois, R. Taori, C. Ishaan Gulrajani, P. Liang, and T. Hashimoto (2023b) Alpacaeval: an automatic evaluator of instruction-following models (2023). URL https://github. com/tatsu-lab/alpaca_eval. Cited by: §10.2, §4.3.
  • B. Liao, Y. Xu, H. Dong, J. Li, C. Monz, S. Savarese, D. Sahoo, and C. Xiong (2025) Reward-guided speculative decoding for efficient llm reasoning. In Forty-second International Conference on Machine Learning, Cited by: §1.
  • H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. In The Twelfth International Conference on Learning Representations, Cited by: §7.
  • A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. (2024) Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Cited by: §1, §7, §7.
  • J. Liu, C. S. Xia, Y. Wang, and L. Zhang (2023) Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36, pp. 21558–21572. Cited by: §10.3.
  • Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang (2024a) WizardCoder: empowering code large language models with evol-instruct. In ICLR, Cited by: §1, §10.1.
  • Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang (2024b) WizardCoder: empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations, Cited by: §10.2, §4.2.
  • MAA (2025) American invitational mathematics examination (aime). Note: Mathematical Association of AmericaAIME exam materials Cited by: §10.1, §4.1.
  • A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. (2023) Self-refine: iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36, pp. 46534–46594. Cited by: §7.
  • L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn (2023) Teaching small language models to reason. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 2: short papers), pp. 1773–1781. Cited by: §1, §7.
  • A. Mitra, H. Khanpour, C. Rosset, and A. Awadallah (2024) Orca-math: unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830. Cited by: §7.
  • S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah (2023) Orca: progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707. Cited by: §7.
  • o1 Team (2024) Skywork-o1 open series. Note: https://huggingface.co/SkyworkAccessed: November 2024 Cited by: §4.
  • R. Y. Pang, W. Yuan, H. He, K. Cho, S. Sukhbaatar, and J. Weston (2024) Iterative reasoning preference optimization. Advances in Neural Information Processing Systems 37, pp. 116617–116637. Cited by: §1.
  • A. Patel, S. Bhattamishra, and N. Goyal (2021) Are nlp models really able to solve simple math word problems?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094. Cited by: §10.1, §4.1.
  • J. Schmidhuber (1992) Learning complex, extended sequences using the principle of history compression. Neural computation 4 (2), pp. 234–242. Cited by: §1, §7.
  • Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1.
  • K. Shridhar, A. Stolfo, and M. Sachan (2023) Distilling reasoning capabilities into smaller language models. Findings of the Association for Computational Linguistics: ACL 2023, pp. 7059–7073. Cited by: §1, §7.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • G. Wang, Z. Yang, Z. Wang, S. Wang, Q. Xu, and Q. Huang (2025) ABKD: pursuing a proper allocation of the probability mass in knowledge distillation via \\backslashalpha -\\backslashbeta -divergence. In Forty-second International Conference on Machine Learning, Cited by: §2.
  • X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou (2023) Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, Cited by: §7.
  • J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824–24837. Cited by: §7.
  • T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang (2023) Cmath: can your language model pass chinese elementary school math test?. arXiv preprint arXiv:2306.16636. Cited by: §10.1, §4.1.
  • C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, and D. Jiang (2024) WizardLM: empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations, Cited by: §1, §10.2, §10.2, §4.3.
  • A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, et al. (2024) Qwen2. 5-math technical report: toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122. Cited by: §1.
  • L. Yang, Z. Yu, B. Cui, and M. Wang (2025a) Reasonflux: hierarchical llm reasoning via scaling thought templates. arXiv preprint arXiv:2502.06772. Cited by: §1.
  • L. Yang, Z. Yu, T. Zhang, M. Xu, J. E. Gonzalez, B. CUI, and S. YAN (2025b) SuperCorrect: advancing small llm reasoning with thought template distillation and self-correction. In The Thirteenth International Conference on Learning Representations, Cited by: §1, §1, §7.
  • S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan (2023) Tree of thoughts: deliberate problem solving with large language models. Advances in neural information processing systems 36, pp. 11809–11822. Cited by: §7.
  • L. Yu, W. Jiang, H. Shi, J. YU, Z. Liu, Y. Zhang, J. Kwok, Z. Li, A. Weller, and W. Liu (2024) MetaMath: bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, Cited by: §10.1, §10.2.
  • E. Zelikman, Y. Wu, J. Mu, and N. Goodman (2022) Star: bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems 35, pp. 15476–15488. Cited by: §7.
  • X. Zhang, C. Li, Y. Zong, Z. Ying, L. He, and X. Qiu (2023) Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474. Cited by: §10.1, §4.1.
  • L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. (2023) Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36, pp. 46595–46623. Cited by: §10.2, §4.3.
  • W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan (2024) Agieval: a human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pp. 2299–2314. Cited by: §10.1, §4.1.
  • X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang (2024) Distilling mathematical reasoning capabilities into small language models. Neural Networks 179, pp. 106594. Cited by: §7.

6 Qualitative Examples

Table 8: Qualitative examples where the student Qwen2.5-Math-1.5B continuation receives higher local validity than the teacher Qwen2.5-Math-7B-Instruct continuation under the same prefix. rTr_{T} and rSr_{S} denote the PRM (Skywork-o1-OpenPRM-Qwen-2.5-1.5B) rewards for the teacher and student first token continuations, respectively, and the ratio is rS/rTr_{S}/r_{T}. The “Better Local Continuation” column indicates which continuation is locally correct or a valid better alternative under the shared prefix between teacher and student models.
Shared prefix Teacher continuation Student continuation 𝒓𝑻\boldsymbol{r_{T}} 𝒓𝑺\boldsymbol{r_{S}} Ratio Better Local continuation
The cost of printing is 77 copies ×\times 2525 pages ×\times $0.10/page =$17=\mathdollar 17 550550 .50.50 0.085 0.589 6.97 Student. The correct amount is $17.50; the student completes the decimal.
Tim buys 3 goats for $400 each. Tim also buys twice as many llamas as goats, so he buys 2×2=62\times 2=6 llamas 3×2=63\times 2=6 llamas 0.152 0.847 5.57 Student. The multiplier should be the number of goats, 3, so 3×2=63\times 2=6.
Cost per liter is $3 and the pool volume is 25×60x25\times 60x. Thus, 3×25×60x=900003\times 25\times 60x=90000. Simplify: 3×3\times 15×60x15\times 60x 25×60x25\times 60x 0.575 0.801 1.39 Student. The coefficient 25 should be preserved from 3×25×60x3\times 25\times 60x.
Find 1234+46÷441^{234}+4^{6}\div 4^{4}. First, we simplify 16÷441^{6}\div 4^{4} 46÷444^{6}\div 4^{4} 0.601 0.877 1.46 Student. The expression being simplified is 46÷444^{6}\div 4^{4}, not 16÷441^{6}\div 4^{4}.
James drives at 30 mph for half an hour. Then he drives at twice the speed for twice as long, which means he drives at a speed of 3×30=63\times 30=6\cdots 2×30=62\times 30=6\cdots 0.360 0.827 2.30 Student. “Twice the speed” requires multiplying by 2, giving 2×30=602\times 30=60 mph.

7 Related Work

Reasoning. Recent advances in large language model reasoning have been driven by explicit multi-step supervision techniques, including chain-of-thought prompting [Wei et al., 2022], structured reasoning templates [Chenglin et al., 2024, Zhu et al., 2024], thought-expansion methods such as Tree-of-Thought and related variants [Yao et al., 2023], and other forms of guided step decomposition [Shridhar et al., 2023]. These approaches substantially improve the reasoning performance of large models but typically depend on scale and heavy computation, limiting their applicability in resource-constrained settings. This has motivated growing interest in enhancing the reasoning abilities of smaller models through two main directions: (i) richer training-time supervision schemes, such as self-correction [Madaan et al., 2023] and preference-based learning [Yang et al., 2025b], and (ii) knowledge distillation [Schmidhuber, 1992, Hinton, 2015], where a strong teacher transfers its reasoning behavior to a compact student via teacher-generated traces [Li et al., 2022, Magister et al., 2023, Liu et al., 2024].

Reasoning Distillation. Reasoning distillation has emerged as an effective approach for transferring chain-of-thought capabilities from larger to smaller language models [Mukherjee et al., 2023, Hsieh et al., 2023, Mitra et al., 2024, Fu et al., 2023], allowing students to inherit strong reasoning behavior without costly training from scratch. Early work framed this as rationale imitation [Ho et al., 2023, Liu et al., 2024], training students to reproduce teacher explanations using cross-entropy or sequence-level KD [Hinton, 2015, Kim and Rush, 2016]. Subsequent work extends this paradigm with trajectory selection and filtering [Zelikman et al., 2022, Lightman et al., 2023, Wang et al., 2023], keeping only high-quality or self-consistent traces, and with alternative objectives such as on-policy distillation using student rollouts [Agarwal et al., 2024, Ko et al., 2024] or contrastive losses that balance teacher-guided and student-generated supervision [Ko et al., 2025]. Despite this progress, existing techniques largely treat reasoning trajectories as monolithic sequences: implicitly assuming that all teacher steps are equally informative or equally authoritative. This creates two limitations: (i) local under-specification, many reasoning prefixes admit multiple valid continuations, yet sequence-level KD enforces a single teacher choice; and (ii) uniform optimization, the student receives identical learning signal on strong teacher steps, weak teacher steps, and student-favored steps. In contrast, our method adopts a token-level, prefix-conditioned view of reasoning distillation.

8 Theoretical Results and Proofs

This appendix provides formal statements and proofs supporting the theoretical claims in Section 3. Throughout, we consider a fixed prefix (context) cc and a finite vocabulary 𝒱\mathcal{V}. The teacher policy is denoted by π(c)\pi(\cdot\mid c), and r(c,a)r(c,a) denotes a bounded local validity signal.

8.1 Teacher-anchored KL trust-region improvement (detailed proof)

Theorem 8.1 (Teacher-anchored reverse-KL trust-region solution).

Fix a prefix cc and a teacher policy π(c)\pi(\cdot\mid c) over a finite set 𝒱\mathcal{V}, with π(ac)>0\pi(a\mid c)>0 for all a𝒱a\in\mathcal{V}. Let r(c,a)r(c,a)\in\mathbb{R} be bounded. Consider the problem

maxπ~(c)a𝒱π~(ac)r(c,a)s.t.KL(π~(c)π(c))δ,aπ~(ac)=1,π~(ac)>0.\max_{\tilde{\pi}(\cdot\mid c)}\;\sum_{a\in\mathcal{V}}\tilde{\pi}(a\mid c)\,r(c,a)\quad\text{s.t.}\quad\mathrm{KL}\!\left(\tilde{\pi}(\cdot\mid c)\,\|\,\pi(\cdot\mid c)\right)\leq\delta,\ \ \sum_{a}\tilde{\pi}(a\mid c)=1,\ \ \tilde{\pi}(a\mid c)>0. (10)

Then any optimal solution π~(c)\tilde{\pi}^{\star}(\cdot\mid c) has the form

π~(ac)=π(ac)exp(ηr(c,a))a𝒱π(ac)exp(ηr(c,a))\tilde{\pi}^{\star}(a\mid c)=\frac{\pi(a\mid c)\exp\!\big(\eta\,r(c,a)\big)}{\sum_{a^{\prime}\in\mathcal{V}}\pi(a^{\prime}\mid c)\exp\!\big(\eta\,r(c,a^{\prime})\big)} (11)

for some η0\eta\geq 0 chosen such that the KL constraint is satisfied (with equality unless it is inactive).

Proof.

Fix cc and abbreviate π(a)=π(ac)\pi(a)=\pi(a\mid c), π~(a)=π~(ac)\tilde{\pi}(a)=\tilde{\pi}(a\mid c), and r(a)=r(c,a)r(a)=r(c,a). The constraint is

KL(π~π)=a𝒱π~(a)logπ~(a)π(a)δ.\mathrm{KL}(\tilde{\pi}\|\pi)=\sum_{a\in\mathcal{V}}\tilde{\pi}(a)\log\frac{\tilde{\pi}(a)}{\pi(a)}\leq\delta.

Form the Lagrangian with multipliers λ0\lambda\geq 0 (KL constraint) and ν\nu\in\mathbb{R} (normalization):

(π~,λ,ν)\displaystyle\mathcal{L}(\tilde{\pi},\lambda,\nu) =aπ~(a)r(a)λ(aπ~(a)logπ~(a)π(a)δ)+ν(aπ~(a)1).\displaystyle=\sum_{a}\tilde{\pi}(a)\,r(a)-\lambda\Big(\sum_{a}\tilde{\pi}(a)\log\frac{\tilde{\pi}(a)}{\pi(a)}-\delta\Big)+\nu\Big(\sum_{a}\tilde{\pi}(a)-1\Big). (12)

At an interior optimum π~(a)>0\tilde{\pi}(a)>0, stationarity gives for each aa:

0=π~(a)\displaystyle 0=\frac{\partial\mathcal{L}}{\partial\tilde{\pi}(a)} =r(a)λ(logπ~(a)π(a)+1)+ν.\displaystyle=r(a)-\lambda\Big(\log\frac{\tilde{\pi}(a)}{\pi(a)}+1\Big)+\nu. (13)

Rearrange:

logπ~(a)π(a)=1λ(r(a)+ν),ν=νλ.\log\frac{\tilde{\pi}(a)}{\pi(a)}=\frac{1}{\lambda}\big(r(a)+\nu^{\prime}\big),\qquad\nu^{\prime}=\nu-\lambda.

Let η=1/λ0\eta=1/\lambda\geq 0. Exponentiating yields

π~(a)=π(a)exp(ηr(a))exp(ην).\tilde{\pi}(a)=\pi(a)\exp(\eta r(a))\exp(\eta\nu^{\prime}).

Enforcing aπ~(a)=1\sum_{a}\tilde{\pi}(a)=1 determines the normalizer:

exp(ην)=(aπ(a)exp(ηr(a)))1.\exp(\eta\nu^{\prime})=\Big(\sum_{a^{\prime}}\pi(a^{\prime})\exp(\eta r(a^{\prime}))\Big)^{-1}.

Substituting back gives the exponential-tilt form in Eq. (11). Finally, by complementary slackness, λ=0\lambda=0 (equivalently η=0\eta=0) occurs only when the KL constraint is inactive, in which case π~=π\tilde{\pi}^{\star}=\pi; otherwise the constraint is active and η>0\eta>0 is chosen so that KL(π~π)=δ\mathrm{KL}(\tilde{\pi}^{\star}\|\pi)=\delta. ∎

8.2 From optimal improvement to first-order learning-signal allocation

Derivation of Eq. (9).

Fix a prefix ctc_{t}. Consider the locally weighted forward-KL loss

t(θ)=w(ct)KL(π(ct)πθ(ct)),\mathcal{L}_{t}(\theta)\;=\;w(c_{t})\,\mathrm{KL}\!\big(\pi(\cdot\mid c_{t})\,\|\,\pi_{\theta}(\cdot\mid c_{t})\big), (14)

where π(ct)\pi(\cdot\mid c_{t}) (teacher) and w(ct)w(c_{t}) are treated as fixed with respect to θ\theta (i.e., we do stopgrad on ww). A gradient descent step with step-size η>0\eta>0 yields

Δθt\displaystyle\Delta\theta_{t} =ηθt(θ)=ηw(ct)θKL(π(ct)πθ(ct)).\displaystyle=-\eta\,\nabla_{\theta}\mathcal{L}_{t}(\theta)=-\eta\,w(c_{t})\,\nabla_{\theta}\mathrm{KL}\!\big(\pi(\cdot\mid c_{t})\,\|\,\pi_{\theta}(\cdot\mid c_{t})\big). (15)

Absorbing the positive scalar η\eta into the proportionality constant gives

Δθtw(ct)θ[KL(π(ct)πθ(ct))],\Delta\theta_{t}\;\propto\;w(c_{t})\,\nabla_{\theta}\Big[-\,\mathrm{KL}\!\big(\pi(\cdot\mid c_{t})\,\|\,\pi_{\theta}(\cdot\mid c_{t})\big)\Big], (16)

which is Eq. (9).

9 Algorithm

In this section, we provide our algorithm 1.

Algorithm 1 Validity-Calibrated Reasoning Distillation (VCRD)
1:Input: Dataset 𝒟={xi}\mathcal{D}=\{x_{i}\}; teacher LM pp; student LM qθq_{\theta}; auxiliary judge JJ; loss weights (λT,λS)(\lambda_{T},\lambda_{S}); smoothing ε>0\varepsilon>0; rollout horizon TT.
2:Output: Updated student parameters θ\theta.
3:Local validity: For any prefix cc and token aa, define r(c,a)=J(c,a)[0,1]r(c,a)=J(c,a)\in[0,1].
4:for each training step do
5:  Sample minibatch 𝒟\mathcal{B}\subset\mathcal{D}.
6:  for each input xx\in\mathcal{B} do
7:   // Roll out teacher and student once
8:   yT=(a1T,,aTT)p(x)y^{T}=(a^{T}_{1},\dots,a^{T}_{T})\sim p(\cdot\mid x).
9:   yS=(a1S,,aTS)qθ(x)y^{S}=(a^{S}_{1},\dots,a^{S}_{T})\sim q_{\theta}(\cdot\mid x).
10:   // Prefix definitions
11:   ct1T=(x,a<tT)c^{T}_{t-1}=(x,a^{T}_{<t}) and ct1S=(x,a<tS)c^{S}_{t-1}=(x,a^{S}_{<t}) for t=1,,Tt=1,\dots,T.
12:   // LV–SKL weights (teacher prefix)
13:   for t=1t=1 to TT do
14:    wtTr(ct1T,atS)r(ct1T,atT)+εw^{T}_{t}\leftarrow\dfrac{r(c^{T}_{t-1},a^{S}_{t})}{r(c^{T}_{t-1},a^{T}_{t})+\varepsilon}
15:   end for
16:   // LV–SRKL weights (student prefix)
17:   for t=1t=1 to TT do
18:    wtSr(ct1S,atS)r(ct1S,atT)+εw^{S}_{t}\leftarrow\dfrac{r(c^{S}_{t-1},a^{S}_{t})}{r(c^{S}_{t-1},a^{T}_{t})+\varepsilon}
19:   end for
20:   // Validity-weighted KL losses
21:   LV-SKL(x)=t=1TwtTDSKL(α)(p(ct1T)qθ(ct1T))\displaystyle\mathcal{L}_{\mathrm{LV\text{-}SKL}}(x)=\sum_{t=1}^{T}w^{T}_{t}\,D^{(\alpha)}_{\mathrm{SKL}}\!\left(p(\cdot\mid c^{T}_{t-1})\,\big\|\,q_{\theta}(\cdot\mid c^{T}_{t-1})\right)
22:   LV-SRKL(x)=t=1TwtSDSRKL(α)(p(ct1S)qθ(ct1S))\displaystyle\mathcal{L}_{\mathrm{LV\text{-}SRKL}}(x)=\sum_{t=1}^{T}w^{S}_{t}\,D^{(\alpha)}_{\mathrm{SRKL}}\!\left(p(\cdot\mid c^{S}_{t-1})\,\big\|\,q_{\theta}(\cdot\mid c^{S}_{t-1})\right)
23:  end for
24:  // Batch aggregation and update
25:  =1||x[λTLV-SKL(x)+λSLV-SRKL(x)]\displaystyle\mathcal{L}=\frac{1}{|\mathcal{B}|}\sum_{x\in\mathcal{B}}\Big[\lambda_{T}\,\mathcal{L}_{\mathrm{LV\text{-}SKL}}(x)+\lambda_{S}\,\mathcal{L}_{\mathrm{LV\text{-}SRKL}}(x)\Big]
26:  Update θθηθ\theta\leftarrow\theta-\eta\,\nabla_{\theta}\mathcal{L}
27:end for

10 Detailed Experimental Setup

10.1 Dataset Description

MetaMathQA (mathematical reasoning; Yu et al. [2024]1): MetaMathQA is a large-scale dataset introduced to strengthen mathematical reasoning in language models. It is constructed through question bootstrapping, where each problem is rewritten from multiple perspectives, such as forward reasoning, backward reasoning, and alternative rephrasings, to expose models to diverse solution paths and intermediate reasoning structures.

GSM8K (mathematical reasoning; Cobbe et al. [2021]2): GSM8K consists of 8.5K high-quality grade-school math word problems that require multi-step numerical and logical reasoning. The dataset emphasizes clarity, linguistic diversity, and systematic solution steps, making it a standard benchmark for evaluating fundamental arithmetic reasoning in LLMs.

MATH (mathematical reasoning; Hendrycks et al. [2021]3): The MATH dataset contains thousands of competition-style mathematical questions spanning algebra, geometry, combinatorics, number theory, probability, and more. It includes detailed step-by-step solutions generated from a procedural codebase, enabling rigorous evaluation of a model’s ability to perform symbolic and multi-step mathematical reasoning at a variety of difficulty levels.

SVAMP (mathematical reasoning; Patel et al. [2021]4): SVAMP is a variant of GSM8K designed to mitigate spurious pattern exploitation in arithmetic word problems. It introduces controlled perturbations, such as swapping quantities or modifying irrelevant details, to evaluate whether models rely on genuine mathematical reasoning rather than superficial cues.

MinervaMath (mathematical reasoning; Lewkowycz et al. [2022]5): MinervaMath is a challenging benchmark derived from the Minerva project, focusing specifically on mathematical problem solving. It includes competition-style questions across algebra, calculus, geometry, number theory, and probability. Problems require multi-step symbolic manipulation and precise derivations, making MinervaMath a rigorous test of a model’s ability to handle advanced, long-chain mathematical reasoning.

Gaokao2023-EN (mathematical reasoning; Zhang et al. [2023]6): Gaokao2023-EN contains English translations of math questions from the 2023 Chinese National College Entrance Examination (Gaokao). The dataset features concise, exam-style problems across algebra, geometry, trigonometry, and applied math. Its formulation emphasizes careful reading, multi-step reasoning, and robustness to linguistically minimal prompts, providing a strong evaluation of structured mathematical reasoning under realistic exam conditions.

OlympiadBench (mathematical reasoning; He et al. [2024]7): OlympiadBench aggregates Olympiad-style mathematical problems requiring multi-hop symbolic reasoning, pattern discovery, and structured derivations. Questions are high difficulty and typically require creative reasoning, making this benchmark sensitive to errors in intermediate steps.

AMC23 (mathematical reasoning; AI-MO [2024]8): AMC23 contains problems from the 2023 American Mathematics Competition, focusing on algebra, geometry, combinatorics, and number theory at the mid-competition level. The dataset evaluates a model’s ability to navigate moderately challenging problems that require structured reasoning rather than memorized patterns.

AIME24 (mathematical reasoning; MAA [2025]9): AIME24 consists of problems from the 2024 American Invitational Mathematics Examination. These questions demand multi-step derivations, precise algebraic manipulation, and careful numerical reasoning, providing a sensitive test of a model’s ability to avoid compounding local reasoning errors.

SAT-Math (mathematical reasoning; Zhong et al. [2024]10): SAT-Math evaluates models on algebra, arithmetic reasoning, function interpretation, and geometry tasks from the SAT exam. While less challenging than competition benchmarks, the dataset tests robustness under shorter, mixed-format reasoning questions.

CMATH (mathematical reasoning; Wei et al. [2023]): CMATH is a curated collection covering a broad set of competition-math problem types, with carefully structured reasoning paths and high-quality solutions. It includes tasks requiring symbolic manipulation, equation solving, and multi-step deductive reasoning.

WizardCoder (code generation; Luo et al. [2024a]11): WizardCoder is constructed using the Evol-Instruct procedure, which automatically expands and refines existing code-instruction corpora. The process begins from CodeAlpaca (20K instructions) and iteratively applies instruction evolution techniques, adding constraints, increasing reasoning depth, introducing distractor code, and modifying specifications, to produce more challenging programming tasks. The resulting dataset contains roughly 78K evolved problems and serves as a large-scale code-instruction corpus used to finetune StarCoder and related models.

HumanEval (code generation; Chen [2021]12): HumanEval is a benchmark of 164 manually written Python programming problems, each comprising a function signature, natural-language description, and a set of unit tests. The tasks were explicitly created to avoid overlap with pretraining corpora, making HumanEval a standard benchmark for evaluating functional correctness of program synthesis.

HumanEval+: HumanEval+ extends HumanEval by providing perturbed, paraphrased, or structurally varied versions of the original problems. These variants preserve the underlying semantics while altering surface form, enabling a more robust evaluation of generalization and reasoning stability in code generation models.

MBPP (code generation; Austin et al. [2021]13): MBPP contains approximately 1,000 crowdsourced Python programming tasks aimed at entry-level programmers. Each problem includes a description and test cases covering basic programming constructs such as loops, list manipulation, strings, and simple algorithms. MBPP is widely used to measure fundamental code-generation abilities and correctness.

MBPP+: MBPP+ augments the original MBPP benchmark with rephrasings, structural variations, and more diverse test cases. It evaluates whether a model can maintain correctness when task formulations shift, providing a more rigorous assessment of generalization beyond the original surface templates.

11footnotetext: https://huggingface.co/datasets/meta-math/MetaMathQA22footnotetext: https://huggingface.co/datasets/openai/gsm8k33footnotetext: https://huggingface.co/datasets/deepmind/math44footnotetext: https://huggingface.co/datasets/ChilleD/SVAMP55footnotetext: https://huggingface.co/datasets/math-ai/minervamath66footnotetext: https://huggingface.co/datasets/MARIO-Math-Reasoning/Gaokao2023-Math-En77footnotetext: https://huggingface.co/datasets/Hothan/OlympiadBench88footnotetext: https://huggingface.co/datasets/math-ai/amc2399footnotetext: https://huggingface.co/datasets/math-ai/aime241010footnotetext: https://huggingface.co/datasets/ndavidson/sat-math-chain-of-thought1111footnotetext: https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v11212footnotetext: https://huggingface.co/datasets/openai/openai_humaneval1313footnotetext: https://huggingface.co/datasets/google-research-datasets/mbpp

10.2 Training Details

Table 9: Hyperparameter values used in VCRD (distillation stage) experiments across all task families.
Hyperparameter Value
Finetuning method LoRA (rank = 16, α\alpha = 128, dropout = 0.05)
LoRA target modules all self-attention and MLP layers
Learning rate 5×1055\times 10^{-5}
Max sequence length 1024
Max prompt length 512
Effective batch size 128
LR scheduler cosine
Warmup ratio 0.1
Validity smoothing ε\varepsilon 1×1081\times 10^{-8}
# Epochs (Math Reasoning) 2 epochs
# Epochs (Code Generation) 1 epochs
# Epochs (Instruction Following) 3 epochs

In this subsection, we describe the hyperparameters and implementation settings used for training VCRD. Table 9 summarizes all configuration values. All models are trained using LoRA-based adaptation [Hu et al., 2022] for parameter-efficient finetuning, and we adopt a unified optimization setup across mathematical reasoning, code generation, and instruction-following tasks. We use the maximum batch size that fits on our 4 NVIDIA A100 (80GB) GPUs, combined with gradient accumulation, to attain an effective batch size of 128. Student models are first initialized via supervised finetuning on task-specific datasets with ground-truth responses, after which validity-calibrated distillation is applied. Following the setup of Ko et al. [2025], we do not use any language-modeling loss on pretraining corpora. For all experiments, we used FlashAttention and bf16 precision.

For mathematical reasoning, we set the distillation weights to (λT,λS)=(1,1)(\lambda_{T},\lambda_{S})=(1,1) for the Qwen2.5-Math experiments and (2,1.5)(2,1.5) for the Qwen2-Math experiments. Students are first initialized by supervised fine‑tuning on the full MetaMathQA [Yu et al., 2024] dataset for one epoch with learning rate 5×1065\times 10^{-6}, cosine learning-rate scheduling, maximum sequence length 2048, and warmup ratio 0.1. Then, for distillation, we construct a training set by randomly sampling 50k randomly selected MetaMathQA samples and collecting both teacher and student responses for each prompt. Distillation performed for two epochs.

For code generation, we evaluate VCRD on two teacher–student configurations: Qwen2.5-Coder-7B-Instruct\rightarrowQwen2.5-Coder-1.5B and Qwen2.5-Coder-14B-Instruct\rightarrowQwen2.5-Coder-7B-Instruct. In the 7B\rightarrow1.5B setting, we first finetune the 1.5B student on the ground-truth code dataset for 5 epochs using FlashAttention, bf16 precision, a learning rate of 5×1065\!\times\!10^{-6}, a maximum sequence length of 2048, and a warmup ratio of 0.10.1. We then perform VCRD distillation for one epoch (600 training iterations). For the larger 14B\rightarrow7B configuration, we initialize the 7B student directly from the Qwen2.5-Coder-7B-Instruct checkpoint without any supervised finetuning, and distill for a single epoch of 400 iterations. For both configurations, we set the distillation weights to (λT,λS)=(2,1)(\lambda_{T},\lambda_{S})=(2,1). Distillation prompts are from the WizardCoder dataset [Luo et al., 2024b], which is constructed using Evol-Instruct method [Xu et al., 2024] code instruction datasets. For distillation, we construct a training set by collecting both teacher and student responses for each prompt from WizardCoder dataset. Distillation performed for two epochs.

This setup, spanning two distinct model scales, enables us to assess whether VCRD’s prefix-conditioned validity calibration provides consistent improvements in execution-level robustness and functional correctness across code-generation regimes.

For instruction-following, first, the student model is supervised finetuned for one epoch on all Metamath instruction-tuning dataset using ground-truth responses. During this stage, we set the maximum sequence length to 2048, the learning rate to 5×1055\times 10^{-5}, and the warmup ratio to 0.10.1. After finetuning, we perform VCRD distillation for three epochs over the sampled UltraChat prompts. The distillation weights are set to (λT,λS)=(2,2)(\lambda_{T},\lambda_{S})=(2,2) for the Qwen2.5 teacher-student configuration and (2,1)(2,1) for the Qwen2 configuration. for distillation, we construct a training set by randomly sampling 50k prompts from the UltraChat200k dataset [Ding et al., 2023] and collecting both teacher and student responses for each prompt. We evaluate VCRD under two teacher-student configurations: Qwen2.5-7B-Instruct\rightarrowQwen2.5-1.5B and Qwen2-7B-Instruct\rightarrowQwen2-1.5B. Distillation is run for three epochs over the sampled UltraChat prompts. For evaluation, we report win-rate performance on three common instruction-following benchmarks: AlpacaEval [Li et al., 2023b], Evol-Instruct [Xu et al., 2024], and UltraFeedback [Cui et al., 2023]. Following prior work [Ko et al., 2025], we use GPT-4o or GPT-4o-mini as LLM-as-a-Judge [Zheng et al., 2023] to provide consistent preference-based scoring across models. This benchmark suite spans conversational, multi-step, and preference-heavy instructions, enabling a broad and challenging evaluation of VCRD’s ability to provide stable supervisory signals beyond purely reasoning-focused tasks.

10.3 Evaluation

We follow a similar evaluation protocol to [Ko et al., 2025]). Below we describe the specific settings used for each task family.

Math Reasoning. For evaluating mathematical reasoning performance, we use a single A100 80GB GPU and we follow the official Qwen2.5-Math evaluation protocol222https://github.com/QwenLM/Qwen2.5-Math, using the qwen25-math-cot prompt format. All evaluations are performed with greedy decoding (temperature 0, top-pp=1) and a single sample per query under the vLLM backend.

Code Generation. For code evaluations, we again use a single A100 80GB GPU and employ greedy decoding with a maximum generation length of 1024. Mathematical reasoning performance is measured using the EvalPlus framework [Liu et al., 2023], which executes predicted solutions to verify correctness. For code generation, we use the HumanEval, HumanEval+, MBPP, and MBPP+ evaluation suites, all executed with their official test harnesses to ensure consistency and prevent overfitting to reference implementations.

Instruction following. For instruction-following evaluation, we generate model responses using a single NVIDIA A100 80GB GPU with temperature 0.8, top-pp=0.95, and a maximum generation length of 512 tokens. Each comparison is performed using the pairwise system prompt described in Appendix 10.3.1, which presents the judge model with the user question and two candidate responses and asks for a preference decision. To mitigate position bias, we randomly swap the order of the two responses and average win rates across both permutations. For AlpacaEval, we use the officially released text-davinci-003 reference responses. For Evol-Instruct and UltraFeedback, we compare generated responses against gpt-3.5-turbo outputs that were produced internally, following the same protocol used in prior benchmark releases. All evaluations use GPT-4o or GPT-4o-mini as the judge to ensure consistent preference-based scoring across models.

10.3.1 Instruction-Following Judge Prompt

[System] Act as an impartial judge and evaluate which of two assistant responses better answers the user question shown below. Your judgment should consider correctness, instruction compliance, coherence, clarity, and overall helpfulness. Ignore superficial stylistic differences unless they affect content quality. Read both responses carefully and provide a brief explanation of your choice. To avoid position bias, do not let the order of presentation influence your decision. After your explanation, output your verdict strictly in one of the following formats: [[A]], [[B]], or [[C]] (tie). [User Question] {question} [Response A] {answer_A} [Response B] {answer_B}

10.4 PRM‑Free Validity Approximation Setup

For teacher-student pairs without an available process reward model (e.g., DeepSeek-Coder), we approximate local validity using the teacher’s next-token likelihood under the shared prefix. The student proposes a greedy next token atS=argmaxaps(act)a_{t}^{S}=\arg\max_{a}p_{s}(a\mid c_{t}), and we use the teacher-assigned probability pt(atSct)p_{t}(a_{t}^{S}\mid c_{t}) as the raw student reward. To obtain a stable teacher baseline, we take the top-kk teacher probabilities with k=128k=128 and compute a collision-based score i=1kpt(ict)2\sum_{i=1}^{k}p_{t}(i\mid c_{t})^{2}. The local validity weight is then defined as

wt=pt(atSct)i=1kpt(ict)2+ε,ε=108,w_{t}=\frac{p_{t}(a_{t}^{S}\mid c_{t})}{\sum_{i=1}^{k}p_{t}(i\mid c_{t})^{2}+\varepsilon},\qquad\varepsilon=10^{-8},

followed by log-space smoothing and symmetric clamping: we compute logwt\log w_{t}, scale by a factor γ=0.5\gamma=0.5, and clip to [log0.5,log2.0][\log 0.5,\log 2.0] before exponentiating back to obtain wt[0.5,2.0]w_{t}\in[0.5,2.0]. This weight is applied multiplicatively to both LV–SKL and LV–SRKL losses at each token.

11 Ablation Study

Table 10 further examines how local validity scores should be converted into distillation weights. LV-Joint-rsr_{s} weights the loss using only the student’s raw validity score rsr_{s}, ignoring the teacher score under the same prefix. Its lower performance indicates that absolute validity alone is not a reliable allocator of learning signal. LV-Joint-(rsrtr_{s}-r_{t}) uses the difference between student and teacher validity scores, which better reflects the local improvement direction and improves over raw rsr_{s} weighting. However, it still falls short of VCRD, which uses the relative validity ratio rs/(rt+ε)r_{s}/(r_{t}+\varepsilon). This ratio compares the student continuation against the teacher continuation under the same context while normalizing for the local scale of the judge scores. The superior performance of VCRD supports the use of relative, scale-normalized validity calibration for reasoning distillation, consistent with the analysis in Section 3.

Table 10: Ablation of validity-weighting rules on Qwen2.5-Math-7B-Instruct \rightarrow Qwen2.5-Math-1.5B. AVG is the average over all datasets.
Qwen2.5-Math-7B-Instruct (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Math-1.5B (S\mathcal{M}_{S})
Method GSM8K MATH Gaokao Olympiad CMATH AIME24 AMC23 AVG
LV-Joint-rsr_{s} 85.6 73.8 60.5 32.7 90.0 6.7 47.5 56.68
LV-Joint-(rsrtr_{s}-r_{t}) 85.2 73.6 61.8 34.2 89.2 10.0 52.5 58.03
VCRD 85.0 73.7 62.1 35.4 90.2 13.3 60.0 59.96

12 Computational Cost

Table 11: Average training time per iteration on Qwen2.5-Math-7B-Instruct (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Math-1.5B (S\mathcal{M}_{S}).
Method Time (s/iteration)
KD 30.19
Distillm-2 32.84
VCRD (PRM-based) 33.84
VCRD-Prob (PRM-free) 32.31

We report the training-time cost of VCRD in Table 11. Following Distillm-2, all methods use a shared offline data-preparation stage in which teacher and/or student trajectories are generated once before training and then kept fixed. This isolates the comparison to the distillation objective itself, rather than differences in rollout generation. Therefore, the reported times exclude this shared offline stage and measure only the training iteration cost. VCRD introduces only a small additional overhead over existing distillation baselines. In the same Qwen2.5-Math setting, KD requires 30.19 s/iteration, Distillm-2 requires 32.84 s/iteration, and PRM-based VCRD requires 33.84 s/iteration. This shows that the extra PRM-based local-validity computation adds only about 1 second per iteration over Distillm-2. The PRM-free variant has essentially the same cost as standard distillation, requiring 32.31 s/iteration.

13 PRM‑sensitivity

We further evaluate the sensitivity of VCRD to the PRM used as the validity judge. As shown in Table 12, replacing the Skywork PRM-1.5B with the larger Qwen2.5-Math-PRM-7B yields a slightly higher average score, improving from 55.38 to 55.48. This shows that VCRD is robust to PRM scale. Importantly, both PRM-based VCRD variants outperform the non-calibrated SKL+SRKL baseline and the pure on-policy SRKL baseline.

Table 12: PRM sensitivity results.Qwen2.5-Math-7B-Instruct (T\mathcal{M}_{T}) \rightarrow Qwen2.5-Math-1.5B (S\mathcal{M}_{S}).
PRM GSM8K MATH SVAMP Minerva Gaokao Olympiad AVG
VCRD (Skywork-o1-Open-PRM-1.5B ) 81.0 61.8 85.9 25.7 53.2 24.7 55.38
VCRD (Qwen2.5-Math-PRM-7B) 81.6 60.8 85.5 27.2 53.5 24.3 55.48
SKL+SRKL 81.6 61.0 85.9 25.7 51.4 21.9 54.58
SRKL (on-policy KD) 78.0 57.6 83.6 17.3 47.5 21.9 50.98

Comments

· 0
Be the first to comment on this paper.