UTF8mc\CJK@envStartUTF8
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
Abstract
Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model’s sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.
1 Introduction
The strong reasoning ability exhibited by large language models (LLMs) has often been attributed to reinforcement learning (RL). However, empirical analyses question whether RL explains emergent reasoning: as the number of sampled generations grows, post-RL models often fail to outperform their pre-RL counterparts, suggesting that RL may not be what endows LLMs with reasoning ability (yue2025does). At the same time, distillation has become a standard way to transfer the capabilities of expensive or stronger models to smaller models (hinton2015distilling; guo2025deepseek; busbridge2025distillation), and inference-time compute allocated to sampling or search has improved LLM performance (snell2024scaling; welleck2024from).
However, the relationship between sampling, RL, and self-distillation remains unclear. In particular, karan2026reasoning show that a base model, without additional training or external reward, can match or exceed post-RL models using power sampling. This raises the question of whether the success of power sampling reflects a mechanism distinct from RL and distillation, or whether these methods can be connected through a common structure. Clarifying such a connection is important because it can reveal whether gains that appear to come from different procedures in fact arise from a common mechanism, and whether an expensive inference-time procedure can be converted into an offline training objective.
In this study, we show that sampling, RL, and self-distillation are naturally connected through the power distribution. As illustrated in Figure 1, this distribution is the target of power sampling, the closed-form optimum of a self-reward RL objective, and the teacher distribution amortized by self-distillation. From the sampling perspective, a natural question is whether the effect of power sampling can be reproduced by an inexpensive token-level approximation. We show that this is structurally difficult: per-token approximations cannot match the power distribution without sequence-level information. From the RL perspective, the power distribution is the closed-form optimum of KL-regularized RL (ouyang2022instructgpt) when the reward is the model’s sequence-level log-probabilities, i.e., the self-reward in the sense of huang2025selfimprovement. Finally, by rewriting this RL objective, we derive power self-distillation as an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into offline training. We further show that power self-distillation achieves sharpening, and that whether the resulting sharpening improves a true reward is determined by a reward covariance under the power distribution.
Our contributions are summarized as follows. Figure 1 illustrates the connection we study, and Table S.1 compares these axes with prior work.
-
•
We show that approximating power sampling at inference time is structurally hard: per-token approximations cannot match the power distribution without sequence-level information (Propositions 1 and 2).
-
•
We show that the power distribution is the closed-form optimum of KL-regularized RL with the model’s sequence-level log-probabilities as the reward (Corollary 1), and derive power self-distillation by rewriting this RL objective, thereby amortizing expensive power sampling into offline training (Algorithm 1).
-
•
We provide a sharpening bound for power self-distillation (Proposition 3), and characterize when the induced self-distillation improves a true reward through a covariance condition under the power distribution (Proposition 4).
2 Related work
RL post-training as distribution sharpening. RL has become a central tool in LLM post-training, including RL from human feedback (RLHF) (ouyang2022instructgpt) and RL with verifiable rewards (RLVR) (shao2024deepseekmath; guo2025deepseek; lambert2024tulu). However, a growing line of work questions whether such RL induces genuinely new reasoning capabilities. yue2025does showed that under pass@k evaluation, RLVR often improves sampling efficiency at small but can underperform the base model at large , suggesting that RLVR concentrates probability mass on reasoning paths already present in the base model’s distribution. Complementing this view, he2025rewarding analyzed a degenerate rank bias in GRPO that preferentially reinforces high-probability trajectories, yielding a “distribution sharpening” regime where simply sampling more from the base model can be stronger under the same sample budget. Motivated by the perspective that many RL gains resemble distribution sharpening, karan2026reasoning proposed a training-free inference-time method that targets sharpened distributions of the base model. Their approach uses a Metropolis–Hastings sampler to approximate sequence-level power sampling and achieves reasoning improvements comparable to RL. azizi2026power and ji2026scalable developed lower-latency approximations to power sampling. We complement this line of work by showing that the power distribution targeted by these samplers is also the closed-form optimum of a self-reward KL-regularized RL objective.
Inference-time compute and distillation to amortize inference cost. Recent work argues that allocating additional computation at inference time can substantially improve LLM outputs (snell2024scaling; welleck2024from). When an external reward or verifier is available, a common method is Best-of-, which generates candidates and selects the one with the highest reward; this simple strategy can yield strong empirical gains (stiennon2020learning; nakano2021webgpt; touvron2023llama; pmlr-v202-gao23h; eisenstein2023helping; mudgal2024controlled). To amortize the inference cost of Best-of-, several works characterized the distribution induced by Best-of- selection and proposed to distill this distribution into a single policy (gui2024bonbon; amini2025variational; sessa2025bond; yang2025fasterwind). In contrast to these reward-based distillation methods, we derive a self-distillation objective that amortizes power sampling itself, using only samples from the base model’s power distribution.
Self-improvement without external rewards. A growing number of empirical studies suggest that language models can improve without relying on external rewards or human-provided labels, using self-generated data and intrinsic training signals. huang2023large; wang2023self curated model-generated solutions or instructions and then fine-tuned on them. Several works perform RL using internal feedback alone, such as entropy minimization objectives (prabhudesai2025maximizing) or confidence as the reward (zhao2026learning). Even randomly assigned rewards can improve performance (shao2025spurious). huang2025selfimprovement formalized LLM self-improvement as distribution sharpening and analyzed algorithms motivated by SFT and KL-regularized RL. Building on this sharpening view, we show that the model’s sequence-level log-probabilities induce the power distribution through KL-regularized RL, and that distilling this distribution can sharpen the model without external rewards.
3 Preliminaries
Notation. Let denote the space of prompts and let denote a distribution over prompts. We consider completions of length over a finite vocabulary , and write for the completion space. The base model is a policy and we write for the conditional distribution of given . With , we use the autoregressive factorization . We write to mean and for a set .
Self-improvement. Language models have been shown to be capable of self-improvement, improving their own performance without external rewards (huang2023large; wang2023self; prabhudesai2025maximizing; zhao2026learning). This phenomenon is counterintuitive and appears to contradict the data-processing inequality, which states that mutual information is non-increasing under further processing of random variables (cover1999elements). huang2025selfimprovement reconcile these observations by interpreting improvements as computational, not statistical: self-improvement sharpens the distribution so that sampling a near-optimal solution becomes easier. This perspective connects to classical trade-offs between sampling and optimization in theoretical computer science (kirkpatrick1983optimization; lovasz2006fast).
Formally, define the self-reward as the log-likelihood
| (1) |
and let the corresponding maximizer set be
| (2) |
Given , a policy is -sharpened relative to if the following holds:
| (3) |
huang2025selfimprovement analyze the sample complexity of achieving -sharpening when is accessed only through conditional draws and likelihood evaluations , for supervised fine-tuning on Best-of- targets sampled from and for KL-regularized RL objectives driven by .
Power distribution. Recent analyses of RL suggest that empirical reasoning gains resemble distribution sharpening, where probability mass concentrates on trajectories already well supported under the base model (yue2025does; he2025rewarding). Motivated by this view, karan2026reasoning target inference-time sampling from the power distribution induced by the base model.
Definition 1 (Power distribution).
With a policy and an exponent , we define the power distribution induced by as
| (4) |
Exact sampling from Eq. (4) is intractable at scale. karan2026reasoning therefore propose a Metropolis–Hastings (MH) procedure that achieves reasoning accuracy competitive with strong RL post-training (shao2024deepseekmath; guo2025deepseek), without further training. Lower-latency approximations have subsequently been proposed (azizi2026power; ji2026scalable), but these methods still use substantially more inference-time compute than standard autoregressive sampling.
4 Approximating power sampling requires sequence-level information
In this section, we begin from the sampling perspective. We ask whether the power distribution can be reproduced by inexpensive inference-time approximations, focusing on two natural local inference-time procedures: (i) a per-token tempered distribution (Section 4.1) and (ii) sequential importance sampling (SIS) with a one-step proposal (Section 4.2). In both cases, the gap to is governed by sequence-level information that the local approximations do not access, showing why cheap inference-time approximations are structurally difficult and motivating the RL and self-distillation perspectives in Section 5.
4.1 Comparison to per-token temperature scaling
A natural way to locally approximate is to apply the same power transformation at the token level during decoding. For , define the per-token tempered next-token distribution by
| (5) |
In contrast, the power distribution in Eq. (4) is, more precisely, the sequence-level power distribution , whose next-token conditional we denote by :
| (6) |
We show that for arbitrary suffix distributions, the entire odds-ratio gap between Eqs. (5) and (6) is controlled by the Rényi entropy of the suffix.
Proposition 1 (Power vs. temperature odds ratios via suffix Rényi entropies).
For , a prompt , a prefix , and , let denote the conditional distribution of the suffix under the base model,
For a distribution on a finite set, define the Rényi entropy of order as Then for any such that , the ratio of next-token odds under versus satisfies
| (7) |
We have , so Eq. (7) implies that, among next-token candidates with comparable values of , those for which has larger Rényi entropy are relatively downweighted under compared to . Thus, compared with per-token temperature scaling, sequence-level power sharpening favors continuations whose suffix distributions under are more peaked, i.e., have lower Rényi entropy.
Comparison to karan2026reasoning. karan2026reasoning also studied the gap between per-token temperature and sequence-level power sampling, and formalized it in the special case of two extreme tokens (positive vs. negative pivotal tokens; their Example 1 and Proposition 3). Our result enables a quantitative comparison for any two next-token candidates.
Proposition 1 suggests that matching the next-token distribution induced by sequence-level power sampling at a step requires information about the suffix distributions following each candidate token.
4.2 Variance-minimizing one-step proposals for sequential power sampling
Beyond marginal token distributions, we turn to sequential importance sampling (SIS) targeting , where a basic design goal is to stabilize incremental importance weights. Proposition 3.3 of zhao2024probabilistic identifies the unique one-step variance-minimizing proposal in a general SIS setup, and we apply it to the power distribution .
Fix a prompt . Define the unnormalized power mass and, for , the prefix totals
| (8) |
where for the prefix is empty. Let be the normalizing constant, so that , and let be the normalized power distribution on from Eq. (4). For , write for the prefix marginal obtained by summing over ; then .
Consider extending a fixed prefix by one token in one step of SIS (or SMC without resampling), while keeping the global target on . Define the incremental importance weight (chopin2020introduction)
| (9) |
where we condition on with , and denotes variance under . The next proposition shows the unique proposal that minimizes at such a prefix.
Proposition 2 (Variance-minimizing one-step proposal at prefix ).
In the setting above, fix with . Among all proposals on , the unique minimizer of is
| (10) |
where ; the right-hand side equals the next-token conditional under .
Proposition 2 implies that minimizing the local one-step variance of the incremental weight forces the proposal to coincide with the next-token conditional in Eq. (10), which itself depends on the prefix totals summed over all suffixes. In particular, proposals that modify only the base next-token conditional cannot in general equal the unique minimizer in Eq. (10). The proof and SIS background are in Section A.2.
Implication. Propositions 1 and 2 indicate that inexpensive one-step approximations cannot reproduce without sequence-level information, leaving inference-time approximation of structurally expensive. This aligns with prior work that expends additional inference-time compute (karan2026reasoning; azizi2026power; ji2026scalable) to approximate the power distribution.
5 From self-reward RL to power self-distillation
Section 4 shows that is structurally expensive to approximate by sampling at inference time. In this section, we take the complementary view that also connects RL and self-distillation, allowing us to shift the cost to offline training. Section 5.1 identifies as the closed-form optimum of a KL-regularized RL objective with self-reward. Section 5.2 uses this identification to derive an offline self-distillation algorithm from that RL objective. Section 5.3 then analyzes what the resulting distilled model achieves: a sharpening guarantee on the self-reward, and a characterization of when sharpening also improves a true reward.
5.1 Power distribution as the optimum of self-reward RL
Let be a candidate policy, and consider the KL-regularized RL objective with reward (ouyang2022instructgpt; guo2025deepseek)
| (11) |
with . By the standard closed-form solution of KL-regularized RL (levine2018reinforcement), the unique maximizer of Eq. (11) is the reward-tilted distribution
| (12) |
We restate this as Proposition 5 in Section A.5 and include a proof for completeness. Specializing the reward in Eq. (12) to the self-reward in Eq. (1) yields the power distribution.
Corollary 1 (Self-reward tilt equals the power distribution).
This identification connects power sampling and self-improvement RL: the inference-time target of karan2026reasoning coincides with the closed-form optimum of the KL-regularized self-reward objective studied in huang2025selfimprovement, namely .
5.2 Deriving power self-distillation
We now derive a self-distillation procedure from the RL objective without requiring the deployed model to sample from at inference time.
RL objective as reverse and then forward KL to . With and , the inner objective in Eq. (11) can be rewritten for each as
| (14) |
with the same partition function as in Eq. (12), which does not depend on . Thus, for each prompt , maximizing over unconstrained is equivalent to minimizing the reverse KL divergence , with unique minimizer .
However, the reverse KL is an expectation under , so optimizing it directly would require on-policy samples from the learner. We therefore convert the objective into an offline distillation surrogate that shares the same target distribution , by minimizing the forward KL from the teacher distribution to the student, . The population minimizer over is still , so this surrogate preserves the same target distribution while enabling offline maximum-likelihood training on teacher samples. This forward-KL surrogate mirrors the reward-augmented maximum-likelihood method of norouzi2016reward, who also exchanged the reverse KL appearing in entropy-regularized RL for a forward KL.
Forward KL yields MLE on teacher samples. Expanding the forward KL training objective gives
| (15) |
where abbreviates and . The first term on the right-hand side of Eq. (15) does not depend on , so minimizing the population forward KL is equivalent to maximizing the expected log-likelihood . In practice we form an empirical objective by drawing i.i.d. pairs with and , and we solve the following maximum likelihood estimate (MLE) problem:
| (16) |
This procedure uses only offline completions sampled from the power distribution derived from the base policy , and it does not rely on any external reward labels, so it is an instance of self-distillation. We refer to it as power self-distillation and summarize it in Algorithm 1. In practice we run teacher inference once, store , and then train the student with standard supervised fine-tuning on . Separating teacher generation from student training simplifies implementation and enables dataset reuse.
5.3 Sharpening and true reward under power self-distillation
In this subsection, we analyze two complementary aspects of power self-distillation: (i) Proposition 3 bounds the extent to which sharpens the self-reward, in the sense of huang2025selfimprovement; (ii) Proposition 4 shows that the local rate at which sharpening changes a true reward is determined by the covariance .
(i) Self-reward sharpening of the distilled model. huang2025selfimprovement formalize self-improvement via -sharpening relative to as in Eq. (3). The next proposition bounds how well the MLE in Eq. (16) concentrates on the self-reward maximizer set in Eq. (2).
Proposition 3 (Power self-distillation and sharpening).
Fix and . Suppose and there exists a constant , independent of , such that . Let be i.i.d. samples with , and let be an MLE. Then with probability at least over ,
| (17) |
In particular, the right-hand side of Eq. (17) converges to as and .
Thus, for sufficiently large and , power self-distillation can achieve -sharpening in the sense of Eq. (3). The proof is in Section A.3.
(ii) When does sharpening also improve a different true reward? Proposition 3 guarantees concentration on the self-reward maximizer set, but evaluation is typically governed by a different true reward (e.g., correctness). Let denote this true reward and define, for fixed ,
The next proposition characterizes how changes with .
Proposition 4 (Covariance form of ).
For any and any fixed ,
| (18) |
where covariances are over the support of , on which is finite. In particular, if for some and ,
| (19) |
then is non-decreasing in :
The proof is in Section A.4. Proposition 4 states that equals the covariance between and under , so whether increasing improves the true reward is determined exactly by . In particular, when , this covariance reduces to , so is non-decreasing in .
6 Numerical evaluation
This section experimentally validates the following points.
-
•
(RQ1) Power sampling increases self-reward (Section 5.1).
-
•
(RQ2) Sharpening can improve true reward when aligns with (Section 5.3).
-
•
(RQ3) Power self-distillation achieves self-improvement (Section 5).
Detailed experimental setups are provided in Section B.1, and synthetic experiments validating Section 4 are shown in Sections B.2.4 and B.2.5.
6.1 Setup
We used the Qwen2.5-Math-7B (yang2024qwen2), Qwen2.5-7B (yang2024qwen), and Phi-3.5-mini-instruct (abdin2024phi) models on the MATH (lightman2024lets), HumanEval (chen2021evaluating), MBPP (austin2021program), and GPQA (rein2024gpqa) datasets. In the main text, we focus on the Qwen2.5-Math-7B model on the MATH dataset, which consists of 12,500 competition-style math problems spanning seven categories. For evaluation, we used MATH500, a selected subset of the MATH test set. For power self-distillation (Algorithm 1), we sampled 500 training problems from MATH, excluding those in MATH500. We fine-tuned with LoRA adapters (hu2022lora) using the AdamW optimizer (loshchilov2017decoupled).
For power sampling, we used the MH procedure of karan2026reasoning (Algorithm 2) with their default hyperparameters, including . For additional baselines, we used standard autoregressive sampling (Standard) and token-wise temperature scaling (Temperature) with , so that the token-level baseline uses the same local power exponent as power sampling. We studied three model variants: the base model (Base), the power-distilled model (Power-distilled, Algorithm 1), and a randomly initialized model (RandW). RandW is a negative control for cases where likelihood is not aligned with correctness.
To study the relationship between true reward and self-reward , we additionally evaluated an approach we call self-reward Best-of-: given sampled completions , we selected the completion with the largest value of . In all experiments, denotes the completion-token average log-likelihood under the evaluated model, with prompt tokens masked out; this length normalization makes values comparable across completions.
| All completions | Self-reward Best-of- | ||||
| Model | Sampling | ||||
| RandW | Standard | ||||
| Power | |||||
| Base | Standard | ||||
| Temperature | |||||
| Power | |||||
| Power-distilled | Standard | ||||
| Temperature | |||||
| Model | Sampling | Correctness | Summary |
|---|---|---|---|
| Base | Temperature | No | Uses irrelevant mathematical properties and generates an incorrect formula, resulting in a hallucinated final answer. |
| Power | Yes | Maintains logical consistency and mathematical accuracy, but simulates a Python execution to present a non-executed solution. | |
| Distilled | Standard | Yes | Shows robust reasoning and self-correction by re-evaluating the problem when constraints are not met. |
6.2 Results
Power sampling increases self-reward (RQ1). Table 1 shows mean self-reward () and accuracy () over sampled completions (left two columns). Power sampling raises for both the base model and RandW. Decoding with token-wise temperature also raises on the base model.
Sharpening can improve true reward when aligns with (RQ2). Table 1 also shows that higher is typically accompanied by higher true reward , except on RandW, where is not aligned with . Notably, self-reward Best-of- yields the largest gains in across all models.
To make this point clearer, Figure 3 plots the decoding temperature against and ; both quantities decrease as increases (i.e., sharpening weakens). Figure 3 uses synthetic rewards (Section B.1 for details), whose correlation with ranges from positive to negative; the gain in from power sampling grows roughly linearly with .
Power self-distillation achieves self-improvement (RQ3). Table 1 shows that after power self-distillation, the student with temperature decoding scores higher on both and than the base model under standard sampling, temperature sampling, or power sampling. The strongest result is obtained by combining power self-distillation with Temperature decoding. At inference time, the student uses only autoregressive decoding (with temperature), thereby amortizing the inference cost of power sampling into offline training.
Qualitative example. Table 2 summarizes completions on one MATH500 problem. With token-wise temperature, the model cites irrelevant facts and concludes with a hallucinated formula, plausibly because token-wise tilting in Eq. (5) does not coincide with sequence-level tilting in Eq. (6). Power sampling instead tilts toward and is graded correct, but the completion includes plausible Python code that is never executed, and the model only mimics a reasoning pattern. After power self-distillation, standard decoding yields the correct answer with more robust step-by-step reasoning. The full completions are shown in Section B.2.3.
Additional dataset–model combinations are reported in Section B.2.1; in each case, the distilled model outperforms the corresponding base model.
7 Conclusion
We showed that the power distribution bridges power sampling, self-reward KL-regularized RL, and self-distillation as the sampling target, closed-form RL optimum, and teacher distribution. From the sampling perspective, inexpensive local approximations are structurally limited: per-token temperature scaling and variance-minimizing one-step proposals both miss sequence-level information. From the RL perspective, the same sequence-level power distribution is the optimizer of KL-regularized RL when the reward is the model’s sequence-level log-probabilities. This identification yields power self-distillation, an offline surrogate that amortizes power sampling into supervised training on teacher samples. Power self-distillation can achieve self-reward sharpening, while true-reward improvement is governed by . Finally, we supported the analysis with experiments.
Limitations. Self-improvement through sharpening and distillation inherits the capabilities of the base model, so gains can be small when the base is weak; improving base-model quality (e.g., pretraining) is outside our scope. Our analysis and experiments focus on autoregressive language models over finite horizons.
References
Notation.
Let the unnormalized power target . Let denote the Metropolis–Hastings acceptance ratio comparing completions (with fixed), where denotes the autoregressive proposal density for resampling a suffix under :
| (20) |
| Paper | Power distribution | Sampling | RL | Distillation | No external reward |
|---|---|---|---|---|---|
| norouzi2016reward | – | – | ✓ | ✓ | – |
| rusu2016policy | – | – | ✓ | ✓ | – |
| teh2017distral | – | – | ✓ | ✓ | – |
| laskin2023incontext | – | – | ✓ | ✓ | – |
| huang2025selfimprovement | – | ✓ | ✓ | ✓ | ✓ |
| gui2024bonbon | – | ✓ | – | ✓ | – |
| amini2025variational | – | ✓ | – | ✓ | – |
| balashankar2025infalign | – | ✓ | ✓ | – | – |
| sessa2025bond | – | ✓ | ✓ | ✓ | – |
| yang2025fasterwind | – | ✓ | ✓ | ✓ | – |
| karan2026reasoning | ✓ | ✓ | – | – | ✓ |
| azizi2026power | ✓ | ✓ | – | – | ✓ |
| ji2026scalable | ✓ | ✓ | – | – | ✓ |
| Ours | ✓ | ✓ | ✓ | ✓ | ✓ |
Appendix A Proofs and background
A.1 Proof of Proposition 1
Proof of Proposition 1.
Fix and a prefix with , and write for . For any suffix and token , autoregressive factorization gives
Using Eq. (6), the numerator for token is therefore
Summing over yields the corresponding denominator in Eq. (6), so the prefix factor cancels and
| (21) |
For temperature scaling, Eq. (5) gives . Hence for any with ,
and therefore
By the definition of , , which yields Eq. (7). ∎
A.2 Background and proof of Proposition 2
This appendix is aligned with the sequential Monte Carlo presentation of zhao2024probabilistic, who derive a general twist-induced proposal (their Prop. 3.3) that minimizes the variance of the one-step incremental importance weight for a given tower of intermediate targets. We provide a proof of the same variance-minimization fact specialized to using the Cauchy–Schwarz inequality (cf. zhao2024probabilistic, App. A.2).
A.2.1 From a sequence-level target to a sequential sampler
Let be a finite vocabulary and fix a prompt and completion length . Let denote the power distribution on from Eq. (4), i.e.,
Exact sampling from may be intractable because normalizing constants involve sums over exponentially many sequences. Many practical samplers therefore build sequentially: having generated a prefix , they draw a next token from a proposal and update importance weights so that, after steps, full-length draws can be reweighted to be (exactly or approximately) correct for .
A.2.2 Incremental importance weights
For , let denote the marginal of on the length- prefix:
One step of sequential importance sampling extends by sampling . The incremental multiplicative factor appended to the running weight is [chopin2020introduction]
| (22) |
defined on the event , where denotes the length- prefix ending in . For the power distribution, , so Eq. (22) agrees with in Eq. (9).
If one initializes weights at and updates , then for any completed trajectory with ,
| (23) |
which is the usual full-sequence importance weight of for the target against the autoregressive proposal . Thus each is the local factor that must be “well behaved” if the final weights are not to explode or collapse.
A.2.3 Why minimize at one step?
Condition on a fixed feasible prefix with . Write for , i.e., the true conditional under . Then with .
Whenever for all with , the mean is always . However, depends strongly on : if places too little mass where is large, occasional huge weights arise, which is the usual “weight degeneracy” pathology in importance sampling. Minimizing therefore makes the single-step contribution to weight instability as small as possible (among independent proposals), holding the prefix fixed. This is the same local objective highlighted by zhao2024probabilistic for twist-induced proposals.
A.2.4 Proof of Proposition 2
Proof of Proposition 2.
Fix with and write for . Then and under , assuming whenever .
Since ,
By Cauchy–Schwarz,
The left-hand side equals , so with equality if and only if the Cauchy–Schwarz inequality is tight, i.e., , equivalently . Because , the unique minimizer on is , which is .
A.3 Proof of Proposition 3
We first provide the following lemma, which is used to bound the Hellinger distance between the MLE and the true conditional distribution for finite-class models.
Lemma 1 (Finite-class MLE Hellinger bound [wong1995probability, geer2000empirical, zhang2006f]).
Assume and . Let be i.i.d. with and , and let be an MLE. Then for any , with probability at least ,
Using this lemma, we can prove Proposition 3 as follows.
Proof of Proposition 3.
Define the failure event . By a simple inclusion,
Taking yields
| (24) |
where .
Let . For each , write and . For two distributions , define the squared Hellinger distance
By the reverse triangle inequality applied to the vectors and ,
| (25) |
On the event , we have and , so Equation (25) implies
Therefore , and hence
| (26) |
Convergence of the upper bound.
The MLE term satisfies as .
For the limit of , fix and write . By definition of , we have for all and for all . The normalizing constant of the power distribution satisfies
| (27) | ||||
| (28) | ||||
| (29) |
For each , the ratio lies in , hence as . Because is finite, , and therefore
| (30) |
The indicators converge to for -almost every as by Eq. (30). Since indicators are bounded by , dominated convergence yields
Thus, the second term in Eq. (17) converges to as . Together with the limit of the first term, the full upper bound converges to . ∎
A.4 Proof of Proposition 4
Proof of Proposition 4.
Recall
Differentiating with respect to yields
Using
| (31) |
we obtain
| (32) | ||||
| (33) | ||||
| (34) | ||||
| (35) | ||||
| (36) |
∎
A.5 Closed-form optimizer for KL-regularized RL: restatement and proof
We restate the standard closed-form solution of KL-regularized RL used in Section 5.1.
Proposition 5 (Closed-form optimizer for KL-regularized RL [levine2018reinforcement]).
Proof of Proposition 5.
Fix and write and . For any , expanding the KL divergence against gives
where we used from Eq. (12). Since with equality if and only if , the inner objective is uniquely maximized at . Because is an expectation over of these decoupled per- objectives, the unique global maximizer is . ∎
Appendix B Experimental details
B.1 Setup
Models and datasets.
We used Qwen2.5-Math-7B [yang2024qwen2], Qwen2.5-7B [yang2024qwen], and Phi-3.5-mini-instruct [abdin2024phi] models on the following datasets.
-
•
Mathematics. We used the MATH dataset [lightman2024lets], which consists of 12,500 competition-style math problems spanning seven categories (e.g., geometry, number theory, and precalculus), with 7,500 training and 5,000 test problems. For evaluation, we used MATH500, a randomly selected subset of the MATH test set standardized by OpenAI111https://huggingface.co/datasets/HuggingFaceH4/MATH-500. For distillation, we sampled 500 examples from MATH with MATH500 removed222https://raw.githubusercontent.com/rasbt/math_full_minus_math500/main/math_full_minus_math500.json.
-
•
Programming. For evaluation, we used HumanEval [chen2021evaluating], a set of handwritten programming problems covering algorithms, reasoning, mathematics, and language understanding; each problem includes unit tests, and a solution was correct if it passed all tests. For distillation, we used MBPP [austin2021program], a benchmark of crowd-sourced Python programming problems designed to be solvable by entry-level programmers. We used questions from the sanitized subset, excluding the prompt split.
-
•
Multiple-choice science. We used GPQA [rein2024gpqa], a multiple-choice science benchmark (physics, chemistry, and biology) requiring advanced reasoning. For evaluation, we used GPQA-Diamond, a high-quality subset of questions. For distillation, we used the remaining GPQA questions after removing any overlap with GPQA-Diamond.
Power sampling.
We used the power sampling algorithm of karan2026reasoning, largely following their hyperparameters. Specifically, we used , maximum sampling token length , block size , , and the proposal LLM set to the base model with sampling temperature . The token-wise Temperature baseline uses the same , applying the corresponding local power transform independently at each decoding step. For the randomly initialized model (RandW; Section 6), we instead used maximum token length and , because under the default settings (maximum token length and ) EOS tokens rarely appeared for RandW and wall-clock sampling time became significantly longer.
Self-reward computation.
To report , we computed, under the evaluated model, the average log-likelihood over completion tokens, excluding prompt tokens. Our theoretical analysis assumes completions of a fixed length , but in our experiments completion lengths vary across prompts and sampling methods, so we normalize by the number of completion tokens to remove length bias in .
Synthetic random rewards.
For the synthetic-reward probe in Figure 3, each completion is mapped to a scalar in by applying SHA-256 to the UTF-8 encoding of and interpreting the leading 64 bits of the digest as an unsigned fraction. Let and denote the z-scores of the self-reward and of the hash reward above, each computed with the corresponding pooled global sample mean and sample standard deviation. We then define
where the are i.i.d. with . Figure 3 sweeps and plots the mean increase in under power versus standard sampling against the empirical covariance between and , using completions produced under standard sampling. The construction is designed to sweep in a controlled way; we plot empirical gain against this controlled covariance to visualize the qualitative rate prediction of Proposition 4.
Distillation.
We trained the student with supervised fine-tuning on the offline power-sampled dataset. Concretely, we minimized the standard token-level cross-entropy loss of a causal language model on the teacher-generated completion, masking the prompt tokens (i.e., the loss was computed only on the completion tokens). The student was initialized from the base model and was trained with LoRA adapters (, , dropout ) applied to q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj. We trained the models for 3 epochs using the AdamW optimizer with a weight decay of 0.01 and a linear warmup ratio of 0.03. The learning rate was tuned per dataset and model as summarized in Table S.2. We used per-device batch size 1 with 8 gradient accumulation steps, and enabled gradient checkpointing. We set the maximum sequence length to 1024 tokens to keep activation memory manageable on a single GPU. Teacher completions exceeding this cap were truncated, and the cross-entropy loss was computed on all in-window completion tokens. The truncation affected only a minority of completions (e.g., 83.6% of Qwen2.5-Math-7B completions on MATH fit fully within the cap), and each in-window token still provides a valid distillation signal toward .
| Dataset | Qwen2.5-7B | Qwen2.5-Math-7B | Phi-3.5-mini-instruct |
|---|---|---|---|
| MATH | |||
| HumanEval/MBPP | |||
| GPQA |
Hardware and execution time.
All experiments were conducted on GPU nodes equipped with two Intel Xeon Platinum 8360Y CPUs, 512 GiB of host memory, and eight NVIDIA A100 GPUs with 40 GiB of memory each. On a single GPU, supervised fine-tuning of one student per dataset and model finished in under one hour, while teacher generation via power sampling (Algorithm 2) took more than one day per dataset and model. The total compute is on the order of a few hundred A100-GPU-hours.
B.2 Additional results
B.2.1 Other datasets and models
This section reports results on additional dataset–model combinations that are not shown in the main text. In all cases, the distilled model has a higher than the base under standard autoregressive decoding. The distilled model often attains comparable to that of the corresponding base model with power sampling.
| All completions | Self-reward Best-of- | ||||
|---|---|---|---|---|---|
| Model | Sampling | ||||
| Qwen / Base | Standard | ||||
| Power | |||||
| Qwen / Distilled | Standard | ||||
| Temperature | |||||
| Phi / Base | Standard | ||||
| Power | |||||
| Phi / Distilled | Standard | ||||
| Temperature | |||||
| All completions | Self-reward Best-of- | ||||
|---|---|---|---|---|---|
| Model | Sampling | ||||
| Qwen-Math / Base | Standard | ||||
| Power | |||||
| Qwen-Math / Distilled | Standard | ||||
| Temperature | |||||
| Qwen / Base | Standard | ||||
| Power | |||||
| Qwen / Distilled | Standard | ||||
| Temperature | |||||
| Phi / Base | Standard | ||||
| Power | |||||
| Phi / Distilled | Standard | ||||
| Temperature | |||||
| All completions | Self-reward Best-of- | ||||
|---|---|---|---|---|---|
| Model | Sampling | ||||
| Qwen-Math / Base | Standard | ||||
| Power | |||||
| Qwen-Math / Distilled | Standard | ||||
| Temperature | |||||
| Qwen / Base | Standard | ||||
| Power | |||||
| Qwen / Distilled | Standard | ||||
| Temperature | |||||
| Phi / Base | Standard | ||||
| Power | |||||
| Phi / Distilled | Standard | ||||
| Temperature | |||||
B.2.2 Power
We also evaluated Power using Qwen2.5-Math-7B on MATH500. This variant runs the MH power-sampling loop and accepts a proposal if and only if (Algorithm 2), corresponding to the limit .
| All completions | Self-reward Best-of- | |||
|---|---|---|---|---|
| Sampling | ||||
| Power | ||||
B.2.3 Qualitative results
This section presents full completions for one MATH-style geometry problem summarized in Table 2 with the gold answer . The prompt is:
The coordinates of a parallelogram are , , , and with . What is the value of ?
B.2.4 Synthetic validation of suffix-Rényi odds corrections
To validate Proposition 1 in a setting that reflects the Zipf-like word-frequency structure of natural language, we construct a finite synthetic autoregressive distribution whose language-model next-token probabilities follow a Zipf-like law over many candidates. Unlike the extreme pivotal-token construction of karan2026reasoning, every next-token candidate is followed by a full-support suffix distribution. The construction is summarized in Figure S.5. The base next-token distribution has tokens with Zipf-like probabilities
For every token , the conditional suffix distribution has the same support size , no zero-probability suffixes, and a non-uniform power-law shape
The suffix exponent varies deterministically and non-monotonically with the next-token rank, using a sinusoidal component plus a small trend. Thus, all suffix distributions have identical support size and full support, but differ in sharpness. This deliberately avoids the singular-versus-uniform example in karan2026reasoning: the experiment isolates the more general quantity identified by Proposition 1, namely the suffix Rényi entropy. In Figure S.5, the left panel shows the Zipf-like next-token distribution, the middle panel shows the token-dependent suffix exponent , and the right panel shows representative full-support suffix distributions.
For each , we compute both the token-wise temperature next-token distribution and the sequence-level power next-token conditional exactly under this synthetic distribution. The temperature next-token distribution is
whereas the next-token conditional induced by the sequence-level power distribution is
Figure S.6 compares the two sides of Proposition 1 for every unordered token pair and every tested . The left panel plots the Rényi-predicted log odds correction against the directly computed power-versus-temperature log odds correction, while the right panel shows the distribution of these corrections at the main experimental exponent .
Figure S.7 illustrates the consequence of the correction at the level of next-token preferences: even when and temperature favors token , sequence-level power can favor token if has sufficiently lower suffix Rényi entropy.
B.2.5 Synthetic validation of optimal one-step proposals for sequential power sampling
We reuse the synthetic distribution of Section B.2.4 to validate Proposition 2. For a fixed prompt and an empty prefix, the unique variance-minimizing one-step proposal in Equation 10 reduces to
which equals the next-token conditional of the sequence-level power distribution and depends on the suffix power masses of every candidate token. We compare with three one-step proposals that do not use those suffix totals: the base proposal , the token-wise temperature proposal , and a uniform reference .
For each proposal , the first-step incremental importance weight in Equation 9 simplifies to
and we show its exact mean, the coefficient of variation , and the effective sample size fraction . By Proposition 2, only achieves and hence ; the closed-form values for the other proposals are computed exactly from the synthetic distribution.
Figure S.8 compares the four proposals at . The left panel shows the proposal probabilities; the oracle proposal equals the target next-token conditional by construction, and the temperature, base, and uniform proposals deviate from it, especially on next-token ranks where the suffix exponent is small and is large. The right panel plots : only the oracle proposal yields a constant log weight, while the other proposals produce token-dependent log weights.
Figure S.9 reports the exact and as a function of . The oracle proposal attains for every , whereas the gap between the temperature proposal and the oracle widens as grows, because larger amplifies the suffix power masses that the local temperature transform ignores.
Figure S.10 checks the same conclusion with Monte Carlo: for each proposal we draw tokens, compute the self-normalized , and average across replicates. The sampled concentrates around the exact values from Figure S.9 as grows, and the ordering of the proposals is preserved at every particle budget.
Comments
· 0