[2605.04542] Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

\CJKencfamily

UTF8mc\CJK@envStartUTF8

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Akiyoshi Tomihari tomihari@g.ecc.u-tokyo.ac.jp Affiliation: Department of Computer Science, The University of Tokyo Issei Sato sato@g.ecc.u-tokyo.ac.jp Affiliation: Department of Computer Science, The University of Tokyo

Abstract

Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model’s sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.

1 Introduction

The strong reasoning ability exhibited by large language models (LLMs) has often been attributed to reinforcement learning (RL). However, empirical analyses question whether RL explains emergent reasoning: as the number of sampled generations grows, post-RL models often fail to outperform their pre-RL counterparts, suggesting that RL may not be what endows LLMs with reasoning ability (yue2025does). At the same time, distillation has become a standard way to transfer the capabilities of expensive or stronger models to smaller models (hinton2015distilling; guo2025deepseek; busbridge2025distillation), and inference-time compute allocated to sampling or search has improved LLM performance (snell2024scaling; welleck2024from).

However, the relationship between sampling, RL, and self-distillation remains unclear. In particular, karan2026reasoning show that a base model, without additional training or external reward, can match or exceed post-RL models using power sampling. This raises the question of whether the success of power sampling reflects a mechanism distinct from RL and distillation, or whether these methods can be connected through a common structure. Clarifying such a connection is important because it can reveal whether gains that appear to come from different procedures in fact arise from a common mechanism, and whether an expensive inference-time procedure can be converted into an offline training objective.

In this study, we show that sampling, RL, and self-distillation are naturally connected through the power distribution. As illustrated in Figure 1, this distribution is the target of power sampling, the closed-form optimum of a self-reward RL objective, and the teacher distribution amortized by self-distillation. From the sampling perspective, a natural question is whether the effect of power sampling can be reproduced by an inexpensive token-level approximation. We show that this is structurally difficult: per-token approximations cannot match the power distribution without sequence-level information. From the RL perspective, the power distribution is the closed-form optimum of KL-regularized RL (ouyang2022instructgpt) when the reward is the model’s sequence-level log-probabilities, i.e., the self-reward in the sense of huang2025selfimprovement. Finally, by rewriting this RL objective, we derive power self-distillation as an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into offline training. We further show that power self-distillation achieves sharpening, and that whether the resulting sharpening improves a true reward is determined by a reward covariance under the power distribution.

Our contributions are summarized as follows. Figure 1 illustrates the connection we study, and Table S.1 compares these axes with prior work.

•

We show that approximating power sampling at inference time is structurally hard: per-token approximations cannot match the power distribution without sequence-level information (Propositions 1 and 2).
•

We show that the power distribution is the closed-form optimum of KL-regularized RL with the model’s sequence-level log-probabilities as the reward (Corollary 1), and derive power self-distillation by rewriting this RL objective, thereby amortizing expensive power sampling into offline training (Algorithm 1).
•

We provide a sharpening bound for power self-distillation (Proposition 3), and characterize when the induced self-distillation improves a true reward through a covariance condition under the power distribution (Proposition 4).

Figure 1: Overview of our contribution. The power distribution connects sampling, KL-regularized RL, and self-distillation: it is the target of power sampling, the closed-form optimum of self-reward RL, and the teacher distribution amortized by power self-distillation.

2 Related work

RL post-training as distribution sharpening. RL has become a central tool in LLM post-training, including RL from human feedback (RLHF) (ouyang2022instructgpt) and RL with verifiable rewards (RLVR) (shao2024deepseekmath; guo2025deepseek; lambert2024tulu). However, a growing line of work questions whether such RL induces genuinely new reasoning capabilities. yue2025does showed that under pass@k evaluation, RLVR often improves sampling efficiency at small $k$ but can underperform the base model at large $k$ , suggesting that RLVR concentrates probability mass on reasoning paths already present in the base model’s distribution. Complementing this view, he2025rewarding analyzed a degenerate rank bias in GRPO that preferentially reinforces high-probability trajectories, yielding a “distribution sharpening” regime where simply sampling more from the base model can be stronger under the same sample budget. Motivated by the perspective that many RL gains resemble distribution sharpening, karan2026reasoning proposed a training-free inference-time method that targets sharpened distributions of the base model. Their approach uses a Metropolis–Hastings sampler to approximate sequence-level power sampling and achieves reasoning improvements comparable to RL. azizi2026power and ji2026scalable developed lower-latency approximations to power sampling. We complement this line of work by showing that the power distribution targeted by these samplers is also the closed-form optimum of a self-reward KL-regularized RL objective.

Inference-time compute and distillation to amortize inference cost. Recent work argues that allocating additional computation at inference time can substantially improve LLM outputs (snell2024scaling; welleck2024from). When an external reward or verifier is available, a common method is Best-of- $N$ , which generates $N$ candidates and selects the one with the highest reward; this simple strategy can yield strong empirical gains (stiennon2020learning; nakano2021webgpt; touvron2023llama; pmlr-v202-gao23h; eisenstein2023helping; mudgal2024controlled). To amortize the inference cost of Best-of- $N$ , several works characterized the distribution induced by Best-of- $N$ selection and proposed to distill this distribution into a single policy (gui2024bonbon; amini2025variational; sessa2025bond; yang2025fasterwind). In contrast to these reward-based distillation methods, we derive a self-distillation objective that amortizes power sampling itself, using only samples from the base model’s power distribution.

Self-improvement without external rewards. A growing number of empirical studies suggest that language models can improve without relying on external rewards or human-provided labels, using self-generated data and intrinsic training signals. huang2023large; wang2023self curated model-generated solutions or instructions and then fine-tuned on them. Several works perform RL using internal feedback alone, such as entropy minimization objectives (prabhudesai2025maximizing) or confidence as the reward (zhao2026learning). Even randomly assigned rewards can improve performance (shao2025spurious). huang2025selfimprovement formalized LLM self-improvement as distribution sharpening and analyzed algorithms motivated by SFT and KL-regularized RL. Building on this sharpening view, we show that the model’s sequence-level log-probabilities induce the power distribution through KL-regularized RL, and that distilling this distribution can sharpen the model without external rewards.

3 Preliminaries

Notation. Let $\mathcal{X}$ denote the space of prompts and let $\mu\in\Delta(\mathcal{X})$ denote a distribution over prompts. We consider completions of length $T\geq 1$ over a finite vocabulary $\mathcal{V}$ , and write $\mathcal{Y}:=\mathcal{V}^{T}$ for the completion space. The base model is a policy $\pi:\mathcal{X}\to\Delta(\mathcal{Y})$ and we write $\pi(\cdot\mid x)$ for the conditional distribution of $y$ given $x$ . With $y_{<t}:=(y_{1},\dots,y_{t-1})$ , we use the autoregressive factorization $\pi(y\mid x)=\prod_{t=1}^{T}\pi(y_{t}\mid x,y_{<t})$ . We write $a\lesssim b$ to mean $a=O(b)$ and $\pi(S\mid x):=\sum_{y\in S}\pi(y\mid x)$ for a set $S\subseteq\mathcal{Y}$ .

Self-improvement. Language models have been shown to be capable of self-improvement, improving their own performance without external rewards (huang2023large; wang2023self; prabhudesai2025maximizing; zhao2026learning). This phenomenon is counterintuitive and appears to contradict the data-processing inequality, which states that mutual information is non-increasing under further processing of random variables (cover1999elements). huang2025selfimprovement reconcile these observations by interpreting improvements as computational, not statistical: self-improvement sharpens the distribution so that sampling a near-optimal solution becomes easier. This perspective connects to classical trade-offs between sampling and optimization in theoretical computer science (kirkpatrick1983optimization; lovasz2006fast).

Formally, define the self-reward as the log-likelihood

r_{\mathrm{self}}(x,y;\pi)=\log\pi(y\mid x)

(1)

and let the corresponding maximizer set be

\bm{y}^{\star}(x):=\mathop{\rm arg~max}\limits_{y\in\mathcal{Y}}r_{\mathrm{self}}(x,y;\pi).

(2)

Given $(\epsilon,\delta)\in(0,1)^{2}$ , a policy $\widehat{\pi}$ is $(\epsilon,\delta)$ -sharpened relative to $\pi$ if the following holds:

\mathbb{P}_{x\sim\mu}\Big[\widehat{\pi}\big(\bm{y}^{\star}(x)\mid x\big)\geq 1-\delta\Big]\geq 1-\epsilon.

(3)

huang2025selfimprovement analyze the sample complexity of achieving $(\epsilon,\delta)$ -sharpening when $\pi$ is accessed only through conditional draws $y\sim\pi(\cdot\mid x)$ and likelihood evaluations $\pi(y\mid x)$ , for supervised fine-tuning on Best-of- $N$ targets sampled from $\pi$ and for KL-regularized RL objectives driven by $r_{\mathrm{self}}(x,y;\pi)$ .

Power distribution. Recent analyses of RL suggest that empirical reasoning gains resemble distribution sharpening, where probability mass concentrates on trajectories already well supported under the base model (yue2025does; he2025rewarding). Motivated by this view, karan2026reasoning target inference-time sampling from the power distribution induced by the base model.

Definition 1 (Power distribution).

With a policy $\pi:\mathcal{X}\to\Delta(\mathcal{Y})$ and an exponent $\alpha>1$ , we define the power distribution induced by $\pi$ as

\pi_{\alpha}(y\mid x):=\frac{\pi(y\mid x)^{\alpha}}{\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime}\mid x)^{\alpha}}.

(4)

Exact sampling from Eq. (4) is intractable at scale. karan2026reasoning therefore propose a Metropolis–Hastings (MH) procedure that achieves reasoning accuracy competitive with strong RL post-training (shao2024deepseekmath; guo2025deepseek), without further training. Lower-latency approximations have subsequently been proposed (azizi2026power; ji2026scalable), but these methods still use substantially more inference-time compute than standard autoregressive sampling.

4 Approximating power sampling requires sequence-level information

In this section, we begin from the sampling perspective. We ask whether the power distribution $\pi_{\alpha}$ can be reproduced by inexpensive inference-time approximations, focusing on two natural local inference-time procedures: (i) a per-token tempered distribution (Section 4.1) and (ii) sequential importance sampling (SIS) with a one-step proposal (Section 4.2). In both cases, the gap to $\pi_{\alpha}$ is governed by sequence-level information that the local approximations do not access, showing why cheap inference-time approximations are structurally difficult and motivating the RL and self-distillation perspectives in Section 5.

4.1 Comparison to per-token temperature scaling

A natural way to locally approximate $\pi_{\alpha}$ is to apply the same power transformation at the token level during decoding. For $s\in\mathcal{V}$ , define the per-token tempered next-token distribution by

\pi_{\mathrm{temp},\alpha}(y_{t}=s\mid x,y_{<t}):=\frac{\pi(y_{t}=s\mid x,y_{<t})^{\alpha}}{\sum_{s^{\prime}\in\mathcal{V}}\pi(y_{t}=s^{\prime}\mid x,y_{<t})^{\alpha}}.

(5)

In contrast, the power distribution in Eq. (4) is, more precisely, the sequence-level power distribution $\pi_{\alpha}(\cdot\mid x)\propto\pi(\cdot\mid x)^{\alpha}$ , whose next-token conditional we denote by $\pi_{\mathrm{pow},\alpha}$ :

\pi_{\mathrm{pow},\alpha}(y_{t}=s\mid x,y_{<t}):=\frac{\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}\pi(y_{<t},\,s,\,y_{t+1:T}\mid x)^{\alpha}}{\sum_{y_{t:T}\in\mathcal{V}^{T-t+1}}\pi(y_{<t},\,y_{t:T}\mid x)^{\alpha}}.

(6)

We show that for arbitrary suffix distributions, the entire odds-ratio gap between Eqs. (5) and (6) is controlled by the Rényi entropy of the suffix.

Proposition 1 (Power vs. temperature odds ratios via suffix Rényi entropies).

For $\alpha>1$ , a prompt $x$ , a prefix $y_{<t}$ , and $a\in\mathcal{V}$ , let $q_{t,a}$ denote the conditional distribution of the suffix $Y_{t+1:T}$ under the base model,

q_{t,a}(y_{t+1:T}):=\pi(y_{t+1:T}\mid x,y_{<t},y_{t}=a).

For a distribution $p$ on a finite set, define the Rényi entropy of order $\alpha$ as $H_{\alpha}(p):=1/(1-\alpha)\log\sum_{z}p(z)^{\alpha}.$ Then for any $a,b\in\mathcal{V}$ such that $\pi(y_{t}=a\mid x,y_{<t})>0,\pi(y_{t}=b\mid x,y_{<t})>0$ , the ratio of next-token odds under $\pi_{\mathrm{pow},\alpha}$ versus $\pi_{\mathrm{temp},\alpha}$ satisfies

\frac{\pi_{\mathrm{pow},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{pow},\alpha}(y_{t}=b\mid x,y_{<t})}\bigg/\frac{\pi_{\mathrm{temp},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{temp},\alpha}(y_{t}=b\mid x,y_{<t})}=\exp\!\bigl((1-\alpha)\,(H_{\alpha}(q_{t,a})-H_{\alpha}(q_{t,b}))\bigr).

(7)

We have $1-\alpha<0$ , so Eq. (7) implies that, among next-token candidates $a$ with comparable values of $\pi(y_{t}=a\mid x,y_{<t})$ , those for which $q_{t,a}$ has larger Rényi entropy are relatively downweighted under $\pi_{\mathrm{pow},\alpha}$ compared to $\pi_{\mathrm{temp},\alpha}$ . Thus, compared with per-token temperature scaling, sequence-level power sharpening favors continuations whose suffix distributions under $\pi$ are more peaked, i.e., have lower Rényi entropy.

Comparison to karan2026reasoning. karan2026reasoning also studied the gap between per-token temperature and sequence-level power sampling, and formalized it in the special case of two extreme tokens (positive vs. negative pivotal tokens; their Example 1 and Proposition 3). Our result enables a quantitative comparison for any two next-token candidates.

Proposition 1 suggests that matching the next-token distribution induced by sequence-level power sampling at a step requires information about the suffix distributions following each candidate token.

4.2 Variance-minimizing one-step proposals for sequential power sampling

Beyond marginal token distributions, we turn to sequential importance sampling (SIS) targeting $\pi_{\alpha}$ , where a basic design goal is to stabilize incremental importance weights. Proposition 3.3 of zhao2024probabilistic identifies the unique one-step variance-minimizing proposal in a general SIS setup, and we apply it to the power distribution $\pi_{\alpha}$ .

Fix a prompt $x\in\mathcal{X}$ . Define the unnormalized power mass $\tilde{\pi}_{\alpha}(y_{1:T}):=\pi(y_{1:T}\mid x)^{\alpha}$ and, for $t=0,\dots,T$ , the prefix totals

\tilde{\pi}_{\alpha,t}(y_{1:t})\;:=\;\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}\tilde{\pi}_{\alpha}(y_{1:T}),

(8)

where for $t=0$ the prefix is empty. Let $Z_{\alpha}:=\sum_{y_{1:T}\in\mathcal{V}^{T}}\tilde{\pi}_{\alpha}(y_{1:T})$ be the normalizing constant, so that $\tilde{\pi}_{\alpha,0}=Z_{\alpha}$ , and let $\pi_{\alpha}(\cdot\mid x)$ be the normalized power distribution on $\mathcal{V}^{T}$ from Eq. (4). For $t\geq 1$ , write $\pi_{\alpha}(y_{1:t}\mid x)$ for the prefix marginal obtained by summing $\pi_{\alpha}(y_{1:T}\mid x)$ over $y_{t+1:T}$ ; then $\pi_{\alpha}(y_{1:t}\mid x)=\tilde{\pi}_{\alpha,t}(y_{1:t})/Z_{\alpha}$ .

Consider extending a fixed prefix $y_{<t}$ by one token $Y_{t}\sim q(\cdot\mid x,y_{<t})$ in one step of SIS (or SMC without resampling), while keeping the global target $\pi_{\alpha}(\cdot\mid x)$ on $\mathcal{V}^{T}$ . Define the incremental importance weight (chopin2020introduction)

W_{t}\;:=\;\frac{\pi_{\alpha}(y_{<t},Y_{t}\mid x)}{\pi_{\alpha}(y_{<t}\mid x)\,q(Y_{t}\mid x,y_{<t})},

(9)

where we condition on $y_{<t}$ with $\pi_{\alpha}(y_{<t}\mid x)>0$ , and $\mathrm{Var}[W_{t}]$ denotes variance under $Y_{t}\sim q(\cdot\mid x,y_{<t})$ . The next proposition shows the unique proposal $q(\cdot\mid x,y_{<t})$ that minimizes $\mathrm{Var}[W_{t}]$ at such a prefix.

Proposition 2 (Variance-minimizing one-step proposal at prefix $y_{<t}$ ).

In the setting above, fix $y_{<t}$ with $\pi_{\alpha}(y_{<t}\mid x)>0$ . Among all proposals $q(\cdot\mid x,y_{<t})$ on $\mathcal{V}$ , the unique minimizer of $\mathrm{Var}[W_{t}]$ is

q_{t}^{\star}(y_{t}\mid x,y_{<t})\;:=\;\frac{\tilde{\pi}_{\alpha,t}(y_{1:t})}{\tilde{\pi}_{\alpha,t-1}(y_{<t})}\;=\;\pi_{\alpha}\bigl(y_{t}\mid x,y_{<t}\bigr),

(10)

where $y_{1:t}=(y_{<t},y_{t})$ ; the right-hand side equals the next-token conditional under $\pi_{\alpha}(\cdot\mid x)$ .

Proposition 2 implies that minimizing the local one-step variance of the incremental weight forces the proposal to coincide with the next-token conditional in Eq. (10), which itself depends on the prefix totals $\tilde{\pi}_{\alpha,t}$ summed over all suffixes. In particular, proposals that modify only the base next-token conditional cannot in general equal the unique minimizer in Eq. (10). The proof and SIS background are in Section A.2.

Implication. Propositions 1 and 2 indicate that inexpensive one-step approximations cannot reproduce $\pi_{\alpha}$ without sequence-level information, leaving inference-time approximation of $\pi_{\alpha}$ structurally expensive. This aligns with prior work that expends additional inference-time compute (karan2026reasoning; azizi2026power; ji2026scalable) to approximate the power distribution.

5 From self-reward RL to power self-distillation

Section 4 shows that $\pi_{\alpha}$ is structurally expensive to approximate by sampling at inference time. In this section, we take the complementary view that $\pi_{\alpha}$ also connects RL and self-distillation, allowing us to shift the cost to offline training. Section 5.1 identifies $\pi_{\alpha}$ as the closed-form optimum of a KL-regularized RL objective with self-reward. Section 5.2 uses this identification to derive an offline self-distillation algorithm from that RL objective. Section 5.3 then analyzes what the resulting distilled model achieves: a sharpening guarantee on the self-reward, and a characterization of when sharpening also improves a true reward.

5.1 Power distribution as the optimum of self-reward RL

Let $q:\mathcal{X}\to\Delta(\mathcal{Y})$ be a candidate policy, and consider the KL-regularized RL objective with reward $r$ (ouyang2022instructgpt; guo2025deepseek)

J_{\beta}(q;\pi,r):=\mathbb{E}_{x\sim\mu}\Big[\mathbb{E}_{y\sim q(\cdot\mid x)}\big[r(x,y)\big]-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi(\cdot\mid x)\big)\Big]

(11)

with $\beta>0$ . By the standard closed-form solution of KL-regularized RL (levine2018reinforcement), the unique maximizer of Eq. (11) is the reward-tilted distribution

\pi_{\beta}^{\star}(y\mid x):=\frac{\pi(y\mid x)\exp\!\big(\beta^{-1}r(x,y)\big)}{Z_{r}(x)},\qquad Z_{r}(x):=\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime}\mid x)\exp\!\big(\beta^{-1}r(x,y^{\prime})\big).

(12)

We restate this as Proposition 5 in Section A.5 and include a proof for completeness. Specializing the reward in Eq. (12) to the self-reward $r_{\mathrm{self}}$ in Eq. (1) yields the power distribution.

Corollary 1 (Self-reward tilt equals the power distribution).

Suppose $r(x,y)=r_{\mathrm{self}}(x,y;\pi)=\log\pi(y\mid x)$ as in (1). Then the optimizer $\pi_{\beta}^{\star}$ in Eq. (12) equals the power distribution $\pi_{\alpha}$ in Eq. (4):

\pi_{\beta}^{\star}(\cdot\mid x)=\pi_{\alpha}(\cdot\mid x),\qquad\alpha:=1+\beta^{-1}>1.

(13)

This identification connects power sampling and self-improvement RL: the inference-time target of karan2026reasoning coincides with the closed-form optimum of the KL-regularized self-reward objective studied in huang2025selfimprovement, namely $\pi_{\alpha}$ .

5.2 Deriving power self-distillation

We now derive a self-distillation procedure from the RL objective without requiring the deployed model to sample from $\pi_{\alpha}$ at inference time.

RL objective as reverse and then forward KL to $\pi_{\alpha}$ . With $r=r_{\mathrm{self}}$ and $\alpha=1+\beta^{-1}$ , the inner objective in Eq. (11) can be rewritten for each $x$ as

\mathbb{E}_{y\sim q(\cdot\mid x)}[r_{\mathrm{self}}(x,y)]-\beta D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi(\cdot\mid x)\big)=-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi_{\alpha}(\cdot\mid x)\big)\;+\;\beta\log Z_{r}(x),

(14)

with the same partition function $Z_{r}(x)$ as in Eq. (12), which does not depend on $q$ . Thus, for each prompt $x$ , maximizing $J_{\beta}(q;\pi,r_{\mathrm{self}})$ over unconstrained $q$ is equivalent to minimizing the reverse KL divergence $D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi_{\alpha}(\cdot\mid x)\big)$ , with unique minimizer $q(\cdot\mid x)=\pi_{\alpha}(\cdot\mid x)$ .

However, the reverse KL is an expectation under $q$ , so optimizing it directly would require on-policy samples from the learner. We therefore convert the objective into an offline distillation surrogate that shares the same target distribution $\pi_{\alpha}$ , by minimizing the forward KL from the teacher distribution to the student, $D_{\mathrm{KL}}\!\big(\pi_{\alpha}(\cdot\mid x)\,\|\,q(\cdot\mid x)\big)$ . The population minimizer over $q$ is still $\pi_{\alpha}$ , so this surrogate preserves the same target distribution while enabling offline maximum-likelihood training on teacher samples. This forward-KL surrogate mirrors the reward-augmented maximum-likelihood method of norouzi2016reward, who also exchanged the reverse KL appearing in entropy-regularized RL for a forward KL.

Forward KL yields MLE on teacher samples. Expanding the forward KL training objective gives

\displaystyle\mathbb{E}_{x\sim\mu}\!\left[D_{\mathrm{KL}}\!\big(\pi_{\alpha}(\cdot\mid x)\,\|\,q(\cdot\mid x)\big)\right]

\displaystyle=\mathbb{E}_{x,y\sim\mu,\pi_{\alpha}}\big[\log\pi_{\alpha}(y\mid x)\big]-\mathbb{E}_{x,y\sim\mu,\pi_{\alpha}}\big[\log q(y\mid x)\big],

(15)

where $x,y\sim\mu,\pi_{\alpha}$ abbreviates $x\sim\mu$ and $y\sim\pi_{\alpha}(\cdot\mid x)$ . The first term on the right-hand side of Eq. (15) does not depend on $q$ , so minimizing the population forward KL is equivalent to maximizing the expected log-likelihood $\mathbb{E}_{x,y\sim\mu,\pi_{\alpha}}\big[\log q(y\mid x)\big]$ . In practice we form an empirical objective by drawing i.i.d. pairs $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ with $x_{i}\sim\mu$ and $y_{i}\sim\pi_{\alpha}(\cdot\mid x_{i})$ , and we solve the following maximum likelihood estimate (MLE) problem:

\widehat{\pi}\in\arg\max_{q\in\Pi_{\alpha}}\sum_{i=1}^{n}\log q(y_{i}\mid x_{i}).

(16)

This procedure uses only offline completions sampled from the power distribution $\pi_{\alpha}$ derived from the base policy $\pi$ , and it does not rely on any external reward labels, so it is an instance of self-distillation. We refer to it as power self-distillation and summarize it in Algorithm 1. In practice we run teacher inference once, store $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ , and then train the student with standard supervised fine-tuning on $D$ . Separating teacher generation from student training simplifies implementation and enables dataset reuse.

5.3 Sharpening and true reward under power self-distillation

In this subsection, we analyze two complementary aspects of power self-distillation: (i) Proposition 3 bounds the extent to which $\widehat{\pi}$ sharpens the self-reward, in the sense of huang2025selfimprovement; (ii) Proposition 4 shows that the local rate at which sharpening changes a true reward is determined by the covariance $\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},r_{\mathrm{self}})$ .

(i) Self-reward sharpening of the distilled model. huang2025selfimprovement formalize self-improvement via $(\epsilon,\delta)$ -sharpening relative to $\pi$ as in Eq. (3). The next proposition bounds how well the MLE in Eq. (16) concentrates on the self-reward maximizer set $\bm{y}^{\star}(x)$ in Eq. (2).

Proposition 3 (Power self-distillation and sharpening).

Fix $\alpha>1$ and $\rho,\delta\in(0,1)$ . Suppose $\pi_{\alpha}\in\Pi_{\alpha}$ and there exists a constant $M<\infty$ , independent of $\alpha$ , such that $|\Pi_{\alpha}|\leq M$ . Let $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ be i.i.d. samples with $x_{i}\sim\mu,y_{i}\sim\pi_{\alpha}(\cdot\mid x_{i})$ , and let $\widehat{\pi}\in\arg\max_{q\in\Pi_{\alpha}}\sum_{i=1}^{n}\log q(y_{i}\mid x_{i})$ be an MLE. Then with probability at least $1-\rho$ over $D$ ,

\mathbb{P}_{x\sim\mu}\big[\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta\big]\ \lesssim\ \frac{\log(M\rho^{-1})}{\delta n}\;+\;\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big].

(17)

In particular, the right-hand side of Eq. (17) converges to $0$ as $n\to\infty$ and $\alpha\to\infty$ .

Thus, for sufficiently large $n$ and $\alpha$ , power self-distillation can achieve $(\epsilon,\delta)$ -sharpening in the sense of Eq. (3). The proof is in Section A.3.

(ii) When does sharpening also improve a different true reward? Proposition 3 guarantees concentration on the self-reward maximizer set, but evaluation is typically governed by a different true reward (e.g., correctness). Let $r^{\star}:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ denote this true reward and define, for fixed $x\in\mathcal{X}$ ,

R(\alpha;x):=\mathbb{E}_{y\sim\pi_{\alpha}(\cdot\mid x)}\big[r^{\star}(x,y)\big].

The next proposition characterizes how $R(\alpha;x)$ changes with $\alpha$ .

Proposition 4 (Covariance form of $\partial_{\alpha}R(\alpha;x)$ ).

For any $\alpha>0$ and any fixed $x\in\mathcal{X}$ ,

\frac{\partial}{\partial\alpha}R(\alpha;x)=\mathrm{Cov}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(r^{\star}(x,y),r_{\mathrm{self}}(x,y)\big),

(18)

where covariances are over the support of $\pi(\cdot\mid x)$ , on which $\log\pi(\cdot\mid x)$ is finite. In particular, if for some $b(x)\in\mathbb{R}$ and $c(x)>0$ ,

r^{\star}(x,y)=c(x)\,r_{\mathrm{self}}(x,y)+b(x)\qquad\forall y\in\mathcal{Y},

(19)

then $R(\alpha;x)$ is non-decreasing in $\alpha$ :

\frac{\partial}{\partial\alpha}R(\alpha;x)=c(x)\,\mathrm{Var}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(r_{\mathrm{self}}(x,y)\big)\geq 0.

The proof is in Section A.4. Proposition 4 states that $\partial_{\alpha}R(\alpha;x)$ equals the covariance between $r^{\star}$ and $r_{\mathrm{self}}$ under $\pi_{\alpha}(\cdot\mid x)$ , so whether increasing $\alpha$ improves the true reward is determined exactly by $\mathrm{Cov}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(r^{\star}(x,y),r_{\mathrm{self}}(x,y)\big)$ . In particular, when $r^{\star}=r_{\mathrm{self}}$ , this covariance reduces to $\mathrm{Var}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(\log\pi(y\mid x)\big)\geq 0$ , so $R(\alpha;x)$ is non-decreasing in $\alpha$ .

Refer to caption — Figure 2: Sharper power sampling raises both $r_{\mathrm{self}}$ and $r^{\star}$ . A smaller temperature parameter $\tau=1/\alpha$ (sharper distribution) increases $r_{\mathrm{self}}$ and true reward $r^{\star}$ (accuracy) on MATH500.

6 Numerical evaluation

This section experimentally validates the following points.

•

(RQ1) Power sampling increases self-reward (Section 5.1).
•

(RQ2) Sharpening can improve true reward when $r^{\star}$ aligns with $r_{\mathrm{self}}$ (Section 5.3).
•

(RQ3) Power self-distillation achieves self-improvement (Section 5).

Detailed experimental setups are provided in Section B.1, and synthetic experiments validating Section 4 are shown in Sections B.2.4 and B.2.5.

6.1 Setup

We used the Qwen2.5-Math-7B (yang2024qwen2), Qwen2.5-7B (yang2024qwen), and Phi-3.5-mini-instruct (abdin2024phi) models on the MATH (lightman2024lets), HumanEval (chen2021evaluating), MBPP (austin2021program), and GPQA (rein2024gpqa) datasets. In the main text, we focus on the Qwen2.5-Math-7B model on the MATH dataset, which consists of 12,500 competition-style math problems spanning seven categories. For evaluation, we used MATH500, a selected subset of the MATH test set. For power self-distillation (Algorithm 1), we sampled 500 training problems from MATH, excluding those in MATH500. We fine-tuned with LoRA adapters (hu2022lora) using the AdamW optimizer (loshchilov2017decoupled).

For power sampling, we used the MH procedure of karan2026reasoning (Algorithm 2) with their default hyperparameters, including $\alpha=4.0$ . For additional baselines, we used standard autoregressive sampling (Standard) and token-wise temperature scaling (Temperature) with $\tau=1/\alpha=0.25$ , so that the token-level baseline uses the same local power exponent as power sampling. We studied three model variants: the base model (Base), the power-distilled model (Power-distilled, Algorithm 1), and a randomly initialized model (RandW). RandW is a negative control for cases where likelihood is not aligned with correctness.

To study the relationship between true reward $r^{\star}$ and self-reward $r_{\mathrm{self}}$ , we additionally evaluated an approach we call self-reward Best-of- $N$ : given $N$ sampled completions $\{y_{i}\}_{i=1}^{N}$ , we selected the completion with the largest value of $r_{\mathrm{self}}(y_{i})$ . In all experiments, $r_{\mathrm{self}}$ denotes the completion-token average log-likelihood under the evaluated model, with prompt tokens masked out; this length normalization makes values comparable across completions.

Table 1: True reward

r^{\star}

(accuracy) and self-reward

r_{\mathrm{self}}

for Qwen2.5-Math-7B on MATH500. The left two columns report means over all sampled completions. The right two columns report self-reward Best-of-

N

: the completion with the largest

r_{\mathrm{self}}

among samples generated with different seeds. Evaluated with four seeds.

		All completions		Self-reward Best-of- $N$
Model	Sampling	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$
RandW	Standard	$0.000\pm 0.000$	$-13.406\pm 0.008$	$0.000$	$-13.309$
RandW	Power	$0.000\pm 0.000$	$-12.370\pm 0.014$	$0.000$	$-12.133$
Base	Standard	$0.508\pm 0.016$	$-0.316\pm 0.043$	$0.680$	$-0.097$
	Temperature	$0.683\pm 0.014$	$-0.061\pm 0.015$	$0.756$	$-0.036$
	Power	$0.714\pm 0.006$	$-0.077\pm 0.001$	$0.742$	$-0.062$
Power-distilled	Standard	$0.643\pm 0.025$	$-0.089\pm 0.005$	$0.763$	$-0.042$
Power-distilled	Temperature	$\mathbf{0.722}\pm 0.009$	$\mathbf{-0.043}\pm 0.001$	$\mathbf{0.768}$	$\mathbf{-0.034}$

Table 2: Qualitative comparison for Qwen2.5-Math-7B on a MATH500 question: “The coordinates of a parallelogram are

(5,3)

(6,8)

(7,4)

, and

(x,y)

with

x>7

. What is the value of

x+y

?”

Model	Sampling	Correctness	Summary
Base	Temperature	No	Uses irrelevant mathematical properties and generates an incorrect formula, resulting in a hallucinated final answer.
Base	Power	Yes	Maintains logical consistency and mathematical accuracy, but simulates a Python execution to present a non-executed solution.
Distilled	Standard	Yes	Shows robust reasoning and self-correction by re-evaluating the problem when constraints are not met.

6.2 Results

Power sampling increases self-reward (RQ1). Table 1 shows mean self-reward ( $r_{\mathrm{self}}$ ) and accuracy ( $r^{\star}$ ) over sampled completions (left two columns). Power sampling raises $r_{\mathrm{self}}$ for both the base model and RandW. Decoding with token-wise temperature also raises $r_{\mathrm{self}}$ on the base model.

Sharpening can improve true reward when $r^{\star}$ aligns with $r_{\mathrm{self}}$ (RQ2). Table 1 also shows that higher $r_{\mathrm{self}}$ is typically accompanied by higher true reward $r^{\star}$ , except on RandW, where $r_{\mathrm{self}}$ is not aligned with $r^{\star}$ . Notably, self-reward Best-of- $N$ yields the largest gains in $r^{\star}$ across all models.

To make this point clearer, Figure 3 plots the decoding temperature $\tau=1/\alpha$ against $r_{\mathrm{self}}$ and $r^{\star}$ ; both quantities decrease as $\tau$ increases (i.e., sharpening weakens). Figure 3 uses synthetic rewards (Section B.1 for details), whose correlation with $r_{\mathrm{self}}$ ranges from positive to negative; the gain in $r$ from power sampling grows roughly linearly with $\mathrm{Cov}(r,r_{\mathrm{self}})$ .

Power self-distillation achieves self-improvement (RQ3). Table 1 shows that after power self-distillation, the student with temperature decoding scores higher on both $r^{\star}$ and $r_{\mathrm{self}}$ than the base model under standard sampling, temperature sampling, or power sampling. The strongest result is obtained by combining power self-distillation with Temperature decoding. At inference time, the student uses only autoregressive decoding (with temperature), thereby amortizing the inference cost of power sampling into offline training.

Qualitative example. Table 2 summarizes completions on one MATH500 problem. With token-wise temperature, the model cites irrelevant facts and concludes with a hallucinated formula, plausibly because token-wise tilting in Eq. (5) does not coincide with sequence-level tilting in Eq. (6). Power sampling instead tilts toward $\pi_{\alpha}$ and is graded correct, but the completion includes plausible Python code that is never executed, and the model only mimics a reasoning pattern. After power self-distillation, standard decoding yields the correct answer with more robust step-by-step reasoning. The full completions are shown in Section B.2.3.

Additional dataset–model combinations are reported in Section B.2.1; in each case, the distilled model outperforms the corresponding base model.

7 Conclusion

We showed that the power distribution bridges power sampling, self-reward KL-regularized RL, and self-distillation as the sampling target, closed-form RL optimum, and teacher distribution. From the sampling perspective, inexpensive local approximations are structurally limited: per-token temperature scaling and variance-minimizing one-step proposals both miss sequence-level information. From the RL perspective, the same sequence-level power distribution is the optimizer of KL-regularized RL when the reward is the model’s sequence-level log-probabilities. This identification yields power self-distillation, an offline surrogate that amortizes power sampling into supervised training on teacher samples. Power self-distillation can achieve self-reward sharpening, while true-reward improvement is governed by $\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},r_{\mathrm{self}})$ . Finally, we supported the analysis with experiments.

Limitations. Self-improvement through sharpening and distillation inherits the capabilities of the base model, so gains can be small when the base is weak; improving base-model quality (e.g., pretraining) is outside our scope. Our analysis and experiments focus on autoregressive language models over finite horizons.

References

Algorithm 1 Power self-distillation

1:Base model

\pi

; power exponent

\alpha>0

; teacher sampler

\textsc{TeacherSample}(x;\pi,\alpha)

approximating

\pi_{\alpha}(\cdot\mid x)

(e.g., Algorithm 2); prompt source

\mu

; dataset size

n

; student model

q_{\theta}

(initialized from

\pi

2:Trained student

q_{\theta}

3:Offline teacher sampling (data collection):

4:for

i=1,\dots,n

5: Sample

x_{i}\sim\mu

6: Sample

y_{i}\sim\textsc{TeacherSample}(x_{i};\pi,\alpha)

7:end for

8:Store

D\leftarrow\{(x_{i},y_{i})\}_{i=1}^{n}

9:Student training (supervised fine-tuning):

10:Optimize completion-only NLL on

D

\theta\leftarrow\arg\min_{\theta}\sum_{(x,y)\in D}-\log q_{\theta}(y\mid x),

11:where the loss is computed only on completion tokens by masking prompt tokens.

Algorithm 2 Power sampling using Metropolis–Hastings [karan2026reasoning] (Power(

\infty

): deterministic acceptance).

Notation.

Let the unnormalized power target $\tilde{\pi}_{\alpha}(y\mid x)\propto\pi(y\mid x)^{\alpha}$ . Let $A(y^{\prime},y)$ denote the Metropolis–Hastings acceptance ratio comparing completions $y,y^{\prime}$ (with $x$ fixed), where $p_{\mathrm{prop}}(y^{\prime}\mid y,x)$ denotes the autoregressive proposal density for resampling a suffix under $p_{\mathrm{prop}}(\cdot\mid x,\cdot)$ :

A(y^{\prime},y)\;:=\;\min\left\{1,\ \frac{\tilde{\pi}_{\alpha}(y^{\prime}\mid x)}{\tilde{\pi}_{\alpha}(y\mid x)}\cdot\frac{p_{\mathrm{prop}}(y\mid y^{\prime},x)}{p_{\mathrm{prop}}(y^{\prime}\mid y,x)}\right\}.

(20)

1:Base model

\pi

; proposal

p_{\mathrm{prop}}

; prompt

x

; completion length

T

with

B\mid T

; block size

B

; inner iterations

N_{\mathrm{MCMC}}

; exponent

\alpha>0

2:Completion

y_{1:T}

(

\mathrm{MH}

: approximate sample from powered conditional

\pi(\cdot\mid x)^{\alpha}

up to MCMC error; Power

(\infty)

: accept proposals only if

\pi(y^{\prime}\mid x)>\pi(y\mid x)

, so

\pi(y\mid x)

is monotone along accepted moves).

3:for

k=0,1,\dots,T/B-1

4: Given the current state

y_{1:kB}

, construct an initialization

y^{(0)}

by extending autoregressively with

p_{\mathrm{prop}}

to length

(k+1)B

y^{(0)}_{t}\sim p_{\mathrm{prop}}(y_{t}\mid x,y_{<t}),\qquad kB+1\leq t\leq(k+1)B.

5: Set

y\leftarrow y^{(0)}

6: for

n=1,\dots,N_{\mathrm{MCMC}}

7: Sample

m

uniformly from

\{1,\dots,(k+1)B\}

8: Construct a proposal completion

y^{\prime}

with prefix

y_{1:m-1}

and resample the suffix:

y^{\prime}_{t}\sim p_{\mathrm{prop}}(y_{t}\mid x,y^{\prime}_{<t}),\qquad m\leq t\leq(k+1)B.

9: (MH) Compute

A(y^{\prime},y)

from Eq. (20). Draw

u\sim\mathrm{Uniform}(0,1)

. If

u\leq A(y^{\prime},y)

, set

y\leftarrow y^{\prime}

10: (Power

(\infty)

) If

\pi(y^{\prime}\mid x)>\pi(y\mid x)

, set

y\leftarrow y^{\prime}

11: end for

12: Set

y_{1:(k+1)B}\leftarrow y

as the current state carried into the next block iteration.

13:end for

14:return

y_{1:T}

Table S.1: Comparison of prior work by its connection to power distributions, sampling, RL, distillation, and whether it avoids external rewards.

Paper	Power distribution	Sampling	RL	Distillation	No external reward
norouzi2016reward	–	–	✓	✓	–
rusu2016policy	–	–	✓	✓	–
teh2017distral	–	–	✓	✓	–
laskin2023incontext	–	–	✓	✓	–
huang2025selfimprovement	–	✓	✓	✓	✓
gui2024bonbon	–	✓	–	✓	–
amini2025variational	–	✓	–	✓	–
balashankar2025infalign	–	✓	✓	–	–
sessa2025bond	–	✓	✓	✓	–
yang2025fasterwind	–	✓	✓	✓	–
karan2026reasoning	✓	✓	–	–	✓
azizi2026power	✓	✓	–	–	✓
ji2026scalable	✓	✓	–	–	✓
Ours	✓	✓	✓	✓	✓

Appendix A Proofs and background

A.1 Proof of Proposition 1

Proof of Proposition 1.

Fix $x$ and a prefix $y_{<t}$ with $\pi(y_{<t}\mid x)>0$ , and write $p(s):=\pi(y_{t}=s\mid x,y_{<t})$ for $s\in\mathcal{V}$ . For any suffix $y_{t+1:T}$ and token $s$ , autoregressive factorization gives

\pi(y_{<t},\,s,\,y_{t+1:T}\mid x)=\pi(y_{<t}\mid x)\,p(s)\,q_{t,s}(y_{t+1:T}).

Using Eq. (6), the numerator for token $s$ is therefore

\displaystyle\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}\pi(y_{<t},\,s,\,y_{t+1:T}\mid x)^{\alpha}

\displaystyle=\pi(y_{<t}\mid x)^{\alpha}\,p(s)^{\alpha}\sum_{y_{t+1:T}}q_{t,s}(y_{t+1:T})^{\alpha}.

Summing over $s\in\mathcal{V}$ yields the corresponding denominator in Eq. (6), so the prefix factor $\pi(y_{<t}\mid x)^{\alpha}$ cancels and

\pi_{\mathrm{pow},\alpha}(y_{t}=s\mid x,y_{<t})=\frac{p(s)^{\alpha}\sum_{z}q_{t,s}(z)^{\alpha}}{\sum_{s^{\prime}\in\mathcal{V}}p(s^{\prime})^{\alpha}\sum_{z}q_{t,s^{\prime}}(z)^{\alpha}}.

(21)

For temperature scaling, Eq. (5) gives $\pi_{\mathrm{temp},\alpha}(y_{t}=s\mid x,y_{<t})=p(s)^{\alpha}/\sum_{s^{\prime}}p(s^{\prime})^{\alpha}$ . Hence for any $a,b\in\mathcal{V}$ with $p(b)>0$ ,

\frac{\pi_{\mathrm{pow},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{pow},\alpha}(y_{t}=b\mid x,y_{<t})}=\frac{p(a)^{\alpha}\sum_{z}q_{t,a}(z)^{\alpha}}{p(b)^{\alpha}\sum_{z}q_{t,b}(z)^{\alpha}},\qquad\frac{\pi_{\mathrm{temp},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{temp},\alpha}(y_{t}=b\mid x,y_{<t})}=\left(\frac{p(a)}{p(b)}\right)^{\alpha},

and therefore

\frac{\pi_{\mathrm{pow},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{pow},\alpha}(y_{t}=b\mid x,y_{<t})}\bigg/\frac{\pi_{\mathrm{temp},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{temp},\alpha}(y_{t}=b\mid x,y_{<t})}=\frac{\sum_{z}q_{t,a}(z)^{\alpha}}{\sum_{z}q_{t,b}(z)^{\alpha}}.

By the definition of $H_{\alpha}$ , $\sum_{z}q(z)^{\alpha}=\exp\!\bigl((1-\alpha)H_{\alpha}(q)\bigr)$ , which yields Eq. (7). ∎

A.2 Background and proof of Proposition 2

This appendix is aligned with the sequential Monte Carlo presentation of zhao2024probabilistic, who derive a general twist-induced proposal (their Prop. 3.3) that minimizes the variance of the one-step incremental importance weight for a given tower of intermediate targets. We provide a proof of the same variance-minimization fact specialized to $\pi_{\alpha}$ using the Cauchy–Schwarz inequality (cf. zhao2024probabilistic, App. A.2).

A.2.1 From a sequence-level target to a sequential sampler

Let $\mathcal{V}$ be a finite vocabulary and fix a prompt $x$ and completion length $T\geq 1$ . Let $P(y_{1:T}):=\pi_{\alpha}(y_{1:T}\mid x)$ denote the power distribution on $\mathcal{V}^{T}$ from Eq. (4), i.e.,

P(y_{1:T})\propto\pi(y_{1:T}\mid x)^{\alpha},\qquad\pi(y_{1:T}\mid x)=\prod_{t=1}^{T}\pi(y_{t}\mid x,y_{<t}).

Exact sampling from $P$ may be intractable because normalizing constants involve sums over exponentially many sequences. Many practical samplers therefore build $y_{1:T}$ sequentially: having generated a prefix $y_{<t}$ , they draw a next token $y_{t}$ from a proposal $q(\cdot\mid x,y_{<t})$ and update importance weights so that, after $T$ steps, full-length draws can be reweighted to be (exactly or approximately) correct for $P$ .

A.2.2 Incremental importance weights

For $t=1,\dots,T$ , let $P_{t}$ denote the marginal of $P$ on the length- $t$ prefix:

P_{t}(y_{1:t}):=\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}P(y_{1:T}).

One step of sequential importance sampling extends $y_{<t}$ by sampling $Y_{t}\sim q(\cdot\mid x,y_{<t})$ . The incremental multiplicative factor appended to the running weight is [chopin2020introduction]

W_{t}:=\frac{P_{t}(y_{<t},Y_{t})}{P_{t-1}(y_{<t})\,q(Y_{t}\mid x,y_{<t})},

(22)

defined on the event $\{P_{t-1}(y_{<t})>0\}$ , where $(y_{<t},Y_{t})$ denotes the length- $t$ prefix ending in $Y_{t}$ . For the power distribution, $P_{t}(y_{1:t})=\pi_{\alpha}(y_{1:t}\mid x)$ , so Eq. (22) agrees with $W_{t}$ in Eq. (9).

If one initializes weights at $w_{0}:=1$ and updates $w_{t}:=w_{t-1}W_{t}$ , then for any completed trajectory $y_{1:T}$ with $\prod_{t=1}^{T}P_{t-1}(y_{<t})>0$ ,

w_{T}=\frac{P(y_{1:T})}{q(y_{1:T}\mid x)},\qquad q(y_{1:T}\mid x):=\prod_{t=1}^{T}q(y_{t}\mid x,y_{<t}),

(23)

which is the usual full-sequence importance weight of $y_{1:T}$ for the target $P$ against the autoregressive proposal $q$ . Thus each $W_{t}$ is the local factor that must be “well behaved” if the final weights are not to explode or collapse.

A.2.3 Why minimize $\mathrm{Var}[W_{t}]$ at one step?

Condition on a fixed feasible prefix $y_{<t}$ with $P_{t-1}(y_{<t})>0$ . Write $f_{t}(v):=P_{t}(y_{<t},v)/P_{t-1}(y_{<t})$ for $v\in\mathcal{V}$ , i.e., the true conditional $P(y_{t}=v\mid y_{<t})$ under $P$ . Then $W_{t}=f_{t}(Y_{t})/q(Y_{t})$ with $Y_{t}\sim q$ .

Whenever $q(v)>0$ for all $v$ with $f_{t}(v)>0$ , the mean is always $\mathbb{E}[W_{t}\mid y_{<t}]=\sum_{v\in\mathcal{V}}q(v)\,f_{t}(v)/q(v)=1$ . However, $\mathrm{Var}[W_{t}\mid y_{<t}]$ depends strongly on $q$ : if $q$ places too little mass where $f_{t}$ is large, occasional huge weights arise, which is the usual “weight degeneracy” pathology in importance sampling. Minimizing $\mathrm{Var}[W_{t}\mid y_{<t}]$ therefore makes the single-step contribution to weight instability as small as possible (among independent proposals), holding the prefix fixed. This is the same local objective highlighted by zhao2024probabilistic for twist-induced proposals.

A.2.4 Proof of Proposition 2

Proof of Proposition 2.

Fix $y_{<t}$ with $\pi_{\alpha}(y_{<t}\mid x)>0$ and write $f(v):=\pi_{\alpha}(y_{<t},v\mid x)/\pi_{\alpha}(y_{<t}\mid x)$ for $v\in\mathcal{V}$ . Then $\sum_{v\in\mathcal{V}}f(v)=1$ and $W_{t}=f(Y_{t})/q(Y_{t})$ under $Y_{t}\sim q$ , assuming $q(v)>0$ whenever $f(v)>0$ .

Since $\mathbb{E}[W_{t}]=\sum_{v}f(v)=1$ ,

\mathrm{Var}[W_{t}]=\mathbb{E}[W_{t}^{2}]-1=\sum_{v\in\mathcal{V}}\frac{f(v)^{2}}{q(v)}-1.

By Cauchy–Schwarz,

\Big(\sum_{v\in\mathcal{V}}f(v)\Big)^{2}=\Big(\sum_{v\in\mathcal{V}}\sqrt{q(v)}\cdot\frac{f(v)}{\sqrt{q(v)}}\Big)^{2}\leq\Big(\sum_{v\in\mathcal{V}}q(v)\Big)\Big(\sum_{v\in\mathcal{V}}\frac{f(v)^{2}}{q(v)}\Big)=\sum_{v\in\mathcal{V}}\frac{f(v)^{2}}{q(v)}.

The left-hand side equals $1$ , so $\mathrm{Var}[W_{t}]\geq 0$ with equality if and only if the Cauchy–Schwarz inequality is tight, i.e., $\sqrt{q(v)}\propto f(v)/\sqrt{q(v)}$ , equivalently $q(v)\propto f(v)$ . Because $\sum_{v}f(v)=1$ , the unique minimizer on $\{v:f(v)>0\}$ is $q(v)=f(v)$ , which is $\pi_{\alpha}(v\mid x,y_{<t})$ .

Finally, with $\tilde{\pi}_{\alpha,t}$ as in Eq. (8) and $Z_{\alpha}$ as in the main text,

f(v)=\frac{\pi_{\alpha}(y_{<t},v\mid x)}{\pi_{\alpha}(y_{<t}\mid x)}=\frac{\tilde{\pi}_{\alpha,t}(y_{1:t})/Z_{\alpha}}{\tilde{\pi}_{\alpha,t-1}(y_{<t})/Z_{\alpha}}=\frac{\tilde{\pi}_{\alpha,t}(y_{1:t})}{\tilde{\pi}_{\alpha,t-1}(y_{<t})},

where $y_{1:t}=(y_{<t},v)$ , which is Eq. (10). ∎

A.3 Proof of Proposition 3

We first provide the following lemma, which is used to bound the Hellinger distance between the MLE and the true conditional distribution for finite-class models.

Lemma 1 (Finite-class MLE Hellinger bound [wong1995probability, geer2000empirical, zhang2006f]).

Assume $|\Pi|<\infty$ and $\pi^{\star}\in\Pi$ . Let $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ be i.i.d. with $x_{i}\sim\mu$ and $y_{i}\sim\pi^{\star}(\cdot\mid x_{i})$ , and let $\widehat{\pi}\in\arg\max_{\pi\in\Pi}\sum_{i=1}^{n}\log\pi(y_{i}\mid x_{i})$ be an MLE. Then for any $\rho\in(0,1)$ , with probability at least $1-\rho$ ,

\mathbb{E}_{x\sim\mu}\!\left[D_{H}^{2}\!\big(\widehat{\pi}(\cdot\mid x),\pi^{\star}(\cdot\mid x)\big)\right]\leq\frac{2\log(|\Pi|\rho^{-1})}{n}.

Using this lemma, we can prove Proposition 3 as follows.

Proof of Proposition 3.

Define the failure event $F(x):=\{\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta\}$ . By a simple inclusion,

F(x)\subseteq\Big\{\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\Big\}\ \cup\ \Big\{\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta,\ \pi_{\alpha}(\bm{y}^{\star}(x)\mid x)>1-\tfrac{\delta}{2}\Big\}.

Taking $\mathbb{P}_{x\sim\mu}[\cdot]$ yields

\mathbb{P}_{x\sim\mu}[F(x)]\leq\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]\;+\;\mathbb{P}_{x\sim\mu}[E(x)],

(24)

where $E(x):=\{\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta,\ \pi_{\alpha}(\bm{y}^{\star}(x)\mid x)>1-\tfrac{\delta}{2}\}$ .

Let $B(x):=\mathcal{Y}\setminus\bm{y}^{\star}(x)$ . For each $x$ , write $\widehat{\pi}_{x}:=\widehat{\pi}(\cdot\mid x)$ and $p_{x}:=\pi_{\alpha}(\cdot\mid x)$ . For two distributions $p,q\in\Delta(\mathcal{Y})$ , define the squared Hellinger distance

D_{H}^{2}(p,q):=\sum_{y\in\mathcal{Y}}\big(\sqrt{p(y)}-\sqrt{q(y)}\big)^{2}.

By the reverse triangle inequality applied to the vectors $(\sqrt{\widehat{\pi}_{x}(y)})_{y\in B(x)}$ and $(\sqrt{p_{x}(y)})_{y\in B(x)}$ ,

D_{H}^{2}(\widehat{\pi}_{x},p_{x})\geq\sum_{y\in B(x)}\big(\sqrt{\widehat{\pi}_{x}(y)}-\sqrt{p_{x}(y)}\big)^{2}\geq\big(\sqrt{\widehat{\pi}(B(x)\mid x)}-\sqrt{\pi_{\alpha}(B(x)\mid x)}\big)^{2}.

(25)

On the event $E(x)$ , we have $\widehat{\pi}(B(x)\mid x)=1-\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\geq\delta$ and $\pi_{\alpha}(B(x)\mid x)=1-\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)<\delta/2$ , so Equation (25) implies

D_{H}^{2}(\widehat{\pi}_{x},p_{x})\geq\big(\sqrt{\delta}-\sqrt{\delta/2}\big)^{2}=\Big(1-\tfrac{1}{\sqrt{2}}\Big)^{2}\delta=:c_{0}\,\delta.

Therefore $\mathbf{1}\{E(x)\}\leq D_{H}^{2}(\widehat{\pi}_{x},p_{x})/(c_{0}\delta)$ , and hence

\mathbb{P}_{x\sim\mu}[E(x)]\leq\frac{1}{c_{0}\delta}\,\mathbb{E}_{x\sim\mu}\big[D_{H}^{2}(\widehat{\pi}_{x},p_{x})\big].

(26)

Finally, by the finite-class MLE Hellinger bound (Lemma 1) and $|\Pi_{\alpha}|\leq M$ , with probability at least $1-\rho$ ,

\mathbb{E}_{x\sim\mu}\big[D_{H}^{2}(\widehat{\pi}_{x},p_{x})\big]\leq\frac{2\log(M\rho^{-1})}{n}.

Combining Eqs. (24) and (26) with the bound above yields

\mathbb{P}_{x\sim\mu}\big[\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta\big]\leq\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]\;+\;\frac{2}{c_{0}}\cdot\frac{\log(M\rho^{-1})}{\delta\,n}.

Absorbing constants proves Eq. (17).

Convergence of the upper bound.

The MLE term satisfies $\frac{\log(M\rho^{-1})}{\delta n}\to 0$ as $n\to\infty$ .

For the limit of $\alpha$ , fix $x\in\mathcal{X}$ and write $m(x):=\max_{y\in\mathcal{Y}}\pi(y\mid x)$ . By definition of $\bm{y}^{\star}(x)$ , we have $\pi(y\mid x)=m(x)$ for all $y\in\bm{y}^{\star}(x)$ and $\pi(y\mid x)<m(x)$ for all $y\in B(x)=\mathcal{Y}\setminus\bm{y}^{\star}(x)$ . The normalizing constant of the power distribution satisfies

$\displaystyle Z_{\alpha}(x)$	$\displaystyle:=\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime}\mid x)^{\alpha}$	(27)
	$\displaystyle=\sum_{y^{\prime}\in\bm{y}^{\star}(x)}m(x)^{\alpha}\;+\;\sum_{y^{\prime}\in B(x)}\pi(y^{\prime}\mid x)^{\alpha}$	(28)
	$\displaystyle=\bigl\|\bm{y}^{\star}(x)\bigr\|\,m(x)^{\alpha}\;+\;\sum_{y^{\prime}\in B(x)}\pi(y^{\prime}\mid x)^{\alpha}.$	(29)

For each $y^{\prime}\in B(x)$ , the ratio $\pi(y^{\prime}\mid x)/m(x)$ lies in $[0,1)$ , hence $\bigl(\pi(y^{\prime}\mid x)/m(x)\bigr)^{\alpha}\to 0$ as $\alpha\to\infty$ . Because $B(x)$ is finite, $\sum_{y^{\prime}\in B(x)}\pi(y^{\prime}\mid x)^{\alpha}=o\bigl(m(x)^{\alpha}\bigr)$ , and therefore

\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)=\frac{\bigl|\bm{y}^{\star}(x)\bigr|\,m(x)^{\alpha}}{Z_{\alpha}(x)}\xrightarrow[\alpha\to\infty]{}1.

(30)

The indicators $\mathbf{1}\{\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\}$ converge to $0$ for $\mu$ -almost every $x$ as $\alpha\to\infty$ by Eq. (30). Since indicators are bounded by $1$ , dominated convergence yields

\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]=\mathbb{E}_{x\sim\mu}\big[\mathbf{1}\{\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\}\big]\xrightarrow[\alpha\to\infty]{}0.

Thus, the second term in Eq. (17) converges to $0$ as $\alpha\to\infty$ . Together with the $n\to\infty$ limit of the first term, the full upper bound converges to $0$ . ∎

A.4 Proof of Proposition 4

Proof of Proposition 4.

Recall

\pi_{\alpha}(y)=\frac{\pi(y)^{\alpha}}{Z_{\alpha}}=\frac{e^{\alpha\log\pi(y)}}{Z_{\alpha}},\qquad Z_{\alpha}:=\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime})^{\alpha}=\sum_{y^{\prime}}e^{\alpha\log\pi(y^{\prime})}.

Differentiating $\pi_{\alpha}(y)$ with respect to $\alpha$ yields

	$\displaystyle\frac{\partial}{\partial\alpha}\pi_{\alpha}(y)$	$\displaystyle=\frac{\partial}{\partial\alpha}\left(\frac{e^{\alpha\log\pi(y)}}{Z_{\alpha}}\right)$
		$\displaystyle=\frac{\log\pi(y)e^{\alpha\log\pi(y)}Z_{\alpha}-e^{\alpha\log\pi(y)}\frac{\partial}{\partial\alpha}Z_{\alpha}}{Z_{\alpha}^{2}}$
		$\displaystyle=\pi_{\alpha}(y)\left(\log\pi(y)-\frac{\partial}{\partial\alpha}\log Z_{\alpha}\right).$

Using

\displaystyle\frac{\partial}{\partial\alpha}\log Z_{\alpha}=\frac{1}{Z_{\alpha}}\sum_{y^{\prime}}\log\pi(y^{\prime})e^{\alpha\log\pi(y^{\prime})}=\sum_{y^{\prime}}\pi_{\alpha}(y^{\prime})\log\pi(y^{\prime})=\mathbb{E}_{\pi_{\alpha}}[\log\pi],

(31)

we obtain

$\displaystyle\frac{\partial}{\partial\alpha}\mathbb{E}_{\pi_{\alpha}}[r^{\star}]$	$\displaystyle=\sum_{y}r^{\star}(y)\frac{\partial}{\partial\alpha}\pi_{\alpha}(y)$	(32)
	$\displaystyle=\sum_{y}r^{\star}(y)\pi_{\alpha}(y)\left(\log\pi(y)-\mathbb{E}_{\pi_{\alpha}}[\log\pi]\right)$	(33)
	$\displaystyle=\mathbb{E}_{\pi_{\alpha}}\!\big[r^{\star}(\log\pi-\mathbb{E}_{\pi_{\alpha}}[\log\pi])\big]$	(34)
	$\displaystyle=\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},\log\pi)$	(35)
	$\displaystyle=\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},r_{\mathrm{self}}).$	(36)

∎

A.5 Closed-form optimizer for KL-regularized RL: restatement and proof

We restate the standard closed-form solution of KL-regularized RL used in Section 5.1.

Proposition 5 (Closed-form optimizer for KL-regularized RL [levine2018reinforcement]).

For each $x\in\mathcal{X}$ , let $\pi_{\beta}^{\star}$ be the reward-tilted distribution defined in Eq. (12), and assume $Z_{r}(x)<\infty$ for every $x\in\mathcal{X}$ . Then $\pi_{\beta}^{\star}$ maximizes $J_{\beta}(q;\pi,r)$ in Eq. (11) over all $q:\mathcal{X}\to\Delta(\mathcal{Y})$ .

Proof of Proposition 5.

Fix $x\in\mathcal{X}$ and write $f(y):=\pi(y\mid x)\exp\!\big(\beta^{-1}r(x,y)\big)$ and $Z:=Z_{r}(x)=\sum_{y^{\prime}\in\mathcal{Y}}f(y^{\prime})$ . For any $q(\cdot\mid x)\in\Delta(\mathcal{Y})$ , expanding the KL divergence against $\pi_{\beta}^{\star}(\cdot\mid x)$ gives

	$\displaystyle\mathbb{E}_{y\sim q(\cdot\mid x)}[r(x,y)]-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\\|\,\pi(\cdot\mid x)\big)$	$\displaystyle=-\beta\sum_{y\in\mathcal{Y}}q(y\mid x)\log\frac{q(y\mid x)}{f(y)/Z}$
		$\displaystyle=-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\\|\,\pi_{\beta}^{\star}(\cdot\mid x)\big)+\beta\log Z,$

where we used $\pi_{\beta}^{\star}(y\mid x)=f(y)/Z$ from Eq. (12). Since $D_{\mathrm{KL}}(\cdot\,\|\,\pi_{\beta}^{\star}(\cdot\mid x))\geq 0$ with equality if and only if $q(\cdot\mid x)=\pi_{\beta}^{\star}(\cdot\mid x)$ , the inner objective is uniquely maximized at $q(\cdot\mid x)=\pi_{\beta}^{\star}(\cdot\mid x)$ . Because $J_{\beta}(q;\pi,r)$ is an expectation over $x\sim\mu$ of these decoupled per- $x$ objectives, the unique global maximizer is $q=\pi_{\beta}^{\star}$ . ∎

Appendix B Experimental details

B.1 Setup

Models and datasets.

We used Qwen2.5-Math-7B [yang2024qwen2], Qwen2.5-7B [yang2024qwen], and Phi-3.5-mini-instruct [abdin2024phi] models on the following datasets.

•

Mathematics. We used the MATH dataset [lightman2024lets], which consists of 12,500 competition-style math problems spanning seven categories (e.g., geometry, number theory, and precalculus), with 7,500 training and 5,000 test problems. For evaluation, we used MATH500, a randomly selected subset of the MATH test set standardized by OpenAI¹¹1https://huggingface.co/datasets/HuggingFaceH4/MATH-500. For distillation, we sampled 500 examples from MATH with MATH500 removed²²2https://raw.githubusercontent.com/rasbt/math_full_minus_math500/main/math_full_minus_math500.json.
•

Programming. For evaluation, we used HumanEval [chen2021evaluating], a set of $164$ handwritten programming problems covering algorithms, reasoning, mathematics, and language understanding; each problem includes unit tests, and a solution was correct if it passed all tests. For distillation, we used MBPP [austin2021program], a benchmark of crowd-sourced Python programming problems designed to be solvable by entry-level programmers. We used $420$ questions from the sanitized subset, excluding the prompt split.
•

Multiple-choice science. We used GPQA [rein2024gpqa], a multiple-choice science benchmark (physics, chemistry, and biology) requiring advanced reasoning. For evaluation, we used GPQA-Diamond, a high-quality subset of $198$ questions. For distillation, we used the remaining $250$ GPQA questions after removing any overlap with GPQA-Diamond.

Power sampling.

We used the power sampling algorithm of karan2026reasoning, largely following their hyperparameters. Specifically, we used $\alpha=4.0$ , maximum sampling token length $3072$ , block size $192$ , $N_{\mathrm{MCMC}}=10$ , and the proposal LLM $p_{\mathrm{prop}}$ set to the base model with sampling temperature $\tau=1/\alpha=0.25$ . The token-wise Temperature baseline uses the same $\tau$ , applying the corresponding local power transform independently at each decoding step. For the randomly initialized model (RandW; Section 6), we instead used maximum token length $1024$ and $N_{\mathrm{MCMC}}=2$ , because under the default settings (maximum token length $3072$ and $N_{\mathrm{MCMC}}=10$ ) EOS tokens rarely appeared for RandW and wall-clock sampling time became significantly longer.

Self-reward computation.

To report $r_{\mathrm{self}}$ , we computed, under the evaluated model, the average log-likelihood over completion tokens, excluding prompt tokens. Our theoretical analysis assumes completions of a fixed length $T$ , but in our experiments completion lengths vary across prompts and sampling methods, so we normalize by the number of completion tokens to remove length bias in $r_{\mathrm{self}}$ .

Synthetic random rewards.

For the synthetic-reward probe in Figure 3, each completion $y$ is mapped to a scalar in $[0,1)$ by applying SHA-256 to the UTF-8 encoding of $y$ and interpreting the leading 64 bits of the digest as an unsigned fraction. Let $z_{\mathrm{self}}(y)$ and $z_{r}(y)$ denote the z-scores of the self-reward $r_{\mathrm{self}}(x,y)$ and of the hash reward above, each computed with the corresponding pooled global sample mean and sample standard deviation. We then define

r_{\lambda}(y)\;:=\;\lambda\,z_{\mathrm{self}}(y)+\sqrt{1-\lambda^{2}}\,\bigl(z_{r}(y)+\varepsilon(y)\bigr),\qquad\lambda\in[-1,1],

where the $\varepsilon(y)$ are i.i.d. $\mathcal{N}(0,\sigma^{2})$ with $\sigma=0.5$ . Figure 3 sweeps $\lambda$ and plots the mean increase in $r_{\lambda}$ under power versus standard sampling against the empirical covariance between $r_{\mathrm{self}}$ and $r_{\lambda}$ , using completions produced under standard sampling. The construction is designed to sweep $\mathrm{Cov}(r_{\lambda},r_{\mathrm{self}})$ in a controlled way; we plot empirical gain against this controlled covariance to visualize the qualitative rate prediction of Proposition 4.

Distillation.

We trained the student with supervised fine-tuning on the offline power-sampled dataset. Concretely, we minimized the standard token-level cross-entropy loss of a causal language model on the teacher-generated completion, masking the prompt tokens (i.e., the loss was computed only on the completion tokens). The student was initialized from the base model and was trained with LoRA adapters ( $r{=}16$ , $\alpha{=}32$ , dropout $0.05$ ) applied to q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj. We trained the models for 3 epochs using the AdamW optimizer with a weight decay of 0.01 and a linear warmup ratio of 0.03. The learning rate was tuned per dataset and model as summarized in Table S.2. We used per-device batch size 1 with 8 gradient accumulation steps, and enabled gradient checkpointing. We set the maximum sequence length to 1024 tokens to keep activation memory manageable on a single GPU. Teacher completions exceeding this cap were truncated, and the cross-entropy loss was computed on all in-window completion tokens. The truncation affected only a minority of completions (e.g., 83.6% of Qwen2.5-Math-7B completions on MATH fit fully within the cap), and each in-window token still provides a valid distillation signal toward $\pi_{\alpha}$ .

Table S.2: Learning rate used for SFT distillation, per (dataset, model) pair.

Dataset	Qwen2.5-7B	Qwen2.5-Math-7B	Phi-3.5-mini-instruct
MATH	$1\times 10^{-5}$	$1\times 10^{-5}$	$1\times 10^{-3}$
HumanEval/MBPP	$1\times 10^{-5}$	$1\times 10^{-5}$	$5\times 10^{-4}$
GPQA	$1\times 10^{-5}$	$1\times 10^{-4}$	$2\times 10^{-4}$

Hardware and execution time.

All experiments were conducted on GPU nodes equipped with two Intel Xeon Platinum 8360Y CPUs, 512 GiB of host memory, and eight NVIDIA A100 GPUs with 40 GiB of memory each. On a single GPU, supervised fine-tuning of one student per dataset and model finished in under one hour, while teacher generation via power sampling (Algorithm 2) took more than one day per dataset and model. The total compute is on the order of a few hundred A100-GPU-hours.

B.2 Additional results

B.2.1 Other datasets and models

This section reports results on additional dataset–model combinations that are not shown in the main text. In all cases, the distilled model has a higher $r^{\star}$ than the base under standard autoregressive decoding. The distilled model often attains $r^{\star}$ comparable to that of the corresponding base model with power sampling.

Table S.3: MATH: true reward

r^{\star}

(accuracy) and self-reward

r_{\mathrm{self}}

. Left: all completions, means with

\pm

std over seeds. Right: self-reward Best-of-

N

over samples generated with different seeds (max

r_{\mathrm{self}}

per item, then same aggregation). 4 seeds.

		All completions		Self-reward Best-of- $N$
Model	Sampling	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$
Qwen / Base	Standard	$0.410\pm 0.004$	$-0.405\pm 0.049$	$0.579$	$-0.185$
Qwen / Base	Power	$\mathbf{0.706}\pm 0.017$	$-0.093\pm 0.001$	$0.677$	$-0.080$
Qwen / Distilled	Standard	$0.631\pm 0.013$	$-0.094\pm 0.002$	$\mathbf{0.682}$	$-0.064$
Qwen / Distilled	Temperature	$0.661\pm 0.006$	$\mathbf{-0.073}\pm 0.001$	$0.676$	$\mathbf{-0.060}$
Phi / Base	Standard	$0.449\pm 0.014$	$-0.234\pm 0.001$	$0.476$	$-0.173$
Phi / Base	Power	$\mathbf{0.513}\pm 0.017$	$-0.175\pm 0.002$	$\mathbf{0.493}$	$-0.155$
Phi / Distilled	Standard	$0.470\pm 0.000$	$-0.118\pm 0.001$	$0.481$	$-0.088$
Phi / Distilled	Temperature	$0.457\pm 0.014$	$\mathbf{-0.106}\pm 0.001$	$0.461$	$\mathbf{-0.086}$

Table S.4: HumanEval: true reward

r^{\star}

(HumanEval pass) and self-reward

r_{\mathrm{self}}

. Left: all completions, means with

\pm

std over seeds. Right: self-reward Best-of-

N

over samples generated with different seeds (max

r_{\mathrm{self}}

per item, then same aggregation). 4 seeds.

		All completions		Self-reward Best-of- $N$
Model	Sampling	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$
Qwen-Math / Base	Standard	$0.320\pm 0.016$	$-0.741\pm 0.012$	$0.383$	$-0.427$
Qwen-Math / Base	Power	$0.538\pm 0.030$	$\mathbf{-0.144}\pm 0.003$	$0.562$	$\mathbf{-0.106}$
Qwen-Math / Distilled	Standard	$0.416\pm 0.023$	$-0.563\pm 0.040$	$0.452$	$-0.334$
Qwen-Math / Distilled	Temperature	$\mathbf{0.541}\pm 0.005$	$-0.304\pm 0.001$	$\mathbf{0.566}$	$-0.208$
Qwen / Base	Standard	$0.326\pm 0.025$	$-0.966\pm 0.046$	$0.376$	$-0.426$
Qwen / Base	Power	$\mathbf{0.573}\pm 0.020$	$\mathbf{-0.130}\pm 0.004$	$0.568$	$\mathbf{-0.096}$
Qwen / Distilled	Standard	$0.425\pm 0.029$	$-0.849\pm 0.017$	$0.470$	$-0.325$
Qwen / Distilled	Temperature	$0.541\pm 0.017$	$-0.479\pm 0.024$	$\mathbf{0.600}$	$-0.235$
Phi / Base	Standard	$0.549\pm 0.021$	$-0.913\pm 0.012$	$0.562$	$-0.589$
Phi / Base	Power	$0.712\pm 0.027$	$\mathbf{-0.330}\pm 0.004$	$\mathbf{0.734}$	$\mathbf{-0.294}$
Phi / Distilled	Standard	$0.634\pm 0.031$	$-0.730\pm 0.029$	$0.602$	$-0.473$
Phi / Distilled	Temperature	$\mathbf{0.715}\pm 0.020$	$-0.627\pm 0.028$	$0.675$	$-0.447$

Table S.5: GPQA: true reward

r^{\star}

(accuracy) and self-reward

r_{\mathrm{self}}

. Left: all completions, means with

\pm

std over seeds. Right: self-reward Best-of-

N

over samples generated with different seeds (max

r_{\mathrm{self}}

per item, then same aggregation). 4 seeds.

		All completions		Self-reward Best-of- $N$
Model	Sampling	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$
Qwen-Math / Base	Standard	$0.100\pm 0.025$	$-0.675\pm 0.076$	$0.103$	$-0.675$
Qwen-Math / Base	Power	$\mathbf{0.277}\pm 0.022$	$\mathbf{-0.088}\pm 0.002$	$0.279$	$\mathbf{-0.087}$
Qwen-Math / Distilled	Standard	$0.275\pm 0.004$	$-0.165\pm 0.001$	$\mathbf{0.281}$	$-0.113$
Qwen-Math / Distilled	Temperature	$\mathbf{0.277}\pm 0.001$	$-0.149\pm 0.001$	$0.277$	$-0.109$
Qwen / Base	Standard	$0.244\pm 0.017$	$-1.531\pm 0.185$	$0.245$	$-1.527$
Qwen / Base	Power	$0.283\pm 0.033$	$\mathbf{-0.118}\pm 0.001$	$0.287$	$\mathbf{-0.118}$
Qwen / Distilled	Standard	$0.280\pm 0.035$	$-0.437\pm 0.098$	$0.278$	$-0.426$
Qwen / Distilled	Temperature	$\mathbf{0.285}\pm 0.025$	$-0.210\pm 0.007$	$\mathbf{0.291}$	$-0.201$
Phi / Base	Standard	$0.223\pm 0.027$	$-0.802\pm 0.023$	$0.223$	$-0.800$
Phi / Base	Power	$\mathbf{0.309}\pm 0.019$	$\mathbf{-0.215}\pm 0.004$	$\mathbf{0.309}$	$\mathbf{-0.214}$
Phi / Distilled	Standard	$0.268\pm 0.004$	$-0.321\pm 0.010$	$0.267$	$-0.319$
Phi / Distilled	Temperature	$0.292\pm 0.012$	$-0.284\pm 0.059$	$0.298$	$-0.262$

B.2.2 Power $(\infty)$

We also evaluated Power $(\infty)$ using Qwen2.5-Math-7B on MATH500. This variant runs the MH power-sampling loop and accepts a proposal $y^{\prime}$ if and only if $\pi(y^{\prime}\mid x)>\pi(y\mid x)$ (Algorithm 2), corresponding to the limit $\alpha\to\infty$ .

Table S.6: Power

(\infty)

results for Qwen2.5-Math-7B on MATH500. Left: all completions, means with

\pm

std over seeds. Right: self-reward Best-of-

N

over samples generated with different seeds.

	All completions		Self-reward Best-of- $N$
Sampling	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$	$r^{\star}(\uparrow)$	$r_{\mathrm{self}}$
Power $(\infty)$	$\mathbf{0.728\pm 0.012}$	$-0.075\pm 0.001$	$0.736$	$-0.061$

B.2.3 Qualitative results

This section presents full completions for one MATH-style geometry problem summarized in Table 2 with the gold answer $x+y=17$ . The prompt is:

The coordinates of a parallelogram are $(5,3)$ , $(6,8)$ , $(7,4)$ , and $(x,y)$ with $x>7$ . What is the value of $x+y$ ?

(a) Question.

(b) Base sampling.

Figure S.1: Full generations for an example in MATH500 (gold

x+y=17

) (Part 1/4).

(a) Power sampling.

Figure S.2: Full generations for an example in MATH500 (gold

x+y=17

) (Part 2/4).

(a) Distilled model with base sampling (Part 1/2).

Figure S.3: Full generations for an example in MATH500 (gold

x+y=17

) (Part 3/4).

(a) Distilled model with base sampling (Part 2/2).

Figure S.4: Full generations for an example in MATH500 (gold

x+y=17

) (Part 4/4).

B.2.4 Synthetic validation of suffix-Rényi odds corrections

To validate Proposition 1 in a setting that reflects the Zipf-like word-frequency structure of natural language, we construct a finite synthetic autoregressive distribution whose language-model next-token probabilities follow a Zipf-like law over many candidates. Unlike the extreme pivotal-token construction of karan2026reasoning, every next-token candidate is followed by a full-support suffix distribution. The construction is summarized in Figure S.5. The base next-token distribution has $V=64$ tokens with Zipf-like probabilities

p_{i}\propto(i+1)^{-1.05},\qquad i=0,\dots,V-1.

For every token $i$ , the conditional suffix distribution $q_{i}$ has the same support size $M=256$ , no zero-probability suffixes, and a non-uniform power-law shape

q_{i}(z)\propto z^{-s_{i}},\qquad z=1,\dots,M.

The suffix exponent $s_{i}\in[0.45,1.65]$ varies deterministically and non-monotonically with the next-token rank, using a sinusoidal component plus a small trend. Thus, all suffix distributions have identical support size and full support, but differ in sharpness. This deliberately avoids the singular-versus-uniform example in karan2026reasoning: the experiment isolates the more general quantity identified by Proposition 1, namely the suffix Rényi entropy. In Figure S.5, the left panel shows the Zipf-like next-token distribution, the middle panel shows the token-dependent suffix exponent $s_{i}$ , and the right panel shows representative full-support suffix distributions.

For each $\alpha\in\{1.1,1.5,2,3,4,8\}$ , we compute both the token-wise temperature next-token distribution and the sequence-level power next-token conditional exactly under this synthetic distribution. The temperature next-token distribution is

\pi_{\mathrm{temp},\alpha}(i)=\frac{p_{i}^{\alpha}}{\sum_{j}p_{j}^{\alpha}},

whereas the next-token conditional induced by the sequence-level power distribution is

\pi_{\mathrm{pow},\alpha}(i)=\frac{p_{i}^{\alpha}\sum_{z}q_{i}(z)^{\alpha}}{\sum_{j}p_{j}^{\alpha}\sum_{z}q_{j}(z)^{\alpha}}.

Figure S.6 compares the two sides of Proposition 1 for every unordered token pair and every tested $\alpha$ . The left panel plots the Rényi-predicted log odds correction against the directly computed power-versus-temperature log odds correction, while the right panel shows the distribution of these corrections at the main experimental exponent $\alpha=4$ .

Figure S.7 illustrates the consequence of the correction at the level of next-token preferences: even when $p_{i}>p_{j}$ and temperature favors token $i$ , sequence-level power can favor token $j$ if $q_{j}$ has sufficiently lower suffix Rényi entropy.

B.2.5 Synthetic validation of optimal one-step proposals for sequential power sampling

We reuse the synthetic distribution of Section B.2.4 to validate Proposition 2. For a fixed prompt and an empty prefix, the unique variance-minimizing one-step proposal in Equation 10 reduces to

q^{\star}(i)\;\propto\;p_{i}^{\alpha}\,\sum_{z=1}^{M}q_{i}(z)^{\alpha},\qquad i=0,\dots,V-1,

which equals the next-token conditional of the sequence-level power distribution $\pi_{\mathrm{pow},\alpha}$ and depends on the suffix power masses $\sum_{z}q_{i}(z)^{\alpha}$ of every candidate token. We compare $q^{\star}$ with three one-step proposals that do not use those suffix totals: the base proposal $q^{\mathrm{base}}(i)=p_{i}$ , the token-wise temperature proposal $q^{\mathrm{temp}}(i)\propto p_{i}^{\alpha}$ , and a uniform reference $q^{\mathrm{unif}}(i)=1/V$ .

For each proposal $q$ , the first-step incremental importance weight in Equation 9 simplifies to

W_{1}(i)\;=\;\frac{q^{\star}(i)}{q(i)},

and we show its exact mean, the coefficient of variation $\mathrm{CV}^{2}(W_{1})=\mathrm{Var}[W_{1}]/\mathbb{E}[W_{1}]^{2}$ , and the effective sample size fraction $\mathrm{ESS}/N=1/(1+\mathrm{CV}^{2}(W_{1}))$ . By Proposition 2, only $q^{\star}$ achieves $\mathrm{Var}[W_{1}]=0$ and hence $\mathrm{ESS}/N=1$ ; the closed-form values for the other proposals are computed exactly from the synthetic distribution.

Figure S.8 compares the four proposals at $\alpha=4$ . The left panel shows the proposal probabilities; the oracle proposal equals the target next-token conditional $\pi_{\mathrm{pow},\alpha}$ by construction, and the temperature, base, and uniform proposals deviate from it, especially on next-token ranks where the suffix exponent $s_{i}$ is small and $\sum_{z}q_{i}(z)^{\alpha}$ is large. The right panel plots $\log W_{1}(i)$ : only the oracle proposal yields a constant log weight, while the other proposals produce token-dependent log weights.

Figure S.9 reports the exact $\mathrm{ESS}/N$ and $\mathrm{CV}^{2}(W_{1})$ as a function of $\alpha$ . The oracle proposal attains $\mathrm{ESS}/N=1$ for every $\alpha$ , whereas the gap between the temperature proposal and the oracle widens as $\alpha$ grows, because larger $\alpha$ amplifies the suffix power masses that the local temperature transform ignores.

Figure S.10 checks the same conclusion with Monte Carlo: for each proposal we draw $N$ tokens, compute the self-normalized $\mathrm{ESS}$ , and average across replicates. The sampled $\mathrm{ESS}/N$ concentrates around the exact values from Figure S.9 as $N$ grows, and the ordering of the proposals is preserved at every particle budget.

\CJK@envEnd

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Abstract

1 Introduction

2 Related work

3 Preliminaries

Definition 1 (Power distribution).

4 Approximating power sampling requires sequence-level information

4.1 Comparison to per-token temperature scaling

Proposition 1 (Power vs. temperature odds ratios via suffix Rényi entropies).

4.2 Variance-minimizing one-step proposals for sequential power sampling

Proposition 2 (Variance-minimizing one-step proposal at prefix y<ty_{<t}).

5 From self-reward RL to power self-distillation

5.1 Power distribution as the optimum of self-reward RL

Corollary 1 (Self-reward tilt equals the power distribution).

5.2 Deriving power self-distillation

5.3 Sharpening and true reward under power self-distillation

Proposition 3 (Power self-distillation and sharpening).

Proposition 4 (Covariance form of ∂αR​(α;x)\partial_{\alpha}R(\alpha;x)).

6 Numerical evaluation

6.1 Setup

6.2 Results

7 Conclusion

References

Notation.

Appendix A Proofs and background

A.1 Proof of Proposition 1

Proof of Proposition 1.

A.2 Background and proof of Proposition 2

A.2.1 From a sequence-level target to a sequential sampler

A.2.2 Incremental importance weights

A.2.3 Why minimize Var​[Wt]\mathrm{Var}[W_{t}] at one step?

A.2.4 Proof of Proposition 2

Proof of Proposition 2.

A.3 Proof of Proposition 3

Lemma 1 (Finite-class MLE Hellinger bound [wong1995probability, geer2000empirical, zhang2006f]).

Proof of Proposition 3.

Convergence of the upper bound.

A.4 Proof of Proposition 4

Proof of Proposition 4.

A.5 Closed-form optimizer for KL-regularized RL: restatement and proof

Proposition 5 (Closed-form optimizer for KL-regularized RL [levine2018reinforcement]).

Proof of Proposition 5.

Appendix B Experimental details

B.1 Setup

Models and datasets.

Power sampling.

Self-reward computation.

Synthetic random rewards.

Distillation.

Hardware and execution time.

B.2 Additional results

B.2.1 Other datasets and models

B.2.2 Power(∞)(\infty)

B.2.3 Qualitative results

B.2.4 Synthetic validation of suffix-Rényi odds corrections

B.2.5 Synthetic validation of optimal one-step proposals for sequential power sampling

Comments

Proposition 2 (Variance-minimizing one-step proposal at prefix $y_{<t}$ ).

Proposition 4 (Covariance form of $\partial_{\alpha}R(\alpha;x)$ ).

A.2.3 Why minimize $\mathrm{Var}[W_{t}]$ at one step?

B.2.2 Power $(\infty)$