arXiv:2605.04542 · cs.LG · uncurated · rendered via ar5iv

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.04542] Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
\CJKencfamily

UTF8mc\CJK@envStartUTF8

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Akiyoshi Tomihari tomihari@g.ecc.u-tokyo.ac.jp Affiliation: Department of Computer Science, The University of Tokyo    Issei Sato sato@g.ecc.u-tokyo.ac.jp Affiliation: Department of Computer Science, The University of Tokyo
Abstract

Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model’s sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.

1 Introduction

The strong reasoning ability exhibited by large language models (LLMs) has often been attributed to reinforcement learning (RL). However, empirical analyses question whether RL explains emergent reasoning: as the number of sampled generations grows, post-RL models often fail to outperform their pre-RL counterparts, suggesting that RL may not be what endows LLMs with reasoning ability (yue2025does). At the same time, distillation has become a standard way to transfer the capabilities of expensive or stronger models to smaller models (hinton2015distilling; guo2025deepseek; busbridge2025distillation), and inference-time compute allocated to sampling or search has improved LLM performance (snell2024scaling; welleck2024from).

However, the relationship between sampling, RL, and self-distillation remains unclear. In particular, karan2026reasoning show that a base model, without additional training or external reward, can match or exceed post-RL models using power sampling. This raises the question of whether the success of power sampling reflects a mechanism distinct from RL and distillation, or whether these methods can be connected through a common structure. Clarifying such a connection is important because it can reveal whether gains that appear to come from different procedures in fact arise from a common mechanism, and whether an expensive inference-time procedure can be converted into an offline training objective.

In this study, we show that sampling, RL, and self-distillation are naturally connected through the power distribution. As illustrated in Figure 1, this distribution is the target of power sampling, the closed-form optimum of a self-reward RL objective, and the teacher distribution amortized by self-distillation. From the sampling perspective, a natural question is whether the effect of power sampling can be reproduced by an inexpensive token-level approximation. We show that this is structurally difficult: per-token approximations cannot match the power distribution without sequence-level information. From the RL perspective, the power distribution is the closed-form optimum of KL-regularized RL (ouyang2022instructgpt) when the reward is the model’s sequence-level log-probabilities, i.e., the self-reward in the sense of huang2025selfimprovement. Finally, by rewriting this RL objective, we derive power self-distillation as an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into offline training. We further show that power self-distillation achieves sharpening, and that whether the resulting sharpening improves a true reward is determined by a reward covariance under the power distribution.

Our contributions are summarized as follows. Figure 1 illustrates the connection we study, and Table S.1 compares these axes with prior work.

  • We show that approximating power sampling at inference time is structurally hard: per-token approximations cannot match the power distribution without sequence-level information (Propositions 1 and 2).

  • We show that the power distribution is the closed-form optimum of KL-regularized RL with the model’s sequence-level log-probabilities as the reward (Corollary 1), and derive power self-distillation by rewriting this RL objective, thereby amortizing expensive power sampling into offline training (Algorithm 1).

  • We provide a sharpening bound for power self-distillation (Proposition 3), and characterize when the induced self-distillation improves a true reward through a covariance condition under the power distribution (Proposition 4).

Figure 1: Overview of our contribution. The power distribution connects sampling, KL-regularized RL, and self-distillation: it is the target of power sampling, the closed-form optimum of self-reward RL, and the teacher distribution amortized by power self-distillation.

2 Related work

RL post-training as distribution sharpening. RL has become a central tool in LLM post-training, including RL from human feedback (RLHF) (ouyang2022instructgpt) and RL with verifiable rewards (RLVR) (shao2024deepseekmath; guo2025deepseek; lambert2024tulu). However, a growing line of work questions whether such RL induces genuinely new reasoning capabilities. yue2025does showed that under pass@k evaluation, RLVR often improves sampling efficiency at small kk but can underperform the base model at large kk, suggesting that RLVR concentrates probability mass on reasoning paths already present in the base model’s distribution. Complementing this view, he2025rewarding analyzed a degenerate rank bias in GRPO that preferentially reinforces high-probability trajectories, yielding a “distribution sharpening” regime where simply sampling more from the base model can be stronger under the same sample budget. Motivated by the perspective that many RL gains resemble distribution sharpening, karan2026reasoning proposed a training-free inference-time method that targets sharpened distributions of the base model. Their approach uses a Metropolis–Hastings sampler to approximate sequence-level power sampling and achieves reasoning improvements comparable to RL. azizi2026power and ji2026scalable developed lower-latency approximations to power sampling. We complement this line of work by showing that the power distribution targeted by these samplers is also the closed-form optimum of a self-reward KL-regularized RL objective.

Inference-time compute and distillation to amortize inference cost. Recent work argues that allocating additional computation at inference time can substantially improve LLM outputs (snell2024scaling; welleck2024from). When an external reward or verifier is available, a common method is Best-of-NN, which generates NN candidates and selects the one with the highest reward; this simple strategy can yield strong empirical gains (stiennon2020learning; nakano2021webgpt; touvron2023llama; pmlr-v202-gao23h; eisenstein2023helping; mudgal2024controlled). To amortize the inference cost of Best-of-NN, several works characterized the distribution induced by Best-of-NN selection and proposed to distill this distribution into a single policy (gui2024bonbon; amini2025variational; sessa2025bond; yang2025fasterwind). In contrast to these reward-based distillation methods, we derive a self-distillation objective that amortizes power sampling itself, using only samples from the base model’s power distribution.

Self-improvement without external rewards. A growing number of empirical studies suggest that language models can improve without relying on external rewards or human-provided labels, using self-generated data and intrinsic training signals. huang2023large; wang2023self curated model-generated solutions or instructions and then fine-tuned on them. Several works perform RL using internal feedback alone, such as entropy minimization objectives (prabhudesai2025maximizing) or confidence as the reward (zhao2026learning). Even randomly assigned rewards can improve performance (shao2025spurious). huang2025selfimprovement formalized LLM self-improvement as distribution sharpening and analyzed algorithms motivated by SFT and KL-regularized RL. Building on this sharpening view, we show that the model’s sequence-level log-probabilities induce the power distribution through KL-regularized RL, and that distilling this distribution can sharpen the model without external rewards.

3 Preliminaries

Notation. Let 𝒳\mathcal{X} denote the space of prompts and let μΔ(𝒳)\mu\in\Delta(\mathcal{X}) denote a distribution over prompts. We consider completions of length T1T\geq 1 over a finite vocabulary 𝒱\mathcal{V}, and write 𝒴:=𝒱T\mathcal{Y}:=\mathcal{V}^{T} for the completion space. The base model is a policy π:𝒳Δ(𝒴)\pi:\mathcal{X}\to\Delta(\mathcal{Y}) and we write π(x)\pi(\cdot\mid x) for the conditional distribution of yy given xx. With y<t:=(y1,,yt1)y_{<t}:=(y_{1},\dots,y_{t-1}), we use the autoregressive factorization π(yx)=t=1Tπ(ytx,y<t)\pi(y\mid x)=\prod_{t=1}^{T}\pi(y_{t}\mid x,y_{<t}). We write aba\lesssim b to mean a=O(b)a=O(b) and π(Sx):=ySπ(yx)\pi(S\mid x):=\sum_{y\in S}\pi(y\mid x) for a set S𝒴S\subseteq\mathcal{Y}.

Self-improvement. Language models have been shown to be capable of self-improvement, improving their own performance without external rewards (huang2023large; wang2023self; prabhudesai2025maximizing; zhao2026learning). This phenomenon is counterintuitive and appears to contradict the data-processing inequality, which states that mutual information is non-increasing under further processing of random variables (cover1999elements). huang2025selfimprovement reconcile these observations by interpreting improvements as computational, not statistical: self-improvement sharpens the distribution so that sampling a near-optimal solution becomes easier. This perspective connects to classical trade-offs between sampling and optimization in theoretical computer science (kirkpatrick1983optimization; lovasz2006fast).

Formally, define the self-reward as the log-likelihood

rself(x,y;π)=logπ(yx)r_{\mathrm{self}}(x,y;\pi)=\log\pi(y\mid x) (1)

and let the corresponding maximizer set be

𝒚(x):=argmaxy𝒴rself(x,y;π).\bm{y}^{\star}(x):=\mathop{\rm arg~max}\limits_{y\in\mathcal{Y}}r_{\mathrm{self}}(x,y;\pi). (2)

Given (ϵ,δ)(0,1)2(\epsilon,\delta)\in(0,1)^{2}, a policy π^\widehat{\pi} is (ϵ,δ)(\epsilon,\delta)-sharpened relative to π\pi if the following holds:

xμ[π^(𝒚(x)x)1δ]1ϵ.\mathbb{P}_{x\sim\mu}\Big[\widehat{\pi}\big(\bm{y}^{\star}(x)\mid x\big)\geq 1-\delta\Big]\geq 1-\epsilon. (3)

huang2025selfimprovement analyze the sample complexity of achieving (ϵ,δ)(\epsilon,\delta)-sharpening when π\pi is accessed only through conditional draws yπ(x)y\sim\pi(\cdot\mid x) and likelihood evaluations π(yx)\pi(y\mid x), for supervised fine-tuning on Best-of-NN targets sampled from π\pi and for KL-regularized RL objectives driven by rself(x,y;π)r_{\mathrm{self}}(x,y;\pi).

Power distribution. Recent analyses of RL suggest that empirical reasoning gains resemble distribution sharpening, where probability mass concentrates on trajectories already well supported under the base model (yue2025does; he2025rewarding). Motivated by this view, karan2026reasoning target inference-time sampling from the power distribution induced by the base model.

Definition 1 (Power distribution).

With a policy π:𝒳Δ(𝒴)\pi:\mathcal{X}\to\Delta(\mathcal{Y}) and an exponent α>1\alpha>1, we define the power distribution induced by π\pi as

πα(yx):=π(yx)αy𝒴π(yx)α.\pi_{\alpha}(y\mid x):=\frac{\pi(y\mid x)^{\alpha}}{\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime}\mid x)^{\alpha}}. (4)

Exact sampling from Eq. (4) is intractable at scale. karan2026reasoning therefore propose a Metropolis–Hastings (MH) procedure that achieves reasoning accuracy competitive with strong RL post-training (shao2024deepseekmath; guo2025deepseek), without further training. Lower-latency approximations have subsequently been proposed (azizi2026power; ji2026scalable), but these methods still use substantially more inference-time compute than standard autoregressive sampling.

4 Approximating power sampling requires sequence-level information

In this section, we begin from the sampling perspective. We ask whether the power distribution πα\pi_{\alpha} can be reproduced by inexpensive inference-time approximations, focusing on two natural local inference-time procedures: (i) a per-token tempered distribution (Section 4.1) and (ii) sequential importance sampling (SIS) with a one-step proposal (Section 4.2). In both cases, the gap to πα\pi_{\alpha} is governed by sequence-level information that the local approximations do not access, showing why cheap inference-time approximations are structurally difficult and motivating the RL and self-distillation perspectives in Section 5.

4.1 Comparison to per-token temperature scaling

A natural way to locally approximate πα\pi_{\alpha} is to apply the same power transformation at the token level during decoding. For s𝒱s\in\mathcal{V}, define the per-token tempered next-token distribution by

πtemp,α(yt=sx,y<t):=π(yt=sx,y<t)αs𝒱π(yt=sx,y<t)α.\pi_{\mathrm{temp},\alpha}(y_{t}=s\mid x,y_{<t}):=\frac{\pi(y_{t}=s\mid x,y_{<t})^{\alpha}}{\sum_{s^{\prime}\in\mathcal{V}}\pi(y_{t}=s^{\prime}\mid x,y_{<t})^{\alpha}}. (5)

In contrast, the power distribution in Eq. (4) is, more precisely, the sequence-level power distribution πα(x)π(x)α\pi_{\alpha}(\cdot\mid x)\propto\pi(\cdot\mid x)^{\alpha}, whose next-token conditional we denote by πpow,α\pi_{\mathrm{pow},\alpha}:

πpow,α(yt=sx,y<t):=yt+1:T𝒱Ttπ(y<t,s,yt+1:Tx)αyt:T𝒱Tt+1π(y<t,yt:Tx)α.\pi_{\mathrm{pow},\alpha}(y_{t}=s\mid x,y_{<t}):=\frac{\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}\pi(y_{<t},\,s,\,y_{t+1:T}\mid x)^{\alpha}}{\sum_{y_{t:T}\in\mathcal{V}^{T-t+1}}\pi(y_{<t},\,y_{t:T}\mid x)^{\alpha}}. (6)

We show that for arbitrary suffix distributions, the entire odds-ratio gap between Eqs. (5) and (6) is controlled by the Rényi entropy of the suffix.

Proposition 1 (Power vs. temperature odds ratios via suffix Rényi entropies).

For α>1\alpha>1, a prompt xx, a prefix y<ty_{<t}, and a𝒱a\in\mathcal{V}, let qt,aq_{t,a} denote the conditional distribution of the suffix Yt+1:TY_{t+1:T} under the base model,

qt,a(yt+1:T):=π(yt+1:Tx,y<t,yt=a).q_{t,a}(y_{t+1:T}):=\pi(y_{t+1:T}\mid x,y_{<t},y_{t}=a).

For a distribution pp on a finite set, define the Rényi entropy of order α\alpha as Hα(p):=1/(1α)logzp(z)α.H_{\alpha}(p):=1/(1-\alpha)\log\sum_{z}p(z)^{\alpha}. Then for any a,b𝒱a,b\in\mathcal{V} such that π(yt=ax,y<t)>0,π(yt=bx,y<t)>0\pi(y_{t}=a\mid x,y_{<t})>0,\pi(y_{t}=b\mid x,y_{<t})>0, the ratio of next-token odds under πpow,α\pi_{\mathrm{pow},\alpha} versus πtemp,α\pi_{\mathrm{temp},\alpha} satisfies

πpow,α(yt=ax,y<t)πpow,α(yt=bx,y<t)/πtemp,α(yt=ax,y<t)πtemp,α(yt=bx,y<t)=exp((1α)(Hα(qt,a)Hα(qt,b))).\frac{\pi_{\mathrm{pow},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{pow},\alpha}(y_{t}=b\mid x,y_{<t})}\bigg/\frac{\pi_{\mathrm{temp},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{temp},\alpha}(y_{t}=b\mid x,y_{<t})}=\exp\!\bigl((1-\alpha)\,(H_{\alpha}(q_{t,a})-H_{\alpha}(q_{t,b}))\bigr). (7)

We have 1α<01-\alpha<0, so Eq. (7) implies that, among next-token candidates aa with comparable values of π(yt=ax,y<t)\pi(y_{t}=a\mid x,y_{<t}), those for which qt,aq_{t,a} has larger Rényi entropy are relatively downweighted under πpow,α\pi_{\mathrm{pow},\alpha} compared to πtemp,α\pi_{\mathrm{temp},\alpha}. Thus, compared with per-token temperature scaling, sequence-level power sharpening favors continuations whose suffix distributions under π\pi are more peaked, i.e., have lower Rényi entropy.

Comparison to karan2026reasoning. karan2026reasoning also studied the gap between per-token temperature and sequence-level power sampling, and formalized it in the special case of two extreme tokens (positive vs. negative pivotal tokens; their Example 1 and Proposition 3). Our result enables a quantitative comparison for any two next-token candidates.

Proposition 1 suggests that matching the next-token distribution induced by sequence-level power sampling at a step requires information about the suffix distributions following each candidate token.

4.2 Variance-minimizing one-step proposals for sequential power sampling

Beyond marginal token distributions, we turn to sequential importance sampling (SIS) targeting πα\pi_{\alpha}, where a basic design goal is to stabilize incremental importance weights. Proposition 3.3 of zhao2024probabilistic identifies the unique one-step variance-minimizing proposal in a general SIS setup, and we apply it to the power distribution πα\pi_{\alpha}.

Fix a prompt x𝒳x\in\mathcal{X}. Define the unnormalized power mass π~α(y1:T):=π(y1:Tx)α\tilde{\pi}_{\alpha}(y_{1:T}):=\pi(y_{1:T}\mid x)^{\alpha} and, for t=0,,Tt=0,\dots,T, the prefix totals

π~α,t(y1:t):=yt+1:T𝒱Ttπ~α(y1:T),\tilde{\pi}_{\alpha,t}(y_{1:t})\;:=\;\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}\tilde{\pi}_{\alpha}(y_{1:T}), (8)

where for t=0t=0 the prefix is empty. Let Zα:=y1:T𝒱Tπ~α(y1:T)Z_{\alpha}:=\sum_{y_{1:T}\in\mathcal{V}^{T}}\tilde{\pi}_{\alpha}(y_{1:T}) be the normalizing constant, so that π~α,0=Zα\tilde{\pi}_{\alpha,0}=Z_{\alpha}, and let πα(x)\pi_{\alpha}(\cdot\mid x) be the normalized power distribution on 𝒱T\mathcal{V}^{T} from Eq. (4). For t1t\geq 1, write πα(y1:tx)\pi_{\alpha}(y_{1:t}\mid x) for the prefix marginal obtained by summing πα(y1:Tx)\pi_{\alpha}(y_{1:T}\mid x) over yt+1:Ty_{t+1:T}; then πα(y1:tx)=π~α,t(y1:t)/Zα\pi_{\alpha}(y_{1:t}\mid x)=\tilde{\pi}_{\alpha,t}(y_{1:t})/Z_{\alpha}.

Consider extending a fixed prefix y<ty_{<t} by one token Ytq(x,y<t)Y_{t}\sim q(\cdot\mid x,y_{<t}) in one step of SIS (or SMC without resampling), while keeping the global target πα(x)\pi_{\alpha}(\cdot\mid x) on 𝒱T\mathcal{V}^{T}. Define the incremental importance weight (chopin2020introduction)

Wt:=πα(y<t,Ytx)πα(y<tx)q(Ytx,y<t),W_{t}\;:=\;\frac{\pi_{\alpha}(y_{<t},Y_{t}\mid x)}{\pi_{\alpha}(y_{<t}\mid x)\,q(Y_{t}\mid x,y_{<t})}, (9)

where we condition on y<ty_{<t} with πα(y<tx)>0\pi_{\alpha}(y_{<t}\mid x)>0, and Var[Wt]\mathrm{Var}[W_{t}] denotes variance under Ytq(x,y<t)Y_{t}\sim q(\cdot\mid x,y_{<t}). The next proposition shows the unique proposal q(x,y<t)q(\cdot\mid x,y_{<t}) that minimizes Var[Wt]\mathrm{Var}[W_{t}] at such a prefix.

Proposition 2 (Variance-minimizing one-step proposal at prefix y<ty_{<t}).

In the setting above, fix y<ty_{<t} with πα(y<tx)>0\pi_{\alpha}(y_{<t}\mid x)>0. Among all proposals q(x,y<t)q(\cdot\mid x,y_{<t}) on 𝒱\mathcal{V}, the unique minimizer of Var[Wt]\mathrm{Var}[W_{t}] is

qt(ytx,y<t):=π~α,t(y1:t)π~α,t1(y<t)=πα(ytx,y<t),q_{t}^{\star}(y_{t}\mid x,y_{<t})\;:=\;\frac{\tilde{\pi}_{\alpha,t}(y_{1:t})}{\tilde{\pi}_{\alpha,t-1}(y_{<t})}\;=\;\pi_{\alpha}\bigl(y_{t}\mid x,y_{<t}\bigr), (10)

where y1:t=(y<t,yt)y_{1:t}=(y_{<t},y_{t}); the right-hand side equals the next-token conditional under πα(x)\pi_{\alpha}(\cdot\mid x).

Proposition 2 implies that minimizing the local one-step variance of the incremental weight forces the proposal to coincide with the next-token conditional in Eq. (10), which itself depends on the prefix totals π~α,t\tilde{\pi}_{\alpha,t} summed over all suffixes. In particular, proposals that modify only the base next-token conditional cannot in general equal the unique minimizer in Eq. (10). The proof and SIS background are in Section A.2.

Implication. Propositions 1 and 2 indicate that inexpensive one-step approximations cannot reproduce πα\pi_{\alpha} without sequence-level information, leaving inference-time approximation of πα\pi_{\alpha} structurally expensive. This aligns with prior work that expends additional inference-time compute (karan2026reasoning; azizi2026power; ji2026scalable) to approximate the power distribution.

5 From self-reward RL to power self-distillation

Section 4 shows that πα\pi_{\alpha} is structurally expensive to approximate by sampling at inference time. In this section, we take the complementary view that πα\pi_{\alpha} also connects RL and self-distillation, allowing us to shift the cost to offline training. Section 5.1 identifies πα\pi_{\alpha} as the closed-form optimum of a KL-regularized RL objective with self-reward. Section 5.2 uses this identification to derive an offline self-distillation algorithm from that RL objective. Section 5.3 then analyzes what the resulting distilled model achieves: a sharpening guarantee on the self-reward, and a characterization of when sharpening also improves a true reward.

5.1 Power distribution as the optimum of self-reward RL

Let q:𝒳Δ(𝒴)q:\mathcal{X}\to\Delta(\mathcal{Y}) be a candidate policy, and consider the KL-regularized RL objective with reward rr (ouyang2022instructgpt; guo2025deepseek)

Jβ(q;π,r):=𝔼xμ[𝔼yq(x)[r(x,y)]βDKL(q(x)π(x))]J_{\beta}(q;\pi,r):=\mathbb{E}_{x\sim\mu}\Big[\mathbb{E}_{y\sim q(\cdot\mid x)}\big[r(x,y)\big]-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi(\cdot\mid x)\big)\Big] (11)

with β>0\beta>0. By the standard closed-form solution of KL-regularized RL (levine2018reinforcement), the unique maximizer of Eq. (11) is the reward-tilted distribution

πβ(yx):=π(yx)exp(β1r(x,y))Zr(x),Zr(x):=y𝒴π(yx)exp(β1r(x,y)).\pi_{\beta}^{\star}(y\mid x):=\frac{\pi(y\mid x)\exp\!\big(\beta^{-1}r(x,y)\big)}{Z_{r}(x)},\qquad Z_{r}(x):=\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime}\mid x)\exp\!\big(\beta^{-1}r(x,y^{\prime})\big). (12)

We restate this as Proposition 5 in Section A.5 and include a proof for completeness. Specializing the reward in Eq. (12) to the self-reward rselfr_{\mathrm{self}} in Eq. (1) yields the power distribution.

Corollary 1 (Self-reward tilt equals the power distribution).

Suppose r(x,y)=rself(x,y;π)=logπ(yx)r(x,y)=r_{\mathrm{self}}(x,y;\pi)=\log\pi(y\mid x) as in (1). Then the optimizer πβ\pi_{\beta}^{\star} in Eq. (12) equals the power distribution πα\pi_{\alpha} in Eq. (4):

πβ(x)=πα(x),α:=1+β1>1.\pi_{\beta}^{\star}(\cdot\mid x)=\pi_{\alpha}(\cdot\mid x),\qquad\alpha:=1+\beta^{-1}>1. (13)

This identification connects power sampling and self-improvement RL: the inference-time target of karan2026reasoning coincides with the closed-form optimum of the KL-regularized self-reward objective studied in huang2025selfimprovement, namely πα\pi_{\alpha}.

5.2 Deriving power self-distillation

We now derive a self-distillation procedure from the RL objective without requiring the deployed model to sample from πα\pi_{\alpha} at inference time.

RL objective as reverse and then forward KL to πα\pi_{\alpha}. With r=rselfr=r_{\mathrm{self}} and α=1+β1\alpha=1+\beta^{-1}, the inner objective in Eq. (11) can be rewritten for each xx as

𝔼yq(x)[rself(x,y)]βDKL(q(x)π(x))=βDKL(q(x)πα(x))+βlogZr(x),\mathbb{E}_{y\sim q(\cdot\mid x)}[r_{\mathrm{self}}(x,y)]-\beta D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi(\cdot\mid x)\big)=-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi_{\alpha}(\cdot\mid x)\big)\;+\;\beta\log Z_{r}(x), (14)

with the same partition function Zr(x)Z_{r}(x) as in Eq. (12), which does not depend on qq. Thus, for each prompt xx, maximizing Jβ(q;π,rself)J_{\beta}(q;\pi,r_{\mathrm{self}}) over unconstrained qq is equivalent to minimizing the reverse KL divergence DKL(q(x)πα(x))D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi_{\alpha}(\cdot\mid x)\big), with unique minimizer q(x)=πα(x)q(\cdot\mid x)=\pi_{\alpha}(\cdot\mid x).

However, the reverse KL is an expectation under qq, so optimizing it directly would require on-policy samples from the learner. We therefore convert the objective into an offline distillation surrogate that shares the same target distribution πα\pi_{\alpha}, by minimizing the forward KL from the teacher distribution to the student, DKL(πα(x)q(x))D_{\mathrm{KL}}\!\big(\pi_{\alpha}(\cdot\mid x)\,\|\,q(\cdot\mid x)\big). The population minimizer over qq is still πα\pi_{\alpha}, so this surrogate preserves the same target distribution while enabling offline maximum-likelihood training on teacher samples. This forward-KL surrogate mirrors the reward-augmented maximum-likelihood method of norouzi2016reward, who also exchanged the reverse KL appearing in entropy-regularized RL for a forward KL.

Forward KL yields MLE on teacher samples. Expanding the forward KL training objective gives

𝔼xμ[DKL(πα(x)q(x))]\displaystyle\mathbb{E}_{x\sim\mu}\!\left[D_{\mathrm{KL}}\!\big(\pi_{\alpha}(\cdot\mid x)\,\|\,q(\cdot\mid x)\big)\right] =𝔼x,yμ,πα[logπα(yx)]𝔼x,yμ,πα[logq(yx)],\displaystyle=\mathbb{E}_{x,y\sim\mu,\pi_{\alpha}}\big[\log\pi_{\alpha}(y\mid x)\big]-\mathbb{E}_{x,y\sim\mu,\pi_{\alpha}}\big[\log q(y\mid x)\big], (15)

where x,yμ,παx,y\sim\mu,\pi_{\alpha} abbreviates xμx\sim\mu and yπα(x)y\sim\pi_{\alpha}(\cdot\mid x). The first term on the right-hand side of Eq. (15) does not depend on qq, so minimizing the population forward KL is equivalent to maximizing the expected log-likelihood 𝔼x,yμ,πα[logq(yx)]\mathbb{E}_{x,y\sim\mu,\pi_{\alpha}}\big[\log q(y\mid x)\big]. In practice we form an empirical objective by drawing i.i.d. pairs D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n} with xiμx_{i}\sim\mu and yiπα(xi)y_{i}\sim\pi_{\alpha}(\cdot\mid x_{i}), and we solve the following maximum likelihood estimate (MLE) problem:

π^argmaxqΠαi=1nlogq(yixi).\widehat{\pi}\in\arg\max_{q\in\Pi_{\alpha}}\sum_{i=1}^{n}\log q(y_{i}\mid x_{i}). (16)

This procedure uses only offline completions sampled from the power distribution πα\pi_{\alpha} derived from the base policy π\pi, and it does not rely on any external reward labels, so it is an instance of self-distillation. We refer to it as power self-distillation and summarize it in Algorithm 1. In practice we run teacher inference once, store D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n}, and then train the student with standard supervised fine-tuning on DD. Separating teacher generation from student training simplifies implementation and enables dataset reuse.

5.3 Sharpening and true reward under power self-distillation

In this subsection, we analyze two complementary aspects of power self-distillation: (i) Proposition 3 bounds the extent to which π^\widehat{\pi} sharpens the self-reward, in the sense of huang2025selfimprovement; (ii) Proposition 4 shows that the local rate at which sharpening changes a true reward is determined by the covariance Covπα(r,rself)\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},r_{\mathrm{self}}).

(i) Self-reward sharpening of the distilled model. huang2025selfimprovement formalize self-improvement via (ϵ,δ)(\epsilon,\delta)-sharpening relative to π\pi as in Eq. (3). The next proposition bounds how well the MLE in Eq. (16) concentrates on the self-reward maximizer set 𝒚(x)\bm{y}^{\star}(x) in Eq. (2).

Proposition 3 (Power self-distillation and sharpening).

Fix α>1\alpha>1 and ρ,δ(0,1)\rho,\delta\in(0,1). Suppose παΠα\pi_{\alpha}\in\Pi_{\alpha} and there exists a constant M<M<\infty, independent of α\alpha, such that |Πα|M|\Pi_{\alpha}|\leq M. Let D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n} be i.i.d. samples with xiμ,yiπα(xi)x_{i}\sim\mu,y_{i}\sim\pi_{\alpha}(\cdot\mid x_{i}), and let π^argmaxqΠαi=1nlogq(yixi)\widehat{\pi}\in\arg\max_{q\in\Pi_{\alpha}}\sum_{i=1}^{n}\log q(y_{i}\mid x_{i}) be an MLE. Then with probability at least 1ρ1-\rho over DD,

xμ[π^(𝒚(x)x)1δ]log(Mρ1)δn+xμ[πα(𝒚(x)x)1δ2].\mathbb{P}_{x\sim\mu}\big[\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta\big]\ \lesssim\ \frac{\log(M\rho^{-1})}{\delta n}\;+\;\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]. (17)

In particular, the right-hand side of Eq. (17) converges to 0 as nn\to\infty and α\alpha\to\infty.

Thus, for sufficiently large nn and α\alpha, power self-distillation can achieve (ϵ,δ)(\epsilon,\delta)-sharpening in the sense of Eq. (3). The proof is in Section A.3.

(ii) When does sharpening also improve a different true reward? Proposition 3 guarantees concentration on the self-reward maximizer set, but evaluation is typically governed by a different true reward (e.g., correctness). Let r:𝒳×𝒴r^{\star}:\mathcal{X}\times\mathcal{Y}\to\mathbb{R} denote this true reward and define, for fixed x𝒳x\in\mathcal{X},

R(α;x):=𝔼yπα(x)[r(x,y)].R(\alpha;x):=\mathbb{E}_{y\sim\pi_{\alpha}(\cdot\mid x)}\big[r^{\star}(x,y)\big].

The next proposition characterizes how R(α;x)R(\alpha;x) changes with α\alpha.

Proposition 4 (Covariance form of αR(α;x)\partial_{\alpha}R(\alpha;x)).

For any α>0\alpha>0 and any fixed x𝒳x\in\mathcal{X},

αR(α;x)=Covyπα(x)(r(x,y),rself(x,y)),\frac{\partial}{\partial\alpha}R(\alpha;x)=\mathrm{Cov}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(r^{\star}(x,y),r_{\mathrm{self}}(x,y)\big), (18)

where covariances are over the support of π(x)\pi(\cdot\mid x), on which logπ(x)\log\pi(\cdot\mid x) is finite. In particular, if for some b(x)b(x)\in\mathbb{R} and c(x)>0c(x)>0,

r(x,y)=c(x)rself(x,y)+b(x)y𝒴,r^{\star}(x,y)=c(x)\,r_{\mathrm{self}}(x,y)+b(x)\qquad\forall y\in\mathcal{Y}, (19)

then R(α;x)R(\alpha;x) is non-decreasing in α\alpha:

αR(α;x)=c(x)Varyπα(x)(rself(x,y))0.\frac{\partial}{\partial\alpha}R(\alpha;x)=c(x)\,\mathrm{Var}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(r_{\mathrm{self}}(x,y)\big)\geq 0.

The proof is in Section A.4. Proposition 4 states that αR(α;x)\partial_{\alpha}R(\alpha;x) equals the covariance between rr^{\star} and rselfr_{\mathrm{self}} under πα(x)\pi_{\alpha}(\cdot\mid x), so whether increasing α\alpha improves the true reward is determined exactly by Covyπα(x)(r(x,y),rself(x,y))\mathrm{Cov}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(r^{\star}(x,y),r_{\mathrm{self}}(x,y)\big). In particular, when r=rselfr^{\star}=r_{\mathrm{self}}, this covariance reduces to Varyπα(x)(logπ(yx))0\mathrm{Var}_{y\sim\pi_{\alpha}(\cdot\mid x)}\!\big(\log\pi(y\mid x)\big)\geq 0, so R(α;x)R(\alpha;x) is non-decreasing in α\alpha.

Refer to caption
Figure 2: Sharper power sampling raises both rselfr_{\mathrm{self}} and rr^{\star}. A smaller temperature parameter τ=1/α\tau=1/\alpha (sharper distribution) increases rselfr_{\mathrm{self}} and true reward rr^{\star} (accuracy) on MATH500.
Refer to caption
Figure 3: Empirical illustration of Proposition 4. Synthetic rewards rr with varying correlation to rselfr_{\mathrm{self}} on MATH500, with error bars denoting standard deviation.

6 Numerical evaluation

This section experimentally validates the following points.

  • (RQ1) Power sampling increases self-reward (Section 5.1).

  • (RQ2) Sharpening can improve true reward when rr^{\star} aligns with rselfr_{\mathrm{self}} (Section 5.3).

  • (RQ3) Power self-distillation achieves self-improvement (Section 5).

Detailed experimental setups are provided in Section B.1, and synthetic experiments validating Section 4 are shown in Sections B.2.4 and B.2.5.

6.1 Setup

We used the Qwen2.5-Math-7B (yang2024qwen2), Qwen2.5-7B (yang2024qwen), and Phi-3.5-mini-instruct (abdin2024phi) models on the MATH (lightman2024lets), HumanEval (chen2021evaluating), MBPP (austin2021program), and GPQA (rein2024gpqa) datasets. In the main text, we focus on the Qwen2.5-Math-7B model on the MATH dataset, which consists of 12,500 competition-style math problems spanning seven categories. For evaluation, we used MATH500, a selected subset of the MATH test set. For power self-distillation (Algorithm 1), we sampled 500 training problems from MATH, excluding those in MATH500. We fine-tuned with LoRA adapters (hu2022lora) using the AdamW optimizer (loshchilov2017decoupled).

For power sampling, we used the MH procedure of karan2026reasoning (Algorithm 2) with their default hyperparameters, including α=4.0\alpha=4.0. For additional baselines, we used standard autoregressive sampling (Standard) and token-wise temperature scaling (Temperature) with τ=1/α=0.25\tau=1/\alpha=0.25, so that the token-level baseline uses the same local power exponent as power sampling. We studied three model variants: the base model (Base), the power-distilled model (Power-distilled, Algorithm 1), and a randomly initialized model (RandW). RandW is a negative control for cases where likelihood is not aligned with correctness.

To study the relationship between true reward rr^{\star} and self-reward rselfr_{\mathrm{self}}, we additionally evaluated an approach we call self-reward Best-of-NN: given NN sampled completions {yi}i=1N\{y_{i}\}_{i=1}^{N}, we selected the completion with the largest value of rself(yi)r_{\mathrm{self}}(y_{i}). In all experiments, rselfr_{\mathrm{self}} denotes the completion-token average log-likelihood under the evaluated model, with prompt tokens masked out; this length normalization makes values comparable across completions.

Table 1: True reward rr^{\star} (accuracy) and self-reward rselfr_{\mathrm{self}} for Qwen2.5-Math-7B on MATH500. The left two columns report means over all sampled completions. The right two columns report self-reward Best-of-NN: the completion with the largest rselfr_{\mathrm{self}} among samples generated with different seeds. Evaluated with four seeds.
All completions Self-reward Best-of-NN
Model Sampling r()r^{\star}(\uparrow) rselfr_{\mathrm{self}} r()r^{\star}(\uparrow) rselfr_{\mathrm{self}}
RandW Standard 0.000±0.0000.000\pm 0.000 13.406±0.008-13.406\pm 0.008 0.0000.000 13.309-13.309
Power 0.000±0.0000.000\pm 0.000 12.370±0.014-12.370\pm 0.014 0.0000.000 12.133-12.133
Base Standard 0.508±0.0160.508\pm 0.016 0.316±0.043-0.316\pm 0.043 0.6800.680 0.097-0.097
Temperature 0.683±0.0140.683\pm 0.014 0.061±0.015-0.061\pm 0.015 0.7560.756 0.036-0.036
Power 0.714±0.0060.714\pm 0.006 0.077±0.001-0.077\pm 0.001 0.7420.742 0.062-0.062
Power-distilled Standard 0.643±0.0250.643\pm 0.025 0.089±0.005-0.089\pm 0.005 0.7630.763 0.042-0.042
Temperature 0.722±0.009\mathbf{0.722}\pm 0.009 0.043±0.001\mathbf{-0.043}\pm 0.001 0.768\mathbf{0.768} 0.034\mathbf{-0.034}
Table 2: Qualitative comparison for Qwen2.5-Math-7B on a MATH500 question: “The coordinates of a parallelogram are (5,3)(5,3), (6,8)(6,8), (7,4)(7,4), and (x,y)(x,y) with x>7x>7. What is the value of x+yx+y?”
Model Sampling Correctness Summary
Base Temperature No Uses irrelevant mathematical properties and generates an incorrect formula, resulting in a hallucinated final answer.
Power Yes Maintains logical consistency and mathematical accuracy, but simulates a Python execution to present a non-executed solution.
Distilled Standard Yes Shows robust reasoning and self-correction by re-evaluating the problem when constraints are not met.

6.2 Results

Power sampling increases self-reward (RQ1). Table 1 shows mean self-reward (rselfr_{\mathrm{self}}) and accuracy (rr^{\star}) over sampled completions (left two columns). Power sampling raises rselfr_{\mathrm{self}} for both the base model and RandW. Decoding with token-wise temperature also raises rselfr_{\mathrm{self}} on the base model.

Sharpening can improve true reward when rr^{\star} aligns with rselfr_{\mathrm{self}} (RQ2). Table 1 also shows that higher rselfr_{\mathrm{self}} is typically accompanied by higher true reward rr^{\star}, except on RandW, where rselfr_{\mathrm{self}} is not aligned with rr^{\star}. Notably, self-reward Best-of-NN yields the largest gains in rr^{\star} across all models.

To make this point clearer, Figure 3 plots the decoding temperature τ=1/α\tau=1/\alpha against rselfr_{\mathrm{self}} and rr^{\star}; both quantities decrease as τ\tau increases (i.e., sharpening weakens). Figure 3 uses synthetic rewards (Section B.1 for details), whose correlation with rselfr_{\mathrm{self}} ranges from positive to negative; the gain in rr from power sampling grows roughly linearly with Cov(r,rself)\mathrm{Cov}(r,r_{\mathrm{self}}).

Power self-distillation achieves self-improvement (RQ3). Table 1 shows that after power self-distillation, the student with temperature decoding scores higher on both rr^{\star} and rselfr_{\mathrm{self}} than the base model under standard sampling, temperature sampling, or power sampling. The strongest result is obtained by combining power self-distillation with Temperature decoding. At inference time, the student uses only autoregressive decoding (with temperature), thereby amortizing the inference cost of power sampling into offline training.

Qualitative example. Table 2 summarizes completions on one MATH500 problem. With token-wise temperature, the model cites irrelevant facts and concludes with a hallucinated formula, plausibly because token-wise tilting in Eq. (5) does not coincide with sequence-level tilting in Eq. (6). Power sampling instead tilts toward πα\pi_{\alpha} and is graded correct, but the completion includes plausible Python code that is never executed, and the model only mimics a reasoning pattern. After power self-distillation, standard decoding yields the correct answer with more robust step-by-step reasoning. The full completions are shown in Section B.2.3.

Additional dataset–model combinations are reported in Section B.2.1; in each case, the distilled model outperforms the corresponding base model.

7 Conclusion

We showed that the power distribution bridges power sampling, self-reward KL-regularized RL, and self-distillation as the sampling target, closed-form RL optimum, and teacher distribution. From the sampling perspective, inexpensive local approximations are structurally limited: per-token temperature scaling and variance-minimizing one-step proposals both miss sequence-level information. From the RL perspective, the same sequence-level power distribution is the optimizer of KL-regularized RL when the reward is the model’s sequence-level log-probabilities. This identification yields power self-distillation, an offline surrogate that amortizes power sampling into supervised training on teacher samples. Power self-distillation can achieve self-reward sharpening, while true-reward improvement is governed by Covπα(r,rself)\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},r_{\mathrm{self}}). Finally, we supported the analysis with experiments.

Limitations. Self-improvement through sharpening and distillation inherits the capabilities of the base model, so gains can be small when the base is weak; improving base-model quality (e.g., pretraining) is outside our scope. Our analysis and experiments focus on autoregressive language models over finite horizons.

References

Algorithm 1 Power self-distillation
1:Base model π\pi; power exponent α>0\alpha>0; teacher sampler TeacherSample(x;π,α)\textsc{TeacherSample}(x;\pi,\alpha) approximating πα(x)\pi_{\alpha}(\cdot\mid x) (e.g., Algorithm 2); prompt source μ\mu; dataset size nn; student model qθq_{\theta} (initialized from π\pi).
2:Trained student qθq_{\theta}.
3:Offline teacher sampling (data collection):
4:for i=1,,ni=1,\dots,n do
5:  Sample xiμx_{i}\sim\mu.
6:  Sample yiTeacherSample(xi;π,α)y_{i}\sim\textsc{TeacherSample}(x_{i};\pi,\alpha).
7:end for
8:Store D{(xi,yi)}i=1nD\leftarrow\{(x_{i},y_{i})\}_{i=1}^{n}.
9:Student training (supervised fine-tuning):
10:Optimize completion-only NLL on DD:
θargminθ(x,y)Dlogqθ(yx),\theta\leftarrow\arg\min_{\theta}\sum_{(x,y)\in D}-\log q_{\theta}(y\mid x),
11:where the loss is computed only on completion tokens by masking prompt tokens.
Algorithm 2 Power sampling using Metropolis–Hastings [karan2026reasoning] (Power(\infty): deterministic acceptance).
Notation.

Let the unnormalized power target π~α(yx)π(yx)α\tilde{\pi}_{\alpha}(y\mid x)\propto\pi(y\mid x)^{\alpha}. Let A(y,y)A(y^{\prime},y) denote the Metropolis–Hastings acceptance ratio comparing completions y,yy,y^{\prime} (with xx fixed), where pprop(yy,x)p_{\mathrm{prop}}(y^{\prime}\mid y,x) denotes the autoregressive proposal density for resampling a suffix under pprop(x,)p_{\mathrm{prop}}(\cdot\mid x,\cdot):

A(y,y):=min{1,π~α(yx)π~α(yx)pprop(yy,x)pprop(yy,x)}.A(y^{\prime},y)\;:=\;\min\left\{1,\ \frac{\tilde{\pi}_{\alpha}(y^{\prime}\mid x)}{\tilde{\pi}_{\alpha}(y\mid x)}\cdot\frac{p_{\mathrm{prop}}(y\mid y^{\prime},x)}{p_{\mathrm{prop}}(y^{\prime}\mid y,x)}\right\}. (20)
1:Base model π\pi; proposal ppropp_{\mathrm{prop}}; prompt xx; completion length TT with BTB\mid T; block size BB; inner iterations NMCMCN_{\mathrm{MCMC}}; exponent α>0\alpha>0.
2:Completion y1:Ty_{1:T} (MH\mathrm{MH}: approximate sample from powered conditional π(x)α\pi(\cdot\mid x)^{\alpha} up to MCMC error; Power()(\infty): accept proposals only if π(yx)>π(yx)\pi(y^{\prime}\mid x)>\pi(y\mid x), so π(yx)\pi(y\mid x) is monotone along accepted moves).
3:for k=0,1,,T/B1k=0,1,\dots,T/B-1 do
4:  Given the current state y1:kBy_{1:kB}, construct an initialization y(0)y^{(0)} by extending autoregressively with ppropp_{\mathrm{prop}} to length (k+1)B(k+1)B:
yt(0)pprop(ytx,y<t),kB+1t(k+1)B.y^{(0)}_{t}\sim p_{\mathrm{prop}}(y_{t}\mid x,y_{<t}),\qquad kB+1\leq t\leq(k+1)B.
5:  Set yy(0)y\leftarrow y^{(0)}.
6:  for n=1,,NMCMCn=1,\dots,N_{\mathrm{MCMC}} do
7:   Sample mm uniformly from {1,,(k+1)B}\{1,\dots,(k+1)B\}.
8:   Construct a proposal completion yy^{\prime} with prefix y1:m1y_{1:m-1} and resample the suffix:
ytpprop(ytx,y<t),mt(k+1)B.y^{\prime}_{t}\sim p_{\mathrm{prop}}(y_{t}\mid x,y^{\prime}_{<t}),\qquad m\leq t\leq(k+1)B.
9:   (MH) Compute A(y,y)A(y^{\prime},y) from Eq. (20). Draw uUniform(0,1)u\sim\mathrm{Uniform}(0,1). If uA(y,y)u\leq A(y^{\prime},y), set yyy\leftarrow y^{\prime}.
10:   (Power()(\infty)) If π(yx)>π(yx)\pi(y^{\prime}\mid x)>\pi(y\mid x), set yyy\leftarrow y^{\prime}.
11:  end for
12:  Set y1:(k+1)Byy_{1:(k+1)B}\leftarrow y as the current state carried into the next block iteration.
13:end for
14:return y1:Ty_{1:T}
Table S.1: Comparison of prior work by its connection to power distributions, sampling, RL, distillation, and whether it avoids external rewards.
Paper Power distribution Sampling RL Distillation No external reward
norouzi2016reward
rusu2016policy
teh2017distral
laskin2023incontext
huang2025selfimprovement
gui2024bonbon
amini2025variational
balashankar2025infalign
sessa2025bond
yang2025fasterwind
karan2026reasoning
azizi2026power
ji2026scalable
Ours

Appendix A Proofs and background

A.1 Proof of Proposition 1

Proof of Proposition 1.

Fix xx and a prefix y<ty_{<t} with π(y<tx)>0\pi(y_{<t}\mid x)>0, and write p(s):=π(yt=sx,y<t)p(s):=\pi(y_{t}=s\mid x,y_{<t}) for s𝒱s\in\mathcal{V}. For any suffix yt+1:Ty_{t+1:T} and token ss, autoregressive factorization gives

π(y<t,s,yt+1:Tx)=π(y<tx)p(s)qt,s(yt+1:T).\pi(y_{<t},\,s,\,y_{t+1:T}\mid x)=\pi(y_{<t}\mid x)\,p(s)\,q_{t,s}(y_{t+1:T}).

Using Eq. (6), the numerator for token ss is therefore

yt+1:T𝒱Ttπ(y<t,s,yt+1:Tx)α\displaystyle\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}\pi(y_{<t},\,s,\,y_{t+1:T}\mid x)^{\alpha} =π(y<tx)αp(s)αyt+1:Tqt,s(yt+1:T)α.\displaystyle=\pi(y_{<t}\mid x)^{\alpha}\,p(s)^{\alpha}\sum_{y_{t+1:T}}q_{t,s}(y_{t+1:T})^{\alpha}.

Summing over s𝒱s\in\mathcal{V} yields the corresponding denominator in Eq. (6), so the prefix factor π(y<tx)α\pi(y_{<t}\mid x)^{\alpha} cancels and

πpow,α(yt=sx,y<t)=p(s)αzqt,s(z)αs𝒱p(s)αzqt,s(z)α.\pi_{\mathrm{pow},\alpha}(y_{t}=s\mid x,y_{<t})=\frac{p(s)^{\alpha}\sum_{z}q_{t,s}(z)^{\alpha}}{\sum_{s^{\prime}\in\mathcal{V}}p(s^{\prime})^{\alpha}\sum_{z}q_{t,s^{\prime}}(z)^{\alpha}}. (21)

For temperature scaling, Eq. (5) gives πtemp,α(yt=sx,y<t)=p(s)α/sp(s)α\pi_{\mathrm{temp},\alpha}(y_{t}=s\mid x,y_{<t})=p(s)^{\alpha}/\sum_{s^{\prime}}p(s^{\prime})^{\alpha}. Hence for any a,b𝒱a,b\in\mathcal{V} with p(b)>0p(b)>0,

πpow,α(yt=ax,y<t)πpow,α(yt=bx,y<t)=p(a)αzqt,a(z)αp(b)αzqt,b(z)α,πtemp,α(yt=ax,y<t)πtemp,α(yt=bx,y<t)=(p(a)p(b))α,\frac{\pi_{\mathrm{pow},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{pow},\alpha}(y_{t}=b\mid x,y_{<t})}=\frac{p(a)^{\alpha}\sum_{z}q_{t,a}(z)^{\alpha}}{p(b)^{\alpha}\sum_{z}q_{t,b}(z)^{\alpha}},\qquad\frac{\pi_{\mathrm{temp},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{temp},\alpha}(y_{t}=b\mid x,y_{<t})}=\left(\frac{p(a)}{p(b)}\right)^{\alpha},

and therefore

πpow,α(yt=ax,y<t)πpow,α(yt=bx,y<t)/πtemp,α(yt=ax,y<t)πtemp,α(yt=bx,y<t)=zqt,a(z)αzqt,b(z)α.\frac{\pi_{\mathrm{pow},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{pow},\alpha}(y_{t}=b\mid x,y_{<t})}\bigg/\frac{\pi_{\mathrm{temp},\alpha}(y_{t}=a\mid x,y_{<t})}{\pi_{\mathrm{temp},\alpha}(y_{t}=b\mid x,y_{<t})}=\frac{\sum_{z}q_{t,a}(z)^{\alpha}}{\sum_{z}q_{t,b}(z)^{\alpha}}.

By the definition of HαH_{\alpha}, zq(z)α=exp((1α)Hα(q))\sum_{z}q(z)^{\alpha}=\exp\!\bigl((1-\alpha)H_{\alpha}(q)\bigr), which yields Eq. (7). ∎

A.2 Background and proof of Proposition 2

This appendix is aligned with the sequential Monte Carlo presentation of zhao2024probabilistic, who derive a general twist-induced proposal (their Prop. 3.3) that minimizes the variance of the one-step incremental importance weight for a given tower of intermediate targets. We provide a proof of the same variance-minimization fact specialized to πα\pi_{\alpha} using the Cauchy–Schwarz inequality (cf. zhao2024probabilistic, App. A.2).

A.2.1 From a sequence-level target to a sequential sampler

Let 𝒱\mathcal{V} be a finite vocabulary and fix a prompt xx and completion length T1T\geq 1. Let P(y1:T):=πα(y1:Tx)P(y_{1:T}):=\pi_{\alpha}(y_{1:T}\mid x) denote the power distribution on 𝒱T\mathcal{V}^{T} from Eq. (4), i.e.,

P(y1:T)π(y1:Tx)α,π(y1:Tx)=t=1Tπ(ytx,y<t).P(y_{1:T})\propto\pi(y_{1:T}\mid x)^{\alpha},\qquad\pi(y_{1:T}\mid x)=\prod_{t=1}^{T}\pi(y_{t}\mid x,y_{<t}).

Exact sampling from PP may be intractable because normalizing constants involve sums over exponentially many sequences. Many practical samplers therefore build y1:Ty_{1:T} sequentially: having generated a prefix y<ty_{<t}, they draw a next token yty_{t} from a proposal q(x,y<t)q(\cdot\mid x,y_{<t}) and update importance weights so that, after TT steps, full-length draws can be reweighted to be (exactly or approximately) correct for PP.

A.2.2 Incremental importance weights

For t=1,,Tt=1,\dots,T, let PtP_{t} denote the marginal of PP on the length-tt prefix:

Pt(y1:t):=yt+1:T𝒱TtP(y1:T).P_{t}(y_{1:t}):=\sum_{y_{t+1:T}\in\mathcal{V}^{T-t}}P(y_{1:T}).

One step of sequential importance sampling extends y<ty_{<t} by sampling Ytq(x,y<t)Y_{t}\sim q(\cdot\mid x,y_{<t}). The incremental multiplicative factor appended to the running weight is [chopin2020introduction]

Wt:=Pt(y<t,Yt)Pt1(y<t)q(Ytx,y<t),W_{t}:=\frac{P_{t}(y_{<t},Y_{t})}{P_{t-1}(y_{<t})\,q(Y_{t}\mid x,y_{<t})}, (22)

defined on the event {Pt1(y<t)>0}\{P_{t-1}(y_{<t})>0\}, where (y<t,Yt)(y_{<t},Y_{t}) denotes the length-tt prefix ending in YtY_{t}. For the power distribution, Pt(y1:t)=πα(y1:tx)P_{t}(y_{1:t})=\pi_{\alpha}(y_{1:t}\mid x), so Eq. (22) agrees with WtW_{t} in Eq. (9).

If one initializes weights at w0:=1w_{0}:=1 and updates wt:=wt1Wtw_{t}:=w_{t-1}W_{t}, then for any completed trajectory y1:Ty_{1:T} with t=1TPt1(y<t)>0\prod_{t=1}^{T}P_{t-1}(y_{<t})>0,

wT=P(y1:T)q(y1:Tx),q(y1:Tx):=t=1Tq(ytx,y<t),w_{T}=\frac{P(y_{1:T})}{q(y_{1:T}\mid x)},\qquad q(y_{1:T}\mid x):=\prod_{t=1}^{T}q(y_{t}\mid x,y_{<t}), (23)

which is the usual full-sequence importance weight of y1:Ty_{1:T} for the target PP against the autoregressive proposal qq. Thus each WtW_{t} is the local factor that must be “well behaved” if the final weights are not to explode or collapse.

A.2.3 Why minimize Var[Wt]\mathrm{Var}[W_{t}] at one step?

Condition on a fixed feasible prefix y<ty_{<t} with Pt1(y<t)>0P_{t-1}(y_{<t})>0. Write ft(v):=Pt(y<t,v)/Pt1(y<t)f_{t}(v):=P_{t}(y_{<t},v)/P_{t-1}(y_{<t}) for v𝒱v\in\mathcal{V}, i.e., the true conditional P(yt=vy<t)P(y_{t}=v\mid y_{<t}) under PP. Then Wt=ft(Yt)/q(Yt)W_{t}=f_{t}(Y_{t})/q(Y_{t}) with YtqY_{t}\sim q.

Whenever q(v)>0q(v)>0 for all vv with ft(v)>0f_{t}(v)>0, the mean is always 𝔼[Wty<t]=v𝒱q(v)ft(v)/q(v)=1\mathbb{E}[W_{t}\mid y_{<t}]=\sum_{v\in\mathcal{V}}q(v)\,f_{t}(v)/q(v)=1. However, Var[Wty<t]\mathrm{Var}[W_{t}\mid y_{<t}] depends strongly on qq: if qq places too little mass where ftf_{t} is large, occasional huge weights arise, which is the usual “weight degeneracy” pathology in importance sampling. Minimizing Var[Wty<t]\mathrm{Var}[W_{t}\mid y_{<t}] therefore makes the single-step contribution to weight instability as small as possible (among independent proposals), holding the prefix fixed. This is the same local objective highlighted by zhao2024probabilistic for twist-induced proposals.

A.2.4 Proof of Proposition 2

Proof of Proposition 2.

Fix y<ty_{<t} with πα(y<tx)>0\pi_{\alpha}(y_{<t}\mid x)>0 and write f(v):=πα(y<t,vx)/πα(y<tx)f(v):=\pi_{\alpha}(y_{<t},v\mid x)/\pi_{\alpha}(y_{<t}\mid x) for v𝒱v\in\mathcal{V}. Then v𝒱f(v)=1\sum_{v\in\mathcal{V}}f(v)=1 and Wt=f(Yt)/q(Yt)W_{t}=f(Y_{t})/q(Y_{t}) under YtqY_{t}\sim q, assuming q(v)>0q(v)>0 whenever f(v)>0f(v)>0.

Since 𝔼[Wt]=vf(v)=1\mathbb{E}[W_{t}]=\sum_{v}f(v)=1,

Var[Wt]=𝔼[Wt2]1=v𝒱f(v)2q(v)1.\mathrm{Var}[W_{t}]=\mathbb{E}[W_{t}^{2}]-1=\sum_{v\in\mathcal{V}}\frac{f(v)^{2}}{q(v)}-1.

By Cauchy–Schwarz,

(v𝒱f(v))2=(v𝒱q(v)f(v)q(v))2(v𝒱q(v))(v𝒱f(v)2q(v))=v𝒱f(v)2q(v).\Big(\sum_{v\in\mathcal{V}}f(v)\Big)^{2}=\Big(\sum_{v\in\mathcal{V}}\sqrt{q(v)}\cdot\frac{f(v)}{\sqrt{q(v)}}\Big)^{2}\leq\Big(\sum_{v\in\mathcal{V}}q(v)\Big)\Big(\sum_{v\in\mathcal{V}}\frac{f(v)^{2}}{q(v)}\Big)=\sum_{v\in\mathcal{V}}\frac{f(v)^{2}}{q(v)}.

The left-hand side equals 11, so Var[Wt]0\mathrm{Var}[W_{t}]\geq 0 with equality if and only if the Cauchy–Schwarz inequality is tight, i.e., q(v)f(v)/q(v)\sqrt{q(v)}\propto f(v)/\sqrt{q(v)}, equivalently q(v)f(v)q(v)\propto f(v). Because vf(v)=1\sum_{v}f(v)=1, the unique minimizer on {v:f(v)>0}\{v:f(v)>0\} is q(v)=f(v)q(v)=f(v), which is πα(vx,y<t)\pi_{\alpha}(v\mid x,y_{<t}).

Finally, with π~α,t\tilde{\pi}_{\alpha,t} as in Eq. (8) and ZαZ_{\alpha} as in the main text,

f(v)=πα(y<t,vx)πα(y<tx)=π~α,t(y1:t)/Zαπ~α,t1(y<t)/Zα=π~α,t(y1:t)π~α,t1(y<t),f(v)=\frac{\pi_{\alpha}(y_{<t},v\mid x)}{\pi_{\alpha}(y_{<t}\mid x)}=\frac{\tilde{\pi}_{\alpha,t}(y_{1:t})/Z_{\alpha}}{\tilde{\pi}_{\alpha,t-1}(y_{<t})/Z_{\alpha}}=\frac{\tilde{\pi}_{\alpha,t}(y_{1:t})}{\tilde{\pi}_{\alpha,t-1}(y_{<t})},

where y1:t=(y<t,v)y_{1:t}=(y_{<t},v), which is Eq. (10). ∎

A.3 Proof of Proposition 3

We first provide the following lemma, which is used to bound the Hellinger distance between the MLE and the true conditional distribution for finite-class models.

Lemma 1 (Finite-class MLE Hellinger bound [wong1995probability, geer2000empirical, zhang2006f]).

Assume |Π|<|\Pi|<\infty and πΠ\pi^{\star}\in\Pi. Let D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n} be i.i.d. with xiμx_{i}\sim\mu and yiπ(xi)y_{i}\sim\pi^{\star}(\cdot\mid x_{i}), and let π^argmaxπΠi=1nlogπ(yixi)\widehat{\pi}\in\arg\max_{\pi\in\Pi}\sum_{i=1}^{n}\log\pi(y_{i}\mid x_{i}) be an MLE. Then for any ρ(0,1)\rho\in(0,1), with probability at least 1ρ1-\rho,

𝔼xμ[DH2(π^(x),π(x))]2log(|Π|ρ1)n.\mathbb{E}_{x\sim\mu}\!\left[D_{H}^{2}\!\big(\widehat{\pi}(\cdot\mid x),\pi^{\star}(\cdot\mid x)\big)\right]\leq\frac{2\log(|\Pi|\rho^{-1})}{n}.

Using this lemma, we can prove Proposition 3 as follows.

Proof of Proposition 3.

Define the failure event F(x):={π^(𝒚(x)x)1δ}F(x):=\{\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta\}. By a simple inclusion,

F(x){πα(𝒚(x)x)1δ2}{π^(𝒚(x)x)1δ,πα(𝒚(x)x)>1δ2}.F(x)\subseteq\Big\{\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\Big\}\ \cup\ \Big\{\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta,\ \pi_{\alpha}(\bm{y}^{\star}(x)\mid x)>1-\tfrac{\delta}{2}\Big\}.

Taking xμ[]\mathbb{P}_{x\sim\mu}[\cdot] yields

xμ[F(x)]xμ[πα(𝒚(x)x)1δ2]+xμ[E(x)],\mathbb{P}_{x\sim\mu}[F(x)]\leq\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]\;+\;\mathbb{P}_{x\sim\mu}[E(x)], (24)

where E(x):={π^(𝒚(x)x)1δ,πα(𝒚(x)x)>1δ2}E(x):=\{\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta,\ \pi_{\alpha}(\bm{y}^{\star}(x)\mid x)>1-\tfrac{\delta}{2}\}.

Let B(x):=𝒴𝒚(x)B(x):=\mathcal{Y}\setminus\bm{y}^{\star}(x). For each xx, write π^x:=π^(x)\widehat{\pi}_{x}:=\widehat{\pi}(\cdot\mid x) and px:=πα(x)p_{x}:=\pi_{\alpha}(\cdot\mid x). For two distributions p,qΔ(𝒴)p,q\in\Delta(\mathcal{Y}), define the squared Hellinger distance

DH2(p,q):=y𝒴(p(y)q(y))2.D_{H}^{2}(p,q):=\sum_{y\in\mathcal{Y}}\big(\sqrt{p(y)}-\sqrt{q(y)}\big)^{2}.

By the reverse triangle inequality applied to the vectors (π^x(y))yB(x)(\sqrt{\widehat{\pi}_{x}(y)})_{y\in B(x)} and (px(y))yB(x)(\sqrt{p_{x}(y)})_{y\in B(x)},

DH2(π^x,px)yB(x)(π^x(y)px(y))2(π^(B(x)x)πα(B(x)x))2.D_{H}^{2}(\widehat{\pi}_{x},p_{x})\geq\sum_{y\in B(x)}\big(\sqrt{\widehat{\pi}_{x}(y)}-\sqrt{p_{x}(y)}\big)^{2}\geq\big(\sqrt{\widehat{\pi}(B(x)\mid x)}-\sqrt{\pi_{\alpha}(B(x)\mid x)}\big)^{2}. (25)

On the event E(x)E(x), we have π^(B(x)x)=1π^(𝒚(x)x)δ\widehat{\pi}(B(x)\mid x)=1-\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\geq\delta and πα(B(x)x)=1πα(𝒚(x)x)<δ/2\pi_{\alpha}(B(x)\mid x)=1-\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)<\delta/2, so Equation (25) implies

DH2(π^x,px)(δδ/2)2=(112)2δ=:c0δ.D_{H}^{2}(\widehat{\pi}_{x},p_{x})\geq\big(\sqrt{\delta}-\sqrt{\delta/2}\big)^{2}=\Big(1-\tfrac{1}{\sqrt{2}}\Big)^{2}\delta=:c_{0}\,\delta.

Therefore 𝟏{E(x)}DH2(π^x,px)/(c0δ)\mathbf{1}\{E(x)\}\leq D_{H}^{2}(\widehat{\pi}_{x},p_{x})/(c_{0}\delta), and hence

xμ[E(x)]1c0δ𝔼xμ[DH2(π^x,px)].\mathbb{P}_{x\sim\mu}[E(x)]\leq\frac{1}{c_{0}\delta}\,\mathbb{E}_{x\sim\mu}\big[D_{H}^{2}(\widehat{\pi}_{x},p_{x})\big]. (26)

Finally, by the finite-class MLE Hellinger bound (Lemma 1) and |Πα|M|\Pi_{\alpha}|\leq M, with probability at least 1ρ1-\rho,

𝔼xμ[DH2(π^x,px)]2log(Mρ1)n.\mathbb{E}_{x\sim\mu}\big[D_{H}^{2}(\widehat{\pi}_{x},p_{x})\big]\leq\frac{2\log(M\rho^{-1})}{n}.

Combining Eqs. (24) and (26) with the bound above yields

xμ[π^(𝒚(x)x)1δ]xμ[πα(𝒚(x)x)1δ2]+2c0log(Mρ1)δn.\mathbb{P}_{x\sim\mu}\big[\widehat{\pi}(\bm{y}^{\star}(x)\mid x)\leq 1-\delta\big]\leq\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]\;+\;\frac{2}{c_{0}}\cdot\frac{\log(M\rho^{-1})}{\delta\,n}.

Absorbing constants proves Eq. (17).

Convergence of the upper bound.

The MLE term satisfies log(Mρ1)δn0\frac{\log(M\rho^{-1})}{\delta n}\to 0 as nn\to\infty.

For the limit of α\alpha, fix x𝒳x\in\mathcal{X} and write m(x):=maxy𝒴π(yx)m(x):=\max_{y\in\mathcal{Y}}\pi(y\mid x). By definition of 𝒚(x)\bm{y}^{\star}(x), we have π(yx)=m(x)\pi(y\mid x)=m(x) for all y𝒚(x)y\in\bm{y}^{\star}(x) and π(yx)<m(x)\pi(y\mid x)<m(x) for all yB(x)=𝒴𝒚(x)y\in B(x)=\mathcal{Y}\setminus\bm{y}^{\star}(x). The normalizing constant of the power distribution satisfies

Zα(x)\displaystyle Z_{\alpha}(x) :=y𝒴π(yx)α\displaystyle:=\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime}\mid x)^{\alpha} (27)
=y𝒚(x)m(x)α+yB(x)π(yx)α\displaystyle=\sum_{y^{\prime}\in\bm{y}^{\star}(x)}m(x)^{\alpha}\;+\;\sum_{y^{\prime}\in B(x)}\pi(y^{\prime}\mid x)^{\alpha} (28)
=|𝒚(x)|m(x)α+yB(x)π(yx)α.\displaystyle=\bigl|\bm{y}^{\star}(x)\bigr|\,m(x)^{\alpha}\;+\;\sum_{y^{\prime}\in B(x)}\pi(y^{\prime}\mid x)^{\alpha}. (29)

For each yB(x)y^{\prime}\in B(x), the ratio π(yx)/m(x)\pi(y^{\prime}\mid x)/m(x) lies in [0,1)[0,1), hence (π(yx)/m(x))α0\bigl(\pi(y^{\prime}\mid x)/m(x)\bigr)^{\alpha}\to 0 as α\alpha\to\infty. Because B(x)B(x) is finite, yB(x)π(yx)α=o(m(x)α)\sum_{y^{\prime}\in B(x)}\pi(y^{\prime}\mid x)^{\alpha}=o\bigl(m(x)^{\alpha}\bigr), and therefore

πα(𝒚(x)x)=|𝒚(x)|m(x)αZα(x)α1.\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)=\frac{\bigl|\bm{y}^{\star}(x)\bigr|\,m(x)^{\alpha}}{Z_{\alpha}(x)}\xrightarrow[\alpha\to\infty]{}1. (30)

The indicators 𝟏{πα(𝒚(x)x)1δ2}\mathbf{1}\{\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\} converge to 0 for μ\mu-almost every xx as α\alpha\to\infty by Eq. (30). Since indicators are bounded by 11, dominated convergence yields

xμ[πα(𝒚(x)x)1δ2]=𝔼xμ[𝟏{πα(𝒚(x)x)1δ2}]α0.\mathbb{P}_{x\sim\mu}\big[\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\big]=\mathbb{E}_{x\sim\mu}\big[\mathbf{1}\{\pi_{\alpha}(\bm{y}^{\star}(x)\mid x)\leq 1-\tfrac{\delta}{2}\}\big]\xrightarrow[\alpha\to\infty]{}0.

Thus, the second term in Eq. (17) converges to 0 as α\alpha\to\infty. Together with the nn\to\infty limit of the first term, the full upper bound converges to 0. ∎

A.4 Proof of Proposition 4

Proof of Proposition 4.

Recall

πα(y)=π(y)αZα=eαlogπ(y)Zα,Zα:=y𝒴π(y)α=yeαlogπ(y).\pi_{\alpha}(y)=\frac{\pi(y)^{\alpha}}{Z_{\alpha}}=\frac{e^{\alpha\log\pi(y)}}{Z_{\alpha}},\qquad Z_{\alpha}:=\sum_{y^{\prime}\in\mathcal{Y}}\pi(y^{\prime})^{\alpha}=\sum_{y^{\prime}}e^{\alpha\log\pi(y^{\prime})}.

Differentiating πα(y)\pi_{\alpha}(y) with respect to α\alpha yields

απα(y)\displaystyle\frac{\partial}{\partial\alpha}\pi_{\alpha}(y) =α(eαlogπ(y)Zα)\displaystyle=\frac{\partial}{\partial\alpha}\left(\frac{e^{\alpha\log\pi(y)}}{Z_{\alpha}}\right)
=logπ(y)eαlogπ(y)Zαeαlogπ(y)αZαZα2\displaystyle=\frac{\log\pi(y)e^{\alpha\log\pi(y)}Z_{\alpha}-e^{\alpha\log\pi(y)}\frac{\partial}{\partial\alpha}Z_{\alpha}}{Z_{\alpha}^{2}}
=πα(y)(logπ(y)αlogZα).\displaystyle=\pi_{\alpha}(y)\left(\log\pi(y)-\frac{\partial}{\partial\alpha}\log Z_{\alpha}\right).

Using

αlogZα=1Zαylogπ(y)eαlogπ(y)=yπα(y)logπ(y)=𝔼πα[logπ],\displaystyle\frac{\partial}{\partial\alpha}\log Z_{\alpha}=\frac{1}{Z_{\alpha}}\sum_{y^{\prime}}\log\pi(y^{\prime})e^{\alpha\log\pi(y^{\prime})}=\sum_{y^{\prime}}\pi_{\alpha}(y^{\prime})\log\pi(y^{\prime})=\mathbb{E}_{\pi_{\alpha}}[\log\pi], (31)

we obtain

α𝔼πα[r]\displaystyle\frac{\partial}{\partial\alpha}\mathbb{E}_{\pi_{\alpha}}[r^{\star}] =yr(y)απα(y)\displaystyle=\sum_{y}r^{\star}(y)\frac{\partial}{\partial\alpha}\pi_{\alpha}(y) (32)
=yr(y)πα(y)(logπ(y)𝔼πα[logπ])\displaystyle=\sum_{y}r^{\star}(y)\pi_{\alpha}(y)\left(\log\pi(y)-\mathbb{E}_{\pi_{\alpha}}[\log\pi]\right) (33)
=𝔼πα[r(logπ𝔼πα[logπ])]\displaystyle=\mathbb{E}_{\pi_{\alpha}}\!\big[r^{\star}(\log\pi-\mathbb{E}_{\pi_{\alpha}}[\log\pi])\big] (34)
=Covπα(r,logπ)\displaystyle=\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},\log\pi) (35)
=Covπα(r,rself).\displaystyle=\mathrm{Cov}_{\pi_{\alpha}}(r^{\star},r_{\mathrm{self}}). (36)

A.5 Closed-form optimizer for KL-regularized RL: restatement and proof

We restate the standard closed-form solution of KL-regularized RL used in Section 5.1.

Proposition 5 (Closed-form optimizer for KL-regularized RL [levine2018reinforcement]).

For each x𝒳x\in\mathcal{X}, let πβ\pi_{\beta}^{\star} be the reward-tilted distribution defined in Eq. (12), and assume Zr(x)<Z_{r}(x)<\infty for every x𝒳x\in\mathcal{X}. Then πβ\pi_{\beta}^{\star} maximizes Jβ(q;π,r)J_{\beta}(q;\pi,r) in Eq. (11) over all q:𝒳Δ(𝒴)q:\mathcal{X}\to\Delta(\mathcal{Y}).

Proof of Proposition 5.

Fix x𝒳x\in\mathcal{X} and write f(y):=π(yx)exp(β1r(x,y))f(y):=\pi(y\mid x)\exp\!\big(\beta^{-1}r(x,y)\big) and Z:=Zr(x)=y𝒴f(y)Z:=Z_{r}(x)=\sum_{y^{\prime}\in\mathcal{Y}}f(y^{\prime}). For any q(x)Δ(𝒴)q(\cdot\mid x)\in\Delta(\mathcal{Y}), expanding the KL divergence against πβ(x)\pi_{\beta}^{\star}(\cdot\mid x) gives

𝔼yq(x)[r(x,y)]βDKL(q(x)π(x))\displaystyle\mathbb{E}_{y\sim q(\cdot\mid x)}[r(x,y)]-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi(\cdot\mid x)\big) =βy𝒴q(yx)logq(yx)f(y)/Z\displaystyle=-\beta\sum_{y\in\mathcal{Y}}q(y\mid x)\log\frac{q(y\mid x)}{f(y)/Z}
=βDKL(q(x)πβ(x))+βlogZ,\displaystyle=-\beta\,D_{\mathrm{KL}}\!\big(q(\cdot\mid x)\,\|\,\pi_{\beta}^{\star}(\cdot\mid x)\big)+\beta\log Z,

where we used πβ(yx)=f(y)/Z\pi_{\beta}^{\star}(y\mid x)=f(y)/Z from Eq. (12). Since DKL(πβ(x))0D_{\mathrm{KL}}(\cdot\,\|\,\pi_{\beta}^{\star}(\cdot\mid x))\geq 0 with equality if and only if q(x)=πβ(x)q(\cdot\mid x)=\pi_{\beta}^{\star}(\cdot\mid x), the inner objective is uniquely maximized at q(x)=πβ(x)q(\cdot\mid x)=\pi_{\beta}^{\star}(\cdot\mid x). Because Jβ(q;π,r)J_{\beta}(q;\pi,r) is an expectation over xμx\sim\mu of these decoupled per-xx objectives, the unique global maximizer is q=πβq=\pi_{\beta}^{\star}. ∎

Appendix B Experimental details

B.1 Setup

Models and datasets.

We used Qwen2.5-Math-7B [yang2024qwen2], Qwen2.5-7B [yang2024qwen], and Phi-3.5-mini-instruct [abdin2024phi] models on the following datasets.

  • Mathematics. We used the MATH dataset [lightman2024lets], which consists of 12,500 competition-style math problems spanning seven categories (e.g., geometry, number theory, and precalculus), with 7,500 training and 5,000 test problems. For evaluation, we used MATH500, a randomly selected subset of the MATH test set standardized by OpenAI111https://huggingface.co/datasets/HuggingFaceH4/MATH-500. For distillation, we sampled 500 examples from MATH with MATH500 removed222https://raw.githubusercontent.com/rasbt/math_full_minus_math500/main/math_full_minus_math500.json.

  • Programming. For evaluation, we used HumanEval [chen2021evaluating], a set of 164164 handwritten programming problems covering algorithms, reasoning, mathematics, and language understanding; each problem includes unit tests, and a solution was correct if it passed all tests. For distillation, we used MBPP [austin2021program], a benchmark of crowd-sourced Python programming problems designed to be solvable by entry-level programmers. We used 420420 questions from the sanitized subset, excluding the prompt split.

  • Multiple-choice science. We used GPQA [rein2024gpqa], a multiple-choice science benchmark (physics, chemistry, and biology) requiring advanced reasoning. For evaluation, we used GPQA-Diamond, a high-quality subset of 198198 questions. For distillation, we used the remaining 250250 GPQA questions after removing any overlap with GPQA-Diamond.

Power sampling.

We used the power sampling algorithm of karan2026reasoning, largely following their hyperparameters. Specifically, we used α=4.0\alpha=4.0, maximum sampling token length 30723072, block size 192192, NMCMC=10N_{\mathrm{MCMC}}=10, and the proposal LLM ppropp_{\mathrm{prop}} set to the base model with sampling temperature τ=1/α=0.25\tau=1/\alpha=0.25. The token-wise Temperature baseline uses the same τ\tau, applying the corresponding local power transform independently at each decoding step. For the randomly initialized model (RandW; Section 6), we instead used maximum token length 10241024 and NMCMC=2N_{\mathrm{MCMC}}=2, because under the default settings (maximum token length 30723072 and NMCMC=10N_{\mathrm{MCMC}}=10) EOS tokens rarely appeared for RandW and wall-clock sampling time became significantly longer.

Self-reward computation.

To report rselfr_{\mathrm{self}}, we computed, under the evaluated model, the average log-likelihood over completion tokens, excluding prompt tokens. Our theoretical analysis assumes completions of a fixed length TT, but in our experiments completion lengths vary across prompts and sampling methods, so we normalize by the number of completion tokens to remove length bias in rselfr_{\mathrm{self}}.

Synthetic random rewards.

For the synthetic-reward probe in Figure 3, each completion yy is mapped to a scalar in [0,1)[0,1) by applying SHA-256 to the UTF-8 encoding of yy and interpreting the leading 64 bits of the digest as an unsigned fraction. Let zself(y)z_{\mathrm{self}}(y) and zr(y)z_{r}(y) denote the z-scores of the self-reward rself(x,y)r_{\mathrm{self}}(x,y) and of the hash reward above, each computed with the corresponding pooled global sample mean and sample standard deviation. We then define

rλ(y):=λzself(y)+1λ2(zr(y)+ε(y)),λ[1,1],r_{\lambda}(y)\;:=\;\lambda\,z_{\mathrm{self}}(y)+\sqrt{1-\lambda^{2}}\,\bigl(z_{r}(y)+\varepsilon(y)\bigr),\qquad\lambda\in[-1,1],

where the ε(y)\varepsilon(y) are i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) with σ=0.5\sigma=0.5. Figure 3 sweeps λ\lambda and plots the mean increase in rλr_{\lambda} under power versus standard sampling against the empirical covariance between rselfr_{\mathrm{self}} and rλr_{\lambda}, using completions produced under standard sampling. The construction is designed to sweep Cov(rλ,rself)\mathrm{Cov}(r_{\lambda},r_{\mathrm{self}}) in a controlled way; we plot empirical gain against this controlled covariance to visualize the qualitative rate prediction of Proposition 4.

Distillation.

We trained the student with supervised fine-tuning on the offline power-sampled dataset. Concretely, we minimized the standard token-level cross-entropy loss of a causal language model on the teacher-generated completion, masking the prompt tokens (i.e., the loss was computed only on the completion tokens). The student was initialized from the base model and was trained with LoRA adapters (r=16r{=}16, α=32\alpha{=}32, dropout 0.050.05) applied to q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj. We trained the models for 3 epochs using the AdamW optimizer with a weight decay of 0.01 and a linear warmup ratio of 0.03. The learning rate was tuned per dataset and model as summarized in Table S.2. We used per-device batch size 1 with 8 gradient accumulation steps, and enabled gradient checkpointing. We set the maximum sequence length to 1024 tokens to keep activation memory manageable on a single GPU. Teacher completions exceeding this cap were truncated, and the cross-entropy loss was computed on all in-window completion tokens. The truncation affected only a minority of completions (e.g., 83.6% of Qwen2.5-Math-7B completions on MATH fit fully within the cap), and each in-window token still provides a valid distillation signal toward πα\pi_{\alpha}.

Table S.2: Learning rate used for SFT distillation, per (dataset, model) pair.
Dataset Qwen2.5-7B Qwen2.5-Math-7B Phi-3.5-mini-instruct
MATH 1×1051\times 10^{-5} 1×1051\times 10^{-5} 1×1031\times 10^{-3}
HumanEval/MBPP 1×1051\times 10^{-5} 1×1051\times 10^{-5} 5×1045\times 10^{-4}
GPQA 1×1051\times 10^{-5} 1×1041\times 10^{-4} 2×1042\times 10^{-4}
Hardware and execution time.

All experiments were conducted on GPU nodes equipped with two Intel Xeon Platinum 8360Y CPUs, 512 GiB of host memory, and eight NVIDIA A100 GPUs with 40 GiB of memory each. On a single GPU, supervised fine-tuning of one student per dataset and model finished in under one hour, while teacher generation via power sampling (Algorithm 2) took more than one day per dataset and model. The total compute is on the order of a few hundred A100-GPU-hours.

B.2 Additional results

B.2.1 Other datasets and models

This section reports results on additional dataset–model combinations that are not shown in the main text. In all cases, the distilled model has a higher rr^{\star} than the base under standard autoregressive decoding. The distilled model often attains rr^{\star} comparable to that of the corresponding base model with power sampling.

Table S.3: MATH: true reward rr^{\star} (accuracy) and self-reward rselfr_{\mathrm{self}}. Left: all completions, means with ±\pm std over seeds. Right: self-reward Best-of-NN over samples generated with different seeds (max rselfr_{\mathrm{self}} per item, then same aggregation). 4 seeds.
All completions Self-reward Best-of-NN
Model Sampling r()r^{\star}(\uparrow) rselfr_{\mathrm{self}} r()r^{\star}(\uparrow) rselfr_{\mathrm{self}}
Qwen / Base Standard 0.410±0.0040.410\pm 0.004 0.405±0.049-0.405\pm 0.049 0.5790.579 0.185-0.185
Power 0.706±0.017\mathbf{0.706}\pm 0.017 0.093±0.001-0.093\pm 0.001 0.6770.677 0.080-0.080
Qwen / Distilled Standard 0.631±0.0130.631\pm 0.013 0.094±0.002-0.094\pm 0.002 0.682\mathbf{0.682} 0.064-0.064
Temperature 0.661±0.0060.661\pm 0.006 0.073±0.001\mathbf{-0.073}\pm 0.001 0.6760.676 0.060\mathbf{-0.060}
Phi / Base Standard 0.449±0.0140.449\pm 0.014 0.234±0.001-0.234\pm 0.001 0.4760.476 0.173-0.173
Power 0.513±0.017\mathbf{0.513}\pm 0.017 0.175±0.002-0.175\pm 0.002 0.493\mathbf{0.493} 0.155-0.155
Phi / Distilled Standard 0.470±0.0000.470\pm 0.000 0.118±0.001-0.118\pm 0.001 0.4810.481 0.088-0.088
Temperature 0.457±0.0140.457\pm 0.014 0.106±0.001\mathbf{-0.106}\pm 0.001 0.4610.461 0.086\mathbf{-0.086}
Table S.4: HumanEval: true reward rr^{\star} (HumanEval pass) and self-reward rselfr_{\mathrm{self}}. Left: all completions, means with ±\pm std over seeds. Right: self-reward Best-of-NN over samples generated with different seeds (max rselfr_{\mathrm{self}} per item, then same aggregation). 4 seeds.
All completions Self-reward Best-of-NN
Model Sampling r()r^{\star}(\uparrow) rselfr_{\mathrm{self}} r()r^{\star}(\uparrow) rselfr_{\mathrm{self}}
Qwen-Math / Base Standard 0.320±0.0160.320\pm 0.016 0.741±0.012-0.741\pm 0.012 0.3830.383 0.427-0.427
Power 0.538±0.0300.538\pm 0.030 0.144±0.003\mathbf{-0.144}\pm 0.003 0.5620.562 0.106\mathbf{-0.106}
Qwen-Math / Distilled Standard 0.416±0.0230.416\pm 0.023 0.563±0.040-0.563\pm 0.040 0.4520.452 0.334-0.334
Temperature 0.541±0.005\mathbf{0.541}\pm 0.005 0.304±0.001-0.304\pm 0.001 0.566\mathbf{0.566} 0.208-0.208
Qwen / Base Standard 0.326±0.0250.326\pm 0.025 0.966±0.046-0.966\pm 0.046 0.3760.376 0.426-0.426
Power 0.573±0.020\mathbf{0.573}\pm 0.020 0.130±0.004\mathbf{-0.130}\pm 0.004 0.5680.568 0.096\mathbf{-0.096}
Qwen / Distilled Standard 0.425±0.0290.425\pm 0.029 0.849±0.017-0.849\pm 0.017 0.4700.470 0.325-0.325
Temperature 0.541±0.0170.541\pm 0.017 0.479±0.024-0.479\pm 0.024 0.600\mathbf{0.600} 0.235-0.235
Phi / Base Standard 0.549±0.0210.549\pm 0.021 0.913±0.012-0.913\pm 0.012 0.5620.562 0.589-0.589
Power 0.712±0.0270.712\pm 0.027 0.330±0.004\mathbf{-0.330}\pm 0.004 0.734\mathbf{0.734} 0.294\mathbf{-0.294}
Phi / Distilled Standard 0.634±0.0310.634\pm 0.031 0.730±0.029-0.730\pm 0.029 0.6020.602 0.473-0.473
Temperature 0.715±0.020\mathbf{0.715}\pm 0.020 0.627±0.028-0.627\pm 0.028 0.6750.675 0.447-0.447
Table S.5: GPQA: true reward rr^{\star} (accuracy) and self-reward rselfr_{\mathrm{self}}. Left: all completions, means with ±\pm std over seeds. Right: self-reward Best-of-NN over samples generated with different seeds (max rselfr_{\mathrm{self}} per item, then same aggregation). 4 seeds.
All completions Self-reward Best-of-NN
Model Sampling r()r^{\star}(\uparrow) rselfr_{\mathrm{self}} r()r^{\star}(\uparrow) rselfr_{\mathrm{self}}
Qwen-Math / Base Standard 0.100±0.0250.100\pm 0.025 0.675±0.076-0.675\pm 0.076 0.1030.103 0.675-0.675
Power 0.277±0.022\mathbf{0.277}\pm 0.022 0.088±0.002\mathbf{-0.088}\pm 0.002 0.2790.279 0.087\mathbf{-0.087}
Qwen-Math / Distilled Standard 0.275±0.0040.275\pm 0.004 0.165±0.001-0.165\pm 0.001 0.281\mathbf{0.281} 0.113-0.113
Temperature 0.277±0.001\mathbf{0.277}\pm 0.001 0.149±0.001-0.149\pm 0.001 0.2770.277 0.109-0.109
Qwen / Base Standard 0.244±0.0170.244\pm 0.017 1.531±0.185-1.531\pm 0.185 0.2450.245 1.527-1.527
Power 0.283±0.0330.283\pm 0.033 0.118±0.001\mathbf{-0.118}\pm 0.001 0.2870.287 0.118\mathbf{-0.118}
Qwen / Distilled Standard 0.280±0.0350.280\pm 0.035 0.437±0.098-0.437\pm 0.098 0.2780.278 0.426-0.426
Temperature 0.285±0.025\mathbf{0.285}\pm 0.025 0.210±0.007-0.210\pm 0.007 0.291\mathbf{0.291} 0.201-0.201
Phi / Base Standard 0.223±0.0270.223\pm 0.027 0.802±0.023-0.802\pm 0.023 0.2230.223 0.800-0.800
Power 0.309±0.019\mathbf{0.309}\pm 0.019 0.215±0.004\mathbf{-0.215}\pm 0.004 0.309\mathbf{0.309} 0.214\mathbf{-0.214}
Phi / Distilled Standard 0.268±0.0040.268\pm 0.004 0.321±0.010-0.321\pm 0.010 0.2670.267 0.319-0.319
Temperature 0.292±0.0120.292\pm 0.012 0.284±0.059-0.284\pm 0.059 0.2980.298 0.262-0.262

B.2.2 Power()(\infty)

We also evaluated Power()(\infty) using Qwen2.5-Math-7B on MATH500. This variant runs the MH power-sampling loop and accepts a proposal yy^{\prime} if and only if π(yx)>π(yx)\pi(y^{\prime}\mid x)>\pi(y\mid x) (Algorithm 2), corresponding to the limit α\alpha\to\infty.

Table S.6: Power()(\infty) results for Qwen2.5-Math-7B on MATH500. Left: all completions, means with ±\pm std over seeds. Right: self-reward Best-of-NN over samples generated with different seeds.
All completions Self-reward Best-of-NN
Sampling r()r^{\star}(\uparrow) rselfr_{\mathrm{self}} r()r^{\star}(\uparrow) rselfr_{\mathrm{self}}
Power()(\infty) 0.728±0.012\mathbf{0.728\pm 0.012} 0.075±0.001-0.075\pm 0.001 0.7360.736 0.061-0.061

B.2.3 Qualitative results

This section presents full completions for one MATH-style geometry problem summarized in Table 2 with the gold answer x+y=17x+y=17. The prompt is:

The coordinates of a parallelogram are (5,3)(5,3), (6,8)(6,8), (7,4)(7,4), and (x,y)(x,y) with x>7x>7. What is the value of x+yx+y?

        

(a) Question.

        

(b) Base sampling.

        

(c) Temperature sampling.
Figure S.1: Full generations for an example in MATH500 (gold x+y=17x+y=17) (Part 1/4).

        

(a) Power sampling.
Figure S.2: Full generations for an example in MATH500 (gold x+y=17x+y=17) (Part 2/4).

        

(a) Distilled model with base sampling (Part 1/2).
Figure S.3: Full generations for an example in MATH500 (gold x+y=17x+y=17) (Part 3/4).

        

(a) Distilled model with base sampling (Part 2/2).
Figure S.4: Full generations for an example in MATH500 (gold x+y=17x+y=17) (Part 4/4).

B.2.4 Synthetic validation of suffix-Rényi odds corrections

To validate Proposition 1 in a setting that reflects the Zipf-like word-frequency structure of natural language, we construct a finite synthetic autoregressive distribution whose language-model next-token probabilities follow a Zipf-like law over many candidates. Unlike the extreme pivotal-token construction of karan2026reasoning, every next-token candidate is followed by a full-support suffix distribution. The construction is summarized in Figure S.5. The base next-token distribution has V=64V=64 tokens with Zipf-like probabilities

pi(i+1)1.05,i=0,,V1.p_{i}\propto(i+1)^{-1.05},\qquad i=0,\dots,V-1.

For every token ii, the conditional suffix distribution qiq_{i} has the same support size M=256M=256, no zero-probability suffixes, and a non-uniform power-law shape

qi(z)zsi,z=1,,M.q_{i}(z)\propto z^{-s_{i}},\qquad z=1,\dots,M.

The suffix exponent si[0.45,1.65]s_{i}\in[0.45,1.65] varies deterministically and non-monotonically with the next-token rank, using a sinusoidal component plus a small trend. Thus, all suffix distributions have identical support size and full support, but differ in sharpness. This deliberately avoids the singular-versus-uniform example in karan2026reasoning: the experiment isolates the more general quantity identified by Proposition 1, namely the suffix Rényi entropy. In Figure S.5, the left panel shows the Zipf-like next-token distribution, the middle panel shows the token-dependent suffix exponent sis_{i}, and the right panel shows representative full-support suffix distributions.

For each α{1.1,1.5,2,3,4,8}\alpha\in\{1.1,1.5,2,3,4,8\}, we compute both the token-wise temperature next-token distribution and the sequence-level power next-token conditional exactly under this synthetic distribution. The temperature next-token distribution is

πtemp,α(i)=piαjpjα,\pi_{\mathrm{temp},\alpha}(i)=\frac{p_{i}^{\alpha}}{\sum_{j}p_{j}^{\alpha}},

whereas the next-token conditional induced by the sequence-level power distribution is

πpow,α(i)=piαzqi(z)αjpjαzqj(z)α.\pi_{\mathrm{pow},\alpha}(i)=\frac{p_{i}^{\alpha}\sum_{z}q_{i}(z)^{\alpha}}{\sum_{j}p_{j}^{\alpha}\sum_{z}q_{j}(z)^{\alpha}}.

Figure S.6 compares the two sides of Proposition 1 for every unordered token pair and every tested α\alpha. The left panel plots the Rényi-predicted log odds correction against the directly computed power-versus-temperature log odds correction, while the right panel shows the distribution of these corrections at the main experimental exponent α=4\alpha=4.

Figure S.7 illustrates the consequence of the correction at the level of next-token preferences: even when pi>pjp_{i}>p_{j} and temperature favors token ii, sequence-level power can favor token jj if qjq_{j} has sufficiently lower suffix Rényi entropy.

Refer to caption
Figure S.5: Synthetic distribution setting. Left: the base next-token distribution pip_{i} is Zipf-like over 6464 tokens. Middle: suffix sharpness varies non-monotonically with the next-token rank through the power-law exponent sis_{i}. Right: representative suffix distributions qiq_{i} are full-support, non-uniform power laws over the same support size M=256M=256. This setup differs from the singular-versus-uniform pivotal-token construction of karan2026reasoning.
Refer to caption
Figure S.6: Suffix Rényi entropy coincides with the power-vs-temperature odds gap. For every unordered token pair (i,j)(i,j) and every tested α\alpha, we compare the Rényi-predicted log correction, (1α)(Hα(qi)Hα(qj))(1-\alpha)(H_{\alpha}(q_{i})-H_{\alpha}(q_{j})), with the closed-form log odds correction, log((πpow,α(i)/πpow,α(j))/(πtemp,α(i)/πtemp,α(j)))\log((\pi_{\mathrm{pow},\alpha}(i)/\pi_{\mathrm{pow},\alpha}(j))/(\pi_{\mathrm{temp},\alpha}(i)/\pi_{\mathrm{temp},\alpha}(j))). The diagonal agreement shows that the suffix-Rényi formula explains the full pairwise gap in a many-token full-support setting. The right panel shows the distribution of closed-form corrections at α=4\alpha=4.
Refer to caption
Figure S.7: Suffix entropy can reverse next-token preferences under sequence-level power. Pairs are ordered so that i<ji<j, hence the Zipf base next-token distribution gives pi>pjp_{i}>p_{j}, and token-wise temperature has positive log odds for ii over jj. Left: examples where the sequence-level power next-token conditional reverses this ordering because token jj has a sharper, lower-entropy suffix distribution. Right: control examples where the Rényi entropy is not large enough to reverse the preferred token. All bars are closed-form log odds, not sampled frequencies.

B.2.5 Synthetic validation of optimal one-step proposals for sequential power sampling

We reuse the synthetic distribution of Section B.2.4 to validate Proposition 2. For a fixed prompt and an empty prefix, the unique variance-minimizing one-step proposal in Equation 10 reduces to

q(i)piαz=1Mqi(z)α,i=0,,V1,q^{\star}(i)\;\propto\;p_{i}^{\alpha}\,\sum_{z=1}^{M}q_{i}(z)^{\alpha},\qquad i=0,\dots,V-1,

which equals the next-token conditional of the sequence-level power distribution πpow,α\pi_{\mathrm{pow},\alpha} and depends on the suffix power masses zqi(z)α\sum_{z}q_{i}(z)^{\alpha} of every candidate token. We compare qq^{\star} with three one-step proposals that do not use those suffix totals: the base proposal qbase(i)=piq^{\mathrm{base}}(i)=p_{i}, the token-wise temperature proposal qtemp(i)piαq^{\mathrm{temp}}(i)\propto p_{i}^{\alpha}, and a uniform reference qunif(i)=1/Vq^{\mathrm{unif}}(i)=1/V.

For each proposal qq, the first-step incremental importance weight in Equation 9 simplifies to

W1(i)=q(i)q(i),W_{1}(i)\;=\;\frac{q^{\star}(i)}{q(i)},

and we show its exact mean, the coefficient of variation CV2(W1)=Var[W1]/𝔼[W1]2\mathrm{CV}^{2}(W_{1})=\mathrm{Var}[W_{1}]/\mathbb{E}[W_{1}]^{2}, and the effective sample size fraction ESS/N=1/(1+CV2(W1))\mathrm{ESS}/N=1/(1+\mathrm{CV}^{2}(W_{1})). By Proposition 2, only qq^{\star} achieves Var[W1]=0\mathrm{Var}[W_{1}]=0 and hence ESS/N=1\mathrm{ESS}/N=1; the closed-form values for the other proposals are computed exactly from the synthetic distribution.

Figure S.8 compares the four proposals at α=4\alpha=4. The left panel shows the proposal probabilities; the oracle proposal equals the target next-token conditional πpow,α\pi_{\mathrm{pow},\alpha} by construction, and the temperature, base, and uniform proposals deviate from it, especially on next-token ranks where the suffix exponent sis_{i} is small and zqi(z)α\sum_{z}q_{i}(z)^{\alpha} is large. The right panel plots logW1(i)\log W_{1}(i): only the oracle proposal yields a constant log weight, while the other proposals produce token-dependent log weights.

Figure S.9 reports the exact ESS/N\mathrm{ESS}/N and CV2(W1)\mathrm{CV}^{2}(W_{1}) as a function of α\alpha. The oracle proposal attains ESS/N=1\mathrm{ESS}/N=1 for every α\alpha, whereas the gap between the temperature proposal and the oracle widens as α\alpha grows, because larger α\alpha amplifies the suffix power masses that the local temperature transform ignores.

Figure S.10 checks the same conclusion with Monte Carlo: for each proposal we draw NN tokens, compute the self-normalized ESS\mathrm{ESS}, and average across replicates. The sampled ESS/N\mathrm{ESS}/N concentrates around the exact values from Figure S.9 as NN grows, and the ordering of the proposals is preserved at every particle budget.

Refer to caption
Figure S.8: One-step SIS proposals for the sequence-level power target. Left: proposal probabilities at α=4\alpha=4. The oracle proposal qq^{\star} of Equation 10 equals the target next-token conditional πpow,α\pi_{\mathrm{pow},\alpha} by construction, while the temperature, base, and uniform proposals deviate from it. Right: log incremental weight logW1(i)\log W_{1}(i) for each proposal; the oracle yields a constant log weight, while base, temperature, and uniform proposals do not.
Refer to caption
Figure S.9: Exact first-step weight variance and effective sample size. Left: exact ESS/N=1/(1+CV2(W1))\mathrm{ESS}/N=1/(1+\mathrm{CV}^{2}(W_{1})) for each one-step proposal as a function of α\alpha. Right: exact CV2(W1)\mathrm{CV}^{2}(W_{1}) on a log scale. The oracle proposal qq^{\star} achieves Var[W1]=0\mathrm{Var}[W_{1}]=0 at every α\alpha, confirming Proposition 2, while the temperature proposal degrades as α\alpha grows.
Refer to caption
Figure S.10: Monte Carlo effective sample size at α=4\alpha=4. For each proposal we draw NN tokens, compute the self-normalized ESS\mathrm{ESS}, and average across replicates; error bars show one standard deviation.
\CJK@envEnd

Comments

· 0
Be the first to comment on this paper.