arXiv:2604.25110 · cs.LG · uncurated · rendered via ar5iv

Knowledge Distillation Must Account for What It Loses

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2604.25110] Knowledge Distillation Must Account for What It Loses

Knowledge Distillation Must Account for What It Loses

Wenshuo Wang Affiliation: School of Future Technology    South China University of Technology    China Affiliation: 202364870251@mail.scut.edu.cn
Abstract

This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. This matters because distillation is increasingly used to turn large, often frontier models into deployable systems, yet headline metrics can hide losses in uncertainty, boundary behavior, process reliability, on-policy stability, grounding, privacy, safety, and diversity. We identify the retention assumption behind current evaluation and reframe distillation as a lossy projection of teacher behavior rather than a faithful copy. We then synthesize existing evidence into a taxonomy of off-metric distillation losses, showing that these losses are concrete, recurring, and measurable. To make the position actionable, we propose scenario-specific preservation targets and a Distillation Loss Statement that reports what was preserved, what was lost, and why the remaining losses are acceptable. The goal is not lossless distillation, but accountable distillation.

1 Introduction

Knowledge distillation transfers behavior from a larger or more capable teacher model to a smaller, cheaper, specialized, or more deployable student model [1, 2]. It is now used far beyond classical model compression: modern distillation produces compact language models, reasoning students, code and tool-using agents, retrieval-augmented components, safety models, domain assistants, and synthetic-data pipelines [3, 4, 5, 6]. These systems are usually evaluated by what the student retains on a primary metric, such as accuracy, pass@k, win rate, benchmark score, task success, or downstream utility. Yet most distillation research does not explicitly state what the student loses beyond the capability represented by that metric. A student can preserve the teacher’s headline score while losing other capabilities that make the teacher reliable.

This omission matters because distilled models are increasingly deployed as substitutes for larger systems, not merely studied as compressed approximations in closed benchmarks. In deployment, reliability often depends on capabilities that primary metrics only weakly measure: uncertainty, boundary behavior, process reliability, on-policy stability, grounding, privacy, safety, and diversity. This position paper argues that knowledge distillation must account for what it loses. We do not claim that distillation should be lossless, or that every student must preserve every teacher capability. We argue that distillation papers should identify which teacher capabilities matter for the intended use, measure whether the student preserves them when possible, and justify the losses they accept.

This argument begins by making explicit the retention assumption behind current distillation evaluation: that if a student matches the teacher on the primary metric, then it has preserved the relevant teacher capability. As developed in Section 2, this assumption is convenient, but it is not warranted. Matching the teacher under one observable does not imply matching the teacher under others. We therefore reframe distillation as a lossy projection of teacher behavior rather than a faithful copy of teacher capability. This framing shifts the central question from how much score the student retained to which teacher capabilities were projected away.

The case for loss accounting is not merely conceptual. As Section 3 shows, prior work has already identified teacher–student divergences in predictive distributions, internal representations, robustness, calibration, subgroup behavior, privacy and memorization, reasoning faithfulness, on-policy stability, refusal boundaries, grounding, and synthetic-data diversity [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. We synthesize these findings into a taxonomy of off-metric distillation losses. Building on that taxonomy, Section 4 turns this position into a recommendation: distillation should be evaluated through scenario-specific preservation targets, not only by score retention. To make this operational, we propose a Distillation Loss Statement that reports the deployment context, primary metric, critical off-metric capabilities, preservation targets, stress distributions, observed losses, accepted losses, and deployment implications.

Finally, we situate this proposal in Section 5 relative to adjacent work and objections. Existing research studies particular losses, particular preservation methods, and broader documentation norms, but it does not yet provide a general norm for reporting teacher-to-student capability change [17, 18, 19, 20]. The main limitation, revisited in Section 6, is that many off-metric capabilities are difficult to define and measure, especially for black-box teachers, and some teacher behaviors should not be preserved. Even so, explicit loss accounting could change how distillation results are interpreted: it would help researchers, reviewers, and deployers distinguish useful compression from unexamined capability erosion, and move distillation from a score-retention exercise toward an accountable transformation.

2 The Retention Assumption

This section isolates the assumption that makes score-centered distillation evaluation look sufficient. The issue is not that primary metrics are useless. They are often necessary. The issue is that they are treated as evidence for a broader claim than they can support: that the student has preserved the teacher capability relevant to the intended use.

2.1 The Hidden Assumption in Current Distillation Evaluation

Let TT denote a teacher model and SS denote a distilled student model. Let m()m(\cdot) be the primary evaluation metric used in a distillation paper, such as accuracy, pass rate, win rate, task success, or downstream utility. A common success claim can be abstracted as:

m(S)m(T).m(S)\approx m(T). (1)

Equation 1 is a claim about retained performance under one metric. It is not, by itself, a claim about retained capability. The hidden assumption appears when this metric-level statement is treated as if it implied a broader teacher–student equivalence.

To make the assumption explicit, let ci()c_{i}(\cdot) denote a teacher capability that may matter for deployment but is not directly measured by mm. Examples include calibration, refusal boundary, retrieval behavior, self-verification, or long-tail coverage. Let 𝒞={c1,,ck}\mathcal{C}=\{c_{1},\ldots,c_{k}\} be the set of such scenario-critical capabilities. The retention assumption is:

m(S)m(T)ci(S)ci(T)for relevant ci𝒞.m(S)\approx m(T)\quad\Longrightarrow\quad c_{i}(S)\approx c_{i}(T)\quad\text{for relevant }c_{i}\in\mathcal{C}. (2)

This implication is not logically warranted. A scalar benchmark can show that two models reach similar outcomes on a test distribution, while leaving open whether they assign similar probabilities, fail on similar inputs, abstain under similar uncertainty, use tools in similar ways, cite similar evidence, or preserve similar minority and tail behavior.

The same problem applies even when the training objective is richer than the final metric. Let xx be an input sampled from a distillation distribution 𝒟distill\mathcal{D}_{\mathrm{distill}}. Let ϕa(M,x)\phi_{a}(M,x) denote an observable teacher signal of model MM on input xx, such as an answer, logit vector, rationale, or trajectory. A distillation objective can be written abstractly as:

minS𝔼x𝒟distill[da(ϕa(T,x),ϕa(S,x))],\min_{S}\;\mathbb{E}_{x\sim\mathcal{D}_{\mathrm{distill}}}\left[d_{a}\big(\phi_{a}(T,x),\phi_{a}(S,x)\big)\right], (3)

where dad_{a} is a discrepancy measure for the observed signal. Equation 3 only constrains what is observed, optimized, and evaluated. If the relevant off-metric capability is not represented in ϕa\phi_{a}, or not tested on the distribution where it matters, then retention of the primary metric cannot establish retention of that capability.

2.2 Distillation as Lossy Projection

The retention assumption becomes easier to see if distillation is viewed as a projection. For an input xx, let Φ(M,x)\Phi(M,x) denote a vector of teacher-relevant observables of model MM:

Φ(M,x)=(ϕm(M,x),ϕc1(M,x),,ϕck(M,x)),\Phi(M,x)=\big(\phi_{m}(M,x),\phi_{c_{1}}(M,x),\ldots,\phi_{c_{k}}(M,x)\big), (4)

where ϕm\phi_{m} is the observable associated with the primary metric and ϕci\phi_{c_{i}} is an observable or proxy associated with capability cic_{i}. In practice, distillation observes only a subset of this vector. Let PAP_{A} be the projection onto the subset of observables AA used for training or evaluation. Distillation tries to make:

PAΦ(S,x)PAΦ(T,x)for x𝒟distill.P_{A}\Phi(S,x)\approx P_{A}\Phi(T,x)\quad\text{for }x\sim\mathcal{D}_{\mathrm{distill}}. (5)

But matching a projection is not the same as matching the full capability vector:

PAΦ(S,x)PAΦ(T,x)⇏Φ(S,x)Φ(T,x).P_{A}\Phi(S,x)\approx P_{A}\Phi(T,x)\quad\not\Rightarrow\quad\Phi(S,x)\approx\Phi(T,x). (6)

This is the sense in which distillation is a lossy projection. The projected dimensions may be faithfully transferred, while unprojected dimensions are unconstrained, weakly constrained, or evaluated only indirectly.

We call the resulting discrepancy an off-metric distillation loss. For a capability cc, let 𝒟c\mathcal{D}_{c} be the stress distribution on which that capability matters, let ϕc\phi_{c} be an observable or proxy for the capability, and let dcd_{c} be a capability-specific discrepancy measure. Then:

Δc(T,S;𝒟c)=𝔼x𝒟c[dc(ϕc(T,x),ϕc(S,x))].\Delta_{c}(T,S;\mathcal{D}_{c})=\mathbb{E}_{x\sim\mathcal{D}_{c}}\left[d_{c}\big(\phi_{c}(T,x),\phi_{c}(S,x)\big)\right]. (7)

A student can therefore satisfy the usual score-retention criterion while still incurring a large off-metric loss:

m(S)m(T)whileΔc(T,S;𝒟c)0.m(S)\approx m(T)\quad\text{while}\quad\Delta_{c}(T,S;\mathcal{D}_{c})\gg 0. (8)

Equation 8 is the central problem this paper addresses. It does not imply that all off-metric losses are unacceptable, or that every teacher behavior should be preserved. It implies that losses relevant to the intended use should be named, measured when possible, and justified rather than left invisible behind retained benchmark scores.

3 Evidence for Off-Metric Distillation Loss

Section 2 showed why primary-metric retention is insufficient as a matter of logic. This section shows why the gap is also practically important. We use the literature in two ways. First, we review reporting patterns in representative distillation work to clarify what is usually treated as evidence of success. This is not a validation or refutation of prior empirical claims; it is a content-level review of what those claims are scoped to report. Second, we synthesize empirical results showing that teacher–student divergence outside the main metric is not hypothetical. The result is a taxonomy of off-metric distillation losses that will be used in Section 4 to define scenario-specific preservation targets.

3.1 Reporting Patterns in Recent Distillation Work

A loss-accounting view begins with a simple reporting question: when a paper claims that a student model has successfully inherited a teacher’s capability, what evidence is actually reported? In representative distillation work, the default evidence is retention of a primary score, often accompanied by parameter count, latency, or training cost. This is natural in classical compression and remains natural in modern LLM distillation, reasoning distillation, and agent distillation [1, 3, 4, 5, 6]. The problem is not that such metrics are irrelevant, nor that prior papers report false results. The problem is narrower: primary metrics make some losses visible while leaving other losses outside the reported scope.

We therefore distinguish between retention evidence and loss evidence. Retention evidence asks whether the student remains competitive on the main task. Loss evidence asks whether the student still matches the teacher on capabilities that the main task does not measure: predictive-distribution fidelity, calibration, robustness, subgroup behavior, process reliability, on-policy stability, grounding, safety boundaries, privacy, and tail diversity. The literature already contains many papers devoted to one of these dimensions, such as robustness transfer, calibration transfer, fairness after distillation, privacy leakage, and safety/refusal transfer [21, 10, 22, 12, 23]. This pattern is itself informative. Off-metric capabilities are usually treated as specialized research topics, not as default reporting items in ordinary distillation papers. Appendix A provides a representative 50-paper reporting-pattern checklist that makes this asymmetry explicit without treating it as a systematic meta-analysis. Our position is that this separation should narrow: papers need not report every possible loss, but they should report the losses that matter for the student’s intended use.

3.2 Existing Evidence That Primary-Metric Retention Is Insufficient

The most basic off-metric loss is distributional. The original distillation formulation already treats soft targets as carrying information beyond hard labels [1]. Later analyses make the point sharper: teacher probability estimates can matter even when teacher accuracy is not the only object of interest, and students can improve under distillation while still failing to match teacher predictive distributions [24, 7]. Recent LLM distillation methods likewise show that the direction and form of distribution matching matter for generation quality [25]. These results support a narrow but important conclusion: answer-level or score-level similarity is weaker than distribution-level fidelity.

A second line of evidence concerns what is hidden behind outputs. Representation- and relation-based distillation methods exist precisely because matching final responses may not preserve intermediate structure. FitNets use teacher hints from intermediate layers, while MiniLM transfers self-attention distributions and value relations; broader KD surveys distinguish response-based, feature-based, and relation-based knowledge for the same reason [26, 27, 2]. Empirical studies of what knowledge gets distilled also show that properties such as invariances, localization behavior, and adversarial susceptibility can be inherited unevenly [8, 9]. The lesson is not that every representation must be copied. It is that output retention alone cannot tell us which internal or relational properties were preserved.

A third line of evidence concerns boundaries. Average-case accuracy can remain high while robustness, group behavior, or tail behavior changes. Work on adversarial robustness transfer explicitly argues that standard KD is usually evaluated by accuracy while robustness may fail to transfer [21]. In natural language inference, in-distribution gains do not automatically imply robustness on target or out-of-distribution examples [11]. Work on bias and fairness further shows that the effects of KD can be uneven across subgroups or classes [28, 22]. These results motivate the notion of a capability boundary: the student may preserve the center of the teacher’s behavior while distorting the regions where reliability is most contested.

Privacy and memorization provide another example of an off-metric capability whose direction is not captured by task utility. Distillation is sometimes treated as privacy-improving because the student interacts with private training data only indirectly, yet membership-inference work shows that distillation alone can provide limited privacy protection [12]. Recent LLM studies further show that students can inherit membership and memorization risks from teachers, while different KD objectives can induce different leakage profiles [13, 29]. The relevant point is not that distillation always worsens privacy. It is that distillation changes the privacy and memorization profile of the student, and this change is invisible under ordinary task scores.

A further line of evidence concerns uncertainty and abstention. Calibration research shows that confidence reliability is distinct from accuracy, and calibration transfer studies show that this property is not automatically inherited by a student [30, 10, 31]. In LLM settings, uncertainty estimation and abstention are increasingly treated as independent capabilities: a model must know when it does not know, and when it should decline or defer rather than answer [32, 33, 34]. If a student retains a teacher’s answer style but not its uncertainty behavior, then the student may become more deployable while becoming less trustworthy.

Modern LLM applications also expose process, grounding, safety, and diversity losses. Step-by-step and System-2-to-System-1 distillation demonstrate that reasoning traces and deliberative procedures can be compressed, but work on chain-of-thought faithfulness shows that explanations need not faithfully reveal the model’s actual decision process [35, 36, 37, 14]. On-policy distillation work makes a related point for generative students: training on static teacher-generated data can leave students untested on their own errors, creating exposure bias and compounding failures under student-generated rollouts [15]. Retrieval and citation work shows that answer quality can diverge from evidence selection and attribution quality [38, 39, 40]. Safety work shows that refusal behavior must be evaluated at the boundary between over-refusal and under-refusal, not merely by the presence of refusal text [41, 42, 23]. Recursive data-generation work shows that generated-data pipelines can lose tails of the original distribution, even when average performance appears acceptable [16, 43]. Together, these lines of work support the same conclusion from different directions: the main metric does not exhaust what the teacher knows how to do.

Existing single-scenario studies already instantiate the kind of loss accounting we call for. For example, metamorphic testing for distilled language models of code shows that students can preserve conventional accuracy while failing to deeply mimic teacher behavior under behavior-preserving transformations; the proposed MetaCompress framework reports up to a 285% greater performance drop under adversarial attacks and identifies up to 62% behavioral discrepancies that traditional accuracy-based evaluation misses [44]. This is not a general solution to loss accounting, because it is tailored to code models and metamorphic relations. It is instead an existence proof for our central claim: measuring what the student retains is insufficient unless we also ask what the student loses.

3.3 A Taxonomy of Off-Metric Distillation Losses

Table 1 organizes the evidence above into a loss-oriented taxonomy. The taxonomy is not intended to be exhaustive, and it does not imply that every distillation paper must measure every row. Instead, each row names a capability dimension that can be invisible under the primary metric and suggests the kind of evidence that would make the loss visible.

Table 1: A taxonomy of off-metric distillation losses. Each row identifies a teacher capability that may be weakly measured or unmeasured by the primary score. The table should be read as a loss-accounting checklist, not as a claim that all losses occur in every distillation setting.
Off-metric loss What the primary metric may retain What the student may lose Representative evidence
Predictive-distribution loss Top answer or task score Soft probabilities, alternative answers, entropy, teacher uncertainty shape Soft targets and distribution fidelity [1, 24, 7, 25]
Representation / relation loss Similar output behavior Intermediate representations, attention relations, invariances, localization behavior Feature-, relation-, and property-transfer studies [26, 27, 8, 9]
Capability-boundary loss Average-case accuracy or win rate OOD robustness, adversarial robustness, subgroup behavior, class-wise reliability Robustness and fairness studies [21, 11, 28, 22]
Privacy / memorization loss Task utility or reduced average memorization Membership leakage, teacher-specific memorized examples, different leakage profiles across KD objectives Privacy and memorization studies [12, 13, 29]
Calibration / abstention loss Correct answers on answer-forcing benchmarks Confidence reliability, knowing when not to answer, deferral or abstention behavior Calibration and abstention work [30, 10, 31, 33, 34]
Process / on-policy stability loss Final answer, plausible rationale, or teacher-forced trace Search behavior, self-verification, recovery from self-generated errors, rollout stability, faithful reasoning traces CoT, process-faithfulness, and on-policy studies [35, 36, 37, 14, 15]
Grounding / provenance loss Factual-looking answer text Retrieval policy, evidence selection, citation fidelity, no-answer behavior under missing evidence RAG and attribution work [38, 39, 40]
Safety-boundary loss Refusal style or aggregate safety score Jailbreak robustness, over-refusal / under-refusal boundary, grounded selective refusal Refusal and jailbreak studies [41, 42, 23]
Long-tail / diversity loss High-frequency patterns and downstream average score Rare modes, minority classes, human heterogeneity, tail coverage in generated data Model-collapse and diversity work [16, 43, 19]

The table also clarifies how the formal loss in Equation 7 should be instantiated. For each row, a paper must choose a capability cc, an observable or proxy ϕc\phi_{c}, a stress distribution 𝒟c\mathcal{D}_{c}, and a discrepancy measure dcd_{c}. For calibration, ϕc\phi_{c} might be a confidence estimate and 𝒟c\mathcal{D}_{c} might contain ambiguous or shifted inputs. For grounding, ϕc\phi_{c} might be retrieved evidence and citation alignment. For safety, 𝒟c\mathcal{D}_{c} might be a set of boundary prompts rather than ordinary helpfulness prompts. For privacy, ϕc\phi_{c} might be a membership-inference or memorization probe. This is why loss accounting must be scenario-specific. The relevant losses are not determined by distillation in the abstract; they are determined by what the student is expected to do.

4 From Loss Taxonomy to Scenario-Specific Preservation

Table 2: Scenario-specific preservation targets for distillation. The primary metric remains useful, but it should be paired with preservation targets for the off-metric capabilities that determine whether the distilled student is reliable in the intended setting.
Distillation scenario Typical primary metric Capabilities to account for Preservation targets and stress tests
Reasoning-model students Math, science, coding, or reasoning benchmarks; pass rate; self-consistency score Capability boundary, self-verification, process reliability, when to spend more inference compute Multi-sample teacher behavior, verifier scores, failed traces, hard and near-boundary problems, tests of reasoning faithfulness
Code and tool-using agents Task success, resolved issue rate, pass@k, patch correctness Debugging trajectory, test generation, tool-use policy, recovery from failed attempts, on-policy stability Tool traces, execution logs, failed patches, student-generated rollouts, ambiguous specifications, hidden-test stress cases
RAG and domain QA systems Exact match, F1, factuality, answer preference Retrieval policy, evidence selection, citation fidelity, no-answer behavior under missing evidence Retrieved documents, selected evidence, attribution alignment, insufficient-evidence cases, citation-support checks
High-risk domain assistants Expert QA accuracy, preference, case-level correctness Calibration, abstention, scope boundary, escalation or referral behavior Confidence and uncertainty estimates, abstention labels, risk categories, incomplete-context cases, shifted or ambiguous cases
Privacy-sensitive students Task utility, compression, or retained benchmark score Membership leakage, memorization, teacher-specific example inheritance, compliance-relevant data exposure Membership-inference probes, exposure or canary tests, memorization probes, comparisons across soft and hard KD objectives
Safety and refusal students Aggregate safety score, jailbreak success rate, helpfulness Refusal boundary, over-refusal / under-refusal tradeoff, robustness to boundary prompts Near-boundary unsafe requests, benign hard requests, multilingual jailbreaks, selective-refusal evaluations
Synthetic-data pipelines Downstream accuracy, label agreement, generation cost Tail coverage, diversity, minority modes, human heterogeneity Rare-class coverage, diversity measures, human-anchor ratios, comparisons to real-data tails
On-device or local students Retained benchmark score, latency, memory, throughput Safety margin, calibration, multi-turn consistency, graceful degradation under limited compute Resource-constrained stress tests, calibration probes, safety boundary tests, long-context or multi-turn probes

The taxonomy in Section 3.3 is useful only if it changes what distillation papers choose to preserve and report. It should not be read as a universal checklist. A reasoning student, a RAG component, a privacy-sensitive assistant, and an on-device classifier need not preserve the same teacher capabilities. The central requirement is narrower and more practical: for each intended use, authors should identify which off-metric capabilities are consequential, choose preservation targets for them, and evaluate the student on stress distributions where those capabilities matter.

This shifts distillation from a generic score-retention problem to a scenario-specific preservation problem. In the notation of Section 2, a paper should not only report whether m(S)m(T)m(S)\approx m(T), but also choose a small set of relevant capabilities c𝒞c\in\mathcal{C}, observable proxies ϕc\phi_{c}, stress distributions 𝒟c\mathcal{D}_{c}, and discrepancy measures dcd_{c}. The goal is not to make every student imitate every teacher property. It is to make explicit which properties are being preserved, which are being ignored, and why that choice is appropriate for the deployment context.

4.1 What Common Distillation Scenarios Should Preserve

Table 2 gives representative preservation targets for common distillation settings. The table is intentionally phrased as guidance rather than a benchmark specification. It connects each scenario to the capability losses most likely to matter beyond the primary metric, and to the kinds of teacher signals or stress tests that would make those losses visible.

Several examples illustrate the logic. Reasoning distillation can transfer rationales or compress deliberative procedures, but this does not by itself establish that the student preserves faithful reasoning, self-verification, or the boundary between problems it can and cannot solve [5, 35, 36, 37, 14]. Agent distillation should therefore preserve more than the final answer: tool calls, failed attempts, tests, recovery behavior, and stability under student-generated rollouts can be part of the teacher capability that matters [6, 15]. RAG distillation similarly should not reduce evidence-grounded behavior to answer text; retrieval, attribution, and insufficient-evidence decisions are separate preservation targets [38, 39, 40]. Privacy-sensitive distillation should not infer privacy from compression alone, because membership leakage and memorization can change with the KD objective [12, 13, 29]. In safety, the relevant object is not refusal style but the boundary between appropriate refusal and over-refusal [41, 42, 23]. In synthetic-data pipelines, the crucial loss may be diversity and tail coverage rather than immediate downstream score [16, 43, 19]. Across these cases, the same principle holds: the preservation target should be chosen from the capability that makes the primary score trustworthy.

4.2 A Distillation Loss Statement

To make scenario-specific preservation actionable, we propose that distillation papers include a Distillation Loss Statement. This statement is a short reporting component, analogous in spirit to documentation practices for models and datasets [17, 18], but focused specifically on what changes when a teacher becomes a student. It should be concise enough to be used in ordinary distillation papers, while explicit enough to prevent score retention from being mistaken for capability preservation.

Table 3: A template for a Distillation Loss Statement. The template does not require every paper to measure every possible loss. It requires authors to state which losses matter for the intended use and how those losses were handled.
Item Question to answer
Deployment context What is the student intended to do, and under what use conditions?
Primary metric Which score is used to claim successful distillation?
Critical off-metric capabilities Which teacher capabilities matter beyond the primary metric in this context?
Preservation targets Which teacher signals or proxies are used to preserve those capabilities?
Stress distributions On which boundary, shifted, ambiguous, adversarial, privacy-sensitive, on-policy, or tail cases is preservation evaluated?
Observed or unmeasured losses Where does the student diverge from the teacher outside the primary metric, and which relevant losses were not measured?
Accepted losses or intended divergences Which losses are considered acceptable, which divergences are corrective rather than harmful, and why?
Deployment implication What should users, reviewers, or deployers infer from the remaining losses?

Table 3 gives the reporting template. A Distillation Loss Statement could be written in a compact form:

This student is intended for [deployment context]. Its primary metric is [metric]. The critical off-metric teacher capabilities are [capabilities]. We attempt to preserve them using [teacher signals or proxies] and evaluate them on [stress distributions]. The student diverges from the teacher in [observed losses]. We consider these losses [acceptable or unacceptable], or these divergences [intended or corrective], because [justification], with the following deployment implications: [implications].

We intentionally do not prescribe universal quantitative thresholds for acceptable loss. Risk-management and documentation frameworks emphasize that evaluation and acceptable use should be tied to intended use, limitations, and deployment context rather than treated as context-free properties of a model [17, 18, 45]. A calibration gap that is acceptable for a toy classifier may be unacceptable for a medical assistant; a refusal-boundary shift that is acceptable for a creative-writing assistant may be unacceptable for a safety-critical deployment; a memorization profile that is tolerable for a public benchmark model may be unacceptable for a private-data setting. The Distillation Loss Statement therefore standardizes the question, not the answer. Authors should identify the relevant capability, choose a suitable proxy or metric when one is available, evaluate it on a meaningful stress distribution, and justify the acceptability of the resulting loss in context.

The statement has two purposes. First, it disciplines evaluation: authors must decide whether calibration, grounding, safety boundary, privacy, diversity, process reliability, or some other capability is relevant, rather than implicitly treating the primary score as sufficient. Second, it disciplines interpretation: reviewers and deployers can see whether a distilled model is a faithful substitute, a task-specific approximation, a corrective departure from an imperfect teacher, or a useful but narrower system. This is why the proposal is compatible with successful distillation. A student may be smaller, faster, cheaper, and still worth deploying even if it loses some teacher capabilities. The point is that these losses should be named, measured when possible, and justified rather than hidden behind retained benchmark performance.

5 Alternative Views and Adjacent Work

This section positions the proposal against strong objections and adjacent research. The objections clarify what we do not claim. The adjacent work clarifies what prior research already contributes, and why a general norm for distillation loss accounting is still needed.

5.1 Strong Objections

Strong objections fall into five groups: expected loss, low-risk scope, measurement burden, black-box access, and corrective divergence. First, distillation is already a lossy compression method: classical compression and KD were designed to make useful students smaller, cheaper, or faster, not to reproduce every teacher property [46, 1, 2]. This objection supports our claim. If losses are expected, then the consequential ones should be named and justified rather than hidden behind retained benchmark performance.

Second, primary metrics may be enough in closed, low-risk settings. We agree. A narrow task with stable inputs and limited deployment consequences may need only a short loss statement. This is why our proposal is scenario-specific: it follows model, data, and risk-documentation traditions that tie reporting to intended use, limitations, and operating conditions [17, 18, 45].

Third, loss accounting may be costly or inconsistent without default thresholds. The cost concern is real, but universal thresholds would be misleading. Measurement choices must match the construct being measured, and AI risk-management frameworks treat risk tolerance as context-dependent rather than fixed across use cases [19, 45]. We therefore standardize the reporting question, not the numerical answer: authors should measure consequential losses when suitable proxies exist, disclose when they do not, and justify acceptability in context.

Fourth, frontier or black-box teachers make perfect accounting impossible. In LLM distillation, available teacher signals vary across black-box and white-box settings [4]. But black-box access still permits behavioral proxies: multi-sample behavior, self-consistency, refusal boundaries, tool traces, retrieval sets, privacy probes, stress distributions, and human or automated evaluations can reveal divergences that final scores miss [47, 25, 6, 12, 13]. The appropriate demand is explicit measurement of what can be observed and explicit acknowledgment of what cannot.

Fifth, students may outperform or intentionally correct their teachers. Born-again and self-distillation results show that a student may exceed the teacher on some metrics [48]. A student may also improve an undesirable teacher behavior, such as over-refusal, unsupported citation, or overconfident prediction [41, 40, 30]. Loss accounting is not teacher worship; it asks authors to distinguish harmful loss, acceptable narrowing, unmeasured divergence, and intended correction.

Together, these objections narrow the proposal rather than refute it. We do not argue for exhaustive preservation, universal metric inflation, or a single benchmark for all distillation settings. We argue for a reporting norm proportional to intended use. The more a student substitutes for a teacher in open, uncertain, privacy-sensitive, or high-stakes settings, the stronger the need to account for capabilities that may have been projected away.

5.2 Related Work

Adjacent work falls into four broad groups. The first studies teacher–student fidelity and asks what distillation actually transfers: predictive distributions, learned invariances, intermediate representations, or other properties beyond the final answer [7, 8, 9]. The second develops preservation methods or diagnostics for particular capabilities, such as representation, relation, robustness, calibration, fairness, reasoning traces, agent behavior, grounding, refusal behavior, privacy, memorization, on-policy stability, or behavioral fidelity under metamorphic tests [26, 27, 21, 10, 11, 35, 6, 23, 12, 13, 29, 15, 44]. The third proposes broader documentation and evaluation norms, such as model cards, datasheets, and position papers calling for more careful measurement of under-specified claims [17, 18, 19, 20]. The fourth consists of existing worked instances in which a single domain already measures a loss that ordinary score retention would miss, such as privacy leakage in LLM KD or behavioral discrepancies in distilled code models [13, 44].

These lines of work are complementary to our position, but they do not replace it. Fidelity studies show that students and teachers can diverge, but they do not provide a general reporting norm for intended-use losses. Capability-specific methods show that particular losses can be reduced or diagnosed, but they do not tell authors which losses matter in which deployment scenario. Documentation frameworks show that machine learning artifacts can be reported more responsibly, but they do not focus on the teacher-to-student transformation itself. Existing single-domain loss-accounting studies show that the approach is feasible, but they remain local. Our contribution is to connect these threads into a single claim: distillation should be treated as a lossy transformation whose consequential losses must be accounted for. This claim is not mutually exclusive with prior methods. It explains when those methods are needed, how their results should be interpreted, and why retained score alone should not be the default evidence of successful distillation.

6 Conclusion

This paper argues that knowledge distillation should be evaluated not only by what students retain on headline metrics, but also by what they lose in teacher capabilities that are critical for deployment. We identified the retention assumption underlying score-centered evaluation, reframed distillation as a lossy projection, synthesized evidence of off-metric losses, and proposed scenario-specific preservation targets together with a Distillation Loss Statement. This matters because distillation is increasingly the bridge between frontier-scale models and deployable systems. What is lost at that bridge can determine whether a student is merely smaller or also narrower, less calibrated, less grounded, less privacy-preserving, less stable on its own rollouts, and less safe. If adopted, loss accounting would change how distillation results are interpreted: retained performance would no longer count as sufficient evidence of preserved capability. It would give authors, reviewers, and deployers a vocabulary for distinguishing useful compression from unexamined capability erosion, and could shift distillation research from score retention toward accountable model transformation.

The main limitation of this position is that off-metric capabilities are difficult to define and measure, especially when the teacher is black-box and the relevant capability is only observable through imperfect behavioral proxies. Some teacher behaviors also should not be preserved, because they may encode bias, overconfidence, unsafe refusals, or other undesirable traits. These limitations do not remove the need for loss accounting. They clarify its purpose: distillation papers should not promise lossless transfer, but should make the consequential losses visible, measurable when possible, and justified for the intended use.

References

References

  • Hinton et al. [2015] Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.
  • Gou et al. [2021] Gou, J., Yu, B., Maybank, S. J., and Tao, D. (2021). Knowledge Distillation: A Survey. International Journal of Computer Vision, 129(6):1789–1819.
  • Sanh et al. [2019] Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108.
  • Xu et al. [2024] Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., and Zhou, T. (2024). A Survey on Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2402.13116.
  • DeepSeek-AI [2025] DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948.
  • Kang et al. [2025] Kang, M., Jeong, J., Lee, S., Cho, J., and Hwang, S. J. (2025). Distilling LLM Agent into Small Models with Retrieval and Code Tools. OpenReview, NeurIPS 2025 Conference (Spotlight).
  • Stanton et al. [2021] Stanton, S., Izacard, G., and Roux, N. L. (2021). Does Knowledge Distillation Really Work? In Advances in Neural Information Processing Systems.
  • Ojha et al. [2023] Ojha, U., Li, Y., Sundara Rajan, A., Liang, Y., and Lee, Y. J. (2023). What Knowledge Gets Distilled in Knowledge Distillation? In Advances in Neural Information Processing Systems, 36:11037–11048.
  • Mohanty et al. [2023] Mohanty, S., Roosta, T., and Passban, P. (2023). What Is Lost in Knowledge Distillation? arXiv preprint arXiv:2311.04142.
  • Hebbalaguppe et al. [2024] Hebbalaguppe, R., Baranwal, M., Prakash, J., Madan, N., Anand, K., and Arora, C. (2024). Understanding Calibration Transfer in Knowledge Distillation. OpenReview, ICLR 2024 withdrawn submission.
  • Stacey and Rei [2024] Stacey, J. and Rei, M. (2024). Distilling Robustness into Natural Language Inference Models with Domain-Targeted Augmentation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2239–2258.
  • Jagielski et al. [2023] Jagielski, M., Nasr, M., Lee, K., Choquette-Choo, C. A., Carlini, N., and Tramèr, F. (2023). Students Parrot Their Teachers: Membership Inference on Model Distillation. In Advances in Neural Information Processing Systems, 36.
  • Zhang et al. [2025] Zhang, Z., Shamsabadi, A. S., Lu, H., Cai, Y., and Haddadi, H. (2025). Membership and Memorization in LLM Knowledge Distillation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20074–20084.
  • Chen et al. [2025] Chen, Y., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V., Bowman, S. R., Leike, J., Kaplan, J., and Perez, E. (2025). Reasoning Models Don’t Always Say What They Think. arXiv preprint arXiv:2505.05410.
  • Song and Zheng [2026] Song, M. and Zheng, M. (2026). A Survey of On-Policy Distillation for Large Language Models. arXiv preprint arXiv:2604.00626.
  • Shumailov et al. [2024] Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., and Gal, Y. (2024). AI Models Collapse When Trained on Recursively Generated Data. Nature, 631:755–759.
  • Mitchell et al. [2019] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229.
  • Gebru et al. [2021] Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM, 64(12):86–92.
  • Zhao et al. [2024] Zhao, D., Andrews, J. T. A., Papakyriakopoulos, O., and Xiang, A. (2024). Position: Measure Dataset Diversity, Don’t Just Claim It. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 60644–60673.
  • Tramèr et al. [2024] Tramèr, F., Kamath, G., and Carlini, N. (2024). Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 48453–48467.
  • Shao et al. [2022] Shao, R., Yi, J., Chen, P.-Y., and Hsieh, C.-J. (2022). How and When Adversarial Robustness Transfers in Knowledge Distillation? arXiv preprint arXiv:2110.12072.
  • Mohammadshahi and Ioannou [2025] Mohammadshahi, A. and Ioannou, Y. (2025). What Is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias. Transactions on Machine Learning Research.
  • Zhang et al. [2026] Zhang, M., Liu, D., Zhang, K., Franco, J., and Liu, H. (2026). Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety. arXiv preprint arXiv:2602.11157.
  • Menon et al. [2021] Menon, A. K., Rawat, A. S., Reddi, S. J., Kim, S., and Kumar, S. (2021). A Statistical Perspective on Distillation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642.
  • Gu et al. [2024] Gu, Y., Dong, L., Wei, F., and Huang, M. (2024). MiniLLM: Knowledge Distillation of Large Language Models. In International Conference on Learning Representations.
  • Romero et al. [2015] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. (2015). FitNets: Hints for Thin Deep Nets. In International Conference on Learning Representations.
  • Wang et al. [2020] Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Advances in Neural Information Processing Systems.
  • Lukasik et al. [2021] Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. (2021). Teacher’s Pet: Understanding and Mitigating Biases in Distillation. arXiv preprint arXiv:2106.10494.
  • Borkar et al. [2026] Borkar, J., Chadha, K., Mireshghallah, N., Zhang, Y., Veliche, I.-E., Mitra, A., Smith, D. A., Xu, Z., and Garcia-Olano, D. (2026). Memorization Dynamics in Knowledge Distillation for Language Models. arXiv preprint arXiv:2601.15394.
  • Guo et al. [2017] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330.
  • Fan et al. [2024] Fan, H., Jiang, Z., Lei, J., and Zhang, M. (2024). Revisit the Essence of Distilling Knowledge Through Calibration. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 12882–12894.
  • Geng et al. [2024] Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., and Gurevych, I. (2024). A Survey of Confidence Estimation and Calibration in Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595.
  • Kapoor et al. [2024] Kapoor, S., Gruver, N., Roberts, M., Collins, K., Pal, A., Bhatt, U., Weller, A., Dooley, S., Goldblum, M., and Wilson, A. G. (2024). Large Language Models Must Be Taught to Know What They Don’t Know. In Advances in Neural Information Processing Systems, 37.
  • Wen et al. [2025] Wen, B., Yao, J., Feng, S., Xu, C., Tsvetkov, Y., Howe, B., and Wang, L. L. (2025). Know Your Limits: A Survey of Abstention in Large Language Models. Transactions of the Association for Computational Linguistics, 13.
  • Hsieh et al. [2023] Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., and Pfister, T. (2023). Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017.
  • Yu et al. [2024] Yu, P., Xu, J., Weston, J., and Kulikov, I. (2024). Distilling System 2 into System 1. arXiv preprint arXiv:2407.06023.
  • Turpin et al. [2023] Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In Advances in Neural Information Processing Systems, 36:74952–74965.
  • Lewis et al. [2020] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, 33.
  • Jia et al. [2025] Jia, P., Xu, D., Li, X., Du, Z., Li, X., Wang, Y., Wang, Y., Liu, Q., Wang, M., Guo, H., Tang, R., and Zhao, X. (2025). Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4242–4256.
  • Huang et al. [2024] Huang, L., Feng, X., Ma, W., Gu, Y., Zhong, W., Feng, X., Yu, W., Peng, W., Tang, D., Tu, D., and Qin, B. (2024). Learning Fine-Grained Grounded Citations for Attributed Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14095–14113.
  • Cui et al. [2024] Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. (2024). OR-Bench: An Over-Refusal Benchmark for Large Language Models. arXiv preprint arXiv:2405.20947.
  • Muhamed et al. [2026] Muhamed, A., Ribeiro, L. F. R., Dreyer, M., Smith, V., and Diab, M. T. (2026). RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6811–6856.
  • Gerstgrasser et al. [2024] Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Korbak, T., Sleight, H., Agrawal, R., Hughes, J., Pai, D. B., Gromov, A., Roberts, D., Yang, D., Donoho, D. L., and Koyejo, S. (2024). Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv preprint arXiv:2404.01413.
  • Awal et al. [2025] Awal, M. A., Rochan, M., and Roy, C. K. (2025). A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher? arXiv preprint arXiv:2511.05476.
  • Tabassi [2023] Tabassi, E. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, National Institute of Standards and Technology. https://doi.org/10.6028/NIST.AI.100-1.
  • Bucila et al. [2006] Bucila, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541.
  • Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations.
  • Furlanello et al. [2018] Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. (2018). Born Again Neural Networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1607–1616.
  • Phuong and Lampert [2019] Phuong, M. and Lampert, C. (2019). Towards Understanding Knowledge Distillation. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142–5151.
  • Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, 35:24824–24837.
  • Li et al. [2023] Li, Y., Zhang, H., Cao, J., Ma, X., and Gao, J. (2023). Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–2679.
  • Lanham et al. [2023] Lanham, T., Garriga-Alonso, A., Cooper, A. F., Hill, K., Greenblatt, R., Noble, R., Birch, A., and others (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv preprint arXiv:2307.13702.
  • Madsen et al. [2024] Madsen, A., Chandar, S., and Reddy, S. (2024). Are Self-Explanations from Large Language Models Faithful? In Findings of the Association for Computational Linguistics: ACL 2024, pages 295–337.
  • Cao [2024] Cao, L. (2024). Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3628–3646.

Appendix A Representative Reporting-Pattern Checklist

Table 4 is a representative 50-paper checklist used to support the reporting-pattern discussion in Section 3.1. It is not a systematic meta-analysis and does not validate or reject the empirical claims of the cited papers. Instead, it records how the literature is distributed across two kinds of evidence: retention evidence, which primarily documents task performance, compression, or deployment efficiency; and loss evidence, which makes a teacher–student divergence visible beyond the primary metric. The point is not that every paper should report every row. The point is that the field already has many local tools for measuring particular losses, but lacks a general norm requiring authors to state which losses matter for the intended use.

Table 4: Representative 50-paper reporting-pattern checklist. The table maps cited work to the kind of evidence it makes salient for this position paper.
Work Evidence type Role in our argument
[1] Retention and distribution Introduces soft targets as information beyond hard labels.
[2] Method taxonomy Distinguishes response-, feature-, and relation-based KD.
[3] Retention evidence Shows successful compression and retained language-understanding performance.
[4] LLM KD background Documents the diversity of modern LLM distillation settings.
[5] Reasoning distillation Illustrates the contemporary importance of distilled reasoning students.
[6] Agent/tool distillation Shows that tool behavior can become a distillation target.
[7] Distribution loss Shows student predictive distributions may diverge from teachers.
[24] Distribution loss Explains why teacher probability estimates can matter beyond accuracy.
[25] Generative distribution Studies how distribution-matching choices affect LLM KD.
[49] Theory Analyzes why KD can work without reducing success to score retention.
[8] Property transfer Studies which off-task properties are inherited by students.
[9] Loss study Directly studies information loss between teacher and student.
[26] Representation preservation Uses intermediate hints, showing outputs alone may be insufficient.
[27] Relation preservation Transfers attention and value relations, not only final outputs.
[48] Counterpoint Shows students may outperform teachers on some metrics.
[21] Robustness loss Shows adversarial robustness may fail to transfer under KD.
[11] OOD loss Shows in-distribution gains do not guarantee target robustness.
[28] Subgroup behavior Studies uneven group-wise effects of distillation.
[22] Fairness loss Examines fairness and bias after knowledge transfer.
[30] Calibration metric Establishes confidence calibration as distinct from accuracy.
[10] Calibration transfer Studies whether calibration transfers through KD.
[31] Calibration as KD Treats calibration as central to distilling knowledge.
[32] Uncertainty background Surveys confidence estimation and calibration in LLMs.
[33] Uncertainty behavior Argues models must learn what they do not know.
[34] Abstention Surveys abstention as a distinct LLM capability.
[35] Rationale distillation Shows rationales can improve small-model learning.
[50] Reasoning traces Establishes chain-of-thought as an important reasoning signal.
[51] CoT distillation Shows small models can learn step-by-step symbolic rationales.
[36] Process compression Compresses deliberative procedures into faster students.
[52] Process faithfulness Measures whether CoT reflects underlying reasoning.
[37] Unfaithful rationales Shows explanations can misrepresent model reasoning.
[53] Explanation faithfulness Studies faithfulness of LLM self-explanations.
[14] Reasoning faithfulness Shows reasoning models may not disclose what drives answers.
[38] Grounding architecture Establishes retrieval-augmented generation as a source-grounded setting.
[39] RAG distillation Distills rationales for retrieval-augmented generation.
[40] Citation fidelity Studies fine-grained grounded citations.
[41] Over-refusal Evaluates excessive refusal as a distinct failure mode.
[42] Selective refusal Evaluates grounded selective refusal.
[23] Safety transfer Shows response-based KD can compromise jailbreak prevention.
[54] Refusal mechanism Treats refusal as an explicit capability.
[16] Tail loss Shows recursive generated-data training can collapse distribution tails.
[43] Synthetic-data mitigation Studies when model collapse may be avoided.
[19] Measurement norm Argues diversity claims require explicit measurement.
[17] Reporting norm Provides a model documentation precedent.
[18] Reporting norm Provides a dataset documentation precedent.
[44] Worked loss instance Uses metamorphic testing to reveal behavioral discrepancies in distilled code models.
[12] Privacy leakage Shows distillation can provide limited protection against membership inference.
[13] LLM privacy loss Studies membership and memorization risks in LLM KD.
[29] Memorization dynamics Shows KD can change memorization and teacher-specific inheritance profiles.
[15] On-policy stability Connects static teacher data to exposure bias and student-rollout instability.

Comments

· 0
Be the first to comment on this paper.