arXiv:2605.04062 · cs.LG · uncurated · rendered via ar5iv

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.04062] EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Shu-Hao Zhang Affiliation:  School of Intelligent ScienceTechnology, Nanjing University, Suzhou 215163, China    Le-Tong Huang Affiliation:  School of Intelligent ScienceTechnology, Nanjing University, Suzhou 215163, China    Xiang-Sheng Deng Affiliation:  School of Intelligent ScienceTechnology, Nanjing University, Suzhou 215163, China    Xin-Yi Zou Affiliation:  Microsoft AI, Beijing 100080, China {zhangsh,zhangsq}@lamda.nju.edu.cn    [0.2em] Chen Wu Affiliation:  Microsoft AI, Beijing 100080, China {zhangsh,zhangsq}@lamda.nju.edu.cn    Nan Li Affiliation:  Microsoft AI, Beijing 100080, China {zhangsh,zhangsq}@lamda.nju.edu.cn    Shao-Qun Zhang Affiliation:  School of Intelligent ScienceTechnology, Nanjing University, Suzhou 215163, China    Zhi-Hua Zhou Affiliation: School of Artificial Intelligence, Nanjing University, Nanjing 210063, China    [0.3em] National Key Laboratory for Novel Software Technology    Nanjing University    Nanjing 210063    China
Abstract

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4–10×\times lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16×\times over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor.  Code [Uncaptioned image] Models

Refer to caption
Figure 1: Performance comparison of EdgeRazor and strong baselines.

1 Introduction

Large language models (LLMs) have attracted widespread attention across domains, driven by scaling laws indicating that performance improves predictably with increases in model size, dataset size, and training budget. There is an increasing demand to deploy lightweight LLMs on resource-constrained devices, where limited storage, memory, and computational capacity impose constraints that full-precision models struggle to meet (Zheng et al., 2025). Quantization is an effective compression technique that maps full-precision weights and activations to low-bit representations (Zhu et al., 2024). An effective method should satisfy several prerequisites, including preserving performance, offering flexible bit-widths for deployability on resource-constrained hardware, and maintaining manageable training overhead (Tan et al., 2024).

Existing LLM quantization methods can be broadly categorized into three paradigms: Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and Quantization-Aware Distillation (QAD). PTQ calibrates quantized parameters on a small dataset without retraining (Frantar et al., 2022; Lin et al., 2024), while incurring significant performance degradation at lower bit-widths (Dettmers and Zettlemoyer, 2023). QAT learns low-bit parameters using surrogate gradients (Bengio et al., 2013) via dataset-driven gradient updates to fit downstream tasks and preserve performance below 4-bit. Nevertheless, QAT incurs substantial training costs (Liu et al., 2025a; Wang et al., 2025). QAD integrates QAT with knowledge distillation (Hinton et al., 2015a; Zhou and Jiang, 2004) from a full-precision teacher to reduce the training cost (Liu et al., 2023). However, QAD methods typically rely on heuristic approaches to pre-specify teacher layers for supervision (Xu et al., 2024), and restrict the switching criterion between forward and reverse KLD exclusively to teacher-distilled data (Du et al., 2024), which precludes flexible data recipes that combine human-annotated and externally distilled corpora (Wu et al., 2025). Additionally, uniform and discrete bit-widths ignore varying parameter sensitivities (Lin et al., 2024) and restrict diverse deployment budgets (Lee and Song, 2025). While mixed-precision utilizes parameter importance for allocation (Huang et al., 2025), it is confined to PTQ methods. Since continuous updates cause severe salience drift during training, a more drift-robust mixed-precision strategy is preferred in QAT and QAD methods.

In this paper, we propose EdgeRazor, a lightweight framework for compressing LLMs with flexible bit-widths via Mixed-Precision Quantization-Aware Distillation (MPQAD). Figure 2 illustrates the workflow. EdgeRazor comprises three configurable modules: Structural Quantization with Mixed Precision (SQMP) for fine-grained control over the matrix-wise average bit-width, Layer-Adaptive Feature Distillation (LAFD) that adaptively selects informative teacher layers for feature-level supervision, and Entropy-Aware KL Divergence (EAKLD) that balances forward and reverse KLD by the entropy of the teacher’s output distribution, extending logit distillation to both human-annotated and distilled datasets. These modules improve quantized models across base, instruction-tuned, and multimodal LLMs, as summarized in Figure 1. On Qwen3-0.6B under weight-activation quantization, the 1.88-bit EdgeRazor retains strong performance on downstream tasks, outperforming the state-of-the-art 2-bit PTQ baseline by 11.27 points and surpassing all competing baselines by 4.38 points at 3-bit precision across 14 domain-specific tasks. These performance gains generalize to other architectures, such as MobileLLM-350M, with a training budget 4–10×\times lower than that of the leading QAT method. For deployment, executing the 1.58-bit Qwen3-0.6B-EdgeRazor through llama.cpp on an Apple M4 Pro chip decreases storage from 1.11 GB to 0.19 GB, while achieving a 15.16×\times decoding speedup over the 16-bit baseline.

The rest of this paper is organized as follows. Section 2 introduces related works. Section 3 proposes the EdgeRazor framework. Section 4 conducts experiments. Section 5 concludes this work.

Refer to caption
Figure 2: Workflow of the EdgeRazor framework.

2 Related works

PTQ compresses LLMs by calibrating quantized parameters on a small dataset without retraining. To maintain performance, existing methods employ local error compensation, such as weight adjustments through inverse Hessian approximation (Frantar et al., 2022), activation-aware scaling and outlier smoothing (Lin et al., 2024; Xiao et al., 2023), and vector quantization space partitioning (Tseng et al., 2024a), preserving near-lossless performance at 4-bit and above. Furthermore, mixed-precision PTQ attempts to optimize structural capacity by heuristically allocating heterogeneous bit-widths across distinct layers or groups (Guan et al., 2024; Huang et al., 2025; Lee and Song, 2025), thereby achieving better accuracy-efficiency trade-offs. Since calibration-driven strategies lack end-to-end gradient supervision, PTQ often suffers substantial performance degradation in sub-4-bit settings, thereby limiting its viability for resource-constrained deployment (Dettmers and Zettlemoyer, 2023).

QAT uses supervised gradient updates with surrogate gradients to optimize low-bit models for target tasks (Bengio et al., 2013). Existing methods typically adopt two paradigms: training natively quantized LLMs entirely from scratch, as pioneered by BitNet (Wang et al., 2025), and fine-tuning from full-precision pre-trained models via block-wise reconstruction and optimized training budgets, exemplified by EfficientQAT (Chen et al., 2025) and ParetoQ (Liu et al., 2025a), which significantly advance the frontier of LLM compression to 2-bit and lower. Nevertheless, these methods typically require substantial computational resources and large-scale corpora to converge (Liu et al., 2025a; Wang et al., 2025), rendering the paradigm prohibitively expensive for downstream adaptations.

QAD integrates QAT with knowledge distillation (Hinton et al., 2015a; Zhou and Jiang, 2004) from a full-precision teacher to alleviate the prohibitive computational demands of QAT. Existing works align output logits and intermediate features, thereby enabling 4-bit compression by data-free generation (Liu et al., 2023) and 1-bit structural decomposition (Xu et al., 2024). Furthermore, recent advancements combine the mode-covering forward KLD (FKLD) (Hinton et al., 2015b) with the mode-seeking reverse KLD, utilizing metrics such as teacher prediction confidence to improve zero-shot performance at lower bit-widths (Du et al., 2024). However, existing QAD approaches are limited by heuristic layer-selection strategies that struggle to generalize across architectures and provide little guidance for selection (Wang et al., 2020), as well as by inflexible KLD-switching criteria that rely exclusively on teacher-distilled data, thereby restricting the use of diverse training recipes (Wu et al., 2025).

3 EdgeRazor

This section presents EdgeRazor. Figure 2 provides an overview of the framework. The EdgeRazor framework for compressing LLMs via MPQAD consists of three novel modules: SQMP that organizes per-channel bit-widths into a periodic super-group pattern with an adjustable ratio in Section 3.1, LAFD that aligns intermediate representations between student and teacher models by dynamically identifying the most informative layers to supervise in Section 3.2, and EAKLD that relies solely on the teacher’s output distribution to integrate forward and reverse KLD in Section 3.3. The overall training objective combines the low-bit student’s task-specific cross-entropy loss task\smash{\mathcal{L}_{\mathrm{task}}} with the feature and logit distillation losses

=αtasktask+αfeaturefeature+αlogitlogit,\mathcal{L}=\alpha_{\mathrm{task}}\,\mathcal{L}_{\mathrm{task}}+\alpha_{\mathrm{feature}}\,\mathcal{L}_{\mathrm{feature}}+\alpha_{\mathrm{logit}}\,\mathcal{L}_{\mathrm{logit}}\,, (1)

where feature\smash{\mathcal{L}_{\mathrm{feature}}} and logit\smash{\mathcal{L}_{\mathrm{logit}}} correspond to LAFD and EAKLD, respectively, and αtask\smash{\alpha_{\mathrm{task}}}, αlogit\smash{\alpha_{\mathrm{logit}}}, αfeature\smash{\alpha_{\mathrm{feature}}} are balancing coefficients.

3.1 Structural quantization with mixed precision

Based on the per-group quantization function Qn-bit(𝐗)Q_{n\text{-bit}}(\mathbf{X}) detailed in Appendix A.1 and the QAD paradigm, we propose SQMP to determine the effective bit-width. To control bit-widths, SQMP assigns a tunable parameter ρ[0,1]\rho\in[0,1] indicating the proportion of weights assigned to 4-bit. We organize this structural mixed-precision quantization into a regular repeating super-group allocation: every 1/ρ\lfloor 1/\rho\rceil consecutive output channels form one super-group, wherein one channel is quantized to 4-bit and the remainder to 1.58-bit. The illustration is provided in Appendix A.2. Since every super-group maintains the same configuration, altering ρ\rho yields fine-grained, smooth control over the fractional bit-width, accommodating diverse deployment budgets. Then, through super-group allocation and matrix multiplication 𝐘=𝐖𝐗\smash{\mathbf{Y}=\mathbf{W}\mathbf{X}}, each output element Yi,l\smash{Y_{i,l}} is computed as

Yi,l=𝐖i,G𝐗,lG=j=0J1si,jWsj,lX\scriptsize1⃝Qn-bit(𝐖i,jG)Q8-bit(𝐗j,lG)\scriptsize2⃝,n{1.58,4},Y_{i,l}=\mathbf{W}^{G}_{i,\cdot}\,\mathbf{X}^{G}_{\cdot,l}=\sum_{j=0}^{J-1}\underbrace{s^{W}_{i,j}\cdot s^{X}_{j,l}}_{{\scriptsize1⃝}}\;\cdot\;\underbrace{Q_{n\text{-bit}}(\mathbf{W}^{G}_{i,j})^{\top}Q_{8\text{-bit}}(\mathbf{X}^{G}_{j,l})}_{{\scriptsize2⃝}}\ ,\quad n\in\{1.58,4\}\ , (2)

where J=din/G\smash{J=d_{\text{in}}/G} denotes the total number of groups along the input dimension dind_{\text{in}} for a given group size GG. The group-wise vectors 𝐖i,jG\smash{\mathbf{W}^{G}_{i,j}} and 𝐗j,lG\smash{\mathbf{X}^{G}_{j,l}} represent the jj-th group of the ii-th weight output channel and the ll-th activation token, respectively. The terms si,jWs^{W}_{i,j} and sj,lXs^{X}_{j,l} are the corresponding 16-bit scaling factors, which are multiplied together to form the combined scaling factor \scriptsize1⃝, and \scriptsize2⃝ is the low-bit integer dot product between the quantized weight and activation groups, where n{1.58,4}n\in\{1.58,4\} is strictly governed by the aforementioned super-group allocation. This factorization facilitates inference acceleration, as the integer arithmetic in \scriptsize2⃝ can be offloaded to efficient kernels.

Unlike PTQ methods that statically preserve sensitive output channels (Lin et al., 2024; Huang et al., 2025), QAT and QAD continuously update parameters, causing the salient weights to shift correspondingly. We provide a theoretical justification for super-group allocation in Appendix A.3.

3.2 Layer-adaptive feature distillation

We propose LAFD to adaptively identify the most informative layers for each input using a structural-similarity-based importance metric computed from the teacher. LAFD is motivated by the observation that consecutive transformer layers do not contribute equally to the overall feature transformation (Tenney et al., 2019). Our empirical analysis in Appendix B reveals that these transformation patterns are highly domain-dependent, which indicates the limitations of heuristic fixed-layer selection. We explicitly quantify this structural similarity using cosine similarity to assess representational transformation, as angular differences between contextual features can reflect semantic and structural changes (Ethayarajh, 2019). To quantify this transformation across layers, we compute the mean cosine similarity between the outputs of adjacent teacher layers across the set of valid token positions 𝒯\mathcal{T} excluding padding,

cl=1|𝒯|t𝒯cos(𝐅T,t(l),𝐅T,t(l1)),l=1,2,,L,c_{l}=\frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}}\cos\!\left(\mathbf{F}_{T,t}^{(l)},\;\mathbf{F}_{T,t}^{(l-1)}\right)\ ,\quad l=1,2,\ldots,L\ , (3)

where 𝐅T,t(l)d\mathbf{F}_{T,t}^{(l)}\in\mathbb{R}^{d} denotes the dd-dimensional teacher features within a training batch at layer ll and position tt, and 𝐅T,t(0)\mathbf{F}_{T,t}^{(0)} corresponds to the output of the embedding layer. A low value of clc_{l} indicates that layer ll substantially transforms the representation direction. We select the kk layers with the lowest scores as the targets of feature distillation,

𝒮=argminS{1,,L},|S|=klScl,\mathcal{S}=\underset{\begin{subarray}{c}S\subseteq\{1,\ldots,L\},\;\lvert S\rvert=k\end{subarray}}{\arg\min}\;\sum_{l\in S}c_{l}\ , (4)

where 𝒮\mathcal{S} is the set containing selected layers. The feature distillation loss is then defined over this adaptively selected set 𝒮\mathcal{S} as

feature=MSEadaptive(𝐅T𝐅S)=1|𝒮|l𝒮1|𝒯|dt𝒯𝐅T,t(l)𝐅S,t(l)22,\mathcal{L}_{\mathrm{feature}}=\text{MSE}_{\mathrm{adaptive}}\!\left(\mathbf{F}_{T}\,\middle\|\,\mathbf{F}_{S}\right)=\frac{1}{\lvert\mathcal{S}\rvert}\sum_{l\in\mathcal{S}}\frac{1}{\lvert\mathcal{T}\rvert\cdot d}\sum_{t\in\mathcal{T}}\left\|\,\mathbf{F}_{T,t}^{(l)}-\mathbf{F}_{S,t}^{(l)}\right\|_{2}^{2}\ , (5)

where 𝐅S,t(l)\mathbf{F}_{S,t}^{(l)} is the corresponding student features. By restricting the feature distillation loss to 𝒮\mathcal{S}, LAFD concentrates the gradient signal on layers with the largest representational changes. This helps reduce the propagation and amplification of quantization errors through subsequent nonlinear computations. Through leveraging structural similarity scores, LAFD provides input-adaptive layer supervision while avoiding the prohibitive cost of searching over layer combinations.

3.3 Entropy-aware KL divergence

In logit distillation, the KLD is used to align the student distribution PS\smash{P_{S}} with the teacher distribution PT\smash{P_{T}}. The forward KLD 𝒟KL(PTPS)\smash{\mathcal{D}_{\mathrm{KL}}(P_{T}\|P_{S})} is zero-avoiding, inducing mode-covering behavior, while the reverse KLD 𝒟KL(PSPT)\smash{\mathcal{D}_{\mathrm{KL}}(P_{S}\|P_{T})} is zero-forcing, inducing mode-seeking behavior (Wu et al., 2025). CAKLD, introduced in BitDistiller (Du et al., 2024), balances these divergences using teacher confidence, but this metric leads to severe mismatch errors in both human-annotated and synthetic corpora, as demonstrated in Appendix C, thereby restricting data diversity. To overcome this, we propose EAKLD, which interpolates between the two objectives using a batch-level mixing coefficient λ\lambda derived from the teacher’s output entropy. The logit distillation loss and the mixing coefficient are defined as

logit=𝒟EAKLD(PTPS)=λ𝒟KL(PTPS)forwardKLD+(1λ)𝒟KL(PSPT)reverseKLD,withλ=𝔼(x,y)𝔻[1|y|i=1|y|min(H(PT(yi|x,y<i)),logK)logK],\begin{split}\mathcal{L}_{\mathrm{logit}}=\mathcal{D}_{\mathrm{EAKLD}}\!\left(P_{T}\,\middle\|\,P_{S}\right)&=\lambda\,\underbrace{\mathcal{D}_{\mathrm{KL}}\!\left(P_{T}\,\middle\|\,P_{S}\right)}_{\mathrm{forward\ KLD}}+(1-\lambda)\,\underbrace{\mathcal{D}_{\mathrm{KL}}\!\left(P_{S}\,\middle\|\,P_{T}\right)}_{\mathrm{reverse\ KLD}}\ ,\\ \text{with}\quad{\lambda}&=\mathbb{E}_{(x,y)\sim\mathbb{D}}\!\left[\frac{1}{|y|}\sum_{i=1}^{|y|}\frac{\min\!\Big(H\big(P_{T}(y_{i}|x,y_{<i})\big),\;\log K\Big)}{\log K}\right]\ ,\end{split} (6)

where 𝔻\mathbb{D} is the data within a training batch, |y||y| is the number of tokens in the response sequence, and H(PT(x,y<i))H(P_{T}(x,y_{<i})) denotes the entropy of the teacher’s predictive distribution at position ii conditioned on the input xx and the preceding tokens y<iy_{<i}. The entropy is formulated as

H(PT(yi|x,y<i))=v𝒱PT(vx,y<i)logPT(vx,y<i),H\big(P_{T}(y_{i}|x,y_{<i})\big)=-\sum_{v\in\mathcal{V}}P_{T}(v\mid x,y_{<i})\log P_{T}(v\mid x,y_{<i})\ , (7)

where 𝒱\mathcal{V} is the vocabulary set, and the denominator logK\log K represents the maximum entropy of a KK-uniform distribution. When the teacher disperses probability evenly among candidates, the entropy increases, causing λ\lambda to rise and adaptively strengthening the forward KLD, thereby encouraging mode-covering. Conversely, when the teacher places high confidence among candidates, yielding a small H(PT)H(P_{T}), λ\lambda decays, thereby prioritizing the reverse KLD for precise mode-seeking. Furthermore, tuning the hyperparameter KK deterministically alters the upper-bound entropy, serving as the primary control mechanism to adjust the entire dataset’s aggregate tendency toward either divergence strategy.

By adapting logit distillation exclusively from entropy, EAKLD captures the full shape of the teacher’s uncertainty and avoids the need for internally distilled labels, supporting training corpora that comprise both human-annotated and externally distilled datasets.

4 Experiments

In this section, we conduct comprehensive experiments to validate the effectiveness and efficiency of the proposed EdgeRazor framework along with its three modules.

4.1 Experimental setup

We apply the EdgeRazor framework to four models: MobileLLM-ParetoQ-350M-BF16 (Mo-bileLLM-350M) (Liu et al., 2025a) representing the base LLMs, Qwen3-0.6B and Qwen3-1.7B (Yang et al., 2025) indicating the instruction-tuned LLMs, and Qwen2.5-Omni-7B (Xu et al., 2025) referring to the multimodal LLMs.

Datasets.

We train the three text-only LLMs on a mixture of human-annotated and externally distilled instruction data (Li et al., 2025a; Zhao et al., 2025), combined with the training splits of commonsense tasks (Bisk et al., 2020; Clark et al., 2019, 2018; Lin et al., 2022; Mihaylov et al., 2018; Sakaguchi et al., 2021; Sap et al., 2019; Zellers et al., 2019). All synthetic samples are generated by external high-quality LLMs rather than their corresponding teachers. In addition, the multimodal LLM is trained on 10K TGIF (Li et al., 2016) clips re-encoded at 30 FPS, with video-understanding responses distilled from the teacher. Appendix D.1 details the specific data mixtures and usage for each model.

Table 1: Hyperparameters for EdgeRazor.
Models 𝝆\boldsymbol{\rho} Bit-widths LRs Training 𝜶𝐭𝐚𝐬𝐤\boldsymbol{\alpha_{\mathrm{task}}} 𝜶𝐟𝐞𝐚𝐭𝐮𝐫𝐞\boldsymbol{\alpha_{\mathrm{feature}}} 𝜶𝐥𝐨𝐠𝐢𝐭\boldsymbol{\alpha_{\mathrm{logit}}}
MobileLLM-350M 11 4.00 2×1052\!\times\!10^{-5} 2 epochs 0.50 0.10 2.0
1/21/2 2.79 4 epochs 0.50 0.10 2.0
1/81/8 1.88 5 epochs 0.20 1.00 4.0
0 1.58 5 epochs 0.20 1.00 4.0
Qwen3-0.6B 11 4.00 2×1052\!\times\!10^{-5} 2k steps 0.05 0.50 2.0
1/21/2 2.79 1 epoch 0.10 0.10 2.0
1/81/8 1.88 1 epoch 0.10 0.10 2.0
0 1.58 1 epoch 0.10 0.10 2.0
Qwen3-1.7B 11 4.00 2×1052\!\times\!10^{-5} 2k steps 0.05 0.50 2.0
1/21/2 2.79 2 epochs 0.10 0.10 2.0
1/81/8 1.88 2 epochs 0.10 0.10 2.0
0 1.58 2 epochs 0.10 0.10 2.0
Qwen2.5-Omni-7B 11 4.00 5×1065\!\times\!10^{-6} 2 epochs 0.10 0.20 2.0

Hyperparameters.

All low-bit LLMs are trained on 8×\timesNVIDIA A100-80 GB GPUs. We apply per-group symmetric quantization, setting the group sizes to 256, 64, and 32 for Qwen3, MobileLLM, and Qwen2.5. The embedding and lm_head layers are kept at 4-bit. For all configurations, we use the AdamW optimizer, and set β=2.0\beta{=}2.0, ϵ=105\epsilon{=}10^{-5}, k=3k{=}3 in LAFD, and K=16K{=}16 in EAKLD. For training schedules, Qwen3-0.6B and 1.7B employ a constant learning rate with a 0.05 warmup ratio and batch sizes of 1024 and 1536. MobileLLM-350M and Qwen2.5-Omni-7B utilize a cosine schedule with a 0.01 warmup ratio and batch sizes of 1920 and 64. Additional details are provided in Table 1.

Baselines.

We compare EdgeRazor against leading PTQ and QAT baselines, including GPTQ (Frantar et al., 2022), AWQ (Lin et al., 2024), AQLM (Egiazarian et al., 2024), QTIP (Tseng et al., 2024b), Slim-LLM+ (Huang et al., 2025), AutoRound (Cheng et al., 2024), Q-Palette (Lee and Song, 2025), EfficientQAT (Chen et al., 2025), ParetoQ (Liu et al., 2025a), LQER (Zhang et al., 2024), OmniQuant (Shao et al., 2024), ABQ-LLM (Zeng et al., 2025), SpinQuant (Liu et al., 2025b), and FlatQuant (Sun et al., 2025). The baseline group sizes are 64 for MobileLLM and 128 for the other LLMs. We also provide evaluations with per-task results against more baselines in Appendix D.3. For the ablation studies, we compare three aspects: the super-group allocation in SQMP depicted in Figure 9 against the stacked allocation in Figure 10; LAFD against conventional feature distillation; and EAKLD in logit distillation against forward KLD and CAKLD introduced in the QAD baseline BitDistiller (Du et al., 2024).

Evaluation.

Table 7 summarizes the evaluation protocols. We prioritize domain-specific tasks over generic perplexity to reflect the concrete reasoning and generation capabilities of LLMs. The text LLMs are benchmarked across diverse tasks with random seed 42, including commonsense reasoning (Bisk et al., 2020; Clark et al., 2019, 2018; Sakaguchi et al., 2021; Sap et al., 2019; Zellers et al., 2019), reading comprehension (Mihaylov et al., 2018), trustworthiness (Hendrycks et al., 2020a), truthfulness (Lin et al., 2022), knowledge (Hendrycks et al., 2020b), instruction following (Zhou et al., 2023), mathematics (Cobbe et al., 2021), and coding (Chen et al., 2021), using the lm_eval v0.4.9.1. Qwen2.5-Omni-7B is evaluated on two video understanding tasks (Fu et al., 2025; Zhou et al., 2025) using the lmms_eval v0.5. To evaluate deployment efficiency, we benchmark on an Apple M4 Pro chip using llama-bench in llama.cpp b6300 under 100 repetitions, with additional details in Appendix D.5.

4.2 Main results

Refer to caption
Figure 3: Performance comparison of EdgeRazor and state-of-the-art baselines at each bit-width.
Refer to caption
Figure 4: Performance comparison of 1.88-bit EdgeRazor and 2-bit baselines on Qwen3-0.6B.

We conduct comprehensive experiments on downstream tasks listed in Appendix D.2. Figure 3 reports the average performance of EdgeRazor, PTQ, and QAT state-of-the-art baselines on MobileLLM-350M, Qwen3-0.6B, and Qwen3-1.7B under both weight-only and weight-activation quantization. Table 2 provides detailed performance on Qwen3-0.6B at every bit-width. Figure 4 visualizes per-task performance at the challenging 1.88-bit setting. Figure 5 extends the evaluation to the multimodal LLM Qwen2.5-Omni-7B. Figure 6 compares the average performance and training budgets of EdgeRazor and the leading QAT baseline ParetoQ. Per-task results are given in Appendix D.3.

We make eight key observations based on the results: (1) In Figure 3, EdgeRazor consistently surpasses the strongest baselines across all bit-widths on the three text-only LLMs, and the margin widens as the bit-width decreases; (2) In Figure 3, the weight-activation baseline curve lies noticeably below the weight-only one, whereas the two EdgeRazor curves nearly coincide, and the gap between EdgeRazor and baselines is larger under weight-activation quantization; (3) In Table 2, on Qwen3-0.6B, at 4-bit, EdgeRazor leads AQLM by 1.35 and FlatQuant by 2.06 points; at 2.79-bit, it achieves gains of 3.21 and 6.72 points over 3-bit AutoRound and FlatQuant; at 1.88-bit, it improves upon 2-bit AQLM and OmniQuant by 5.09 and 11.27 points; at 1.58-bit, it obtains an 8.96-point gain over the 1.75-bit Q-Palette; (4) In Table 2, the 1.88-bit EdgeRazor outperforms the strongest 3-bit baselines AutoRound and FlatQuant by 0.64 and 4.38 points, and the 1.58-bit EdgeRazor exceeds the strongest 2-bit weight-only baseline AQLM by 3.26 points and the strongest 3-bit weight-activation baseline FlatQuant by 2.43 points; (5) In Table 2, EdgeRazor presents superiority over mixed-precision methods like Q-Palette and Slim-LLM+; (6) In Figure 4, most 2-bit baselines collapse to near-zero scores on the reasoning-intensive GSM8K and code-generation HumanEval tasks, while EdgeRazor retains a clear advantage on both; (7) In Figure 5, on the multimodal LLM, Qwen2.5-Omni-7B, at 4-bit weight-only quantization, EdgeRazor with an additional 4-bit vision encoder surpasses AWQ by 0.44 on Video-MME and 1.42 on MLVU; (8) In Figure 6, on MobileLLM-350M, EdgeRazor trained via MPQAD outperforms ParetoQ trained by QAT finetuning at every bit-width, while requiring a 4410×10\times lower training budget.

Observations (1)–(4) demonstrate that EdgeRazor consistently achieves superior performance, with particularly pronounced gains in the ultra-low-bit regime. Observation (5) confirms the effectiveness of our mixed-precision module, SQMP. Observation (6) and Appendix F indicate that EdgeRazor preserves complex reasoning and generation capabilities under ultra-low-bit settings. Observation (7) shows that the advantages of EdgeRazor extend beyond text-only LLMs. Observation (8) highlights the superior efficiency-performance trade-off of EdgeRazor. Appendix E further validates its robust generalization on held-out benchmarks. These results establish the effectiveness of EdgeRazor.

Table 2: Performance comparison of EdgeRazor and strong baselines on Qwen3-0.6B.
(a) Weight-only quantization.
Method W-A-KV Avg. (\uparrow) ΔBF16\Delta_{\text{BF16}} (\uparrow)
AQLM 4-16-16 46.48 -0.87
AutoRound 4-16-16 45.75 -1.60
Q-Palette 4-16-16 40.97 -6.38
EfficientQAT 4-16-16 44.07 -3.28
\rowcolorgray!20 EdgeRazor 4-16-16 47.83 +0.48
AutoRound 3-16-16 40.96 -6.39
Slim-LLM+ 3-16-16 33.95 -13.40
Q-Palette 3.25-16-16 37.55 -9.80
EfficientQAT 3-16-16 39.92 -7.43
\rowcolorgray!20 EdgeRazor 2.79-16-16 44.17 -3.18
AQLM 2-16-16 36.51 -10.84
QTIP 2-16-16 35.94 -11.41
Slim-LLM+ 2-16-16 30.54 -16.81
Q-Palette 2-16-16 30.66 -16.69
EfficientQAT 2-16-16 33.27 -14.08
\rowcolorgray!20 EdgeRazor 1.88-16-16 41.60 -5.75
Q-Palette 1.75-16-16 30.81 -16.54
\rowcolorgray!20 EdgeRazor 1.58-16-16 39.77 -7.58
(b) Weight-activation quantization.
Method W-A-KV Avg. (\uparrow) ΔBF16\Delta_{\text{BF16}} (\uparrow)
OmniQuant 4-8-8 37.27 -10.08
LQER 4-8-8 45.31 -2.04
SpinQuant 4-8-8 41.27 -6.08
FlatQuant 4-8-8 45.74 -1.61
\rowcolorgray!20 EdgeRazor 4-8-8 47.80 +0.45
OmniQuant 3-8-8 34.58 -12.77
LQER 3-8-8 36.46 -10.89
SpinQuant 3-8-8 34.93 -12.42
FlatQuant 3-8-8 37.38 -9.97
\rowcolorgray!20 EdgeRazor 2.79-8-8 44.10 -3.25
OmniQuant 2-8-8 30.49 -16.86
LQER 2-8-8 30.46 -16.89
SpinQuant 2-8-8 30.04 -17.31
FlatQuant 2-8-8 30.23 -17.12
\rowcolorgray!20 EdgeRazor 1.88-8-8 41.76 -5.59
\rowcolorgray!20 EdgeRazor 1.58-8-8 39.81 -7.54
Refer to caption
Figure 5: Performance comparison of 4-bit EdgeRazor and strong baselines on Qwen2.5-Omni-7B.
Refer to caption
Figure 6: Average performance and training budgets of EdgeRazor and ParetoQ on MobileLLM-350M. The training budgets are reported in tokens consumed during training.

4.3 Ablation studies

Table 3: Configurations for the ablation studies of EdgeRazor.
Methods Allocation Feature distillation Logit distillation
Super-group Stacked Adaptive Fixed EAKLD CAKLD FKLD
\rowcolorgray!20 EdgeRazorSG+A+E{}_{\text{SG+A+E}} × × × ×
EdgeRazorST+A+E{}_{\text{ST+A+E}} × × × ×
EdgeRazorSG+A+C{}_{\text{SG+A+C}} × × × ×
EdgeRazorSG+F+E{}_{\text{SG+F+E}} × × × ×
EdgeRazorSG+F+F{}_{\text{SG+F+F}} × × × ×
Table 4: Ablation studies of EdgeRazor on Qwen3-0.6B.
Methods W A KV Average W A KV Average
\rowcolorgray!20 EdgeRazorSG+A+E{}_{\text{SG+A+E}} 2.79 16 16 44.17 2.79 8 8 44.10
EdgeRazorST+A+E{}_{\text{ST+A+E}} 2.79 16 16 43.26 2.79 8 8 43.08
\rowcolorgray!20 EdgeRazorSG+A+E{}_{\text{SG+A+E}} 2.19 16 16 40.71 2.19 8 8 40.14
EdgeRazorSG+A+C{}_{\text{SG+A+C}} 2.19 16 16 39.59 2.19 8 8 39.61
EdgeRazorSG+F+E{}_{\text{SG+F+E}} 2.19 16 16 40.27 2.19 8 8 39.93
EdgeRazorSG+F+F{}_{\text{SG+F+F}} 2.19 16 16 39.25 2.19 8 8 39.51
\rowcolorgray!20 EdgeRazorSG+A+E{}_{\text{SG+A+E}} 1.88 16 16 41.60 1.88 8 8 41.76
EdgeRazorSG+A+C{}_{\text{SG+A+C}} 1.88 16 16 40.95 1.88 8 8 40.85
EdgeRazorSG+F+E{}_{\text{SG+F+E}} 1.88 16 16 40.40 1.88 8 8 40.24
EdgeRazorSG+F+F{}_{\text{SG+F+F}} 1.88 16 16 39.83 1.88 8 8 39.70

To isolate the sources of these performance gains, we conduct detailed ablation studies on the three proposed modules: SQMP, LAFD, and EAKLD. We employ five configurations in Table 3 and evaluate their corresponding performance in Table 4. Detailed settings and per-task results are provided in Appendix D.4. There are four key findings: (1) At 2.79-bit, EdgeRazorSG+A+E{}_{\text{SG+A+E}} consistently outperforms EdgeRazorST+A+E{}_{\text{ST+A+E}}, yielding an improvement of 0.91 and 1.02 points under weight-only and weight-activation quantization setups, respectively; (2) Replacing EAKLD with CAKLD leads to consistent performance drops, where EdgeRazorSG+A+E{}_{\text{SG+A+E}} improves over EdgeRazorSG+A+C{}_{\text{SG+A+C}} by 1.12 and 0.53 points at 2.19-bit, and by 0.65 and 0.91 points at 1.88-bit; (3) Reverting from EAKLD to standard forward KLD similarly degrades performance, with EdgeRazorSG+F+E{}_{\text{SG+F+E}} exceeding EdgeRazorSG+F+F{}_{\text{SG+F+F}} by 1.02 and 0.42 points at 2.19-bit, and by 0.57 and 0.54 points at 1.88-bit; (4) Enabling LAFD over heuristic fixed-layer selection yields consistent improvements, as evidenced by EdgeRazorSG+A+E{}_{\text{SG+A+E}} improving upon EdgeRazorSG+F+E{}_{\text{SG+F+E}} by 0.44 and 0.21 points at 2.19-bit, and by 1.20 and 1.52 points at 1.88-bit.

Finding (1) validates the advantage of super-group allocation over the stacked allocation utilized in the SQMP module. Findings (2) and (3) confirm the superiority of the EAKLD module over alternative distillation criteria, such as CAKLD and forward KLD. Finding (4) demonstrates the efficacy of the LAFD module relative to heuristic methods. These results indicate that each module contributes to the overall performance.

4.4 Compression and acceleration

Refer to caption
Figure 7: Efficiency comparison of EdgeRazor and other baselines at each bit-width.
Refer to caption
Figure 8: Efficiency comparison of deploying 1.58-bit and 4-bit Qwen3-0.6B via llama.cpp.

Beyond performance, deployment efficiency is important for edge applications. We evaluate Qwen3-0.6B and outline two outcomes: (1) By quantizing the embedding and lm_head alongside decoder layers, EdgeRazor with 256 group_size achieves a 99.99% quantization proportion, maximizing compression ratios across all bit-widths over existing per-group quantization methods with 128 group_size in Figure 7; (2) At 1.58-bit TQ2_0 precision type, EdgeRazor reduces storage from 1.11 GB to 0.19 GB and memory from 1.46 GB to 0.51 GB, while accelerating prefilling and decoding speeds from 337.99 and 20.91 tokens/s to 711.67 and 317.03 tokens/s, achieving 2.11×\times and 15.16×\times acceleration over the BF16 baseline in Figure 8.

Outcome (1) highlights the substantial compression achieved by our framework across all bit-widths. Outcome (2) confirms that this compression translates directly into reductions in resource and on-device acceleration. Appendix D.5 further details these efficiency metrics for diverse LLMs, and the deployment benchmarks are conducted on both Apple M4 Pro and i9-14900K chips. These results reveal that EdgeRazor provides an efficient solution for deploying low-bit LLMs.

5 Conclusions

In this paper, we proposed EdgeRazor, a lightweight framework for compressing LLMs into flexible low-bit formats via MPQAD. It consisted of three novel modules: SQMP, LAFD, and EAKLD. Comprehensive evaluations across base, instruction-tuned, and multimodal LLMs demonstrated that EdgeRazor outperformed state-of-the-art baselines at all evaluated bit-widths in terms of domain-specific performance and training budget reduction. Practical benchmarks on llama.cpp demonstrated that EdgeRazor transformed aggressive quantization from a fragile trade-off into a resilient approach for deploying lightweight LLMs on resource-constrained hardware.

6 Acknowledgement

Shao-Qun Zhang is the corresponding author, supported by the Natural Science Foundation of China (62406138) and the Natural Science Foundation of Jiangsu Province (BK20230782). This research was supported by the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China (No. JYB2025XDXM118).

This research was performed during the academic cooperation between LAMDA and Microsoft AI, where the proposed EdgeRazor framework was evaluated and deployed within the internal systems.

References

  • [1] S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman (2024) QuaRot: outlier-free 4-bit inference in rotated LLMs. In Advances in Neural Information Processing Systems 37, pp. 100213–100240. Cited by: §D.3.
  • [2] Y. Bengio, N. Léonard, and A. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §1, §2.
  • [3] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi (2020) PIQA: reasoning about physical commonsense in natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 7432–7439. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [4] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Cited by: Table 7, §4.1.
  • [5] M. Chen, W. Shao, P. Xu, J. Wang, P. Gao, K. Zhang, Y. Qiao, and P. Luo (2025) EfficientQAT: efficient quantization-aware training for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp. 10081–10100. Cited by: §D.3, §2, §4.1.
  • [6] W. Cheng, W. Zhang, H. Shen, Y. Cai, X. He, L. Kaokao, and Y. Liu (2024) Optimize weight rounding via signed gradient descent for the quantization of LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11332–11350. Cited by: §D.3, §4.1.
  • [7] C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2924–2936. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [8] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: Table 6, Table 7, Table 7, §4.1, §4.1.
  • [9] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: Table 7, §4.1.
  • [10] T. Dettmers and L. Zettlemoyer (2023) The case for 4-bit precision: k-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pp. 7750–7774. Cited by: §1, §2.
  • [11] D. Du, Y. Zhang, S. Cao, J. Guo, T. Cao, X. Chu, and N. Xu (2024) BitDistiller: unleashing the potential of sub-4-bit LLMs via self-distillation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 102–116. Cited by: Appendix C, §1, §2, §3.3, §4.1.
  • [12] V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh (2024) Extreme compression of large language models via additive quantization. In Proceedings of the 41st International Conference on Machine Learning, pp. 12284–12303. Cited by: §D.3, §4.1.
  • [13] K. Ethayarajh (2019) How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 55–65. Cited by: §3.2.
  • [14] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh (2022) GPTQ: accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323. Cited by: §D.3, §1, §2, §4.1.
  • [15] C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, P. Chen, Y. Li, S. Lin, S. Zhao, K. Li, T. Xu, X. Zheng, E. Chen, C. Shan, R. He, and X. Sun (2025) Video-MME: the first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24108–24118. Cited by: Table 7, §4.1.
  • [16] Z. Guan, H. Huang, Y. Su, H. Huang, N. Wong, and H. Yu (2024) APTQ: attention-aware post-training mixed-precision quantization for large language models. In Proceedings of the 61st ACM/IEEE Design Automation Conference, pp. 1–6. Cited by: §2.
  • [17] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt (2020) Aligning AI with shared human values. arXiv preprint arXiv:2008.02275. Cited by: Table 7, §4.1.
  • [18] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2020) Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300. Cited by: Table 7, §4.1.
  • [19] J. H. Heo, J. Kim, B. Kwon, B. Kim, S. J. Kwon, and D. Lee (2024) Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. In Proceedings of the 12th International Conference on Learning Representations, pp. 12744–12762. Cited by: §A.3.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.
  • [21] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • [22] W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi (2024) BiLLM: pushing the limit of post-training quantization for LLMs. In Proceedings of the 41st International Conference on Machine Learning, pp. 20023–20042. Cited by: §D.3.
  • [23] W. Huang, H. Qin, Y. Liu, Y. Li, Q. Liu, X. Liu, L. Benini, M. Magno, S. Zhang, and X. Qi (2025) SliM-LLM: salience-driven mixed-precision quantization for large language models. In Proceedings of the 42nd International Conference on Machine Learning, pp. 25672–25692. Cited by: §D.3, §1, §2, §3.1, §4.1.
  • [24] D. Lee and H. O. Song (2025) Q-Palette: fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment. arXiv preprint arXiv:2509.20214. Cited by: §D.3, §1, §2, §4.1.
  • [25] J. Li, L. Du, H. Zhao, B. Zhang, L. Wang, B. Gao, G. Liu, and Y. Lin (2025) Infinity Instruct: scaling instruction selection and synthesis to enhance language models. arXiv preprint arXiv:2506.11116. Cited by: Table 6, §4.1.
  • [26] Y. Li, R. Yin, D. Lee, S. Xiao, and P. Panda (2025) GPTAQ: efficient finetuning-free quantization for asymmetric calibration. In Proceedings of the 42nd International Conference on Machine Learning, pp. 36690–36706. Cited by: §D.3.
  • [27] Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo (2016) TGIF: a new dataset and benchmark on animated gif description. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4641–4650. Cited by: Table 6, §4.1.
  • [28] Z. Li, X. Yan, T. Zhang, H. Qin, D. Xie, J. Tian, Z. Shi, L. Kong, Y. Zhang, and X. Yang (2025) ARB-LLM: alternating refined binarizations for large language models. In Proceedings of the 13th International Conference on Learning Representations, pp. 93900–93912. Cited by: §D.3.
  • [29] J. Lin, J. Tang, H. Tang, S. Yang, W. Chen, W. Wang, G. Xiao, X. Dang, C. Gan, and S. Han (2024) AWQ: activation-aware weight quantization for on-device LLM compression and acceleration. In Proceedings of the 6th Conference on Machine Learning and Systems, Vol. 6, pp. 87–100. Cited by: §A.3, §D.3, §1, §2, §3.1, §4.1.
  • [30] S. Lin, J. Hilton, and O. Evans (2022) TruthfulQA: measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 3214–3252. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [31] Y. Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han (2025) QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. In Proceedings of the 7th Conference on Machine Learning and Systems, Cited by: §D.3.
  • [32] Y. Liu, J. Wen, Y. Wang, S. Ye, L. L. Zhang, T. Cao, C. Li, and M. Yang (2024) VPTQ: extreme low-bit vector post-training quantization for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8181–8196. Cited by: §D.3.
  • [33] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra (2023) LLM-QAT: data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888. Cited by: §1, §2.
  • [34] Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort (2025) ParetoQ: scaling laws in extremely low-bit LLM quantization. arXiv preprint arXiv:2502.02631. Cited by: §D.3, §1, §2, §4.1, §4.1.
  • [35] Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort (2025) SpinQuant: LLM quantization with learned rotations. In Proceedings of the 13th International Conference on Learning Representations, pp. 92009–92032. Cited by: §D.3, §4.1.
  • [36] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [37] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2021) WinoGrande: an adversarial Winograd schema challenge at scale. Communications of the ACM 64 (9), pp. 99–106. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [38] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 4463–4473. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [39] W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo (2024) OmniQuant: omnidirectionally calibrated quantization for large language models. In Proceedings of the 12th International Conference on Learning Representations, pp. 45472–45496. Cited by: §D.3, §4.1.
  • [40] Y. Sun, R. Liu, H. Bai, H. Bao, K. Zhao, Y. Li, J. Hu, X. Yu, L. Hou, C. Yuan, X. Jiang, W. Liu, and J. Yao (2025) FlatQuant: flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pp. 57587–57613. Cited by: §D.3, §4.1.
  • [41] F. Tan, R. Lee, Ł. Dudziak, S. X. Hu, S. Bhattacharya, T. Hospedales, G. Tzimiropoulos, and B. Martinez (2024) MobileQuant: mobile-friendly quantization for on-device language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 9761–9771. Cited by: §1.
  • [42] I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601. Cited by: §3.2.
  • [43] A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. D. Sa (2024) QuIP#: even better LLM quantization with hadamard incoherence and lattice codebooks. In Proceedings of the 41st International Conference on Machine Learning, pp. 48630–48656. Cited by: §D.3, §2.
  • [44] A. Tseng, Q. Sun, D. Hou, and C. M. D. Sa (2024) QTIP: quantization with trellises and incoherence processing. In Advances in Neural Information Processing Systems 37, pp. 59597–59620. Cited by: §D.3, §4.1.
  • [45] H. Wang, S. Ma, L. Ma, L. Wang, W. Wang, L. Dong, S. Huang, H. Wang, J. Xue, R. Wang, J. Bao, C. He, and F. Wei (2025) BitNet: 1-bit pre-training for large language models. Journal of Machine Learning Research 26 (125), pp. 1–29. Cited by: §1, §2.
  • [46] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou (2020) MiniLM: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Advances in Neural Information Processing Systems 33, pp. 5776–5788. Cited by: §2.
  • [47] T. Wu, C. Tao, J. Wang, R. Yang, Z. Zhao, and N. Wong (2025) Rethinking kullback-leibler divergence in knowledge distillation for large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pp. 5737–5755. Cited by: §1, §2, §3.3.
  • [48] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han (2023) SmoothQuant: accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, pp. 38087–38099. Cited by: §2.
  • [49] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Dang, et al. (2025) Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215. Cited by: §4.1.
  • [50] Y. Xu, X. Han, Z. Yang, S. Wang, Q. Zhu, Z. Liu, W. Liu, and W. Che (2024) OneBit: towards extremely low-bit large language models. In Advances in Neural Information Processing Systems 37, pp. 66357–66382. Cited by: §1, §2.
  • [51] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §4.1.
  • [52] X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen (2023) MAmmoTH: building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653. Cited by: Appendix B.
  • [53] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800. Cited by: Table 6, Table 7, §4.1, §4.1.
  • [54] C. Zeng, S. Liu, Y. Xie, H. Liu, X. Wang, M. Wei, S. Yang, F. Chen, and X. Mei (2025) ABQ-LLM: arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pp. 22299–22307. Cited by: §D.3, §4.1.
  • [55] C. Zhang, J. Cheng, G. A. Constantinides, and Y. Zhao (2024) LQER: low-rank quantization error reconstruction for LLMs. In Proceedings of the 41st International Conference on Machine Learning, pp. 58763–58779. Cited by: §D.3, §4.1.
  • [56] H. Zhao, H. Wang, Y. Peng, S. Zhao, X. Tian, S. Chen, Y. Ji, and X. Li (2025) 1.4 million open-source distilled reasoning dataset to empower large language model training. arXiv preprint arXiv:2503.19633. Cited by: Table 6, §4.1.
  • [57] Y. Zheng, Y. Chen, B. Qian, X. Shi, Y. Shu, and J. Chen (2025) A review on edge large language models: design, execution, and applications. ACM Computing Surveys 57 (8), pp. 1–35. Cited by: §1.
  • [58] J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou (2023) Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911. Cited by: Table 7, §4.1.
  • [59] J. Zhou, Y. Shu, B. Zhao, B. Wu, Z. Liang, S. Xiao, M. Qin, X. Yang, Y. Xiong, B. Zhang, T. Huang, and Z. Liu (2025) MLVU: benchmarking multi-task long video understanding. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13691–13701. Cited by: Table 7, §4.1.
  • [60] Z. Zhou and Y. Jiang (2004) NeC4. 5: neural ensemble based c4. 5. IEEE Transactions on knowledge and data engineering 16 (6), pp. 770–773. Cited by: §1, §2.
  • [61] X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang (2024) A survey on model compression for large language models. Transactions of the Association for Computational Linguistics 12, pp. 1556–1577. Cited by: §1.

Appendix A Mixed-precision quantization

A.1 Quantization function for weights and activations

In this section, we provide the per-group symmetric quantization for both weights and activations in our framework. Let 𝐖dout×din{\mathbf{W}\in\mathbb{R}^{d_{\text{out}}\times d_{\text{in}}}} denote a weight matrix and 𝐗din×L{\mathbf{X}\in\mathbb{R}^{d_{\text{in}}\times L}} denote the corresponding activation matrix, where LL is the sequence length. Given a group size GG, we partition 𝐖\mathbf{W} and 𝐗\mathbf{X} along the input dimension into J=din/G{J=d_{\text{in}}/G} groups per output channel to obtain 𝐖Gdout×J×G{\mathbf{W}^{G}\in\mathbb{R}^{d_{\text{out}}\times J\times G}} and 𝐗GJ×G×L{\mathbf{X}^{G}\in\mathbb{R}^{J\times G\times L}}. The jj-th group of the ii-th output channel is defined as 𝐖i,jG=𝐖[i,jG:(j+1)G]G{\mathbf{W}^{G}_{i,j}=\mathbf{W}[i,\;jG:(j{+}1)G]\in\mathbb{R}^{G}}, with Wi,j,kG{W^{G}_{i,j,k}} denoting its kk-th element. Similarly, the jj-th input-channel group for token ll is defined as 𝐗j,lG=𝐗[jG:(j+1)G,l]G{\mathbf{X}^{G}_{j,l}=\mathbf{X}[jG:(j{+}1)G,\;l]\in\mathbb{R}^{G}}. Each group is independently quantized to nn-bit through a quantization function applicable to both weights and activations,

Qn-bit(𝐖i,jG)={clip(𝐖i,jGsi,j,1, 1)withsi,j=max(βGk|Wi,j,kG|,ϵ),ifn=1.58𝐖i,jGsi,jwithsi,j=max(maxk|Wi,j,kG|2n11,ϵ),ifn{4, 8}Q_{n\text{-bit}}(\mathbf{W}^{G}_{i,j})=\begin{cases}\text{clip}\!\left(\left\lfloor\dfrac{\mathbf{W}^{G}_{i,j}}{s_{i,j}}\right\rceil,\,-1,\,1\right)\ \text{with}\ s_{i,j}=\max\!\left(\dfrac{\beta}{G}\displaystyle\sum_{k}|W^{G}_{i,j,k}|,\ \epsilon\right),&\text{if}\ n=1.58\\[6.0pt] \left\lfloor\dfrac{\mathbf{W}^{G}_{i,j}}{s_{i,j}}\right\rceil\ \text{with}\ s_{i,j}=\max\!\left(\dfrac{\displaystyle\max_{k}|W^{G}_{i,j,k}|}{2^{n-1}-1},\ \epsilon\right),&\text{if}\ n\in\{4,\,8\}\end{cases} (8)

where \lfloor\cdot\rceil denotes rounding to the nearest integer, si,js_{i,j} is the scaling factor, β\beta is a tunable scaling coefficient for ternarization, and ϵ\epsilon is a small constant that prevents division by zero. The 1.58-bit branch, as log231.58\log_{2}3\approx 1.58, quantizes each weight to the ternary set {1,0,+1}\{-1,0,+1\}, with the scaling factor derived from the group mean absolute value. The nn-bit branch, where n{4,8}n\in\{4,8\}, quantizes weights to the symmetric integer range such as [7, 7][-7,\,7] for 4-bit and [127, 127][-127,\,127] for 8-bit, with the scaling factor determined by the group-wise maximum absolute value. Specifically, the activation groups are uniformly quantized to 8-bit under weight-activation quantization or retained at 16-bit under weight-only quantization.

A.2 Strategies of structural quantization with mixed precision

Refer to caption
Figure 9: Super-group allocation for weight matrices, visualized via the transposed matrix 𝐖T\mathbf{W}^{T}.
Refer to caption
Figure 10: Stacked allocation for weight matrices, visualized via the transposed matrix 𝐖T\mathbf{W}^{T}.

In this section, we provide strategies on how to allocate high-precision and low-precision groups within weight matrices in our proposed Structural Quantization with Mixed Precision (SQMP) module. Figure 10 shows the stacked allocation, while Figure 9 depicts the super-group allocation. For instance, setting ρ=1/8\rho=1/8 places one 4-bit output channel followed by seven 1.58-bit output channels within each super-group, culminating in an effective bit-width of roughly 1.88-bit.

A.3 Proof of super-group allocation in SQMP

In this section, we justify the super-group allocation by showing that, under a bounded-variation model of salience drift during training, it minimizes a discrepancy-based worst-case upper bound on the salience-alignment error among the allocation schemes considered in this paper.

Definition.

Let 𝐖dout×din\mathbf{W}\in\mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}} and 𝐗din×L\mathbf{X}\in\mathbb{R}^{d_{\mathrm{in}}\times L} denote the full-precision weights and activations, and let

Δ𝐘=𝐖𝐗Q(𝐖)𝐗\Delta\mathbf{Y}\;=\;\mathbf{W}\mathbf{X}-Q(\mathbf{W})\mathbf{X} (9)

denote the quantization error. For each output row ii, let ai{0,1}a_{i}\in\{0,1\} indicate the precision assignment, where ai=1a_{i}=1 denotes a 4-bit assignment and ai=0a_{i}=0 denotes a 1.58-bit assignment. Under the bit-width budget and specific proportion,

1douti=1doutai=ρ,andN:=i=1doutai=ρdout,\frac{1}{d_{\mathrm{out}}}\sum_{i=1}^{d_{\mathrm{out}}}a_{i}\;=\;\rho,\ \ \text{and}\ \ N\;:=\;\sum_{i=1}^{d_{\mathrm{out}}}a_{i}\;=\;\rho\,d_{\mathrm{out}}, (10)

we adopt the row-separable surrogate

(𝐚)i=1dout((1ai)eL+aieH)Si=i=1dout(eLai(eLeH))Si,\mathcal{L}(\mathbf{a})\;\lesssim\;\sum_{i=1}^{d_{\mathrm{out}}}\bigl((1-a_{i})\,e_{L}+a_{i}\,e_{H}\bigr)\,S_{i}\;=\;\sum_{i=1}^{d_{\mathrm{out}}}\bigl(e_{L}-a_{i}(e_{L}-e_{H})\bigr)\,S_{i}, (11)

where Si0S_{i}\geq 0 is the local salience of the ii-th row and eH<eLe_{H}<e_{L} are precision-dependent errors corresponding to 4-bit and 1.58-bit quantization, respectively. Since eLe_{L} and eHe_{H} are constants and iSi\sum_{i}S_{i} is independent of 𝐚\mathbf{a}, minimizing (11) is equivalent to maximizing the salience alignment

𝒜(𝐚):=i=1doutaiSi.\mathcal{A}(\mathbf{a})\;:=\;\sum_{i=1}^{d_{\mathrm{out}}}a_{i}\,S_{i}. (12)

Assumption.

In Post-Training Quantization (PTQ), the salience profile SiS_{i} is static, which enables direct salience-driven assignments (Lin et al., 2024). Conversely, Quantization-Aware Training (QAT) and Quantization-Aware Distillation (QAD) continuously update the model through data-driven gradients, and activation outliers are empirically observed to shift across channels during training (Heo et al., 2024), causing row salience SiS_{i} to fluctuate accordingly. However, as the parameters and activations are tightly bounded by regularization mechanisms such as weight decay and normalization layers, we assume that the continuous extension S:[0,1]0S:[0,1]\!\to\!\mathbb{R}_{\geq 0} of the salience profile is a function of bounded total variation, V(S)<V(S)<\infty.

Worst-case Formulation.

Since the exact trajectory of SS is unpredictable a priori during training, an allocation strategy that is robust to such salience drift should minimize the worst-case alignment error over all functions SS satisfying the bounded total variation assumption. Formally, we identify each row index i{1,,dout}i\in\{1,\dots,d_{\mathrm{out}}\} with the midpoint xi=(i12)/dout[0,1]x_{i}=(i-\tfrac{1}{2})/d_{\mathrm{out}}\in[0,1] and write Si=S(xi)S_{i}=S(x_{i}). Let P={p1,,pN}[0,1]P=\{p_{1},\dots,p_{N}\}\subset[0,1] denote the normalized locations of the NN high-precision rows. Then

𝒜(P)=k=1NS(pk),\mathcal{A}(P)\;=\;\sum_{k=1}^{N}S(p_{k}), (13)

and maximizing 𝒜(P)\mathcal{A}(P) robustly under the unknown but drift-bounded SS reduces to requiring the empirical average 1NkS(pk)\frac{1}{N}\sum_{k}S(p_{k}) to uniformly approximate the ideal mean 01S(x)𝑑x\int_{0}^{1}S(x)\,dx. By the one-dimensional Koksma–Hlawka inequality,

|1Nk=1NS(pk)01S(x)𝑑x|V(S)DN(P),\left|\frac{1}{N}\sum_{k=1}^{N}S(p_{k})-\int_{0}^{1}S(x)\,dx\right|\;\leq\;V(S)\,D_{N}^{*}(P), (14)

where the discrepancy of PP is defined as

DN(P):=supt[0,1]|1Nk=1N𝟏{pkt}t|.D_{N}^{*}(P)\;:=\;\sup_{t\in[0,1]}\left|\frac{1}{N}\sum_{k=1}^{N}\mathbf{1}\{p_{k}\leq t\}-t\right|. (15)

Since the bound (14) is uniform over all bounded-variation SS, minimizing DN(P)D_{N}^{*}(P) tightens the worst-case upper bound on the salience-alignment error.

Proposition.

Among the random, stacked, and super-group allocations, the super-group allocation attains the smallest order of discrepancy, namely Θ(N1)\Theta(N^{-1}). Consequently, under the surrogate (11), it minimizes the Koksma–Hlawka worst-case upper bound (14) up to an absolute constant.

Proof.

We compute the discrepancy for each of the three allocations.

1. Random allocation. The NN high-precision rows are distributed uniformly at random, yielding pki.i.d.Unif[0,1]p_{k}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{Unif}[0,1]. Standard empirical process bounds imply that the discrepancy satisfies

DN(Prand)=𝒪p(N1/2).D_{N}^{*}(P_{\mathrm{rand}})=\mathcal{O}_{p}(N^{-1/2})\ . (16)

2. Stacked allocation. All NN high-precision rows are clustered contiguously at one end of the output dimension, yielding

Pstack={0.5dout,1.5dout,,N0.5dout}.P_{\mathrm{stack}}\;=\;\left\{\frac{0.5}{d_{\mathrm{out}}},\frac{1.5}{d_{\mathrm{out}}},\dots,\frac{N-0.5}{d_{\mathrm{out}}}\right\}\ . (17)

Since all points lie in a sub-interval of length ρ\rho, taking t=ρt=\rho in the definition of DND_{N}^{*} gives a deviation of 1ρ1-\rho. Thus, the discrepancy is constant

DN(Pstack)= 1ρ=Θ(1).D_{N}^{*}(P_{\mathrm{stack}})\;=\;1-\rho\;=\;\Theta(1)\ . (18)

3. Super-group allocation (ours). The 4-bit rows are placed on a deterministic equidistant grid with period 1/ρ\lfloor 1/\rho\rceil along the output dimension. Then, the normalized pattern is the midpoint grid

Psuper={2k12N}k=1N.P_{\mathrm{super}}\;=\;\left\{\frac{2k-1}{2N}\right\}_{k=1}^{N}\ . (19)

For any t[0,1]t\in[0,1], the number of points in [0,t][0,t] is Nt+12\lfloor Nt+\tfrac{1}{2}\rfloor, so that

|1Nk=1N𝟏{pkt}t|12N.\left|\frac{1}{N}\sum_{k=1}^{N}\mathbf{1}\{p_{k}\leq t\}-t\right|\;\leq\;\frac{1}{2N}\ . (20)

While rounding row indices to discrete integers introduces a minor mapping perturbation bounded by 𝒪(1/dout)\mathcal{O}(1/d_{\mathrm{out}}), since dout=N/ρd_{\mathrm{out}}=N/\rho, this shift is 𝒪(1/N)\mathcal{O}(1/N). Thus, the super-group allocation rigorously maintains the optimal discrepancy order,

DN(Psuper)=Θ(N1).D_{N}^{*}(P_{\mathrm{super}})\;=\;\Theta(N^{-1})\ . (21)

By classical one-dimensional discrepancy theory, every NN-point set P[0,1]P\subset[0,1] satisfies DN(P)c/ND_{N}^{*}(P)\geq c/N for some absolute constant c>0c>0. Hence, the super-group allocation already attains the optimal order of one-dimensional discrepancy. Combining this with the Koksma–Hlawka inequality (14) proves the stated worst-case optimality.

A.4 Analysis and discussion

This periodic super-group allocation addresses a limitation of static salience-based assignments in QAT and QAD. By mathematically minimizing the discrepancy DN(P)D_{N}^{*}(P) to its theoretical lower bound, our structural mixed-precision layout minimizes the discrepancy-based worst-case upper bound under arbitrary and unpredictable salience shifts V(S)V(S). It ensures that each token accumulates exactly ρ\rho of 4-bit contributions, thereby effectively decoupling model performance from unpredictable fluctuations in salience. Furthermore, the interleaved super-groups inherently serve as evenly spaced, high-precision buffers along the output dimension. This uniform distribution of precision guarantees that no contiguous block of output features suffers from concentrated degradation, effectively mitigating clustered quantization errors and yielding a smaller discrepancy bound than either the stacked or the random allocation. Beyond algorithmic stability, the random allocation is not hardware-friendly, while our deterministic, repeating structure aligns with hardware execution granularities, supporting coalesced memory access and potentially improving throughput in low-level kernel implementations.

Appendix B Pattern of important features

Refer to caption
Figure 11: Frequency of the kk layers with the lowest clc_{l} (k=3k=3).
Refer to caption
Figure 12: Frequency of the kk layers with the lowest clc_{l} (k=5k=5).
Refer to caption
Figure 13: Frequency of the kk layers with the lowest clc_{l} (k=10k=10).
Refer to caption
Figure 14: Average rank of each layer along with its corresponding range.

In Section 3.2, we adopt cosine similarity as an importance metric to quantify transformations across feature layers. In this section, we investigate how important feature patterns vary across data domains, thereby demonstrating the limitations of pre-specified layer supervision. We analyze two distinct domains, mathematics and coding, from our training corpus. The mathematical dataset is TIGER-Lab/MathInstruct (Yue et al., 2023) and the coding dataset is jtatman/python-code-dataset-500k. Specifically, we randomly sample 1024 instances from each domain, using seed 42, and process them with the full-precision Qwen3-1.7B to calculate the similarities between consecutive layers. We aggregate the frequency at which each layer ranks among the kk layers with the lowest similarity in Figures 11, 12, and 13. We further depict the average rank of each layer, along with its corresponding variance, in Figure 14.

There are three key observations: (1) In Figure 11 (k=3k=3), frequency distributions diverge significantly: mathematical data heavily transforms layers 17 and 18, which coding data rarely alters, whereas coding data modifies layer 2 more frequently; (2) In Figures 12 and 13 (k=5,10k=5,10), this structural divergence persists, notably at k=5k=5 where mathematical data transforms layer 18 over 750 times compared to only 300{\sim}300 times for coding data; (3) In Figure 14, the average rank analysis further confirms this discrepancy, revealing that mathematical data prominently transforms layers 17 through 22, while coding data primarily targets layers 17 through 20 and uniquely emphasizes layer 25.

Observations (1)–(2) indicate that the frequency of largely transformed intermediate layers is domain-dependent. Observation (3) reveals that critical representational bottlenecks are specialized and only partially overlap across domains. Observations (1)–(3) demonstrate that statically encompassing all these disjoint critical layers requires a large union, which induces redundant and over-regularized computations. This additionally motivates the LAFD configuration k=3k=3 for our training hyperparameters in Section 4.1, where dynamically selecting 3 layers strictly concentrates the distillation signal on the most essential representational bottlenecks.

These results reveal that the pattern of important intermediate layers is domain-dependent, rendering fixed-layer feature distillation inefficient for heterogeneous datasets. Motivated by this structural uncertainty, we propose Adaptive Feature Distillation in Section 3.2.

Appendix C Mixing coefficient of forward-reverse KL divergence

Table 5: The proportions of mismatched tokens among high-confidence tokens.
Data types Thresholds Human-annotated data Distilled data
High-conf (%) Mismatched (%) High-conf (%) Mismatched (%)
Mathematics 0.60 79.18 \columncolorgray!2016.78 86.16 \columncolorgray!202.98
0.70 72.36 \columncolorgray!2013.64 81.02 \columncolorgray!201.76
0.80 65.25 \columncolorgray!2010.54 75.61 \columncolorgray!201.14
0.90 56.48 \columncolorgray!207.28 68.61 \columncolorgray!200.74
0.95 49.47 \columncolorgray!205.36 62.83 \columncolorgray!200.60
Coding 0.60 82.96 \columncolorgray!2011.11 83.14 \columncolorgray!202.46
0.70 76.67 \columncolorgray!208.51 77.38 \columncolorgray!201.31
0.80 70.15 \columncolorgray!206.13 71.29 \columncolorgray!200.86
0.90 61.77 \columncolorgray!203.77 63.60 \columncolorgray!200.63
0.95 55.07 \columncolorgray!202.41 57.34 \columncolorgray!200.56
Refer to caption
Figure 15: Scatter plots of top-1 probability versus CAKLD confidence on human-annotated datasets. Mismatched tokens frequently exhibit high top-1 probabilities despite near-zero CAKLD confidence.

The CAKLD method introduced in BitDistiller (Du et al., 2024) employs the label token probability as the mixing coefficient to balance the forward and reverse KL divergences. This approach relies entirely on internally distilled datasets generated by the teacher model. Consequently, it struggles to leverage diverse corpora, such as high-quality human-annotated datasets or synthetic data generated by leading external models. Using the same datasets in Appendix B, we randomly select 1024 samples from the mathematical and coding datasets, using seed 42. These datasets comprise both human-annotated and synthetic data. We prompt the full-precision Qwen3-0.6B to generate distilled responses using a maximum of 1024 tokens, a temperature of 0.7, a top-p of 0.8, and a top-k of 20. Subsequently, we collect the teacher output distributions via forward passes. CAKLD confidence denotes the probability of the label token. We define high-confidence tokens as those for which the teacher assigns a probability exceeding a specified threshold. Mismatch tokens are instances where the teacher’s top-1 predicted token differs from the provided label. Table 5 presents the mismatch proportion among the high-confidence tokens, and Figure 15 exhibits all mismatched tokens.

There are three observations: (1) In Table 5, for mathematical data, the proportion of mismatched tokens within high-confidence tokens is significantly larger on the human-annotated and synthetic data compared to internally distilled data across all thresholds, scaling from 16.78% versus 2.98% at a 0.60 threshold down to 5.36% versus 0.60% at a 0.95 threshold; (2) In Table 5, coding data exhibits an identical trend, with mismatched proportions on human-annotated and synthetic data reaching 11.11% compared to 2.46% on distilled data at the 0.60 threshold; (3) In Figure 15, both mathematical and coding data reveal dense mismatch distributions where the teacher exhibits high top-1 probability despite the corresponding CAKLD confidence approaching zero.

Observations (1)–(3) demonstrate that CAKLD yields substantial mismatched errors when applied to human-annotated and synthetic data. This strict dataset dependency prevents quantized models from leveraging optimal and heterogeneous training recipes. Motivated by this limitation, we propose Entropy-Aware KL Divergence in Section 3.3 to generalize logit distillation across various datasets.

Appendix D Additional details of experiments

D.1 Training data

Table 6: Details of datasets used for training.
# Datasets Subsets Split Data sizes
1 BAAI/Infinity-Instruct (Li et al., 2025a) 7M_domains train 7.45M
2 BAAI/Infinity-Instruct Gen train 1.46M
3 allenai/tulu-v3.1-mix-preview-4096-OLMoE train 0.61M
4 a-m-team/AM-DeepSeek-R1-Distilled-1.4M (Zhao et al., 2025) am_0.5M+am_0.9M train 1.40M
5 Mixed Downstream Datasets (Bisk et al., 2020; Clark et al., 2019, 2018; Lin et al., 2022; Mihaylov et al., 2018; Sakaguchi et al., 2021; Sap et al., 2019; Zellers et al., 2019) train 0.24M
6 BAAI/Infinity-Instruct 7M_core train 1.48M
7 HuggingFaceM4/TGIF (Li et al., 2016) train 10K

Table 6 summarizes the sources, subsets, and sizes of our training corpora. We construct tailored data mixtures for each model architecture. We train the Qwen3 models on datasets 1 through 5. Because we configure Qwen3 in a non-reasoning mode, we strip all chain-of-thought traces from the distilled DeepSeek samples prior to training. We train MobileLLM-350M using datasets 5 and 6. Finally, we train the multimodal Qwen2.5-Omni-7B model on dataset 7. Additionally, sequences are truncated to a maximum length of 1024 tokens for all datasets during training.

D.2 Evaluation protocols

In Table 7, we summarize the evaluation protocols, including downstream tasks with their categories, N-shot settings, scoring methods, and metrics. 0-shot* denotes the 0-shot evaluation format, while the models are trained to training splits from these domains. In Appendix E, we provide results validating robust generalization on held-out benchmarks.

Table 7: Overview of evaluation benchmarks used for EdgeRazor.
Categories Tasks N-shot Scoring methods Metrics
Commonsense reasoning ARC-e (Clark et al., 2018) 0-shot* Log-likelihood Acc_norm
ARC-c (Clark et al., 2018) 0-shot* Log-likelihood Acc_norm
HellaSwag (Zellers et al., 2019) 0-shot* Log-likelihood Acc_norm
BoolQ (Clark et al., 2019) 0-shot* Log-likelihood Acc
PIQA (Bisk et al., 2020) 0-shot* Log-likelihood Acc_norm
Winogrande (Sakaguchi et al., 2021) 0-shot* Log-likelihood Acc
SIQA (Sap et al., 2019) 0-shot* Log-likelihood Acc
Reading comprehension OpenBookQA (Mihaylov et al., 2018) 0-shot* Log-likelihood Acc_norm
Trustworthiness Ethics (Hendrycks et al., 2020a) 0-shot* Log-likelihood Acc
Truthfulness TruthfulQA2 (Lin et al., 2022) 0-shot Log-likelihood Acc
Knowledge MMLU (Hendrycks et al., 2020b) 0-shot Log-likelihood Acc
Instruction following IF-Eval (Zhou et al., 2023) 0-shot Generation Prompt Strict Acc
Mathematics GSM8K (Cobbe et al., 2021) 5-shot Generation Acc
Coding HumanEval (Chen et al., 2021) 0-shot Generation Pass@1
Video understanding Video-MME (Fu et al., 2025) 0-shot Generation Acc
MLVU (Zhou et al., 2025) 0-shot Generation Acc

D.3 Per-task evaluation results

In this section, we provide the complete per-task results across comprehensive downstream tasks. The models include both base and instruction-tuned LLMs. We compare against a comprehensive suite of recent PTQ and QAT baselines, including GPTQ (Frantar et al., 2022), OmniQuant (Shao et al., 2024), AWQ (Lin et al., 2024), AQLM (Egiazarian et al., 2024), BiLLM (Huang et al., 2024), QuIP# (Tseng et al., 2024a), AutoRound (Cheng et al., 2024), VPTQ (Liu et al., 2024), QTIP (Tseng et al., 2024b), ARB-LLM (Li et al., 2025c), GPTAQ (Li et al., 2025b), Slim-LLM+ (Huang et al., 2025), Q-Palette (Lee and Song, 2025), LQER (Zhang et al., 2024), QuaRot (Ashkboos et al., 2024), ABQ-LLM (Zeng et al., 2025), SpinQuant (Liu et al., 2025b), QoQ (Lin et al., 2025), FlatQuant (Sun et al., 2025), EfficientQAT (Chen et al., 2025), and ParetoQ (Liu et al., 2025a). For the baselines, the group size is set to 64 for MobileLLM-350M and 128 for the Qwen models. Tables 8 and 9 present per-task performance for MobileLLM-350M. Tables 10 and 11 list per-task performance for Qwen3-0.6B. Tables 12 and 13 provide per-task performance for Qwen3-1.7B. Notably, the best performance is indicated in bold, and the second-best is underlined. Scores of 0.00 for full-precision models on tasks such as GSM8K and HumanEval indicate a lack of capability, whereas scores of 0.00 for quantized models on these tasks demonstrate severe capability degradation.

Table 8: Detailed performance of weight-only quantization methods on MobileLLM-350M.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU GSM8K HumanE. Average
BF16 16-16-16 64.94 35.49 52.87 58.96 70.84 56.35 40.79 40.20 53.98 37.44 23.52 0.00 0.00 41.18
OmniQuant 4-16-16 63.30 34.64 51.79 58.62 70.62 54.46 40.12 38.40 48.76 37.48 24.26 0.00 0.00 40.19
OmniQuant 3-16-16 61.07 33.53 48.73 60.58 68.28 54.78 39.61 36.00 51.56 39.90 23.14 0.00 0.00 39.78
OmniQuant 2-16-16 43.98 24.57 33.42 56.48 58.11 49.41 35.67 29.60 45.26 43.93 23.18 0.00 0.00 34.12
AWQ 4-16-16 63.80 35.49 52.10 59.05 70.24 54.38 39.97 39.20 54.12 37.85 24.16 0.00 0.00 40.80
AWQ 3-16-16 62.67 33.62 50.26 61.01 68.61 54.54 39.76 37.40 46.94 39.91 24.23 0.00 0.00 39.92
AWQ 2-16-16 36.15 25.00 30.92 48.65 56.20 48.93 34.60 27.40 47.63 51.23 23.23 0.00 0.00 33.07
AQLM 4-16-16 64.31 35.24 52.65 58.65 70.89 55.41 40.74 39.20 53.17 37.11 24.06 0.00 0.00 40.88
AQLM 3-16-16 63.34 33.79 50.57 58.69 69.31 55.01 39.30 37.80 52.25 38.55 22.77 0.00 0.00 40.11
AQLM 2-16-16 61.03 33.11 48.48 62.11 68.01 54.46 39.71 35.20 43.35 38.31 24.31 0.15 0.00 39.09
BiLLM 1.06-16-16 33.12 22.87 30.01 45.41 53.32 49.64 34.54 27.00 46.91 46.23 23.07 0.00 0.00 31.70
QuIP# 4-16-16 29.38 25.34 26.98 49.02 51.20 48.15 34.08 25.80 49.11 47.14 23.50 0.00 0.00 31.52
QuIP# 3-16-16 60.23 32.59 49.86 59.20 69.37 53.35 39.66 35.60 49.58 38.24 23.24 0.00 0.00 39.30
QuIP# 2-16-16 53.28 27.56 43.88 59.88 64.69 53.83 38.28 32.20 44.52 39.47 23.03 0.00 0.00 36.97
AutoRound 4-16-16 63.43 34.64 51.70 57.52 70.46 56.20 40.33 38.60 49.52 37.16 24.28 0.00 0.00 40.30
AutoRound 3-16-16 62.21 33.45 50.03 57.74 68.17 54.06 39.46 37.40 49.26 37.65 23.16 0.00 0.00 39.43
AutoRound 2-16-16 50.84 26.88 38.64 57.89 62.62 53.59 36.64 31.20 49.59 44.35 23.19 0.00 0.00 36.57
VPTQ 4-16-16 26.35 29.52 25.77 61.19 50.22 49.64 33.93 27.40 43.76 48.02 26.42 0.00 0.00 32.48
VPTQ 3-16-16 25.97 29.35 25.86 61.44 50.33 50.36 33.78 28.60 44.87 48.26 26.73 0.00 0.00 32.73
VPTQ 2-16-16 25.97 29.10 25.85 60.67 49.40 49.72 32.60 26.40 43.49 48.44 26.70 0.00 0.00 32.18
QTIP 2-16-16 26.35 28.24 26.40 37.83 49.08 50.59 35.16 24.60 56.67 49.21 22.97 0.00 0.00 31.32
ARB-LLM 1-16-16 37.33 22.18 31.16 58.47 54.84 51.07 34.95 28.80 43.55 46.55 22.91 0.30 0.00 33.24
Slim-LLM+ 3-16-16 59.18 31.66 47.77 60.83 67.79 54.22 38.84 35.20 53.37 39.27 23.10 0.00 0.00 39.33
Slim-LLM+ 2-16-16 37.71 23.89 30.94 57.58 56.69 50.36 34.44 26.00 43.22 46.71 23.04 0.00 0.00 33.14
Q-Palette 4-16-16 63.76 34.47 51.10 59.60 70.51 56.27 39.97 38.20 55.76 38.34 24.63 0.00 0.00 40.97
Q-Palette 3.25-16-16 63.76 35.58 50.75 56.39 69.42 54.06 39.46 36.80 55.53 37.53 23.09 0.00 0.00 40.18
Q-Palette 2-16-16 51.05 29.18 42.63 55.63 63.82 52.33 37.56 31.20 55.78 43.64 24.36 0.00 0.00 37.48
Q-Palette 1.75-16-16 48.27 26.62 38.93 47.55 60.66 51.62 37.82 30.40 51.22 46.32 22.85 0.00 0.00 35.56
EfficientQAT 4-16-16 63.68 35.67 51.73 58.47 70.73 56.75 40.74 38.40 54.07 37.11 24.16 0.00 0.00 40.89
EfficientQAT 3-16-16 61.53 33.11 49.45 60.89 69.04 53.75 39.66 36.80 45.35 37.66 23.24 0.00 0.00 39.27
EfficientQAT 2-16-16 49.92 27.05 39.29 61.77 63.49 50.91 37.56 29.60 46.05 42.50 22.93 0.00 0.00 36.24
ParetoQ 4-16-16 64.23 38.14 53.13 58.32 71.55 56.20 40.33 38.00 50.73 37.04 24.78 0.08 0.00 40.96
ParetoQ 3-16-16 62.75 33.28 51.24 60.92 70.95 56.75 39.82 39.00 46.39 37.00 25.02 0.00 0.00 40.24
ParetoQ 2-16-16 57.66 32.59 46.95 63.03 69.31 56.67 40.43 35.20 43.40 36.25 24.88 0.45 0.00 38.99
ParetoQ 1.58-16-16 56.10 29.95 43.68 61.62 67.30 54.30 39.30 36.40 43.34 38.82 23.05 0.15 0.00 38.00
\rowcolorgray!20 EdgeRazor 4-16-16 69.19 36.26 51.91 62.26 70.40 56.20 40.74 37.40 57.41 37.96 25.00 0.53 0.00 41.94
\rowcolorgray!20 EdgeRazor 2.79-16-16 65.87 32.68 45.98 61.71 68.82 56.27 40.02 35.00 56.53 38.97 24.27 0.76 0.00 40.53
\rowcolorgray!20 EdgeRazor 1.88-16-16 62.50 29.61 42.36 59.54 67.08 54.38 39.71 33.00 57.67 40.09 24.73 0.76 0.00 39.34
\rowcolorgray!20 EdgeRazor 1.58-16-16 58.63 26.19 38.95 58.07 65.29 53.04 39.30 32.20 56.26 41.97 24.12 0.53 0.00 38.04
Table 9: Detailed performance of weight-activation quantization methods on MobileLLM-350M.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU GSM8K HumanE. Average
BF16 16-16-16 64.94 35.49 52.87 58.96 70.84 56.35 40.79 40.20 53.98 37.44 23.52 0.00 0.00 41.18
OmniQuant 4-8-8 63.93 34.81 51.86 57.80 70.18 55.01 40.17 37.80 50.93 37.38 24.37 0.00 0.00 40.33
OmniQuant 3-8-8 61.41 32.08 48.78 60.21 68.28 54.14 39.56 35.80 52.66 38.30 23.57 0.00 0.00 39.60
OmniQuant 2-8-8 44.49 23.21 33.21 62.02 57.13 50.99 34.95 29.00 46.75 44.85 23.42 0.00 0.00 34.62
LQER 4-8-8 64.02 36.01 51.84 56.61 70.78 53.91 40.02 39.40 53.17 38.36 24.08 0.00 0.00 40.63
LQER 3-8-8 58.54 32.17 48.82 58.26 67.46 53.51 38.84 37.20 49.53 41.88 24.63 0.00 0.00 39.30
LQER 2-8-8 35.86 23.29 29.21 44.43 54.90 49.17 35.52 25.40 53.12 45.15 23.31 0.00 0.00 32.26
QuaRot 4-8-8 26.73 25.77 26.09 38.72 48.80 47.67 34.90 24.40 52.58 50.28 24.16 0.00 0.00 30.78
QuaRot 3-8-8 27.40 27.47 25.37 40.76 50.11 51.07 34.24 23.00 55.99 50.97 24.39 0.00 0.00 31.60
QuaRot 2-8-8 25.97 27.65 25.44 39.88 50.00 49.41 34.39 23.60 51.36 50.47 22.99 0.00 0.00 30.86
ABQ-LLM 4-8-8 63.22 34.73 51.50 55.29 70.02 56.04 40.99 38.20 54.34 38.07 23.66 0.00 0.00 40.47
ABQ-LLM 3-8-8 59.51 31.31 46.78 61.68 66.92 53.83 38.74 35.00 44.72 37.56 23.47 0.00 0.00 38.42
ABQ-LLM 2.32-8-8 47.10 25.68 38.18 60.67 61.70 51.46 35.57 30.20 43.66 43.53 23.80 0.00 0.00 35.50
SpinQuant 4-8-8 61.66 34.04 51.62 59.45 69.97 56.59 39.92 37.80 53.33 38.99 23.76 0.00 0.00 40.55
SpinQuant 3-8-8 58.33 30.72 46.80 54.01 67.57 55.49 38.49 34.80 55.54 39.03 24.60 0.08 0.00 38.88
SpinQuant 2-8-8 31.65 23.55 28.43 37.92 52.50 49.49 34.19 26.40 56.72 47.59 22.92 0.00 0.00 31.64
QoQ 4-8-4 62.16 35.84 51.69 57.61 69.21 54.14 40.28 38.80 55.19 37.43 23.42 0.00 0.00 40.44
FlatQuant 4-8-8 61.95 34.73 51.72 58.93 70.84 53.75 40.33 39.00 52.24 38.01 23.56 0.15 0.00 40.40
FlatQuant 3-8-8 61.07 31.91 48.43 55.44 68.39 53.12 39.51 35.20 45.23 39.24 24.74 0.08 0.00 38.64
FlatQuant 2-8-8 31.44 22.27 27.19 44.71 51.74 47.59 34.60 25.40 43.53 49.22 23.03 0.00 0.00 30.82
\rowcolorgray!20 EdgeRazor 4-8-8 69.11 35.84 51.82 62.60 70.35 56.20 40.58 37.40 57.21 37.90 24.66 0.45 0.00 41.86
\rowcolorgray!20 EdgeRazor 2.79-8-8 65.99 32.68 45.99 62.11 68.55 56.51 40.07 35.20 56.51 39.05 24.41 0.99 0.00 40.62
\rowcolorgray!20 EdgeRazor 1.88-8-8 62.33 29.52 42.34 59.39 67.30 55.17 39.71 32.20 57.77 40.17 24.70 0.61 0.00 39.32
\rowcolorgray!20 EdgeRazor 1.58-8-8 58.67 26.19 38.92 58.04 65.23 53.83 39.25 32.00 56.33 42.03 24.19 0.83 0.00 38.12
Table 10: Detailed performance of weight-only quantization methods on Qwen3-0.6B.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
BF16 16-16-16 56.02 34.04 47.23 64.04 67.36 56.04 39.20 31.20 47.70 42.84 40.12 58.41 41.54 37.20 47.35
GPTQ 4-16-16 52.78 32.85 45.10 61.71 65.18 55.56 41.15 31.00 49.67 44.89 33.86 53.05 26.23 18.90 43.71
GPTQ 3-16-16 36.91 25.60 38.66 60.18 60.66 53.67 38.54 28.80 44.84 43.48 26.45 24.95 0.61 0.00 34.53
GPTQ 2-16-16 24.87 25.26 26.43 42.63 51.41 53.43 33.83 27.00 54.33 47.60 24.72 8.50 0.00 0.00 30.00
OmniQuant 4-16-16 47.01 31.06 44.43 58.32 65.18 57.06 38.28 31.80 44.39 42.32 40.55 12.01 0.00 0.00 36.60
OmniQuant 3-16-16 44.61 26.02 38.94 63.36 61.81 54.62 36.90 29.40 43.25 43.81 30.56 10.72 0.00 0.00 34.57
OmniQuant 2-16-16 32.41 22.27 28.55 38.62 55.22 50.51 34.54 24.60 56.73 51.28 22.92 12.20 0.00 0.00 30.70
AWQ 4-16-16 52.15 31.91 45.36 61.56 65.45 54.22 37.62 31.00 46.07 39.34 40.62 56.56 33.97 29.27 44.65
AWQ 3-16-16 39.69 26.62 39.77 57.55 61.92 54.46 37.36 29.60 45.67 44.94 27.97 25.14 2.65 1.83 35.37
AWQ 2-16-16 25.17 26.71 26.22 61.62 51.31 51.46 33.52 26.60 45.77 48.12 26.89 10.91 0.00 0.00 31.02
AQLM 4-16-16 52.82 33.19 47.03 63.91 67.41 55.64 39.76 33.20 45.44 42.69 41.04 56.01 40.26 32.32 46.48
AQLM 3-16-16 50.42 29.18 43.13 64.19 64.47 56.43 39.56 31.60 44.42 41.97 31.28 44.73 15.85 0.61 39.85
AQLM 2-16-16 40.49 29.86 40.72 43.00 64.15 55.56 37.36 31.00 47.88 44.02 33.79 34.75 8.49 0.00 36.51
BiLLM 1.06-16-16 27.36 25.94 27.06 46.64 51.41 49.49 33.01 26.20 47.49 49.25 24.28 11.65 0.00 0.00 29.98
QuIP# 4-16-16 26.05 26.71 26.11 45.78 49.40 50.75 33.62 27.20 52.45 45.32 24.68 10.54 0.00 0.00 29.90
QuIP# 3-16-16 44.07 28.41 38.00 63.82 62.19 54.06 36.23 29.00 44.97 45.29 28.06 19.78 1.97 0.00 35.42
QuIP# 2-16-16 27.53 23.55 27.18 37.83 51.69 51.22 34.75 27.00 56.67 50.27 22.78 10.54 0.00 0.00 30.07
AutoRound 4-16-16 51.47 31.31 45.56 67.31 66.81 53.83 39.71 31.40 44.57 41.72 41.20 55.45 32.98 37.20 45.75
AutoRound 3-16-16 47.43 27.99 41.60 58.20 63.49 54.14 37.92 30.80 56.72 42.16 40.49 39.74 13.80 18.90 40.96
AutoRound 2-16-16 35.31 22.70 31.43 60.43 58.16 51.70 35.16 27.60 46.28 45.54 22.92 7.95 0.00 0.00 31.80
VPTQ 4-16-16 47.01 30.20 45.19 67.19 66.43 55.56 39.36 29.80 45.14 43.21 31.08 51.76 31.69 0.00 41.69
VPTQ 3-16-16 42.30 28.92 40.85 63.49 61.81 50.83 39.41 29.40 46.01 46.50 28.04 33.27 6.29 7.32 37.46
VPTQ 2-16-16 32.11 24.15 31.14 57.52 55.98 51.46 36.39 26.40 44.99 47.13 23.52 8.87 0.15 0.00 31.42
QTIP 2-16-16 44.99 27.30 39.79 65.93 62.24 55.80 38.74 29.60 45.59 43.33 23.17 23.48 3.26 0.00 35.94
ARB-LLM 1-16-16 28.37 25.51 29.30 46.36 53.05 49.49 34.08 26.20 54.66 47.61 23.71 12.38 0.00 0.00 30.77
GPTAQ 4-16-16 50.93 33.45 45.31 61.19 66.43 56.91 40.58 31.20 52.43 43.90 38.84 52.87 29.87 18.90 44.49
GPTAQ 3-16-16 40.87 26.28 39.93 60.34 61.26 54.70 38.64 29.40 47.20 43.30 29.07 26.25 1.29 0.00 35.61
GPTAQ 2-16-16 26.47 24.91 26.25 40.18 50.98 49.96 34.80 27.20 55.01 49.41 24.07 7.95 0.00 0.00 29.80
Slim-LLM+ 3-16-16 42.13 25.09 38.52 62.94 61.48 53.99 36.44 29.60 43.41 44.81 27.78 3.33 5.76 0.00 33.95
Slim-LLM+ 2-16-16 30.68 21.16 27.93 37.92 55.60 51.38 35.21 26.60 56.67 50.35 22.99 11.09 0.00 0.00 30.54
Q-Palette 4-16-16 52.23 31.74 45.22 49.79 65.40 53.83 39.87 30.80 43.55 55.79 35.59 30.68 0.00 39.02 40.97
Q-Palette 3.25-16-16 42.09 27.90 42.27 59.54 64.15 52.96 39.25 30.60 43.08 44.60 33.22 27.17 0.00 18.90 37.55
Q-Palette 2-16-16 29.92 23.63 28.44 60.95 52.88 48.86 33.67 25.40 45.89 46.39 24.18 9.06 0.00 0.00 30.66
Q-Palette 1.75-16-16 28.96 25.68 27.06 60.18 52.29 48.93 34.08 26.20 47.16 44.75 23.12 12.94 0.00 0.00 30.81
EfficientQAT 4-16-16 53.41 34.47 47.36 70.40 68.06 56.83 40.07 31.00 47.11 45.51 43.48 50.65 0.00 28.66 44.07
EfficientQAT 3-16-16 48.44 30.03 44.12 67.34 66.54 54.85 39.00 30.80 44.20 43.83 31.43 34.57 0.00 23.78 39.92
EfficientQAT 2-16-16 43.22 26.71 35.93 65.69 60.61 52.64 37.77 28.20 43.75 44.05 24.31 2.96 0.00 0.00 33.27
\rowcolorgray!20 EdgeRazor 4-16-16 58.54 33.45 45.04 68.01 68.34 55.72 40.07 33.40 54.36 43.69 39.37 53.42 42.00 34.15 47.83
\rowcolorgray!20 EdgeRazor 2.79-16-16 51.77 28.33 37.47 70.70 63.71 54.06 40.33 28.20 55.08 42.72 36.85 51.39 26.69 31.10 44.17
\rowcolorgray!20 EdgeRazor 1.88-16-16 51.22 27.73 34.21 66.91 63.66 53.35 38.43 27.60 55.92 43.80 28.78 42.51 25.09 23.17 41.60
\rowcolorgray!20 EdgeRazor 1.58-16-16 45.75 25.77 33.89 66.64 60.72 52.33 38.23 29.80 51.70 44.40 32.85 37.34 14.25 23.17 39.77
Table 11: Detailed performance of weight-activation quantization methods on Qwen3-0.6B.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
BF16 16-16-16 56.02 34.04 47.23 64.04 67.36 56.04 39.20 31.20 47.70 42.84 40.12 58.41 41.54 37.20 47.35
OmniQuant 4-8-8 48.11 30.46 44.06 66.24 65.07 55.88 37.82 32.00 48.90 42.07 39.11 12.01 0.00 0.00 37.27
OmniQuant 3-8-8 42.42 28.07 38.73 64.19 61.59 54.14 37.15 28.80 43.81 43.74 30.40 11.09 0.00 0.00 34.58
OmniQuant 2-8-8 32.20 21.76 27.66 38.13 54.57 50.28 33.37 26.20 56.65 51.43 22.99 11.65 0.00 0.00 30.49
LQER 4-8-8 55.64 31.83 45.21 63.76 66.05 53.51 38.43 29.80 47.86 41.85 41.13 54.16 31.61 33.54 45.31
LQER 3-8-8 41.33 26.88 39.91 62.05 61.32 52.01 38.18 27.80 43.52 43.05 26.56 38.63 4.32 4.88 36.46
LQER 2-8-8 27.57 25.51 26.74 53.52 53.48 50.36 33.37 27.00 44.22 49.98 23.57 11.09 0.00 0.00 30.46
QuaRot 4-8-8 24.07 27.22 26.59 46.94 51.20 49.64 33.93 29.80 51.01 48.25 24.46 8.50 0.00 0.00 30.12
QuaRot 3-8-8 23.74 27.90 26.40 45.66 51.41 47.59 32.80 29.80 51.24 47.47 25.42 7.95 0.00 0.00 29.81
QuaRot 2-8-8 24.83 27.39 26.25 48.23 51.90 48.38 32.29 30.60 50.73 48.85 24.23 7.95 0.00 0.00 30.12
ABQ-LLM 4-8-8 56.14 34.04 47.46 63.91 67.30 56.83 39.30 31.20 47.79 42.76 40.09 58.04 0.00 38.41 44.52
ABQ-LLM 3-8-8 32.45 23.29 28.43 54.98 54.24 50.36 33.37 25.80 53.96 52.50 23.05 11.65 0.00 0.00 31.72
ABQ-LLM 2.32-8-8 26.18 26.79 26.07 43.00 51.20 49.57 33.98 28.00 55.29 49.11 24.16 12.20 0.00 0.00 30.40
SpinQuant 4-8-8 48.32 30.29 44.41 52.94 65.56 56.27 38.74 32.60 55.61 43.42 32.67 48.61 25.25 3.05 41.27
SpinQuant 3-8-8 40.45 25.77 40.00 38.47 61.15 55.72 37.67 27.80 56.71 45.08 24.36 32.53 3.34 0.00 34.93
SpinQuant 2-8-8 30.77 23.21 27.92 45.47 51.09 50.99 33.83 24.80 52.44 43.98 24.84 11.28 0.00 0.00 30.04
QoQ 4-8-4 24.54 25.09 26.17 39.17 50.82 49.88 32.70 27.00 56.22 49.68 24.95 10.91 0.00 0.00 29.80
FlatQuant 4-8-8 54.21 30.80 45.66 65.87 66.59 56.27 39.61 32.40 55.92 44.07 37.58 55.64 27.14 28.66 45.74
FlatQuant 3-8-8 44.91 28.16 40.17 54.95 62.89 53.43 37.92 28.00 55.24 42.20 31.40 40.11 3.26 0.61 37.38
FlatQuant 2-8-8 28.32 21.84 26.52 39.63 53.37 51.07 34.19 28.20 56.17 49.67 22.94 11.28 0.00 0.00 30.23
\rowcolorgray!20 EdgeRazor 4-8-8 57.79 33.70 45.00 67.49 67.85 55.88 40.17 33.80 54.09 43.53 39.73 53.42 42.00 34.76 47.80
\rowcolorgray!20 EdgeRazor 2.79-8-8 52.10 28.50 37.36 70.58 63.93 53.12 40.12 28.60 54.97 42.82 36.44 49.54 26.99 32.32 44.10
\rowcolorgray!20 EdgeRazor 1.88-8-8 51.47 27.99 34.22 66.85 63.49 53.04 38.02 27.40 55.92 43.88 29.56 44.55 25.09 23.17 41.76
\rowcolorgray!20 EdgeRazor 1.58-8-8 44.87 26.11 33.88 66.73 60.55 51.30 38.28 31.00 50.76 44.72 33.09 38.45 15.01 22.56 39.81
Table 12: Detailed performance of weight-only quantization methods on Qwen3-1.7B.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
BF16 16-16-16 69.87 42.83 60.40 77.77 72.58 60.85 45.19 37.40 49.63 45.97 55.49 67.10 68.76 67.07 58.64
GPTQ 4-16-16 62.21 38.40 58.35 76.51 70.35 58.72 42.78 34.80 55.24 45.79 51.37 59.52 59.59 55.49 54.94
GPTQ 3-16-16 56.69 35.15 53.71 69.36 67.08 58.48 41.56 34.80 51.26 47.46 42.03 33.09 9.63 3.66 43.14
GPTQ 2-16-16 25.76 24.91 26.17 48.99 50.27 49.80 33.11 27.80 51.45 47.91 23.54 7.76 0.00 0.00 29.82
OmniQuant 4-16-16 69.11 41.13 58.02 79.79 71.00 62.35 44.63 36.00 52.10 44.84 52.34 15.34 0.00 0.00 44.76
OmniQuant 3-16-16 60.61 36.01 52.49 67.00 68.55 58.33 40.84 32.80 43.64 45.24 48.95 14.60 0.00 0.00 40.65
OmniQuant 2-16-16 40.95 24.32 32.85 59.60 59.19 52.17 35.82 27.40 44.65 44.53 22.94 12.20 0.00 0.00 32.62
AWQ 4-16-16 71.76 43.60 59.71 75.41 71.27 60.46 43.86 36.20 45.47 45.66 54.23 67.84 58.98 62.80 56.95
AWQ 3-16-16 56.36 34.90 52.98 70.52 68.39 58.33 40.02 31.60 51.45 45.67 46.92 48.43 30.10 32.32 47.71
AWQ 2-16-16 25.38 26.54 25.83 62.17 51.41 49.64 32.80 29.60 43.23 48.33 24.65 12.38 0.00 0.00 30.85
AQLM 4-16-16 67.68 42.06 59.84 75.54 71.55 60.69 44.52 36.20 51.01 46.08 55.70 65.25 66.41 63.41 57.57
AQLM 3-16-16 59.01 37.20 55.94 73.85 69.26 59.12 43.14 35.00 44.94 43.65 51.77 53.97 45.34 45.12 51.24
AQLM 2-16-16 55.85 33.11 49.98 67.34 67.08 59.43 42.22 30.80 43.34 43.36 42.81 23.11 21.76 0.00 41.44
BiLLM 1.04-16-16 27.57 27.74 27.49 39.45 50.87 50.91 33.52 25.00 56.43 33.52 25.21 10.35 0.00 0.00 29.15
QuIP# 4-16-16 38.01 23.89 31.36 63.36 58.43 52.57 35.36 25.80 43.83 46.91 24.90 12.94 0.00 0.00 32.67
QuIP# 3-16-16 35.52 22.53 31.51 62.75 56.20 51.85 35.62 26.60 43.83 48.15 28.25 17.56 0.00 0.00 32.88
QuIP# 2-16-16 33.16 20.48 30.02 61.01 54.73 50.59 35.01 24.60 43.23 47.81 24.81 13.12 0.00 0.00 31.33
AutoRound 4-16-16 69.32 43.00 58.88 80.06 70.62 60.62 44.93 36.60 59.39 48.53 55.94 64.70 63.38 60.37 58.31
AutoRound 3-16-16 60.27 37.12 54.62 73.18 69.42 59.75 42.84 35.20 46.83 45.44 49.55 54.90 45.26 46.34 51.48
AutoRound 2-16-16 47.60 28.58 39.78 66.36 61.64 50.83 39.36 30.00 43.51 42.59 30.59 11.83 1.06 0.00 35.27
VPTQ 4-16-16 71.04 39.93 57.55 75.14 70.02 60.22 44.22 35.60 47.65 44.21 54.61 65.25 62.40 63.41 56.52
VPTQ 3-16-16 52.53 37.46 53.75 73.33 68.01 58.88 40.89 35.80 56.37 43.39 36.89 55.45 22.14 29.27 47.44
VPTQ 2-16-16 36.66 25.17 38.08 63.12 58.60 53.91 37.26 28.80 43.57 43.36 25.46 9.80 0.00 0.00 33.13
QTIP 2-16-16 60.14 34.98 53.58 63.61 70.02 58.56 41.50 35.20 43.42 43.39 43.94 44.18 27.37 21.95 45.85
ARB-LLM 1-16-16 31.65 23.21 32.75 62.63 56.42 49.57 35.26 25.00 44.02 41.86 23.22 2.96 0.00 0.00 30.61
GPTAQ 4-16-16 68.52 42.06 58.59 78.72 71.06 59.67 43.45 35.40 58.34 46.23 53.31 63.22 60.96 59.15 57.05
GPTAQ 3-16-16 50.63 31.48 53.95 73.82 69.31 56.91 40.99 34.00 48.78 46.53 41.93 38.45 25.17 10.37 44.45
GPTAQ 2-16-16 28.20 22.27 27.57 43.79 52.72 50.43 34.80 25.40 53.50 49.51 23.52 7.95 0.00 0.00 29.98
Slim-LLM+ 3-16-16 61.20 36.35 51.18 68.47 67.79 58.56 41.76 34.60 44.59 41.18 48.60 41.22 35.25 24.39 46.80
Slim-LLM+ 2-16-16 34.55 23.55 31.26 61.19 55.93 52.17 35.67 27.80 44.01 48.01 22.95 14.60 0.00 0.00 32.26
Q-Palette 4-16-16 66.25 40.44 58.65 75.44 70.73 61.64 43.09 37.60 44.36 46.37 55.08 33.09 0.00 64.02 49.77
Q-Palette 3.25-16-16 55.68 36.69 56.57 78.35 70.29 57.85 41.04 36.20 46.44 54.30 52.03 29.39 0.00 52.44 47.66
Q-Palette 2-16-16 36.24 24.06 36.09 62.29 59.47 51.78 35.88 27.20 44.39 43.26 25.87 15.71 0.00 0.61 33.06
Q-Palette 1.75-16-16 31.02 22.95 30.28 61.87 54.46 48.38 34.24 23.80 46.96 43.31 22.88 12.75 0.00 0.00 30.92
EfficientQAT 4-16-16 69.44 43.17 61.35 79.36 73.12 61.40 44.93 35.80 53.26 46.16 56.49 63.03 0.00 57.32 53.20
EfficientQAT 3-16-16 64.90 39.25 56.80 79.14 71.11 60.30 43.86 37.20 54.82 43.17 53.10 57.49 0.00 48.17 50.67
EfficientQAT 2-16-16 57.87 31.40 45.02 72.57 65.45 57.46 42.22 30.40 48.20 41.45 36.97 14.79 0.00 0.00 38.84
\rowcolorgray!20 EdgeRazor 4-16-16 70.66 44.80 57.51 80.09 72.31 60.14 44.06 38.40 64.02 48.41 54.70 58.96 68.39 57.32 58.56
\rowcolorgray!20 EdgeRazor 2.79-16-16 63.47 38.57 49.48 78.78 68.23 55.64 43.91 33.40 60.81 45.42 46.25 54.71 54.28 53.66 53.33
\rowcolorgray!20 EdgeRazor 1.88-16-16 59.60 34.04 40.94 72.11 65.23 54.38 41.76 29.80 57.30 46.09 38.93 43.81 36.39 39.63 47.14
\rowcolorgray!20 EdgeRazor 1.58-16-16 55.60 31.06 39.53 70.95 63.60 53.28 41.97 31.60 55.89 40.16 35.00 32.72 29.49 33.54 43.89
Table 13: Detailed performance of weight-activation quantization methods on Qwen3-1.7B.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
BF16 16-16-16 69.87 42.83 60.40 77.77 72.58 60.85 45.19 37.40 49.63 45.97 55.49 67.10 68.92 67.07 58.65
OmniQuant 4-8-8 67.42 40.36 58.46 76.54 70.57 59.83 44.42 36.80 47.09 44.20 52.96 13.12 0.00 0.00 43.70
OmniQuant 3-8-8 62.21 35.15 52.17 67.74 68.44 56.75 41.71 33.60 44.65 44.17 48.68 14.60 0.00 0.00 40.71
OmniQuant 2-8-8 39.14 22.27 32.91 62.39 58.05 51.22 35.62 28.20 43.27 48.22 22.93 11.65 0.00 0.00 32.56
LQER 4-8-8 66.75 41.38 59.88 71.47 71.60 58.56 42.63 36.40 45.28 45.79 51.67 61.92 59.67 60.98 55.28
LQER 3-8-8 56.82 33.45 52.95 71.62 67.52 56.20 40.74 35.00 57.30 45.64 40.64 43.81 22.74 30.49 46.78
LQER 2-8-8 27.65 24.66 26.56 59.33 50.54 49.57 34.19 25.80 45.76 49.68 22.93 12.20 0.00 0.00 30.63
QuaRot 4-8-8 25.46 26.79 26.63 46.97 51.25 49.64 32.91 29.00 51.22 47.71 24.70 9.98 0.00 0.00 30.16
QuaRot 3-8-8 24.75 26.79 26.43 47.37 50.65 51.85 31.47 30.40 51.23 47.13 24.08 10.17 0.00 0.00 30.17
QuaRot 2-8-8 24.92 26.28 25.87 46.79 51.63 52.33 33.27 27.80 52.97 49.13 24.18 10.35 0.00 0.00 30.39
ABQ-LLM 4-8-8 63.59 40.44 56.43 77.92 70.78 58.64 43.76 35.60 51.77 42.25 52.00 12.94 0.00 0.00 43.29
ABQ-LLM 3-8-8 49.20 30.20 49.25 64.19 65.83 57.46 39.66 33.40 43.27 43.27 41.48 12.20 0.00 0.00 37.82
ABQ-LLM 2.32-8-8 37.37 25.77 28.12 41.01 57.29 52.33 34.95 26.80 49.66 47.02 22.97 11.65 0.00 0.00 31.07
SpinQuant 4-8-8 65.45 39.85 59.25 78.29 71.55 59.75 43.96 38.20 48.58 46.09 54.11 65.06 58.98 57.93 56.22
SpinQuant 3-8-8 60.77 36.60 53.16 76.64 66.54 59.67 40.17 34.40 55.74 45.64 40.75 56.01 26.23 12.80 47.51
SpinQuant 2-8-8 31.73 21.16 31.14 45.35 54.62 48.93 34.75 26.00 48.59 45.34 23.19 3.88 0.00 0.00 29.62
QoQ 4-8-4 24.62 21.50 26.15 37.98 51.74 50.51 33.06 31.20 56.73 48.43 25.50 11.28 0.00 0.00 29.91
FlatQuant 4-8-8 68.01 42.32 58.53 78.13 71.22 59.59 44.11 37.40 53.19 47.41 54.85 66.73 63.23 65.85 57.90
FlatQuant 3-8-8 60.61 36.43 53.47 75.47 67.90 58.33 41.71 34.60 53.68 44.70 47.24 50.28 35.10 28.66 49.16
FlatQuant 2-8-8 26.09 26.54 26.44 39.08 50.05 49.80 33.06 26.40 56.30 49.22 23.61 11.65 0.00 0.00 29.87
\rowcolorgray!20 EdgeRazor 4-8-8 70.16 44.45 57.52 79.82 72.58 59.67 43.45 38.20 63.56 48.37 54.29 60.26 68.54 59.15 58.57
\rowcolorgray!20 EdgeRazor 2.79-8-8 62.79 38.31 49.53 78.38 68.72 56.04 43.65 33.40 60.72 45.57 46.27 54.34 53.68 50.61 53.00
\rowcolorgray!20 EdgeRazor 1.88-8-8 59.09 33.53 40.85 72.14 65.18 53.99 41.76 29.00 57.33 46.18 39.03 41.96 37.53 40.85 47.03
\rowcolorgray!20 EdgeRazor 1.58-8-8 55.64 31.48 39.68 70.70 64.25 53.91 41.76 31.60 56.26 40.15 35.07 32.35 28.96 32.93 43.91

D.4 Ablation studies

This section reports the per-task ablation results for Qwen3-0.6B. We evaluate the proposed modules under three distinct bit-width settings, as presented in Tables 14, 15, and 16. We also provide the exact bit-width allocations for the three configurations. For the 2.79-bit and 1.88-bit models, mixed-precision quantization is applied exclusively to the decoder layers, while the embedding and lm_head layers are maintained at 4-bit. In contrast, the 2.19-bit model applies mixed-precision quantization across all of these components. Regarding the configuration abbreviations, SG and ST denote super-group and stacked allocations, respectively. The fixed feature distillation baseline aligns the first, second, and last features from the teacher model.

Table 14: Detailed performance of quantization methods based on EdgeRazor. The decoder layers are 2.79-bit (50% 4-bit and 50% 1.58-bit), while the embedding and lm_head layers are 4-bit.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
\rowcolorgray!20 EdgeRazorSG+A+E\textsc{EdgeRazor}_{\text{SG+A+E}} 2.79-16-16 51.77 28.33 37.47 70.70 63.71 54.06 40.33 28.20 55.08 42.72 36.85 51.39 26.69 31.10 44.17
\rowcolorgray!20 EdgeRazorSG+A+E\textsc{EdgeRazor}_{\text{SG+A+E}} 2.79-8-8 52.10 28.50 37.36 70.58 63.93 53.12 40.12 28.60 54.97 42.82 36.44 49.54 26.99 32.32 44.10
EdgeRazorST+A+E\textsc{EdgeRazor}_{\text{ST+A+E}} 2.79-16-16 55.51 30.03 37.51 54.25 64.04 53.91 40.84 28.40 58.84 44.02 37.05 46.77 28.20 26.22 43.26
EdgeRazorST+A+E\textsc{EdgeRazor}_{\text{ST+A+E}} 2.79-8-8 55.47 30.29 37.56 54.04 64.09 54.30 40.79 28.40 58.77 43.93 37.05 46.03 27.45 25.00 43.08
Table 15: Detailed performance of quantization methods based on EdgeRazor. The quantized layers, including decoder, embedding, and lm_head, are all 2.19-bit (25% 4-bit and 75% 1.58-bit).
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
\rowcolorgray!20 EdgeRazorSG+A+E\textsc{EdgeRazor}_{\text{SG+A+E}} 2.19-16-16 50.17 29.44 34.21 63.88 62.89 50.83 37.00 29.80 48.45 43.59 31.77 38.82 24.64 24.39 40.71
\rowcolorgray!20 EdgeRazorSG+A+E\textsc{EdgeRazor}_{\text{SG+A+E}} 2.19-8-8 49.24 28.67 34.28 63.98 61.86 50.43 36.69 29.80 47.12 44.14 31.71 39.19 22.90 21.95 40.14
EdgeRazorSG+A+C\textsc{EdgeRazor}_{\text{SG+A+C}} 2.19-16-16 52.27 27.82 33.59 65.05 62.46 50.20 37.92 28.00 43.29 44.54 27.05 40.30 20.39 21.34 39.59
EdgeRazorSG+A+C\textsc{EdgeRazor}_{\text{SG+A+C}} 2.19-8-8 52.31 27.47 33.70 64.65 62.19 50.51 37.77 27.80 43.29 44.59 27.13 40.30 20.24 22.56 39.61
EdgeRazorSG+F+E\textsc{EdgeRazor}_{\text{SG+F+E}} 2.19-16-16 49.03 27.05 34.33 59.57 62.51 51.46 38.28 30.20 53.63 45.34 27.88 36.97 27.37 20.12 40.27
EdgeRazorSG+F+E\textsc{EdgeRazor}_{\text{SG+F+E}} 2.19-8-8 47.77 26.96 34.02 59.33 61.86 52.09 38.13 30.60 53.22 44.98 27.89 35.67 27.60 18.90 39.93
EdgeRazorSG+F+F\textsc{EdgeRazor}_{\text{SG+F+F}} 2.19-16-16 49.15 26.11 30.04 52.29 63.38 51.70 38.23 29.00 55.86 45.85 29.13 36.04 22.59 20.12 39.25
EdgeRazorSG+F+F\textsc{EdgeRazor}_{\text{SG+F+F}} 2.19-8-8 48.99 26.11 33.24 51.83 62.89 51.38 38.13 28.40 55.77 45.92 28.98 37.15 21.15 23.17 39.51
Table 16: Detailed performance of quantization methods based on EdgeRazor. The decoder layers are 1.88-bit (12.5% 4-bit and 87.5% 1.58-bit), while the embedding and lm_head layers are 4-bit.
Methods W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Ethics Tr.QA2 MMLU IFEval GSM8K HumanE. Average
\rowcolorgray!20 EdgeRazorSG+A+E\textsc{EdgeRazor}_{\text{SG+A+E}} 1.88-16-16 51.22 27.73 34.21 66.91 63.66 53.35 38.43 27.60 55.92 43.80 28.78 42.51 25.09 23.17 41.60
\rowcolorgray!20 EdgeRazorSG+A+E\textsc{EdgeRazor}_{\text{SG+A+E}} 1.88-8-8 51.47 27.99 34.22 66.85 63.49 53.04 38.02 27.40 55.92 43.88 29.56 44.55 25.09 23.17 41.76
EdgeRazorSG+A+C\textsc{EdgeRazor}_{\text{SG+A+C}} 1.88-16-16 49.20 27.56 34.64 65.05 61.75 53.67 39.36 30.00 54.72 44.35 30.90 39.56 23.05 19.51 40.95
EdgeRazorSG+A+C\textsc{EdgeRazor}_{\text{SG+A+C}} 1.88-8-8 49.16 27.47 34.50 64.95 61.75 54.06 39.20 29.40 54.80 44.47 30.59 37.34 22.21 21.95 40.85
EdgeRazorSG+F+E\textsc{EdgeRazor}_{\text{SG+F+E}} 1.88-16-16 48.57 26.45 33.60 58.65 61.53 53.20 39.30 29.40 56.46 43.67 33.21 40.67 20.77 20.12 40.40
EdgeRazorSG+F+E\textsc{EdgeRazor}_{\text{SG+F+E}} 1.88-8-8 48.15 25.94 33.75 58.50 60.83 52.72 38.54 29.40 56.44 43.50 33.33 41.04 19.86 21.34 40.24
EdgeRazorSG+F+F\textsc{EdgeRazor}_{\text{SG+F+F}} 1.88-16-16 47.90 27.39 34.23 61.83 61.43 52.49 38.23 26.80 54.12 47.87 26.46 39.37 20.55 18.90 39.83
EdgeRazorSG+F+F\textsc{EdgeRazor}_{\text{SG+F+F}} 1.88-8-8 48.06 26.96 34.30 61.77 61.37 51.93 38.38 26.60 52.13 47.53 25.99 39.74 20.92 20.12 39.70

D.5 Efficiency

Table 17: Compression comparison of quantization methods on MobileLLM-350M.
Methods Bit-widths Group sizes Quantized layers Quantization proportions (\uparrow) Compression ratios (\uparrow)
decoder emb lm_head
BF16 16 00.00% 1.00×
\rowcolorgray!20 4 256 99.99% 3.76×
\rowcolorgray!20 2.79 256 99.99% 4.94×
\rowcolorgray!20 1.88 256 99.99% 6.46×
\rowcolorgray!20 EdgeRazor 1.58 256 99.99% 7.19×
Other methods 4 128 × × 83.66% 2.64×
3 128 × × 83.66% 3.06×
2 128 × × 83.66% 3.64×
1.58 128 × × 83.66% 3.96×
Other methods 4 channel × × 83.66% 2.68×
3 channel × × 83.66% 3.12×
2 channel × × 83.66% 3.72×
1.58 channel × × 83.66% 4.05×
Table 18: Compression comparison of quantization methods on Qwen3-0.6B.
Methods Bit-widths Group sizes Quantized layers Quantization proportions (\uparrow) Compression ratios (\uparrow)
decoder emb lm_head
BF16 16 00.00% 1.00×
\rowcolorgray!20 4 256 99.99% 3.94×
\rowcolorgray!20 2.79 256 99.99% 5.05×
\rowcolorgray!20 1.88 256 99.99% 6.41×
\rowcolorgray!20 EdgeRazor 1.58 256 99.99% 7.04×
Other methods 4 128 × × 73.89% 2.21×
3 128 × × 73.89% 2.47×
2 128 × × 73.89% 2.78×
1.58 128 × × 73.89% 2.94×
Other methods 4 channel × × 73.89% 2.24×
3 channel × × 73.89% 2.50×
2 channel × × 73.89% 2.83×
1.58 channel × × 73.89% 2.99×
Table 19: Compression comparison of quantization methods on Qwen3-1.7B.
Methods Bit-widths Group sizes Quantized layers Quantization proportions (\uparrow) Compression ratios (\uparrow)
decoder emb lm_head
BF16 16 00.00% 1.00×
\rowcolorgray!20 4 256 99.99% 3.94×
\rowcolorgray!20 2.79 256 99.99% 5.21×
\rowcolorgray!20 1.88 256 99.99% 6.88×
\rowcolorgray!20 EdgeRazor 1.58 256 99.99% 7.69×
Other methods 4 128 × × 81.91% 2.55×
3 128 × × 81.91% 2.93×
2 128 × × 81.91% 3.45×
1.58 128 × × 81.91% 3.73×
Other methods 4 channel × × 81.91% 2.59×
3 channel × × 81.91% 2.99×
2 channel × × 81.91% 3.53×
1.58 channel × × 81.91% 3.82×
Table 20: Compression comparison of quantization methods on Qwen2.5-Omni-7B.
Methods Bit-widths Group sizes Quantized layers Prop. (\uparrow) Compr. (\uparrow)
vision decoder emb lm_head
BF16 16 00.00% 1.00×
\rowcolorgray!20 EdgeRazor 4 64 96.73% 3.45×
GPTQ and AWQ 4 128 × × × 79.81% 2.45×
Table 21: Inference comparison of text LLMs on two chips: Apple M4 Pro and Intel i9-14900K.
Chips Models Methods Weight types KV types Storage (GB) Memory (GB) Prefilling (tok/s) Decoding (tok/s)
Apple M4 Pro MobileLLM-350M BF16 BF16 BF16 0.70 0.79 614.23 37.99
Q4_K Q4_K Q8_0 0.25 0.33 1019.88 325.81
\cellcolorgray!20EdgeRazor \cellcolorgray!20Q4_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.21 \cellcolorgray!200.29 \cellcolorgray!201901.14 \cellcolorgray!20426.47
Q2_K Q2_K Q8_0 0.20 0.28 1505.05 398.31
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ1_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.17 \cellcolorgray!200.23 \cellcolorgray!201583.24 \cellcolorgray!20439.52
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ2_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.18 \cellcolorgray!200.24 \cellcolorgray!201604.57 \cellcolorgray!20458.66
Qwen3-0.6B BF16 BF16 BF16 1.11 1.46 337.99 20.91
Q4_K Q4_K Q8_0 0.36 0.69 734.41 239.55
\cellcolorgray!20EdgeRazor \cellcolorgray!20Q4_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.35 \cellcolorgray!200.67 \cellcolorgray!201269.41 \cellcolorgray!20262.66
Q2_K Q2_K Q8_0 0.27 0.59 726.56 233.47
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ1_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.17 \cellcolorgray!200.49 \cellcolorgray!20705.93 \cellcolorgray!20293.18
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ2_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.19 \cellcolorgray!200.51 \cellcolorgray!20711.67 \cellcolorgray!20317.07
Qwen3-1.7B BF16 BF16 BF16 3.21 3.55 254.09 7.89
Q4_K Q4_K Q8_0 1.03 1.35 415.10 131.75
\cellcolorgray!20EdgeRazor \cellcolorgray!20Q4_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.98 \cellcolorgray!201.29 \cellcolorgray!20688.56 \cellcolorgray!20148.21
Q2_K Q2_K Q8_0 0.72 1.04 409.87 119.16
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ1_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.44 \cellcolorgray!200.76 \cellcolorgray!20404.06 \cellcolorgray!20153.97
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ2_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.50 \cellcolorgray!200.83 \cellcolorgray!20411.27 \cellcolorgray!20187.02
Intel i9-14900K MobileLLM-350M BF16 BF16 BF16 0.70 0.79 814.77 73.33
Q4_K Q4_K Q8_0 0.25 0.34 666.54 127.52
\cellcolorgray!20EdgeRazor \cellcolorgray!20Q4_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.21 \cellcolorgray!200.29 \cellcolorgray!201096.51 \cellcolorgray!20149.81
Q2_K Q2_K Q8_0 0.20 0.28 907.69 150.78
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ1_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.17 \cellcolorgray!200.24 \cellcolorgray!20926.26 \cellcolorgray!20160.42
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ2_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.18 \cellcolorgray!200.24 \cellcolorgray!201069.87 \cellcolorgray!20162.04
Qwen3-0.6B BF16 BF16 BF16 1.11 1.46 588.82 44.52
Q4_K Q4_K Q8_0 0.36 0.69 621.54 89.45
\cellcolorgray!20EdgeRazor \cellcolorgray!20Q4_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.35 \cellcolorgray!200.67 \cellcolorgray!20686.89 \cellcolorgray!2097.09
Q2_K Q2_K Q8_0 0.27 0.59 480.29 108.44
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ1_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.17 \cellcolorgray!200.49 \cellcolorgray!20446.37 \cellcolorgray!20131.02
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ2_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.19 \cellcolorgray!200.51 \cellcolorgray!20743.87 \cellcolorgray!20137.02
Qwen3-1.7B BF16 BF16 BF16 3.21 3.56 241.68 17.48
Q4_K Q4_K Q8_0 1.03 1.35 273.06 41.61
\cellcolorgray!20EdgeRazor \cellcolorgray!20Q4_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.98 \cellcolorgray!201.30 \cellcolorgray!20337.34 \cellcolorgray!2046.21
Q2_K Q2_K Q8_0 0.72 1.05 202.54 57.85
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ1_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.44 \cellcolorgray!200.77 \cellcolorgray!20186.76 \cellcolorgray!2074.94
\cellcolorgray!20EdgeRazor \cellcolorgray!20TQ2_0 \cellcolorgray!20Q8_0 \cellcolorgray!200.50 \cellcolorgray!200.83 \cellcolorgray!20361.60 \cellcolorgray!2075.64

In this section, we present all efficiency metrics for MobileLLM-350M, Qwen3-0.6B, Qwen3-1.7B, and Qwen2.5-Omni-7B on the Apple M4 Pro and Intel i9-14900K chips. Since the multimodal projector feature is experimental in the inference framework llama.cpp and is not compatible with the llama-bench tool, Qwen2.5-Omni-7B is not included in the inference benchmark. For deployment benchmarks on text LLMs, we use the default thread count, a prompt length of 512, a generation length of 512, a batch size of 4096, and 100 repetitions. Tables 17, 18, 19, and 20 show that EdgeRazor attains the highest quantization proportions and compression ratios across diverse LLMs. Table 21 further indicates that EdgeRazor achieves promising efficiency with compatible llama.cpp precision types on two chips.

Appendix E Generalization on held-out benchmarks

Table 22: Average performance of weight-only quantization methods on Qwen3 models.
Methods W A KV Qwen3-0.6B Qwen3-1.7B
BF16 16 16 16 44.32 64.61
GPTQ 4 16 16 33.01 56.49
OmniQuant 4 16 16 13.14 16.92
AWQ 4 16 16 40.11 60.96
AQLM 4 16 16 42.41 62.69
QuIP# 4 16 16 8.81 9.46
AutoRound 4 16 16 41.71 61.10
VPTQ 4 16 16 28.63 61.42
GPTAQ 4 16 16 35.12 59.16
Q-Palette 4 16 16 26.32 38.05
EfficientQAT 4 16 16 30.70 44.21
\rowcolorgray!20 EdgeRazor 4 16 16 42.24 59.84
GPTQ 3 16 16 13.00 22.10
OmniQuant 3 16 16 10.32 15.89
AWQ 3 16 16 14.40 39.44
AQLM 3 16 16 23.12 49.05
QuIP# 3 16 16 12.45 11.45
AutoRound 3 16 16 28.23 49.01
VPTQ 3 16 16 18.73 35.94
GPTAQ 3 16 16 14.15 28.98
Slim-LLM+ 3 16 16 9.22 37.37
Q-Palette 3.25 16 16 19.82 33.47
EfficientQAT 3 16 16 22.45 39.69
\rowcolorgray!20 EdgeRazor 2.79 16 16 36.51 52.23
GPTQ 2 16 16 8.31 7.83
OmniQuant 2 16 16 8.78 8.79
AWQ 2 16 16 9.45 9.26
AQLM 2 16 16 19.26 21.92
QuIP# 2 16 16 8.33 9.48
AutoRound 2 16 16 7.72 10.87
VPTQ 2 16 16 8.14 8.82
QTIP 2 16 16 12.48 34.36
GPTAQ 2 16 16 8.01 7.87
Slim-LLM+ 2 16 16 8.52 9.39
Q-Palette 2 16 16 8.31 10.55
EfficientQAT 2 16 16 6.82 12.94
\rowcolorgray!20 EdgeRazor 1.88 16 16 29.89 39.69
BiLLM 1.06 16 16 8.98 8.89
ARB-LLM 1 16 16 9.02 6.55
Q-Palette 1.75 16 16 9.02 8.91
\rowcolorgray!20 EdgeRazor 1.58 16 16 26.90 32.69
Table 23: Average performance of weight-activation quantization methods on Qwen3 models.
Methods W A KV Qwen3-0.6B Qwen3-1.7B
BF16 16 16 16 44.32 64.65
OmniQuant 4 8 8 12.78 16.52
LQER 4 8 8 40.11 58.56
QuaRot 4 8 8 8.24 8.67
ABQ-LLM 4 8 8 34.14 16.24
SpinQuant 4 8 8 27.40 59.02
QoQ 4 8 4 8.97 9.20
FlatQuant 4 8 8 37.26 62.67
\rowcolorgray!20 EdgeRazor 4 8 8 42.48 60.56
OmniQuant 3 8 8 10.37 15.82
LQER 3 8 8 18.60 34.42
QuaRot 3 8 8 8.34 8.56
ABQ-LLM 3 8 8 8.68 13.42
SpinQuant 3 8 8 15.06 33.95
FlatQuant 3 8 8 18.85 40.32
\rowcolorgray!20 EdgeRazor 2.79 8 8 36.32 51.23
OmniQuant 2 8 8 8.66 8.65
LQER 2 8 8 8.67 8.78
QuaRot 2 8 8 8.05 8.63
ABQ-LLM 2.32 8 8 9.09 8.66
SpinQuant 2 8 8 9.03 6.77
FlatQuant 2 8 8 8.56 8.82
\rowcolorgray!20 EdgeRazor 1.88 8 8 30.59 39.84
\rowcolorgray!20 EdgeRazor 1.58 8 8 27.28 32.33

The evaluation scores reported in the main text demonstrate the excellent performance of EdgeRazor across extensive downstream tasks. When the training mixture incorporates the training splits of several commonsense reasoning tasks, the main results constitute a comprehensive evaluation that integrates supervised domain adaptations and out-of-domain evaluations. To rigorously validate the generalization capabilities of our proposed framework, this section isolates the evaluation exclusively to held-out benchmarks. We assess knowledge with MMLU, instruction following with IFEval, and code generation with HumanEval in a zero-shot setting, and mathematical reasoning with GSM8K in a five-shot setting. We exclude Qwen2.5-Omni-7B due to its reliance on teacher-distilled data and the base LLM MobileLLM-350M as it inherently lacks the capacity for complex reasoning and coding, yielding uninformative near-zero performance as detailed in Tables 8 and 9. Consequently, Tables 22 and 23 collect the average performance of these challenging held-out benchmarks on the instruction-tuned Qwen3 models. Notably, the best performance is indicated in bold, and the second-best is underlined.

There are three observations. (1) In Tables 22 and 23, existing quantization baselines experience severe performance degradation on complex cognitive tasks when compressed below 4-bit, with the strongest 2-bit weight-activation baselines ABQ-LLM and FlatQuant collapsing to scores below 10 points; (2) EdgeRazor at 2.79-bit establishes a massive performance advantage over all 3-bit baselines. In Table 22, it outperforms the strongest 3-bit competitor AutoRound by 8.28 and 3.22 points on Qwen3-0.6B and Qwen3-1.7B. Under the weight-activation setup in Table 23, this margin widens, surpassing the leading 3-bit FlatQuant by 17.47 and 10.91 points; (3) In the sub-2-bit regime where competing baselines fail to generate coherent responses, EdgeRazor robustly preserves foundational cognitive and generative capabilities. In Table 23, EdgeRazor at 1.58-bit achieves 27.28 and 32.33 on Qwen3-0.6B and Qwen3-1.7B, exceeding the best 2-bit baselines ABQ-LLM and FlatQuant by 18.19 and 23.51 points.

These observations demonstrate that the performance advantages of EdgeRazor derive from generalized capability enhancements rather than mere supervised domain adaptations. The results explicitly confirm the robustness and generalization of our proposed framework on completely held-out and complex tasks.

Appendix F Capability comparison of low-bit LLMs

Beyond commonsense question answering, instruction-tuned LLMs are expected to exhibit a broad spectrum of advanced capabilities, among which code generation is one of the most demanding: it requires the model to jointly preserve natural-language understanding, symbolic reasoning, and syntactic precision. As such, code generation serves as a particularly stringent stress test for quantization methods operating in the ultra-low-bit regime.

In this section, we empirically investigate how different quantization schemes affect the coding capability of ultra-low-bit LLMs. To make the comparison informative, we select the best baselines identified in Appendix D.3, namely AQLM (W2-A16-KV16) and OmniQuant (W2-A8-KV8), llama.cpp Q2_K as the PTQ baseline from the open-sourced inference framework, and contrast them with our proposed EdgeRazor at an even more aggressive bit-width of W1.58-A8-KV8. All methods are applied on the same 16-bit Qwen3-0.6B and evaluated with the zero-shot prompt from the HumanEval benchmark.

Table 24 presents a representative case study on the count_distinct_characters problem. This generation comparison corroborates the quantitative findings reported in Appendix D.3: existing ultra-low-bit quantization methods either strip the model of its instruction-following capability (AQLM) or drive it into incoherent token distributions (OmniQuant), while EdgeRazor substantially preserves high-level coding capability at sub-2-bit weight precision. We attribute this robustness to our framework, EdgeRazor, integrating mixed-precision quantization and distillation schemes to mitigate catastrophic degradation in capabilities.

Table 24: Zero-shot coding cases of ultra-low-bit Qwen3-0.6B. The zero-shot prompt is from the HumanEval benchmark.
Prompt Write a solution to the following problem and make sure that it passes the tests:
```python
def count_distinct_characters(string: str) -> int:
 """ Given a string, find out how many distinct characters (regardless of case) it consists of
 >>> count_distinct_characters(’xyzXYZ’)
 3
 >>> count_distinct_characters(’Jerry’)
 4
 """
```
AQLM W2-A16-KV16 Filtered Generation:
The solution passes the tests.
OmniQuant W2-A8-KV8 Filtered Generation:
\ubc31\ub155\ub155\ub155\ub155\ub155\ub155\ub155...
llama.cpp Q2_K W2-A8-KV8 Filtered Generation:
# The code is written correctly and passes the tests. The function implementation is as described.
EdgeRazor W1.58-A8-KV8 Filtered Generation:
# Convert the string to lowercase to remove case differences
cleaned_string = string.lower()

# Count the distinct characters
distinct_chars = set(cleaned_string)
return len(distinct_chars)

Comments

· 0
Be the first to comment on this paper.