[2605.02971] Multilingual Safety Alignment via Self-Distillation

Multilingual Safety Alignment via Self-Distillation

Ruiyang QinTongji University2432008@tongji.edu.cn Qin and Wang contributed equally to this work and agree that the order of their names may be exchanged as useful to highlight their contributions in individual professional pursuits. Qingzhuo WangTongji University2534123@tongji.edu.cnDongrui LiuShanghai AI Laboratoryliudongrui@pjlab.org.cnQiang LiTongji Universityqli@tongji.edu.cnZhihua WeiTongji Universityzhihua_wei@tongji.edu.cnWen ShenTongji Universitywenshen@tongji.edu.cn Corresponding author.

Abstract

Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM’s inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods—on-policy MSD and off-policy MSD—both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model’s general capabilities.

WARNING: This paper may contain content that is offensive and harmful.

1 Introduction

Large language models (LLMs) have emerged as powerful worldwide applications, enabling users from diverse linguistic and cultural communities to benefit from AI developments (Touvron et al., 2023; Grattafiori et al., 2024; Team et al., 2024; Yang et al., 2025a). However, recent studies reveal a severe safety misalignment across different languages. That is, LLMs exhibit robust safety behaviors in high-resource languages (e.g., English), but their safeguards often fail in low-resource languages (e.g., Javanese, Swahili) (Deng et al., 2024; Wang et al., 2023a; Shen et al., 2024; Yong et al., 2025). This multilingual safety misalignment undermines the overall reliability of LLMs, highlighting a critical demand for methods of robust cross-lingual safety alignment.

To enhance multilingual safety, some studies rely on representation engineering (Wang et al., 2025) or activation steering (Zhang et al., 2026). However, these training-free methods often exhibit weak generalization across diverse datasets. Alternatively, recent research has sought to transfer safety capabilities from high-resource languages (typically English) to other languages via cross-lingual alignment (Zhang et al., 2024; Zhao et al., 2024, 2025; Bu et al., 2026). These methods generally depend on Supervised Fine-Tuning (SFT) or Preference Optimization (PO), both of which require substantial high-quality response data for training. However, generating high-quality response data is expensive and difficult, particularly for low-resource languages. Therefore, there is an urgent need for a response-free approach capable of transferring safeguards from high-resource to low-resource languages while maintaining robust generalization capabilities.

Refer to caption — Figure 1: The MSD framework leverages self-distillation within the same LLM $p_{\theta}$ to transfer inherent safety capabilities from high-resource (Teacher) to low-resource (Student) languages. This response-free method bridges the cross-lingual safety gap while preserving the model’s general capabilities.

To this end, we introduce Multilingual Self-Distillation (MSD), a cross-lingual safeguard transfer framework based on self-distillation. This framework is designed to transfer an LLM’s inherent safety capabilities from high-resource to low-resource languages, bridging the linguistic gap in safety alignment. As illustrated in Figure 1, both the teacher and the student are initialized from the same LLM. The teacher leverages the model’s strong safeguards in high-resource languages to conduct safe reasoning and generate harmless responses, thereby guiding the student, which represents the model’s weak safety capabilities in low-resource languages. Crucially, this framework requires no external response data for multilingual alignment. Moreover, our framework is flexible and can be integrated with various self-distillation strategies, including off-policy self-distillation (Yang et al., 2024; Yuan et al., 2024) and on-policy self-distillation (Zhao et al., 2026; Shenfeld et al., 2026).

The proposed MSD provides the teacher with additional information to elicit strong safety capability in the high-resource language, while providing the student with only the target low-resource input query. For clarity, we utilize English as the high-resource language for the following illustration. The teacher’s additional information includes the input query in English (already provided in the dataset) and a Chain-of-Thought (CoT) instruction, prompting the teacher to conduct reasoning in English and generate the final answer in the target low-resource language. In this way, the MSD does not require any response data; it relies solely on multilingual queries to provide alignment signals.

Furthermore, to differentiate the importance of individual tokens during safe response generation, we introduce a divergence measure termed Dual-Perspective Safety Weighting (DPSW). DPSW adaptively increases the penalty on safety-critical tokens while reducing the penalty on uninformative ones. By jointly evaluating the confidence levels of both the teacher and the student, DPSW adjusts the penalty weight of each individual token. This divergence measure prioritizes safety-critical tokens and filters out non-critical tokens, thereby ensuring a more precise and robust cross-lingual alignment.

We evaluate the performance of our framework under both off-policy and on-policy self-distillation settings. Extensive experiments across diverse models and benchmarks demonstrate that both settings consistently outperform existing baselines, yielding exceptional multilingual safety alignment. Furthermore, our MSD framework exhibits robust generalization to more challenging datasets and out-of-distribution languages, all while effectively preserving the model’s general capabilities.

2 Related Work

Multilingual Safety Alignment. LLMs often exhibit safety vulnerabilities in low-resource languages despite strong safeguards in high-resource ones (Deng et al., 2024; Wang et al., 2023a; Shen et al., 2024; Yong et al., 2025). Existing alignment methods primarily rely on supervised fine-tuning (Deng et al., 2024; Li et al., 2024; Shen et al., 2024), preference optimization (Rafailov et al., 2023; Ethayarajh et al., 2024; Zhao et al., 2025), or distillation techniques (Zhang et al., 2024, 2025), which all require substantial response data for training. However, generating such high-quality response data across target languages is expensive and prone to translation errors. While recent studies (Bu et al., 2026; Yang et al., 2026b) alleviate the need for low-resource responses, they still depend heavily on English response data. Alternatively, training-free methods like representation engineering (Wang et al., 2025) and activation steering (Zhang et al., 2026; Liang et al., 2026) reduce data reliance but suffer from poor cross-dataset generalization and capability degradation (Cao et al., 2025). In this paper, we propose MSD, a response-free method capable of transferring safeguards from high-resource to low-resource languages while maintaining robust generalization capabilities.

Self-Distillation. Self-distillation serves as a mechanism for self-improvement, enabling models to learn from their own outputs rather than external teachers (Furlanello et al., 2018; Zhang et al., 2019; Yang et al., 2019). In the LLM era, self-distillation can branch into off-policy and on-policy strategies. Off-policy self-distillation typically generates discrete trajectories as the training data and then fine-tunes the model on them (Askell et al., 2021; Zelikman et al., 2022; Gulcehre et al., 2023; Wang et al., 2023b; Sun et al., 2023; Yang et al., 2024; Yuan et al., 2024). More recently, on-policy self-distillation (OPSD) instantiates the teacher and the student from the same model under different contexts (Zhao et al., 2026; Hübotter et al., 2026). By equipping the teacher with privileged information, OPSD evaluates the student’s own rollouts and provides dense token-level supervision. This paradigm has been extended to continual learning, context internalization, and reasoning compression (Shenfeld et al., 2026; Ye et al., 2026; Sang et al., 2026). In this paper, we propose a flexible cross-lingual self-distillation framework. It can be integrated with both off-policy and on-policy settings to transfer inherent safety capabilities. Please see Appendix A for a more detailed version of related work.

3 Multilingual Self-Distillation

In this section, we propose Multilingual Self-Distillation (MSD) to bridge the multilingual safety gap in a target LLM. First, Section 3.1 introduces the cross-lingual safeguard transfer framework, specifically designed to transfer inherent safety capabilities from high-resource to low-resource languages. Second, Section 3.2 details Dual-Perspective Safety Weighting (DPSW), a token-level divergence measure designed to optimize the distillation process.

3.1 Cross-Lingual Safety Transfer through Self-Distillation

Consider a multilingual jailbreak dataset comprising $N$ samples, denoted as $\mathcal{X}=\{X_{i}\}_{i=1}^{N}$ , where each sample $X_{i}=\{x_{i}^{(1)},x_{i}^{(2)},\dots,x_{i}^{(M)}\}$ represents a specific harmful query translated across $M$ distinct languages. For clarity, we utilize English as the representative high-resource language for the following illustration. Letting $p_{\theta}$ denote an LLM parameterized by $\theta$ , our framework instantiates both the teacher $p_{T}$ and the student $p_{S}$ from $p_{\theta}$ by conditioning on different contexts. As shown in Figure 2, given a target low-resource query $x^{\text{tgt}}\in X$ (we omit the sample index $i$ for brevity), the student $p_{S}$ is only provided with the target low-resource query $x^{\text{tgt}}$ :

\small p_{S}(\cdot\mid x^{\text{tgt}})\triangleq p_{\theta}(\cdot\mid x^{\text{tgt}}).

(1)

Conversely, the teacher $p_{T}$ is provided with additional information, including the query in English $x^{*}\in X$ and a CoT instruction $\mathcal{C}$ :

\small p_{T}(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C})\triangleq p_{\theta}(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C}).

(2)

The CoT instruction $\mathcal{C}$ is designed to effectively elicit the model’s internal safe reasoning capability in English. As detailed in Figure 3, the CoT instruction $\mathcal{C}$ explicitly instructs the teacher to reason in English and generate the final response in the target low-resource language to guide the student.

Our MSD framework is flexible and can be integrated with various self-distillation strategies. Specifically, we implement two concrete methods: on-policy MSD and off-policy MSD. During training, for a given target low-resource query $x^{\text{tgt}}$ , the distinction between the on-policy and off-policy settings lies in whether the output sequence $y$ is sampled from the student or the teacher. On-policy MSD trains on output sequences sampled from the student, i.e., $y\sim p_{S}(\cdot\mid x^{\text{tgt}})$ . Conversely, off-policy MSD trains on output sequences sampled from the teacher, i.e., $y\sim p_{T}(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C})$ . To represent both the on-policy and off-policy settings, we introduce a generalized sampling distribution, $q(\cdot)$ . In the on-policy setting, $q(\cdot)$ is directly equivalent to the student’s sampling:

\small q(\cdot)\triangleq p_{S}(\cdot\mid x^{\text{tgt}}).

(3)

In the off-policy setting, $q(\cdot)$ is directly equivalent to the teacher’s sampling:

\small q(\cdot)\triangleq p_{T}(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C}).

(4)

Consequently, the generated output sequences across both settings can be generalized as: $y\sim q(\cdot)$ . At each generation step $t$ , both the teacher and the student generate next-token distributions over the vocabulary $\mathcal{V}$ , yielding $p_{S}\!\left(\cdot\mid x^{\text{tgt}},y_{<t}\right)$ for the student and $p_{T}\!\left(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C},y_{<t}\right)$ for the teacher. The overall training objective of the MSD is formulated to minimize the divergence between the student and the teacher across all generated tokens for all samples in the dataset $\mathcal{X}$ , defined as:

\small\mathcal{L}(\theta)=\mathbb{E}_{X\sim\mathcal{X}}\left[\mathbb{E}_{x^{\text{tgt}}\sim X}\left[\mathbb{E}_{y\sim q(\cdot)}\left[\frac{1}{L_{y}}\sum_{t=1}^{L_{y}}D\biggl(p_{S}\!\left(\cdot\mid x^{\text{tgt}},y_{<t}\right)\big\|\;p_{T}\!\left(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C},y_{<t}\right)\biggr)\right]\right]\right],

(5)

where $L_{y}$ is the length of the output sequence $y$ ; $D$ can be any divergence measure between distributions. Here we utilize the reverse Kullback-Leibler divergence. Please see Table 12 in Appendix C for ablation study of different divergence measures. In our optimization process, we restrict gradient updates exclusively to the student $p_{S}$ . The parameters of the teacher $p_{T}$ are kept frozen to provide a stable supervision signal. We conduct comparative experiments analyzing the performance differences between a frozen and an unfrozen teacher, with detailed results presented in Table 8.

3.2 Dual-Perspective Safety Weighting

In Equation (5), the divergence $D\left(p_{S}\!\left(\cdot\right)\big\|\;p_{T}\!\left(\cdot\right)\right)$ is uniformly averaged across all tokens in the output sequence $y$ . However, in the context of refusal generation, the critical tokens representing the refusal behavior (e.g., “I cannot answer …”) decisively influence the safety decision of the sequence, whereas other tokens typically contain non-critical information (Jain et al., 2024; Qi et al., 2025; Wang et al., 2026). Treating all tokens equally weakens the supervision on these key tokens. To address this limitation, we propose Dual-Perspective Safety Weighting (DPSW), a token-level divergence measure. The core objective of DPSW is to adaptively increase the penalty weight for the divergence on safety-critical tokens while reducing the weight on non-critical ones.

Specifically, we determine each token’s penalty weight by considering both the teacher’s and student’s perspectives. At generation step $t$ , let $y^{*}_{t}$ denote the token assigned the highest probability by the teacher, i.e., $y^{*}_{t}=\arg\max_{y\in\mathcal{V}}p_{T}(y\mid x^{\text{tgt}},x^{*},\mathcal{C},y_{<t})$ . The penalty weight $w_{t}^{T}$ from the teacher’s perspective and the penalty weight $w_{t}^{S}$ from the student’s perspective are computed as follows.

Teacher’s Perspective ( $w_{t}^{T}$ ): Shen et al. (2026) observed that models consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. Thus, the penalty weight $w_{t}^{T}$ is positively correlated with its confidence in generating $y^{*}_{t}$ . Building upon this finding, we quantify the teacher’s confidence in outputting critical safety tokens using information entropy. To prevent the entropy from being diluted by the long-tail probabilities of the entire vocabulary, we use a top- $K$ entropy rather than the full-vocabulary entropy. Formally, we first extract the teacher’s top- $K$ token set at step $t$ , denoted as $\mathcal{V}_{t}^{K}=\text{TopK}\big(p_{T}(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C},y_{<t})\big)$ , and renormalize the probabilities within this subset: $\bar{p}_{T}(v)=\frac{p_{T}(v)}{\sum_{u\in\mathcal{V}_{t}^{K}}p_{T}(u)},v\in\mathcal{V}_{t}^{K}$ . The metric is then defined based on the normalized top- $K$ entropy:

\small w_{t}^{T}=1-\frac{-\sum_{v\in\mathcal{V}_{t}^{K}}\bar{p}_{T}(v)\log\bar{p}_{T}(v)}{\log K}.

(6)

A higher $w_{t}^{T}$ indicates that the teacher is highly confident in its generated token. We apply $K=32$ for the main experiments. Please see Table 13 in Appendix C for hyperparameter experiments on $K$ .

Student’s Perspective ( $w_{t}^{S}$ ): The penalty weight $w_{t}^{S}$ depends on the student’s confidence in the teacher’s chosen token $y^{*}_{t}$ . A lower student confidence indicates a more severe disagreement with the teacher’s choice, rendering the student susceptible to generating unsafe responses. We measure the risk of the student disagreeing with the teacher by evaluating its confidence in $y^{*}_{t}$ . Formally, this metric is defined as:

\small w_{t}^{S}=1-p_{S}(y^{*}_{t}\mid x^{\text{tgt}},y_{<t}).

(7)

This metric captures the risk that the student deviates toward harmful generation. A higher $w_{t}^{S}$ indicates a larger deviation from the teacher’s chosen token.

Dual-Perspective Weighting Formulation: By combining these two perspectives, we derive the weight for the $t$ -th token as $w_{t}=w_{t}^{T}\cdot w_{t}^{S}$ . To maintain training stability and prevent gradient explosion, we normalize these weights across the generated output $y$ , yielding $\tilde{w}_{t}=L_{y}\cdot\frac{w_{t}}{\sum_{j=1}^{L_{y}}w_{j}}$ .

Finally, DPSW upgrades Equation (5) with the weighting mechanism as follows:

\small\mathcal{L}(\theta)=\mathbb{E}_{X\sim\mathcal{X}}\left[\mathbb{E}_{x^{\text{tgt}}\sim X}\left[\mathbb{E}_{y\sim q(\cdot)}\left[\frac{1}{L_{y}}\sum_{t=1}^{L_{y}}\tilde{w}_{t}\cdot D\biggl(p_{S}\!\left(\cdot\mid x^{\text{tgt}},y_{<t}\right)\big\|\;p_{T}\!\left(\cdot\mid x^{\text{tgt}},x^{*},\mathcal{C},y_{<t}\right)\biggr)\right]\right]\right].

(8)

Here, we apply a stop-gradient operation to $\tilde{w}_{t}$ during training, treating it purely as a constant scalar.

We empirically compare the weights $w^{T}_{t}$ and $w^{S}_{t}$ assigned to safety-critical versus non-critical tokens. As shown in Table 5, the weights $w_{t}$ , as well as $w^{T}_{t}$ and $w^{S}_{t}$ , are consistently higher for safety-critical tokens than non-critical tokens. This demonstrates that the DPSW mechanism tends to assign higher penalty weights to safety-critical tokens over non-critical ones, which aligns with our objective.

4 Experiments

4.1 Experimental Setup

Models. We conduct our studies on four representative LLMs: Qwen-2.5-7B-Instruct (Yang et al., 2025b), Qwen-3-8B (Yang et al., 2025a), LLaMA-2-7B-chat (Touvron et al., 2023), and LLaMA-3-8B-Instruct (Grattafiori et al., 2024).

Languages. We evaluate ten languages from three different resource levels following Deng et al. (2024): (1) High-resource: English (EN), Chinese (ZH), Italian (IT), Vietnamese (VI); (2) Medium-resource: Arabic (AR), Korean (KO), Thai (TH); (3) Low-resource: Bengali (BN), Swahili (SW), Javanese (JV). Only EN, ZH, AR and BN are included in the training data, and testing is evaluated on all ten languages. In the main experiments, we designate EN as the high-resource language to transfer its strong safety capabilities to lower-resource languages. We also conduct additional experiments by choosing ZH as the high-resource language for Qwen-3-8B, as detailed in Table 8.

Datasets and metrics. For training, we use multilingual queries from XSafety (Wang et al., 2023a). For testing, we employ two more difficult multilingual jailbreak datasets: MultiJail (Deng et al., 2024) and PKU-SafeRLHF (Ji et al., 2025). For safety evaluation, we use Attack Success Rate (ASR) as our metric. Following Zhao et al. (2025), we use GPT-4o to classify responses as safe, unsafe, or invalid, where both unsafe and invalid responses are counted toward ASR. To test general capability, we evaluate the accuracy on MMMLU (Hendrycks et al., 2020) and MGSM (Shi et al., 2022).

Table 1: Multilingual safety performance on two benchmarks: MultiJail and PKU-SafeRLHF. Evaluation is based on the Attack Success Rate (ASR

\downarrow

). The best and second-best results are highlighted in bold and underlined, respectively. Languages marked with an asterisk (^∗), e.g., IT^∗, indicate unseen languages that are not included in the training set. Note that the average (Avg.) scores are computed across all ten evaluated languages, including the eight languages presented here alongside English (EN) and Chinese (ZH). Additional results for EN and ZH, as well as evaluations on more models (LLaMA-2-7B-chat and Qwen-2.5-7B-Instruct), are provided in Table 9 of Appendix C.

MultiJail PKU-SafeRLHF Methods \columncolorHRHigh \columncolorMRMedium \columncolorLRLow \columncolorAVGAvg. \columncolorHRHigh \columncolorMRMedium \columncolorLRLow \columncolorAVGAvg. \cellcolorHRIT^∗ \cellcolorHRVI^∗ \cellcolorMRAR \cellcolorMRKO^∗ \cellcolorMRTH^∗ \cellcolorLRBN \cellcolorLRSW^∗ \cellcolorLRJV^∗ \columncolorAVG\cellcolorAVG \cellcolorHRIT^∗ \cellcolorHRVI^∗ \cellcolorMRAR \cellcolorMRKO^∗ \cellcolorMRTH^∗ \cellcolorLRBN \cellcolorLRSW^∗ \cellcolorLRJV^∗ \columncolorAVG\cellcolorAVG Qwen-3-8B Raw 13.97 10.48 12.70 13.33 9.52 22.22 94.92 19.68 \columncolorAVG21.68 10.60 15.00 13.40 15.40 12.20 21.80 97.20 18.00 \columncolorAVG22.08 SFT 6.03 4.76 3.81 5.71 2.54 8.89 96.19 35.87 \columncolorAVG17.08 8.40 10.20 9.20 7.20 5.40 22.60 97.00 16.60 \columncolorAVG19.00 DPO 8.89 9.84 8.89 10.48 8.89 15.56 94.92 23.17 \columncolorAVG19.53 8.60 12.00 8.80 9.40 8.20 15.00 97.80 17.00 \columncolorAVG18.90 rDPO 10.16 7.94 10.79 10.48 7.62 18.10 95.24 23.17 \columncolorAVG19.65 8.00 10.60 8.60 8.80 7.40 16.60 97.20 15.40 \columncolorAVG18.70 KTO 8.89 5.71 9.21 12.70 6.98 15.24 95.56 21.27 \columncolorAVG18.70 7.00 10.00 7.60 8.60 8.80 14.80 97.80 13.00 \columncolorAVG17.96 ORPO 8.25 10.48 9.52 8.89 7.62 17.78 95.24 20.95 \columncolorAVG19.14 8.00 10.40 7.60 10.00 8.60 18.00 96.80 15.40 \columncolorAVG18.88 R-DPO 10.16 10.48 10.16 13.33 7.94 18.73 95.56 26.35 \columncolorAVG20.57 8.20 11.80 7.80 9.80 7.60 16.20 96.80 15.80 \columncolorAVG18.84 SimPO 8.89 7.62 8.89 9.84 8.25 18.10 94.29 23.49 \columncolorAVG19.27 6.80 9.60 7.60 11.20 6.80 18.60 98.20 14.80 \columncolorAVG18.58 PolyRefuse 15.56 13.02 12.38 17.14 11.11 22.54 94.29 28.89 \columncolorAVG23.53 11.00 15.20 13.20 13.80 11.80 22.40 97.40 18.20 \columncolorAVG22.32 Self-Defense 12.38 12.38 11.43 14.92 13.02 17.78 95.56 26.67 \columncolorAVG22.38 11.20 13.80 13.20 13.00 11.80 22.20 97.60 19.40 \columncolorAVG22.04 SDRRL 4.44 4.13 3.49 5.08 3.81 8.25 77.14 28.57 \columncolorAVG14.16 2.20 3.20 1.60 1.80 3.20 5.40 62.00 8.40 \columncolorAVG9.12 MPO 1.90 1.59 3.81 4.13 1.90 10.79 72.38 22.86 \columncolorAVG12.44 2.20 2.60 2.20 2.00 1.20 7.00 63.40 7.40 \columncolorAVG9.08 \rowcolorOURS MSD (Off-Policy) 1.59 2.22 1.90 2.86 2.86 4.76 63.81 17.46 \columncolorAVG9.90 0.20 1.60 1.80 1.60 1.20 2.00 52.40 3.60 \columncolorAVG6.54 \rowcolorOURS MSD (On-Policy) 1.90 0.95 3.81 2.22 1.90 3.81 67.62 16.83 \columncolorAVG10.04 1.00 1.00 2.20 1.20 2.40 2.80 61.80 3.20 \columncolorAVG7.62 LLaMA-3-8B-Instruct Raw 22.54 13.65 11.43 17.46 10.79 13.97 69.52 39.37 \columncolorAVG21.21 14.40 5.40 8.40 9.80 9.00 15.40 65.20 22.80 \columncolorAVG15.82 SFT 3.49 2.86 11.11 4.76 0.63 1.59 50.79 6.35 \columncolorAVG8.48 0.80 1.40 0.60 5.80 1.00 2.20 33.60 5.60 \columncolorAVG5.24 DPO 2.86 4.76 9.21 1.90 1.90 1.59 53.33 6.67 \columncolorAVG8.44 2.60 1.20 5.60 1.20 1.20 5.00 68.20 2.80 \columncolorAVG8.62 rDPO 4.13 2.54 1.90 1.90 0.63 6.98 29.52 10.79 \columncolorAVG6.09 5.40 3.20 2.40 6.00 3.60 4.20 28.40 7.20 \columncolorAVG6.44 KTO 4.13 1.90 6.35 3.17 0.95 1.59 43.81 5.40 \columncolorAVG6.98 0.40 2.40 3.80 2.20 1.40 1.40 37.20 2.60 \columncolorAVG5.11 ORPO 3.49 2.86 10.79 10.79 0.32 2.86 77.46 1.27 \columncolorAVG11.11 2.40 2.40 7.00 9.80 1.40 4.20 72.80 2.60 \columncolorAVG10.30 R-DPO 2.22 5.40 9.84 2.22 1.90 3.49 57.14 7.30 \columncolorAVG9.14 0.80 1.40 6.40 1.00 1.60 4.60 65.60 2.60 \columncolorAVG8.44 SimPO 3.49 2.54 5.71 8.89 0.95 2.86 77.14 7.62 \columncolorAVG11.08 2.60 2.40 3.60 12.00 2.20 5.20 72.20 2.60 \columncolorAVG10.36 PolyRefuse 21.59 15.24 13.97 18.73 12.06 16.51 67.30 44.76 \columncolorAVG22.35 13.20 5.40 8.80 9.80 9.00 15.40 65.20 23.60 \columncolorAVG15.86 Self-Defense 8.89 12.70 8.89 7.94 42.54 28.25 14.29 6.35 \columncolorAVG15.56 4.20 5.20 6.00 11.00 38.00 18.00 9.60 9.60 \columncolorAVG11.06 SDRRL 6.67 11.43 11.43 11.75 7.94 11.75 28.25 30.16 \columncolorAVG12.61 2.60 3.00 8.80 7.40 5.60 14.20 20.20 11.40 \columncolorAVG7.70 MPO 7.62 6.67 7.62 6.03 14.92 22.22 14.92 22.22 \columncolorAVG11.84 1.80 2.20 2.20 2.00 3.80 7.40 11.80 8.20 \columncolorAVG4.38 \rowcolorOURS MSD (Off-Policy) 2.86 1.27 1.59 1.90 2.22 0.63 6.03 11.11 \columncolorAVG2.92 2.60 0.60 0.60 0.40 1.00 1.20 3.20 1.80 \columncolorAVG1.22 \rowcolorOURS MSD (On-Policy) 1.27 1.59 1.27 1.27 1.90 2.22 4.13 5.08 \columncolorAVG2.03 2.20 1.20 1.00 0.80 1.60 2.40 2.80 3.40 \columncolorAVG1.78

Baselines. We compare MSD against six categories of strong baselines, including (1) SFT (Ouyang et al., 2022); (2) representative PO methods: DPO (Rafailov et al., 2023), rDPO (Chowdhury et al., 2024), KTO (Ethayarajh et al., 2024), ORPO (Hong et al., 2024), R-DPO (Park et al., 2024), and SimPO (Meng et al., 2024); (3) representation engineering method: PolyRefuse (Wang et al., 2025); (4) prompt-based method: Self-Defense (Phute et al., 2024); (5) distillation-based method: SDRRL (Zhang et al., 2024); (6) multilingual safety alignment method: MPO (Zhao et al., 2025).

More experimental details about datasets, evaluation metrics and baselines are listed in Appendix B.

4.2 Overall Evaluation on Multilingual Safety Alignment

Our proposed MSD method effectively improves multilingual safety performance, exhibiting strong cross-dataset and cross-lingual generalization. Crucially, it preserves general capabilities and operates entirely without target-language response data.

Multilingual safety performance. Table 1 presents the multilingual safety performance comparing the proposed MSD with other representative baselines. To comprehensively evaluate safety performance, we report results across three resource levels of languages: high-, medium-, and low-resource, distinguished by different colors. From these evaluations, we draw the following two insights:

$\bullet$ MSD exhibits superior safety performance across benchmarks. Compared to various strong baselines, both the on-policy MSD and off-policy MSD consistently achieve superior multilingual safety performance across both benchmarks. The excellent performance is observed across high-, medium-, and low-resource languages, underscoring MSD as a robust cross-lingual transfer framework. Notably, while our framework is trained on the relatively simple XSafety dataset, it maintains exceptional performance on the more challenging MultiJail and PKU-SafeRLHF benchmarks. This highlights the cross-dataset generalization capability of our method.

Table 2: Average accuracy (

\uparrow

) on multilingual general capability benchmarks (MMMLU and MGSM). Best and second-best results are bolded and underlined, respectively.

Method Qwen-2.5-7B-Instruct Qwen-3-8B LLaMA-2-7B-Chat LLaMA-3-8B-Instruct MMMLU MGSM MMMLU MGSM MMMLU MGSM MMMLU MGSM Raw 58.26 63.89 64.65 64.87 32.05 10.95 48.17 56.04 SFT 50.55 52.95 59.22 73.42 24.00 6.22 17.65 20.84 DPO 58.04 62.91 65.15 65.35 31.86 11.04 41.53 53.13 rDPO 57.92 63.16 65.26 65.89 31.96 10.29 22.86 51.89 KTO 57.85 63.05 65.23 65.89 31.84 11.00 33.79 52.36 ORPO 57.47 59.89 65.49 67.35 31.49 11.40 34.88 47.02 R-DPO 58.06 62.84 65.23 65.49 32.01 10.11 42.26 52.69 SimPO 57.16 55.20 65.45 67.85 31.33 10.25 35.31 47.78 PolyRefuse 57.90 58.69 64.61 64.29 31.20 10.29 44.01 55.60 SDRRL 57.34 58.58 65.07 62.40 27.11 11.75 34.28 42.65 MPO 58.17 62.95 64.78 64.62 31.91 11.92 47.87 54.25 MSD (Off-Policy) 58.25 60.44 66.88 75.53 31.61 10.55 46.57 55.89 MSD (On-Policy) 58.80 63.89 66.76 74.04 31.97 12.80 48.16 58.11

Table 3: Comparison of response data generation costs across different methods.

SFT Distillation Preference Optimization Prompt Representation Eng. MSD (Ours) Method SFT SDRRL DPO / rDPO / KTO / ORPO / SimPO / MPO Self-Defense PolyRefuse MSD Data Cost $650 $450 $1200 $0 $\sim$ $0 $0

$\bullet$ Strong generalization to unseen low-resource languages. Out of the ten languages evaluated, only four are present in the training set, allowing us to observe how effectively the multilingual alignment transfers to unseen languages. Existing baselines frequently exhibit biased performance, particularly benefiting high-resource languages or languages seen during training. In such cases, safety alignment remains inadequate for low-resource languages like SW and JV. In contrast, MSD demonstrates stable and generalizable safety behaviors, particularly in low-resource languages. For instance, for LLaMA-3-8B-Instruct evaluated on the MultiJail benchmark, the off-policy MSD significantly reduces the ASR for SW from the raw model’s 69.52% down to 6.03%, and the on-policy MSD further reduces it to 4.13%. A similar safety improvement is consistently observed on the PKU-SafeRLHF dataset. This indicates that our method does not merely overfit to the training language distribution; instead, it effectively transfers the model’s internal high-resource safeguards to low-resource safeguards.

For Qwen-3-8B, while MSD substantially reduces the MultiJail ASR on SW (from 94.92% to 63.81%), the ASR remains high.This primarily stems from the model’s weak low-resource generative capability, which generates invalid responses that are counted toward ASR. Specifically, the off-policy MSD’s 63.81% ASR comprises 57.14% invalid and only 6.67% unsafe responses. Crucially, Appendix C.5 confirms that MSD successfully generates safe English reasoning, merely failing to generate valid responses in SW. This phenomenon is specific to the Qwen series on SW. For other models and languages, unsafe responses dominate the ASR, and invalid responses are negligible.

MSD maintains previous knowledge and reasoning capabilities after alignment. We evaluate all methods across two general tasks: MMMLU and MGSM. Results in Table 2 show that MSD preserves general capabilities well after alignment. This demonstrates that the substantial improvements in multilingual safety do not come at the cost of degrading the model’s general or reasoning performance. For detailed results of all languages, please refer to Table 10 and Table 11 in Appendix C.

MSD requires zero cost for response data generation. Table 3 compares the data generation costs of different methods. To ensure high-quality data, all data generation and translation tasks are performed using the GPT-4o API. SFT incurs significant costs because it requires safe responses for every target language. SDRRL incurs a moderate cost by translating the model’s English responses into target languages. Preference Optimization is the most expensive, as it requires paired contrastive responses for every query. In contrast, our MSD requires zero data generation cost.

Table 4: The ablation study of DPSW. Evaluation is based on Attack Success Rate (ASR

\downarrow

Method Qwen-2.5-7B-Instruct Qwen-3-8B LLaMA-2-7B-Chat LLaMA-3-8B-Instruct MultiJail PKU-SafeRLHF MultiJail PKU-SafeRLHF MultiJail PKU-SafeRLHF MultiJail PKU-SafeRLHF MSD (Off-Policy) w/o DPSW 13.27 11.62 11.25 8.12 9.95 11.48 4.55 2.48 MSD (Off-Policy) w/ DPSW 10.41 10.98 9.90 6.54 8.86 9.96 2.92 1.22 MSD (On-Policy) w/o DPSW 6.38 6.92 11.09 9.10 4.99 6.80 3.33 3.02 MSD (On-Policy) w/ DPSW 5.94 6.04 10.04 7.62 2.48 3.90 2.03 1.78 Best-Performing Baseline 12.38 12.10 12.44 9.08 11.18 14.34 6.09 4.38

Table 5: DPSW weight analysis of safety-critical and non-critical tokens.

Token Type	Qwen-3-8B			LLaMA-3-8B-Instruct
Token Type	$\text{Avg.}(w_{t})$	$\text{Avg.}(w^{T}_{t})$	$\text{Avg.}(w^{S}_{t})$	$\text{Avg.}(w_{t})$	$\text{Avg.}(w^{T}_{t})$	$\text{Avg.}(w^{S}_{t})$
Safety-Critical Token	0.189	0.886	0.253	0.182	0.852	0.281
Non-Critical Token	0.112	0.740	0.199	0.117	0.709	0.235

Table 6: The ablation study of teacher-side additional information under the on-policy setting.

Method Qwen-3-8B LLaMA-3-8B-Instruct MultiJail PKU MMMLU MGSM MultiJail PKU MMMLU MGSM (ASR $\downarrow$ ) (ASR $\downarrow$ ) (Accuracy $\uparrow$ ) (Accuracy $\uparrow$ ) (ASR $\downarrow$ ) (ASR $\downarrow$ ) (Accuracy $\uparrow$ ) (Accuracy $\uparrow$ ) MSD w/ GT 11.71 11.88 66.74 74.00 11.55 12.34 18.58 30.65 MSD w/ EN 15.59 17.34 66.18 74.69 6.98 8.70 30.31 16.00 MSD w/ CoT 11.59 10.94 67.19 73.71 8.00 10.98 34.56 38.62 MSD w/ EN, w/ CoT 10.04 7.62 66.76 74.04 2.03 1.78 48.16 58.11 Raw Model 21.68 22.08 64.65 64.87 21.21 15.82 48.17 56.04

Table 7: Comparison of using different languages as the high-resource language.

Method MultiJail PKU MMMLU MGSM (ASR $\downarrow$ ) (ASR $\downarrow$ ) (Accuracy $\uparrow$ ) (Accuracy $\uparrow$ ) MSD (Off-Policy) w/ ZH 10.74 8.79 66.42 75.35 MSD (Off-Policy) w/ EN 9.90 6.54 66.88 75.53 MSD (On-Policy) w/ ZH 11.79 9.44 66.97 75.96 MSD (On-Policy) w/ EN 10.04 7.62 66.76 74.04 Best-Performing Baseline 12.44 9.08 65.49 73.42

Table 8: Comparison of freezing or updating the teacher’s parameters on Qwen-3-8B (On-Policy).

Method MultiJail PKU MMMLU MGSM (ASR $\downarrow$ ) (ASR $\downarrow$ ) (Accuracy $\uparrow$ ) (Accuracy $\uparrow$ ) $\alpha$ =0 (Ours) 10.04 7.62 66.76 74.04 $\alpha$ =0.01 12.03 6.70 65.59 69.35 $\alpha$ =0.02 14.67 6.74 64.62 69.16 $\alpha$ =0.05 35.49 22.24 58.46 69.24

4.3 Ablation Study

Effect of DPSW. We conduct an ablation study of DPSW across both the off-policy and on-policy MSD. Results in Table 4 demonstrate that integrating DPSW consistently enhances safety performance across all models and benchmarks. This confirms that DPSW is beneficial for multilingual safety alignment. We also include the best-performing baseline results in the table for comparison. Notably, even without DPSW, both the off-policy and on-policy MSD still outperform the majority of these best baselines. This observation validates that the safeguard transfer framework of MSD is effective for cross-lingual alignment on its own, and the addition of DPSW further enhances its performance.

Furthermore, to verify the validity of the DPSW design, we evaluate whether tokens assigned higher weights ( $w_{t}$ ) correspond to safety-critical tokens based on GPT-4o (detailed in Appendix B.5). Specifically, we separate the tokens of each sentence into safety-critical tokens and non-critical tokens. Then we examine the average values of the weights $w_{t}$ , $w^{T}_{t}$ and $w^{S}_{t}$ across both safety-critical and non-critical tokens. As shown in Table 5, safety-critical tokens consistently receive higher values for $w_{t}$ , $w^{T}_{t}$ , and $w^{S}_{t}$ compared to non-critical ones. This evidence supports our motivation that DPSW assigns higher penalty weights to safety-critical tokens throughout the sequence, rather than applying a uniform penalty across all generated tokens. Please see Appendix B.5 for details of the experiment.

Effect of teacher-side additional information. A core design in MSD is the additional information provided to the teacher. To validate the effectiveness of each component within our additional information, we evaluate four variants under the on-policy setting: (1) our full MSD (the query in English $x^{*}$ and the CoT instruction $\mathcal{C}$ ); (2) MSD w/ EN (the query in English $x^{*}$ only); (3) MSD w/ CoT (the CoT instruction $\mathcal{C}$ only); and (4) MSD w/ GT (ground-truth refusal responses), which are utilized in previous methods (Zhao et al., 2026; Shenfeld et al., 2026). Results in Table 6 demonstrate that the full MSD achieves the lowest ASR across both jailbreak benchmarks while preserving general capabilities. This confirms that both the query in English $x^{*}$ and the CoT instruction $\mathcal{C}$ are important for the teacher to elicit the model’s inherent safety capabilities in high-resource languages.

Furthermore, we observe that directly providing ground-truth refusals to the teacher can severely degrade the model’s general capabilities. For instance, MSD w/ GT drops LLaMA-3-8B-Instruct’s MMMLU score from 48.17 to 18.58. Aligning with recent findings (Kim et al., 2026; Yang et al., 2026a), this indicates that directly providing ground-truth refusals to the teacher may induce shortcut learning and information leakage, rather than genuinely transferring safety capabilities. These results further underscore the MSD’s superiority: MSD not only eliminates the need for refusal response data but also yields the most effective multilingual safety alignment while preserving general capabilities.

Exploring the choice of high-resource language. In our primary experiments, we designate English as the high-resource language. To evaluate the generalization of our framework to other dominant languages, we conduct an additional experiment using Chinese (ZH) as the high-resource language for Qwen-3-8B, driven by the importance of Chinese data in its pre-training corpus. Specifically, we adapt the additional information by providing the teacher with the query in Chinese and a Chinese CoT instruction. As shown in Table 8, while utilizing ZH as the high-resource language yields slightly inferior performance compared to EN, it still outperforms other baselines while successfully preserving the model’s general capabilities. This observation validates the robustness and flexibility of our framework. This indicates that the high-resource language in our framework can be flexibly selected based on the dominant languages present in the target model’s pre-training data.

Effect of freezing the teacher. During training, the parameters of the teacher $p_{T}$ are kept frozen to provide a stable supervision signal. Prior studies (Hübotter et al., 2026; Shenfeld et al., 2026) advocate updating the teacher’s parameters via an Exponential Moving Average (EMA) of the student’s parameters. Specifically, the EMA strategy updates the teacher iteratively as $p_{T}\leftarrow(1-\alpha)p_{T}+\alpha p_{S}$ , where $\alpha\in(0,1)$ . Thus, we compare the effects of employing an EMA update with $\alpha\in\{0.01,0.02,0.05\}$ (following (Shenfeld et al., 2026)) against a frozen teacher ( $\alpha=0$ ). As shown in Table 8, freezing the teacher ( $\alpha=0$ ) yields the best trade-off: it achieves strong multilingual safety performance while preserving the best general capabilities. Conversely, a minor EMA update ( $\alpha=0.01$ ) degrades general capability despite marginal safety gains on PKU-SafeRLHF, and a larger update ( $\alpha=0.05$ ) causes a severe collapse in both safety performance and general capability. This evidence suggests that updating the teacher with the student’s parameters may corrupt the teacher’s inherent high-resource safety capabilities and general capabilities. In contrast, a frozen teacher aligns perfectly with our objective of transferring existing safeguards without degrading general capabilities.

5 Conclusion and Limitations

In this paper, we introduce MSD, a response-free framework designed to bridge the severe safety alignment gap between high-resource and low-resource languages. By leveraging self-distillation, MSD effectively transfers internal safeguards from high-resource to low-resource languages, entirely eliminating the expensive costs of generating response data. To further optimize this transfer, we proposed DPSW, a token-level divergence measure that improves alignment precision based on both the teacher’s and the student’s perspectives. Extensive evaluations demonstrate that MSD achieves the best multilingual safety performance and strong generalization to unseen languages and more challenging datasets, all while preserving the model’s general capabilities. Ultimately, our approach provides a scalable, cost-effective paradigm for developing safe and reliable LLMs.

Limitations. The limitations of our MSD framework are as follows: (1) it requires the base model to already possess strong safeguards and in-context learning capabilities in the high-resource language; (2) it depends on high-quality multilingual query translations to construct the teacher’s additional information. Although such data are available in existing benchmarks, poor translation quality can negatively impact the alignment. A detailed discussion of these limitations is provided in Appendix D.

References

R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In The twelfth international conference on learning representations, Cited by: Appendix A.
A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. (2021) A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861. Cited by: §2.
Y. Bu, X. Liu, Z. Ren, Y. Yang, and J. Dai (2026) Align once, benefit multilingually: enforcing multilingual consistency for LLM safety alignment. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: Appendix A, §1, §2.
Z. Cao, Y. Yang, and H. Zhao (2025) Scans: mitigating the exaggerated safety for llms via safety-conscious activation steering. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 23523–23531. Cited by: Appendix A, §2.
S. R. Chowdhury, A. Kini, and N. Natarajan (2024) Provably robust dpo: aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409. Cited by: §B.3, §4.1.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §B.2.
Y. Deng, W. Zhang, S. J. Pan, and L. Bing (2024) Multilingual jailbreak challenges in large language models. In The Twelfth International Conference on Learning Representations, Cited by: Appendix A, §B.2, §B.2, §1, §2, §4.1, §4.1.
K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela (2024) Kto: model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306. Cited by: Appendix A, §B.3, §2, §4.1.
T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. In International conference on machine learning, pp. 1607–1616. Cited by: Appendix A, §2.
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. In Neural Information Processing Systems, Cited by: §1, §4.1.
Y. Gu, L. Dong, F. Wei, and M. Huang (2023) Minillm: knowledge distillation of large language models. arXiv preprint arXiv:2306.08543. Cited by: Appendix A.
C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, et al. (2023) Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998. Cited by: Appendix A, §2.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2020) Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300. Cited by: §B.2, §4.1.
J. Hong, N. Lee, and J. Thorne (2024) Orpo: monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11170–11189. Cited by: §B.3, §4.1.
J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. (2026) Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802. Cited by: Appendix A, §2, §4.3.
N. Jain, A. Shrivastava, C. Zhu, D. Liu, A. Samuel, A. Panda, A. Kumar, M. Goldblum, and T. Goldstein (2024) Refusal tokens: a simple way to calibrate refusals in large language models. In Neurips Safe Generative AI Workshop 2024, External Links: Link Cited by: §3.2.
J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li, et al. (2025) Pku-saferlhf: towards multi-level safety alignment for llms with human preference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31983–32016. Cited by: §B.2, §4.1.
J. Kim, X. Luo, M. Kim, S. Lee, D. Kim, J. Jeon, D. Li, and Y. Yang (2026) Why does self-distillation (sometimes) degrade the reasoning capability of llms?. arXiv preprint arXiv:2603.24472. Cited by: §4.3.
J. Li, Y. Liu, C. Liu, L. Shi, X. Ren, Y. Zheng, Y. Liu, and Y. Xue (2024) A cross-language investigation into jailbreak attacks in large language models. arXiv preprint arXiv:2401.16765. Cited by: Appendix A, §2.
J. Liang, Z. Wang, and H. Wang (2026) Multilingual safety alignment via sparse weight editing. arXiv preprint arXiv:2602.22554. Cited by: Appendix A, §2.
K. Lu and T. M. Lab (2025) On-policy distillation. Thinking Machines Lab: Connectionism. Note: https://thinkingmachines.ai/blog/on-policy-distillation External Links: Document Cited by: Appendix A.
Y. Meng, M. Xia, and D. Chen (2024) Simpo: simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems 37, pp. 124198–124235. Cited by: Appendix A, §B.3, §4.1.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, pp. 27730–27744. Cited by: §B.3, §4.1.
R. Park, R. Rafailov, S. Ermon, and C. Finn (2024) Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 4998–5017. Cited by: §B.3, §4.1.
M. Phute, A. Helbling, M. D. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau (2024) LLM self defense: by self examination, LLMs know they are being tricked. In The Second Tiny Papers Track at ICLR 2024, External Links: Link Cited by: §B.3, §4.1.
X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson (2025) Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §3.2.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. Advances in neural information processing systems 36, pp. 53728–53741. Cited by: Appendix A, §B.3, §2, §4.1.
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He (2020) Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3505–3506. Cited by: §B.4.
H. Sang, Y. Xu, Z. Zhou, R. He, Z. Wang, and J. Sun (2026) On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433. Cited by: Appendix A, §2.
G. Shen, D. Zhao, H. Tong, J. Li, F. Zhao, and Y. Zeng (2026) Safety instincts: LLMs learn to trust their internal compass for self-defense. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §3.2.
L. Shen, W. Tan, S. Chen, Y. Chen, J. Zhang, H. Xu, B. Zheng, P. Koehn, and D. Khashabi (2024) The language barrier: dissecting safety challenges of llms in multilingual contexts. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 2668–2680. Cited by: Appendix A, §1, §2.
I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal (2026) Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897. Cited by: Appendix A, §1, §2, §4.3, §4.3.
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, et al. (2022) Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057. Cited by: §B.2, §4.1.
Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, and C. Gan (2023) Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems 36, pp. 2511–2565. Cited by: Appendix A, §2.
G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et al. (2024) Gemma 2: improving open language models at a practical size. arXiv preprint arXiv:2408.00118. Cited by: §1.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §1, §4.1.
G. Wang, H. Shi, T. Ouyang, and A. Wang (2026) Few tokens, big leverage: preserving safety alignment by constraining safety tokens during fine-tuning. arXiv preprint arXiv:2603.07445. Cited by: §3.2.
W. Wang, Z. Tu, C. Chen, Y. Yuan, J. Huang, W. Jiao, and M. R. Lyu (2023a) All languages matter: on the multilingual safety of large language models. arXiv preprint arXiv:2310.00905. Cited by: Appendix A, §B.1, §1, §2, §4.1.
X. Wang, M. Wang, Y. Liu, H. Schuetze, and B. Plank (2025) Refusal direction is universal across safety-aligned languages. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: Appendix A, §B.3, §1, §2, §4.1.
Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi (2023b) Self-instruct: aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pp. 13484–13508. Cited by: Appendix A, §2.
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025a) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §1, §4.1.
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, et al. (2025b) Qwen2.5 technical report. External Links: 2412.15115, Link Cited by: §4.1.
C. Yang, L. Xie, C. Su, and A. L. Yuille (2019) Snapshot distillation: teacher-student optimization in one generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2859–2868. Cited by: Appendix A, §2.
C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan (2026a) Self-distilled rlvr. arXiv preprint arXiv:2604.03128. Cited by: §4.3.
J. Yang, H. Liu, J. Tu, J. Cheng, Z. Zhang, S. Cui, J. Weng, J. Tao, H. Xue, H. Wang, et al. (2026b) LASA: language-agnostic semantic alignment at the semantic bottleneck for llm safety. arXiv preprint arXiv:2604.12710. Cited by: Appendix A, §2.
Z. Yang, T. Pang, H. Feng, H. Wang, W. Chen, M. Zhu, and Q. Liu (2024) Self-distillation bridges distribution gap in language model fine-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1028–1043. Cited by: Appendix A, §1, §2.
T. Ye, L. Dong, X. Wu, S. Huang, and F. Wei (2026) On-policy context distillation for language models. arXiv preprint arXiv:2602.12275. Cited by: Appendix A, §2.
Z. Yong, B. Ermis, M. Fadaee, S. Bach, and J. Kreutzer (2025) The state of multilingual llm safety research: from measuring the language gap to mitigating it. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 15856–15871. Cited by: Appendix A, §1, §2.
W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. E. Weston (2024) Self-rewarding language models. In International Conference on Machine Learning, pp. 57905–57923. Cited by: Appendix A, §1, §2.
E. Zelikman, Y. Wu, J. Mu, and N. Goodman (2022) Star: bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems 35, pp. 15476–15488. Cited by: Appendix A, §2.
L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma (2019) Be your own teacher: improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3713–3722. Cited by: Appendix A, §2.
M. Zhang, D. Liu, K. Zhang, J. Franco, H. Liu, and K. Zhu (2025) Response-based knowledge distillation for multilingual jailbreak prevention unwittingly compromises safety. In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025, External Links: Link Cited by: Appendix A, §2.
X. Zhang, C. Xie, L. Zhu, Y. Yang, W. Zhao, Z. Cheng, C. Wang, F. Shen, and T. Chua (2026) Who transfers safety? identifying and targeting cross-lingual shared safety neurons. arXiv preprint arXiv:2602.01283. Cited by: Appendix A, §1, §2.
Y. Zhang, Y. Wang, Z. Liu, S. Wang, X. Wang, P. Li, M. Sun, and Y. Liu (2024) Enhancing multilingual capabilities of large language models through self-distillation from resource-rich languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11189–11204. Cited by: Appendix A, §B.3, §1, §2, §4.1.
J. Zhao, Z. Zhang, L. Gao, Q. Zhang, T. Gui, and X. Huang (2024) Llama beyond english: an empirical study on language capability transfer. arXiv preprint arXiv:2401.01055. Cited by: §1.
S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover (2026) Self-distilled reasoner: on-policy self-distillation for large language models. arXiv preprint arXiv:2601.18734. Cited by: Appendix A, §1, §2, §4.3.
W. Zhao, Y. Hu, Y. Deng, T. Wu, W. Zhang, J. Guo, A. Zhang, Y. Zhao, B. Qin, T. Chua, et al. (2025) Mpo: multilingual safety alignment via reward gap optimization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 23564–23587. Cited by: Appendix A, §B.2, §B.3, §1, §2, §4.1, §4.1.

Appendix A Additional Related Work

Multilingual Safety Alignment. Recent studies have exposed significant risks in the multilingual safety of LLMs: while they maintain strong safety capabilities in high-resource languages, their safeguards frequently fail in low-resource languages [Deng et al., 2024, Wang et al., 2023a, Shen et al., 2024, Yong et al., 2025]. Current methods to bridge this gap typically employ supervised fine-tuning [Deng et al., 2024, Li et al., 2024, Shen et al., 2024], preference optimization [Rafailov et al., 2023, Ethayarajh et al., 2024, Meng et al., 2024, Zhao et al., 2025], or distillation techniques [Zhang et al., 2024, 2025]. However, these data-driven methods demand substantial high-quality response data per target language. Generating such data through human annotation or API-based translation is costly and has a risk of introducing translation errors that harm data quality, therefore hindering large-scale alignment in low-resource languages. To alleviate this, recent studies [Bu et al., 2026, Yang et al., 2026b] introduce novel paradigms that eliminate the need for low-resource response data. Nevertheless, these methods are based on conventional training frameworks (e.g., DPO) and still need extensive English response data. Conversely, training-free approaches, including representation engineering [Wang et al., 2025] and activation steering [Zhang et al., 2026, Liang et al., 2026], reduce reliance on response data. However, they exhibit poor cross-dataset generalization and may harm the model’s general capabilities [Cao et al., 2025].

Self-Distillation. Self-distillation was studied as a way for a model to improve by learning from itself, rather than from a separately trained larger teacher. Early work [Furlanello et al., 2018, Zhang et al., 2019, Yang et al., 2019] showed that self-distillation can improve generalization and act as an effective regularizer. In the LLM era, self-distillation has been revisited as a mechanism for self-improvement. A line of off-policy methods lets LLMs generate and exploit their own supervision signals through iterative self-training, rationale bootstrapping, or self-generated instruction tuning [Zelikman et al., 2022, Gulcehre et al., 2023, Wang et al., 2023b, Sun et al., 2023, Yang et al., 2024, Yuan et al., 2024]. These approaches typically produce discrete trajectories or rationales and then fine-tune the model on them, yielding off-policy distillation. More recently, on-policy distillation (OPD) lets a stronger teacher evaluate the student’s own rollouts and provide dense token-level supervision [Agarwal et al., 2024, Gu et al., 2023, Lu and Lab, 2025]. To further remove the need for an external teacher, on-policy self-distillation (OPSD) instantiates the teacher and the student from the same model under different contexts, where the teacher is equipped with privileged information [Zhao et al., 2026, Hübotter et al., 2026]. This paradigm has been extended to continual learning, context internalization, and reasoning compression [Shenfeld et al., 2026, Ye et al., 2026, Sang et al., 2026]. In this paper, we build on this line and propose a flexible multilingual self-distillation framework that can integrate different self-distillation strategies to transfer a model’s internal high-resource-language safety capability to low-resource languages.

Appendix B More Experimental Details

B.1 Training Data

XSafety [Wang et al., 2023a]. We use XSafety as the sole multilingual query source for training. XSafety is constructed from several well-established monolingual safety corpora and extended into a multilingual setting through professional translation and proofreading. The benchmark covers 14 safety issues in total, including seven typical safety scenarios, six instruction-attack scenarios, and one commonsense safety test set. Concretely, the original benchmark contains 2,800 source queries, with 200 instances for each safety issue, and is translated into 10 languages: English, Chinese, Spanish, French, Bengali, Arabic, Hindi, Russian, Japanese, and German, resulting in 28,000 annotated instances in total. To ensure translation quality, the original paper first applies machine translation and then conducts two rounds of professional proofreading. It further reports a pass rate above 99% on a random inspection of 10% of the translated data. In our experiments, we only use the multilingual queries from XSafety and do not rely on any response-level supervision in the training stage.

B.2 Benchmark

We evaluate our method on two multilingual safety benchmarks and two multilingual utility benchmarks.

MultiJail [Deng et al., 2024]. MultiJail is a multilingual jailbreak benchmark containing 315 unsafe prompts across 10 languages, including English (EN), Chinese (ZH), Italian (IT), Vietnamese (VI), Arabic (AR), Korean (KO), Thai (TH), Bengali (BN), Swahili (SW), and Javanese (JV). The prompts cover 18 distinct categories of safety risks. To ensure benchmark quality, the multilingual prompts are manually verified by native speakers.

PKU-SafeRLHF [Ji et al., 2025]. PKU-SafeRLHF is a widely used safety preference dataset. To construct a multilingual evaluation set, we randomly sample 500 harmful English queries from PKU-SafeRLHF and translate them into 9 target languages, including Chinese (ZH), Italian (IT), Vietnamese (VI), Arabic (AR), Korean (KO), Thai (TH), Bengali (BN), Swahili (SW), and Javanese (JV), using GPT-4o with the translation prompt shown in Figure 4. This benchmark is used to test whether aligned models can generalize to a different multilingual safety distribution beyond XSafety.

MMMLU [Hendrycks et al., 2020]. MMMLU is the multilingual extension of MMLU, covering 57 subjects from elementary-level knowledge to advanced professional domains. The benchmark uses professionally translated test sets, which provide a relatively reliable evaluation of multilingual general knowledge. In our experiments, we adopt the 5-shot setting. The benchmark covers Arabic (AR), Bengali (BN), German (DE), English (EN), Spanish (ES), French (FR), Hindi (HI), Indonesian (ID), Italian (IT), Japanese (JA), Korean (KO), Portuguese (PT), Swahili (SW), Yoruba (YO), and Chinese (ZH).

MGSM [Shi et al., 2022]. MGSM is a multilingual grade-school mathematics benchmark built from 250 GSM8K [Cobbe et al., 2021] problems translated by human annotators. It evaluates multilingual mathematical reasoning under multi-step problem solving. In our experiments, we follow the 0-shot CoT setting. The benchmark covers English (EN), Spanish (ES), French (FR), German (DE), Russian (RU), Chinese (ZH), Japanese (JA), Thai (TH), Swahili (SW), Bengali (BN), and Telugu (TE).

For multilingual safety evaluation, we report the Attack Success Rate (ASR) following the evaluation protocol of Deng et al. [2024], using GPT-4o as the judge. As shown in Figure 5, GPT-4o classifies each response into three categories: safe, unsafe, and invalid. Responses judged as unsafe or invalid are both counted as successful attacks, while only meaningful refusals are considered safe [Zhao et al., 2025]. Therefore, a lower ASR indicates better multilingual safety alignment. To further ensure the accuracy and fairness of the LLM-as-a-judge evaluation, we randomly sampled 3% of the responses from all models on both benchmarks for manual verification. We observed an agreement rate of over 97% between human judgments and GPT-4o, which supports the reliability of the automatic evaluation pipeline used in this work.

B.3 Baseline Methods

SFT [Ouyang et al., 2022]: Supervised Fine-Tuning directly trains the model to imitate safe target responses with standard next-token prediction loss. It serves as a basic training-based baseline for multilingual safety alignment.

DPO [Rafailov et al., 2023]: Direct Preference Optimization learns from chosen–rejected response pairs by directly increasing the likelihood of preferred responses over dispreferred ones relative to a reference model. It is one of the most widely used offline preference alignment methods.

rDPO [Chowdhury et al., 2024]: rDPO is a robust variant of DPO designed to reduce sensitivity to noisy preference labels. It mitigates the impact of noisy feedback through a more noise-tolerant optimization objective.

KTO [Ethayarajh et al., 2024]: Kahneman-Tversky Optimization extends alignment beyond paired comparisons by learning from non-paired preference signals. It models alignment through a prospect-theoretic utility formulation.

ORPO [Hong et al., 2024]: Odds Ratio Preference Optimization removes the need for a separate reference model by introducing an odds-ratio term that directly contrasts winning and losing responses. It is optimized jointly with the SFT objective.

R-DPO [Park et al., 2024]: R-DPO augments DPO with an additional regularization term to mitigate undesirable effects such as length exploitation. This makes the learned preference signal more stable during alignment.

SimPO [Meng et al., 2024]: SimPO removes the reference model and uses the average log-likelihood of a response as the implicit reward. It further incorporates length normalization and a target reward margin to stabilize preference optimization.

PolyRefuse [Wang et al., 2025]: PolyRefuse is a training-free representation engineering method based on the observation that refusal directions are highly transferable across safety-aligned languages. It extracts a cross-lingual refusal direction from hidden representations and steers the model toward refusal behavior through activation intervention.

Self-Defense [Phute et al., 2024]: Self-Defense is a prompt-based defense method that does not require any training or parameter updates. It prompts the LLM to examine its own generated response and judge whether it is harmful, thereby using self-examination as a zero-shot defense against jailbreak attacks.

SDRRL [Zhang et al., 2024]: SDRRL improves multilingual capability by leveraging the model’s stronger responses in a resource-rich language as distillation targets for other languages. It constructs cross-lingual transfer data through translation and self-distillation to improve target-language performance.

MPO [Zhao et al., 2025]: MPO transfers multilingual safety capability by aligning the reward gap between safe and unsafe responses in target languages to that of a dominant language. It uses the dominant language as a high-quality supervision source and regularizes the model to preserve its original capability.

For preference-based baselines that require chosen–rejected response pairs, including DPO, rDPO, KTO, ORPO, R-DPO, SimPO, and MPO, we construct the preference data with GPT-4o. Specifically, for each query in XSAFETY, we use the prompt in Figure 6 to ask GPT-4o to generate one chosen response and one rejected response in the same language as the query. The chosen response is required to be safe, helpful, and aligned, while the rejected response directly follows the harmful intent without safety guardrails. This procedure provides relatively high-quality preference pairs for a fair comparison across all preference-based baselines. For SFT, we train on the same preference dataset but only use the chosen responses as supervision targets. For PolyRefuse and SDRRL, we follow the original implementations and settings described in their papers.

B.4 Implementation Details

All training experiments are conducted on four RTX PRO 6000 96GB GPUs using the Hugging Face TRL¹¹1https://github.com/huggingface/trl library. We perform full fine-tuning of the entire model’s parameters. For distributed training, we leverage the DeepSpeed [Rasley et al., 2020] framework with ZeRO-3 optimization. We search the teacher-side weight hyperparameter $K$ in DPSW over {16, 32, 64} and report the results in Table 13 in Appendix C. To ensure a fair comparison, all training-based methods are trained for one epoch under the same hyperparameter search space. Specifically, we search all training-based methods for the batch size in {16, 32} and the learning rate in {5e-7, 1e-6, 5e-6, 1e-5, 2e-5}.

B.5 Details of DPSW Weight Analysis

To further analyze whether DPSW assigns larger weights to semantically safety-critical tokens, we conduct a post-hoc token-level annotation study on the training data. Specifically, we randomly sample 1000 questions from the XSafety training set and run the first training step of MSD on two representative backbones, Qwen-3-8B and LLaMA-3-8B-Instruct. For each sampled question, we obtain the student-generated completion and record the token-level DPSW statistics before parameter updates, including the final weight $w_{t}$ , the teacher-side weight $w_{t}^{T}$ , and the student-side weight $w_{t}^{S}$ .

Since directly identifying safety-critical tokens using surface refusal phrases is unreliable in multilingual settings, we instead perform semantic span annotation. As shown in Figure 7, we ask GPT-4o to split each student-generated completion into contiguous, non-overlapping token spans and assign each span one of two labels: Safety-critical or Non-critical. A span is labeled as Safety-critical if it directly contributes to the model’s safety judgment, refusal stance, harmfulness/legality/ethics assessment, boundary setting, or safe redirection. In contrast, stylistic tokens, connective phrases, generic explanations, formatting tokens, filler tokens, and content that does not determine the safety stance are labeled as Non-critical. Importantly, the annotator only observes the completion text and token positions, but not the DPSW weights, which avoids circular reasoning in the analysis.

The annotation output is required to be a JSON object keyed by sequence index, where each value is a list of labeled spans with inclusive start_pos and end_pos. We automatically verify that the annotated spans cover every token position exactly once and that spans are non-overlapping. Invalid annotations are re-generated until they satisfy these structural constraints. After obtaining the span labels, we map each token to its corresponding span type and compute the average values of $w_{t}$ , $w_{t}^{T}$ , and $w_{t}^{S}$ for Safety-critical and Non-critical tokens separately:

\mathrm{Avg}(w)=\frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}}w_{t},

where $\mathcal{T}$ denotes the set of tokens belonging to a specific span type. The same computation is applied to $w_{t}^{T}$ and $w_{t}^{S}$ .

The results are reported in Table 5. Across both Qwen-3-8B and LLaMA-3-8B-Instruct, safety-critical tokens receive consistently higher values of $w_{t}$ , $w_{t}^{T}$ , and $w_{t}^{S}$ than non-critical tokens. This indicates that the teacher is more confident at safety-relevant positions and that the student deviates more from the teacher at these positions. Therefore, DPSW naturally concentrates the distillation penalty on tokens that are more important for safety alignment, rather than uniformly weighting all tokens in the generated sequence.

Case study of DPSW span annotation. Figure 8 illustrates the case study of the semantic span annotation used in our DPSW analysis. The average weight $\bar{w_{t}}$ of all tokens in each span is written in parentheses after the span name. Safety-critical spans receive clearly higher average weights than non-critical spans, supporting the intuition that DPSW emphasizes safety-relevant positions across the whole sequence.

Appendix C Additional Experimental Results

C.1 Full Results on Safety Benchmarks

Table 9 reports the full multilingual safety results across all ten languages and four backbones. Both off-policy and on-policy MSD consistently achieve strong ASR reductions across high-, medium-, and low-resource languages, with particularly clear gains on challenging low-resource languages such as Swahili and Javanese, further confirming the robustness and scalability of our cross-lingual safeguard transfer framework.

Table 9: Full results of multilingual safety performance on two benchmarks: MultiJail and PKU-SafeRLHF. Evaluation is based on the Attack Success Rate (ASR

\downarrow

). The best and second-best results are highlighted in bold and underlined, respectively.

MultiJail PKU-SafeRLHF Methods \columncolorHRHigh-resource \columncolorMRMedium-resource \columncolorLRLow-resource \columncolorAVGAvg. \columncolorHRHigh-resource \columncolorMRMedium-resource \columncolorLRLow-resource \columncolorAVGAvg. \cellcolorHREN \cellcolorHRZH \cellcolorHRIT \cellcolorHRVI \cellcolorMRAR \cellcolorMRKO \cellcolorMRTH \cellcolorLRBN \cellcolorLRSW \cellcolorLRJV \columncolorAVG\cellcolorAVG \cellcolorHREN \cellcolorHRZH \cellcolorHRIT \cellcolorHRVI \cellcolorMRAR \cellcolorMRKO \cellcolorMRTH \cellcolorLRBN \cellcolorLRSW \cellcolorLRJV \columncolorAVG\cellcolorAVG Qwen-2.5-7B-Instruct Raw 7.30 4.13 9.21 7.62 9.52 8.57 6.67 34.92 99.05 15.56 \columncolorAVG20.26 5.80 1.80 5.40 5.40 5.60 6.00 7.00 40.20 98.80 21.00 \columncolorAVG19.70 SFT 3.81 5.40 3.81 3.49 4.44 8.89 3.49 9.52 98.73 14.29 \columncolorAVG15.59 6.20 5.20 1.00 3.80 5.60 4.80 8.60 10.40 98.00 9.60 \columncolorAVG15.32 DPO 3.49 0.63 3.17 3.17 5.71 3.81 2.22 14.92 97.46 6.67 \columncolorAVG14.13 2.00 0.60 1.40 0.60 2.40 2.40 3.00 21.80 98.60 8.40 \columncolorAVG14.12 rDPO 2.22 1.27 1.27 1.59 3.17 1.90 1.90 14.60 96.19 5.08 \columncolorAVG12.92 0.40 0.20 1.60 1.00 2.00 2.00 2.40 18.20 97.80 7.00 \columncolorAVG13.26 KTO 1.27 1.27 0.95 2.22 1.59 1.27 0.95 12.38 96.51 5.40 \columncolorAVG12.38 0.40 0.40 0.40 0.60 1.60 2.20 1.80 17.00 97.60 7.40 \columncolorAVG12.94 ORPO 0.32 1.27 0.95 1.59 2.22 1.59 1.59 15.87 97.78 7.94 \columncolorAVG13.11 0.20 0.20 0.40 0.20 0.80 3.00 2.40 22.80 97.80 6.80 \columncolorAVG13.46 R-DPO 2.22 1.59 3.81 2.86 5.40 4.13 2.86 16.19 98.41 7.30 \columncolorAVG14.48 1.80 0.80 1.80 1.40 3.20 2.40 2.80 24.40 98.40 8.80 \columncolorAVG14.58 SimPO 0.00 1.27 0.32 1.90 0.95 7.62 5.71 11.75 97.14 5.71 \columncolorAVG13.24 2.60 3.20 3.00 4.00 5.20 2.20 3.80 18.80 98.00 10.80 \columncolorAVG15.16 PolyRefuse 6.67 4.13 10.16 7.30 8.57 8.25 7.30 35.24 99.05 17.78 \columncolorAVG20.45 5.40 1.80 5.60 6.20 5.40 6.00 7.40 42.40 99.00 20.80 \columncolorAVG20.00 Self-Defense 4.76 3.17 7.62 4.76 7.30 5.40 6.67 25.40 99.37 13.33 \columncolorAVG17.78 5.00 1.40 3.80 4.00 4.20 4.60 6.20 36.40 98.80 17.80 \columncolorAVG18.22 SDRRL 0.63 3.49 4.44 2.86 7.94 9.52 6.03 14.60 93.65 19.37 \columncolorAVG16.25 4.00 2.20 4.20 4.40 4.40 6.80 4.20 17.40 92.60 12.80 \columncolorAVG15.30 MPO 4.13 2.86 6.03 3.81 9.21 4.76 13.33 16.51 56.51 31.43 \columncolorAVG14.86 2.80 1.60 3.20 2.60 7.80 6.60 11.60 15.20 51.80 17.80 \columncolorAVG12.10 \rowcolorOURS MSD (Off-Policy) 1.90 2.54 6.35 1.27 6.03 2.86 4.76 10.79 53.33 14.29 \columncolorAVG10.41 3.00 1.60 4.60 5.20 3.80 5.60 6.00 10.80 55.20 14.00 \columncolorAVG10.98 \rowcolorOURS MSD (On-Policy) 1.59 0.95 3.81 1.90 5.71 0.95 1.90 5.40 33.33 3.81 \columncolorAVG5.94 3.00 2.40 4.20 2.40 3.20 1.00 1.60 4.40 32.60 5.60 \columncolorAVG6.04 Qwen-3-8B Raw 13.65 6.35 13.97 10.48 12.70 13.33 9.52 22.22 94.92 19.68 \columncolorAVG21.68 13.00 4.21 10.60 15.00 13.40 15.40 12.20 21.80 97.20 18.00 \columncolorAVG22.08 SFT 4.44 2.54 6.03 4.76 3.81 5.71 2.54 8.89 96.19 35.87 \columncolorAVG17.08 7.00 6.40 8.40 10.20 9.20 7.20 5.40 22.60 97.00 16.60 \columncolorAVG19.00 DPO 9.21 5.40 8.89 9.84 8.89 10.48 8.89 15.56 94.92 23.17 \columncolorAVG19.53 9.20 3.00 8.60 12.00 8.80 9.40 8.20 15.00 97.80 17.00 \columncolorAVG18.90 rDPO 8.25 4.76 10.16 7.94 10.79 10.48 7.62 18.10 95.24 23.17 \columncolorAVG19.65 11.80 2.60 8.00 10.60 8.60 8.80 7.40 16.60 97.20 15.40 \columncolorAVG18.70 KTO 7.30 4.13 8.89 5.71 9.21 12.70 6.98 15.24 95.56 21.27 \columncolorAVG18.70 9.40 2.60 7.00 10.00 7.60 8.60 8.80 14.80 97.80 13.00 \columncolorAVG17.96 ORPO 7.62 5.08 8.25 10.48 9.52 8.89 7.62 17.78 95.24 20.95 \columncolorAVG19.14 10.60 3.40 8.00 10.40 7.60 10.00 8.60 18.00 96.80 15.40 \columncolorAVG18.88 R-DPO 7.62 5.40 10.16 10.48 10.16 13.33 7.94 18.73 95.56 26.35 \columncolorAVG20.57 10.80 3.60 8.20 11.80 7.80 9.80 7.60 16.20 96.80 15.80 \columncolorAVG18.84 SimPO 7.94 5.40 8.89 7.62 8.89 9.84 8.25 18.10 94.29 23.49 \columncolorAVG19.27 9.60 2.60 6.80 9.60 7.60 11.20 6.80 18.60 98.20 14.80 \columncolorAVG18.58 PolyRefuse 12.70 7.62 15.56 13.02 12.38 17.14 11.11 22.54 94.29 28.89 \columncolorAVG23.53 15.20 5.01 11.00 15.20 13.20 13.80 11.80 22.40 97.40 18.20 \columncolorAVG22.32 Self-Defense 12.38 7.30 12.38 12.38 11.43 14.92 13.02 17.78 95.56 26.67 \columncolorAVG22.38 14.60 3.60 11.20 13.80 13.20 13.00 11.80 22.20 97.60 19.40 \columncolorAVG22.04 SDRRL 3.81 2.86 4.44 4.13 3.49 5.08 3.81 8.25 77.14 28.57 \columncolorAVG14.16 1.60 1.80 2.20 3.20 1.60 1.80 3.20 5.40 62.00 8.40 \columncolorAVG9.12 MPO 2.22 2.86 1.90 1.59 3.81 4.13 1.90 10.79 72.38 22.86 \columncolorAVG12.44 1.60 1.20 2.20 2.60 2.20 2.00 1.20 7.00 63.40 7.40 \columncolorAVG9.08 \rowcolorOURS MSD (Off-Policy) 0.63 0.95 1.59 2.22 1.90 2.86 2.86 4.76 63.81 17.46 \columncolorAVG9.90 0.60 0.40 0.20 1.60 1.80 1.60 1.20 2.00 52.40 3.60 \columncolorAVG6.54 \rowcolorOURS MSD (On-Policy) 0.95 0.32 1.90 0.95 3.81 2.22 1.90 3.81 67.62 16.83 \columncolorAVG10.04 0.20 0.40 1.00 1.00 2.20 1.20 2.40 2.80 61.80 3.20 \columncolorAVG7.62 LLaMA-2-7B-Chat Raw 0.95 6.03 5.40 12.70 36.83 15.87 51.43 49.52 32.06 9.52 \columncolorAVG22.03 1.80 3.80 5.00 6.40 29.80 10.00 46.20 48.00 51.40 16.40 \columncolorAVG21.88 SFT 0.00 0.63 0.63 5.71 35.87 6.98 95.87 12.38 94.92 7.30 \columncolorAVG26.03 0.60 1.20 0.40 9.40 11.60 13.60 96.40 26.80 98.20 8.00 \columncolorAVG26.62 DPO 1.27 3.49 4.76 9.52 31.75 14.92 50.48 38.10 24.44 8.25 \columncolorAVG18.70 1.00 2.00 3.40 4.60 27.80 8.00 43.00 44.40 44.80 15.40 \columncolorAVG19.44 rDPO 1.27 3.49 4.44 9.52 31.75 13.02 48.57 39.37 26.35 6.98 \columncolorAVG18.48 1.40 2.60 3.40 4.80 27.40 8.60 43.80 44.20 43.60 15.60 \columncolorAVG19.54 KTO 1.27 3.49 2.54 9.52 32.06 13.02 49.52 35.87 23.49 7.30 \columncolorAVG17.81 1.60 2.00 2.80 4.40 24.20 8.00 40.60 44.20 43.20 14.40 \columncolorAVG18.54 ORPO 0.95 2.86 5.08 7.30 30.16 11.11 46.35 36.51 24.44 7.62 \columncolorAVG17.24 1.80 2.80 3.60 5.00 22.40 7.80 41.00 45.00 43.00 13.60 \columncolorAVG18.60 R-DPO 1.27 3.81 4.76 7.62 33.33 13.33 48.57 39.37 25.71 8.57 \columncolorAVG18.63 1.20 2.80 3.40 5.00 25.80 8.40 43.20 45.80 46.20 16.00 \columncolorAVG19.78 SimPO 0.95 3.81 6.98 9.21 30.48 9.52 48.89 38.73 25.71 7.30 \columncolorAVG18.16 1.80 2.20 5.00 6.00 24.20 9.40 43.60 45.00 47.40 14.60 \columncolorAVG19.92 PolyRefuse 0.95 6.35 6.03 12.70 36.51 16.83 52.06 42.86 28.89 11.43 \columncolorAVG21.46 1.60 3.40 4.00 7.40 31.40 9.80 45.40 49.80 51.60 17.60 \columncolorAVG22.20 Self-Defense 0.95 2.86 3.49 9.52 27.62 12.38 52.38 39.68 53.33 23.17 \columncolorAVG22.54 1.40 3.40 3.40 8.80 26.60 6.80 57.60 62.60 57.80 19.40 \columncolorAVG24.78 SDRRL 0.63 4.13 0.63 2.54 13.97 8.89 39.37 21.59 17.46 2.54 \columncolorAVG11.18 1.60 1.80 2.00 2.00 14.00 4.00 36.60 25.60 41.80 14.00 \columncolorAVG14.34 MPO 0.95 5.71 4.44 10.79 33.97 14.60 46.35 42.54 26.67 10.79 \columncolorAVG19.68 1.60 2.60 4.40 6.20 28.80 9.80 46.00 50.60 49.40 17.60 \columncolorAVG21.70 \rowcolorOURS MSD (Off-Policy) 0.63 0.95 2.22 3.81 14.29 5.08 23.49 20.32 13.65 4.13 \columncolorAVG8.86 0.40 0.80 0.80 2.00 11.80 3.40 21.60 24.20 28.00 6.60 \columncolorAVG9.96 \rowcolorOURS MSD (On-Policy) 0.32 0.32 0.63 0.63 4.13 0.95 7.62 5.71 3.17 1.27 \columncolorAVG2.48 0.00 0.40 0.40 0.40 4.80 1.20 11.20 6.80 9.40 4.40 \columncolorAVG3.90 LLaMA-3-8B-Instruct Raw 2.54 10.79 22.54 13.65 11.43 17.46 10.79 13.97 69.52 39.37 \columncolorAVG21.21 2.20 5.60 14.40 5.40 8.40 9.80 9.00 15.40 65.20 22.80 \columncolorAVG15.82 SFT 1.59 1.59 3.49 2.86 11.11 4.76 0.63 1.59 50.79 6.35 \columncolorAVG8.48 0.60 0.80 0.80 1.40 0.60 5.80 1.00 2.20 33.60 5.60 \columncolorAVG5.24 DPO 1.59 0.63 2.86 4.76 9.21 1.90 1.90 1.59 53.33 6.67 \columncolorAVG8.44 0.20 0.20 2.60 1.20 5.60 1.20 1.20 5.00 68.20 2.80 \columncolorAVG8.62 rDPO 0.95 1.59 4.13 2.54 1.90 1.90 0.63 6.98 29.52 10.79 \columncolorAVG6.09 2.00 2.00 5.40 3.20 2.40 6.00 3.60 4.20 28.40 7.20 \columncolorAVG6.44 KTO 0.95 1.59 4.13 1.90 6.35 3.17 0.95 1.59 43.81 5.40 \columncolorAVG6.98 0.40 0.40 0.40 2.40 3.80 2.20 1.40 1.40 37.20 2.60 \columncolorAVG5.11 ORPO 0.63 0.63 3.49 2.86 10.79 10.79 0.32 2.86 77.46 1.27 \columncolorAVG11.11 0.00 0.00 2.40 2.40 7.00 9.80 1.40 4.20 72.80 2.60 \columncolorAVG10.30 R-DPO 0.95 0.95 2.22 5.40 9.84 2.22 1.90 3.49 57.14 7.30 \columncolorAVG9.14 0.20 0.20 0.80 1.40 6.40 1.00 1.60 4.60 65.60 2.60 \columncolorAVG8.44 SimPO 0.63 0.95 3.49 2.54 5.71 8.89 0.95 2.86 77.14 7.62 \columncolorAVG11.08 0.40 0.40 2.60 2.40 3.60 12.00 2.20 5.20 72.20 2.60 \columncolorAVG10.36 PolyRefuse 2.86 10.48 21.59 15.24 13.97 18.73 12.06 16.51 67.30 44.76 \columncolorAVG22.35 2.40 5.80 13.20 5.40 8.80 9.80 9.00 15.40 65.20 23.60 \columncolorAVG15.86 Self-Defense 14.60 11.11 8.89 12.70 8.89 7.94 42.54 28.25 14.29 6.35 \columncolorAVG15.56 6.20 2.80 4.20 5.20 6.00 11.00 38.00 18.00 9.60 9.60 \columncolorAVG11.06 SDRRL 1.27 5.40 6.67 11.43 11.43 11.75 7.94 11.75 28.25 30.16 \columncolorAVG12.61 0.00 3.80 2.60 3.00 8.80 7.40 5.60 14.20 20.20 11.40 \columncolorAVG7.70 MPO 5.40 10.79 7.62 6.67 7.62 6.03 14.92 22.22 14.92 22.22 \columncolorAVG11.84 2.40 2.00 1.80 2.20 2.20 2.00 3.80 7.40 11.80 8.20 \columncolorAVG4.38 \rowcolorOURS MSD (Off-Policy) 0.95 0.63 2.86 1.27 1.59 1.90 2.22 0.63 6.03 11.11 \columncolorAVG2.92 0.60 0.20 2.60 0.60 0.60 0.40 1.00 1.20 3.20 1.80 \columncolorAVG1.22 \rowcolorOURS MSD (On-Policy) 0.63 0.95 1.27 1.59 1.27 1.27 1.90 2.22 4.13 5.08 \columncolorAVG2.03 1.40 1.00 2.20 1.20 1.00 0.80 1.60 2.40 2.80 3.40 \columncolorAVG1.78

C.2 Full Results on Utility Benchmarks

Table 10 presents the full MMMLU results across 15 languages. MSD preserves general knowledge ability well after safety alignment and achieves the best or competitive average performance on all backbones, indicating that the safety gains do not come at the cost of broad multilingual utility.

Table 10: Full multilingual utility results on MMMLU. Best and second-best Avg. within each model block are highlighted in bold and underlined, respectively.

Method AR BN DE EN ES FR HI ID IT JA KO PT SW YO ZH Avg. Qwen-2.5-7B-Instruct Raw 52.65 46.24 62.56 73.71 68.02 67.01 49.04 63.47 66.28 62.56 59.73 67.45 34.88 32.94 67.34 58.26 SFT 36.80 42.57 55.49 71.60 61.91 54.79 45.10 57.90 60.54 35.59 47.97 62.78 33.36 31.73 60.05 50.55 DPO 52.64 46.10 61.84 73.29 67.78 67.17 48.53 63.69 66.28 62.05 59.84 67.09 34.60 32.73 66.91 58.04 rDPO 52.81 45.73 61.99 73.17 67.50 66.70 48.11 63.24 66.07 62.06 59.66 67.06 34.71 32.92 67.08 57.92 KTO 52.87 45.63 61.76 73.03 67.32 66.50 48.21 63.17 65.97 61.95 59.79 66.87 34.79 33.01 66.88 57.85 ORPO 53.79 44.93 62.28 73.21 67.00 66.34 47.74 62.11 65.72 60.97 59.20 65.96 33.76 32.72 66.34 57.47 R-DPO 52.73 46.05 62.09 73.38 67.72 66.98 48.53 63.44 66.39 62.14 59.98 67.10 34.72 32.77 66.96 58.06 SimPO 53.94 44.54 62.10 72.54 66.51 65.74 47.18 61.72 65.08 59.96 58.79 65.96 34.57 32.81 66.00 57.16 PolyRefuse 57.08 46.22 62.80 72.77 66.34 65.73 50.01 62.40 65.62 60.72 59.81 65.47 35.26 32.36 65.91 57.90 SDRRL 55.13 47.84 61.11 73.09 65.30 65.80 48.90 62.03 65.57 60.20 60.11 64.33 34.30 31.11 65.33 57.34 MPO 53.73 45.77 62.43 73.69 68.15 66.95 48.85 63.31 66.09 61.99 59.78 66.94 34.98 33.02 66.86 58.17 MSD (Off-Policy) 52.57 46.23 62.53 73.72 68.12 67.10 48.95 63.47 66.40 62.61 59.81 67.39 34.63 32.84 67.38 58.25 MSD (On-Policy) 58.15 48.21 64.09 73.52 67.28 66.71 51.20 63.45 66.59 62.01 61.00 66.32 34.97 32.13 66.38 58.80 Qwen-3-8B Raw 67.82 63.27 70.01 74.36 72.57 71.49 65.96 70.69 71.90 68.91 68.10 72.48 42.72 31.18 58.27 64.65 SFT 57.06 49.91 64.94 75.79 68.94 67.17 53.48 64.89 67.10 62.36 60.58 67.54 28.76 30.84 68.98 59.22 DPO 68.25 63.52 71.06 74.43 73.12 72.00 66.47 71.09 72.06 69.53 68.64 73.21 43.11 31.73 59.08 65.15 rDPO 67.94 63.64 71.36 74.94 73.37 72.14 66.31 71.41 72.31 69.77 68.84 72.95 42.57 31.65 59.72 65.26 KTO 68.15 63.84 71.23 74.91 73.05 71.93 66.42 71.25 72.16 69.85 69.21 73.18 42.24 31.28 59.69 65.23 ORPO 68.69 63.64 71.41 75.17 73.74 72.24 66.47 71.66 72.55 70.36 69.03 73.06 42.94 31.55 59.76 65.49 R-DPO 68.69 63.84 71.17 74.72 73.00 71.95 66.41 71.25 71.86 69.95 68.62 73.27 42.74 31.68 59.27 65.23 SimPO 68.42 63.95 71.36 75.20 73.41 72.25 67.10 71.26 72.66 69.86 69.19 73.26 42.85 31.21 59.81 65.45 PolyRefuse 67.60 62.90 70.20 74.37 72.62 71.49 66.20 70.54 71.74 69.23 68.37 72.30 42.52 30.97 58.15 64.61 SDRRL 64.14 63.75 70.52 76.86 72.24 70.74 65.95 67.78 71.71 69.51 67.38 71.26 41.95 29.64 72.64 65.07 MPO 67.82 62.96 70.84 74.21 72.35 71.79 66.44 70.52 71.76 69.39 68.47 72.84 42.54 31.36 58.37 64.78 MSD (Off-Policy) 69.01 64.39 72.01 77.31 73.99 72.88 67.36 72.09 73.62 70.23 68.98 74.01 42.32 31.95 73.11 66.88 MSD (On-Policy) 68.72 63.92 72.05 76.93 73.95 72.64 66.54 71.94 73.71 71.11 69.74 73.19 43.01 30.59 73.32 66.76 LLaMA-2-7B-Chat Raw 27.57 32.08 34.19 40.48 35.07 33.64 32.08 31.74 33.32 31.21 30.16 34.38 27.04 26.60 31.19 32.05 SFT 16.68 24.03 24.26 32.99 25.00 24.46 24.02 27.44 27.00 23.37 26.16 25.52 19.35 17.16 22.56 24.00 DPO 27.39 31.83 34.34 40.00 34.54 33.35 31.83 31.69 33.78 30.82 30.18 34.61 26.50 26.20 30.84 31.86 rDPO 27.54 31.99 34.43 40.04 34.63 33.56 31.98 31.67 33.98 31.07 30.20 34.66 26.48 26.21 30.96 31.96 KTO 27.41 31.86 34.02 39.97 34.79 33.39 31.86 31.56 33.86 31.05 30.24 34.17 26.46 26.29 30.67 31.84 ORPO 27.18 31.47 33.70 39.13 33.99 32.84 31.47 31.47 33.37 30.90 29.45 33.96 26.32 26.56 30.54 31.49 R-DPO 27.67 31.98 34.43 40.19 34.83 33.56 31.99 31.71 33.95 31.16 30.02 34.70 26.40 26.49 31.07 32.01 SimPO 27.30 31.35 33.41 39.08 33.89 32.80 31.36 31.23 33.09 30.55 29.53 33.46 26.15 26.24 30.51 31.33 PolyRefuse 27.61 31.20 33.14 39.15 33.89 32.54 31.20 30.96 33.09 30.63 29.71 33.16 26.23 25.25 30.24 31.20 SDRRL 25.66 27.09 29.28 34.61 30.60 29.80 27.10 27.98 29.94 29.06 28.99 30.89 16.79 10.07 28.79 27.11 MPO 27.58 31.89 34.33 40.44 34.35 33.27 31.90 31.72 33.54 31.25 30.00 34.43 26.85 26.23 30.87 31.91 MSD (Off-Policy) 27.50 31.63 33.46 39.54 34.41 33.13 31.64 31.26 33.56 31.36 30.50 34.08 26.36 25.52 30.20 31.61 MSD (On-Policy) 27.39 31.92 34.15 40.49 34.77 33.55 31.93 31.54 33.49 31.21 30.10 34.39 26.78 26.46 31.38 31.97 LLaMA-3-8B-Instruct Raw 44.85 38.54 53.13 62.46 55.09 55.27 42.89 49.46 53.82 47.49 46.77 54.61 36.43 30.59 51.23 48.17 SFT 14.34 25.35 5.11 28.28 18.46 5.00 12.13 24.58 21.73 12.43 34.21 24.53 13.74 2.94 21.93 17.65 DPO 38.21 30.66 47.62 61.56 52.95 51.42 36.53 44.00 49.18 41.85 40.12 51.20 17.38 12.97 47.36 41.53 rDPO 14.86 10.34 26.62 57.41 40.19 24.10 10.92 19.11 31.47 20.68 16.93 36.70 3.20 1.27 29.16 22.86 KTO 25.21 20.24 41.50 59.84 48.70 42.49 22.45 35.93 43.95 32.97 30.70 48.05 9.44 4.89 40.49 33.79 ORPO 29.20 27.08 44.69 54.57 49.52 42.31 26.63 38.29 43.43 31.16 31.26 48.11 12.75 3.53 40.69 34.88 R-DPO 38.87 32.00 47.70 61.89 53.30 52.21 37.08 44.48 49.83 42.24 40.89 51.36 19.58 14.52 47.98 42.26 SimPO 30.12 25.30 42.69 53.51 49.48 42.17 27.75 39.37 44.10 33.89 33.21 47.70 11.97 6.04 42.36 35.31 PolyRefuse 41.00 35.77 49.24 60.38 52.29 52.32 39.14 46.28 50.39 43.83 43.24 51.34 28.06 18.11 48.82 44.01 SDRRL 31.06 23.38 40.64 52.76 42.79 42.37 27.10 37.26 41.18 34.42 34.77 42.58 20.11 6.40 37.39 34.28 MPO 44.57 38.41 53.16 62.09 55.03 54.61 42.69 49.42 53.49 47.10 46.11 54.05 35.94 30.23 51.09 47.87 MSD (Off-Policy) 43.81 37.82 51.38 61.04 53.82 53.92 40.95 48.28 52.50 45.93 45.30 52.12 34.08 27.50 50.13 46.57 MSD (On-Policy) 44.88 38.54 53.12 62.47 55.13 55.25 42.81 49.41 53.80 47.51 46.81 54.62 36.25 30.51 51.21 48.16

Table 11 reports the full MGSM results across 11 languages. MSD maintains strong multilingual reasoning performance after alignment, and in several cases improves the average score over the raw model and other baselines, demonstrating that our framework preserves reasoning ability while enhancing multilingual safety.

Table 11: Full multilingual utility results on MGSM. Best and second-best Avg. within each model block are highlighted in bold and underlined, respectively.

Method EN ES FR DE RU ZH JA TH SW BN TE Avg. Qwen-2.5-7B-Instruct Raw 90.00 76.40 27.20 87.60 85.20 81.20 73.20 76.80 14.00 66.40 24.80 63.89 SFT 91.20 76.40 31.20 76.00 75.20 81.20 67.60 25.20 6.40 29.20 22.80 52.95 DPO 91.20 76.40 28.00 81.60 82.00 85.60 74.40 77.60 10.40 63.20 21.60 62.91 rDPO 91.60 74.80 28.80 82.80 83.60 85.20 73.60 76.40 10.40 64.40 23.20 63.16 KTO 90.80 76.00 28.40 82.80 83.20 84.40 72.00 78.00 10.00 62.00 26.00 63.05 ORPO 90.00 47.60 28.00 84.40 81.60 80.80 73.60 71.20 15.20 62.80 23.60 59.89 R-DPO 90.00 76.40 27.20 83.20 81.60 85.60 74.40 77.60 10.80 63.60 20.80 62.84 SimPO 91.20 42.00 28.00 80.40 72.80 78.80 68.00 49.20 13.20 59.20 24.40 55.20 PolyRefuse 90.00 68.00 35.20 86.00 74.40 80.00 70.00 66.00 0.40 62.40 13.20 58.69 SDRRL 91.60 41.20 28.40 80.80 82.40 80.80 70.00 70.80 18.80 57.20 22.40 58.58 MPO 92.00 74.40 27.20 81.60 83.20 82.00 74.40 76.80 10.80 65.20 24.80 62.95 MSD (Off-Policy) 91.20 66.80 39.60 80.00 83.20 82.00 69.20 57.20 7.60 64.40 23.60 60.44 MSD (On-Policy) 92.40 75.20 26.40 84.80 84.00 81.60 74.40 76.80 14.80 66.80 25.60 63.89 Qwen-3-8B Raw 51.60 84.40 76.00 80.80 82.80 60.40 59.20 75.20 31.20 68.40 43.60 64.87 SFT 92.80 82.40 79.60 81.60 86.00 81.60 76.40 78.80 23.20 68.80 56.40 73.42 DPO 58.40 82.40 77.60 80.80 83.20 54.80 58.80 75.60 33.60 69.20 44.40 65.35 rDPO 56.80 84.00 78.00 80.80 82.00 57.60 60.80 75.60 35.60 68.40 45.20 65.89 KTO 56.40 84.00 77.60 79.60 84.00 56.40 61.20 76.40 34.40 69.60 45.20 65.89 ORPO 61.20 83.20 77.20 78.80 86.00 60.00 64.40 73.60 34.40 71.60 50.40 67.35 R-DPO 57.60 82.80 79.20 81.20 82.80 56.00 60.80 75.60 34.40 67.20 42.80 65.49 SimPO 60.40 83.60 78.80 80.40 86.00 61.60 64.40 72.80 39.20 69.60 49.60 67.85 PolyRefuse 54.80 83.60 74.80 80.40 81.60 53.60 60.80 74.80 32.80 67.20 42.80 64.29 SDRRL 76.00 68.00 66.40 59.20 70.40 79.20 48.80 73.60 36.80 56.40 51.60 62.40 MPO 55.60 83.60 74.80 80.40 81.60 56.00 60.80 75.60 33.20 66.80 42.40 64.62 MSD (Off-Policy) 83.60 86.80 81.60 77.60 86.00 78.00 72.80 80.00 42.40 76.40 65.60 75.53 MSD (On-Policy) 84.00 83.60 79.60 75.60 85.20 79.60 70.40 81.20 41.20 74.00 60.00 74.04 LLaMA-2-7B-Chat Raw 25.60 14.80 15.60 20.00 11.20 20.40 10.80 1.60 0.00 0.40 0.00 10.95 SFT 24.40 5.20 6.00 13.20 4.40 9.60 4.40 0.80 0.40 0.00 0.00 6.22 DPO 22.80 16.00 12.80 17.20 13.60 16.80 11.20 0.00 0.00 0.00 0.00 11.04 rDPO 22.80 14.80 13.60 19.60 12.40 18.00 12.00 0.00 0.00 0.00 0.00 10.29 KTO 24.80 15.60 14.80 14.80 12.00 17.20 10.80 0.00 0.00 0.00 0.00 11.00 ORPO 23.20 16.40 16.00 18.80 10.40 17.20 10.80 1.20 0.00 0.00 0.00 11.40 R-DPO 21.60 16.80 14.00 17.20 12.00 17.60 12.00 0.00 0.00 0.00 0.00 10.11 SimPO 23.60 13.60 15.60 18.40 8.80 18.40 12.80 1.60 0.00 0.00 0.00 10.25 PolyRefuse 22.80 14.40 15.60 17.60 8.80 20.40 10.40 2.00 0.80 0.40 0.00 10.29 SDRRL 27.20 18.00 18.40 13.60 18.00 14.00 16.80 0.00 2.40 0.80 0.00 11.75 MPO 23.60 14.00 16.40 20.40 12.00 19.20 12.00 0.80 0.40 0.40 0.00 11.92 MSD (Off-Policy) 21.20 13.60 15.60 22.40 10.80 16.80 12.80 2.40 0.40 0.00 0.00 10.55 MSD (On-Policy) 26.40 19.60 17.60 12.00 20.40 16.80 14.00 0.80 0.00 0.40 0.00 12.80 LLaMA-3-8B-Instruct Raw 78.80 62.00 62.40 64.00 68.80 59.20 58.80 60.40 30.40 38.40 33.20 56.04 SFT 44.40 2.40 9.20 19.60 46.80 50.00 19.60 16.80 4.00 7.60 8.80 20.84 DPO 79.60 62.40 59.60 63.20 65.60 58.00 55.20 46.80 34.80 40.00 19.20 53.13 rDPO 79.20 65.20 57.20 57.20 62.80 61.60 49.20 50.40 30.40 38.00 19.60 51.89 KTO 78.80 62.00 59.20 60.80 66.80 62.40 46.80 49.20 34.00 36.40 19.60 52.36 ORPO 78.40 58.80 58.40 52.40 64.80 56.80 44.40 44.80 25.60 23.20 9.60 47.02 R-DPO 78.80 63.20 59.20 62.80 65.60 58.00 54.00 48.40 34.80 36.40 18.40 52.69 SimPO 81.60 59.20 55.60 52.80 61.20 59.20 44.40 47.60 23.20 24.80 16.00 47.78 PolyRefuse 79.60 64.40 60.40 68.80 68.80 57.20 57.60 55.20 28.40 38.00 33.20 55.60 SDRRL 77.60 54.40 51.60 54.80 46.80 54.00 44.80 46.40 22.80 9.20 6.80 42.65 MPO 77.20 62.00 60.00 64.40 68.00 58.40 56.80 53.60 28.80 36.40 31.20 54.25 MSD (Off-Policy) 78.80 63.60 62.80 64.80 67.60 60.80 56.40 58.80 30.00 40.00 31.20 55.89 MSD (On-Policy) 78.00 69.60 62.40 67.20 67.60 62.40 56.40 56.80 29.20 52.00 37.60 58.11

C.3 Ablation Study on Divergence Objectives

We compare different divergence objectives for the MSD distillation loss under the on-policy setting, including forward KL (FKL), reverse KL (RKL), and Jensen–Shannon divergence (JSD). For JSD, we use the standard symmetric formulation with the mixture distribution $m=\frac{1}{2}p_{T}+\frac{1}{2}p_{S}$ , i.e., $\mathrm{JSD}(p_{T}\|p_{S})=\frac{1}{2}D_{\mathrm{KL}}(p_{T}\|m)+\frac{1}{2}D_{\mathrm{KL}}(p_{S}\|m)$ . As shown in Table 12, RKL achieves the best overall safety performance across both Qwen-3-8B and LLaMA-3-8B-Instruct, suggesting that mode-seeking distillation is more suitable for transferring high-resource safety behavior to low-resource languages in our setting.

Table 12: Ablation study on different divergence objectives in MSD. We compare forward KL (FKL), reverse KL (RKL), and Jensen–Shannon divergence (JSD). For MultiJail and PKU-SafeRLHF, we report attack success rate (ASR), where lower is better.

Method	Qwen-3-8B		LLaMA-3-8B-Instruct
Method	MultiJail	PKU-SafeRLHF	MultiJail	PKU-SafeRLHF
MSD w/ FKL	12.79	9.53	2.67	3.84
MSD w/ RKL	10.04	7.62	2.03	1.78
MSD w/ JSD	10.85	8.02	2.38	2.62

C.4 Hyperparameter Experiment of Teacher’s Top- $K$ Entropy

Table 13 studies the sensitivity of DPSW to the top- $K$ candidate size in the teacher-side weight $w^{T}_{t}$ . Across both Qwen-3-8B and LLaMA-3-8B-Instruct, all DPSW variants consistently outperform the baseline without DPSW, regardless of the choice of $K$ . This demonstrates that the effectiveness of DPSW is highly robust to the parameter $K$ .

Table 13: The hyperparameter experiment about

K

in the teacher-side weight

w^{T}_{t}

of DPSW. We conduct the experiment under the on-policy setting. For MultiJail and PKU-SafeRLHF, we report attack success rate (ASR), where lower is better.

Method	Qwen-3-8B		LLaMA-3-8B-Instruct
Method	MultiJail	PKU-SafeRLHF	MultiJail	PKU-SafeRLHF
MSD w/o DPSW	11.09	9.10	3.33	3.02
MSD w/ DPSW ( $K=16$ )	10.69	8.24	2.60	2.16
MSD w/ DPSW ( $K=32$ )	10.04	7.62	2.03	1.78
MSD w/ DPSW ( $K=64$ )	10.49	7.23	2.28	1.88

C.5 Reason for High ASR on the Swahili (SW) Language for Qwen-2.5-7B-Instruct and Qwen-3-8B

Table 14: Invalid and unsafe rates on the Swahili (SW) language for Qwen-2.5-7B-Instruct and Qwen-3-8B. We report the percentages of invalid and unsafe responses on MultiJail and PKU-SafeRLHF.

Method	Qwen-2.5-7B-Instruct				Qwen-3-8B
	MultiJail		PKU-SafeRLHF		MultiJail		PKU-SafeRLHF
	invalid	unsafe	invalid	unsafe	invalid	unsafe	invalid	unsafe
MSD (Off-Policy)	39.68	13.65	40.80	14.40	57.14	6.67	46.40	6.00
MSD (On-Policy)	27.62	5.71	28.00	4.60	60.95	6.67	55.40	6.40

As detailed in Appendix B.2, our evaluation metric Attack Success Rate (ASR) categorizes both unsafe and invalid responses as successful attacks, whereas only meaningful refusals are deemed safe. This rigorous metric explains the persistently high ASR observed in certain low-resource languages, such as Swahili (SW), even after alignment. As Table 14 shows, for instance, when evaluating Qwen-3-8B on MultiJail in Swahili, off-policy MSD reports an ASR of 63.81%. However, a closer breakdown reveals that genuinely unsafe responses account for only 6.67%, while invalid responses constitute 57.14%. Similarly, on-policy MSD yields a 67.62% ASR, comprising 6.67% unsafe and 60.95% invalid responses. This indicates that the inflated ASR is predominantly driven by invalid outputs rather than actual safety failures. Furthermore, we investigate the model’s internal reasoning behind these invalid outputs to determine whether it intends to refuse. As illustrated in Figure 9, unlike the Base model and DPO—which actively reason towards generating unsafe answers—MSD successfully identifies the harmful intent during the reasoning phase. The invalid outcome mainly stems from the base model’s limited generation ability in low-resource languages, which produces a invalid final response that is penalized as a failed defense.

Appendix D Extended Discussion of Limitations

Dependency on Inherent High-Resource Capabilities. The core mechanism of MSD involves transferring a model’s internal safety capabilities from a high-resource language to low-resource languages. Consequently, the framework’s success is strictly bounded by the teacher model’s ability to reason and refuse harmful instructions in the source language. If the foundational LLM has not undergone sufficient prior safety supervision or exhibits weak in-context learning (ICL) performance, the teacher will fail to provide a stable and valid supervision signal. In such cases, the student may inadvertently inherit or even amplify the teacher’s alignment failures, preventing effective safeguard transfer.

Sensitivity to Query Translation Quality. A primary advantage of MSD is the elimination of the prohibitive cost associated with generating target-language response data. However, the framework still necessitates parallel queries—pairing a target low-resource query with its high-resource translation—to establish the teacher’s additional information. While benchmarks like XSafety and MultiJail provide such pairs, the semantic equivalence and linguistic accuracy of these translations are critical. Translation errors may lead to a misalignment between the teacher’s reasoning and the student’s input, ultimately degrading the precision of the cross-lingual alignment.

Multilingual Safety Alignment via Self-Distillation

Multilingual Safety Alignment via Self-Distillation

Abstract

1 Introduction

2 Related Work

3 Multilingual Self-Distillation

3.1 Cross-Lingual Safety Transfer through Self-Distillation

3.2 Dual-Perspective Safety Weighting

4 Experiments

4.1 Experimental Setup

4.2 Overall Evaluation on Multilingual Safety Alignment

4.3 Ablation Study

5 Conclusion and Limitations

References

Appendix A Additional Related Work

Appendix B More Experimental Details

B.1 Training Data

B.2 Benchmark

B.3 Baseline Methods

B.4 Implementation Details

B.5 Details of DPSW Weight Analysis

Appendix C Additional Experimental Results

C.1 Full Results on Safety Benchmarks

C.2 Full Results on Utility Benchmarks

C.3 Ablation Study on Divergence Objectives

C.4 Hyperparameter Experiment of Teacher’s Top-KK Entropy

C.5 Reason for High ASR on the Swahili (SW) Language for Qwen-2.5-7B-Instruct and Qwen-3-8B

Appendix D Extended Discussion of Limitations

Comments

C.4 Hyperparameter Experiment of Teacher’s Top- $K$ Entropy