[2605.04059] Continual Distillation of Teachers from Different Domains

Continual Distillation of Teachers from Different Domains

Nicolas Michel Affiliation: The University of Tokyo Affiliation: Japanese-French Laboratory for Informatics, CNRS {nicolas, yamasaki}@cvm.t.u-tokyo.ac.jp,maorong@nii.ac.jp, jhe2@iu.edu Maorong Wang Affiliation: National Institute of Informatics Jiangpeng He Affiliation: Indiana University Bloomington Toshihiko Yamasaki Affiliation: The University of Tokyo

Abstract

^$\dagger$^$\dagger$footnotetext: Equal supervision.

Deep learning models continue to scale, with some requiring more storage than many large-scale datasets. Thus, we introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher. We also show that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross-domain generalization. The code and implementation for this work are publicly available at: https://github.com/Nicolas1203/continual_distillation.

^$\dagger$^$\dagger$footnotetext: Equal supervision.

Refer to caption — Figure 1: Overview of the Continual Distillation problem. A student model $\mathcal{S}$ learns through distillation from a sequence of teachers $\{\mathcal{T}_{1},\mathcal{T}_{2},\mathcal{T}_{3}\}$ . During distillation, part of the teachers’ training data is unavailable. The objective is for the student to maintain high performance on all domains known to the teachers, but not necessarily introduced to the student.

1 Introduction

Over the last decade, deep learning models have reached unprecedented scales, creating a growing need for computation-efficient strategies. Consequently, Continual Learning (CL) [8, 24, 35] has become a major branch of contemporary deep learning research. The main idea is straightforward: a model is trained on a sequence of datasets where previous data becomes unavailable over time. The rationale is that re-training on both past and current data as new data arrives can be computationally expensive or require substantial storage. In this context, data access is limited during training.

With the recent adoption of Foundation Models (FMs) [9, 21, 11, 26, 1] as the backbone of modern deep learning, we propose a new paradigm, Continual Distillation (CD), tailored to FMs, illustrated in Figure 1. In CD, instead of learning from a sequence of datasets, we propose to learn from a sequence of models trained on different datasets. Specifically, a single student model learns sequentially from multiple teacher models without retaining access to earlier ones. Similar to training data in CL, new FMs are regularly introduced over time, and storing them is cumbersome since they can rapidly require more storage than large-scale datasets. For instance, storing 10B parameters requires approximately 38GB [7], while FMs often exceed 100B parameters. Even accessing FMs through restricted APIs presents serious limitations, as previous versions may become unavailable after updates. Our CD setup reflects a realistic scenario in which one aims to leverage the ever-evolving stream of FMs to train smaller and more specialized models through distillation. In analogy to CL, where a single model benefits from a continuous stream of data, in CD, a single student model benefits from a continuous stream of teacher models.

However, CD introduces two main challenges. First, the original training data of a foundation model is typically unavailable, undisclosed, or prohibitively large to reuse [2]. Thus, choosing distillation data is critical, and such data can potentially emerge from a domain unknown to the teacher [36]. Second, each teacher generally exhibits distinct abilities and performance; for instance, one may excel at recognizing animals, while another may specialize in distinguishing insects.

In this work, we focus on the case of Continual Distillation, where, analogous to Domain Incremental Learning, teachers have been trained on a domain incremental set of datasets. However, we assume that teachers share partially overlapping domains. We believe this assumption to be realistic, as, for example, it is safe to assume that all FMs have been trained on ImageNet. Therefore, we make the following key assumptions: (1) each teacher is trained on data from different domains while sharing a specific domain; (2) the training data for distillation is fixed; and (3) the training data is unlabeled. The ultimate objective is to obtain a student model that achieves high performance on every domain known by at least one teacher, without having access to all teacher training data or labels. In this context, we find that training data can be decomposed into two categories: External Data (ED), unknown to all teachers, and Internal Data (ID), known to all teachers. Importantly, we discover that ED enables the transfer of knowledge from domains unseen by the student but known by the teacher during distillation, which we refer to as Unseen Knowledge Transfer (UKT). Naturally, we observe that the student inevitably forgets unseen knowledge transferred by previous teachers when learning from future ones. We refer to this phenomenon as Unseen Knowledge Forgetting (UKF). The central problem of CD becomes reaching an optimal UKT-UKF trade-off.

Experimentally, we observe that while the usage of ED enables UKT, mainstream distillation strategies fail to address UKF, as their primary focus relies on maximizing knowledge transfer only. We therefore propose Self External Data Distillation (SE2D), a CL-inspired method tailored for the CD problem, focusing on preserving logits of ED to maintain performance on domains unseen by the student. SE2D allows positive UKT while maintaining performance in older domains. Our contributions are as follows:

•

We introduce the paradigm of Continual Distillation, motivated by the practical challenges that arise when previous teacher models are no longer accessible.
•

We demonstrate that the choice of distillation data is crucial for transferring knowledge to domains unseen by the student, and choose to take advantage of External Data, unknown to the teachers.
•

We identify Unseen Knowledge Forgetting, mitigate it by preserving External Data logits, and validate our approach across various benchmarks.

2 Related Work

Continual Learning. Continual Learning (CL) [5, 24] focuses on enabling models to learn from a sequence of tasks while retaining previously acquired knowledge. Traditionally, each CL task is defined by different non-overlapping datasets with unique properties. In Class Incremental Learning (CIL) [10], each task is composed of a unique set of classes. In Domain Incremental Learning (DIL) [3], classes are shared, but input distributions vary. While access and change in data have been widely studied, we are, to the best of our knowledge, the first to consider an alternative scenario where data is fixed but accessible model change over time. In this case, we focus on a situation where the teachers’ training domain differs, not unlike a DIL setup, where each teacher would be trained on a specific task of a DIL setting. Since foundation models become ubiquitous [26], large, and ever-changing, a paradigm shift from ever-evolving data to ever-evolving teachers appears relevant. While Knowledge Distillation is often employed in CL to mitigate forgetting [22, 28, 32], we instead investigate the phenomenon of forgetting within distillation itself, treating it as the central challenge of our study.

Knowledge Distillation. Knowledge Distillation (KD) [15] is a technique that allows a smaller student model to replicate the behavior of a larger, more capable teacher model. Ideally, KD assumes that both the teacher and student are trained on the same dataset, and has been widely used for model compression [29, 39] and transfer learning [38]. When such data is not available, distillation is considered to be data-free [36]. In that context, the objective is usually to reproduce the teacher’s training domain as closely as possible. Thanks to the capability of knowledge transfer from KD, it has become one of the cornerstones for solving the forgetting problem in CL [28, 12]. Nevertheless, the ideal setting (i.e., the students are distilled with the original teachers’ training dataset) of KD rarely holds in CL due to the unavailability of historical data. Consequently, CL methods conduct distillation with new training samples [20], small memory buffer samples [22], generated samples [37], or adversarial samples [14].

While prior work has explored multi-teacher knowledge distillation and the use of distillation in continual learning, our setting differs in several key aspects. First, existing multi-teacher distillation methods typically assume simultaneous access to all teachers, whereas we consider a sequential setting where only one teacher is accessible at a time. Second, unlike continual learning approaches that focus on data streams, our formulation assumes a fixed dataset and instead models a stream of evolving teacher models. Finally, in contrast to standard distillation-based continual learning methods, we do not assume access to past data or replay buffers, nor to previously seen teachers. This combination of constraints defines a distinct and practically relevant scenario that, to the best of our knowledge, has not been explicitly studied before. We therefore frame this problem as Continual Distillation, emphasizing the shift from data-centric to model-centric CL.

3 Continual Distillation

3.1 Generic Definition

We define Continual Distillation (CD) as the process of distilling the knowledge from a sequence of teacher models continuously into one student model, on a fixed dataset. When distilling from one teacher to the student, other teachers are considered unavailable. The distillation process from a given teacher to the student is analogous to a task in standard CL. Formally, given a sequence of teachers $\{\mathcal{T}_{0},\mathcal{T}_{1},\dots,\mathcal{T}_{N}\}$ , each trained on a dataset $\mathcal{D}_{t}^{\mathcal{T}}$ , the student $\mathcal{S}$ is optimized to minimize distillation loss $\mathcal{L}_{dist}$ , with respect to $\mathcal{T}_{t}$ on a distillation dataset $\mathcal{D}^{\mathcal{S}}$ . Importantly, we only consider distillation and no label-dependent loss is considered. In this work, we focus on logits-based distillation as representations are architecture-dependent, computation-intensive, and require access to the entire teacher model. We present the overall procedure in Figure 1. We denote by $\mathbb{D}_{\mathcal{S}}$ ( $\mathcal{T}_{t}$ , $\mathcal{D}^{\mathcal{S}})$ the operation of distilling from teacher $\mathcal{T}$ trained with $\mathcal{D}^{\mathcal{T}}_{t}$ to student $\mathcal{S}$ on distillation dataset $\mathcal{D}^{\mathcal{S}}$ .

3.2 Specific Problem Scenario

Traditional KD assumes that the teacher training datasets are available for distillation. In CD, not only are such datasets considered unavailable, but dataset domains might differ from one teacher training to another. In other words, $\mathcal{D}_{t}^{\mathcal{T}}$ , $\mathcal{D}_{t^{\prime}}^{\mathcal{T}}$ and $\mathcal{D}^{\mathcal{S}}$ may cover partially or totally different domains. Therefore, various scenarios can be defined in CD depending on the domain overlap between teachers training data $\mathcal{D}_{t}^{\mathcal{T}}$ and distillation data $\mathcal{D}^{\mathcal{S}}$ . Realistically, when considering FMs, it is safe to assume that part of the training data is shared. Typically, publicly available datasets such as Wikipedia or ImageNet are commonly used for training FMs. However, additional data, exclusive to a specific model, might also be included during training. A visualisation of such a scenario is presented in Figure 2. Formally, we consider the case of partially exclusive teacher domains such that all teachers share a specific domain:

\mathcal{D}_{t}^{\mathcal{T}}\cap\mathcal{D}_{t^{\prime}}^{\mathcal{T}}=\mathcal{D}_{i},\ \forall t,t^{\prime},

(1)

with $\mathcal{D}_{i}$ the Internal Data (ID). The remaining distillation data is considered unknown to all teachers, hence:

\mathcal{D}^{\mathcal{S}}=\mathcal{D}_{e}\cup\mathcal{D}_{i},

(2)

where $\mathcal{D}_{e}$ denotes the External Data (ED) such that for any teacher of index $t$ , $\mathcal{D}_{e}\cap\mathcal{D}_{t}=\emptyset$ .

3.3 Unseen Knowledge Transfer and Forgetting

We define ‘unseen’ as domains not present in the student’s training data but present in the teacher’s knowledge. In CD, $\mathcal{D}_{e}$ is unknown to the teachers. Such an ED can either be introduced or can appear unknowingly when generating data, a standard procedure of data-free distillation. While this could appear to be a limitation, we observe that leveraging ED allows the student to acquire knowledge about domains that have never been explicitly seen during training. We refer to this phenomenon as Unseen Knowledge Transfer (UKT). Intuitively, when integrating ED, generic knowledge is transferred because of the teacher’s uncertainty. Conversely, when the teacher is confident, specific knowledge only is transferred. In CD, we propose to take advantage of UKT by purposefully integrating ED during distillation to extract additional knowledge from the teacher.

However, in CD, the student sequentially learns from multiple teachers, each providing distinct unseen domain knowledge. While UKT enables the student to acquire information about domains not directly represented in the teacher data, this transferred knowledge is often fragile. As the student learns from subsequent teachers, they tend to lose information previously transferred from earlier ones. We refer to this phenomenon as Unseen Knowledge Forgetting (UKF). UKF differs fundamentally from the catastrophic forgetting traditionally studied in Domain Incremental Learning, as the forgotten knowledge does not originate from the student’s own training data but from the teacher’s knowledge. Since the student is never directly exposed to such knowledge, we call it unseen knowledge. An intuitive illustration of UKF and UKT is given in Figure 3.

3.4 Self External Data Distillation (SE2D)

We introduce Self-External Data Distillation (SE2D), a method designed to mitigate UKF in Continual Distillation. In SE2D, the student model is trained not only from the current teacher but also from its own checkpoint saved after the previous task. Such a strategy is quite common in Continual Learning [20, 27]; however, we propose to adapt it to the specific problem at hand. Therefore, the distillation process from the checkpoint is performed exclusively on external data $\mathcal{D}_{e}$ , which is unknown to all teachers. An overview of our proposed approach is given in Figure 4.

Prior observations indicate that performance on domains unseen by the student yet known by past teachers depends heavily on these external samples. We restrict self-distillation to external data $\mathcal{D}_{e}$ to specifically preserve knowledge that is not directly supported by the shared internal domain $\mathcal{D}_{i}$ . Applying self-distillation on $\mathcal{D}_{i}$ would mainly reinforce already stable knowledge, while our goal is to maintain transferred knowledge from unseen domains, which is primarily captured through $\mathcal{D}_{e}$ .

Practically, at each distillation step $t$ , the student $\mathcal{S}_{t}$ learns from both the current teacher $\mathcal{T}_{t}$ and the previous student checkpoint $\mathcal{S}_{t-1}$ :

\mathcal{L}_{\text{SE2D}}=\mathcal{L}_{\text{KD}}(\mathcal{S}_{t},\mathcal{T}_{t};\mathcal{D}^{\mathcal{S}})+\mathcal{L}_{\text{KD}}(\mathcal{S}_{t},\mathcal{S}_{t-1};\mathcal{D}_{e}),

(3)

where $\mathcal{L}_{\text{KD}}$ denotes the temperature-scaled Kullback–Leibler divergence between the softmax distributions of the student logits $z_{\mathcal{S}}(x)$ and teacher logits $z_{\mathcal{T}}(x)$ :

\mathcal{L}_{\text{KD}}(\mathcal{S},\mathcal{T};\mathcal{D}^{\mathcal{S}})=T^{2}\,\mathbb{E}_{x\sim\mathcal{D}^{\mathcal{S}}}\left[\text{KL}\left(\sigma\!\left(\tfrac{z_{\mathcal{T}}(x)}{T}\right)\;\Big\|\;\sigma\!\left(\tfrac{z_{\mathcal{S}}(x)}{T}\right)\right)\right],

(4)

where $T$ is the distillation temperature and $\sigma(\cdot)$ denotes the softmax function. This simple yet effective mechanism enables the student to accumulate knowledge over time while retaining transferable information from previous distillation stages. In Section 5, we present experimental results of SE2D across various benchmarks.

4 Experimental Setup

In the following, we describe the experimental setup for Continual Distillation. More details regarding the implementation are given in the appendix.

4.1 Domain Selection

Teachers Domains.

To reproduce a Continual Distillation context, we work on datasets containing data of the same classes but coming from different domains. For example, in Figure 1, each domain contains the class “cat”; however, the color of the cat is different depending on the considered domain. Another example would be the condition in which the picture was taken (inside or outside). Such a domain shift is typically considered in DIL [33, 16].

External Data Selection.

In CD, the choice of ED is crucial. In realistic scenarios, ED could be introduced accidentally, as knowing FMs exact training domain is unlikely. In this work, we deliberately exploit such external data, specifically selected to fall outside all teacher training domains, to study its influence on knowledge transfer and forgetting. We consider two scenarios: (1) related external domains, where the data share the same semantic classes as the teacher domains (e.g., if teachers are trained on sea animals such as dolphins, sharks, and whales, the external data may include jellyfish); and (2) unrelated external domains, where the data are semantically distinct (e.g., using images of trucks or digits instead of marine animals). In both cases, teachers have minimal performance on ED. Additional discussion is included in appendix.

4.2 Teacher Selection

For the experimental setup, we hypothesize that most contemporary FMs share a substantial amount of common knowledge but differ primarily in their performance on specific tasks. Based on this assumption, we train a sequence of teacher models that all share a common domain while each possesses a unique domain not shared by any other teacher. Consequently, all teachers in the sequence provide a shared knowledge base together with a teacher-specific domain that the student must learn and retain through the use of external data. An example is given in Table 1 where each teacher is trained on pairs $(0,1)$ , $(0,2)$ , and so on.

4.3 Datasets

For simulating CD, we build upon Domain Incremental Learning datasets. We consider each domain separately and pre-train teachers on subsets of such domains. CIFAR20 [30], a variation of CIFAR-100 using the 20 superclasses instead of the 100 fine-grained classes. Since in CIFAR-100, each superclass is composed of 5 sub-classes, using the superclasses allows for defining 5 different domains where each domain is images from different subclasses but identical super-class. CIFAR20 contains $10,000$ train images and $2,000$ test images per domain. All such domains are considered related. Additionally, we experiment with CUB [34] and MNIST as unrelated datasets. Digits, that we define as the combination of various digit datasets, each one representing a different domain. Namely, we mix MNIST [19], MNIST-M [13], USPS [17] and SVHN [23]. As a related domain, we consider KMNIST [6], a dataset composed of Japanese Hiragana. We reckon this dataset as having similar difficulty as MNIST, even though of different classes. DomainNet, an adapted version of the DomainNet dataset [25], containing six visual domains: Real, Clipart, Painting, Infograph, Sketch, and Quickdraw. Each domain shares the same set of 345 classes but differs significantly in style and texture statistics. This dataset contains around 600,000 images, and the domains are unbalanced, increasing the difficulty. All domains are considered related.

4.4 Considered Methods

We selected mainstream and recent state-of-the-art baselines for logits-based distillation methods. We considered the following. KL-divergence. This method consists of a standard distillation loss and serves as a baseline [18]. Logits Standardization (LS) [31] is a method recently proposed that improves upon logits distillation by standardizing student and teacher logits. Medium Difficulty Samples (MDS) [4] is a data-pruning strategy that considers distilling on samples of medium difficulty only. The original method was proposed in a supervised scenario where the teacher’s cross-entropy was considered as the criterion for assessing sample difficulty. In our setup, we adapted this method with the same principle, using teacher entropy as a sample difficulty estimator. Decoupled Knowledge Distillation (DKD) [39], a standard distillation method that decomposes the conventional distillation objective into two components: a target class term and a non-target class term. While most efficient in supervised scenarios, DKD can easily be used in unsupervised scenarios by considering the teacher’s maximum prediction as the target. Self-Distillation. Distillation is regularly used in CL [22, 12, 28]. A common strategy is to save a checkpoint of the current model at the end of the task and use such a checkpoint for distillation when training on the subsequent task. This methods uses both $\mathcal{D}_{i}$ and $\mathcal{D}_{e}$ for distillation.

5 Experimental Results

Table 1: Accuracy (%) of the student model on test sets, domain-wise, on CIFAR20, for various domain overlaps, after distilling for

1

epoch. The larger the ratio

|\mathcal{D}_{e}|/|\mathcal{D}^{\mathcal{S}}|

, the more the student performs on domains unseen during training (underlined values).

Domain	0	1	2	3	$\|\mathcal{D}_{e}\|/\|\mathcal{D}^{\mathcal{S}}\|$
After $\mathbb{D}(\mathcal{T}_{01},\mathcal{D}^{\mathcal{S}}_{0})$	93.25	37.80	45.30	36.90	0%
After $\mathbb{D}(\mathcal{T}_{02},\mathcal{D}^{\mathcal{S}}_{0})$	93.95	31.60	42.20	35.25	0%
After $\mathbb{D}(\mathcal{T}_{03},\mathcal{D}^{\mathcal{S}}_{0})$	92.10	32.00	39.05	33.00	0%
After $\mathbb{D}$ ( $\mathcal{T}_{012}$ , $\mathcal{D}^{\mathcal{S}}_{014}$ )	92.70	93.15	68.35	53.80	33%
After $\mathbb{D}$ ( $T_{013}$ , $\mathcal{D}^{\mathcal{S}}_{014}$ )	92.45	92.95	48.95	72.55	33%
After $\mathbb{D}$ ( $\mathcal{T}_{01}$ , $\mathcal{D}^{\mathcal{S}}_{04}$ )	96.35	77.15	48.95	48.45	50%
After $\mathbb{D}$ ( $\mathcal{T}_{02}$ , $\mathcal{D}^{\mathcal{S}}_{04}$ )	96.35	43.30	80.10	46.55	50%
After $\mathbb{D}$ ( $\mathcal{T}_{03}$ , $\mathcal{D}^{\mathcal{S}}_{04}$ )	95.70	49.40	57.60	77.80	50%
After $\mathbb{D}$ ( $\mathcal{T}_{01}$ , $\mathcal{D}^{\mathcal{S}}_{034}$ )	94.60	85.20	51.85	57.15	66%
After $\mathbb{D}$ ( $\mathcal{T}_{02}$ , $\mathcal{D}^{\mathcal{S}}_{034}$ )	94.60	44.00	83.55	51.55	66%

5.1 External Data Impact on UKT and UKF

External Data Improves Knowledge Transfer.

A first observation that can be made in the CD setting is the necessity of training with ED to fully distill knowledge from the teacher. To showcase this effect, we train a sequence of teachers on CIFAR20 on domain pairs $\{0,1\},\{0,2\},\{0,3\}$ , while distilling to the student model on ID only (domain $0$ ), using the KL-divergence. The results are displayed in Table 1, where the accuracy of the student after each task is reported. It is important to note that initially, considered teachers achieve above $95\%$ accuracy on their respective domains. Eventually, distilling only on the domain $0$ yields competitive performance on this domain; however, performances on other domains remain extremely limited at any training step. The student performs only on domains that have been encountered during training. In Table 1, we maintain the same teacher sequence but include ED for distillation. Such data are from the domain $4$ of CIFAR20, which is unknown to the teacher. In this case, we can observe that the student maintains performance on domain $0$ while achieving much stronger performances on other domains, despite never being encountered during training. UKT is observed only when distilling with external data.

External Data Accentuates UKF.

While ED typically facilitates knowledge transfer, it can bias the model toward the most recent teacher at the expense of earlier knowledge. Tables 2 and 3 show that ED does not guarantee uniform gains and may even accentuate forgetting. For instance, on Digits, DKD and LS suffer significant drops on SVHN and MNIST-M; notably, DKD’s MNIST-M performance falls from $54.50\%$ to $33.84\%$ . However, this effect is not universal, as DKD shows slight gains on DomainNet’s Infograph (Table 4). This suggests that while ED can trigger substantial UKF, the impact depends heavily on the distillation method and domain distribution.

Table 2: Performances (%, higher is better) of the student at the end of training on CIFAR20 for 4 scenarios. Internal Data Only (D0), Related External Data (D4), CUB as ED, and MNIST as ED. The number of runs is set to

3

. The gain columns shows the gain over using Internal Data Only. Grey: Internal Data; Blue: Domain known by the teacher; Red: External Data (ED); White: Ignored.

CIFAR20 - Internal Data Only
Method	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorDomainColD4 ✗	\columncolorAvgColAvg. (0-3)	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol97.75	\columncolorOtherDom95.80	\columncolorOtherDom96.70	\columncolorOtherDom95.75	\columncolorDomainCol-	\columncolorAvgCol96.5	\columncolorDomainCol0.00
\columncolorDomainColKL-divergence	\columncolorInDomCol98.10 $\pm$ 0.05	\columncolorOtherDom41.40 $\pm$ 0.84	\columncolorOtherDom53.38 $\pm$ 0.35	\columncolorOtherDom54.87 $\pm$ 3.70	\columncolorDomainCol-	\columncolorAvgCol61.94 $\pm$ 1.24	\columncolorDomainCol0.00
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol95.70 $\pm$ 1.55	\columncolorOtherDom35.53 $\pm$ 1.94	\columncolorOtherDom45.42 $\pm$ 2.05	\columncolorOtherDom41.02 $\pm$ 2.55	\columncolorDomainCol-	\columncolorAvgCol54.42 $\pm$ 2.02	\columncolorDomainCol0.00
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol96.50 $\pm$ 1.52	\columncolorOtherDom39.13 $\pm$ 2.90	\columncolorOtherDom49.77 $\pm$ 4.40	\columncolorOtherDom49.58 $\pm$ 5.78	\columncolorDomainCol-	\columncolorAvgCol58.75 $\pm$ 3.65	\columncolorDomainCol0.00
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol94.20 $\pm$ 0.73	\columncolorOtherDom34.30 $\pm$ 2.80	\columncolorOtherDom44.70 $\pm$ 1.40	\columncolorOtherDom41.00 $\pm$ 1.90	\columncolorDomainCol-	\columncolorAvgCol53.55 $\pm$ 1.72	\columncolorDomainCol0.00
\columncolorDomainColSelf-Distillation	\columncolorInDomCol97.45 $\pm$ 0.43	\columncolorOtherDom37.58 $\pm$ 2.06	\columncolorOtherDom50.42 $\pm$ 1.41	\columncolorOtherDom45.83 $\pm$ 2.15	\columncolorDomainCol-	\columncolorAvgCol57.82 $\pm$ 1.51	\columncolorDomainCol0.00

CIFAR20 + Related External Data
Method	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColD4	\columncolorAvgColAvg. (0-3)	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol97.75	\columncolorOtherDom95.80	\columncolorOtherDom96.70	\columncolorOtherDom95.75	\columncolorOutDomCol-	\columncolorAvgCol96.5	\columncolorDomainCol0.00
\columncolorDomainColKL-divergence	\columncolorInDomCol97.05 $\pm$ 0.09	\columncolorOtherDom48.55 $\pm$ 1.15	\columncolorOtherDom55.08 $\pm$ 0.70	\columncolorOtherDom84.77 $\pm$ 0.87	\columncolorOutDomCol-	\columncolorAvgCol71.36 $\pm$ 0.70	\columncolorDomainCol9.42
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol96.05 $\pm$ 0.50	\columncolorOtherDom44.13 $\pm$ 0.98	\columncolorOtherDom51.67 $\pm$ 0.92	\columncolorOtherDom68.55 $\pm$ 2.60	\columncolorOutDomCol-	\columncolorAvgCol65.10 $\pm$ 1.25	\columncolorDomainCol10.68
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol96.85 $\pm$ 0.15	\columncolorOtherDom47.25 $\pm$ 0.69	\columncolorOtherDom54.25 $\pm$ 0.46	\columncolorOtherDom83.20 $\pm$ 1.87	\columncolorOutDomCol-	\columncolorAvgCol70.39 $\pm$ 0.79	\columncolorDomainCol11.64
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol96.55 $\pm$ 0.07	\columncolorOtherDom45.26 $\pm$ 2.10	\columncolorOtherDom54.90 $\pm$ 0.71	\columncolorOtherDom73.51 $\pm$ 0.56	\columncolorOutDomCol-	\columncolorAvgCol67.56 $\pm$ 0.86	\columncolorDomainCol14.01
\columncolorDomainColSelf-Distillation	\columncolorInDomCol97.71 $\pm$ 0.18	\columncolorOtherDom61.23 $\pm$ 0.83	\columncolorOtherDom64.21 $\pm$ 0.51	\columncolorOtherDom76.58 $\pm$ 0.94	\columncolorOutDomCol-	\columncolorAvgCol74.93 $\pm$ 0.61	\columncolorDomainCol17.11
\columncolorDomainColSE2D (ours)	\columncolorInDomCol97.46 $\pm$ 0.19	\columncolorOtherDom70.71 $\pm$ 1.05	\columncolorOtherDom62.85 $\pm$ 0.50	\columncolorOtherDom73.65 $\pm$ 1.67	\columncolorOutDomCol-	\columncolorAvgCol76.17 $\pm$ 0.85	\columncolorDomainColn/a

CIFAR20 + CUB
Method	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColCUB	\columncolorAvgColAvg. (0-3)	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol97.75	\columncolorOtherDom95.80	\columncolorOtherDom96.70	\columncolorOtherDom95.75	\columncolorOutDomCol-	\columncolorAvgCol96.5	\columncolorDomainCol0.00
\columncolorDomainColKL-divergence	\columncolorInDomCol97.24 $\pm$ 0.37	\columncolorOtherDom43.89 $\pm$ 1.02	\columncolorOtherDom55.13 $\pm$ 0.76	\columncolorOtherDom71.80 $\pm$ 1.73	\columncolorOutDomCol-	\columncolorAvgCol67.02 $\pm$ 0.97	\columncolorDomainCol5.08
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol93.39 $\pm$ 0.88	\columncolorOtherDom33.46 $\pm$ 2.51	\columncolorOtherDom43.10 $\pm$ 2.93	\columncolorOtherDom40.70 $\pm$ 2.20	\columncolorOutDomCol-	\columncolorAvgCol52.66 $\pm$ 2.13	\columncolorDomainCol-1.76
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol94.79 $\pm$ 3.17	\columncolorOtherDom38.53 $\pm$ 4.87	\columncolorOtherDom50.21 $\pm$ 5.74	\columncolorOtherDom59.32 $\pm$ 11.60	\columncolorOutDomCol-	\columncolorAvgCol60.71 $\pm$ 6.34	\columncolorDomainCol1.96
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol97.15 $\pm$ 0.50	\columncolorOtherDom41.12 $\pm$ 0.96	\columncolorOtherDom52.38 $\pm$ 1.76	\columncolorOtherDom60.38 $\pm$ 2.48	\columncolorOutDomCol-	\columncolorAvgCol62.76 $\pm$ 0.89	\columncolorDomainCol9.21
\columncolorDomainColSelf-Distillation	\columncolorInDomCol97.47 $\pm$ 0.12	\columncolorOtherDom47.97 $\pm$ 1.07	\columncolorOtherDom58.40 $\pm$ 1.34	\columncolorOtherDom61.97 $\pm$ 1.82	\columncolorOutDomCol-	\columncolorAvgCol66.45 $\pm$ 1.09	\columncolorDomainCol8.63
\columncolorDomainColSE2D (ours)	\columncolorInDomCol97.74 $\pm$ 0.10	\columncolorOtherDom53.93 $\pm$ 0.43	\columncolorOtherDom58.02 $\pm$ 0.45	\columncolorOtherDom64.54 $\pm$ 1.81	\columncolorOutDomCol-	\columncolorAvgCol68.56 $\pm$ 0.70	\columncolorDomainColn/a
CIFAR20 + MNIST
Method	\columncolorInDomColD0 (ID)	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColMNIST	\columncolorAvgColAvg. (0-3)	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol97.75	\columncolorOtherDom95.80	\columncolorOtherDom96.70	\columncolorOtherDom95.75	\columncolorOutDomCol-	\columncolorAvgCol96.5	\columncolorDomainCol0.00
\columncolorDomainColKL-divergence	\columncolorInDomCol94.45 $\pm$ 4.20	\columncolorOtherDom38.24 $\pm$ 6.06	\columncolorOtherDom48.44 $\pm$ 8.56	\columncolorOtherDom57.97 $\pm$ 15.71	\columncolorOutDomCol-	\columncolorAvgCol59.78 $\pm$ 8.63	\columncolorDomainCol-2.16
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol91.00 $\pm$ 3.35	\columncolorOtherDom30.71 $\pm$ 3.60	\columncolorOtherDom41.36 $\pm$ 3.61	\columncolorOtherDom37.96 $\pm$ 4.56	\columncolorOutDomCol-	\columncolorAvgCol50.26 $\pm$ 3.78	\columncolorDomainCol-4.16
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol92.04 $\pm$ 4.43	\columncolorOtherDom35.93 $\pm$ 6.37	\columncolorOtherDom45.79 $\pm$ 7.50	\columncolorOtherDom52.42 $\pm$ 13.24	\columncolorOutDomCol-	\columncolorAvgCol56.54 $\pm$ 7.88	\columncolorDomainCol-2.21
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol86.83 $\pm$ 1.04	\columncolorOtherDom28.53 $\pm$ 2.20	\columncolorOtherDom37.92 $\pm$ 2.41	\columncolorOtherDom37.38 $\pm$ 3.24	\columncolorOutDomCol-	\columncolorAvgCol47.17 $\pm$ 1.57	\columncolorDomainCol-6.38
\columncolorDomainColSelf-Distillation	\columncolorInDomCol91.90 $\pm$ 4.94	\columncolorOtherDom36.35 $\pm$ 9.46	\columncolorOtherDom45.23 $\pm$ 9.18	\columncolorOtherDom43.58 $\pm$ 12.43	\columncolorOutDomCol-	\columncolorAvgCol54.26 $\pm$ 9.00	\columncolorDomainCol-3.56
\columncolorDomainColSE2D (ours)	\columncolorInDomCol92.64 $\pm$ 4.66	\columncolorOtherDom39.94 $\pm$ 9.73	\columncolorOtherDom47.55 $\pm$ 9.55	\columncolorOtherDom48.88 $\pm$ 13.23	\columncolorOutDomCol-	\columncolorAvgCol57.25 $\pm$ 9.29	\columncolorDomainColn/a

Table 3: Performances (%, higher is better) of the student at the end of training on Digits for 2 scenarios. Internal Data Only (MNIST) and Related External Data (KMNIST). The number of runs is set to

3

. Average and standard deviations are reported.

Digits - Internal Data Only
Method	\columncolorInDomColMNIST	\columncolorOtherDomSVHN	\columncolorOtherDomMNIST-M	\columncolorOtherDomUSPS	\columncolorDomainColKMNIST ✗	\columncolorAvgColAvg.	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol99.2	\columncolorOtherDom97.84	\columncolorOtherDom99.25	\columncolorOtherDom98.8	\columncolorDomainCol-	\columncolorAvgCol98.77	\columncolorDomainCol0.00
\columncolorDomainColKL-divergence	\columncolorInDomCol99.17 $\pm$ 0.04	\columncolorOtherDom35.80 $\pm$ 1.49	\columncolorOtherDom62.80 $\pm$ 2.69	\columncolorOtherDom95.81 $\pm$ 0.46	\columncolorDomainCol-	\columncolorAvgCol73.40 $\pm$ 1.17	\columncolorDomainCol0.00
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol98.70 $\pm$ 0.11	\columncolorOtherDom33.35 $\pm$ 1.28	\columncolorOtherDom54.50 $\pm$ 2.45	\columncolorOtherDom95.07 $\pm$ 0.43	\columncolorDomainCol-	\columncolorAvgCol70.40 $\pm$ 1.07	\columncolorDomainCol0.00
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol99.13 $\pm$ 0.09	\columncolorOtherDom37.60 $\pm$ 1.40	\columncolorOtherDom64.10 $\pm$ 1.73	\columncolorOtherDom96.00 $\pm$ 0.25	\columncolorDomainCol-	\columncolorAvgCol74.21 $\pm$ 0.87	\columncolorDomainCol0.00
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol98.15 $\pm$ 1.13	\columncolorOtherDom34.51 $\pm$ 4.80	\columncolorOtherDom54.97 $\pm$ 9.49	\columncolorOtherDom93.21 $\pm$ 3.01	\columncolorDomainCol-	\columncolorAvgCol70.21 $\pm$ 3.80	\columncolorDomainCol0.00
\columncolorDomainColSelf-Distillation	\columncolorInDomCol99.23 $\pm$ 0.10	\columncolorOtherDom35.08 $\pm$ 1.03	\columncolorOtherDom65.87 $\pm$ 1.38	\columncolorOtherDom95.28 $\pm$ 0.53	\columncolorDomainCol-	\columncolorAvgCol73.87 $\pm$ 0.76	\columncolorDomainCol0.00

Digits + Related External Data
Method	\columncolorInDomColMNIST	\columncolorOtherDomSVHN	\columncolorOtherDomMNIST-M	\columncolorOtherDomUSPS	\columncolorOutDomColKMNIST	\columncolorAvgColAvg.	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainColKL-divergence	\columncolorInDomCol99.13 $\pm$ 0.05	\columncolorOtherDom31.53 $\pm$ 1.55	\columncolorOtherDom59.84 $\pm$ 2.57	\columncolorOtherDom96.51 $\pm$ 0.10	\columncolorOutDomCol-	\columncolorAvgCol71.75 $\pm$ 1.07	\columncolorDomainCol-1.65
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol98.35 $\pm$ 0.32	\columncolorOtherDom25.21 $\pm$ 1.62	\columncolorOtherDom33.84 $\pm$ 4.28	\columncolorOtherDom92.87 $\pm$ 1.63	\columncolorOutDomCol-	\columncolorAvgCol62.57 $\pm$ 1.96	\columncolorDomainCol-7.83
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol99.13 $\pm$ 0.08	\columncolorOtherDom32.28 $\pm$ 0.37	\columncolorOtherDom61.47 $\pm$ 2.15	\columncolorOtherDom96.33 $\pm$ 0.33	\columncolorOutDomCol-	\columncolorAvgCol72.30 $\pm$ 0.73	\columncolorDomainCol-1.91
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol99.12 $\pm$ 0.04	\columncolorOtherDom33.03 $\pm$ 1.79	\columncolorOtherDom60.74 $\pm$ 0.97	\columncolorOtherDom96.13 $\pm$ 0.05	\columncolorOutDomCol-	\columncolorAvgCol62.50 $\pm$ 0.90	\columncolorDomainCol-7.71
\columncolorDomainColSelf-Distillation	\columncolorInDomCol99.38 $\pm$ 0.01	\columncolorOtherDom55.86 $\pm$ 1.60	\columncolorOtherDom90.76 $\pm$ 0.35	\columncolorOtherDom96.33 $\pm$ 0.15	\columncolorOutDomCol-	\columncolorAvgCol85.58 $\pm$ 0.53	\columncolorDomainCol11.71
\columncolorDomainColSE2D (ours)	\columncolorInDomCol99.33 $\pm$ 0.04	\columncolorOtherDom61.84 $\pm$ 2.05	\columncolorOtherDom90.44 $\pm$ 0.18	\columncolorOtherDom96.33 $\pm$ 0.10	\columncolorOutDomCol-	\columncolorAvgCol87.00 $\pm$ 0.60	\columncolorDomainColn/a

Table 4: Performances (%, higher is better) of the student at the end of training on DomainNet for 2 scenarios. Internal Data Only (Clipart) and Related External Data (Sketch). The number of runs is set to

3

. Average and standard deviations are reported.

\columncolorDomainCol	DomainNet - Internal Data Only
Method	\columncolorInDomColClipart	\columncolorOtherDomInfograph	\columncolorOtherDomPainting	\columncolorOtherDomQuickdraw	\columncolorOtherDomReal	\columncolorDomainColSketch ✗	\columncolorAvgColAvg.	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol74.35	\columncolorOtherDom35.83	\columncolorOtherDom66.66	\columncolorOtherDom66.19	\columncolorOtherDom78.90	\columncolorDomainCol-	\columncolorAvgCol64.39	\columncolorDomainCol0.00
\columncolorDomainColKL-divergence	\columncolorInDomCol77.08 $\pm$ 0.45	\columncolorOtherDom18.16 $\pm$ 0.24	\columncolorOtherDom44.17 $\pm$ 0.50	\columncolorOtherDom17.19 $\pm$ 0.73	\columncolorOtherDom67.29 $\pm$ 0.24	\columncolorDomainCol-	\columncolorAvgCol44.78 $\pm$ 0.29	\columncolorDomainCol0.00
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol77.32 $\pm$ 0.22	\columncolorOtherDom18.04 $\pm$ 0.09	\columncolorOtherDom42.76 $\pm$ 0.40	\columncolorOtherDom17.43 $\pm$ 0.31	\columncolorOtherDom63.99 $\pm$ 0.14	\columncolorDomainCol-	\columncolorAvgCol43.91 $\pm$ 0.17	\columncolorDomainCol0.00
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol77.21 $\pm$ 0.37	\columncolorOtherDom17.21 $\pm$ 0.91	\columncolorOtherDom40.95 $\pm$ 0.14	\columncolorOtherDom16.42 $\pm$ 1.01	\columncolorOtherDom60.39 $\pm$ 0.20	\columncolorDomainCol-	\columncolorAvgCol42.43 $\pm$ 0.44	\columncolorDomainCol0.00
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol76.92 $\pm$ 0.42	\columncolorOtherDom18.08 $\pm$ 0.23	\columncolorOtherDom42.14 $\pm$ 0.38	\columncolorOtherDom17.44 $\pm$ 1.03	\columncolorOtherDom63.29 $\pm$ 0.20	\columncolorDomainCol-	\columncolorAvgCol43.57 $\pm$ 0.34	\columncolorDomainCol0.00
\columncolorDomainColSelf-Distillation	\columncolorInDomCol80.57 $\pm$ 0.08	\columncolorOtherDom20.85 $\pm$ 0.33	\columncolorOtherDom46.23 $\pm$ 0.11	\columncolorOtherDom23.53 $\pm$ 0.25	\columncolorOtherDom65.38 $\pm$ 0.04	\columncolorDomainCol-	\columncolorAvgCol47.31 $\pm$ 0.11	\columncolorDomainCol0.00

\columncolorDomainCol	DomainNet + Related External Data
Method	\columncolorInDomColClipart	\columncolorOtherDomInfograph	\columncolorOtherDomPainting	\columncolorOtherDomQuickdraw	\columncolorOtherDomReal	\columncolorOutDomColSketch	\columncolorAvgColAvg.	\columncolorDomainColGain ( $\uparrow$ )
\columncolorDomainColKL-divergence	\columncolorInDomCol76.00 $\pm$ 0.18	\columncolorOtherDom18.89 $\pm$ 0.09	\columncolorOtherDom44.77 $\pm$ 0.79	\columncolorOtherDom15.53 $\pm$ 0.15	\columncolorOtherDom70.65 $\pm$ 0.46	\columncolorOutDomCol-	\columncolorAvgCol45.17 $\pm$ 0.31	\columncolorDomainCol0.39
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol76.53 $\pm$ 0.14	\columncolorOtherDom18.70 $\pm$ 0.44	\columncolorOtherDom44.24 $\pm$ 0.32	\columncolorOtherDom16.24 $\pm$ 0.40	\columncolorOtherDom68.29 $\pm$ 0.34	\columncolorOutDomCol-	\columncolorAvgCol44.80 $\pm$ 0.12	\columncolorDomainCol0.89
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol75.28 $\pm$ 0.39	\columncolorOtherDom16.88 $\pm$ 0.71	\columncolorOtherDom40.78 $\pm$ 0.49	\columncolorOtherDom15.47 $\pm$ 0.43	\columncolorOtherDom62.76 $\pm$ 0.64	\columncolorOutDomCol-	\columncolorAvgCol42.24 $\pm$ 0.31	\columncolorDomainCol-0.20
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol75.32 $\pm$ 0.39	\columncolorOtherDom17.53 $\pm$ 0.07	\columncolorOtherDom42.28 $\pm$ 0.11	\columncolorOtherDom16.12 $\pm$ 0.61	\columncolorOtherDom67.29 $\pm$ 0.11	\columncolorOutDomCol-	\columncolorAvgCol43.71 $\pm$ 0.19	\columncolorDomainCol0.14
\columncolorDomainColSelf-Distillation	\columncolorInDomCol80.10 $\pm$ 0.18	\columncolorOtherDom21.53 $\pm$ 0.09	\columncolorOtherDom48.15 $\pm$ 0.23	\columncolorOtherDom25.28 $\pm$ 0.19	\columncolorOtherDom68.75 $\pm$ 0.01	\columncolorOutDomCol-	\columncolorAvgCol48.76 $\pm$ 0.07	\columncolorDomainCol1.45
\columncolorDomainColSE2D (ours)	\columncolorInDomCol78.05 $\pm$ 0.11	\columncolorOtherDom21.98 $\pm$ 0.29	\columncolorOtherDom47.76 $\pm$ 0.44	\columncolorOtherDom23.81 $\pm$ 0.34	\columncolorOtherDom68.43 $\pm$ 0.14	\columncolorOutDomCol-	\columncolorAvgCol48.01 $\pm$ 0.20	\columncolorDomainColn/a

5.2 UKF Mitigation

Traditional Methods and UKF.

In Tables 2, 3, and 4, we present the results for all methods in CD setups with related ED. Firstly, it can easily be seen that, as expected, performances on earlier domains tend to be the lowest for all methods, which is a direct demonstration of UKF where the student forgot transferred knowledge on unseen domains. In all scenarios, all traditional distillation methods suffer from UKF. Secondly, Self-Distillation can partially mitigate UKF and surpass distillation approaches due to the regularization effect of self-distillation. However, a clear trade-off emerges between UKT and UKF, not unlike the stability-plasticity trade-off in Continual Learning, where a model has to balance learning and remembering capabilities. Therefore, while self-distillation surpasses most baselines on earlier domains, it struggles to achieve high performances on more recent domains.

Performances of SE2D.

SE2D allows for better knowledge retention when compared to most baselines, especially on older tasks where it can surpass baselines by more than $9\%$ on domain $1$ of CIFAR20, as reported in Table 2. However, such results hold only for sufficiently related external data where the student can adequately learn from the teacher from one task to another. In Table 2, when using MNIST in combination, the advantages of SE2D become largely limited. This behavior is even more pronounced when training on DomainNet, where SE2D falls behind Self Distillation. We identify the main reason to be the low teacher quality, giving poor supervision on ED for SE2D. Additionally, results in Table 4 show that even when using domains from DomainNet (supposedly related), performance gain is inconsistent for all methods. We believe the vast discrepancy between domains in this dataset hinders UKT, highlighting the importance of the origin of ED for efficient CD. This is discussed in more detail in appendix.

5.3 Discussion

External Data Ratio Matters.

Another phenomenon that can be observed in Table 1 is that increasing $\frac{|\mathcal{D}_{e}|}{|\mathcal{D}^{\mathcal{S}}|}$ , the proportion of ED compared to ID, enhances performances on domains unseen by the student. Therefore, a clear trend emerges: the more data unknown to the teacher is used, the stronger the UKT is observed.

External Data Origin Matters.

Naturally, the origin of the external data has an impact on the intensity of UKT. Intuitively, a large domain difference between external data and teacher domains might make distillation more challenging. To showcase the impact of the domain gap, we experimented with related and unrelated external domains on CIFAR20. Table 2 presents the results with various ED scenarios. Notably, it can be observed that leveraging ED leads to consistent improvement in average on all domains when related enough to ID. For example, when using D4 and CUB as ED, the performances of KL-divergence distillation increase from $61.94\%$ to $71.36\%$ and $67.02\%$ , respectively. However, when using MNIST as ED, the performances slightly drop to $59.78\%$ . Such a trend is observed for all considered methods, as it can be observed in Figure 5. Semantically, CUB consists of images of birds and is more similar to CIFAR20 than MNIST, even though the number of classes does not align. Interestingly, we observe that leveraging ED is essential to promote UKT. However, as the domain shift between ED and ID increases (e.g., in CUB), performance tends to degrade. When the domain gap becomes too large (as in MNIST), using ED can even result in lower performance than without it.

Limitations of SE2D.

While SE2D reduces UKF, its impact relies on (1) the domain gap between teacher and external data and (2) teacher performance on domains unseen by the student. SE2D also requires data-origin knowledge; the student must distinguish between the teacher’s known and unknown domains. This is particularly complex when data are generated to imitate training sets, making it non-trivial to identify data outside the teacher’s domain.

6 Conclusions and Future Work

In this work, we introduced a new paradigm titled Continual Distillation, where a single model learns on a fixed dataset from a sequence of teachers. Such a new setup is relevant in the context of ever-evolving Foundation Models, which are costly to train, expensive to store, demanding to run inference on, and in many cases only accessible via restricted APIs. In such a context, we observe that the domain of origin of distillation data is crucial for controlling which knowledge is indeed transferred to the student. Notably, we unveil two characteristics: Unknown Knowledge Transfer and Unknown Knowledge Forgetting, which represent the ability of the student to modify its knowledge on domains that they have never encountered. Such knowledge control depends only on the teacher and the data used for distillation. The objective then becomes reaching the best UKT-UKF trade-off. In that sense, we proposed Self External Data Distillation (SE2D), which allows us to reduce UKF and maintain strong average performance on all domains, including domains unseen during training. However, we uncover that the domain gap between external data and the teacher domain must be carefully considered in order to foster UKT. Similarly, performant teachers are required for SE2D performance to be ensured.

Eventually, UKT comprises opportunities and risks, as uncontrolled or undesired knowledge could be involuntarily embedded in a student model through distillation depending on the considered data. Such a vulnerability could be easily exploited and introduce unknown bias to model training. Such an aspect of UKT should be explored in future work. Potential future directions include working with larger models, such as language or multimodal models.

Acknowledgments

This work was partially financially supported by JST ASPIRE Program, Japan, Grant Number JPMJAP2303. This work was partially supported by the JSPS Postdoctoral Fellowship for Research in Japan (Fellowship ID: P24752).

References

[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §1.
[2] M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M. Yang, and F. S. Khan (2025) Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence 47 (4), pp. 2245–2264. Cited by: §1.
[3] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara (2020) Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems, Vol. 33, pp. 15920–15930. Cited by: §2.
[4] Y. Chen, X. Xu, F. de Hoog, J. Liu, and S. Wang (2025) Medium-difficulty samples constitute smoothed decision boundary for knowledge distillation on pruned datasets. Cited by: §4.4.
[5] Z. Chen and B. Liu (2018) Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12 (3), pp. 1–207. Cited by: §2.
[6] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha (2018) Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718. Cited by: §4.3.
[7] A. De, E. Wang, R. Varma, A. Sridhar, and K. Khandelwal (2025) Scaling multimodal foundation models in torchmultimodal with pytorch distributed. Note: https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed Cited by: §1.
[8] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (7), pp. 3366–3385. Cited by: §1.
[9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. pp. 4171–4186. Cited by: §1.
[10] N. Dong, Y. Zhang, M. Ding, and Y. Bai (2023) Class-incremental object detection. Pattern Recognition 139, pp. 109488. Cited by: §2.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1.
[12] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle (2020) Podnet: pooled outputs distillation for small-tasks incremental learning. In 16th European Conference on Conputer Vision (ECCV), pp. 86–102. Cited by: §2, §4.4.
[13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. In International Conference on Machine Learning, Cited by: §4.3.
[14] D. Goswami, A. Soutif-Cormerais, Y. Liu, S. Kamath, B. Twardowski, J. Van De Weijer, et al. (2024) Resurrecting old classes with new data for exemplar-free continual learning. pp. 28525–28534. Cited by: §2.
[15] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
[16] Y. Hsu, Y. Liu, A. Ramasamy, and Z. Kira (2018) Re-evaluating continual learning scenarios: a categorization and case for strong baselines. arXiv preprint arXiv:1810.12488. Cited by: §4.1.
[17] J. J. Hull (2002) A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (5), pp. 550–554. Cited by: §4.3.
[18] S. Kullback (1997) Information theory and statistics. Courier Corporation. Cited by: §4.4.
[19] Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.3.
[20] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §2, §3.4.
[21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
[22] N. Michel, M. Wang, L. Xiao, and T. Yamasaki (2024-21–27 Jul) Rethinking momentum knowledge distillation in online continual learning. In Proceedings of the 41st International Conference on Machine LearningProceedings of the AAAI Conference on Artificial IntelligenceProceedings of the IEEE/CVF International Conference on Computer VisionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionSixth International Conference on Learning RepresentationsProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionOn-line learning in neural networksProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionEuropean Conference on Computer VisionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE/CVF international conference on computer visionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionInternational workshop on multiple classifier systemsProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)International Conference on Machine LearningProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)Proceedings of the IEEE/CVF International Conference on Computer VisionInternational Conference on Machine LearningProceedings of the 58th Annual Meeting of the Association for Computational LinguisticsThe IEEE International Conference on Computer VisionProceedings of the 29th ACM international conference on information & knowledge managementProceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionThe Thirteenth International Conference on Learning RepresentationsProceedings of the IEEE/CVF International Conference on Computer VisionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (Eds.), Proceedings of Machine Learning Research, Vol. 23538, pp. 35607–35622. External Links: Link Cited by: §2, §2, §4.4.
[23] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng, et al. (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, pp. 7. Cited by: §4.3.
[24] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54–71. Cited by: §1, §2.
[25] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. pp. 1406–1415. Cited by: §4.3.
[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. pp. 8748–8763. Cited by: §1, §2.
[27] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars (2017) Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1320–1328. Cited by: §3.4.
[28] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2, §2, §4.4.
[29] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.
[30] S. Stojanov, S. Mishra, N. A. Thai, N. Dhanda, A. Humayun, C. Yu, L. B. Smith, and J. M. Rehg (2019) Incremental object learning from contiguous views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8777–8786. Cited by: §4.3.
[31] S. Sun, W. Ren, J. Li, R. Wang, and X. Cao (2024) Logit standardization in knowledge distillation. pp. 15731–15740. Cited by: §4.4.
[32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. (2023) Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Cited by: §2.
[33] G. M. van de Ven, T. Tuytelaars, and A. S. Tolias (2022) Three types of incremental learning. Nature Machine Intelligence 4 (12), pp. 1185–1197. Cited by: §4.1.
[34] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §4.3.
[35] L. Wang, X. Zhang, H. Su, and J. Zhu (2024) A comprehensive survey of continual learning: theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
[36] Y. Wang, D. Yang, Z. Chen, Y. Liu, S. Liu, W. Zhang, L. Zhang, and L. Qi (2024) De-confounded data-free knowledge distillation for handling distribution shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12615–12625. Cited by: §1, §2.
[37] C. Wu, L. Herranz, X. Liu, J. Van De Weijer, B. Raducanu, et al. (2018) Memory replay gans: learning to generate new categories without forgetting. Advances in Neural Information Processing Systems 31. Cited by: §2.
[38] J. Yim, D. Joo, J. Bae, and J. Kim (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. pp. 4133–4141. Cited by: §2.
[39] B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang (2022) Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11953–11962. Cited by: §2, §4.4.

Appendix A Experimental Setup

A.1 Implementation Details

For training, we start from pre-trained weights and use the Adam optimizer with a learning rate of $0.0001$ for 3 epochs. As students start from pre-trained weights, we observe negligible improvement when using more epochs. Images are resized to $224\times 224$ to fit the size used during pre-training. We use random horizontal flips as augmentations and use data normalization. We use a batch size of 64. Regarding the distillation, we use the KL-divergence with a temperature of $10$ . Experiments are conducted with ViT architecture, namely the ViT-B/16. We use ViT-base as teachers for CIFAR20 and DomainNet. For Digits, we use a ViT-tiny as a teacher. Every teacher is initialised from pre-trained weights and trained for $50$ epochs with Adam optimizer and a learning rate of $0.0001$ . We use the same architectures for students in the main draft and additionally experiment with ViT-tiny. For all results, 3 seeds are used.

For DomainNet, since domains are unbalanced, we oversample or undersample them so that each task has the same number of steps. The base task length is determined by the Internal Data (ID) size, and the External Data (ED) is sampled accordingly. For example, if ID is larger than ED, we oversample from ED.

A.2 Metric

For all experiments, we report the domain-wise Final Accuracy ( $\%$ ) of the student at the end of the training sequence.

Appendix B Additional Discussions

Feasibility of obtaining unknown ED.

In CD, a question naturally arises as to whether obtaining ED unknown to FMs is feasible, since such models are trained on vast and diverse data. We argue that obtaining unknown data is feasible in two highly plausible scenarios: Temporal Gap: Any data uploaded to public repositories after the FM’s training cutoff is guaranteed to be unknown. Private/Synthetic Data: In industrial settings, private proprietary datasets or synthetically generated data serve as excellent ED.

Assessing ED quality.

As presented throughout this paper, ED selection is key in Continual Distillation. While quantifying semantic similarity is hard without teacher data, we propose using the teacher’s own predictive uncertainty, by measuring the entropy, as a proxy. In Figure B.1, we show the entropy distribution of a teacher trained on two domains of CIFAR20, for various domains of CIFAR20 as well as CUB and MNIST. We observe that the more the external data is ”unrelated”, the more ”flat” the entropy distribution becomes. To select adequate ED without knowing the training distribution, a user can 1) filter out samples based on an entropy threshold, 2) use the 4th-order moment (kurtosis) of the entropy distribution to quantify ”flatness”. As shown in Fig. B.1, lower flatness correlates with higher UKT potential. Additionally, in Figure B.2, the entropy distribution varies widely across domains, partially explaining the mitigated results observed on this dataset.

Appendix C Additional Experiments

C.1 Additional Metrics

Forgetting

For each method, we measure forgetting as the drop in performance on the learned domains after training from new teachers. Formally, for a student trained to imitate a teacher of index $t$ , the forgetting for domain $d$ is computed as:

F_{d}=\max_{i<t}A_{d}^{(i)}-A_{d}^{(t)}

where $A_{d}^{(i)}$ is the accuracy on domain $d$ after distillation from teacher $t$ . The overall forgetting is averaged across all domains:

F=\frac{1}{D}\sum_{d=1}^{D}F_{d}

This metric captures the extent to which a method forgets previously learned knowledge when adapting to new tasks. The results are presented in Tables C.1, C.2 and C.3. It can be observed that our method leads to competitive forgetting in all scenarios.

Table C.1: Forgetting (%, lower is better) of the student at the end of training on CIFAR20 for 4 scenarios. Internal Data Only (D0), Related External Data (D4), CUB as ED, and MNIST as ED. The number of runs is set to

3

CIFAR20 - Internal Data Only
\columncolorDomainColMethod	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorDomainColD4 ✗	\columncolorAvgColAvg. (0-3)
\columncolorDomainColKL-divergence	\columncolorInDomCol0	\columncolorOtherDom11.95	\columncolorOtherDom2.98	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol3.73
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol0	\columncolorOtherDom2.38	\columncolorOtherDom3.47	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol1.46
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol0	\columncolorOtherDom10.42	\columncolorOtherDom1.88	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol3.08
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol0	\columncolorOtherDom11.95	\columncolorOtherDom2.98	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol3.23
\columncolorDomainColSelf-Distillation	\columncolorInDomCol0	\columncolorOtherDom16.02	\columncolorOtherDom2.47	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol4.12

CIFAR20 + Related External Data
\columncolorDomainColMethod	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColD4	\columncolorAvgColAvg. (0-3)
\columncolorDomainColKL-divergence	\columncolorInDomCol0.52	\columncolorOtherDom38.15	\columncolorOtherDom30.26	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol17.23
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol0.13	\columncolorOtherDom25.15	\columncolorOtherDom16.25	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol10.38
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol0.62	\columncolorOtherDom39.23	\columncolorOtherDom32.08	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol17.98
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol0.29	\columncolorOtherDom37.74	\columncolorOtherDom31.00	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol17.26
\columncolorDomainColSelf-Distillation	\columncolorInDomCol0	\columncolorOtherDom25.33	\columncolorOtherDom7.93	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol8.32
\columncolorDomainColSE2D (ours)	\columncolorInDomCol0	\columncolorOtherDom16.10	\columncolorOtherDom3.67	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol4.44

CIFAR20 + CUB
\columncolorDomainColMethod	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColCUB	\columncolorAvgColAvg. (0-3)
\columncolorDomainColKL-divergence	\columncolorInDomCol0.93	\columncolorOtherDom30.68	\columncolorOtherDom17.82	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol12.36
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol0.89	\columncolorOtherDom5.37	\columncolorOtherDom3.86	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol2.03
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol2.54	\columncolorOtherDom32.28	\columncolorOtherDom20.66	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol11.87
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol0.61	\columncolorOtherDom30.79	\columncolorOtherDom16.26	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol11.41
\columncolorDomainColSelf-Distillation	\columncolorInDomCol0.81	\columncolorOtherDom27.88	\columncolorOtherDom2.48	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol7.29
\columncolorDomainColSE2D (ours)	\columncolorInDomCol0.40	\columncolorOtherDom21.61	\columncolorOtherDom6.81	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol7.21
CIFAR20 + MNIST
\columncolorDomainColMethod	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColMNIST	\columncolorAvgColAvg. (0-3)
\columncolorDomainColKL-divergence	\columncolorInDomCol0	\columncolorOtherDom15.20	\columncolorOtherDom7.40	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol5.15
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol0	\columncolorOtherDom2.74	\columncolorOtherDom0.57	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol0.83
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol1.57	\columncolorOtherDom14.58	\columncolorOtherDom9.50	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol5.41
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol0.97	\columncolorOtherDom15.34	\columncolorOtherDom7.67	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol5.00
\columncolorDomainColSelf-Distillation	\columncolorInDomCol0	\columncolorOtherDom9.69	\columncolorOtherDom1.02	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol2.18
\columncolorDomainColSE2D (ours)	\columncolorInDomCol0.85	\columncolorOtherDom9.88	\columncolorOtherDom2.85	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol4.43

Algorithm 1 Continual Distillation with KL divergence and SGD.

0: Sequence of teachers

\{\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{T}\}

, student model

\mathcal{S}_{\theta}

, distillation dataset

\mathcal{D}^{\mathcal{S}}

1: for

t=1

N

2: for

x\in\mathcal{D}^{\mathcal{S}}

3: Obtain teacher predictions

p_{t}(x)=\mathcal{T}_{t}(x)

4: Student predictions

q_{\theta}(x)=\mathcal{S}_{\theta}(x)

5: Distillation loss:

\mathcal{L}_{t}=\mathrm{KL}\big(p_{t}(x)\,\|\,q_{\theta}(x)\big)

6: Update student parameters

\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}_{t}

7: end for

8: end for

9: return Trained student model

\mathcal{S}_{\theta}

Algorithm 2 Overview of SE2D training algorithm with Continual Distillation.

0: Sequence of teachers

\{\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{T}\}

, student model

\mathcal{S}_{\theta}

, distillation dataset

\mathcal{D}^{\mathcal{S}}=\mathcal{D}_{i}\cup\mathcal{D}_{e}

1: for

t=1

N

2: if

t=1

then

3: for

x\in\mathcal{D}^{\mathcal{S}}

4: Obtain teacher predictions

p_{t}(x)=\mathcal{T}_{t}(x)

5: Compute student predictions

q_{\theta}(x)=\mathcal{S}_{\theta}(x)

6: Compute distillation loss:

\mathcal{L}_{t}=\mathrm{KL}\big(p_{t}(x)\,\|\,q_{\theta}(x)\big)

7: Update student parameters

\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}_{t}

8: end for

9: else

10: Load previous student checkpoint

\mathcal{S}_{\theta}^{t-1}

11: for

(x_{e},x_{i})\in(\mathcal{D}_{e},\mathcal{D}_{i})

12: Compute student predictions on internal data:

q_{\theta}(x_{i})=\mathcal{S}_{\theta}(x_{i})

13: Compute student predictions on external data:

q_{\theta}(x_{e})=\mathcal{S}_{\theta}(x_{e})

14: Obtain previous student predictions on external data:

p_{t-1}(x_{e})=\mathcal{S}_{\theta}^{t-1}(x_{e})

15: Obtain current teacher predictions on all data:

p_{t}^{\text{teacher}}((x_{e},x_{i}))=\mathcal{T}_{t}((x_{e},x_{i}))

16: Compute distillation loss on external data from previous student:

\mathcal{L}_{\text{student}}=\mathrm{KL}\big(p_{t-1}(x_{e})\,\|\,q_{\theta}(x_{e})\big)

17: Compute distillation loss on all data from teacher:

\mathcal{L}_{\text{teacher}}=\mathrm{KL}\big(p_{t}^{\text{teacher}}((x_{e},x_{i}))\,\|\,q_{\theta}((x_{e},x_{i}))\big)

18: Compute total loss:

\mathcal{L}_{t}=\mathcal{L}_{\text{student}}+\mathcal{L}_{\text{teacher}}

19: Update student parameters

\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}_{t}

20: end for

21: end if

22: end for

23: return Trained student model

\mathcal{S}_{\theta}

Table C.2: Forgetting (%, lower is better) of the student at the end of training on Digits for 2 scenarios. Internal Data Only (D0), Related External Data (D4). The number of runs is set to 3.

Digits - Internal Data Only
\columncolorDomainColMethod	\columncolorInDomColMNIST	\columncolorOtherDomSVHN	\columncolorOtherDomMNIST-M	\columncolorOtherDomUSPS	\columncolorDomainColKMNIST ✗	\columncolorAvgColAvg.
\columncolorDomainColKL-divergence	\columncolorInDomCol0.23	\columncolorOtherDom3.34	\columncolorOtherDom16.20	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol4.94
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol0.47	\columncolorOtherDom0.53	\columncolorOtherDom11.73	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol3.12
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol0.10	\columncolorOtherDom4.09	\columncolorOtherDom12.88	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol4.40
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol0.23	\columncolorOtherDom3.34	\columncolorOtherDom16.20	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol4.94
\columncolorDomainColSelf-Distillation	\columncolorInDomCol0.13	\columncolorOtherDom4.06	\columncolorOtherDom7.46	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol3.54

Digits + Related External Data
\columncolorDomainColMethod	\columncolorInDomColMNIST	\columncolorOtherDomSVHN	\columncolorOtherDomMNIST-M	\columncolorOtherDomUSPS	\columncolorOutDomColKMNIST	\columncolorAvgColAvg.
\columncolorDomainColKL-divergence	\columncolorInDomCol0.24	\columncolorOtherDom40.68	\columncolorOtherDom38.41	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol19.17
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol0.84	\columncolorOtherDom12.00	\columncolorOtherDom50.77	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol15.87
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol0.27	\columncolorOtherDom40.50	\columncolorOtherDom36.59	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol19.22
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol0.25	\columncolorOtherDom40.69	\columncolorOtherDom38.35	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol19.16
\columncolorDomainColSelf-Distillation	\columncolorInDomCol0.10	\columncolorOtherDom16.33	\columncolorOtherDom6.35	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol5.58
\columncolorDomainColSE2D (ours)	\columncolorInDomCol0.13	\columncolorOtherDom10.37	\columncolorOtherDom4.90	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol3.73

Table C.3: Forgetting (%, lower is better) on DomainNet for 2 scenarios: Internal Data Only (Seq -1) and Related External Data (Seq 0). Averaged across 3 runs.

\columncolorDomainCol	DomainNet - Internal Data Only
\columncolorDomainColMethod	\columncolorInDomColClipart	\columncolorOtherDomInfograph	\columncolorOtherDomPainting	\columncolorOtherDomQuickdraw	\columncolorOtherDomReal	\columncolorDomainColSketch ✗	\columncolorAvgColAvg.
\columncolorDomainColKL-divergence	\columncolorInDomCol4.63	\columncolorOtherDom9.91	\columncolorOtherDom13.70	\columncolorOtherDom20.31	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol12.14
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol3.85	\columncolorOtherDom10.34	\columncolorOtherDom13.89	\columncolorOtherDom22.38	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol12.61
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol3.50	\columncolorOtherDom11.48	\columncolorOtherDom15.39	\columncolorOtherDom23.73	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol13.52
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol5.20	\columncolorOtherDom10.70	\columncolorOtherDom15.69	\columncolorOtherDom21.46	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol13.26
\columncolorDomainColCheckpoint	\columncolorInDomCol1.27	\columncolorOtherDom5.86	\columncolorOtherDom11.30	\columncolorOtherDom20.68	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol9.78
\columncolorDomainColSE2D (ours)	\columncolorInDomCol4.60	\columncolorOtherDom9.58	\columncolorOtherDom13.49	\columncolorOtherDom20.27	\columncolorOtherDom0	\columncolorDomainCol-	\columncolorAvgCol11.98

\columncolorDomainCol	DomainNet + Related External Data
\columncolorDomainColMethod	\columncolorInDomColClipart	\columncolorOtherDomInfograph	\columncolorOtherDomPainting	\columncolorOtherDomQuickdraw	\columncolorOtherDomReal	\columncolorOutDomColSketch ✗	\columncolorAvgColAvg.
\columncolorDomainColKL-divergence	\columncolorInDomCol5.56	\columncolorOtherDom11.89	\columncolorOtherDom18.07	\columncolorOtherDom29.12	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol16.16
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol4.64	\columncolorOtherDom12.15	\columncolorOtherDom18.31	\columncolorOtherDom29.69	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol16.20
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol5.40	\columncolorOtherDom13.70	\columncolorOtherDom20.98	\columncolorOtherDom32.44	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol18.13
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol6.25	\columncolorOtherDom13.03	\columncolorOtherDom19.68	\columncolorOtherDom29.62	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol17.15
\columncolorDomainColCheckpoint	\columncolorInDomCol1.70	\columncolorOtherDom6.63	\columncolorOtherDom14.54	\columncolorOtherDom29.43	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol13.08
\columncolorDomainColSE2D (ours)	\columncolorInDomCol3.42	\columncolorOtherDom6.91	\columncolorOtherDom15.54	\columncolorOtherDom30.86	\columncolorOtherDom0	\columncolorOutDomCol-	\columncolorAvgCol14.18

Accuracy Curves

We report per-domain accuracy curves during training with related external data in Figures D.1 to Figure D.4.

C.2 Additional Architecures

We report results with additional architectures. Namely, we experimented with a ViT-tiny as a student instead of the ViT-base version in the main manuscript. Similarly, we experimented with larger models as teacher using CLIP-base teachers (ViT-L/14). Results are presented in Table C.5.

Table C.4: Performances (%, higher is better) of the student at the end of training on CIFAR20 for 2 scenarios with a ViT-tiny. Internal Data Only (D0), Related External Data (D4). The number of runs is set to 3. Average and standard deviations are reported.

CIFAR20 - Internal Data Only
\columncolorDomainColMethod	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorDomainColD4 ✗	\columncolorAvgColAvg. (0-3)
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol97.75	\columncolorOtherDom95.80	\columncolorOtherDom96.70	\columncolorOtherDom95.75	\columncolorDomainCol-	\columncolorAvgCol96.5
\columncolorDomainColKL-divergence	\columncolorInDomCol96.27 $\pm$ 0.21	\columncolorOtherDom37.85 $\pm$ 0.78	\columncolorOtherDom52.28 $\pm$ 0.42	\columncolorOtherDom49.48 $\pm$ 1.28	\columncolorDomainCol-	\columncolorAvgCol58.97 $\pm$ 0.67
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol95.40 $\pm$ 1.05	\columncolorOtherDom33.27 $\pm$ 1.60	\columncolorOtherDom45.38 $\pm$ 0.90	\columncolorOtherDom40.45 $\pm$ 1.34	\columncolorDomainCol-	\columncolorAvgCol53.62 $\pm$ 1.22
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol96.25 $\pm$ 0.85	\columncolorOtherDom38.95 $\pm$ 2.17	\columncolorOtherDom51.58 $\pm$ 2.47	\columncolorOtherDom50.15 $\pm$ 4.40	\columncolorDomainCol-	\columncolorAvgCol59.23 $\pm$ 2.47
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol94.45 $\pm$ 0.44	\columncolorOtherDom35.08 $\pm$ 0.88	\columncolorOtherDom45.35 $\pm$ 0.41	\columncolorOtherDom41.98 $\pm$ 3.10	\columncolorDomainCol-	\columncolorAvgCol54.22 $\pm$ 1.21
\columncolorDomainColSelf-Distillation	\columncolorInDomCol96.27 $\pm$ 0.35	\columncolorOtherDom37.93 $\pm$ 1.22	\columncolorOtherDom51.82 $\pm$ 0.53	\columncolorOtherDom46.43 $\pm$ 0.97	\columncolorDomainCol-	\columncolorAvgCol58.11 $\pm$ 0.77

CIFAR20 + Related External Data
\columncolorDomainColMethod	\columncolorInDomColD0	\columncolorOtherDomD1	\columncolorOtherDomD2	\columncolorOtherDomD3	\columncolorOutDomColD4	\columncolorAvgColAvg. (0-3)
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol97.75	\columncolorOtherDom95.80	\columncolorOtherDom96.70	\columncolorOtherDom95.75	\columncolorOutDomCol-	\columncolorAvgCol96.5
\columncolorDomainColKL-divergence	\columncolorInDomCol96.36 $\pm$ 0.33	\columncolorOtherDom46.54 $\pm$ 1.17	\columncolorOtherDom55.86 $\pm$ 1.10	\columncolorOtherDom75.96 $\pm$ 1.76	\columncolorOutDomCol-	\columncolorAvgCol68.68 $\pm$ 1.09
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol95.18 $\pm$ 0.46	\columncolorOtherDom40.77 $\pm$ 0.95	\columncolorOtherDom50.92 $\pm$ 1.07	\columncolorOtherDom60.69 $\pm$ 1.32	\columncolorOutDomCol-	\columncolorAvgCol61.89 $\pm$ 0.95
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol96.45 $\pm$ 0.26	\columncolorOtherDom46.80 $\pm$ 0.74	\columncolorOtherDom55.08 $\pm$ 0.61	\columncolorOtherDom76.49 $\pm$ 1.50	\columncolorOutDomCol-	\columncolorAvgCol68.7 $\pm$ 0.78
\columncolorDomainColSelf-Distillation	\columncolorInDomCol97.26 $\pm$ 0.25	\columncolorOtherDom55.42 $\pm$ 0.31	\columncolorOtherDom61.62 $\pm$ 0.70	\columncolorOtherDom68.74 $\pm$ 1.46	\columncolorOutDomCol-	\columncolorAvgCol70.76 $\pm$ 0.68
\columncolorDomainColSE2D (ours)	\columncolorInDomCol96.77 $\pm$ 0.25	\columncolorOtherDom62.33 $\pm$ 1.19	\columncolorOtherDom59.45 $\pm$ 1.36	\columncolorOtherDom65.82 $\pm$ 2.03	\columncolorOutDomCol-	\columncolorAvgCol71.09 $\pm$ 1.21

Table C.5: Domain Accuracy (%, higher is better) on DomainNet with CLIP-based teachers. Mean and standard deviation are reported.

\columncolorDomainCol	DomainNet - Internal Data Only
\columncolorDomainColMethod	\columncolorInDomColClipart	\columncolorOtherDomInfograph	\columncolorOtherDomPainting	\columncolorOtherDomQuickdraw	\columncolorOtherDomReal	\columncolorDomainColSketch ✗	\columncolorAvgColAvg.
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol86.33	\columncolorOtherDom53.89	\columncolorOtherDom79.74	\columncolorOtherDom69.62	\columncolorOtherDom89.86	\columncolorDomainCol-	\columncolorAvgCol76.23
\columncolorDomainColKL-divergence	\columncolorInDomCol78.33 $\pm$ 0.23	\columncolorOtherDom19.00 $\pm$ 0.30	\columncolorOtherDom39.25 $\pm$ 0.81	\columncolorOtherDom35.36 $\pm$ 0.49	\columncolorOtherDom51.51 $\pm$ 0.74	\columncolorDomainCol-	\columncolorAvgCol44.69 $\pm$ 0.32
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol78.55 $\pm$ 0.03	\columncolorOtherDom18.89 $\pm$ 0.41	\columncolorOtherDom39.69 $\pm$ 0.19	\columncolorOtherDom34.77 $\pm$ 0.38	\columncolorOtherDom52.45 $\pm$ 0.49	\columncolorDomainCol-	\columncolorAvgCol44.87 $\pm$ 0.06
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol74.39 $\pm$ 0.39	\columncolorOtherDom15.54 $\pm$ 0.66	\columncolorOtherDom33.92 $\pm$ 0.56	\columncolorOtherDom20.09 $\pm$ 1.11	\columncolorOtherDom46.62 $\pm$ 0.76	\columncolorDomainCol-	\columncolorAvgCol38.11 $\pm$ 0.64
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol75.51 $\pm$ 0.10	\columncolorOtherDom17.52 $\pm$ 0.20	\columncolorOtherDom35.67 $\pm$ 1.39	\columncolorOtherDom26.22 $\pm$ 0.70	\columncolorOtherDom47.71 $\pm$ 0.69	\columncolorDomainCol-	\columncolorAvgCol40.53 $\pm$ 0.27
\columncolorDomainColSelf-Distillation	\columncolorInDomCol79.99 $\pm$ 0.55	\columncolorOtherDom21.72 $\pm$ 0.80	\columncolorOtherDom43.65 $\pm$ 0.57	\columncolorOtherDom26.52 $\pm$ 1.42	\columncolorOtherDom57.21 $\pm$ 0.13	\columncolorDomainCol-	\columncolorAvgCol45.82 $\pm$ 0.61

\columncolorDomainCol	DomainNet + Related External Data
\columncolorDomainColMethod	\columncolorInDomColClipart	\columncolorOtherDomInfograph	\columncolorOtherDomPainting	\columncolorOtherDomQuickdraw	\columncolorOtherDomReal	\columncolorOutDomColSketch ✗	\columncolorAvgColAvg.
\columncolorDomainCol $\mathcal{T}_{best}$ (upper bound)	\columncolorInDomCol86.33	\columncolorOtherDom53.89	\columncolorOtherDom79.74	\columncolorOtherDom69.62	\columncolorOtherDom89.86	\columncolorOutDomCol-	\columncolorAvgCol76.23
\columncolorDomainColKL-divergence	\columncolorInDomCol78.22 $\pm$ 0.39	\columncolorOtherDom19.98 $\pm$ 0.17	\columncolorOtherDom41.59 $\pm$ 0.55	\columncolorOtherDom42.83 $\pm$ 0.37	\columncolorOtherDom52.62 $\pm$ 0.40	\columncolorOutDomCol-	\columncolorAvgCol47.05 $\pm$ 0.22
\columncolorDomainColDKD [CVPR’22]	\columncolorInDomCol78.17 $\pm$ 0.11	\columncolorOtherDom20.01 $\pm$ 0.19	\columncolorOtherDom41.78 $\pm$ 0.39	\columncolorOtherDom42.69 $\pm$ 0.84	\columncolorOtherDom52.37 $\pm$ 0.24	\columncolorOutDomCol-	\columncolorAvgCol47.00 $\pm$ 0.01
\columncolorDomainColLS [CVPR’24]	\columncolorInDomCol74.64 $\pm$ 0.13	\columncolorOtherDom16.63 $\pm$ 0.31	\columncolorOtherDom37.09 $\pm$ 0.09	\columncolorOtherDom26.78 $\pm$ 0.35	\columncolorOtherDom47.45 $\pm$ 0.35	\columncolorOutDomCol-	\columncolorAvgCol40.52 $\pm$ 0.17
\columncolorDomainColMDS [ICLR’25]	\columncolorInDomCol75.21 $\pm$ 0.59	\columncolorOtherDom18.55 $\pm$ 0.45	\columncolorOtherDom39.32 $\pm$ 0.13	\columncolorOtherDom31.84 $\pm$ 0.73	\columncolorOtherDom49.32 $\pm$ 0.39	\columncolorOutDomCol-	\columncolorAvgCol42.85 $\pm$ 0.39
\columncolorDomainColSelf-Distillation	\columncolorInDomCol79.47 $\pm$ 0.50	\columncolorOtherDom22.84 $\pm$ 0.60	\columncolorOtherDom47.51 $\pm$ 1.33	\columncolorOtherDom30.15 $\pm$ 0.77	\columncolorOtherDom58.51 $\pm$ 1.37	\columncolorOutDomCol-	\columncolorAvgCol47.69 $\pm$ 0.84
\columncolorDomainColSE2D (ours)	\columncolorInDomCol78.52 $\pm$ 0.10	\columncolorOtherDom21.34 $\pm$ 0.45	\columncolorOtherDom47.12 $\pm$ 0.35	\columncolorOtherDom29.18 $\pm$ 0.30	\columncolorOtherDom58.01 $\pm$ 0.61	\columncolorOutDomCol-	\columncolorAvgCol46.83 $\pm$ 0.31

C.3 Additional Sequences

As presented in the main paper, DomainNet is particularly challenging. Therefore, we conducted additional experiments where challenging domains have either been removed or used as external data. Such results are presented in Table C.6. Despite SE2D’s lower overall performance compared to Self-Distillation, the disparity between the two methods diminishes in less complex scenarios, where SE2D simultaneously expands its lead over the baseline.

Table C.6: Accuracy per domain (%). Grey: Internal Data; Blue: Domain known by the teacher (active); Red: External Data (ED); White: Ignored. Avg computed on Internal Data + Active domains.

	DomainNet - Sequence 1 (Quickdraw is used as ED and Infograph is ignored)
Method	\columncolorClipartGreyClipart	\columncolorIgnoredWhiteInfograph	\columncolorUsedBluePainting	\columncolorEDRedQuickdraw	\columncolorUsedBlueReal	\columncolorUsedBlueSketch	\columncolorAvgColAvg.
KL-divergence	\columncolorClipartGrey74.66 $\pm$ 0.14	\columncolorIgnoredWhite-	\columncolorUsedBlue33.01 $\pm$ 0.21	\columncolorEDRed-	\columncolorUsedBlue46.17 $\pm$ 0.62	\columncolorUsedBlue48.68 $\pm$ 0.15	\columncolorAvgCol50.63
DKD [CVPR’22]	\columncolorClipartGrey76.18 $\pm$ 0.20	\columncolorIgnoredWhite-	\columncolorUsedBlue36.20 $\pm$ 0.49	\columncolorEDRed-	\columncolorUsedBlue49.70 $\pm$ 0.62	\columncolorUsedBlue54.85 $\pm$ 0.32	\columncolorAvgCol54.23
MDS [ICLR’25]	\columncolorClipartGrey70.76 $\pm$ 0.40	\columncolorIgnoredWhite-	\columncolorUsedBlue30.01 $\pm$ 0.78	\columncolorEDRed-	\columncolorUsedBlue43.52 $\pm$ 1.09	\columncolorUsedBlue51.56 $\pm$ 0.68	\columncolorAvgCol48.96
Self-Distillation	\columncolorClipartGrey80.43 $\pm$ 0.10	\columncolorIgnoredWhite-	\columncolorUsedBlue47.47 $\pm$ 0.71	\columncolorEDRed-	\columncolorUsedBlue61.10 $\pm$ 0.59	\columncolorUsedBlue58.15 $\pm$ 0.31	\columncolorAvgCol61.79
SE2D (ours)	\columncolorClipartGrey76.89 $\pm$ 0.25	\columncolorIgnoredWhite-	\columncolorUsedBlue38.72 $\pm$ 0.49	\columncolorEDRed-	\columncolorUsedBlue52.83 $\pm$ 0.20	\columncolorUsedBlue58.84 $\pm$ 0.25	\columncolorAvgCol56.82

	DomainNet - Sequence 2 (Infograph is used as ED)
Method	\columncolorClipartGreyClipart	\columncolorEDRedInfograph	\columncolorUsedBluePainting	\columncolorUsedBlueQuickdraw	\columncolorUsedBlueReal	\columncolorUsedBlueSketch	\columncolorAvgColAvg.
KL-divergence	\columncolorClipartGrey75.78 $\pm$ 0.14	\columncolorEDRed-	\columncolorUsedBlue34.61 $\pm$ 0.52	\columncolorUsedBlue16.59 $\pm$ 0.22	\columncolorUsedBlue49.27 $\pm$ 0.43	\columncolorUsedBlue52.43 $\pm$ 0.44	\columncolorAvgCol45.73
DKD [CVPR’22]	\columncolorClipartGrey76.73 $\pm$ 0.33	\columncolorEDRed-	\columncolorUsedBlue36.48 $\pm$ 0.68	\columncolorUsedBlue17.53 $\pm$ 0.44	\columncolorUsedBlue51.84 $\pm$ 0.55	\columncolorUsedBlue56.83 $\pm$ 0.34	\columncolorAvgCol47.88
MDS [ICLR’25]	\columncolorClipartGrey75.28 $\pm$ 0.17	\columncolorEDRed-	\columncolorUsedBlue35.43 $\pm$ 0.92	\columncolorUsedBlue16.47 $\pm$ 0.48	\columncolorUsedBlue50.74 $\pm$ 0.88	\columncolorUsedBlue56.65 $\pm$ 0.37	\columncolorAvgCol46.91
Self-Distillation	\columncolorClipartGrey80.45 $\pm$ 0.06	\columncolorEDRed-	\columncolorUsedBlue48.48 $\pm$ 0.41	\columncolorUsedBlue20.84 $\pm$ 0.89	\columncolorUsedBlue64.79 $\pm$ 0.40	\columncolorUsedBlue59.04 $\pm$ 0.24	\columncolorAvgCol54.72
SE2D (ours)	\columncolorClipartGrey78.08 $\pm$ 0.20	\columncolorEDRed-	\columncolorUsedBlue49.02 $\pm$ 0.36	\columncolorUsedBlue18.50 $\pm$ 0.40	\columncolorUsedBlue63.29 $\pm$ 0.32	\columncolorUsedBlue58.99 $\pm$ 0.22	\columncolorAvgCol53.58

	DomainNet - Sequence 3 (Sketch is used as ED, Infograph and Quickdraw are ignored)
Method	\columncolorClipartGreyClipart	\columncolorIgnoredWhiteInfograph	\columncolorUsedBluePainting	\columncolorIgnoredWhiteQuickdraw	\columncolorUsedBlueReal	\columncolorEDRedSketch	\columncolorAvgColAvg.
KL-divergence	\columncolorClipartGrey75.75 $\pm$ 0.35	\columncolorIgnoredWhite-	\columncolorUsedBlue42.44 $\pm$ 0.67	\columncolorIgnoredWhite-	\columncolorUsedBlue65.15 $\pm$ 0.53	\columncolorEDRed-	\columncolorAvgCol61.11
DKD [CVPR’22]	\columncolorClipartGrey76.25 $\pm$ 0.23	\columncolorIgnoredWhite-	\columncolorUsedBlue45.52 $\pm$ 0.72	\columncolorIgnoredWhite-	\columncolorUsedBlue70.50 $\pm$ 0.12	\columncolorEDRed-	\columncolorAvgCol64.09
MDS [ICLR’25]	\columncolorClipartGrey75.25 $\pm$ 0.07	\columncolorIgnoredWhite-	\columncolorUsedBlue44.74 $\pm$ 0.94	\columncolorIgnoredWhite-	\columncolorUsedBlue70.11 $\pm$ 0.35	\columncolorEDRed-	\columncolorAvgCol63.37
Self-Distillation	\columncolorClipartGrey79.54 $\pm$ 0.26	\columncolorIgnoredWhite-	\columncolorUsedBlue59.61 $\pm$ 0.30	\columncolorIgnoredWhite-	\columncolorUsedBlue71.48 $\pm$ 0.26	\columncolorEDRed-	\columncolorAvgCol70.21
SE2D (ours)	\columncolorClipartGrey78.04 $\pm$ 0.15	\columncolorIgnoredWhite-	\columncolorUsedBlue59.36 $\pm$ 0.68	\columncolorIgnoredWhite-	\columncolorUsedBlue70.61 $\pm$ 0.40	\columncolorEDRed-	\columncolorAvgCol69.34

Appendix D Algorithms

To provide a clear overview of our training methodology, we present the Continual Distillation procedure in Algorithm 1. Furthermore, a detailed description of our proposed SE2D approach is provided in Algorithm 2.