[2605.05996] Gaussian Mixture Models in Hilbert Spaces via Kernel Methods

Gaussian Mixture Models in Hilbert Spaces via Kernel Methods

Daniel López-Montero Affiliation: Friedrich-Alexander-Universität Erlangen-Nürnberg Antonio Álvarez-López Affiliation: Friedrich-Alexander-Universität Erlangen-Nürnberg Affiliation: Universidad Autónoma de Madrid Marcos Matabuena Affiliation: Mohamed bin Zayed University Affiliation: of Artificial Intelligence

Abstract

Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^{2}$ -functional data and random graphs in Laplacian spaces arising in modern medical applications.

⁰⁰footnotetext: Code, data and experiments are available at https://github.com/dani2442/rkhs-mixture-models/.

1 Introduction

Technological progress across multiple domains is generating high-dimensional datasets that demand new analytical tools. Modern data structures, such as random functions, spatial fields, images, signals, graphs, or trajectories are often more naturally represented as random objects in Hilbert spaces than as finite-dimensional Euclidean vectors in $\mathbb{R}^{d}$ . Analyzing the data directly using Hilbert space methods preserves their intrinsic geometric structure and allows their full potential to be exploited [53, 26].

In many applications, observed data are heterogeneous and may arise from $K$ latent populations, each characterized by a probability law $P_{i}$ , with $i=1,\dots,K$ . Mixture models [28] provide a flexible framework for representing such data-generating mechanisms and, under suitable regularity conditions, can approximate broad classes of probability measures. This paper proposes a mixture-based framework for probability measures on a separable Hilbert space $\mathcal{X}$ , extending the classical idea of Gaussian mixture models (GMM) for finite-dimensional data [22].

Extending GMMs beyond finite-dimensional spaces is challenging. Infinite-dimensional Hilbert spaces do not admit a Lebesgue measure, so densities cannot be defined in the standard way, and likelihood-based methods typically require the choice of a reference measure [19, 41]. We therefore adopt a density-free perspective: we fit Gaussian mixture models by minimizing the Maximum Mean Discrepancy (MMD) [33] between the kernel mean embedding of the empirical measure and that of the target mixture model. We now formulate the problem.

Model and objective.

Let $P$ be an unknown probability measure on a separable Hilbert space $\mathcal{X}$ . A Gaussian measure on $\mathcal{X}$ , denoted by $\mathcal{N}(m,\mathcal{K})$ , is uniquely determined by a mean element $m\in\mathcal{X}$ and a covariance operator $\mathcal{K}\colon\mathcal{X}\to\mathcal{X}$ , which generalizes the finite-dimensional covariance matrix. To ensure non-Dirac Gaussian components with finite total variance, $\mathcal{K}$ must be non-zero, self-adjoint, positive semi-definite, and trace-class.

For a fixed number of components $K\in\mathbb{N}$ , we approximate $P$ with a finite Gaussian mixture

Q_{\theta}\coloneqq\sum_{k=1}^{K}\pi_{k}\,\mathcal{N}(m_{k},\mathcal{K}_{k}),\qquad\pi_{k}\geq 0\quad\forall k,\qquad\sum_{k=1}^{K}\pi_{k}=1,

(1)

parameterized by $\theta\coloneqq(\pi_{k},m_{k},\mathcal{K}_{k})_{k=1}^{K}\in\Theta_{K}$ , where $\Theta_{K}$ denotes the admissible parameter space.

We fit $Q_{\theta}$ to $P$ by minimizing the Maximum Mean Discrepancy induced by a measurable positive-definite kernel $\kappa\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ :

\operatorname{MMD}_{\kappa}^{2}(P,Q_{\theta})=\mathbb{E}_{X,X^{\prime}\sim P}[\kappa(X,X^{\prime})]-2\,\mathbb{E}_{X\sim P,\;Y\sim Q_{\theta}}[\kappa(X,Y)]+\mathbb{E}_{Y,Y^{\prime}\sim Q_{\theta}}[\kappa(Y,Y^{\prime})],

(2)

where $X$ and $X^{\prime}$ are independent copies drawn from $P$ , $Y$ and $Y^{\prime}$ are independent copies drawn from $Q_{\theta}$ , and all four random variables are mutually independent. This leads to a density-free fitting criterion that depends only on kernel expectations. When the kernel $\kappa$ is characteristic, the MMD is a metric on probability measures. Important examples include Gaussian radial kernels in Euclidean spaces and, more generally, in suitable metric spaces of strong negative type [60]. The formal definitions and statistical background are deferred to Supplemental Material Appendix˜A.

In practice, given $X_{1},\dots,X_{n}\overset{\text{i.i.d.}}{\sim}P$ , we replace $P$ by the empirical measure $P_{n}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}$ and solve the problem

\min_{\theta\in\Theta_{K}}\operatorname{MMD}_{\kappa}^{2}(P_{n},Q_{\theta}).

(3)

Motivation.

The motivation for this work is to perform clustering of dynamic, time-varying functional objects [23, 24] that take values in a possibly infinite-dimensional space $\mathcal{X}$ . Figure 1 shows examples of contemporary data structures that arise in this setting.

Let $\mathcal{X}$ be a Hilbert space, and let $(X(t))_{t\in[0,T]}$ be a stochastic process valued in $\mathcal{X}$ . We observe independent sample trajectories

X_{1}(t),\dots,X_{n}(t)

and denote by $P_{t}$ the law of $X(t)$ at time $t$ . We estimate a time-varying mixture model by minimizing a temporal finite-sample approximation of the integrated discrepancy

\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,t},Q_{\theta,t}\bigr)\,\mathrm{d}t,

where $P_{n,t}$ is the empirical law of $X_{1}(t),\dots,X_{n}(t)$ at time $t$ . In Section˜4.2, this methodology is applied to continuous glucose monitoring data from a diabetes clinical trial evaluating a new therapy for juvenile diabetes [67].

Notation.

Throughout, $\mathcal{X}$ denotes a separable real Hilbert space with inner product $\langle\cdot,\cdot\rangle_{\mathcal{X}}$ , norm $\|\cdot\|_{\mathcal{X}}$ , and identity operator $I_{\mathcal{X}}$ . We let $\mathcal{L}(\mathcal{X})$ denote the space of bounded linear operators on $\mathcal{X}$ . For a trace-class operator $T\in\mathcal{L}(\mathcal{X})$ with eigenvalues $(\lambda_{j})_{j\geq 1}$ , its Fredholm determinant is $\det_{F}(I_{\mathcal{X}}+T)\coloneqq\prod_{j=1}^{\infty}(1+\lambda_{j})$ . Let $\mathcal{P}(\mathcal{X})$ denote the set of Borel probability measures on $\mathcal{X}$ , and for $p\geq 1$ , let $\mathcal{P}_{p}(\mathcal{X})$ be the subset of measures with finite $p$ -th moments. For finite-dimensional spaces, boldface lowercase letters ( $\mathbf{x},\mathbf{m}$ ) denote vectors and boldface uppercase letters ( $\mathbf{K},\mathbf{I}$ ) denote matrices. We write $\mathbf{1}$ for the all-ones vector and $\Delta^{K-1}$ for the probability simplex.

1.1 Contributions and paper organization

Refer to caption — Figure 1: Representative structured data samples. Left: Hourly continuous glucose monitoring paths. Center: Signals on a fixed graph with varying node intensities. Right: Correlation matrices.

The main contributions of this paper are threefold:

•

Closed-form MMD fitting for Hilbert-space mixtures. We derive exact closed-form expressions for $\operatorname{MMD}_{\kappa}^{2}(P_{n},Q_{\theta})$ for Gaussian radial basis and polynomial kernels, extending known $\mathbb{R}^{d}$ formulas to separable Hilbert spaces [1, 13]. Moreover, we prove that, for every $p\geq 1$ , finite Gaussian mixtures of the form (1) are dense in $\mathcal{P}_{p}(\mathcal{X})$ under the Wasserstein distance $W_{p}$ . This result provides theoretical support for the approximation capacity of the proposed model in infinite-dimensional applications.
•

Consistent finite-dimensional computation. We introduce a spectral projection framework that reduces infinite-dimensional computation to standard linear algebra in $\mathbb{R}^{M}$ . We prove that both the projected MMD objective and the posterior probabilities converge to their exact infinite-dimensional counterparts as the number of terms in the finite-dimensional approximation satisfies $M\to\infty$ ; see Propositions 2.1 and 2.2.
•

Real-world performance. We evaluate the empirical performance of the proposed method across several data structures, including random functions in $L^{2}$ , spatial fields, rotations in $\mathrm{SO}(3)$ , and graph Laplacians. We compare our method against state-of-the-art approaches on established clustering benchmarks, demonstrating competitive performance; see Sections˜4 and F. To further assess its performance on dynamic functional data, we analyze a clinical trial dataset with continuous glucose monitoring data, aiming to identify subgroups of individuals with heterogeneous responses to therapy.

The supplementary material contains all proofs, together with additional analytical results.

Related work

Functional data analysis and clustering. Functional data analysis (FDA) studies samples naturally viewed as functions, curves, or surfaces, typically using basis expansions, smoothing, and functional principal component analysis [53, 26]. Our spectral projection framework is related to Karhunen–Loève expansions for Gaussian measures [53, 26, 4], but uses projection as a computational approximation to the MMD objective. Model-based functional clustering—such as Gaussian mixtures—usually assumes latent groups with group-specific low-dimensional structure [9, 17, 36, 31, 35]. Such likelihood-based approaches require a reference measure, which is delicate in infinite dimensions [8, 19, 21]. Distance-based methods [42, 27] avoid this issue, but typically lack the explicit probabilistic and temporal interpretation that a generative model provides.

Kernel mean embeddings and MMD. Kernel mean embeddings represent probability measures as RKHS elements [59, 45, 5], inducing the Maximum Mean Discrepancy (MMD) [33]. For characteristic kernels, MMD metrizes equality in law [60, 57], making it a natural density-free fitting objective. MMD fitting has been used in generative modeling [40, 12], parametric inference [16, 65], kernel-based clustering [29, 43], distributional reinforcement learning [3], and temporal distributional learning [1]. These approaches often rely on closed-form kernel mean embeddings for Gaussian distributions under Gaussian and polynomial kernels [13]; related work also studies MMD testing on function spaces [70]. We extend these formulas and fitting methods from $\mathbb{R}^{d}$ to separable Hilbert spaces, using characteristic-kernel MMD to fit finite Gaussian mixtures with closed-form optimization, approximation guarantees, and projection-consistent computation.

Mixture models. Gaussian mixture models are a classical tool for model-based clustering and density estimation [22, 28]. Extensions beyond Gaussian components have been proposed [64, 62], and more recent work applies mixtures of Gaussian processes to functional data [66, 34, 44] and to more general infinite-dimensional settings [41]. These methods remain largely likelihood- or projected-density-based; in contrast, we estimate Hilbert-space Gaussian mixtures by minimizing MMD, giving a density-free fitting criterion that is more natural in infinite-dimensional regimes.

Digital health. Digital health generates unprecedented amounts of near-continuous data, motivating dynamic, time-varying functional models [1, 23, 24]. Mixture models for these general data structures remain underdeveloped, despite their potential in clinical trials for identifying subgroups of individuals with heterogeneous treatment responses.

2 MMD fitting

To minimize (3), we evaluate kernel expectations under $\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k})$ . Expanding the squared MMD between $P_{n}=\tfrac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}$ and $Q_{\theta}=\sum_{k=1}^{K}\pi_{k}\nu_{k}$ yields

\displaystyle\eqref{eq:objective_emp}=\frac{1}{n^{2}}\sum_{i,j=1}^{n}\kappa(X_{i},X_{j})-\frac{2}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\pi_{k}\,\mathbb{E}_{Y\sim\nu_{k}}[\kappa(X_{i},Y)]+\sum_{k,s=1}^{K}\pi_{k}\pi_{s}\,\mathbb{E}_{Y\sim\nu_{k},\;Y^{\prime}\sim\nu_{s}}[\kappa(Y,Y^{\prime})].

(4)

Because the first data–data term is constant in $\theta$ , optimization depends entirely on the data-to-component and component-to-component expectations:

J_{i,k}\coloneqq\mathbb{E}_{Y\sim\nu_{k}}[\kappa(X_{i},Y)],\qquad I_{k,s}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\;Y^{\prime}\sim\nu_{s}}[\kappa(Y,Y^{\prime})].

(5)

To make the objective a tractable differentiable function of $\theta$ , we consider Gaussian and polynomial kernels, for which these expectations admit closed-form expressions (Appendix˜B):

\kappa_{A}(x,y)=\exp\!\left(-\frac{1}{2}\langle x-y,A(x-y)\rangle_{\mathcal{X}}\right)\qquad\text{and}\qquad\kappa(x,y)=(\langle x,y\rangle_{\mathcal{X}}+c)^{p},

where $A\in\mathcal{L}(\mathcal{X})$ is bounded, self-adjoint, and positive semidefinite (often isotropic, $A=\sigma_{\kappa}^{-2}I_{\mathcal{X}}$ ). The finite moments condition required by polynomial kernels are automatically satisfied by both $Q_{\theta}$ and $P_{n}$ .

2.1 Fitting and optimization

The empirical objective (4) naturally splits into two parameter groups: mixture weights $\pi\in\Delta^{K-1}$ and component parameters $\{(m_{k},\mathcal{K}_{k})\}_{k=1}^{K}$ . This enables an alternating scheme that combines a convex weight update with a nonconvex update of the component means and covariances.

To make the dependence on $\pi$ explicit, define the averaged cross-expectation vector $\mathbf{J}\in\mathbb{R}^{K}$ with entries $J_{k}\coloneqq\frac{1}{n}\sum_{i=1}^{n}J_{i,k}$ and the Gram matrix $\mathbf{I}\in\mathbb{R}^{K\times K}$ with entries $I_{k,s}$ . Dropping the (constant) data–data term, minimizing (4) over $\pi$ for fixed components reduces to the quadratic program

\min_{\pi\in\mathbb{R}^{K}}\ \pi^{\top}\mathbf{I}\,\pi-2\mathbf{J}^{\top}\pi\qquad\text{s.t.}\qquad\pi\geq 0,\quad\mathbf{1}^{\top}\pi=1.

(6)

Since $I_{k,s}=\langle\mu_{\nu_{k}},\mu_{\nu_{s}}\rangle_{\mathcal{H}_{\kappa}}$ is a Gram inner product of kernel mean embeddings, $\mathbf{I}$ is positive semi-definite and the program is convex. Geometrically, the optimal $\pi^{\star}$ projects the empirical mean embedding $\mu_{P_{n}}$ onto the convex hull of the component embeddings $\{\mu_{\nu_{k}}\}_{k=1}^{K}$ .

Since joint optimization over the full parameter $\theta$ remains nonconvex, we employ an alternating minimization strategy, analogous to EM [22]. In each iteration, we first solve (6) for $\pi$ with fixed components, and then fix $\pi$ to update the means and covariances via a finite-dimensional gradient step (Section˜2.2). Pseudocode and implementation details are deferred to Section˜B.2.

2.2 Numerical computation

For practical implementation, we project the Hilbert space objects of Section˜2 onto a finite-dimensional subspace. Fixing an orthonormal basis $(e_{j})_{j\geq 1}$ of $\mathcal{X}$ (e.g., cosine or data-driven Karhunen–Loève [53, 4]), we apply the orthogonal projector $\Pi_{M}:\mathcal{X}\to\mathcal{X}_{M}\coloneqq\mathrm{span}\{e_{1},\dots,e_{M}\}$ . This identifies $\mathcal{X}_{M}$ with $\mathbb{R}^{M}$ , allowing each object to be represented by its coordinate vector.

For each component $\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k})$ , let $\mathbf{m}_{k,M}\in\mathbb{R}^{M}$ and $\mathbf{K}_{k,M}\in\mathbb{R}^{M\times M}$ be the coordinate representations of the projected mean $\Pi_{M}m_{k}$ and covariance $\Pi_{M}\mathcal{K}_{k}\Pi_{M}$ , and let $\mathbf{x}_{i,M}\in\mathbb{R}^{M}$ be the coordinates of $\Pi_{M}X_{i}$ . The pushforward $\nu_{k,M}\coloneqq\nu_{k}\circ\Pi_{M}^{-1}$ is then a Gaussian on $\mathbb{R}^{M}$ . The projected mixture and empirical measure on $\mathbb{R}^{M}$ are

Q_{\theta,M}\coloneqq\sum_{s=1}^{K}\pi_{s}\nu_{s,M},\qquad P_{n,M}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{\mathbf{x}_{i,M}}.

The projected cross-expectations are

J^{(M)}_{i,k}\coloneqq\mathbb{E}_{Y\sim\nu_{k}}\big[\kappa(\Pi_{M}X_{i},\Pi_{M}Y)\big],\qquad I^{(M)}_{k,s}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\,Y^{\prime}\sim\nu_{s}}\big[\kappa(\Pi_{M}Y,\Pi_{M}Y^{\prime})\big].

(7)

Computationally, these expectations are evaluated exactly in $\mathbb{R}^{M}$ using the coordinate vectors $\mathbf{x}_{i,M}$ and the finite-dimensional Gaussian measures $\nu_{k,M}$ . Explicit matrix formulas for (7) under Gaussian and polynomial kernels are deferred to Section˜B.1. The finite-dimensional counterpart of (4) is

\operatorname{MMD}_{\kappa}^{2}(P_{n,M},Q_{\theta,M})=\frac{1}{n^{2}}\sum_{i,j=1}^{n}\kappa(\Pi_{M}X_{i},\Pi_{M}X_{j})-\frac{2}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\pi_{k}J^{(M)}_{i,k}+\sum_{k,s=1}^{K}\pi_{k}\pi_{s}I^{(M)}_{k,s}.

(8)

Disregarding the (constant) data–data term, evaluating (8) per iteration costs $O(K^{2}M^{3}+nKM^{2})$ for both the Gaussian radial and polynomial kernels. Moreover, if we assume that the projected covariances are diagonal, the cost reduces to $O(K^{2}M+nKM)$ , making the method highly efficient and scalable to large sample sizes and projection dimensions.

Proposition 2.1 (Consistency of projected MMD).

Let $\kappa$ be a continuous kernel on $\mathcal{X}$ . Assume there exist constants $C,q\geq 0$ such that

|\kappa(x,y)|\leq C\bigl(1+\|x\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|y\|_{\mathcal{X}}^{q}\bigr)\qquad\text{for all }x,y\in\mathcal{X},

(9)

and that $\int_{\mathcal{X}}\|x\|_{\mathcal{X}}^{q}\,\mathrm{d}\nu_{k}(x)<\infty$ for all $k=1,\dots,K$ . Then, the projected MMD objective converges to the exact MMD:

\operatorname{MMD}_{\kappa}^{2}(P_{n,M},Q_{\theta,M})\xrightarrow{M\to\infty}\operatorname{MMD}_{\kappa}^{2}(P_{n},Q_{\theta})\qquad\text{a.s.}

(10)

Together with the known empirical stability of MMD that controls the error from replacing $P$ by $P_{n}$ (see [65, Proposition A.1]), Proposition 2.1 shows that the finite-dimensional surrogate (8) first approximates the Hilbert-space empirical criterion as $M\to\infty$ , and then the population as $n\to\infty$ . This justifies (8) as a computationally tractable proxy for the infinite-dimensional MMD objective.

Posterior probabilities.

The mixture admits the standard generative interpretation: draw a latent assignment $Z\in\{1,\dots,K\}$ with prior $\mathbb{P}(Z=k)=\pi_{k}$ , then sample $X\mid\{Z=k\}\sim\nu_{k}$ . The posterior component probability, or responsibility, is defined by the Radon–Nikodym derivative

\gamma_{k}(X)\coloneqq\mathbb{P}(Z=k\mid X)=\frac{\,\mathrm{d}(\pi_{k}\nu_{k})}{\,\mathrm{d}Q_{\theta}}(X),\qquad Q_{\theta}\text{-a.s.},

(11)

which is well-defined because $\pi_{k}\nu_{k}\ll Q_{\theta}$ whenever $\pi_{k}>0$ . In the finite-dimensional setting, whenever the projected covariance $\mathbf{K}_{k,M}$ is nonsingular, $\nu_{k,M}$ admits a Lebesgue density $p_{k,M}$ on $\mathbb{R}^{M}$ and (11) reduces to the familiar Bayes formula

\gamma_{k}^{(M)}(x)=\frac{\pi_{k}\,p_{k,M}(x)}{\sum_{s=1}^{K}\pi_{s}\,p_{s,M}(x)}.

(12)

As $M$ grows, $\sigma(\Pi_{M}X)$ accumulates the information of $\sigma(X)$ , yielding the following convergence.

Proposition 2.2.

Let $(Z,X)$ be as above and let $\gamma_{k}^{(M)}\coloneqq\frac{\,\mathrm{d}(\pi_{k}\nu_{k,M})}{\,\mathrm{d}Q_{\theta,M}}$ . Then the projected responsibility satisfies the martingale property

\gamma_{k}^{(M)}(\Pi_{M}X)=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(\Pi_{M}X)]\quad\text{a.s.},

(13)

and $\gamma_{k}^{(M)}(\Pi_{M}X)\to\gamma_{k}(X)$ as $M\to\infty$ both almost surely and in $L^{1}$ .

The proofs of Propositions 2.1 and 2.2 are deferred to Section˜H.1 and Section˜H.2, respectively.

3 Approximation theory

In this section we establish theoretical guarantees that justify the use of finite Gaussian mixtures in Hilbert spaces as a flexible approximation class.

Proposition 3.1 (Wasserstein and MMD density).

Let $\mathcal{X}$ be a separable Hilbert space.

(i)

For every $p\geq 1$ , finite mixtures of Gaussian measures on $\mathcal{X}$ of the form (1) are dense in $\mathcal{P}_{p}(\mathcal{X})$ with respect to the Wasserstein distance $W_{p}$ .
(ii)

Let $\kappa$ be a continuous positive-definite kernel satisfying $\sup_{x\in\mathcal{X}}\kappa(x,x)\leq 1$ . Then, for all $P\in\mathcal{P}(\mathcal{X})$ and all $K\geq 1$ , there exists $Q_{K}$ of the form (1) such that

$\operatorname{MMD}_{\kappa}(P,Q_{K})\leq\frac{2}{\sqrt{K}}.$

Part (i) shows that the model class is highly expressive under Wasserstein metrics. These metrics are widely used in mathematical statistics to study convergence of estimators, resampling schemes such as the bootstrap, and related distributional properties [49]. This result also implies MMD density for characteristic bounded continuous kernels. Part (ii), however, applies to arbitrary probability measures and provides an explicit finite-mixture approximation bound for the MMD discrepancy minimized by our algorithm. The proof of Proposition 3.1 is deferred to Section˜H.3.

4 Numerical experiments

The first goal of this section is to assess the empirical clustering performance of our methods against established approaches in both finite-dimensional and infinite-dimensional regimes. We then focus on a digital health case study to demonstrate the versatility of our methods for analyzing complex biomarkers that may provide new clinical insights in practice. In particular, we summarize the glucose time series using different representations, including random functions and graphs, which can be viewed as functional dynamic objects. Additional numerical comparisons are reported in Appendix˜F.

4.1 Clustering benchmark

We compare the mixture model described in Section˜2.2 against several baselines: (i) partitioning and centroid-based methods, including K-Medoids and K-Means [55, 56]; (ii) agglomerative hierarchical methods, including average-linkage, Ward, and BIRCH [46, 68, 72]; (iii) density-based methods, including DBSCAN and OPTICS [25, 2]; (iv) model-based methods, including Gaussian mixtures [28, 41]; and (v) graph- and dimension-reduction methods such as spectral clustering [47].

Datasets. We evaluate our approach using representative synthetic and real-world datasets.

•

Synthetic environments: (i) $\mathbb{R}^{d}$ : five standard sklearn clustering datasets with $n=500$ , $d=2$ , and $K\in\{2,3\}$ [50]; and (ii) $\mathrm{SO}(3)$ : 10 synthetic datasets of rotations under an Euler-angle parameterization, projected onto Wigner-D matrix coefficients, with $n=200$ and $K=3$ .
•

Real-world datasets: (iii) $L^{2}$ : five standard functional data benchmarks [53, 11, 26, 51, 48]; (iv) CGM: continuous glucose monitoring curves from $n=98$ individuals with $K=2$ , detailed in Section˜4.2; and (v) Molecular: MUTAG graph signals with $n=188$ and $K=2$ [20].

All implementation details and evaluation protocols are reported in Appendix˜G. In particular, clustering performance is evaluated using the Adjusted Rand Index (ARI), a well-established clustering metric that compares unsupervised cluster assignments with available ground-truth labels. We fix $K$ to the known number of ground-truth classes, as our goal here is to compare clustering accuracy.

Results. Although designed for general random objects in Hilbert spaces, our method is competitive on finite-dimensional $\mathbb{R}^{d}$ benchmarks and shows better performance on the CGM and $\mathrm{SO}(3)$ datasets, which fall outside the scope of most existing clustering methods. It also performs well in the infinite-dimensional $L^{2}$ setting. In general, the proposed model remains highly competitive and computationally efficient, while preserving a common probabilistic interpretation. This flexibility enables time-dependent interpretations, as illustrated in the digital health case study.

Table 1: Benchmark of clustering algorithms comparing the operating spaces; the computational complexity of training, inference, and memory as functions of the number of samples

n

and clusters

K

; and clustering performance (ARI, mean

{}_{\text{std}}

) across several datasets grouped by geometric space.

Method	Space	Train	Infer	Memory	$\mathbb{R}^{d}$	$L^{2}$	CGM.	$\mathrm{SO}(3)$	Molecular
MMD GMM (Gaussian)	Hilbert	$O(nK)$	$O(nK)$	$O(nK)$	$0.692_{0.379}$	$0.412_{0.219}$	$\mathbf{0.053}_{0.040}$	$\mathbf{0.867}_{0.140}$	$0.104_{0.032}$
MMD GMM (Polynomial)	Hilbert	$O(nK)$	$O(nK)$	$O(nK)$	$0.615_{0.334}$	$0.376_{0.144}$	$0.051_{0.022}$	$0.675_{0.217}$	$\mathbf{0.151}_{0.009}$
Projected GMM (Appendix D)	Hilbert	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.386_{0.276}$	$0.034_{0.000}$	$0.828_{0.172}$	$0.104_{0.059}$
K-Medoids [37]	Metric	$O(n^{2}K)$	$O(nK)$	$O(n^{2})$	$0.569_{0.328}$	$\mathbf{0.458}_{0.201}$	$0.020_{0.000}$	$0.828_{0.109}$	$-0.005_{0.000}$
Hierarchical (Avg) [27, 39]	Metric	$O(n^{2}\log n)$	—	$O(n^{2})$	$0.555_{0.320}$	$0.234_{0.296}$	$0.002_{0.000}$	$0.490_{0.315}$	$0.081_{0.000}$
DBSCAN [25]	Metric	$O(n)$	—	$O(n)$	$0.844_{0.164}$	$0.000_{0.000}$	$0.000_{0.000}$	$0.627_{0.273}$	$0.074_{0.000}$
HDBSCAN [14]	Metric	$O(n)$	—	$O(n)$	$0.870_{0.171}$	$0.171_{0.227}$	$0.000_{0.000}$	$0.691_{0.227}$	$0.117_{0.000}$
K-Center [32]	Metric	$O(nK)$	$O(nK)$	$O(n)$	$0.392_{0.258}$	$0.150_{0.115}$	$0.002_{0.000}$	$0.188_{0.169}$	$0.074_{0.000}$
Kernel k-Groups [29]	Metric	$O(Kn^{2})$	—	$O(n^{2})$	$0.567_{0.328}$	$0.438_{0.219}$	$0.019_{0.000}$	$0.751_{0.189}$	$0.125_{0.000}$
FDA K-Means [42]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.451_{0.196}$	$0.023_{0.000}$	—	—
FDA Fuzzy C-Means [6]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.396_{0.157}$	$0.020_{0.000}$	—	—
Funclust [35]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.451_{0.186}$	$0.015_{0.000}$	—	—
FunHDDC [9]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.419_{0.217}$	$0.013_{0.000}$	—	—
fclust [36]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.430_{0.218}$	$0.030_{0.000}$	—	—
K-Centres [17]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.361_{0.157}$	$0.001_{0.000}$	—	—
Curvclust [31]	$L^{2}$	$O(nK)$	$O(nK)$	$O(nK)$	—	$0.383_{0.250}$	$0.012_{0.000}$	—	—
MiniBatch KMeans [56]	$\mathbb{R}^{d}$	$O(nK)$	$O(nK)$	$O(n)$	$0.554_{0.327}$	—	—	—	—
Spectral [47]	$\mathbb{R}^{d}$	$O(n^{3})$	—	$O(n^{2})$	$\mathbf{0.949}_{0.074}$	—	—	—	—
Ward [68]	$\mathbb{R}^{d}$	$O(n^{2})$	—	$O(n^{2})$	$0.613_{0.330}$	—	—	—	—
BIRCH [72]	$\mathbb{R}^{d}$	$O(n)$	$O(nK)$	$O(n)$	$0.525_{0.303}$	—	—	—	—
Gaussian Mixture (EM) [28]	$\mathbb{R}^{d}$	$O(nK)$	$O(nK)$	$O(nK)$	$0.683_{0.387}$	—	—	—	—
Affinity Propagation [30]	$\mathbb{R}^{d}$	$O(n^{2})$	—	$O(n^{2})$	$0.489_{0.308}$	—	—	—	—
MeanShift [71]	$\mathbb{R}^{d}$	$O(n^{2})$	$O(n)$	$O(n)$	$0.564_{0.328}$	—	—	—	—
OPTICS [2]	$\mathbb{R}^{d}$	$O(n^{2})$	—	$O(n)$	$0.809_{0.219}$	—	—	—	—

4.2 Case study: different representations of digital health time series

We now illustrate the proposed model in a digital health case study based on longitudinal continuous glucose monitoring (CGM) data from a juvenile clinical trial [67]. A central question in longitudinal studies is whether and how the distribution of individuals changes over time, for example, in response to a treatment [38, 18]. In the original article [67], the authors demonstrate that the new insulin pump system leads to improvements in several glycemic metrics before and after the intervention. Here, we consider these outcomes in a continuous-time formulation and use a dynamic functional representation of glucose trajectories. Digital health technologies make it possible to study this question at high temporal resolution. Starting from multi-day glucose time series, we construct three daily representations: (i) functional observations in $L^{2}$ ; (ii) correlation matrices describing the dependence between glucose values at different hours of the day; and (iii) Laplacian graphs encoding similarities in glucose dynamics across hours. Similar representations have been considered as biomarkers in clinical fMRI studies [61, 69].

To analyze temporal changes across these three CGM representations, we extend (1) by allowing the mixture weights to vary over time while sharing the component parameters, namely the mean and covariance operator parameters of the Gaussian mixture model. Our formulation is similar to [1]. However, in that work the authors propose a specific generative model to model distributions through finite-dimensional random examples over time; here, the approach is more general.

Background and data.

Continuous glucose monitors record interstitial glucose concentration every 5 minutes, producing 288 measurements per day. We represent each individual-day by its intraday glucose curve $f\colon[0,24]\to\mathbb{R}$ . The data come from a randomized controlled diabetes study evaluating the effect of a closed-loop insulin delivery system over a period of more than six months [63]. For each individual, a sliding window of $W$ days with stride 1 produces one observation per window. Within each window, the 288 time slots are averaged across valid days; slots with fewer than $\lceil 0.5\,W\rceil$ valid readings are linearly interpolated from neighboring slots. After preprocessing, the dataset comprises 98 individuals, with 34 in the control group and 64 in the treatment group.

Time-varying mixture weights.

Let $X_{1},\dots,X_{n}$ be i.i.d. copies of an $\mathcal{X}$ -valued measurable stochastic process $X(t)$ . For each $t$ , denote $P(t)$ the law of $X(t)$ , and define the empirical distribution by $P_{n}(t)\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}(t)}$ . We fit the model

Q_{\theta}(t)\coloneqq\sum_{k=1}^{K}\pi_{k}(t)\,\mathcal{N}(m_{k},\mathcal{K}_{k}),\qquad\pi(t)\coloneqq\bigl(\pi_{1}(t),\dots,\pi_{K}(t)\bigr)\in\Delta^{K-1}.

(14)

Conceptually, the shared components $(m_{k},\mathcal{K}_{k})$ capture fundamental, stationary data archetypes, such as distinct physiological regimes or individual phenotypes, while the time-varying weight vector $\pi(t)$ describes population-level shifts between these archetypes over time.

Following the kernel $K$ -groups framework [29], we select $K$ by evaluating the within-cluster MMD across candidate values and applying the elbow rule. Model fitting minimizes the integrated MMD²,

\min_{\theta(t)=\{\pi_{k}(t),m_{k},\mathcal{K}_{k}\}_{k}}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}^{2}\bigl(P_{n}(t),Q_{\theta}(t)\bigr)\,\mathrm{d}t.

(15)

We parameterize $\pi(t)=\mathrm{softmax}\bigl(\mathbf{z}(t)\bigr)$ , where the logit path $\mathbf{z}\colon[0,T]\to\mathbb{R}^{K}$ is the solution to a neural ODE [15], initialized at $\mathbf{z}(0)=\log\pi(0)$ . This parameterization introduces a natural bias toward smooth temporal dynamics. Further details on density and consistency results, the projected $\mathbb{R}^{M}$ approximation of the integrated objective, and implementation specifics are deferred to Appendix˜C.

Group-level comparison.

To compare the control and treatment arms, we compute each individual’s posterior cluster membership $\gamma_{k}(x_{i},t)$ at each time point and aggregate these memberships into group-level trajectories. Specifically, for each group $g\in\{\mathrm{control},\mathrm{treatment}\}$ , we compute

\bar{\gamma}_{g}(t)=(\bar{\gamma}_{g,1}(t),\dots,\bar{\gamma}_{g,K}(t)),\qquad\bar{\gamma}_{g,k}(t)\coloneqq\frac{1}{|G_{g}(t)|}\sum_{i\in G_{g}(t)}\gamma_{k}(x_{i},t),

which averages posterior probabilities across all individuals in group $g$ observed at time $t$ . We then quantify the divergence between study arms using the total variation distance $\mathrm{TV}(t)=\frac{1}{2}\left\|\bar{\gamma}_{\mathrm{treat}}(t)-\bar{\gamma}_{\mathrm{ctrl}}(t)\right\|_{1}.$ The metric ranges from 0 to 1, where 0 reflects identical group-level cluster usage and 1 reflects complete separation. It provides an interpretable way to quantify treatment-induced distributional shifts over time.

4.2.1 Results

Intraday curves. Each individual-window is represented as an intraday glucose curve in Sobolev space $H^{1}(0,24;\mathbb{R})$ , continuous profiles are projected onto a truncated cosine basis and fitted with (15) using $K=2$ latent clusters. The two learned means contrast an elevated, variable profile ( $m_{1}$ ) with a flatter, well-regulated profile ( $m_{2}$ ); the empirical and model-predicted mean surfaces agree closely (Figure˜3, left). While the control group remains approximately time-invariant, the treatment group gradually shifts toward $m_{2}$ , with $\mathrm{TV}$ distance increasing from $\approx 0.11$ at baseline to $\approx 0.19$ at the end of the trial. This is consistent with an improvement in glycemic regulation over the course of the clinical trial.

Correlation matrices. Instead of clustering intraday curves, we cluster the running $24\times 24$ inter-hour correlation matrices in the Hilbert space $\mathrm{Sym}(24)$ equipped with the Frobenius inner product, thereby capturing how the dependence structure between hours of the day evolves over the study. Each sliding window ( $W=14$ days) is converted to 24 hourly averages per day, from which the sample correlation matrix is calculated using linear shrinkage toward the identity, with intensity $\alpha=0.1$ , and then projected onto the PSD cone. Symmetric matrices are embedded through the scaled vectorization mapping into $\mathbb{R}^{300}$ , which preserves the Frobenius distance isometrically, and are fitted with the temporal mixture framework using $K=3$ .

The learned clusters in Figure˜4 identify three inter-hour dependence patterns: moderate local correlation $m_{1}$ , broadly strong correlation $m_{2}$ , and sharper early-hour dependence $m_{3}$ . Over time, membership shifts from $m_{3}$ toward the more regulated $m_{1}$ regime, consistent with improved glycemic regulation over time.

Individual similarity graph signals. We now move from hour- to individual-level structure. For each sliding window ( $W=21$ days), we compute $H^{1}$ -pairwise individual similarities to get a time-varying similarity matrix $S(t)\in\mathbb{R}^{N\times N}$ . A fixed individual graph is then built by thresholding the time-averaged similarity matrix at the 85th percentile, and the graph Laplacian eigenbasis (using $R=4$ eigenvectors) is used to project each $S(t)$ onto graph signals. The mixture model with $K=3$ clusters these projected matrices over time. As the trial progresses (Figure˜5), the network structure evolves from diffuse inter-group connections toward denser within-group similarity in the treatment arm, consistent with individuals progressively achieving improved glycemic control.

5 Final remarks

The main contribution of this paper is a Gaussian mixture model for random objects in separable Hilbert spaces, estimated via an MMD-based optimization procedure. A key practical and methodological advantage of the proposed approach is that it bypasses the need for densities or dominating measures, which are problematic in infinite-dimensional settings. In addition, the framework flexibly handles diverse data structures, is efficient, and demonstrates competitive empirical performance.

The main limitation of the method is its scalability in large-scale biomedical studies. This could be mitigated through subsampling strategies [10] or other computational approximation techniques. Future work will focus on developing specialized temporal algorithms for particular Hilbert spaces, extending the framework to general metric spaces, and quantifying the uncertainty of the parameter and prediction of the mixture model.

References

[1] A. Álvarez-López and M. Matabuena (2026) Continuous-time learning of probability distributions: a case study in a digital trial of young children with type 1 diabetes. External Links: 2603.24427 Cited by: §F.2, Table 2, 1st item, §1, §1, §4.2.
[2] M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander (1999) OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 49–60. Cited by: §4.1, Table 1.
[3] M. Antonetti, H. Donãncio, and F. Forbes (2025) An analysis of distributional reinforcement learning with Gaussian mixtures. Cited by: §1.
[4] X. Bay and J. Croix (2019) Karhunen–Loève decomposition of Gaussian measures on Banach spaces. 39 (2), pp. 279–297. Cited by: Appendix E, Appendix E, Corollary E.1, §1, §2.2.
[5] A. Berlinet and C. Thomas-Agnan (2004) Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer US. Cited by: §1.
[6] J. C. Bezdek (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Springer US. Cited by: Table 1.
[7] C. M. Bishop (2006) Pattern recognition and machine learning. Information Science and Statistics, Springer. Cited by: §F.1.
[8] V. Bogachev (1998) Gaussian Measures. Mathematical Surveys and Monographs, Vol. 62, American Mathematical Society. Cited by: §1.
[9] C. Bouveyron and J. Jacques (2011) Model-based clustering of time series in group-specific functional subspaces. 5 (4), pp. 281–300. Cited by: §1, Table 1.
[10] J. R. Bradley (2021) An approach to incorporate subsampling into a generic bayesian hierarchical model. Journal of Computational and Graphical Statistics 30 (4), pp. 889–905. Cited by: §5.
[11] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone (2017) Classification And Regression Trees. Routledge. Cited by: Table 2, 2nd item.
[12] F. Briol, A. Barp, A. B. Duncan, and M. Girolami (2019)Statistical inference for generative models with maximum mean discrepancy(Website) Cited by: §1.
[13] F. Briol, A. Gessner, T. Karvonen, and M. Mahsereci (2025)A Dictionary of Closed-Form Kernel Mean Embeddings(Website) Cited by: 1st item, §1.
[14] R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander (2015) Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. 10 (1), pp. 1–51. Cited by: Appendix F, Table 1.
[15] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud (2019)Neural Ordinary Differential Equations(Website) Cited by: §4.2.
[16] B. Chérief-Abdellatif and P. Alquier (2022) Finite sample properties of parametric MMD estimation: Robustness to misspecification and dependence. 28 (1), pp. 181–213. Cited by: §1.
[17] J. Chiou and P. Li (2007) Functional Clustering and Identifying Substructures of Longitudinal Data. 69 (4), pp. 679–699. Cited by: §1, Table 1.
[18] W. Clarke and B. Kovatchev (2009) Statistical tools to analyze continuous glucose monitor data. 11, pp. S–45. Cited by: §4.2.
[19] G. Da Prato and J. Zabczyk (2014) Stochastic Equations in Infinite Dimensions. Encyclopedia of Mathematics and Its Applications, Cambridge University Press. Cited by: §1, §1.
[20] A. K. Debnath, R. L. Lopez De Compadre, G. Debnath, A. J. Shusterman, and C. Hansch (1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. 34 (2), pp. 786–797. Cited by: Table 2, 2nd item.
[21] A. Delaigle and P. Hall (2010) Defining probability density for a distribution of random functions. 38 (2). Cited by: §1.
[22] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum Likelihood from Incomplete Data Via the \mkbibemphEM Algorithm. 39 (1), pp. 1–22. Cited by: Appendix D, §1, §1, §2.1.
[23] P. Dubey and H. Müller (2020) Functional models for time-varying random objects. 82 (2), pp. 275–327. Cited by: §1, §1.
[24] P. Dubey and H. Müller (2022) Modeling time-varying random objects and dynamic networks. 117 (540), pp. 2252–2267. Cited by: §1, §1.
[25] M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. Cited by: Appendix F, §4.1, Table 1.
[26] F. Ferraty and P. Vieu (2006) Nonparametric Functional Data Analysis. Springer Series in Statistics, Springer New York. Cited by: Table 2, §1, §1, 2nd item.
[27] L. Ferreira and D. B. Hitchcock (2009) A Comparison of Hierarchical Methods for Clustering Functional Data. 38 (9), pp. 1925–1949. Cited by: §1, Table 1.
[28] C. Fraley and A. E. Raftery (2002) Model-Based Clustering, Discriminant Analysis, and Density Estimation. 97 (458), pp. 611–631. Cited by: §1, §1, §4.1, Table 1.
[29] G. França, M. L. Rizzo, and J. T. Vogelstein (2017)Kernel k-Groups via Hartigan’s Method(Website) arXiv.org. Cited by: Appendix C, §1, §4.2, Table 1.
[30] B. J. Frey and D. Dueck (2007) Clustering by Passing Messages Between Data Points. 315 (5814), pp. 972–976. Cited by: Table 1.
[31] M. Giacofci, S. Lambert‐Lacroix, G. Marot, and F. Picard (2013) Wavelet‐Based Clustering for Mixed‐Effects Functional Models in High Dimension. 69 (1), pp. 31–40. Cited by: §1, Table 1.
[32] T. F. Gonzalez (1985) Clustering to minimize the maximum intercluster distance. 38, pp. 293–306. Cited by: Appendix F, Table 1.
[33] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A Kernel Two-Sample Test. 13 (25), pp. 723–773. Cited by: §1, §1.
[34] M. Huang, R. Li, H. Wang, and W. Yao (2014) Estimating mixture of gaussian processes by kernel smoothing. 32 (2), pp. 259–270. Cited by: §1.
[35] J. Jacques and C. Preda (2013) Funclust: A curves clustering method using functional random variables density approximation. 112, pp. 164–171. Cited by: §1, Table 1.
[36] G. M. James and C. A. Sugar (2003) Clustering for Sparsely Sampled Functional Data. 98 (462), pp. 397–408. Cited by: §1, Table 1.
[37] L. Kaufman and P. J. Rousseeuw (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, Wiley. Cited by: Table 1.
[38] J. S. Krouwer and G. S. Cembrowski (2010) A review of standards and statistics used to describe blood glucose monitor performance. 4 (1), pp. 75–83. Cited by: §4.2.
[39] G. N. Lance and W. T. Williams (1967) A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems. 9 (4), pp. 373–380. Cited by: Table 1.
[40] Y. Li, K. Swersky, and R. Zemel (2015) Generative Moment Matching Networks. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1718–1727. Cited by: §1.
[41] H. Lu, J. Jia, and D. Meng (2026)Sequential Monte Carlo with Gaussian Mixture Approximation for Infinite-Dimensional Statistical Inverse Problems(Website) Cited by: Appendix D, Appendix F, §1, §1, §4.1.
[42] M. Luz López García, R. García-Ródenas, and A. González Gómez (2015) K-means algorithms for functional data. 151, pp. 231–245. Cited by: §1, Table 1.
[43] M. Matabuena, J. C. Vidal, O. H. M. Padilla, and D. Sejdinovic (2025) Kernel biclustering algorithm in Hilbert spaces. Cited by: §1.
[44] I. C. McDowell, D. Manandhar, C. M. Vockley, A. K. Schmid, T. E. Reddy, and B. E. Engelhardt (2018) Clustering gene expression time series data using an infinite gaussian process mixture model. 14 (1), pp. e1005896. Cited by: §1.
[45] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf (2017) Kernel Mean Embedding of Distributions: A Review and Beyond. 10 (1–2), pp. 1–141. Cited by: §1.
[46] F. Murtagh and P. Contreras (2012) Algorithms for hierarchical clustering: an overview. 2 (1), pp. 86–97. Cited by: Appendix F, §4.1.
[47] A. Y. Ng, M. I. Jordan, and Y. Weiss (2001) On spectral clustering: analysis and an algorithm. In Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pp. 849–856. Cited by: §4.1, Table 1.
[48] R. T. Olszewski (2001) Generalized feature extraction for structural pattern recognition in time-series data. Carnegie Mellon University. Cited by: Table 2, 2nd item.
[49] V. M. Panaretos and Y. Zemel (2019) Statistical aspects of wasserstein distances. Annual Review of Statistics and Its Application 6, pp. 405–431. Cited by: §3.
[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: Machine learning in python. 12, pp. 2825–2830. Cited by: Table 2, 1st item.
[51] C. Preda, Q. Grimonprez, and V. Vandewalle (2021) Categorical functional data analysis. the cfda r package. 9 (23), pp. 3074. Cited by: Table 2, 2nd item.
[52] C. Ramos-Carreño, J. L. Torrecilla, M. Carbajo-Berrocal, P. Marcos, and A. Suárez (2024) Scikit-Fda : A \mkbibemphPython Package for Functional Data Analysis. 109 (2). Cited by: Appendix F.
[53] J. O. Ramsay and B. W. Silverman (2005) Functional Data Analysis. Springer Series in Statistics, Springer New York. Cited by: Table 2, §1, §1, §2.2, 2nd item.
[54] C. E. Rasmussen and C. K. I. Williams (2008) Gaussian processes for machine learning. Adaptive Computation and Machine Learning, MIT Press. Cited by: §F.2.
[55] E. Schubert and P. J. Rousseeuw (2021) Fast and eager k -medoids clustering: O ( k ) runtime improvement of the PAM, CLARA, and CLARANS algorithms. 101, pp. 101804. Cited by: Appendix F, §4.1.
[56] D. Sculley (2010) Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. Cited by: §4.1, Table 1.
[57] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. 41 (5). Cited by: §1.
[58] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016)NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis(Website) Cited by: §F.8.
[59] A. Smola, A. Gretton, L. Song, and B. Schölkopf (2007) A Hilbert Space Embedding for Distributions. In Algorithmic Learning Theory, M. Hutter, R. A. Servedio, and E. Takimoto (Eds.), pp. 13–31. Cited by: §1.
[60] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet (2011) Universality, Characteristic Kernels and RKHS Embedding of Measures. 12 (70), pp. 2389–2410. Cited by: §1, §1.
[61] K. Supekar, V. Menon, D. Rubin, M. Musen, and M. D. Greicius (2008) Network Analysis of Intrinsic Functional Brain Connectivity in Alzheimer’s Disease. 4 (6), pp. e1000100. Cited by: §4.2.
[62] C. Tang, N. Lenssen, Y. Wei, and T. Zheng (2023) Wasserstein Distributional Learning via Majorization-Minimization. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 10703–10731. Cited by: §1.
[63] M. Tauschmann, H. Thabit, L. Bally, J. M. Allen, S. Hartnell, M. E. Wilinska, Y. Ruan, J. Sibayan, C. Kollman, P. Cheng, R. W. Beck, C. L. Acerini, M. L. Evans, D. B. Dunger, D. Elleri, F. Campbell, R. M. Bergenstal, A. Criego, V. N. Shah, L. Leelarathna, R. Hovorka, B. Alvarado, C. Ashanti, J. Baggott, K. Balakrishnan, N. Barber, L. Bath, S. Beasley, C. Beatson, S. Borgman, S. Bradshaw, B. Bugielski, A. Carlson, E. Collett, J. Curtis, J. Demmitt, D. Donahue, J. Exall, R. Forshaw, J. Hayes, S. Heath, A. Hellmann, V. Huegel, J. Hyatt, L. James, H. Joseph, P. Joshee, W. Konerza, J. Lum, M. Madden, T. Martens, C. McCarthy, M. McDonald, V. Mikityuk, H. Miles, D. Miller, W. Mubita, C. Murphy, B. Olson, R. Pad, N. Patibandla, K. Riding, A. Shaju, L. Thomas, J. Thomson, D. White, S. Yau, and J. Yong (2018) Closed-loop insulin delivery in suboptimally controlled type 1 diabetes: a multicentre, 12-week randomised trial. 392 (10155), pp. 1321–1329. Cited by: Table 2, §4.2.
[64] D. M. Titterington, A. F. M. Smith, and U. E. Makov (1985) Statistical analysis of finite mixture distributions. Wiley. Cited by: §1.
[65] I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet (2017) Minimax Estimation of Kernel Mean Embeddings. 18 (86), pp. 1–47. Cited by: §1, §2.2.
[66] V. Tresp (2000) Mixtures of Gaussian Processes. In Advances in Neural Information Processing Systems, Vol. 13. Cited by: §F.2, §1.
[67] R. P. Wadwa, Z. W. Reed, B. A. Buckingham, M. D. DeBoer, L. Ekhlaspour, G. P. Forlenza, M. Schoelwer, J. Lum, C. Kollman, R. W. Beck, et al. (2023) Trial of hybrid closed-loop control in young children with type 1 diabetes. New England Journal of Medicine 388 (11), pp. 991–1001. Cited by: §1, §4.2.
[68] J. H. Ward (1963) Hierarchical Grouping to Optimize an Objective Function. 58 (301), pp. 236–244. Cited by: §4.1, Table 1.
[69] T. Wu, Y. Ma, Z. Zheng, S. Peng, X. Wu, D. Eidelberg, and P. Chan (2015) Parkinson’s Disease-Related Spatial Covariance Pattern identified with Resting-State Functional MRI. 35 (11), pp. 1764–1770. Cited by: §4.2.
[70] G. Wynne and A. B. Duncan (2022) A kernel two-sample test for functional data. 23 (73), pp. 1–51. Cited by: §H.7, §1.
[71] Yizong Cheng (1995) Mean shift, mode seeking, and clustering. 17 (8), pp. 790–799. Cited by: Table 1.
[72] T. Zhang, R. Ramakrishnan, and M. Livny (1996) BIRCH: an efficient data clustering method for very large databases. 25 (2), pp. 103–114. Cited by: §4.1, Table 1.

Appendix A Background

This section establishes the notation and foundational properties of Gaussian measures and kernel methods used throughout this work. Appendix E discusses the extension of this framework to Banach spaces.

A.1 Gaussian measures on separable Hilbert spaces

Let $\mathcal{X}$ be a separable real Hilbert space equipped with its Borel $\sigma$ -algebra $\mathcal{B}(\mathcal{X})$ . A Borel probability measure $\nu$ on $\mathcal{X}$ is said to be Gaussian if, for every continuous linear functional $f\in\mathcal{X}^{*}$ , the pushforward measure $\nu\circ f^{-1}$ is a Gaussian measure on $\mathbb{R}$ . By the Riesz representation theorem, we routinely identify $\mathcal{X}^{*}$ with $\mathcal{X}$ .

A Gaussian measure $\nu$ is uniquely determined by its mean $m\in\mathcal{X}$ and its covariance operator $\mathcal{K}:\mathcal{X}\to\mathcal{X}$ , which are defined via the relations

\langle m,h\rangle_{\mathcal{X}}=\int_{\mathcal{X}}\langle x,h\rangle_{\mathcal{X}}\,\mathrm{d}\nu(x),\quad h\in\mathcal{X},

and

\langle\mathcal{K}g,h\rangle_{\mathcal{X}}=\int_{\mathcal{X}}\langle x-m,g\rangle_{\mathcal{X}}\langle x-m,h\rangle_{\mathcal{X}}\,\mathrm{d}\nu(x),\quad g,h\in\mathcal{X}.

We denote such a measure by $\nu=\mathcal{N}(m,\mathcal{K})$ . The covariance operator $\mathcal{K}$ is necessarily self-adjoint, positive semidefinite, and trace-class. The latter property implies that for any orthonormal basis $(e_{i})_{i\geq 1}$ of $\mathcal{X}$ , the trace is finite:

\operatorname{tr}(\mathcal{K})=\sum_{i=1}^{\infty}\langle\mathcal{K}e_{i},e_{i}\rangle_{\mathcal{X}}<\infty.

The case $\mathcal{K}=0$ corresponds to the Dirac measure $\delta_{m}$ ; this case is excluded from the model class in the main text. In particular, if $X\sim\mathcal{N}(m,\mathcal{K})$ , its moments satisfy $\mathbb{E}\|X-m\|_{\mathcal{X}}^{2}=\operatorname{tr}(\mathcal{K})$ and $\mathbb{E}\|X\|_{\mathcal{X}}^{2}=\|m\|_{\mathcal{X}}^{2}+\operatorname{tr}(\mathcal{K})$ .

A.2 Kernel mean embeddings and Maximum Mean Discrepancy

Let $\kappa:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ be a Borel measurable, positive-definite kernel with an associated reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{\kappa}$ . For probability measures $P,Q\in\mathcal{P}(\mathcal{X})$ , we assume the following integrability condition holds:

\mathbb{E}_{x\sim P}\left[\sqrt{\kappa(x,x)}\right]<\infty,\quad\mathbb{E}_{y\sim Q}\left[\sqrt{\kappa(y,y)}\right]<\infty.

(16)

Under this assumption, the kernel mean embeddings

\mu_{P}\coloneqq\mathbb{E}_{x\sim P}[\kappa(x,\cdot)],\quad\mu_{Q}\coloneqq\mathbb{E}_{y\sim Q}[\kappa(y,\cdot)]

are well-defined as Bochner integrals in $\mathcal{H}_{\kappa}$ . The squared Maximum Mean Discrepancy (MMD) between $P$ and $Q$ is defined as the distance between their embeddings:

	$\displaystyle\operatorname{MMD}_{\kappa}^{2}(P,Q)$	$\displaystyle\coloneqq\\|\mu_{P}-\mu_{Q}\\|_{\mathcal{H}_{\kappa}}^{2}$
		$\displaystyle=\mathbb{E}_{x,x^{\prime}\sim P}[\kappa(x,x^{\prime})]-2\mathbb{E}_{x\sim P,y\sim Q}[\kappa(x,y)]+\mathbb{E}_{y,y^{\prime}\sim Q}[\kappa(y,y^{\prime})],$		(17)

where $x,x^{\prime}$ are i.i.d. samples from $P$ and $y,y^{\prime}$ are i.i.d. samples from $Q$ . Note that if $\kappa$ is bounded, condition (16) is satisfied for all probability measures.

A kernel is called characteristic if the embedding map $P\mapsto\mu_{P}$ is injective. In this case, $\operatorname{MMD}_{\kappa}(P,Q)=0$ if and only if $P=Q$ , meaning the MMD induces a true metric on the space of probability measures.

A.3 Fourier interpretation of the Gaussian MMD

The Gaussian radial kernel, given by

\kappa(x,y)\coloneqq\exp\left(-\frac{\|x-y\|_{\mathcal{X}}^{2}}{2\sigma_{\kappa}^{2}}\right),\quad\sigma_{\kappa}>0,

(18)

is both bounded and characteristic on separable Hilbert spaces.

To gain intuition into the geometry induced by this kernel, we briefly consider the finite-dimensional case $\mathcal{X}=\mathbb{R}^{d}$ . For $P,Q\in\mathcal{P}(\mathbb{R}^{d})$ , let $\varphi_{P}$ and $\varphi_{Q}$ denote their respective characteristic functions, $\varphi_{P}(\omega)\coloneqq\int_{\mathbb{R}^{d}}e^{i\omega^{\top}x}\,\mathrm{d}P(x)$ .

By Bochner’s theorem, the shift-invariant kernel (18) admits a Fourier representation with a Gaussian spectral density $\psi_{\sigma_{\kappa}}$ :

\kappa(x,y)=\frac{\sigma_{\kappa}^{d}}{(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}e^{i\omega^{\top}(x-y)}e^{-\sigma_{\kappa}^{2}\|\omega\|^{2}/2}\,\mathrm{d}\omega.

By defining the signed measure $\mu\coloneqq P-Q$ and applying Fubini’s theorem to the expanded MMD objective (17), we obtain:

	$\displaystyle\operatorname{MMD}_{\kappa}^{2}(P,Q)$	$\displaystyle=\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\kappa(x,y)\,\mathrm{d}\mu(x)\,\mathrm{d}\mu(y)$
		$\displaystyle=\frac{\sigma_{\kappa}^{d}}{(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\left\|\int_{\mathbb{R}^{d}}e^{i\omega^{\top}x}\,\mathrm{d}\mu(x)\right\|^{2}e^{-\sigma_{\kappa}^{2}\\|\omega\\|^{2}/2}\,\mathrm{d}\omega$
		$\displaystyle=\frac{\sigma_{\kappa}^{d}}{(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\big\|\varphi_{P}(\omega)-\varphi_{Q}(\omega)\big\|^{2}e^{-\sigma_{\kappa}^{2}\\|\omega\\|^{2}/2}\,\mathrm{d}\omega.$

This formulation reveals a frequency-domain perspective: the squared Gaussian MMD is a weighted $L^{2}$ distance between the characteristic functions of the two distributions. The Gaussian weight $e^{-\sigma_{\kappa}^{2}\|\omega\|^{2}/2}$ acts as a low-pass filter, effectively penalizing discrepancies in low-frequency modes (global structural differences) while naturally smoothing out high-frequency noise.

Appendix B Closed-form MMD expressions

To evaluate the cross-expectations $J_{i,k}$ and $I_{k,s}$ of (5) in the infinite-dimensional setting, we rely on the Fredholm determinant. Recall that for any trace-class operator $T\in\mathcal{L}(\mathcal{X})$ , the determinant $\det_{F}(I_{\mathcal{X}}+T)$ is well-defined and serves as the natural generalization of the standard matrix determinant.

Proposition B.1 (MMD for the Gaussian kernel).

Let $A\in\mathcal{L}(\mathcal{X})$ be a self-adjoint, positive semi-definite operator, and consider the Gaussian kernel induced by $A$ ,

\kappa_{A}(x,y)\coloneqq\exp\left(-\frac{1}{2}\big\langle x-y,A(x-y)\big\rangle_{\mathcal{X}}\right).

(19)

Then the expectations in (5) are given by

	$\displaystyle J_{i,k}$	$\displaystyle=\frac{\exp\Bigl(-\tfrac{1}{2}\big\langle X_{i}-m_{k},\,A^{1/2}\bigl(I_{\mathcal{X}}+A^{1/2}\mathcal{K}_{k}A^{1/2}\bigr)^{-1}A^{1/2}(X_{i}-m_{k})\big\rangle_{\mathcal{X}}\Bigr)}{\det_{F}\bigl(I_{\mathcal{X}}+A^{1/2}\mathcal{K}_{k}A^{1/2}\bigr)^{1/2}},$		(20)
	$\displaystyle I_{k,s}$	$\displaystyle=\frac{\exp\Bigl(-\tfrac{1}{2}\big\langle m_{k}-m_{s},\,A^{1/2}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{-1}A^{1/2}(m_{k}-m_{s})\big\rangle_{\mathcal{X}}\Bigr)}{\det_{F}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{1/2}}.$		(21)

Proposition B.2 (MMD for the polynomial kernel).

Fix $p\in\mathbb{N}$ , $c\geq 0$ , and consider the kernel

\kappa(x,y)\coloneqq(\langle x,y\rangle_{\mathcal{X}}+c)^{p}.

(22)

Then the expectations in (5) are given by

	$\displaystyle J_{i,k}$	$\displaystyle=\sum_{t=0}^{\lfloor p/2\rfloor}\frac{p!}{(p-2t)!\,2^{t}\,t!}\,\langle X_{i},\mathcal{K}_{k}X_{i}\rangle_{\mathcal{X}}^{t}\,\left(\langle X_{i},m_{k}\rangle_{\mathcal{X}}+c\right)^{p-2t},$
	$\displaystyle I_{k,s}$	$\displaystyle=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\big\langle M_{r,k},M_{r,s}\big\rangle_{\mathcal{X}^{\otimes r}},\hskip 18.49988pt\text{with }M_{r,k}\coloneqq\mathbb{E}_{y\sim\nu_{k}}[y^{\otimes r}]\in\mathcal{X}^{\otimes r},\qquad M_{0,k}=1.$

The formal proofs of Propositions B.1 and B.2 are given in Sections˜H.7 and H.8.

B.1 Numerical computations and algorithm description

Given an orthonormal basis $(e_{r})_{r=1}^{M}$ of $\mathcal{X}_{M}$ , we represent observations and means by their coordinate vectors

\mathbf{x}_{i}\coloneqq\big(\langle X_{i},e_{r}\rangle_{\mathcal{X}}\big)_{r=1}^{M}\in\mathbb{R}^{M},\qquad\mathbf{m}_{k}\coloneqq\big(\langle m_{k},e_{r}\rangle_{\mathcal{X}}\big)_{r=1}^{M}\in\mathbb{R}^{M},

(23)

and the covariance operator $\mathcal{K}_{k,M}$ by its matrix representation $\mathbf{K}_{k}\in\mathbb{R}^{M\times M}$ with entries

(\mathbf{K}_{k})_{r\ell}\coloneqq\langle\mathcal{K}_{k,M}e_{\ell},e_{r}\rangle_{\mathcal{X}}\qquad 1\leq r,\ell\leq M.

(24)

In these coordinates, the pushforward measure $\nu_{k,M}$ becomes the multivariate normal $\mathcal{N}(\mathbf{m}_{k},\mathbf{K}_{k})$ on $\mathbb{R}^{M}$ . Using these coordinate representations, we now provide explicit formulas for the finite-dimensional approximations.

Projected responsibilities.

The pushforward $\nu_{k,M}=\nu_{k}\circ\Pi_{M}^{-1}$ is a multivariate normal distribution on $\mathbb{R}^{M}$ ,

\nu_{k,M}=\mathcal{N}(\mathbf{m}_{k},\mathbf{K}_{k}).

If each $\mathbf{K}_{k}$ is positive definite, then each $\nu_{k,M}$ admits the Lebesgue density

p_{k}^{(M)}(\mathbf{x})\coloneqq(2\pi)^{-M/2}\det(\mathbf{K}_{k})^{-1/2}\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}-\mathbf{m}_{k})^{\top}\mathbf{K}_{k}^{-1}(\mathbf{x}-\mathbf{m}_{k})\Big\}.

In this case the projected responsibilities satisfy, for $i\in\{1,\dots,n\}$ and $k\in\{1,\dots,K\}$ ,

\gamma^{(M)}_{i,k}\coloneqq\gamma_{k}^{(M)}(\mathbf{x}_{i})=\frac{\pi_{k}p_{k}^{(M)}(\mathbf{x}_{i})}{\sum_{s=1}^{K}\pi_{s}p_{s}^{(M)}(\mathbf{x}_{i})}.

If some $\mathbf{K}_{k}$ are singular, the corresponding $\nu_{k,M}$ are supported on affine subspaces; the Radon–Nikodym definition of $\gamma_{k}^{(M)}$ still applies and one may compute densities on the supports or use a small ridge regularization.

1. Gaussian radial kernel (Proposition B.1) Let $\mathbf{A}\in\mathbb{R}^{M\times M}$ denote the matrix representation of the restricted operator $A|_{\mathcal{X}_{M}}$ in the basis $(e_{r})_{r=1}^{M}$ , with entries $\mathbf{A}_{r\ell}=\langle Ae_{\ell},e_{r}\rangle_{\mathcal{X}}$ . The closed forms of Proposition B.1 become the finite-dimensional approximations

	$\displaystyle J^{(M)}_{i,k}$	$\displaystyle\coloneqq\det\!\big(I_{M}+\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2}\big)^{-1/2}$
		$\displaystyle\qquad\times\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}_{i}-\mathbf{m}_{k})^{\top}\mathbf{A}^{1/2}\big(I_{M}+\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2}\big)^{-1}\mathbf{A}^{1/2}(\mathbf{x}_{i}-\mathbf{m}_{k})\Big\},$
	$\displaystyle I^{(M)}_{k,s}$	$\displaystyle\coloneqq\det\!\big(I_{M}+\mathbf{A}^{1/2}(\mathbf{K}_{k}+\mathbf{K}_{s})\mathbf{A}^{1/2}\big)^{-1/2}$
		$\displaystyle\qquad\times\exp\!\Big\{-\tfrac{1}{2}(\mathbf{m}_{k}-\mathbf{m}_{s})^{\top}\mathbf{A}^{1/2}\big(I_{M}+\mathbf{A}^{1/2}(\mathbf{K}_{k}+\mathbf{K}_{s})\mathbf{A}^{1/2}\big)^{-1}\mathbf{A}^{1/2}(\mathbf{m}_{k}-\mathbf{m}_{s})\Big\},$

and we set

J^{(M)}_{k}\coloneqq\frac{1}{n}\sum_{i=1}^{n}J^{(M)}_{i,k},\qquad\mathbf{J}^{(M)}\coloneqq(J^{(M)}_{1},\dots,J^{(M)}_{K})^{\top},\qquad\mathbf{I}^{(M)}\coloneqq(I^{(M)}_{k,s})_{k,s=1}^{K}.

For the isotropic case $\mathbf{A}=\sigma_{\kappa}^{-2}I_{M}$ , one writes $\alpha=-1/(2\sigma_{\kappa}^{2})$ and recovers the equivalent forms $\det(I_{M}-2\alpha\mathbf{K}_{k})^{-1/2}$ and $(I_{M}-2\alpha\mathbf{K}_{k})^{-1}$ in the exponents.

For numerically stable log-domain evaluation, compute the symmetric eigendecompositions

\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2}=\mathbf{V}_{k}\mathrm{diag}(\lambda_{k,1},\dots,\lambda_{k,M})\mathbf{V}_{k}^{\top},

\mathbf{A}^{1/2}(\mathbf{K}_{k}+\mathbf{K}_{s})\mathbf{A}^{1/2}=\mathbf{V}_{k,s}\mathrm{diag}(\lambda_{k,s,1},\dots,\lambda_{k,s,M})\mathbf{V}_{k,s}^{\top},

with orthogonal matrices $\mathbf{V}_{k},\mathbf{V}_{k,s}$ and $\lambda_{\cdot,\cdot}\geq 0$ . Note that $\det(I_{M}+\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2})=\prod_{r}(1+\lambda_{k,r})$ . Writing $\mathbf{z}_{i,k}\coloneqq\mathbf{V}_{k}^{\top}\mathbf{A}^{1/2}(\mathbf{x}_{i}-\mathbf{m}_{k})$ and $\mathbf{d}_{k,s}\coloneqq\mathbf{V}_{k,s}^{\top}\mathbf{A}^{1/2}(\mathbf{m}_{k}-\mathbf{m}_{s})$ , we obtain the stable log-domain expressions

	$\displaystyle\log J^{(M)}_{i,k}$	$\displaystyle=-\frac{1}{2}\sum_{r=1}^{M}\log(1+\lambda_{k,r})\;-\;\frac{1}{2}\sum_{r=1}^{M}\frac{(\mathbf{z}_{i,k})_{r}^{2}}{1+\lambda_{k,r}},$
	$\displaystyle\log I^{(M)}_{k,s}$	$\displaystyle=-\frac{1}{2}\sum_{r=1}^{M}\log(1+\lambda_{k,s,r})\;-\;\frac{1}{2}\sum_{r=1}^{M}\frac{(\mathbf{d}_{k,s})_{r}^{2}}{1+\lambda_{k,s,r}}.$

When $\mathbf{A}=\sigma_{\kappa}^{-2}I_{M}$ , the matrices $\mathbf{V}_{k}$ coincide with the eigenvectors of $\mathbf{K}_{k}$ and the expressions reduce to those with $\alpha=-1/(2\sigma_{\kappa}^{2})$ . To reduce cost, one may truncate these sums to $r\leq R$ (e.g. the leading $R$ eigenvalues), i.e. replace the relevant matrices by their rank- $R$ spectral approximations.

2. Polynomial kernel (Proposition B.2) Fix an integer $p\geq 1$ and $c\geq 0$ and consider $\kappa(x,y)=(\langle x,y\rangle_{\mathcal{X}}+c)^{p}$ . In the projected space, this becomes $\kappa(\mathbf{x},\mathbf{y})=(\mathbf{x}^{\top}\mathbf{y}+c)^{p}$ on $\mathbb{R}^{M}$ .

Computing $J^{(M)}_{i,k}$ .

For $y\sim\nu_{k,M}$ , define the scalars

\mu^{(M)}_{i,k}\coloneqq\mathbf{x}_{i}^{\top}\mathbf{m}_{k}+c,\qquad v^{(M)}_{i,k}\coloneqq\mathbf{x}_{i}^{\top}\mathbf{K}_{k}\,\mathbf{x}_{i},

so that $\mathbf{x}_{i}^{\top}y+c$ is a one-dimensional Gaussian with mean $\mu^{(M)}_{i,k}$ and variance $v^{(M)}_{i,k}$ . Then Proposition B.2 yields the efficient closed form

J^{(M)}_{i,k}=\sum_{t=0}^{\lfloor p/2\rfloor}\frac{p!}{(p-2t)!\,2^{t}\,t!}\,\big(v^{(M)}_{i,k}\big)^{t}\,\big(\mu^{(M)}_{i,k}\big)^{p-2t}.

Computing $I^{(M)}_{k,s}$ .

Let $y\sim\nu_{k,M}$ and $y^{\prime}\sim\nu_{s,M}$ be independent in $\mathbb{R}^{M}$ , and set $Z_{k,s}\coloneqq y^{\top}y^{\prime}$ . Then

I^{(M)}_{k,s}=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\mathbb{E}\!\left[Z_{k,s}^{\,r}\right].

Equivalently, by Proposition B.2,

I^{(M)}_{k,s}=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\big\langle\mathbf{M}^{(M)}_{r,k},\mathbf{M}^{(M)}_{r,s}\big\rangle_{(\mathbb{R}^{M})^{\otimes r}},

where

\mathbf{M}^{(M)}_{r,k}\coloneqq\mathbb{E}_{y\sim\nu_{k,M}}[y^{\otimes r}]\in(\mathbb{R}^{M})^{\otimes r}.

For small degrees, these moments admit simple closed forms in terms of $(\mathbf{m}_{k},\mathbf{K}_{k})$ . In particular,

	$\displaystyle\mathbb{E}[Z_{k,s}]$	$\displaystyle=\mathbf{m}_{k}^{\top}\mathbf{m}_{s},$
	$\displaystyle\mathbb{E}[Z_{k,s}^{2}]$	$\displaystyle=(\mathbf{m}_{k}^{\top}\mathbf{m}_{s})^{2}+\mathbf{m}_{k}^{\top}\mathbf{K}_{s}\,\mathbf{m}_{k}+\mathbf{m}_{s}^{\top}\mathbf{K}_{k}\,\mathbf{m}_{s}+\mathrm{tr}(\mathbf{K}_{k}\mathbf{K}_{s}),$
	$\displaystyle\mathbb{E}[Z_{k,s}^{3}]$	$\displaystyle=(\mathbf{m}_{k}^{\top}\mathbf{m}_{s})^{3}+3(\mathbf{m}_{k}^{\top}\mathbf{m}_{s})\Big(\mathbf{m}_{k}^{\top}\mathbf{K}_{s}\,\mathbf{m}_{k}+\mathbf{m}_{s}^{\top}\mathbf{K}_{k}\,\mathbf{m}_{s}+\mathrm{tr}(\mathbf{K}_{k}\mathbf{K}_{s})\Big)+6\,\mathbf{m}_{k}^{\top}\mathbf{K}_{s}\mathbf{K}_{k}\,\mathbf{m}_{s}.$

For larger $p$ , one may compute higher-order moments via Wick/Isserlis expansions.

B.2 Algorithm descriptions

We present pseudocode for the fitting procedure described in Section˜2.1. The algorithm operates in the projected coordinate space $\mathbb{R}^{M}$ using the finite-dimensional quantities $(\mathbf{J}^{(M)},\mathbf{I}^{(M)})$ from Section˜B.1, which we denote $(\mathbf{J},\mathbf{I})$ for brevity.

Algorithm 1 Alternating Optimization for Projected MMD Gaussian Mixtures

1:Projected data

\{\mathbf{x}_{i,M}\}_{i=1}^{n}\subset\mathbb{R}^{M}

; components

K

; learning rate

\eta

; ridge

\varepsilon>0

; iterations

T

2:Fitted weights

\hat{\pi}

and components

\{(\hat{\mathbf{m}}_{k,M},\hat{\mathbf{K}}_{k,M})\}_{k=1}^{K}

\{\mathbf{m}_{k,M}\}_{k=1}^{K}\leftarrow\textsc{KMeans}(\{\mathbf{x}_{i,M}\}_{i=1}^{n},K)

\triangleright

Initialize means from data

4:Assign

c_{i}\leftarrow\arg\min_{k}\|\mathbf{x}_{i,M}-\mathbf{m}_{k,M}\|

and let

S_{k}=\{i:c_{i}=k\}

\triangleright

Nearest-center partition

\mathbf{K}_{k,M}\leftarrow\tfrac{1}{|S_{k}|-1}\!\sum_{i\in S_{k}}(\mathbf{x}_{i,M}-\mathbf{m}_{k,M})(\mathbf{x}_{i,M}-\mathbf{m}_{k,M})^{\top}+\varepsilon\mathbf{I}

\triangleright

Within-cluster covariance

\mathbf{L}_{k}\leftarrow\mathrm{chol}(\mathbf{K}_{k,M})

\triangleright

Cholesky parametrization ensures

\mathbf{K}_{k,M}\succ 0

\pi\leftarrow\mathbf{1}/K

\triangleright

Uniform weight initialization

8:for

t=1,\dots,T

9: Compute

\mathbf{I}^{(M)}\in\mathbb{R}^{K\times K}

\mathbf{J}^{(M)}\in\mathbb{R}^{K}

using Section˜B.1

\triangleright

Cross-expectations

10:

\pi\leftarrow\arg\min_{\pi\in\Delta^{K-1}}\pi^{\top}\mathbf{I}^{(M)}\pi-2(\mathbf{J}^{(M)})^{\top}\pi

\triangleright

Exact QP update (6)

11:

\mathcal{L}_{M}\leftarrow\pi^{\top}\mathbf{I}^{(M)}\pi-2(\mathbf{J}^{(M)})^{\top}\pi

\triangleright

Objective at fixed

\pi

12:

\mathbf{m}_{k,M}\leftarrow\mathbf{m}_{k,M}-\eta\nabla_{\mathbf{m}_{k,M}}\mathcal{L}_{M}

\triangleright

Gradient step on means

13:

\mathbf{L}_{k}\leftarrow\mathbf{L}_{k}-\eta\nabla_{\mathbf{L}_{k}}\mathcal{L}_{M}

\triangleright

Gradient step on Cholesky factors

14:

\mathbf{K}_{k,M}\leftarrow\mathbf{L}_{k}\mathbf{L}_{k}^{\top}+\varepsilon\mathbf{I}

\triangleright

Reconstruct PSD covariances

15:end for

16:return

\hat{\pi}\leftarrow\pi

and

\{(\hat{\mathbf{m}}_{k,M},\hat{\mathbf{K}}_{k,M})\}\leftarrow\{(\mathbf{m}_{k,M},\mathbf{K}_{k,M})\}

Algorithm˜1 alternates between the analytical weight update (6) and a gradient step on the component parameters $(\mathbf{m}_{k,M},\mathbf{L}_{k})$ , with covariances parameterized through their Cholesky factors to enforce positive definiteness. The exact gradients are obtained from the differentiable closed forms of Propositions B.1 and B.2. A fully gradient-based variant—updating $\pi$ jointly with the components, either via projection onto the simplex or through logits $\ell\in\mathbb{R}^{K}$ with $\pi_{k}=\operatorname{softmax}(\ell)_{k}$ —is also possible.

Appendix C Time-varying mixtures

This appendix details the algorithmic implementation and theoretical results deferred from Section˜4.2. We outline the projected $\mathbb{R}^{M}$ fitting procedure, and subsequently establish three guarantees: the consistency of the integrated empirical objective, the consistency of its projected $\mathbb{R}^{M}$ approximation, and the density of time-varying Gaussian mixtures with shared components.

Algorithm 2 Time-varying MMD-based mixture fitting

1:Projected temporal data

\{\mathbf{x}_{i,M}^{(l)}\}_{i=1}^{n_{l}}\subset\mathbb{R}^{M}

across time slices

t_{1},\dots,t_{L}

; number of components

K

; learning rate

\eta

; epochs

E

; Neural ODE parameters

\Phi

2:Fitted parameters

\hat{\Phi}

and shared components

\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}

3:Initialize

\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K}

using the pooled data

\bigcup_{l=1}^{L}\{\mathbf{x}_{i,M}^{(l)}\}_{i=1}^{n_{l}}

4:Initialize

\Phi

5:for epoch

=1,\dots,E

6: Compute

\pi(t_{l})\leftarrow\operatorname{softmax}(\mathbf{z}(t_{l}))

for

l\in\{1,\dots,L\}

via the ODE.

7: Compute

\mathbf{I}^{(M)}\in\mathbb{R}^{K\times K}

via (7).

8: for

l=1,\dots,L

9: Compute

\mathbf{J}_{l}^{(M)}\in\mathbb{R}^{K}

where

[\mathbf{J}_{l}^{(M)}]_{k}\leftarrow\frac{1}{n_{l}}\sum_{i=1}^{n_{l}}J_{i,k}^{(M)}

via (7).

10:

\operatorname{MMD}_{l}^{2}\leftarrow\pi(t_{l})^{\top}\mathbf{I}^{(M)}\,\pi(t_{l})-2\,(\mathbf{J}_{l}^{(M)})^{\top}\pi(t_{l})

\triangleright

Unnormalized per-slice MMD²

11: end for

12:

\mathcal{L}\leftarrow\frac{1}{L}\sum_{l=1}^{L}\operatorname{MMD}_{l}^{2}

\triangleright

Approximate integrated loss (15)

13: Update

(\Phi,\{\mathbf{m}_{k},\mathbf{K}_{k}\})\leftarrow(\Phi,\{\mathbf{m}_{k},\mathbf{K}_{k}\})-\eta\,\nabla\mathcal{L}

14:end for

15:return

\hat{\Phi}\leftarrow\Phi

\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}\leftarrow\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K}

Algorithm˜2 implements the optimization of the time-varying objective (15) in $\mathbb{R}^{M}$ . The component parameters $\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K}$ are shared across all time slices, while the temporal variation is entirely captured by the weights $\pi(t)$ , whose logits are governed by the neural ODE introduced in Section˜4.2. Implementation details are summarized in Table˜3.

To evaluate the MMD efficiently within the algorithm, we rely on the finite-dimensional projections defined in (7). We separate the evaluation into terms that can be computed globally versus those computed per time slice. The matrix $\mathbf{I}^{(M)}$ depends only on the shared components and is computed once per epoch. In contrast, the vectors $\mathbf{J}_{l}^{(M)}$ evaluate the interaction between the shared components and the data specific to time slice $t_{l}$ , requiring per-slice computation.

Motivated by the within-cluster RKHS dispersion criterion underlying kernel $k$ -groups [29], we select $K$ by applying an elbow rule to the integrated MMD loss over candidate temporal mixtures.

Consistency guarantees

In parallel with the static framework of Section˜3, we establish the theoretical foundations for the temporal mixture model. Unlike the static case, where sample consistency is standard, minimizing the integrated temporal objective requires controlling the time-varying empirical processes. We first establish this sample consistency, followed by the validity of the projected approximation and the expressiveness of the temporal model class.

Proposition C.1 (Sample consistency of the integrated MMD).

Let $\kappa$ be a continuous positive-definite kernel on the separable Hilbert space $\mathcal{X}$ . Let $X_{1},\dots,X_{n}$ be i.i.d. copies of a jointly measurable $\mathcal{X}$ -valued stochastic process $X\colon[0,T]\to\mathcal{X}$ with marginal laws $P(t)$ , and let $P_{n}(t)$ be the corresponding empirical measure. Let $Q(t)$ be a family of probability measures on $\mathcal{X}$ such that its mean embedding trajectory $t\mapsto\mu_{Q(t)}$ is strongly measurable and belongs to $L^{2}(0,T;\mathcal{H}_{\kappa})$ . If $\int_{0}^{T}\mathbb{E}_{X\sim P(t)}[\kappa(X,X)]\,\mathrm{d}t<\infty$ , we have

\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q(t)\bigr)\,\mathrm{d}t\xrightarrow{n\to\infty}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P(t),Q(t)\bigr)\,\mathrm{d}t\qquad\text{a.s.}

In particular, $\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),P(t)\bigr)\,\mathrm{d}t\to 0$ almost surely.

Proposition C.2 (Projection consistency for temporal mixtures).

Let $Q_{\theta}(t)=\sum_{k=1}^{K}\pi_{k}(t)\nu_{k}$ be a time-dependent Gaussian mixture where $\pi(t)\in\Delta^{K-1}$ and $t\mapsto\pi_{k}(t)$ is measurable for each $k$ . Let $P_{n,M}(t)$ and $Q_{\theta,M}(t)$ be the projected measures on $\mathbb{R}^{M}$ defined in Section˜2.2, and let $\kappa_{M}$ denote the coordinate version of the projected kernel. Suppose $\kappa$ is continuous and satisfies the polynomial growth condition (9) for some $q\geq 0$ . If the sample paths satisfy $\int_{0}^{T}\|X_{i}(t)\|_{\mathcal{X}}^{2q}\,\mathrm{d}t<\infty$ almost surely for $i=1,\dots,n$ , then

\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa_{M}}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)\,\mathrm{d}t\xrightarrow{M\to\infty}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q_{\theta}(t)\bigr)\,\mathrm{d}t\qquad\text{a.s.}

Furthermore, for each fixed $t$ , if $X(t)\sim Q_{\theta}(t)$ , the projected responsibilities $\gamma_{k}^{(M)}(\Pi_{M}X(t),t)$ satisfy the martingale convergence property of Proposition˜2.2 with weights $\pi(t)$ .

Proposition C.3 (Density of time-varying Gaussian mixtures).

Fix $p\geq 1$ and let $P\colon[0,T]\to\mathcal{P}_{p}(\mathcal{X})$ be continuous with respect to the Wasserstein distance $W_{p}$ . Then, for every $\varepsilon>0$ , there exist $K\in\mathbb{N}$ , shared Gaussian measures $\nu_{1},\dots,\nu_{K}$ on $\mathcal{X}$ with nonzero covariance operator, and continuous functions $\pi_{1},\dots,\pi_{K}\colon[0,T]\to[0,1]$ with $\sum_{k=1}^{K}\pi_{k}(t)=1$ , such that

\sup_{t\in[0,T]}W_{p}\left(P(t),\sum_{k=1}^{K}\pi_{k}(t)\nu_{k}\right)<\varepsilon.

The proofs of Propositions C.1, C.2 and C.3 are provided in Sections˜H.4, H.5 and H.6, respectively.

Appendix D Projected maximum likelihood estimation

Although this paper focuses on MMD-based fitting, we briefly describe the projected maximum-likelihood baseline used in Section˜4; it is the only density-based competitor that is well defined in the Hilbert-space setting of (1). As discussed in the introduction, likelihood-based fitting in infinite dimensions requires a common dominating measure. Here we use the Karhunen–Loève projector associated with the Gaussian prior

\rho\coloneqq\mu_{0}=\mathcal{N}(0,C)

on $\mathcal{X}$ . Let $(\lambda_{j},\phi_{j})_{j\geq 1}$ be the eigenpairs of $C$ ordered so that $\lambda_{1}\geq\lambda_{2}\geq\cdots$ , and fix $M$ with $\lambda_{M}>0$ . Define

\mathcal{X}_{M}\coloneqq\mathrm{span}\{\phi_{1},\dots,\phi_{M}\},\qquad\Pi_{M}x\coloneqq\sum_{j=1}^{M}\langle x,\phi_{j}\rangle_{\mathcal{X}}\,\phi_{j},\qquad\Pi_{M}^{\perp}\coloneqq I_{\mathcal{X}}-\Pi_{M}.

Under the decomposition $\mathcal{X}=\mathcal{X}_{M}\oplus\mathcal{X}_{M}^{\perp}$ , the prior factorizes as

\rho=\rho_{M}\otimes\rho_{M}^{\perp},\qquad\rho_{M}=\mathcal{N}(0,\Lambda_{M}),\qquad\Lambda_{M}\coloneqq\mathrm{diag}(\lambda_{1},\dots,\lambda_{M}),

where $\rho_{M}^{\perp}$ is the law of $\Pi_{M}^{\perp}X$ for $X\sim\rho$ . Following [41], we construct each component by altering only the first $M$ coordinates. Given projected parameters $\mathbf{m}_{k}\in\mathbb{R}^{M}$ and $\mathbf{K}_{k}\in\mathbb{R}^{M\times M}$ with $\mathbf{K}_{k}\succ 0$ , let

\nu_{k,M}\coloneqq\mathcal{N}(\mathbf{m}_{k},\mathbf{K}_{k})\qquad\text{on }\mathbb{R}^{M},

and lift it to

\nu_{k}\coloneqq\nu_{k,M}\otimes\rho_{M}^{\perp}\qquad\text{on }\mathcal{X}.

Equivalently, $\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k})$ , where

m_{k}=\sum_{j=1}^{M}(\mathbf{m}_{k})_{j}\,\phi_{j},

and $\mathcal{K}_{k}$ agrees with $C$ on $\mathcal{X}_{M}^{\perp}$ while its restriction to $\mathcal{X}_{M}$ has matrix $\mathbf{K}_{k}$ in the basis $\{\phi_{1},\dots,\phi_{M}\}$ . Since $\nu_{k,M}$ and $\rho_{M}$ are nondegenerate Gaussians on $\mathbb{R}^{M}$ , they are equivalent, and therefore $\nu_{k}\sim\rho$ . Hence every mixture

Q\coloneqq\sum_{k=1}^{K}\pi_{k}\nu_{k},\qquad\pi_{k}>0,\qquad\sum_{k=1}^{K}\pi_{k}=1,

satisfies $Q\ll\rho$ ; indeed $Q\sim\rho$ because its density is strictly positive $\rho$ -a.s.

For $x\in\mathcal{X}$ , write

\mathbf{x}\coloneqq\bigl(\langle x,\phi_{1}\rangle_{\mathcal{X}},\dots,\langle x,\phi_{M}\rangle_{\mathcal{X}}\bigr)^{\top}

and let

\varphi_{M}(\mathbf{x};\mathbf{m},\mathbf{K})\coloneqq(2\pi)^{-M/2}\det(\mathbf{K})^{-1/2}\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}-\mathbf{m})^{\top}\mathbf{K}^{-1}(\mathbf{x}-\mathbf{m})\Big\}

denote the usual Gaussian density on $\mathbb{R}^{M}$ . Then

p_{k}(x)\coloneqq\frac{\,\mathrm{d}\nu_{k}}{\,\mathrm{d}\rho}(x)=\frac{\,\mathrm{d}\nu_{k,M}}{\,\mathrm{d}\rho_{M}}(\mathbf{x})=\frac{\varphi_{M}(\mathbf{x};\mathbf{m}_{k},\mathbf{K}_{k})}{\varphi_{M}(\mathbf{x};0,\Lambda_{M})}

and therefore

p_{k}(x)=\det(\Lambda_{M})^{1/2}\det(\mathbf{K}_{k})^{-1/2}\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}-\mathbf{m}_{k})^{\top}\mathbf{K}_{k}^{-1}(\mathbf{x}-\mathbf{m}_{k})+\tfrac{1}{2}\mathbf{x}^{\top}\Lambda_{M}^{-1}\mathbf{x}\Big\}.

Accordingly,

q(x)\coloneqq\frac{\,\mathrm{d}Q}{\,\mathrm{d}\rho}(x)=\sum_{k=1}^{K}\pi_{k}\,p_{k}(x).

Since the factor $\varphi_{M}(\mathbf{x};0,\Lambda_{M})^{-1}$ is common to every component, maximizing the $\rho$ -log-likelihood is equivalent to maximizing the usual projected mixture log-likelihood

\ell_{M}(\pi,m,\mathcal{K})\coloneqq\sum_{i=1}^{n}\log\!\Big(\sum_{k=1}^{K}\pi_{k}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{k},\mathbf{K}_{k})\Big),

(25)

because

\sum_{i=1}^{n}\log q(X_{i})=\ell_{M}(\pi,m,\mathcal{K})-\sum_{i=1}^{n}\log\varphi_{M}(\mathbf{x}_{i};0,\Lambda_{M}),

and the second term is independent of the mixture parameters.

The EM algorithm [22] is therefore carried out in the projected coordinates $\mathbf{x}_{i}\in\mathbb{R}^{M}$ . In the E-step,

r_{ik}=\frac{\pi_{k}\,p_{k}(X_{i})}{\sum_{\ell=1}^{K}\pi_{\ell}\,p_{\ell}(X_{i})}=\frac{\pi_{k}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{k},\mathbf{K}_{k})}{\sum_{\ell=1}^{K}\pi_{\ell}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{\ell},\mathbf{K}_{\ell})}.

For the unregularized full-covariance model, the M-step updates are

\pi_{k}^{\mathrm{new}}=\frac{1}{n}\sum_{i=1}^{n}r_{ik},\qquad\mathbf{m}_{k}^{\mathrm{new}}=\frac{\sum_{i=1}^{n}r_{ik}\mathbf{x}_{i}}{\sum_{i=1}^{n}r_{ik}},

and

\mathbf{K}_{k}^{\mathrm{new}}=\frac{\sum_{i=1}^{n}r_{ik}(\mathbf{x}_{i}-\mathbf{m}_{k}^{\mathrm{new}})(\mathbf{x}_{i}-\mathbf{m}_{k}^{\mathrm{new}})^{\top}}{\sum_{i=1}^{n}r_{ik}}.

In practice, one stabilizes the covariance update by replacing $\mathbf{K}_{k}^{\mathrm{new}}$ with $\mathbf{K}_{k}^{\mathrm{new}}+\varepsilon I_{M}$ for some $\varepsilon>0$ . This is a regularized EM step rather than the exact M-step of the unpenalized likelihood, but it guarantees $\mathbf{K}_{k}\succ 0$ and therefore preserves equivalence with the reference measure after lifting. In the fixed-covariance benchmark variant used in our experiments (see Section˜4), the same E-step is used, while only the weights and means are updated and the covariance matrix is held fixed. After each M-step, the updated parameters are lifted back to $\mathcal{X}$ by setting

m_{k}^{\mathrm{new}}=\sum_{j=1}^{M}(\mathbf{m}_{k}^{\mathrm{new}})_{j}\,\phi_{j}

and defining $\mathcal{K}_{k}^{\mathrm{new}}$ to agree with $C$ on $\mathcal{X}_{M}^{\perp}$ and to have matrix $\mathbf{K}_{k}^{\mathrm{new}}$ on $\mathcal{X}_{M}$ . Algorithm 3 summarizes this projected EM routine, with optional ridge regularization controlled by $\varepsilon$ .

Algorithm 3 Projected EM for the Gaussian-mixture log-likelihood

1:Projected data

\{\mathbf{x}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{M}

; number of components

K

; ridge parameter

\varepsilon\geq 0

; EM iterations

T_{\mathrm{EM}}

2:Fitted parameters

\hat{\pi},\,\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}

\{\mathbf{m}_{k}\}_{k=1}^{K}\leftarrow\textsc{KMeans}(\{\mathbf{x}_{i}\}_{i=1}^{n},K)

\triangleright

Initialize means

\mathbf{K}_{k}\leftarrow\mathbf{I}_{M}

\pi_{k}\leftarrow 1/K

\triangleright

Initialize covariances and weights

5:for

t=1,\dots,T_{\mathrm{EM}}

6: for

i=1,\dots,n

k=1,\dots,K

r_{ik}\leftarrow\dfrac{\pi_{k}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{k},\mathbf{K}_{k})}{\sum_{s=1}^{K}\pi_{s}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{s},\mathbf{K}_{s})}

\triangleright

E-step responsibilities

8: end for

9: for

k=1,\dots,K

10:

N_{k}\leftarrow\sum_{i=1}^{n}r_{ik}

\triangleright

Effective cluster size

11:

\pi_{k}\leftarrow N_{k}/n

\triangleright

M-step: weights

12:

\mathbf{m}_{k}\leftarrow\tfrac{1}{N_{k}}\sum_{i=1}^{n}r_{ik}\mathbf{x}_{i}

\triangleright

M-step: means

13:

\mathbf{K}_{k}\leftarrow\tfrac{1}{N_{k}}\sum_{i=1}^{n}r_{ik}(\mathbf{x}_{i}-\mathbf{m}_{k})(\mathbf{x}_{i}-\mathbf{m}_{k})^{\top}+\varepsilon\mathbf{I}_{M}

\triangleright

M-step: ridge-regularized covariances

14: end for

15:end for

16:return

\hat{\pi}\leftarrow\pi

\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}\leftarrow\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K}

The next result shows that the projected likelihood ratios are consistent as $M\to\infty$ .

Proposition D.1 (Consistency of projected likelihood ratios).

For each $M$ , set $\rho_{M}\coloneqq\rho\circ\Pi_{M}^{-1}$ , $Q_{M}\coloneqq Q\circ\Pi_{M}^{-1}$ , and define $q_{M}\coloneqq dQ_{M}/d\rho_{M}$ . Let $X$ be an $\mathcal{X}$ -valued random variable with distribution $\rho$ and denote

X^{(M)}\coloneqq\Pi_{M}X.

Then $q_{M}(X^{(M)})=\mathbb{E}_{\rho}[q(X)\mid X^{(M)}]$ a.s. and $q_{M}(X^{(M)})\xrightarrow{M\to\infty}q(X)$ a.s. and in $L^{1}(\rho)$ . Consequently, for i.i.d. $X_{1},\dots,X_{n}\sim Q$ ,

\sum_{i=1}^{n}\log q_{M}(\Pi_{M}X_{i})\longrightarrow\sum_{i=1}^{n}\log q(X_{i})\qquad Q^{n}\text{-a.s.}

The proof is given in Section˜H.9.

Appendix E Extension to Banach spaces

This appendix outlines how our framework extends to real separable Banach spaces $(\mathcal{X},\|\cdot\|_{\mathcal{X}})$ . Let $\mathcal{X}^{*}$ be the continuous dual and $\langle x,f\rangle$ the dual pairing. A Borel probability measure $\nu$ on $\mathcal{X}$ is Gaussian if $\nu\circ f^{-1}$ is Gaussian for all $f\in\mathcal{X}^{*}$ . If $\nu$ is centered, its characteristic functional is

\widehat{\nu}(f)\coloneqq\int_{\mathcal{X}}e^{i\langle x,f\rangle}\,\mathrm{d}\nu(x)=\exp\!\Big(-\tfrac{1}{2}\,C_{\nu}(f,f)\Big),\qquad f\in\mathcal{X}^{*},

where the covariance form $C_{\nu}:\mathcal{X}^{*}\times\mathcal{X}^{*}\to\mathbb{R}$ is given by

C_{\nu}(f,g)\coloneqq\int_{\mathcal{X}}\langle x,f\rangle\,\langle x,g\rangle\,\mathrm{d}\nu(x).

By Fernique’s theorem, $\nu$ has a well-defined covariance operator $R_{\nu}:\mathcal{X}^{*}\to\mathcal{X}$ given by $R_{\nu}f=\int_{\mathcal{X}}\langle x,f\rangle x\,\mathrm{d}\nu(x)$ . The associated Cameron–Martin space $H(\nu)$ is the Hilbert completion of $R_{\nu}(\mathcal{X}^{*})$ under the inner product $\langle R_{\nu}f,R_{\nu}g\rangle_{H(\nu)}=\langle R_{\nu}f,g\rangle$ .

Karhunen–Loève decomposition.

Assume that $H(\nu)$ is infinite-dimensional. Following Bay and Croix [4], one constructs vectors $x_{n}\in\mathcal{X}$ , dual functionals $x_{n}^{*}\in\mathcal{X}^{*}$ , and normalized coordinates

h_{n}\coloneqq\sqrt{\lambda_{n}}\,x_{n}\in H(\nu),\qquad h_{n}^{*}\coloneqq\lambda_{n}^{-1/2}x_{n}^{*}\in\mathcal{X}^{*},

from successive maximizers of the Rayleigh quotient associated with the covariance operators of the residual Gaussian measures; see [4, Theorem 3.3]. The resulting coordinates satisfy $R_{\nu}h_{n}^{*}=h_{n}$ , and the random variables $\langle\cdot,h_{n}^{*}\rangle$ are independent $\mathcal{N}(0,1)$ under $\nu$ .

Corollary E.1 (Banach-space Karhunen–Loève decomposition [4, Corollary 3.8]).

Let $\nu$ be a centered Gaussian measure on a real separable Banach space $\mathcal{X}$ , assume $H(\nu)$ is infinite-dimensional, and let $(\lambda_{n},x_{n},h_{n},h_{n}^{*})_{n\geq 0}$ be constructed as above. Then

x=\sum_{n=0}^{\infty}\langle x,h_{n}^{*}\rangle\,h_{n}

converges in $\mathcal{X}$ for $\nu$ -a.e. $x\in\mathcal{X}$ , and the coordinate maps $x\mapsto\langle x,h_{n}^{*}\rangle$ are independent $\mathcal{N}(0,1)$ under $\nu$ . Equivalently, if $(\xi_{n})_{n\geq 0}$ are i.i.d. $\mathcal{N}(0,1)$ , then

\sum_{n=0}^{\infty}\sqrt{\lambda_{n}}\,\xi_{n}\,x_{n}

converges almost surely in $\mathcal{X}$ and has law $\nu$ .

In the Hilbert-space setting of the main text, we identify $\mathcal{X}\simeq\mathcal{X}^{*}$ via the Riesz representation theorem. Then $R_{\nu}$ becomes the usual covariance operator, and the above construction reduces to the classical spectral Karhunen–Loève expansion.

Applicability.

The measure-theoretic components of our framework—including Gaussian mixtures, Radon–Nikodym responsibilities, and MMD weak density arguments—extend naturally to separable Banach spaces. While orthogonal projections are unavailable in this broader setting, one can substitute finite-rank continuous projections $P_{M}:\mathcal{X}\to\mathcal{X}$ , such as the Bay–Croix projections from Corollary˜E.1. However, we do not pursue this extension further, as the closed-form MMD expressions and projection algorithms developed in the main text rely intrinsically on the inner product structure of Hilbert spaces.

Appendix F Synthetic experiments

We compare against four baseline families: (i) metric-space methods [55, 46, 25, 14, 32]; (ii) FDA-specific methods [52]; (iii) classic Euclidean methods restricted to $\mathbb{R}^{d}$ ; and (iv) a projected maximum-likelihood baseline inspired by [41] (see Appendix˜D). We report Adjusted Rand Index (ARI), train with Adam ( $\mathrm{lr}=0.1$ ) and $k$ -means initialization, and run all scripts on CPU in minutes.

Model class. Across all domains, we fit the finite-dimensional mixture of Section˜2.2 after fixing an orthonormal basis $(e_{r})_{r=1}^{M}$ of $\mathcal{X}_{M}$ . We use cosine bases for $L^{2}/H^{1}$ , Wigner-D matrix coefficients for $\mathrm{SO}(3)$ , Laplacian eigenvectors for graph signals, and the canonical basis for $\mathbb{R}^{d}$ .

We evaluate the MMD-based Gaussian mixture fitting method on several Hilbert-space representations of increasing complexity. The first five experiments are controlled mixture-recovery tasks with known ground-truth parameters. The sixth experiment tests system identification for a Gaussian process induced by a linear SDE. The last two experiments apply the same fitting procedure to real molecular and skeleton-sequence data. In all cases, observations are represented in a finite-dimensional basis as in Section˜2.2, and the model is fitted by minimizing the squared MMD objective with a Gaussian radial kernel. Final MMD values are reported as within-experiment diagnostics and are not meant to be compared directly across domains, since the coefficient dimension, scaling, and kernel bandwidth differ across settings.

F.1 Gaussian random variables in $\mathbb{R}^{d}$

As a sanity check, we first consider the finite-dimensional case $\mathcal{X}=\mathbb{R}^{d}$ with the standard inner product, where the model reduces to a classical Gaussian mixture model [7].

Experiment.

We generate $n=1500$ samples from a $K=3$ Gaussian mixture on $\mathbb{R}$ with weights $\pi=(0.5,0.3,0.2)$ , means $(-3.0,\,0.5,\,4.0)$ , and standard deviations $(0.6,\,0.9,\,0.5)$ . We then fit a $K=3$ mixture by minimizing $\mathrm{MMD}^{2}$ . Figure˜6 shows that the learned density closely matches the empirical histogram and the ground-truth density.

F.2 Functional data in $L^{2}(0,1)$

We next consider functional data in $\mathcal{X}=L^{2}(0,1;\mathbb{R}^{d})$ . Gaussian random elements in this space can be viewed as Gaussian processes with square-integrable sample paths, so the mixture model specializes to a mixture of Gaussian processes [54, 66, 1].

Experiment.

We generate multivariate functional data in $\mathcal{X}=L^{2}(0,1;\mathbb{R}^{2})$ from $K=5$ Gaussian components with weights $\pi=(0.30,0.25,0.20,0.15,0.10)$ . Each component has a distinct mean function and diagonal covariance profile in coefficient space. We sample $n=500$ trajectories and project them onto a cosine basis with $R=15$ functions per spatial dimension, yielding coefficient vectors in $\mathbb{R}^{30}$ . Figure˜7 shows raw trajectories, recovered mean functions, recovered weights, and the training curve.

F.3 Functional data in $L^{2}([0,1]^{2})$

The same construction extends to functions on two-dimensional domains by using tensor-product bases. Given an orthonormal basis $(\varphi_{r})_{r\geq 0}$ of $L^{2}(0,1)$ , the functions

\Phi_{m,n}(s,t)\coloneqq\varphi_{m}(s)\varphi_{n}(t)

form an orthonormal basis of $L^{2}([0,1]^{2})$ . This setting covers image-like signals, spatial random fields, and spatio-temporal processes.

Experiment.

We consider $\mathcal{X}=L^{2}([0,1]^{2};\mathbb{R})$ with a tensor-product cosine basis, using $R_{s}=R_{t}=8$ and hence $M=64$ coefficients. We generate $n=400$ samples from a $K=3$ Gaussian mixture with weights $\pi=(0.45,0.35,0.20)$ , where each component has a distinct 2D mean surface. Figure˜8 compares the true and predicted mean surfaces and reports the training loss and recovered weights.

F.4 $L^{2}(\mathrm{SO}(3))$ rotation data

We also consider non-Euclidean domains equipped with an $L^{2}$ structure. For a compact group $G$ with Haar probability measure $\mu$ , the space $L^{2}_{\mu}(G;\mathbb{R}^{d})$ is a Hilbert space with inner product

\langle f,g\rangle_{\mathcal{X}}\coloneqq\int_{G}f(x)^{\top}g(x)\,\mathrm{d}\mu(x).

For $G=\mathrm{SO}(3)$ , the Peter–Weyl theorem provides an orthonormal basis in terms of Wigner D-matrices.

Experiment.

We consider rotation data on $\mathrm{SO}(3)$ , relevant for orientation estimation in robotics and molecular modeling. We generate $n=200$ rotations from $K=3$ concentrated components and encode them using real Wigner-basis features up to degree $L_{\max}=3$ , giving $M=84$ coefficients. Figure˜9 displays each rotation through the image of the reference direction $[1,0,0]$ on $S^{2}$ , together with the true and learned component directions, the training loss, and the recovered weights.

F.5 Graph signals

We next test the method on graph-structured data, where the Hilbert-space geometry is induced by the graph Laplacian. Let $G=(V,E)$ be a finite weighted graph with Laplacian $L=D-W$ . For $\alpha>0$ , we equip $\mathbb{R}^{|V|}$ with the inner product

\langle f,g\rangle_{\mathcal{X}}\coloneqq f^{\top}(L+\alpha I)g.

This geometry gives larger norm to high-frequency graph components, while the finite-dimensional projection below restricts the representation to the leading Laplacian modes.

Experiment.

We generate $n=150$ graph signals on a random Erdős–Rényi graph with $|V|=30$ nodes and edge probability $0.25$ . The signals are projected onto the first $M=15$ Laplacian eigenvectors, using regularization $\alpha=0.1$ . The mixture has $K=3$ components with uniform weights. Figure˜10 shows the true and predicted mean graph signals, together with the training loss and recovered weights.

F.6 Linear SDE: system identification

As a parametric Gaussian-process example, we consider the linear stochastic differential equation

\,\mathrm{d}x(t)=\big(Ax(t)+Bu(t)\big)\,\mathrm{d}t+G\,\mathrm{d}W(t),\qquad t\in[a,b].

For Gaussian initial condition and deterministic input $u$ , the solution is a Gaussian process. Its mean $m(t)$ and covariance $P(t)=\mathrm{Cov}[x(t),x(t)]$ satisfy

\dot{m}=Am+Bu,\qquad m(a)=m_{0},\qquad\dot{P}=AP+PA^{\top}+GG^{\top},\qquad P(a)=\Sigma_{0}.

The resulting projected mean and covariance can therefore be computed analytically in any chosen $L^{2}$ basis.

Experiment.

We treat $(A,B,G)$ as learnable parameters and identify them from $n_{s}=20$ sampled trajectories of a $4$ -dimensional linear SDE over $[0,5]$ , driven by the input $u(t)=(\sin t,\cos 2t)$ . The trajectories are projected onto an $L^{2}$ cosine basis with $R=7$ functions per state dimension, giving $M=28$ coefficients. We match the empirical coefficient distribution to the single Gaussian induced by the linear SDE. Figure˜11 shows the sample paths, training loss, recovered mean function, and diagonal of the projected covariance.

F.7 3D molecular data (QM9)

We also test the method on real molecular structures using permutation-invariant graph fingerprints derived from 3D atomic coordinates.

Experiment.

We sample $n=500$ molecules from QM9, build WL-hash fingerprints of dimension $128$ from distance-threshold molecular graphs, and project them onto the first $R=64$ DCT coefficients. We fit a $K=5$ Gaussian mixture with diagonal covariance. Figure˜12 shows the three molecules closest to each learned component mean in coefficient space. The HOMO–LUMO gap and assignment responsibility are reported for each representative molecule.

F.8 NTU RGB+D skeleton sequences

Finally, we apply the method to human motion data, treating skeleton sequences as $\mathbb{R}^{75}$ -valued paths corresponding to $25$ joints with $3$ coordinates each.

Experiment.

We load $n=150$ action sequences from the NTU RGB+D dataset [58], resample each sequence to $64$ frames, subtract the mean trajectory, and project onto an $L^{2}$ cosine basis with $R=10$ . This gives coefficient vectors of dimension $M=750$ . We fit a $K=5$ Gaussian mixture with diagonal covariance by minimizing $\mathrm{MMD}^{2}$ . Figure˜13 shows the most representative skeleton sequences per component, selected as those closest to the learned component mean in coefficient space.

F.9 Summary

Table˜3 summarizes the experimental settings and final $\mathrm{MMD}^{2}$ values. The controlled experiments show that the method recovers the main mixture structure across Euclidean, functional, group-valued, and graph-based representations. The real-data experiments illustrate that the same coefficient-space fitting procedure also yields coherent representative components on molecular and skeleton-sequence data.

Appendix G Experimental details and compute resources

We record the implementation details used for the experiments in Sections˜4 and F. The complete codebase, along with instructions to reproduce the synthetic and real-world experiments, can be found in the repository https://github.com/dani2442/rkhs-mixture-models/.

Unless stated otherwise, all models are trained on projected coefficient vectors in $\mathbb{R}^{M}$ , use torch.float64, use fixed random seeds, run on CPU, and initialize mixture means with $k$ -means++.

G.1 Hilbert spaces and bases

Each experiment fits the projected mixture model of Section˜2.2 after fixing a separable Hilbert space $\mathcal{X}$ and an orthonormal basis $(e_{r})_{r\geq 1}$ , which determines the truncated subspace $\mathcal{X}_{M}=\mathrm{span}\{e_{1},\dots,e_{M}\}$ . We define here the Hilbert spaces and bases used in Tables˜2 and 3, so that the table entries refer to this section instead of repeating the definitions.

(B1) Euclidean canonical basis.

For $\mathcal{X}=\mathbb{R}^{d}$ we use the canonical basis $(e_{r})_{r=1}^{d}$ , so $M=d$ .

(B2) $L^{2}$ cosine basis.

For $\mathcal{X}=L^{2}(0,1;\mathbb{R})$ with $\langle f,g\rangle=\int_{0}^{1}f(t)g(t)\,\mathrm{d}t$ , we use the orthonormal cosine basis

e_{0}(t)=1,\qquad e_{r}(t)=\sqrt{2}\cos(\pi rt),\quad r\geq 1,

truncated at $r=R-1$ , so $M=R$ . For vector-valued data $\mathcal{X}=L^{2}(0,1;\mathbb{R}^{d})$ , we tensor with the canonical basis of $\mathbb{R}^{d}$ , so $M=Rd$ .

(B3) Tensor-product cosine basis on $[0,1]^{2}$ .

For $\mathcal{X}=L^{2}([0,1]^{2};\mathbb{R})$ we use $\Phi_{m,n}(s,t)=e_{m}(s)\,e_{n}(t)$ with $0\leq m<R_{s}$ , $0\leq n<R_{t}$ , so $M=R_{s}R_{t}$ .

(B4) $H^{1}$ cosine basis.

For $\mathcal{X}=H^{1}(0,1;\mathbb{R})$ with $\langle f,g\rangle=\int_{0}^{1}fg+\int_{0}^{1}f^{\prime}g^{\prime}$ , we use the $H^{1}$ -rescaled cosines

\tilde{e}_{r}(t)=\frac{e_{r}(t)}{\sqrt{1+\pi^{2}r^{2}}},\qquad r=0,\dots,R_{s}-1,

so $M=R_{s}$ ; the $\tilde{e}_{r}$ are orthonormal in $H^{1}$ .

(B5) Real Wigner–D basis on $\mathrm{SO}(3)$ .

For $\mathcal{X}=L^{2}(\mathrm{SO}(3);\mathbb{R})$ under the Haar inner product, the Peter–Weyl theorem yields an orthonormal basis given by the (real) Wigner D-matrix entries,

e_{\ell,m,n}(R)=\sqrt{2\ell+1}\,D^{\ell}_{m,n}(R),\qquad\ell\geq 0,\ -\ell\leq m,n\leq\ell.

Truncating at $\ell\leq L_{\max}$ gives $M=\sum_{\ell=0}^{L_{\max}}(2\ell+1)^{2}$ ; for $L_{\max}=3$ , $M=84$ .

(B6) Graph Laplacian eigenbasis.

For a finite weighted graph $G=(V,E)$ with Laplacian $L=D-W$ and regularization $\alpha>0$ , we equip $\mathcal{X}=\mathbb{R}^{|V|}$ with $\langle f,g\rangle=f^{\top}(L+\alpha I)g$ . Diagonalizing $L=U\Lambda U^{\top}$ with eigenpairs $(\lambda_{j},u_{j})$ , the rescaled eigenvectors $e_{j}=u_{j}/\sqrt{\lambda_{j}+\alpha}$ form an orthonormal basis of $\mathcal{X}$ . We keep the $M$ eigenvectors associated with the smallest eigenvalues.

(B7) Frobenius basis on $\mathrm{Sym}(d)$ .

For $\mathcal{X}=\mathrm{Sym}(d)\subset\mathbb{R}^{d\times d}$ with $\langle A,B\rangle=\mathrm{tr}(A^{\top}B)$ , we use the canonical symmetric basis $\{E_{ii}\}\cup\{(E_{ij}+E_{ji})/\sqrt{2}\}_{i<j}$ (vech embedding), so $M=\binom{d+1}{2}$ .

(B8) Weisfeiler–Lehman (WL) with kernel PCA / DCT.

For molecular graphs (MUTAG, QM9) we compute WL hash fingerprints in $\mathbb{R}^{D}$ , then project onto the leading $M$ kernel-PCA directions (MUTAG) or DCT coefficients (QM9) to obtain coefficient vectors in $\mathbb{R}^{M}$ . The basis is fixed in coefficient space and inherited from the projection.

G.2 Experimental setting

The clustering experiments are unsupervised: all methods are fit without access to labels, and labels are used only after fitting to compute Adjusted Rand Index (ARI). There are no train/test splits because the reported task is clustering or mixture recovery on a fixed sample; variability is instead measured across independently generated datasets, benchmark datasets, random restarts, or subsampled runs, as specified in Table˜2.

Table 2: Benchmark hyperparameters and evaluation protocol. Bases (B1)–(B8) are defined in Section˜G.1. All MMD-GMM fits use Adam and

k

-means++ initialization; polynomial kernels use degree

2

and

c=1

Experiment	Data and Runs	Representation and MMD-GMM training
$\mathbb{R}^{d}$ toy data	Five labeled sklearn datasets [50], $n=500$ , $d=2$ , $K\in\{2,3\}$ ; seed $30$ , MMD seed $42$ .	Representation: canonical basis (B1) on standardized coordinates, $M=2$ . Training: $300$ epochs, learning rate $0.05$ , full covariance; Gaussian bandwidth by $0.5$ times the median pairwise-distance heuristic.
$L^{2}(\mathbb{R})$ functional data	Growth [53], Waveform [11], Phoneme [26], Kneading [51], and ECG200 [48]; ARI averaged over the five datasets; seed $42$ .	Representation: $L^{2}$ cosine basis (B2), $R=15$ ( $R=10$ for Waveform), coefficient normalization per dataset. Training: $400$ epochs, learning rate $0.1$ , diagonal covariance; Gaussian bandwidth by median heuristic.
Glucodensity clustering	PEDAP CGM cohort [63, 1], $98$ patients, $K=2$ ; $5$ MMD runs; seed $42$ .	Representation: $H^{1}$ cosine basis (B4) with $R_{s}=8$ on overlapping $4$ -day sliding windows (stride $1$ day, $16$ time bins); patient-level features for our method. Training: $600$ epochs, learning rate $0.01$ , diagonal covariance; cosine annealing to $10^{-4}$ ; Gaussian bandwidth by median heuristic.
$\mathrm{SO}(3)$ rotations	$10$ generated datasets, $n=200$ , $K=3$ , noise concentration $8.0$ ; seeds $42,\ldots,51$ .	Representation: real Wigner–D basis (B5) with $L_{\max}=3$ , $M=84$ . Training: $400$ epochs, learning rate $0.1$ , diagonal covariance; Gaussian bandwidth by median heuristic.
MUTAG molecular graphs	MUTAG [20], $n=188$ , $K=2$ , $5$ runs; seed $42$ .	Representation: WL fingerprints with kernel PCA (B8), fingerprint dimension $512$ , radius $3$ , reduced to $M=32$ . Training: $500$ epochs, learning rate $0.05$ , $5$ runs; Gaussian bandwidth selected from $\{0.1,0.5,1,2,4\}$ plus the median heuristic by a $60$ -epoch MMD proxy sweep.

The standalone mixture-recovery figures and case-study panels in Sections˜4, 4.2 and F that are not part of the unified ARI benchmark follow the configurations summarized in Table˜3.

Table 3: Hyperparameters for the additional mixture-recovery and case-study figures. Bases (B1)–(B8) are defined in Section˜G.1.

Experiment	Basis	$\bm{n}$	$\bm{K}$	$\bm{M}$	Epochs	Kernel/optimizer details
$1$ D Gaussian mixture	(B1)	$1500$	$3$	$1$	$400$	Gaussian $\sigma=1.0$ , Adam learning rate $0.05$ .
$L^{2}(0,1;\mathbb{R}^{2})$ recovery	(B2)	$500$	$5$	$30$	$400$	Cosine basis with $R=15$ per coordinate; Gaussian $\sigma=1.2$ , Adam learning rate $0.1$ .
$L^{2}([0,1]^{2})$ recovery	(B3)	$400$	$3$	$64$	$400$	Tensor-product cosine $R_{s}=R_{t}=8$ ; Gaussian $\sigma=2.0$ , Adam learning rate $0.1$ .
$L^{2}(\mathrm{SO}(3))$ recovery	(B5)	$200$	$3$	$84$	$400$	Real Wigner–D, $L_{\max}=3$ ; Gaussian $\sigma=10.0$ , Adam learning rate $0.05$ .
Graph-signal recovery	(B6)	$150$	$3$	$15$	$200$	Erdős–Rényi graph, $\alpha=0.1$ ; Gaussian $\sigma=2.0$ , Adam learning rate $0.1$ .
Linear SDE identification	(B2)	$20$ paths	$1$	$28$	$400$	Cosine basis with $R=7$ per state dim.; Gaussian bandwidth $2\,\mathrm{sd}(X)$ , Adam learning rate $10^{-2}$ .
QM9 molecules	(B8)	$500$	$5$	$64$	$150$	WL fingerprints + first $64$ DCT coefficients; Gaussian $\sigma=2.0$ , Adam learning rate $0.05$ .
NTU representatives	(B2)	$150$	$5$	$750$	$250$	Per-joint cosine basis $R=10$ on $\mathbb{R}^{75}$ -valued paths; Gaussian $\sigma=2.0$ , Adam learning rate $0.05$ .
CGM temporal $H^{1}$ mixture	(B4)	$98$ patients	$2$	$8$	$400$	Overlapping $4$ -day sliding windows with stride $1$ day, $16$ time bins, $H^{1}$ cosine basis with $R_{s}=8$ ; 2-layer neural ODE weights, hidden width $64$ , RK4, Gaussian $\sigma=0.8$ , Adam learning rate $0.01$ .
CGM correlation mixture	(B7)	$21{,}889$ windows	$3$	$300$	$200$	Overlapping $W=14$ -day sliding windows with stride $1$ day; each window gives a $24\times 24$ inter-hour correlation matrix with linear shrinkage toward the identity (intensity $\alpha=0.1$ ), vech embedding into $\mathrm{Sym}(24)$ ; 2-layer neural ODE logits, hidden width $64$ , RK4, Adam learning rate $10^{-2}$ .
CGM patient graph mixture	(B6)	$76$ patients	$3$	$304$	$200$	Overlapping $21$ -day sliding windows with stride $1$ day, $10$ time bins; pairwise $H^{1}$ similarities use $R_{s}=16$ cosine modes; graph Laplacian basis rank $R=4$ ; 2-layer neural ODE logits, hidden width $64$ , RK4, Adam learning rate $10^{-2}$ .

G.3 Statistical reporting

All entries in Table˜1 are reported as mean ARI with one standard deviation in subscript, computed directly from the list of ARI scores saved by each benchmark script. The standard deviation is a descriptive measure of run-to-run or dataset-to-dataset variability, not a standard error and not a formal confidence interval. The variability source is the labelled dataset collection for the $\mathbb{R}^{d}$ and $L^{2}$ columns, independent generated datasets for $\mathrm{SO}(3)$ , and multiple runs for glucodensity and MUTAG. Deterministic single-dataset baselines have zero reported standard deviation when only one fitted value is produced.

G.4 Compute resources

All reported experiments were run on CPU only; no GPU was used or required. The local machine used for reproduction is an Ubuntu 24.04.4 LTS workstation equipped with two Intel Xeon E5-2650 v2 CPUs at $2.60\,\mathrm{GHz}$ ( $16$ physical cores in total), $188$ GiB of RAM, and a network-mounted filesystem. The software environment relies on Python 3.11.5 with the following core libraries: PyTorch 1.13.1, NumPy 1.26.4, SciPy 1.12.0, scikit-learn 1.5.2, pandas 1.5.3, scikit-fda 0.10.1, torch-geometric 2.7.0, aeon 0.11.1, rdata 1.0.0, torchdiffeq 0.2.5, and torchsde 0.2.6. The end-to-end reproduction pipeline runs as a single CPU-worker workflow and typically requires $2$ – $4$ wall-clock hours, depending on cached downloads and optional notebook regeneration.

Appendix H Proofs

H.1 Proof of Proposition 2.1

Proof.

Since $(e_{j})_{j\geq 1}$ is an orthonormal basis of $\mathcal{X}$ , the orthogonal projections $\Pi_{M}$ converge strongly to the identity. That is, for every $x\in\mathcal{X}$ , $\Pi_{M}x\to x$ in $\mathcal{X}$ as $M\to\infty$ .

We first prove the almost sure convergence of $J^{(M)}_{i,k}$ . Fix $i\in\{1,\dots,n\}$ and $k\in\{1,\dots,K\}$ . Fix $\omega$ such that $x_{i}\coloneqq X_{i}(\omega)\in\mathcal{X}$ . Then $\Pi_{M}x_{i}\to x_{i}$ in $\mathcal{X}$ . By the continuity of $\kappa$ , for every $y\in\mathcal{X}$ , we have $\kappa(\Pi_{M}x_{i},\Pi_{M}y)\to\kappa(x_{i},y)$ . Furthermore, using the growth assumption (9) and the contraction property of the orthogonal projector ( $\|\Pi_{M}x\|_{\mathcal{X}}\leq\|x\|_{\mathcal{X}}$ ), we can bound the kernel evaluation as:

|\kappa(\Pi_{M}x_{i},\Pi_{M}y)|\leq C\bigl(1+\|\Pi_{M}x_{i}\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|\Pi_{M}y\|_{\mathcal{X}}^{q}\bigr)\leq C\bigl(1+\|x_{i}\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|y\|_{\mathcal{X}}^{q}\bigr).

By the finite moment assumption, the right-hand side is $\nu_{k}$ -integrable with respect to $y$ for our fixed $x_{i}$ . Therefore, by the dominated convergence theorem,

J^{(M)}_{i,k}(\omega)=\int_{\mathcal{X}}\kappa(\Pi_{M}x_{i},\Pi_{M}y)\,\mathrm{d}\nu_{k}(y)\to\int_{\mathcal{X}}\kappa(x_{i},y)\,\mathrm{d}\nu_{k}(y)=J_{i,k}(\omega),

which establishes $J^{(M)}_{i,k}\to J_{i,k}$ almost surely.

Next, we establish the convergence of $I^{(M)}_{k,s}$ . Fix $k,s\in\{1,\dots,K\}$ . By the continuity of $\kappa$ , $\kappa(\Pi_{M}y,\Pi_{M}y^{\prime})\to\kappa(y,y^{\prime})$ for all $y,y^{\prime}\in\mathcal{X}$ . Similarly to the previous case, applying (9) and the contraction property yields:

|\kappa(\Pi_{M}y,\Pi_{M}y^{\prime})|\leq C\bigl(1+\|y\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|y^{\prime}\|_{\mathcal{X}}^{q}\bigr).

Due to the finite moment assumption, this dominating function is integrable with respect to the product measure $\nu_{k}\otimes\nu_{s}$ . Invoking the dominated convergence theorem once more, we obtain

I^{(M)}_{k,s}=\int_{\mathcal{X}\times\mathcal{X}}\kappa(\Pi_{M}y,\Pi_{M}y^{\prime})\,\mathrm{d}(\nu_{k}\otimes\nu_{s})(y,y^{\prime})\to\int_{\mathcal{X}\times\mathcal{X}}\kappa(y,y^{\prime})\,\mathrm{d}(\nu_{k}\otimes\nu_{s})(y,y^{\prime})=I_{k,s}.

Finally, since $\kappa(\Pi_{M}X_{i},\Pi_{M}X_{j})\to\kappa(X_{i},X_{j})$ a.s. for all $i,j$ , and the sums in (8) are finite with weights $\pi_{k}$ independent of $M$ , combining these results yields (10). ∎

H.2 Proof of Proposition 2.2

Proof.

Fix $k\in\{1,\dots,K\}$ . We first show that the exact responsibility $\gamma_{k}(X)$ is a version of the conditional expectation $\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X)]$ . Let $Q_{\theta}=\sum_{s=1}^{K}\pi_{s}\nu_{s}$ be the marginal law of $X$ . For any Borel set $A\in\mathcal{B}(\mathcal{X})$ , the definition of $\gamma_{k}=\,\mathrm{d}(\pi_{k}\nu_{k})/\,\mathrm{d}Q_{\theta}$ yields

\int_{A}\gamma_{k}(x)\,\mathrm{d}Q_{\theta}(x)=\pi_{k}\nu_{k}(A)=\mathbb{P}(X\in A,\ Z=k).

Equivalently, for every $A\in\mathcal{B}(\mathcal{X})$ ,

\mathbb{E}\bigl[\gamma_{k}(X)\mathbf{1}_{\{X\in A\}}\bigr]=\mathbb{E}\bigl[\mathbf{1}_{\{Z=k\}}\mathbf{1}_{\{X\in A\}}\bigr].

Since $\gamma_{k}(X)$ is $\sigma(X)$ -measurable, this identity implies $\gamma_{k}(X)=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X)]$ almost surely.

By an identical argument on the projected space, equipping $\mathcal{X}_{M}$ with the pushforward measures $\nu_{s,M}$ and $Q_{\theta,M}$ , the projected derivative $\gamma_{k}^{(M)}=\,\mathrm{d}(\pi_{k}\nu_{k,M})/\,\mathrm{d}Q_{\theta,M}$ satisfies

\int_{B}\gamma_{k}^{(M)}(u)\,\mathrm{d}Q_{\theta,M}(u)=\mathbb{P}(X^{(M)}\in B,\ Z=k)

for any $B\in\mathcal{B}(\mathcal{X}_{M})$ . Equivalently, for every $B\in\mathcal{B}(\mathcal{X}_{M})$ ,

\mathbb{E}\bigl[\gamma_{k}^{(M)}(X^{(M)})\,\mathbf{1}_{\{X^{(M)}\in B\}}\bigr]=\mathbb{E}\bigl[\mathbf{1}_{\{Z=k\}}\,\mathbf{1}_{\{X^{(M)}\in B\}}\bigr].

Since $\gamma_{k}^{(M)}(X^{(M)})$ is $\sigma(X^{(M)})$ -measurable, this confirms that

\gamma_{k}^{(M)}(X^{(M)})=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X^{(M)})]

almost surely, establishing (13).

Next, define the filtration $\mathcal{F}_{M}\coloneqq\sigma(X^{(M)})$ . The property $\Pi_{M}=\Pi_{M}\Pi_{M+1}$ ensures that $X^{(M)}=\Pi_{M}X^{(M+1)}$ , so $X^{(M)}$ is $\sigma(X^{(M+1)})$ -measurable, meaning the filtration is increasing ( $\mathcal{F}_{M}\subseteq\mathcal{F}_{M+1}$ ). Let $\mathcal{F}_{\infty}\coloneqq\sigma(\cup_{M=1}^{\infty}\mathcal{F}_{M})$ . Since each projector $\Pi_{M}$ is continuous, $X^{(M)}$ is $\sigma(X)$ -measurable, yielding $\mathcal{F}_{\infty}\subseteq\sigma(X)$ . Conversely, since $\Pi_{M}x\to x$ , we have $X^{(M)}(\omega)\to X(\omega)$ pointwise in $\mathcal{X}$ . Since $\mathcal{X}$ is a metric space, the pointwise limit of $\mathcal{F}_{\infty}$ -measurable maps into $\mathcal{X}$ is $\mathcal{F}_{\infty}$ -measurable. This forces $\sigma(X)\subseteq\mathcal{F}_{\infty}$ , concluding that $\mathcal{F}_{\infty}=\sigma(X)$ .

Finally, we apply Lévy’s upward convergence theorem. Since the target $\mathbf{1}_{\{Z=k\}}$ is integrable, the martingale sequence $\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\mathcal{F}_{M}]$ converges almost surely and in $L^{1}$ to $\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\mathcal{F}_{\infty}]$ . By our previous identities,

\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\mathcal{F}_{\infty}]=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X)]=\gamma_{k}(X)\qquad\text{a.s.}

Substituting these into the limit, we obtain, as $M\to\infty$ ,

\gamma_{k}^{(M)}(X^{(M)})\to\gamma_{k}(X)\qquad\text{a.s.\ and in }L^{1}.

∎

H.3 Proof of Proposition 3.1

Proof.

Proof of (i).

Fix $p\geq 1$ , $P\in\mathcal{P}_{p}(\mathcal{X})$ , and $\varepsilon>0$ . Let $\mathcal{K}_{0}$ be any nonzero trace-class covariance operator on $\mathcal{X}$ . It is enough to prove density for mixtures with component covariances of the form $\sigma^{2}\mathcal{K}_{0}$ , since these form a subclass of the mixtures in (1).

Because $\mathcal{X}$ is a separable Hilbert space, it is Polish. Hence finitely supported probability measures are dense in $\mathcal{P}_{p}(\mathcal{X})$ for $W_{p}$ . Therefore, there exists $\nu=\sum_{i=1}^{N}w_{i}\delta_{x_{i}}$ such that $W_{p}(P,\nu)<\varepsilon/2.$

For $\sigma>0$ , define the Gaussian smoothing

\nu_{\sigma}\coloneqq\sum_{i=1}^{N}w_{i}\,\mathcal{N}(x_{i},\sigma^{2}\mathcal{K}_{0}).

Let $I$ be an index with $\mathbb{P}(I=i)=w_{i}$ , and let $Z\sim\mathcal{N}(0,\mathcal{K}_{0})$ , independent of $I$ . Then $X=x_{I}$ has law $\nu$ , while $Y=x_{I}+\sigma Z$ has law $\nu_{\sigma}$ . This coupling gives

W_{p}^{p}(\nu,\nu_{\sigma})\leq\mathbb{E}\|X-Y\|_{\mathcal{X}}^{p}=\sigma^{p}\mathbb{E}\|Z\|_{\mathcal{X}}^{p}.

Since $\mathcal{K}_{0}$ is trace-class, Fernique’s theorem implies $\mathbb{E}\|Z\|_{\mathcal{X}}^{p}<\infty$ . Hence $W_{p}(\nu,\nu_{\sigma})\to 0$ as $\sigma\downarrow 0$ . Choosing $\sigma$ small enough yields $W_{p}(\nu,\nu_{\sigma})<\varepsilon/2$ , and the triangle inequality gives

W_{p}(P,\nu_{\sigma})<\varepsilon.

Thus, finite Gaussian mixtures with non-zero covariance operators are $W_{p}$ -dense in $\mathcal{P}_{p}(\mathcal{X})$ .

Proof of (ii). Fix $P\in\mathcal{P}(\mathcal{X})$ and draw $K$ i.i.d. samples $X_{1},\dots,X_{K}\sim P$ . Let the empirical measure be $\widehat{P}_{K}\coloneqq\frac{1}{K}\sum_{k=1}^{K}\delta_{X_{k}}$ . Since $\mathbb{E}[\kappa(X_{k},\cdot)]=\mu_{P}$ and we assume $\sup_{x}\kappa(x,x)\leq 1$ , the expected squared MMD is bounded by the variance of the mean embedding:

\mathbb{E}\big[\operatorname{MMD}_{\kappa}^{2}(P,\widehat{P}_{K})\big]=\frac{1}{K}\mathbb{E}\big[\|\kappa(X_{1},\cdot)-\mu_{P}\|_{\mathcal{H}_{\kappa}}^{2}\big]\leq\frac{1}{K}\mathbb{E}[\kappa(X_{1},X_{1})]\leq\frac{1}{K}.

By the probabilistic method, there exists a specific realization $m_{1},\dots,m_{K}\in\mathcal{X}$ such that the deterministic empirical measure $\mu_{K}\coloneqq\frac{1}{K}\sum_{k=1}^{K}\delta_{m_{k}}$ satisfies $\operatorname{MMD}_{\kappa}(P,\mu_{K})\leq 1/\sqrt{K}$ .

Fix any nonzero trace-class covariance operator $\mathcal{K}_{0}$ and define $Q_{K,\sigma}\coloneqq\frac{1}{K}\sum_{k=1}^{K}\mathcal{N}(m_{k},\sigma^{2}\mathcal{K}_{0})$ . Then $Q_{K,\sigma}$ converges weakly to $\mu_{K}$ as $\sigma\downarrow 0$ . Because $\kappa$ is bounded and continuous, weak convergence implies convergence in MMD. Thus, we can choose $\sigma>0$ small enough such that $\operatorname{MMD}_{\kappa}(Q_{K,\sigma},\mu_{K})\leq 1/\sqrt{K}$ .

Applying the triangle inequality yields

\operatorname{MMD}_{\kappa}(P,Q_{K,\sigma})\leq\operatorname{MMD}_{\kappa}(P,\mu_{K})+\operatorname{MMD}_{\kappa}(\mu_{K},Q_{K,\sigma})\leq\frac{2}{\sqrt{K}},

which completes the proof, as $Q_{K,\sigma}$ is a valid mixture of the form (1). ∎

H.4 Proof of Proposition C.1

Proof.

Step 1. Let $\Phi(x)=\kappa(x,\cdot)\in\mathcal{H}_{\kappa}$ be the canonical feature map. Because $\mathcal{X}$ is separable and $\kappa$ is continuous, the reproducing kernel Hilbert space $\mathcal{H}_{\kappa}$ is separable. Define the Bochner space

\mathbb{H}\coloneqq L^{2}(0,T;\mathcal{H}_{\kappa}).

For each sample path, define the random element $Y_{i}(t)\coloneqq\Phi(X_{i}(t))$ . Since $\Phi$ is continuous, $(\omega,t)\mapsto\Phi(X_{i}(t,\omega))$ is jointly measurable. Together with the separability of $\mathcal{H}_{\kappa}$ and the integrability estimate below, this defines a strongly measurable $\mathbb{H}$ -valued random variable. By assumption, its squared norm is integrable:

\mathbb{E}[\|Y_{1}\|_{\mathbb{H}}^{2}]=\mathbb{E}\!\left[\int_{0}^{T}\|\Phi(X_{1}(t))\|_{\mathcal{H}_{\kappa}}^{2}\,\mathrm{d}t\right]=\mathbb{E}\!\left[\int_{0}^{T}\kappa(X_{1}(t),X_{1}(t))\,\mathrm{d}t\right]<\infty.

Thus, $Y_{1},\dots,Y_{n}$ are i.i.d. $\mathbb{H}$ -valued random variables with finite first and second moments.

Step 2. By Jensen’s inequality and Tonelli’s theorem, the population mean embedding trajectory $\mu_{P}\coloneqq(\mu_{P(t)})_{t\in[0,T]}$ satisfies $\int_{0}^{T}\|\mu_{P(t)}\|^{2}_{\mathcal{H}_{\kappa}}\,\mathrm{d}t\leq\mathbb{E}[\|Y_{1}\|_{\mathbb{H}}^{2}]<\infty$ , so $\mu_{P}\in\mathbb{H}$ . By the Bochner–Fubini theorem, $\mu_{P}$ is exactly the expected value $\mathbb{E}[Y_{1}]$ in $\mathbb{H}$ . Similarly, the empirical mean embedding trajectory $\mu_{P_{n}}$ corresponds to the sample average:

\mu_{P_{n}}=\frac{1}{n}\sum_{i=1}^{n}Y_{i}\quad\text{in }\mathbb{H}.

By assumption, the target embedding trajectory $\mu_{Q}\coloneqq(\mu_{Q(t)})_{t\in[0,T]}$ also belongs to $\mathbb{H}$ .

Step 3. By the strong law of large numbers for Hilbert-space-valued random variables, the sample average converges to its expectation almost surely:

\mu_{P_{n}}\xrightarrow{n\to\infty}\mu_{P}\qquad\text{in }\mathbb{H}\text{ a.s.}

Since $\mu_{P_{n}}-\mu_{Q}\to\mu_{P}-\mu_{Q}$ in $\mathbb{H}$ , the continuity of the squared norm gives $\|\mu_{P_{n}}-\mu_{Q}\|_{\mathbb{H}}^{2}\to\|\mu_{P}-\mu_{Q}\|_{\mathbb{H}}^{2}$ . Because the integrated squared MMD is precisely the squared distance in $\mathbb{H}$ scaled by $1/T$ , this immediately implies that

\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q(t)\bigr)\,\mathrm{d}t\xrightarrow{n\to\infty}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P(t),Q(t)\bigr)\,\mathrm{d}t

almost surely. Finally, setting $Q(t)=P(t)$ directly yields $\frac{1}{T}\|\mu_{P_{n}}-\mu_{P}\|_{\mathbb{H}}^{2}\to 0$ almost surely, completing the proof. ∎

H.5 Proof of Proposition C.2

Proof.

For each $M$ , the identification of $\mathcal{X}_{M}$ with $\mathbb{R}^{M}$ via the basis $(e_{r})_{r=1}^{M}$ maps the projected measures $P_{n,M}(t)\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{\Pi_{M}X_{i}(t)}$ and $Q_{\theta,M}(t)\coloneqq\sum_{k=1}^{K}\pi_{k}(t)\nu_{k,M}$ to their finite-dimensional coordinate representations $\mathbf{P}_{n,M}(t)$ and $\mathbf{Q}_{\theta,M}(t)$ . By defining $\kappa_{M}(\mathbf{x},\mathbf{y})\coloneqq\kappa(\sum_{r=1}^{M}x_{r}e_{r},\sum_{r=1}^{M}y_{r}e_{r})$ , we have

\operatorname{MMD}_{\kappa_{M}}^{2}\bigl(\mathbf{P}_{n,M}(t),\mathbf{Q}_{\theta,M}(t)\bigr)=\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)

for every $t$ . Expanding the exact and projected MMD objectives, we define the cross-expectations

J_{i,k}^{(M)}(t)\coloneqq\mathbb{E}_{Y\sim\nu_{k}}\big[\kappa(\Pi_{M}X_{i}(t),\Pi_{M}Y)\big],\qquad J_{i,k}(t)\coloneqq\mathbb{E}_{Y\sim\nu_{k}}\big[\kappa(X_{i}(t),Y)\big],

I_{k,s}^{(M)}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\,Y^{\prime}\sim\nu_{s}}[\kappa(\Pi_{M}Y,\Pi_{M}Y^{\prime})],\qquad I_{k,s}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\,Y^{\prime}\sim\nu_{s}}[\kappa(Y,Y^{\prime})].

Fix a sample path satisfying $\int_{0}^{T}\|X_{i}(t)\|_{\mathcal{X}}^{2q}\,\mathrm{d}t<\infty$ for every $i=1,\dots,n$ . For every $t$ , $\Pi_{M}X_{i}(t)\to X_{i}(t)$ in $\mathcal{X}$ . Hence, by the continuity of $\kappa$ , the data–data interactions converge pointwise in $t$ :

\kappa(\Pi_{M}X_{i}(t),\Pi_{M}X_{j}(t))\to\kappa(X_{i}(t),X_{j}(t)).

For fixed $t$ , the integrand in $J_{i,k}^{(M)}(t)$ satisfies $\kappa(\Pi_{M}X_{i}(t),\Pi_{M}Y)\to\kappa(X_{i}(t),Y)$ for every $Y\in\mathcal{X}$ , and is bounded by $|\kappa(\Pi_{M}X_{i}(t),\Pi_{M}Y)|\leq C(1+\|X_{i}(t)\|_{\mathcal{X}}^{q})(1+\|Y\|_{\mathcal{X}}^{q})$ . Then dominated convergence applies because Gaussian measures on Hilbert spaces have finite $q$ -moments, giving $J_{i,k}^{(M)}(t)\to J_{i,k}(t)$ . The exact same argument applies to the component–component term, yielding $I_{k,s}^{(M)}\to I_{k,s}$ .

We next justify passing the limit through the time integral using the Dominated Convergence Theorem. By the polynomial growth condition on $\kappa$ and the contraction property of $\Pi_{M}$ ,

|\kappa(\Pi_{M}X_{i}(t),\Pi_{M}X_{j}(t))|\leq C\bigl(1+\|X_{i}(t)\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|X_{j}(t)\|_{\mathcal{X}}^{q}\bigr).

This upper bound belongs to $L^{1}(0,T)$ by Cauchy–Schwarz and the sample path assumption. Likewise, if $Y_{k}\sim\nu_{k}$ ,

|J_{i,k}^{(M)}(t)|\leq C\bigl(1+\|X_{i}(t)\|_{\mathcal{X}}^{q}\bigr)\,\mathbb{E}\bigl[1+\|Y_{k}\|_{\mathcal{X}}^{q}\bigr],

which is integrable on $[0,T]$ . Finally, the component–component term is uniformly bounded by $C\,\mathbb{E}[1+\|Y_{k}\|_{\mathcal{X}}^{q}]\,\mathbb{E}[1+\|Y_{s}\|_{\mathcal{X}}^{q}]$ , which is finite and independent of $M$ .

Since $0\leq\pi_{k}(t)\pi_{s}(t)\leq 1$ , we can apply the Dominated Convergence Theorem to integrate term by term. Integrating the finite-dimensional MMD expansion

	$\displaystyle\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)$	$\displaystyle=\frac{1}{n^{2}}\sum_{i,j=1}^{n}\kappa(\Pi_{M}X_{i}(t),\Pi_{M}X_{j}(t))$
		$\displaystyle\quad-\frac{2}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\pi_{k}(t)J_{i,k}^{(M)}(t)+\sum_{k,s=1}^{K}\pi_{k}(t)\pi_{s}(t)I_{k,s}^{(M)}$

over $[0,T]$ yields convergence for fixed $n,K$ , and $\theta$ :

\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)\,\mathrm{d}t\xrightarrow{M\to\infty}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q_{\theta}(t)\bigr)\,\mathrm{d}t\qquad\text{a.s.}

Dividing by $T$ and using the coordinate identification $\kappa_{M}$ completes the proof. ∎

H.6 Proof of Proposition C.3

Proof.

Fix $p\geq 1$ , let $P\colon[0,T]\to\mathcal{P}_{p}(\mathcal{X})$ be continuous with respect to $W_{p}$ , and fix $\varepsilon>0$ . Because $[0,T]$ is compact and $P$ is continuous, the trajectory image $\mathcal{C}\coloneqq P([0,T])$ is compact in the metric space $(\mathcal{P}_{p}(\mathcal{X}),W_{p})$ .

By Proposition˜3.1, for every $\mu\in\mathcal{C}$ , there exists a finite Gaussian mixture $q_{\mu}$ such that $W_{p}(\mu,q_{\mu})<\varepsilon/2$ . The open balls $U_{\mu}\coloneqq B_{W_{p}}(\mu,\varepsilon/2)\cap\mathcal{C}$ form an open cover of $\mathcal{C}$ . By compactness, we can extract a finite subcover centered at $\mu_{1},\dots,\mu_{J}\in\mathcal{C}$ , with corresponding approximating mixtures $q_{1},\dots,q_{J}$ .

Let $\{\lambda_{j}\}_{j=1}^{J}$ be a continuous partition of unity on $\mathcal{C}$ subordinate to this finite subcover. That is, each $\lambda_{j}\colon\mathcal{C}\to[0,1]$ is continuous, $\sum_{j=1}^{J}\lambda_{j}(\mu)=1$ for all $\mu\in\mathcal{C}$ , and $\lambda_{j}(\mu)=0$ whenever $\mu\notin U_{j}$ .

Since each $q_{j}$ is a finite Gaussian mixture, we can pool all distinct Gaussian components appearing across $q_{1},\dots,q_{J}$ into a single shared dictionary $\{\nu_{1},\dots,\nu_{K}\}$ . We can then express each local mixture over this shared basis as

q_{j}=\sum_{k=1}^{K}a_{j,k}\nu_{k},\qquad\text{where }a_{j,k}\geq 0\text{ and }\sum_{k=1}^{K}a_{j,k}=1.

We now construct the temporal weights by interpolating the coefficients:

\pi_{k}(t)\coloneqq\sum_{j=1}^{J}\lambda_{j}(P(t))\,a_{j,k},\qquad t\in[0,T],\quad k=1,\dots,K.

Because $P$ and the partition functions $\lambda_{j}$ are continuous, the trajectories $t\mapsto\pi_{k}(t)$ are continuous. Furthermore, $\pi_{k}(t)\geq 0$ and $\sum_{k=1}^{K}\pi_{k}(t)=1$ . The temporal mixture is therefore given by

Q(t)\coloneqq\sum_{k=1}^{K}\pi_{k}(t)\nu_{k}=\sum_{j=1}^{J}\lambda_{j}(P(t))\,q_{j}.

To bound the distance between $P(t)$ and $Q(t)$ , fix $t\in[0,T]$ . We define the set of active indices at time $t$ as $A(t)\coloneqq\{j:\lambda_{j}(P(t))>0\}$ . For each $j\in A(t)$ , $P(t)\in U_{j}$ , meaning $W_{p}(P(t),\mu_{j})<\varepsilon/2$ . By the triangle inequality,

W_{p}(P(t),q_{j})\leq W_{p}(P(t),\mu_{j})+W_{p}(\mu_{j},q_{j})<\varepsilon.

For each $j\in A(t)$ , choose an optimal transport plan $\gamma_{j,t}\in\Pi(P(t),q_{j})$ , which satisfies $\int\|x-y\|_{\mathcal{X}}^{p}\,\mathrm{d}\gamma_{j,t}(x,y)=W_{p}^{p}(P(t),q_{j})<\varepsilon^{p}$ . We construct a mixed coupling:

\gamma_{t}\coloneqq\sum_{j\in A(t)}\lambda_{j}(P(t))\,\gamma_{j,t}.

Because each $\gamma_{j,t}$ has first marginal $P(t)$ and second marginal $q_{j}$ , the convex combination $\gamma_{t}$ has first marginal $P(t)$ and second marginal $\sum_{j\in A(t)}\lambda_{j}(P(t))\,q_{j}=Q(t)$ . Thus, $\gamma_{t}\in\Pi(P(t),Q(t))$ .

Evaluating the transport cost of this coupling yields

	$\displaystyle W_{p}^{p}(P(t),Q(t))$	$\displaystyle\leq\int_{\mathcal{X}\times\mathcal{X}}\\|x-y\\|_{\mathcal{X}}^{p}\,\mathrm{d}\gamma_{t}(x,y)$
		$\displaystyle=\sum_{j\in A(t)}\lambda_{j}(P(t))\int_{\mathcal{X}\times\mathcal{X}}\\|x-y\\|_{\mathcal{X}}^{p}\,\mathrm{d}\gamma_{j,t}(x,y).$

Since the integrals are strictly bounded by $\varepsilon^{p}$ and the sum of the active weights is 1, we conclude

W_{p}^{p}(P(t),Q(t))<\sum_{j\in A(t)}\lambda_{j}(P(t))\,\varepsilon^{p}=\varepsilon^{p}.

Hence, $W_{p}(P(t),Q(t))<\varepsilon$ for every $t\in[0,T]$ , establishing the supremum bound. ∎

H.7 Proof of Proposition B.1

Proof.

This proof is inspired by [70, Theorem 18]. Since $A$ is bounded, self-adjoint, and positive semi-definite, its square root $T\coloneqq A^{1/2}$ is a bounded, self-adjoint, positive semi-definite operator, and

\kappa_{A}(x,y)=\exp\left(-\frac{1}{2}\|Tx-Ty\|_{\mathcal{X}}^{2}\right).

By the stability of Gaussian measures under bounded linear maps, if $y\sim\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k})$ , the transformed variable is distributed as $Ty\sim\mathcal{N}(Tm_{k},\,T\mathcal{K}_{k}T)$ . Because $\mathcal{K}_{k}$ is trace-class and $T$ is bounded, the operator $T\mathcal{K}_{k}T$ remains trace-class. Thus, the general case reduces to evaluating the isotropic kernel $\kappa(x,y)=\exp(-\frac{1}{2}\|x-y\|_{\mathcal{X}}^{2})$ on the transformed variables $Tx$ and $Ty$ .

To evaluate the isotropic case, we rewrite the data-to-model expectation by centering the integration variable. Let $h\coloneqq X_{i}-m_{k}$ and write $y=m_{k}+\xi$ , where $\xi\sim\mathcal{N}(0,\mathcal{K}_{k})$ . Since $X_{i}-y=h-\xi$ , the expectation becomes:

J_{i,k}=\mathbb{E}_{y\sim\nu_{k}}\left[\exp\left(-\frac{1}{2}\|X_{i}-y\|_{\mathcal{X}}^{2}\right)\right]=\mathbb{E}_{\xi}\left[\exp\left(-\frac{1}{2}\|\xi-h\|_{\mathcal{X}}^{2}\right)\right].

By the spectral theorem for compact self-adjoint operators, there exists an orthonormal eigenbasis $(e_{r})_{r\geq 1}$ of $\mathcal{K}_{k}$ with corresponding eigenvalues $(\lambda_{r})_{r\geq 1}$ . Expanding $\xi$ and $h$ in this basis yields

\xi=\sum_{r=1}^{\infty}\sqrt{\lambda_{r}}Z_{r}e_{r},\qquad h=\sum_{r=1}^{\infty}h_{r}e_{r},

where $Z_{r}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,1)$ . For any $N\in\mathbb{N}$ , define $\xi^{(N)}\coloneqq\sum_{r=1}^{N}\sqrt{\lambda_{r}}Z_{r}e_{r}$ and $h^{(N)}\coloneqq\sum_{r=1}^{N}h_{r}e_{r}$ . By Parseval’s identity, the squared norm of the difference is simply the limit of its partial sums:

\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}=\sum_{r=1}^{N}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}\xrightarrow{N\to\infty}\sum_{r=1}^{\infty}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}=\|\xi-h\|_{\mathcal{X}}^{2}\quad\text{a.s.}

Since this convergence holds almost surely and $0\leq\exp\left(-\frac{1}{2}\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}\right)\leq 1$ , the dominated convergence theorem allows us to compute the expectation as the limit of the finite-dimensional projections:

\displaystyle J_{i,k}=\lim_{N\to\infty}\mathbb{E}\left[\exp\left(-\frac{1}{2}\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}\right)\right]=\lim_{N\to\infty}\prod_{r=1}^{N}\mathbb{E}\left[\exp\left(-\frac{1}{2}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}\right)\right].

Expanding the square inside the exponential yields $-\frac{1}{2}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}=-\frac{\lambda_{r}}{2}Z_{r}^{2}+h_{r}\sqrt{\lambda_{r}}Z_{r}-\frac{h_{r}^{2}}{2}$ . Factoring out the constant term, we obtain:

\mathbb{E}\left[\exp\left(-\frac{1}{2}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}\right)\right]=e^{-h_{r}^{2}/2}\mathbb{E}\left[\exp\left(h_{r}\sqrt{\lambda_{r}}Z_{r}-\frac{\lambda_{r}}{2}Z_{r}^{2}\right)\right].

Using the Gaussian identity $\mathbb{E}[\exp(aZ+bZ^{2})]=(1-2b)^{-1/2}\exp\bigl(\frac{a^{2}}{2(1-2b)}\bigr)$ for $Z\sim\mathcal{N}(0,1)$ , which holds if $2b<1$ , with $a=h_{r}\sqrt{\lambda_{r}}$ and $b=-\lambda_{r}/2$ , each factor evaluates to:

e^{-h_{r}^{2}/2}(1+\lambda_{r})^{-1/2}\exp\left(\frac{h_{r}^{2}\lambda_{r}}{2(1+\lambda_{r})}\right)=(1+\lambda_{r})^{-1/2}\exp\left(-\frac{1}{2}\frac{h_{r}^{2}}{1+\lambda_{r}}\right).

Substituting the one-dimensional evaluations back into the finite-dimensional product yields:

\mathbb{E}\left[\exp\left(-\frac{1}{2}\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}\right)\right]=\left(\prod_{r=1}^{N}(1+\lambda_{r})^{-1/2}\right)\exp\left(-\frac{1}{2}\sum_{r=1}^{N}\frac{h_{r}^{2}}{1+\lambda_{r}}\right).

Passing to the limit as $N\to\infty$ , the infinite product converges to $\det_{F}(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1/2}$ . Simultaneously, by the spectral decomposition of $(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1}$ , the series in the exponent evaluates to

\sum_{r=1}^{\infty}\frac{h_{r}^{2}}{1+\lambda_{r}}=\big\langle h,(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1}h\big\rangle_{\mathcal{X}}.

Recalling that $h=X_{i}-m_{k}$ , we conclude:

J_{i,k}=\det\nolimits_{F}(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1/2}\exp\left(-\frac{1}{2}\big\langle X_{i}-m_{k},(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1}(X_{i}-m_{k})\big\rangle_{\mathcal{X}}\right).

Applying this isotropic solution to the transformed variables $TX_{i}$ and $Ty\sim\mathcal{N}(Tm_{k},T\mathcal{K}_{k}T)$ , we obtain the intermediate form:

J_{i,k}=\det\nolimits_{F}(I_{\mathcal{X}}+T\mathcal{K}_{k}T)^{-1/2}\exp\left(-\frac{1}{2}\big\langle T(X_{i}-m_{k}),(I_{\mathcal{X}}+T\mathcal{K}_{k}T)^{-1}T(X_{i}-m_{k})\big\rangle_{\mathcal{X}}\right).

Substituting $T=A^{1/2}$ recovers Equation (20).

Finally, for the cross-expectation $I_{k,s}$ , let $y\sim\nu_{k}$ and $y^{\prime}\sim\nu_{s}$ be independent. Their difference $z\coloneqq y-y^{\prime}$ is distributed as $\mathcal{N}(m_{k}-m_{s},\mathcal{K}_{k}+\mathcal{K}_{s})$ . Since $\kappa_{A}(y,y^{\prime})=\kappa_{A}(z,0)$ , applying the formula for $J_{i,k}$ with $X_{i}=0$ immediately yields:

I_{k,s}=\frac{\exp\Bigl(-\frac{1}{2}\big\langle m_{k}-m_{s},\,A^{1/2}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{-1}A^{1/2}(m_{k}-m_{s})\big\rangle_{\mathcal{X}}\Bigr)}{\det_{F}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{1/2}},

which completes the proof.

∎

H.8 Proof of Proposition B.2

Proof.

Write $y=m_{k}+\xi$ , where $\xi\sim\mathcal{N}(0,\mathcal{K}_{k})$ . Then

Z\coloneqq\langle X_{i},y\rangle_{\mathcal{X}}+c=\underbrace{\langle X_{i},m_{k}\rangle_{\mathcal{X}}+c}_{\mu_{i,k}}+\langle X_{i},\xi\rangle_{\mathcal{X}}

is a real Gaussian random variable with mean $\mu_{i,k}\in\mathbb{R}$ and variance

v_{i,k}\coloneqq\mathbb{E}[\langle X_{i},\xi\rangle_{\mathcal{X}}^{2}]=\langle X_{i},\mathcal{K}_{k}X_{i}\rangle_{\mathcal{X}}.

Thus,

Z\stackrel{{\scriptstyle d}}{{=}}\mu_{i,k}+\sqrt{v_{i,k}}\,G,\qquad G\sim\mathcal{N}(0,1),

and therefore

J_{i,k}=\mathbb{E}[Z^{p}]=\sum_{r=0}^{p}\binom{p}{r}\mu_{i,k}^{\,p-r}v_{i,k}^{r/2}\,\mathbb{E}[G^{r}].

Since $\mathbb{E}[G^{r}]=0$ for odd $r$ and $\mathbb{E}[G^{2t}]=\frac{(2t)!}{2^{t}t!},$ we obtain

J_{i,k}=\sum_{t=0}^{\lfloor p/2\rfloor}\frac{p!}{(p-2t)!\,2^{t}\,t!}\,\langle X_{i},\mathcal{K}_{k}X_{i}\rangle_{\mathcal{X}}^{\,t}\bigl(\langle X_{i},m_{k}\rangle_{\mathcal{X}}+c\bigr)^{p-2t}.

Now let $y\sim\nu_{k}$ and $y^{\prime}\sim\nu_{s}$ be independent. By definition of the polynomial kernel,

I_{k,s}=\mathbb{E}\big[(\langle y,y^{\prime}\rangle_{\mathcal{X}}+c)^{p}\big].

Applying the binomial theorem to expand the power, and using the linearity of expectation,

I_{k,s}=\mathbb{E}\Bigg[\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\langle y,y^{\prime}\rangle_{\mathcal{X}}^{r}\Bigg]=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\mathbb{E}\big[\langle y,y^{\prime}\rangle_{\mathcal{X}}^{r}\big].

To evaluate this expectation, recall that $\mathcal{X}^{\otimes r}$ denotes the $r$ -th tensor power of $\mathcal{X}$ . For any vector $y\in\mathcal{X}$ , its $r$ -fold tensor product is $y^{\otimes r}\coloneqq y\otimes\dots\otimes y$ . By the standard definition of the inner product on $\mathcal{X}^{\otimes r}$ , we have the fundamental identity:

\langle y,y^{\prime}\rangle_{\mathcal{X}}^{r}=\langle y^{\otimes r},(y^{\prime})^{\otimes r}\rangle_{\mathcal{X}^{\otimes r}},

where we adopt the convention $\mathcal{X}^{\otimes 0}=\mathbb{R}$ and $y^{\otimes 0}=1$ for $r=0$ .

Since Gaussian measures on separable Hilbert spaces have finite moments of all orders, the Bochner moments

M_{r,k}\coloneqq\mathbb{E}[y^{\otimes r}]\in\mathcal{X}^{\otimes r},\qquad M_{r,s}\coloneqq\mathbb{E}[(y^{\prime})^{\otimes r}]\in\mathcal{X}^{\otimes r}

are well-defined. Using independence, the expectation of the inner product factors as the inner product of the expectations:

\mathbb{E}\big[\langle y^{\otimes r},(y^{\prime})^{\otimes r}\rangle_{\mathcal{X}^{\otimes r}}\big]=\big\langle\mathbb{E}[y^{\otimes r}],\,\mathbb{E}[(y^{\prime})^{\otimes r}]\big\rangle_{\mathcal{X}^{\otimes r}}=\langle M_{r,k},M_{r,s}\rangle_{\mathcal{X}^{\otimes r}}.

Substituting this back into the binomial expansion yields

I_{k,s}=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\langle M_{r,k},M_{r,s}\rangle_{\mathcal{X}^{\otimes r}},

which completes the proof. ∎

H.9 Proof of Proposition D.1

Proof.

Let $\rho$ be as in the proposition and define $q=dQ/d\rho\in L^{1}(\rho)$ , so that $\int q\,\mathrm{d}\rho=Q(\mathcal{X})=1$ . Since $Q\ll\rho$ , the projected measure $Q_{M}=Q\circ\Pi_{M}^{-1}$ satisfies $Q_{M}\ll\rho_{M}$ , so $q_{M}=dQ_{M}/d\rho_{M}$ is well defined. Let $X$ denote the identity random element on the probability space $(\mathcal{X},\mathcal{B}(\mathcal{X}),\rho)$ and set $X^{(M)}\coloneqq\Pi_{M}X$ . Write $\mathcal{F}_{M}\coloneqq\sigma(X^{(M)})$ and $\mathcal{F}_{\infty}\coloneqq\sigma(\cup_{M\geq 1}\mathcal{F}_{M})$ . The same projection argument used in the proof of Proposition 2.2 gives $\mathcal{F}_{\infty}=\sigma(X)=\mathcal{B}(\mathcal{X})$ .

Step 1 (identification as a conditional expectation). By definition, $q_{M}=dQ_{M}/d\rho_{M}$ and for each $B\in\mathcal{B}(\mathcal{X}_{M})$ ,

\int_{B}q_{M}(u)\,\mathrm{d}\rho_{M}(u)=Q_{M}(B)=Q(\Pi_{M}^{-1}(B))=\int_{\Pi_{M}^{-1}(B)}q(x)\,\mathrm{d}\rho(x).

Since $\rho_{M}=\rho\circ\Pi_{M}^{-1}$ , the left-hand side equals

\int_{\mathcal{X}}\mathbf{1}_{\{X^{(M)}\in B\}}\,q_{M}(X^{(M)})\,\mathrm{d}\rho.

Combining these identities, we obtain

\int_{\mathcal{X}}\mathbf{1}_{\{X^{(M)}\in B\}}\,q_{M}(X^{(M)})\,\mathrm{d}\rho=\int_{\mathcal{X}}\mathbf{1}_{\{X^{(M)}\in B\}}\,q(X)\,\mathrm{d}\rho\qquad\forall B\in\mathcal{B}(\mathcal{X}_{M}),

so $q_{M}(X^{(M)})$ is a version of $\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{M}]$ . Equivalently, $q_{M}(\Pi_{M}x)=\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{M}](x)$ for $\rho$ -a.e. $x$ .

Step 2 (martingale convergence). Since $(\mathcal{F}_{M})_{M\geq 1}$ is an increasing filtration and $q(X)\in L^{1}(\rho)$ , the martingale convergence theorem yields

\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{M}]\longrightarrow\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{\infty}]=q(X)\qquad\rho\text{-a.s. and in }L^{1}(\rho).

Combining with Step 1 shows $q_{M}(\Pi_{M}x)\to q(x)$ for $\rho$ -a.e. $x$ , and in $L^{1}(\rho)$ .

Step 3 (log-likelihood convergence under $Q$ ). Since $Q\ll\rho$ , the almost sure convergence in Step 2 also holds $Q$ -a.s. Moreover, $q>0$ $Q$ -a.s. and $q_{M}>0$ $Q_{M}$ -a.s.; hence $q_{M}(\Pi_{M}X)>0$ for $Q$ -a.e. $X$ , and $\log q(X)$ and $\log q_{M}(\Pi_{M}X)$ are well-defined $Q$ -a.s. Therefore, for i.i.d. $X_{1},\dots,X_{n}\sim Q$ ,

\log q_{M}(\Pi_{M}X_{i})\longrightarrow\log q(X_{i})\qquad\text{a.s. for each }i,

and summing over $i=1,\dots,n$ yields the claimed convergence of the log-likelihoods. ∎

Gaussian mixture models in Hilbert spaces via kernel methods

Gaussian Mixture Models in Hilbert Spaces via Kernel Methods

Abstract

1 Introduction

Model and objective.

Motivation.

Notation.

1.1 Contributions and paper organization

Related work

2 MMD fitting

2.1 Fitting and optimization

2.2 Numerical computation

Proposition 2.1 (Consistency of projected MMD).

Posterior probabilities.

Proposition 2.2.

3 Approximation theory

Proposition 3.1 (Wasserstein and MMD density).

4 Numerical experiments

4.1 Clustering benchmark

4.2 Case study: different representations of digital health time series

Background and data.

Time-varying mixture weights.

Group-level comparison.

4.2.1 Results

5 Final remarks

References

Appendix A Background

A.1 Gaussian measures on separable Hilbert spaces

A.2 Kernel mean embeddings and Maximum Mean Discrepancy

A.3 Fourier interpretation of the Gaussian MMD

Appendix B Closed-form MMD expressions

Proposition B.1 (MMD for the Gaussian kernel).

Proposition B.2 (MMD for the polynomial kernel).

B.1 Numerical computations and algorithm description

Projected responsibilities.

Computing Ji,k(M)J^{(M)}_{i,k}.

Computing Ik,s(M)I^{(M)}_{k,s}.

B.2 Algorithm descriptions

Appendix C Time-varying mixtures

Consistency guarantees

Proposition C.1 (Sample consistency of the integrated MMD).

Proposition C.2 (Projection consistency for temporal mixtures).

Proposition C.3 (Density of time-varying Gaussian mixtures).

Appendix D Projected maximum likelihood estimation

Proposition D.1 (Consistency of projected likelihood ratios).

Appendix E Extension to Banach spaces

Karhunen–Loève decomposition.

Corollary E.1 (Banach-space Karhunen–Loève decomposition [4, Corollary 3.8]).

Applicability.

Appendix F Synthetic experiments

F.1 Gaussian random variables in ℝd\mathbb{R}^{d}

Experiment.

F.2 Functional data in L2​(0,1)L^{2}(0,1)

Experiment.

F.3 Functional data in L2​([0,1]2)L^{2}([0,1]^{2})

Experiment.

F.4 L2​(SO​(3))L^{2}(\mathrm{SO}(3)) rotation data

Experiment.

F.5 Graph signals

Experiment.

F.6 Linear SDE: system identification

Experiment.

F.7 3D molecular data (QM9)

Experiment.

F.8 NTU RGB+D skeleton sequences

Experiment.

F.9 Summary

Appendix G Experimental details and compute resources

G.1 Hilbert spaces and bases

(B1) Euclidean canonical basis.

(B2) L2L^{2} cosine basis.

(B3) Tensor-product cosine basis on [0,1]2[0,1]^{2}.

(B4) H1H^{1} cosine basis.

(B5) Real Wigner–D basis on SO​(3)\mathrm{SO}(3).

(B6) Graph Laplacian eigenbasis.

(B7) Frobenius basis on Sym​(d)\mathrm{Sym}(d).

(B8) Weisfeiler–Lehman (WL) with kernel PCA / DCT.

G.2 Experimental setting

G.3 Statistical reporting

G.4 Compute resources

Computing $J^{(M)}_{i,k}$ .

Computing $I^{(M)}_{k,s}$ .

F.1 Gaussian random variables in $\mathbb{R}^{d}$

F.2 Functional data in $L^{2}(0,1)$

F.3 Functional data in $L^{2}([0,1]^{2})$

F.4 $L^{2}(\mathrm{SO}(3))$ rotation data

(B2) $L^{2}$ cosine basis.

(B3) Tensor-product cosine basis on $[0,1]^{2}$ .

(B4) $H^{1}$ cosine basis.

(B5) Real Wigner–D basis on $\mathrm{SO}(3)$ .

(B7) Frobenius basis on $\mathrm{Sym}(d)$ .