arXiv:2605.05996 · stat.ML · uncurated · rendered via ar5iv

Gaussian mixture models in Hilbert spaces via kernel methods

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.05996] Gaussian Mixture Models in Hilbert Spaces via Kernel Methods

Gaussian Mixture Models in Hilbert Spaces via Kernel Methods

Daniel López-Montero Affiliation: Friedrich-Alexander-Universität Erlangen-Nürnberg    Antonio Álvarez-López Affiliation: Friedrich-Alexander-Universität Erlangen-Nürnberg Affiliation: Universidad Autónoma de Madrid    Marcos Matabuena Affiliation: Mohamed bin Zayed University Affiliation: of Artificial Intelligence
Abstract

Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including L2L^{2}-functional data and random graphs in Laplacian spaces arising in modern medical applications.

00footnotetext: Code, data and experiments are available at https://github.com/dani2442/rkhs-mixture-models/.

1 Introduction

Technological progress across multiple domains is generating high-dimensional datasets that demand new analytical tools. Modern data structures, such as random functions, spatial fields, images, signals, graphs, or trajectories are often more naturally represented as random objects in Hilbert spaces than as finite-dimensional Euclidean vectors in d\mathbb{R}^{d}. Analyzing the data directly using Hilbert space methods preserves their intrinsic geometric structure and allows their full potential to be exploited [53, 26].

In many applications, observed data are heterogeneous and may arise from KK latent populations, each characterized by a probability law PiP_{i}, with i=1,,Ki=1,\dots,K. Mixture models [28] provide a flexible framework for representing such data-generating mechanisms and, under suitable regularity conditions, can approximate broad classes of probability measures. This paper proposes a mixture-based framework for probability measures on a separable Hilbert space 𝒳\mathcal{X}, extending the classical idea of Gaussian mixture models (GMM) for finite-dimensional data [22].

Extending GMMs beyond finite-dimensional spaces is challenging. Infinite-dimensional Hilbert spaces do not admit a Lebesgue measure, so densities cannot be defined in the standard way, and likelihood-based methods typically require the choice of a reference measure [19, 41]. We therefore adopt a density-free perspective: we fit Gaussian mixture models by minimizing the Maximum Mean Discrepancy (MMD) [33] between the kernel mean embedding of the empirical measure and that of the target mixture model. We now formulate the problem.

Model and objective.

Let PP be an unknown probability measure on a separable Hilbert space 𝒳\mathcal{X}. A Gaussian measure on 𝒳\mathcal{X}, denoted by 𝒩(m,𝒦)\mathcal{N}(m,\mathcal{K}), is uniquely determined by a mean element m𝒳m\in\mathcal{X} and a covariance operator 𝒦:𝒳𝒳\mathcal{K}\colon\mathcal{X}\to\mathcal{X}, which generalizes the finite-dimensional covariance matrix. To ensure non-Dirac Gaussian components with finite total variance, 𝒦\mathcal{K} must be non-zero, self-adjoint, positive semi-definite, and trace-class.

For a fixed number of components KK\in\mathbb{N}, we approximate PP with a finite Gaussian mixture

Qθk=1Kπk𝒩(mk,𝒦k),πk0k,k=1Kπk=1,Q_{\theta}\coloneqq\sum_{k=1}^{K}\pi_{k}\,\mathcal{N}(m_{k},\mathcal{K}_{k}),\qquad\pi_{k}\geq 0\quad\forall k,\qquad\sum_{k=1}^{K}\pi_{k}=1, (1)

parameterized by θ(πk,mk,𝒦k)k=1KΘK\theta\coloneqq(\pi_{k},m_{k},\mathcal{K}_{k})_{k=1}^{K}\in\Theta_{K}, where ΘK\Theta_{K} denotes the admissible parameter space.

We fit QθQ_{\theta} to PP by minimizing the Maximum Mean Discrepancy induced by a measurable positive-definite kernel κ:𝒳×𝒳\kappa\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}:

MMDκ2(P,Qθ)=𝔼X,XP[κ(X,X)]2𝔼XP,YQθ[κ(X,Y)]+𝔼Y,YQθ[κ(Y,Y)],\operatorname{MMD}_{\kappa}^{2}(P,Q_{\theta})=\mathbb{E}_{X,X^{\prime}\sim P}[\kappa(X,X^{\prime})]-2\,\mathbb{E}_{X\sim P,\;Y\sim Q_{\theta}}[\kappa(X,Y)]+\mathbb{E}_{Y,Y^{\prime}\sim Q_{\theta}}[\kappa(Y,Y^{\prime})], (2)

where XX and XX^{\prime} are independent copies drawn from PP, YY and YY^{\prime} are independent copies drawn from QθQ_{\theta}, and all four random variables are mutually independent. This leads to a density-free fitting criterion that depends only on kernel expectations. When the kernel κ\kappa is characteristic, the MMD is a metric on probability measures. Important examples include Gaussian radial kernels in Euclidean spaces and, more generally, in suitable metric spaces of strong negative type [60]. The formal definitions and statistical background are deferred to Supplemental Material Appendix˜A.

In practice, given X1,,Xni.i.d.PX_{1},\dots,X_{n}\overset{\text{i.i.d.}}{\sim}P, we replace PP by the empirical measure Pn1ni=1nδXiP_{n}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}} and solve the problem

minθΘKMMDκ2(Pn,Qθ).\min_{\theta\in\Theta_{K}}\operatorname{MMD}_{\kappa}^{2}(P_{n},Q_{\theta}). (3)
Motivation.

The motivation for this work is to perform clustering of dynamic, time-varying functional objects [23, 24] that take values in a possibly infinite-dimensional space 𝒳\mathcal{X}. Figure 1 shows examples of contemporary data structures that arise in this setting.

Let 𝒳\mathcal{X} be a Hilbert space, and let (X(t))t[0,T](X(t))_{t\in[0,T]} be a stochastic process valued in 𝒳\mathcal{X}. We observe independent sample trajectories

X1(t),,Xn(t)X_{1}(t),\dots,X_{n}(t)

and denote by PtP_{t} the law of X(t)X(t) at time tt. We estimate a time-varying mixture model by minimizing a temporal finite-sample approximation of the integrated discrepancy

0TMMDκ2(Pn,t,Qθ,t)dt,\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,t},Q_{\theta,t}\bigr)\,\mathrm{d}t,

where Pn,tP_{n,t} is the empirical law of X1(t),,Xn(t)X_{1}(t),\dots,X_{n}(t) at time tt. In Section˜4.2, this methodology is applied to continuous glucose monitoring data from a diabetes clinical trial evaluating a new therapy for juvenile diabetes [67].

Notation.

Throughout, 𝒳\mathcal{X} denotes a separable real Hilbert space with inner product ,𝒳\langle\cdot,\cdot\rangle_{\mathcal{X}}, norm 𝒳\|\cdot\|_{\mathcal{X}}, and identity operator I𝒳I_{\mathcal{X}}. We let (𝒳)\mathcal{L}(\mathcal{X}) denote the space of bounded linear operators on 𝒳\mathcal{X}. For a trace-class operator T(𝒳)T\in\mathcal{L}(\mathcal{X}) with eigenvalues (λj)j1(\lambda_{j})_{j\geq 1}, its Fredholm determinant is detF(I𝒳+T)j=1(1+λj)\det_{F}(I_{\mathcal{X}}+T)\coloneqq\prod_{j=1}^{\infty}(1+\lambda_{j}). Let 𝒫(𝒳)\mathcal{P}(\mathcal{X}) denote the set of Borel probability measures on 𝒳\mathcal{X}, and for p1p\geq 1, let 𝒫p(𝒳)\mathcal{P}_{p}(\mathcal{X}) be the subset of measures with finite pp-th moments. For finite-dimensional spaces, boldface lowercase letters (𝐱,𝐦\mathbf{x},\mathbf{m}) denote vectors and boldface uppercase letters (𝐊,𝐈\mathbf{K},\mathbf{I}) denote matrices. We write 𝟏\mathbf{1} for the all-ones vector and ΔK1\Delta^{K-1} for the probability simplex.

1.1 Contributions and paper organization

Refer to caption
Refer to caption
Refer to caption
Figure 1: Representative structured data samples. Left: Hourly continuous glucose monitoring paths. Center: Signals on a fixed graph with varying node intensities. Right: Correlation matrices.

The main contributions of this paper are threefold:

  • Closed-form MMD fitting for Hilbert-space mixtures. We derive exact closed-form expressions for MMDκ2(Pn,Qθ)\operatorname{MMD}_{\kappa}^{2}(P_{n},Q_{\theta}) for Gaussian radial basis and polynomial kernels, extending known d\mathbb{R}^{d} formulas to separable Hilbert spaces [1, 13]. Moreover, we prove that, for every p1p\geq 1, finite Gaussian mixtures of the form (1) are dense in 𝒫p(𝒳)\mathcal{P}_{p}(\mathcal{X}) under the Wasserstein distance WpW_{p}. This result provides theoretical support for the approximation capacity of the proposed model in infinite-dimensional applications.

  • Consistent finite-dimensional computation. We introduce a spectral projection framework that reduces infinite-dimensional computation to standard linear algebra in M\mathbb{R}^{M}. We prove that both the projected MMD objective and the posterior probabilities converge to their exact infinite-dimensional counterparts as the number of terms in the finite-dimensional approximation satisfies MM\to\infty; see Propositions 2.1 and 2.2.

  • Real-world performance. We evaluate the empirical performance of the proposed method across several data structures, including random functions in L2L^{2}, spatial fields, rotations in SO(3)\mathrm{SO}(3), and graph Laplacians. We compare our method against state-of-the-art approaches on established clustering benchmarks, demonstrating competitive performance; see Sections˜4 and F. To further assess its performance on dynamic functional data, we analyze a clinical trial dataset with continuous glucose monitoring data, aiming to identify subgroups of individuals with heterogeneous responses to therapy.

The supplementary material contains all proofs, together with additional analytical results.

Related work

Functional data analysis and clustering. Functional data analysis (FDA) studies samples naturally viewed as functions, curves, or surfaces, typically using basis expansions, smoothing, and functional principal component analysis [53, 26]. Our spectral projection framework is related to Karhunen–Loève expansions for Gaussian measures [53, 26, 4], but uses projection as a computational approximation to the MMD objective. Model-based functional clustering—such as Gaussian mixtures—usually assumes latent groups with group-specific low-dimensional structure [9, 17, 36, 31, 35]. Such likelihood-based approaches require a reference measure, which is delicate in infinite dimensions [8, 19, 21]. Distance-based methods [42, 27] avoid this issue, but typically lack the explicit probabilistic and temporal interpretation that a generative model provides.

Kernel mean embeddings and MMD. Kernel mean embeddings represent probability measures as RKHS elements [59, 45, 5], inducing the Maximum Mean Discrepancy (MMD) [33]. For characteristic kernels, MMD metrizes equality in law [60, 57], making it a natural density-free fitting objective. MMD fitting has been used in generative modeling [40, 12], parametric inference [16, 65], kernel-based clustering [29, 43], distributional reinforcement learning [3], and temporal distributional learning [1]. These approaches often rely on closed-form kernel mean embeddings for Gaussian distributions under Gaussian and polynomial kernels [13]; related work also studies MMD testing on function spaces [70]. We extend these formulas and fitting methods from d\mathbb{R}^{d} to separable Hilbert spaces, using characteristic-kernel MMD to fit finite Gaussian mixtures with closed-form optimization, approximation guarantees, and projection-consistent computation.

Mixture models. Gaussian mixture models are a classical tool for model-based clustering and density estimation [22, 28]. Extensions beyond Gaussian components have been proposed [64, 62], and more recent work applies mixtures of Gaussian processes to functional data [66, 34, 44] and to more general infinite-dimensional settings [41]. These methods remain largely likelihood- or projected-density-based; in contrast, we estimate Hilbert-space Gaussian mixtures by minimizing MMD, giving a density-free fitting criterion that is more natural in infinite-dimensional regimes.

Digital health. Digital health generates unprecedented amounts of near-continuous data, motivating dynamic, time-varying functional models [1, 23, 24]. Mixture models for these general data structures remain underdeveloped, despite their potential in clinical trials for identifying subgroups of individuals with heterogeneous treatment responses.

2 MMD fitting

To minimize (3), we evaluate kernel expectations under νk=𝒩(mk,𝒦k)\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k}). Expanding the squared MMD between Pn=1ni=1nδXiP_{n}=\tfrac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}} and Qθ=k=1KπkνkQ_{\theta}=\sum_{k=1}^{K}\pi_{k}\nu_{k} yields

(3)=1n2i,j=1nκ(Xi,Xj)2ni=1nk=1Kπk𝔼Yνk[κ(Xi,Y)]+k,s=1Kπkπs𝔼Yνk,Yνs[κ(Y,Y)].\displaystyle\eqref{eq:objective_emp}=\frac{1}{n^{2}}\sum_{i,j=1}^{n}\kappa(X_{i},X_{j})-\frac{2}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\pi_{k}\,\mathbb{E}_{Y\sim\nu_{k}}[\kappa(X_{i},Y)]+\sum_{k,s=1}^{K}\pi_{k}\pi_{s}\,\mathbb{E}_{Y\sim\nu_{k},\;Y^{\prime}\sim\nu_{s}}[\kappa(Y,Y^{\prime})]. (4)

Because the first data–data term is constant in θ\theta, optimization depends entirely on the data-to-component and component-to-component expectations:

Ji,k𝔼Yνk[κ(Xi,Y)],Ik,s𝔼Yνk,Yνs[κ(Y,Y)].J_{i,k}\coloneqq\mathbb{E}_{Y\sim\nu_{k}}[\kappa(X_{i},Y)],\qquad I_{k,s}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\;Y^{\prime}\sim\nu_{s}}[\kappa(Y,Y^{\prime})]. (5)

To make the objective a tractable differentiable function of θ\theta, we consider Gaussian and polynomial kernels, for which these expectations admit closed-form expressions (Appendix˜B):

κA(x,y)=exp(12xy,A(xy)𝒳)andκ(x,y)=(x,y𝒳+c)p,\kappa_{A}(x,y)=\exp\!\left(-\frac{1}{2}\langle x-y,A(x-y)\rangle_{\mathcal{X}}\right)\qquad\text{and}\qquad\kappa(x,y)=(\langle x,y\rangle_{\mathcal{X}}+c)^{p},

where A(𝒳)A\in\mathcal{L}(\mathcal{X}) is bounded, self-adjoint, and positive semidefinite (often isotropic, A=σκ2I𝒳A=\sigma_{\kappa}^{-2}I_{\mathcal{X}}). The finite moments condition required by polynomial kernels are automatically satisfied by both QθQ_{\theta} and PnP_{n}.

2.1 Fitting and optimization

The empirical objective (4) naturally splits into two parameter groups: mixture weights πΔK1\pi\in\Delta^{K-1} and component parameters {(mk,𝒦k)}k=1K\{(m_{k},\mathcal{K}_{k})\}_{k=1}^{K}. This enables an alternating scheme that combines a convex weight update with a nonconvex update of the component means and covariances.

To make the dependence on π\pi explicit, define the averaged cross-expectation vector 𝐉K\mathbf{J}\in\mathbb{R}^{K} with entries Jk1ni=1nJi,kJ_{k}\coloneqq\frac{1}{n}\sum_{i=1}^{n}J_{i,k} and the Gram matrix 𝐈K×K\mathbf{I}\in\mathbb{R}^{K\times K} with entries Ik,sI_{k,s}. Dropping the (constant) data–data term, minimizing (4) over π\pi for fixed components reduces to the quadratic program

minπKπ𝐈π2𝐉πs.t.π0,𝟏π=1.\min_{\pi\in\mathbb{R}^{K}}\ \pi^{\top}\mathbf{I}\,\pi-2\mathbf{J}^{\top}\pi\qquad\text{s.t.}\qquad\pi\geq 0,\quad\mathbf{1}^{\top}\pi=1. (6)

Since Ik,s=μνk,μνsκI_{k,s}=\langle\mu_{\nu_{k}},\mu_{\nu_{s}}\rangle_{\mathcal{H}_{\kappa}} is a Gram inner product of kernel mean embeddings, 𝐈\mathbf{I} is positive semi-definite and the program is convex. Geometrically, the optimal π\pi^{\star} projects the empirical mean embedding μPn\mu_{P_{n}} onto the convex hull of the component embeddings {μνk}k=1K\{\mu_{\nu_{k}}\}_{k=1}^{K}.

Since joint optimization over the full parameter θ\theta remains nonconvex, we employ an alternating minimization strategy, analogous to EM [22]. In each iteration, we first solve (6) for π\pi with fixed components, and then fix π\pi to update the means and covariances via a finite-dimensional gradient step (Section˜2.2). Pseudocode and implementation details are deferred to Section˜B.2.

2.2 Numerical computation

For practical implementation, we project the Hilbert space objects of Section˜2 onto a finite-dimensional subspace. Fixing an orthonormal basis (ej)j1(e_{j})_{j\geq 1} of 𝒳\mathcal{X} (e.g., cosine or data-driven Karhunen–Loève [53, 4]), we apply the orthogonal projector ΠM:𝒳𝒳Mspan{e1,,eM}\Pi_{M}:\mathcal{X}\to\mathcal{X}_{M}\coloneqq\mathrm{span}\{e_{1},\dots,e_{M}\}. This identifies 𝒳M\mathcal{X}_{M} with M\mathbb{R}^{M}, allowing each object to be represented by its coordinate vector.

For each component νk=𝒩(mk,𝒦k)\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k}), let 𝐦k,MM\mathbf{m}_{k,M}\in\mathbb{R}^{M} and 𝐊k,MM×M\mathbf{K}_{k,M}\in\mathbb{R}^{M\times M} be the coordinate representations of the projected mean ΠMmk\Pi_{M}m_{k} and covariance ΠM𝒦kΠM\Pi_{M}\mathcal{K}_{k}\Pi_{M}, and let 𝐱i,MM\mathbf{x}_{i,M}\in\mathbb{R}^{M} be the coordinates of ΠMXi\Pi_{M}X_{i}. The pushforward νk,MνkΠM1\nu_{k,M}\coloneqq\nu_{k}\circ\Pi_{M}^{-1} is then a Gaussian on M\mathbb{R}^{M}. The projected mixture and empirical measure on M\mathbb{R}^{M} are

Qθ,Ms=1Kπsνs,M,Pn,M1ni=1nδ𝐱i,M.Q_{\theta,M}\coloneqq\sum_{s=1}^{K}\pi_{s}\nu_{s,M},\qquad P_{n,M}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{\mathbf{x}_{i,M}}.

The projected cross-expectations are

Ji,k(M)𝔼Yνk[κ(ΠMXi,ΠMY)],Ik,s(M)𝔼Yνk,Yνs[κ(ΠMY,ΠMY)].J^{(M)}_{i,k}\coloneqq\mathbb{E}_{Y\sim\nu_{k}}\big[\kappa(\Pi_{M}X_{i},\Pi_{M}Y)\big],\qquad I^{(M)}_{k,s}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\,Y^{\prime}\sim\nu_{s}}\big[\kappa(\Pi_{M}Y,\Pi_{M}Y^{\prime})\big]. (7)

Computationally, these expectations are evaluated exactly in M\mathbb{R}^{M} using the coordinate vectors 𝐱i,M\mathbf{x}_{i,M} and the finite-dimensional Gaussian measures νk,M\nu_{k,M}. Explicit matrix formulas for (7) under Gaussian and polynomial kernels are deferred to Section˜B.1. The finite-dimensional counterpart of (4) is

MMDκ2(Pn,M,Qθ,M)=1n2i,j=1nκ(ΠMXi,ΠMXj)2ni=1nk=1KπkJi,k(M)+k,s=1KπkπsIk,s(M).\operatorname{MMD}_{\kappa}^{2}(P_{n,M},Q_{\theta,M})=\frac{1}{n^{2}}\sum_{i,j=1}^{n}\kappa(\Pi_{M}X_{i},\Pi_{M}X_{j})-\frac{2}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\pi_{k}J^{(M)}_{i,k}+\sum_{k,s=1}^{K}\pi_{k}\pi_{s}I^{(M)}_{k,s}. (8)

Disregarding the (constant) data–data term, evaluating (8) per iteration costs O(K2M3+nKM2)O(K^{2}M^{3}+nKM^{2}) for both the Gaussian radial and polynomial kernels. Moreover, if we assume that the projected covariances are diagonal, the cost reduces to O(K2M+nKM)O(K^{2}M+nKM), making the method highly efficient and scalable to large sample sizes and projection dimensions.

Refer to caption
(a) MMD2 versus projection dimension MM and sample size nn.
Refer to caption
(b) Training convergence for different numbers of components KK.
Refer to caption
(c) Final MMD2 across component counts KK and kernel bandwidths σ\sigma.
Figure 2: Numerical sensitivity of the MMD objective to various hyperparameters and sample sizes.
Proposition 2.1 (Consistency of projected MMD).

Let κ\kappa be a continuous kernel on 𝒳\mathcal{X}. Assume there exist constants C,q0C,q\geq 0 such that

|κ(x,y)|C(1+x𝒳q)(1+y𝒳q)for all x,y𝒳,|\kappa(x,y)|\leq C\bigl(1+\|x\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|y\|_{\mathcal{X}}^{q}\bigr)\qquad\text{for all }x,y\in\mathcal{X}, (9)

and that 𝒳x𝒳qdνk(x)<\int_{\mathcal{X}}\|x\|_{\mathcal{X}}^{q}\,\mathrm{d}\nu_{k}(x)<\infty for all k=1,,Kk=1,\dots,K. Then, the projected MMD objective converges to the exact MMD:

MMDκ2(Pn,M,Qθ,M)MMMDκ2(Pn,Qθ)a.s.\operatorname{MMD}_{\kappa}^{2}(P_{n,M},Q_{\theta,M})\xrightarrow{M\to\infty}\operatorname{MMD}_{\kappa}^{2}(P_{n},Q_{\theta})\qquad\text{a.s.} (10)

Together with the known empirical stability of MMD that controls the error from replacing PP by PnP_{n} (see [65, Proposition A.1]), Proposition 2.1 shows that the finite-dimensional surrogate (8) first approximates the Hilbert-space empirical criterion as MM\to\infty, and then the population as nn\to\infty. This justifies (8) as a computationally tractable proxy for the infinite-dimensional MMD objective.

Posterior probabilities.

The mixture admits the standard generative interpretation: draw a latent assignment Z{1,,K}Z\in\{1,\dots,K\} with prior (Z=k)=πk\mathbb{P}(Z=k)=\pi_{k}, then sample X{Z=k}νkX\mid\{Z=k\}\sim\nu_{k}. The posterior component probability, or responsibility, is defined by the Radon–Nikodym derivative

γk(X)(Z=kX)=d(πkνk)dQθ(X),Qθ-a.s.,\gamma_{k}(X)\coloneqq\mathbb{P}(Z=k\mid X)=\frac{\,\mathrm{d}(\pi_{k}\nu_{k})}{\,\mathrm{d}Q_{\theta}}(X),\qquad Q_{\theta}\text{-a.s.}, (11)

which is well-defined because πkνkQθ\pi_{k}\nu_{k}\ll Q_{\theta} whenever πk>0\pi_{k}>0. In the finite-dimensional setting, whenever the projected covariance 𝐊k,M\mathbf{K}_{k,M} is nonsingular, νk,M\nu_{k,M} admits a Lebesgue density pk,Mp_{k,M} on M\mathbb{R}^{M} and (11) reduces to the familiar Bayes formula

γk(M)(x)=πkpk,M(x)s=1Kπsps,M(x).\gamma_{k}^{(M)}(x)=\frac{\pi_{k}\,p_{k,M}(x)}{\sum_{s=1}^{K}\pi_{s}\,p_{s,M}(x)}. (12)

As MM grows, σ(ΠMX)\sigma(\Pi_{M}X) accumulates the information of σ(X)\sigma(X), yielding the following convergence.

Proposition 2.2.

Let (Z,X)(Z,X) be as above and let γk(M)d(πkνk,M)dQθ,M\gamma_{k}^{(M)}\coloneqq\frac{\,\mathrm{d}(\pi_{k}\nu_{k,M})}{\,\mathrm{d}Q_{\theta,M}}. Then the projected responsibility satisfies the martingale property

γk(M)(ΠMX)=𝔼[𝟏{Z=k}σ(ΠMX)]a.s.,\gamma_{k}^{(M)}(\Pi_{M}X)=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(\Pi_{M}X)]\quad\text{a.s.}, (13)

and γk(M)(ΠMX)γk(X)\gamma_{k}^{(M)}(\Pi_{M}X)\to\gamma_{k}(X) as MM\to\infty both almost surely and in L1L^{1}.

The proofs of Propositions 2.1 and 2.2 are deferred to Section˜H.1 and Section˜H.2, respectively.

3 Approximation theory

In this section we establish theoretical guarantees that justify the use of finite Gaussian mixtures in Hilbert spaces as a flexible approximation class.

Proposition 3.1 (Wasserstein and MMD density).

Let 𝒳\mathcal{X} be a separable Hilbert space.

  1. (i)

    For every p1p\geq 1, finite mixtures of Gaussian measures on 𝒳\mathcal{X} of the form (1) are dense in 𝒫p(𝒳)\mathcal{P}_{p}(\mathcal{X}) with respect to the Wasserstein distance WpW_{p}.

  2. (ii)

    Let κ\kappa be a continuous positive-definite kernel satisfying supx𝒳κ(x,x)1\sup_{x\in\mathcal{X}}\kappa(x,x)\leq 1. Then, for all P𝒫(𝒳)P\in\mathcal{P}(\mathcal{X}) and all K1K\geq 1, there exists QKQ_{K} of the form (1) such that

    MMDκ(P,QK)2K.\operatorname{MMD}_{\kappa}(P,Q_{K})\leq\frac{2}{\sqrt{K}}.

Part (i) shows that the model class is highly expressive under Wasserstein metrics. These metrics are widely used in mathematical statistics to study convergence of estimators, resampling schemes such as the bootstrap, and related distributional properties [49]. This result also implies MMD density for characteristic bounded continuous kernels. Part (ii), however, applies to arbitrary probability measures and provides an explicit finite-mixture approximation bound for the MMD discrepancy minimized by our algorithm. The proof of Proposition 3.1 is deferred to Section˜H.3.

4 Numerical experiments

The first goal of this section is to assess the empirical clustering performance of our methods against established approaches in both finite-dimensional and infinite-dimensional regimes. We then focus on a digital health case study to demonstrate the versatility of our methods for analyzing complex biomarkers that may provide new clinical insights in practice. In particular, we summarize the glucose time series using different representations, including random functions and graphs, which can be viewed as functional dynamic objects. Additional numerical comparisons are reported in Appendix˜F.

4.1 Clustering benchmark

We compare the mixture model described in Section˜2.2 against several baselines: (i) partitioning and centroid-based methods, including K-Medoids and K-Means [55, 56]; (ii) agglomerative hierarchical methods, including average-linkage, Ward, and BIRCH [46, 68, 72]; (iii) density-based methods, including DBSCAN and OPTICS [25, 2]; (iv) model-based methods, including Gaussian mixtures [28, 41]; and (v) graph- and dimension-reduction methods such as spectral clustering [47].

Datasets. We evaluate our approach using representative synthetic and real-world datasets.

  • Synthetic environments: (i) d\mathbb{R}^{d}: five standard sklearn clustering datasets with n=500n=500, d=2d=2, and K{2,3}K\in\{2,3\} [50]; and (ii) SO(3)\mathrm{SO}(3): 10 synthetic datasets of rotations under an Euler-angle parameterization, projected onto Wigner-D matrix coefficients, with n=200n=200 and K=3K=3.

  • Real-world datasets: (iii) L2L^{2}: five standard functional data benchmarks [53, 11, 26, 51, 48]; (iv) CGM: continuous glucose monitoring curves from n=98n=98 individuals with K=2K=2, detailed in Section˜4.2; and (v) Molecular: MUTAG graph signals with n=188n=188 and K=2K=2 [20].

All implementation details and evaluation protocols are reported in Appendix˜G. In particular, clustering performance is evaluated using the Adjusted Rand Index (ARI), a well-established clustering metric that compares unsupervised cluster assignments with available ground-truth labels. We fix KK to the known number of ground-truth classes, as our goal here is to compare clustering accuracy.

Results. Although designed for general random objects in Hilbert spaces, our method is competitive on finite-dimensional d\mathbb{R}^{d} benchmarks and shows better performance on the CGM and SO(3)\mathrm{SO}(3) datasets, which fall outside the scope of most existing clustering methods. It also performs well in the infinite-dimensional L2L^{2} setting. In general, the proposed model remains highly competitive and computationally efficient, while preserving a common probabilistic interpretation. This flexibility enables time-dependent interpretations, as illustrated in the digital health case study.

Table 1: Benchmark of clustering algorithms comparing the operating spaces; the computational complexity of training, inference, and memory as functions of the number of samples nn and clusters KK; and clustering performance (ARI, meanstd{}_{\text{std}}) across several datasets grouped by geometric space.
Method Space Train Infer Memory d\mathbb{R}^{d} L2L^{2} CGM. SO(3)\mathrm{SO}(3) Molecular
MMD GMM (Gaussian) Hilbert O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.6920.3790.692_{0.379} 0.4120.2190.412_{0.219} 0.0530.040\mathbf{0.053}_{0.040} 0.8670.140\mathbf{0.867}_{0.140} 0.1040.0320.104_{0.032}
MMD GMM (Polynomial) Hilbert O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.6150.3340.615_{0.334} 0.3760.1440.376_{0.144} 0.0510.0220.051_{0.022} 0.6750.2170.675_{0.217} 0.1510.009\mathbf{0.151}_{0.009}
Projected GMM (Appendix D) Hilbert O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.3860.2760.386_{0.276} 0.0340.0000.034_{0.000} 0.8280.1720.828_{0.172} 0.1040.0590.104_{0.059}
K-Medoids [37] Metric O(n2K)O(n^{2}K) O(nK)O(nK) O(n2)O(n^{2}) 0.5690.3280.569_{0.328} 0.4580.201\mathbf{0.458}_{0.201} 0.0200.0000.020_{0.000} 0.8280.1090.828_{0.109} 0.0050.000-0.005_{0.000}
Hierarchical (Avg) [27, 39] Metric O(n2logn)O(n^{2}\log n) O(n2)O(n^{2}) 0.5550.3200.555_{0.320} 0.2340.2960.234_{0.296} 0.0020.0000.002_{0.000} 0.4900.3150.490_{0.315} 0.0810.0000.081_{0.000}
DBSCAN [25] Metric O(n)O(n) O(n)O(n) 0.8440.1640.844_{0.164} 0.0000.0000.000_{0.000} 0.0000.0000.000_{0.000} 0.6270.2730.627_{0.273} 0.0740.0000.074_{0.000}
HDBSCAN [14] Metric O(n)O(n) O(n)O(n) 0.8700.1710.870_{0.171} 0.1710.2270.171_{0.227} 0.0000.0000.000_{0.000} 0.6910.2270.691_{0.227} 0.1170.0000.117_{0.000}
K-Center [32] Metric O(nK)O(nK) O(nK)O(nK) O(n)O(n) 0.3920.2580.392_{0.258} 0.1500.1150.150_{0.115} 0.0020.0000.002_{0.000} 0.1880.1690.188_{0.169} 0.0740.0000.074_{0.000}
Kernel k-Groups [29] Metric O(Kn2)O(Kn^{2}) O(n2)O(n^{2}) 0.5670.3280.567_{0.328} 0.4380.2190.438_{0.219} 0.0190.0000.019_{0.000} 0.7510.1890.751_{0.189} 0.1250.0000.125_{0.000}
FDA K-Means [42] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.4510.1960.451_{0.196} 0.0230.0000.023_{0.000}
FDA Fuzzy C-Means [6] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.3960.1570.396_{0.157} 0.0200.0000.020_{0.000}
Funclust [35] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.4510.1860.451_{0.186} 0.0150.0000.015_{0.000}
FunHDDC [9] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.4190.2170.419_{0.217} 0.0130.0000.013_{0.000}
fclust [36] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.4300.2180.430_{0.218} 0.0300.0000.030_{0.000}
K-Centres [17] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.3610.1570.361_{0.157} 0.0010.0000.001_{0.000}
Curvclust [31] L2L^{2} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.3830.2500.383_{0.250} 0.0120.0000.012_{0.000}
MiniBatch KMeans [56] d\mathbb{R}^{d} O(nK)O(nK) O(nK)O(nK) O(n)O(n) 0.5540.3270.554_{0.327}
Spectral [47] d\mathbb{R}^{d} O(n3)O(n^{3}) O(n2)O(n^{2}) 0.9490.074\mathbf{0.949}_{0.074}
Ward [68] d\mathbb{R}^{d} O(n2)O(n^{2}) O(n2)O(n^{2}) 0.6130.3300.613_{0.330}
BIRCH [72] d\mathbb{R}^{d} O(n)O(n) O(nK)O(nK) O(n)O(n) 0.5250.3030.525_{0.303}
Gaussian Mixture (EM) [28] d\mathbb{R}^{d} O(nK)O(nK) O(nK)O(nK) O(nK)O(nK) 0.6830.3870.683_{0.387}
Affinity Propagation [30] d\mathbb{R}^{d} O(n2)O(n^{2}) O(n2)O(n^{2}) 0.4890.3080.489_{0.308}
MeanShift [71] d\mathbb{R}^{d} O(n2)O(n^{2}) O(n)O(n) O(n)O(n) 0.5640.3280.564_{0.328}
OPTICS [2] d\mathbb{R}^{d} O(n2)O(n^{2}) O(n)O(n) 0.8090.2190.809_{0.219}

4.2 Case study: different representations of digital health time series

We now illustrate the proposed model in a digital health case study based on longitudinal continuous glucose monitoring (CGM) data from a juvenile clinical trial [67]. A central question in longitudinal studies is whether and how the distribution of individuals changes over time, for example, in response to a treatment [38, 18]. In the original article [67], the authors demonstrate that the new insulin pump system leads to improvements in several glycemic metrics before and after the intervention. Here, we consider these outcomes in a continuous-time formulation and use a dynamic functional representation of glucose trajectories. Digital health technologies make it possible to study this question at high temporal resolution. Starting from multi-day glucose time series, we construct three daily representations: (i) functional observations in L2L^{2}; (ii) correlation matrices describing the dependence between glucose values at different hours of the day; and (iii) Laplacian graphs encoding similarities in glucose dynamics across hours. Similar representations have been considered as biomarkers in clinical fMRI studies [61, 69].

To analyze temporal changes across these three CGM representations, we extend (1) by allowing the mixture weights to vary over time while sharing the component parameters, namely the mean and covariance operator parameters of the Gaussian mixture model. Our formulation is similar to [1]. However, in that work the authors propose a specific generative model to model distributions through finite-dimensional random examples over time; here, the approach is more general.

Background and data.

Continuous glucose monitors record interstitial glucose concentration every 5 minutes, producing 288 measurements per day. We represent each individual-day by its intraday glucose curve f:[0,24]f\colon[0,24]\to\mathbb{R}. The data come from a randomized controlled diabetes study evaluating the effect of a closed-loop insulin delivery system over a period of more than six months [63]. For each individual, a sliding window of WW days with stride 1 produces one observation per window. Within each window, the 288 time slots are averaged across valid days; slots with fewer than 0.5W\lceil 0.5\,W\rceil valid readings are linearly interpolated from neighboring slots. After preprocessing, the dataset comprises 98 individuals, with 34 in the control group and 64 in the treatment group.

Time-varying mixture weights.

Let X1,,XnX_{1},\dots,X_{n} be i.i.d. copies of an 𝒳\mathcal{X}-valued measurable stochastic process X(t)X(t). For each tt, denote P(t)P(t) the law of X(t)X(t), and define the empirical distribution by Pn(t)1ni=1nδXi(t)P_{n}(t)\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}(t)}. We fit the model

Qθ(t)k=1Kπk(t)𝒩(mk,𝒦k),π(t)(π1(t),,πK(t))ΔK1.Q_{\theta}(t)\coloneqq\sum_{k=1}^{K}\pi_{k}(t)\,\mathcal{N}(m_{k},\mathcal{K}_{k}),\qquad\pi(t)\coloneqq\bigl(\pi_{1}(t),\dots,\pi_{K}(t)\bigr)\in\Delta^{K-1}. (14)

Conceptually, the shared components (mk,𝒦k)(m_{k},\mathcal{K}_{k}) capture fundamental, stationary data archetypes, such as distinct physiological regimes or individual phenotypes, while the time-varying weight vector π(t)\pi(t) describes population-level shifts between these archetypes over time.

Following the kernel KK-groups framework [29], we select KK by evaluating the within-cluster MMD across candidate values and applying the elbow rule. Model fitting minimizes the integrated MMD2,

minθ(t)={πk(t),mk,𝒦k}k1T0TMMD2(Pn(t),Qθ(t))dt.\min_{\theta(t)=\{\pi_{k}(t),m_{k},\mathcal{K}_{k}\}_{k}}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}^{2}\bigl(P_{n}(t),Q_{\theta}(t)\bigr)\,\mathrm{d}t. (15)

We parameterize π(t)=softmax(𝐳(t))\pi(t)=\mathrm{softmax}\bigl(\mathbf{z}(t)\bigr), where the logit path 𝐳:[0,T]K\mathbf{z}\colon[0,T]\to\mathbb{R}^{K} is the solution to a neural ODE [15], initialized at 𝐳(0)=logπ(0)\mathbf{z}(0)=\log\pi(0). This parameterization introduces a natural bias toward smooth temporal dynamics. Further details on density and consistency results, the projected M\mathbb{R}^{M} approximation of the integrated objective, and implementation specifics are deferred to Appendix˜C.

Refer to caption
Figure 3: Temporal glucose mixture in H1H^{1}. Left: Empirical (top) and model-predicted (bottom) mean glucose surfaces. Top right: Global cluster weights πk(t)\pi_{k}(t) (solid lines) with group-level posteriors for control (dotted) γ¯ctrl(t)\bar{\gamma}_{\mathrm{ctrl}}(t) and treatment (dashed) γ¯treat(t)\bar{\gamma}_{\mathrm{treat}}(t). Bottom right: Learned means.
Group-level comparison.

To compare the control and treatment arms, we compute each individual’s posterior cluster membership γk(xi,t)\gamma_{k}(x_{i},t) at each time point and aggregate these memberships into group-level trajectories. Specifically, for each group g{control,treatment}g\in\{\mathrm{control},\mathrm{treatment}\}, we compute

γ¯g(t)=(γ¯g,1(t),,γ¯g,K(t)),γ¯g,k(t)1|Gg(t)|iGg(t)γk(xi,t),\bar{\gamma}_{g}(t)=(\bar{\gamma}_{g,1}(t),\dots,\bar{\gamma}_{g,K}(t)),\qquad\bar{\gamma}_{g,k}(t)\coloneqq\frac{1}{|G_{g}(t)|}\sum_{i\in G_{g}(t)}\gamma_{k}(x_{i},t),

which averages posterior probabilities across all individuals in group gg observed at time tt. We then quantify the divergence between study arms using the total variation distance TV(t)=12γ¯treat(t)γ¯ctrl(t)1.\mathrm{TV}(t)=\frac{1}{2}\left\|\bar{\gamma}_{\mathrm{treat}}(t)-\bar{\gamma}_{\mathrm{ctrl}}(t)\right\|_{1}. The metric ranges from 0 to 1, where 0 reflects identical group-level cluster usage and 1 reflects complete separation. It provides an interpretable way to quantify treatment-induced distributional shifts over time.

4.2.1 Results

Intraday curves. Each individual-window is represented as an intraday glucose curve in Sobolev space H1(0,24;)H^{1}(0,24;\mathbb{R}), continuous profiles are projected onto a truncated cosine basis and fitted with (15) using K=2K=2 latent clusters. The two learned means contrast an elevated, variable profile (m1m_{1}) with a flatter, well-regulated profile (m2m_{2}); the empirical and model-predicted mean surfaces agree closely (Figure˜3, left). While the control group remains approximately time-invariant, the treatment group gradually shifts toward m2m_{2}, with TV\mathrm{TV} distance increasing from 0.11\approx 0.11 at baseline to 0.19\approx 0.19 at the end of the trial. This is consistent with an improvement in glycemic regulation over the course of the clinical trial.

Correlation matrices. Instead of clustering intraday curves, we cluster the running 24×2424\times 24 inter-hour correlation matrices in the Hilbert space Sym(24)\mathrm{Sym}(24) equipped with the Frobenius inner product, thereby capturing how the dependence structure between hours of the day evolves over the study. Each sliding window (W=14W=14 days) is converted to 24 hourly averages per day, from which the sample correlation matrix is calculated using linear shrinkage toward the identity, with intensity α=0.1\alpha=0.1, and then projected onto the PSD cone. Symmetric matrices are embedded through the scaled vectorization mapping into 300\mathbb{R}^{300}, which preserves the Frobenius distance isometrically, and are fitted with the temporal mixture framework using K=3K=3.

The learned clusters in Figure˜4 identify three inter-hour dependence patterns: moderate local correlation m1m_{1}, broadly strong correlation m2m_{2}, and sharper early-hour dependence m3m_{3}. Over time, membership shifts from m3m_{3} toward the more regulated m1m_{1} regime, consistent with improved glycemic regulation over time.

Refer to caption
Figure 4: Correlation-based temporal mixture in Sym(24)\mathrm{Sym}(24). Left: MMD2 training loss (top) and cluster weights πk(t)\pi_{k}(t) (bottom). Right: Learned mean correlation matrices.

Individual similarity graph signals. We now move from hour- to individual-level structure. For each sliding window (W=21W=21 days), we compute H1H^{1}-pairwise individual similarities to get a time-varying similarity matrix S(t)N×NS(t)\in\mathbb{R}^{N\times N}. A fixed individual graph is then built by thresholding the time-averaged similarity matrix at the 85th percentile, and the graph Laplacian eigenbasis (using R=4R=4 eigenvectors) is used to project each S(t)S(t) onto graph signals. The mixture model with K=3K=3 clusters these projected matrices over time. As the trial progresses (Figure˜5), the network structure evolves from diffuse inter-group connections toward denser within-group similarity in the treatment arm, consistent with individuals progressively achieving improved glycemic control.

Refer to caption
Figure 5: Individual similarity graph temporal mixture. Left: MMD2 training loss (top) and cluster weights πk(t)\pi_{k}(t) (bottom). Right: Network snapshots at four time points; node color distinguishes control (blue) from treatment (red), and edge width encodes similarity.

5 Final remarks

The main contribution of this paper is a Gaussian mixture model for random objects in separable Hilbert spaces, estimated via an MMD-based optimization procedure. A key practical and methodological advantage of the proposed approach is that it bypasses the need for densities or dominating measures, which are problematic in infinite-dimensional settings. In addition, the framework flexibly handles diverse data structures, is efficient, and demonstrates competitive empirical performance.

The main limitation of the method is its scalability in large-scale biomedical studies. This could be mitigated through subsampling strategies [10] or other computational approximation techniques. Future work will focus on developing specialized temporal algorithms for particular Hilbert spaces, extending the framework to general metric spaces, and quantifying the uncertainty of the parameter and prediction of the mixture model.

References

  • [1] A. Álvarez-López and M. Matabuena (2026) Continuous-time learning of probability distributions: a case study in a digital trial of young children with type 1 diabetes. External Links: 2603.24427 Cited by: §F.2, Table 2, 1st item, §1, §1, §4.2.
  • [2] M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander (1999) OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 49–60. Cited by: §4.1, Table 1.
  • [3] M. Antonetti, H. Donãncio, and F. Forbes (2025) An analysis of distributional reinforcement learning with Gaussian mixtures. Cited by: §1.
  • [4] X. Bay and J. Croix (2019) Karhunen–Loève decomposition of Gaussian measures on Banach spaces. 39 (2), pp. 279–297. Cited by: Appendix E, Appendix E, Corollary E.1, §1, §2.2.
  • [5] A. Berlinet and C. Thomas-Agnan (2004) Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer US. Cited by: §1.
  • [6] J. C. Bezdek (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Springer US. Cited by: Table 1.
  • [7] C. M. Bishop (2006) Pattern recognition and machine learning. Information Science and Statistics, Springer. Cited by: §F.1.
  • [8] V. Bogachev (1998) Gaussian Measures. Mathematical Surveys and Monographs, Vol. 62, American Mathematical Society. Cited by: §1.
  • [9] C. Bouveyron and J. Jacques (2011) Model-based clustering of time series in group-specific functional subspaces. 5 (4), pp. 281–300. Cited by: §1, Table 1.
  • [10] J. R. Bradley (2021) An approach to incorporate subsampling into a generic bayesian hierarchical model. Journal of Computational and Graphical Statistics 30 (4), pp. 889–905. Cited by: §5.
  • [11] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone (2017) Classification And Regression Trees. Routledge. Cited by: Table 2, 2nd item.
  • [12] F. Briol, A. Barp, A. B. Duncan, and M. Girolami (2019)Statistical inference for generative models with maximum mean discrepancy(Website) Cited by: §1.
  • [13] F. Briol, A. Gessner, T. Karvonen, and M. Mahsereci (2025)A Dictionary of Closed-Form Kernel Mean Embeddings(Website) Cited by: 1st item, §1.
  • [14] R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander (2015) Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. 10 (1), pp. 1–51. Cited by: Appendix F, Table 1.
  • [15] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud (2019)Neural Ordinary Differential Equations(Website) Cited by: §4.2.
  • [16] B. Chérief-Abdellatif and P. Alquier (2022) Finite sample properties of parametric MMD estimation: Robustness to misspecification and dependence. 28 (1), pp. 181–213. Cited by: §1.
  • [17] J. Chiou and P. Li (2007) Functional Clustering and Identifying Substructures of Longitudinal Data. 69 (4), pp. 679–699. Cited by: §1, Table 1.
  • [18] W. Clarke and B. Kovatchev (2009) Statistical tools to analyze continuous glucose monitor data. 11, pp. S–45. Cited by: §4.2.
  • [19] G. Da Prato and J. Zabczyk (2014) Stochastic Equations in Infinite Dimensions. Encyclopedia of Mathematics and Its Applications, Cambridge University Press. Cited by: §1, §1.
  • [20] A. K. Debnath, R. L. Lopez De Compadre, G. Debnath, A. J. Shusterman, and C. Hansch (1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. 34 (2), pp. 786–797. Cited by: Table 2, 2nd item.
  • [21] A. Delaigle and P. Hall (2010) Defining probability density for a distribution of random functions. 38 (2). Cited by: §1.
  • [22] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum Likelihood from Incomplete Data Via the \mkbibemphEM Algorithm. 39 (1), pp. 1–22. Cited by: Appendix D, §1, §1, §2.1.
  • [23] P. Dubey and H. Müller (2020) Functional models for time-varying random objects. 82 (2), pp. 275–327. Cited by: §1, §1.
  • [24] P. Dubey and H. Müller (2022) Modeling time-varying random objects and dynamic networks. 117 (540), pp. 2252–2267. Cited by: §1, §1.
  • [25] M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. Cited by: Appendix F, §4.1, Table 1.
  • [26] F. Ferraty and P. Vieu (2006) Nonparametric Functional Data Analysis. Springer Series in Statistics, Springer New York. Cited by: Table 2, §1, §1, 2nd item.
  • [27] L. Ferreira and D. B. Hitchcock (2009) A Comparison of Hierarchical Methods for Clustering Functional Data. 38 (9), pp. 1925–1949. Cited by: §1, Table 1.
  • [28] C. Fraley and A. E. Raftery (2002) Model-Based Clustering, Discriminant Analysis, and Density Estimation. 97 (458), pp. 611–631. Cited by: §1, §1, §4.1, Table 1.
  • [29] G. França, M. L. Rizzo, and J. T. Vogelstein (2017)Kernel k-Groups via Hartigan’s Method(Website) arXiv.org. Cited by: Appendix C, §1, §4.2, Table 1.
  • [30] B. J. Frey and D. Dueck (2007) Clustering by Passing Messages Between Data Points. 315 (5814), pp. 972–976. Cited by: Table 1.
  • [31] M. Giacofci, S. Lambert‐Lacroix, G. Marot, and F. Picard (2013) Wavelet‐Based Clustering for Mixed‐Effects Functional Models in High Dimension. 69 (1), pp. 31–40. Cited by: §1, Table 1.
  • [32] T. F. Gonzalez (1985) Clustering to minimize the maximum intercluster distance. 38, pp. 293–306. Cited by: Appendix F, Table 1.
  • [33] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A Kernel Two-Sample Test. 13 (25), pp. 723–773. Cited by: §1, §1.
  • [34] M. Huang, R. Li, H. Wang, and W. Yao (2014) Estimating mixture of gaussian processes by kernel smoothing. 32 (2), pp. 259–270. Cited by: §1.
  • [35] J. Jacques and C. Preda (2013) Funclust: A curves clustering method using functional random variables density approximation. 112, pp. 164–171. Cited by: §1, Table 1.
  • [36] G. M. James and C. A. Sugar (2003) Clustering for Sparsely Sampled Functional Data. 98 (462), pp. 397–408. Cited by: §1, Table 1.
  • [37] L. Kaufman and P. J. Rousseeuw (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, Wiley. Cited by: Table 1.
  • [38] J. S. Krouwer and G. S. Cembrowski (2010) A review of standards and statistics used to describe blood glucose monitor performance. 4 (1), pp. 75–83. Cited by: §4.2.
  • [39] G. N. Lance and W. T. Williams (1967) A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems. 9 (4), pp. 373–380. Cited by: Table 1.
  • [40] Y. Li, K. Swersky, and R. Zemel (2015) Generative Moment Matching Networks. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1718–1727. Cited by: §1.
  • [41] H. Lu, J. Jia, and D. Meng (2026)Sequential Monte Carlo with Gaussian Mixture Approximation for Infinite-Dimensional Statistical Inverse Problems(Website) Cited by: Appendix D, Appendix F, §1, §1, §4.1.
  • [42] M. Luz López García, R. García-Ródenas, and A. González Gómez (2015) K-means algorithms for functional data. 151, pp. 231–245. Cited by: §1, Table 1.
  • [43] M. Matabuena, J. C. Vidal, O. H. M. Padilla, and D. Sejdinovic (2025) Kernel biclustering algorithm in Hilbert spaces. Cited by: §1.
  • [44] I. C. McDowell, D. Manandhar, C. M. Vockley, A. K. Schmid, T. E. Reddy, and B. E. Engelhardt (2018) Clustering gene expression time series data using an infinite gaussian process mixture model. 14 (1), pp. e1005896. Cited by: §1.
  • [45] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf (2017) Kernel Mean Embedding of Distributions: A Review and Beyond. 10 (1–2), pp. 1–141. Cited by: §1.
  • [46] F. Murtagh and P. Contreras (2012) Algorithms for hierarchical clustering: an overview. 2 (1), pp. 86–97. Cited by: Appendix F, §4.1.
  • [47] A. Y. Ng, M. I. Jordan, and Y. Weiss (2001) On spectral clustering: analysis and an algorithm. In Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pp. 849–856. Cited by: §4.1, Table 1.
  • [48] R. T. Olszewski (2001) Generalized feature extraction for structural pattern recognition in time-series data. Carnegie Mellon University. Cited by: Table 2, 2nd item.
  • [49] V. M. Panaretos and Y. Zemel (2019) Statistical aspects of wasserstein distances. Annual Review of Statistics and Its Application 6, pp. 405–431. Cited by: §3.
  • [50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: Machine learning in python. 12, pp. 2825–2830. Cited by: Table 2, 1st item.
  • [51] C. Preda, Q. Grimonprez, and V. Vandewalle (2021) Categorical functional data analysis. the cfda r package. 9 (23), pp. 3074. Cited by: Table 2, 2nd item.
  • [52] C. Ramos-Carreño, J. L. Torrecilla, M. Carbajo-Berrocal, P. Marcos, and A. Suárez (2024) Scikit-Fda : A \mkbibemphPython Package for Functional Data Analysis. 109 (2). Cited by: Appendix F.
  • [53] J. O. Ramsay and B. W. Silverman (2005) Functional Data Analysis. Springer Series in Statistics, Springer New York. Cited by: Table 2, §1, §1, §2.2, 2nd item.
  • [54] C. E. Rasmussen and C. K. I. Williams (2008) Gaussian processes for machine learning. Adaptive Computation and Machine Learning, MIT Press. Cited by: §F.2.
  • [55] E. Schubert and P. J. Rousseeuw (2021) Fast and eager k -medoids clustering: O ( k ) runtime improvement of the PAM, CLARA, and CLARANS algorithms. 101, pp. 101804. Cited by: Appendix F, §4.1.
  • [56] D. Sculley (2010) Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. Cited by: §4.1, Table 1.
  • [57] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. 41 (5). Cited by: §1.
  • [58] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016)NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis(Website) Cited by: §F.8.
  • [59] A. Smola, A. Gretton, L. Song, and B. Schölkopf (2007) A Hilbert Space Embedding for Distributions. In Algorithmic Learning Theory, M. Hutter, R. A. Servedio, and E. Takimoto (Eds.), pp. 13–31. Cited by: §1.
  • [60] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet (2011) Universality, Characteristic Kernels and RKHS Embedding of Measures. 12 (70), pp. 2389–2410. Cited by: §1, §1.
  • [61] K. Supekar, V. Menon, D. Rubin, M. Musen, and M. D. Greicius (2008) Network Analysis of Intrinsic Functional Brain Connectivity in Alzheimer’s Disease. 4 (6), pp. e1000100. Cited by: §4.2.
  • [62] C. Tang, N. Lenssen, Y. Wei, and T. Zheng (2023) Wasserstein Distributional Learning via Majorization-Minimization. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 10703–10731. Cited by: §1.
  • [63] M. Tauschmann, H. Thabit, L. Bally, J. M. Allen, S. Hartnell, M. E. Wilinska, Y. Ruan, J. Sibayan, C. Kollman, P. Cheng, R. W. Beck, C. L. Acerini, M. L. Evans, D. B. Dunger, D. Elleri, F. Campbell, R. M. Bergenstal, A. Criego, V. N. Shah, L. Leelarathna, R. Hovorka, B. Alvarado, C. Ashanti, J. Baggott, K. Balakrishnan, N. Barber, L. Bath, S. Beasley, C. Beatson, S. Borgman, S. Bradshaw, B. Bugielski, A. Carlson, E. Collett, J. Curtis, J. Demmitt, D. Donahue, J. Exall, R. Forshaw, J. Hayes, S. Heath, A. Hellmann, V. Huegel, J. Hyatt, L. James, H. Joseph, P. Joshee, W. Konerza, J. Lum, M. Madden, T. Martens, C. McCarthy, M. McDonald, V. Mikityuk, H. Miles, D. Miller, W. Mubita, C. Murphy, B. Olson, R. Pad, N. Patibandla, K. Riding, A. Shaju, L. Thomas, J. Thomson, D. White, S. Yau, and J. Yong (2018) Closed-loop insulin delivery in suboptimally controlled type 1 diabetes: a multicentre, 12-week randomised trial. 392 (10155), pp. 1321–1329. Cited by: Table 2, §4.2.
  • [64] D. M. Titterington, A. F. M. Smith, and U. E. Makov (1985) Statistical analysis of finite mixture distributions. Wiley. Cited by: §1.
  • [65] I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet (2017) Minimax Estimation of Kernel Mean Embeddings. 18 (86), pp. 1–47. Cited by: §1, §2.2.
  • [66] V. Tresp (2000) Mixtures of Gaussian Processes. In Advances in Neural Information Processing Systems, Vol. 13. Cited by: §F.2, §1.
  • [67] R. P. Wadwa, Z. W. Reed, B. A. Buckingham, M. D. DeBoer, L. Ekhlaspour, G. P. Forlenza, M. Schoelwer, J. Lum, C. Kollman, R. W. Beck, et al. (2023) Trial of hybrid closed-loop control in young children with type 1 diabetes. New England Journal of Medicine 388 (11), pp. 991–1001. Cited by: §1, §4.2.
  • [68] J. H. Ward (1963) Hierarchical Grouping to Optimize an Objective Function. 58 (301), pp. 236–244. Cited by: §4.1, Table 1.
  • [69] T. Wu, Y. Ma, Z. Zheng, S. Peng, X. Wu, D. Eidelberg, and P. Chan (2015) Parkinson’s Disease-Related Spatial Covariance Pattern identified with Resting-State Functional MRI. 35 (11), pp. 1764–1770. Cited by: §4.2.
  • [70] G. Wynne and A. B. Duncan (2022) A kernel two-sample test for functional data. 23 (73), pp. 1–51. Cited by: §H.7, §1.
  • [71] Yizong Cheng (1995) Mean shift, mode seeking, and clustering. 17 (8), pp. 790–799. Cited by: Table 1.
  • [72] T. Zhang, R. Ramakrishnan, and M. Livny (1996) BIRCH: an efficient data clustering method for very large databases. 25 (2), pp. 103–114. Cited by: §4.1, Table 1.

Appendix A Background

This section establishes the notation and foundational properties of Gaussian measures and kernel methods used throughout this work. Appendix E discusses the extension of this framework to Banach spaces.

A.1 Gaussian measures on separable Hilbert spaces

Let 𝒳\mathcal{X} be a separable real Hilbert space equipped with its Borel σ\sigma-algebra (𝒳)\mathcal{B}(\mathcal{X}). A Borel probability measure ν\nu on 𝒳\mathcal{X} is said to be Gaussian if, for every continuous linear functional f𝒳f\in\mathcal{X}^{*}, the pushforward measure νf1\nu\circ f^{-1} is a Gaussian measure on \mathbb{R}. By the Riesz representation theorem, we routinely identify 𝒳\mathcal{X}^{*} with 𝒳\mathcal{X}.

A Gaussian measure ν\nu is uniquely determined by its mean m𝒳m\in\mathcal{X} and its covariance operator 𝒦:𝒳𝒳\mathcal{K}:\mathcal{X}\to\mathcal{X}, which are defined via the relations

m,h𝒳=𝒳x,h𝒳dν(x),h𝒳,\langle m,h\rangle_{\mathcal{X}}=\int_{\mathcal{X}}\langle x,h\rangle_{\mathcal{X}}\,\mathrm{d}\nu(x),\quad h\in\mathcal{X},

and

𝒦g,h𝒳=𝒳xm,g𝒳xm,h𝒳dν(x),g,h𝒳.\langle\mathcal{K}g,h\rangle_{\mathcal{X}}=\int_{\mathcal{X}}\langle x-m,g\rangle_{\mathcal{X}}\langle x-m,h\rangle_{\mathcal{X}}\,\mathrm{d}\nu(x),\quad g,h\in\mathcal{X}.

We denote such a measure by ν=𝒩(m,𝒦)\nu=\mathcal{N}(m,\mathcal{K}). The covariance operator 𝒦\mathcal{K} is necessarily self-adjoint, positive semidefinite, and trace-class. The latter property implies that for any orthonormal basis (ei)i1(e_{i})_{i\geq 1} of 𝒳\mathcal{X}, the trace is finite:

tr(𝒦)=i=1𝒦ei,ei𝒳<.\operatorname{tr}(\mathcal{K})=\sum_{i=1}^{\infty}\langle\mathcal{K}e_{i},e_{i}\rangle_{\mathcal{X}}<\infty.

The case 𝒦=0\mathcal{K}=0 corresponds to the Dirac measure δm\delta_{m}; this case is excluded from the model class in the main text. In particular, if X𝒩(m,𝒦)X\sim\mathcal{N}(m,\mathcal{K}), its moments satisfy 𝔼Xm𝒳2=tr(𝒦)\mathbb{E}\|X-m\|_{\mathcal{X}}^{2}=\operatorname{tr}(\mathcal{K}) and 𝔼X𝒳2=m𝒳2+tr(𝒦)\mathbb{E}\|X\|_{\mathcal{X}}^{2}=\|m\|_{\mathcal{X}}^{2}+\operatorname{tr}(\mathcal{K}).

A.2 Kernel mean embeddings and Maximum Mean Discrepancy

Let κ:𝒳×𝒳\kappa:\mathcal{X}\times\mathcal{X}\to\mathbb{R} be a Borel measurable, positive-definite kernel with an associated reproducing kernel Hilbert space (RKHS) κ\mathcal{H}_{\kappa}. For probability measures P,Q𝒫(𝒳)P,Q\in\mathcal{P}(\mathcal{X}), we assume the following integrability condition holds:

𝔼xP[κ(x,x)]<,𝔼yQ[κ(y,y)]<.\mathbb{E}_{x\sim P}\left[\sqrt{\kappa(x,x)}\right]<\infty,\quad\mathbb{E}_{y\sim Q}\left[\sqrt{\kappa(y,y)}\right]<\infty. (16)

Under this assumption, the kernel mean embeddings

μP𝔼xP[κ(x,)],μQ𝔼yQ[κ(y,)]\mu_{P}\coloneqq\mathbb{E}_{x\sim P}[\kappa(x,\cdot)],\quad\mu_{Q}\coloneqq\mathbb{E}_{y\sim Q}[\kappa(y,\cdot)]

are well-defined as Bochner integrals in κ\mathcal{H}_{\kappa}. The squared Maximum Mean Discrepancy (MMD) between PP and QQ is defined as the distance between their embeddings:

MMDκ2(P,Q)\displaystyle\operatorname{MMD}_{\kappa}^{2}(P,Q) μPμQκ2\displaystyle\coloneqq\|\mu_{P}-\mu_{Q}\|_{\mathcal{H}_{\kappa}}^{2}
=𝔼x,xP[κ(x,x)]2𝔼xP,yQ[κ(x,y)]+𝔼y,yQ[κ(y,y)],\displaystyle=\mathbb{E}_{x,x^{\prime}\sim P}[\kappa(x,x^{\prime})]-2\mathbb{E}_{x\sim P,y\sim Q}[\kappa(x,y)]+\mathbb{E}_{y,y^{\prime}\sim Q}[\kappa(y,y^{\prime})], (17)

where x,xx,x^{\prime} are i.i.d. samples from PP and y,yy,y^{\prime} are i.i.d. samples from QQ. Note that if κ\kappa is bounded, condition (16) is satisfied for all probability measures.

A kernel is called characteristic if the embedding map PμPP\mapsto\mu_{P} is injective. In this case, MMDκ(P,Q)=0\operatorname{MMD}_{\kappa}(P,Q)=0 if and only if P=QP=Q, meaning the MMD induces a true metric on the space of probability measures.

A.3 Fourier interpretation of the Gaussian MMD

The Gaussian radial kernel, given by

κ(x,y)exp(xy𝒳22σκ2),σκ>0,\kappa(x,y)\coloneqq\exp\left(-\frac{\|x-y\|_{\mathcal{X}}^{2}}{2\sigma_{\kappa}^{2}}\right),\quad\sigma_{\kappa}>0, (18)

is both bounded and characteristic on separable Hilbert spaces.

To gain intuition into the geometry induced by this kernel, we briefly consider the finite-dimensional case 𝒳=d\mathcal{X}=\mathbb{R}^{d}. For P,Q𝒫(d)P,Q\in\mathcal{P}(\mathbb{R}^{d}), let φP\varphi_{P} and φQ\varphi_{Q} denote their respective characteristic functions, φP(ω)deiωxdP(x)\varphi_{P}(\omega)\coloneqq\int_{\mathbb{R}^{d}}e^{i\omega^{\top}x}\,\mathrm{d}P(x).

By Bochner’s theorem, the shift-invariant kernel (18) admits a Fourier representation with a Gaussian spectral density ψσκ\psi_{\sigma_{\kappa}}:

κ(x,y)=σκd(2π)d/2deiω(xy)eσκ2ω2/2dω.\kappa(x,y)=\frac{\sigma_{\kappa}^{d}}{(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}e^{i\omega^{\top}(x-y)}e^{-\sigma_{\kappa}^{2}\|\omega\|^{2}/2}\,\mathrm{d}\omega.

By defining the signed measure μPQ\mu\coloneqq P-Q and applying Fubini’s theorem to the expanded MMD objective (17), we obtain:

MMDκ2(P,Q)\displaystyle\operatorname{MMD}_{\kappa}^{2}(P,Q) =d×dκ(x,y)dμ(x)dμ(y)\displaystyle=\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\kappa(x,y)\,\mathrm{d}\mu(x)\,\mathrm{d}\mu(y)
=σκd(2π)d/2d|deiωxdμ(x)|2eσκ2ω2/2dω\displaystyle=\frac{\sigma_{\kappa}^{d}}{(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\left|\int_{\mathbb{R}^{d}}e^{i\omega^{\top}x}\,\mathrm{d}\mu(x)\right|^{2}e^{-\sigma_{\kappa}^{2}\|\omega\|^{2}/2}\,\mathrm{d}\omega
=σκd(2π)d/2d|φP(ω)φQ(ω)|2eσκ2ω2/2dω.\displaystyle=\frac{\sigma_{\kappa}^{d}}{(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\big|\varphi_{P}(\omega)-\varphi_{Q}(\omega)\big|^{2}e^{-\sigma_{\kappa}^{2}\|\omega\|^{2}/2}\,\mathrm{d}\omega.

This formulation reveals a frequency-domain perspective: the squared Gaussian MMD is a weighted L2L^{2} distance between the characteristic functions of the two distributions. The Gaussian weight eσκ2ω2/2e^{-\sigma_{\kappa}^{2}\|\omega\|^{2}/2} acts as a low-pass filter, effectively penalizing discrepancies in low-frequency modes (global structural differences) while naturally smoothing out high-frequency noise.

Appendix B Closed-form MMD expressions

To evaluate the cross-expectations Ji,kJ_{i,k} and Ik,sI_{k,s} of (5) in the infinite-dimensional setting, we rely on the Fredholm determinant. Recall that for any trace-class operator T(𝒳)T\in\mathcal{L}(\mathcal{X}), the determinant detF(I𝒳+T)\det_{F}(I_{\mathcal{X}}+T) is well-defined and serves as the natural generalization of the standard matrix determinant.

Proposition B.1 (MMD for the Gaussian kernel).

Let A(𝒳)A\in\mathcal{L}(\mathcal{X}) be a self-adjoint, positive semi-definite operator, and consider the Gaussian kernel induced by AA,

κA(x,y)exp(12xy,A(xy)𝒳).\kappa_{A}(x,y)\coloneqq\exp\left(-\frac{1}{2}\big\langle x-y,A(x-y)\big\rangle_{\mathcal{X}}\right). (19)

Then the expectations in (5) are given by

Ji,k\displaystyle J_{i,k} =exp(12Ximk,A1/2(I𝒳+A1/2𝒦kA1/2)1A1/2(Ximk)𝒳)detF(I𝒳+A1/2𝒦kA1/2)1/2,\displaystyle=\frac{\exp\Bigl(-\tfrac{1}{2}\big\langle X_{i}-m_{k},\,A^{1/2}\bigl(I_{\mathcal{X}}+A^{1/2}\mathcal{K}_{k}A^{1/2}\bigr)^{-1}A^{1/2}(X_{i}-m_{k})\big\rangle_{\mathcal{X}}\Bigr)}{\det_{F}\bigl(I_{\mathcal{X}}+A^{1/2}\mathcal{K}_{k}A^{1/2}\bigr)^{1/2}}, (20)
Ik,s\displaystyle I_{k,s} =exp(12mkms,A1/2(I𝒳+A1/2(𝒦k+𝒦s)A1/2)1A1/2(mkms)𝒳)detF(I𝒳+A1/2(𝒦k+𝒦s)A1/2)1/2.\displaystyle=\frac{\exp\Bigl(-\tfrac{1}{2}\big\langle m_{k}-m_{s},\,A^{1/2}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{-1}A^{1/2}(m_{k}-m_{s})\big\rangle_{\mathcal{X}}\Bigr)}{\det_{F}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{1/2}}. (21)
Proposition B.2 (MMD for the polynomial kernel).

Fix pp\in\mathbb{N}, c0c\geq 0, and consider the kernel

κ(x,y)(x,y𝒳+c)p.\kappa(x,y)\coloneqq(\langle x,y\rangle_{\mathcal{X}}+c)^{p}. (22)

Then the expectations in (5) are given by

Ji,k\displaystyle J_{i,k} =t=0p/2p!(p2t)! 2tt!Xi,𝒦kXi𝒳t(Xi,mk𝒳+c)p2t,\displaystyle=\sum_{t=0}^{\lfloor p/2\rfloor}\frac{p!}{(p-2t)!\,2^{t}\,t!}\,\langle X_{i},\mathcal{K}_{k}X_{i}\rangle_{\mathcal{X}}^{t}\,\left(\langle X_{i},m_{k}\rangle_{\mathcal{X}}+c\right)^{p-2t},
Ik,s\displaystyle I_{k,s} =r=0p(pr)cprMr,k,Mr,s𝒳r,with Mr,k𝔼yνk[yr]𝒳r,M0,k=1.\displaystyle=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\big\langle M_{r,k},M_{r,s}\big\rangle_{\mathcal{X}^{\otimes r}},\hskip 18.49988pt\text{with }M_{r,k}\coloneqq\mathbb{E}_{y\sim\nu_{k}}[y^{\otimes r}]\in\mathcal{X}^{\otimes r},\qquad M_{0,k}=1.

The formal proofs of Propositions B.1 and B.2 are given in Sections˜H.7 and H.8.

B.1 Numerical computations and algorithm description

Given an orthonormal basis (er)r=1M(e_{r})_{r=1}^{M} of 𝒳M\mathcal{X}_{M}, we represent observations and means by their coordinate vectors

𝐱i(Xi,er𝒳)r=1MM,𝐦k(mk,er𝒳)r=1MM,\mathbf{x}_{i}\coloneqq\big(\langle X_{i},e_{r}\rangle_{\mathcal{X}}\big)_{r=1}^{M}\in\mathbb{R}^{M},\qquad\mathbf{m}_{k}\coloneqq\big(\langle m_{k},e_{r}\rangle_{\mathcal{X}}\big)_{r=1}^{M}\in\mathbb{R}^{M}, (23)

and the covariance operator 𝒦k,M\mathcal{K}_{k,M} by its matrix representation 𝐊kM×M\mathbf{K}_{k}\in\mathbb{R}^{M\times M} with entries

(𝐊k)r𝒦k,Me,er𝒳1r,M.(\mathbf{K}_{k})_{r\ell}\coloneqq\langle\mathcal{K}_{k,M}e_{\ell},e_{r}\rangle_{\mathcal{X}}\qquad 1\leq r,\ell\leq M. (24)

In these coordinates, the pushforward measure νk,M\nu_{k,M} becomes the multivariate normal 𝒩(𝐦k,𝐊k)\mathcal{N}(\mathbf{m}_{k},\mathbf{K}_{k}) on M\mathbb{R}^{M}. Using these coordinate representations, we now provide explicit formulas for the finite-dimensional approximations.

Projected responsibilities.

The pushforward νk,M=νkΠM1\nu_{k,M}=\nu_{k}\circ\Pi_{M}^{-1} is a multivariate normal distribution on M\mathbb{R}^{M},

νk,M=𝒩(𝐦k,𝐊k).\nu_{k,M}=\mathcal{N}(\mathbf{m}_{k},\mathbf{K}_{k}).

If each 𝐊k\mathbf{K}_{k} is positive definite, then each νk,M\nu_{k,M} admits the Lebesgue density

pk(M)(𝐱)(2π)M/2det(𝐊k)1/2exp{12(𝐱𝐦k)𝐊k1(𝐱𝐦k)}.p_{k}^{(M)}(\mathbf{x})\coloneqq(2\pi)^{-M/2}\det(\mathbf{K}_{k})^{-1/2}\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}-\mathbf{m}_{k})^{\top}\mathbf{K}_{k}^{-1}(\mathbf{x}-\mathbf{m}_{k})\Big\}.

In this case the projected responsibilities satisfy, for i{1,,n}i\in\{1,\dots,n\} and k{1,,K}k\in\{1,\dots,K\},

γi,k(M)γk(M)(𝐱i)=πkpk(M)(𝐱i)s=1Kπsps(M)(𝐱i).\gamma^{(M)}_{i,k}\coloneqq\gamma_{k}^{(M)}(\mathbf{x}_{i})=\frac{\pi_{k}p_{k}^{(M)}(\mathbf{x}_{i})}{\sum_{s=1}^{K}\pi_{s}p_{s}^{(M)}(\mathbf{x}_{i})}.

If some 𝐊k\mathbf{K}_{k} are singular, the corresponding νk,M\nu_{k,M} are supported on affine subspaces; the Radon–Nikodym definition of γk(M)\gamma_{k}^{(M)} still applies and one may compute densities on the supports or use a small ridge regularization.

1. Gaussian radial kernel (Proposition B.1) Let 𝐀M×M\mathbf{A}\in\mathbb{R}^{M\times M} denote the matrix representation of the restricted operator A|𝒳MA|_{\mathcal{X}_{M}} in the basis (er)r=1M(e_{r})_{r=1}^{M}, with entries 𝐀r=Ae,er𝒳\mathbf{A}_{r\ell}=\langle Ae_{\ell},e_{r}\rangle_{\mathcal{X}}. The closed forms of Proposition B.1 become the finite-dimensional approximations

Ji,k(M)\displaystyle J^{(M)}_{i,k} det(IM+𝐀1/2𝐊k𝐀1/2)1/2\displaystyle\coloneqq\det\!\big(I_{M}+\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2}\big)^{-1/2}
×exp{12(𝐱i𝐦k)𝐀1/2(IM+𝐀1/2𝐊k𝐀1/2)1𝐀1/2(𝐱i𝐦k)},\displaystyle\qquad\times\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}_{i}-\mathbf{m}_{k})^{\top}\mathbf{A}^{1/2}\big(I_{M}+\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2}\big)^{-1}\mathbf{A}^{1/2}(\mathbf{x}_{i}-\mathbf{m}_{k})\Big\},
Ik,s(M)\displaystyle I^{(M)}_{k,s} det(IM+𝐀1/2(𝐊k+𝐊s)𝐀1/2)1/2\displaystyle\coloneqq\det\!\big(I_{M}+\mathbf{A}^{1/2}(\mathbf{K}_{k}+\mathbf{K}_{s})\mathbf{A}^{1/2}\big)^{-1/2}
×exp{12(𝐦k𝐦s)𝐀1/2(IM+𝐀1/2(𝐊k+𝐊s)𝐀1/2)1𝐀1/2(𝐦k𝐦s)},\displaystyle\qquad\times\exp\!\Big\{-\tfrac{1}{2}(\mathbf{m}_{k}-\mathbf{m}_{s})^{\top}\mathbf{A}^{1/2}\big(I_{M}+\mathbf{A}^{1/2}(\mathbf{K}_{k}+\mathbf{K}_{s})\mathbf{A}^{1/2}\big)^{-1}\mathbf{A}^{1/2}(\mathbf{m}_{k}-\mathbf{m}_{s})\Big\},

and we set

Jk(M)1ni=1nJi,k(M),𝐉(M)(J1(M),,JK(M)),𝐈(M)(Ik,s(M))k,s=1K.J^{(M)}_{k}\coloneqq\frac{1}{n}\sum_{i=1}^{n}J^{(M)}_{i,k},\qquad\mathbf{J}^{(M)}\coloneqq(J^{(M)}_{1},\dots,J^{(M)}_{K})^{\top},\qquad\mathbf{I}^{(M)}\coloneqq(I^{(M)}_{k,s})_{k,s=1}^{K}.

For the isotropic case 𝐀=σκ2IM\mathbf{A}=\sigma_{\kappa}^{-2}I_{M}, one writes α=1/(2σκ2)\alpha=-1/(2\sigma_{\kappa}^{2}) and recovers the equivalent forms det(IM2α𝐊k)1/2\det(I_{M}-2\alpha\mathbf{K}_{k})^{-1/2} and (IM2α𝐊k)1(I_{M}-2\alpha\mathbf{K}_{k})^{-1} in the exponents.

For numerically stable log-domain evaluation, compute the symmetric eigendecompositions

𝐀1/2𝐊k𝐀1/2=𝐕kdiag(λk,1,,λk,M)𝐕k,\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2}=\mathbf{V}_{k}\mathrm{diag}(\lambda_{k,1},\dots,\lambda_{k,M})\mathbf{V}_{k}^{\top},
𝐀1/2(𝐊k+𝐊s)𝐀1/2=𝐕k,sdiag(λk,s,1,,λk,s,M)𝐕k,s,\mathbf{A}^{1/2}(\mathbf{K}_{k}+\mathbf{K}_{s})\mathbf{A}^{1/2}=\mathbf{V}_{k,s}\mathrm{diag}(\lambda_{k,s,1},\dots,\lambda_{k,s,M})\mathbf{V}_{k,s}^{\top},

with orthogonal matrices 𝐕k,𝐕k,s\mathbf{V}_{k},\mathbf{V}_{k,s} and λ,0\lambda_{\cdot,\cdot}\geq 0. Note that det(IM+𝐀1/2𝐊k𝐀1/2)=r(1+λk,r)\det(I_{M}+\mathbf{A}^{1/2}\mathbf{K}_{k}\mathbf{A}^{1/2})=\prod_{r}(1+\lambda_{k,r}). Writing 𝐳i,k𝐕k𝐀1/2(𝐱i𝐦k)\mathbf{z}_{i,k}\coloneqq\mathbf{V}_{k}^{\top}\mathbf{A}^{1/2}(\mathbf{x}_{i}-\mathbf{m}_{k}) and 𝐝k,s𝐕k,s𝐀1/2(𝐦k𝐦s)\mathbf{d}_{k,s}\coloneqq\mathbf{V}_{k,s}^{\top}\mathbf{A}^{1/2}(\mathbf{m}_{k}-\mathbf{m}_{s}), we obtain the stable log-domain expressions

logJi,k(M)\displaystyle\log J^{(M)}_{i,k} =12r=1Mlog(1+λk,r)12r=1M(𝐳i,k)r21+λk,r,\displaystyle=-\frac{1}{2}\sum_{r=1}^{M}\log(1+\lambda_{k,r})\;-\;\frac{1}{2}\sum_{r=1}^{M}\frac{(\mathbf{z}_{i,k})_{r}^{2}}{1+\lambda_{k,r}},
logIk,s(M)\displaystyle\log I^{(M)}_{k,s} =12r=1Mlog(1+λk,s,r)12r=1M(𝐝k,s)r21+λk,s,r.\displaystyle=-\frac{1}{2}\sum_{r=1}^{M}\log(1+\lambda_{k,s,r})\;-\;\frac{1}{2}\sum_{r=1}^{M}\frac{(\mathbf{d}_{k,s})_{r}^{2}}{1+\lambda_{k,s,r}}.

When 𝐀=σκ2IM\mathbf{A}=\sigma_{\kappa}^{-2}I_{M}, the matrices 𝐕k\mathbf{V}_{k} coincide with the eigenvectors of 𝐊k\mathbf{K}_{k} and the expressions reduce to those with α=1/(2σκ2)\alpha=-1/(2\sigma_{\kappa}^{2}). To reduce cost, one may truncate these sums to rRr\leq R (e.g. the leading RR eigenvalues), i.e. replace the relevant matrices by their rank-RR spectral approximations.

2. Polynomial kernel (Proposition B.2) Fix an integer p1p\geq 1 and c0c\geq 0 and consider κ(x,y)=(x,y𝒳+c)p\kappa(x,y)=(\langle x,y\rangle_{\mathcal{X}}+c)^{p}. In the projected space, this becomes κ(𝐱,𝐲)=(𝐱𝐲+c)p\kappa(\mathbf{x},\mathbf{y})=(\mathbf{x}^{\top}\mathbf{y}+c)^{p} on M\mathbb{R}^{M}.

Computing Ji,k(M)J^{(M)}_{i,k}.

For yνk,My\sim\nu_{k,M}, define the scalars

μi,k(M)𝐱i𝐦k+c,vi,k(M)𝐱i𝐊k𝐱i,\mu^{(M)}_{i,k}\coloneqq\mathbf{x}_{i}^{\top}\mathbf{m}_{k}+c,\qquad v^{(M)}_{i,k}\coloneqq\mathbf{x}_{i}^{\top}\mathbf{K}_{k}\,\mathbf{x}_{i},

so that 𝐱iy+c\mathbf{x}_{i}^{\top}y+c is a one-dimensional Gaussian with mean μi,k(M)\mu^{(M)}_{i,k} and variance vi,k(M)v^{(M)}_{i,k}. Then Proposition B.2 yields the efficient closed form

Ji,k(M)=t=0p/2p!(p2t)! 2tt!(vi,k(M))t(μi,k(M))p2t.J^{(M)}_{i,k}=\sum_{t=0}^{\lfloor p/2\rfloor}\frac{p!}{(p-2t)!\,2^{t}\,t!}\,\big(v^{(M)}_{i,k}\big)^{t}\,\big(\mu^{(M)}_{i,k}\big)^{p-2t}.
Computing Ik,s(M)I^{(M)}_{k,s}.

Let yνk,My\sim\nu_{k,M} and yνs,My^{\prime}\sim\nu_{s,M} be independent in M\mathbb{R}^{M}, and set Zk,syyZ_{k,s}\coloneqq y^{\top}y^{\prime}. Then

Ik,s(M)=r=0p(pr)cpr𝔼[Zk,sr].I^{(M)}_{k,s}=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\mathbb{E}\!\left[Z_{k,s}^{\,r}\right].

Equivalently, by Proposition B.2,

Ik,s(M)=r=0p(pr)cpr𝐌r,k(M),𝐌r,s(M)(M)r,I^{(M)}_{k,s}=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\big\langle\mathbf{M}^{(M)}_{r,k},\mathbf{M}^{(M)}_{r,s}\big\rangle_{(\mathbb{R}^{M})^{\otimes r}},

where

𝐌r,k(M)𝔼yνk,M[yr](M)r.\mathbf{M}^{(M)}_{r,k}\coloneqq\mathbb{E}_{y\sim\nu_{k,M}}[y^{\otimes r}]\in(\mathbb{R}^{M})^{\otimes r}.

For small degrees, these moments admit simple closed forms in terms of (𝐦k,𝐊k)(\mathbf{m}_{k},\mathbf{K}_{k}). In particular,

𝔼[Zk,s]\displaystyle\mathbb{E}[Z_{k,s}] =𝐦k𝐦s,\displaystyle=\mathbf{m}_{k}^{\top}\mathbf{m}_{s},
𝔼[Zk,s2]\displaystyle\mathbb{E}[Z_{k,s}^{2}] =(𝐦k𝐦s)2+𝐦k𝐊s𝐦k+𝐦s𝐊k𝐦s+tr(𝐊k𝐊s),\displaystyle=(\mathbf{m}_{k}^{\top}\mathbf{m}_{s})^{2}+\mathbf{m}_{k}^{\top}\mathbf{K}_{s}\,\mathbf{m}_{k}+\mathbf{m}_{s}^{\top}\mathbf{K}_{k}\,\mathbf{m}_{s}+\mathrm{tr}(\mathbf{K}_{k}\mathbf{K}_{s}),
𝔼[Zk,s3]\displaystyle\mathbb{E}[Z_{k,s}^{3}] =(𝐦k𝐦s)3+3(𝐦k𝐦s)(𝐦k𝐊s𝐦k+𝐦s𝐊k𝐦s+tr(𝐊k𝐊s))+6𝐦k𝐊s𝐊k𝐦s.\displaystyle=(\mathbf{m}_{k}^{\top}\mathbf{m}_{s})^{3}+3(\mathbf{m}_{k}^{\top}\mathbf{m}_{s})\Big(\mathbf{m}_{k}^{\top}\mathbf{K}_{s}\,\mathbf{m}_{k}+\mathbf{m}_{s}^{\top}\mathbf{K}_{k}\,\mathbf{m}_{s}+\mathrm{tr}(\mathbf{K}_{k}\mathbf{K}_{s})\Big)+6\,\mathbf{m}_{k}^{\top}\mathbf{K}_{s}\mathbf{K}_{k}\,\mathbf{m}_{s}.

For larger pp, one may compute higher-order moments via Wick/Isserlis expansions.

B.2 Algorithm descriptions

We present pseudocode for the fitting procedure described in Section˜2.1. The algorithm operates in the projected coordinate space M\mathbb{R}^{M} using the finite-dimensional quantities (𝐉(M),𝐈(M))(\mathbf{J}^{(M)},\mathbf{I}^{(M)}) from Section˜B.1, which we denote (𝐉,𝐈)(\mathbf{J},\mathbf{I}) for brevity.

Algorithm 1 Alternating Optimization for Projected MMD Gaussian Mixtures
1:Projected data {𝐱i,M}i=1nM\{\mathbf{x}_{i,M}\}_{i=1}^{n}\subset\mathbb{R}^{M}; components KK; learning rate η\eta; ridge ε>0\varepsilon>0; iterations TT
2:Fitted weights π^\hat{\pi} and components {(𝐦^k,M,𝐊^k,M)}k=1K\{(\hat{\mathbf{m}}_{k,M},\hat{\mathbf{K}}_{k,M})\}_{k=1}^{K}
3:{𝐦k,M}k=1KKMeans({𝐱i,M}i=1n,K)\{\mathbf{m}_{k,M}\}_{k=1}^{K}\leftarrow\textsc{KMeans}(\{\mathbf{x}_{i,M}\}_{i=1}^{n},K) \triangleright Initialize means from data
4:Assign ciargmink𝐱i,M𝐦k,Mc_{i}\leftarrow\arg\min_{k}\|\mathbf{x}_{i,M}-\mathbf{m}_{k,M}\| and let Sk={i:ci=k}S_{k}=\{i:c_{i}=k\} \triangleright Nearest-center partition
5:𝐊k,M1|Sk|1iSk(𝐱i,M𝐦k,M)(𝐱i,M𝐦k,M)+ε𝐈\mathbf{K}_{k,M}\leftarrow\tfrac{1}{|S_{k}|-1}\!\sum_{i\in S_{k}}(\mathbf{x}_{i,M}-\mathbf{m}_{k,M})(\mathbf{x}_{i,M}-\mathbf{m}_{k,M})^{\top}+\varepsilon\mathbf{I} \triangleright Within-cluster covariance
6:𝐋kchol(𝐊k,M)\mathbf{L}_{k}\leftarrow\mathrm{chol}(\mathbf{K}_{k,M}) \triangleright Cholesky parametrization ensures 𝐊k,M0\mathbf{K}_{k,M}\succ 0
7:π𝟏/K\pi\leftarrow\mathbf{1}/K \triangleright Uniform weight initialization
8:for t=1,,Tt=1,\dots,T do
9:  Compute 𝐈(M)K×K\mathbf{I}^{(M)}\in\mathbb{R}^{K\times K}, 𝐉(M)K\mathbf{J}^{(M)}\in\mathbb{R}^{K} using Section˜B.1 \triangleright Cross-expectations
10:  πargminπΔK1π𝐈(M)π2(𝐉(M))π\pi\leftarrow\arg\min_{\pi\in\Delta^{K-1}}\pi^{\top}\mathbf{I}^{(M)}\pi-2(\mathbf{J}^{(M)})^{\top}\pi \triangleright Exact QP update (6)
11:  Mπ𝐈(M)π2(𝐉(M))π\mathcal{L}_{M}\leftarrow\pi^{\top}\mathbf{I}^{(M)}\pi-2(\mathbf{J}^{(M)})^{\top}\pi \triangleright Objective at fixed π\pi
12:  𝐦k,M𝐦k,Mη𝐦k,MM\mathbf{m}_{k,M}\leftarrow\mathbf{m}_{k,M}-\eta\nabla_{\mathbf{m}_{k,M}}\mathcal{L}_{M} \triangleright Gradient step on means
13:  𝐋k𝐋kη𝐋kM\mathbf{L}_{k}\leftarrow\mathbf{L}_{k}-\eta\nabla_{\mathbf{L}_{k}}\mathcal{L}_{M} \triangleright Gradient step on Cholesky factors
14:  𝐊k,M𝐋k𝐋k+ε𝐈\mathbf{K}_{k,M}\leftarrow\mathbf{L}_{k}\mathbf{L}_{k}^{\top}+\varepsilon\mathbf{I} \triangleright Reconstruct PSD covariances
15:end for
16:return π^π\hat{\pi}\leftarrow\pi and {(𝐦^k,M,𝐊^k,M)}{(𝐦k,M,𝐊k,M)}\{(\hat{\mathbf{m}}_{k,M},\hat{\mathbf{K}}_{k,M})\}\leftarrow\{(\mathbf{m}_{k,M},\mathbf{K}_{k,M})\}

Algorithm˜1 alternates between the analytical weight update (6) and a gradient step on the component parameters (𝐦k,M,𝐋k)(\mathbf{m}_{k,M},\mathbf{L}_{k}), with covariances parameterized through their Cholesky factors to enforce positive definiteness. The exact gradients are obtained from the differentiable closed forms of Propositions B.1 and B.2. A fully gradient-based variant—updating π\pi jointly with the components, either via projection onto the simplex or through logits K\ell\in\mathbb{R}^{K} with πk=softmax()k\pi_{k}=\operatorname{softmax}(\ell)_{k}—is also possible.

Appendix C Time-varying mixtures

This appendix details the algorithmic implementation and theoretical results deferred from Section˜4.2. We outline the projected M\mathbb{R}^{M} fitting procedure, and subsequently establish three guarantees: the consistency of the integrated empirical objective, the consistency of its projected M\mathbb{R}^{M} approximation, and the density of time-varying Gaussian mixtures with shared components.

Algorithm 2 Time-varying MMD-based mixture fitting
1:Projected temporal data {𝐱i,M(l)}i=1nlM\{\mathbf{x}_{i,M}^{(l)}\}_{i=1}^{n_{l}}\subset\mathbb{R}^{M} across time slices t1,,tLt_{1},\dots,t_{L}; number of components KK; learning rate η\eta; epochs EE; Neural ODE parameters Φ\Phi.
2:Fitted parameters Φ^\hat{\Phi} and shared components {(𝐦^k,𝐊^k)}k=1K\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}.
3:Initialize {(𝐦k,𝐊k)}k=1K\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K} using the pooled data l=1L{𝐱i,M(l)}i=1nl\bigcup_{l=1}^{L}\{\mathbf{x}_{i,M}^{(l)}\}_{i=1}^{n_{l}}.
4:Initialize Φ\Phi.
5:for epoch =1,,E=1,\dots,E do
6:  Compute π(tl)softmax(𝐳(tl))\pi(t_{l})\leftarrow\operatorname{softmax}(\mathbf{z}(t_{l})) for l{1,,L}l\in\{1,\dots,L\} via the ODE.
7:  Compute 𝐈(M)K×K\mathbf{I}^{(M)}\in\mathbb{R}^{K\times K} via (7).
8:  for l=1,,Ll=1,\dots,L do
9:   Compute 𝐉l(M)K\mathbf{J}_{l}^{(M)}\in\mathbb{R}^{K} where [𝐉l(M)]k1nli=1nlJi,k(M)[\mathbf{J}_{l}^{(M)}]_{k}\leftarrow\frac{1}{n_{l}}\sum_{i=1}^{n_{l}}J_{i,k}^{(M)} via (7).
10:   MMDl2π(tl)𝐈(M)π(tl)2(𝐉l(M))π(tl)\operatorname{MMD}_{l}^{2}\leftarrow\pi(t_{l})^{\top}\mathbf{I}^{(M)}\,\pi(t_{l})-2\,(\mathbf{J}_{l}^{(M)})^{\top}\pi(t_{l}). \triangleright Unnormalized per-slice MMD2
11:  end for
12:  1Ll=1LMMDl2\mathcal{L}\leftarrow\frac{1}{L}\sum_{l=1}^{L}\operatorname{MMD}_{l}^{2}. \triangleright Approximate integrated loss (15)
13:  Update (Φ,{𝐦k,𝐊k})(Φ,{𝐦k,𝐊k})η(\Phi,\{\mathbf{m}_{k},\mathbf{K}_{k}\})\leftarrow(\Phi,\{\mathbf{m}_{k},\mathbf{K}_{k}\})-\eta\,\nabla\mathcal{L}.
14:end for
15:return Φ^Φ\hat{\Phi}\leftarrow\Phi,   {(𝐦^k,𝐊^k)}k=1K{(𝐦k,𝐊k)}k=1K\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}\leftarrow\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K}.

Algorithm˜2 implements the optimization of the time-varying objective (15) in M\mathbb{R}^{M}. The component parameters {(𝐦k,𝐊k)}k=1K\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K} are shared across all time slices, while the temporal variation is entirely captured by the weights π(t)\pi(t), whose logits are governed by the neural ODE introduced in Section˜4.2. Implementation details are summarized in Table˜3.

To evaluate the MMD efficiently within the algorithm, we rely on the finite-dimensional projections defined in (7). We separate the evaluation into terms that can be computed globally versus those computed per time slice. The matrix 𝐈(M)\mathbf{I}^{(M)} depends only on the shared components and is computed once per epoch. In contrast, the vectors 𝐉l(M)\mathbf{J}_{l}^{(M)} evaluate the interaction between the shared components and the data specific to time slice tlt_{l}, requiring per-slice computation.

Motivated by the within-cluster RKHS dispersion criterion underlying kernel kk-groups [29], we select KK by applying an elbow rule to the integrated MMD loss over candidate temporal mixtures.

Consistency guarantees

In parallel with the static framework of Section˜3, we establish the theoretical foundations for the temporal mixture model. Unlike the static case, where sample consistency is standard, minimizing the integrated temporal objective requires controlling the time-varying empirical processes. We first establish this sample consistency, followed by the validity of the projected approximation and the expressiveness of the temporal model class.

Proposition C.1 (Sample consistency of the integrated MMD).

Let κ\kappa be a continuous positive-definite kernel on the separable Hilbert space 𝒳\mathcal{X}. Let X1,,XnX_{1},\dots,X_{n} be i.i.d. copies of a jointly measurable 𝒳\mathcal{X}-valued stochastic process X:[0,T]𝒳X\colon[0,T]\to\mathcal{X} with marginal laws P(t)P(t), and let Pn(t)P_{n}(t) be the corresponding empirical measure. Let Q(t)Q(t) be a family of probability measures on 𝒳\mathcal{X} such that its mean embedding trajectory tμQ(t)t\mapsto\mu_{Q(t)} is strongly measurable and belongs to L2(0,T;κ)L^{2}(0,T;\mathcal{H}_{\kappa}). If 0T𝔼XP(t)[κ(X,X)]dt<\int_{0}^{T}\mathbb{E}_{X\sim P(t)}[\kappa(X,X)]\,\mathrm{d}t<\infty, we have

1T0TMMDκ2(Pn(t),Q(t))dtn1T0TMMDκ2(P(t),Q(t))dta.s.\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q(t)\bigr)\,\mathrm{d}t\xrightarrow{n\to\infty}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P(t),Q(t)\bigr)\,\mathrm{d}t\qquad\text{a.s.}

In particular, 1T0TMMDκ2(Pn(t),P(t))dt0\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),P(t)\bigr)\,\mathrm{d}t\to 0 almost surely.

Proposition C.2 (Projection consistency for temporal mixtures).

Let Qθ(t)=k=1Kπk(t)νkQ_{\theta}(t)=\sum_{k=1}^{K}\pi_{k}(t)\nu_{k} be a time-dependent Gaussian mixture where π(t)ΔK1\pi(t)\in\Delta^{K-1} and tπk(t)t\mapsto\pi_{k}(t) is measurable for each kk. Let Pn,M(t)P_{n,M}(t) and Qθ,M(t)Q_{\theta,M}(t) be the projected measures on M\mathbb{R}^{M} defined in Section˜2.2, and let κM\kappa_{M} denote the coordinate version of the projected kernel. Suppose κ\kappa is continuous and satisfies the polynomial growth condition (9) for some q0q\geq 0. If the sample paths satisfy 0TXi(t)𝒳2qdt<\int_{0}^{T}\|X_{i}(t)\|_{\mathcal{X}}^{2q}\,\mathrm{d}t<\infty almost surely for i=1,,ni=1,\dots,n, then

1T0TMMDκM2(Pn,M(t),Qθ,M(t))dtM1T0TMMDκ2(Pn(t),Qθ(t))dta.s.\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa_{M}}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)\,\mathrm{d}t\xrightarrow{M\to\infty}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q_{\theta}(t)\bigr)\,\mathrm{d}t\qquad\text{a.s.}

Furthermore, for each fixed tt, if X(t)Qθ(t)X(t)\sim Q_{\theta}(t), the projected responsibilities γk(M)(ΠMX(t),t)\gamma_{k}^{(M)}(\Pi_{M}X(t),t) satisfy the martingale convergence property of Proposition˜2.2 with weights π(t)\pi(t).

Proposition C.3 (Density of time-varying Gaussian mixtures).

Fix p1p\geq 1 and let P:[0,T]𝒫p(𝒳)P\colon[0,T]\to\mathcal{P}_{p}(\mathcal{X}) be continuous with respect to the Wasserstein distance WpW_{p}. Then, for every ε>0\varepsilon>0, there exist KK\in\mathbb{N}, shared Gaussian measures ν1,,νK\nu_{1},\dots,\nu_{K} on 𝒳\mathcal{X} with nonzero covariance operator, and continuous functions π1,,πK:[0,T][0,1]\pi_{1},\dots,\pi_{K}\colon[0,T]\to[0,1] with k=1Kπk(t)=1\sum_{k=1}^{K}\pi_{k}(t)=1, such that

supt[0,T]Wp(P(t),k=1Kπk(t)νk)<ε.\sup_{t\in[0,T]}W_{p}\left(P(t),\sum_{k=1}^{K}\pi_{k}(t)\nu_{k}\right)<\varepsilon.

The proofs of Propositions C.1, C.2 and C.3 are provided in Sections˜H.4, H.5 and H.6, respectively.

Appendix D Projected maximum likelihood estimation

Although this paper focuses on MMD-based fitting, we briefly describe the projected maximum-likelihood baseline used in Section˜4; it is the only density-based competitor that is well defined in the Hilbert-space setting of (1). As discussed in the introduction, likelihood-based fitting in infinite dimensions requires a common dominating measure. Here we use the Karhunen–Loève projector associated with the Gaussian prior

ρμ0=𝒩(0,C)\rho\coloneqq\mu_{0}=\mathcal{N}(0,C)

on 𝒳\mathcal{X}. Let (λj,ϕj)j1(\lambda_{j},\phi_{j})_{j\geq 1} be the eigenpairs of CC ordered so that λ1λ2\lambda_{1}\geq\lambda_{2}\geq\cdots, and fix MM with λM>0\lambda_{M}>0. Define

𝒳Mspan{ϕ1,,ϕM},ΠMxj=1Mx,ϕj𝒳ϕj,ΠMI𝒳ΠM.\mathcal{X}_{M}\coloneqq\mathrm{span}\{\phi_{1},\dots,\phi_{M}\},\qquad\Pi_{M}x\coloneqq\sum_{j=1}^{M}\langle x,\phi_{j}\rangle_{\mathcal{X}}\,\phi_{j},\qquad\Pi_{M}^{\perp}\coloneqq I_{\mathcal{X}}-\Pi_{M}.

Under the decomposition 𝒳=𝒳M𝒳M\mathcal{X}=\mathcal{X}_{M}\oplus\mathcal{X}_{M}^{\perp}, the prior factorizes as

ρ=ρMρM,ρM=𝒩(0,ΛM),ΛMdiag(λ1,,λM),\rho=\rho_{M}\otimes\rho_{M}^{\perp},\qquad\rho_{M}=\mathcal{N}(0,\Lambda_{M}),\qquad\Lambda_{M}\coloneqq\mathrm{diag}(\lambda_{1},\dots,\lambda_{M}),

where ρM\rho_{M}^{\perp} is the law of ΠMX\Pi_{M}^{\perp}X for XρX\sim\rho. Following [41], we construct each component by altering only the first MM coordinates. Given projected parameters 𝐦kM\mathbf{m}_{k}\in\mathbb{R}^{M} and 𝐊kM×M\mathbf{K}_{k}\in\mathbb{R}^{M\times M} with 𝐊k0\mathbf{K}_{k}\succ 0, let

νk,M𝒩(𝐦k,𝐊k)on M,\nu_{k,M}\coloneqq\mathcal{N}(\mathbf{m}_{k},\mathbf{K}_{k})\qquad\text{on }\mathbb{R}^{M},

and lift it to

νkνk,MρMon 𝒳.\nu_{k}\coloneqq\nu_{k,M}\otimes\rho_{M}^{\perp}\qquad\text{on }\mathcal{X}.

Equivalently, νk=𝒩(mk,𝒦k)\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k}), where

mk=j=1M(𝐦k)jϕj,m_{k}=\sum_{j=1}^{M}(\mathbf{m}_{k})_{j}\,\phi_{j},

and 𝒦k\mathcal{K}_{k} agrees with CC on 𝒳M\mathcal{X}_{M}^{\perp} while its restriction to 𝒳M\mathcal{X}_{M} has matrix 𝐊k\mathbf{K}_{k} in the basis {ϕ1,,ϕM}\{\phi_{1},\dots,\phi_{M}\}. Since νk,M\nu_{k,M} and ρM\rho_{M} are nondegenerate Gaussians on M\mathbb{R}^{M}, they are equivalent, and therefore νkρ\nu_{k}\sim\rho. Hence every mixture

Qk=1Kπkνk,πk>0,k=1Kπk=1,Q\coloneqq\sum_{k=1}^{K}\pi_{k}\nu_{k},\qquad\pi_{k}>0,\qquad\sum_{k=1}^{K}\pi_{k}=1,

satisfies QρQ\ll\rho; indeed QρQ\sim\rho because its density is strictly positive ρ\rho-a.s.

For x𝒳x\in\mathcal{X}, write

𝐱(x,ϕ1𝒳,,x,ϕM𝒳)\mathbf{x}\coloneqq\bigl(\langle x,\phi_{1}\rangle_{\mathcal{X}},\dots,\langle x,\phi_{M}\rangle_{\mathcal{X}}\bigr)^{\top}

and let

φM(𝐱;𝐦,𝐊)(2π)M/2det(𝐊)1/2exp{12(𝐱𝐦)𝐊1(𝐱𝐦)}\varphi_{M}(\mathbf{x};\mathbf{m},\mathbf{K})\coloneqq(2\pi)^{-M/2}\det(\mathbf{K})^{-1/2}\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}-\mathbf{m})^{\top}\mathbf{K}^{-1}(\mathbf{x}-\mathbf{m})\Big\}

denote the usual Gaussian density on M\mathbb{R}^{M}. Then

pk(x)dνkdρ(x)=dνk,MdρM(𝐱)=φM(𝐱;𝐦k,𝐊k)φM(𝐱;0,ΛM)p_{k}(x)\coloneqq\frac{\,\mathrm{d}\nu_{k}}{\,\mathrm{d}\rho}(x)=\frac{\,\mathrm{d}\nu_{k,M}}{\,\mathrm{d}\rho_{M}}(\mathbf{x})=\frac{\varphi_{M}(\mathbf{x};\mathbf{m}_{k},\mathbf{K}_{k})}{\varphi_{M}(\mathbf{x};0,\Lambda_{M})}

and therefore

pk(x)=det(ΛM)1/2det(𝐊k)1/2exp{12(𝐱𝐦k)𝐊k1(𝐱𝐦k)+12𝐱ΛM1𝐱}.p_{k}(x)=\det(\Lambda_{M})^{1/2}\det(\mathbf{K}_{k})^{-1/2}\exp\!\Big\{-\tfrac{1}{2}(\mathbf{x}-\mathbf{m}_{k})^{\top}\mathbf{K}_{k}^{-1}(\mathbf{x}-\mathbf{m}_{k})+\tfrac{1}{2}\mathbf{x}^{\top}\Lambda_{M}^{-1}\mathbf{x}\Big\}.

Accordingly,

q(x)dQdρ(x)=k=1Kπkpk(x).q(x)\coloneqq\frac{\,\mathrm{d}Q}{\,\mathrm{d}\rho}(x)=\sum_{k=1}^{K}\pi_{k}\,p_{k}(x).

Since the factor φM(𝐱;0,ΛM)1\varphi_{M}(\mathbf{x};0,\Lambda_{M})^{-1} is common to every component, maximizing the ρ\rho-log-likelihood is equivalent to maximizing the usual projected mixture log-likelihood

M(π,m,𝒦)i=1nlog(k=1KπkφM(𝐱i;𝐦k,𝐊k)),\ell_{M}(\pi,m,\mathcal{K})\coloneqq\sum_{i=1}^{n}\log\!\Big(\sum_{k=1}^{K}\pi_{k}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{k},\mathbf{K}_{k})\Big), (25)

because

i=1nlogq(Xi)=M(π,m,𝒦)i=1nlogφM(𝐱i;0,ΛM),\sum_{i=1}^{n}\log q(X_{i})=\ell_{M}(\pi,m,\mathcal{K})-\sum_{i=1}^{n}\log\varphi_{M}(\mathbf{x}_{i};0,\Lambda_{M}),

and the second term is independent of the mixture parameters.

The EM algorithm [22] is therefore carried out in the projected coordinates 𝐱iM\mathbf{x}_{i}\in\mathbb{R}^{M}. In the E-step,

rik=πkpk(Xi)=1Kπp(Xi)=πkφM(𝐱i;𝐦k,𝐊k)=1KπφM(𝐱i;𝐦,𝐊).r_{ik}=\frac{\pi_{k}\,p_{k}(X_{i})}{\sum_{\ell=1}^{K}\pi_{\ell}\,p_{\ell}(X_{i})}=\frac{\pi_{k}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{k},\mathbf{K}_{k})}{\sum_{\ell=1}^{K}\pi_{\ell}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{\ell},\mathbf{K}_{\ell})}.

For the unregularized full-covariance model, the M-step updates are

πknew=1ni=1nrik,𝐦knew=i=1nrik𝐱ii=1nrik,\pi_{k}^{\mathrm{new}}=\frac{1}{n}\sum_{i=1}^{n}r_{ik},\qquad\mathbf{m}_{k}^{\mathrm{new}}=\frac{\sum_{i=1}^{n}r_{ik}\mathbf{x}_{i}}{\sum_{i=1}^{n}r_{ik}},

and

𝐊knew=i=1nrik(𝐱i𝐦knew)(𝐱i𝐦knew)i=1nrik.\mathbf{K}_{k}^{\mathrm{new}}=\frac{\sum_{i=1}^{n}r_{ik}(\mathbf{x}_{i}-\mathbf{m}_{k}^{\mathrm{new}})(\mathbf{x}_{i}-\mathbf{m}_{k}^{\mathrm{new}})^{\top}}{\sum_{i=1}^{n}r_{ik}}.

In practice, one stabilizes the covariance update by replacing 𝐊knew\mathbf{K}_{k}^{\mathrm{new}} with 𝐊knew+εIM\mathbf{K}_{k}^{\mathrm{new}}+\varepsilon I_{M} for some ε>0\varepsilon>0. This is a regularized EM step rather than the exact M-step of the unpenalized likelihood, but it guarantees 𝐊k0\mathbf{K}_{k}\succ 0 and therefore preserves equivalence with the reference measure after lifting. In the fixed-covariance benchmark variant used in our experiments (see Section˜4), the same E-step is used, while only the weights and means are updated and the covariance matrix is held fixed. After each M-step, the updated parameters are lifted back to 𝒳\mathcal{X} by setting

mknew=j=1M(𝐦knew)jϕjm_{k}^{\mathrm{new}}=\sum_{j=1}^{M}(\mathbf{m}_{k}^{\mathrm{new}})_{j}\,\phi_{j}

and defining 𝒦knew\mathcal{K}_{k}^{\mathrm{new}} to agree with CC on 𝒳M\mathcal{X}_{M}^{\perp} and to have matrix 𝐊knew\mathbf{K}_{k}^{\mathrm{new}} on 𝒳M\mathcal{X}_{M}. Algorithm 3 summarizes this projected EM routine, with optional ridge regularization controlled by ε\varepsilon.

Algorithm 3 Projected EM for the Gaussian-mixture log-likelihood
1:Projected data {𝐱i}i=1nM\{\mathbf{x}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{M}; number of components KK; ridge parameter ε0\varepsilon\geq 0; EM iterations TEMT_{\mathrm{EM}}
2:Fitted parameters π^,{(𝐦^k,𝐊^k)}k=1K\hat{\pi},\,\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}
3:{𝐦k}k=1KKMeans({𝐱i}i=1n,K)\{\mathbf{m}_{k}\}_{k=1}^{K}\leftarrow\textsc{KMeans}(\{\mathbf{x}_{i}\}_{i=1}^{n},K) \triangleright Initialize means
4:𝐊k𝐈M\mathbf{K}_{k}\leftarrow\mathbf{I}_{M},   πk1/K\pi_{k}\leftarrow 1/K \triangleright Initialize covariances and weights
5:for t=1,,TEMt=1,\dots,T_{\mathrm{EM}} do
6:  for i=1,,ni=1,\dots,n, k=1,,Kk=1,\dots,K do
7:   rikπkφM(𝐱i;𝐦k,𝐊k)s=1KπsφM(𝐱i;𝐦s,𝐊s)r_{ik}\leftarrow\dfrac{\pi_{k}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{k},\mathbf{K}_{k})}{\sum_{s=1}^{K}\pi_{s}\,\varphi_{M}(\mathbf{x}_{i};\mathbf{m}_{s},\mathbf{K}_{s})} \triangleright E-step responsibilities
8:  end for
9:  for k=1,,Kk=1,\dots,K do
10:   Nki=1nrikN_{k}\leftarrow\sum_{i=1}^{n}r_{ik} \triangleright Effective cluster size
11:   πkNk/n\pi_{k}\leftarrow N_{k}/n \triangleright M-step: weights
12:   𝐦k1Nki=1nrik𝐱i\mathbf{m}_{k}\leftarrow\tfrac{1}{N_{k}}\sum_{i=1}^{n}r_{ik}\mathbf{x}_{i} \triangleright M-step: means
13:   𝐊k1Nki=1nrik(𝐱i𝐦k)(𝐱i𝐦k)+ε𝐈M\mathbf{K}_{k}\leftarrow\tfrac{1}{N_{k}}\sum_{i=1}^{n}r_{ik}(\mathbf{x}_{i}-\mathbf{m}_{k})(\mathbf{x}_{i}-\mathbf{m}_{k})^{\top}+\varepsilon\mathbf{I}_{M} \triangleright M-step: ridge-regularized covariances
14:  end for
15:end for
16:return π^π\hat{\pi}\leftarrow\pi,   {(𝐦^k,𝐊^k)}k=1K{(𝐦k,𝐊k)}k=1K\{(\hat{\mathbf{m}}_{k},\hat{\mathbf{K}}_{k})\}_{k=1}^{K}\leftarrow\{(\mathbf{m}_{k},\mathbf{K}_{k})\}_{k=1}^{K}

The next result shows that the projected likelihood ratios are consistent as MM\to\infty.

Proposition D.1 (Consistency of projected likelihood ratios).

For each MM, set ρMρΠM1\rho_{M}\coloneqq\rho\circ\Pi_{M}^{-1}, QMQΠM1Q_{M}\coloneqq Q\circ\Pi_{M}^{-1}, and define qMdQM/dρMq_{M}\coloneqq dQ_{M}/d\rho_{M}. Let XX be an 𝒳\mathcal{X}-valued random variable with distribution ρ\rho and denote

X(M)ΠMX.X^{(M)}\coloneqq\Pi_{M}X.

Then qM(X(M))=𝔼ρ[q(X)X(M)]q_{M}(X^{(M)})=\mathbb{E}_{\rho}[q(X)\mid X^{(M)}] a.s. and qM(X(M))Mq(X)q_{M}(X^{(M)})\xrightarrow{M\to\infty}q(X) a.s. and in L1(ρ)L^{1}(\rho). Consequently, for i.i.d. X1,,XnQX_{1},\dots,X_{n}\sim Q,

i=1nlogqM(ΠMXi)i=1nlogq(Xi)Qn-a.s.\sum_{i=1}^{n}\log q_{M}(\Pi_{M}X_{i})\longrightarrow\sum_{i=1}^{n}\log q(X_{i})\qquad Q^{n}\text{-a.s.}

The proof is given in Section˜H.9.

Appendix E Extension to Banach spaces

This appendix outlines how our framework extends to real separable Banach spaces (𝒳,𝒳)(\mathcal{X},\|\cdot\|_{\mathcal{X}}). Let 𝒳\mathcal{X}^{*} be the continuous dual and x,f\langle x,f\rangle the dual pairing. A Borel probability measure ν\nu on 𝒳\mathcal{X} is Gaussian if νf1\nu\circ f^{-1} is Gaussian for all f𝒳f\in\mathcal{X}^{*}. If ν\nu is centered, its characteristic functional is

ν^(f)𝒳eix,fdν(x)=exp(12Cν(f,f)),f𝒳,\widehat{\nu}(f)\coloneqq\int_{\mathcal{X}}e^{i\langle x,f\rangle}\,\mathrm{d}\nu(x)=\exp\!\Big(-\tfrac{1}{2}\,C_{\nu}(f,f)\Big),\qquad f\in\mathcal{X}^{*},

where the covariance form Cν:𝒳×𝒳C_{\nu}:\mathcal{X}^{*}\times\mathcal{X}^{*}\to\mathbb{R} is given by

Cν(f,g)𝒳x,fx,gdν(x).C_{\nu}(f,g)\coloneqq\int_{\mathcal{X}}\langle x,f\rangle\,\langle x,g\rangle\,\mathrm{d}\nu(x).

By Fernique’s theorem, ν\nu has a well-defined covariance operator Rν:𝒳𝒳R_{\nu}:\mathcal{X}^{*}\to\mathcal{X} given by Rνf=𝒳x,fxdν(x)R_{\nu}f=\int_{\mathcal{X}}\langle x,f\rangle x\,\mathrm{d}\nu(x). The associated Cameron–Martin space H(ν)H(\nu) is the Hilbert completion of Rν(𝒳)R_{\nu}(\mathcal{X}^{*}) under the inner product Rνf,RνgH(ν)=Rνf,g\langle R_{\nu}f,R_{\nu}g\rangle_{H(\nu)}=\langle R_{\nu}f,g\rangle.

Karhunen–Loève decomposition.

Assume that H(ν)H(\nu) is infinite-dimensional. Following Bay and Croix [4], one constructs vectors xn𝒳x_{n}\in\mathcal{X}, dual functionals xn𝒳x_{n}^{*}\in\mathcal{X}^{*}, and normalized coordinates

hnλnxnH(ν),hnλn1/2xn𝒳,h_{n}\coloneqq\sqrt{\lambda_{n}}\,x_{n}\in H(\nu),\qquad h_{n}^{*}\coloneqq\lambda_{n}^{-1/2}x_{n}^{*}\in\mathcal{X}^{*},

from successive maximizers of the Rayleigh quotient associated with the covariance operators of the residual Gaussian measures; see [4, Theorem 3.3]. The resulting coordinates satisfy Rνhn=hnR_{\nu}h_{n}^{*}=h_{n}, and the random variables ,hn\langle\cdot,h_{n}^{*}\rangle are independent 𝒩(0,1)\mathcal{N}(0,1) under ν\nu.

Corollary E.1 (Banach-space Karhunen–Loève decomposition [4, Corollary 3.8]).

Let ν\nu be a centered Gaussian measure on a real separable Banach space 𝒳\mathcal{X}, assume H(ν)H(\nu) is infinite-dimensional, and let (λn,xn,hn,hn)n0(\lambda_{n},x_{n},h_{n},h_{n}^{*})_{n\geq 0} be constructed as above. Then

x=n=0x,hnhnx=\sum_{n=0}^{\infty}\langle x,h_{n}^{*}\rangle\,h_{n}

converges in 𝒳\mathcal{X} for ν\nu-a.e. x𝒳x\in\mathcal{X}, and the coordinate maps xx,hnx\mapsto\langle x,h_{n}^{*}\rangle are independent 𝒩(0,1)\mathcal{N}(0,1) under ν\nu. Equivalently, if (ξn)n0(\xi_{n})_{n\geq 0} are i.i.d. 𝒩(0,1)\mathcal{N}(0,1), then

n=0λnξnxn\sum_{n=0}^{\infty}\sqrt{\lambda_{n}}\,\xi_{n}\,x_{n}

converges almost surely in 𝒳\mathcal{X} and has law ν\nu.

In the Hilbert-space setting of the main text, we identify 𝒳𝒳\mathcal{X}\simeq\mathcal{X}^{*} via the Riesz representation theorem. Then RνR_{\nu} becomes the usual covariance operator, and the above construction reduces to the classical spectral Karhunen–Loève expansion.

Applicability.

The measure-theoretic components of our framework—including Gaussian mixtures, Radon–Nikodym responsibilities, and MMD weak density arguments—extend naturally to separable Banach spaces. While orthogonal projections are unavailable in this broader setting, one can substitute finite-rank continuous projections PM:𝒳𝒳P_{M}:\mathcal{X}\to\mathcal{X}, such as the Bay–Croix projections from Corollary˜E.1. However, we do not pursue this extension further, as the closed-form MMD expressions and projection algorithms developed in the main text rely intrinsically on the inner product structure of Hilbert spaces.

Appendix F Synthetic experiments

We compare against four baseline families: (i) metric-space methods [55, 46, 25, 14, 32]; (ii) FDA-specific methods [52]; (iii) classic Euclidean methods restricted to d\mathbb{R}^{d}; and (iv) a projected maximum-likelihood baseline inspired by [41] (see Appendix˜D). We report Adjusted Rand Index (ARI), train with Adam (lr=0.1\mathrm{lr}=0.1) and kk-means initialization, and run all scripts on CPU in minutes.

Model class. Across all domains, we fit the finite-dimensional mixture of Section˜2.2 after fixing an orthonormal basis (er)r=1M(e_{r})_{r=1}^{M} of 𝒳M\mathcal{X}_{M}. We use cosine bases for L2/H1L^{2}/H^{1}, Wigner-D matrix coefficients for SO(3)\mathrm{SO}(3), Laplacian eigenvectors for graph signals, and the canonical basis for d\mathbb{R}^{d}.

We evaluate the MMD-based Gaussian mixture fitting method on several Hilbert-space representations of increasing complexity. The first five experiments are controlled mixture-recovery tasks with known ground-truth parameters. The sixth experiment tests system identification for a Gaussian process induced by a linear SDE. The last two experiments apply the same fitting procedure to real molecular and skeleton-sequence data. In all cases, observations are represented in a finite-dimensional basis as in Section˜2.2, and the model is fitted by minimizing the squared MMD objective with a Gaussian radial kernel. Final MMD values are reported as within-experiment diagnostics and are not meant to be compared directly across domains, since the coefficient dimension, scaling, and kernel bandwidth differ across settings.

F.1 Gaussian random variables in d\mathbb{R}^{d}

As a sanity check, we first consider the finite-dimensional case 𝒳=d\mathcal{X}=\mathbb{R}^{d} with the standard inner product, where the model reduces to a classical Gaussian mixture model [7].

Experiment.

We generate n=1500n=1500 samples from a K=3K=3 Gaussian mixture on \mathbb{R} with weights π=(0.5,0.3,0.2)\pi=(0.5,0.3,0.2), means (3.0, 0.5, 4.0)(-3.0,\,0.5,\,4.0), and standard deviations (0.6, 0.9, 0.5)(0.6,\,0.9,\,0.5). We then fit a K=3K=3 mixture by minimizing MMD2\mathrm{MMD}^{2}. Figure˜6 shows that the learned density closely matches the empirical histogram and the ground-truth density.

Refer to caption
Figure 6: d\mathbb{R}^{d} Gaussian mixture recovery. Left: MMD2\mathrm{MMD}^{2} training loss. Right: Empirical histogram of the samples overlaid with the true (blue) and learned (dashed coral) densities.

F.2 Functional data in L2(0,1)L^{2}(0,1)

We next consider functional data in 𝒳=L2(0,1;d)\mathcal{X}=L^{2}(0,1;\mathbb{R}^{d}). Gaussian random elements in this space can be viewed as Gaussian processes with square-integrable sample paths, so the mixture model specializes to a mixture of Gaussian processes [54, 66, 1].

Experiment.

We generate multivariate functional data in 𝒳=L2(0,1;2)\mathcal{X}=L^{2}(0,1;\mathbb{R}^{2}) from K=5K=5 Gaussian components with weights π=(0.30,0.25,0.20,0.15,0.10)\pi=(0.30,0.25,0.20,0.15,0.10). Each component has a distinct mean function and diagonal covariance profile in coefficient space. We sample n=500n=500 trajectories and project them onto a cosine basis with R=15R=15 functions per spatial dimension, yielding coefficient vectors in 30\mathbb{R}^{30}. Figure˜7 shows raw trajectories, recovered mean functions, recovered weights, and the training curve.

Refer to caption
Figure 7: L2(0,1;2)L^{2}(0,1;\mathbb{R}^{2}) mixture with K=5K=5. (a) Raw trajectories (dimension 1) overlaid with true means. (b) True (dashed) versus predicted (solid) mean functions. (c) True versus predicted mixture weights. (d) MMD2\mathrm{MMD}^{2} training loss.

F.3 Functional data in L2([0,1]2)L^{2}([0,1]^{2})

The same construction extends to functions on two-dimensional domains by using tensor-product bases. Given an orthonormal basis (φr)r0(\varphi_{r})_{r\geq 0} of L2(0,1)L^{2}(0,1), the functions

Φm,n(s,t)φm(s)φn(t)\Phi_{m,n}(s,t)\coloneqq\varphi_{m}(s)\varphi_{n}(t)

form an orthonormal basis of L2([0,1]2)L^{2}([0,1]^{2}). This setting covers image-like signals, spatial random fields, and spatio-temporal processes.

Experiment.

We consider 𝒳=L2([0,1]2;)\mathcal{X}=L^{2}([0,1]^{2};\mathbb{R}) with a tensor-product cosine basis, using Rs=Rt=8R_{s}=R_{t}=8 and hence M=64M=64 coefficients. We generate n=400n=400 samples from a K=3K=3 Gaussian mixture with weights π=(0.45,0.35,0.20)\pi=(0.45,0.35,0.20), where each component has a distinct 2D mean surface. Figure˜8 compares the true and predicted mean surfaces and reports the training loss and recovered weights.

Refer to caption
Figure 8: L2([0,1]2)L^{2}([0,1]^{2}) mixture with K=3K=3. Left columns: True mean surfaces mk(s,t)m_{k}(s,t) (top row) and predicted surfaces (bottom row), aligned by color. Right column: MMD2\mathrm{MMD}^{2} training loss (top) and true versus predicted mixture weights (bottom).

F.4 L2(SO(3))L^{2}(\mathrm{SO}(3)) rotation data

We also consider non-Euclidean domains equipped with an L2L^{2} structure. For a compact group GG with Haar probability measure μ\mu, the space Lμ2(G;d)L^{2}_{\mu}(G;\mathbb{R}^{d}) is a Hilbert space with inner product

f,g𝒳Gf(x)g(x)dμ(x).\langle f,g\rangle_{\mathcal{X}}\coloneqq\int_{G}f(x)^{\top}g(x)\,\mathrm{d}\mu(x).

For G=SO(3)G=\mathrm{SO}(3), the Peter–Weyl theorem provides an orthonormal basis in terms of Wigner D-matrices.

Experiment.

We consider rotation data on SO(3)\mathrm{SO}(3), relevant for orientation estimation in robotics and molecular modeling. We generate n=200n=200 rotations from K=3K=3 concentrated components and encode them using real Wigner-basis features up to degree Lmax=3L_{\max}=3, giving M=84M=84 coefficients. Figure˜9 displays each rotation through the image of the reference direction [1,0,0][1,0,0] on S2S^{2}, together with the true and learned component directions, the training loss, and the recovered weights.

Refer to caption
Figure 9: L2(SO(3))L^{2}(\mathrm{SO}(3)) mixture with K=3K=3. Left: Data projected on S2S^{2} with true and learned component directions. Middle: MMD2\mathrm{MMD}^{2} training loss. Right: True versus predicted mixture weights.

F.5 Graph signals

We next test the method on graph-structured data, where the Hilbert-space geometry is induced by the graph Laplacian. Let G=(V,E)G=(V,E) be a finite weighted graph with Laplacian L=DWL=D-W. For α>0\alpha>0, we equip |V|\mathbb{R}^{|V|} with the inner product

f,g𝒳f(L+αI)g.\langle f,g\rangle_{\mathcal{X}}\coloneqq f^{\top}(L+\alpha I)g.

This geometry gives larger norm to high-frequency graph components, while the finite-dimensional projection below restricts the representation to the leading Laplacian modes.

Experiment.

We generate n=150n=150 graph signals on a random Erdős–Rényi graph with |V|=30|V|=30 nodes and edge probability 0.250.25. The signals are projected onto the first M=15M=15 Laplacian eigenvectors, using regularization α=0.1\alpha=0.1. The mixture has K=3K=3 components with uniform weights. Figure˜10 shows the true and predicted mean graph signals, together with the training loss and recovered weights.

Refer to caption
Figure 10: Graph-signal mixture with K=3K=3. Left block: True (top) versus predicted (bottom) mean signals per component, plotted on the shared Erdős–Rényi graph using a common colormap. Right column: MMD2\mathrm{MMD}^{2} training loss (top) and true versus predicted weights (bottom).

F.6 Linear SDE: system identification

As a parametric Gaussian-process example, we consider the linear stochastic differential equation

dx(t)=(Ax(t)+Bu(t))dt+GdW(t),t[a,b].\,\mathrm{d}x(t)=\big(Ax(t)+Bu(t)\big)\,\mathrm{d}t+G\,\mathrm{d}W(t),\qquad t\in[a,b].

For Gaussian initial condition and deterministic input uu, the solution is a Gaussian process. Its mean m(t)m(t) and covariance P(t)=Cov[x(t),x(t)]P(t)=\mathrm{Cov}[x(t),x(t)] satisfy

m˙=Am+Bu,m(a)=m0,P˙=AP+PA+GG,P(a)=Σ0.\dot{m}=Am+Bu,\qquad m(a)=m_{0},\qquad\dot{P}=AP+PA^{\top}+GG^{\top},\qquad P(a)=\Sigma_{0}.

The resulting projected mean and covariance can therefore be computed analytically in any chosen L2L^{2} basis.

Experiment.

We treat (A,B,G)(A,B,G) as learnable parameters and identify them from ns=20n_{s}=20 sampled trajectories of a 44-dimensional linear SDE over [0,5][0,5], driven by the input u(t)=(sint,cos2t)u(t)=(\sin t,\cos 2t). The trajectories are projected onto an L2L^{2} cosine basis with R=7R=7 functions per state dimension, giving M=28M=28 coefficients. We match the empirical coefficient distribution to the single Gaussian induced by the linear SDE. Figure˜11 shows the sample paths, training loss, recovered mean function, and diagonal of the projected covariance.

Refer to caption
Figure 11: Linear SDE system identification via MMD. (a) Sample paths of the first state dimension. (b) MMD2\mathrm{MMD}^{2} training loss. (c) True versus estimated mean function m(t)m(t) for every state dimension. (d) Diagonal of the projected covariance matrix.

F.7 3D molecular data (QM9)

We also test the method on real molecular structures using permutation-invariant graph fingerprints derived from 3D atomic coordinates.

Experiment.

We sample n=500n=500 molecules from QM9, build WL-hash fingerprints of dimension 128128 from distance-threshold molecular graphs, and project them onto the first R=64R=64 DCT coefficients. We fit a K=5K=5 Gaussian mixture with diagonal covariance. Figure˜12 shows the three molecules closest to each learned component mean in coefficient space. The HOMO–LUMO gap and assignment responsibility are reported for each representative molecule.

Refer to caption
Figure 12: Representative QM9 molecules per component. Columns correspond to learned components k=1,,Kk=1,\ldots,K, and rows display the three closest molecules to each component mean in coefficient space. Each panel reports the HOMO-LUMO gap and assignment responsibility γi,k\gamma_{i,k}.

F.8 NTU RGB+D skeleton sequences

Finally, we apply the method to human motion data, treating skeleton sequences as 75\mathbb{R}^{75}-valued paths corresponding to 2525 joints with 33 coordinates each.

Experiment.

We load n=150n=150 action sequences from the NTU RGB+D dataset [58], resample each sequence to 6464 frames, subtract the mean trajectory, and project onto an L2L^{2} cosine basis with R=10R=10. This gives coefficient vectors of dimension M=750M=750. We fit a K=5K=5 Gaussian mixture with diagonal covariance by minimizing MMD2\mathrm{MMD}^{2}. Figure˜13 shows the most representative skeleton sequences per component, selected as those closest to the learned component mean in coefficient space.

Refer to caption
Figure 13: Most representative NTU skeleton sequences per component. Columns correspond to learned components k=1,,Kk=1,\ldots,K, and rows display the three closest sequences to each component mean. Each cell shows 66 evenly-spaced snapshots overlaid with increasing opacity, along with the action label and the assignment responsibility γi,k\gamma_{i,k}.

F.9 Summary

Table˜3 summarizes the experimental settings and final MMD2\mathrm{MMD}^{2} values. The controlled experiments show that the method recovers the main mixture structure across Euclidean, functional, group-valued, and graph-based representations. The real-data experiments illustrate that the same coefficient-space fitting procedure also yields coherent representative components on molecular and skeleton-sequence data.

Appendix G Experimental details and compute resources

We record the implementation details used for the experiments in Sections˜4 and F. The complete codebase, along with instructions to reproduce the synthetic and real-world experiments, can be found in the repository https://github.com/dani2442/rkhs-mixture-models/.

Unless stated otherwise, all models are trained on projected coefficient vectors in M\mathbb{R}^{M}, use torch.float64, use fixed random seeds, run on CPU, and initialize mixture means with kk-means++.

G.1 Hilbert spaces and bases

Each experiment fits the projected mixture model of Section˜2.2 after fixing a separable Hilbert space 𝒳\mathcal{X} and an orthonormal basis (er)r1(e_{r})_{r\geq 1}, which determines the truncated subspace 𝒳M=span{e1,,eM}\mathcal{X}_{M}=\mathrm{span}\{e_{1},\dots,e_{M}\}. We define here the Hilbert spaces and bases used in Tables˜2 and 3, so that the table entries refer to this section instead of repeating the definitions.

(B1) Euclidean canonical basis.

For 𝒳=d\mathcal{X}=\mathbb{R}^{d} we use the canonical basis (er)r=1d(e_{r})_{r=1}^{d}, so M=dM=d.

(B2) L2L^{2} cosine basis.

For 𝒳=L2(0,1;)\mathcal{X}=L^{2}(0,1;\mathbb{R}) with f,g=01f(t)g(t)dt\langle f,g\rangle=\int_{0}^{1}f(t)g(t)\,\mathrm{d}t, we use the orthonormal cosine basis

e0(t)=1,er(t)=2cos(πrt),r1,e_{0}(t)=1,\qquad e_{r}(t)=\sqrt{2}\cos(\pi rt),\quad r\geq 1,

truncated at r=R1r=R-1, so M=RM=R. For vector-valued data 𝒳=L2(0,1;d)\mathcal{X}=L^{2}(0,1;\mathbb{R}^{d}), we tensor with the canonical basis of d\mathbb{R}^{d}, so M=RdM=Rd.

(B3) Tensor-product cosine basis on [0,1]2[0,1]^{2}.

For 𝒳=L2([0,1]2;)\mathcal{X}=L^{2}([0,1]^{2};\mathbb{R}) we use Φm,n(s,t)=em(s)en(t)\Phi_{m,n}(s,t)=e_{m}(s)\,e_{n}(t) with 0m<Rs0\leq m<R_{s}, 0n<Rt0\leq n<R_{t}, so M=RsRtM=R_{s}R_{t}.

(B4) H1H^{1} cosine basis.

For 𝒳=H1(0,1;)\mathcal{X}=H^{1}(0,1;\mathbb{R}) with f,g=01fg+01fg\langle f,g\rangle=\int_{0}^{1}fg+\int_{0}^{1}f^{\prime}g^{\prime}, we use the H1H^{1}-rescaled cosines

e~r(t)=er(t)1+π2r2,r=0,,Rs1,\tilde{e}_{r}(t)=\frac{e_{r}(t)}{\sqrt{1+\pi^{2}r^{2}}},\qquad r=0,\dots,R_{s}-1,

so M=RsM=R_{s}; the e~r\tilde{e}_{r} are orthonormal in H1H^{1}.

(B5) Real Wigner–D basis on SO(3)\mathrm{SO}(3).

For 𝒳=L2(SO(3);)\mathcal{X}=L^{2}(\mathrm{SO}(3);\mathbb{R}) under the Haar inner product, the Peter–Weyl theorem yields an orthonormal basis given by the (real) Wigner D-matrix entries,

e,m,n(R)=2+1Dm,n(R),0,m,n.e_{\ell,m,n}(R)=\sqrt{2\ell+1}\,D^{\ell}_{m,n}(R),\qquad\ell\geq 0,\ -\ell\leq m,n\leq\ell.

Truncating at Lmax\ell\leq L_{\max} gives M==0Lmax(2+1)2M=\sum_{\ell=0}^{L_{\max}}(2\ell+1)^{2}; for Lmax=3L_{\max}=3, M=84M=84.

(B6) Graph Laplacian eigenbasis.

For a finite weighted graph G=(V,E)G=(V,E) with Laplacian L=DWL=D-W and regularization α>0\alpha>0, we equip 𝒳=|V|\mathcal{X}=\mathbb{R}^{|V|} with f,g=f(L+αI)g\langle f,g\rangle=f^{\top}(L+\alpha I)g. Diagonalizing L=UΛUL=U\Lambda U^{\top} with eigenpairs (λj,uj)(\lambda_{j},u_{j}), the rescaled eigenvectors ej=uj/λj+αe_{j}=u_{j}/\sqrt{\lambda_{j}+\alpha} form an orthonormal basis of 𝒳\mathcal{X}. We keep the MM eigenvectors associated with the smallest eigenvalues.

(B7) Frobenius basis on Sym(d)\mathrm{Sym}(d).

For 𝒳=Sym(d)d×d\mathcal{X}=\mathrm{Sym}(d)\subset\mathbb{R}^{d\times d} with A,B=tr(AB)\langle A,B\rangle=\mathrm{tr}(A^{\top}B), we use the canonical symmetric basis {Eii}{(Eij+Eji)/2}i<j\{E_{ii}\}\cup\{(E_{ij}+E_{ji})/\sqrt{2}\}_{i<j} (vech embedding), so M=(d+12)M=\binom{d+1}{2}.

(B8) Weisfeiler–Lehman (WL) with kernel PCA / DCT.

For molecular graphs (MUTAG, QM9) we compute WL hash fingerprints in D\mathbb{R}^{D}, then project onto the leading MM kernel-PCA directions (MUTAG) or DCT coefficients (QM9) to obtain coefficient vectors in M\mathbb{R}^{M}. The basis is fixed in coefficient space and inherited from the projection.

G.2 Experimental setting

The clustering experiments are unsupervised: all methods are fit without access to labels, and labels are used only after fitting to compute Adjusted Rand Index (ARI). There are no train/test splits because the reported task is clustering or mixture recovery on a fixed sample; variability is instead measured across independently generated datasets, benchmark datasets, random restarts, or subsampled runs, as specified in Table˜2.

Table 2: Benchmark hyperparameters and evaluation protocol. Bases (B1)–(B8) are defined in Section˜G.1. All MMD-GMM fits use Adam and kk-means++ initialization; polynomial kernels use degree 22 and c=1c=1.
Experiment Data and Runs Representation and MMD-GMM training
d\mathbb{R}^{d} toy data Five labeled sklearn datasets [50], n=500n=500, d=2d=2, K{2,3}K\in\{2,3\}; seed 3030, MMD seed 4242. Representation: canonical basis (B1) on standardized coordinates, M=2M=2. Training: 300300 epochs, learning rate 0.050.05, full covariance; Gaussian bandwidth by 0.50.5 times the median pairwise-distance heuristic.
L2()L^{2}(\mathbb{R}) functional data Growth [53], Waveform [11], Phoneme [26], Kneading [51], and ECG200 [48]; ARI averaged over the five datasets; seed 4242. Representation: L2L^{2} cosine basis (B2), R=15R=15 (R=10R=10 for Waveform), coefficient normalization per dataset. Training: 400400 epochs, learning rate 0.10.1, diagonal covariance; Gaussian bandwidth by median heuristic.
Glucodensity clustering PEDAP CGM cohort [63, 1], 9898 patients, K=2K=2; 55 MMD runs; seed 4242. Representation: H1H^{1} cosine basis (B4) with Rs=8R_{s}=8 on overlapping 44-day sliding windows (stride 11 day, 1616 time bins); patient-level features for our method. Training: 600600 epochs, learning rate 0.010.01, diagonal covariance; cosine annealing to 10410^{-4}; Gaussian bandwidth by median heuristic.
SO(3)\mathrm{SO}(3) rotations 1010 generated datasets, n=200n=200, K=3K=3, noise concentration 8.08.0; seeds 42,,5142,\ldots,51. Representation: real Wigner–D basis (B5) with Lmax=3L_{\max}=3, M=84M=84. Training: 400400 epochs, learning rate 0.10.1, diagonal covariance; Gaussian bandwidth by median heuristic.
MUTAG molecular graphs MUTAG [20], n=188n=188, K=2K=2, 55 runs; seed 4242. Representation: WL fingerprints with kernel PCA (B8), fingerprint dimension 512512, radius 33, reduced to M=32M=32. Training: 500500 epochs, learning rate 0.050.05, 55 runs; Gaussian bandwidth selected from {0.1,0.5,1,2,4}\{0.1,0.5,1,2,4\} plus the median heuristic by a 6060-epoch MMD proxy sweep.

The standalone mixture-recovery figures and case-study panels in Sections˜4, 4.2 and F that are not part of the unified ARI benchmark follow the configurations summarized in Table˜3.

Table 3: Hyperparameters for the additional mixture-recovery and case-study figures. Bases (B1)–(B8) are defined in Section˜G.1.
Experiment Basis 𝒏\bm{n} 𝑲\bm{K} 𝑴\bm{M} Epochs Kernel/optimizer details
11D Gaussian mixture (B1) 15001500 33 11 400400 Gaussian σ=1.0\sigma=1.0, Adam learning rate 0.050.05.
L2(0,1;2)L^{2}(0,1;\mathbb{R}^{2}) recovery (B2) 500500 55 3030 400400 Cosine basis with R=15R=15 per coordinate; Gaussian σ=1.2\sigma=1.2, Adam learning rate 0.10.1.
L2([0,1]2)L^{2}([0,1]^{2}) recovery (B3) 400400 33 6464 400400 Tensor-product cosine Rs=Rt=8R_{s}=R_{t}=8; Gaussian σ=2.0\sigma=2.0, Adam learning rate 0.10.1.
L2(SO(3))L^{2}(\mathrm{SO}(3)) recovery (B5) 200200 33 8484 400400 Real Wigner–D, Lmax=3L_{\max}=3; Gaussian σ=10.0\sigma=10.0, Adam learning rate 0.050.05.
Graph-signal recovery (B6) 150150 33 1515 200200 Erdős–Rényi graph, α=0.1\alpha=0.1; Gaussian σ=2.0\sigma=2.0, Adam learning rate 0.10.1.
Linear SDE identification (B2) 2020 paths 11 2828 400400 Cosine basis with R=7R=7 per state dim.; Gaussian bandwidth 2sd(X)2\,\mathrm{sd}(X), Adam learning rate 10210^{-2}.
QM9 molecules (B8) 500500 55 6464 150150 WL fingerprints + first 6464 DCT coefficients; Gaussian σ=2.0\sigma=2.0, Adam learning rate 0.050.05.
NTU representatives (B2) 150150 55 750750 250250 Per-joint cosine basis R=10R=10 on 75\mathbb{R}^{75}-valued paths; Gaussian σ=2.0\sigma=2.0, Adam learning rate 0.050.05.
CGM temporal H1H^{1} mixture (B4) 9898 patients 22 88 400400 Overlapping 44-day sliding windows with stride 11 day, 1616 time bins, H1H^{1} cosine basis with Rs=8R_{s}=8; 2-layer neural ODE weights, hidden width 6464, RK4, Gaussian σ=0.8\sigma=0.8, Adam learning rate 0.010.01.
CGM correlation mixture (B7) 21,88921{,}889 windows 33 300300 200200 Overlapping W=14W=14-day sliding windows with stride 11 day; each window gives a 24×2424\times 24 inter-hour correlation matrix with linear shrinkage toward the identity (intensity α=0.1\alpha=0.1), vech embedding into Sym(24)\mathrm{Sym}(24); 2-layer neural ODE logits, hidden width 6464, RK4, Adam learning rate 10210^{-2}.
CGM patient graph mixture (B6) 7676 patients 33 304304 200200 Overlapping 2121-day sliding windows with stride 11 day, 1010 time bins; pairwise H1H^{1} similarities use Rs=16R_{s}=16 cosine modes; graph Laplacian basis rank R=4R=4; 2-layer neural ODE logits, hidden width 6464, RK4, Adam learning rate 10210^{-2}.

G.3 Statistical reporting

All entries in Table˜1 are reported as mean ARI with one standard deviation in subscript, computed directly from the list of ARI scores saved by each benchmark script. The standard deviation is a descriptive measure of run-to-run or dataset-to-dataset variability, not a standard error and not a formal confidence interval. The variability source is the labelled dataset collection for the d\mathbb{R}^{d} and L2L^{2} columns, independent generated datasets for SO(3)\mathrm{SO}(3), and multiple runs for glucodensity and MUTAG. Deterministic single-dataset baselines have zero reported standard deviation when only one fitted value is produced.

G.4 Compute resources

All reported experiments were run on CPU only; no GPU was used or required. The local machine used for reproduction is an Ubuntu 24.04.4 LTS workstation equipped with two Intel Xeon E5-2650 v2 CPUs at 2.60GHz2.60\,\mathrm{GHz} (1616 physical cores in total), 188188 GiB of RAM, and a network-mounted filesystem. The software environment relies on Python 3.11.5 with the following core libraries: PyTorch 1.13.1, NumPy 1.26.4, SciPy 1.12.0, scikit-learn 1.5.2, pandas 1.5.3, scikit-fda 0.10.1, torch-geometric 2.7.0, aeon 0.11.1, rdata 1.0.0, torchdiffeq 0.2.5, and torchsde 0.2.6. The end-to-end reproduction pipeline runs as a single CPU-worker workflow and typically requires 2244 wall-clock hours, depending on cached downloads and optional notebook regeneration.

Appendix H Proofs

H.1 Proof of Proposition 2.1

Proof.

Since (ej)j1(e_{j})_{j\geq 1} is an orthonormal basis of 𝒳\mathcal{X}, the orthogonal projections ΠM\Pi_{M} converge strongly to the identity. That is, for every x𝒳x\in\mathcal{X}, ΠMxx\Pi_{M}x\to x in 𝒳\mathcal{X} as MM\to\infty.

We first prove the almost sure convergence of Ji,k(M)J^{(M)}_{i,k}. Fix i{1,,n}i\in\{1,\dots,n\} and k{1,,K}k\in\{1,\dots,K\}. Fix ω\omega such that xiXi(ω)𝒳x_{i}\coloneqq X_{i}(\omega)\in\mathcal{X}. Then ΠMxixi\Pi_{M}x_{i}\to x_{i} in 𝒳\mathcal{X}. By the continuity of κ\kappa, for every y𝒳y\in\mathcal{X}, we have κ(ΠMxi,ΠMy)κ(xi,y)\kappa(\Pi_{M}x_{i},\Pi_{M}y)\to\kappa(x_{i},y). Furthermore, using the growth assumption (9) and the contraction property of the orthogonal projector (ΠMx𝒳x𝒳\|\Pi_{M}x\|_{\mathcal{X}}\leq\|x\|_{\mathcal{X}}), we can bound the kernel evaluation as:

|κ(ΠMxi,ΠMy)|C(1+ΠMxi𝒳q)(1+ΠMy𝒳q)C(1+xi𝒳q)(1+y𝒳q).|\kappa(\Pi_{M}x_{i},\Pi_{M}y)|\leq C\bigl(1+\|\Pi_{M}x_{i}\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|\Pi_{M}y\|_{\mathcal{X}}^{q}\bigr)\leq C\bigl(1+\|x_{i}\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|y\|_{\mathcal{X}}^{q}\bigr).

By the finite moment assumption, the right-hand side is νk\nu_{k}-integrable with respect to yy for our fixed xix_{i}. Therefore, by the dominated convergence theorem,

Ji,k(M)(ω)=𝒳κ(ΠMxi,ΠMy)dνk(y)𝒳κ(xi,y)dνk(y)=Ji,k(ω),J^{(M)}_{i,k}(\omega)=\int_{\mathcal{X}}\kappa(\Pi_{M}x_{i},\Pi_{M}y)\,\mathrm{d}\nu_{k}(y)\to\int_{\mathcal{X}}\kappa(x_{i},y)\,\mathrm{d}\nu_{k}(y)=J_{i,k}(\omega),

which establishes Ji,k(M)Ji,kJ^{(M)}_{i,k}\to J_{i,k} almost surely.

Next, we establish the convergence of Ik,s(M)I^{(M)}_{k,s}. Fix k,s{1,,K}k,s\in\{1,\dots,K\}. By the continuity of κ\kappa, κ(ΠMy,ΠMy)κ(y,y)\kappa(\Pi_{M}y,\Pi_{M}y^{\prime})\to\kappa(y,y^{\prime}) for all y,y𝒳y,y^{\prime}\in\mathcal{X}. Similarly to the previous case, applying (9) and the contraction property yields:

|κ(ΠMy,ΠMy)|C(1+y𝒳q)(1+y𝒳q).|\kappa(\Pi_{M}y,\Pi_{M}y^{\prime})|\leq C\bigl(1+\|y\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|y^{\prime}\|_{\mathcal{X}}^{q}\bigr).

Due to the finite moment assumption, this dominating function is integrable with respect to the product measure νkνs\nu_{k}\otimes\nu_{s}. Invoking the dominated convergence theorem once more, we obtain

Ik,s(M)=𝒳×𝒳κ(ΠMy,ΠMy)d(νkνs)(y,y)𝒳×𝒳κ(y,y)d(νkνs)(y,y)=Ik,s.I^{(M)}_{k,s}=\int_{\mathcal{X}\times\mathcal{X}}\kappa(\Pi_{M}y,\Pi_{M}y^{\prime})\,\mathrm{d}(\nu_{k}\otimes\nu_{s})(y,y^{\prime})\to\int_{\mathcal{X}\times\mathcal{X}}\kappa(y,y^{\prime})\,\mathrm{d}(\nu_{k}\otimes\nu_{s})(y,y^{\prime})=I_{k,s}.

Finally, since κ(ΠMXi,ΠMXj)κ(Xi,Xj)\kappa(\Pi_{M}X_{i},\Pi_{M}X_{j})\to\kappa(X_{i},X_{j}) a.s. for all i,ji,j, and the sums in (8) are finite with weights πk\pi_{k} independent of MM, combining these results yields (10). ∎

H.2 Proof of Proposition 2.2

Proof.

Fix k{1,,K}k\in\{1,\dots,K\}. We first show that the exact responsibility γk(X)\gamma_{k}(X) is a version of the conditional expectation 𝔼[𝟏{Z=k}σ(X)]\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X)]. Let Qθ=s=1KπsνsQ_{\theta}=\sum_{s=1}^{K}\pi_{s}\nu_{s} be the marginal law of XX. For any Borel set A(𝒳)A\in\mathcal{B}(\mathcal{X}), the definition of γk=d(πkνk)/dQθ\gamma_{k}=\,\mathrm{d}(\pi_{k}\nu_{k})/\,\mathrm{d}Q_{\theta} yields

Aγk(x)dQθ(x)=πkνk(A)=(XA,Z=k).\int_{A}\gamma_{k}(x)\,\mathrm{d}Q_{\theta}(x)=\pi_{k}\nu_{k}(A)=\mathbb{P}(X\in A,\ Z=k).

Equivalently, for every A(𝒳)A\in\mathcal{B}(\mathcal{X}),

𝔼[γk(X)𝟏{XA}]=𝔼[𝟏{Z=k}𝟏{XA}].\mathbb{E}\bigl[\gamma_{k}(X)\mathbf{1}_{\{X\in A\}}\bigr]=\mathbb{E}\bigl[\mathbf{1}_{\{Z=k\}}\mathbf{1}_{\{X\in A\}}\bigr].

Since γk(X)\gamma_{k}(X) is σ(X)\sigma(X)-measurable, this identity implies γk(X)=𝔼[𝟏{Z=k}σ(X)]\gamma_{k}(X)=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X)] almost surely.

By an identical argument on the projected space, equipping 𝒳M\mathcal{X}_{M} with the pushforward measures νs,M\nu_{s,M} and Qθ,MQ_{\theta,M}, the projected derivative γk(M)=d(πkνk,M)/dQθ,M\gamma_{k}^{(M)}=\,\mathrm{d}(\pi_{k}\nu_{k,M})/\,\mathrm{d}Q_{\theta,M} satisfies

Bγk(M)(u)dQθ,M(u)=(X(M)B,Z=k)\int_{B}\gamma_{k}^{(M)}(u)\,\mathrm{d}Q_{\theta,M}(u)=\mathbb{P}(X^{(M)}\in B,\ Z=k)

for any B(𝒳M)B\in\mathcal{B}(\mathcal{X}_{M}). Equivalently, for every B(𝒳M)B\in\mathcal{B}(\mathcal{X}_{M}),

𝔼[γk(M)(X(M)) 1{X(M)B}]=𝔼[𝟏{Z=k} 1{X(M)B}].\mathbb{E}\bigl[\gamma_{k}^{(M)}(X^{(M)})\,\mathbf{1}_{\{X^{(M)}\in B\}}\bigr]=\mathbb{E}\bigl[\mathbf{1}_{\{Z=k\}}\,\mathbf{1}_{\{X^{(M)}\in B\}}\bigr].

Since γk(M)(X(M))\gamma_{k}^{(M)}(X^{(M)}) is σ(X(M))\sigma(X^{(M)})-measurable, this confirms that

γk(M)(X(M))=𝔼[𝟏{Z=k}σ(X(M))]\gamma_{k}^{(M)}(X^{(M)})=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X^{(M)})]

almost surely, establishing (13).

Next, define the filtration Mσ(X(M))\mathcal{F}_{M}\coloneqq\sigma(X^{(M)}). The property ΠM=ΠMΠM+1\Pi_{M}=\Pi_{M}\Pi_{M+1} ensures that X(M)=ΠMX(M+1)X^{(M)}=\Pi_{M}X^{(M+1)}, so X(M)X^{(M)} is σ(X(M+1))\sigma(X^{(M+1)})-measurable, meaning the filtration is increasing (MM+1\mathcal{F}_{M}\subseteq\mathcal{F}_{M+1}). Let σ(M=1M)\mathcal{F}_{\infty}\coloneqq\sigma(\cup_{M=1}^{\infty}\mathcal{F}_{M}). Since each projector ΠM\Pi_{M} is continuous, X(M)X^{(M)} is σ(X)\sigma(X)-measurable, yielding σ(X)\mathcal{F}_{\infty}\subseteq\sigma(X). Conversely, since ΠMxx\Pi_{M}x\to x, we have X(M)(ω)X(ω)X^{(M)}(\omega)\to X(\omega) pointwise in 𝒳\mathcal{X}. Since 𝒳\mathcal{X} is a metric space, the pointwise limit of \mathcal{F}_{\infty}-measurable maps into 𝒳\mathcal{X} is \mathcal{F}_{\infty}-measurable. This forces σ(X)\sigma(X)\subseteq\mathcal{F}_{\infty}, concluding that =σ(X)\mathcal{F}_{\infty}=\sigma(X).

Finally, we apply Lévy’s upward convergence theorem. Since the target 𝟏{Z=k}\mathbf{1}_{\{Z=k\}} is integrable, the martingale sequence 𝔼[𝟏{Z=k}M]\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\mathcal{F}_{M}] converges almost surely and in L1L^{1} to 𝔼[𝟏{Z=k}]\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\mathcal{F}_{\infty}]. By our previous identities,

𝔼[𝟏{Z=k}]=𝔼[𝟏{Z=k}σ(X)]=γk(X)a.s.\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\mathcal{F}_{\infty}]=\mathbb{E}[\mathbf{1}_{\{Z=k\}}\mid\sigma(X)]=\gamma_{k}(X)\qquad\text{a.s.}

Substituting these into the limit, we obtain, as MM\to\infty,

γk(M)(X(M))γk(X)a.s. and in L1.\gamma_{k}^{(M)}(X^{(M)})\to\gamma_{k}(X)\qquad\text{a.s.\ and in }L^{1}.

H.3 Proof of Proposition 3.1

Proof.

Proof of (i).

Fix p1p\geq 1, P𝒫p(𝒳)P\in\mathcal{P}_{p}(\mathcal{X}), and ε>0\varepsilon>0. Let 𝒦0\mathcal{K}_{0} be any nonzero trace-class covariance operator on 𝒳\mathcal{X}. It is enough to prove density for mixtures with component covariances of the form σ2𝒦0\sigma^{2}\mathcal{K}_{0}, since these form a subclass of the mixtures in (1).

Because 𝒳\mathcal{X} is a separable Hilbert space, it is Polish. Hence finitely supported probability measures are dense in 𝒫p(𝒳)\mathcal{P}_{p}(\mathcal{X}) for WpW_{p}. Therefore, there exists ν=i=1Nwiδxi\nu=\sum_{i=1}^{N}w_{i}\delta_{x_{i}} such that Wp(P,ν)<ε/2.W_{p}(P,\nu)<\varepsilon/2.

For σ>0\sigma>0, define the Gaussian smoothing

νσi=1Nwi𝒩(xi,σ2𝒦0).\nu_{\sigma}\coloneqq\sum_{i=1}^{N}w_{i}\,\mathcal{N}(x_{i},\sigma^{2}\mathcal{K}_{0}).

Let II be an index with (I=i)=wi\mathbb{P}(I=i)=w_{i}, and let Z𝒩(0,𝒦0)Z\sim\mathcal{N}(0,\mathcal{K}_{0}), independent of II. Then X=xIX=x_{I} has law ν\nu, while Y=xI+σZY=x_{I}+\sigma Z has law νσ\nu_{\sigma}. This coupling gives

Wpp(ν,νσ)𝔼XY𝒳p=σp𝔼Z𝒳p.W_{p}^{p}(\nu,\nu_{\sigma})\leq\mathbb{E}\|X-Y\|_{\mathcal{X}}^{p}=\sigma^{p}\mathbb{E}\|Z\|_{\mathcal{X}}^{p}.

Since 𝒦0\mathcal{K}_{0} is trace-class, Fernique’s theorem implies 𝔼Z𝒳p<\mathbb{E}\|Z\|_{\mathcal{X}}^{p}<\infty. Hence Wp(ν,νσ)0W_{p}(\nu,\nu_{\sigma})\to 0 as σ0\sigma\downarrow 0. Choosing σ\sigma small enough yields Wp(ν,νσ)<ε/2W_{p}(\nu,\nu_{\sigma})<\varepsilon/2, and the triangle inequality gives

Wp(P,νσ)<ε.W_{p}(P,\nu_{\sigma})<\varepsilon.

Thus, finite Gaussian mixtures with non-zero covariance operators are WpW_{p}-dense in 𝒫p(𝒳)\mathcal{P}_{p}(\mathcal{X}).

Proof of (ii). Fix P𝒫(𝒳)P\in\mathcal{P}(\mathcal{X}) and draw KK i.i.d. samples X1,,XKPX_{1},\dots,X_{K}\sim P. Let the empirical measure be P^K1Kk=1KδXk\widehat{P}_{K}\coloneqq\frac{1}{K}\sum_{k=1}^{K}\delta_{X_{k}}. Since 𝔼[κ(Xk,)]=μP\mathbb{E}[\kappa(X_{k},\cdot)]=\mu_{P} and we assume supxκ(x,x)1\sup_{x}\kappa(x,x)\leq 1, the expected squared MMD is bounded by the variance of the mean embedding:

𝔼[MMDκ2(P,P^K)]=1K𝔼[κ(X1,)μPκ2]1K𝔼[κ(X1,X1)]1K.\mathbb{E}\big[\operatorname{MMD}_{\kappa}^{2}(P,\widehat{P}_{K})\big]=\frac{1}{K}\mathbb{E}\big[\|\kappa(X_{1},\cdot)-\mu_{P}\|_{\mathcal{H}_{\kappa}}^{2}\big]\leq\frac{1}{K}\mathbb{E}[\kappa(X_{1},X_{1})]\leq\frac{1}{K}.

By the probabilistic method, there exists a specific realization m1,,mK𝒳m_{1},\dots,m_{K}\in\mathcal{X} such that the deterministic empirical measure μK1Kk=1Kδmk\mu_{K}\coloneqq\frac{1}{K}\sum_{k=1}^{K}\delta_{m_{k}} satisfies MMDκ(P,μK)1/K\operatorname{MMD}_{\kappa}(P,\mu_{K})\leq 1/\sqrt{K}.

Fix any nonzero trace-class covariance operator 𝒦0\mathcal{K}_{0} and define QK,σ1Kk=1K𝒩(mk,σ2𝒦0)Q_{K,\sigma}\coloneqq\frac{1}{K}\sum_{k=1}^{K}\mathcal{N}(m_{k},\sigma^{2}\mathcal{K}_{0}). Then QK,σQ_{K,\sigma} converges weakly to μK\mu_{K} as σ0\sigma\downarrow 0. Because κ\kappa is bounded and continuous, weak convergence implies convergence in MMD. Thus, we can choose σ>0\sigma>0 small enough such that MMDκ(QK,σ,μK)1/K\operatorname{MMD}_{\kappa}(Q_{K,\sigma},\mu_{K})\leq 1/\sqrt{K}.

Applying the triangle inequality yields

MMDκ(P,QK,σ)MMDκ(P,μK)+MMDκ(μK,QK,σ)2K,\operatorname{MMD}_{\kappa}(P,Q_{K,\sigma})\leq\operatorname{MMD}_{\kappa}(P,\mu_{K})+\operatorname{MMD}_{\kappa}(\mu_{K},Q_{K,\sigma})\leq\frac{2}{\sqrt{K}},

which completes the proof, as QK,σQ_{K,\sigma} is a valid mixture of the form (1). ∎

H.4 Proof of Proposition C.1

Proof.

Step 1. Let Φ(x)=κ(x,)κ\Phi(x)=\kappa(x,\cdot)\in\mathcal{H}_{\kappa} be the canonical feature map. Because 𝒳\mathcal{X} is separable and κ\kappa is continuous, the reproducing kernel Hilbert space κ\mathcal{H}_{\kappa} is separable. Define the Bochner space

L2(0,T;κ).\mathbb{H}\coloneqq L^{2}(0,T;\mathcal{H}_{\kappa}).

For each sample path, define the random element Yi(t)Φ(Xi(t))Y_{i}(t)\coloneqq\Phi(X_{i}(t)). Since Φ\Phi is continuous, (ω,t)Φ(Xi(t,ω))(\omega,t)\mapsto\Phi(X_{i}(t,\omega)) is jointly measurable. Together with the separability of κ\mathcal{H}_{\kappa} and the integrability estimate below, this defines a strongly measurable \mathbb{H}-valued random variable. By assumption, its squared norm is integrable:

𝔼[Y12]=𝔼[0TΦ(X1(t))κ2dt]=𝔼[0Tκ(X1(t),X1(t))dt]<.\mathbb{E}[\|Y_{1}\|_{\mathbb{H}}^{2}]=\mathbb{E}\!\left[\int_{0}^{T}\|\Phi(X_{1}(t))\|_{\mathcal{H}_{\kappa}}^{2}\,\mathrm{d}t\right]=\mathbb{E}\!\left[\int_{0}^{T}\kappa(X_{1}(t),X_{1}(t))\,\mathrm{d}t\right]<\infty.

Thus, Y1,,YnY_{1},\dots,Y_{n} are i.i.d. \mathbb{H}-valued random variables with finite first and second moments.

Step 2. By Jensen’s inequality and Tonelli’s theorem, the population mean embedding trajectory μP(μP(t))t[0,T]\mu_{P}\coloneqq(\mu_{P(t)})_{t\in[0,T]} satisfies 0TμP(t)κ2dt𝔼[Y12]<\int_{0}^{T}\|\mu_{P(t)}\|^{2}_{\mathcal{H}_{\kappa}}\,\mathrm{d}t\leq\mathbb{E}[\|Y_{1}\|_{\mathbb{H}}^{2}]<\infty, so μP\mu_{P}\in\mathbb{H}. By the Bochner–Fubini theorem, μP\mu_{P} is exactly the expected value 𝔼[Y1]\mathbb{E}[Y_{1}] in \mathbb{H}. Similarly, the empirical mean embedding trajectory μPn\mu_{P_{n}} corresponds to the sample average:

μPn=1ni=1nYiin .\mu_{P_{n}}=\frac{1}{n}\sum_{i=1}^{n}Y_{i}\quad\text{in }\mathbb{H}.

By assumption, the target embedding trajectory μQ(μQ(t))t[0,T]\mu_{Q}\coloneqq(\mu_{Q(t)})_{t\in[0,T]} also belongs to \mathbb{H}.

Step 3. By the strong law of large numbers for Hilbert-space-valued random variables, the sample average converges to its expectation almost surely:

μPnnμPin  a.s.\mu_{P_{n}}\xrightarrow{n\to\infty}\mu_{P}\qquad\text{in }\mathbb{H}\text{ a.s.}

Since μPnμQμPμQ\mu_{P_{n}}-\mu_{Q}\to\mu_{P}-\mu_{Q} in \mathbb{H}, the continuity of the squared norm gives μPnμQ2μPμQ2\|\mu_{P_{n}}-\mu_{Q}\|_{\mathbb{H}}^{2}\to\|\mu_{P}-\mu_{Q}\|_{\mathbb{H}}^{2}. Because the integrated squared MMD is precisely the squared distance in \mathbb{H} scaled by 1/T1/T, this immediately implies that

1T0TMMDκ2(Pn(t),Q(t))dtn1T0TMMDκ2(P(t),Q(t))dt\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q(t)\bigr)\,\mathrm{d}t\xrightarrow{n\to\infty}\frac{1}{T}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P(t),Q(t)\bigr)\,\mathrm{d}t

almost surely. Finally, setting Q(t)=P(t)Q(t)=P(t) directly yields 1TμPnμP20\frac{1}{T}\|\mu_{P_{n}}-\mu_{P}\|_{\mathbb{H}}^{2}\to 0 almost surely, completing the proof. ∎

H.5 Proof of Proposition C.2

Proof.

For each MM, the identification of 𝒳M\mathcal{X}_{M} with M\mathbb{R}^{M} via the basis (er)r=1M(e_{r})_{r=1}^{M} maps the projected measures Pn,M(t)1ni=1nδΠMXi(t)P_{n,M}(t)\coloneqq\frac{1}{n}\sum_{i=1}^{n}\delta_{\Pi_{M}X_{i}(t)} and Qθ,M(t)k=1Kπk(t)νk,MQ_{\theta,M}(t)\coloneqq\sum_{k=1}^{K}\pi_{k}(t)\nu_{k,M} to their finite-dimensional coordinate representations 𝐏n,M(t)\mathbf{P}_{n,M}(t) and 𝐐θ,M(t)\mathbf{Q}_{\theta,M}(t). By defining κM(𝐱,𝐲)κ(r=1Mxrer,r=1Myrer)\kappa_{M}(\mathbf{x},\mathbf{y})\coloneqq\kappa(\sum_{r=1}^{M}x_{r}e_{r},\sum_{r=1}^{M}y_{r}e_{r}), we have

MMDκM2(𝐏n,M(t),𝐐θ,M(t))=MMDκ2(Pn,M(t),Qθ,M(t))\operatorname{MMD}_{\kappa_{M}}^{2}\bigl(\mathbf{P}_{n,M}(t),\mathbf{Q}_{\theta,M}(t)\bigr)=\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)

for every tt. Expanding the exact and projected MMD objectives, we define the cross-expectations

Ji,k(M)(t)𝔼Yνk[κ(ΠMXi(t),ΠMY)],Ji,k(t)𝔼Yνk[κ(Xi(t),Y)],J_{i,k}^{(M)}(t)\coloneqq\mathbb{E}_{Y\sim\nu_{k}}\big[\kappa(\Pi_{M}X_{i}(t),\Pi_{M}Y)\big],\qquad J_{i,k}(t)\coloneqq\mathbb{E}_{Y\sim\nu_{k}}\big[\kappa(X_{i}(t),Y)\big],
Ik,s(M)𝔼Yνk,Yνs[κ(ΠMY,ΠMY)],Ik,s𝔼Yνk,Yνs[κ(Y,Y)].I_{k,s}^{(M)}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\,Y^{\prime}\sim\nu_{s}}[\kappa(\Pi_{M}Y,\Pi_{M}Y^{\prime})],\qquad I_{k,s}\coloneqq\mathbb{E}_{Y\sim\nu_{k},\,Y^{\prime}\sim\nu_{s}}[\kappa(Y,Y^{\prime})].

Fix a sample path satisfying 0TXi(t)𝒳2qdt<\int_{0}^{T}\|X_{i}(t)\|_{\mathcal{X}}^{2q}\,\mathrm{d}t<\infty for every i=1,,ni=1,\dots,n. For every tt, ΠMXi(t)Xi(t)\Pi_{M}X_{i}(t)\to X_{i}(t) in 𝒳\mathcal{X}. Hence, by the continuity of κ\kappa, the data–data interactions converge pointwise in tt:

κ(ΠMXi(t),ΠMXj(t))κ(Xi(t),Xj(t)).\kappa(\Pi_{M}X_{i}(t),\Pi_{M}X_{j}(t))\to\kappa(X_{i}(t),X_{j}(t)).

For fixed tt, the integrand in Ji,k(M)(t)J_{i,k}^{(M)}(t) satisfies κ(ΠMXi(t),ΠMY)κ(Xi(t),Y)\kappa(\Pi_{M}X_{i}(t),\Pi_{M}Y)\to\kappa(X_{i}(t),Y) for every Y𝒳Y\in\mathcal{X}, and is bounded by |κ(ΠMXi(t),ΠMY)|C(1+Xi(t)𝒳q)(1+Y𝒳q)|\kappa(\Pi_{M}X_{i}(t),\Pi_{M}Y)|\leq C(1+\|X_{i}(t)\|_{\mathcal{X}}^{q})(1+\|Y\|_{\mathcal{X}}^{q}). Then dominated convergence applies because Gaussian measures on Hilbert spaces have finite qq-moments, giving Ji,k(M)(t)Ji,k(t)J_{i,k}^{(M)}(t)\to J_{i,k}(t). The exact same argument applies to the component–component term, yielding Ik,s(M)Ik,sI_{k,s}^{(M)}\to I_{k,s}.

We next justify passing the limit through the time integral using the Dominated Convergence Theorem. By the polynomial growth condition on κ\kappa and the contraction property of ΠM\Pi_{M},

|κ(ΠMXi(t),ΠMXj(t))|C(1+Xi(t)𝒳q)(1+Xj(t)𝒳q).|\kappa(\Pi_{M}X_{i}(t),\Pi_{M}X_{j}(t))|\leq C\bigl(1+\|X_{i}(t)\|_{\mathcal{X}}^{q}\bigr)\bigl(1+\|X_{j}(t)\|_{\mathcal{X}}^{q}\bigr).

This upper bound belongs to L1(0,T)L^{1}(0,T) by Cauchy–Schwarz and the sample path assumption. Likewise, if YkνkY_{k}\sim\nu_{k},

|Ji,k(M)(t)|C(1+Xi(t)𝒳q)𝔼[1+Yk𝒳q],|J_{i,k}^{(M)}(t)|\leq C\bigl(1+\|X_{i}(t)\|_{\mathcal{X}}^{q}\bigr)\,\mathbb{E}\bigl[1+\|Y_{k}\|_{\mathcal{X}}^{q}\bigr],

which is integrable on [0,T][0,T]. Finally, the component–component term is uniformly bounded by C𝔼[1+Yk𝒳q]𝔼[1+Ys𝒳q]C\,\mathbb{E}[1+\|Y_{k}\|_{\mathcal{X}}^{q}]\,\mathbb{E}[1+\|Y_{s}\|_{\mathcal{X}}^{q}], which is finite and independent of MM.

Since 0πk(t)πs(t)10\leq\pi_{k}(t)\pi_{s}(t)\leq 1, we can apply the Dominated Convergence Theorem to integrate term by term. Integrating the finite-dimensional MMD expansion

MMDκ2(Pn,M(t),Qθ,M(t))\displaystyle\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr) =1n2i,j=1nκ(ΠMXi(t),ΠMXj(t))\displaystyle=\frac{1}{n^{2}}\sum_{i,j=1}^{n}\kappa(\Pi_{M}X_{i}(t),\Pi_{M}X_{j}(t))
2ni=1nk=1Kπk(t)Ji,k(M)(t)+k,s=1Kπk(t)πs(t)Ik,s(M)\displaystyle\quad-\frac{2}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\pi_{k}(t)J_{i,k}^{(M)}(t)+\sum_{k,s=1}^{K}\pi_{k}(t)\pi_{s}(t)I_{k,s}^{(M)}

over [0,T][0,T] yields convergence for fixed n,Kn,K, and θ\theta:

0TMMDκ2(Pn,M(t),Qθ,M(t))dtM0TMMDκ2(Pn(t),Qθ(t))dta.s.\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n,M}(t),Q_{\theta,M}(t)\bigr)\,\mathrm{d}t\xrightarrow{M\to\infty}\int_{0}^{T}\operatorname{MMD}_{\kappa}^{2}\bigl(P_{n}(t),Q_{\theta}(t)\bigr)\,\mathrm{d}t\qquad\text{a.s.}

Dividing by TT and using the coordinate identification κM\kappa_{M} completes the proof. ∎

H.6 Proof of Proposition C.3

Proof.

Fix p1p\geq 1, let P:[0,T]𝒫p(𝒳)P\colon[0,T]\to\mathcal{P}_{p}(\mathcal{X}) be continuous with respect to WpW_{p}, and fix ε>0\varepsilon>0. Because [0,T][0,T] is compact and PP is continuous, the trajectory image 𝒞P([0,T])\mathcal{C}\coloneqq P([0,T]) is compact in the metric space (𝒫p(𝒳),Wp)(\mathcal{P}_{p}(\mathcal{X}),W_{p}).

By Proposition˜3.1, for every μ𝒞\mu\in\mathcal{C}, there exists a finite Gaussian mixture qμq_{\mu} such that Wp(μ,qμ)<ε/2W_{p}(\mu,q_{\mu})<\varepsilon/2. The open balls UμBWp(μ,ε/2)𝒞U_{\mu}\coloneqq B_{W_{p}}(\mu,\varepsilon/2)\cap\mathcal{C} form an open cover of 𝒞\mathcal{C}. By compactness, we can extract a finite subcover centered at μ1,,μJ𝒞\mu_{1},\dots,\mu_{J}\in\mathcal{C}, with corresponding approximating mixtures q1,,qJq_{1},\dots,q_{J}.

Let {λj}j=1J\{\lambda_{j}\}_{j=1}^{J} be a continuous partition of unity on 𝒞\mathcal{C} subordinate to this finite subcover. That is, each λj:𝒞[0,1]\lambda_{j}\colon\mathcal{C}\to[0,1] is continuous, j=1Jλj(μ)=1\sum_{j=1}^{J}\lambda_{j}(\mu)=1 for all μ𝒞\mu\in\mathcal{C}, and λj(μ)=0\lambda_{j}(\mu)=0 whenever μUj\mu\notin U_{j}.

Since each qjq_{j} is a finite Gaussian mixture, we can pool all distinct Gaussian components appearing across q1,,qJq_{1},\dots,q_{J} into a single shared dictionary {ν1,,νK}\{\nu_{1},\dots,\nu_{K}\}. We can then express each local mixture over this shared basis as

qj=k=1Kaj,kνk,where aj,k0 and k=1Kaj,k=1.q_{j}=\sum_{k=1}^{K}a_{j,k}\nu_{k},\qquad\text{where }a_{j,k}\geq 0\text{ and }\sum_{k=1}^{K}a_{j,k}=1.

We now construct the temporal weights by interpolating the coefficients:

πk(t)j=1Jλj(P(t))aj,k,t[0,T],k=1,,K.\pi_{k}(t)\coloneqq\sum_{j=1}^{J}\lambda_{j}(P(t))\,a_{j,k},\qquad t\in[0,T],\quad k=1,\dots,K.

Because PP and the partition functions λj\lambda_{j} are continuous, the trajectories tπk(t)t\mapsto\pi_{k}(t) are continuous. Furthermore, πk(t)0\pi_{k}(t)\geq 0 and k=1Kπk(t)=1\sum_{k=1}^{K}\pi_{k}(t)=1. The temporal mixture is therefore given by

Q(t)k=1Kπk(t)νk=j=1Jλj(P(t))qj.Q(t)\coloneqq\sum_{k=1}^{K}\pi_{k}(t)\nu_{k}=\sum_{j=1}^{J}\lambda_{j}(P(t))\,q_{j}.

To bound the distance between P(t)P(t) and Q(t)Q(t), fix t[0,T]t\in[0,T]. We define the set of active indices at time tt as A(t){j:λj(P(t))>0}A(t)\coloneqq\{j:\lambda_{j}(P(t))>0\}. For each jA(t)j\in A(t), P(t)UjP(t)\in U_{j}, meaning Wp(P(t),μj)<ε/2W_{p}(P(t),\mu_{j})<\varepsilon/2. By the triangle inequality,

Wp(P(t),qj)Wp(P(t),μj)+Wp(μj,qj)<ε.W_{p}(P(t),q_{j})\leq W_{p}(P(t),\mu_{j})+W_{p}(\mu_{j},q_{j})<\varepsilon.

For each jA(t)j\in A(t), choose an optimal transport plan γj,tΠ(P(t),qj)\gamma_{j,t}\in\Pi(P(t),q_{j}), which satisfies xy𝒳pdγj,t(x,y)=Wpp(P(t),qj)<εp\int\|x-y\|_{\mathcal{X}}^{p}\,\mathrm{d}\gamma_{j,t}(x,y)=W_{p}^{p}(P(t),q_{j})<\varepsilon^{p}. We construct a mixed coupling:

γtjA(t)λj(P(t))γj,t.\gamma_{t}\coloneqq\sum_{j\in A(t)}\lambda_{j}(P(t))\,\gamma_{j,t}.

Because each γj,t\gamma_{j,t} has first marginal P(t)P(t) and second marginal qjq_{j}, the convex combination γt\gamma_{t} has first marginal P(t)P(t) and second marginal jA(t)λj(P(t))qj=Q(t)\sum_{j\in A(t)}\lambda_{j}(P(t))\,q_{j}=Q(t). Thus, γtΠ(P(t),Q(t))\gamma_{t}\in\Pi(P(t),Q(t)).

Evaluating the transport cost of this coupling yields

Wpp(P(t),Q(t))\displaystyle W_{p}^{p}(P(t),Q(t)) 𝒳×𝒳xy𝒳pdγt(x,y)\displaystyle\leq\int_{\mathcal{X}\times\mathcal{X}}\|x-y\|_{\mathcal{X}}^{p}\,\mathrm{d}\gamma_{t}(x,y)
=jA(t)λj(P(t))𝒳×𝒳xy𝒳pdγj,t(x,y).\displaystyle=\sum_{j\in A(t)}\lambda_{j}(P(t))\int_{\mathcal{X}\times\mathcal{X}}\|x-y\|_{\mathcal{X}}^{p}\,\mathrm{d}\gamma_{j,t}(x,y).

Since the integrals are strictly bounded by εp\varepsilon^{p} and the sum of the active weights is 1, we conclude

Wpp(P(t),Q(t))<jA(t)λj(P(t))εp=εp.W_{p}^{p}(P(t),Q(t))<\sum_{j\in A(t)}\lambda_{j}(P(t))\,\varepsilon^{p}=\varepsilon^{p}.

Hence, Wp(P(t),Q(t))<εW_{p}(P(t),Q(t))<\varepsilon for every t[0,T]t\in[0,T], establishing the supremum bound. ∎

H.7 Proof of Proposition B.1

Proof.

This proof is inspired by [70, Theorem 18]. Since AA is bounded, self-adjoint, and positive semi-definite, its square root TA1/2T\coloneqq A^{1/2} is a bounded, self-adjoint, positive semi-definite operator, and

κA(x,y)=exp(12TxTy𝒳2).\kappa_{A}(x,y)=\exp\left(-\frac{1}{2}\|Tx-Ty\|_{\mathcal{X}}^{2}\right).

By the stability of Gaussian measures under bounded linear maps, if yνk=𝒩(mk,𝒦k)y\sim\nu_{k}=\mathcal{N}(m_{k},\mathcal{K}_{k}), the transformed variable is distributed as Ty𝒩(Tmk,T𝒦kT)Ty\sim\mathcal{N}(Tm_{k},\,T\mathcal{K}_{k}T). Because 𝒦k\mathcal{K}_{k} is trace-class and TT is bounded, the operator T𝒦kTT\mathcal{K}_{k}T remains trace-class. Thus, the general case reduces to evaluating the isotropic kernel κ(x,y)=exp(12xy𝒳2)\kappa(x,y)=\exp(-\frac{1}{2}\|x-y\|_{\mathcal{X}}^{2}) on the transformed variables TxTx and TyTy.

To evaluate the isotropic case, we rewrite the data-to-model expectation by centering the integration variable. Let hXimkh\coloneqq X_{i}-m_{k} and write y=mk+ξy=m_{k}+\xi, where ξ𝒩(0,𝒦k)\xi\sim\mathcal{N}(0,\mathcal{K}_{k}). Since Xiy=hξX_{i}-y=h-\xi, the expectation becomes:

Ji,k=𝔼yνk[exp(12Xiy𝒳2)]=𝔼ξ[exp(12ξh𝒳2)].J_{i,k}=\mathbb{E}_{y\sim\nu_{k}}\left[\exp\left(-\frac{1}{2}\|X_{i}-y\|_{\mathcal{X}}^{2}\right)\right]=\mathbb{E}_{\xi}\left[\exp\left(-\frac{1}{2}\|\xi-h\|_{\mathcal{X}}^{2}\right)\right].

By the spectral theorem for compact self-adjoint operators, there exists an orthonormal eigenbasis (er)r1(e_{r})_{r\geq 1} of 𝒦k\mathcal{K}_{k} with corresponding eigenvalues (λr)r1(\lambda_{r})_{r\geq 1}. Expanding ξ\xi and hh in this basis yields

ξ=r=1λrZrer,h=r=1hrer,\xi=\sum_{r=1}^{\infty}\sqrt{\lambda_{r}}Z_{r}e_{r},\qquad h=\sum_{r=1}^{\infty}h_{r}e_{r},

where Zri.i.d.𝒩(0,1)Z_{r}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,1). For any NN\in\mathbb{N}, define ξ(N)r=1NλrZrer\xi^{(N)}\coloneqq\sum_{r=1}^{N}\sqrt{\lambda_{r}}Z_{r}e_{r} and h(N)r=1Nhrerh^{(N)}\coloneqq\sum_{r=1}^{N}h_{r}e_{r}. By Parseval’s identity, the squared norm of the difference is simply the limit of its partial sums:

ξ(N)h(N)𝒳2=r=1N(λrZrhr)2Nr=1(λrZrhr)2=ξh𝒳2a.s.\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}=\sum_{r=1}^{N}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}\xrightarrow{N\to\infty}\sum_{r=1}^{\infty}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}=\|\xi-h\|_{\mathcal{X}}^{2}\quad\text{a.s.}

Since this convergence holds almost surely and 0exp(12ξ(N)h(N)𝒳2)10\leq\exp\left(-\frac{1}{2}\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}\right)\leq 1, the dominated convergence theorem allows us to compute the expectation as the limit of the finite-dimensional projections:

Ji,k=limN𝔼[exp(12ξ(N)h(N)𝒳2)]=limNr=1N𝔼[exp(12(λrZrhr)2)].\displaystyle J_{i,k}=\lim_{N\to\infty}\mathbb{E}\left[\exp\left(-\frac{1}{2}\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}\right)\right]=\lim_{N\to\infty}\prod_{r=1}^{N}\mathbb{E}\left[\exp\left(-\frac{1}{2}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}\right)\right].

Expanding the square inside the exponential yields 12(λrZrhr)2=λr2Zr2+hrλrZrhr22-\frac{1}{2}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}=-\frac{\lambda_{r}}{2}Z_{r}^{2}+h_{r}\sqrt{\lambda_{r}}Z_{r}-\frac{h_{r}^{2}}{2}. Factoring out the constant term, we obtain:

𝔼[exp(12(λrZrhr)2)]=ehr2/2𝔼[exp(hrλrZrλr2Zr2)].\mathbb{E}\left[\exp\left(-\frac{1}{2}(\sqrt{\lambda_{r}}Z_{r}-h_{r})^{2}\right)\right]=e^{-h_{r}^{2}/2}\mathbb{E}\left[\exp\left(h_{r}\sqrt{\lambda_{r}}Z_{r}-\frac{\lambda_{r}}{2}Z_{r}^{2}\right)\right].

Using the Gaussian identity 𝔼[exp(aZ+bZ2)]=(12b)1/2exp(a22(12b))\mathbb{E}[\exp(aZ+bZ^{2})]=(1-2b)^{-1/2}\exp\bigl(\frac{a^{2}}{2(1-2b)}\bigr) for Z𝒩(0,1)Z\sim\mathcal{N}(0,1), which holds if 2b<12b<1, with a=hrλra=h_{r}\sqrt{\lambda_{r}} and b=λr/2b=-\lambda_{r}/2, each factor evaluates to:

ehr2/2(1+λr)1/2exp(hr2λr2(1+λr))=(1+λr)1/2exp(12hr21+λr).e^{-h_{r}^{2}/2}(1+\lambda_{r})^{-1/2}\exp\left(\frac{h_{r}^{2}\lambda_{r}}{2(1+\lambda_{r})}\right)=(1+\lambda_{r})^{-1/2}\exp\left(-\frac{1}{2}\frac{h_{r}^{2}}{1+\lambda_{r}}\right).

Substituting the one-dimensional evaluations back into the finite-dimensional product yields:

𝔼[exp(12ξ(N)h(N)𝒳2)]=(r=1N(1+λr)1/2)exp(12r=1Nhr21+λr).\mathbb{E}\left[\exp\left(-\frac{1}{2}\|\xi^{(N)}-h^{(N)}\|_{\mathcal{X}}^{2}\right)\right]=\left(\prod_{r=1}^{N}(1+\lambda_{r})^{-1/2}\right)\exp\left(-\frac{1}{2}\sum_{r=1}^{N}\frac{h_{r}^{2}}{1+\lambda_{r}}\right).

Passing to the limit as NN\to\infty, the infinite product converges to detF(I𝒳+𝒦k)1/2\det_{F}(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1/2}. Simultaneously, by the spectral decomposition of (I𝒳+𝒦k)1(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1}, the series in the exponent evaluates to

r=1hr21+λr=h,(I𝒳+𝒦k)1h𝒳.\sum_{r=1}^{\infty}\frac{h_{r}^{2}}{1+\lambda_{r}}=\big\langle h,(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1}h\big\rangle_{\mathcal{X}}.

Recalling that h=Ximkh=X_{i}-m_{k}, we conclude:

Ji,k=detF(I𝒳+𝒦k)1/2exp(12Ximk,(I𝒳+𝒦k)1(Ximk)𝒳).J_{i,k}=\det\nolimits_{F}(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1/2}\exp\left(-\frac{1}{2}\big\langle X_{i}-m_{k},(I_{\mathcal{X}}+\mathcal{K}_{k})^{-1}(X_{i}-m_{k})\big\rangle_{\mathcal{X}}\right).

Applying this isotropic solution to the transformed variables TXiTX_{i} and Ty𝒩(Tmk,T𝒦kT)Ty\sim\mathcal{N}(Tm_{k},T\mathcal{K}_{k}T), we obtain the intermediate form:

Ji,k=detF(I𝒳+T𝒦kT)1/2exp(12T(Ximk),(I𝒳+T𝒦kT)1T(Ximk)𝒳).J_{i,k}=\det\nolimits_{F}(I_{\mathcal{X}}+T\mathcal{K}_{k}T)^{-1/2}\exp\left(-\frac{1}{2}\big\langle T(X_{i}-m_{k}),(I_{\mathcal{X}}+T\mathcal{K}_{k}T)^{-1}T(X_{i}-m_{k})\big\rangle_{\mathcal{X}}\right).

Substituting T=A1/2T=A^{1/2} recovers Equation (20).

Finally, for the cross-expectation Ik,sI_{k,s}, let yνky\sim\nu_{k} and yνsy^{\prime}\sim\nu_{s} be independent. Their difference zyyz\coloneqq y-y^{\prime} is distributed as 𝒩(mkms,𝒦k+𝒦s)\mathcal{N}(m_{k}-m_{s},\mathcal{K}_{k}+\mathcal{K}_{s}). Since κA(y,y)=κA(z,0)\kappa_{A}(y,y^{\prime})=\kappa_{A}(z,0), applying the formula for Ji,kJ_{i,k} with Xi=0X_{i}=0 immediately yields:

Ik,s=exp(12mkms,A1/2(I𝒳+A1/2(𝒦k+𝒦s)A1/2)1A1/2(mkms)𝒳)detF(I𝒳+A1/2(𝒦k+𝒦s)A1/2)1/2,I_{k,s}=\frac{\exp\Bigl(-\frac{1}{2}\big\langle m_{k}-m_{s},\,A^{1/2}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{-1}A^{1/2}(m_{k}-m_{s})\big\rangle_{\mathcal{X}}\Bigr)}{\det_{F}\bigl(I_{\mathcal{X}}+A^{1/2}(\mathcal{K}_{k}+\mathcal{K}_{s})A^{1/2}\bigr)^{1/2}},

which completes the proof.

H.8 Proof of Proposition B.2

Proof.

Write y=mk+ξy=m_{k}+\xi, where ξ𝒩(0,𝒦k)\xi\sim\mathcal{N}(0,\mathcal{K}_{k}). Then

ZXi,y𝒳+c=Xi,mk𝒳+cμi,k+Xi,ξ𝒳Z\coloneqq\langle X_{i},y\rangle_{\mathcal{X}}+c=\underbrace{\langle X_{i},m_{k}\rangle_{\mathcal{X}}+c}_{\mu_{i,k}}+\langle X_{i},\xi\rangle_{\mathcal{X}}

is a real Gaussian random variable with mean μi,k\mu_{i,k}\in\mathbb{R} and variance

vi,k𝔼[Xi,ξ𝒳2]=Xi,𝒦kXi𝒳.v_{i,k}\coloneqq\mathbb{E}[\langle X_{i},\xi\rangle_{\mathcal{X}}^{2}]=\langle X_{i},\mathcal{K}_{k}X_{i}\rangle_{\mathcal{X}}.

Thus,

Z=dμi,k+vi,kG,G𝒩(0,1),Z\stackrel{{\scriptstyle d}}{{=}}\mu_{i,k}+\sqrt{v_{i,k}}\,G,\qquad G\sim\mathcal{N}(0,1),

and therefore

Ji,k=𝔼[Zp]=r=0p(pr)μi,kprvi,kr/2𝔼[Gr].J_{i,k}=\mathbb{E}[Z^{p}]=\sum_{r=0}^{p}\binom{p}{r}\mu_{i,k}^{\,p-r}v_{i,k}^{r/2}\,\mathbb{E}[G^{r}].

Since 𝔼[Gr]=0\mathbb{E}[G^{r}]=0 for odd rr and 𝔼[G2t]=(2t)!2tt!,\mathbb{E}[G^{2t}]=\frac{(2t)!}{2^{t}t!}, we obtain

Ji,k=t=0p/2p!(p2t)! 2tt!Xi,𝒦kXi𝒳t(Xi,mk𝒳+c)p2t.J_{i,k}=\sum_{t=0}^{\lfloor p/2\rfloor}\frac{p!}{(p-2t)!\,2^{t}\,t!}\,\langle X_{i},\mathcal{K}_{k}X_{i}\rangle_{\mathcal{X}}^{\,t}\bigl(\langle X_{i},m_{k}\rangle_{\mathcal{X}}+c\bigr)^{p-2t}.

Now let yνky\sim\nu_{k} and yνsy^{\prime}\sim\nu_{s} be independent. By definition of the polynomial kernel,

Ik,s=𝔼[(y,y𝒳+c)p].I_{k,s}=\mathbb{E}\big[(\langle y,y^{\prime}\rangle_{\mathcal{X}}+c)^{p}\big].

Applying the binomial theorem to expand the power, and using the linearity of expectation,

Ik,s=𝔼[r=0p(pr)cpry,y𝒳r]=r=0p(pr)cpr𝔼[y,y𝒳r].I_{k,s}=\mathbb{E}\Bigg[\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\langle y,y^{\prime}\rangle_{\mathcal{X}}^{r}\Bigg]=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\mathbb{E}\big[\langle y,y^{\prime}\rangle_{\mathcal{X}}^{r}\big].

To evaluate this expectation, recall that 𝒳r\mathcal{X}^{\otimes r} denotes the rr-th tensor power of 𝒳\mathcal{X}. For any vector y𝒳y\in\mathcal{X}, its rr-fold tensor product is yryyy^{\otimes r}\coloneqq y\otimes\dots\otimes y. By the standard definition of the inner product on 𝒳r\mathcal{X}^{\otimes r}, we have the fundamental identity:

y,y𝒳r=yr,(y)r𝒳r,\langle y,y^{\prime}\rangle_{\mathcal{X}}^{r}=\langle y^{\otimes r},(y^{\prime})^{\otimes r}\rangle_{\mathcal{X}^{\otimes r}},

where we adopt the convention 𝒳0=\mathcal{X}^{\otimes 0}=\mathbb{R} and y0=1y^{\otimes 0}=1 for r=0r=0.

Since Gaussian measures on separable Hilbert spaces have finite moments of all orders, the Bochner moments

Mr,k𝔼[yr]𝒳r,Mr,s𝔼[(y)r]𝒳rM_{r,k}\coloneqq\mathbb{E}[y^{\otimes r}]\in\mathcal{X}^{\otimes r},\qquad M_{r,s}\coloneqq\mathbb{E}[(y^{\prime})^{\otimes r}]\in\mathcal{X}^{\otimes r}

are well-defined. Using independence, the expectation of the inner product factors as the inner product of the expectations:

𝔼[yr,(y)r𝒳r]=𝔼[yr],𝔼[(y)r]𝒳r=Mr,k,Mr,s𝒳r.\mathbb{E}\big[\langle y^{\otimes r},(y^{\prime})^{\otimes r}\rangle_{\mathcal{X}^{\otimes r}}\big]=\big\langle\mathbb{E}[y^{\otimes r}],\,\mathbb{E}[(y^{\prime})^{\otimes r}]\big\rangle_{\mathcal{X}^{\otimes r}}=\langle M_{r,k},M_{r,s}\rangle_{\mathcal{X}^{\otimes r}}.

Substituting this back into the binomial expansion yields

Ik,s=r=0p(pr)cprMr,k,Mr,s𝒳r,I_{k,s}=\sum_{r=0}^{p}\binom{p}{r}c^{p-r}\,\langle M_{r,k},M_{r,s}\rangle_{\mathcal{X}^{\otimes r}},

which completes the proof. ∎

H.9 Proof of Proposition D.1

Proof.

Let ρ\rho be as in the proposition and define q=dQ/dρL1(ρ)q=dQ/d\rho\in L^{1}(\rho), so that qdρ=Q(𝒳)=1\int q\,\mathrm{d}\rho=Q(\mathcal{X})=1. Since QρQ\ll\rho, the projected measure QM=QΠM1Q_{M}=Q\circ\Pi_{M}^{-1} satisfies QMρMQ_{M}\ll\rho_{M}, so qM=dQM/dρMq_{M}=dQ_{M}/d\rho_{M} is well defined. Let XX denote the identity random element on the probability space (𝒳,(𝒳),ρ)(\mathcal{X},\mathcal{B}(\mathcal{X}),\rho) and set X(M)ΠMXX^{(M)}\coloneqq\Pi_{M}X. Write Mσ(X(M))\mathcal{F}_{M}\coloneqq\sigma(X^{(M)}) and σ(M1M)\mathcal{F}_{\infty}\coloneqq\sigma(\cup_{M\geq 1}\mathcal{F}_{M}). The same projection argument used in the proof of Proposition 2.2 gives =σ(X)=(𝒳)\mathcal{F}_{\infty}=\sigma(X)=\mathcal{B}(\mathcal{X}).

Step 1 (identification as a conditional expectation). By definition, qM=dQM/dρMq_{M}=dQ_{M}/d\rho_{M} and for each B(𝒳M)B\in\mathcal{B}(\mathcal{X}_{M}),

BqM(u)dρM(u)=QM(B)=Q(ΠM1(B))=ΠM1(B)q(x)dρ(x).\int_{B}q_{M}(u)\,\mathrm{d}\rho_{M}(u)=Q_{M}(B)=Q(\Pi_{M}^{-1}(B))=\int_{\Pi_{M}^{-1}(B)}q(x)\,\mathrm{d}\rho(x).

Since ρM=ρΠM1\rho_{M}=\rho\circ\Pi_{M}^{-1}, the left-hand side equals

𝒳𝟏{X(M)B}qM(X(M))dρ.\int_{\mathcal{X}}\mathbf{1}_{\{X^{(M)}\in B\}}\,q_{M}(X^{(M)})\,\mathrm{d}\rho.

Combining these identities, we obtain

𝒳𝟏{X(M)B}qM(X(M))dρ=𝒳𝟏{X(M)B}q(X)dρB(𝒳M),\int_{\mathcal{X}}\mathbf{1}_{\{X^{(M)}\in B\}}\,q_{M}(X^{(M)})\,\mathrm{d}\rho=\int_{\mathcal{X}}\mathbf{1}_{\{X^{(M)}\in B\}}\,q(X)\,\mathrm{d}\rho\qquad\forall B\in\mathcal{B}(\mathcal{X}_{M}),

so qM(X(M))q_{M}(X^{(M)}) is a version of 𝔼ρ[q(X)M]\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{M}]. Equivalently, qM(ΠMx)=𝔼ρ[q(X)M](x)q_{M}(\Pi_{M}x)=\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{M}](x) for ρ\rho-a.e. xx.

Step 2 (martingale convergence). Since (M)M1(\mathcal{F}_{M})_{M\geq 1} is an increasing filtration and q(X)L1(ρ)q(X)\in L^{1}(\rho), the martingale convergence theorem yields

𝔼ρ[q(X)M]𝔼ρ[q(X)]=q(X)ρ-a.s. and in L1(ρ).\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{M}]\longrightarrow\mathbb{E}_{\rho}[q(X)\mid\mathcal{F}_{\infty}]=q(X)\qquad\rho\text{-a.s. and in }L^{1}(\rho).

Combining with Step 1 shows qM(ΠMx)q(x)q_{M}(\Pi_{M}x)\to q(x) for ρ\rho-a.e. xx, and in L1(ρ)L^{1}(\rho).

Step 3 (log-likelihood convergence under QQ). Since QρQ\ll\rho, the almost sure convergence in Step 2 also holds QQ-a.s. Moreover, q>0q>0 QQ-a.s. and qM>0q_{M}>0 QMQ_{M}-a.s.; hence qM(ΠMX)>0q_{M}(\Pi_{M}X)>0 for QQ-a.e. XX, and logq(X)\log q(X) and logqM(ΠMX)\log q_{M}(\Pi_{M}X) are well-defined QQ-a.s. Therefore, for i.i.d. X1,,XnQX_{1},\dots,X_{n}\sim Q,

logqM(ΠMXi)logq(Xi)a.s. for each i,\log q_{M}(\Pi_{M}X_{i})\longrightarrow\log q(X_{i})\qquad\text{a.s. for each }i,

and summing over i=1,,ni=1,\dots,n yields the claimed convergence of the log-likelihoods. ∎

Comments

· 0
Be the first to comment on this paper.