Gaussian Mixture Models in Hilbert Spaces via Kernel Methods
Abstract
Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including -functional data and random graphs in Laplacian spaces arising in modern medical applications.
1 Introduction
Technological progress across multiple domains is generating high-dimensional datasets that demand new analytical tools. Modern data structures, such as random functions, spatial fields, images, signals, graphs, or trajectories are often more naturally represented as random objects in Hilbert spaces than as finite-dimensional Euclidean vectors in . Analyzing the data directly using Hilbert space methods preserves their intrinsic geometric structure and allows their full potential to be exploited [53, 26].
In many applications, observed data are heterogeneous and may arise from latent populations, each characterized by a probability law , with . Mixture models [28] provide a flexible framework for representing such data-generating mechanisms and, under suitable regularity conditions, can approximate broad classes of probability measures. This paper proposes a mixture-based framework for probability measures on a separable Hilbert space , extending the classical idea of Gaussian mixture models (GMM) for finite-dimensional data [22].
Extending GMMs beyond finite-dimensional spaces is challenging. Infinite-dimensional Hilbert spaces do not admit a Lebesgue measure, so densities cannot be defined in the standard way, and likelihood-based methods typically require the choice of a reference measure [19, 41]. We therefore adopt a density-free perspective: we fit Gaussian mixture models by minimizing the Maximum Mean Discrepancy (MMD) [33] between the kernel mean embedding of the empirical measure and that of the target mixture model. We now formulate the problem.
Model and objective.
Let be an unknown probability measure on a separable Hilbert space . A Gaussian measure on , denoted by , is uniquely determined by a mean element and a covariance operator , which generalizes the finite-dimensional covariance matrix. To ensure non-Dirac Gaussian components with finite total variance, must be non-zero, self-adjoint, positive semi-definite, and trace-class.
For a fixed number of components , we approximate with a finite Gaussian mixture
| (1) |
parameterized by , where denotes the admissible parameter space.
We fit to by minimizing the Maximum Mean Discrepancy induced by a measurable positive-definite kernel :
| (2) |
where and are independent copies drawn from , and are independent copies drawn from , and all four random variables are mutually independent. This leads to a density-free fitting criterion that depends only on kernel expectations. When the kernel is characteristic, the MMD is a metric on probability measures. Important examples include Gaussian radial kernels in Euclidean spaces and, more generally, in suitable metric spaces of strong negative type [60]. The formal definitions and statistical background are deferred to Supplemental Material Appendix˜A.
In practice, given , we replace by the empirical measure and solve the problem
| (3) |
Motivation.
The motivation for this work is to perform clustering of dynamic, time-varying functional objects [23, 24] that take values in a possibly infinite-dimensional space . Figure 1 shows examples of contemporary data structures that arise in this setting.
Let be a Hilbert space, and let be a stochastic process valued in . We observe independent sample trajectories
and denote by the law of at time . We estimate a time-varying mixture model by minimizing a temporal finite-sample approximation of the integrated discrepancy
where is the empirical law of at time . In Section˜4.2, this methodology is applied to continuous glucose monitoring data from a diabetes clinical trial evaluating a new therapy for juvenile diabetes [67].
Notation.
Throughout, denotes a separable real Hilbert space with inner product , norm , and identity operator . We let denote the space of bounded linear operators on . For a trace-class operator with eigenvalues , its Fredholm determinant is . Let denote the set of Borel probability measures on , and for , let be the subset of measures with finite -th moments. For finite-dimensional spaces, boldface lowercase letters () denote vectors and boldface uppercase letters () denote matrices. We write for the all-ones vector and for the probability simplex.
1.1 Contributions and paper organization



The main contributions of this paper are threefold:
-
•
Closed-form MMD fitting for Hilbert-space mixtures. We derive exact closed-form expressions for for Gaussian radial basis and polynomial kernels, extending known formulas to separable Hilbert spaces [1, 13]. Moreover, we prove that, for every , finite Gaussian mixtures of the form (1) are dense in under the Wasserstein distance . This result provides theoretical support for the approximation capacity of the proposed model in infinite-dimensional applications.
-
•
Consistent finite-dimensional computation. We introduce a spectral projection framework that reduces infinite-dimensional computation to standard linear algebra in . We prove that both the projected MMD objective and the posterior probabilities converge to their exact infinite-dimensional counterparts as the number of terms in the finite-dimensional approximation satisfies ; see Propositions 2.1 and 2.2.
-
•
Real-world performance. We evaluate the empirical performance of the proposed method across several data structures, including random functions in , spatial fields, rotations in , and graph Laplacians. We compare our method against state-of-the-art approaches on established clustering benchmarks, demonstrating competitive performance; see Sections˜4 and F. To further assess its performance on dynamic functional data, we analyze a clinical trial dataset with continuous glucose monitoring data, aiming to identify subgroups of individuals with heterogeneous responses to therapy.
The supplementary material contains all proofs, together with additional analytical results.
Related work
Functional data analysis and clustering. Functional data analysis (FDA) studies samples naturally viewed as functions, curves, or surfaces, typically using basis expansions, smoothing, and functional principal component analysis [53, 26]. Our spectral projection framework is related to Karhunen–Loève expansions for Gaussian measures [53, 26, 4], but uses projection as a computational approximation to the MMD objective. Model-based functional clustering—such as Gaussian mixtures—usually assumes latent groups with group-specific low-dimensional structure [9, 17, 36, 31, 35]. Such likelihood-based approaches require a reference measure, which is delicate in infinite dimensions [8, 19, 21]. Distance-based methods [42, 27] avoid this issue, but typically lack the explicit probabilistic and temporal interpretation that a generative model provides.
Kernel mean embeddings and MMD. Kernel mean embeddings represent probability measures as RKHS elements [59, 45, 5], inducing the Maximum Mean Discrepancy (MMD) [33]. For characteristic kernels, MMD metrizes equality in law [60, 57], making it a natural density-free fitting objective. MMD fitting has been used in generative modeling [40, 12], parametric inference [16, 65], kernel-based clustering [29, 43], distributional reinforcement learning [3], and temporal distributional learning [1]. These approaches often rely on closed-form kernel mean embeddings for Gaussian distributions under Gaussian and polynomial kernels [13]; related work also studies MMD testing on function spaces [70]. We extend these formulas and fitting methods from to separable Hilbert spaces, using characteristic-kernel MMD to fit finite Gaussian mixtures with closed-form optimization, approximation guarantees, and projection-consistent computation.
Mixture models. Gaussian mixture models are a classical tool for model-based clustering and density estimation [22, 28]. Extensions beyond Gaussian components have been proposed [64, 62], and more recent work applies mixtures of Gaussian processes to functional data [66, 34, 44] and to more general infinite-dimensional settings [41]. These methods remain largely likelihood- or projected-density-based; in contrast, we estimate Hilbert-space Gaussian mixtures by minimizing MMD, giving a density-free fitting criterion that is more natural in infinite-dimensional regimes.
Digital health. Digital health generates unprecedented amounts of near-continuous data, motivating dynamic, time-varying functional models [1, 23, 24]. Mixture models for these general data structures remain underdeveloped, despite their potential in clinical trials for identifying subgroups of individuals with heterogeneous treatment responses.
2 MMD fitting
To minimize (3), we evaluate kernel expectations under . Expanding the squared MMD between and yields
| (4) |
Because the first data–data term is constant in , optimization depends entirely on the data-to-component and component-to-component expectations:
| (5) |
To make the objective a tractable differentiable function of , we consider Gaussian and polynomial kernels, for which these expectations admit closed-form expressions (Appendix˜B):
where is bounded, self-adjoint, and positive semidefinite (often isotropic, ). The finite moments condition required by polynomial kernels are automatically satisfied by both and .
2.1 Fitting and optimization
The empirical objective (4) naturally splits into two parameter groups: mixture weights and component parameters . This enables an alternating scheme that combines a convex weight update with a nonconvex update of the component means and covariances.
To make the dependence on explicit, define the averaged cross-expectation vector with entries and the Gram matrix with entries . Dropping the (constant) data–data term, minimizing (4) over for fixed components reduces to the quadratic program
| (6) |
Since is a Gram inner product of kernel mean embeddings, is positive semi-definite and the program is convex. Geometrically, the optimal projects the empirical mean embedding onto the convex hull of the component embeddings .
Since joint optimization over the full parameter remains nonconvex, we employ an alternating minimization strategy, analogous to EM [22]. In each iteration, we first solve (6) for with fixed components, and then fix to update the means and covariances via a finite-dimensional gradient step (Section˜2.2). Pseudocode and implementation details are deferred to Section˜B.2.
2.2 Numerical computation
For practical implementation, we project the Hilbert space objects of Section˜2 onto a finite-dimensional subspace. Fixing an orthonormal basis of (e.g., cosine or data-driven Karhunen–Loève [53, 4]), we apply the orthogonal projector . This identifies with , allowing each object to be represented by its coordinate vector.
For each component , let and be the coordinate representations of the projected mean and covariance , and let be the coordinates of . The pushforward is then a Gaussian on . The projected mixture and empirical measure on are
The projected cross-expectations are
| (7) |
Computationally, these expectations are evaluated exactly in using the coordinate vectors and the finite-dimensional Gaussian measures . Explicit matrix formulas for (7) under Gaussian and polynomial kernels are deferred to Section˜B.1. The finite-dimensional counterpart of (4) is
| (8) |
Disregarding the (constant) data–data term, evaluating (8) per iteration costs for both the Gaussian radial and polynomial kernels. Moreover, if we assume that the projected covariances are diagonal, the cost reduces to , making the method highly efficient and scalable to large sample sizes and projection dimensions.
Proposition 2.1 (Consistency of projected MMD).
Let be a continuous kernel on . Assume there exist constants such that
| (9) |
and that for all . Then, the projected MMD objective converges to the exact MMD:
| (10) |
Together with the known empirical stability of MMD that controls the error from replacing by (see [65, Proposition A.1]), Proposition 2.1 shows that the finite-dimensional surrogate (8) first approximates the Hilbert-space empirical criterion as , and then the population as . This justifies (8) as a computationally tractable proxy for the infinite-dimensional MMD objective.
Posterior probabilities.
The mixture admits the standard generative interpretation: draw a latent assignment with prior , then sample . The posterior component probability, or responsibility, is defined by the Radon–Nikodym derivative
| (11) |
which is well-defined because whenever . In the finite-dimensional setting, whenever the projected covariance is nonsingular, admits a Lebesgue density on and (11) reduces to the familiar Bayes formula
| (12) |
As grows, accumulates the information of , yielding the following convergence.
Proposition 2.2.
Let be as above and let . Then the projected responsibility satisfies the martingale property
| (13) |
and as both almost surely and in .
The proofs of Propositions 2.1 and 2.2 are deferred to Section˜H.1 and Section˜H.2, respectively.
3 Approximation theory
In this section we establish theoretical guarantees that justify the use of finite Gaussian mixtures in Hilbert spaces as a flexible approximation class.
Proposition 3.1 (Wasserstein and MMD density).
Let be a separable Hilbert space.
Part (i) shows that the model class is highly expressive under Wasserstein metrics. These metrics are widely used in mathematical statistics to study convergence of estimators, resampling schemes such as the bootstrap, and related distributional properties [49]. This result also implies MMD density for characteristic bounded continuous kernels. Part (ii), however, applies to arbitrary probability measures and provides an explicit finite-mixture approximation bound for the MMD discrepancy minimized by our algorithm. The proof of Proposition 3.1 is deferred to Section˜H.3.
4 Numerical experiments
The first goal of this section is to assess the empirical clustering performance of our methods against established approaches in both finite-dimensional and infinite-dimensional regimes. We then focus on a digital health case study to demonstrate the versatility of our methods for analyzing complex biomarkers that may provide new clinical insights in practice. In particular, we summarize the glucose time series using different representations, including random functions and graphs, which can be viewed as functional dynamic objects. Additional numerical comparisons are reported in Appendix˜F.
4.1 Clustering benchmark
We compare the mixture model described in Section˜2.2 against several baselines: (i) partitioning and centroid-based methods, including K-Medoids and K-Means [55, 56]; (ii) agglomerative hierarchical methods, including average-linkage, Ward, and BIRCH [46, 68, 72]; (iii) density-based methods, including DBSCAN and OPTICS [25, 2]; (iv) model-based methods, including Gaussian mixtures [28, 41]; and (v) graph- and dimension-reduction methods such as spectral clustering [47].
Datasets. We evaluate our approach using representative synthetic and real-world datasets.
-
•
Synthetic environments: (i) : five standard sklearn clustering datasets with , , and [50]; and (ii) : 10 synthetic datasets of rotations under an Euler-angle parameterization, projected onto Wigner-D matrix coefficients, with and .
- •
All implementation details and evaluation protocols are reported in Appendix˜G. In particular, clustering performance is evaluated using the Adjusted Rand Index (ARI), a well-established clustering metric that compares unsupervised cluster assignments with available ground-truth labels. We fix to the known number of ground-truth classes, as our goal here is to compare clustering accuracy.
Results. Although designed for general random objects in Hilbert spaces, our method is competitive on finite-dimensional benchmarks and shows better performance on the CGM and datasets, which fall outside the scope of most existing clustering methods. It also performs well in the infinite-dimensional setting. In general, the proposed model remains highly competitive and computationally efficient, while preserving a common probabilistic interpretation. This flexibility enables time-dependent interpretations, as illustrated in the digital health case study.
| Method | Space | Train | Infer | Memory | CGM. | Molecular | |||
| MMD GMM (Gaussian) | Hilbert | ||||||||
| MMD GMM (Polynomial) | Hilbert | ||||||||
| Projected GMM (Appendix D) | Hilbert | — | |||||||
| K-Medoids [37] | Metric | ||||||||
| Hierarchical (Avg) [27, 39] | Metric | — | |||||||
| DBSCAN [25] | Metric | — | |||||||
| HDBSCAN [14] | Metric | — | |||||||
| K-Center [32] | Metric | ||||||||
| Kernel k-Groups [29] | Metric | — | |||||||
| FDA K-Means [42] | — | — | — | ||||||
| FDA Fuzzy C-Means [6] | — | — | — | ||||||
| Funclust [35] | — | — | — | ||||||
| FunHDDC [9] | — | — | — | ||||||
| fclust [36] | — | — | — | ||||||
| K-Centres [17] | — | — | — | ||||||
| Curvclust [31] | — | — | — | ||||||
| MiniBatch KMeans [56] | — | — | — | — | |||||
| Spectral [47] | — | — | — | — | — | ||||
| Ward [68] | — | — | — | — | — | ||||
| BIRCH [72] | — | — | — | — | |||||
| Gaussian Mixture (EM) [28] | — | — | — | — | |||||
| Affinity Propagation [30] | — | — | — | — | — | ||||
| MeanShift [71] | — | — | — | — | |||||
| OPTICS [2] | — | — | — | — | — |
4.2 Case study: different representations of digital health time series
We now illustrate the proposed model in a digital health case study based on longitudinal continuous glucose monitoring (CGM) data from a juvenile clinical trial [67]. A central question in longitudinal studies is whether and how the distribution of individuals changes over time, for example, in response to a treatment [38, 18]. In the original article [67], the authors demonstrate that the new insulin pump system leads to improvements in several glycemic metrics before and after the intervention. Here, we consider these outcomes in a continuous-time formulation and use a dynamic functional representation of glucose trajectories. Digital health technologies make it possible to study this question at high temporal resolution. Starting from multi-day glucose time series, we construct three daily representations: (i) functional observations in ; (ii) correlation matrices describing the dependence between glucose values at different hours of the day; and (iii) Laplacian graphs encoding similarities in glucose dynamics across hours. Similar representations have been considered as biomarkers in clinical fMRI studies [61, 69].
To analyze temporal changes across these three CGM representations, we extend (1) by allowing the mixture weights to vary over time while sharing the component parameters, namely the mean and covariance operator parameters of the Gaussian mixture model. Our formulation is similar to [1]. However, in that work the authors propose a specific generative model to model distributions through finite-dimensional random examples over time; here, the approach is more general.
Background and data.
Continuous glucose monitors record interstitial glucose concentration every 5 minutes, producing 288 measurements per day. We represent each individual-day by its intraday glucose curve . The data come from a randomized controlled diabetes study evaluating the effect of a closed-loop insulin delivery system over a period of more than six months [63]. For each individual, a sliding window of days with stride 1 produces one observation per window. Within each window, the 288 time slots are averaged across valid days; slots with fewer than valid readings are linearly interpolated from neighboring slots. After preprocessing, the dataset comprises 98 individuals, with 34 in the control group and 64 in the treatment group.
Time-varying mixture weights.
Let be i.i.d. copies of an -valued measurable stochastic process . For each , denote the law of , and define the empirical distribution by . We fit the model
| (14) |
Conceptually, the shared components capture fundamental, stationary data archetypes, such as distinct physiological regimes or individual phenotypes, while the time-varying weight vector describes population-level shifts between these archetypes over time.
Following the kernel -groups framework [29], we select by evaluating the within-cluster MMD across candidate values and applying the elbow rule. Model fitting minimizes the integrated MMD2,
| (15) |
We parameterize , where the logit path is the solution to a neural ODE [15], initialized at . This parameterization introduces a natural bias toward smooth temporal dynamics. Further details on density and consistency results, the projected approximation of the integrated objective, and implementation specifics are deferred to Appendix˜C.
Group-level comparison.
To compare the control and treatment arms, we compute each individual’s posterior cluster membership at each time point and aggregate these memberships into group-level trajectories. Specifically, for each group , we compute
which averages posterior probabilities across all individuals in group observed at time . We then quantify the divergence between study arms using the total variation distance The metric ranges from 0 to 1, where 0 reflects identical group-level cluster usage and 1 reflects complete separation. It provides an interpretable way to quantify treatment-induced distributional shifts over time.
4.2.1 Results
Intraday curves. Each individual-window is represented as an intraday glucose curve in Sobolev space , continuous profiles are projected onto a truncated cosine basis and fitted with (15) using latent clusters. The two learned means contrast an elevated, variable profile () with a flatter, well-regulated profile (); the empirical and model-predicted mean surfaces agree closely (Figure˜3, left). While the control group remains approximately time-invariant, the treatment group gradually shifts toward , with distance increasing from at baseline to at the end of the trial. This is consistent with an improvement in glycemic regulation over the course of the clinical trial.
Correlation matrices. Instead of clustering intraday curves, we cluster the running inter-hour correlation matrices in the Hilbert space equipped with the Frobenius inner product, thereby capturing how the dependence structure between hours of the day evolves over the study. Each sliding window ( days) is converted to 24 hourly averages per day, from which the sample correlation matrix is calculated using linear shrinkage toward the identity, with intensity , and then projected onto the PSD cone. Symmetric matrices are embedded through the scaled vectorization mapping into , which preserves the Frobenius distance isometrically, and are fitted with the temporal mixture framework using .
The learned clusters in Figure˜4 identify three inter-hour dependence patterns: moderate local correlation , broadly strong correlation , and sharper early-hour dependence . Over time, membership shifts from toward the more regulated regime, consistent with improved glycemic regulation over time.
Individual similarity graph signals. We now move from hour- to individual-level structure. For each sliding window ( days), we compute -pairwise individual similarities to get a time-varying similarity matrix . A fixed individual graph is then built by thresholding the time-averaged similarity matrix at the 85th percentile, and the graph Laplacian eigenbasis (using eigenvectors) is used to project each onto graph signals. The mixture model with clusters these projected matrices over time. As the trial progresses (Figure˜5), the network structure evolves from diffuse inter-group connections toward denser within-group similarity in the treatment arm, consistent with individuals progressively achieving improved glycemic control.
5 Final remarks
The main contribution of this paper is a Gaussian mixture model for random objects in separable Hilbert spaces, estimated via an MMD-based optimization procedure. A key practical and methodological advantage of the proposed approach is that it bypasses the need for densities or dominating measures, which are problematic in infinite-dimensional settings. In addition, the framework flexibly handles diverse data structures, is efficient, and demonstrates competitive empirical performance.
The main limitation of the method is its scalability in large-scale biomedical studies. This could be mitigated through subsampling strategies [10] or other computational approximation techniques. Future work will focus on developing specialized temporal algorithms for particular Hilbert spaces, extending the framework to general metric spaces, and quantifying the uncertainty of the parameter and prediction of the mixture model.
References
- [1] (2026) Continuous-time learning of probability distributions: a case study in a digital trial of young children with type 1 diabetes. External Links: 2603.24427 Cited by: §F.2, Table 2, 1st item, §1, §1, §4.2.
- [2] (1999) OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 49–60. Cited by: §4.1, Table 1.
- [3] (2025) An analysis of distributional reinforcement learning with Gaussian mixtures. Cited by: §1.
- [4] (2019) Karhunen–Loève decomposition of Gaussian measures on Banach spaces. 39 (2), pp. 279–297. Cited by: Appendix E, Appendix E, Corollary E.1, §1, §2.2.
- [5] (2004) Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer US. Cited by: §1.
- [6] (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Springer US. Cited by: Table 1.
- [7] (2006) Pattern recognition and machine learning. Information Science and Statistics, Springer. Cited by: §F.1.
- [8] (1998) Gaussian Measures. Mathematical Surveys and Monographs, Vol. 62, American Mathematical Society. Cited by: §1.
- [9] (2011) Model-based clustering of time series in group-specific functional subspaces. 5 (4), pp. 281–300. Cited by: §1, Table 1.
- [10] (2021) An approach to incorporate subsampling into a generic bayesian hierarchical model. Journal of Computational and Graphical Statistics 30 (4), pp. 889–905. Cited by: §5.
- [11] (2017) Classification And Regression Trees. Routledge. Cited by: Table 2, 2nd item.
- [12] (2019)Statistical inference for generative models with maximum mean discrepancy(Website) Cited by: §1.
- [13] (2025)A Dictionary of Closed-Form Kernel Mean Embeddings(Website) Cited by: 1st item, §1.
- [14] (2015) Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. 10 (1), pp. 1–51. Cited by: Appendix F, Table 1.
- [15] (2019)Neural Ordinary Differential Equations(Website) Cited by: §4.2.
- [16] (2022) Finite sample properties of parametric MMD estimation: Robustness to misspecification and dependence. 28 (1), pp. 181–213. Cited by: §1.
- [17] (2007) Functional Clustering and Identifying Substructures of Longitudinal Data. 69 (4), pp. 679–699. Cited by: §1, Table 1.
- [18] (2009) Statistical tools to analyze continuous glucose monitor data. 11, pp. S–45. Cited by: §4.2.
- [19] (2014) Stochastic Equations in Infinite Dimensions. Encyclopedia of Mathematics and Its Applications, Cambridge University Press. Cited by: §1, §1.
- [20] (1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. 34 (2), pp. 786–797. Cited by: Table 2, 2nd item.
- [21] (2010) Defining probability density for a distribution of random functions. 38 (2). Cited by: §1.
- [22] (1977) Maximum Likelihood from Incomplete Data Via the \mkbibemphEM Algorithm. 39 (1), pp. 1–22. Cited by: Appendix D, §1, §1, §2.1.
- [23] (2020) Functional models for time-varying random objects. 82 (2), pp. 275–327. Cited by: §1, §1.
- [24] (2022) Modeling time-varying random objects and dynamic networks. 117 (540), pp. 2252–2267. Cited by: §1, §1.
- [25] (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. Cited by: Appendix F, §4.1, Table 1.
- [26] (2006) Nonparametric Functional Data Analysis. Springer Series in Statistics, Springer New York. Cited by: Table 2, §1, §1, 2nd item.
- [27] (2009) A Comparison of Hierarchical Methods for Clustering Functional Data. 38 (9), pp. 1925–1949. Cited by: §1, Table 1.
- [28] (2002) Model-Based Clustering, Discriminant Analysis, and Density Estimation. 97 (458), pp. 611–631. Cited by: §1, §1, §4.1, Table 1.
- [29] (2017)Kernel k-Groups via Hartigan’s Method(Website) arXiv.org. Cited by: Appendix C, §1, §4.2, Table 1.
- [30] (2007) Clustering by Passing Messages Between Data Points. 315 (5814), pp. 972–976. Cited by: Table 1.
- [31] (2013) Wavelet‐Based Clustering for Mixed‐Effects Functional Models in High Dimension. 69 (1), pp. 31–40. Cited by: §1, Table 1.
- [32] (1985) Clustering to minimize the maximum intercluster distance. 38, pp. 293–306. Cited by: Appendix F, Table 1.
- [33] (2012) A Kernel Two-Sample Test. 13 (25), pp. 723–773. Cited by: §1, §1.
- [34] (2014) Estimating mixture of gaussian processes by kernel smoothing. 32 (2), pp. 259–270. Cited by: §1.
- [35] (2013) Funclust: A curves clustering method using functional random variables density approximation. 112, pp. 164–171. Cited by: §1, Table 1.
- [36] (2003) Clustering for Sparsely Sampled Functional Data. 98 (462), pp. 397–408. Cited by: §1, Table 1.
- [37] (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, Wiley. Cited by: Table 1.
- [38] (2010) A review of standards and statistics used to describe blood glucose monitor performance. 4 (1), pp. 75–83. Cited by: §4.2.
- [39] (1967) A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems. 9 (4), pp. 373–380. Cited by: Table 1.
- [40] (2015) Generative Moment Matching Networks. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1718–1727. Cited by: §1.
- [41] (2026)Sequential Monte Carlo with Gaussian Mixture Approximation for Infinite-Dimensional Statistical Inverse Problems(Website) Cited by: Appendix D, Appendix F, §1, §1, §4.1.
- [42] (2015) K-means algorithms for functional data. 151, pp. 231–245. Cited by: §1, Table 1.
- [43] (2025) Kernel biclustering algorithm in Hilbert spaces. Cited by: §1.
- [44] (2018) Clustering gene expression time series data using an infinite gaussian process mixture model. 14 (1), pp. e1005896. Cited by: §1.
- [45] (2017) Kernel Mean Embedding of Distributions: A Review and Beyond. 10 (1–2), pp. 1–141. Cited by: §1.
- [46] (2012) Algorithms for hierarchical clustering: an overview. 2 (1), pp. 86–97. Cited by: Appendix F, §4.1.
- [47] (2001) On spectral clustering: analysis and an algorithm. In Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pp. 849–856. Cited by: §4.1, Table 1.
- [48] (2001) Generalized feature extraction for structural pattern recognition in time-series data. Carnegie Mellon University. Cited by: Table 2, 2nd item.
- [49] (2019) Statistical aspects of wasserstein distances. Annual Review of Statistics and Its Application 6, pp. 405–431. Cited by: §3.
- [50] (2011) Scikit-learn: Machine learning in python. 12, pp. 2825–2830. Cited by: Table 2, 1st item.
- [51] (2021) Categorical functional data analysis. the cfda r package. 9 (23), pp. 3074. Cited by: Table 2, 2nd item.
- [52] (2024) Scikit-Fda : A \mkbibemphPython Package for Functional Data Analysis. 109 (2). Cited by: Appendix F.
- [53] (2005) Functional Data Analysis. Springer Series in Statistics, Springer New York. Cited by: Table 2, §1, §1, §2.2, 2nd item.
- [54] (2008) Gaussian processes for machine learning. Adaptive Computation and Machine Learning, MIT Press. Cited by: §F.2.
- [55] (2021) Fast and eager k -medoids clustering: O ( k ) runtime improvement of the PAM, CLARA, and CLARANS algorithms. 101, pp. 101804. Cited by: Appendix F, §4.1.
- [56] (2010) Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. Cited by: §4.1, Table 1.
- [57] (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. 41 (5). Cited by: §1.
- [58] (2016)NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis(Website) Cited by: §F.8.
- [59] (2007) A Hilbert Space Embedding for Distributions. In Algorithmic Learning Theory, M. Hutter, R. A. Servedio, and E. Takimoto (Eds.), pp. 13–31. Cited by: §1.
- [60] (2011) Universality, Characteristic Kernels and RKHS Embedding of Measures. 12 (70), pp. 2389–2410. Cited by: §1, §1.
- [61] (2008) Network Analysis of Intrinsic Functional Brain Connectivity in Alzheimer’s Disease. 4 (6), pp. e1000100. Cited by: §4.2.
- [62] (2023) Wasserstein Distributional Learning via Majorization-Minimization. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 10703–10731. Cited by: §1.
- [63] (2018) Closed-loop insulin delivery in suboptimally controlled type 1 diabetes: a multicentre, 12-week randomised trial. 392 (10155), pp. 1321–1329. Cited by: Table 2, §4.2.
- [64] (1985) Statistical analysis of finite mixture distributions. Wiley. Cited by: §1.
- [65] (2017) Minimax Estimation of Kernel Mean Embeddings. 18 (86), pp. 1–47. Cited by: §1, §2.2.
- [66] (2000) Mixtures of Gaussian Processes. In Advances in Neural Information Processing Systems, Vol. 13. Cited by: §F.2, §1.
- [67] (2023) Trial of hybrid closed-loop control in young children with type 1 diabetes. New England Journal of Medicine 388 (11), pp. 991–1001. Cited by: §1, §4.2.
- [68] (1963) Hierarchical Grouping to Optimize an Objective Function. 58 (301), pp. 236–244. Cited by: §4.1, Table 1.
- [69] (2015) Parkinson’s Disease-Related Spatial Covariance Pattern identified with Resting-State Functional MRI. 35 (11), pp. 1764–1770. Cited by: §4.2.
- [70] (2022) A kernel two-sample test for functional data. 23 (73), pp. 1–51. Cited by: §H.7, §1.
- [71] (1995) Mean shift, mode seeking, and clustering. 17 (8), pp. 790–799. Cited by: Table 1.
- [72] (1996) BIRCH: an efficient data clustering method for very large databases. 25 (2), pp. 103–114. Cited by: §4.1, Table 1.
Appendix A Background
This section establishes the notation and foundational properties of Gaussian measures and kernel methods used throughout this work. Appendix E discusses the extension of this framework to Banach spaces.
A.1 Gaussian measures on separable Hilbert spaces
Let be a separable real Hilbert space equipped with its Borel -algebra . A Borel probability measure on is said to be Gaussian if, for every continuous linear functional , the pushforward measure is a Gaussian measure on . By the Riesz representation theorem, we routinely identify with .
A Gaussian measure is uniquely determined by its mean and its covariance operator , which are defined via the relations
and
We denote such a measure by . The covariance operator is necessarily self-adjoint, positive semidefinite, and trace-class. The latter property implies that for any orthonormal basis of , the trace is finite:
The case corresponds to the Dirac measure ; this case is excluded from the model class in the main text. In particular, if , its moments satisfy and .
A.2 Kernel mean embeddings and Maximum Mean Discrepancy
Let be a Borel measurable, positive-definite kernel with an associated reproducing kernel Hilbert space (RKHS) . For probability measures , we assume the following integrability condition holds:
| (16) |
Under this assumption, the kernel mean embeddings
are well-defined as Bochner integrals in . The squared Maximum Mean Discrepancy (MMD) between and is defined as the distance between their embeddings:
| (17) |
where are i.i.d. samples from and are i.i.d. samples from . Note that if is bounded, condition (16) is satisfied for all probability measures.
A kernel is called characteristic if the embedding map is injective. In this case, if and only if , meaning the MMD induces a true metric on the space of probability measures.
A.3 Fourier interpretation of the Gaussian MMD
The Gaussian radial kernel, given by
| (18) |
is both bounded and characteristic on separable Hilbert spaces.
To gain intuition into the geometry induced by this kernel, we briefly consider the finite-dimensional case . For , let and denote their respective characteristic functions, .
By Bochner’s theorem, the shift-invariant kernel (18) admits a Fourier representation with a Gaussian spectral density :
By defining the signed measure and applying Fubini’s theorem to the expanded MMD objective (17), we obtain:
This formulation reveals a frequency-domain perspective: the squared Gaussian MMD is a weighted distance between the characteristic functions of the two distributions. The Gaussian weight acts as a low-pass filter, effectively penalizing discrepancies in low-frequency modes (global structural differences) while naturally smoothing out high-frequency noise.
Appendix B Closed-form MMD expressions
To evaluate the cross-expectations and of (5) in the infinite-dimensional setting, we rely on the Fredholm determinant. Recall that for any trace-class operator , the determinant is well-defined and serves as the natural generalization of the standard matrix determinant.
Proposition B.1 (MMD for the Gaussian kernel).
Let be a self-adjoint, positive semi-definite operator, and consider the Gaussian kernel induced by ,
| (19) |
Then the expectations in (5) are given by
| (20) | ||||
| (21) |
Proposition B.2 (MMD for the polynomial kernel).
The formal proofs of Propositions B.1 and B.2 are given in Sections˜H.7 and H.8.
B.1 Numerical computations and algorithm description
Given an orthonormal basis of , we represent observations and means by their coordinate vectors
| (23) |
and the covariance operator by its matrix representation with entries
| (24) |
In these coordinates, the pushforward measure becomes the multivariate normal on . Using these coordinate representations, we now provide explicit formulas for the finite-dimensional approximations.
Projected responsibilities.
The pushforward is a multivariate normal distribution on ,
If each is positive definite, then each admits the Lebesgue density
In this case the projected responsibilities satisfy, for and ,
If some are singular, the corresponding are supported on affine subspaces; the Radon–Nikodym definition of still applies and one may compute densities on the supports or use a small ridge regularization.
1. Gaussian radial kernel (Proposition B.1) Let denote the matrix representation of the restricted operator in the basis , with entries . The closed forms of Proposition B.1 become the finite-dimensional approximations
and we set
For the isotropic case , one writes and recovers the equivalent forms and in the exponents.
For numerically stable log-domain evaluation, compute the symmetric eigendecompositions
with orthogonal matrices and . Note that . Writing and , we obtain the stable log-domain expressions
When , the matrices coincide with the eigenvectors of and the expressions reduce to those with . To reduce cost, one may truncate these sums to (e.g. the leading eigenvalues), i.e. replace the relevant matrices by their rank- spectral approximations.
2. Polynomial kernel (Proposition B.2) Fix an integer and and consider . In the projected space, this becomes on .
Computing .
For , define the scalars
so that is a one-dimensional Gaussian with mean and variance . Then Proposition B.2 yields the efficient closed form
Computing .
Let and be independent in , and set . Then
Equivalently, by Proposition B.2,
where
For small degrees, these moments admit simple closed forms in terms of . In particular,
For larger , one may compute higher-order moments via Wick/Isserlis expansions.
B.2 Algorithm descriptions
We present pseudocode for the fitting procedure described in Section˜2.1. The algorithm operates in the projected coordinate space using the finite-dimensional quantities from Section˜B.1, which we denote for brevity.
Algorithm˜1 alternates between the analytical weight update (6) and a gradient step on the component parameters , with covariances parameterized through their Cholesky factors to enforce positive definiteness. The exact gradients are obtained from the differentiable closed forms of Propositions B.1 and B.2. A fully gradient-based variant—updating jointly with the components, either via projection onto the simplex or through logits with —is also possible.
Appendix C Time-varying mixtures
This appendix details the algorithmic implementation and theoretical results deferred from Section˜4.2. We outline the projected fitting procedure, and subsequently establish three guarantees: the consistency of the integrated empirical objective, the consistency of its projected approximation, and the density of time-varying Gaussian mixtures with shared components.
Algorithm˜2 implements the optimization of the time-varying objective (15) in . The component parameters are shared across all time slices, while the temporal variation is entirely captured by the weights , whose logits are governed by the neural ODE introduced in Section˜4.2. Implementation details are summarized in Table˜3.
To evaluate the MMD efficiently within the algorithm, we rely on the finite-dimensional projections defined in (7). We separate the evaluation into terms that can be computed globally versus those computed per time slice. The matrix depends only on the shared components and is computed once per epoch. In contrast, the vectors evaluate the interaction between the shared components and the data specific to time slice , requiring per-slice computation.
Motivated by the within-cluster RKHS dispersion criterion underlying kernel -groups [29], we select by applying an elbow rule to the integrated MMD loss over candidate temporal mixtures.
Consistency guarantees
In parallel with the static framework of Section˜3, we establish the theoretical foundations for the temporal mixture model. Unlike the static case, where sample consistency is standard, minimizing the integrated temporal objective requires controlling the time-varying empirical processes. We first establish this sample consistency, followed by the validity of the projected approximation and the expressiveness of the temporal model class.
Proposition C.1 (Sample consistency of the integrated MMD).
Let be a continuous positive-definite kernel on the separable Hilbert space . Let be i.i.d. copies of a jointly measurable -valued stochastic process with marginal laws , and let be the corresponding empirical measure. Let be a family of probability measures on such that its mean embedding trajectory is strongly measurable and belongs to . If , we have
In particular, almost surely.
Proposition C.2 (Projection consistency for temporal mixtures).
Let be a time-dependent Gaussian mixture where and is measurable for each . Let and be the projected measures on defined in Section˜2.2, and let denote the coordinate version of the projected kernel. Suppose is continuous and satisfies the polynomial growth condition (9) for some . If the sample paths satisfy almost surely for , then
Furthermore, for each fixed , if , the projected responsibilities satisfy the martingale convergence property of Proposition˜2.2 with weights .
Proposition C.3 (Density of time-varying Gaussian mixtures).
Fix and let be continuous with respect to the Wasserstein distance . Then, for every , there exist , shared Gaussian measures on with nonzero covariance operator, and continuous functions with , such that
Appendix D Projected maximum likelihood estimation
Although this paper focuses on MMD-based fitting, we briefly describe the projected maximum-likelihood baseline used in Section˜4; it is the only density-based competitor that is well defined in the Hilbert-space setting of (1). As discussed in the introduction, likelihood-based fitting in infinite dimensions requires a common dominating measure. Here we use the Karhunen–Loève projector associated with the Gaussian prior
on . Let be the eigenpairs of ordered so that , and fix with . Define
Under the decomposition , the prior factorizes as
where is the law of for . Following [41], we construct each component by altering only the first coordinates. Given projected parameters and with , let
and lift it to
Equivalently, , where
and agrees with on while its restriction to has matrix in the basis . Since and are nondegenerate Gaussians on , they are equivalent, and therefore . Hence every mixture
satisfies ; indeed because its density is strictly positive -a.s.
For , write
and let
denote the usual Gaussian density on . Then
and therefore
Accordingly,
Since the factor is common to every component, maximizing the -log-likelihood is equivalent to maximizing the usual projected mixture log-likelihood
| (25) |
because
and the second term is independent of the mixture parameters.
The EM algorithm [22] is therefore carried out in the projected coordinates . In the E-step,
For the unregularized full-covariance model, the M-step updates are
and
In practice, one stabilizes the covariance update by replacing with for some . This is a regularized EM step rather than the exact M-step of the unpenalized likelihood, but it guarantees and therefore preserves equivalence with the reference measure after lifting. In the fixed-covariance benchmark variant used in our experiments (see Section˜4), the same E-step is used, while only the weights and means are updated and the covariance matrix is held fixed. After each M-step, the updated parameters are lifted back to by setting
and defining to agree with on and to have matrix on . Algorithm 3 summarizes this projected EM routine, with optional ridge regularization controlled by .
The next result shows that the projected likelihood ratios are consistent as .
Proposition D.1 (Consistency of projected likelihood ratios).
For each , set , , and define . Let be an -valued random variable with distribution and denote
Then a.s. and a.s. and in . Consequently, for i.i.d. ,
The proof is given in Section˜H.9.
Appendix E Extension to Banach spaces
This appendix outlines how our framework extends to real separable Banach spaces . Let be the continuous dual and the dual pairing. A Borel probability measure on is Gaussian if is Gaussian for all . If is centered, its characteristic functional is
where the covariance form is given by
By Fernique’s theorem, has a well-defined covariance operator given by . The associated Cameron–Martin space is the Hilbert completion of under the inner product .
Karhunen–Loève decomposition.
Assume that is infinite-dimensional. Following Bay and Croix [4], one constructs vectors , dual functionals , and normalized coordinates
from successive maximizers of the Rayleigh quotient associated with the covariance operators of the residual Gaussian measures; see [4, Theorem 3.3]. The resulting coordinates satisfy , and the random variables are independent under .
Corollary E.1 (Banach-space Karhunen–Loève decomposition [4, Corollary 3.8]).
Let be a centered Gaussian measure on a real separable Banach space , assume is infinite-dimensional, and let be constructed as above. Then
converges in for -a.e. , and the coordinate maps are independent under . Equivalently, if are i.i.d. , then
converges almost surely in and has law .
In the Hilbert-space setting of the main text, we identify via the Riesz representation theorem. Then becomes the usual covariance operator, and the above construction reduces to the classical spectral Karhunen–Loève expansion.
Applicability.
The measure-theoretic components of our framework—including Gaussian mixtures, Radon–Nikodym responsibilities, and MMD weak density arguments—extend naturally to separable Banach spaces. While orthogonal projections are unavailable in this broader setting, one can substitute finite-rank continuous projections , such as the Bay–Croix projections from Corollary˜E.1. However, we do not pursue this extension further, as the closed-form MMD expressions and projection algorithms developed in the main text rely intrinsically on the inner product structure of Hilbert spaces.
Appendix F Synthetic experiments
We compare against four baseline families: (i) metric-space methods [55, 46, 25, 14, 32]; (ii) FDA-specific methods [52]; (iii) classic Euclidean methods restricted to ; and (iv) a projected maximum-likelihood baseline inspired by [41] (see Appendix˜D). We report Adjusted Rand Index (ARI), train with Adam () and -means initialization, and run all scripts on CPU in minutes.
Model class. Across all domains, we fit the finite-dimensional mixture of Section˜2.2 after fixing an orthonormal basis of . We use cosine bases for , Wigner-D matrix coefficients for , Laplacian eigenvectors for graph signals, and the canonical basis for .
We evaluate the MMD-based Gaussian mixture fitting method on several Hilbert-space representations of increasing complexity. The first five experiments are controlled mixture-recovery tasks with known ground-truth parameters. The sixth experiment tests system identification for a Gaussian process induced by a linear SDE. The last two experiments apply the same fitting procedure to real molecular and skeleton-sequence data. In all cases, observations are represented in a finite-dimensional basis as in Section˜2.2, and the model is fitted by minimizing the squared MMD objective with a Gaussian radial kernel. Final MMD values are reported as within-experiment diagnostics and are not meant to be compared directly across domains, since the coefficient dimension, scaling, and kernel bandwidth differ across settings.
F.1 Gaussian random variables in
As a sanity check, we first consider the finite-dimensional case with the standard inner product, where the model reduces to a classical Gaussian mixture model [7].
Experiment.
We generate samples from a Gaussian mixture on with weights , means , and standard deviations . We then fit a mixture by minimizing . Figure˜6 shows that the learned density closely matches the empirical histogram and the ground-truth density.
F.2 Functional data in
We next consider functional data in . Gaussian random elements in this space can be viewed as Gaussian processes with square-integrable sample paths, so the mixture model specializes to a mixture of Gaussian processes [54, 66, 1].
Experiment.
We generate multivariate functional data in from Gaussian components with weights . Each component has a distinct mean function and diagonal covariance profile in coefficient space. We sample trajectories and project them onto a cosine basis with functions per spatial dimension, yielding coefficient vectors in . Figure˜7 shows raw trajectories, recovered mean functions, recovered weights, and the training curve.
F.3 Functional data in
The same construction extends to functions on two-dimensional domains by using tensor-product bases. Given an orthonormal basis of , the functions
form an orthonormal basis of . This setting covers image-like signals, spatial random fields, and spatio-temporal processes.
Experiment.
We consider with a tensor-product cosine basis, using and hence coefficients. We generate samples from a Gaussian mixture with weights , where each component has a distinct 2D mean surface. Figure˜8 compares the true and predicted mean surfaces and reports the training loss and recovered weights.
F.4 rotation data
We also consider non-Euclidean domains equipped with an structure. For a compact group with Haar probability measure , the space is a Hilbert space with inner product
For , the Peter–Weyl theorem provides an orthonormal basis in terms of Wigner D-matrices.
Experiment.
We consider rotation data on , relevant for orientation estimation in robotics and molecular modeling. We generate rotations from concentrated components and encode them using real Wigner-basis features up to degree , giving coefficients. Figure˜9 displays each rotation through the image of the reference direction on , together with the true and learned component directions, the training loss, and the recovered weights.
F.5 Graph signals
We next test the method on graph-structured data, where the Hilbert-space geometry is induced by the graph Laplacian. Let be a finite weighted graph with Laplacian . For , we equip with the inner product
This geometry gives larger norm to high-frequency graph components, while the finite-dimensional projection below restricts the representation to the leading Laplacian modes.
Experiment.
We generate graph signals on a random Erdős–Rényi graph with nodes and edge probability . The signals are projected onto the first Laplacian eigenvectors, using regularization . The mixture has components with uniform weights. Figure˜10 shows the true and predicted mean graph signals, together with the training loss and recovered weights.
F.6 Linear SDE: system identification
As a parametric Gaussian-process example, we consider the linear stochastic differential equation
For Gaussian initial condition and deterministic input , the solution is a Gaussian process. Its mean and covariance satisfy
The resulting projected mean and covariance can therefore be computed analytically in any chosen basis.
Experiment.
We treat as learnable parameters and identify them from sampled trajectories of a -dimensional linear SDE over , driven by the input . The trajectories are projected onto an cosine basis with functions per state dimension, giving coefficients. We match the empirical coefficient distribution to the single Gaussian induced by the linear SDE. Figure˜11 shows the sample paths, training loss, recovered mean function, and diagonal of the projected covariance.
F.7 3D molecular data (QM9)
We also test the method on real molecular structures using permutation-invariant graph fingerprints derived from 3D atomic coordinates.
Experiment.
We sample molecules from QM9, build WL-hash fingerprints of dimension from distance-threshold molecular graphs, and project them onto the first DCT coefficients. We fit a Gaussian mixture with diagonal covariance. Figure˜12 shows the three molecules closest to each learned component mean in coefficient space. The HOMO–LUMO gap and assignment responsibility are reported for each representative molecule.
F.8 NTU RGB+D skeleton sequences
Finally, we apply the method to human motion data, treating skeleton sequences as -valued paths corresponding to joints with coordinates each.
Experiment.
We load action sequences from the NTU RGB+D dataset [58], resample each sequence to frames, subtract the mean trajectory, and project onto an cosine basis with . This gives coefficient vectors of dimension . We fit a Gaussian mixture with diagonal covariance by minimizing . Figure˜13 shows the most representative skeleton sequences per component, selected as those closest to the learned component mean in coefficient space.
F.9 Summary
Table˜3 summarizes the experimental settings and final values. The controlled experiments show that the method recovers the main mixture structure across Euclidean, functional, group-valued, and graph-based representations. The real-data experiments illustrate that the same coefficient-space fitting procedure also yields coherent representative components on molecular and skeleton-sequence data.
Appendix G Experimental details and compute resources
We record the implementation details used for the experiments in Sections˜4 and F. The complete codebase, along with instructions to reproduce the synthetic and real-world experiments, can be found in the repository https://github.com/dani2442/rkhs-mixture-models/.
Unless stated otherwise, all models are trained on projected coefficient vectors in , use torch.float64, use fixed random seeds, run on CPU, and initialize mixture means with -means++.
G.1 Hilbert spaces and bases
Each experiment fits the projected mixture model of Section˜2.2 after fixing a separable Hilbert space and an orthonormal basis , which determines the truncated subspace . We define here the Hilbert spaces and bases used in Tables˜2 and 3, so that the table entries refer to this section instead of repeating the definitions.
(B1) Euclidean canonical basis.
For we use the canonical basis , so .
(B2) cosine basis.
For with , we use the orthonormal cosine basis
truncated at , so . For vector-valued data , we tensor with the canonical basis of , so .
(B3) Tensor-product cosine basis on .
For we use with , , so .
(B4) cosine basis.
For with , we use the -rescaled cosines
so ; the are orthonormal in .
(B5) Real Wigner–D basis on .
For under the Haar inner product, the Peter–Weyl theorem yields an orthonormal basis given by the (real) Wigner D-matrix entries,
Truncating at gives ; for , .
(B6) Graph Laplacian eigenbasis.
For a finite weighted graph with Laplacian and regularization , we equip with . Diagonalizing with eigenpairs , the rescaled eigenvectors form an orthonormal basis of . We keep the eigenvectors associated with the smallest eigenvalues.
(B7) Frobenius basis on .
For with , we use the canonical symmetric basis (vech embedding), so .
(B8) Weisfeiler–Lehman (WL) with kernel PCA / DCT.
For molecular graphs (MUTAG, QM9) we compute WL hash fingerprints in , then project onto the leading kernel-PCA directions (MUTAG) or DCT coefficients (QM9) to obtain coefficient vectors in . The basis is fixed in coefficient space and inherited from the projection.
G.2 Experimental setting
The clustering experiments are unsupervised: all methods are fit without access to labels, and labels are used only after fitting to compute Adjusted Rand Index (ARI). There are no train/test splits because the reported task is clustering or mixture recovery on a fixed sample; variability is instead measured across independently generated datasets, benchmark datasets, random restarts, or subsampled runs, as specified in Table˜2.
| Experiment | Data and Runs | Representation and MMD-GMM training |
| toy data | Five labeled sklearn datasets [50], , , ; seed , MMD seed . | Representation: canonical basis (B1) on standardized coordinates, . Training: epochs, learning rate , full covariance; Gaussian bandwidth by times the median pairwise-distance heuristic. |
| functional data | Growth [53], Waveform [11], Phoneme [26], Kneading [51], and ECG200 [48]; ARI averaged over the five datasets; seed . | Representation: cosine basis (B2), ( for Waveform), coefficient normalization per dataset. Training: epochs, learning rate , diagonal covariance; Gaussian bandwidth by median heuristic. |
| Glucodensity clustering | PEDAP CGM cohort [63, 1], patients, ; MMD runs; seed . | Representation: cosine basis (B4) with on overlapping -day sliding windows (stride day, time bins); patient-level features for our method. Training: epochs, learning rate , diagonal covariance; cosine annealing to ; Gaussian bandwidth by median heuristic. |
| rotations | generated datasets, , , noise concentration ; seeds . | Representation: real Wigner–D basis (B5) with , . Training: epochs, learning rate , diagonal covariance; Gaussian bandwidth by median heuristic. |
| MUTAG molecular graphs | MUTAG [20], , , runs; seed . | Representation: WL fingerprints with kernel PCA (B8), fingerprint dimension , radius , reduced to . Training: epochs, learning rate , runs; Gaussian bandwidth selected from plus the median heuristic by a -epoch MMD proxy sweep. |
The standalone mixture-recovery figures and case-study panels in Sections˜4, 4.2 and F that are not part of the unified ARI benchmark follow the configurations summarized in Table˜3.
| Experiment | Basis | Epochs | Kernel/optimizer details | |||
| D Gaussian mixture | (B1) | Gaussian , Adam learning rate . | ||||
| recovery | (B2) | Cosine basis with per coordinate; Gaussian , Adam learning rate . | ||||
| recovery | (B3) | Tensor-product cosine ; Gaussian , Adam learning rate . | ||||
| recovery | (B5) | Real Wigner–D, ; Gaussian , Adam learning rate . | ||||
| Graph-signal recovery | (B6) | Erdős–Rényi graph, ; Gaussian , Adam learning rate . | ||||
| Linear SDE identification | (B2) | paths | Cosine basis with per state dim.; Gaussian bandwidth , Adam learning rate . | |||
| QM9 molecules | (B8) | WL fingerprints + first DCT coefficients; Gaussian , Adam learning rate . | ||||
| NTU representatives | (B2) | Per-joint cosine basis on -valued paths; Gaussian , Adam learning rate . | ||||
| CGM temporal mixture | (B4) | patients | Overlapping -day sliding windows with stride day, time bins, cosine basis with ; 2-layer neural ODE weights, hidden width , RK4, Gaussian , Adam learning rate . | |||
| CGM correlation mixture | (B7) | windows | Overlapping -day sliding windows with stride day; each window gives a inter-hour correlation matrix with linear shrinkage toward the identity (intensity ), vech embedding into ; 2-layer neural ODE logits, hidden width , RK4, Adam learning rate . | |||
| CGM patient graph mixture | (B6) | patients | Overlapping -day sliding windows with stride day, time bins; pairwise similarities use cosine modes; graph Laplacian basis rank ; 2-layer neural ODE logits, hidden width , RK4, Adam learning rate . |
G.3 Statistical reporting
All entries in Table˜1 are reported as mean ARI with one standard deviation in subscript, computed directly from the list of ARI scores saved by each benchmark script. The standard deviation is a descriptive measure of run-to-run or dataset-to-dataset variability, not a standard error and not a formal confidence interval. The variability source is the labelled dataset collection for the and columns, independent generated datasets for , and multiple runs for glucodensity and MUTAG. Deterministic single-dataset baselines have zero reported standard deviation when only one fitted value is produced.
G.4 Compute resources
All reported experiments were run on CPU only; no GPU was used or required. The local machine used for reproduction is an Ubuntu 24.04.4 LTS workstation equipped with two Intel Xeon E5-2650 v2 CPUs at ( physical cores in total), GiB of RAM, and a network-mounted filesystem. The software environment relies on Python 3.11.5 with the following core libraries: PyTorch 1.13.1, NumPy 1.26.4, SciPy 1.12.0, scikit-learn 1.5.2, pandas 1.5.3, scikit-fda 0.10.1, torch-geometric 2.7.0, aeon 0.11.1, rdata 1.0.0, torchdiffeq 0.2.5, and torchsde 0.2.6. The end-to-end reproduction pipeline runs as a single CPU-worker workflow and typically requires – wall-clock hours, depending on cached downloads and optional notebook regeneration.
Appendix H Proofs
H.1 Proof of Proposition 2.1
Proof.
Since is an orthonormal basis of , the orthogonal projections converge strongly to the identity. That is, for every , in as .
We first prove the almost sure convergence of . Fix and . Fix such that . Then in . By the continuity of , for every , we have . Furthermore, using the growth assumption (9) and the contraction property of the orthogonal projector (), we can bound the kernel evaluation as:
By the finite moment assumption, the right-hand side is -integrable with respect to for our fixed . Therefore, by the dominated convergence theorem,
which establishes almost surely.
Next, we establish the convergence of . Fix . By the continuity of , for all . Similarly to the previous case, applying (9) and the contraction property yields:
Due to the finite moment assumption, this dominating function is integrable with respect to the product measure . Invoking the dominated convergence theorem once more, we obtain
H.2 Proof of Proposition 2.2
Proof.
Fix . We first show that the exact responsibility is a version of the conditional expectation . Let be the marginal law of . For any Borel set , the definition of yields
Equivalently, for every ,
Since is -measurable, this identity implies almost surely.
By an identical argument on the projected space, equipping with the pushforward measures and , the projected derivative satisfies
for any . Equivalently, for every ,
Since is -measurable, this confirms that
almost surely, establishing (13).
Next, define the filtration . The property ensures that , so is -measurable, meaning the filtration is increasing (). Let . Since each projector is continuous, is -measurable, yielding . Conversely, since , we have pointwise in . Since is a metric space, the pointwise limit of -measurable maps into is -measurable. This forces , concluding that .
Finally, we apply Lévy’s upward convergence theorem. Since the target is integrable, the martingale sequence converges almost surely and in to . By our previous identities,
Substituting these into the limit, we obtain, as ,
∎
H.3 Proof of Proposition 3.1
Proof.
Proof of (i).
Fix , , and . Let be any nonzero trace-class covariance operator on . It is enough to prove density for mixtures with component covariances of the form , since these form a subclass of the mixtures in (1).
Because is a separable Hilbert space, it is Polish. Hence finitely supported probability measures are dense in for . Therefore, there exists such that
For , define the Gaussian smoothing
Let be an index with , and let , independent of . Then has law , while has law . This coupling gives
Since is trace-class, Fernique’s theorem implies . Hence as . Choosing small enough yields , and the triangle inequality gives
Thus, finite Gaussian mixtures with non-zero covariance operators are -dense in .
Proof of (ii). Fix and draw i.i.d. samples . Let the empirical measure be . Since and we assume , the expected squared MMD is bounded by the variance of the mean embedding:
By the probabilistic method, there exists a specific realization such that the deterministic empirical measure satisfies .
Fix any nonzero trace-class covariance operator and define . Then converges weakly to as . Because is bounded and continuous, weak convergence implies convergence in MMD. Thus, we can choose small enough such that .
Applying the triangle inequality yields
which completes the proof, as is a valid mixture of the form (1). ∎
H.4 Proof of Proposition C.1
Proof.
Step 1. Let be the canonical feature map. Because is separable and is continuous, the reproducing kernel Hilbert space is separable. Define the Bochner space
For each sample path, define the random element . Since is continuous, is jointly measurable. Together with the separability of and the integrability estimate below, this defines a strongly measurable -valued random variable. By assumption, its squared norm is integrable:
Thus, are i.i.d. -valued random variables with finite first and second moments.
Step 2. By Jensen’s inequality and Tonelli’s theorem, the population mean embedding trajectory satisfies , so . By the Bochner–Fubini theorem, is exactly the expected value in . Similarly, the empirical mean embedding trajectory corresponds to the sample average:
By assumption, the target embedding trajectory also belongs to .
Step 3. By the strong law of large numbers for Hilbert-space-valued random variables, the sample average converges to its expectation almost surely:
Since in , the continuity of the squared norm gives . Because the integrated squared MMD is precisely the squared distance in scaled by , this immediately implies that
almost surely. Finally, setting directly yields almost surely, completing the proof. ∎
H.5 Proof of Proposition C.2
Proof.
For each , the identification of with via the basis maps the projected measures and to their finite-dimensional coordinate representations and . By defining , we have
for every . Expanding the exact and projected MMD objectives, we define the cross-expectations
Fix a sample path satisfying for every . For every , in . Hence, by the continuity of , the data–data interactions converge pointwise in :
For fixed , the integrand in satisfies for every , and is bounded by . Then dominated convergence applies because Gaussian measures on Hilbert spaces have finite -moments, giving . The exact same argument applies to the component–component term, yielding .
We next justify passing the limit through the time integral using the Dominated Convergence Theorem. By the polynomial growth condition on and the contraction property of ,
This upper bound belongs to by Cauchy–Schwarz and the sample path assumption. Likewise, if ,
which is integrable on . Finally, the component–component term is uniformly bounded by , which is finite and independent of .
Since , we can apply the Dominated Convergence Theorem to integrate term by term. Integrating the finite-dimensional MMD expansion
over yields convergence for fixed , and :
Dividing by and using the coordinate identification completes the proof. ∎
H.6 Proof of Proposition C.3
Proof.
Fix , let be continuous with respect to , and fix . Because is compact and is continuous, the trajectory image is compact in the metric space .
By Proposition˜3.1, for every , there exists a finite Gaussian mixture such that . The open balls form an open cover of . By compactness, we can extract a finite subcover centered at , with corresponding approximating mixtures .
Let be a continuous partition of unity on subordinate to this finite subcover. That is, each is continuous, for all , and whenever .
Since each is a finite Gaussian mixture, we can pool all distinct Gaussian components appearing across into a single shared dictionary . We can then express each local mixture over this shared basis as
We now construct the temporal weights by interpolating the coefficients:
Because and the partition functions are continuous, the trajectories are continuous. Furthermore, and . The temporal mixture is therefore given by
To bound the distance between and , fix . We define the set of active indices at time as . For each , , meaning . By the triangle inequality,
For each , choose an optimal transport plan , which satisfies . We construct a mixed coupling:
Because each has first marginal and second marginal , the convex combination has first marginal and second marginal . Thus, .
Evaluating the transport cost of this coupling yields
Since the integrals are strictly bounded by and the sum of the active weights is 1, we conclude
Hence, for every , establishing the supremum bound. ∎
H.7 Proof of Proposition B.1
Proof.
This proof is inspired by [70, Theorem 18]. Since is bounded, self-adjoint, and positive semi-definite, its square root is a bounded, self-adjoint, positive semi-definite operator, and
By the stability of Gaussian measures under bounded linear maps, if , the transformed variable is distributed as . Because is trace-class and is bounded, the operator remains trace-class. Thus, the general case reduces to evaluating the isotropic kernel on the transformed variables and .
To evaluate the isotropic case, we rewrite the data-to-model expectation by centering the integration variable. Let and write , where . Since , the expectation becomes:
By the spectral theorem for compact self-adjoint operators, there exists an orthonormal eigenbasis of with corresponding eigenvalues . Expanding and in this basis yields
where . For any , define and . By Parseval’s identity, the squared norm of the difference is simply the limit of its partial sums:
Since this convergence holds almost surely and , the dominated convergence theorem allows us to compute the expectation as the limit of the finite-dimensional projections:
Expanding the square inside the exponential yields . Factoring out the constant term, we obtain:
Using the Gaussian identity for , which holds if , with and , each factor evaluates to:
Substituting the one-dimensional evaluations back into the finite-dimensional product yields:
Passing to the limit as , the infinite product converges to . Simultaneously, by the spectral decomposition of , the series in the exponent evaluates to
Recalling that , we conclude:
Applying this isotropic solution to the transformed variables and , we obtain the intermediate form:
Substituting recovers Equation (20).
Finally, for the cross-expectation , let and be independent. Their difference is distributed as . Since , applying the formula for with immediately yields:
which completes the proof.
∎
H.8 Proof of Proposition B.2
Proof.
Write , where . Then
is a real Gaussian random variable with mean and variance
Thus,
and therefore
Since for odd and we obtain
Now let and be independent. By definition of the polynomial kernel,
Applying the binomial theorem to expand the power, and using the linearity of expectation,
To evaluate this expectation, recall that denotes the -th tensor power of . For any vector , its -fold tensor product is . By the standard definition of the inner product on , we have the fundamental identity:
where we adopt the convention and for .
Since Gaussian measures on separable Hilbert spaces have finite moments of all orders, the Bochner moments
are well-defined. Using independence, the expectation of the inner product factors as the inner product of the expectations:
Substituting this back into the binomial expansion yields
which completes the proof. ∎
H.9 Proof of Proposition D.1
Proof.
Let be as in the proposition and define , so that . Since , the projected measure satisfies , so is well defined. Let denote the identity random element on the probability space and set . Write and . The same projection argument used in the proof of Proposition 2.2 gives .
Step 1 (identification as a conditional expectation). By definition, and for each ,
Since , the left-hand side equals
Combining these identities, we obtain
so is a version of . Equivalently, for -a.e. .
Step 2 (martingale convergence). Since is an increasing filtration and , the martingale convergence theorem yields
Combining with Step 1 shows for -a.e. , and in .
Step 3 (log-likelihood convergence under ). Since , the almost sure convergence in Step 2 also holds -a.s. Moreover, -a.s. and -a.s.; hence for -a.e. , and and are well-defined -a.s. Therefore, for i.i.d. ,
and summing over yields the claimed convergence of the log-likelihoods. ∎
Comments
· 0