if Babel.locale_mapped == nil then Babel.locale_mapped = true Babel.linebreaking.add_before(Babel.locale_map, 1) Babel.loc_to_scr = Babel.chr_to_loc = Babel.chr_to_loc or end Babel.locale_props[1].letters = false \directlua if Babel.script_blocks[’Latn’] then Babel.loc_to_scr[1] = Babel.script_blocks[’Latn’] Babel.locale_props[1].lg = 89 end \directlua if Babel.script_blocks[’Latn’] then Babel.loc_to_scr[1] = Babel.script_blocks[’Latn’] end \bbl@luahyphenate\directlua if Babel.locale_mapped == nil then Babel.locale_mapped = true Babel.linebreaking.add_before(Babel.locale_map, 1) Babel.loc_to_scr = Babel.chr_to_loc = Babel.chr_to_loc or end Babel.locale_props[2].letters = false \directlua if Babel.script_blocks[”] then Babel.loc_to_scr[2] = Babel.script_blocks[”] Babel.locale_props[2].lg = 89 end \directlua if Babel.script_blocks[”] then Babel.loc_to_scr[2] = Babel.script_blocks[”] end [tifinagh]rm[Path=fonts/, Extension=.ttf, UprightFont=*]ebrima \bbl@patterns@luaenglish\bbl@patterns@luatifinagh
Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance
Abstract
We introduce the Yat kernel
a rational hidden-unit primitive whose units are Mercer sections over a shared input/weight space. For the kernel is PSD; for it dominates a scaled inverse-multiquadric (IMQ) in the Loewner order, yielding fixed-kernel universality, characteristicness, and strict positive definiteness on every compact domain. The polynomial numerator opens nonradial alignment channels absent from finite IMQ expansions, witnessed by the directional far-field trace . Algebraically, a second finite difference in the bias recovers any IMQ atom from three positive-bias Yat atoms exactly, sharp at three atoms in every dimension at exact pointwise equality. A trained shared- Yat layer is therefore a finite learned-center expansion in a fixed universal characteristic RKHS, with closed-form norm and explicit diagonal driving a Rademacher generalization bound.
1 Introduction
A trained MLP layer has no canonical Hilbert-space norm. Generalization bounds therefore route through Lipschitz or spectral surrogates of the weight matrix (Bartlett et al., 2017; Neyshabur et al., 2018), function-level objects loose by construction. The reason is structural: standard activations (ReLU, GELU (Hendrycks and Gimpel, 2016), sigmoid) are scalar functions of an inner product, not symmetric kernel sections of the joint variable , so a trained layer’s read-out is not a finite element of any RKHS, and the toolkit that comes with one (closed-form Hilbert norm, fixed-kernel universality, characteristicness, MMD) is unavailable.
The available Mercer-section primitives are either radial without alignment (Gaussian RBF, IMQ: weights act only as centers, the kernel diagonal is constant, and the polynomial alignment plays no role) or aligned without distance locality (universal dot-product kernels of analytic positive-coefficient form (Smola et al., 2000): alignment and Hilbert structure but no decay). A finite, trainable hidden-unit primitive that combines locality, alignment, and Hilbert structure is the missing piece.
There is a primitive that splits the difference. The Yat kernel
| (1) |
is the rational product of a polynomial alignment numerator and an inverse-multiquadric (IMQ) denominator : rational, finite, trainable. It is symmetric and PSD in , so a trained shared- layer is genuinely a finite RKHS expansion with closed-form norm , and single-layer generalization follows from the RKHS-ball Rademacher framework with no surrogate on the weight matrix.
Why should such a primitive exist, and why are its properties what they are? Schur products of PSD kernels are PSD, and both factors and are PSD; the technically obvious fact is that the product is itself a Mercer kernel. The interesting fact is a single Loewner inequality. Expanding and dividing through, is itself a sum of Schur products of PSD kernels, manifestly PSD. So the radial IMQ kernel sits inside the Yat kernel in the Loewner order; Aronszajn’s inclusion theorem combined with Micchelli’s IMQ universality (Micchelli et al., 2006) gives universality of the fixed kernel, not just of a family. The radial alternatives are recovered as a strict sub-RKHS of .
Loewner domination delivers RKHS containment by a soft analytic argument, but it leaves an algebraic question: can the IMQ atoms be constructed from Yat atoms, or only known to lie inside the Yat RKHS by Aronszajn? Section 4 answers the constructive question with a sharper one-line fact:
| (2) |
a second-order finite difference of three biased Yat atoms in the bias parameter recovers every IMQ atom exactly, and the count of three is sharp in every dimension. This is what the analytic Loewner argument cannot see, and what makes the construction concrete: any quantitative IMQ approximation rate transfers to Yat at factor-three width overhead.
The polynomial numerator does more than improve the constants. IMQ RKHS functions decay at infinity, but a Yat atom retains a directional quadratic shadow along every ray :
| (3) |
independent of and , a kernel invariant that survives every choice of hyperparameter. No finite IMQ combination has nonzero trace, so the alignment numerator opens directional channels in that the radial factor alone cannot reach. Section 3 makes this precise: the trace map surjects onto homogeneous quadratic forms on the sphere while annihilating every finite IMQ combination.
Both kernel hyperparameters appear directly in the single-layer Rademacher bound as regularization targets, a transparency that constant-diagonal radial kernels lose and scalar MLP units do not have at all. The reason is in the kernel diagonal: is polynomial in with explicit dependence on both and , and the diagonal carries those two scalars into the bound. The closed-form norm itself is generic to any continuous Mercer kernel; what is Yat-specific is the form of the diagonal (Section 5).
Organization. Section 2 establishes Mercer structure, Loewner domination, and fixed-kernel universality. Section 3 develops the channel decomposition, the directional far-field trace, and the bounded-domain width-complexity gap. Section 4 proves the bias-finite-difference identity (2) and its three-atom sharpness. Section 5 derives the closed-form RKHS norm and the Rademacher bound. Section 6 treats depth: per-prefix exact bookkeeping, ambient Sobolev containment, and a conjectured impossibility of any prefix-independent global Yat-Gram norm. Background, related work, and proofs are in Appendices B–E onwards.
2 Mercer Structure and Fixed-Kernel Universality
Notation. Throughout, , is compact, , . Write for the regularised squared distance, and use the shorthands
The unbiased section corresponds to . We adopt throughout the convention that the RKHS of the identically-zero kernel is the zero subspace with trivial norm; this makes the channel decomposition of Theorem 3 and the limit well-defined when individual channels collapse.
Two algebraic facts anchor the kernel-section view of the Yat unit. First, is positive semidefinite, and thus a Mercer kernel on every compact , for every and . Second, on top of mere PSD, dominates a scaled IMQ kernel in the Loewner order: . This second fact transfers IMQ density into and pins down fixed-kernel universality on every compact domain whenever . Both follow from the Schur-product factorisation of the polynomial alignment numerator and the IMQ denominator .
Theorem 1 (PSD and Mercer for ).
For every , , the kernel in (1) is symmetric and positive semidefinite on , and on every compact the restriction is continuous and Mercer. By Moore–Aronszajn there exists a unique RKHS with reproducing kernel . The sign condition is essential: for PSD can fail (explicit counterexample at , , nodes ).
The sign condition has a structural cause: the polynomial-numerator feature map requires to be real. Per-atom variation of hyperparameters within an expansion lies in the larger biased Yat span (see Appendix A) but not in any single RKHS , so all quantitative RKHS results that follow, namely the closed-form norm (Proposition 3) and the Rademacher bound (Theorem 6), require shared across summands. This is the unit of analysis for the rest of the paper.
PSD structure implies a Loewner-order inequality that yields RKHS containment and singular universality as direct algebraic consequences. Write for the IMQ kernel. Expanding gives , a sum of Schur products of PSD kernels, hence PSD as a kernel: every finite Gram matrix of is itself PSD.
Theorem 2 (Kernel-order domination).
For every , ,
in the Loewner PSD order on kernels. By Aronszajn’s inclusion theorem (Paulsen and Raghupathi, 2016),
Equivalently, .
The constant on the left of the Loewner inequality is sharp on : at the single point the ratio is exactly, so the Gram of at becomes negative for any , and therefore . The sharpness is global: on a compact that excludes a neighbourhood of the origin, the optimal domination constant can be strictly larger than , since (after cancelling the IMQ denominators) is bounded below away from once is bounded away from .
Two related constants govern the behaviour and should not be conflated. The kernel-domination constant enters squared because it compares kernels (and hence Gram matrices), while the RKHS embedding constant in Theorem 2 is , comparing norms.
Quantitative approximation rates transferred through the embedding therefore inherit on norms or on squared norms, harmless at fixed but exploding at the boundary.
That boundary is genuine, not an artefact of the embedding. At the channel decomposition (Theorem 3 below) shows that collapses to the quadratic-alignment channel alone: the radial channel and the linear-alignment channel disappear under the zero-kernel convention. Equivalently, the IMQ-radial subspace embeds continuously into for every but is not contained in at all, since every vanishes at the origin while generic IMQ functions do not. The transition is a phase transition of the RKHS, recorded analytically in Table 1: the practical recommendation is to use rather than to treat as a regularised limit. With the boundary understood, fixed-kernel universality follows for the open regime.
Proposition 1 (Singular universality at ).
Let be compact and let , . Then the RKHS of the fixed biased kernel is dense in in the uniform norm. Hence is a universal kernel, not merely a member of a universal family.
Universality in turn implies the kernel mean embedding is injective and Gram matrices at distinct nodes are strictly positive definite, both consequences we use throughout.
Corollary 1 (Characteristic kernel and SPD).
For every , , and every compact : (i) is characteristic on , so the kernel mean embedding is injective on finite signed Borel measures and the maximum mean discrepancy metrizes weak convergence (Sriperumbudur et al., 2010, Thm. 23); (ii) is strictly positive definite in the standard kernel-theoretic sense: for any and any pairwise distinct , the Gram matrix is strictly positive definite, hence invertible, and the quadratic form is a proper RKHS norm on coefficient space.
Together with Theorem 1, these results license a shared- Yat layer as a finite expansion in a universal characteristic RKHS, the unit of analysis the rest of the paper exploits. Table 1 summarises how Yat compares with its three closest classical kernel-valued alternatives along the properties used in the rest of the paper.
| Primitive | PSD/Mercer | UAT-fam | UAT-sing | Char. | Bounded | Numer. | Denom. |
|---|---|---|---|---|---|---|---|
| Yat () | ✓ | ✓ | ✓ | ✓ | ✓ | align | local |
| Yat () | ✓ | ✓ | —§ | —§ | ✓ | align | local |
| Polynomial | ✓ | —† | —† | —‡ | — | align | — |
| Univ. dot-prod | ✓ | ✓ | ✓ | ✓ | — | align | — |
| RBF | ✓ | ✓ | ✓ | ✓ | ✓ | — | local |
| IMQ | ✓ | ✓ | ✓ | ✓ | ✓ | — | local |
| †Fails universality: finite-rank RKHS (not dense in for infinite ). | |||||||
| ‡Fails characteristic: cannot separate all probability measures. | |||||||
| §Fails on origin-containing domains: forces for every . | |||||||
3 Nonradial Alignment Structure
Theorem 2 embeds the IMQ RKHS into for , but it leaves a structural question open: does the polynomial alignment numerator add anything that the radial denominator alone could not produce? It does, and the gap is recorded by a single far-field invariant. The IMQ RKHS lives entirely inside the part of function space, since every IMQ RKHS element decays at infinity, while a Yat atom retains a directional quadratic shadow along every ray as . That mismatch is the structural witness of nonradiality, and once it is in hand the rest of the section follows by algebra: a strict RKHS containment, a clean three-channel decomposition of , and a sharp directional trace identity that no finite IMQ combination can reproduce.
The decay statement is the simplest of the three.
Lemma 1 (IMQ RKHS functions vanish at infinity).
Every belongs to , i.e., as .
The lemma is a global statement on , distinct from the compact-domain universality of : on every compact , the IMQ RKHS restricts densely to (Micchelli et al., 2006), but as functions on the ambient the elements of vanish at infinity, while generic Yat sections do not. A direct consequence is that the Loewner embedding cannot reverse, so the RKHS containment is strict.
Corollary 2 (Strict containment and non-reversibility).
For , : . Moreover, no constant satisfies on .
The strict gap has an explicit factorisation. Expanding the squared alignment numerator splits into three Schur products with the IMQ denominator: each piece is itself PSD, so the RKHS sum rule decomposes the Yat space accordingly into a radial channel and two alignment channels.
Theorem 3 (Yat RKHS channel decomposition).
Write where , , . Each summand is PSD: is PSD, the linear kernel is PSD (it is the inner-product feature map), the quadratic is PSD as the Schur square of a PSD kernel, and Schur products of PSD kernels are PSD; non-negative scalar multiples preserve PSD. By the RKHS sum rule (Paulsen and Raghupathi, 2016; Berlinet and Thomas-Agnan, 2004),
with squared infimal-convolution norm
The three channels are: the radial IMQ channel (; as sets with rescaled norm (constant-rescaling identity for RKHS norms, Paulsen and Raghupathi 2016, Prop. 4.4), present when ), the linear-alignment IMQ channel (, vanishes at ), and the quadratic-alignment IMQ channel (, carrying the far-field trace, present for all ). We adopt the convention that the RKHS of the identically-zero kernel is the zero subspace with trivial norm. At both and vanish under this convention: the RKHS loses the IMQ subspace and the constant coordinate, which is the degeneracy responsible for failure of fixed-kernel universality on domains containing the origin. The same degeneracy appears as the kernel diagonal vanishing at the origin: , so every satisfies , so both the universality failure and the diagonal-vanishing diagnostic record the same fact.
The quadratic-alignment channel is what survives every choice of and is the source of the far-field shadow that distinguishes Yat from IMQ. The next proposition makes the shadow precise as a sphere-valued asymptotic trace and certifies that the trace map is surjective onto homogeneous quadratic forms while annihilating every finite IMQ combination.
Proposition 2 (Directional asymptotic trace separates Yat from IMQ).
Let denote the linear subspace of functions for which the radial limit exists for every , and define on the asymptotic trace operator . Both and lie in . For every , every , and every , the biased Yat atom satisfies
| (4) |
independent of and , and the convergence is uniform in . By contrast, for every finite IMQ combination . Consequently the induced linear map is surjective onto the space of restrictions of homogeneous quadratic forms, of dimension , while .
The directional trace identity above is qualitative: a bounded radial combination eventually misses a Yat atom’s far-field value. The gap becomes quantitative on a bounded domain by combining the directional-trace identity (Proposition 2) with the classical radial curse of dimensionality for ridge approximation (Eldan and Shamir, 2016): a single Yat atom approximates a quadratic ridge function on to arbitrary accuracy, while any bounded-variation IMQ expansion needs exponentially many atoms in the ambient dimension.
A precise quantitative form (Proposition 7, Appendix E.8) makes this concrete under a joint constraint on the IMQ side: bounded coefficient mass and bounded center norms , both held fixed as . Under this joint constraint, on a single Yat atom with approximates the quadratic ridge uniformly to , while every IMQ expansion in this class needs atoms. Lifting either constraint, by letting grow with or letting centers escape into the shell, collapses the lower bound, so the separation is a constrained radial-vs-ridge statement, not a model-free benchmark prediction.
Sharper but extrapolative quantitative separations on exterior shells for are recorded in Appendix G.1: a uniform bound and an directional-tail risk lower bound, both for bounded-variation IMQ and RBF expansions. On normalized-feature compact domains both IMQ and RBF are universal (Micchelli et al., 2006; Steinwart and Christmann, 2008), so these are far-field structural separations rather than compact-domain benchmark predictions.
4 IMQ Embedding via Bias Finite Differences
Theorem 2 buys an IMQ-into-Yat embedding through Loewner domination, but that argument is analytic: Aronszajn’s inclusion theorem only certifies an injection of RKHS, not a constructive recipe for producing an IMQ atom from Yat atoms. A purely algebraic recipe exists, and the underlying reason is one line: the Yat numerator is exactly quadratic in with constant second derivative , so any second-order finite difference in kills it and leaves only the IMQ denominator times a constant. The stencil is the smallest one with all biases strictly positive. Concretely, taking that stencil at a common center gives an exact IMQ atom, multiplied by a constant. The identity is pointwise, not approximate, and the count of three Yat atoms is necessary in every dimension: two atoms cannot reproduce a single IMQ section. This construction gives a second, constructive route to fixed-kernel universality and immediately transfers any IMQ approximation rate to Yat at a factor-three width overhead.
Theorem 4 (Bias-finite-difference IMQ embedding).
Fix . For every and every , the positive-bias second-order finite difference of the biased Yat atom satisfies the exact pointwise identity
| (5) |
for every . Consequently , where and . The containment is strict.
The factor-three overhead is not slack: removing any one of the three Yat atoms loses the cancellation that produces the IMQ section. The next theorem establishes this as a tightness statement at the level of exact pointwise equality.
Theorem 5 (Three atoms are necessary).
The factor- reduction in (5) is sharp in the number of atoms: for every , , , no linear combination of at most two biased Yat atoms (at arbitrary centers and biases) can equal a nonzero scalar multiple of .
The full proof is in Appendix E.2, where the result is restated for convenience (Theorem 9), via an irreducible-quadric argument in . Tightness is at the level of exact pointwise equality; whether two atoms can -approximate on a compact set is open. The hypothesis is essential: at the origin, the single Yat atom recovers the IMQ atom exactly, so one atom suffices in that case.
Two consequences follow without additional analysis. The first propagates Micchelli’s IMQ universality (Micchelli et al., 2006) into the Yat family by composition; the second turns the embedding into a quantitative rate-transfer statement.
Corollary 3 (Family universality, algebraic route).
For every compact , both at any single and the broader families and are dense in in the uniform norm. This gives a second algebraic route to the universality already established analytically by Proposition 1 via Loewner-IMQ domination.
Corollary 4 (Approximation rate transfer at factor- width overhead).
Fix and , and let be compact. For every finite IMQ approximant of atoms, there exists a Yat approximant of at most atoms (positive biases drawn from ) with for every . In particular, any quantitative IMQ approximation rate transfers to a Yat rate ; equivalently, any IMQ rate gives a Yat rate .
Corollary 4 is the answer to “what does Yat inherit from IMQ approximation theory?”. We do not prove new Sobolev or Besov approximation rates here; the bias finite-difference identity shows that Yat is never worse than fixed- IMQ at the level of any rate inherited through this embedding. Yat-native rates that exploit the alignment numerator beyond what radial expansions can match remain open. Note that the atoms on the Yat side carry three distinct biases , so the resulting expansion lies in the wider biased Yat span rather than in any single shared- RKHS; the closed-form norm and Rademacher bound of Section 5 apply to each fixed-bias channel separately, with the per-channel norms combining via the RKHS sum rule.
The rate-transfer corollary leaves the RKHS-ball radius unspecified: in the absence of a constructive bound on , the inherited rate is information-theoretic rather than operational. The next section closes this gap by computing in closed form for any trained shared- Yat layer and propagating it to a Rademacher generalization bound, so that the rate transfer above and the capacity bound below are companion statements: the first says what error rate a target in admits at atoms; the second says what radius the trained network’s coordinate functions occupy.
5 Layer-Local RKHS Capacity
Section 4 transferred IMQ approximation rates to Yat at factor-three width overhead but left the radius of the relevant RKHS ball unspecified, and an inherited rate without a constructive radius is information-theoretic, not operational.111The factor-three rate-transfer construction uses three biases per center, so the constructed approximant lives in the wider biased Yat span rather than in any single shared- RKHS. The closed-form norm and Rademacher bound here apply per shared- Yat layer; the two are companion statements over the kernel family, not over one trained network. Per-channel norms combine via the RKHS sum rule. The reproducing property of closes the gap: the squared RKHS norm of any finite kernel expansion equals the quadratic form in the trained weights and read-out coefficients, computable from those parameters alone, with no Lipschitz surrogate and no spectral product across depth. Plugging this norm into the RKHS-ball Rademacher framework yields a single-layer bound whose only Yat-specific ingredient is the kernel diagonal , which carries both and into the bound as explicit regularisation targets, in contrast to the constant diagonals of Gaussian RBF and IMQ.
Proposition 3 (Closed-form RKHS norm).
Let , , and let be a finite kernel expansion with shared . Then
| (6) |
The identity holds for any choice of , including repeated centers; in that case may be singular, and although different coefficient vectors can represent the same function , their difference lies in and the quadratic form is invariant across equivalent representations.
The reproducing property (Aronszajn, 1950; Steinwart and Christmann, 2008) delivers this as an exact finite-dimensional formula: for a shared- Yat layer, the right-hand side is computable from the trained weights alone, in closed form, with no Lipschitz surrogate. The identity is mathematically tight at every even when is singular; near-coincident centers are a numerical-conditioning issue, not a mathematical overstatement, and stable factorisation or deduplication suffices in floating-point implementations. Plugging this norm into the standard RKHS-ball Rademacher inequality (Bartlett and Mendelson, 2002) yields the Yat-specific generalisation bound below; the only ingredient that depends on the kernel beyond a generic Mercer-section property is the diagonal , which transfers and into explicit terms in the bound.
Theorem 6 (Single-layer Rademacher bound).
Fix , . Let be the RKHS ball of radius , , and let be i.i.d. samples. With the convention , ,
| (7) |
A data-dependent empirical refinement (Corollary 7) replaces by . The polynomial diagonal has a real cost on unnormalised high-dimensional inputs and we record it explicitly: under isotropic Gaussian , and the bound scales as , against constant-diagonal for Gaussian RBF and for IMQ. Feature-scale control (e.g. unit-sphere normalisation, which collapses the diagonal to ) recovers a dimension-free bound; the analysis makes the dependence explicit via and rather than hiding it. The high-dimensional variance preservation vs. (Proposition 5) survives under this normalisation because is scale-invariant. On the unit sphere, , so becomes a function of the single variable , i.e. a zonal kernel; the directional far-field separation of §3 then refines into a sharper spectral statement, with exponential eigenvalue decay in the spherical-harmonic basis (Appendix K).
The Rademacher bound is hidden-layer-local: is computed from the trained of one shared- Yat layer, is a data-side radius on the layer’s input, and there is no spectral product across depth or Lipschitz surrogate.
The reproducing-property derivation is generic to any continuous Mercer kernel: a learned-center RBF or IMQ layer admits the same formula and the same diagonal-based Rademacher bound. The Yat-specific content is the form of the diagonal , which depends nontrivially on and on both and , whereas the Gaussian RBF diagonal is the constant and the IMQ diagonal is . Ordinary scalar MLP units do not induce such a diagonal at all because they do not define a Mercer section over the shared input/weight space. Evaluating at width on input dimension costs versus for a forward pass; gradient descent driving inflates the embedding constant and the conditioning of , so a softplus-style reparametrisation is recommended.
The Yat kernel section is uniformly bounded on every compact , by , and globally bounded on for every fixed ; at and the global supremum is attained at (Proposition 4, Appendix E.6). The diagonal used in Theorem 6 is the special case . Classical unbounded scalar activations such as ReLU and GELU satisfy as , so the Rademacher route via the kernel diagonal is not available in this same unit-as-Mercer-section form.
Table 1 compares Yat against the three closest classical kernel-valued primitives along the properties that drive layer-local RKHS analysis. Yat is the only row in which a polynomial alignment numerator and IMQ locality coexist while retaining fixed-kernel universality, characteristicness, and a closed-form layer-local RKHS norm. The directional asymptotic trace (Proposition 2) and the factor- tightness of the IMQ reduction (Theorem 9, App. E.2) turn the table from a rhetorical summary into a structural separation: Yat atoms add directional quadratic far-field components no finite IMQ combination can match, and at most two biased atoms cannot reproduce a single IMQ section.
What singles out Yat in Table 1 is not any one of these properties but their simultaneous combination: IMQ-type locality, polynomial alignment with nonzero quadratic far-field trace, an exact three-atom IMQ-recovery identity, fixed-kernel universality from Loewner domination, and concentration that survives high-dimensional inputs: under isotropic Gaussian inputs, while (Proposition 5, Appendix E.6), because the Yat numerator depends on the one-dimensional projection .
A multiclass margin bound of order with follows from the closed-form norm; a peeling argument (Theorem 13) replaces the worst-case radius by the trained at cost. Four further RKHS-ball comparisons propagate the Loewner domination to interpolation-norm, pullback, alignment-excess, and spectral-effective-dimension bounds (Appendix G).
6 Conclusion and Discussion
The Yat kernel is a single-layer object on which four properties simultaneously hold: PSD/Mercer for , fixed-kernel universality for from Loewner-IMQ domination, an exact three-atom finite-difference IMQ identity (sharp in every dimension), and a closed-form layer-local RKHS norm with diagonal-driven Rademacher control. Composing the unit at depth preserves none of these globally: no parameter-independent Mercer kernel on the input domain captures the depth- Yat-Gram norm of every trained network. The single-layer story does not extend to a global theory, but it survives in two complementary regimes that delimit the structural scope of the analysis.
In the per-prefix regime, we fix the prior layers. The pulled-back kernel is PSD on , and upper-bounds (Theorem 14), with equality on the prefix-range section subspace; inherits universality when and is continuous injective (generic at , not automatic). The pullback construction is closest in form to convolutional kernel networks (Mairal et al., 2014; Mairal, 2016), differing in unit and in the explicit closed-form per-layer norm. In the parameter-independent regime, every depth- coordinate under bounded lies in a fixed ambient Sobolev RKHS (, Theorem 19); the depth dependence is super-exponential, so the result is qualitative containment (degenerate at ), not a capacity certificate. The infinite-width Yat NTK is universal on every compact domain for (Appendix N).
These two finite-depth regimes sit at opposite ends of one trade-off: per-prefix bookkeeping is exact but prefix-dependent, ambient Sobolev containment is parameter-independent but coarse. No single Mercer kernel on absorbs every trained Yat stack’s layer-local Gram structure simultaneously, and we formalise this impossibility as a conjecture.
Conjecture 1 (No exact global deep Yat-Gram norm).
Fix , input domain , layer widths , and the Yat-stack hypothesis class with every prefix continuous, injective on , bilipschitz onto its image, and every Yat layer satisfying , , pairwise distinct centers. For each output coordinate let denote the trained read-out matrix carrying coordinate through layer (with the -th row of the final read-out and the corresponding contribution at intermediate layers via the chain). There exists no Mercer kernel on , parameter-independent of , satisfying, for every output coordinate and every trained Yat stack in the hypothesis class,
References
- Sobolev spaces. 2 edition, Pure and Applied Mathematics, Vol. 140, Academic Press. Cited by: §J.1, §J.1.
- Theory of reproducing kernels. Transactions of the American Mathematical Society 68 (3), pp. 337–404. Cited by: Appendix C, §5.
- On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, Cited by: Appendix N.
- Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research 18 (19), pp. 1–53. Cited by: Appendix B.
- Local Rademacher complexities. The Annals of Statistics 33 (4), pp. 1497–1537. Cited by: §L.1, Appendix L.
- Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix B, §1.
- Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, pp. 463–482. Cited by: Appendix B, §E.5, §5.
- Reproducing kernel Hilbert spaces in probability and statistics. Springer. Cited by: Theorem 3.
- On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
- A kernel perspective for regularizing deep neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix O.
- Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
- Rational neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix B.
- Multivariable functional interpolation and adaptive networks. Complex Systems 2 (3), pp. 321–355. Cited by: Appendix B.
- Radial basis functions: theory and implementations. Cambridge University Press. Cited by: Appendix B.
- A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (2), pp. 121–167. Cited by: Appendix B.
- Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
- Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7 (3), pp. 331–368. Cited by: §L.2, Appendix L, Remark 10.
- Kernel methods for deep learning. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B.
- Rational kernels: theory and algorithms. Journal of Machine Learning Research 5, pp. 1035–1062. Cited by: Appendix B.
- On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, pp. 265–292. Cited by: Appendix B.
- Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
- ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: NeurIPS Paper Checklist.
- A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences 17 (5), pp. 1407–1425. Cited by: §J.1, Appendix B.
- The spectrum of kernel random matrices. The Annals of Statistics 38 (1), pp. 1–50. Cited by: §E.6.
- The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), pp. 907–940. Cited by: Appendix B, §3.
- Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research 2, pp. 299–312. Cited by: Appendix B.
- Linearized two-layers neural networks in high dimension. In Advances in Neural Information Processing Systems, Cited by: §E.6.
- A kernel two-sample test. Journal of Machine Learning Research 13, pp. 723–773. Cited by: Appendix M.
- Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415. Cited by: §1.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. Cited by: Appendix T.
- Matrix analysis. 2 edition, Cambridge University Press. Cited by: Appendix C, §E.1, §E.7, §I.2.
- Neural tangent kernel: convergence and generalization in neural networks. In NeurIPS, Cited by: Appendix N, Appendix N, Appendix B.
- Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 28 (5), pp. 1302–1338. Cited by: §E.6.
- Deep neural networks as Gaussian processes. In ICLR, Cited by: Appendix B.
- Decoupled weight decay regularization. In ICLR, Cited by: Appendix T.
- Convolutional kernel networks. In Advances in Neural Information Processing Systems, Cited by: Appendix B, §6.
- End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
- Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A 209, pp. 415–446. Cited by: Appendix C.
- Universal kernels. Journal of Machine Learning Research 7, pp. 2651–2667. Cited by: Appendix C, §E.3, §E.9, §1, §3, §3, §4, Theorem 8.
- Padé activation units: end-to-end learning of flexible activation functions in deep networks. In International Conference on Learning Representations (ICLR), Cited by: Appendix B.
- Fast learning in networks of locally-tuned processing units. Neural Computation 1 (2), pp. 281–294. Cited by: Appendix B.
- Bayesian learning for neural networks. Lecture Notes in Statistics, Vol. 118, Springer. Cited by: Appendix B.
- A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations (ICLR), Cited by: Appendix B, §1.
- Spatial modelling using a new class of nonstationary covariance functions. Environmetrics 17 (5), pp. 483–506. Cited by: Appendix B.
- Universal approximation using radial-basis-function networks. Neural Computation 3 (2), pp. 246–257. Cited by: Appendix B.
- An introduction to the theory of reproducing kernel Hilbert spaces. Cambridge Studies in Advanced Mathematics, Vol. 152, Cambridge University Press. Cited by: Appendix N, Appendix C, §E.8, Theorem 2, Theorem 3, Theorem 3.
- Optimal approximation of piecewise smooth functions using deep ReLU networks. Neural Networks 108, pp. 296–330. Cited by: §J.1.
- Approximation theory of the MLP model in neural networks. Acta Numerica 8, pp. 143–195. Cited by: Appendix B.
- Learning transferable visual models from natural language supervision. In ICML, Cited by: NeurIPS Paper Checklist.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations (ICLR), Cited by: Table 6.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. Cited by: Appendix T.
- Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
- Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
- Sobolev spaces of fractional order, nemytskij operators, and nonlinear partial differential equations. De Gruyter. Cited by: §J.1.
- Error estimates and condition numbers for radial basis function interpolation. Advances in Computational Mathematics 3 (3), pp. 251–264. Cited by: Appendix B, Appendix F.
- Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press. Cited by: Appendix B, Appendix B.
- Regularization with dot-product kernels. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, Table 1.
- Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research 12, pp. 2389–2410. Cited by: §J.3, Appendix B, Appendix C.
- Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research 11, pp. 1517–1561. Cited by: Appendix C, §E.4, Corollary 1.
- Support vector machines. Springer. Cited by: Appendix B, Appendix C, §E.3, §3, §5, Theorem 7.
- Optimal rates for regularized least squares regression. In Conference on Learning Theory (COLT), pp. 79–93. Cited by: §L.2, Appendix L.
- Neural networks and rational functions. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
- Scattered data approximation. Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press. Cited by: Appendix K, Appendix K, Appendix K, Appendix K, Appendix B, Appendix B, Appendix F.
- The Laplace transform. Princeton University Press. Cited by: Appendix C.
- Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
- Deep kernel learning. In AISTATS, Cited by: Appendix B.
Appendix roadmap.
The appendices are organized in three tiers. Tier 1 (proofs of main-body results): Appendix E (one top-level section, four subsections, one per main-body theorem block) contains proofs of every result in Sections 2–5. Tier 2 (extensions of the main-body theory): Appendix G treats exterior-shell separations and the polynomial-separation gap on line-containing domains; Appendix H develops the multiclass learned-norm generalization bound (Rademacher + peeling); Appendix I contains the prefix-pullback theorem and the four Loewner-comparison theorems (interpolation norms, pullback domination, alignment-excess, spectral comparison); Appendix J establishes the generic ambient Sobolev-RKHS containment of bounded smooth-atom stacks; Appendix F records the three-step asymptotic filtration of the biased span and a native-space rate-transfer statement with coefficient-mass lower bound; Appendix P converts the directional-trace separation into a quantitative atom-count bound; Appendix Q states the degree-matching skeleton for a uniqueness conjecture. Tier 3 (orthogonal applications of the same single-layer kernel): Appendix K computes the Funk–Hecke spectrum on , including exponential eigenvalue decay and logarithmic effective dimension; Appendix L derives fast rates via eigenvalue decay (ERM and KRR routes); Appendix M gives quantitative MMD and two-sample-testing sample complexities; Appendix N computes the infinite-width NTK; Appendix O derives an intrinsic RKHS-Lipschitz bound and certified adversarial radius. Tier-3 sections are self-contained and read independently of one another. The CLIP-probe and directional-tail benchmark appendices (Appendices R, S) collect numerical evidence for results stated in the main body, and Appendix T exercises the per-prefix theory inside a trained M Yat causal language model.
Appendix A Preliminaries and Notation
This section consolidates the notation used throughout the paper, both in the main body and in the appendices. Generic RKHS background (Mercer’s theorem, Moore–Aronszajn, universality / characteristicness / SPD, Schur products, Laplace representation) is in Appendix C; named results from the literature we cite by label (Steinwart–Christmann product-kernel containment, Micchelli IMQ universality) are in Appendix D.
| Symbol | Definition | First used |
| Domains and dimensions | ||
| input dimension; depth- width | §1 | |
| compact input domain | §1 | |
| unit sphere | §1 | |
| closed ball | §6 | |
| Kernels and atoms | ||
| (regularised squared distance) | §1 | |
| (biased Yat atom) | §1 | |
| Yat kernel ; unbiased | §1 | |
| IMQ kernel | §1 | |
| RKHS objects | ||
| , , | RKHSs of , , | §3 |
| squared norm of at distinct | §6 | |
| Loewner PSD order on kernels / Gram matrices | §3 | |
| Spans of atoms | ||
| §5 | ||
| broader Yat spans over resp. , | App. A | |
| §5 | ||
| bounded-variation IMQ class: , | App. G | |
| Yat analogue of | App. P | |
| Operators | ||
| (asymptotic-trace operator) | §4 | |
| quadratic forms on the sphere | §4 | |
| Mercer integral operator | App. L | |
| effective dimension | App. L | |
| Network and pullback | ||
| prefix map, layer map, layer width | §7 | |
| layer- output coordinate and its read-out vector | §7 | |
| (pullback) | §7 | |
Standing assumptions.
Throughout, , is compact, , . The sphere is and the closed ball .
Kernels.
The regularised squared distance and the three kernels we work with:
RKHS objects.
, , denote the RKHSs of , , respectively. The Loewner PSD order is denoted . The RKHS associated with a continuous PSD kernel is the unique Hilbert space of functions on for which holds for all . The squared norm of a finite expansion at distinct centers equals where is the Gram matrix.
Spans of atoms.
The atom families that appear repeatedly:
Bounded-variation and bounded-center sub-classes used in lower bounds:
Operators on the spans.
The directional asymptotic-trace operator on the radial limit is
defined on the linear subspace where the limit exists for every direction. The image of on is the space of quadratic forms restricted to , . The Mercer integral operator of a kernel on is denoted , with effective dimension .
Network notation.
For a depth- Yat stack with input and per-layer widths ,
where is the prefix map, the layer map, a coordinate index, and the layer- output coordinate. The pulled-back kernel on is .
Conventions.
carries the uniform surface measure where unspecified; is the space of real symmetric matrices; is column-major matrix vectorisation; without subscript is the Euclidean inner product on . The Tifinagh letter \tifinaghfontⵟ (Unicode U+2D5F) is used as the kernel-specific glyph .
Appendix B Related Work
RBF and learned-center kernel networks.
Radial-basis-function networks [Broomhead and Lowe, 1988, Moody and Darken, 1989] train finite expansions of locally-tuned radial kernels and are universal approximators [Park and Sandberg, 1991]; the classical MLP approximation theory underlying this is surveyed by Pinkus [1999]. The classical theory of native spaces and IMQ universality is developed in Wendland [2004]. With a shared bandwidth, an IMQ or Gaussian RBF layer is already a finite learned-center RKHS expansion admitting the same norm and the same diagonal-based Rademacher bound as Yat. The Yat primitive differs in two respects: it replaces the radial numerator with the polynomial alignment factor , producing a nonzero directional far-field trace that all pure-radial combinations lack; and the bias places a constant-coordinate feature in the polynomial factor’s feature map, establishing fixed-kernel universality via kernel-order domination rather than through a variable-bandwidth construction.
Neural kernels and arc-cosine constructions.
Cho and Saul [2009] pioneered the program of treating neural-network hidden units as Mercer sections, introducing the arc-cosine family as kernels associated with threshold activations. The Yat construction shares the spirit of that program — treating the unit as a kernel section rather than a scalar activation — but trades the integral arc-cosine form for a finite, trainable, rational primitive whose RKHS structure (Mercer property, Loewner-IMQ domination, channel decomposition, finite-difference IMQ embedding) is fully closed-form.
Kernel machines and product kernels.
Polynomially modulated radial kernels and product or nonstationary kernel constructions are classical in kernel machines and Gaussian processes [Schölkopf and Smola, 2002, Steinwart and Christmann, 2008]. The Steinwart–Christmann product-kernel containment theorem [Steinwart and Christmann, 2008, Lemma 4.6] is one of two routes to Yat’s fixed-kernel universality (the other being Loewner-order domination). The contribution here is not the abstract existence of a product kernel, but the specific rational hidden-unit form for which product-kernel universality, an exact IMQ finite-difference identity, a nonzero directional asymptotic trace, and a layer-local closed-form RKHS norm simultaneously hold. Prior product-kernel analyses in the GP and kernel-machine literature do not study this hidden-unit parameterization, its finite-difference structure, or its asymptotic-trace geometry.
Deep-learning generalization and the NTK.
For standard scalar-activation MLPs, layer-level capacity is controlled through spectral or Lipschitz surrogates on the weight matrix [Bartlett et al., 2017, Neyshabur et al., 2018], which are function-level bounds rather than Hilbert-space objects attached to individual units. The historical origin of the neural-network-as-kernel viewpoint is Neal [1996], where infinite-width Bayesian networks induce Gaussian-process priors. The neural tangent kernel [Jacot et al., 2018] is an infinite-width post-hoc object defined at the network output; the NNGP [Lee et al., 2018] indexes a training-sample covariance at the output; the conjugate kernel construction of Daniely et al. [2016] treats layered networks as compositions of Mercer-section objects and is the closest antecedent to our prefix-pullback formulation in Section 6; deep kernel learning [Wilson et al., 2016] places a GP on top of a learned feature extractor; the pullback formalism we use in Section 6 is closely related to the convolutional kernel networks of Mairal et al. [2014], where a kernel is constructed by repeated pullback through learned feature maps. The arc-cosine kernels of Cho and Saul [2009] and random-feature constructions [Rahimi and Recht, 2007, 2008] also yield Mercer-section-flavoured hidden-unit views, in different parameterizations. The spectral analysis of NTKs on the sphere via Funk–Hecke / Gegenbauer expansion is by now a small subliterature in ML: Bietti and Mairal [2019] characterise the NTK spectrum on the sphere, and Cao and Gu [2019] use spectral decay rates analogous to our Theorem 21 in deep-network generalization arguments; Bordelon et al. [2020] connect Mercer-eigendecay and learning curves in the direction our fast-rates appendix follows. What is Yat-specific is the conjunction: a finite, trainable, non-radial, rational primitive with fixed-kernel universality from explicit Loewner-IMQ domination, an exact algebraic IMQ embedding sharp at three atoms, a nonzero directional far-field trace, and a closed-form layer-local norm with an explicit input-dependent diagonal.
RKHS generalization bounds.
The single-layer Rademacher bound (Theorem 6) and the multiclass learned-norm generalization bound (Theorem 12) follow the RKHS-ball Rademacher framework of Bartlett and Mendelson [2002]; the multiclass extension uses the pairwise-margin approach of Crammer and Singer [2001]. The modern reference for the chain “universal characteristic SPD” on locally compact spaces, including the equivalence between universality and characteristicness under the finite-signed-measure formulation, is Sriperumbudur et al. [2011]. What is Yat-specific is the explicit, input-dependent kernel diagonal , which makes and explicit regularization targets in the bound, unlike the constant diagonal of the Gaussian () or IMQ () kernels.
Approximation rates and ridge-vs-radial separations.
The width-complexity gap (Proposition 7) sits in a longer tradition of separations between ridge and radial approximation. Eldan and Shamir [2016] established an exponential gap for ridge functions against radial expansions; Bach [2017] sharpened this in the convex-neural-network setting with -bounded coefficient classes that closely match our . The Barron-space MLP capacity literature [E et al., 2019] provides the analogous functional class for two-layer networks; positioning Yat against the Barron framework is left to future work, as the Yat layer is a fixed-RKHS object rather than a Barron-norm-bounded class.
Rational and ratio-form kernels.
Kernels written as a polynomial-in-numerator over a polynomial-in-denominator have been studied in machine learning principally through the rational-transducer construction of Cortes et al. [2004], which formalises a class of rational kernels on sequences via finite-state operations and proves Mercer-style closure properties; the abstract framing of ”kernel as a ratio of polynomial-like objects” is the same, but the input domain (sequences vs. ) and the algebraic primitives (transducer composition vs. Schur products) are disjoint, so neither the asymptotic-trace separation nor the bias finite-difference IMQ embedding has an analogue in that framework. A separate line of work installs rationality at the scalar activation level rather than at the kernel-section level: Telgarsky [2017] establishes expressivity of rational networks, Molina et al. [2020] learn Padé approximants end-to-end, and Boullé et al. [2020] train low-degree rational activations and obtain favourable approximation rates. These constructions keep the unit a scalar function of , do not define a Mercer section over the joint variable, and therefore do not carry the closed-form RKHS norm or the fixed-kernel universality the Yat construction targets; the axis is orthogonal to ours. Closer in form are the normalised polynomial kernels of the SVM literature [Burges, 1998, Schölkopf and Smola, 2002], which divide by , i.e. by their own self-kernel; the Yat kernel divides instead by the regularised distance , which preserves alignment-dependence in the kernel diagonal and adds local IMQ-type decay that self-kernel normalisation does not. The classical theory of inverse-multiquadric kernels and their native spaces is developed at length in Wendland [2004], Buhmann [2003], including the Bessel-potential Sobolev characterisation of the IMQ native space we use throughout. Schaback [1995] established the conditional-positive-definiteness framework that places kernels with vanishing diagonals (such as the unbiased case) in a unified setting with strictly positive radial kernels.
Non-stationary and anisotropic kernels.
The Yat kernel is non-stationary in the sense that depends on rather than only on ; the polynomial alignment numerator is intrinsically anisotropic. This places Yat in a literature on non-stationary covariance constructions in Gaussian processes and kernel machines: Genton [2001] surveys the design space of kernel families including non-stationary and anisotropic constructions, Paciorek and Schervish [2006] develop a class of non-stationary covariance functions parameterised by spatially varying length scales, and Wilson and Adams [2013] introduce spectral mixture kernels that obtain alignment-style modulation through a sum over frequency components. The Yat kernel differs from these constructions by being a single rational expression rather than a mixture or a length-scale field, by carrying an exact algebraic IMQ embedding (Theorem 4) absent from the spectral-mixture and varying-length-scale families, and by admitting a closed-form layer-local RKHS norm that is generic to shared-bandwidth Mercer-section layers.
Appendix C Background
Reproducing kernel Hilbert spaces.
A symmetric function is positive semidefinite (PSD) if for every finite set and coefficients . Every continuous PSD on a compact set is a Mercer kernel and admits a uniformly convergent eigenexpansion with and orthonormal [Mercer, 1909]. The Moore–Aronszajn theorem [Aronszajn, 1950] associates to every PSD a unique reproducing kernel Hilbert space (RKHS) whose reproducing property holds for all . Two properties of RKHSs are used throughout. First, if in the Loewner PSD order (meaning is PSD), then with [Paulsen and Raghupathi, 2016]. Second, the RKHS sum rule: if then with squared infimal-convolution norm .
Universality, characteristicness, and strict positive definiteness.
A continuous PSD kernel on compact is universal if its RKHS is dense in [Micchelli et al., 2006, Steinwart and Christmann, 2008]. We say is characteristic if the kernel mean embedding is injective on finite signed Borel measures on (this finite-signed formulation is the one we use throughout). A kernel is strictly positive definite (SPD) if for every and every set of pairwise distinct points the Gram matrix is positive definite. The implications we use for continuous kernels on compact are
where the first arrow is Sriperumbudur et al. [2011] and the second follows because if for distinct nodes then the atomic measure has zero kernel mean embedding. (Under the finite-signed-measure definition above, characteristicness on a compact domain is in fact equivalent to universality by a Hahn–Banach/Riesz argument; we only need the one-way chain stated here. A weaker probability-measure-only definition of characteristicness is strictly weaker than universality, but we do not use it.) For a bounded continuous characteristic kernel on a compact metric space, metrizes weak convergence of probability measures [Sriperumbudur et al., 2010, Thm. 23]; we use this endpoint of the chain throughout.
Schur products and Laplace integrals.
Two closure properties are used repeatedly. (S) The Schur product theorem [Horn and Johnson, 2012, Sec. 7.5]: the entrywise (Hadamard) product of two PSD kernels is PSD. (L) The Laplace representation for [Widder, 1941, Ch. IV]: the IMQ kernel is a nonnegative mixture of Gaussian PSD kernels, hence PSD. Combining (S) and (L), any Schur product with a PSD polynomial kernel is PSD — this is the core of the Yat PSD proof. For the Yat numerator , the feature map requires for to be real; the sign condition is necessary, as Theorem 1 exhibits a counterexample at .
Appendix D Background Theorems from the Literature
This appendix collects the named results invoked by label in the proofs of Sections 2–5. Standard background (Mercer’s theorem, Moore–Aronszajn, universality, characteristicness, strict positive definiteness, empirical Rademacher complexity, the Schur product theorem, and the Laplace representation) is given in Section C and not repeated here.
Theorem 7 (Product-kernel RKHS containment [Steinwart and Christmann, 2008, Lemma 4.6]).
Let be continuous PSD kernels on , and let be their RKHSs. Then the Schur product is a continuous PSD kernel with RKHS satisfying: for every and , the pointwise product , with .
Theorem 8 (Micchelli IMQ universality [Micchelli et al., 2006, Thm. 17]).
For every and every compact , the IMQ span is dense in in the uniform norm; equivalently, the IMQ kernel is universal on every compact .
Appendix E Proofs of Main-Body Results
E.1 Proof of Theorem 1: PSD/Mercer for
We restate Theorem 1 for the reader’s convenience and prove it as a single block.
Theorem (Theorem 1 restated).
Let and . The kernel is symmetric and positive semidefinite on . On every compact , is continuous, so Mercer’s theorem yields a decomposition with and orthonormal. By Moore–Aronszajn there exists a unique RKHS with reproducing kernel .
Proof.
Factor , where and .
Factor . Extend symmetrically by : define and (well-defined because ). Then , so is the square of a linear kernel in extended feature space. The linear kernel is PSD; by the Schur product theorem its Hadamard square is also PSD, so is PSD. Equivalently, satisfies , which is real iff . The counter-example below shows this condition is necessary.
Factor . By the Laplace identity for ,
| (8) |
For each the Gaussian is a PSD kernel and . A nonnegative mixture of PSD kernels is PSD, so is PSD.
Schur product. By the Schur product theorem [Horn and Johnson, 2012], the entry-wise product of two PSD matrices is PSD. Applied to the Gram matrices of and on an arbitrary finite point set, this shows is PSD on .
Continuity on compact . The denominator is bounded below by , so is a ratio of a continuous numerator and a strictly positive continuous denominator, hence continuous on . Mercer’s theorem applies and yields the decomposition. Moore–Aronszajn gives the unique RKHS.
Counterexample at . Take , , nodes , . The numerator Gram is
with eigenvalues , so the numerator is indefinite. The denominator at these nodes is , both positive, so the Schur product inherits the eigenvalue and is not PSD. ∎
E.2 Proof of Theorem 4: bias-finite-difference IMQ embedding
Theorem (Theorem 4 restated).
Fix . For every and every , the bias-finite-difference identity (2) holds identically in . In particular, . The containment is strict.
Proof.
The denominator is independent of the bias parameter. For fixed set . Then
The numerator second-order difference is
independent of . Dividing by gives , multiplied by on both sides recovers (2). Since the three biases are all strictly positive, so the identity uses only atoms from and hence .
Strictness. Fix and a unit vector with . Along the ray ,
while every finite linear combination obeys
The two far-field limits disagree, so for every . By Lemma 3 this same unbiased section belongs to via for any , so it witnesses strict containment . ∎
Lemma 2 (Irreducibility and coprimality of the regularised distance).
Let , , and for set . Then (i) is irreducible in ; (ii) for , and are coprime in .
Proof.
(i) Translate by : with , . Suppose this factors as in with . Both factors are linear: , . Matching coefficients gives
Since , neither nor vanishes, so the linear constraint gives — the two direction vectors are parallel. But then has rank at most , while has rank . Contradiction; is irreducible. (ii) If and shared an irreducible factor for , since both are irreducible they would equal each other up to a scalar; comparing the degree- part vs. then forces , contradiction. ∎
Theorem 9 (Tightness of the factor- reduction).
Let , . For every , no linear combination of at most two biased Yat atoms (at arbitrary centers and biases) can equal a nonzero scalar multiple of .
Outline. The argument splits by dimension. For we use Lemma 2 (irreducibility and coprimality of in ) and a case-by-case polynomial analysis. For , factors over as , so Lemma 2 fails; we instead match poles directly in the complex plane.
Proof for .
We first reduce to non-trivial atoms. An atom with , or with (i.e. and ), contributes zero; dropping it gives a one-atom or zero-atom identity. A zero-atom combination is identically for . A single non-trivial atom: clearing denominators in gives in , where and . If : cancelling gives , impossible since is nonconstant for . If : by Lemma 2, is irreducible (hence prime in the UFD ) and coprime to ; from we get , but and : contradiction. Hence in the one-atom case. Assume both atoms are non-trivial, , , and for contradiction . Let and . Clearing denominators gives
Case 1: . Then and becomes where . Subcase : , so . The degree- homogeneous part of is . For and this vanishes only if ; the degree- part then forces , so , and , contradicting . Subcase : by Lemma 2, and are distinct irreducibles in , hence coprime. From : and , so . Since , we have for some constant ; then , a degree- polynomial equal to a constant, forcing : contradiction.
Case 2: (the case is symmetric). Then ; cancelling from gives , hence
Since , by Lemma 2 and are coprime irreducibles in , so . Both have degree , so for some . Comparing degree- homogeneous parts: . For and , the polynomials and are linearly independent in , witnessed by two evaluations: (i) at , the values are and respectively, with ratio ; (ii) at any non-zero (which exists for and is the only step that uses the dimensional hypothesis), the values are and , with a different ratio, so the two polynomials are not proportional. At both polynomials equal up to a non-zero scalar, so Case 2 genuinely requires ; Case 1 applies for all . Hence , giving : contradiction.
Case 3: pairwise distinct. By Lemma 2, each is irreducible in and distinct are pairwise coprime. Since is a unique factorisation domain (UFD), irreducibles are prime and the principal ideal is therefore prime, making an integral domain. Reducing modulo : . Since , the classes are nonzero in ; since is an integral domain and , we get , i.e. . But and : contradiction. By symmetry (quotient by ), : contradiction. ∎
Proof for via complex pole matching.
At the variables collapse to scalars: , and factors over as , so Lemma 2 no longer applies. Instead, assume for contradiction
holds for all with , . The standing hypothesis gives in (this is what excludes the trivial case , where one biased atom already realises the IMQ atom). By analytic continuation the identity extends to . The right-hand side has exactly two simple poles, at . We partition by the pattern of equality among into four mutually exclusive and jointly exhaustive cases: contains (1) three distinct values; (2) exactly and both differ from ; (3) exactly one of equals (and the two atom centers differ); (4) .
Case 1: pairwise distinct. The first atom contributes potential simple poles at . These poles are not present in either the second atom (whose poles are at ) nor the right-hand side (whose poles are at ). Hence they must be removable, requiring the linear factor to vanish at :
Since are real with , separating real and imaginary parts forces and , i.e. the first atom is identically zero. The same argument applied to atom 2 forces . The identity collapses to , contradicting .
Case 2: . The two atoms share denominator . The combined left-hand-side numerator has degree . Equality of rational functions implies equality of pole sets with multiplicity; since , the LHS poles at must cancel, forcing . Since , for some constant . Then LHS is constant on , while RHS is non-constant (since ). Contradiction.
Case 3: (and the symmetric subcase ). The atom whose centre differs from has poles at the off- locations; by the Case 1 argument applied to that atom alone, the atom is identically zero. The identity reduces to one atom , i.e. . The left side is a polynomial in of degree exactly (using and for a non-trivial atom), while the right side is a non-zero constant. Contradiction.
Case 4: . LHS and RHS share denominator ; equating numerators gives the polynomial identity
Matching the coefficient: , so (since ). Matching the coefficient: , so (since ). Substituting back into the constant term: . Contradiction.
All four cases yield a contradiction, so the factor- lower bound holds at as well. ∎
E.3 Proof of Proposition 1: singular universality at
Proposition (Proposition 1 restated).
Let be compact, , . Then is dense in in the uniform norm.
Proof.
Write with the IMQ kernel and .
Step 1: the IMQ factor is universal. The RKHS of is dense in for every compact by the Micchelli universality theorem [Micchelli et al., 2006].
Step 2: the polynomial factor has a constant feature. The numerator kernel has the explicit feature map , so . One coordinate of is the nonzero constant .
Step 3: the polynomial RKHS contains constants. The RKHS of is . The dual coefficient yields for every , so the constant function with (an upper bound because is non-minimal, so the RKHS norm is the infimum over representers).
Step 4: contains . By the product-kernel RKHS containment theorem [Steinwart and Christmann, 2008, Lemma 4.6] (also Theorem 7 above), the RKHS of contains every pointwise product with and , and . Take from Step 3, with . Then for every the pointwise product lies in , and . Thus the embedding has operator norm , recovering the Loewner bound of Theorem 2. (The cleaner route to the same conclusion is the Loewner-order proof in the main body.)
Step 5: density. Let and . By Step 1, there exists with . Step 4 gives . Hence is dense in . ∎
Remark 1 (Necessity of ).
The constant feature in is what makes Step 4 produce constants in . At this coordinate vanishes and the argument fails. When additionally , the simpler observation for every forces every to satisfy , ruling out density in .
E.4 Characteristic kernel and strict positive definiteness
Corollary (Characteristic kernel, restating Corollary 1(i)).
For , , the kernel is characteristic on every compact .
Proof.
By Proposition 1, is universal on . A universal continuous PSD kernel on compact is characteristic: if in for some finite signed Borel , then by the reproducing property for every . By density, for every , so by the Riesz representation theorem. The metrization of weak convergence by then follows from Sriperumbudur et al. [2010, Thm. 23]. ∎
Corollary (Strict positive definiteness and Gram invertibility, restating part of Cor. 1).
For , , and any pairwise distinct , the Gram matrix is strictly positive definite, hence invertible.
Proof.
Let and set , the finite signed Borel measure on with atomic masses . By bilinearity of the kernel mean embedding,
If , the mean embedding of in is zero. Since is characteristic on every compact (Corollary 1(i)), the mean embedding is injective on finite signed Borel measures, hence . Distinctness of then forces for every . Equivalently, . ∎
E.5 Proof of Proposition 3 and Theorem 6: closed-form norm and single-layer Rademacher bound
Proposition (Proposition 3 restated).
Let , , and . Then with .
Proof.
By bilinearity and the reproducing property,
The identity is exact for every coefficient vector , including when is rank-deficient. If two centers coincide or two coefficient vectors represent the same function (equivalently, ), then bilinearity of the RKHS inner product gives , since both equal . The pseudoinverse expression is algebraically equal to on PSD via , so either form may be used. Coincident or near-coincident centers therefore raise a numerical-conditioning question, not a mathematical correctness question; in floating-point implementations, deduplicating exactly-coincident centers and using a stable factorisation (e.g. pivoted Cholesky) on the remaining strictly-PD block is sufficient.
Theorem (Theorem 6 restated).
Fix , . Let , , i.i.d. samples. Then .
Proof.
For any and , the reproducing property of the fixed kernel gives . By Cauchy–Schwarz,
| (8) |
The diagonal of the biased Yat kernel is
since and . For ,
The standard RKHS Rademacher bound for a ball of radius follows from the reproducing property and Jensen’s inequality (Bartlett and Mendelson 2002, Lemma 22, with the one-sided convention used at line 7). Writing , pulls out a factor , and Jensen exchanges with the square root:
The constant is exactly (no symmetrization factor of ) under this one-sided convention. Bounding the per-sample term by and taking the square root gives
∎
Corollary (Unbiased case).
At , .
E.6 Uniform Boundedness and High-Dimensional Variance
Proposition 4 (Uniform boundedness with sharpness).
For , , , ,
For every fixed , , , the section is globally bounded on . At and the global supremum is , attained at . For the unbiased section is identically zero. For the supremum on does not have a clean closed form; it is bounded by on any compact domain with , .
Proof.
Compact-domain bound. By Cauchy–Schwarz, , so . Also , so . Nonnegativity follows from the squared numerator and positive denominator.
Global boundedness at fixed . Set ; since and ,
For this gives ; for , continuity and give .
Sharpness at . Decompose with . Then and , so
This is monotonically decreasing in , so the supremum is attained at . Writing and , reduce to the one-variable maximisation
Differentiating, , so the unique positive critical point is . Substituting:
attained at . ∎
The diagonal value used in the Rademacher proof is the special case of the kernel definition. Classical unbounded scalar activations such as ReLU and GELU satisfy as , so the Rademacher route via the kernel diagonal is not available in this same unit-as-Mercer-section form.
High-dimensional variance preservation under isotropic Gaussian inputs.
A second structural difference between the Yat unit and a purely radial unit shows up in the high-dimensional input regime: the IMQ output concentrates while the Yat output does not. The phenomenon ”isotropic Gaussian inputs concentrate radial kernels in high ” is by now standard in random-matrix and neural-network theory: El Karoui [2010] established the original spectral collapse for kernel random matrices on isotropic high-dimensional inputs, and Ghorbani et al. [2019] characterise the same concentration for radial features in two-layer linearised networks. The proposition below is the matching structural statement at the level of the kernel diagonal: under the same inputs, the Yat numerator depends on a single 1-D projection and resists the collapse.
Proposition 5 (High-dimensional variance preservation).
Let , let have unit norm , and fix , . For a positive random variable , write . As ,
Proof.
By rotational invariance of the standard Gaussian, take . Then , and the regularised distance satisfies (non-central chi-squared, degrees of freedom, non-centrality ). Standard moment formulas give and .
IMQ unit. For , second-order Taylor expansion of around (justified since a.s. and by chi-squared concentration) gives
so .
Yat unit. Set , so . Direct calculation gives , , and
(the contributions of to are independent of , so do not enter the covariance). Bivariate Taylor expansion of around yields
Substituting the dimensional scalings , , and , the three terms are , dominated by . Hence .
Higher-order Taylor remainders are controlled by combining the chi-squared concentration with a high-probability lower bound on . Specifically, the chi-squared concentration inequality [Laurent and Massart, 2000, Lemma 1] gives for some absolute constant and all . Split the expectation over the events and its complement: on , deterministically and the standard chi-squared moment formula gives ; on , the deterministic floor together with the deterministic upper bound (since on the relevant range) gives , which is asymptotically negligible. We do not assume independence of and , which fails; the deterministic bound circumvents this. Combining, for . This is the bound the bivariate Taylor expansion needs: the leading variance term dominates the corrections, and the asymptotic CV scaling is unaffected. The earlier deterministic bound alone is too loose to control the remainder at this scale, since grows with ; the high-probability split above is the correct bookkeeping. ∎
Remark 2.
Proposition 5 is a structural calculation under the specific toy model of isotropic Gaussian inputs and unit-norm centers; real ML feature distributions are not isotropic Gaussian, and trained Yat centers need not be unit-norm. What the proposition isolates is the asymptotic mechanism: the IMQ denominator concentrates around its mean of order , so the radial output concentrates around and its relative spread vanishes; the Yat numerator depends only on the one-dimensional projection and so does not concentrate, yielding an relative spread that survives the limit.
E.7 Additional proofs from Section 2
Proof of Theorem 2.
By Theorem 1 the kernel is PSD, so is well-defined and the Loewner-order Aronszajn inclusion below is meaningful. Compute . The kernels and are PSD, and the scalar (using ) makes a nonneg-scaled PSD kernel; is PSD (IMQ); Schur products of PSD kernels are PSD (Schur product theorem [Horn and Johnson, 2012, Sec. 7.5]). The sum is PSD; Aronszajn’s inclusion theorem then yields the RKHS containment and norm bound. For this reduces to the trivial ; the RKHS inclusion is meaningful only for . ∎
E.8 Proofs from Section 3
Proof of Lemma 1.
Every finite-span element vanishes at infinity since each atom decays as . The finite span is dense in ; if in RKHS norm, the reproducing property gives
so uniformly. Since is closed under uniform limits, . ∎
Proof of Corollary 2.
Proof of Theorem 3.
The RKHS sum rule states that with the infimal-convolution norm [Paulsen and Raghupathi, 2016]; extended to three summands by induction. Each is PSD: since is PSD and ; because the dot-product kernel is PSD (), is PSD, their Schur product is PSD by the Schur product theorem, and (the channel vanishes at ); by the Schur product theorem applied to and , both PSD. The RKHS sum rule therefore applies. ∎
Proof of Proposition 2.
For (4), write
uniformly in . For the IMQ side, , so finite combinations satisfy . The image of contains every rank-one quadratic form ; the symmetric rank-one tensors span (the off-diagonal generators are obtained as ), so . ∎
Corollary 5 (Concrete large-radius regression gap).
Fix , , , and choose with . Let . Then for every finite IMQ combination ,
Hence for every there exists such that . In particular, a single Yat atom carries an directional signal on a sufficiently large exterior ray that no finite IMQ expansion can uniformly match there.
Proof of Corollary 5.
Proposition 2 gives and , hence . Given , choose so that for all : and . Then , and the supremum over preserves the bound. ∎
Proposition 6 (Far-field gradient asymptotics).
Fix , , , and a unit direction with . Along the ray , the center-gradients of the IMQ section and the Yat section satisfy
Remark 3.
The proposition records an asymptotic difference in the center-gradient of a single Yat atom versus a single IMQ atom at large , evaluated in isolation. It is a statement about the kernel itself, not about gradients of trained networks: in a finite expansion the per-atom gradients combine through the readout coefficients , and far-field signal can be informative or uninformative depending on where the data actually live.
Proof of Proposition 6.
For the IMQ atom, . With , the numerator and the denominator , so .
For the Yat atom, set and . The quotient rule gives
Substituting , the first term equals . The leading-order behaviour of numerator and denominator is over , giving the limit . The second term has numerator over denominator , giving . Summing yields the stated limit. ∎
Proposition 7 (Width-complexity gap for quadratic ridge functions).
Let , , , , , and . Set and fix in the regime , with . For , write .
-
(i)
Yat upper bound. The single Yat atom satisfies for every , where is an explicit polynomial in .
-
(ii)
IMQ lower bound. Every with uses at least atoms; whenever stays bounded as , this is .
Proof of Proposition 7.
(i) Substituting ,
For , and . If , then , and . Hence
Setting the right-hand side to and taking to dominate both (first term, sufficient for ) and (second term, sufficient for ) gives as claimed.
(ii) Restrict attention to the inscribed Euclidean ball . By rotational invariance of , take , and consider the affine slice , where is the -ball of radius . On , in the slice direction and the uniform approximation hypothesis (strengthened to ) gives for every . For any ,
Combining, for every . Equivalently, where denotes the closed Euclidean ball in . The cross-section is contained in a -dimensional ball of radius at most , with -volume bounded by . The slice has -volume , so volume-counting gives
which rearranges to . By Stirling’s approximation, the Gamma factors cancel between the slice and the ball volumes; absorbing the constant into the implicit Stirling-error term and taking logs gives , which is for any that does not grow with . ∎
E.9 Additional proofs from Section 4
Lemma 3 (The unbiased section lies in the positive-bias span).
Fix and . The map is quadratic in , so for every
| (9) |
where every atom on the right has bias in . In particular .
Proof of Lemma 3.
is independent of , so is a univariate quadratic in . Three values over-determine a quadratic, and the unique linear extrapolant to has coefficients on respectively, since for any quadratic . Dividing by gives (9). ∎
Corollary 6 (Strict span containment).
for every . Concretely, for every the unbiased section lies in (Lemma 3) but not in . This is the span-level companion to the RKHS-level strict containment of Corollary 2: the same separation holds at the level of finite linear spans of atoms, witnessed by the directional far-field trace rather than by Loewner domination.
Proof of Corollary 6.
Proof of Corollary 3.
E.10 Additional proofs from Section 5
Corollary 7 (Data-dependent empirical bound).
Under the assumptions of Theorem 6, the same proof yields the data-dependent bound
| (10) |
computable from the sample alone and at most on any sample with .
Appendix F Asymptotic Filtration and Native-Space Rate Transfer
This appendix records two structural results of the unbiased and biased Yat spans that extend the main-body theory: a three-step asymptotic filtration of the biased span detected by the directional-trace operator and its higher-order analogues, and a native-space rate-transfer statement coupled with a coefficient-mass lower bound.
Three-step asymptotic filtration of the biased span.
The biased Yat span admits a three-step filtration
detected by linear asymptotic-trace operators , , acting at orders , , on radial rays. The successive quotients are canonically isomorphic to the spaces of restrictions of homogeneous quadratic, linear, and constant forms on , decomposing further into spherical-harmonic sectors , , . This sharpens the directional-trace separation of Proposition 2 from a single-witness statement (Yat atoms have nonzero quadratic shadow, IMQ atoms have zero shadow) into a full filtration whose quotients refine into spherical-harmonic sectors. The first two non-trivial levels are strictly nested: via an explicit four-atom construction, and as a direct sum (the linear-alignment span is disjoint from the IMQ span). The intrinsic characterisation of and whether are open.
Native-space rate transfer and coefficient-mass lower bound.
The exact bias-finite-difference reduction of Theorem 4 carries IMQ scattered-data approximation theory into at a factor- overhead in atom count: every finite IMQ approximant on a compact set induces a biased-span approximant with identical pointwise error. The IMQ kernel has an analytic native space, so the scattered-data theory of Schaback [1995], Wendland [2004] delivers fill-distance approximation rates that are exponential in for native-space targets, in contrast to the polynomial rates of -Sobolev kernels; under the bias finite-difference reduction these exponential rates transfer to with at most a factor- overhead. In the opposite direction, the codimension-one zero set produces approximation lower bounds: on , fewer than unbiased atoms cannot approximate the constant function to error below , and the total weighted coefficient cost is bounded below by .
Appendix G Far-Field Separations
We write
For fixed shared , let denote the RKHS of . For , write
This appendix collects the radial-vs-Yat far-field separations: an asymptotic exterior-shell bound (both and versions) and a polynomial-separation gap on line-containing domains. The multiclass-generalization machinery, prefix-pullback theorem, and Loewner-comparison theorems previously bundled here have been split into separate appendices (Appendices H and I) for clarity.
G.1 Asymptotic Exterior-Shell Separation
We complement the bounded-domain gap of Section 2 with asymptotic and separations on exterior shells for . For , write and for the bounded-variation classes and with and .
Lemma 4 (Uniform radial decay on exterior shells).
Let . For , ; for , .
Proof.
by reverse triangle inequality, so and . Sum against . ∎
Lemma 5 (Quantitative Yat far-field approximation).
Fix , , and . Let and . For every and every ,
In particular, as .
Proof.
Let , , so . Then and . The numerator of equals , whose absolute value is at most . The denominator . Combining gives . ∎
Theorem 10 (Far-field separation: and ).
Fix , , , , , , , and let be as in Lemma 5.
(i) -bound on the exterior shell.
with the analogous bound over replacing the IMQ decay term by .
(ii) risk lower bound under directional sampling. Let be a probability distribution on with independent of , , and . Then every satisfies
with the analogous RBF bound. A single Yat atom realises at zero error in either norm.
Proof.
(i) Take . For , by Lemma 5 and by Lemma 4. The triangle inequality and supremum over give the IMQ bound; the RBF case is identical. (ii) (Lemma 5 transferred ), so reverse triangle gives . from Lemma 4. A second reverse-triangle step, positive part, and squaring give the IMQ bound; RBF identical. ∎
Remark 4.
Both bounds require and ; lifting either restriction allows radial methods to place centers in the shell or use coefficients growing with the radius, and the bounds become vacuous.
G.2 Polynomial Separation
Theorem 11 (Polynomial approximation gap on line-containing domains).
Let , , , and . Let be the space of degree-at-most-two polynomials. Suppose contains a nontrivial line segment with and a compact interval with nonempty interior. Then
Proof.
Let . Along , . Restrictions of to this line are univariate polynomials of degree at most two. Suppose for all with . By real-analyticity this extends globally: . If has degree , the right side has degree (since is monic quadratic), while the left has degree two, so , i.e. constant. Then . If : impossible since . If : the left side has real double root while has nonreal roots ; contradiction. Hence , and since is closed in , its distance from is strictly positive. ∎
Appendix H Learned-Norm Multiclass Generalization
Let and . For a -class predictor with each , define . For a finite Yat head with Gram matrix , the reproducing property gives .
Define the ramp loss , the pairwise margin surrogate , and the empirical surrogate .
Lemma 6 (Rademacher bound for pairwise margin loss).
Let . Then
Proof.
By subadditivity, . The ramp loss is -Lipschitz, so by contraction, where restricts to on the class- support. Setting and applying Cauchy–Schwarz in the product Hilbert space gives . By Jensen’s inequality and the reproducing property, . Summing over the ordered pairs gives the result. ∎
Theorem 12 (Learned-norm multiclass generalization bound).
With probability at least over i.i.d. samples from , every satisfies
For a finite shared- Yat head, .
Proof.
The zero-one loss is dominated by . A standard Rademacher generalization inequality for -valued losses, combined with Lemma 6, gives the result. The norm identity follows from the reproducing property. ∎
Remark 5.
The actual learned RKHS norm , not a post-hoc radius, controls the margin class.
Theorem 13 (Data-dependent learned-norm bound by peeling).
Fix . With probability at least , every finite shared- Yat head satisfies
where and . For a finite Yat head, .
Proof.
Standard peeling argument: for set and ; note . Apply Theorem 12 to with confidence . For any with , one has , giving the data-dependent bound after substituting . ∎
Corollary 8 (Regularized training controls the learned norm).
If training returns satisfying , then and in particular .
Proof.
The zero predictor has for every , giving . The assumed inequality and then give . ∎
Appendix I Pullback and Loewner-Comparison Theorems
I.1 Layerwise Composition: Pullback RKHS Structure
This section proves a prefix-conditioned pullback statement: once the prefix of a trained network is fixed, the next Yat layer induces a valid RKHS on the original input space.
Let be compact. For a depth- Yat stack, write
where layer has coordinates . Write and .
When learned centers do not lie in the prefix range , we view each layer coordinate as the restriction to of a function in the global RKHS on . Equivalently, define the extended domain , apply the RKHS construction on , and restrict back to ; the restriction RKHS carries the quotient norm. The norm bound below holds in either view.
Theorem 14 (Prefix-conditioned pullback RKHS: PSD, quotient norm, universality).
Let be a continuous PSD kernel on and a measurable map. Define and .
(a) PSD and norm bound. is PSD on . For every , with .
(b) Exact quotient-norm characterisation. where is orthogonal projection onto ; the infimum is attained at .
(c) Universality propagates through injective continuous prefixes. If is compact, is continuous and injective, and is universal on , then is universal on .
Application to layer- Yat. Specialising to , , , part (a) gives the per-layer norm bound , with equality (via (b)) iff . For and continuous injective , part (c) together with Proposition 1 yields universality of on .
Proof.
(a) For and , by PSD of . Let be the canonical feature map and , so . The reproducing property gives , so with the displayed norm bound. Specialising and using delivers the per-layer Yat-Gram bound.
(b) The map defined on sections by is an isometric isomorphism (kernel-section inner products agree). Under , corresponds to , so . The affine set equals , minimised at .
(c) compact and Hausdorff with continuous injective makes a homeomorphism. For , set ; by universality of , find with . Then and . ∎
I.2 RKHS Ball Comparisons: Interpolation, Pullback, and Spectral Structure
The kernel-order domination of Theorem 2 propagates to three further structural comparisons: finite-sample interpolation norms, deep-prefix pullback geometry, and integral-operator spectra.
Finite-sample interpolation norm comparison.
For training points , define empirical Gram matrices and .
Theorem 15 (Finite-sample interpolation norm comparison).
as PSD matrices. If and both Gram matrices are strictly positive definite, then , and for any label vector ,
The minimum RKHS norm to interpolate at in is at most times the minimum interpolation norm in .
Pullback kernel-order domination for deep prefixes.
Theorem 16 (Pullback domination).
Let be any measurable map, and define and . Then on , and
In particular, for any trained Yat stack with per-layer bias , the RKHS ball comparison of Theorem 2 survives at every layer, conditionally on the trained prefix.
Proof.
For any and : , since is PSD by Theorem 2. Aronszajn’s inclusion theorem then gives the containment. ∎
Gram-ball alignment excess.
Proposition 8 (Alignment excess decomposition).
For any trained prototype set , define and . Set ; is PSD by Schur-product factorization. For any ,
where . Moreover,
equivalently iff the finite expansion is the zero function in the RKHS . In particular, if the alignment Gram matrix is nonsingular on the chosen prototypes, equality forces . The trained Yat norm decomposes into a radial IMQ component and a non-negative alignment excess .
Proof.
Substituting , where , gives the decomposition. Non-negativity follows from the PSD property , which is the Schur product of the PSD kernels (with multiplier ) plus and . The equality condition follows because for a PSD matrix , iff , equivalently iff in . ∎
Spectral domination.
For a probability measure on , define the integral operators and , both compact self-adjoint positive operators on .
Theorem 17 (Spectral domination and effective-dimension comparison).
as operators on . Consequently,
and for every and the effective dimensions satisfy
Proof.
For any , , since is PSD (Theorem 2). The eigenvalue bound follows from the min-max theorem. For the effective dimension, use monotonicity of together with :
and summing in gives . ∎
Theorem 18 (Layerwise perturbation propagation).
Let and be two depth- networks acting on . Let be the union of the two intermediate ranges. Suppose each is -Lipschitz on and . Then
The sup is over the relevant intermediate representations actually visited; it is not required to hold globally on .
Proof.
Let with . Then . Unrolling gives . ∎
Lemma 7 (Lipschitz bound for one Yat layer).
Let , , , . Define
Each atom is -Lipschitz on . The coordinate is Lipschitz with constant , and the vector layer is Lipschitz with constant .
Proof.
With and , . Using , , and gives . The coordinate and vector-layer bounds follow by the chain rule and the Frobenius norm bound on the Jacobian. ∎
Appendix J Generic Sobolev-RKHS Containment for Bounded Smooth-Atom Stacks (Applied to Yat)
This appendix proves Theorem 19. The result is an ambient containment theorem: every bounded finite-depth Yat stack lies in one fixed Sobolev restriction RKHS on the original input domain. It does not claim that RKHSs are closed under nonlinear composition, nor does it replace the exact single-layer Gram norm with an exact global Gram norm.
J.1 Sobolev restriction RKHS
Let be compact and let be a bounded Lipschitz domain with . Since bounded Lipschitz domains are Sobolev extension domains [Adams and Fournier, 2003], the Banach-algebra and Moser-composition estimates transfer to without loss; the Sobolev-induction template we use here is standard in deep-network analysis, with the closest neighbours being Petersen and Voigtländer [2018] for -class deep ReLU networks and E et al. [2019] for the Barron-space variant. Fix . Define
By the Sobolev embedding (valid for ), point evaluation is continuous on , making it an RKHS on .
Lemma 8 (Sobolev algebra and Nemytskii inverse stability).
Let be a bounded Lipschitz domain and . Then:
-
1.
(Banach algebra.) is a Banach algebra: there exists such that for all .
-
2.
(Nemytskii inverse stability.) If satisfies for every , then . Moreover, there exists a nondecreasing function such that .
Proof.
Part (1): the Sobolev multiplication theorem for on Lipschitz extension domains [Adams and Fournier, 2003, Thm. 4.39]. Part (2): since (Sobolev embedding), is bounded above. Apply the Moser/Nemytskii composition estimate [Runst and Sickel, 1996, Ch. 5] to on ; the resulting norm bound depends only on , , , and the domain . ∎
J.2 Network class and parameter budgets
For a depth- Yat stack, write and, for and ,
Assume the uniform budgets
Note (not ): the Sobolev containment holds for any bounded real bias; is required only for the Mercer/RKHS regime of the single-layer results elsewhere in the paper. The width enters only through ; if per-atom bounds are assumed, take .
J.3 Full Proof of Global Sobolev-RKHS Containment
Theorem 19 (Global ambient RKHS containment).
Let be compact and a bounded Lipschitz domain. Fix and define with the quotient norm. Under the budgets , , , , every coordinate of every admissible Yat stack belongs to , with for finite constants depending only on and not on the trained parameters. The reproducing kernel of is universal on , hence characteristic.
Proof.
We induct on . Write (finite since is bounded).
Base case (). Each coordinate is a polynomial, hence in , with for some depending on and .
Inductive step. Assume and for all .
Denominator. . Algebra bounds give . Pointwise, . By Lemma 8(2), with .
Atom. by another application of (1), with
Layer coordinate. . Set . Restricting from to and using the quotient norm gives , closing the induction. The constants depend only on the budgets and not on the trained parameters. ∎
Remark 6 (Growth of in depth is super-exponential).
The recurrence above defines a sequence of finite constants , but the dependence on depth is severe and worth recording explicitly. Each step of the induction is at least quadratic: the algebra bound contributes a square in , the denominator bound contributes another square, and the Moser-type inverse-stability factor has, in the standard chain-rule decomposition for on , polynomial dependence of degree on (with prefactors that absorb from the -norms of on the interval ). Composing through one Yat layer therefore takes via a polynomial of degree at least , and iterating times gives
i.e., a tower-of-twos blow-up in depth. Theorem 19 should therefore be read as a qualitative containment statement ( for every admissible Yat stack and every ), not as a quantitatively useful capacity certificate at any practical depth. Improving the depth dependence of — to polynomial-in-, or even exponential-in- — would require either restricting the activation atom to exponential-decay-friendly functions (Gaussian RBF, in place of the rational ) or using a different ambient class (e.g., a depth-aware Besov scale) that absorbs the chain-rule blow-up. Both routes are out of scope here. The theorem we record is the parameter-independent containment of bounded-budget Yat stacks in a fixed universal RKHS; the constants are not the load-bearing content.
Corollary 9 (Universality and characteristicness of ).
is universal on , hence characteristic.
Proof.
Every polynomial restricted to belongs to (polynomials are smooth on bounded ), and hence its restriction to belongs to . By Stone–Weierstrass, polynomials are dense in . Characteristicness follows from universality on compact [Sriperumbudur et al., 2011]. ∎
Remark 7 (What this theorem does and does not give).
Theorem 19 is an ambient containment result. It does not give an exact deep Yat-Gram norm: the identity is a single-layer statement. For a depth- stack the induced kernel on the original input space is the prefix-dependent pullback ; the global Sobolev RKHS avoids this prefix-dependence at the cost of giving only a coarse ambient norm bound.
Remark 8 (Sign of bias and Mercer regime).
The proof requires only ; the sign of is irrelevant for Sobolev containment. The condition is required only for the Mercer/RKHS structure of the single-layer Yat kernel (Theorem 1). In the common case where a single shared bias is used, the theorem applies with .
Appendix K Spectral Structure on the Sphere
The Mercer eigenvalues of on a generic compact domain are governed by Sobolev regularity of the IMQ factor and decay polynomially. On the unit sphere , the kernel reduces to a zonal function whose Funk–Hecke decomposition yields an exponential eigenvalue decay rate, sharper than the polynomial Sobolev bound. The sphere is the operationally relevant subdomain in feature-normalised practice (layer-norm, -normalised CLIP embeddings) and the analytically cleanest setting: , so the Yat kernel becomes a univariate function of the inner product , and the Funk–Hecke formula [Wendland, 2004, Sec. 10.4] diagonalises the integral operator in spherical-harmonic sectors. We work with the unbiased ; the biased for adds two terms and that share the same off-interval pole at , so the asymptotic decay rate below is unchanged and only the prefactors and low-frequency content shift. Downstream consequences for fast generalization rates appear in Appendix L.
For , the unbiased Yat kernel restricts to the zonal function
| (11) |
which extends meromorphically to with a single simple pole at and residue .
Theorem 20 (Funk–Hecke spectrum of on the sphere).
The integral operator associated with the zonal kernel (11) admits the spectral decomposition , where is orthogonal projection onto the space of degree- spherical harmonics (multiplicity ), and the eigenvalues are given by the Funk–Hecke formula
with the Gegenbauer polynomial of index .
Proof.
is continuous on (the pole lies strictly outside the closed interval since ), so the kernel is bounded and continuous on . The associated operator is therefore Hilbert–Schmidt, hence compact and self-adjoint on . Spherical harmonics simultaneously diagonalise every zonal Hilbert–Schmidt operator by the Funk–Hecke formula [Wendland, 2004, Sec. 10.4], and the displayed integral is the standard Gegenbauer expansion coefficient on the eigenspace of degree . ∎
Theorem 21 (Exponential decay rate).
The Funk–Hecke eigenvalues of on satisfy as , with
Proof.
Strategy. The Funk–Hecke expansion of in Gegenbauer polynomials gives , where is the Mercer eigenvalue and is the Gegenbauer polynomial of degree . The asymptotic decay rate for simple-pole kernels is recovered by the standard residue argument: the closest singularity of in the complex -plane to the interval controls the radius of convergence of the Gegenbauer expansion via the standard Bernstein-ellipse / pole-distance correspondence [Wendland, 2004, Ch. 12]. We do not extract from the Gegenbauer generating function directly; rather, we use the generating function only to identify the relevant ellipse parameter . For (), the Gegenbauer polynomials reduce to Chebyshev polynomials of the first kind and the generating function takes the logarithmic form ; the same Bernstein-ellipse argument applies, with the asymptotic decay rate unchanged.
Joukowski substitution. The Bernstein ellipse with foci is the image of under the Joukowski map for . The map is a conformal bijection from onto , with the inverse selecting the branch for real (so ). Under this substitution, a real point corresponds to (the outer pre-image), and equivalently to (the inner pre-image inside the unit disk). Substituting ,
and the inner pre-image is the corresponding singularity location in the Gegenbauer-generating-function variable.
Upper bound. For every , is analytic on the open ellipse and continuous up to its boundary (the pole lies strictly outside ). Bernstein’s theorem on Gegenbauer expansions [Wendland, 2004, Ch. 12] gives . Letting yields for every .
Matching lower bound. Decompose
A direct computation gives , a polynomial of degree , so the only Bernstein-ellipse-bounded contributions to the Gegenbauer coefficients of for come from the simple-pole term. The Gegenbauer generating identity is
The denominator factors as when , so the simple pole at in -coordinates pulls back to a simple pole at in the generating-function variable. Standard residue analysis at then gives the Gegenbauer coefficients of as with (the Gegenbauer-weight residue evaluated at , which converges to a positive constant as ). Hence . ∎
Corollary 10 (Logarithmic effective dimension on ).
The effective dimension of on satisfies
Proof.
Set . Each eigenvalue has multiplicity , so . By Theorem 21, exactly when . The head from . The tail by geometric summation. Adding gives the displayed rate. ∎
Remark 9 (Comparison with classical kernels and downstream rate).
The polynomial kernel has rank on (only are non-zero); the IMQ kernel shares the singularity at on the spherical reduction, hence the same exponential rate ; the Gaussian RBF has super-exponential decay . The unbiased Yat kernel on the sphere therefore inherits the asymptotic IMQ rate, while the polynomial numerator alters only the prefactor and contributes finite-rank low-frequency content. The exponential decay drives a near-parametric rate for both ERM and KRR on the sphere; see Appendix L.
Appendix L Fast Generalization Rates via Eigenvalue Decay
The single-layer Rademacher bound of Theorem 6 is the worst-case rate , optimal only when no spectral information is available. Mercer eigenvalue decay yields fast rates via local Rademacher complexity [Bartlett et al., 2005] for empirical-risk minimisation, and via Caponnetto–De Vito bias-variance bounds [Caponnetto and De Vito, 2007, Steinwart et al., 2009] for kernel ridge regression. On the sphere, the exponential decay of Theorem 21 drives both routes to a near-parametric rate.
L.1 Local Rademacher upper bound for ERM
Theorem 22 (Fast rate via Mercer eigenvalue decay).
Let be compact with , and let be a Borel probability measure on . Suppose the Mercer eigenvalues of on satisfy for some . Under realisability and i.i.d. samples with conditionally sub-Gaussian noise of variance proxy , the empirical-risk minimiser over the ball satisfies
for an absolute constant .
Proof.
By the diagonal of , (the unbiased section is the case of the diagonal used in Theorem 6). Apply Bartlett et al. [2005, Theorem 3.3] to the function class : the local Rademacher complexity at radius is bounded by , and the fixed point of the inequality satisfies under polynomial decay . The local Rademacher excess-risk bound yields the displayed expectation inequality. ∎
Corollary 11 (Near-parametric ERM rate on the sphere).
For and the uniform surface measure, Theorem 21 gives with multiplicity . The local Rademacher fixed point becomes , yielding
where hides factors polynomial in .
Proof.
On the kernel diagonal is . Substituting the Funk–Hecke spectrum into the local Rademacher fixed point, . The crossover index is determined by , giving . The head contributes , the tail times the same scale, hence . The fixed-point equation self-consistently gives . ∎
L.2 Refined KRR bias-variance bound
Theorem 23 (Refined Yat KRR excess risk).
Under the setup above with conditionally sub-Gaussian noise of variance proxy and source condition , the kernel ridge regression estimator with regularisation satisfies the bias-variance bound
where is the Yat effective dimension.
Proof.
The bounded diagonal together with the source condition place us inside the bias-variance template of Caponnetto and De Vito [2007] (see also Steinwart et al., 2009). Their Theorem 1 yields, with probability at least ,
where the leading variance term is scale-invariant under and the lower-order term tracks the kernel scale. Substituting for and absorbing the high-probability factor into the constants gives the displayed expectation bound. ∎
Theorem 24 (Near-parametric KRR rate on the sphere).
Let with the uniform surface measure, and choose . Then , with logarithmic factor .
Proof.
Remark 10 (Comparison with Sobolev RKHS).
A Sobolev- RKHS on with has KRR rate , sub-parametric in [Caponnetto and De Vito, 2007]. The Yat kernel’s analytic profile gives the strictly faster rate uniformly in for targets in , which is a strictly smaller class than any Sobolev space of the same domain: the comparison reflects the higher analytic regularity of the Yat native space rather than a uniform improvement on a fixed function class.
Remark 11 (ERM and KRR are companion routes).
Remark 12 (Reading the variance constant).
The leading variance term in Theorem 23 is , scale-invariant under ; the kernel diagonal enters only as a polynomial prefactor of the lower-order remainder. A data-dependent refinement replacing with the empirical fourth moment in the remainder term would require a separate matrix-Bernstein argument and is not implied by Theorem 23.
Appendix M Quantitative MMD and Two-Sample Testing
Corollary 1 establishes that is characteristic on every compact for : the kernel mean embedding is injective on Borel probability measures. We strengthen this qualitative property to a quantitative sample-complexity statement for two-sample testing, with the kernel-specific constant on the diagonal.
Theorem 25 (Empirical MMD convergence rate for ).
Let be Borel probability measures on a compact set with , and let be i.i.d. samples from and i.i.d. from , all mutually independent. The unbiased U-statistic estimator
satisfies
for an absolute constant , hence .
Proof.
Corollary 12 (Sample complexity for kernel two-sample testing).
Fix significance level and threshold . The test that rejects when has Type-I error on the null and power on alternatives with , provided
Proof.
Apply Theorem 25 with Chebyshev’s inequality on both null (, deviation ) and alternative (, deviation ) regimes; the deviation probability is bounded by under the displayed sample-size lower bound. ∎
Remark 13 (Trade-off in and ).
The sample complexity scales as and as : a smaller or a larger inflates the diagonal and thereby the variance constant of the U-statistic. The same diagonal drives the Rademacher constant in Theorem 6, while in Theorem 23 the kernel diagonal enters only as a prefactor on the lower-order remainder, with the leading variance term controlled by the effective dimension alone.
Appendix N The Yat Neural Tangent Kernel at Infinite Width
The biased atom is the natural neural unit attached to . We compute the infinite-width Neural Tangent Kernel [Jacot et al., 2018, Arora et al., 2019] of a width- shared-bias Yat layer and show that the limit is itself dominated by an IMQ kernel via the bias decomposition, giving universality on every compact for .
Setup.
Fix , , . We adopt the standard NTK parametrisation [Jacot et al., 2018]: the width- shared- Yat layer is
with i.i.d. initialisation and . The empirical NTK is
Theorem 26 (Yat NTK closed form).
As , in probability on every compact , with
Both summands are continuous PSD kernels on , hence so is .
Proof.
Under the NTK parametrisation, , so
The summands are i.i.d. with mean and bounded by Proposition 4 on compact ; the law of large numbers gives convergence in probability to . Similarly, , hence
By independence of and together with , the law of large numbers gives convergence in probability to . PSD of both summands follows from the Gram representation: where is the Gaussian measure, and is the same statement applied coordinate-wise to the gradient. Continuity follows from continuity of and on compact sets together with dominated convergence under . ∎
Corollary 13 (Universality of the Yat NTK for ).
For every , the Yat NTK is universal on every compact : the RKHS is dense in in the sup norm.
Proof.
Step 1 (Gram representation of ). The proof of Theorem 26 established
where . The map is a continuous feature map from into (continuity follows from continuity of in and the Proposition 4 bound, which makes the family uniformly -integrable on compact ).
Step 2 ( is integrally strictly PD on ). Let be a finite non-zero signed Borel measure on . By Tonelli,
Suppose this integral vanishes. Then for -almost every ; continuity of in (dominated convergence) and full support of extend this to on . In particular, . But is precisely the kernel mean embedding of under (since ). For , Corollary 1(i) states that is characteristic on in the strong sense that the kernel mean embedding is injective on finite signed Borel measures on . Hence forces , a contradiction. Therefore is integrally strictly PD on .
Step 3 (Density of in via duality). A signed Borel measure on annihilates iff for every . By the reproducing property, , vanishing for every iff the element is zero, equivalently . By Step 2 this forces . Hence has trivial annihilator in (signed Radon measures, by Riesz), so is dense in in the sup norm.
Remark 14 (General principle).
Step 2 above instantiates a general principle: under spherically symmetric initialisation with full support, the -summand of the empirical NTK has the Gram representation , and integral strict positive definiteness of on a compact follows from characteristicness of the underlying kernel on . Universality of the Yat NTK is therefore a direct consequence of universality of the Yat kernel itself, with no additional spectral hypothesis on beyond full support.
Appendix O RKHS-Lipschitz Bounds and Certified Adversarial Radius
The single-layer Lipschitz bound of Lemma 7 is parametric in the trained and is used for stack-Lipschitz propagation. We complement it with an intrinsic RKHS-Lipschitz bound expressed only in the RKHS norm, derived from the reproducing property and a closed-form bound on the mixed partial of . This RKHS-norm route to certified robustness for kernel classifiers is developed for general kernels by Bietti et al. [2019]; here we instantiate it with the -specific mixed partial.
Lemma 9 (Mixed partial of on the diagonal).
For , set
Then
Proof.
Write and , so that . Differentiating in ,
Differentiating in and noting that any term with an explicit factor vanishes at , and that ,
Substituting gives the displayed expression for . The trace formula follows from and . ∎
Theorem 27 (RKHS-Lipschitz constant for ).
Let be path-connected and compact with . For every with and every ,
with
Proof.
By the reproducing property and Cauchy–Schwarz, , hence
using the standard identity . Lemma 9 gives for . The mean-value inequality on the path-connected set (along a piecewise-smooth path from to in ) gives . ∎
Corollary 14 (Certified adversarial radius for an RKHS classifier).
Let be a -class predictor with . For an input with predicted class and margin , every adversarial perturbation satisfying
keeps the predicted class equal to .
Proof.
The budget gives for each , hence by the triangle inequality. Theorem 27 applied to yields
so for every and the prediction is preserved. ∎
Remark 15 (Capacity-robustness trade-off).
The certified radius is : increasing buys robustness at the price of capacity, since the kernel diagonal shrinks. This trade-off is intrinsic to and explicit in the parameter , in contrast to scalar-activation networks where Lipschitz control is imposed externally through weight-norm penalties.
Appendix P Yat-Native Atom-Count Bounds via the Polynomial Component
The exterior-shell separation (Appendix G.1) and the bounded-domain width-complexity gap (Proposition 7) capture qualitative atom-count separations between Yat and IMQ. We complement them with a constant-atom upper bound for arbitrary symmetric quadratic forms, paired with a dimension-counting IMQ lower bound. The combination converts the directional asymptotic-trace separation of Proposition 2 into a quantitative atom-count gap on PSD quadratic targets.
Setup.
For , define the bounded-variation atom families
The atom count of an element is the smallest such expansion length.
Lemma 10 (Single-atom Yat approximation of a rank-one quadratic).
Let , , , and set
For every ,
Proof.
With , . Compute
On , , so . For the second term is at most , and for the first term is at most ; hence (for , otherwise the conclusion follows trivially by taking even larger). The bound for gives
∎
Proposition 9 (Spectral Yat approximation of any symmetric quadratic form).
Let have rank with spectral decomposition . For every , set and choose as in Lemma 10. Then
is an unbiased Yat expansion with exactly atoms satisfying .
Proof.
. Lemma 10 applied to each unit eigenvector with tolerance gives . Multiplying by and summing gives the displayed bound. ∎
Proposition 10 (IMQ atom-count lower bound for a PSD quadratic target).
Let be PSD with , and let be a unit eigenvector of with eigenvalue . Fix and . Suppose with satisfies and . Set
Any expansion realising the bounded-variation budget has
In particular, when stays bounded as (equivalently, with fixed), the right-hand side is .
Proof.
Restrict attention to the inscribed Euclidean ball . By rotational invariance of , align with a unit eigenvector of for . Consider the slice where is the -ball of radius . For , , so (we strengthen the hypothesis to ). The bounded-variation bound gives
so for every , with the constant in now depending on rather than . Hence . Each ball intersects the affine slice in a -ball of radius at most and volume . The slice has -volume , so volume counting yields the displayed lower bound. By Stirling, the Stirling-error contribution from cancels between the slice and the ball volumes, and the constant is absorbed into ; taking logs, , which is when is bounded as . ∎
Theorem 28 (Yat-native rate-form separation).
Let on , where for some PSD of rank with , and with and IMQ approximation rate (i.e., for every there exists with atoms and ). Fix with , and set .
(i) Yat upper bound. There exists with at most atoms and .
(ii) IMQ lower bound. Fix and . Every with has atom count at least the dimension-counting lower bound of Proposition 10; in the regime where are held fixed as , this bound is .
(iii) Separation. For full-rank , the Yat side achieves the target error with atoms while the IMQ side requires in the fixed- regime of (ii).
Proof.
(i) Apply Proposition 9 to with tolerance , producing an unbiased Yat expansion with atoms and . By definition of , choose with IMQ atoms and . Theorem 4 converts into a biased Yat expansion with at most atoms and pointwise. Set with atom count ; the triangle inequality gives the displayed sup bound.
(ii) Given , . Apply Proposition 10 with the perturbed tolerance in place of (the hypothesis ensures , so is well-defined). The atom count is bounded below by the dimension-counting estimate, which under the fixed- asymptotic regime is .
(iii) For full-rank , ; combining (i) and (ii) gives the displayed separation. ∎
Remark 16 (RKHS-norm cost of the Yat upper bound).
Theorem 28(i) is an atom-count statement, not an RKHS-norm statement. The single-atom rank-one approximation (Lemma 10) requires , so the resulting Yat atom has center norm and squared RKHS norm
Aggregating over the atoms in the rank- spectral decomposition (Proposition 9) inflates this further by a factor at most in the full-rank case. So the Yat-side resource cost decomposes into a constant atom count and a polynomial-in- RKHS-ball radius; the IMQ side requires atoms but each at bounded center norm, giving a polynomial RKHS-ball radius for any single expansion. The separation in Theorem 28 is therefore poly-vs-exp in atom count and poly-vs-poly in RKHS-ball radius; downstream generalization comparisons via Theorem 6 inherit both factors. The exponential atom-count gap on the Yat side is real and quantitative, but it is a gap in the discrete combinatorial complexity of the expansion, not a free-lunch reduction in the underlying capacity radius.
Appendix Q Toward a Uniqueness Characterization
The structural results of the main paper isolate three properties that the Yat kernel satisfies and the classical RBF/IMQ/polynomial kernels do not jointly satisfy: rational symmetric form, bounded global diagonal, and a non-trivial quadratic asymptotic trace at infinity. We record the rigorous degree-matching skeleton these conditions imply, and state the full classification as a conjecture.
Conditions.
Let , , be a continuous PSD kernel satisfying
-
(C1)
(Rational symmetric form, in reduced form.) with jointly polynomial in , symmetric in the variable pair, for every , and in the polynomial ring (i.e., the rational form is reduced).
-
(C2)
(At-most-quadratic numerator growth.) The diagonal admits as . Equivalently, in the reduced rational form of (C1), on the diagonal .
-
(C3)
(Non-trivial quadratic directional trace.) For some , the limit exists for every and is a non-zero quadratic form in .
Proposition 11 (Degree matching under (C1)–(C3)).
For any kernel satisfying (C1)–(C3), (and by symmetry the same in ).
Proof.
Write and as homogeneous decompositions in . As , has leading order where are the largest indices with non-zero homogeneous parts. Existence of a finite, non-zero limit (C3) forces and the limiting ratio to be a non-zero rational function of alone. Since the limit is a quadratic form in , the limit’s total numerator degree in after cancellation is , which is consistent only with (the cases are ruled out by the requirement that the trace is non-zero quadratic of degree exactly ). Together with (C2), no higher-degree homogeneous parts are admissible: if but , then is unbounded. Therefore . ∎
Conjecture 2 (Uniqueness of up to symmetry).
Every continuous PSD kernel satisfying (C1)–(C3) has the form
for some , , , and .
Remark 17 (Status).
Proposition 11 establishes the degree- constraint, which is the first of several steps in the conjectured classification. The remaining steps—(i) showing PSD on all finite point sets forces to be the square of a degree- symmetric form , up to a linear change of variables, and (ii) the analogous classification for strictly positive degree- symmetric denominators —require an explicit case analysis with multi-point Gram constraints that we have not carried out, and which constitute the bulk of the work behind Conjecture 2.
Remark 18 (Why higher homogeneous degrees do not survive on ).
A potential concern is that may collapse to a quadratic on — for instance, — but this does not invalidate the degree conclusion: such a collapse is a redundancy of the homogeneous expansion, and any kernel admitting it can be rewritten with via cancellation of the factor. Equivalently, after passing to the reduced rational form in which numerator and denominator share no common polynomial factor (the convention adopted in (C1)), the directional-trace argument forces .
Remark 19 (Independence of (C1)–(C3)).
Each condition is necessary for the conjectured classification: dropping (C1) admits Gaussian RBF and IMQ (no rational symmetric form); dropping (C2) admits higher-degree rational kernels such as , whose diagonal grows as and so violates the at-most-quadratic numerator condition while still satisfying (C1) and (C3); dropping (C3) admits the polynomial (no IMQ denominator, so the far-field along blows up rather than approaching a finite quadratic limit) and all bounded radial kernels (Matérn, IMQ) which have . The simultaneous combination forces the rigid degree structure of Proposition 11 and conjecturally isolates as a two-parameter family.
Appendix R CLIP Probe Classification Sweep
Scope of this experiment.
This appendix is intended as a trainability/diagnostic check, not a head-to-head benchmark of Yat against tuned kernel baselines. The bandwidths of RBF and IMQ are fixed to match Yat’s (no per-variant tuning), so the comparison is deliberately under-tuned for the radial baselines. The two structurally informative comparisons we do draw are Yat vs. Poly (with vs. without the IMQ denominator) and Yat vs. Yatrand (trained vs. frozen centers). A bandwidth-tuned comparison against learned-center RBF/IMQ heads (and against random Fourier features and a standard MLP head) is left as future empirical work; we therefore avoid drawing performance-superiority conclusions from this table alone.
To illustrate the practical consequence of the alignment numerator, we train single-layer kernel classifiers on frozen CLIP ViT-B/32 image features (ImageNet-1k, 1000 classes, ). Each classifier is a one-vs-rest kernel expansion with learned centers (one per class), optimised by Adam for 20 epochs with a grid of six learning rates over three seeds. The five variants compared are: Yat (, trained centers, shared , ), Poly (quadratic polynomial kernel, same numerator as Yat, no IMQ denominator), Yatrand (Yat with frozen random centers, only trained), RBF (Gaussian kernel with learned centers, fixed bandwidth to match Yat’s ), and IMQ (inverse multiquadric kernel with learned centers, fixed matching Yat’s ). All five variants use the same Adam optimizer and six-point learning-rate grid; no per-variant bandwidth tuning was performed.
| Variant | Best LR | Best acc | ||||||
|---|---|---|---|---|---|---|---|---|
| Yat (trained) | 68.4 | 70.8 | 73.1 | 73.9 | 72.2 | 69.7 | ||
| Poly | 72.5 | 71.2 | 70.1 | 64.3 | 51.6 | 49.1 | ||
| Yatrand | 62.6 | 67.5 | 69.5 | 69.2 | 58.4 | 46.2 |
Reading Table 3. The two structurally informative comparisons are: (1) Yat (73.9%) vs. Poly (72.5%): the pp gap isolates the benefit of the IMQ locality denominator over the polynomial alignment numerator alone. The Poly kernel has finite-dimensional RKHS (Table 1, footnote ) of dimension at degree , so the gap roughly reflects the cost of forcing a -class problem on features into a finite-rank RKHS; (2) Yat (73.9%) vs. Yatrand (69.5%): the pp gap shows that center optimisation is required to fully exploit the alignment numerator, consistent with the learned-center RKHS view. Both ablations isolate one variable at a time and use the same Yat hyperparameters; the comparison is not contaminated by bandwidth choice. We note that the Poly baseline (and the bandwidth-mismatched RBF/IMQ baselines in Table 4) peak at the lower end of the searched LR grid; extending the grid downwards may reduce the pp Yat-vs-Poly gap, and we report the result with this caveat. RBF/IMQ are reported with the additional bandwidth-mismatch caveat already noted.
Bandwidth-mismatched RBF/IMQ baselines (separated for visual honesty).
For completeness we also ran fixed-bandwidth RBF and IMQ heads at the same optimizer / lr grid; we report these in a separate Table 4 rather than alongside Yat to avoid an apples-to-oranges headline juxtaposition. The bandwidth was fixed to match Yat’s ( for RBF; ), not tuned per variant, and the resulting collapse to a small fraction of Yat’s accuracy at every learning rate (RBF and IMQ peak at and respectively against Yat’s , with the gap widening as the learning rate increases) is the optimization-side counterpart of Proposition 5: with fixed at and at , both kernels concentrate around a near-constant value, so both the forward signal and the Adam gradient lose discriminative scale (for RBF, is numerically zero; for IMQ, is near-uniform across pairs). The remedy is a bandwidth scaled to the data (), not a different optimizer; with per-variant bandwidth selection these baselines would be expected to perform competitively. We draw no performance-superiority conclusion against RBF or IMQ from this table; a bandwidth-tuned head-to-head comparison is left as future empirical work.
| Variant | Best LR | Best acc | ||||||
|---|---|---|---|---|---|---|---|---|
| RBF (fixed ) | 46.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | ||
| IMQ (fixed ) | 28.7 | 25.1 | 11.9 | 0.1 | 0.1 | 0.1 |
Per-class RKHS norm correlation.
For the Yat head trained at the best learning rate (), the per-class closed-form RKHS norm (Proposition 3) has positive Spearman correlation with per-class top-1 accuracy across classes (one observation per class, validation examples each). The correlation is stable across a ablation grid spanning two orders of magnitude ( over seven configurations; reproduction scripts are provided in the supplementary material). The interpretation of the positive sign is discussed in §6. The closed-form norm is computable directly from the trained weights with no held-out data. Learned-center RBF and IMQ heads admit the same formula at the same cost: the reproducing-property derivation of Proposition 3 is generic to finite kernel expansions and applies equally to any Mercer kernel. What is specific to Yat is the kernel itself — its diagonal , its directional far-field trace, and the bias finite-difference structure. Ordinary scalar MLP units () do not admit the formula at all because they do not define a Mercer section over the shared input/weight space.
Appendix S Directional-Tail Benchmark (Corollary 5)
To make Corollary 5 concrete we run a small synthetic experiment in . The target function is the single Yat atom with , , . Along the ray the alignment factor satisfies , so Corollary 5 predicts for every finite IMQ combination . This experiment provides concrete numerical verification of the analytical prediction. The regime (training on an annulus, evaluating on a far-field ray) is chosen so that the analytical prediction is tight by construction.
Setup.
We draw training points uniformly (area-measure) from the annulus , labels . Two IMQ models are trained: and learned centers, shared bandwidth trained jointly, Adam for epochs at learning rates and . Both models are evaluated along the ray for . The script (directional_tail_benchmark.py) is provided in the supplementary material; it uses MLX and runs in s on Apple Silicon M-series.
Results.
| Model | Mean at | Max at | |
|---|---|---|---|
| IMQ-50 | 50 | ||
| IMQ-200 | 200 | ||
| Predicted () | — |
Both models train successfully in the annulus (MSE declines steadily) yet converge to the predicted error floor in the far field, consistent with Corollary 5: increasing the number of IMQ atoms does not recover the directional tail because the analytical result is unconditional on model size. This is the finite-sample instantiation of the property of the IMQ span (Proposition 2).
Appendix T The Yat Primitive at Depth in a Trained Causal Language Model
The single-layer theory of Sections 2–5 makes its claims at one shared- Yat layer; the pullback theorem (Theorem 14) extends them to a fixed prefix inside a deep stack so that the closed-form norm (Proposition 3) and the diagonal-driven Rademacher bound (Theorem 6) survive composition at depth. This appendix exercises that depth extension on a trained causal language model: at matched architecture and matched Chinchilla-optimal compute on C4, the Yat primitive composed across many layers reaches the GELU-baseline accuracy regime, so the layer-local theoretical objects (per-layer Gram, closed-form norm, Rademacher diagonal) are not just well-defined at depth but populated with non-degenerate trained content.
Setup.
We replace the standard GELU MLP block of a M decoder-only causal language model ( layers, , context ) with a block at the same parameter count, and train both at the GELU-Chinchilla-optimal budget of B C4 tokens [Raffel and others, 2020, Hoffmann and others, 2022] ( steps, global batch , AdamW [Loshchilov and Hutter, 2019] with cosine decay, weight decay ). Learning rates were chosen by a K-token pre-sweep: for GELU and for Yat. We sweep the full pn,sb Yat grid: pn = per-neuron , sb = shared within a layer, = learnable per-channel scaling, = constant . Each row is meanstd over independent training seeds; downstream zero-shot evaluations are run via lm-evaluation-harness.
| Wiki PPL | LAMBADA acc | ARC-E | ARC-C | HellaSwag | OpenBookQA | PIQA | Long-range PPL | |
|---|---|---|---|---|---|---|---|---|
| GELU | ||||||||
| Yat (pn) | ||||||||
| Yat (sb) | ||||||||
| Yat (pn) | ||||||||
| Yat (sb) |
Reading.
At matched architecture and matched Chinchilla compute, the four Yat variants and the GELU baseline land in the same accuracy regime within seed variance on the six lm-evaluation-harness tasks. The two Yat rows lead on Wikitext-103 perplexity ( vs. for GELU) and on long-range perplexity ( vs. ) by margins that exceed seed variance; the rows match GELU on accuracy and lead by PPL (lower is better). Yat throughput at this scale is tokens/sec versus for GELU ( lower), measured at the training batch size on TPU v5e and averaged over the full training run. The point of the experiment is the depth-viability claim: the Yat primitive composed across layers in a trained causal LM reaches the same accuracy regime as a GELU MLP at matched compute, and the closed-form RKHS norm of Proposition 3 continues to apply per layer to the trained shared- rows. We make no head-to-head benchmark claim and view differences smaller than seed variance as inconclusive.
Compute resources.
Each variant was trained on C4 tokens at the Chinchilla-optimal compute ratio. Per-variant training used Google Cloud TPU v5e/v6e (TPU Research Cloud) with approximately TPU-hours per run; with three seeds per variant (GELU, pn, sb, pn, sb) alongside a -point LR pre-sweep at K tokens, total compute for the M LM experiment was approximately TPU-hours.
NeurIPS Paper Checklist
-
1.
Claims
-
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
-
Answer: [Yes]
-
Justification: The four contributions in the introduction (PSD/Mercer + Loewner-domination universality, nonradial alignment via directional trace, exact IMQ recovery with three-atom tightness, and layer-local RKHS bookkeeping) correspond directly to Theorems 1, 2 (universality), Theorem 3 and Proposition 2 (alignment / trace), Theorem 4 and Theorem 5 (IMQ recovery and three-atom tightness), and Proposition 3 together with Theorem 6 (closed-form norm and Rademacher bound). The deep-stack scope is explicitly limited to a prefix-conditioned pullback (Theorem 14) and an ambient Sobolev containment (Theorem 19); we do not claim an exact global Yat-Gram norm.
-
2.
Limitations
-
Question: Does the paper discuss the limitations of the work performed by the authors?
-
Answer: [Yes]
-
Justification: Section 6 contains a dedicated limitations discussion: shared requirement for the exact RKHS norm, qualitative (not rate-form) universality, two-tier rather than exact deep-stack theory, extrapolative exterior-shell separations, and under-tuned empirical baselines. Each is paired with the scope-bounded remedy or the open problem it implies.
-
3.
Theory assumptions and proofs
-
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
-
Answer: [Yes]
-
Justification: Every theorem, proposition, lemma, and corollary states its assumptions inline (, , compactness of , etc.); notation is consolidated in Appendix A, generic RKHS background in Appendix C, and named results from the literature in Appendix D. Proofs of main-body results are in Appendix E; proofs of extension theorems are split across Appendix G (far-field separations), Appendix H (multiclass generalisation), Appendix I (pullback and Loewner-comparison), Appendix J (Sobolev containment), Appendix P (atom-count separation), and Appendix Q (degree-matching skeleton for the uniqueness conjecture).
-
4.
Experimental result reproducibility
-
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper?
-
Answer: [Yes]
-
Justification: The CLIP probe (Appendix R), the directional-tail benchmark (Appendix S), and the M causal-LM proof-of-concept (Appendix T) report the architecture, optimizer, learning-rate grid, seed count, dataset, and bandwidth choices. All three are diagnostic experiments accompanying a theoretical paper, not benchmark contributions; the paper makes no performance-superiority claim from them.
-
5.
Open access to data and code
-
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
-
Answer: [Yes]
-
Justification: Reproduction scripts for both diagnostic experiments are referenced in the appendix; CLIP features are public ImageNet-1k embeddings from a public CLIP ViT-B/32 checkpoint.
-
6.
Experimental setting/details
-
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
-
Answer: [Yes]
-
Justification: Appendix R specifies the optimizer (Adam), six-point learning-rate grid, three-seed averaging, and the per-variant bandwidth choices. The deliberate under-tuning of RBF/IMQ baselines is disclosed explicitly.
-
7.
Experiment statistical significance
-
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
-
Answer: [Yes]
-
Justification: The CLIP probe table reports mean standard deviation across three seeds; the per-class Spearman correlation is reported with its stability range across a ablation grid.
-
8.
Experiments compute resources
-
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
-
Answer: [Yes]
-
Justification: The directional-tail benchmark runs in approximately 6 s on Apple Silicon M-series; the CLIP probe is single-GPU with frozen features at .
-
9.
Code of ethics
-
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics?
-
Answer: [Yes]
-
Justification: The paper is a theoretical contribution about a kernel construction; no human subjects, no personal data, no harmful applications.
-
10.
Broader impacts
-
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
-
Answer: [N/A]
-
Justification: The paper develops mathematical foundations for a hidden-unit primitive; broader societal impact is not directly applicable to this contribution.
-
11.
Safeguards
-
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models with a high risk for misuse?
-
Answer: [N/A]
-
Justification: No new datasets or pretrained models are released.
-
12.
Licenses for existing assets
-
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
-
Answer: [Yes]
-
Justification: The CLIP ViT-B/32 model [Radford and others, 2021] (MIT license) and ImageNet-1k validation features [Deng et al., 2009] (custom research-only terms; we use only image embeddings from the public validation set, with no redistribution of raw images) are the only external assets used. Both are cited at the points of use in Appendix R.
-
13.
New assets
-
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
-
Answer: [N/A]
-
Justification: The paper introduces a kernel construction and theoretical results, not new assets requiring documentation.
-
14.
Crowdsourcing and research with human subjects
-
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
-
Answer: [N/A]
-
Justification: No human subjects.
-
15.
Institutional review board (IRB) approvals or equivalent for research with human subjects
-
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
-
Answer: [N/A]
-
Justification: No human subjects.
-
16.
Declaration of LLM usage
-
Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research?
-
Answer: [N/A]
-
Justification: LLMs are not part of the methodology of this paper.
Comments
· 0