arXiv:2605.03262 · cs.LG · uncurated · rendered via ar5iv

A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.03262] Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance
\bbl@luahyphenate\directlua

if Babel.locale_mapped == nil then Babel.locale_mapped = true Babel.linebreaking.add_before(Babel.locale_map, 1) Babel.loc_to_scr = Babel.chr_to_loc = Babel.chr_to_loc or end Babel.locale_props[1].letters = false \directlua if Babel.script_blocks[’Latn’] then Babel.loc_to_scr[1] = Babel.script_blocks[’Latn’] Babel.locale_props[1].lg = 89 end \directlua if Babel.script_blocks[’Latn’] then Babel.loc_to_scr[1] = Babel.script_blocks[’Latn’] end \bbl@luahyphenate\directlua if Babel.locale_mapped == nil then Babel.locale_mapped = true Babel.linebreaking.add_before(Babel.locale_map, 1) Babel.loc_to_scr = Babel.chr_to_loc = Babel.chr_to_loc or end Babel.locale_props[2].letters = false \directlua if Babel.script_blocks[”] then Babel.loc_to_scr[2] = Babel.script_blocks[”] Babel.locale_props[2].lg = 89 end \directlua if Babel.script_blocks[”] then Babel.loc_to_scr[2] = Babel.script_blocks[”] end [tifinagh]rm[Path=fonts/, Extension=.ttf, UprightFont=*]ebrima \bbl@patterns@luaenglish\bbl@patterns@luatifinagh

Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

Taha Bouhsine Affiliation: Azetta AI Affiliation: taha@azetta.ai
Abstract

We introduce the Yat kernel

kb,ε(𝐰,𝐱)=(𝐰𝐱+b)2𝐱𝐰2+ε,b0,ε>0,k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{(\mathbf{w}^{\top}\mathbf{x}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon},\qquad b\geq 0,\ \varepsilon>0,

a rational hidden-unit primitive whose units are Mercer sections over a shared input/weight space. For b0b\geq 0 the kernel is PSD; for b>0b>0 it dominates a scaled inverse-multiquadric (IMQ) in the Loewner order, yielding fixed-kernel universality, characteristicness, and strict positive definiteness on every compact domain. The polynomial numerator opens nonradial alignment channels absent from finite IMQ expansions, witnessed by the directional far-field trace Tgε(;𝐰,b)(𝐮)=(𝐮𝐰)2T_{\infty}g_{\varepsilon}(\cdot;\mathbf{w},b)(\mathbf{u})=(\mathbf{u}^{\top}\mathbf{w})^{2}. Algebraically, a second finite difference in the bias recovers any IMQ atom from three positive-bias Yat atoms exactly, sharp at three atoms in every dimension at exact pointwise equality. A trained shared-(b,ε)(b,\varepsilon) Yat layer is therefore a finite learned-center expansion in a fixed universal characteristic RKHS, with closed-form norm 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} and explicit diagonal (𝐱2+b)2/ε(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon driving a Rademacher generalization bound.

1 Introduction

A trained MLP layer has no canonical Hilbert-space norm. Generalization bounds therefore route through Lipschitz or spectral surrogates of the weight matrix (Bartlett et al., 2017; Neyshabur et al., 2018), function-level objects loose by construction. The reason is structural: standard activations σ(𝐰𝐱)\sigma(\mathbf{w}^{\top}\mathbf{x}) (ReLU, GELU (Hendrycks and Gimpel, 2016), sigmoid) are scalar functions of an inner product, not symmetric kernel sections of the joint variable (𝐰,𝐱)(\mathbf{w},\mathbf{x}), so a trained layer’s read-out is not a finite element of any RKHS, and the toolkit that comes with one (closed-form Hilbert norm, fixed-kernel universality, characteristicness, MMD) is unavailable.

The available Mercer-section primitives are either radial without alignment (Gaussian RBF, IMQ: weights act only as centers, the kernel diagonal is constant, and the polynomial alignment (𝐰𝐱+b)2(\mathbf{w}^{\top}\mathbf{x}+b)^{2} plays no role) or aligned without distance locality (universal dot-product kernels of analytic positive-coefficient form (Smola et al., 2000): alignment and Hilbert structure but no 𝐱𝐰\|\mathbf{x}-\mathbf{w}\| decay). A finite, trainable hidden-unit primitive that combines locality, alignment, and Hilbert structure is the missing piece.

There is a primitive that splits the difference. The Yat kernel

k\tifinaghfont(𝐰,𝐱)=(𝐰𝐱+b)2𝐱𝐰2+ε,b0,ε>0,k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})\;=\;\frac{(\mathbf{w}^{\top}\mathbf{x}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon},\qquad b\geq 0,\ \varepsilon>0, (1)

is the rational product of a polynomial alignment numerator pb(𝐱,𝐰)=(𝐱𝐰+b)2p_{b}(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b)^{2} and an inverse-multiquadric (IMQ) denominator hε(𝐱,𝐰):=(𝐱𝐰2+ε)1h_{\varepsilon}(\mathbf{x},\mathbf{w}):=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}: rational, finite, trainable. It is symmetric and PSD in (𝐰,𝐱)(\mathbf{w},\mathbf{x}), so a trained shared-(b,ε)(b,\varepsilon) layer is genuinely a finite RKHS expansion jαjkb,ε(𝐰j,)\sum_{j}\alpha_{j}k_{b,\varepsilon}(\mathbf{w}_{j},\cdot) with closed-form norm 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}, and single-layer generalization follows from the RKHS-ball Rademacher framework with no surrogate on the weight matrix.

Why should such a primitive exist, and why are its properties what they are? Schur products of PSD kernels are PSD, and both factors pbp_{b} and hεh_{\varepsilon} are PSD; the technically obvious fact is that the product is itself a Mercer kernel. The interesting fact is a single Loewner inequality. Expanding (𝐰𝐱+b)2=b2+2b(𝐱𝐰)+(𝐱𝐰)2(\mathbf{w}^{\top}\mathbf{x}+b)^{2}=b^{2}+2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2} and dividing through, kb,εb2hεk_{b,\varepsilon}-b^{2}h_{\varepsilon} is itself a sum of Schur products of PSD kernels, manifestly PSD. So the radial IMQ kernel sits inside the Yat kernel in the Loewner order; Aronszajn’s inclusion theorem combined with Micchelli’s IMQ universality (Micchelli et al., 2006) gives universality of the fixed kernel, not just of a family. The radial alternatives are recovered as a strict sub-RKHS of b,ε\mathcal{H}_{b,\varepsilon}.

Loewner domination delivers RKHS containment by a soft analytic argument, but it leaves an algebraic question: can the IMQ atoms be constructed from Yat atoms, or only known to lie inside the Yat RKHS by Aronszajn? Section 4 answers the constructive question with a sharper one-line fact:

gε(;𝐰,3h) 2gε(;𝐰,2h)+gε(;𝐰,h)= 2h2kIMQε(,𝐰),g_{\varepsilon}(\cdot;\mathbf{w},3h)\;-\;2\,g_{\varepsilon}(\cdot;\mathbf{w},2h)\;+\;g_{\varepsilon}(\cdot;\mathbf{w},h)\;=\;2h^{2}\,k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}), (2)

a second-order finite difference of three biased Yat atoms in the bias parameter recovers every IMQ atom exactly, and the count of three is sharp in every dimension. This is what the analytic Loewner argument cannot see, and what makes the construction concrete: any quantitative IMQ approximation rate transfers to Yat at factor-three width overhead.

The polynomial numerator does more than improve the constants. IMQ RKHS functions decay at infinity, but a Yat atom retains a directional quadratic shadow along every ray r𝐮r\mathbf{u}:

Tgε(;𝐰,b)(𝐮)=(𝐮𝐰)2,T_{\infty}\,g_{\varepsilon}(\cdot;\mathbf{w},b)(\mathbf{u})\;=\;(\mathbf{u}^{\top}\mathbf{w})^{2}, (3)

independent of bb and ε\varepsilon, a kernel invariant that survives every choice of hyperparameter. No finite IMQ combination has nonzero trace, so the alignment numerator opens directional channels in b,ε\mathcal{H}_{b,\varepsilon} that the radial factor alone cannot reach. Section 3 makes this precise: the trace map surjects onto homogeneous quadratic forms on the sphere while annihilating every finite IMQ combination.

Both kernel hyperparameters appear directly in the single-layer Rademacher bound as regularization targets, a transparency that constant-diagonal radial kernels lose and scalar MLP units do not have at all. The reason is in the kernel diagonal: kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon is polynomial in 𝐱\|\mathbf{x}\| with explicit dependence on both bb and ε\varepsilon, and the diagonal carries those two scalars into the bound. The closed-form norm 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} itself is generic to any continuous Mercer kernel; what is Yat-specific is the form of the diagonal (Section 5).

Organization. Section 2 establishes Mercer structure, Loewner domination, and fixed-kernel universality. Section 3 develops the channel decomposition, the directional far-field trace, and the bounded-domain width-complexity gap. Section 4 proves the bias-finite-difference identity (2) and its three-atom sharpness. Section 5 derives the closed-form RKHS norm and the Rademacher bound. Section 6 treats depth: per-prefix exact bookkeeping, ambient Sobolev containment, and a conjectured impossibility of any prefix-independent global Yat-Gram norm. Background, related work, and proofs are in Appendices BE onwards.

2 Mercer Structure and Fixed-Kernel Universality

Notation. Throughout, d1d\geq 1, 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is compact, b0b\geq 0, ε>0\varepsilon>0. Write Dε(𝐱,𝐰):=𝐱𝐰2+εD_{\varepsilon}(\mathbf{x},\mathbf{w}):=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon for the regularised squared distance, and use the shorthands

gε(𝐱;𝐰,b):=(𝐱𝐰+b)2Dε(𝐱,𝐰)(biased Yat atom),hε(𝐱,𝐰):=kIMQε(𝐱,𝐰):=Dε(𝐱,𝐰)1(IMQ).g_{\varepsilon}(\mathbf{x};\mathbf{w},b):=\frac{(\mathbf{x}^{\top}\mathbf{w}+b)^{2}}{D_{\varepsilon}(\mathbf{x},\mathbf{w})}\quad\text{(biased Yat atom)},\qquad h_{\varepsilon}(\mathbf{x},\mathbf{w}):=k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}):=D_{\varepsilon}(\mathbf{x},\mathbf{w})^{-1}\quad\text{(IMQ)}.

The unbiased section k\tifinaghfont(𝐰,𝐱)=gε(𝐱;𝐰,0)k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})=g_{\varepsilon}(\mathbf{x};\mathbf{w},0) corresponds to b=0b=0. We adopt throughout the convention that the RKHS of the identically-zero kernel is the zero subspace {0}\{0\} with trivial norm; this makes the channel decomposition of Theorem 3 and the limit b0+b\to 0^{+} well-defined when individual channels collapse.

Two algebraic facts anchor the kernel-section view of the Yat unit. First, kb,εk_{b,\varepsilon} is positive semidefinite, and thus a Mercer kernel on every compact 𝒳\mathcal{X}, for every b0b\geq 0 and ε>0\varepsilon>0. Second, on top of mere PSD, kb,εk_{b,\varepsilon} dominates a scaled IMQ kernel in the Loewner order: b2hεkb,εb^{2}h_{\varepsilon}\preceq k_{b,\varepsilon}. This second fact transfers IMQ density into b,ε\mathcal{H}_{b,\varepsilon} and pins down fixed-kernel universality on every compact domain whenever b>0b>0. Both follow from the Schur-product factorisation kb,ε=pbhεk_{b,\varepsilon}=p_{b}\cdot h_{\varepsilon} of the polynomial alignment numerator pb(𝐱,𝐰)=(𝐱𝐰+b)2p_{b}(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b)^{2} and the IMQ denominator hεh_{\varepsilon}.

Theorem 1 (PSD and Mercer for b0b\geq 0).

For every b0b\geq 0, ε>0\varepsilon>0, the kernel k\tifinaghfontk_{\text{\tifinaghfont ⵟ}} in (1) is symmetric and positive semidefinite on d×d\mathbb{R}^{d}\times\mathbb{R}^{d}, and on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d} the restriction is continuous and Mercer. By Moore–Aronszajn there exists a unique RKHS b,ε\mathcal{H}_{b,\varepsilon} with reproducing kernel kb,εk_{b,\varepsilon}. The sign condition is essential: for b<0b<0 PSD can fail (explicit 2×22\times 2 counterexample at b=1b=-1, d=2d=2, nodes 𝐞1,𝐞2\mathbf{e}_{1},\mathbf{e}_{2}).

The sign condition has a structural cause: the polynomial-numerator feature map Φb(𝐱)=(𝐱𝐱,2b𝐱,b)\Phi_{b}(\mathbf{x})=(\mathbf{x}\otimes\mathbf{x},\sqrt{2b}\,\mathbf{x},b) requires b0b\geq 0 to be real. Per-atom variation of hyperparameters within an expansion f=jαjkbj,εj(𝐰j,)f=\sum_{j}\alpha_{j}\,k_{b_{j},\varepsilon_{j}}(\mathbf{w}_{j},\cdot) lies in the larger biased Yat span F\tifinaghfont0F_{\text{\tifinaghfont ⵟ}}^{\geq 0} (see Appendix A) but not in any single RKHS b,ε\mathcal{H}_{b,\varepsilon}, so all quantitative RKHS results that follow, namely the closed-form norm (Proposition 3) and the Rademacher bound (Theorem 6), require shared (b,ε)(b,\varepsilon) across summands. This is the unit of analysis for the rest of the paper.

PSD structure implies a Loewner-order inequality that yields RKHS containment and singular universality as direct algebraic consequences. Write hε(𝐱,𝐰):=(𝐱𝐰2+ε)1h_{\varepsilon}(\mathbf{x},\mathbf{w}):=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1} for the IMQ kernel. Expanding (𝐰𝐱+b)2=b2+2b(𝐱𝐰)+(𝐱𝐰)2(\mathbf{w}^{\top}\mathbf{x}+b)^{2}=b^{2}+2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2} gives kb,εb2hε=[2b(𝐱𝐰)+(𝐱𝐰)2]hεk_{b,\varepsilon}-b^{2}h_{\varepsilon}=[2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2}]h_{\varepsilon}, a sum of Schur products of PSD kernels, hence PSD as a kernel: every finite Gram matrix of kb,εb2hεk_{b,\varepsilon}-b^{2}h_{\varepsilon} is itself PSD.

Theorem 2 (Kernel-order domination).

For every b0b\geq 0, ε>0\varepsilon>0,

b2hεkb,εb^{2}\,h_{\varepsilon}\preceq k_{b,\varepsilon}

in the Loewner PSD order on kernels. By Aronszajn’s inclusion theorem (Paulsen and Raghupathi, 2016),

hεb,ε,fb,ε1bfhε(b>0).\mathcal{H}_{h_{\varepsilon}}\subseteq\mathcal{H}_{b,\varepsilon},\qquad\|f\|_{\mathcal{H}_{b,\varepsilon}}\leq\frac{1}{b}\|f\|_{\mathcal{H}_{h_{\varepsilon}}}\quad(b>0).

Equivalently, Bhε(R)Bb,ε(R/b)B_{\mathcal{H}_{h_{\varepsilon}}}(R)\subseteq B_{\mathcal{H}_{b,\varepsilon}}(R/b).

The constant b2b^{2} on the left of the Loewner inequality is sharp on d\mathbb{R}^{d}: at the single point 𝐱=𝟎\mathbf{x}=\mathbf{0} the ratio is kb,ε(𝟎,𝟎)/hε(𝟎,𝟎)=(b2/ε)/(1/ε)=b2k_{b,\varepsilon}(\mathbf{0},\mathbf{0})/h_{\varepsilon}(\mathbf{0},\mathbf{0})=(b^{2}/\varepsilon)/(1/\varepsilon)=b^{2} exactly, so the 1×11\times 1 Gram of kb,εchεk_{b,\varepsilon}-c\,h_{\varepsilon} at 𝐱=𝟎\mathbf{x}=\mathbf{0} becomes negative for any c>b2c>b^{2}, and therefore sup{c0:chεkb,ε}=b2\sup\{c\geq 0:c\,h_{\varepsilon}\preceq k_{b,\varepsilon}\}=b^{2}. The sharpness is global: on a compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d} that excludes a neighbourhood of the origin, the optimal domination constant can be strictly larger than b2b^{2}, since kb,ε/hε=(𝐱𝐰+b)2k_{b,\varepsilon}/h_{\varepsilon}=(\mathbf{x}^{\top}\mathbf{w}+b)^{2} (after cancelling the IMQ denominators) is bounded below away from b2b^{2} once 𝐱𝐰\mathbf{x}^{\top}\mathbf{w} is bounded away from 0.

Two related constants govern the b0+b\to 0^{+} behaviour and should not be conflated. The kernel-domination constant b2b^{2} enters squared because it compares kernels (and hence Gram matrices), while the RKHS embedding constant in Theorem 2 is 1/b1/b, comparing norms.

Quantitative approximation rates transferred through the embedding therefore inherit 1/b1/b on norms or 1/b21/b^{2} on squared norms, harmless at fixed b>0b>0 but exploding at the boundary.

That boundary is genuine, not an artefact of the embedding. At b=0b=0 the channel decomposition (Theorem 3 below) shows that 0,ε\mathcal{H}_{0,\varepsilon} collapses to the quadratic-alignment channel k2\mathcal{H}_{k_{2}} alone: the radial channel k0\mathcal{H}_{k_{0}} and the linear-alignment channel k1\mathcal{H}_{k_{1}} disappear under the zero-kernel convention. Equivalently, the IMQ-radial subspace hε\mathcal{H}_{h_{\varepsilon}} embeds continuously into b,ε\mathcal{H}_{b,\varepsilon} for every b>0b>0 but is not contained in 0,ε\mathcal{H}_{0,\varepsilon} at all, since every f0,εf\in\mathcal{H}_{0,\varepsilon} vanishes at the origin while generic IMQ functions do not. The transition b=0b>0b=0\rightleftharpoons b>0 is a phase transition of the RKHS, recorded analytically in Table 1: the practical recommendation is to use b>0b>0 rather than to treat b=0b=0 as a regularised limit. With the boundary understood, fixed-kernel universality follows for the open regime.

Proposition 1 (Singular universality at b0>0b_{0}>0).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and let b0>0b_{0}>0, ε0>0\varepsilon_{0}>0. Then the RKHS b0,ε0\mathcal{H}_{b_{0},\varepsilon_{0}} of the fixed biased kernel kb0,ε0(𝐰,𝐱)=(𝐰𝐱+b0)2/(𝐱𝐰2+ε0)k_{b_{0},\varepsilon_{0}}(\mathbf{w},\mathbf{x})=(\mathbf{w}^{\top}\mathbf{x}+b_{0})^{2}/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon_{0}) is dense in C(𝒳)C(\mathcal{X}) in the uniform norm. Hence kb0,ε0k_{b_{0},\varepsilon_{0}} is a universal kernel, not merely a member of a universal family.

Universality in turn implies the kernel mean embedding is injective and Gram matrices at distinct nodes are strictly positive definite, both consequences we use throughout.

Corollary 1 (Characteristic kernel and SPD).

For every b0>0b_{0}>0, ε0>0\varepsilon_{0}>0, and every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}: (i) kb0,ε0k_{b_{0},\varepsilon_{0}} is characteristic on 𝒳\mathcal{X}, so the kernel mean embedding μkb0,ε0(,𝐱)𝑑μ(𝐱)\mu\mapsto\int k_{b_{0},\varepsilon_{0}}(\cdot,\mathbf{x})\,d\mu(\mathbf{x}) is injective on finite signed Borel measures and the maximum mean discrepancy MMDkb0,ε0\mathrm{MMD}_{k_{b_{0},\varepsilon_{0}}} metrizes weak convergence (Sriperumbudur et al., 2010, Thm. 23); (ii) kb0,ε0k_{b_{0},\varepsilon_{0}} is strictly positive definite in the standard kernel-theoretic sense: for any n1n\geq 1 and any nn pairwise distinct {𝐱1,,𝐱n}d\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}\subset\mathbb{R}^{d}, the Gram matrix 𝐊\mathbf{K} is strictly positive definite, hence invertible, and the quadratic form 𝛂𝐊𝛂\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} is a proper RKHS norm on coefficient space.

Together with Theorem 1, these results license a shared-(b,ε)(b,\varepsilon) Yat layer as a finite expansion in a universal characteristic RKHS, the unit of analysis the rest of the paper exploits. Table 1 summarises how Yat compares with its three closest classical kernel-valued alternatives along the properties used in the rest of the paper.

Table 1: Theoretical properties at the hidden-layer level. Fixed bias b0b\geq 0, ε>0\varepsilon>0; “UAT-fam” = family-level universal approximation; “UAT-sing” = single fixed kernel is universal; “Char.” = characteristic kernel (MMD metrizes weak convergence on compact 𝒳\mathcal{X}); “Bounded” = k(𝐰,)k(\mathbf{w},\cdot) bounded on d\mathbb{R}^{d}. Polynomial = (𝐰𝐱+b)2(\mathbf{w}^{\top}\mathbf{x}+b)^{2}; RBF = eγ𝐱𝐰2e^{-\gamma\|\mathbf{x}-\mathbf{w}\|^{2}}; IMQ = (𝐱𝐰2+ε)1(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}; Univ. dot-prod = eγ𝐱𝐰e^{\gamma\mathbf{x}^{\top}\mathbf{w}} (universal analytic dot-product kernel (Smola et al., 2000)).
Primitive PSD/Mercer UAT-fam UAT-sing Char. Bounded Numer. Denom.
Yat (k\tifinaghfont,b>0k_{\text{\tifinaghfont ⵟ}},\,b>0) align local
Yat (k\tifinaghfont,b=0k_{\text{\tifinaghfont ⵟ}},\,b=0) § § align local
Polynomial ()2(\cdot)^{2} align
Univ. dot-prod align
RBF local
IMQ local
Fails universality: finite-rank RKHS (not dense in C(𝒳)C(\mathcal{X}) for infinite 𝒳\mathcal{X}).
Fails characteristic: cannot separate all probability measures.
§Fails on origin-containing domains: k0,ε(𝟎,𝟎)=0k_{0,\varepsilon}(\mathbf{0},\mathbf{0})=0 forces f(𝟎)=0f(\mathbf{0})=0 for every f0,εf\in\mathcal{H}_{0,\varepsilon}.

3 Nonradial Alignment Structure

Theorem 2 embeds the IMQ RKHS into b,ε\mathcal{H}_{b,\varepsilon} for b>0b>0, but it leaves a structural question open: does the polynomial alignment numerator add anything that the radial denominator alone could not produce? It does, and the gap is recorded by a single far-field invariant. The IMQ RKHS lives entirely inside the C0C_{0} part of function space, since every IMQ RKHS element decays at infinity, while a Yat atom retains a directional quadratic shadow (𝐮𝐰)2(\mathbf{u}^{\top}\mathbf{w})^{2} along every ray r𝐮r\mathbf{u} as rr\to\infty. That mismatch is the structural witness of nonradiality, and once it is in hand the rest of the section follows by algebra: a strict RKHS containment, a clean three-channel decomposition of b,ε\mathcal{H}_{b,\varepsilon}, and a sharp directional trace identity that no finite IMQ combination can reproduce.

The decay statement is the simplest of the three.

Lemma 1 (IMQ RKHS functions vanish at infinity).

Every fhεf\in\mathcal{H}_{h_{\varepsilon}} belongs to C0(d)C_{0}(\mathbb{R}^{d}), i.e., f(𝐱)0f(\mathbf{x})\to 0 as 𝐱\|\mathbf{x}\|\to\infty.

The lemma is a global statement on d\mathbb{R}^{d}, distinct from the compact-domain universality of hεh_{\varepsilon}: on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, the IMQ RKHS restricts densely to C(𝒳)C(\mathcal{X}) (Micchelli et al., 2006), but as functions on the ambient d\mathbb{R}^{d} the elements of hε\mathcal{H}_{h_{\varepsilon}} vanish at infinity, while generic Yat sections do not. A direct consequence is that the Loewner embedding cannot reverse, so the RKHS containment is strict.

Corollary 2 (Strict containment and non-reversibility).

For b>0b>0, ε>0\varepsilon>0: hεb,ε\mathcal{H}_{h_{\varepsilon}}\subsetneq\mathcal{H}_{b,\varepsilon}. Moreover, no constant C>0C>0 satisfies kb,εChεk_{b,\varepsilon}\preceq C\,h_{\varepsilon} on d\mathbb{R}^{d}.

The strict gap has an explicit factorisation. Expanding the squared alignment numerator (𝐱𝐰+b)2=b2+2b(𝐱𝐰)+(𝐱𝐰)2(\mathbf{x}^{\top}\mathbf{w}+b)^{2}=b^{2}+2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2} splits kb,εk_{b,\varepsilon} into three Schur products with the IMQ denominator: each piece is itself PSD, so the RKHS sum rule decomposes the Yat space accordingly into a radial channel and two alignment channels.

Theorem 3 (Yat RKHS channel decomposition).

Write kb,ε=k0+k1+k2k_{b,\varepsilon}=k_{0}+k_{1}+k_{2} where k0:=b2hεk_{0}:=b^{2}h_{\varepsilon}, k1:=2b(𝐱𝐰)hεk_{1}:=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}, k2:=(𝐱𝐰)2hεk_{2}:=(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}. Each summand is PSD: hεh_{\varepsilon} is PSD, the linear kernel 𝐱𝐰\mathbf{x}^{\top}\mathbf{w} is PSD (it is the inner-product feature map), the quadratic (𝐱𝐰)2(\mathbf{x}^{\top}\mathbf{w})^{2} is PSD as the Schur square of a PSD kernel, and Schur products of PSD kernels are PSD; non-negative scalar multiples preserve PSD. By the RKHS sum rule (Paulsen and Raghupathi, 2016; Berlinet and Thomas-Agnan, 2004),

b,ε=k0+k1+k2,\mathcal{H}_{b,\varepsilon}=\mathcal{H}_{k_{0}}+\mathcal{H}_{k_{1}}+\mathcal{H}_{k_{2}},

with squared infimal-convolution norm

fb,ε2=inff0+f1+f2=ffiki(f0k02+f1k12+f2k22).\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\inf_{\begin{subarray}{c}f_{0}+f_{1}+f_{2}=f\\ f_{i}\in\mathcal{H}_{k_{i}}\end{subarray}}\bigl(\|f_{0}\|_{\mathcal{H}_{k_{0}}}^{2}+\|f_{1}\|_{\mathcal{H}_{k_{1}}}^{2}+\|f_{2}\|_{\mathcal{H}_{k_{2}}}^{2}\bigr).

The three channels are: the radial IMQ channel (k0=b2hεk_{0}=b^{2}h_{\varepsilon}; as sets k0=hε\mathcal{H}_{k_{0}}=\mathcal{H}_{h_{\varepsilon}} with rescaled norm fk0=b1fhε\|f\|_{\mathcal{H}_{k_{0}}}=b^{-1}\|f\|_{\mathcal{H}_{h_{\varepsilon}}} (constant-rescaling identity for RKHS norms, Paulsen and Raghupathi 2016, Prop. 4.4), present when b>0b>0), the linear-alignment IMQ channel (k1=2b(𝐱𝐰)hεk_{1}=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}, vanishes at b=0b=0), and the quadratic-alignment IMQ channel (k2=(𝐱𝐰)2hεk_{2}=(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}, carrying the far-field trace, present for all b0b\geq 0). We adopt the convention that the RKHS of the identically-zero kernel is the zero subspace {0}\{0\} with trivial norm. At b=0b=0 both k0k_{0} and k1k_{1} vanish under this convention: the RKHS loses the IMQ subspace and the constant coordinate, which is the degeneracy responsible for failure of fixed-kernel universality on domains containing the origin. The same degeneracy appears as the kernel diagonal vanishing at the origin: k0,ε(𝟎,𝟎)=0k_{0,\varepsilon}(\mathbf{0},\mathbf{0})=0, so every f0,εf\in\mathcal{H}_{0,\varepsilon} satisfies f(𝟎)=0f(\mathbf{0})=0, so both the universality failure and the diagonal-vanishing diagnostic record the same fact.

The quadratic-alignment channel k2\mathcal{H}_{k_{2}} is what survives every choice of b0b\geq 0 and is the source of the far-field shadow that distinguishes Yat from IMQ. The next proposition makes the shadow precise as a sphere-valued asymptotic trace and certifies that the trace map is surjective onto homogeneous quadratic forms while annihilating every finite IMQ combination.

Proposition 2 (Directional asymptotic trace separates Yat from IMQ).

Let 𝒟T\mathcal{D}_{T_{\infty}} denote the linear subspace of functions F:dF:\mathbb{R}^{d}\to\mathbb{R} for which the radial limit limrF(r𝐮)\lim_{r\to\infty}F(r\mathbf{u}) exists for every 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1}, and define on 𝒟T\mathcal{D}_{T_{\infty}} the asymptotic trace operator TF(𝐮):=limrF(r𝐮)T_{\infty}F(\mathbf{u}):=\lim_{r\to\infty}F(r\mathbf{u}). Both F\tifinaghfont0F_{\text{\tifinaghfont ⵟ}}^{\geq 0} and FIMQ,εF_{\mathrm{IMQ},\varepsilon} lie in 𝒟T\mathcal{D}_{T_{\infty}}. For every 𝐰d\mathbf{w}\in\mathbb{R}^{d}, every bb\in\mathbb{R}, and every 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1}, the biased Yat atom satisfies

Tgε(;𝐰,b)(𝐮)=(𝐮𝐰)2,T_{\infty}\,g_{\varepsilon}(\cdot;\mathbf{w},b)(\mathbf{u})\;=\;(\mathbf{u}^{\top}\mathbf{w})^{2}, (4)

independent of bb and ε\varepsilon, and the convergence is uniform in 𝐮\mathbf{u}. By contrast, TF0T_{\infty}F\equiv 0 for every finite IMQ combination FFIMQ,εF\in F_{\mathrm{IMQ},\varepsilon}. Consequently the induced linear map T:F\tifinaghfont0C(𝕊d1)T_{\infty}:F_{\text{\tifinaghfont ⵟ}}^{\geq 0}\to C(\mathbb{S}^{d-1}) is surjective onto the space of restrictions of homogeneous quadratic forms, 𝒬2(𝕊d1)={𝐮𝐮𝐀𝐮:𝐀Sym(d)},\mathcal{Q}_{2}(\mathbb{S}^{d-1})=\{\mathbf{u}\mapsto\mathbf{u}^{\top}\mathbf{A}\mathbf{u}:\mathbf{A}\in\mathrm{Sym}(d)\}, of dimension d(d+1)/2d(d{+}1)/2, while FIMQ,εkerTF_{\mathrm{IMQ},\varepsilon}\subseteq\ker T_{\infty}.

The directional trace identity above is qualitative: a bounded radial combination eventually misses a Yat atom’s far-field value. The gap becomes quantitative on a bounded domain by combining the directional-trace identity (Proposition 2) with the classical radial curse of dimensionality for ridge approximation (Eldan and Shamir, 2016): a single Yat atom approximates a quadratic ridge function on [R,R]d[-R,R]^{d} to arbitrary accuracy, while any bounded-variation IMQ expansion needs exponentially many atoms in the ambient dimension.

A precise quantitative form (Proposition 7, Appendix E.8) makes this concrete under a joint constraint on the IMQ side: bounded coefficient mass |aj|Λ\sum|a_{j}|\leq\Lambda and bounded center norms 𝐯jWR/2\|\mathbf{v}_{j}\|\leq W\leq R/2, both held fixed as dd\to\infty. Under this joint constraint, on [R,R]d[-R,R]^{d} a single Yat atom k0,ε(α𝐰,)k_{0,\varepsilon}(\alpha\mathbf{w}_{\star},\cdot) with α=Ω(d)\alpha=\Omega(\sqrt{d}) approximates the quadratic ridge (𝐰𝐱)2(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2} uniformly to δ\delta, while every IMQ expansion in this class needs mexp(Ω(dlogd))m\geq\exp(\Omega(d\log d)) atoms. Lifting either constraint, by letting Λ\Lambda grow with dd or letting centers escape into the shell, collapses the lower bound, so the separation is a constrained radial-vs-ridge statement, not a model-free benchmark prediction.

Sharper but extrapolative quantitative separations on exterior shells AR={R𝐱2R}A_{R}=\{R\leq\|\mathbf{x}\|\leq 2R\} for RWR\gg W are recorded in Appendix G.1: a uniform LL^{\infty} bound and an L2L^{2} directional-tail risk lower bound, both for bounded-variation IMQ and RBF expansions. On normalized-feature compact domains both IMQ and RBF are universal (Micchelli et al., 2006; Steinwart and Christmann, 2008), so these are far-field structural separations rather than compact-domain benchmark predictions.

4 IMQ Embedding via Bias Finite Differences

Theorem 2 buys an IMQ-into-Yat embedding through Loewner domination, but that argument is analytic: Aronszajn’s inclusion theorem only certifies an injection of RKHS, not a constructive recipe for producing an IMQ atom from Yat atoms. A purely algebraic recipe exists, and the underlying reason is one line: the Yat numerator (𝐱𝐰+b)2(\mathbf{x}^{\top}\mathbf{w}+b)^{2} is exactly quadratic in bb with constant second derivative 22, so any second-order finite difference in bb kills it and leaves only the IMQ denominator times a constant. The stencil (h,2h,3h)(h,2h,3h) is the smallest one with all biases strictly positive. Concretely, taking that stencil at a common center gives an exact IMQ atom, multiplied by a constant. The identity is pointwise, not approximate, and the count of three Yat atoms is necessary in every dimension: two atoms cannot reproduce a single IMQ section. This construction gives a second, constructive route to fixed-kernel universality and immediately transfers any IMQ approximation rate to Yat at a factor-three width overhead.

Theorem 4 (Bias-finite-difference IMQ embedding).

Fix ε>0\varepsilon>0. For every 𝐰d\mathbf{w}\in\mathbb{R}^{d} and every h>0h>0, the positive-bias second-order finite difference of the biased Yat atom satisfies the exact pointwise identity

gε(𝐱;𝐰,3h) 2gε(𝐱;𝐰,2h)+gε(𝐱;𝐰,h)=2h2𝐱𝐰2+ε= 2h2kIMQε(𝐱,𝐰),g_{\varepsilon}(\mathbf{x};\mathbf{w},3h)\;-\;2\,g_{\varepsilon}(\mathbf{x};\mathbf{w},2h)\;+\;g_{\varepsilon}(\mathbf{x};\mathbf{w},h)\;=\;\frac{2h^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon}\;=\;2h^{2}\,k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}), (5)

for every 𝐱d\mathbf{x}\in\mathbb{R}^{d}. Consequently FIMQ,εF\tifinaghfont,ε+F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}, where F\tifinaghfont,ε+=span{gε(;𝐰,b):𝐰d,b>0}F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}=\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},b>0\} and FIMQ,ε=span{kIMQε(,𝐰):𝐰d}F_{\mathrm{IMQ},\varepsilon}=\mathrm{span}\{k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d}\}. The containment is strict.

The factor-three overhead is not slack: removing any one of the three Yat atoms loses the cancellation that produces the IMQ section. The next theorem establishes this as a tightness statement at the level of exact pointwise equality.

Theorem 5 (Three atoms are necessary).

The factor-33 reduction in (5) is sharp in the number of atoms: for every 𝐰0𝟎\mathbf{w}_{0}\neq\mathbf{0}, ε>0\varepsilon>0, d1d\geq 1, no linear combination of at most two biased Yat atoms (at arbitrary centers and biases) can equal a nonzero scalar multiple of kIMQε(,𝐰0)k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0}).

The full proof is in Appendix E.2, where the result is restated for convenience (Theorem 9), via an irreducible-quadric argument in [𝐱]/(D1)\mathbb{C}[\mathbf{x}]/(D_{1}). Tightness is at the level of exact pointwise equality; whether two atoms can LL^{\infty}-approximate kIMQε(,𝐰0)k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0}) on a compact set is open. The hypothesis 𝐰0𝟎\mathbf{w}_{0}\neq\mathbf{0} is essential: at the origin, the single Yat atom gε(;𝟎,b)/b2=kIMQε(,𝟎)g_{\varepsilon}(\cdot;\mathbf{0},b)/b^{2}=k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{0}) recovers the IMQ atom exactly, so one atom suffices in that case.

Two consequences follow without additional analysis. The first propagates Micchelli’s IMQ universality (Micchelli et al., 2006) into the Yat family by composition; the second turns the embedding into a quantitative rate-transfer statement.

Corollary 3 (Family universality, algebraic route).

For every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, both F\tifinaghfont,ε+F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} at any single ε>0\varepsilon>0 and the broader families F\tifinaghfont+=span{kb,ε(,𝐰):𝐰d,b>0,ε>0}F_{\text{\tifinaghfont ⵟ}}^{+}=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},b>0,\varepsilon>0\} and F\tifinaghfont0=span{kb,ε(,𝐰):𝐰d,b0,ε>0}F_{\text{\tifinaghfont ⵟ}}^{\geq 0}=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},b\geq 0,\varepsilon>0\} are dense in C(𝒳)C(\mathcal{X}) in the uniform norm. This gives a second algebraic route to the universality already established analytically by Proposition 1 via Loewner-IMQ domination.

Corollary 4 (Approximation rate transfer at factor-33 width overhead).

Fix ε>0\varepsilon>0 and h>0h>0, and let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact. For every finite IMQ approximant F=j=1majkIMQε(,𝐰j)FIMQ,εF=\sum_{j=1}^{m}a_{j}\,k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{j})\in F_{\mathrm{IMQ},\varepsilon} of mm atoms, there exists a Yat approximant GF\tifinaghfont,ε+G\in F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} of at most 3m3m atoms (positive biases drawn from {h,2h,3h}\{h,2h,3h\}) with G(𝐱)=F(𝐱)G(\mathbf{x})=F(\mathbf{x}) for every 𝐱d\mathbf{x}\in\mathbb{R}^{d}. In particular, any quantitative IMQ approximation rate infFFIMQ,ε,|F|mfFL(𝒳)ϕ(m)\inf_{F\in F_{\mathrm{IMQ},\varepsilon},\,|F|\leq m}\|f-F\|_{L^{\infty}(\mathcal{X})}\leq\phi(m) transfers to a Yat rate infGF\tifinaghfont,ε+,|G|3mfGL(𝒳)ϕ(m)\inf_{G\in F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+},\,|G|\leq 3m}\|f-G\|_{L^{\infty}(\mathcal{X})}\leq\phi(m); equivalently, any IMQ rate ϕ(m)\phi(m) gives a Yat rate ϕ(m/3)\phi(\lfloor m/3\rfloor).

Corollary 4 is the answer to “what does Yat inherit from IMQ approximation theory?”. We do not prove new Sobolev or Besov approximation rates here; the bias finite-difference identity shows that Yat is never worse than fixed-ε\varepsilon IMQ at the level of any rate inherited through this embedding. Yat-native rates that exploit the alignment numerator beyond what radial expansions can match remain open. Note that the 3m3m atoms on the Yat side carry three distinct biases {h,2h,3h}\{h,2h,3h\}, so the resulting expansion lies in the wider biased Yat span F\tifinaghfont0F_{\text{\tifinaghfont ⵟ}}^{\geq 0} rather than in any single shared-(b,ε)(b,\varepsilon) RKHS; the closed-form norm and Rademacher bound of Section 5 apply to each fixed-bias channel separately, with the per-channel norms combining via the RKHS sum rule.

The rate-transfer corollary leaves the RKHS-ball radius BB unspecified: in the absence of a constructive bound on fb,ε\|f\|_{\mathcal{H}_{b,\varepsilon}}, the inherited rate is information-theoretic rather than operational. The next section closes this gap by computing BB in closed form for any trained shared-(b,ε)(b,\varepsilon) Yat layer and propagating it to a Rademacher generalization bound, so that the rate transfer above and the capacity bound below are companion statements: the first says what error rate a target in b,ε\mathcal{H}_{b,\varepsilon} admits at 3N3N atoms; the second says what radius the trained network’s coordinate functions occupy.

5 Layer-Local RKHS Capacity

Section 4 transferred IMQ approximation rates to Yat at factor-three width overhead but left the radius of the relevant RKHS ball unspecified, and an inherited rate without a constructive radius is information-theoretic, not operational.111The factor-three rate-transfer construction uses three biases {h,2h,3h}\{h,2h,3h\} per center, so the constructed approximant lives in the wider biased Yat span F\tifinaghfont0F_{\text{\tifinaghfont ⵟ}}^{\geq 0} rather than in any single shared-(b,ε)(b,\varepsilon) RKHS. The closed-form norm and Rademacher bound here apply per shared-(b,ε)(b,\varepsilon) Yat layer; the two are companion statements over the kernel family, not over one trained network. Per-channel norms combine via the RKHS sum rule. The reproducing property of b,ε\mathcal{H}_{b,\varepsilon} closes the gap: the squared RKHS norm of any finite kernel expansion jαjkb,ε(𝐰j,)\sum_{j}\alpha_{j}k_{b,\varepsilon}(\mathbf{w}_{j},\cdot) equals the quadratic form 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} in the trained weights and read-out coefficients, computable from those parameters alone, with no Lipschitz surrogate and no spectral product across depth. Plugging this norm into the RKHS-ball Rademacher framework yields a single-layer bound B(R2+b)/(nε)B(R^{2}+b)/(\sqrt{n}\sqrt{\varepsilon}) whose only Yat-specific ingredient is the kernel diagonal (𝐱2+b)2/ε(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon, which carries both bb and ε\varepsilon into the bound as explicit regularisation targets, in contrast to the constant diagonals of Gaussian RBF and IMQ.

Proposition 3 (Closed-form RKHS norm).

Let b0b\geq 0, ε>0\varepsilon>0, and let f()=j=1mαjkb,ε(𝐰j,)b,εf(\cdot)=\sum_{j=1}^{m}\alpha_{j}\,k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\in\mathcal{H}_{b,\varepsilon} be a finite kernel expansion with shared (b,ε)(b,\varepsilon). Then

fb,ε2=𝜶𝐊𝜶,𝐊ij=kb,ε(𝐰i,𝐰j).\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}\;=\;\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha},\qquad\mathbf{K}_{ij}=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j}). (6)

The identity holds for any choice of {𝐰j}\{\mathbf{w}_{j}\}, including repeated centers; in that case 𝐊\mathbf{K} may be singular, and although different coefficient vectors can represent the same function ff, their difference lies in ker𝐊\ker\mathbf{K} and the quadratic form 𝛂𝐊𝛂\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} is invariant across equivalent representations.

The reproducing property (Aronszajn, 1950; Steinwart and Christmann, 2008) delivers this as an exact finite-dimensional formula: for a shared-(b,ε)(b,\varepsilon) Yat layer, the right-hand side is computable from the trained weights (𝐰j,𝜶)(\mathbf{w}_{j},\boldsymbol{\alpha}) alone, in closed form, with no Lipschitz surrogate. The identity is mathematically tight at every 𝜶\boldsymbol{\alpha} even when 𝐊\mathbf{K} is singular; near-coincident centers are a numerical-conditioning issue, not a mathematical overstatement, and stable factorisation or deduplication suffices in floating-point implementations. Plugging this norm into the standard RKHS-ball Rademacher inequality (Bartlett and Mendelson, 2002) yields the Yat-specific generalisation bound below; the only ingredient that depends on the kernel beyond a generic Mercer-section property is the diagonal kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon, which transfers bb and ε\varepsilon into explicit terms in the bound.

Theorem 6 (Single-layer Rademacher bound).

Fix b0b\geq 0, ε>0\varepsilon>0. Let B,b,ε={fb,ε:fb,εB}\mathcal{F}_{B,b,\varepsilon}=\{f\in\mathcal{H}_{b,\varepsilon}:\|f\|_{\mathcal{H}_{b,\varepsilon}}\leq B\} be the RKHS ball of radius BB, 𝐱R\|\mathbf{x}\|\leq R, and let {𝐱i}i=1n\{\mathbf{x}_{i}\}_{i=1}^{n} be i.i.d. samples. With the convention Rad^n():=𝔼𝛔supf1ni=1nσif(𝐱i)\widehat{\mathrm{Rad}}_{n}(\mathcal{F}):=\mathbb{E}_{\boldsymbol{\sigma}}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(\mathbf{x}_{i}), σiiidUnif{±1}\sigma_{i}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}\mathrm{Unif}\{\pm 1\},

Rad^n(B,b,ε)B(R2+b)nε.\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;\leq\;\frac{B(R^{2}+b)}{\sqrt{n}\,\sqrt{\varepsilon}}. (7)

A data-dependent empirical refinement (Corollary 7) replaces (R2+b)(R^{2}+b) by (1/n)i(𝐱i2+b)2\sqrt{(1/n)\sum_{i}(\|\mathbf{x}_{i}\|^{2}+b)^{2}}. The polynomial diagonal has a real cost on unnormalised high-dimensional inputs and we record it explicitly: under isotropic Gaussian 𝐱d\mathbf{x}\in\mathbb{R}^{d}, 𝔼[kb,ε(𝐱,𝐱)]=Θ(d2/ε)\mathbb{E}[k_{b,\varepsilon}(\mathbf{x},\mathbf{x})]=\Theta(d^{2}/\varepsilon) and the bound scales as B(d+b)/nεB(d+b)/\sqrt{n\varepsilon}, against constant-diagonal 11 for Gaussian RBF and 1/ε1/\varepsilon for IMQ. Feature-scale control (e.g. unit-sphere normalisation, which collapses the diagonal to (1+b)2/ε(1+b)^{2}/\varepsilon) recovers a dimension-free bound; the analysis makes the dependence explicit via bb and ε\varepsilon rather than hiding it. The high-dimensional variance preservation CV[kb,ε]=Θ(1)\mathrm{CV}[k_{b,\varepsilon}]=\Theta(1) vs. CV[hε]=Θ(d1/2)\mathrm{CV}[h_{\varepsilon}]=\Theta(d^{-1/2}) (Proposition 5) survives under this normalisation because CV\mathrm{CV} is scale-invariant. On the unit sphere, 𝐱𝐰2=22𝐱𝐰\|\mathbf{x}-\mathbf{w}\|^{2}=2-2\mathbf{x}^{\top}\mathbf{w}, so kb,εk_{b,\varepsilon} becomes a function of the single variable 𝐱𝐰\mathbf{x}^{\top}\mathbf{w}, i.e. a zonal kernel; the directional far-field separation of §3 then refines into a sharper spectral statement, with exponential eigenvalue decay in the spherical-harmonic basis (Appendix K).

The Rademacher bound is hidden-layer-local: BB is computed from the trained (𝐰j,𝜶)(\mathbf{w}_{j},\boldsymbol{\alpha}) of one shared-(b,ε)(b,\varepsilon) Yat layer, RR is a data-side radius on the layer’s input, and there is no spectral product across depth or Lipschitz surrogate.

The reproducing-property derivation is generic to any continuous Mercer kernel: a learned-center RBF or IMQ layer admits the same 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} formula and the same diagonal-based Rademacher bound. The Yat-specific content is the form of the diagonal kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon, which depends nontrivially on 𝐱\mathbf{x} and on both bb and ε\varepsilon, whereas the Gaussian RBF diagonal is the constant 11 and the IMQ diagonal is ε1\varepsilon^{-1}. Ordinary scalar MLP units do not induce such a diagonal at all because they do not define a Mercer section over the shared input/weight space. Evaluating 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} at width mm on input dimension dd costs O(m2d)O(m^{2}d) versus O(md)O(md) for a forward pass; gradient descent driving b0+b\to 0^{+} inflates the embedding constant 1/b1/b and the conditioning of 𝐊\mathbf{K}, so a softplus-style reparametrisation is recommended.

The Yat kernel section kb,ε(𝐰,)k_{b,\varepsilon}(\mathbf{w},\cdot) is uniformly bounded on every compact 𝐱R\|\mathbf{x}\|\leq R, 𝐰W\|\mathbf{w}\|\leq W by (RW+b)2/ε(RW+b)^{2}/\varepsilon, and globally bounded on d\mathbb{R}^{d} for every fixed 𝐰\mathbf{w}; at b=0b=0 and 𝐰𝟎\mathbf{w}\neq\mathbf{0} the global supremum 𝐰4/ε+𝐰2\|\mathbf{w}\|^{4}/\varepsilon+\|\mathbf{w}\|^{2} is attained at (1+ε/𝐰2)𝐰(1+\varepsilon/\|\mathbf{w}\|^{2})\mathbf{w} (Proposition 4, Appendix E.6). The diagonal kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon used in Theorem 6 is the special case 𝐰=𝐱\mathbf{w}=\mathbf{x}. Classical unbounded scalar activations such as ReLU and GELU satisfy σ(𝐰𝐱)\sigma(\mathbf{w}^{\top}\mathbf{x})\to\infty as 𝐱\|\mathbf{x}\|\to\infty, so the Rademacher route via the kernel diagonal is not available in this same unit-as-Mercer-section form.

Table 1 compares Yat against the three closest classical kernel-valued primitives along the properties that drive layer-local RKHS analysis. Yat is the only row in which a polynomial alignment numerator and IMQ locality coexist while retaining fixed-kernel universality, characteristicness, and a closed-form layer-local RKHS norm. The directional asymptotic trace (Proposition 2) and the factor-33 tightness of the IMQ reduction (Theorem 9, App. E.2) turn the table from a rhetorical summary into a structural separation: Yat atoms add directional quadratic far-field components no finite IMQ combination can match, and at most two biased atoms cannot reproduce a single IMQ section.

What singles out Yat in Table 1 is not any one of these properties but their simultaneous combination: IMQ-type locality, polynomial alignment with nonzero quadratic far-field trace, an exact three-atom IMQ-recovery identity, fixed-kernel universality from Loewner domination, and concentration that survives high-dimensional inputs: under isotropic Gaussian inputs, CV[kb,ε]=Θ(1)\mathrm{CV}[k_{b,\varepsilon}]=\Theta(1) while CV[hε]=Θ(d1/2)\mathrm{CV}[h_{\varepsilon}]=\Theta(d^{-1/2}) (Proposition 5, Appendix E.6), because the Yat numerator depends on the one-dimensional projection 𝐱𝐰\mathbf{x}^{\top}\mathbf{w}.

A multiclass margin bound of order O(C(C1)κB/(γn))O(C(C-1)\kappa B/(\gamma\sqrt{n})) with κ(R2+b)/ε\kappa\leq(R^{2}+b)/\sqrt{\varepsilon} follows from the closed-form norm; a peeling argument (Theorem 13) replaces the worst-case radius by the trained max{B(𝐟),B0}\max\{B(\mathbf{f}),B_{0}\} at loglog\log\log cost. Four further RKHS-ball comparisons propagate the Loewner domination to interpolation-norm, pullback, alignment-excess, and spectral-effective-dimension bounds (Appendix G).

6 Conclusion and Discussion

The Yat kernel kb,εk_{b,\varepsilon} is a single-layer object on which four properties simultaneously hold: PSD/Mercer for b0b\geq 0, fixed-kernel universality for b>0b>0 from Loewner-IMQ domination, an exact three-atom finite-difference IMQ identity (sharp in every dimension), and a closed-form layer-local RKHS norm with diagonal-driven Rademacher control. Composing the unit at depth preserves none of these globally: no parameter-independent Mercer kernel on the input domain captures the depth-LL Yat-Gram norm of every trained network. The single-layer story does not extend to a global theory, but it survives in two complementary regimes that delimit the structural scope of the analysis.

In the per-prefix regime, we fix the prior layers. The pulled-back kernel k~(𝐱,𝐱):=k(Φ1(𝐱),Φ1(𝐱))\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime}):=k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\Phi_{\ell-1}(\mathbf{x}^{\prime})) is PSD on 𝒳\mathcal{X}, and 𝜶,c𝐊𝜶,c\boldsymbol{\alpha}_{\ell,c}^{\top}\mathbf{K}_{\ell}\boldsymbol{\alpha}_{\ell,c} upper-bounds z,ck~2\|z_{\ell,c}\|_{\mathcal{H}_{\widetilde{k}_{\ell}}}^{2} (Theorem 14), with equality on the prefix-range section subspace; k~\widetilde{k}_{\ell} inherits universality when b>0b_{\ell}>0 and Φ1\Phi_{\ell-1} is continuous injective (generic at md1+1m_{\ell}\geq d_{\ell-1}+1, not automatic). The pullback construction is closest in form to convolutional kernel networks (Mairal et al., 2014; Mairal, 2016), differing in unit and in the explicit closed-form per-layer norm. In the parameter-independent regime, every depth-LL coordinate under bounded (𝐰,j,b,ε,𝜶)(\mathbf{w}_{\ell,j},b_{\ell},\varepsilon_{\ell},\boldsymbol{\alpha}_{\ell}) lies in a fixed ambient Sobolev RKHS Hs(d0)H^{s}(\mathbb{R}^{d_{0}}) (s>d0/2s>d_{0}/2, Theorem 19); the depth dependence is super-exponential, so the result is qualitative containment (degenerate at d01d_{0}\gg 1), not a capacity certificate. The infinite-width Yat NTK is universal on every compact domain for b>0b>0 (Appendix N).

These two finite-depth regimes sit at opposite ends of one trade-off: per-prefix bookkeeping is exact but prefix-dependent, ambient Sobolev containment is parameter-independent but coarse. No single Mercer kernel on XX absorbs every trained Yat stack’s layer-local Gram structure simultaneously, and we formalise this impossibility as a conjecture.

Conjecture 1 (No exact global deep Yat-Gram norm).

Fix L2L\geq 2, input domain Xd0X\subset\mathbb{R}^{d_{0}}, layer widths (m1,,mL)(m_{1},\dots,m_{L}), and the Yat-stack hypothesis class with every prefix Φ1\Phi_{\ell-1} continuous, injective on XX, bilipschitz onto its image, and every Yat layer satisfying b0b_{\ell}\geq 0, ε>0\varepsilon_{\ell}>0, pairwise distinct centers. For each output coordinate c{1,,mL}c\in\{1,\dots,m_{L}\} let 𝐀,cm×d\mathbf{A}_{\ell,c}\in\mathbb{R}^{m_{\ell}\times d_{\ell}} denote the trained read-out matrix carrying coordinate cc through layer \ell (with 𝐀L,c\mathbf{A}_{L,c} the cc-th row of the final read-out and 𝐀,c\mathbf{A}_{\ell,c} the corresponding contribution at intermediate layers via the chain). There exists no Mercer kernel KK^{\star} on X×XX\times X, parameter-independent of (𝐰,𝛂,b,ε)(\mathbf{w}_{\ell},\boldsymbol{\alpha}_{\ell},b_{\ell},\varepsilon_{\ell}), satisfying, for every output coordinate c{1,,mL}c\in\{1,\dots,m_{L}\} and every trained Yat stack in the hypothesis class,

zL,cK2==1LTr(𝐀,c𝐊𝐀,c).\|z_{L,c}\|_{\mathcal{H}_{K^{\star}}}^{2}\;=\;\sum_{\ell=1}^{L}\mathrm{Tr}\!\bigl(\mathbf{A}_{\ell,c}^{\top}\mathbf{K}_{\ell}\mathbf{A}_{\ell,c}\bigr).

Two empirical checks corroborate the theory: (𝜶c𝐊𝜶c)1/2(\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c})^{1/2} correlates with per-class accuracy at ρ=0.564\rho=0.564 on a 1000-class CLIP probe (Appendix R), and a matched-Chinchilla 261261M Yat causal LM reaches the GELU baseline over 33 seeds (Appendix T).

References

  • R. A. Adams and J. J. F. Fournier (2003) Sobolev spaces. 2 edition, Pure and Applied Mathematics, Vol. 140, Academic Press. Cited by: §J.1, §J.1.
  • N. Aronszajn (1950) Theory of reproducing kernels. Transactions of the American Mathematical Society 68 (3), pp. 337–404. Cited by: Appendix C, §5.
  • S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, Cited by: Appendix N.
  • F. Bach (2017) Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research 18 (19), pp. 1–53. Cited by: Appendix B.
  • P. L. Bartlett, O. Bousquet, and S. Mendelson (2005) Local Rademacher complexities. The Annals of Statistics 33 (4), pp. 1497–1537. Cited by: §L.1, Appendix L.
  • P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix B, §1.
  • P. L. Bartlett and S. Mendelson (2002) Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, pp. 463–482. Cited by: Appendix B, §E.5, §5.
  • A. Berlinet and C. Thomas-Agnan (2004) Reproducing kernel Hilbert spaces in probability and statistics. Springer. Cited by: Theorem 3.
  • A. Bietti and J. Mairal (2019) On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
  • A. Bietti, G. Mialon, D. Chen, and J. Mairal (2019) A kernel perspective for regularizing deep neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix O.
  • B. Bordelon, A. Canatar, and C. Pehlevan (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
  • N. Boullé, Y. Nakatsukasa, and A. Townsend (2020) Rational neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix B.
  • D. S. Broomhead and D. Lowe (1988) Multivariable functional interpolation and adaptive networks. Complex Systems 2 (3), pp. 321–355. Cited by: Appendix B.
  • M. D. Buhmann (2003) Radial basis functions: theory and implementations. Cambridge University Press. Cited by: Appendix B.
  • C. J. C. Burges (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (2), pp. 121–167. Cited by: Appendix B.
  • Y. Cao and Q. Gu (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
  • A. Caponnetto and E. De Vito (2007) Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7 (3), pp. 331–368. Cited by: §L.2, Appendix L, Remark 10.
  • Y. Cho and L. K. Saul (2009) Kernel methods for deep learning. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B.
  • C. Cortes, P. Haffner, and M. Mohri (2004) Rational kernels: theory and algorithms. Journal of Machine Learning Research 5, pp. 1035–1062. Cited by: Appendix B.
  • K. Crammer and Y. Singer (2001) On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, pp. 265–292. Cited by: Appendix B.
  • A. Daniely, R. Frostig, and Y. Singer (2016) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: NeurIPS Paper Checklist.
  • W. E, C. Ma, and L. Wu (2019) A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences 17 (5), pp. 1407–1425. Cited by: §J.1, Appendix B.
  • N. El Karoui (2010) The spectrum of kernel random matrices. The Annals of Statistics 38 (1), pp. 1–50. Cited by: §E.6.
  • R. Eldan and O. Shamir (2016) The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), pp. 907–940. Cited by: Appendix B, §3.
  • M. G. Genton (2001) Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research 2, pp. 299–312. Cited by: Appendix B.
  • B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari (2019) Linearized two-layers neural networks in high dimension. In Advances in Neural Information Processing Systems, Cited by: §E.6.
  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13, pp. 723–773. Cited by: Appendix M.
  • D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415. Cited by: §1.
  • J. Hoffmann et al. (2022) Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. Cited by: Appendix T.
  • R. A. Horn and C. R. Johnson (2012) Matrix analysis. 2 edition, Cambridge University Press. Cited by: Appendix C, §E.1, §E.7, §I.2.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In NeurIPS, Cited by: Appendix N, Appendix N, Appendix B.
  • B. Laurent and P. Massart (2000) Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 28 (5), pp. 1302–1338. Cited by: §E.6.
  • J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018) Deep neural networks as Gaussian processes. In ICLR, Cited by: Appendix B.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: Appendix T.
  • J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid (2014) Convolutional kernel networks. In Advances in Neural Information Processing Systems, Cited by: Appendix B, §6.
  • J. Mairal (2016) End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
  • J. Mercer (1909) Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A 209, pp. 415–446. Cited by: Appendix C.
  • C. A. Micchelli, Y. Xu, and H. Zhang (2006) Universal kernels. Journal of Machine Learning Research 7, pp. 2651–2667. Cited by: Appendix C, §E.3, §E.9, §1, §3, §3, §4, Theorem 8.
  • A. Molina, P. Schramowski, and K. Kersting (2020) Padé activation units: end-to-end learning of flexible activation functions in deep networks. In International Conference on Learning Representations (ICLR), Cited by: Appendix B.
  • J. Moody and C. J. Darken (1989) Fast learning in networks of locally-tuned processing units. Neural Computation 1 (2), pp. 281–294. Cited by: Appendix B.
  • R. M. Neal (1996) Bayesian learning for neural networks. Lecture Notes in Statistics, Vol. 118, Springer. Cited by: Appendix B.
  • B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2018) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations (ICLR), Cited by: Appendix B, §1.
  • C. J. Paciorek and M. J. Schervish (2006) Spatial modelling using a new class of nonstationary covariance functions. Environmetrics 17 (5), pp. 483–506. Cited by: Appendix B.
  • J. Park and I. W. Sandberg (1991) Universal approximation using radial-basis-function networks. Neural Computation 3 (2), pp. 246–257. Cited by: Appendix B.
  • V. I. Paulsen and M. Raghupathi (2016) An introduction to the theory of reproducing kernel Hilbert spaces. Cambridge Studies in Advanced Mathematics, Vol. 152, Cambridge University Press. Cited by: Appendix N, Appendix C, §E.8, Theorem 2, Theorem 3, Theorem 3.
  • P. Petersen and F. Voigtländer (2018) Optimal approximation of piecewise smooth functions using deep ReLU networks. Neural Networks 108, pp. 296–330. Cited by: §J.1.
  • A. Pinkus (1999) Approximation theory of the MLP model in neural networks. Acta Numerica 8, pp. 143–195. Cited by: Appendix B.
  • A. Radford et al. (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: NeurIPS Paper Checklist.
  • J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap (2020) Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations (ICLR), Cited by: Table 6.
  • C. Raffel et al. (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. Cited by: Appendix T.
  • A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
  • A. Rahimi and B. Recht (2008) Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
  • T. Runst and W. Sickel (1996) Sobolev spaces of fractional order, nemytskij operators, and nonlinear partial differential equations. De Gruyter. Cited by: §J.1.
  • R. Schaback (1995) Error estimates and condition numbers for radial basis function interpolation. Advances in Computational Mathematics 3 (3), pp. 251–264. Cited by: Appendix B, Appendix F.
  • B. Schölkopf and A. J. Smola (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press. Cited by: Appendix B, Appendix B.
  • A. J. Smola, Z. L. Óvári, and R. C. Williamson (2000) Regularization with dot-product kernels. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, Table 1.
  • B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet (2011) Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research 12, pp. 2389–2410. Cited by: §J.3, Appendix B, Appendix C.
  • B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet (2010) Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research 11, pp. 1517–1561. Cited by: Appendix C, §E.4, Corollary 1.
  • I. Steinwart and A. Christmann (2008) Support vector machines. Springer. Cited by: Appendix B, Appendix C, §E.3, §3, §5, Theorem 7.
  • I. Steinwart, D. R. Hush, and C. Scovel (2009) Optimal rates for regularized least squares regression. In Conference on Learning Theory (COLT), pp. 79–93. Cited by: §L.2, Appendix L.
  • M. Telgarsky (2017) Neural networks and rational functions. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
  • H. Wendland (2004) Scattered data approximation. Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press. Cited by: Appendix K, Appendix K, Appendix K, Appendix K, Appendix B, Appendix B, Appendix F.
  • D. V. Widder (1941) The Laplace transform. Princeton University Press. Cited by: Appendix C.
  • A. G. Wilson and R. P. Adams (2013) Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In AISTATS, Cited by: Appendix B.

Appendix roadmap.

The appendices are organized in three tiers. Tier 1 (proofs of main-body results): Appendix E (one top-level section, four subsections, one per main-body theorem block) contains proofs of every result in Sections 25. Tier 2 (extensions of the main-body theory): Appendix G treats exterior-shell separations and the polynomial-separation gap on line-containing domains; Appendix H develops the multiclass learned-norm generalization bound (Rademacher + peeling); Appendix I contains the prefix-pullback theorem and the four Loewner-comparison theorems (interpolation norms, pullback domination, alignment-excess, spectral comparison); Appendix J establishes the generic ambient Sobolev-RKHS containment of bounded smooth-atom stacks; Appendix F records the three-step asymptotic filtration of the biased span and a native-space rate-transfer statement with coefficient-mass lower bound; Appendix P converts the directional-trace separation into a quantitative atom-count bound; Appendix Q states the degree-matching skeleton for a uniqueness conjecture. Tier 3 (orthogonal applications of the same single-layer kernel): Appendix K computes the Funk–Hecke spectrum on 𝕊d1\mathbb{S}^{d-1}, including exponential eigenvalue decay and logarithmic effective dimension; Appendix L derives fast rates via eigenvalue decay (ERM and KRR routes); Appendix M gives quantitative MMD and two-sample-testing sample complexities; Appendix N computes the infinite-width NTK; Appendix O derives an intrinsic RKHS-Lipschitz bound and certified adversarial radius. Tier-3 sections are self-contained and read independently of one another. The CLIP-probe and directional-tail benchmark appendices (Appendices R, S) collect numerical evidence for results stated in the main body, and Appendix T exercises the per-prefix theory inside a trained 261261M Yat causal language model.

Appendix A Preliminaries and Notation

This section consolidates the notation used throughout the paper, both in the main body and in the appendices. Generic RKHS background (Mercer’s theorem, Moore–Aronszajn, universality / characteristicness / SPD, Schur products, Laplace representation) is in Appendix C; named results from the literature we cite by label (Steinwart–Christmann product-kernel containment, Micchelli IMQ universality) are in Appendix D.

Table 2: Quick reference for the notation. The table is grouped by category; longer definitions and standing assumptions follow.
Symbol Definition First used
Domains and dimensions
d,d0,dd,d_{0},d_{\ell} input dimension; depth-\ell width §1
𝒳d\mathcal{X}\subset\mathbb{R}^{d} compact input domain §1
𝕊d1\mathbb{S}^{d-1} unit sphere {𝐮:𝐮=1}\{\mathbf{u}:\|\mathbf{u}\|=1\} §1
BRB_{R} closed ball {𝐱:𝐱R}\{\mathbf{x}:\|\mathbf{x}\|\leq R\} §6
Kernels and atoms
Dε(𝐱,𝐰)D_{\varepsilon}(\mathbf{x},\mathbf{w}) 𝐱𝐰2+ε\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon (regularised squared distance) §1
gε(𝐱;𝐰,b)g_{\varepsilon}(\mathbf{x};\mathbf{w},b) (𝐱𝐰+b)2/Dε(\mathbf{x}^{\top}\mathbf{w}+b)^{2}/D_{\varepsilon} (biased Yat atom) §1
kb,ε,k\tifinaghfontk_{b,\varepsilon},k_{\text{\tifinaghfont ⵟ}} Yat kernel gεg_{\varepsilon}; unbiased k0,εk_{0,\varepsilon} §1
hε,kIMQεh_{\varepsilon},k_{\mathrm{IMQ}}^{\varepsilon} IMQ kernel 1/Dε1/D_{\varepsilon} §1
RKHS objects
b,ε\mathcal{H}_{b,\varepsilon}, \tifinaghfont\mathcal{H}_{\text{\tifinaghfont ⵟ}}, IMQε\mathcal{H}_{\mathrm{IMQ}}^{\varepsilon} RKHSs of kb,εk_{b,\varepsilon}, k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}, hεh_{\varepsilon} §3
𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} squared norm of jαjk(𝐰j,)\sum_{j}\alpha_{j}k(\mathbf{w}_{j},\cdot) at distinct 𝐰j\mathbf{w}_{j} §6
\preceq Loewner PSD order on kernels / Gram matrices §3
Spans of atoms
F\tifinaghfont,ε+F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} span{gε(;𝐰,b):𝐰d,b>0}\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},\,b>0\} §5
F\tifinaghfont+,F\tifinaghfont0F_{\text{\tifinaghfont ⵟ}}^{+},F_{\text{\tifinaghfont ⵟ}}^{\geq 0} broader Yat spans over b>0b>0 resp. b0b\geq 0, ε>0\varepsilon>0 App. A
FIMQ,εF_{\mathrm{IMQ},\varepsilon} span{hε(,𝐰):𝐰d}\mathrm{span}\{h_{\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d}\} §5
IMQ(Λ,W)\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W) bounded-variation IMQ class: |aj|Λ\sum|a_{j}|\leq\Lambda, 𝐯jW\|\mathbf{v}_{j}\|\leq W App. G
\tifinaghfont(Λ,W)\mathcal{R}_{\text{\tifinaghfont ⵟ}}(\Lambda,W) Yat analogue of IMQ\mathcal{R}_{\mathrm{IMQ}} App. P
Operators
TF(𝐮)T_{\infty}F(\mathbf{u}) limrF(r𝐮)\lim_{r\to\infty}F(r\mathbf{u}) (asymptotic-trace operator) §4
𝒬2(𝕊d1)\mathcal{Q}_{2}(\mathbb{S}^{d-1}) quadratic forms 𝐮𝐮𝐀𝐮\mathbf{u}\mapsto\mathbf{u}^{\top}\mathbf{A}\mathbf{u} on the sphere §4
TkT_{k} Mercer integral operator fk(,𝐱)f(𝐱)𝑑ρ(𝐱)f\mapsto\int k(\cdot,\mathbf{x})f(\mathbf{x})\,d\rho(\mathbf{x}) App. L
𝒩k(λ)\mathcal{N}_{k}(\lambda) effective dimension tr(Tk(Tk+λI)1)\operatorname{tr}(T_{k}(T_{k}+\lambda I)^{-1}) App. L
Network and pullback
Φ,T,m\Phi_{\ell},T_{\ell},m_{\ell} prefix map, layer map, layer width §7
z,c,𝜶,cz_{\ell,c},\boldsymbol{\alpha}_{\ell,c} layer-\ell output coordinate cc and its read-out vector §7
k~(𝐱,𝐱)\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime}) k(Φ1(𝐱),Φ1(𝐱))k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\Phi_{\ell-1}(\mathbf{x}^{\prime})) (pullback) §7

Standing assumptions.

Throughout, d1d\geq 1, 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is compact, b0b\geq 0, ε>0\varepsilon>0. The sphere is 𝕊d1={𝐮d:𝐮=1}\mathbb{S}^{d-1}=\{\mathbf{u}\in\mathbb{R}^{d}:\|\mathbf{u}\|=1\} and the closed ball BR={𝐱d:𝐱R}B_{R}=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\}.

Kernels.

The regularised squared distance and the three kernels we work with:

Dε(𝐱,𝐰)\displaystyle D_{\varepsilon}(\mathbf{x},\mathbf{w}) :=𝐱𝐰2+ε,\displaystyle:=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon,
biased Yat atom: gε(𝐱;𝐰,b)\displaystyle\text{biased Yat atom: }\quad g_{\varepsilon}(\mathbf{x};\mathbf{w},b) :=(𝐱𝐰+b)2/Dε(𝐱,𝐰),\displaystyle:=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}/D_{\varepsilon}(\mathbf{x},\mathbf{w}),
Yat kernel: kb,ε(𝐰,𝐱)\displaystyle\text{Yat kernel: }\quad k_{b,\varepsilon}(\mathbf{w},\mathbf{x}) :=gε(𝐱;𝐰,b),k\tifinaghfont(𝐰,𝐱):=k0,ε(𝐰,𝐱),\displaystyle:=g_{\varepsilon}(\mathbf{x};\mathbf{w},b),\quad k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x}):=k_{0,\varepsilon}(\mathbf{w},\mathbf{x}),
IMQ kernel: hε(𝐱,𝐰)\displaystyle\text{IMQ kernel: }\quad h_{\varepsilon}(\mathbf{x},\mathbf{w}) :=kIMQε(𝐱,𝐰):=Dε(𝐱,𝐰)1.\displaystyle:=k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}):=D_{\varepsilon}(\mathbf{x},\mathbf{w})^{-1}.

RKHS objects.

b,ε\mathcal{H}_{b,\varepsilon}, \tifinaghfont\mathcal{H}_{\text{\tifinaghfont ⵟ}}, hε=IMQε\mathcal{H}_{h_{\varepsilon}}=\mathcal{H}_{\mathrm{IMQ}}^{\varepsilon} denote the RKHSs of kb,εk_{b,\varepsilon}, k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}, hεh_{\varepsilon} respectively. The Loewner PSD order is denoted \preceq. The RKHS k\mathcal{H}_{k} associated with a continuous PSD kernel kk is the unique Hilbert space of functions on 𝒳\mathcal{X} for which f(𝐱)=f,k(,𝐱)kf(\mathbf{x})=\langle f,k(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{k}} holds for all fkf\in\mathcal{H}_{k}. The squared norm fk2\|f\|_{\mathcal{H}_{k}}^{2} of a finite expansion f=jαjk(𝐰j,)f=\sum_{j}\alpha_{j}k(\mathbf{w}_{j},\cdot) at distinct centers {𝐰j}j=1m\{\mathbf{w}_{j}\}_{j=1}^{m} equals 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} where 𝐊ij:=k(𝐰i,𝐰j)\mathbf{K}_{ij}:=k(\mathbf{w}_{i},\mathbf{w}_{j}) is the Gram matrix.

Spans of atoms.

The atom families that appear repeatedly:

F\tifinaghfont,ε+\displaystyle F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} :=span{gε(;𝐰,b):𝐰d,b>0},\displaystyle:=\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},\ b>0\},
F\tifinaghfont+\displaystyle F_{\text{\tifinaghfont ⵟ}}^{+} :=span{kb,ε(,𝐰):𝐰d,b>0,ε>0},\displaystyle:=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},\ b>0,\ \varepsilon>0\},
F\tifinaghfont0\displaystyle F_{\text{\tifinaghfont ⵟ}}^{\geq 0} :=span{kb,ε(,𝐰):𝐰d,b0,ε>0},\displaystyle:=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},\ b\geq 0,\ \varepsilon>0\},
FIMQ,ε\displaystyle F_{\mathrm{IMQ},\varepsilon} :=span{kIMQε(,𝐰):𝐰d}.\displaystyle:=\mathrm{span}\{k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d}\}.

Bounded-variation and bounded-center sub-classes used in lower bounds:

IMQ(Λ,W):={jajhε(,𝐯j):|aj|Λ,𝐯jW},\tifinaghfont(Λ,W) analogously.\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W):=\Bigl\{\textstyle\sum_{j}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j}):\,\sum|a_{j}|\leq\Lambda,\ \|\mathbf{v}_{j}\|\leq W\Bigr\},\quad\mathcal{R}_{\text{\tifinaghfont ⵟ}}(\Lambda,W)\text{ analogously.}

Operators on the spans.

The directional asymptotic-trace operator on the radial limit is

TF(𝐮):=limrF(r𝐮),𝐮𝕊d1,T_{\infty}F(\mathbf{u})\;:=\;\lim_{r\to\infty}F(r\mathbf{u}),\qquad\mathbf{u}\in\mathbb{S}^{d-1},

defined on the linear subspace 𝒟TC(d)\mathcal{D}_{T_{\infty}}\subset C(\mathbb{R}^{d}) where the limit exists for every direction. The image of TT_{\infty} on F\tifinaghfont0F_{\text{\tifinaghfont ⵟ}}^{\geq 0} is the space of quadratic forms restricted to 𝕊d1\mathbb{S}^{d-1}, 𝒬2(𝕊d1):={𝐮𝐮𝐀𝐮:𝐀Sym(d)}\mathcal{Q}_{2}(\mathbb{S}^{d-1}):=\{\mathbf{u}\mapsto\mathbf{u}^{\top}\mathbf{A}\mathbf{u}:\mathbf{A}\in\mathrm{Sym}(d)\}. The Mercer integral operator of a kernel kk on L2(ρ)L^{2}(\rho) is denoted Tk:fk(,𝐱)f(𝐱)𝑑ρ(𝐱)T_{k}\!:f\mapsto\int k(\cdot,\mathbf{x})f(\mathbf{x})\,d\rho(\mathbf{x}), with effective dimension 𝒩k(λ):=tr(Tk(Tk+λI)1)\mathcal{N}_{k}(\lambda):=\mathrm{tr}(T_{k}(T_{k}+\lambda I)^{-1}).

Network notation.

For a depth-LL Yat stack with input 𝐱Xd0\mathbf{x}\in X\subset\mathbb{R}^{d_{0}} and per-layer widths mm_{\ell},

Φ0(𝐱)=𝐱,Φ(𝐱)=T(Φ1(𝐱)),[T(𝐳)]c=j=1mα,j,ckb,ε(𝐰,j,𝐳),\Phi_{0}(\mathbf{x})=\mathbf{x},\qquad\Phi_{\ell}(\mathbf{x})=T_{\ell}(\Phi_{\ell-1}(\mathbf{x})),\qquad[T_{\ell}(\mathbf{z})]_{c}=\sum_{j=1}^{m_{\ell}}\alpha_{\ell,j,c}\,k_{b_{\ell},\varepsilon_{\ell}}(\mathbf{w}_{\ell,j},\mathbf{z}),

where Φ1\Phi_{\ell-1} is the prefix map, TT_{\ell} the layer map, cc a coordinate index, and z,c:=[Φ]cz_{\ell,c}:=[\Phi_{\ell}]_{c} the layer-\ell output coordinate. The pulled-back kernel on XX is k~(𝐱,𝐱):=k(Φ1(𝐱),Φ1(𝐱))\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime}):=k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\Phi_{\ell-1}(\mathbf{x}^{\prime})).

Conventions.

𝕊d1\mathbb{S}^{d-1} carries the uniform surface measure where unspecified; Sym(d)\mathrm{Sym}(d) is the space of real d×dd\times d symmetric matrices; vec()\mathrm{vec}(\cdot) is column-major matrix vectorisation; ,\langle\cdot,\cdot\rangle without subscript is the Euclidean inner product on d\mathbb{R}^{d}. The Tifinagh letter \tifinaghfont (Unicode U+2D5F) is used as the kernel-specific glyph k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}.

Appendix B Related Work

RBF and learned-center kernel networks.

Radial-basis-function networks [Broomhead and Lowe, 1988, Moody and Darken, 1989] train finite expansions of locally-tuned radial kernels and are universal approximators [Park and Sandberg, 1991]; the classical MLP approximation theory underlying this is surveyed by Pinkus [1999]. The classical theory of native spaces and IMQ universality is developed in Wendland [2004]. With a shared bandwidth, an IMQ or Gaussian RBF layer is already a finite learned-center RKHS expansion admitting the same 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} norm and the same diagonal-based Rademacher bound as Yat. The Yat primitive differs in two respects: it replaces the radial numerator with the polynomial alignment factor (𝐰𝐱+b)2(\mathbf{w}^{\top}\mathbf{x}+b)^{2}, producing a nonzero directional far-field trace that all pure-radial combinations lack; and the bias b>0b>0 places a constant-coordinate feature in the polynomial factor’s feature map, establishing fixed-kernel universality via kernel-order domination rather than through a variable-bandwidth construction.

Neural kernels and arc-cosine constructions.

Cho and Saul [2009] pioneered the program of treating neural-network hidden units as Mercer sections, introducing the arc-cosine family as kernels associated with threshold activations. The Yat construction shares the spirit of that program — treating the unit as a kernel section rather than a scalar activation — but trades the integral arc-cosine form for a finite, trainable, rational primitive whose RKHS structure (Mercer property, Loewner-IMQ domination, channel decomposition, finite-difference IMQ embedding) is fully closed-form.

Kernel machines and product kernels.

Polynomially modulated radial kernels and product or nonstationary kernel constructions are classical in kernel machines and Gaussian processes [Schölkopf and Smola, 2002, Steinwart and Christmann, 2008]. The Steinwart–Christmann product-kernel containment theorem [Steinwart and Christmann, 2008, Lemma 4.6] is one of two routes to Yat’s fixed-kernel universality (the other being Loewner-order domination). The contribution here is not the abstract existence of a product kernel, but the specific rational hidden-unit form for which product-kernel universality, an exact IMQ finite-difference identity, a nonzero directional asymptotic trace, and a layer-local closed-form RKHS norm simultaneously hold. Prior product-kernel analyses in the GP and kernel-machine literature do not study this hidden-unit parameterization, its finite-difference structure, or its asymptotic-trace geometry.

Deep-learning generalization and the NTK.

For standard scalar-activation MLPs, layer-level capacity is controlled through spectral or Lipschitz surrogates on the weight matrix [Bartlett et al., 2017, Neyshabur et al., 2018], which are function-level bounds rather than Hilbert-space objects attached to individual units. The historical origin of the neural-network-as-kernel viewpoint is Neal [1996], where infinite-width Bayesian networks induce Gaussian-process priors. The neural tangent kernel [Jacot et al., 2018] is an infinite-width post-hoc object defined at the network output; the NNGP [Lee et al., 2018] indexes a training-sample covariance at the output; the conjugate kernel construction of Daniely et al. [2016] treats layered networks as compositions of Mercer-section objects and is the closest antecedent to our prefix-pullback formulation in Section 6; deep kernel learning [Wilson et al., 2016] places a GP on top of a learned feature extractor; the pullback formalism we use in Section 6 is closely related to the convolutional kernel networks of Mairal et al. [2014], where a kernel is constructed by repeated pullback through learned feature maps. The arc-cosine kernels of Cho and Saul [2009] and random-feature constructions [Rahimi and Recht, 2007, 2008] also yield Mercer-section-flavoured hidden-unit views, in different parameterizations. The spectral analysis of NTKs on the sphere via Funk–Hecke / Gegenbauer expansion is by now a small subliterature in ML: Bietti and Mairal [2019] characterise the NTK spectrum on the sphere, and Cao and Gu [2019] use spectral decay rates analogous to our Theorem 21 in deep-network generalization arguments; Bordelon et al. [2020] connect Mercer-eigendecay and learning curves in the direction our fast-rates appendix follows. What is Yat-specific is the conjunction: a finite, trainable, non-radial, rational primitive with fixed-kernel universality from explicit Loewner-IMQ domination, an exact algebraic IMQ embedding sharp at three atoms, a nonzero directional far-field trace, and a closed-form layer-local norm with an explicit input-dependent diagonal.

RKHS generalization bounds.

The single-layer Rademacher bound (Theorem 6) and the multiclass learned-norm generalization bound (Theorem 12) follow the RKHS-ball Rademacher framework of Bartlett and Mendelson [2002]; the multiclass extension uses the pairwise-margin approach of Crammer and Singer [2001]. The modern reference for the chain “universal \Rightarrow characteristic \Rightarrow SPD” on locally compact spaces, including the equivalence between universality and characteristicness under the finite-signed-measure formulation, is Sriperumbudur et al. [2011]. What is Yat-specific is the explicit, input-dependent kernel diagonal kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon, which makes bb and ε\varepsilon explicit regularization targets in the bound, unlike the constant diagonal of the Gaussian (11) or IMQ (ε1\varepsilon^{-1}) kernels.

Approximation rates and ridge-vs-radial separations.

The width-complexity gap (Proposition 7) sits in a longer tradition of separations between ridge and radial approximation. Eldan and Shamir [2016] established an exponential gap for ridge functions against radial expansions; Bach [2017] sharpened this in the convex-neural-network setting with 1\ell_{1}-bounded coefficient classes that closely match our IMQ(Λ,W)\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W). The Barron-space MLP capacity literature [E et al., 2019] provides the analogous functional class for two-layer networks; positioning Yat against the Barron framework is left to future work, as the Yat layer is a fixed-RKHS object rather than a Barron-norm-bounded class.

Rational and ratio-form kernels.

Kernels written as a polynomial-in-numerator over a polynomial-in-denominator have been studied in machine learning principally through the rational-transducer construction of Cortes et al. [2004], which formalises a class of rational kernels on sequences via finite-state operations and proves Mercer-style closure properties; the abstract framing of ”kernel as a ratio of polynomial-like objects” is the same, but the input domain (sequences vs. d\mathbb{R}^{d}) and the algebraic primitives (transducer composition vs. Schur products) are disjoint, so neither the asymptotic-trace separation nor the bias finite-difference IMQ embedding has an analogue in that framework. A separate line of work installs rationality at the scalar activation level rather than at the kernel-section level: Telgarsky [2017] establishes expressivity of rational networks, Molina et al. [2020] learn Padé approximants end-to-end, and Boullé et al. [2020] train low-degree rational activations and obtain favourable approximation rates. These constructions keep the unit a scalar function of 𝐰𝐱\mathbf{w}^{\top}\mathbf{x}, do not define a Mercer section over the joint variable, and therefore do not carry the closed-form RKHS norm or the fixed-kernel universality the Yat construction targets; the axis is orthogonal to ours. Closer in form are the normalised polynomial kernels of the SVM literature [Burges, 1998, Schölkopf and Smola, 2002], which divide (𝐱𝐰+b)d(\mathbf{x}^{\top}\mathbf{w}+b)^{d} by k(𝐱,𝐱)k(𝐰,𝐰)\sqrt{k(\mathbf{x},\mathbf{x})k(\mathbf{w},\mathbf{w})}, i.e. by their own self-kernel; the Yat kernel divides instead by the regularised distance 𝐱𝐰2+ε\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon, which preserves alignment-dependence in the kernel diagonal kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon and adds local IMQ-type decay that self-kernel normalisation does not. The classical theory of inverse-multiquadric kernels and their native spaces is developed at length in Wendland [2004], Buhmann [2003], including the Bessel-potential Sobolev characterisation of the IMQ native space we use throughout. Schaback [1995] established the conditional-positive-definiteness framework that places kernels with vanishing diagonals (such as the unbiased k0,ε(𝟎,𝟎)=0k_{0,\varepsilon}(\mathbf{0},\mathbf{0})=0 case) in a unified setting with strictly positive radial kernels.

Non-stationary and anisotropic kernels.

The Yat kernel is non-stationary in the sense that kb,ε(𝐰,𝐱)k_{b,\varepsilon}(\mathbf{w},\mathbf{x}) depends on (𝐰,𝐱)(\mathbf{w},\mathbf{x}) rather than only on 𝐱𝐰\mathbf{x}-\mathbf{w}; the polynomial alignment numerator (𝐱𝐰+b)2(\mathbf{x}^{\top}\mathbf{w}+b)^{2} is intrinsically anisotropic. This places Yat in a literature on non-stationary covariance constructions in Gaussian processes and kernel machines: Genton [2001] surveys the design space of kernel families including non-stationary and anisotropic constructions, Paciorek and Schervish [2006] develop a class of non-stationary covariance functions parameterised by spatially varying length scales, and Wilson and Adams [2013] introduce spectral mixture kernels that obtain alignment-style modulation through a sum over frequency components. The Yat kernel differs from these constructions by being a single rational expression rather than a mixture or a length-scale field, by carrying an exact algebraic IMQ embedding (Theorem 4) absent from the spectral-mixture and varying-length-scale families, and by admitting a closed-form layer-local RKHS norm that is generic to shared-bandwidth Mercer-section layers.

Appendix C Background

Reproducing kernel Hilbert spaces.

A symmetric function k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{R} is positive semidefinite (PSD) if i,jcicjk(𝐱i,𝐱j)0\sum_{i,j}c_{i}c_{j}\,k(\mathbf{x}_{i},\mathbf{x}_{j})\geq 0 for every finite set {𝐱i}𝒳\{\mathbf{x}_{i}\}\subset\mathcal{X} and coefficients {ci}\{c_{i}\}\subset\mathbb{R}. Every continuous PSD kk on a compact set 𝒳\mathcal{X} is a Mercer kernel and admits a uniformly convergent eigenexpansion k(𝐱,𝐲)=iλiϕi(𝐱)ϕi(𝐲)k(\mathbf{x},\mathbf{y})=\sum_{i}\lambda_{i}\phi_{i}(\mathbf{x})\phi_{i}(\mathbf{y}) with λi0\lambda_{i}\geq 0 and {ϕi}L2(𝒳)\{\phi_{i}\}\subset L^{2}(\mathcal{X}) orthonormal [Mercer, 1909]. The Moore–Aronszajn theorem [Aronszajn, 1950] associates to every PSD kk a unique reproducing kernel Hilbert space (RKHS) k\mathcal{H}_{k} whose reproducing property f(𝐱)=f,k(,𝐱)kf(\mathbf{x})=\langle f,k(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{k}} holds for all fkf\in\mathcal{H}_{k}. Two properties of RKHSs are used throughout. First, if c2k1k2c^{2}k_{1}\preceq k_{2} in the Loewner PSD order (meaning k2c2k1k_{2}-c^{2}k_{1} is PSD), then k1k2\mathcal{H}_{k_{1}}\subseteq\mathcal{H}_{k_{2}} with fk2c1fk1\|f\|_{\mathcal{H}_{k_{2}}}\leq c^{-1}\|f\|_{\mathcal{H}_{k_{1}}} [Paulsen and Raghupathi, 2016]. Second, the RKHS sum rule: if k=k1+k2k=k_{1}+k_{2} then k=k1+k2\mathcal{H}_{k}=\mathcal{H}_{k_{1}}+\mathcal{H}_{k_{2}} with squared infimal-convolution norm fk2=inff1+f2=f(f1k12+f2k22)\|f\|_{\mathcal{H}_{k}}^{2}=\inf_{f_{1}+f_{2}=f}(\|f_{1}\|_{\mathcal{H}_{k_{1}}}^{2}+\|f_{2}\|_{\mathcal{H}_{k_{2}}}^{2}).

Universality, characteristicness, and strict positive definiteness.

A continuous PSD kernel on compact 𝒳\mathcal{X} is universal if its RKHS is dense in C(𝒳)C(\mathcal{X}) [Micchelli et al., 2006, Steinwart and Christmann, 2008]. We say kk is characteristic if the kernel mean embedding μk(,𝐱)𝑑μ(𝐱)\mu\mapsto\int k(\cdot,\mathbf{x})\,d\mu(\mathbf{x}) is injective on finite signed Borel measures on 𝒳\mathcal{X} (this finite-signed formulation is the one we use throughout). A kernel is strictly positive definite (SPD) if for every n1n\geq 1 and every set of pairwise distinct points {𝐱1,,𝐱n}\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\} the Gram matrix is positive definite. The implications we use for continuous kernels on compact 𝒳\mathcal{X} are

universalcharacteristicSPD,\text{universal}\;\Longrightarrow\;\text{characteristic}\;\Longrightarrow\;\text{SPD},

where the first arrow is Sriperumbudur et al. [2011] and the second follows because if 𝐜𝐊𝐜=0\mathbf{c}^{\top}\mathbf{K}\mathbf{c}=0 for distinct nodes then the atomic measure μ=iciδ𝐱i\mu=\sum_{i}c_{i}\delta_{\mathbf{x}_{i}} has zero kernel mean embedding. (Under the finite-signed-measure definition above, characteristicness on a compact domain is in fact equivalent to universality by a Hahn–Banach/Riesz argument; we only need the one-way chain stated here. A weaker probability-measure-only definition of characteristicness is strictly weaker than universality, but we do not use it.) For a bounded continuous characteristic kernel on a compact metric space, MMDk\mathrm{MMD}_{k} metrizes weak convergence of probability measures [Sriperumbudur et al., 2010, Thm. 23]; we use this endpoint of the chain throughout.

Schur products and Laplace integrals.

Two closure properties are used repeatedly. (S) The Schur product theorem [Horn and Johnson, 2012, Sec. 7.5]: the entrywise (Hadamard) product of two PSD kernels is PSD. (L) The Laplace representation a1=0eta𝑑ta^{-1}=\int_{0}^{\infty}e^{-ta}\,dt for a>0a>0 [Widder, 1941, Ch. IV]: the IMQ kernel hε(𝐱,𝐰)h_{\varepsilon}(\mathbf{x},\mathbf{w}) =(𝐱𝐰2+ε)1=0etεet𝐱𝐰2𝑑t=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}=\int_{0}^{\infty}e^{-t\varepsilon}e^{-t\|\mathbf{x}-\mathbf{w}\|^{2}}\,dt is a nonnegative mixture of Gaussian PSD kernels, hence PSD. Combining (S) and (L), any Schur product p(𝐱,𝐰)hε(𝐱,𝐰)p(\mathbf{x},\mathbf{w})\cdot h_{\varepsilon}(\mathbf{x},\mathbf{w}) with a PSD polynomial kernel pp is PSD — this is the core of the Yat PSD proof. For the Yat numerator pb(𝐱,𝐰)=(𝐱𝐰+b)2p_{b}(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}, the feature map Φb(𝐱)=(𝐱𝐱,2b𝐱,b)\Phi_{b}(\mathbf{x})=(\mathbf{x}\otimes\mathbf{x},\sqrt{2b}\,\mathbf{x},b) requires b0b\geq 0 for 2b\sqrt{2b} to be real; the sign condition is necessary, as Theorem 1 exhibits a 2×22\times 2 counterexample at b=1b=-1.

Appendix D Background Theorems from the Literature

This appendix collects the named results invoked by label in the proofs of Sections 25. Standard background (Mercer’s theorem, Moore–Aronszajn, universality, characteristicness, strict positive definiteness, empirical Rademacher complexity, the Schur product theorem, and the Laplace representation) is given in Section C and not repeated here.

Theorem 7 (Product-kernel RKHS containment [Steinwart and Christmann, 2008, Lemma 4.6]).

Let k1,k2k_{1},k_{2} be continuous PSD kernels on 𝒳\mathcal{X}, and let 1,2\mathcal{H}_{1},\mathcal{H}_{2} be their RKHSs. Then the Schur product k=k1k2k=k_{1}\cdot k_{2} is a continuous PSD kernel with RKHS k\mathcal{H}_{k} satisfying: for every f11f_{1}\in\mathcal{H}_{1} and f22f_{2}\in\mathcal{H}_{2}, the pointwise product f1f2kf_{1}\cdot f_{2}\in\mathcal{H}_{k}, with f1f2kf11f22\|f_{1}\cdot f_{2}\|_{\mathcal{H}_{k}}\leq\|f_{1}\|_{\mathcal{H}_{1}}\|f_{2}\|_{\mathcal{H}_{2}}.

Theorem 8 (Micchelli IMQ universality [Micchelli et al., 2006, Thm. 17]).

For every ε>0\varepsilon>0 and every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, the IMQ span FIMQ,ε=span{(𝐰2+ε)1:𝐰d}F_{\mathrm{IMQ},\varepsilon}=\mathrm{span}\{(\|\cdot-\mathbf{w}\|^{2}+\varepsilon)^{-1}:\mathbf{w}\in\mathbb{R}^{d}\} is dense in C(𝒳)C(\mathcal{X}) in the uniform norm; equivalently, the IMQ kernel is universal on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}.

Appendix E Proofs of Main-Body Results

E.1 Proof of Theorem 1: PSD/Mercer for b0b\geq 0

We restate Theorem 1 for the reader’s convenience and prove it as a single block.

Theorem (Theorem 1 restated).

Let ε>0\varepsilon>0 and b0b\geq 0. The kernel kb,ε(𝐰,𝐱)=(𝐰𝐱+b)2/(𝐱𝐰2+ε)k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=(\mathbf{w}^{\top}\mathbf{x}+b)^{2}/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon) is symmetric and positive semidefinite on d×d\mathbb{R}^{d}\times\mathbb{R}^{d}. On every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, kb,ε|𝒳×𝒳k_{b,\varepsilon}|_{\mathcal{X}\times\mathcal{X}} is continuous, so Mercer’s theorem yields a decomposition kb,ε(𝐰,𝐱)=i1λiϕi(𝐰)ϕi(𝐱)k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\sum_{i\geq 1}\lambda_{i}\phi_{i}(\mathbf{w})\phi_{i}(\mathbf{x}) with λi0\lambda_{i}\geq 0 and {ϕi}L2(𝒳)\{\phi_{i}\}\subset L^{2}(\mathcal{X}) orthonormal. By Moore–Aronszajn there exists a unique RKHS b,ε\mathcal{H}_{b,\varepsilon} with reproducing kernel kb,εk_{b,\varepsilon}.

Proof.

Factor kb,ε=phεk_{b,\varepsilon}=p\cdot h_{\varepsilon}, where p(𝐱,𝐰):=(𝐱𝐰+b)2p(\mathbf{x},\mathbf{w}):=(\mathbf{x}^{\top}\mathbf{w}+b)^{2} and hε(𝐱,𝐰):=(𝐱𝐰2+ε)1h_{\varepsilon}(\mathbf{x},\mathbf{w}):=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}.

Factor pp. Extend symmetrically by b\sqrt{b}: define 𝐱~:=(𝐱,b)d+1\tilde{\mathbf{x}}:=(\mathbf{x},\sqrt{b})\in\mathbb{R}^{d+1} and 𝐰~:=(𝐰,b)d+1\tilde{\mathbf{w}}:=(\mathbf{w},\sqrt{b})\in\mathbb{R}^{d+1} (well-defined because b0b\geq 0). Then 𝐱~𝐰~=𝐱𝐰+b\tilde{\mathbf{x}}^{\top}\tilde{\mathbf{w}}=\mathbf{x}^{\top}\mathbf{w}+b, so p(𝐱,𝐰)=(𝐱~𝐰~)2p(\mathbf{x},\mathbf{w})=(\tilde{\mathbf{x}}^{\top}\tilde{\mathbf{w}})^{2} is the square of a linear kernel in extended feature space. The linear kernel is PSD; by the Schur product theorem its Hadamard square is also PSD, so pp is PSD. Equivalently, Φb(𝐱)=(vec(𝐱𝐱),2b𝐱,b)d2+d+1\Phi_{b}(\mathbf{x})=(\mathrm{vec}(\mathbf{x}\mathbf{x}^{\top}),\sqrt{2b}\,\mathbf{x},b)\in\mathbb{R}^{d^{2}+d+1} satisfies Φb(𝐱),Φb(𝐰)=(𝐱𝐰)2+2b𝐱𝐰+b2=(𝐱𝐰+b)2=p(𝐱,𝐰)\langle\Phi_{b}(\mathbf{x}),\Phi_{b}(\mathbf{w})\rangle=(\mathbf{x}^{\top}\mathbf{w})^{2}+2b\,\mathbf{x}^{\top}\mathbf{w}+b^{2}=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}=p(\mathbf{x},\mathbf{w}), which is real iff b0b\geq 0. The b<0b<0 counter-example below shows this condition is necessary.

Factor hεh_{\varepsilon}. By the Laplace identity a1=0eta𝑑ta^{-1}=\int_{0}^{\infty}e^{-ta}\,dt for a>0a>0,

hε(𝐱,𝐰)=0etεet𝐱𝐰2𝑑t.h_{\varepsilon}(\mathbf{x},\mathbf{w})\;=\;\int_{0}^{\infty}e^{-t\varepsilon}\,e^{-t\|\mathbf{x}-\mathbf{w}\|^{2}}\,dt. (8)

For each t>0t>0 the Gaussian et𝐱𝐰2e^{-t\|\mathbf{x}-\mathbf{w}\|^{2}} is a PSD kernel and etε0e^{-t\varepsilon}\geq 0. A nonnegative mixture of PSD kernels is PSD, so hεh_{\varepsilon} is PSD.

Schur product. By the Schur product theorem [Horn and Johnson, 2012], the entry-wise product of two PSD matrices is PSD. Applied to the Gram matrices of pp and hεh_{\varepsilon} on an arbitrary finite point set, this shows kb,ε=phεk_{b,\varepsilon}=p\cdot h_{\varepsilon} is PSD on d\mathbb{R}^{d}.

Continuity on compact 𝒳\mathcal{X}. The denominator 𝐱𝐰2+ε\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon is bounded below by ε>0\varepsilon>0, so kb,εk_{b,\varepsilon} is a ratio of a continuous numerator and a strictly positive continuous denominator, hence continuous on 𝒳×𝒳\mathcal{X}\times\mathcal{X}. Mercer’s theorem applies and yields the decomposition. Moore–Aronszajn gives the unique RKHS.

Counterexample at b<0b<0. Take b=1b=-1, d=2d=2, nodes 𝐱1=𝐞1\mathbf{x}_{1}=\mathbf{e}_{1}, 𝐱2=𝐞2\mathbf{x}_{2}=\mathbf{e}_{2}. The numerator Gram is

[(𝐱i𝐱j+b)2]=(( 11)2(01)2(01)2(11)2)=(0110),[(\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+b)^{2}]\;=\;\begin{pmatrix}(\,1-1)^{2}&(0-1)^{2}\\ (0-1)^{2}&(1-1)^{2}\end{pmatrix}\;=\;\begin{pmatrix}0&1\\ 1&0\end{pmatrix},

with eigenvalues {1,1}\{1,-1\}, so the numerator is indefinite. The denominator at these nodes is Dε(𝐱i,𝐱j){ε,2+ε}D_{\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{j})\in\{\varepsilon,2+\varepsilon\}, both positive, so the Schur product inherits the eigenvalue (2+ε)1<0-(2+\varepsilon)^{-1}<0 and k1,εk_{-1,\varepsilon} is not PSD. ∎

E.2 Proof of Theorem 4: bias-finite-difference IMQ embedding

Theorem (Theorem 4 restated).

Fix ε>0\varepsilon>0. For every 𝐰d\mathbf{w}\in\mathbb{R}^{d} and every h>0h>0, the bias-finite-difference identity (2) holds identically in 𝐱d\mathbf{x}\in\mathbb{R}^{d}. In particular, FIMQ,εF\tifinaghfont,ε+F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}. The containment is strict.

Proof.

The denominator Dε(𝐱,𝐰)=𝐱𝐰2+εD_{\varepsilon}(\mathbf{x},\mathbf{w})=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon is independent of the bias parameter. For fixed (𝐱,𝐰)(\mathbf{x},\mathbf{w}) set t:=𝐱𝐰+2ht:=\mathbf{x}^{\top}\mathbf{w}+2h. Then

gε(𝐱;𝐰,3h)=(t+h)2Dε,gε(𝐱;𝐰,2h)=t2Dε,gε(𝐱;𝐰,h)=(th)2Dε.g_{\varepsilon}(\mathbf{x};\mathbf{w},3h)=\frac{(t+h)^{2}}{D_{\varepsilon}},\quad g_{\varepsilon}(\mathbf{x};\mathbf{w},2h)=\frac{t^{2}}{D_{\varepsilon}},\quad g_{\varepsilon}(\mathbf{x};\mathbf{w},h)=\frac{(t-h)^{2}}{D_{\varepsilon}}.

The numerator second-order difference is

(t+h)22t2+(th)2=(t2+2th+h2)2t2+(t22th+h2)= 2h2,(t+h)^{2}-2t^{2}+(t-h)^{2}\;=\;(t^{2}+2th+h^{2})-2t^{2}+(t^{2}-2th+h^{2})\;=\;2h^{2},

independent of tt. Dividing by h2Dεh^{2}D_{\varepsilon} gives 2/Dε=2kIMQε(𝐱,𝐰)2/D_{\varepsilon}=2k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}), multiplied by h2h^{2} on both sides recovers (2). Since h>0h>0 the three biases {h,2h,3h}\{h,2h,3h\} are all strictly positive, so the identity uses only atoms from F\tifinaghfont,ε+F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} and hence FIMQ,εF\tifinaghfont,ε+F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}.

Strictness. Fix 𝐰0\mathbf{w}\neq 0 and a unit vector 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1} with 𝐮𝐰0\mathbf{u}^{\top}\mathbf{w}\neq 0. Along the ray 𝐱=r𝐮\mathbf{x}=r\mathbf{u},

k\tifinaghfont(r𝐮,𝐰)=r2(𝐮𝐰)2r22r𝐮𝐰+𝐰2+εr(𝐮𝐰)2 0,k_{\text{\tifinaghfont ⵟ}}(r\mathbf{u},\mathbf{w})\;=\;\frac{r^{2}(\mathbf{u}^{\top}\mathbf{w})^{2}}{r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon}\;\xrightarrow[r\to\infty]{}\;(\mathbf{u}^{\top}\mathbf{w})^{2}\;\neq\;0,

while every finite linear combination F=i=1NcikIMQε(,𝐰i)FIMQ,εF=\sum_{i=1}^{N}c_{i}k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{i})\in F_{\mathrm{IMQ},\varepsilon} obeys

|F(r𝐮)|i=1N|ci|r22r𝐮𝐰i+𝐰i2+ε=O(r2)r 0.|F(r\mathbf{u})|\;\leq\;\sum_{i=1}^{N}\frac{|c_{i}|}{r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}_{i}+\|\mathbf{w}_{i}\|^{2}+\varepsilon}\;=\;O(r^{-2})\;\xrightarrow[r\to\infty]{}\;0.

The two far-field limits disagree, so k\tifinaghfont(,𝐰)FIMQ,εk_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{w})\notin F_{\mathrm{IMQ},\varepsilon} for every 𝐰0\mathbf{w}\neq 0. By Lemma 3 this same unbiased section belongs to F\tifinaghfont,ε+F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} via k\tifinaghfont(,𝐰)=gε(;𝐰,0)=3gε(;𝐰,h)3gε(;𝐰,2h)+gε(;𝐰,3h)k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{w})=g_{\varepsilon}(\cdot;\mathbf{w},0)=3g_{\varepsilon}(\cdot;\mathbf{w},h)-3g_{\varepsilon}(\cdot;\mathbf{w},2h)+g_{\varepsilon}(\cdot;\mathbf{w},3h) for any h>0h>0, so it witnesses strict containment FIMQ,εF\tifinaghfont,ε+F_{\mathrm{IMQ},\varepsilon}\subsetneq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}. ∎

Lemma 2 (Irreducibility and coprimality of the regularised distance).

Let d2d\geq 2, ε>0\varepsilon>0, and for 𝐰d\mathbf{w}\in\mathbb{R}^{d} set D𝐰(𝐱):=𝐱𝐰2+εD_{\mathbf{w}}(\mathbf{x}):=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon. Then (i) D𝐰D_{\mathbf{w}} is irreducible in [𝐱1,,𝐱d]\mathbb{C}[\mathbf{x}_{1},\ldots,\mathbf{x}_{d}]; (ii) for 𝐯𝐰\mathbf{v}\neq\mathbf{w}, D𝐯D_{\mathbf{v}} and D𝐰D_{\mathbf{w}} are coprime in [𝐱]\mathbb{C}[\mathbf{x}].

Proof.

(i) Translate by 𝐰\mathbf{w}: with 𝐲:=𝐱𝐰\mathbf{y}:=\mathbf{x}-\mathbf{w}, D𝐰=𝐲2+εD_{\mathbf{w}}=\|\mathbf{y}\|^{2}+\varepsilon. Suppose this factors as f(𝐲)g(𝐲)f(\mathbf{y})g(\mathbf{y}) in [𝐲]\mathbb{C}[\mathbf{y}] with degf,degg1\deg f,\deg g\geq 1. Both factors are linear: f=𝜶𝐲+βf=\boldsymbol{\alpha}^{\top}\mathbf{y}+\beta, g=𝜸𝐲+δg=\boldsymbol{\gamma}^{\top}\mathbf{y}+\delta. Matching coefficients gives

𝜶𝜸+𝜸𝜶=2𝐈d,δ𝜶+β𝜸=𝟎,βδ=ε.\boldsymbol{\alpha}\boldsymbol{\gamma}^{\top}+\boldsymbol{\gamma}\boldsymbol{\alpha}^{\top}=2\mathbf{I}_{d},\qquad\delta\boldsymbol{\alpha}+\beta\boldsymbol{\gamma}=\mathbf{0},\qquad\beta\delta=\varepsilon.

Since ε0\varepsilon\neq 0, neither β\beta nor δ\delta vanishes, so the linear constraint gives 𝜸=(δ/β)𝜶\boldsymbol{\gamma}=-(\delta/\beta)\boldsymbol{\alpha} — the two direction vectors are parallel. But then 𝜶𝜸+𝜸𝜶=(2δ/β)𝜶𝜶\boldsymbol{\alpha}\boldsymbol{\gamma}^{\top}+\boldsymbol{\gamma}\boldsymbol{\alpha}^{\top}=-(2\delta/\beta)\boldsymbol{\alpha}\boldsymbol{\alpha}^{\top} has rank at most 11, while 2𝐈d2\mathbf{I}_{d} has rank d2d\geq 2. Contradiction; D𝐰D_{\mathbf{w}} is irreducible. (ii) If D𝐯D_{\mathbf{v}} and D𝐰D_{\mathbf{w}} shared an irreducible factor for 𝐯𝐰\mathbf{v}\neq\mathbf{w}, since both are irreducible they would equal each other up to a scalar; comparing the degree-11 part 2𝐱𝐯-2\mathbf{x}^{\top}\mathbf{v} vs. 2𝐱𝐰-2\mathbf{x}^{\top}\mathbf{w} then forces 𝐯=𝐰\mathbf{v}=\mathbf{w}, contradiction. ∎

Theorem 9 (Tightness of the factor-33 reduction).

Let d1d\geq 1, ε>0\varepsilon>0. For every 𝐰0d{0}\mathbf{w}_{0}\in\mathbb{R}^{d}\setminus\{0\}, no linear combination of at most two biased Yat atoms (at arbitrary centers and biases) can equal a nonzero scalar multiple of kIMQε(,𝐰0)k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0}).

Outline. The argument splits by dimension. For d2d\geq 2 we use Lemma 2 (irreducibility and coprimality of D𝐰D_{\mathbf{w}} in [𝐱]\mathbb{C}[\mathbf{x}]) and a case-by-case polynomial analysis. For d=1d=1, DwD_{w} factors over \mathbb{C} as (zwiε)(zw+iε)(z-w-i\sqrt{\varepsilon})(z-w+i\sqrt{\varepsilon}), so Lemma 2 fails; we instead match poles directly in the complex plane.

Proof for d2d\geq 2.

We first reduce to non-trivial atoms. An atom with ci=0c_{i}=0, or with Li0L_{i}\equiv 0 (i.e. 𝐰i=𝟎\mathbf{w}_{i}=\mathbf{0} and bi=0b_{i}=0), contributes zero; dropping it gives a one-atom or zero-atom identity. A zero-atom combination is identically 0λkIMQε0\neq\lambda k_{\mathrm{IMQ}}^{\varepsilon} for λ0\lambda\neq 0. A single non-trivial atom: clearing denominators in cgε(;𝐰,b)=λkIMQε(,𝐰0)c\,g_{\varepsilon}(\cdot;\mathbf{w},b)=\lambda\,k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0}) gives cL2D0=λDwc\,L^{2}\,D_{0}=\lambda\,D_{w} in [𝐱]\mathbb{C}[\mathbf{x}], where L(𝐱)=𝐱𝐰+bL(\mathbf{x})=\mathbf{x}^{\top}\mathbf{w}+b and Dw=𝐱𝐰2+εD_{w}=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon. If 𝐰=𝐰0\mathbf{w}=\mathbf{w}_{0}: cancelling D0D_{0} gives cL2=λc\,L^{2}=\lambda, impossible since LL is nonconstant for 𝐰0𝟎\mathbf{w}_{0}\neq\mathbf{0}. If 𝐰𝐰0\mathbf{w}\neq\mathbf{w}_{0}: by Lemma 2, DwD_{w} is irreducible (hence prime in the UFD [𝐱]\mathbb{C}[\mathbf{x}]) and coprime to D0D_{0}; from DwcL2D_{w}\mid c\,L^{2} we get DwLD_{w}\mid L, but degDw=2>degL1\deg D_{w}=2>\deg L\leq 1 and L0L\not\equiv 0: contradiction. Hence λ=0\lambda=0 in the one-atom case. Assume both atoms are non-trivial, ci0c_{i}\neq 0, Li0L_{i}\not\equiv 0, and for contradiction λ0\lambda\neq 0. Let Di(𝐱)=𝐱𝐰i2+εD_{i}(\mathbf{x})=\|\mathbf{x}-\mathbf{w}_{i}\|^{2}+\varepsilon and Li(𝐱)=𝐱𝐰i+biL_{i}(\mathbf{x})=\mathbf{x}^{\top}\mathbf{w}_{i}+b_{i}. Clearing denominators gives

c1L12D0D2+c2L22D0D1=λD1D2in [𝐱].c_{1}L_{1}^{2}D_{0}D_{2}+c_{2}L_{2}^{2}D_{0}D_{1}=\lambda D_{1}D_{2}\quad\text{in }\mathbb{R}[\mathbf{x}].

Case 1: 𝐰1=𝐰2=:𝐰\mathbf{w}_{1}=\mathbf{w}_{2}=:\mathbf{w}. Then D1=D2=:DD_{1}=D_{2}=:D and ()(*) becomes ND0=λDN\,D_{0}=\lambda\,D where N:=c1L12+c2L22N:=c_{1}L_{1}^{2}+c_{2}L_{2}^{2}. Subcase 𝐰=𝐰0\mathbf{w}=\mathbf{w}_{0}: D=D0D=D_{0}, so N=λN=\lambda. The degree-22 homogeneous part of NN is (c1+c2)(𝐱𝐰0)2(c_{1}+c_{2})(\mathbf{x}^{\top}\mathbf{w}_{0})^{2}. For d2d\geq 2 and 𝐰0𝟎\mathbf{w}_{0}\neq\mathbf{0} this vanishes only if c1+c2=0c_{1}+c_{2}=0; the degree-11 part then forces c1b1+c2b2=0c_{1}b_{1}+c_{2}b_{2}=0, so b1=b2b_{1}=b_{2}, and N=b12(c1+c2)=0N=b_{1}^{2}(c_{1}+c_{2})=0, contradicting λ0\lambda\neq 0. Subcase 𝐰𝐰0\mathbf{w}\neq\mathbf{w}_{0}: by Lemma 2, DD and D0D_{0} are distinct irreducibles in [𝐱]\mathbb{C}[\mathbf{x}], hence coprime. From ND0=λDND_{0}=\lambda D: DND0D\mid ND_{0} and gcd(D,D0)=1\gcd(D,D_{0})=1, so DND\mid N. Since degN2=degD\deg N\leq 2=\deg D, we have N=μDN=\mu D for some constant μ\mu; then μD0=λ\mu D_{0}=\lambda, a degree-22 polynomial equal to a constant, forcing μ=λ=0\mu=\lambda=0: contradiction.

Case 2: 𝐰1=𝐰0𝐰2\mathbf{w}_{1}=\mathbf{w}_{0}\neq\mathbf{w}_{2} (the case 𝐰2=𝐰0𝐰1\mathbf{w}_{2}=\mathbf{w}_{0}\neq\mathbf{w}_{1} is symmetric). Then D1=D0D_{1}=D_{0}; cancelling D0D_{0} from ()(*) gives c1L12D2+c2L22D0=λD2c_{1}L_{1}^{2}D_{2}+c_{2}L_{2}^{2}D_{0}=\lambda D_{2}, hence

(c1L12λ)D2=c2L22D0.(c_{1}L_{1}^{2}-\lambda)\,D_{2}=-c_{2}L_{2}^{2}\,D_{0}.

Since 𝐰0𝐰2\mathbf{w}_{0}\neq\mathbf{w}_{2}, by Lemma 2 D0D_{0} and D2D_{2} are coprime irreducibles in [𝐱]\mathbb{C}[\mathbf{x}], so D0(c1L12λ)D_{0}\mid(c_{1}L_{1}^{2}-\lambda). Both have degree 2=degD0\leq 2=\deg D_{0}, so c1L12λ=μD0c_{1}L_{1}^{2}-\lambda=\mu D_{0} for some μ\mu\in\mathbb{C}. Comparing degree-22 homogeneous parts: c1(𝐱𝐰0)2=μ𝐱2c_{1}(\mathbf{x}^{\top}\mathbf{w}_{0})^{2}=\mu\|\mathbf{x}\|^{2}. For d2d\geq 2 and 𝐰0𝟎\mathbf{w}_{0}\neq\mathbf{0}, the polynomials (𝐱𝐰0)2(\mathbf{x}^{\top}\mathbf{w}_{0})^{2} and 𝐱2\|\mathbf{x}\|^{2} are linearly independent in [𝐱]\mathbb{C}[\mathbf{x}], witnessed by two evaluations: (i) at 𝐱=𝐰0\mathbf{x}=\mathbf{w}_{0}, the values are 𝐰04\|\mathbf{w}_{0}\|^{4} and 𝐰02\|\mathbf{w}_{0}\|^{2} respectively, with ratio 𝐰02\|\mathbf{w}_{0}\|^{2}; (ii) at any non-zero 𝐱𝐰0\mathbf{x}\perp\mathbf{w}_{0} (which exists for d2d\geq 2 and is the only step that uses the dimensional hypothesis), the values are 0 and 𝐱2>0\|\mathbf{x}\|^{2}>0, with a different ratio, so the two polynomials are not proportional. At d=1d=1 both polynomials equal w0,12x2w_{0,1}^{2}x^{2} up to a non-zero scalar, so Case 2 genuinely requires d2d\geq 2; Case 1 applies for all d1d\geq 1. Hence μ=c1=0\mu=c_{1}=0, giving λ=0\lambda=0: contradiction.

Case 3: 𝐰0,𝐰1,𝐰2\mathbf{w}_{0},\mathbf{w}_{1},\mathbf{w}_{2} pairwise distinct. By Lemma 2, each DiD_{i} is irreducible in [𝐱]\mathbb{C}[\mathbf{x}] and distinct DiD_{i} are pairwise coprime. Since [𝐱]\mathbb{C}[\mathbf{x}] is a unique factorisation domain (UFD), irreducibles are prime and the principal ideal (D1)(D_{1}) is therefore prime, making R:=[𝐱]/(D1)R:=\mathbb{C}[\mathbf{x}]/(D_{1}) an integral domain. Reducing ()(*) modulo (D1)(D_{1}): c1L1¯ 2D0¯D2¯=0c_{1}\,\overline{L_{1}}^{\,2}\,\overline{D_{0}}\,\overline{D_{2}}=0. Since 𝐰0,𝐰2𝐰1\mathbf{w}_{0},\mathbf{w}_{2}\neq\mathbf{w}_{1}, the classes D0¯,D2¯\overline{D_{0}},\overline{D_{2}} are nonzero in RR; since RR is an integral domain and c10c_{1}\neq 0, we get L1¯=0\overline{L_{1}}=0, i.e. L1(D1)L_{1}\in(D_{1}). But L10L_{1}\not\equiv 0 and degL1=1<2=degD1\deg L_{1}=1<2=\deg D_{1}: contradiction. By symmetry (quotient by D2D_{2}), c2=0c_{2}=0: contradiction. ∎

Proof for d=1d=1 via complex pole matching.

At d=1d=1 the variables collapse to scalars: wi,bi,w0w_{i},b_{i},w_{0}\in\mathbb{R}, and Dw(z)=(zw)2+εD_{w}(z)=(z-w)^{2}+\varepsilon factors over \mathbb{C} as (zwiε)(zw+iε)(z-w-i\sqrt{\varepsilon})(z-w+i\sqrt{\varepsilon}), so Lemma 2 no longer applies. Instead, assume for contradiction

c1(w1z+b1)2(zw1)2+ε+c2(w2z+b2)2(zw2)2+ε=λ(zw0)2+εc_{1}\frac{(w_{1}z+b_{1})^{2}}{(z-w_{1})^{2}+\varepsilon}+c_{2}\frac{(w_{2}z+b_{2})^{2}}{(z-w_{2})^{2}+\varepsilon}=\frac{\lambda}{(z-w_{0})^{2}+\varepsilon}

holds for all zz\in\mathbb{R} with λ0\lambda\neq 0, ε>0\varepsilon>0. The standing hypothesis 𝐰0𝟎\mathbf{w}_{0}\neq\mathbf{0} gives w00w_{0}\neq 0 in d=1d=1 (this is what excludes the trivial case 𝐰0=𝟎\mathbf{w}_{0}=\mathbf{0}, where one biased atom gε(;𝟎,b)/b2=b2/(b2(𝐱2+ε))=kIMQε(,𝟎)g_{\varepsilon}(\cdot;\mathbf{0},b)/b^{2}=b^{2}/(b^{2}(\|\mathbf{x}\|^{2}+\varepsilon))=k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{0}) already realises the IMQ atom). By analytic continuation the identity extends to zz\in\mathbb{C}. The right-hand side has exactly two simple poles, at z=w0±iεz=w_{0}\pm i\sqrt{\varepsilon}. We partition by the pattern of equality among {w0,w1,w2}\{w_{0},w_{1},w_{2}\} into four mutually exclusive and jointly exhaustive cases: {w1,w2,w0}\{w_{1},w_{2},w_{0}\} contains (1) three distinct values; (2) exactly w1=w2w_{1}=w_{2} and both differ from w0w_{0}; (3) exactly one of w1,w2w_{1},w_{2} equals w0w_{0} (and the two atom centers differ); (4) w1=w2=w0w_{1}=w_{2}=w_{0}.

Case 1: w0,w1,w2w_{0},w_{1},w_{2} pairwise distinct. The first atom contributes potential simple poles at z=w1±iεz=w_{1}\pm i\sqrt{\varepsilon}. These poles are not present in either the second atom (whose poles are at w2±iεw1±iεw_{2}\pm i\sqrt{\varepsilon}\neq w_{1}\pm i\sqrt{\varepsilon}) nor the right-hand side (whose poles are at w0±iεw1±iεw_{0}\pm i\sqrt{\varepsilon}\neq w_{1}\pm i\sqrt{\varepsilon}). Hence they must be removable, requiring the linear factor w1z+b1w_{1}z+b_{1} to vanish at z=w1+iεz=w_{1}+i\sqrt{\varepsilon}:

w1(w1+iε)+b1=(w12+b1)+iw1ε=0.w_{1}(w_{1}+i\sqrt{\varepsilon})+b_{1}=(w_{1}^{2}+b_{1})+iw_{1}\sqrt{\varepsilon}=0.

Since w1,b1,εw_{1},b_{1},\varepsilon are real with ε>0\varepsilon>0, separating real and imaginary parts forces w1=0w_{1}=0 and b1=0b_{1}=0, i.e. the first atom is identically zero. The same argument applied to atom 2 forces w2=0=b2w_{2}=0=b_{2}. The identity collapses to 0=λ/Dw00=\lambda/D_{w_{0}}, contradicting λ0\lambda\neq 0.

Case 2: w1=w2=:ww0w_{1}=w_{2}=:w\neq w_{0}. The two atoms share denominator Dw(z)D_{w}(z). The combined left-hand-side numerator P(z):=c1(wz+b1)2+c2(wz+b2)2P(z):=c_{1}(wz+b_{1})^{2}+c_{2}(wz+b_{2})^{2} has degree 2\leq 2. Equality of rational functions implies equality of pole sets with multiplicity; since ww0w\neq w_{0}, the LHS poles at w±iεw\pm i\sqrt{\varepsilon} must cancel, forcing DwPD_{w}\mid P. Since degP2=degDw\deg P\leq 2=\deg D_{w}, P=CDwP=C\cdot D_{w} for some constant CC\in\mathbb{C}. Then LHS=C=C is constant on \mathbb{C}, while RHS=λ/Dw0=\lambda/D_{w_{0}} is non-constant (since λ0\lambda\neq 0). Contradiction.

Case 3: w1=w0w2w_{1}=w_{0}\neq w_{2} (and the symmetric subcase w2=w0w1w_{2}=w_{0}\neq w_{1}). The atom whose centre differs from w0w_{0} has poles at the off-w0w_{0} locations; by the Case 1 argument applied to that atom alone, the atom is identically zero. The identity reduces to one atom c1(w0z+b1)2/Dw0(z)=λ/Dw0(z)c_{1}(w_{0}z+b_{1})^{2}/D_{w_{0}}(z)=\lambda/D_{w_{0}}(z), i.e. c1(w0z+b1)2=λc_{1}(w_{0}z+b_{1})^{2}=\lambda. The left side is a polynomial in zz of degree exactly 22 (using w00w_{0}\neq 0 and c10c_{1}\neq 0 for a non-trivial atom), while the right side is a non-zero constant. Contradiction.

Case 4: w1=w2=w0w_{1}=w_{2}=w_{0}. LHS and RHS share denominator Dw0(z)D_{w_{0}}(z); equating numerators gives the polynomial identity

c1(w0z+b1)2+c2(w0z+b2)2=λin [z].c_{1}(w_{0}z+b_{1})^{2}+c_{2}(w_{0}z+b_{2})^{2}=\lambda\quad\text{in }\mathbb{R}[z].

Matching the z2z^{2} coefficient: w02(c1+c2)=0w_{0}^{2}(c_{1}+c_{2})=0, so c2=c1c_{2}=-c_{1} (since w00w_{0}\neq 0). Matching the z1z^{1} coefficient: 2w0(c1b1+c2b2)=2w0c1(b1b2)=02w_{0}(c_{1}b_{1}+c_{2}b_{2})=2w_{0}c_{1}(b_{1}-b_{2})=0, so b1=b2b_{1}=b_{2} (since c10c_{1}\neq 0). Substituting back into the constant term: c1b12+c2b22=c1b12c1b12=0λc_{1}b_{1}^{2}+c_{2}b_{2}^{2}=c_{1}b_{1}^{2}-c_{1}b_{1}^{2}=0\neq\lambda. Contradiction.

All four cases yield a contradiction, so the factor-33 lower bound holds at d=1d=1 as well. ∎

E.3 Proof of Proposition 1: singular universality at b0>0b_{0}>0

Proposition (Proposition 1 restated).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact, b0>0b_{0}>0, ε0>0\varepsilon_{0}>0. Then b0,ε0\mathcal{H}_{b_{0},\varepsilon_{0}} is dense in C(𝒳)C(\mathcal{X}) in the uniform norm.

Proof.

Write k=phk=p\cdot h with h(𝐱,𝐰)=(𝐱𝐰2+ε0)1h(\mathbf{x},\mathbf{w})=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon_{0})^{-1} the IMQ kernel and p(𝐱,𝐰)=(𝐱𝐰+b0)2p(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b_{0})^{2}.

Step 1: the IMQ factor is universal. The RKHS h\mathcal{H}_{h} of hh is dense in C(𝒳)C(\mathcal{X}) for every compact 𝒳\mathcal{X} by the Micchelli universality theorem [Micchelli et al., 2006].

Step 2: the polynomial factor has a constant feature. The numerator kernel pp has the explicit feature map Φb0(𝐱)=(vec(𝐱𝐱),2b0𝐱,b0)d2+d+1\Phi_{b_{0}}(\mathbf{x})=(\mathrm{vec}(\mathbf{x}\mathbf{x}^{\top}),\sqrt{2b_{0}}\,\mathbf{x},b_{0})\in\mathbb{R}^{d^{2}+d+1}, so p(𝐱,𝐰)=Φb0(𝐱),Φb0(𝐰)p(\mathbf{x},\mathbf{w})=\langle\Phi_{b_{0}}(\mathbf{x}),\Phi_{b_{0}}(\mathbf{w})\rangle. One coordinate of Φb0\Phi_{b_{0}} is the nonzero constant b0b_{0}.

Step 3: the polynomial RKHS contains constants. The RKHS of pp is p={𝐮,Φb0():𝐮d2+d+1}\mathcal{H}_{p}=\{\langle\mathbf{u},\Phi_{b_{0}}(\cdot)\rangle:\mathbf{u}\in\mathbb{R}^{d^{2}+d+1}\}. The dual coefficient 𝐮1=(0,,0,1/b0)\mathbf{u}_{1}=(0,\ldots,0,1/b_{0}) yields 𝐮1,Φb0(𝐱)=b0(1/b0)=1\langle\mathbf{u}_{1},\Phi_{b_{0}}(\mathbf{x})\rangle=b_{0}\cdot(1/b_{0})=1 for every 𝐱\mathbf{x}, so the constant function 𝟏𝒳p\mathbf{1}_{\mathcal{X}}\in\mathcal{H}_{p} with 𝟏𝒳p𝐮1=1/b0\|\mathbf{1}_{\mathcal{X}}\|_{\mathcal{H}_{p}}\leq\|\mathbf{u}_{1}\|=1/b_{0} (an upper bound because Φb0\Phi_{b_{0}} is non-minimal, so the RKHS norm is the infimum over representers).

Step 4: k\mathcal{H}_{k} contains h\mathcal{H}_{h}. By the product-kernel RKHS containment theorem [Steinwart and Christmann, 2008, Lemma 4.6] (also Theorem 7 above), the RKHS of k=phk=p\cdot h contains every pointwise product f1f2f_{1}\cdot f_{2} with f1hf_{1}\in\mathcal{H}_{h} and f2pf_{2}\in\mathcal{H}_{p}, and f1f2kf1hf2p\|f_{1}\cdot f_{2}\|_{\mathcal{H}_{k}}\leq\|f_{1}\|_{\mathcal{H}_{h}}\|f_{2}\|_{\mathcal{H}_{p}}. Take f2=𝟏𝒳pf_{2}=\mathbf{1}_{\mathcal{X}}\in\mathcal{H}_{p} from Step 3, with 𝟏𝒳p1/b0\|\mathbf{1}_{\mathcal{X}}\|_{\mathcal{H}_{p}}\leq 1/b_{0}. Then for every f1hf_{1}\in\mathcal{H}_{h} the pointwise product f1𝟏𝒳=f1f_{1}\cdot\mathbf{1}_{\mathcal{X}}=f_{1} lies in k\mathcal{H}_{k}, and f1kf1h(1/b0)\|f_{1}\|_{\mathcal{H}_{k}}\leq\|f_{1}\|_{\mathcal{H}_{h}}\cdot(1/b_{0}). Thus the embedding hk\mathcal{H}_{h}\hookrightarrow\mathcal{H}_{k} has operator norm 1/b0\leq 1/b_{0}, recovering the Loewner bound fb0,ε0b01fh\|f\|_{\mathcal{H}_{b_{0},\varepsilon_{0}}}\leq b_{0}^{-1}\|f\|_{\mathcal{H}_{h}} of Theorem 2. (The cleaner route to the same conclusion is the Loewner-order proof in the main body.)

Step 5: density. Let gC(𝒳)g\in C(\mathcal{X}) and η>0\eta>0. By Step 1, there exists fhf\in\mathcal{H}_{h} with gf<η\|g-f\|_{\infty}<\eta. Step 4 gives fkf\in\mathcal{H}_{k}. Hence k\mathcal{H}_{k} is dense in C(𝒳)C(\mathcal{X}). ∎

Remark 1 (Necessity of b0>0b_{0}>0).

The constant feature b0b_{0} in Φb0\Phi_{b_{0}} is what makes Step 4 produce constants in p\mathcal{H}_{p}. At b0=0b_{0}=0 this coordinate vanishes and the argument fails. When additionally 𝟎𝒳\mathbf{0}\in\mathcal{X}, the simpler observation k0,ε(𝐰,𝟎)=0k_{0,\varepsilon}(\mathbf{w},\mathbf{0})=0 for every 𝐰\mathbf{w} forces every f0,εf\in\mathcal{H}_{0,\varepsilon} to satisfy f(𝟎)=0f(\mathbf{0})=0, ruling out density in C(𝒳)C(\mathcal{X}).

E.4 Characteristic kernel and strict positive definiteness

Corollary (Characteristic kernel, restating Corollary 1(i)).

For b0>0b_{0}>0, ε0>0\varepsilon_{0}>0, the kernel kb0,ε0k_{b_{0},\varepsilon_{0}} is characteristic on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}.

Proof.

By Proposition 1, kb0,ε0k_{b_{0},\varepsilon_{0}} is universal on 𝒳\mathcal{X}. A universal continuous PSD kernel on compact 𝒳\mathcal{X} is characteristic: if μk:=kb0,ε0(,𝐱)𝑑μ(𝐱)=0\mu_{k}:=\int k_{b_{0},\varepsilon_{0}}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})=0 in b0,ε0\mathcal{H}_{b_{0},\varepsilon_{0}} for some finite signed Borel μ\mu, then by the reproducing property f𝑑μ=f,μk=0\int f\,d\mu=\langle f,\mu_{k}\rangle=0 for every fb0,ε0f\in\mathcal{H}_{b_{0},\varepsilon_{0}}. By density, f𝑑μ=0\int f\,d\mu=0 for every fC(𝒳)f\in C(\mathcal{X}), so μ=0\mu=0 by the Riesz representation theorem. The metrization of weak convergence by MMDkb0,ε0\mathrm{MMD}_{k_{b_{0},\varepsilon_{0}}} then follows from Sriperumbudur et al. [2010, Thm. 23]. ∎

Corollary (Strict positive definiteness and Gram invertibility, restating part of Cor. 1).

For b0>0b_{0}>0, ε0>0\varepsilon_{0}>0, and any nn pairwise distinct {𝐱1,,𝐱n}d\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}\subset\mathbb{R}^{d}, the Gram matrix 𝐊ij=kb0,ε0(𝐱i,𝐱j)\mathbf{K}_{ij}=k_{b_{0},\varepsilon_{0}}(\mathbf{x}_{i},\mathbf{x}_{j}) is strictly positive definite, hence invertible.

Proof.

Let 𝐜=(c1,,cn)n\mathbf{c}=(c_{1},\dots,c_{n})\in\mathbb{R}^{n} and set μ:=i=1nciδ𝐱i\mu:=\sum_{i=1}^{n}c_{i}\delta_{\mathbf{x}_{i}}, the finite signed Borel measure on 𝒳={𝐱1,,𝐱n}\mathcal{X}=\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\} with atomic masses cic_{i}. By bilinearity of the kernel mean embedding,

𝐜𝐊𝐜=i,jcicjkb0,ε0(𝐱i,𝐱j)=kb0,ε0(,𝐱)𝑑μ(𝐱)b0,ε02.\mathbf{c}^{\top}\mathbf{K}\mathbf{c}\;=\;\sum_{i,j}c_{i}c_{j}\,k_{b_{0},\varepsilon_{0}}(\mathbf{x}_{i},\mathbf{x}_{j})\;=\;\Big\|\int k_{b_{0},\varepsilon_{0}}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})\Big\|_{\mathcal{H}_{b_{0},\varepsilon_{0}}}^{2}.

If 𝐜𝐊𝐜=0\mathbf{c}^{\top}\mathbf{K}\mathbf{c}=0, the mean embedding of μ\mu in b0,ε0\mathcal{H}_{b_{0},\varepsilon_{0}} is zero. Since kb0,ε0k_{b_{0},\varepsilon_{0}} is characteristic on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d} (Corollary 1(i)), the mean embedding is injective on finite signed Borel measures, hence μ=0\mu=0. Distinctness of {𝐱i}\{\mathbf{x}_{i}\} then forces ci=0c_{i}=0 for every ii. Equivalently, 𝐊0\mathbf{K}\succ 0. ∎

E.5 Proof of Proposition 3 and Theorem 6: closed-form norm and single-layer Rademacher bound

Proposition (Proposition 3 restated).

Let b0b\geq 0, ε>0\varepsilon>0, and f()=j=1mαjkb,ε(𝐰j,)b,εf(\cdot)=\sum_{j=1}^{m}\alpha_{j}\,k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\in\mathcal{H}_{b,\varepsilon}. Then fb,ε2=𝛂𝐊𝛂\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} with 𝐊ij=kb,ε(𝐰i,𝐰j)\mathbf{K}_{ij}=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j}).

Proof.

By bilinearity and the reproducing property,

fb,ε2\displaystyle\|f\|^{2}_{\mathcal{H}_{b,\varepsilon}} =i,jαiαjkb,ε(𝐰i,),kb,ε(𝐰j,)b,ε\displaystyle\;=\;\sum_{i,j}\alpha_{i}\alpha_{j}\,\bigl\langle k_{b,\varepsilon}(\mathbf{w}_{i},\cdot),\,k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\bigr\rangle_{\mathcal{H}_{b,\varepsilon}}
=i,jαiαjkb,ε(𝐰i,𝐰j)=𝜶𝐊𝜶.\displaystyle\;=\;\sum_{i,j}\alpha_{i}\alpha_{j}\,k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j})\;=\;\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}.\qed

The identity fb,ε2=𝜶𝐊𝜶\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} is exact for every coefficient vector 𝜶\boldsymbol{\alpha}, including when 𝐊\mathbf{K} is rank-deficient. If two centers coincide or two coefficient vectors 𝜶,𝜶\boldsymbol{\alpha},\boldsymbol{\alpha}^{\prime} represent the same function ff (equivalently, 𝐊𝜶=𝐊𝜶\mathbf{K}\boldsymbol{\alpha}=\mathbf{K}\boldsymbol{\alpha}^{\prime}), then bilinearity of the RKHS inner product gives 𝜶𝐊𝜶=𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}=\boldsymbol{\alpha}^{\prime\top}\mathbf{K}\boldsymbol{\alpha}^{\prime}, since both equal f,fb,ε\langle f,f\rangle_{\mathcal{H}_{b,\varepsilon}}. The pseudoinverse expression (𝐊𝜶)𝐊(𝐊𝜶)(\mathbf{K}\boldsymbol{\alpha})^{\top}\mathbf{K}^{\dagger}(\mathbf{K}\boldsymbol{\alpha}) is algebraically equal to 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} on PSD 𝐊\mathbf{K} via 𝐊𝐊𝐊=𝐊\mathbf{K}\mathbf{K}^{\dagger}\mathbf{K}=\mathbf{K}, so either form may be used. Coincident or near-coincident centers therefore raise a numerical-conditioning question, not a mathematical correctness question; in floating-point implementations, deduplicating exactly-coincident centers and using a stable factorisation (e.g. pivoted Cholesky) on the remaining strictly-PD block is sufficient.

Theorem (Theorem 6 restated).

Fix b0b\geq 0, ε>0\varepsilon>0. Let B,b,ε={fb,ε:fb,εB}\mathcal{F}_{B,b,\varepsilon}=\{f\in\mathcal{H}_{b,\varepsilon}:\|f\|_{\mathcal{H}_{b,\varepsilon}}\leq B\}, 𝐱R\|\mathbf{x}\|\leq R, nn i.i.d. samples. Then Rad^n(B,b,ε)B(R2+b)/(nε)\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\leq B(R^{2}+b)/(\sqrt{n}\,\sqrt{\varepsilon}).

Proof.

For any fB,b,εf\in\mathcal{F}_{B,b,\varepsilon} and 𝐱d\mathbf{x}\in\mathbb{R}^{d}, the reproducing property of the fixed kernel kb,εk_{b,\varepsilon} gives f(𝐱)=f,kb,ε(,𝐱)b,εf(\mathbf{x})=\langle f,k_{b,\varepsilon}(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{b,\varepsilon}}. By Cauchy–Schwarz,

|f(𝐱)|fb,εkb,ε(𝐱,𝐱)Bkb,ε(𝐱,𝐱).|f(\mathbf{x})|\;\leq\;\|f\|_{\mathcal{H}_{b,\varepsilon}}\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}\;\leq\;B\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}. (8)

The diagonal of the biased Yat kernel is

kb,ε(𝐱,𝐱)=(𝐱2+b)20+ε=(𝐱2+b)2ε,k_{b,\varepsilon}(\mathbf{x},\mathbf{x})\;=\;\frac{(\|\mathbf{x}\|^{2}+b)^{2}}{0+\varepsilon}\;=\;\frac{(\|\mathbf{x}\|^{2}+b)^{2}}{\varepsilon},

since 𝐱𝐱+b=𝐱2+b\mathbf{x}^{\top}\mathbf{x}+b=\|\mathbf{x}\|^{2}+b and 𝐱𝐱2+ε=ε\|\mathbf{x}-\mathbf{x}\|^{2}+\varepsilon=\varepsilon. For 𝐱R\|\mathbf{x}\|\leq R,

kb,ε(𝐱,𝐱)𝐱2+bεR2+bε.\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}\;\leq\;\frac{\|\mathbf{x}\|^{2}+b}{\sqrt{\varepsilon}}\;\leq\;\frac{R^{2}+b}{\sqrt{\varepsilon}}.

The standard RKHS Rademacher bound for a ball of radius BB follows from the reproducing property and Jensen’s inequality (Bartlett and Mendelson 2002, Lemma 22, with the one-sided convention Rad^n():=𝔼𝝈supf1niσif(𝐱i)\widehat{\mathrm{Rad}}_{n}(\mathcal{F}):=\mathbb{E}_{\boldsymbol{\sigma}}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i}\sigma_{i}f(\mathbf{x}_{i}) used at line 7). Writing f(𝐱i)=f,kb,ε(,𝐱i)f(\mathbf{x}_{i})=\langle f,k_{b,\varepsilon}(\cdot,\mathbf{x}_{i})\rangle, supfB\sup_{\|f\|\leq B} pulls out a factor BB, and Jensen exchanges 𝔼𝝈\mathbb{E}_{\boldsymbol{\sigma}} with the square root:

Rad^n(B,b,ε)=Bn𝔼𝝈iσikb,ε(,𝐱i)b,εBn1ni=1nkb,ε(𝐱i,𝐱i).\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;=\;\frac{B}{n}\,\mathbb{E}_{\boldsymbol{\sigma}}\Big\|\sum_{i}\sigma_{i}k_{b,\varepsilon}(\cdot,\mathbf{x}_{i})\Big\|_{\mathcal{H}_{b,\varepsilon}}\;\leq\;\frac{B}{\sqrt{n}}\cdot\sqrt{\frac{1}{n}\sum_{i=1}^{n}k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{i})}.

The constant is exactly 11 (no symmetrization factor of 22) under this one-sided convention. Bounding the per-sample term by (R2+b)2/ε(R^{2}+b)^{2}/\varepsilon and taking the square root gives

Rad^n(B,b,ε)B1n(R2+b)2ε=B(R2+b)nε.\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;\leq\;B\sqrt{\frac{1}{n}\cdot\frac{(R^{2}+b)^{2}}{\varepsilon}}\;=\;\frac{B(R^{2}+b)}{\sqrt{n}\,\sqrt{\varepsilon}}.

Corollary (Unbiased case).

At b=0b=0, Rad^n(B,0,ε)BR2/(nε)\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,0,\varepsilon})\leq BR^{2}/(\sqrt{n}\,\sqrt{\varepsilon}).

E.6 Uniform Boundedness and High-Dimensional Variance

Proposition 4 (Uniform boundedness with sharpness).

For 𝐱R\|\mathbf{x}\|\leq R, 𝐰W\|\mathbf{w}\|\leq W, b0b\geq 0, ε>0\varepsilon>0,

0kb,ε(𝐰,𝐱)(RW+b)2ε.0\;\leq\;k_{b,\varepsilon}(\mathbf{w},\mathbf{x})\;\leq\;\frac{(RW+b)^{2}}{\varepsilon}.

For every fixed 𝐰\mathbf{w}, b0b\geq 0, ε>0\varepsilon>0, the section kb,ε(𝐰,)k_{b,\varepsilon}(\mathbf{w},\cdot) is globally bounded on d\mathbb{R}^{d}. At b=0b=0 and 𝐰𝟎\mathbf{w}\neq\mathbf{0} the global supremum is sup𝐱dk\tifinaghfont(𝐰,𝐱)=𝐰4/ε+𝐰2\sup_{\mathbf{x}\in\mathbb{R}^{d}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})=\|\mathbf{w}\|^{4}/\varepsilon+\|\mathbf{w}\|^{2}, attained at 𝐱=(1+ε/𝐰2)𝐰\mathbf{x}^{\ast}=(1+\varepsilon/\|\mathbf{w}\|^{2})\mathbf{w}. For 𝐰=𝟎\mathbf{w}=\mathbf{0} the unbiased section is identically zero. For b>0b>0 the supremum on d\mathbb{R}^{d} does not have a clean closed form; it is bounded by (RW+b)2/ε(RW+b)^{2}/\varepsilon on any compact domain with 𝐱R\|\mathbf{x}\|\leq R, 𝐰W\|\mathbf{w}\|\leq W.

Proof.

Compact-domain bound. By Cauchy–Schwarz, |𝐰𝐱|𝐰𝐱WR|\mathbf{w}^{\top}\mathbf{x}|\leq\|\mathbf{w}\|\|\mathbf{x}\|\leq WR, so |𝐰𝐱+b|WR+b|\mathbf{w}^{\top}\mathbf{x}+b|\leq WR+b. Also Dε(𝐱,𝐰)ε>0D_{\varepsilon}(\mathbf{x},\mathbf{w})\geq\varepsilon>0, so kb,ε(𝐰,𝐱)(WR+b)2/εk_{b,\varepsilon}(\mathbf{w},\mathbf{x})\leq(WR+b)^{2}/\varepsilon. Nonnegativity follows from the squared numerator and positive denominator.

Global boundedness at fixed 𝐰\mathbf{w}. Set c:=𝐰2+b0c:=\|\mathbf{w}\|^{2}+b\geq 0; since |𝐱𝐰+b|𝐱𝐰𝐰+c|\mathbf{x}^{\top}\mathbf{w}+b|\leq\|\mathbf{x}-\mathbf{w}\|\|\mathbf{w}\|+c and Dε(𝐱,𝐰)𝐱𝐰2D_{\varepsilon}(\mathbf{x},\mathbf{w})\geq\|\mathbf{x}-\mathbf{w}\|^{2},

kb,ε(𝐰,𝐱)(𝐰+c𝐱𝐰)2for 𝐱𝐰>0.k_{b,\varepsilon}(\mathbf{w},\mathbf{x})\;\leq\;\left(\|\mathbf{w}\|+\frac{c}{\|\mathbf{x}-\mathbf{w}\|}\right)^{2}\quad\text{for }\|\mathbf{x}-\mathbf{w}\|>0.

For 𝐱𝐰2c\|\mathbf{x}-\mathbf{w}\|\geq 2c this gives kb,ε(𝐰+12)2k_{b,\varepsilon}\leq(\|\mathbf{w}\|+\tfrac{1}{2})^{2}; for 𝐱𝐰2c\|\mathbf{x}-\mathbf{w}\|\leq 2c, continuity and DεεD_{\varepsilon}\geq\varepsilon give kb,εc2(2𝐰+1)2/εk_{b,\varepsilon}\leq c^{2}(2\|\mathbf{w}\|+1)^{2}/\varepsilon.

Sharpness at b=0b=0. Decompose 𝐱=α𝐰/𝐰+𝐱\mathbf{x}=\alpha\,\mathbf{w}/\|\mathbf{w}\|+\mathbf{x}_{\perp} with 𝐱𝐰\mathbf{x}_{\perp}\perp\mathbf{w}. Then 𝐱𝐰=α𝐰\mathbf{x}^{\top}\mathbf{w}=\alpha\|\mathbf{w}\| and 𝐱𝐰2=(α𝐰)2+𝐱2\|\mathbf{x}-\mathbf{w}\|^{2}=(\alpha-\|\mathbf{w}\|)^{2}+\|\mathbf{x}_{\perp}\|^{2}, so

k\tifinaghfont(𝐰,𝐱)=α2𝐰2(α𝐰)2+𝐱2+ε.k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})\;=\;\frac{\alpha^{2}\|\mathbf{w}\|^{2}}{(\alpha-\|\mathbf{w}\|)^{2}+\|\mathbf{x}_{\perp}\|^{2}+\varepsilon}.

This is monotonically decreasing in 𝐱2\|\mathbf{x}_{\perp}\|^{2}, so the supremum is attained at 𝐱=0\mathbf{x}_{\perp}=0. Writing λ=α/𝐰\lambda=\alpha/\|\mathbf{w}\| and u=𝐰2u=\|\mathbf{w}\|^{2}, reduce to the one-variable maximisation

f(λ)=λ2u2(λ1)2u+ε.f(\lambda)\;=\;\frac{\lambda^{2}u^{2}}{(\lambda-1)^{2}u+\varepsilon}.

Differentiating, f(λ)=2λu2[(1λ)u+ε]/((λ1)2u+ε)2f^{\prime}(\lambda)=2\lambda u^{2}\,[(1-\lambda)u+\varepsilon]/((\lambda-1)^{2}u+\varepsilon)^{2}, so the unique positive critical point is λ=1+ε/u\lambda^{\ast}=1+\varepsilon/u. Substituting:

f(λ)=(1+ε/u)2u2(ε/u)2u+ε=(u+ε)2ε(1+ε/u)=u(u+ε)ε=u2ε+u=𝐰4ε+𝐰2,f(\lambda^{\ast})\;=\;\frac{(1+\varepsilon/u)^{2}\,u^{2}}{(\varepsilon/u)^{2}\,u+\varepsilon}\;=\;\frac{(u+\varepsilon)^{2}}{\varepsilon(1+\varepsilon/u)}\;=\;\frac{u(u+\varepsilon)}{\varepsilon}\;=\;\frac{u^{2}}{\varepsilon}+u\;=\;\frac{\|\mathbf{w}\|^{4}}{\varepsilon}+\|\mathbf{w}\|^{2},

attained at 𝐱=λ𝐰/𝐰𝐰=(1+ε/𝐰2)𝐰\mathbf{x}^{\ast}=\lambda^{\ast}\,\mathbf{w}/\|\mathbf{w}\|\cdot\|\mathbf{w}\|=(1+\varepsilon/\|\mathbf{w}\|^{2})\mathbf{w}. ∎

The diagonal value kb,ε(𝐱,𝐱)=(𝐱2+b)2/εk_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon used in the Rademacher proof is the special case 𝐰=𝐱\mathbf{w}=\mathbf{x} of the kernel definition. Classical unbounded scalar activations such as ReLU and GELU satisfy σ(𝐰𝐱)\sigma(\mathbf{w}^{\top}\mathbf{x})\to\infty as 𝐱\|\mathbf{x}\|\to\infty, so the Rademacher route via the kernel diagonal is not available in this same unit-as-Mercer-section form.

High-dimensional variance preservation under isotropic Gaussian inputs.

A second structural difference between the Yat unit and a purely radial unit shows up in the high-dimensional input regime: the IMQ output concentrates while the Yat output does not. The phenomenon ”isotropic Gaussian inputs concentrate radial kernels in high dd” is by now standard in random-matrix and neural-network theory: El Karoui [2010] established the original spectral collapse for kernel random matrices on isotropic high-dimensional inputs, and Ghorbani et al. [2019] characterise the same concentration for radial features in two-layer linearised networks. The proposition below is the matching structural statement at the level of the kernel diagonal: under the same inputs, the Yat numerator (𝐱𝐰+b)2(\mathbf{x}^{\top}\mathbf{w}+b)^{2} depends on a single 1-D projection and resists the collapse.

Proposition 5 (High-dimensional variance preservation).

Let 𝐱𝒩(𝟎,𝐈d)\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d}), let 𝐰d\mathbf{w}\in\mathbb{R}^{d} have unit norm 𝐰=1\|\mathbf{w}\|=1, and fix b0b\geq 0, ε>0\varepsilon>0. For a positive random variable YY, write CV[Y]:=Var[Y]/𝔼[Y]\mathrm{CV}[Y]:=\sqrt{\mathrm{Var}[Y]}/\mathbb{E}[Y]. As dd\to\infty,

CV[hε(𝐱,𝐰)]=Θ(d1/2),CV[kb,ε(𝐰,𝐱)]=Θ(1).\mathrm{CV}[h_{\varepsilon}(\mathbf{x},\mathbf{w})]=\Theta(d^{-1/2}),\qquad\mathrm{CV}[k_{b,\varepsilon}(\mathbf{w},\mathbf{x})]=\Theta(1).
Proof.

By rotational invariance of the standard Gaussian, take 𝐰=𝐞1\mathbf{w}=\mathbf{e}_{1}. Then 𝐰𝐱=x1𝒩(0,1)\mathbf{w}^{\top}\mathbf{x}=x_{1}\sim\mathcal{N}(0,1), and the regularised distance D:=𝐱𝐰2+ε=(x11)2+i=2dxi2+εD:=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon=(x_{1}-1)^{2}+\sum_{i=2}^{d}x_{i}^{2}+\varepsilon satisfies Dεχd2(λ=1)D-\varepsilon\sim\chi_{d}^{2}(\lambda=1) (non-central chi-squared, dd degrees of freedom, non-centrality 𝐰2=1\|\mathbf{w}\|^{2}=1). Standard moment formulas give μD:=𝔼[D]=d+1+ε\mu_{D}:=\mathbb{E}[D]=d+1+\varepsilon and σD2:=Var[D]=2d+4\sigma_{D}^{2}:=\mathrm{Var}[D]=2d+4.

IMQ unit. For hε=1/Dh_{\varepsilon}=1/D, second-order Taylor expansion of 1/D1/D around μD\mu_{D} (justified since Dε>0D\geq\varepsilon>0 a.s. and |DμD|/μD=OP(d1/2)|D-\mu_{D}|/\mu_{D}=O_{P}(d^{-1/2}) by chi-squared concentration) gives

𝔼[hε]=μD1+O(σD2/μD3)=Θ(d1),Var[hε]=σD2/μD4+O(μD5)=Θ(d3),\mathbb{E}[h_{\varepsilon}]=\mu_{D}^{-1}+O(\sigma_{D}^{2}/\mu_{D}^{3})=\Theta(d^{-1}),\qquad\mathrm{Var}[h_{\varepsilon}]=\sigma_{D}^{2}/\mu_{D}^{4}+O(\mu_{D}^{-5})=\Theta(d^{-3}),

so CV[hε]=Θ(d3/2)/Θ(d1)=Θ(d1/2)\mathrm{CV}[h_{\varepsilon}]=\Theta(d^{-3/2})/\Theta(d^{-1})=\Theta(d^{-1/2}).

Yat unit. Set N:=(x1+b)2N:=(x_{1}+b)^{2}, so kb,ε=N/Dk_{b,\varepsilon}=N/D. Direct calculation gives μN=1+b2\mu_{N}=1+b^{2}, σN2=2+4b2\sigma_{N}^{2}=2+4b^{2}, and

Cov(N,D)=Cov((x1+b)2,(x11)2)=24b\mathrm{Cov}(N,D)=\mathrm{Cov}\bigl((x_{1}+b)^{2},(x_{1}-1)^{2}\bigr)=2-4b

(the contributions of x2,,xdx_{2},\ldots,x_{d} to DD are independent of NN, so do not enter the covariance). Bivariate Taylor expansion of N/DN/D around (μN,μD)(\mu_{N},\mu_{D}) yields

𝔼[kb,ε]=μN/μD+O(μD2)=Θ(d1),\mathbb{E}[k_{b,\varepsilon}]=\mu_{N}/\mu_{D}+O(\mu_{D}^{-2})=\Theta(d^{-1}),
Var[kb,ε]=σN2μD22μNCov(N,D)μD3+μN2σD2μD4+O(μD5).\mathrm{Var}[k_{b,\varepsilon}]=\frac{\sigma_{N}^{2}}{\mu_{D}^{2}}-\frac{2\mu_{N}\,\mathrm{Cov}(N,D)}{\mu_{D}^{3}}+\frac{\mu_{N}^{2}\sigma_{D}^{2}}{\mu_{D}^{4}}+O(\mu_{D}^{-5}).

Substituting the dimensional scalings μD=Θ(d)\mu_{D}=\Theta(d), σD2=Θ(d)\sigma_{D}^{2}=\Theta(d), and μN,σN2,Cov(N,D)=Θ(1)\mu_{N},\sigma_{N}^{2},\mathrm{Cov}(N,D)=\Theta(1), the three terms are Θ(d2)Θ(d3)+Θ(d3)=Θ(d2)\Theta(d^{-2})-\Theta(d^{-3})+\Theta(d^{-3})=\Theta(d^{-2}), dominated by σN2/μD2\sigma_{N}^{2}/\mu_{D}^{2}. Hence CV[kb,ε]=Θ(d1)/Θ(d1)=Θ(1)\mathrm{CV}[k_{b,\varepsilon}]=\Theta(d^{-1})/\Theta(d^{-1})=\Theta(1).

Higher-order Taylor remainders are controlled by combining the chi-squared concentration |DμD|/μD=OP(d1/2)|D-\mu_{D}|/\mu_{D}=O_{P}(d^{-1/2}) with a high-probability lower bound on DD. Specifically, the chi-squared concentration inequality [Laurent and Massart, 2000, Lemma 1] gives Pr[D<μD/2]ecd\Pr[D<\mu_{D}/2]\leq e^{-cd} for some absolute constant c>0c>0 and all dd0d\geq d_{0}. Split the expectation 𝔼[(DμD)k/Dk]\mathbb{E}[(D-\mu_{D})^{k}/D^{k}] over the events :={DμD/2}\mathcal{E}:=\{D\geq\mu_{D}/2\} and its complement: on \mathcal{E}, Dk(2/μD)k=O(dk)D^{-k}\leq(2/\mu_{D})^{k}=O(d^{-k}) deterministically and the standard chi-squared moment formula 𝔼[(DμD)k𝟏]=O(dk/2)\mathbb{E}[(D-\mu_{D})^{k}\mathbf{1}_{\mathcal{E}}]=O(d^{k/2}) gives 𝔼[(DμD)k/Dk 1]=O(dk/2)\mathbb{E}[(D-\mu_{D})^{k}/D^{k}\,\mathbf{1}_{\mathcal{E}}]=O(d^{-k/2}); on c\mathcal{E}^{c}, the deterministic floor DεD\geq\varepsilon together with the deterministic upper bound |DμD|kμDk=Θ(dk)|D-\mu_{D}|^{k}\leq\mu_{D}^{k}=\Theta(d^{k}) (since 0<D2μD0<D\leq 2\mu_{D} on the relevant range) gives 𝔼[(DμD)k/Dk 1c]εkdkPr[c]εkdkecd=o(dk)\mathbb{E}[(D-\mu_{D})^{k}/D^{k}\,\mathbf{1}_{\mathcal{E}^{c}}]\leq\varepsilon^{-k}\,d^{k}\,\Pr[\mathcal{E}^{c}]\leq\varepsilon^{-k}\,d^{k}\,e^{-cd}=o(d^{-k}), which is asymptotically negligible. We do not assume independence of |DμD|k|D-\mu_{D}|^{k} and 𝟏c\mathbf{1}_{\mathcal{E}^{c}}, which fails; the deterministic bound circumvents this. Combining, 𝔼[(DμD)k/Dk]=O(dk/2)\mathbb{E}[(D-\mu_{D})^{k}/D^{k}]=O(d^{-k/2}) for k3k\geq 3. This is the bound the bivariate Taylor expansion needs: the leading Θ(d2)\Theta(d^{-2}) variance term dominates the O(d3)O(d^{-3}) corrections, and the asymptotic CV scaling is unaffected. The earlier deterministic bound DkεkD^{-k}\leq\varepsilon^{-k} alone is too loose to control the remainder at this scale, since 𝔼[(DμD)k]εk\mathbb{E}[(D-\mu_{D})^{k}]\cdot\varepsilon^{-k} grows with dd; the high-probability split above is the correct bookkeeping. ∎

Remark 2.

Proposition 5 is a structural calculation under the specific toy model of isotropic Gaussian inputs and unit-norm centers; real ML feature distributions are not isotropic Gaussian, and trained Yat centers need not be unit-norm. What the proposition isolates is the asymptotic mechanism: the IMQ denominator DD concentrates around its mean of order dd, so the radial output 1/D1/D concentrates around 1/d1/d and its relative spread vanishes; the Yat numerator (𝐱𝐰+b)2(\mathbf{x}^{\top}\mathbf{w}+b)^{2} depends only on the one-dimensional projection 𝐱𝐰\mathbf{x}^{\top}\mathbf{w} and so does not concentrate, yielding an Θ(1)\Theta(1) relative spread that survives the limit.

E.7 Additional proofs from Section 2

Proof of Theorem 2.

By Theorem 1 the kernel kb,εk_{b,\varepsilon} is PSD, so b,ε\mathcal{H}_{b,\varepsilon} is well-defined and the Loewner-order Aronszajn inclusion below is meaningful. Compute kb,εb2hε=2b(𝐱𝐰)hε+(𝐱𝐰)2hεk_{b,\varepsilon}-b^{2}h_{\varepsilon}=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}+(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}. The kernels 𝐱𝐰\mathbf{x}^{\top}\mathbf{w} and (𝐱𝐰)2(\mathbf{x}^{\top}\mathbf{w})^{2} are PSD, and the scalar 2b02b\geq 0 (using b0b\geq 0) makes 2b(𝐱𝐰)hε2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon} a nonneg-scaled PSD kernel; hεh_{\varepsilon} is PSD (IMQ); Schur products of PSD kernels are PSD (Schur product theorem [Horn and Johnson, 2012, Sec. 7.5]). The sum is PSD; Aronszajn’s inclusion theorem then yields the RKHS containment and norm bound. For b=0b=0 this reduces to the trivial 0k0,ε0\preceq k_{0,\varepsilon}; the RKHS inclusion hεb,ε\mathcal{H}_{h_{\varepsilon}}\subseteq\mathcal{H}_{b,\varepsilon} is meaningful only for b>0b>0. ∎

E.8 Proofs from Section 3

Proof of Lemma 1.

Every finite-span element fm(𝐱)=j=1majhε(𝐱,𝐰j)f_{m}(\mathbf{x})=\sum_{j=1}^{m}a_{j}h_{\varepsilon}(\mathbf{x},\mathbf{w}_{j}) vanishes at infinity since each atom decays as O(𝐱2)O(\|\mathbf{x}\|^{-2}). The finite span is dense in hε\mathcal{H}_{h_{\varepsilon}}; if fmff_{m}\to f in RKHS norm, the reproducing property gives

sup𝐱d|fm(𝐱)f(𝐱)|sup𝐱hε(𝐱,𝐱)fmfhε=ε1/2fmfhε,\sup_{\mathbf{x}\in\mathbb{R}^{d}}|f_{m}(\mathbf{x})-f(\mathbf{x})|\leq\sup_{\mathbf{x}}\sqrt{h_{\varepsilon}(\mathbf{x},\mathbf{x})}\,\|f_{m}-f\|_{\mathcal{H}_{h_{\varepsilon}}}=\varepsilon^{-1/2}\|f_{m}-f\|_{\mathcal{H}_{h_{\varepsilon}}},

so fmff_{m}\to f uniformly. Since C0(d)C_{0}(\mathbb{R}^{d}) is closed under uniform limits, fC0(d)f\in C_{0}(\mathbb{R}^{d}). ∎

Proof of Corollary 2.

The inclusion hεb,ε\mathcal{H}_{h_{\varepsilon}}\subseteq\mathcal{H}_{b,\varepsilon} is Theorem 2. For strictness, choose 𝐰𝟎\mathbf{w}\neq\mathbf{0} and any 𝐮\mathbf{u} with 𝐮𝐰0\mathbf{u}^{\top}\mathbf{w}\neq 0: the far-field trace gives limrkb,ε(𝐰,r𝐮)=(𝐮𝐰)20\lim_{r\to\infty}k_{b,\varepsilon}(\mathbf{w},r\mathbf{u})=(\mathbf{u}^{\top}\mathbf{w})^{2}\neq 0, so kb,ε(𝐰,)C0(d)k_{b,\varepsilon}(\mathbf{w},\cdot)\notin C_{0}(\mathbb{R}^{d}) while, by Lemma 1, hεC0(d)\mathcal{H}_{h_{\varepsilon}}\subset C_{0}(\mathbb{R}^{d}). Hence kb,ε(𝐰,)b,εhεk_{b,\varepsilon}(\mathbf{w},\cdot)\in\mathcal{H}_{b,\varepsilon}\setminus\mathcal{H}_{h_{\varepsilon}}. Non-reversibility: if kb,εChεk_{b,\varepsilon}\preceq C\,h_{\varepsilon}, Aronszajn inclusion would give b,εhεC0(d)\mathcal{H}_{b,\varepsilon}\subseteq\mathcal{H}_{h_{\varepsilon}}\subset C_{0}(\mathbb{R}^{d}), contradicting the existence of Yat sections not vanishing at infinity. ∎

Proof of Theorem 3.

The RKHS sum rule states that k1+k2=k1+k2\mathcal{H}_{k_{1}+k_{2}}=\mathcal{H}_{k_{1}}+\mathcal{H}_{k_{2}} with the infimal-convolution norm [Paulsen and Raghupathi, 2016]; extended to three summands by induction. Each kik_{i} is PSD: k0=b2hεk_{0}=b^{2}h_{\varepsilon} since hεh_{\varepsilon} is PSD and b20b^{2}\geq 0; k1=2b(𝐱𝐰)hεk_{1}=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon} because the dot-product kernel 𝐱𝐰\mathbf{x}^{\top}\mathbf{w} is PSD (ijcicj𝐱i𝐱j=ici𝐱i20\sum_{ij}c_{i}c_{j}\mathbf{x}_{i}^{\top}\mathbf{x}_{j}=\|\sum_{i}c_{i}\mathbf{x}_{i}\|^{2}\geq 0), hεh_{\varepsilon} is PSD, their Schur product is PSD by the Schur product theorem, and 2b02b\geq 0 (the channel vanishes at b=0b=0); k2=(𝐱𝐰)2hεk_{2}=(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon} by the Schur product theorem applied to (𝐱𝐰)2(\mathbf{x}^{\top}\mathbf{w})^{2} and hεh_{\varepsilon}, both PSD. The RKHS sum rule therefore applies. ∎

Proof of Proposition 2.

For (4), write

gε(r𝐮;𝐰,b)=(r𝐮𝐰+b)2r22r𝐮𝐰+𝐰2+ε=(𝐮𝐰+b/r)212(𝐮𝐰)/r+(𝐰2+ε)/r2r(𝐮𝐰)2,g_{\varepsilon}(r\mathbf{u};\mathbf{w},b)\;=\;\frac{(r\,\mathbf{u}^{\top}\mathbf{w}+b)^{2}}{r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon}\;=\;\frac{(\mathbf{u}^{\top}\mathbf{w}+b/r)^{2}}{1-2(\mathbf{u}^{\top}\mathbf{w})/r+(\|\mathbf{w}\|^{2}+\varepsilon)/r^{2}}\;\xrightarrow[r\to\infty]{}\;(\mathbf{u}^{\top}\mathbf{w})^{2},

uniformly in 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1}. For the IMQ side, |kIMQε(r𝐮,𝐰i)|(r22r𝐰i+𝐰i2+ε)1=O(r2)|k_{\mathrm{IMQ}}^{\varepsilon}(r\mathbf{u},\mathbf{w}_{i})|\leq(r^{2}-2r\|\mathbf{w}_{i}\|+\|\mathbf{w}_{i}\|^{2}+\varepsilon)^{-1}=O(r^{-2}), so finite combinations F=icikIMQε(,𝐰i)F=\sum_{i}c_{i}k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{i}) satisfy TF0T_{\infty}F\equiv 0. The image of TT_{\infty} contains every rank-one quadratic form (𝐮𝐰)2=𝐮(𝐰𝐰)𝐮(\mathbf{u}^{\top}\mathbf{w})^{2}=\mathbf{u}^{\top}(\mathbf{w}\mathbf{w}^{\top})\mathbf{u}; the symmetric rank-one tensors 𝐰𝐰\mathbf{w}\mathbf{w}^{\top} span Sym(d)\mathrm{Sym}(d) (the off-diagonal generators are obtained as 12((𝐞i+𝐞j)(𝐞i+𝐞j)𝐞i𝐞i𝐞j𝐞j)\tfrac{1}{2}((\mathbf{e}_{i}+\mathbf{e}_{j})(\mathbf{e}_{i}+\mathbf{e}_{j})^{\top}-\mathbf{e}_{i}\mathbf{e}_{i}^{\top}-\mathbf{e}_{j}\mathbf{e}_{j}^{\top})), so Im(T)=𝒬2(𝕊d1)\mathrm{Im}(T_{\infty})=\mathcal{Q}_{2}(\mathbb{S}^{d-1}). ∎

Corollary 5 (Concrete large-radius regression gap).

Fix 𝐰𝟎\mathbf{w}\neq\mathbf{0}, bb\in\mathbb{R}, ε>0\varepsilon>0, and choose 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1} with a:=(𝐮𝐰)2>0a:=(\mathbf{u}^{\top}\mathbf{w})^{2}>0. Let f(𝐱)=gε(𝐱;𝐰,b)f(\mathbf{x})=g_{\varepsilon}(\mathbf{x};\mathbf{w},b). Then for every finite IMQ combination FFIMQ,εF\in F_{\mathrm{IMQ},\varepsilon},

|f(r𝐮)F(r𝐮)|ra.|f(r\mathbf{u})-F(r\mathbf{u})|\;\xrightarrow[r\to\infty]{}\;a.

Hence for every η(0,a)\eta\in(0,a) there exists RηR_{\eta} such that suprRη|f(r𝐮)F(r𝐮)|aη\sup_{r\geq R_{\eta}}|f(r\mathbf{u})-F(r\mathbf{u})|\geq a-\eta. In particular, a single Yat atom carries an O(1)O(1) directional signal on a sufficiently large exterior ray that no finite IMQ expansion can uniformly match there.

Proof of Corollary 5.

Proposition 2 gives f(r𝐮)af(r\mathbf{u})\to a and F(r𝐮)0F(r\mathbf{u})\to 0, hence f(r𝐮)F(r𝐮)af(r\mathbf{u})-F(r\mathbf{u})\to a. Given η(0,a)\eta\in(0,a), choose RηR_{\eta} so that for all rRηr\geq R_{\eta}: |f(r𝐮)a|<η/2|f(r\mathbf{u})-a|<\eta/2 and |F(r𝐮)|<η/2|F(r\mathbf{u})|<\eta/2. Then |f(r𝐮)F(r𝐮)|aη|f(r\mathbf{u})-F(r\mathbf{u})|\geq a-\eta, and the supremum over rRηr\geq R_{\eta} preserves the bound. ∎

Proposition 6 (Far-field gradient asymptotics).

Fix 𝐰d{𝟎}\mathbf{w}\in\mathbb{R}^{d}\setminus\{\mathbf{0}\}, b0b\geq 0, ε>0\varepsilon>0, and a unit direction 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1} with 𝐮𝐰0\mathbf{u}^{\top}\mathbf{w}\neq 0. Along the ray 𝐱=r𝐮\mathbf{x}=r\mathbf{u}, the center-gradients of the IMQ section hε(𝐱,𝐰)h_{\varepsilon}(\mathbf{x},\mathbf{w}) and the Yat section kb,ε(𝐰,𝐱)k_{b,\varepsilon}(\mathbf{w},\mathbf{x}) satisfy

𝐰hε(r𝐮,𝐰)=Θ(r3),limr𝐰kb,ε(𝐰,r𝐮)=2(𝐮𝐰)𝐮.\bigl\|\nabla_{\mathbf{w}}\,h_{\varepsilon}(r\mathbf{u},\mathbf{w})\bigr\|=\Theta(r^{-3}),\qquad\lim_{r\to\infty}\nabla_{\mathbf{w}}\,k_{b,\varepsilon}(\mathbf{w},r\mathbf{u})=2(\mathbf{u}^{\top}\mathbf{w})\,\mathbf{u}.
Remark 3.

The proposition records an asymptotic difference in the center-gradient of a single Yat atom versus a single IMQ atom at large 𝐱\|\mathbf{x}\|, evaluated in isolation. It is a statement about the kernel itself, not about gradients of trained networks: in a finite expansion the per-atom gradients combine through the readout coefficients 𝛂\boldsymbol{\alpha}, and far-field signal can be informative or uninformative depending on where the data actually live.

Proof of Proposition 6.

For the IMQ atom, 𝐰hε=2(𝐱𝐰)/(𝐱𝐰2+ε)2\nabla_{\mathbf{w}}h_{\varepsilon}=2(\mathbf{x}-\mathbf{w})/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{2}. With 𝐱=r𝐮\mathbf{x}=r\mathbf{u}, the numerator r𝐮𝐰=r+O(1)\|r\mathbf{u}-\mathbf{w}\|=r+O(1) and the denominator (r22r𝐮𝐰+𝐰2+ε)2=r4+O(r3)(r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon)^{2}=r^{4}+O(r^{3}), so 𝐰hε=2r3(1+O(r1))=Θ(r3)\|\nabla_{\mathbf{w}}h_{\varepsilon}\|=2r^{-3}(1+O(r^{-1}))=\Theta(r^{-3}).

For the Yat atom, set N(𝐱):=(𝐰𝐱+b)2N(\mathbf{x}):=(\mathbf{w}^{\top}\mathbf{x}+b)^{2} and D(𝐱):=𝐱𝐰2+εD(\mathbf{x}):=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon. The quotient rule gives

𝐰kb,ε(𝐰,𝐱)=2(𝐰𝐱+b)𝐱D(𝐱)+2(𝐰𝐱+b)2(𝐱𝐰)D(𝐱)2.\nabla_{\mathbf{w}}\,k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{2(\mathbf{w}^{\top}\mathbf{x}+b)\,\mathbf{x}}{D(\mathbf{x})}+\frac{2(\mathbf{w}^{\top}\mathbf{x}+b)^{2}(\mathbf{x}-\mathbf{w})}{D(\mathbf{x})^{2}}.

Substituting 𝐱=r𝐮\mathbf{x}=r\mathbf{u}, the first term equals 2(r𝐮𝐰+b)r𝐮/D(r𝐮)2(r\mathbf{u}^{\top}\mathbf{w}+b)\,r\mathbf{u}/D(r\mathbf{u}). The leading-order behaviour of numerator and denominator is 2r2(𝐮𝐰)𝐮+O(r)2r^{2}(\mathbf{u}^{\top}\mathbf{w})\mathbf{u}+O(r) over r2+O(r)r^{2}+O(r), giving the limit 2(𝐮𝐰)𝐮2(\mathbf{u}^{\top}\mathbf{w})\mathbf{u}. The second term has numerator 2(r𝐮𝐰+b)2(r𝐮𝐰)=O(r3)2(r\mathbf{u}^{\top}\mathbf{w}+b)^{2}(r\mathbf{u}-\mathbf{w})=O(r^{3}) over denominator D(r𝐮)2=O(r4)D(r\mathbf{u})^{2}=O(r^{4}), giving O(r1)0O(r^{-1})\to 0. Summing yields the stated limit. ∎

Proposition 7 (Width-complexity gap for quadratic ridge functions).

Let d2d\geq 2, 𝐰𝕊d1\mathbf{w}_{\star}\in\mathbb{S}^{d-1}, ε>0\varepsilon>0, R>0R>0, δ(0,R2)\delta\in(0,R^{2}), and 𝒳:=[R,R]d\mathcal{X}:=[-R,R]^{d}. Set f(𝐱):=(𝐰𝐱)2f^{\star}(\mathbf{x}):=(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2} and fix Λ\Lambda in the regime Λ>ε(R2δ)\Lambda>\varepsilon(R^{2}-\delta), with ρ:=Λ/(R2δ)ε\rho:=\sqrt{\Lambda/(R^{2}-\delta)-\varepsilon}. For W,Λ>0W,\Lambda>0, write IMQ(Λ,W)={j=1majhε(,𝐯j):m,j|aj|Λ,𝐯jW}\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)=\{\sum_{j=1}^{m}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j}):m\in\mathbb{N},\ \sum_{j}|a_{j}|\leq\Lambda,\ \|\mathbf{v}_{j}\|\leq W\}.

  1. (i)

    Yat upper bound. The single Yat atom gα(𝐱):=k0,ε(α𝐰,𝐱)g_{\alpha}(\mathbf{x}):=k_{0,\varepsilon}(\alpha\mathbf{w}_{\star},\mathbf{x}) satisfies sup𝐱𝒳|gα(𝐱)f(𝐱)|δ\sup_{\mathbf{x}\in\mathcal{X}}|g_{\alpha}(\mathbf{x})-f^{\star}(\mathbf{x})|\leq\delta for every αα0(d,R,ε,δ)\alpha\geq\alpha_{0}(d,R,\varepsilon,\delta), where α0\alpha_{0} is an explicit polynomial in d1/2,R,ε1/2,δ1/2d^{1/2},R,\varepsilon^{1/2},\delta^{-1/2}.

  2. (ii)

    IMQ lower bound. Every FIMQ(Λ,W)F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W) with sup𝐱𝒳|F(𝐱)f(𝐱)|δ\sup_{\mathbf{x}\in\mathcal{X}}|F(\mathbf{x})-f^{\star}(\mathbf{x})|\leq\delta uses at least m(2R/ρ)d1Γ((d+1)/2)/π(d1)/2m\geq(2R/\rho)^{d-1}\Gamma((d+1)/2)/\pi^{(d-1)/2} atoms; whenever Λ\Lambda stays bounded as dd\to\infty, this is exp(Ω(dlogd))\exp(\Omega(d\log d)).

Proof of Proposition 7.

(i) Substituting 𝐰=α𝐰\mathbf{w}=\alpha\mathbf{w}_{\star},

gα(𝐱)=α2(𝐰𝐱)2𝐱α𝐰2+ε=(𝐰𝐱)21+η(𝐱,α),η(𝐱,α):=2𝐰𝐱α+𝐱2+εα2.g_{\alpha}(\mathbf{x})=\frac{\alpha^{2}(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}}{\|\mathbf{x}-\alpha\mathbf{w}_{\star}\|^{2}+\varepsilon}=\frac{(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}}{1+\eta(\mathbf{x},\alpha)},\qquad\eta(\mathbf{x},\alpha):=-\frac{2\mathbf{w}_{\star}^{\top}\mathbf{x}}{\alpha}+\frac{\|\mathbf{x}\|^{2}+\varepsilon}{\alpha^{2}}.

For 𝐱𝒳\mathbf{x}\in\mathcal{X}, |𝐰𝐱|Rd|\mathbf{w}_{\star}^{\top}\mathbf{x}|\leq R\sqrt{d} and 𝐱2+εdR2+ε\|\mathbf{x}\|^{2}+\varepsilon\leq dR^{2}+\varepsilon. If αmax{8Rd, 2dR2+ε}\alpha\geq\max\{8R\sqrt{d},\,2\sqrt{dR^{2}+\varepsilon}\}, then |η|1/2|\eta|\leq 1/2, and |1/(1+η)1|2|η||1/(1+\eta)-1|\leq 2|\eta|. Hence

|gα(𝐱)f(𝐱)|(𝐰𝐱)22|η|dR2(4Rdα+2(dR2+ε)α2).|g_{\alpha}(\mathbf{x})-f^{\star}(\mathbf{x})|\leq(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\cdot 2|\eta|\leq dR^{2}\Bigl(\frac{4R\sqrt{d}}{\alpha}+\frac{2(dR^{2}+\varepsilon)}{\alpha^{2}}\Bigr).

Setting the right-hand side to δ\delta and taking α\alpha to dominate both 8R3d3/2/δ8R^{3}d^{3/2}/\delta (first term, sufficient for 4R3d3/2/αδ/24R^{3}d^{3/2}/\alpha\leq\delta/2) and 2dR2(dR2+ε)/δ2\sqrt{dR^{2}(dR^{2}+\varepsilon)/\delta} (second term, sufficient for 2dR2(dR2+ε)/α2δ/22dR^{2}(dR^{2}+\varepsilon)/\alpha^{2}\leq\delta/2) gives α0\alpha_{0} as claimed.

(ii) Restrict attention to the inscribed Euclidean ball :={𝐱d:𝐱R}[R,R]d\mathcal{B}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\}\subset[-R,R]^{d}. By rotational invariance of \mathcal{B}, take 𝐰=𝐞1\mathbf{w}_{\star}=\mathbf{e}_{1}, and consider the affine slice S:={R/2}×d1S^{\prime}:=\{R/2\}\times\mathcal{B}^{\prime}_{d-1}\subset\mathcal{B}, where d1d1\mathcal{B}^{\prime}_{d-1}\subset\mathbb{R}^{d-1} is the (d1)(d-1)-ball of radius R3/2R\sqrt{3}/2. On SS^{\prime}, fR2/4f^{\star}\equiv R^{2}/4 in the slice direction and the uniform approximation hypothesis (strengthened to δ<R2/4\delta<R^{2}/4) gives F(𝐱)R2/4δ>0F(\mathbf{x})\geq R^{2}/4-\delta>0 for every 𝐱S\mathbf{x}\in S^{\prime}. For any F=jajhε(,𝐰j)IMQ(Λ,W)F=\sum_{j}a_{j}h_{\varepsilon}(\cdot,\mathbf{w}_{j})\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W),

F(𝐱)|F(𝐱)|j|aj|hε(𝐱,𝐰j)Λmaxjhε(𝐱,𝐰j)=Λminj𝐱𝐰j2+ε.F(\mathbf{x})\leq|F(\mathbf{x})|\leq\sum_{j}|a_{j}|\,h_{\varepsilon}(\mathbf{x},\mathbf{w}_{j})\leq\Lambda\cdot\max_{j}h_{\varepsilon}(\mathbf{x},\mathbf{w}_{j})=\frac{\Lambda}{\min_{j}\|\mathbf{x}-\mathbf{w}_{j}\|^{2}+\varepsilon}.

Combining, minj𝐱𝐰j2Λ/(R2/4δ)ε=ρ2\min_{j}\|\mathbf{x}-\mathbf{w}_{j}\|^{2}\leq\Lambda/(R^{2}/4-\delta)-\varepsilon=\rho^{2} for every 𝐱S\mathbf{x}\in S^{\prime}. Equivalently, Sj=1mBd(𝐰j,ρ)S^{\prime}\subseteq\bigcup_{j=1}^{m}B_{d}(\mathbf{w}_{j},\rho) where BdB_{d} denotes the closed Euclidean ball in d\mathbb{R}^{d}. The cross-section Bd(𝐰j,ρ)SB_{d}(\mathbf{w}_{j},\rho)\cap S^{\prime} is contained in a (d1)(d-1)-dimensional ball of radius at most ρ\rho, with (d1)(d-1)-volume bounded by Vd1(ρ):=π(d1)/2ρd1/Γ((d+1)/2)V_{d-1}(\rho):=\pi^{(d-1)/2}\rho^{d-1}/\Gamma((d+1)/2). The slice SS^{\prime} has (d1)(d-1)-volume π(d1)/2(R3/2)d1/Γ((d+1)/2)\pi^{(d-1)/2}(R\sqrt{3}/2)^{d-1}/\Gamma((d+1)/2), so volume-counting gives

π(d1)/2(R3/2)d1/Γ((d+1)/2)=|S|d1j=1m|Bd(𝐰j,ρ)S|d1mVd1(ρ),\pi^{(d-1)/2}(R\sqrt{3}/2)^{d-1}/\Gamma((d+1)/2)=|S^{\prime}|_{d-1}\leq\sum_{j=1}^{m}|B_{d}(\mathbf{w}_{j},\rho)\cap S^{\prime}|_{d-1}\leq m\,V_{d-1}(\rho),

which rearranges to m(R3/(2ρ))d1m\geq(R\sqrt{3}/(2\rho))^{d-1}. By Stirling’s approximation, the Gamma factors cancel between the slice and the ball volumes; absorbing the constant (3/2)d1(\sqrt{3}/2)^{d-1} into the implicit O(d)O(d) Stirling-error term and taking logs gives logm(d1)log(R3/(2ρ))O(d)\log m\geq(d-1)\log(R\sqrt{3}/(2\rho))-O(d), which is Ω(dlogd)\Omega(d\log d) for any ρ\rho that does not grow with dd. ∎

E.9 Additional proofs from Section 4

Lemma 3 (The unbiased section lies in the positive-bias span).

Fix 𝐰d\mathbf{w}\in\mathbb{R}^{d} and ε>0\varepsilon>0. The map bgε(;𝐰,b)b\mapsto g_{\varepsilon}(\cdot;\mathbf{w},b) is quadratic in bb, so for every h>0h>0

gε(;𝐰,0)= 3gε(;𝐰,h) 3gε(;𝐰,2h)+gε(;𝐰,3h),g_{\varepsilon}(\cdot;\mathbf{w},0)\;=\;3\,g_{\varepsilon}(\cdot;\mathbf{w},h)\;-\;3\,g_{\varepsilon}(\cdot;\mathbf{w},2h)\;+\;g_{\varepsilon}(\cdot;\mathbf{w},3h), (9)

where every atom on the right has bias in {h,2h,3h}(0,)\{h,2h,3h\}\subset(0,\infty). In particular gε(;𝐰,0)F\tifinaghfont,ε+g_{\varepsilon}(\cdot;\mathbf{w},0)\in F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}.

Proof of Lemma 3.

DεD_{\varepsilon} is independent of bb, so gε(;𝐰,b)Dε=(𝐱𝐰+b)2g_{\varepsilon}(\cdot;\mathbf{w},b)D_{\varepsilon}=(\mathbf{x}^{\top}\mathbf{w}+b)^{2} is a univariate quadratic in bb. Three values {h,2h,3h}\{h,2h,3h\} over-determine a quadratic, and the unique linear extrapolant to b=0b=0 has coefficients (3,3,1)(3,-3,1) on (h,2h,3h)(h,2h,3h) respectively, since f(0)=3f(h)3f(2h)+f(3h)f(0)=3f(h)-3f(2h)+f(3h) for any quadratic ff. Dividing by DεD_{\varepsilon} gives (9). ∎

Corollary 6 (Strict span containment).

FIMQ,εF\tifinaghfont,ε+F_{\mathrm{IMQ},\varepsilon}\subsetneq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} for every ε>0\varepsilon>0. Concretely, for every 𝐰𝟎\mathbf{w}\neq\mathbf{0} the unbiased section gε(;𝐰,0)g_{\varepsilon}(\cdot;\mathbf{w},0) lies in F\tifinaghfont,ε+F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+} (Lemma 3) but not in FIMQ,εF_{\mathrm{IMQ},\varepsilon}. This is the span-level companion to the RKHS-level strict containment of Corollary 2: the same separation holds at the level of finite linear spans of atoms, witnessed by the directional far-field trace rather than by Loewner domination.

Proof of Corollary 6.

Inclusion is Theorem 4. For the strict separation, fix a unit vector 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1} with 𝐮𝐰0\mathbf{u}^{\top}\mathbf{w}\neq 0. Along the ray 𝐱=r𝐮\mathbf{x}=r\mathbf{u}, gε(r𝐮;𝐰,0)=r2(𝐮𝐰)2/(r22r𝐮𝐰+𝐰2+ε)(𝐮𝐰)20,g_{\varepsilon}(r\mathbf{u};\mathbf{w},0)=r^{2}(\mathbf{u}^{\top}\mathbf{w})^{2}/(r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon)\to(\mathbf{u}^{\top}\mathbf{w})^{2}\neq 0, while every finite combination FFIMQ,εF\in F_{\mathrm{IMQ},\varepsilon} obeys |F(r𝐮)|=O(r2)0|F(r\mathbf{u})|=O(r^{-2})\to 0. The far-field limits disagree, so gε(;𝐰,0)FIMQ,εg_{\varepsilon}(\cdot;\mathbf{w},0)\notin F_{\mathrm{IMQ},\varepsilon}. By Lemma 3 it does lie in F\tifinaghfont,ε+F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}, witnessing strictness. ∎

Proof of Corollary 3.

FIMQ,εF_{\mathrm{IMQ},\varepsilon} is dense in C(𝒳)C(\mathcal{X}) by Micchelli et al. [2006, Thm. 17]. By Theorem 4, FIMQ,εF\tifinaghfont,ε+F\tifinaghfont+F\tifinaghfont0F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}\subseteq F_{\text{\tifinaghfont ⵟ}}^{+}\subseteq F_{\text{\tifinaghfont ⵟ}}^{\geq 0}, so each superset is dense. ∎

Proof of Corollary 4.

By Theorem 4, every IMQ atom equals (2h2)1(2h^{2})^{-1} times a positive-bias second-order finite difference of three Yat atoms, identically in 𝐱\mathbf{x}. Substituting term by term in FF produces a 3m3m-atom Yat expansion GG with GFG\equiv F pointwise; the quantitative rate transfer is then immediate. ∎

E.10 Additional proofs from Section 5

Corollary 7 (Data-dependent empirical bound).

Under the assumptions of Theorem 6, the same proof yields the data-dependent bound

Rad^n(B,b,ε)Bnε1ni=1n(𝐱i2+b)2,\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;\leq\;\frac{B}{\sqrt{n\varepsilon}}\sqrt{\frac{1}{n}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}+b)^{2}}, (10)

computable from the sample alone and at most B(R2+b)/(nε)B(R^{2}+b)/(\sqrt{n}\,\sqrt{\varepsilon}) on any sample with maxi𝐱iR\max_{i}\|\mathbf{x}_{i}\|\leq R.

Proof of Corollary 7.

The proof of Theorem 6 bounds Rad^n\widehat{\mathrm{Rad}}_{n} by (B/n)(1/n)ikb,ε(𝐱i,𝐱i)(B/\sqrt{n})\sqrt{(1/n)\sum_{i}k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{i})} before applying the worst-case bound on the diagonal. Using the Yat-kernel diagonal kb,ε(𝐱i,𝐱i)=(𝐱i2+b)2/εk_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{i})=(\|\mathbf{x}_{i}\|^{2}+b)^{2}/\varepsilon directly gives (10). ∎

Appendix F Asymptotic Filtration and Native-Space Rate Transfer

This appendix records two structural results of the unbiased and biased Yat spans that extend the main-body theory: a three-step asymptotic filtration of the biased span detected by the directional-trace operator and its higher-order analogues, and a native-space rate-transfer statement coupled with a coefficient-mass lower bound.

Three-step asymptotic filtration of the biased span.

The biased Yat span \tifinaghfont=span{gε(;𝐰,b):𝐰d,b}\mathcal{F}_{\text{\tifinaghfont ⵟ}}=\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},b\in\mathbb{R}\} admits a three-step filtration

\tifinaghfontkerT𝒦1𝒦2𝒦3,\mathcal{F}_{\text{\tifinaghfont ⵟ}}\supseteq\ker T_{\infty}\supseteq\mathcal{K}_{1}\supseteq\mathcal{K}_{2}\supseteq\mathcal{K}_{3},

detected by linear asymptotic-trace operators TT_{\infty}, T1T_{1}, T2T_{2} acting at orders r0r^{0}, r1r^{-1}, r2r^{-2} on radial rays. The successive quotients are canonically isomorphic to the spaces of restrictions of homogeneous quadratic, linear, and constant forms on 𝕊d1\mathbb{S}^{d-1}, decomposing further into spherical-harmonic sectors 02\mathcal{H}_{0}\oplus\mathcal{H}_{2}, 1\mathcal{H}_{1}, 0\mathcal{H}_{0}. This sharpens the directional-trace separation of Proposition 2 from a single-witness statement (Yat atoms have nonzero quadratic shadow, IMQ atoms have zero shadow) into a full filtration whose quotients refine into spherical-harmonic sectors. The first two non-trivial levels are strictly nested: kerT𝒦1\ker T_{\infty}\setminus\mathcal{K}_{1}\neq\emptyset via an explicit four-atom construction, and 𝒦1=IMQ\mathcal{K}_{1}=\mathcal{L}\oplus\mathcal{F}_{\mathrm{IMQ}} as a direct sum (the linear-alignment span \mathcal{L} is disjoint from the IMQ span). The intrinsic characterisation of 𝒦3=ker(T2|IMQ)\mathcal{K}_{3}=\ker(T_{2}|_{\mathcal{F}_{\mathrm{IMQ}}}) and whether 𝒦2𝒦3\mathcal{K}_{2}\supsetneq\mathcal{K}_{3} are open.

Native-space rate transfer and coefficient-mass lower bound.

The exact bias-finite-difference reduction of Theorem 4 carries IMQ scattered-data approximation theory into \tifinaghfont\mathcal{F}_{\text{\tifinaghfont ⵟ}} at a factor-33 overhead in atom count: every finite IMQ approximant on a compact set induces a biased-span approximant with identical pointwise error. The IMQ kernel has an analytic native space, so the scattered-data theory of Schaback [1995], Wendland [2004] delivers fill-distance approximation rates that are exponential in 1/hΞ1/h_{\Xi} for native-space targets, in contrast to the polynomial hΞkh_{\Xi}^{k} rates of CkC^{k}-Sobolev kernels; under the bias finite-difference reduction these exponential rates transfer to \tifinaghfont\mathcal{F}_{\text{\tifinaghfont ⵟ}} with at most a factor-33 overhead. In the opposite direction, the codimension-one zero set k\tifinaghfont(,𝐰)|𝐰=0k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{w})|_{\mathbf{w}^{\perp}}=0 produces approximation lower bounds: on 𝕊d1\mathbb{S}^{d-1}, fewer than dd unbiased atoms cannot approximate the constant function to error below 11, and the total weighted coefficient cost is bounded below by Ω(εd/N)\Omega(\varepsilon d/N).

Appendix G Far-Field Separations

We write

kb,ε(𝐰,𝐱)=(𝐰𝐱+b)2𝐱𝐰2+ε,b0,ε>0.k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{(\mathbf{w}^{\top}\mathbf{x}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon},\qquad b\geq 0,\quad\varepsilon>0.

For fixed shared (b,ε)(b,\varepsilon), let b,ε\mathcal{H}_{b,\varepsilon} denote the RKHS of kb,εk_{b,\varepsilon}. For R>0R>0, write

BR:={𝐱d:𝐱R},𝕊d1:={𝐮d:𝐮=1}.B_{R}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\},\qquad\mathbb{S}^{d-1}:=\{\mathbf{u}\in\mathbb{R}^{d}:\|\mathbf{u}\|=1\}.

This appendix collects the radial-vs-Yat far-field separations: an asymptotic exterior-shell bound (both LL^{\infty} and L2L^{2} versions) and a polynomial-separation gap on line-containing domains. The multiclass-generalization machinery, prefix-pullback theorem, and Loewner-comparison theorems previously bundled here have been split into separate appendices (Appendices H and I) for clarity.

G.1 Asymptotic Exterior-Shell Separation

We complement the bounded-domain gap of Section 2 with asymptotic LL^{\infty} and L2L^{2} separations on exterior shells AR:={r𝐮:Rr2R,𝐮𝕊d1}A_{R}:=\{r\mathbf{u}:R\leq r\leq 2R,\ \mathbf{u}\in\mathbb{S}^{d-1}\} for RWR\gg W. For W,Λ>0W,\Lambda>0, write IMQ(Λ,W)\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W) and RBF(Λ,W,γ)\mathcal{R}_{\mathrm{RBF}}(\Lambda,W,\gamma) for the bounded-variation classes j=1majhε(,𝐯j)\sum_{j=1}^{m}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j}) and j=1majeγ𝐯j2\sum_{j=1}^{m}a_{j}\,e^{-\gamma\|\cdot-\mathbf{v}_{j}\|^{2}} with |aj|Λ\sum|a_{j}|\leq\Lambda and 𝐯jW\|\mathbf{v}_{j}\|\leq W.

Lemma 4 (Uniform radial decay on exterior shells).

Let R>WR>W. For FIMQ(Λ,W)F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W), supAR|F|Λ/((RW)2+ε)\sup_{A_{R}}|F|\leq\Lambda/((R-W)^{2}+\varepsilon); for FRBF(Λ,W,γ)F\in\mathcal{R}_{\mathrm{RBF}}(\Lambda,W,\gamma), supAR|F|Λeγ(RW)2\sup_{A_{R}}|F|\leq\Lambda e^{-\gamma(R-W)^{2}}.

Proof.

𝐱𝐯jRW\|\mathbf{x}-\mathbf{v}_{j}\|\geq R-W by reverse triangle inequality, so (𝐱𝐯j2+ε)1((RW)2+ε)1(\|\mathbf{x}-\mathbf{v}_{j}\|^{2}+\varepsilon)^{-1}\leq((R-W)^{2}+\varepsilon)^{-1} and eγ𝐱𝐯j2eγ(RW)2e^{-\gamma\|\mathbf{x}-\mathbf{v}_{j}\|^{2}}\leq e^{-\gamma(R-W)^{2}}. Sum against |aj||a_{j}|. ∎

Lemma 5 (Quantitative Yat far-field approximation).

Fix 𝐰d\mathbf{w}_{\star}\in\mathbb{R}^{d}, b0b\geq 0, and ε>0\varepsilon>0. Let g(𝐱)=kb,ε(𝐰,𝐱)g_{\star}(\mathbf{x})=k_{b,\varepsilon}(\mathbf{w}_{\star},\mathbf{x}) and q(𝐮):=(𝐮𝐰)2q(\mathbf{u}):=(\mathbf{u}^{\top}\mathbf{w}_{\star})^{2}. For every R>𝐰R>\|\mathbf{w}_{\star}\| and every r[R,2R]r\in[R,2R],

|g(r𝐮)q(𝐮)|ηR,ηR:=4R𝐰(b+𝐰2)+b2+𝐰2(𝐰2+ε)(R𝐰)2+ε.\left|g_{\star}(r\mathbf{u})-q(\mathbf{u})\right|\leq\eta_{R},\qquad\eta_{R}:=\frac{4R\|\mathbf{w}_{\star}\|(b+\|\mathbf{w}_{\star}\|^{2})+b^{2}+\|\mathbf{w}_{\star}\|^{2}(\|\mathbf{w}_{\star}\|^{2}+\varepsilon)}{(R-\|\mathbf{w}_{\star}\|)^{2}+\varepsilon}.

In particular, ηR=O(R1)\eta_{R}=O(R^{-1}) as RR\to\infty.

Proof.

Let W:=𝐰W_{\star}:=\|\mathbf{w}_{\star}\|, a:=𝐮𝐰a:=\mathbf{u}^{\top}\mathbf{w}_{\star}, so |a|W|a|\leq W_{\star}. Then g(r𝐮)=(ra+b)2/(r22ra+W2+ε)g_{\star}(r\mathbf{u})=(ra+b)^{2}/(r^{2}-2ra+W_{\star}^{2}+\varepsilon) and q(𝐮)=a2q(\mathbf{u})=a^{2}. The numerator of g(r𝐮)a2g_{\star}(r\mathbf{u})-a^{2} equals (ra+b)2a2(r22ra+W2+ε)=2ra(b+a2)+b2a2(W2+ε)(ra+b)^{2}-a^{2}(r^{2}-2ra+W_{\star}^{2}+\varepsilon)=2ra(b+a^{2})+b^{2}-a^{2}(W_{\star}^{2}+\varepsilon), whose absolute value is at most 2rW(b+W2)+b2+W2(W2+ε)4RW(b+W2)+b2+W2(W2+ε)2rW_{\star}(b+W_{\star}^{2})+b^{2}+W_{\star}^{2}(W_{\star}^{2}+\varepsilon)\leq 4RW_{\star}(b+W_{\star}^{2})+b^{2}+W_{\star}^{2}(W_{\star}^{2}+\varepsilon). The denominator r𝐮𝐰2+ε(RW)2+ε\|r\mathbf{u}-\mathbf{w}_{\star}\|^{2}+\varepsilon\geq(R-W_{\star})^{2}+\varepsilon. Combining gives ηR\eta_{R}. ∎

Theorem 10 (Far-field separation: LL^{\infty} and L2L^{2}).

Fix 𝐰𝟎\mathbf{w}_{\star}\neq\mathbf{0}, b0b\geq 0, ε>0\varepsilon>0, g=kb,ε(𝐰,)g_{\star}=k_{b,\varepsilon}(\mathbf{w}_{\star},\cdot), W𝐰W\geq\|\mathbf{w}_{\star}\|, Λ>0\Lambda>0, R>WR>W, and let ηR\eta_{R} be as in Lemma 5.

(i) LL^{\infty}-bound on the exterior shell.

infFIMQ(Λ,W)gFL(AR)(𝐰2ηRΛ(RW)2+ε)+,\inf_{F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)}\|g_{\star}-F\|_{L^{\infty}(A_{R})}\geq\bigl(\|\mathbf{w}_{\star}\|^{2}-\eta_{R}-\tfrac{\Lambda}{(R-W)^{2}+\varepsilon}\bigr)_{+},

with the analogous bound over RBF(Λ,W,γ)\mathcal{R}_{\mathrm{RBF}}(\Lambda,W,\gamma) replacing the IMQ decay term by Λeγ(RW)2\Lambda e^{-\gamma(R-W)^{2}}.

(ii) L2L^{2} risk lower bound under directional sampling. Let ν\nu be a probability distribution on 𝕊d1\mathbb{S}^{d-1} with r[R,2R]r\in[R,2R] independent of 𝐮ν\mathbf{u}\sim\nu, q(𝐮):=(𝐮𝐰)2q(\mathbf{u}):=(\mathbf{u}^{\top}\mathbf{w}_{\star})^{2}, and cν(𝐰):=𝔼[q(𝐮)2]>0c_{\nu}(\mathbf{w}_{\star}):=\mathbb{E}[q(\mathbf{u})^{2}]>0. Then every FIMQ(Λ,W)F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W) satisfies

𝔼[(g(𝐱)F(𝐱))2][(cν(𝐰)ηRΛ(RW)2+ε)+]2,\mathbb{E}\bigl[(g_{\star}(\mathbf{x})-F(\mathbf{x}))^{2}\bigr]\geq\Bigl[\bigl(\sqrt{c_{\nu}(\mathbf{w}_{\star})}-\eta_{R}-\tfrac{\Lambda}{(R-W)^{2}+\varepsilon}\bigr)_{+}\Bigr]^{2},

with the analogous RBF bound. A single Yat atom realises gg_{\star} at zero error in either norm.

Proof.

(i) Take 𝐮:=𝐰/𝐰\mathbf{u}_{\star}:=\mathbf{w}_{\star}/\|\mathbf{w}_{\star}\|. For r[R,2R]r\in[R,2R], g(r𝐮)𝐰2ηRg_{\star}(r\mathbf{u}_{\star})\geq\|\mathbf{w}_{\star}\|^{2}-\eta_{R} by Lemma 5 and |F(r𝐮)|Λ/((RW)2+ε)|F(r\mathbf{u}_{\star})|\leq\Lambda/((R-W)^{2}+\varepsilon) by Lemma 4. The triangle inequality and supremum over ARA_{R} give the IMQ bound; the RBF case is identical. (ii) gqL2ηR\|g_{\star}-q\|_{L^{2}}\leq\eta_{R} (Lemma 5 transferred LL2L^{\infty}\to L^{2}), so reverse triangle gives gL2cν(𝐰)ηR\|g_{\star}\|_{L^{2}}\geq\sqrt{c_{\nu}(\mathbf{w}_{\star})}-\eta_{R}. FL2Λ/((RW)2+ε)\|F\|_{L^{2}}\leq\Lambda/((R-W)^{2}+\varepsilon) from Lemma 4. A second reverse-triangle step, positive part, and squaring give the IMQ bound; RBF identical. ∎

Remark 4.

Both bounds require 𝐯jW\|\mathbf{v}_{j}\|\leq W and |aj|Λ\sum|a_{j}|\leq\Lambda; lifting either restriction allows radial methods to place centers in the shell or use coefficients growing with the radius, and the bounds become vacuous.

G.2 Polynomial Separation

Theorem 11 (Polynomial approximation gap on line-containing domains).

Let 𝐰0\mathbf{w}_{\star}\neq 0, b0b\geq 0, ε>0\varepsilon>0, and g(𝐱)=kb,ε(𝐰,𝐱)g_{\star}(\mathbf{x})=k_{b,\varepsilon}(\mathbf{w}_{\star},\mathbf{x}). Let 𝒫2={𝐱𝐱𝐀𝐱+𝐚𝐱+c:𝐀=𝐀}\mathcal{P}_{2}=\{\mathbf{x}\mapsto\mathbf{x}^{\top}\mathbf{A}\mathbf{x}+\mathbf{a}^{\top}\mathbf{x}+c:\mathbf{A}=\mathbf{A}^{\top}\} be the space of degree-at-most-two polynomials. Suppose XdX\subset\mathbb{R}^{d} contains a nontrivial line segment {t𝐯:tI}\{t\mathbf{v}:t\in I\} with 𝐯:=𝐰/𝐰\mathbf{v}:=\mathbf{w}_{\star}/\|\mathbf{w}_{\star}\| and II\subset\mathbb{R} a compact interval with nonempty interior. Then

infP𝒫2gPL(X)>0.\inf_{P\in\mathcal{P}_{2}}\|g_{\star}-P\|_{L^{\infty}(X)}>0.
Proof.

Let ρ:=𝐰>0\rho:=\|\mathbf{w}_{\star}\|>0. Along 𝐱=t𝐯\mathbf{x}=t\mathbf{v}, h(t):=g(t𝐯)=(ρt+b)2/((tρ)2+ε)h(t):=g_{\star}(t\mathbf{v})=(\rho t+b)^{2}/((t-\rho)^{2}+\varepsilon). Restrictions of 𝒫2\mathcal{P}_{2} to this line are univariate polynomials of degree at most two. Suppose h(t)=p(t)h(t)=p(t) for all tt with pΠ2p\in\Pi_{2}. By real-analyticity this extends globally: (ρt+b)2=p(t)((tρ)2+ε)(\rho t+b)^{2}=p(t)((t-\rho)^{2}+\varepsilon). If p0p\neq 0 has degree kk, the right side has degree k+2k+2 (since (tρ)2+ε(t-\rho)^{2}+\varepsilon is monic quadratic), while the left has degree two, so k=0k=0, i.e. p=Cp=C constant. Then (ρt+b)2=C((tρ)2+ε)(\rho t+b)^{2}=C((t-\rho)^{2}+\varepsilon). If C=0C=0: impossible since ρ>0\rho>0. If C0C\neq 0: the left side has real double root b/ρ-b/\rho while (tρ)2+ε(t-\rho)^{2}+\varepsilon has nonreal roots ρ±iε\rho\pm i\sqrt{\varepsilon}; contradiction. Hence hΠ2h\notin\Pi_{2}, and since Π2\Pi_{2} is closed in C(I)C(I), its distance from hh is strictly positive. ∎

Appendix H Learned-Norm Multiclass Generalization

Let XBRX\subseteq B_{R} and κ:=sup𝐱Xkb,ε(𝐱,𝐱)(R2+b)/ε\kappa:=\sup_{\mathbf{x}\in X}\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}\leq(R^{2}+b)/\sqrt{\varepsilon}. For a CC-class predictor 𝐟=(f1,,fC)\mathbf{f}=(f_{1},\ldots,f_{C}) with each fcb,εf_{c}\in\mathcal{H}_{b,\varepsilon}, define 𝐟C2:=c=1Cfcb,ε2\|\mathbf{f}\|_{\mathcal{H}^{C}}^{2}:=\sum_{c=1}^{C}\|f_{c}\|_{\mathcal{H}_{b,\varepsilon}}^{2}. For a finite Yat head fc(𝐱)=jαj,ckb,ε(𝐰j,𝐱)f_{c}(\mathbf{x})=\sum_{j}\alpha_{j,c}k_{b,\varepsilon}(\mathbf{w}_{j},\mathbf{x}) with Gram matrix 𝐊ij=kb,ε(𝐰i,𝐰j)\mathbf{K}_{ij}=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j}), the reproducing property gives 𝐟C2=c=1C𝜶c𝐊𝜶c\|\mathbf{f}\|_{\mathcal{H}^{C}}^{2}=\sum_{c=1}^{C}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}.

Define the ramp loss ϕ(t):=min{1,max{0,1t}}\phi(t):=\min\{1,\max\{0,1-t\}\}, the pairwise margin surrogate γ(𝐟;(𝐱,y)):=cyϕ((fy(𝐱)fc(𝐱))/γ)\ell_{\gamma}(\mathbf{f};(\mathbf{x},y)):=\sum_{c\neq y}\phi((f_{y}(\mathbf{x})-f_{c}(\mathbf{x}))/\gamma), and the empirical surrogate L^γ(𝐟):=(1/n)iγ(𝐟;(𝐱i,yi))\widehat{L}_{\gamma}(\mathbf{f}):=(1/n)\sum_{i}\ell_{\gamma}(\mathbf{f};(\mathbf{x}_{i},y_{i})).

Lemma 6 (Rademacher bound for pairwise margin loss).

Let B:={𝐟:𝐟CB}\mathcal{F}_{B}:=\{\mathbf{f}:\|\mathbf{f}\|_{\mathcal{H}^{C}}\leq B\}. Then

Rad^n(B)2C(C1)Bκγn.\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{B})\leq\frac{\sqrt{2}\,C(C-1)B\kappa}{\gamma\sqrt{n}}.
Proof.

By subadditivity, Rad^n(B)acRad^n(a,c)\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{B})\leq\sum_{a\neq c}\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{a,c}). The ramp loss is (1/γ)(1/\gamma)-Lipschitz, so by contraction, Rad^n(a,c)(1/γ)Rad^n(𝒢a,c)\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{a,c})\leq(1/\gamma)\widehat{\mathrm{Rad}}_{n}(\mathcal{G}_{a,c}) where 𝒢a,c\mathcal{G}_{a,c} restricts to (fafc)(f_{a}-f_{c}) on the class-aa support. Setting hi:=𝟏{yi=a}kb,ε(𝐱i,)b,εh_{i}:=\mathbf{1}\{y_{i}=a\}k_{b,\varepsilon}(\mathbf{x}_{i},\cdot)\in\mathcal{H}_{b,\varepsilon} and applying Cauchy–Schwarz in the product Hilbert space gives Rad^n(𝒢a,c)(2B/n)𝔼σiσihi\widehat{\mathrm{Rad}}_{n}(\mathcal{G}_{a,c})\leq(\sqrt{2}B/n)\mathbb{E}_{\sigma}\|\sum_{i}\sigma_{i}h_{i}\|_{\mathcal{H}}. By Jensen’s inequality and the reproducing property, 𝔼σiσihi(ihi2)1/2κn\mathbb{E}_{\sigma}\|\sum_{i}\sigma_{i}h_{i}\|_{\mathcal{H}}\leq(\sum_{i}\|h_{i}\|_{\mathcal{H}}^{2})^{1/2}\leq\kappa\sqrt{n}. Summing over the C(C1)C(C-1) ordered pairs gives the result. ∎

Theorem 12 (Learned-norm multiclass generalization bound).

With probability at least 1δ1-\delta over i.i.d. samples {(𝐱i,yi)}i=1n\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} from X×{1,,C}X\times\{1,\ldots,C\}, every 𝐟B\mathbf{f}\in\mathcal{F}_{B} satisfies

{argmaxcfc(𝐱)y}L^γ(𝐟)+22C(C1)Bκγn+3(C1)log(2/δ)2n.\mathbb{P}\{\arg\max_{c}f_{c}(\mathbf{x})\neq y\}\leq\widehat{L}_{\gamma}(\mathbf{f})+\frac{2\sqrt{2}\,C(C-1)B\kappa}{\gamma\sqrt{n}}+3(C-1)\sqrt{\frac{\log(2/\delta)}{2n}}.

For a finite shared-(b,ε)(b,\varepsilon) Yat head, B2=c=1C𝛂c𝐊𝛂cB^{2}=\sum_{c=1}^{C}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}.

Proof.

The zero-one loss is dominated by γ\ell_{\gamma}. A standard Rademacher generalization inequality for [0,C1][0,C-1]-valued losses, combined with Lemma 6, gives the result. The norm identity follows from the reproducing property. ∎

Remark 5.

The actual learned RKHS norm c𝛂c𝐊𝛂c\sum_{c}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}, not a post-hoc radius, controls the margin class.

Theorem 13 (Data-dependent learned-norm bound by peeling).

Fix B0>0B_{0}>0. With probability at least 1δ1-\delta, every finite shared-(b,ε)(b,\varepsilon) Yat head 𝐟\mathbf{f} satisfies

{argmaxcfc(𝐱)y}L^γ(𝐟)+42C(C1)κmax{B(𝐟),B0}γn+3(C1)log(π2(s(𝐟)+1)2/(3δ))2n,\mathbb{P}\{\arg\max_{c}f_{c}(\mathbf{x})\neq y\}\leq\widehat{L}_{\gamma}(\mathbf{f})\\ +\frac{4\sqrt{2}\,C(C{-}1)\kappa\max\{B(\mathbf{f}),B_{0}\}}{\gamma\sqrt{n}}+3(C{-}1)\sqrt{\frac{\log\!\bigl(\pi^{2}(s(\mathbf{f}){+}1)^{2}/(3\delta)\bigr)}{2n}},

where B(𝐟):=𝐟CB(\mathbf{f}):=\|\mathbf{f}\|_{\mathcal{H}^{C}} and s(𝐟):=log2(max{B(𝐟),B0}/B0)s(\mathbf{f}):=\lceil\log_{2}(\max\{B(\mathbf{f}),B_{0}\}/B_{0})\rceil. For a finite Yat head, B(𝐟)2=c𝛂c𝐊𝛂cB(\mathbf{f})^{2}=\sum_{c}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}.

Proof.

Standard peeling argument: for s=0,1,2,s=0,1,2,\ldots set Bs:=2sB0B_{s}:=2^{s}B_{0} and δs:=6δ/(π2(s+1)2)\delta_{s}:=6\delta/(\pi^{2}(s+1)^{2}); note sδs=δ\sum_{s}\delta_{s}=\delta. Apply Theorem 12 to Bs\mathcal{F}_{B_{s}} with confidence δs\delta_{s}. For any 𝐟\mathbf{f} with s=s(𝐟)s=s(\mathbf{f}), one has B(𝐟)Bs2max{B(𝐟),B0}B(\mathbf{f})\leq B_{s}\leq 2\max\{B(\mathbf{f}),B_{0}\}, giving the data-dependent bound after substituting δs\delta_{s}. ∎

Corollary 8 (Regularized training controls the learned norm).

If training returns 𝐟^\widehat{\mathbf{f}} satisfying L^γ(𝐟^)+λ𝐟^C2(C1)\widehat{L}_{\gamma}(\widehat{\mathbf{f}})+\lambda\|\widehat{\mathbf{f}}\|_{\mathcal{H}^{C}}^{2}\leq(C-1), then 𝐟^C2(C1)/λ\|\widehat{\mathbf{f}}\|_{\mathcal{H}^{C}}^{2}\leq(C-1)/\lambda and in particular c𝛂^c𝐊𝛂^c(C1)/λ\sum_{c}\widehat{\boldsymbol{\alpha}}_{c}^{\top}\mathbf{K}\widehat{\boldsymbol{\alpha}}_{c}\leq(C-1)/\lambda.

Proof.

The zero predictor has γ(0;(𝐱,y))=C1\ell_{\gamma}(0;(\mathbf{x},y))=C-1 for every (𝐱,y)(\mathbf{x},y), giving L^γ(0)=C1\widehat{L}_{\gamma}(0)=C-1. The assumed inequality and λ>0\lambda>0 then give λ𝐟^C2C1\lambda\|\widehat{\mathbf{f}}\|_{\mathcal{H}^{C}}^{2}\leq C-1. ∎

Appendix I Pullback and Loewner-Comparison Theorems

I.1 Layerwise Composition: Pullback RKHS Structure

This section proves a prefix-conditioned pullback statement: once the prefix of a trained network is fixed, the next Yat layer induces a valid RKHS on the original input space.

Let Xd0X\subset\mathbb{R}^{d_{0}} be compact. For a depth-LL Yat stack, write

Φ0(𝐱)=𝐱,Φ(𝐱)=T(Φ1(𝐱)),\Phi_{0}(\mathbf{x})=\mathbf{x},\qquad\Phi_{\ell}(\mathbf{x})=T_{\ell}(\Phi_{\ell-1}(\mathbf{x})),

where layer \ell has coordinates [T(𝐳)]c=j=1ma,j,ckb,ε(𝐰,j,𝐳)[T_{\ell}(\mathbf{z})]_{c}=\sum_{j=1}^{m_{\ell}}a_{\ell,j,c}k_{b_{\ell},\varepsilon_{\ell}}(\mathbf{w}_{\ell,j},\mathbf{z}). Write k(𝐳,𝐳):=kb,ε(𝐳,𝐳)k_{\ell}(\mathbf{z},\mathbf{z}^{\prime}):=k_{b_{\ell},\varepsilon_{\ell}}(\mathbf{z},\mathbf{z}^{\prime}) and [𝐊]ij:=k(𝐰,i,𝐰,j)[\mathbf{K}_{\ell}]_{ij}:=k_{\ell}(\mathbf{w}_{\ell,i},\mathbf{w}_{\ell,j}).

When learned centers 𝐰,j\mathbf{w}_{\ell,j} do not lie in the prefix range Φ1(X)\Phi_{\ell-1}(X), we view each layer coordinate as the restriction to Φ1(X)\Phi_{\ell-1}(X) of a function in the global RKHS k\mathcal{H}_{k_{\ell}} on d1\mathbb{R}^{d_{\ell-1}}. Equivalently, define the extended domain D:=Φ1(X){𝐰,1,,𝐰,m}D_{\ell}:=\Phi_{\ell-1}(X)\cup\{\mathbf{w}_{\ell,1},\ldots,\mathbf{w}_{\ell,m_{\ell}}\}, apply the RKHS construction on DD_{\ell}, and restrict back to Φ1(X)\Phi_{\ell-1}(X); the restriction RKHS carries the quotient norm. The norm bound below holds in either view.

Theorem 14 (Prefix-conditioned pullback RKHS: PSD, quotient norm, universality).

Let kk be a continuous PSD kernel on ZZ and Φ:XZ\Phi:X\to Z a measurable map. Define k~(𝐱,𝐱):=k(Φ(𝐱),Φ(𝐱))\widetilde{k}(\mathbf{x},\mathbf{x}^{\prime}):=k(\Phi(\mathbf{x}),\Phi(\mathbf{x}^{\prime})) and 𝒮:=span¯{k(Φ(𝐱),):𝐱X}k\mathcal{S}:=\overline{\mathrm{span}}\{k(\Phi(\mathbf{x}),\cdot):\mathbf{x}\in X\}\subset\mathcal{H}_{k}.

(a) PSD and norm bound. k~\widetilde{k} is PSD on XX. For every fkf\in\mathcal{H}_{k}, fΦk~f\circ\Phi\in\mathcal{H}_{\widetilde{k}} with fΦk~fk\|f\circ\Phi\|_{\mathcal{H}_{\widetilde{k}}}\leq\|f\|_{\mathcal{H}_{k}}.

(b) Exact quotient-norm characterisation. fΦk~=inf{gk:gk,g|Φ(X)=f|Φ(X)}=P𝒮fk,\|f\circ\Phi\|_{\mathcal{H}_{\widetilde{k}}}=\inf\{\|g\|_{\mathcal{H}_{k}}:g\in\mathcal{H}_{k},\;g|_{\Phi(X)}=f|_{\Phi(X)}\}=\|P_{\mathcal{S}}f\|_{\mathcal{H}_{k}}, where P𝒮P_{\mathcal{S}} is orthogonal projection onto 𝒮\mathcal{S}; the infimum is attained at P𝒮fP_{\mathcal{S}}f.

(c) Universality propagates through injective continuous prefixes. If XX is compact, Φ\Phi is continuous and injective, and kk is universal on Φ(X)\Phi(X), then k~\widetilde{k} is universal on XX.

Application to layer-\ell Yat. Specialising to k=kb,εk=k_{b_{\ell},\varepsilon_{\ell}}, Φ=Φ1\Phi=\Phi_{\ell-1}, k~=k~\widetilde{k}=\widetilde{k}_{\ell}, part (a) gives the per-layer norm bound [TΦ1]ck~2𝐚,c𝐊𝐚,c\|[T_{\ell}\circ\Phi_{\ell-1}]_{c}\|_{\mathcal{H}_{\widetilde{k}_{\ell}}}^{2}\leq\mathbf{a}_{\ell,c}^{\top}\mathbf{K}_{\ell}\mathbf{a}_{\ell,c}, with equality (via (b)) iff f,c𝒮:=span¯{k(Φ1(𝐱),):𝐱X}f_{\ell,c}\in\mathcal{S}_{\ell}:=\overline{\mathrm{span}}\{k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\cdot):\mathbf{x}\in X\}. For b>0b_{\ell}>0 and continuous injective Φ1\Phi_{\ell-1}, part (c) together with Proposition 1 yields universality of k~\widetilde{k}_{\ell} on XX.

Proof.

(a) For {𝐱i}X\{\mathbf{x}_{i}\}\subset X and {βi}\{\beta_{i}\}\subset\mathbb{R}, βiβjk~(𝐱i,𝐱j)=βiβjk(Φ(𝐱i),Φ(𝐱j))0\sum\beta_{i}\beta_{j}\widetilde{k}(\mathbf{x}_{i},\mathbf{x}_{j})=\sum\beta_{i}\beta_{j}k(\Phi(\mathbf{x}_{i}),\Phi(\mathbf{x}_{j}))\geq 0 by PSD of kk. Let ψ(𝐳):=k(𝐳,)\psi(\mathbf{z}):=k(\mathbf{z},\cdot) be the canonical feature map and ψ~(𝐱):=ψ(Φ(𝐱))\widetilde{\psi}(\mathbf{x}):=\psi(\Phi(\mathbf{x})), so k~(𝐱,𝐱)=ψ~(𝐱),ψ~(𝐱)\widetilde{k}(\mathbf{x},\mathbf{x}^{\prime})=\langle\widetilde{\psi}(\mathbf{x}),\widetilde{\psi}(\mathbf{x}^{\prime})\rangle. The reproducing property gives (fΦ)(𝐱)=f,ψ~(𝐱)k(f\circ\Phi)(\mathbf{x})=\langle f,\widetilde{\psi}(\mathbf{x})\rangle_{\mathcal{H}_{k}}, so fΦk~f\circ\Phi\in\mathcal{H}_{\widetilde{k}} with the displayed norm bound. Specialising f=f,c=ja,j,ck(𝐰,j,)f=f_{\ell,c}=\sum_{j}a_{\ell,j,c}k_{\ell}(\mathbf{w}_{\ell,j},\cdot) and using f,c2=𝐚,c𝐊𝐚,c\|f_{\ell,c}\|_{\mathcal{H}_{\ell}}^{2}=\mathbf{a}_{\ell,c}^{\top}\mathbf{K}_{\ell}\mathbf{a}_{\ell,c} delivers the per-layer Yat-Gram bound.

(b) The map U:k~𝒮U:\mathcal{H}_{\widetilde{k}}\to\mathcal{S} defined on sections by U(k~(𝐱,))=k(Φ(𝐱),)U(\widetilde{k}(\mathbf{x},\cdot))=k(\Phi(\mathbf{x}),\cdot) is an isometric isomorphism (kernel-section inner products agree). Under UU, fΦf\circ\Phi corresponds to P𝒮fP_{\mathcal{S}}f, so fΦk~=P𝒮fk\|f\circ\Phi\|_{\mathcal{H}_{\widetilde{k}}}=\|P_{\mathcal{S}}f\|_{\mathcal{H}_{k}}. The affine set {gk:g|Φ(X)=f|Φ(X)}\{g\in\mathcal{H}_{k}:g|_{\Phi(X)}=f|_{\Phi(X)}\} equals P𝒮f+𝒮P_{\mathcal{S}}f+\mathcal{S}^{\perp}, minimised at P𝒮fP_{\mathcal{S}}f.

(c) XX compact and ZZ Hausdorff with Φ\Phi continuous injective makes Φ:XΦ(X)\Phi:X\to\Phi(X) a homeomorphism. For gC(X)g\in C(X), set h:=gΦ1C(Φ(X))h:=g\circ\Phi^{-1}\in C(\Phi(X)); by universality of kk, find fkf\in\mathcal{H}_{k} with supΦ(X)|fh|<η\sup_{\Phi(X)}|f-h|<\eta. Then fΦk~f\circ\Phi\in\mathcal{H}_{\widetilde{k}} and supX|fΦg|=supΦ(X)|fh|<η\sup_{X}|f\circ\Phi-g|=\sup_{\Phi(X)}|f-h|<\eta. ∎

I.2 RKHS Ball Comparisons: Interpolation, Pullback, and Spectral Structure

The kernel-order domination of Theorem 2 propagates to three further structural comparisons: finite-sample interpolation norms, deep-prefix pullback geometry, and integral-operator spectra.

Finite-sample interpolation norm comparison.

For training points 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n}, define empirical Gram matrices [𝐊Y]ij:=kb,ε(𝐱i,𝐱j)[\mathbf{K}_{Y}]_{ij}:=k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{j}) and [𝐊I]ij:=hε(𝐱i,𝐱j)[\mathbf{K}_{I}]_{ij}:=h_{\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{j}).

Theorem 15 (Finite-sample interpolation norm comparison).

𝐊Yb2𝐊I\mathbf{K}_{Y}\succeq b^{2}\mathbf{K}_{I} as PSD matrices. If b>0b>0 and both Gram matrices are strictly positive definite, then 𝐊Y1b2𝐊I1\mathbf{K}_{Y}^{-1}\preceq b^{-2}\mathbf{K}_{I}^{-1}, and for any label vector 𝐲n\mathbf{y}\in\mathbb{R}^{n},

𝐲𝐊Y1𝐲1b2𝐲𝐊I1𝐲.\mathbf{y}^{\top}\mathbf{K}_{Y}^{-1}\mathbf{y}\;\leq\;\frac{1}{b^{2}}\,\mathbf{y}^{\top}\mathbf{K}_{I}^{-1}\mathbf{y}.

The minimum RKHS norm to interpolate 𝐲\mathbf{y} at 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n} in b,ε\mathcal{H}_{b,\varepsilon} is at most b1b^{-1} times the minimum interpolation norm in hε\mathcal{H}_{h_{\varepsilon}}.

Proof.

𝐊Yb2𝐊I\mathbf{K}_{Y}\succeq b^{2}\mathbf{K}_{I} follows from Theorem 2 applied entry-wise. Inversion reverses Loewner order: 𝐊Yb2𝐊I0\mathbf{K}_{Y}\succeq b^{2}\mathbf{K}_{I}\succ 0 implies 𝐊Y1b2𝐊I1\mathbf{K}_{Y}^{-1}\preceq b^{-2}\mathbf{K}_{I}^{-1} [Horn and Johnson, 2012, Thm. 7.7.4]. The quadratic form bound follows; the minimum-norm interpolant has squared norm 𝐲𝐊1𝐲\mathbf{y}^{\top}\mathbf{K}^{-1}\mathbf{y} by the standard representer theorem. ∎

Pullback kernel-order domination for deep prefixes.

Theorem 16 (Pullback domination).

Let Φ:Xd\Phi:X\to\mathbb{R}^{d^{\prime}} be any measurable map, and define k~b,ε(𝐱,𝐱):=kb,ε(Φ(𝐱),Φ(𝐱))\widetilde{k}_{b,\varepsilon}(\mathbf{x},\mathbf{x}^{\prime}):=k_{b,\varepsilon}(\Phi(\mathbf{x}),\Phi(\mathbf{x}^{\prime})) and h~ε(𝐱,𝐱):=hε(Φ(𝐱),Φ(𝐱))\widetilde{h}_{\varepsilon}(\mathbf{x},\mathbf{x}^{\prime}):=h_{\varepsilon}(\Phi(\mathbf{x}),\Phi(\mathbf{x}^{\prime})). Then b2h~εk~b,εb^{2}\widetilde{h}_{\varepsilon}\preceq\widetilde{k}_{b,\varepsilon} on XX, and

h~εk~b,ε,fk~b,ε1bfh~ε(b>0).\mathcal{H}_{\widetilde{h}_{\varepsilon}}\subseteq\mathcal{H}_{\widetilde{k}_{b,\varepsilon}},\qquad\|f\|_{\mathcal{H}_{\widetilde{k}_{b,\varepsilon}}}\leq\frac{1}{b}\|f\|_{\mathcal{H}_{\widetilde{h}_{\varepsilon}}}\quad(b>0).

In particular, for any trained Yat stack with per-layer bias b>0b_{\ell}>0, the RKHS ball comparison of Theorem 2 survives at every layer, conditionally on the trained prefix.

Proof.

For any {𝐱i}X\{\mathbf{x}_{i}\}\subset X and {βi}\{\beta_{i}\}\subset\mathbb{R}: i,jβiβj(k~b,εb2h~ε)(𝐱i,𝐱j)=i,jβiβj(kb,εb2hε)(Φ(𝐱i),Φ(𝐱j))0\sum_{i,j}\beta_{i}\beta_{j}(\widetilde{k}_{b,\varepsilon}-b^{2}\widetilde{h}_{\varepsilon})(\mathbf{x}_{i},\mathbf{x}_{j})=\sum_{i,j}\beta_{i}\beta_{j}(k_{b,\varepsilon}-b^{2}h_{\varepsilon})(\Phi(\mathbf{x}_{i}),\Phi(\mathbf{x}_{j}))\geq 0, since kb,εb2hεk_{b,\varepsilon}-b^{2}h_{\varepsilon} is PSD by Theorem 2. Aronszajn’s inclusion theorem then gives the containment. ∎

Gram-ball alignment excess.

Proposition 8 (Alignment excess decomposition).

For any trained prototype set 𝐰1,,𝐰m\mathbf{w}_{1},\ldots,\mathbf{w}_{m}, define [𝐆Y]ij:=kb,ε(𝐰i,𝐰j)[\mathbf{G}_{Y}]_{ij}:=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j}) and [𝐆I]ij:=hε(𝐰i,𝐰j)[\mathbf{G}_{I}]_{ij}:=h_{\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j}). Set q(𝐱,𝐰):=[2b𝐱𝐰+(𝐱𝐰)2]hε(𝐱,𝐰)q(\mathbf{x},\mathbf{w}):=[2b\,\mathbf{x}^{\top}\mathbf{w}+(\mathbf{x}^{\top}\mathbf{w})^{2}]\,h_{\varepsilon}(\mathbf{x},\mathbf{w}); qq is PSD by Schur-product factorization. For any 𝛂m\boldsymbol{\alpha}\in\mathbb{R}^{m},

𝜶𝐆Y𝜶=b2𝜶𝐆I𝜶+Ealign(𝜶),\boldsymbol{\alpha}^{\top}\mathbf{G}_{Y}\boldsymbol{\alpha}=b^{2}\,\boldsymbol{\alpha}^{\top}\mathbf{G}_{I}\boldsymbol{\alpha}+E_{\mathrm{align}}(\boldsymbol{\alpha}),

where Ealign(𝛂):=𝛂(𝐆Yb2𝐆I)𝛂0E_{\mathrm{align}}(\boldsymbol{\alpha}):=\boldsymbol{\alpha}^{\top}(\mathbf{G}_{Y}-b^{2}\mathbf{G}_{I})\boldsymbol{\alpha}\geq 0. Moreover,

Ealign(𝜶)=0𝜶ker(𝐆Yb2𝐆I),E_{\mathrm{align}}(\boldsymbol{\alpha})=0\;\iff\;\boldsymbol{\alpha}\in\ker(\mathbf{G}_{Y}-b^{2}\mathbf{G}_{I}),

equivalently iff the finite expansion iαiq(𝐰i,)\sum_{i}\alpha_{i}\,q(\mathbf{w}_{i},\cdot) is the zero function in the RKHS q\mathcal{H}_{q}. In particular, if the alignment Gram matrix 𝐆Yb2𝐆I\mathbf{G}_{Y}-b^{2}\mathbf{G}_{I} is nonsingular on the chosen prototypes, equality forces 𝛂=𝟎\boldsymbol{\alpha}=\mathbf{0}. The trained Yat norm 𝛂𝐆Y𝛂\boldsymbol{\alpha}^{\top}\mathbf{G}_{Y}\boldsymbol{\alpha} decomposes into a radial IMQ component and a non-negative alignment excess EalignE_{\mathrm{align}}.

Proof.

Substituting 𝐆Y=b2𝐆I+𝐆q\mathbf{G}_{Y}=b^{2}\mathbf{G}_{I}+\mathbf{G}_{q}, where [𝐆q]ij:=q(𝐰i,𝐰j)[\mathbf{G}_{q}]_{ij}:=q(\mathbf{w}_{i},\mathbf{w}_{j}), gives the decomposition. Non-negativity follows from the PSD property 𝐆q0\mathbf{G}_{q}\succeq 0, which is the Schur product of the PSD kernels 𝐱𝐰\mathbf{x}^{\top}\mathbf{w} (with multiplier 2b02b\geq 0) plus (𝐱𝐰)2(\mathbf{x}^{\top}\mathbf{w})^{2} and hεh_{\varepsilon}. The equality condition follows because for a PSD matrix 𝐆q\mathbf{G}_{q}, 𝜶𝐆q𝜶=0\boldsymbol{\alpha}^{\top}\mathbf{G}_{q}\boldsymbol{\alpha}=0 iff 𝐆q𝜶=𝟎\mathbf{G}_{q}\boldsymbol{\alpha}=\mathbf{0}, equivalently iff iαiq(𝐰i,)0\sum_{i}\alpha_{i}q(\mathbf{w}_{i},\cdot)\equiv 0 in q\mathcal{H}_{q}. ∎

Spectral domination.

For a probability measure ϱ\varrho on XX, define the integral operators (TYf)(𝐱):=kb,ε(𝐱,𝐱)f(𝐱)𝑑ϱ(𝐱)(T_{Y}f)(\mathbf{x}):=\int k_{b,\varepsilon}(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x}^{\prime})\,d\varrho(\mathbf{x}^{\prime}) and (TIf)(𝐱):=hε(𝐱,𝐱)f(𝐱)𝑑ϱ(𝐱)(T_{I}f)(\mathbf{x}):=\int h_{\varepsilon}(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x}^{\prime})\,d\varrho(\mathbf{x}^{\prime}), both compact self-adjoint positive operators on L2(ϱ)L^{2}(\varrho).

Theorem 17 (Spectral domination and effective-dimension comparison).

TYb2TIT_{Y}\succeq b^{2}T_{I} as operators on L2(ϱ)L^{2}(\varrho). Consequently,

λj(TY)b2λj(TI)j1,\lambda_{j}(T_{Y})\geq b^{2}\lambda_{j}(T_{I})\quad\forall j\geq 1,

and for every λ>0\lambda>0 and b>0b>0 the effective dimensions 𝒩k(λ):=Tr(Tk(Tk+λI)1)\mathcal{N}_{k}(\lambda):=\mathrm{Tr}(T_{k}(T_{k}+\lambda I)^{-1}) satisfy

𝒩Y(λ)𝒩I(λ/b2).\mathcal{N}_{Y}(\lambda)\;\geq\;\mathcal{N}_{I}(\lambda/b^{2}).
Proof.

For any fL2(ϱ)f\in L^{2}(\varrho), f,(TYb2TI)fL2=(kb,εb2hε)(𝐱,𝐱)f(𝐱)f(𝐱)𝑑ϱ20\langle f,(T_{Y}-b^{2}T_{I})f\rangle_{L^{2}}=\iint(k_{b,\varepsilon}-b^{2}h_{\varepsilon})(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x})f(\mathbf{x}^{\prime})\,d\varrho^{2}\geq 0, since kb,εb2hεk_{b,\varepsilon}-b^{2}h_{\varepsilon} is PSD (Theorem 2). The eigenvalue bound follows from the min-max theorem. For the effective dimension, use monotonicity of tt/(t+λ)t\mapsto t/(t+\lambda) together with λj(TY)b2λj(TI)\lambda_{j}(T_{Y})\geq b^{2}\lambda_{j}(T_{I}):

λj(TY)λj(TY)+λb2λj(TI)b2λj(TI)+λ=λj(TI)λj(TI)+λ/b2,\frac{\lambda_{j}(T_{Y})}{\lambda_{j}(T_{Y})+\lambda}\;\geq\;\frac{b^{2}\lambda_{j}(T_{I})}{b^{2}\lambda_{j}(T_{I})+\lambda}\;=\;\frac{\lambda_{j}(T_{I})}{\lambda_{j}(T_{I})+\lambda/b^{2}},

and summing in jj gives 𝒩Y(λ)𝒩I(λ/b2)\mathcal{N}_{Y}(\lambda)\geq\mathcal{N}_{I}(\lambda/b^{2}). ∎

Theorem 18 (Layerwise perturbation propagation).

Let TLT1T_{L}\circ\cdots\circ T_{1} and T~LT~1\widetilde{T}_{L}\circ\cdots\circ\widetilde{T}_{1} be two depth-LL networks acting on X0d0X_{0}\subset\mathbb{R}^{d_{0}}. Let X:=T(X1)T~(X~1)dX_{\ell}:=T_{\ell}(X_{\ell-1})\cup\widetilde{T}_{\ell}(\widetilde{X}_{\ell-1})\subset\mathbb{R}^{d_{\ell}} be the union of the two intermediate ranges. Suppose each TT_{\ell} is LL_{\ell}-Lipschitz on X1X_{\ell-1} and sup𝐳X1T(𝐳)T~(𝐳)δ\sup_{\mathbf{z}\in X_{\ell-1}}\|T_{\ell}(\mathbf{z})-\widetilde{T}_{\ell}(\mathbf{z})\|\leq\delta_{\ell}. Then

sup𝐱X0TLT1(𝐱)T~LT~1(𝐱)=1Lδr=+1LLr.\sup_{\mathbf{x}\in X_{0}}\|T_{L}\circ\cdots\circ T_{1}(\mathbf{x})-\widetilde{T}_{L}\circ\cdots\circ\widetilde{T}_{1}(\mathbf{x})\|\leq\sum_{\ell=1}^{L}\delta_{\ell}\prod_{r=\ell+1}^{L}L_{r}.

The sup is over the relevant intermediate representations actually visited; it is not required to hold globally on d\mathbb{R}^{d_{\ell}}.

Proof.

Let e:=sup𝐱X0𝐳(𝐱)𝐳~(𝐱)e_{\ell}:=\sup_{\mathbf{x}\in X_{0}}\|\mathbf{z}_{\ell}(\mathbf{x})-\widetilde{\mathbf{z}}_{\ell}(\mathbf{x})\| with e0=0e_{0}=0. Then eLe1+δe_{\ell}\leq L_{\ell}e_{\ell-1}+\delta_{\ell}. Unrolling gives eL=1Lδr=+1LLre_{L}\leq\sum_{\ell=1}^{L}\delta_{\ell}\prod_{r=\ell+1}^{L}L_{r}. ∎

Lemma 7 (Lipschitz bound for one Yat layer).

Let 𝐳R\|\mathbf{z}\|\leq R, 𝐰jW\|\mathbf{w}_{j}\|\leq W, b0b\geq 0, ε>0\varepsilon>0. Define

MR,W,b,ε:=2(RW+b)Wε+2(RW+b)2(R+W)ε2.M_{R,W,b,\varepsilon}:=\frac{2(RW+b)W}{\varepsilon}+\frac{2(RW+b)^{2}(R+W)}{\varepsilon^{2}}.

Each atom 𝐳kb,ε(𝐰j,𝐳)\mathbf{z}\mapsto k_{b,\varepsilon}(\mathbf{w}_{j},\mathbf{z}) is MR,W,b,εM_{R,W,b,\varepsilon}-Lipschitz on BRB_{R}. The coordinate Tc(𝐳)=jaj,ckb,ε(𝐰j,𝐳)T_{c}(\mathbf{z})=\sum_{j}a_{j,c}k_{b,\varepsilon}(\mathbf{w}_{j},\mathbf{z}) is Lipschitz with constant j|aj,c|MR,W,b,ε\sum_{j}|a_{j,c}|M_{R,W,b,\varepsilon}, and the vector layer T=(T1,,TC)T=(T_{1},\ldots,T_{C}) is Lipschitz with constant (c[j|aj,c|MR,W,b,ε]2)1/2(\sum_{c}[\sum_{j}|a_{j,c}|M_{R,W,b,\varepsilon}]^{2})^{1/2}.

Proof.

With D(𝐳):=𝐳𝐰2+εD(\mathbf{z}):=\|\mathbf{z}-\mathbf{w}\|^{2}+\varepsilon and N(𝐳):=(𝐰𝐳+b)2N(\mathbf{z}):=(\mathbf{w}^{\top}\mathbf{z}+b)^{2}, 𝐳kb,ε(𝐰,𝐳)=(2(𝐰𝐳+b)𝐰D(𝐳)2(𝐰𝐳+b)2(𝐳𝐰))/D(𝐳)2\nabla_{\mathbf{z}}k_{b,\varepsilon}(\mathbf{w},\mathbf{z})=(2(\mathbf{w}^{\top}\mathbf{z}+b)\mathbf{w}D(\mathbf{z})-2(\mathbf{w}^{\top}\mathbf{z}+b)^{2}(\mathbf{z}-\mathbf{w}))/D(\mathbf{z})^{2}. Using D(𝐳)εD(\mathbf{z})\geq\varepsilon, |𝐰𝐳+b|RW+b|\mathbf{w}^{\top}\mathbf{z}+b|\leq RW+b, and 𝐳𝐰R+W\|\mathbf{z}-\mathbf{w}\|\leq R+W gives 𝐳kb,ε(𝐰,𝐳)MR,W,b,ε\|\nabla_{\mathbf{z}}k_{b,\varepsilon}(\mathbf{w},\mathbf{z})\|\leq M_{R,W,b,\varepsilon}. The coordinate and vector-layer bounds follow by the chain rule and the Frobenius norm bound on the Jacobian. ∎

Appendix J Generic Sobolev-RKHS Containment for Bounded Smooth-Atom Stacks (Applied to Yat)

This appendix proves Theorem 19. The result is an ambient containment theorem: every bounded finite-depth Yat stack lies in one fixed Sobolev restriction RKHS on the original input domain. It does not claim that RKHSs are closed under nonlinear composition, nor does it replace the exact single-layer Gram norm 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} with an exact global Gram norm.

J.1 Sobolev restriction RKHS

Let Xd0X\subset\mathbb{R}^{d_{0}} be compact and let Ud0U\subset\mathbb{R}^{d_{0}} be a bounded Lipschitz domain with XU¯X\subset\overline{U}. Since bounded Lipschitz domains are Sobolev extension domains [Adams and Fournier, 2003], the Hs(d0)H^{s}(\mathbb{R}^{d_{0}}) Banach-algebra and Moser-composition estimates transfer to Hs(U)H^{s}(U) without loss; the Sobolev-induction template we use here is standard in deep-network analysis, with the closest neighbours being Petersen and Voigtländer [2018] for CkC^{k}-class deep ReLU networks and E et al. [2019] for the Barron-space variant. Fix s>d0/2s>d_{0}/2. Define

s(X):={f|X:fHs(U)},gs(X):=inf{fHs(U):f|X=g}.\mathcal{H}_{s}(X):=\bigl\{f|_{X}:f\in H^{s}(U)\bigr\},\qquad\|g\|_{\mathcal{H}_{s}(X)}:=\inf\bigl\{\|f\|_{H^{s}(U)}:f|_{X}=g\bigr\}.

By the Sobolev embedding Hs(U)C(U¯)H^{s}(U)\hookrightarrow C(\overline{U}) (valid for s>d0/2s>d_{0}/2), point evaluation is continuous on s(X)\mathcal{H}_{s}(X), making it an RKHS on XX.

Lemma 8 (Sobolev algebra and Nemytskii inverse stability).

Let UU be a bounded Lipschitz domain and s>d0/2s>d_{0}/2. Then:

  1. 1.

    (Banach algebra.) Hs(U)H^{s}(U) is a Banach algebra: there exists Calg1C_{\mathrm{alg}}\geq 1 such that fgHs(U)CalgfHs(U)gHs(U)\|fg\|_{H^{s}(U)}\leq C_{\mathrm{alg}}\|f\|_{H^{s}(U)}\|g\|_{H^{s}(U)} for all f,gHs(U)f,g\in H^{s}(U).

  2. 2.

    (Nemytskii inverse stability.) If rHs(U)r\in H^{s}(U) satisfies r(𝐱)λ>0r(\mathbf{x})\geq\lambda>0 for every 𝐱U\mathbf{x}\in U, then 1/rHs(U)1/r\in H^{s}(U). Moreover, there exists a nondecreasing function Γs:(0,)×[0,)[0,)\Gamma_{s}:(0,\infty)\times[0,\infty)\to[0,\infty) such that 1/rHs(U)Γs(λ1,rHs(U))\|1/r\|_{H^{s}(U)}\leq\Gamma_{s}(\lambda^{-1},\|r\|_{H^{s}(U)}).

Proof.

Part (1): the Sobolev multiplication theorem for s>d0/2s>d_{0}/2 on Lipschitz extension domains [Adams and Fournier, 2003, Thm. 4.39]. Part (2): since Hs(U)L(U)H^{s}(U)\hookrightarrow L^{\infty}(U) (Sobolev embedding), rr is bounded above. Apply the Moser/Nemytskii composition estimate [Runst and Sickel, 1996, Ch. 5] to φ(t)=1/t\varphi(t)=1/t on [λ,rL][\lambda,\|r\|_{L^{\infty}}]; the resulting norm bound depends only on λ1\lambda^{-1}, rHs(U)\|r\|_{H^{s}(U)}, ss, and the domain UU. ∎

J.2 Network class and parameter budgets

For a depth-LL Yat stack, write 𝐳0(𝐱)=𝐱d0\mathbf{z}_{0}(\mathbf{x})=\mathbf{x}\in\mathbb{R}^{d_{0}} and, for =1,,L\ell=1,\dots,L and c=1,,dc=1,\dots,d_{\ell},

z,c(𝐱)=j=1mα,j,cg,j(𝐱),g,j(𝐱)=(𝐰,j𝐳1(𝐱)+b)2𝐳1(𝐱)𝐰,j2+ε.z_{\ell,c}(\mathbf{x})=\sum_{j=1}^{m_{\ell}}\alpha_{\ell,j,c}\,g_{\ell,j}(\mathbf{x}),\qquad g_{\ell,j}(\mathbf{x})=\frac{(\mathbf{w}_{\ell,j}^{\top}\mathbf{z}_{\ell-1}(\mathbf{x})+b_{\ell})^{2}}{\|\mathbf{z}_{\ell-1}(\mathbf{x})-\mathbf{w}_{\ell,j}\|^{2}+\varepsilon_{\ell}}.

Assume the uniform budgets

𝐰,j2W,|b|B,0<εminεεmax,j=1m|α,j,c|A.\|\mathbf{w}_{\ell,j}\|_{2}\leq W_{\ell},\qquad|b_{\ell}|\leq B_{\ell},\qquad 0<\varepsilon_{\min}\leq\varepsilon_{\ell}\leq\varepsilon_{\max},\qquad\sum_{j=1}^{m_{\ell}}|\alpha_{\ell,j,c}|\leq A_{\ell}.

Note |b|B|b_{\ell}|\leq B_{\ell} (not b0b_{\ell}\geq 0): the Sobolev containment holds for any bounded real bias; b0b_{\ell}\geq 0 is required only for the Mercer/RKHS regime of the single-layer results elsewhere in the paper. The width mm_{\ell} enters only through AA_{\ell}; if per-atom bounds |α,j,c|a|\alpha_{\ell,j,c}|\leq a_{\ell} are assumed, take A=maA_{\ell}=m_{\ell}a_{\ell}.

J.3 Full Proof of Global Sobolev-RKHS Containment

Theorem 19 (Global ambient RKHS containment).

Let Xd0X\subset\mathbb{R}^{d_{0}} be compact and UXU\supset X a bounded Lipschitz domain. Fix s>d0/2s>d_{0}/2 and define s(X):={f|X:fHs(U)}\mathcal{H}_{s}(X):=\{f|_{X}:f\in H^{s}(U)\} with the quotient norm. Under the budgets 𝐰,jW\|\mathbf{w}_{\ell,j}\|\leq W_{\ell}, |b|B|b_{\ell}|\leq B_{\ell}, 0<εminεεmax0<\varepsilon_{\min}\leq\varepsilon_{\ell}\leq\varepsilon_{\max}, j|α,j,c|A\sum_{j}|\alpha_{\ell,j,c}|\leq A_{\ell}, every coordinate z,c|Xz_{\ell,c}|_{X} of every admissible Yat stack belongs to s(X)\mathcal{H}_{s}(X), with z,cs(X)M\|z_{\ell,c}\|_{\mathcal{H}_{s}(X)}\leq M_{\ell} for finite constants M0,,MLM_{0},\dots,M_{L} depending only on X,U,s,L,{d,m,A,W,B},εmin,εmaxX,U,s,L,\{d_{\ell},m_{\ell},A_{\ell},W_{\ell},B_{\ell}\},\varepsilon_{\min},\varepsilon_{\max} and not on the trained parameters. The reproducing kernel of s(X)\mathcal{H}_{s}(X) is universal on XX, hence characteristic.

Proof.

We induct on \ell. Write C1:=1Hs(U)C_{1}:=\|1\|_{H^{s}(U)} (finite since UU is bounded).

Base case (=0\ell=0). Each coordinate z0,i(𝐱)=xiz_{0,i}(\mathbf{x})=x_{i} is a polynomial, hence in Hs(U)H^{s}(U), with xiHs(U)M0\|x_{i}\|_{H^{s}(U)}\leq M_{0} for some M0M_{0} depending on UU and ss.

Inductive step. Assume z1,iHs(U)z_{\ell-1,i}\in H^{s}(U) and z1,iHs(U)M1\|z_{\ell-1,i}\|_{H^{s}(U)}\leq M_{\ell-1} for all i=1,,d1i=1,\dots,d_{\ell-1}.

Linear factor. q,j:=𝐰,j𝐳1+bHs(U)q_{\ell,j}:=\mathbf{w}_{\ell,j}^{\top}\mathbf{z}_{\ell-1}+b_{\ell}\in H^{s}(U) as a linear combination of HsH^{s} functions plus a constant, with

q,jHs(U)d1WM1+BC1=:Q(M1).\|q_{\ell,j}\|_{H^{s}(U)}\leq\sqrt{d_{\ell-1}}W_{\ell}M_{\ell-1}+B_{\ell}C_{1}=:Q_{\ell}(M_{\ell-1}).

By Lemma 8(1), q,j2Hs(U)q_{\ell,j}^{2}\in H^{s}(U) with q,j2Hs(U)CalgQ(M1)2\|q_{\ell,j}^{2}\|_{H^{s}(U)}\leq C_{\mathrm{alg}}Q_{\ell}(M_{\ell-1})^{2}.

Denominator. r,j:=𝐳1𝐰,j2+ε=i(z1,iw,j,i)2+εHs(U)r_{\ell,j}:=\|\mathbf{z}_{\ell-1}-\mathbf{w}_{\ell,j}\|^{2}+\varepsilon_{\ell}=\sum_{i}(z_{\ell-1,i}-w_{\ell,j,i})^{2}+\varepsilon_{\ell}\in H^{s}(U). Algebra bounds give r,jHs(U)d1Calg(M1+WC1)2+εmaxC1=:D(M1)\|r_{\ell,j}\|_{H^{s}(U)}\leq d_{\ell-1}C_{\mathrm{alg}}(M_{\ell-1}+W_{\ell}C_{1})^{2}+\varepsilon_{\max}C_{1}=:D_{\ell}(M_{\ell-1}). Pointwise, r,j(𝐱)εmin>0r_{\ell,j}(\mathbf{x})\geq\varepsilon_{\min}>0. By Lemma 8(2), 1/r,jHs(U)1/r_{\ell,j}\in H^{s}(U) with 1/r,jHs(U)Γs(εmin1,D(M1))\|1/r_{\ell,j}\|_{H^{s}(U)}\leq\Gamma_{s}(\varepsilon_{\min}^{-1},D_{\ell}(M_{\ell-1})).

Atom. g,j=q,j2(1/r,j)Hs(U)g_{\ell,j}=q_{\ell,j}^{2}\cdot(1/r_{\ell,j})\in H^{s}(U) by another application of (1), with

g,jHs(U)Calg2Q(M1)2Γs(εmin1,D(M1))=:G(M1).\|g_{\ell,j}\|_{H^{s}(U)}\leq C_{\mathrm{alg}}^{2}Q_{\ell}(M_{\ell-1})^{2}\,\Gamma_{s}(\varepsilon_{\min}^{-1},D_{\ell}(M_{\ell-1}))=:G_{\ell}(M_{\ell-1}).

Layer coordinate. z,cHs(U)j|α,j,c|g,jHs(U)AG(M1)\|z_{\ell,c}\|_{H^{s}(U)}\leq\sum_{j}|\alpha_{\ell,j,c}|\|g_{\ell,j}\|_{H^{s}(U)}\leq A_{\ell}G_{\ell}(M_{\ell-1}). Set M:=AG(M1)M_{\ell}:=A_{\ell}G_{\ell}(M_{\ell-1}). Restricting from UU to XX and using the quotient norm gives z,c|Xs(X)z,cHs(U)M\|z_{\ell,c}|_{X}\|_{\mathcal{H}_{s}(X)}\leq\|z_{\ell,c}\|_{H^{s}(U)}\leq M_{\ell}, closing the induction. The constants Q,D,G,MQ_{\ell},D_{\ell},G_{\ell},M_{\ell} depend only on the budgets and not on the trained parameters. ∎

Remark 6 (Growth of MM_{\ell} in depth LL is super-exponential).

The recurrence above defines a sequence of finite constants M0,M1,,MLM_{0},M_{1},\dots,M_{L}, but the dependence on depth is severe and worth recording explicitly. Each step of the induction is at least quadratic: the algebra bound q2HsCalgQ2\|q^{2}\|_{H^{s}}\leq C_{\mathrm{alg}}Q_{\ell}^{2} contributes a square in M1M_{\ell-1}, the denominator bound D(M1)=O(M12)D_{\ell}(M_{\ell-1})=O(M_{\ell-1}^{2}) contributes another square, and the Moser-type inverse-stability factor Γs(εmin1,D(M1))\Gamma_{s}(\varepsilon_{\min}^{-1},D_{\ell}(M_{\ell-1})) has, in the standard chain-rule decomposition for f(t)=1/tf(t)=1/t on tεmint\geq\varepsilon_{\min}, polynomial dependence of degree s\lceil s\rceil on rHs\|r\|_{H^{s}} (with prefactors that absorb εmin(s+1)\varepsilon_{\min}^{-(\lceil s\rceil+1)} from the CkC^{k}-norms of 1/t1/t on the interval [εmin,)[\varepsilon_{\min},\infty)). Composing through one Yat layer therefore takes M1MM_{\ell-1}\mapsto M_{\ell} via a polynomial of degree at least 2s+22\lceil s\rceil+2, and iterating LL times gives

ML=O(M0(2s+2)L)O(M0eO(Llogs)),M_{L}\;=\;O\!\bigl(M_{0}^{(2\lceil s\rceil+2)^{L}}\bigr)\;\leq\;O\!\bigl(M_{0}^{e^{O(L\log s)}}\bigr),

i.e., a tower-of-twos blow-up in depth. Theorem 19 should therefore be read as a qualitative containment statement (z,c|Xs(X)z_{\ell,c}|_{X}\in\mathcal{H}_{s}(X) for every admissible Yat stack and every L\ell\leq L), not as a quantitatively useful capacity certificate at any practical depth. Improving the depth dependence of MLM_{L} — to polynomial-in-LL, or even exponential-in-LL — would require either restricting the activation atom to exponential-decay-friendly functions (Gaussian RBF, in place of the rational gεg_{\varepsilon}) or using a different ambient class (e.g., a depth-aware Besov scale) that absorbs the chain-rule blow-up. Both routes are out of scope here. The theorem we record is the parameter-independent containment of bounded-budget Yat stacks in a fixed universal RKHS; the constants are not the load-bearing content.

Corollary 9 (Universality and characteristicness of s(X)\mathcal{H}_{s}(X)).

s(X)\mathcal{H}_{s}(X) is universal on XX, hence characteristic.

Proof.

Every polynomial restricted to UU belongs to Hs(U)H^{s}(U) (polynomials are smooth on bounded UU), and hence its restriction to XX belongs to s(X)\mathcal{H}_{s}(X). By Stone–Weierstrass, polynomials are dense in C(X)C(X). Characteristicness follows from universality on compact XX [Sriperumbudur et al., 2011]. ∎

Remark 7 (What this theorem does and does not give).

Theorem 19 is an ambient containment result. It does not give an exact deep Yat-Gram norm: the identity j𝛂jkb,ε(𝐰j,)b,ε2=𝛂𝐊𝛂\|\sum_{j}\boldsymbol{\alpha}_{j}k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} is a single-layer statement. For a depth-LL stack the induced kernel on the original input space is the prefix-dependent pullback k~(𝐱,𝐱)=k(𝐳1(𝐱),𝐳1(𝐱))\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime})=k_{\ell}(\mathbf{z}_{\ell-1}(\mathbf{x}),\mathbf{z}_{\ell-1}(\mathbf{x}^{\prime})); the global Sobolev RKHS s(X)\mathcal{H}_{s}(X) avoids this prefix-dependence at the cost of giving only a coarse ambient norm bound.

Remark 8 (Sign of bias and Mercer regime).

The proof requires only |b|B|b_{\ell}|\leq B_{\ell}; the sign of bb_{\ell} is irrelevant for Sobolev containment. The condition b0b_{\ell}\geq 0 is required only for the Mercer/RKHS structure of the single-layer Yat kernel (Theorem 1). In the common case where a single shared bias b=b0b_{\ell}=b\geq 0 is used, the theorem applies with B=bB_{\ell}=b.

Appendix K Spectral Structure on the Sphere

The Mercer eigenvalues of k0,εk_{0,\varepsilon} on a generic compact domain are governed by Sobolev regularity of the IMQ factor and decay polynomially. On the unit sphere 𝕊d1\mathbb{S}^{d-1}, the kernel reduces to a zonal function whose Funk–Hecke decomposition yields an exponential eigenvalue decay rate, sharper than the polynomial Sobolev bound. The sphere is the operationally relevant subdomain in feature-normalised practice (layer-norm, 2\ell_{2}-normalised CLIP embeddings) and the analytically cleanest setting: 𝐮𝐯2=22𝐮𝐯\|\mathbf{u}-\mathbf{v}\|^{2}=2-2\mathbf{u}^{\top}\mathbf{v}, so the Yat kernel becomes a univariate function of the inner product t=𝐮𝐯[1,1]t=\mathbf{u}^{\top}\mathbf{v}\in[-1,1], and the Funk–Hecke formula [Wendland, 2004, Sec. 10.4] diagonalises the integral operator in spherical-harmonic sectors. We work with the unbiased k0,εk_{0,\varepsilon}; the biased kb,εk_{b,\varepsilon} for b>0b>0 adds two terms 2bt/Dε2b\,t/D_{\varepsilon} and b2hεb^{2}h_{\varepsilon} that share the same off-interval pole at tt_{\star}, so the asymptotic decay rate ρ\rho_{\star}^{-\ell} below is unchanged and only the prefactors and low-frequency content shift. Downstream consequences for fast generalization rates appear in Appendix L.

For 𝐮,𝐯𝕊d1\mathbf{u},\mathbf{v}\in\mathbb{S}^{d-1}, the unbiased Yat kernel restricts to the zonal function

k\tifinaghfont(𝐮,𝐯)=κ(𝐮𝐯),κ(t):=t2ε+22t,t[1,1],k_{\text{\tifinaghfont ⵟ}}(\mathbf{u},\mathbf{v})\;=\;\kappa(\mathbf{u}^{\top}\mathbf{v}),\qquad\kappa(t)\;:=\;\frac{t^{2}}{\varepsilon+2-2t},\qquad t\in[-1,1], (11)

which extends meromorphically to \mathbb{C} with a single simple pole at t:=1+ε/2>1t_{\star}:=1+\varepsilon/2>1 and residue (t)2/20-(t_{\star})^{2}/2\neq 0.

Theorem 20 (Funk–Hecke spectrum of k\tifinaghfontk_{\text{\tifinaghfont ⵟ}} on the sphere).

The integral operator Tk\tifinaghfont:L2(𝕊d1)L2(𝕊d1)T_{k_{\text{\tifinaghfont ⵟ}}}:L^{2}(\mathbb{S}^{d-1})\to L^{2}(\mathbb{S}^{d-1}) associated with the zonal kernel (11) admits the spectral decomposition Tk\tifinaghfont=0λ𝐏T_{k_{\text{\tifinaghfont ⵟ}}}=\sum_{\ell\geq 0}\lambda_{\ell}\,\mathbf{P}_{\ell}, where 𝐏\mathbf{P}_{\ell} is orthogonal projection onto the space of degree-\ell spherical harmonics (multiplicity N(,d)=Θ(d2)N(\ell,d)=\Theta(\ell^{d-2})), and the eigenvalues are given by the Funk–Hecke formula

λ=|𝕊d2|C(d2)/2(1)11κ(t)C(d2)/2(t)(1t2)(d3)/2𝑑t,\lambda_{\ell}\;=\;\frac{|\mathbb{S}^{d-2}|}{C_{\ell}^{(d-2)/2}(1)}\int_{-1}^{1}\kappa(t)\,C_{\ell}^{(d-2)/2}(t)\,(1-t^{2})^{(d-3)/2}\,dt,

with C(d2)/2C_{\ell}^{(d-2)/2} the Gegenbauer polynomial of index (d2)/2(d-2)/2.

Proof.

κ\kappa is continuous on [1,1][-1,1] (the pole t=1+ε/2t_{\star}=1+\varepsilon/2 lies strictly outside the closed interval since ε>0\varepsilon>0), so the kernel (𝐮,𝐯)κ(𝐮𝐯)(\mathbf{u},\mathbf{v})\mapsto\kappa(\mathbf{u}^{\top}\mathbf{v}) is bounded and continuous on 𝕊d1×𝕊d1\mathbb{S}^{d-1}\times\mathbb{S}^{d-1}. The associated operator Tk\tifinaghfontT_{k_{\text{\tifinaghfont ⵟ}}} is therefore Hilbert–Schmidt, hence compact and self-adjoint on L2(𝕊d1)L^{2}(\mathbb{S}^{d-1}). Spherical harmonics simultaneously diagonalise every zonal Hilbert–Schmidt operator by the Funk–Hecke formula [Wendland, 2004, Sec. 10.4], and the displayed integral is the standard Gegenbauer expansion coefficient on the eigenspace of degree \ell. ∎

Theorem 21 (Exponential decay rate).

The Funk–Hecke eigenvalues of k\tifinaghfontk_{\text{\tifinaghfont ⵟ}} on 𝕊d1\mathbb{S}^{d-1} satisfy λ=Θ(ρ)\lambda_{\ell}=\Theta(\rho_{\star}^{-\ell}) as \ell\to\infty, with

ρ:= 1+ε2+ε+ε24= 1+ε(1+O(ε)).\rho_{\star}\;:=\;1+\frac{\varepsilon}{2}+\sqrt{\varepsilon+\frac{\varepsilon^{2}}{4}}\;=\;1+\sqrt{\varepsilon}\,\bigl(1+O(\sqrt{\varepsilon})\bigr).
Proof.

Strategy. The Funk–Hecke expansion of κ(𝐮𝐰)\kappa(\mathbf{u}^{\top}\mathbf{w}) in Gegenbauer polynomials gives κ(t)=0λcνC(ν)(t)\kappa(t)=\sum_{\ell\geq 0}\lambda_{\ell}\,c_{\ell}^{\nu}\,C_{\ell}^{(\nu)}(t), where λ\lambda_{\ell} is the Mercer eigenvalue and C(ν)C_{\ell}^{(\nu)} is the Gegenbauer polynomial of degree \ell. The asymptotic decay rate λρ\lambda_{\ell}\sim\rho_{\star}^{-\ell} for simple-pole kernels κ(t)=A/(tt)\kappa(t)=-A/(t-t_{\star}) is recovered by the standard residue argument: the closest singularity of κ\kappa in the complex tt-plane to the interval [1,1][-1,1] controls the radius of convergence of the Gegenbauer expansion via the standard Bernstein-ellipse / pole-distance correspondence [Wendland, 2004, Ch. 12]. We do not extract λ\lambda_{\ell} from the Gegenbauer generating function (12tz+z2)ν(1-2tz+z^{2})^{-\nu} directly; rather, we use the generating function only to identify the relevant ellipse parameter ρ\rho_{\star}. For d=2d=2 (ν=0\nu=0), the Gegenbauer polynomials reduce to Chebyshev polynomials of the first kind and the generating function takes the logarithmic form log(12tz+z2)/2=1T(t)z/-\log(1-2tz+z^{2})/2=\sum_{\ell\geq 1}T_{\ell}(t)z^{\ell}/\ell; the same Bernstein-ellipse argument applies, with the asymptotic decay rate unchanged.

Joukowski substitution. The Bernstein ellipse EρE_{\rho}\subset\mathbb{C} with foci ±1\pm 1 is the image of {|z|=ρ}\{|z|=\rho\}\subset\mathbb{C} under the Joukowski map zt(z):=(z+z1)/2z\mapsto t(z):=(z+z^{-1})/2 for ρ>1\rho>1. The map t(z)t(z) is a conformal bijection from {|z|>1}\{|z|>1\} onto [1,1]\mathbb{C}\setminus[-1,1], with the inverse selecting the branch z(t)=t+t21z(t)=t+\sqrt{t^{2}-1} for real t>1t>1 (so |z|>1|z|>1). Under this substitution, a real point t>1t_{\star}>1 corresponds to z=t+t21>1z_{\star}=t_{\star}+\sqrt{t_{\star}^{2}-1}>1 (the outer pre-image), and equivalently to z1=tt21<1z_{\star}^{-1}=t_{\star}-\sqrt{t_{\star}^{2}-1}<1 (the inner pre-image inside the unit disk). Substituting t=1+ε/2t_{\star}=1+\varepsilon/2,

ρ:=z=(1+ε2)+(1+ε2)21= 1+ε2+ε+ε24,\rho_{\star}\;:=\;z_{\star}\;=\;\bigl(1+\tfrac{\varepsilon}{2}\bigr)+\sqrt{\bigl(1+\tfrac{\varepsilon}{2}\bigr)^{2}-1}\;=\;1+\tfrac{\varepsilon}{2}+\sqrt{\varepsilon+\tfrac{\varepsilon^{2}}{4}},

and the inner pre-image z1=1/ρz_{\star}^{-1}=1/\rho_{\star} is the corresponding singularity location in the Gegenbauer-generating-function variable.

Upper bound. For every 1<ρ<ρ1<\rho<\rho_{\star}, κ\kappa is analytic on the open ellipse EρE_{\rho} and continuous up to its boundary (the pole tt_{\star} lies strictly outside EρE_{\rho}). Bernstein’s theorem on Gegenbauer expansions [Wendland, 2004, Ch. 12] gives lim sup|λ|1/1/ρ\limsup_{\ell\to\infty}|\lambda_{\ell}|^{1/\ell}\leq 1/\rho. Letting ρρ\rho\uparrow\rho_{\star} yields |λ|=O(ρ+η)|\lambda_{\ell}|=O(\rho_{\star}^{-\ell+\eta}) for every η>0\eta>0.

Matching lower bound. Decompose

κ(t)=Att+g(t),A:=(t)22>0.\kappa(t)\;=\;-\frac{A}{t-t_{\star}}+g(t),\qquad A:=\frac{(t_{\star})^{2}}{2}>0.

A direct computation gives g(t)=(t+t)/2g(t)=-(t+t_{\star})/2, a polynomial of degree 11, so the only Bernstein-ellipse-bounded contributions to the Gegenbauer coefficients of κ\kappa for 2\ell\geq 2 come from the simple-pole term. The Gegenbauer generating identity is

1(12tz+z2)ν=0C(ν)(t)z,ν=(d2)/2,|z|<min(z,z1)=1/ρ.\frac{1}{(1-2tz+z^{2})^{\nu}}=\sum_{\ell\geq 0}C_{\ell}^{(\nu)}(t)\,z^{\ell},\qquad\nu=(d-2)/2,\quad|z|<\min\!\bigl(z_{\star},z_{\star}^{-1}\bigr)=1/\rho_{\star}.

The denominator factors as (12tz+z2)=(1zz1)(1zz)(1-2tz+z^{2})=(1-z\,z_{\star}^{-1})(1-z\,z_{\star}) when t=(z+z1)/2t=(z_{\star}+z_{\star}^{-1})/2, so the simple pole at t=tt=t_{\star} in tt-coordinates pulls back to a simple pole at z=z1=1/ρz=z_{\star}^{-1}=1/\rho_{\star} in the generating-function variable. Standard residue analysis at z=1/ρz=1/\rho_{\star} then gives the Gegenbauer coefficients of A/(tt)-A/(t-t_{\star}) as Ab(d)ρ-A\,b_{\ell}^{(d)}\rho_{\star}^{-\ell} with b(d)=Θ(1)b_{\ell}^{(d)}=\Theta(1) (the Gegenbauer-weight residue evaluated at z1z_{\star}^{-1}, which converges to a positive constant as \ell\to\infty). Hence |λ|=Ω(ρ)|\lambda_{\ell}|=\Omega(\rho_{\star}^{-\ell}). ∎

Corollary 10 (Logarithmic effective dimension on 𝕊d1\mathbb{S}^{d-1}).

The effective dimension of k\tifinaghfontk_{\text{\tifinaghfont ⵟ}} on 𝕊d1\mathbb{S}^{d-1} satisfies

𝒩\tifinaghfont(λ):=tr(Tk\tifinaghfont(Tk\tifinaghfont+λI)1)=Θ((log(1/λ))d1)(λ0+).\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)\;:=\;\operatorname{tr}\!\bigl(T_{k_{\text{\tifinaghfont ⵟ}}}(T_{k_{\text{\tifinaghfont ⵟ}}}+\lambda I)^{-1}\bigr)\;=\;\Theta\!\bigl((\log(1/\lambda))^{d-1}\bigr)\qquad(\lambda\to 0^{+}).
Proof.

Set λ:=log(1/λ)/logρ\ell_{\lambda}:=\lfloor\log(1/\lambda)/\log\rho_{\star}\rfloor. Each eigenvalue λ\lambda_{\ell} has multiplicity N(,d)=Θ(d2)N(\ell,d)=\Theta(\ell^{d-2}), so 𝒩\tifinaghfont(λ)=0N(,d)λ/(λ+λ)\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)=\sum_{\ell\geq 0}N(\ell,d)\,\lambda_{\ell}/(\lambda_{\ell}+\lambda). By Theorem 21, λλ\lambda_{\ell}\geq\lambda exactly when λ+O(1)\ell\leq\ell_{\lambda}+O(1). The head λN(,d)Θ(1)=Θ(λd1)\sum_{\ell\leq\ell_{\lambda}}N(\ell,d)\cdot\Theta(1)=\Theta(\ell_{\lambda}^{d-1}) from Ld2=Θ(Ld1)\sum_{\ell\leq L}\ell^{d-2}=\Theta(L^{d-1}). The tail >λN(,d)ρ/λ=O(λd2)\sum_{\ell>\ell_{\lambda}}N(\ell,d)\rho_{\star}^{-\ell}/\lambda=O(\ell_{\lambda}^{d-2}) by geometric summation. Adding gives the displayed rate. ∎

Remark 9 (Comparison with classical kernels and downstream rate).

The polynomial kernel (𝐮𝐯)2(\mathbf{u}^{\top}\mathbf{v})^{2} has rank 33 on 𝕊d1\mathbb{S}^{d-1} (only λ0,λ1,λ2\lambda_{0},\lambda_{1},\lambda_{2} are non-zero); the IMQ kernel hεh_{\varepsilon} shares the singularity at tt_{\star} on the spherical reduction, hence the same exponential rate ρ\rho_{\star}^{-\ell}; the Gaussian RBF has super-exponential decay O(C/!)O(C^{\ell}/\ell!). The unbiased Yat kernel on the sphere therefore inherits the asymptotic IMQ rate, while the polynomial numerator t2t^{2} alters only the prefactor and contributes finite-rank low-frequency content. The exponential decay drives a near-parametric O~(n1)\widetilde{O}(n^{-1}) rate for both ERM and KRR on the sphere; see Appendix L.

Appendix L Fast Generalization Rates via Eigenvalue Decay

The single-layer Rademacher bound of Theorem 6 is the worst-case rate O(n1/2)O(n^{-1/2}), optimal only when no spectral information is available. Mercer eigenvalue decay yields fast rates via local Rademacher complexity [Bartlett et al., 2005] for empirical-risk minimisation, and via Caponnetto–De Vito bias-variance bounds [Caponnetto and De Vito, 2007, Steinwart et al., 2009] for kernel ridge regression. On the sphere, the exponential decay of Theorem 21 drives both routes to a near-parametric O~(n1)\widetilde{O}(n^{-1}) rate.

L.1 Local Rademacher upper bound for ERM

Theorem 22 (Fast rate via Mercer eigenvalue decay).

Let 𝒳d{𝟎}\mathcal{X}\subset\mathbb{R}^{d}\setminus\{\mathbf{0}\} be compact with sup𝒳𝐱R\sup_{\mathcal{X}}\|\mathbf{x}\|\leq R, and let ρ\rho be a Borel probability measure on 𝒳\mathcal{X}. Suppose the Mercer eigenvalues {μn}n1\{\mu_{n}\}_{n\geq 1} of Tk0,εT_{k_{0,\varepsilon}} on L2(ρ)L^{2}(\rho) satisfy μnc0nα\mu_{n}\leq c_{0}\,n^{-\alpha} for some α>1\alpha>1. Under realisability fB0,ε(B):={f0,ε:fB}f^{\star}\in B_{\mathcal{H}_{0,\varepsilon}}(B):=\{f\in\mathcal{H}_{0,\varepsilon}:\|f\|\leq B\} and i.i.d. samples with conditionally sub-Gaussian noise of variance proxy σ2\sigma^{2}, the empirical-risk minimiser f^n\hat{f}_{n} over the ball satisfies

𝔼[R(f^n)R(f)]C(B2R4εn)α/(α+1),\mathbb{E}\bigl[R(\hat{f}_{n})-R(f^{\star})\bigr]\;\leq\;C\,\Bigl(\frac{B^{2}R^{4}}{\varepsilon\,n}\Bigr)^{\alpha/(\alpha+1)},

for an absolute constant C>0C>0.

Proof.

By the diagonal of k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}, sup𝒳k\tifinaghfont(𝐱,𝐱)=R4/ε\sup_{\mathcal{X}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{x},\mathbf{x})=R^{4}/\varepsilon (the unbiased section is the b=0b=0 case of the diagonal (𝐱2+b)2/ε(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon used in Theorem 6). Apply Bartlett et al. [2005, Theorem 3.3] to the function class B0,ε(B)B_{\mathcal{H}_{0,\varepsilon}}(B): the local Rademacher complexity at radius rr is bounded by ψ¯(r)(2/n)n1min(r,B2μn)\bar{\psi}(r)\leq\sqrt{(2/n)\sum_{n\geq 1}\min(r,B^{2}\mu_{n})}, and the fixed point rr^{\star} of the inequality r=ψ¯(r)r=\bar{\psi}(r) satisfies rC(B2R4/(εn))α/(α+1)r^{\star}\leq C^{\prime}(B^{2}R^{4}/(\varepsilon n))^{\alpha/(\alpha+1)} under polynomial decay μnc0nα\mu_{n}\leq c_{0}n^{-\alpha}. The local Rademacher excess-risk bound yields the displayed expectation inequality. ∎

Corollary 11 (Near-parametric ERM rate on the sphere).

For 𝒳=𝕊d1\mathcal{X}=\mathbb{S}^{d-1} and ρ\rho the uniform surface measure, Theorem 21 gives μ=Θ(ρ)\mu_{\ell}=\Theta(\rho_{\star}^{-\ell}) with multiplicity N(,d)=Θ(d2)N(\ell,d)=\Theta(\ell^{d-2}). The local Rademacher fixed point becomes r=Θ((logn)d1/n)r^{\star}=\Theta((\log n)^{d-1}/n), yielding

𝔼[R(f^n)R(f)]=O~(n1),\mathbb{E}\bigl[R(\hat{f}_{n})-R(f^{\star})\bigr]\;=\;\widetilde{O}(n^{-1}),

where O~\widetilde{O} hides factors polynomial in logn\log n.

Proof.

On 𝕊d1\mathbb{S}^{d-1} the kernel diagonal is 1/ε1/\varepsilon. Substituting the Funk–Hecke spectrum into the local Rademacher fixed point, n1min(r,B2μn)=0N(,d)min(r,B2ρ)\sum_{n\geq 1}\min(r,B^{2}\mu_{n})=\sum_{\ell\geq 0}N(\ell,d)\min(r,B^{2}\rho_{\star}^{-\ell}). The crossover index \ell_{\star} is determined by ρ=r/B2\rho_{\star}^{-\ell_{\star}}=r/B^{2}, giving =Θ(log(B2/r))\ell_{\star}=\Theta(\log(B^{2}/r)). The head \ell\leq\ell_{\star} contributes Θ(rd1)\Theta(r\,\ell_{\star}^{d-1}), the tail Θ(d2)\Theta(\ell_{\star}^{d-2}) times the same scale, hence ψ¯(r)(C/n)r(log(B2/r))d1\bar{\psi}(r)\leq\sqrt{(C/n)\,r\,(\log(B^{2}/r))^{d-1}}. The fixed-point equation r=ψ¯(r)r=\bar{\psi}(r) self-consistently gives r=Θ((logn)d1/n)r^{\star}=\Theta((\log n)^{d-1}/n). ∎

L.2 Refined KRR bias-variance bound

Theorem 23 (Refined Yat KRR excess risk).

Under the setup above with conditionally sub-Gaussian noise of variance proxy σ2\sigma^{2} and source condition f0,εf^{\star}\in\mathcal{H}_{0,\varepsilon}, the kernel ridge regression estimator f^λ\hat{f}_{\lambda} with regularisation λ>0\lambda>0 satisfies the bias-variance bound

𝔼f^λfL2(ρ)2λf0,ε2+σ2n𝒩\tifinaghfont(λ)+Cσ2R4/εn2λ,\mathbb{E}\bigl\|\hat{f}_{\lambda}-f^{\star}\bigr\|_{L^{2}(\rho)}^{2}\;\leq\;\lambda\,\|f^{\star}\|_{\mathcal{H}_{0,\varepsilon}}^{2}\;+\;\frac{\sigma^{2}}{n}\,\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)\;+\;\frac{C\,\sigma^{2}R^{4}/\varepsilon}{n^{2}\lambda},

where 𝒩\tifinaghfont(λ):=tr(Tk0,ε(Tk0,ε+λI)1)\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda):=\operatorname{tr}\!\bigl(T_{k_{0,\varepsilon}}(T_{k_{0,\varepsilon}}+\lambda I)^{-1}\bigr) is the Yat effective dimension.

Proof.

The bounded diagonal κ2:=sup𝒳k\tifinaghfont(𝐱,𝐱)=R4/ε\kappa^{2}:=\sup_{\mathcal{X}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{x},\mathbf{x})=R^{4}/\varepsilon together with the source condition f0,εf^{\star}\in\mathcal{H}_{0,\varepsilon} place us inside the bias-variance template of Caponnetto and De Vito [2007] (see also Steinwart et al., 2009). Their Theorem 1 yields, with probability at least 1δ1-\delta,

f^λfL2(ρ)2C1λfk2+C2σ2𝒩k(λ)n+C3σ2κ2n2λlog2(1/δ),\|\hat{f}_{\lambda}-f^{\star}\|_{L^{2}(\rho)}^{2}\;\leq\;C_{1}\,\lambda\|f^{\star}\|_{\mathcal{H}_{k}}^{2}\;+\;C_{2}\,\frac{\sigma^{2}\,\mathcal{N}_{k}(\lambda)}{n}\;+\;C_{3}\,\frac{\sigma^{2}\kappa^{2}}{n^{2}\lambda}\log^{2}(1/\delta),

where the leading variance term 𝒩k(λ)/n\mathcal{N}_{k}(\lambda)/n is scale-invariant under kckk\to ck and the lower-order term κ2/(n2λ)\kappa^{2}/(n^{2}\lambda) tracks the kernel scale. Substituting κ2=R4/ε\kappa^{2}=R^{4}/\varepsilon for k=k0,εk=k_{0,\varepsilon} and absorbing the high-probability log\log factor into the constants gives the displayed expectation bound. ∎

Theorem 24 (Near-parametric KRR rate on the sphere).

Let 𝒳=𝕊d1\mathcal{X}=\mathbb{S}^{d-1} with ρ\rho the uniform surface measure, and choose λ=Θ((logn)d1/n)\lambda^{\star}=\Theta((\log n)^{d-1}/n). Then 𝔼f^λfL2(ρ)2=O~(n1)\mathbb{E}\|\hat{f}_{\lambda^{\star}}-f^{\star}\|_{L^{2}(\rho)}^{2}=\widetilde{O}(n^{-1}), with logarithmic factor (logn)d1(\log n)^{d-1}.

Proof.

On 𝕊d1\mathbb{S}^{d-1}, R=1R=1, so the variance constant is σ2/ε\sigma^{2}/\varepsilon. By Corollary 10, 𝒩\tifinaghfont(λ)=Θ((log(1/λ))d1)\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)=\Theta((\log(1/\lambda))^{d-1}). Theorem 23 gives

𝔼f^λf2λf0,ε2+Cεn(log(1/λ))d1.\mathbb{E}\|\hat{f}_{\lambda}-f^{\star}\|^{2}\;\leq\;\lambda\|f^{\star}\|_{\mathcal{H}_{0,\varepsilon}}^{2}+\frac{C}{\varepsilon n}\,(\log(1/\lambda))^{d-1}.

Optimising in λ\lambda, bias and variance balance at λ=Θ((log(1/λ))d1/n)\lambda=\Theta((\log(1/\lambda))^{d-1}/n) with self-consistent solution λ=Θ((logn)d1/n)\lambda^{\star}=\Theta((\log n)^{d-1}/n). Substituting gives the displayed rate. ∎

Remark 10 (Comparison with Sobolev RKHS).

A Sobolev-ss RKHS on 𝕊d1\mathbb{S}^{d-1} with s>(d1)/2s>(d-1)/2 has KRR rate O(n2s/(2s+d1))O(n^{-2s/(2s+d-1)}), sub-parametric in dd [Caponnetto and De Vito, 2007]. The Yat kernel’s analytic profile gives the strictly faster O~(n1)\widetilde{O}(n^{-1}) rate uniformly in ss for targets in 0,ε\mathcal{H}_{0,\varepsilon}, which is a strictly smaller class than any Sobolev space of the same domain: the comparison reflects the higher analytic regularity of the Yat native space rather than a uniform improvement on a fixed function class.

Remark 11 (ERM and KRR are companion routes).

Corollary 11 (norm-constrained ERM via local Rademacher complexity) and Theorem 24 (KRR via Caponnetto–De Vito) deliver the same O~(n1)\widetilde{O}(n^{-1}) rate via different estimators; both rest on the exponential eigenvalue decay of Theorem 21.

Remark 12 (Reading the variance constant).

The leading variance term in Theorem 23 is σ2𝒩\tifinaghfont(λ)/n\sigma^{2}\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)/n, scale-invariant under kckk\to ck; the kernel diagonal κ2=R4/ε\kappa^{2}=R^{4}/\varepsilon enters only as a polynomial prefactor of the lower-order 1/(n2λ)1/(n^{2}\lambda) remainder. A data-dependent refinement replacing R4R^{4} with the empirical fourth moment in the remainder term would require a separate matrix-Bernstein argument and is not implied by Theorem 23.

Appendix M Quantitative MMD and Two-Sample Testing

Corollary 1 establishes that kb,εk_{b,\varepsilon} is characteristic on every compact 𝒳\mathcal{X} for b>0b>0: the kernel mean embedding μkb,ε(,𝐱)𝑑μ\mu\mapsto\int k_{b,\varepsilon}(\cdot,\mathbf{x})\,d\mu is injective on Borel probability measures. We strengthen this qualitative property to a quantitative sample-complexity statement for two-sample testing, with the kernel-specific constant (R2+b)2/ε(R^{2}+b)^{2}/\varepsilon on the diagonal.

Theorem 25 (Empirical MMD convergence rate for kb,εk_{b,\varepsilon}).

Let μ,ν\mu,\nu be Borel probability measures on a compact set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} with sup𝐱𝒳𝐱R\sup_{\mathbf{x}\in\mathcal{X}}\|\mathbf{x}\|\leq R, and let 𝐗1,,𝐗n\mathbf{X}_{1},\dots,\mathbf{X}_{n} be i.i.d. samples from μ\mu and 𝐘1,,𝐘n\mathbf{Y}_{1},\dots,\mathbf{Y}_{n} i.i.d. from ν\nu, all mutually independent. The unbiased U-statistic estimator

MMD^2:=1n(n1)ijkb,ε(𝐗i,𝐗j)+1n(n1)ijkb,ε(𝐘i,𝐘j)2n2i,jkb,ε(𝐗i,𝐘j)\widehat{\mathrm{MMD}}^{2}\;:=\;\frac{1}{n(n-1)}\sum_{i\neq j}k_{b,\varepsilon}(\mathbf{X}_{i},\mathbf{X}_{j})+\frac{1}{n(n-1)}\sum_{i\neq j}k_{b,\varepsilon}(\mathbf{Y}_{i},\mathbf{Y}_{j})-\frac{2}{n^{2}}\sum_{i,j}k_{b,\varepsilon}(\mathbf{X}_{i},\mathbf{Y}_{j})

satisfies

𝔼[(MMD^2MMD2)2]C(R2+b)4ε2n\mathbb{E}\bigl[\bigl(\widehat{\mathrm{MMD}}^{2}-\mathrm{MMD}^{2}\bigr)^{2}\bigr]\;\leq\;\frac{C\,(R^{2}+b)^{4}}{\varepsilon^{2}\,n}

for an absolute constant C>0C>0, hence |MMD^2MMD2|=Op((R2+b)2/(εn))|\widehat{\mathrm{MMD}}^{2}-\mathrm{MMD}^{2}|=O_{p}((R^{2}+b)^{2}/(\varepsilon\sqrt{n})).

Proof.

MMD^2\widehat{\mathrm{MMD}}^{2} is the standard unbiased U-statistic estimator of MMD2\mathrm{MMD}^{2} [Gretton et al., 2012]. By Proposition 4, sup𝐱,𝐰𝒳|kb,ε(𝐱,𝐰)|(R2+b)2/ε\sup_{\mathbf{x},\mathbf{w}\in\mathcal{X}}|k_{b,\varepsilon}(\mathbf{x},\mathbf{w})|\leq(R^{2}+b)^{2}/\varepsilon on the closed ball of radius RR. The standard variance bound for a degree-22 U-statistic with bounded kernel gives Var(MMD^2)Csup|k|2/n=C(R2+b)4/(ε2n)\mathrm{Var}(\widehat{\mathrm{MMD}}^{2})\leq C\sup|k|^{2}/n=C(R^{2}+b)^{4}/(\varepsilon^{2}n), so the mean-squared error is bounded as displayed. ∎

Corollary 12 (Sample complexity for kernel two-sample testing).

Fix significance level β(0,1)\beta\in(0,1) and threshold η>0\eta>0. The test that rejects μ=ν\mu=\nu when MMD^2>η\widehat{\mathrm{MMD}}^{2}>\eta has Type-I error β\leq\beta on the null and power 1β\geq 1-\beta on alternatives with MMD22η\mathrm{MMD}^{2}\geq 2\eta, provided

nC(R2+b)4ε2η2β.n\;\geq\;\frac{C\,(R^{2}+b)^{4}}{\varepsilon^{2}\,\eta^{2}\,\beta}.
Proof.

Apply Theorem 25 with Chebyshev’s inequality on both null (MMD2=0\mathrm{MMD}^{2}=0, deviation η\eta) and alternative (MMD22η\mathrm{MMD}^{2}\geq 2\eta, deviation η\eta) regimes; the deviation probability is bounded by C(R2+b)4/(ε2η2n)βC(R^{2}+b)^{4}/(\varepsilon^{2}\eta^{2}n)\leq\beta under the displayed sample-size lower bound. ∎

Remark 13 (Trade-off in ε\varepsilon and bb).

The sample complexity scales as 1/ε21/\varepsilon^{2} and as (R2+b)4(R^{2}+b)^{4}: a smaller ε\varepsilon or a larger bb inflates the diagonal (R2+b)2/ε(R^{2}+b)^{2}/\varepsilon and thereby the variance constant of the U-statistic. The same diagonal drives the Rademacher constant in Theorem 6, while in Theorem 23 the kernel diagonal enters only as a prefactor on the lower-order remainder, with the leading variance term controlled by the effective dimension 𝒩\tifinaghfont(λ)\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda) alone.

Appendix N The Yat Neural Tangent Kernel at Infinite Width

The biased atom gε(;𝐰,b)g_{\varepsilon}(\cdot;\mathbf{w},b) is the natural neural unit attached to kb,εk_{b,\varepsilon}. We compute the infinite-width Neural Tangent Kernel [Jacot et al., 2018, Arora et al., 2019] of a width-mm shared-bias Yat layer and show that the limit is itself dominated by an IMQ kernel via the bias decomposition, giving universality on every compact 𝒳\mathcal{X} for b>0b>0.

Setup.

Fix b0b\geq 0, ε>0\varepsilon>0, σw>0\sigma_{w}>0. We adopt the standard NTK parametrisation [Jacot et al., 2018]: the width-mm shared-(b,ε)(b,\varepsilon) Yat layer is

fm(𝐱;𝜽):=1mj=1mαjgε(𝐱;𝐰j,b),𝜽=(𝐰1,,𝐰m,α1,,αm),f_{m}(\mathbf{x};\boldsymbol{\theta})\;:=\;\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\alpha_{j}\,g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b),\qquad\boldsymbol{\theta}=(\mathbf{w}_{1},\dots,\mathbf{w}_{m},\alpha_{1},\dots,\alpha_{m}),

with i.i.d. initialisation 𝐰j𝒩(𝟎,σw2Id)\mathbf{w}_{j}\sim\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d}) and αj𝒩(0,1)\alpha_{j}\sim\mathcal{N}(0,1). The empirical NTK is

Θm(𝐱,𝐱):=j=1m[𝐰jfm(𝐱),𝐰jfm(𝐱)+αjfm(𝐱)αjfm(𝐱)].\Theta_{m}(\mathbf{x},\mathbf{x}^{\prime})\;:=\;\sum_{j=1}^{m}\Bigl[\bigl\langle\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}),\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}^{\prime})\bigr\rangle+\partial_{\alpha_{j}}f_{m}(\mathbf{x})\,\partial_{\alpha_{j}}f_{m}(\mathbf{x}^{\prime})\Bigr].
Theorem 26 (Yat NTK closed form).

As mm\to\infty, Θm(𝐱,𝐱)Θ(𝐱,𝐱)\Theta_{m}(\mathbf{x},\mathbf{x}^{\prime})\to\Theta(\mathbf{x},\mathbf{x}^{\prime}) in probability on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, with

Θ(𝐱,𝐱)=Θα(𝐱,𝐱)+Θw(𝐱,𝐱),\Theta(\mathbf{x},\mathbf{x}^{\prime})\;=\;\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})+\Theta_{w}(\mathbf{x},\mathbf{x}^{\prime}),
Θα(𝐱,𝐱):=𝔼𝐰𝒩(𝟎,σw2Id)[gε(𝐱;𝐰,b)gε(𝐱;𝐰,b)],\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})\;:=\;\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d})}\!\bigl[g_{\varepsilon}(\mathbf{x};\mathbf{w},b)\,g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w},b)\bigr],
Θw(𝐱,𝐱):=𝔼𝐰𝒩(𝟎,σw2Id)𝐰gε(𝐱;𝐰,b),𝐰gε(𝐱;𝐰,b).\Theta_{w}(\mathbf{x},\mathbf{x}^{\prime})\;:=\;\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d})}\!\bigl\langle\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x};\mathbf{w},b),\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w},b)\bigr\rangle.

Both summands are continuous PSD kernels on 𝒳×𝒳\mathcal{X}\times\mathcal{X}, hence so is Θ\Theta.

Proof.

Under the NTK parametrisation, αjfm(𝐱)=gε(𝐱;𝐰j,b)/m\partial_{\alpha_{j}}f_{m}(\mathbf{x})=g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b)/\sqrt{m}, so

j=1mαjfm(𝐱)αjfm(𝐱)=1mj=1mgε(𝐱;𝐰j,b)gε(𝐱;𝐰j,b).\sum_{j=1}^{m}\partial_{\alpha_{j}}f_{m}(\mathbf{x})\,\partial_{\alpha_{j}}f_{m}(\mathbf{x}^{\prime})\;=\;\frac{1}{m}\sum_{j=1}^{m}g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b)\,g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w}_{j},b).

The summands are i.i.d. with mean Θα(𝐱,𝐱)\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime}) and bounded by Proposition 4 on compact 𝒳\mathcal{X}; the law of large numbers gives convergence in probability to Θα\Theta_{\alpha}. Similarly, 𝐰jfm(𝐱)=αj𝐰gε(𝐱;𝐰j,b)/m\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x})=\alpha_{j}\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b)/\sqrt{m}, hence

j=1m𝐰jfm(𝐱),𝐰jfm(𝐱)=1mj=1mαj2𝐰gε(𝐱;𝐰j,b),𝐰gε(𝐱;𝐰j,b).\sum_{j=1}^{m}\bigl\langle\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}),\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}^{\prime})\bigr\rangle\;=\;\frac{1}{m}\sum_{j=1}^{m}\alpha_{j}^{2}\,\bigl\langle\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b),\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w}_{j},b)\bigr\rangle.

By independence of αj\alpha_{j} and 𝐰j\mathbf{w}_{j} together with 𝔼αj2=1\mathbb{E}\alpha_{j}^{2}=1, the law of large numbers gives convergence in probability to Θw\Theta_{w}. PSD of both summands follows from the Gram representation: Θα(𝐱,𝐱)=gε(𝐱;,b),gε(𝐱;,b)L2(μw)\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})=\langle g_{\varepsilon}(\mathbf{x};\cdot,b),g_{\varepsilon}(\mathbf{x}^{\prime};\cdot,b)\rangle_{L^{2}(\mu_{w})} where μw\mu_{w} is the Gaussian measure, and Θw\Theta_{w} is the same statement applied coordinate-wise to the gradient. Continuity follows from continuity of gεg_{\varepsilon} and 𝐰gε\nabla_{\mathbf{w}}g_{\varepsilon} on compact sets together with dominated convergence under μw\mu_{w}. ∎

Corollary 13 (Universality of the Yat NTK for b>0b>0).

For every b>0b>0, the Yat NTK Θ\Theta is universal on every compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d}: the RKHS Θ|𝒳\mathcal{H}_{\Theta}|_{\mathcal{X}} is dense in C(𝒳)C(\mathcal{X}) in the sup norm.

Proof.

Step 1 (Gram representation of Θα\Theta_{\alpha}). The proof of Theorem 26 established

Θα(𝐱,𝐱)=gε(𝐱;,b),gε(𝐱;,b)L2(μw),\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})\;=\;\bigl\langle g_{\varepsilon}(\mathbf{x};\,\cdot\,,b),\,g_{\varepsilon}(\mathbf{x}^{\prime};\,\cdot\,,b)\bigr\rangle_{L^{2}(\mu_{w})},

where μw=𝒩(𝟎,σw2Id)\mu_{w}=\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d}). The map Φ:𝐱gε(𝐱;,b)\Phi:\mathbf{x}\mapsto g_{\varepsilon}(\mathbf{x};\,\cdot\,,b) is a continuous feature map from 𝒳\mathcal{X} into L2(μw)L^{2}(\mu_{w}) (continuity follows from continuity of gεg_{\varepsilon} in 𝐱\mathbf{x} and the Proposition 4 bound, which makes the family uniformly L2L^{2}-integrable on compact 𝒳\mathcal{X}).

Step 2 (Θα\Theta_{\alpha} is integrally strictly PD on 𝒳\mathcal{X}). Let μ\mu be a finite non-zero signed Borel measure on 𝒳\mathcal{X}. By Tonelli,

𝒳𝒳Θα(𝐱,𝐱)𝑑μ(𝐱)𝑑μ(𝐱)=dF(𝐰)2𝑑μw(𝐰),F(𝐰):=𝒳gε(𝐱;𝐰,b)𝑑μ(𝐱).\int_{\mathcal{X}}\!\int_{\mathcal{X}}\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})\,d\mu(\mathbf{x})\,d\mu(\mathbf{x}^{\prime})\;=\;\int_{\mathbb{R}^{d}}F(\mathbf{w})^{2}\,d\mu_{w}(\mathbf{w}),\qquad F(\mathbf{w}):=\int_{\mathcal{X}}g_{\varepsilon}(\mathbf{x};\mathbf{w},b)\,d\mu(\mathbf{x}).

Suppose this integral vanishes. Then F0F\equiv 0 for μw\mu_{w}-almost every 𝐰d\mathbf{w}\in\mathbb{R}^{d}; continuity of FF in 𝐰\mathbf{w} (dominated convergence) and full support of μw\mu_{w} extend this to F0F\equiv 0 on d\mathbb{R}^{d}. In particular, F|𝒳0F|_{\mathcal{X}}\equiv 0. But F|𝒳F|_{\mathcal{X}} is precisely the kernel mean embedding of μ\mu under kb,εk_{b,\varepsilon} (since gε(𝐱;𝐰,b)=kb,ε(𝐰,𝐱)g_{\varepsilon}(\mathbf{x};\mathbf{w},b)=k_{b,\varepsilon}(\mathbf{w},\mathbf{x})). For b>0b>0, Corollary 1(i) states that kb,εk_{b,\varepsilon} is characteristic on 𝒳\mathcal{X} in the strong sense that the kernel mean embedding is injective on finite signed Borel measures on 𝒳\mathcal{X}. Hence F|𝒳0F|_{\mathcal{X}}\equiv 0 forces μ=0\mu=0, a contradiction. Therefore Θα\Theta_{\alpha} is integrally strictly PD on 𝒳\mathcal{X}.

Step 3 (Density of Θα\mathcal{H}_{\Theta_{\alpha}} in C(𝒳)C(\mathcal{X}) via duality). A signed Borel measure μ\mu on 𝒳\mathcal{X} annihilates Θα|𝒳\mathcal{H}_{\Theta_{\alpha}}|_{\mathcal{X}} iff 𝒳f𝑑μ=0\int_{\mathcal{X}}f\,d\mu=0 for every fΘαf\in\mathcal{H}_{\Theta_{\alpha}}. By the reproducing property, 𝒳f𝑑μ=f,𝒳Θα(,𝐱)𝑑μ(𝐱)Θα\int_{\mathcal{X}}f\,d\mu=\bigl\langle f,\int_{\mathcal{X}}\Theta_{\alpha}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})\bigr\rangle_{\mathcal{H}_{\Theta_{\alpha}}}, vanishing for every ff iff the element 𝒳Θα(,𝐱)𝑑μ(𝐱)Θα\int_{\mathcal{X}}\Theta_{\alpha}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})\in\mathcal{H}_{\Theta_{\alpha}} is zero, equivalently Θα𝑑μ𝑑μ=0\int\!\int\Theta_{\alpha}\,d\mu\otimes d\mu=0. By Step 2 this forces μ=0\mu=0. Hence Θα|𝒳\mathcal{H}_{\Theta_{\alpha}}|_{\mathcal{X}} has trivial annihilator in C(𝒳)C(\mathcal{X})^{*} (signed Radon measures, by Riesz), so Θα|𝒳\mathcal{H}_{\Theta_{\alpha}}|_{\mathcal{X}} is dense in C(𝒳)C(\mathcal{X}) in the sup norm.

Step 4 (Transfer to Θ\Theta). The summand Θw\Theta_{w} is PSD by Theorem 26, so Θ=Θα+ΘwΘα\Theta=\Theta_{\alpha}+\Theta_{w}\succeq\Theta_{\alpha} in the Loewner order on Gram matrices. By the Aronszajn inclusion theorem [Paulsen and Raghupathi, 2016], the inclusion ΘαΘ\mathcal{H}_{\Theta_{\alpha}}\hookrightarrow\mathcal{H}_{\Theta} is continuous; density of Θα\mathcal{H}_{\Theta_{\alpha}} in C(𝒳)C(\mathcal{X}) transfers to Θ\mathcal{H}_{\Theta}. ∎

Remark 14 (General principle).

Step 2 above instantiates a general principle: under spherically symmetric initialisation 𝐰μw\mathbf{w}\sim\mu_{w} with full support, the α\alpha-summand of the empirical NTK has the Gram representation Θα(𝐱,𝐱)=k(𝐱,),k(𝐱,)L2(μw)\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})=\langle k(\mathbf{x},\cdot),k(\mathbf{x}^{\prime},\cdot)\rangle_{L^{2}(\mu_{w})}, and integral strict positive definiteness of Θα\Theta_{\alpha} on a compact 𝒳\mathcal{X} follows from characteristicness of the underlying kernel kk on 𝒳\mathcal{X}. Universality of the Yat NTK is therefore a direct consequence of universality of the Yat kernel itself, with no additional spectral hypothesis on μw\mu_{w} beyond full support.

Appendix O RKHS-Lipschitz Bounds and Certified Adversarial Radius

The single-layer Lipschitz bound of Lemma 7 is parametric in the trained (𝐰j,𝜶,b,ε)(\mathbf{w}_{j},\boldsymbol{\alpha},b,\varepsilon) and is used for stack-Lipschitz propagation. We complement it with an intrinsic RKHS-Lipschitz bound expressed only in the RKHS norm, derived from the reproducing property and a closed-form bound on the mixed partial of k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}. This RKHS-norm route to certified robustness for kernel classifiers is developed for general kernels by Bietti et al. [2019]; here we instantiate it with the k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}-specific mixed partial.

Lemma 9 (Mixed partial of k\tifinaghfontk_{\text{\tifinaghfont ⵟ}} on the diagonal).

For 𝐱d{𝟎}\mathbf{x}\in\mathbb{R}^{d}\setminus\{\mathbf{0}\}, set

Mij(𝐱):=xixjk\tifinaghfont(𝐱,𝐱)|𝐱=𝐱.M_{ij}(\mathbf{x})\;:=\;\partial_{x_{i}}\partial_{x^{\prime}_{j}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{x},\mathbf{x}^{\prime})\Big|_{\mathbf{x}^{\prime}=\mathbf{x}}.

Then

Mij(𝐱)=2xixj+2𝐱2δijε+2𝐱4δijε2,trM(𝐱)=2𝐱2(1+d)ε+2d𝐱4ε2.M_{ij}(\mathbf{x})\;=\;\frac{2x_{i}x_{j}+2\,\|\mathbf{x}\|^{2}\delta_{ij}}{\varepsilon}+\frac{2\,\|\mathbf{x}\|^{4}\delta_{ij}}{\varepsilon^{2}},\qquad\operatorname{tr}M(\mathbf{x})\;=\;\frac{2\,\|\mathbf{x}\|^{2}(1+d)}{\varepsilon}+\frac{2d\,\|\mathbf{x}\|^{4}}{\varepsilon^{2}}.
Proof.

Write N(𝐱,𝐱):=(𝐱𝐱)2N(\mathbf{x},\mathbf{x}^{\prime}):=(\mathbf{x}^{\top}\mathbf{x}^{\prime})^{2} and D(𝐱,𝐱):=𝐱𝐱2+εD(\mathbf{x},\mathbf{x}^{\prime}):=\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}+\varepsilon, so that k\tifinaghfont=N/Dk_{\text{\tifinaghfont ⵟ}}=N/D. Differentiating in xix_{i},

xik\tifinaghfont=2(𝐱𝐱)xiD(𝐱𝐱)22(xixi)D2.\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}\;=\;\frac{2(\mathbf{x}^{\top}\mathbf{x}^{\prime})\,x^{\prime}_{i}}{D}-\frac{(\mathbf{x}^{\top}\mathbf{x}^{\prime})^{2}\cdot 2(x_{i}-x^{\prime}_{i})}{D^{2}}.

Differentiating in xjx^{\prime}_{j} and noting that any term with an explicit (xixi)(x_{i}-x^{\prime}_{i}) factor vanishes at 𝐱=𝐱\mathbf{x}^{\prime}=\mathbf{x}, and that xj(xixi)=δij\partial_{x^{\prime}_{j}}(x_{i}-x^{\prime}_{i})=-\delta_{ij},

xixjk\tifinaghfont|𝐱=𝐱=2xjxi+2(𝐱𝐱)δijε+2(𝐱𝐱)2δijε2.\partial_{x_{i}}\partial_{x^{\prime}_{j}}k_{\text{\tifinaghfont ⵟ}}\Big|_{\mathbf{x}^{\prime}=\mathbf{x}}\;=\;\frac{2x_{j}x_{i}+2(\mathbf{x}^{\top}\mathbf{x})\delta_{ij}}{\varepsilon}+\frac{2(\mathbf{x}^{\top}\mathbf{x})^{2}\delta_{ij}}{\varepsilon^{2}}.

Substituting 𝐱𝐱=𝐱2\mathbf{x}^{\top}\mathbf{x}=\|\mathbf{x}\|^{2} gives the displayed expression for MijM_{ij}. The trace formula follows from ixi2=𝐱2\sum_{i}x_{i}^{2}=\|\mathbf{x}\|^{2} and iδii=d\sum_{i}\delta_{ii}=d. ∎

Theorem 27 (RKHS-Lipschitz constant for k\tifinaghfontk_{\text{\tifinaghfont ⵟ}}).

Let 𝒳d{𝟎}\mathcal{X}\subset\mathbb{R}^{d}\setminus\{\mathbf{0}\} be path-connected and compact with sup𝒳𝐱R\sup_{\mathcal{X}}\|\mathbf{x}\|\leq R. For every f0,ε|𝒳f\in\mathcal{H}_{0,\varepsilon}|_{\mathcal{X}} with f0,εB\|f\|_{\mathcal{H}_{0,\varepsilon}}\leq B and every 𝐱,𝐱𝒳\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X},

|f(𝐱)f(𝐱)|BL(R,ε,d)𝐱𝐱,|f(\mathbf{x})-f(\mathbf{x}^{\prime})|\;\leq\;B\,L(R,\varepsilon,d)\,\|\mathbf{x}-\mathbf{x}^{\prime}\|,

with

L(R,ε,d):=2R2(1+d)ε+2dR4ε2=O(R2dε)as R2/ε.L(R,\varepsilon,d)\;:=\;\sqrt{\frac{2R^{2}(1+d)}{\varepsilon}+\frac{2dR^{4}}{\varepsilon^{2}}}\;=\;O\!\Bigl(\frac{R^{2}\sqrt{d}}{\varepsilon}\Bigr)\quad\text{as }R^{2}/\varepsilon\to\infty.
Proof.

By the reproducing property and Cauchy–Schwarz, if(𝐱)=f,xik\tifinaghfont(,𝐱)0,ε\partial_{i}f(\mathbf{x})=\langle f,\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{0,\varepsilon}}, hence

f(𝐱)2=i=1d|if(𝐱)|2f2i=1dxik\tifinaghfont(,𝐱)2=f2trM(𝐱),\|\nabla f(\mathbf{x})\|^{2}\;=\;\sum_{i=1}^{d}|\partial_{i}f(\mathbf{x})|^{2}\;\leq\;\|f\|^{2}\sum_{i=1}^{d}\|\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{x})\|^{2}\;=\;\|f\|^{2}\,\operatorname{tr}M(\mathbf{x}),

using the standard identity xik\tifinaghfont(,𝐱)0,ε2=Mii(𝐱)\|\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{x})\|_{\mathcal{H}_{0,\varepsilon}}^{2}=M_{ii}(\mathbf{x}). Lemma 9 gives trM(𝐱)2R2(1+d)/ε+2dR4/ε2\operatorname{tr}M(\mathbf{x})\leq 2R^{2}(1+d)/\varepsilon+2dR^{4}/\varepsilon^{2} for 𝐱R\|\mathbf{x}\|\leq R. The mean-value inequality on the path-connected set 𝒳\mathcal{X} (along a piecewise-smooth path from 𝐱\mathbf{x} to 𝐱\mathbf{x}^{\prime} in 𝒳\mathcal{X}) gives |f(𝐱)f(𝐱)|BL𝐱𝐱|f(\mathbf{x})-f(\mathbf{x}^{\prime})|\leq B\,L\,\|\mathbf{x}-\mathbf{x}^{\prime}\|. ∎

Corollary 14 (Certified adversarial radius for an RKHS classifier).

Let 𝐟=(f1,,fC)\mathbf{f}=(f_{1},\dots,f_{C}) be a CC-class predictor with c=1Cfc0,ε2B2\sum_{c=1}^{C}\|f_{c}\|_{\mathcal{H}_{0,\varepsilon}}^{2}\leq B^{2}. For an input 𝐱𝒳\mathbf{x}\in\mathcal{X} with predicted class yy and margin γ(𝐱):=fy(𝐱)maxcyfc(𝐱)>0\gamma(\mathbf{x}):=f_{y}(\mathbf{x})-\max_{c\neq y}f_{c}(\mathbf{x})>0, every adversarial perturbation 𝛅\boldsymbol{\delta} satisfying

𝜹<γ(𝐱)2BL(R,ε,d)\|\boldsymbol{\delta}\|\;<\;\frac{\gamma(\mathbf{x})}{2BL(R,\varepsilon,d)}

keeps the predicted class equal to yy.

Proof.

The budget cfc2B2\sum_{c}\|f_{c}\|^{2}\leq B^{2} gives fcB\|f_{c}\|\leq B for each cc, hence fyfcfy+fc2B\|f_{y}-f_{c}\|\leq\|f_{y}\|+\|f_{c}\|\leq 2B by the triangle inequality. Theorem 27 applied to fyfcf_{y}-f_{c} yields

|(fyfc)(𝐱+𝜹)(fyfc)(𝐱)| 2BL𝜹<γ(𝐱),\bigl|(f_{y}-f_{c})(\mathbf{x}+\boldsymbol{\delta})-(f_{y}-f_{c})(\mathbf{x})\bigr|\;\leq\;2BL\,\|\boldsymbol{\delta}\|\;<\;\gamma(\mathbf{x}),

so fy(𝐱+𝜹)>fc(𝐱+𝜹)f_{y}(\mathbf{x}+\boldsymbol{\delta})>f_{c}(\mathbf{x}+\boldsymbol{\delta}) for every cyc\neq y and the prediction is preserved. ∎

Remark 15 (Capacity-robustness trade-off).

The certified radius is Θ(ε/(R2d))\Theta(\varepsilon/(R^{2}\sqrt{d})): increasing ε\varepsilon buys robustness at the price of capacity, since the kernel diagonal 𝐱4/ε\|\mathbf{x}\|^{4}/\varepsilon shrinks. This trade-off is intrinsic to k\tifinaghfontk_{\text{\tifinaghfont ⵟ}} and explicit in the parameter ε\varepsilon, in contrast to scalar-activation networks where Lipschitz control is imposed externally through weight-norm penalties.

Appendix P Yat-Native Atom-Count Bounds via the Polynomial Component

The exterior-shell separation (Appendix G.1) and the bounded-domain width-complexity gap (Proposition 7) capture qualitative atom-count separations between Yat and IMQ. We complement them with a constant-atom upper bound for arbitrary symmetric quadratic forms, paired with a dimension-counting IMQ lower bound. The combination converts the directional asymptotic-trace separation of Proposition 2 into a quantitative atom-count gap on PSD quadratic targets.

Setup.

For Λ,W>0\Lambda,W>0, define the bounded-variation atom families

IMQ(Λ,W):={jajhε(,𝐯j):|aj|Λ,𝐯jW},\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)\;:=\;\Bigl\{\textstyle\sum_{j}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j})\,:\,\textstyle\sum|a_{j}|\leq\Lambda,\ \|\mathbf{v}_{j}\|\leq W\Bigr\},
\tifinaghfont(Λ,W):={jcjgε(;𝐰j,bj):|cj|Λ,𝐰jW,bj}.\mathcal{R}_{\text{\tifinaghfont ⵟ}}(\Lambda,W)\;:=\;\Bigl\{\textstyle\sum_{j}c_{j}\,g_{\varepsilon}(\cdot;\mathbf{w}_{j},b_{j})\,:\,\textstyle\sum|c_{j}|\leq\Lambda,\ \|\mathbf{w}_{j}\|\leq W,\ b_{j}\in\mathbb{R}\Bigr\}.

The atom count of an element is the smallest such expansion length.

Lemma 10 (Single-atom Yat approximation of a rank-one quadratic).

Let 𝐰𝕊d1\mathbf{w}_{\star}\in\mathbb{S}^{d-1}, δ>0\delta>0, R>0R>0, and set

α0(d,R,ε,δ):=max(8R3d3/2δ, 2dR2(dR2+ε)δ).\alpha_{0}(d,R,\varepsilon,\delta)\;:=\;\max\Bigl(\tfrac{8R^{3}d^{3/2}}{\delta},\ 2\sqrt{\tfrac{dR^{2}(dR^{2}+\varepsilon)}{\delta}}\Bigr).

For every αα0\alpha\geq\alpha_{0},

sup𝐱[R,R]d|k\tifinaghfont(α𝐰,𝐱)(𝐰𝐱)2|δ.\sup_{\mathbf{x}\in[-R,R]^{d}}\bigl|\,k_{\text{\tifinaghfont ⵟ}}(\alpha\mathbf{w}_{\star},\mathbf{x})-(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\,\bigr|\;\leq\;\delta.
Proof.

With 𝐱[R,R]d\mathbf{x}\in[-R,R]^{d}, 𝐱Rd\|\mathbf{x}\|\leq R\sqrt{d}. Compute

k\tifinaghfont(α𝐰,𝐱)=(𝐰𝐱)21+η(𝐱,α),η(𝐱,α):=2𝐰𝐱α+𝐱2+εα2.k_{\text{\tifinaghfont ⵟ}}(\alpha\mathbf{w}_{\star},\mathbf{x})\;=\;\frac{(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}}{1+\eta(\mathbf{x},\alpha)},\qquad\eta(\mathbf{x},\alpha)\;:=\;-\frac{2\,\mathbf{w}_{\star}^{\top}\mathbf{x}}{\alpha}+\frac{\|\mathbf{x}\|^{2}+\varepsilon}{\alpha^{2}}.

On [R,R]d[-R,R]^{d}, |𝐰𝐱|Rd|\mathbf{w}_{\star}^{\top}\mathbf{x}|\leq R\sqrt{d}, so |η|2Rd/α+(dR2+ε)/α2|\eta|\leq 2R\sqrt{d}/\alpha+(dR^{2}+\varepsilon)/\alpha^{2}. For α2dR2(dR2+ε)/δ\alpha\geq 2\sqrt{dR^{2}(dR^{2}+\varepsilon)/\delta} the second term is at most δ/(4dR2)\delta/(4dR^{2}), and for α8R3d3/2/δ\alpha\geq 8R^{3}d^{3/2}/\delta the first term is at most δ/(4dR2)\delta/(4dR^{2}); hence |η|δ/(2dR2)1/2|\eta|\leq\delta/(2dR^{2})\leq 1/2 (for δdR2\delta\leq dR^{2}, otherwise the conclusion follows trivially by taking α\alpha even larger). The bound |1/(1+η)1|2|η||1/(1+\eta)-1|\leq 2|\eta| for |η|1/2|\eta|\leq 1/2 gives

|k\tifinaghfont(α𝐰,𝐱)(𝐰𝐱)2|(𝐰𝐱)22|η|dR22δ/(2dR2)=δ.\bigl|k_{\text{\tifinaghfont ⵟ}}(\alpha\mathbf{w}_{\star},\mathbf{x})-(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\bigr|\;\leq\;(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\cdot 2|\eta|\;\leq\;dR^{2}\cdot 2\cdot\delta/(2dR^{2})\;=\;\delta.

Proposition 9 (Spectral Yat approximation of any symmetric quadratic form).

Let 𝐀Sym(d)\mathbf{A}\in\mathrm{Sym}(d) have rank rr with spectral decomposition 𝐀=k=1rλk𝐞k𝐞k\mathbf{A}=\sum_{k=1}^{r}\lambda_{k}\mathbf{e}_{k}\mathbf{e}_{k}^{\top}. For every δ>0\delta>0, set δk:=δ/(r|λk|)\delta_{k}:=\delta/(r|\lambda_{k}|) and choose αkα0(d,R,ε,δk)\alpha_{k}\geq\alpha_{0}(d,R,\varepsilon,\delta_{k}) as in Lemma 10. Then

G(𝐱):=k=1rλkk\tifinaghfont(αk𝐞k,𝐱)G(\mathbf{x})\;:=\;\sum_{k=1}^{r}\lambda_{k}\,k_{\text{\tifinaghfont ⵟ}}(\alpha_{k}\mathbf{e}_{k},\mathbf{x})

is an unbiased Yat expansion with exactly rr atoms satisfying sup𝐱[R,R]d|G(𝐱)𝐱𝐀𝐱|δ\sup_{\mathbf{x}\in[-R,R]^{d}}|G(\mathbf{x})-\mathbf{x}^{\top}\mathbf{A}\mathbf{x}|\leq\delta.

Proof.

𝐱𝐀𝐱=k=1rλk(𝐞k𝐱)2\mathbf{x}^{\top}\mathbf{A}\mathbf{x}=\sum_{k=1}^{r}\lambda_{k}(\mathbf{e}_{k}^{\top}\mathbf{x})^{2}. Lemma 10 applied to each unit eigenvector 𝐞k\mathbf{e}_{k} with tolerance δk\delta_{k} gives |k\tifinaghfont(αk𝐞k,𝐱)(𝐞k𝐱)2|δ/(r|λk|)|k_{\text{\tifinaghfont ⵟ}}(\alpha_{k}\mathbf{e}_{k},\mathbf{x})-(\mathbf{e}_{k}^{\top}\mathbf{x})^{2}|\leq\delta/(r|\lambda_{k}|). Multiplying by |λk||\lambda_{k}| and summing gives the displayed bound. ∎

Proposition 10 (IMQ atom-count lower bound for a PSD quadratic target).

Let 𝐀Sym(d)\mathbf{A}\in\mathrm{Sym}(d) be PSD with λmax:=λmax(𝐀)>0\lambda_{\max}:=\lambda_{\max}(\mathbf{A})>0, and let 𝐞1\mathbf{e}_{1} be a unit eigenvector of 𝐀\mathbf{A} with eigenvalue λmax\lambda_{\max}. Fix R>0R>0 and 0<δ<λmaxR20<\delta<\lambda_{\max}R^{2}. Suppose FIMQ(Λ,W)F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W) with WR/2W\leq R/2 satisfies sup[R,R]d|F𝐱𝐀𝐱|δ\sup_{[-R,R]^{d}}|F-\mathbf{x}^{\top}\mathbf{A}\mathbf{x}|\leq\delta and Λ>ε(λmaxR2δ)\Lambda>\varepsilon\,(\lambda_{\max}R^{2}-\delta). Set

ρ2:=ΛλmaxR2δε> 0.\rho^{2}\;:=\;\frac{\Lambda}{\lambda_{\max}R^{2}-\delta}\,-\,\varepsilon\;>\;0.

Any expansion F=j=1majhε(,𝐯j)F=\sum_{j=1}^{m}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j}) realising the bounded-variation budget has

m(2R)d1Γ(d+12)π(d1)/2ρd1.m\;\geq\;\frac{(2R)^{d-1}\,\Gamma\!\bigl(\tfrac{d+1}{2}\bigr)}{\pi^{(d-1)/2}\,\rho^{\,d-1}}.

In particular, when ρ\rho stays bounded as dd\to\infty (equivalently, Λ=O(1)\Lambda=O(1) with R,λmax,δ,εR,\lambda_{\max},\delta,\varepsilon fixed), the right-hand side is exp(Ω(dlogd))\exp(\Omega(d\log d)).

Proof.

Restrict attention to the inscribed Euclidean ball :={𝐱d:𝐱R}[R,R]d\mathcal{B}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\}\subset[-R,R]^{d}. By rotational invariance of \mathcal{B}, align 𝐞1\mathbf{e}_{1} with a unit eigenvector of 𝐀\mathbf{A} for λmax\lambda_{\max}. Consider the slice S:={R/2}×d1S^{\prime}:=\{R/2\}\times\mathcal{B}^{\prime}_{d-1} where d1d1\mathcal{B}^{\prime}_{d-1}\subset\mathbb{R}^{d-1} is the (d1)(d-1)-ball of radius R3/2R\sqrt{3}/2. For 𝐱S\mathbf{x}\in S^{\prime}, 𝐱𝐀𝐱λmax(𝐞1𝐱)2=λmaxR2/4\mathbf{x}^{\top}\mathbf{A}\mathbf{x}\geq\lambda_{\max}(\mathbf{e}_{1}^{\top}\mathbf{x})^{2}=\lambda_{\max}R^{2}/4, so F(𝐱)λmaxR2/4δ>0F(\mathbf{x})\geq\lambda_{\max}R^{2}/4-\delta>0 (we strengthen the hypothesis to δ<λmaxR2/4\delta<\lambda_{\max}R^{2}/4). The bounded-variation bound gives

F(𝐱)j=1m|aj|hε(𝐱,𝐯j)Λminj(𝐱𝐯j2+ε),F(\mathbf{x})\;\leq\;\sum_{j=1}^{m}|a_{j}|\,h_{\varepsilon}(\mathbf{x},\mathbf{v}_{j})\;\leq\;\frac{\Lambda}{\min_{j}(\|\mathbf{x}-\mathbf{v}_{j}\|^{2}+\varepsilon)},

so minj𝐱𝐯j2ρ2\min_{j}\|\mathbf{x}-\mathbf{v}_{j}\|^{2}\leq\rho^{2} for every 𝐱S\mathbf{x}\in S^{\prime}, with the constant in ρ\rho now depending on λmaxR2/4\lambda_{\max}R^{2}/4 rather than λmaxR2\lambda_{\max}R^{2}. Hence SjBd(𝐯j,ρ)S^{\prime}\subset\bigcup_{j}B_{d}(\mathbf{v}_{j},\rho). Each ball intersects the affine slice {x1=R/2}\{x_{1}=R/2\} in a (d1)(d-1)-ball of radius at most ρ\rho and volume π(d1)/2ρd1/Γ((d+1)/2)\pi^{(d-1)/2}\rho^{d-1}/\Gamma((d+1)/2). The slice SS^{\prime} has (d1)(d-1)-volume π(d1)/2(R3/2)d1/Γ((d+1)/2)\pi^{(d-1)/2}(R\sqrt{3}/2)^{d-1}/\Gamma((d+1)/2), so volume counting yields the displayed lower bound. By Stirling, the Stirling-error contribution from Γ((d+1)/2)\Gamma((d+1)/2) cancels between the slice and the ball volumes, and the constant (3/2)d1(\sqrt{3}/2)^{d-1} is absorbed into O(d)O(d); taking logs, logm(d1)log(R3/(2ρ))O(d)\log m\geq(d-1)\log(R\sqrt{3}/(2\rho))-O(d), which is Ω(dlogd)\Omega(d\log d) when ρ\rho is bounded as dd\to\infty. ∎

Theorem 28 (Yat-native rate-form separation).

Let f=p+gf^{\star}=p^{\star}+g^{\star} on [R,R]d[-R,R]^{d}, where p(𝐱)=𝐱𝐀𝐱p^{\star}(\mathbf{x})=\mathbf{x}^{\top}\mathbf{A}\mathbf{x} for some PSD 𝐀Sym(d)\mathbf{A}\in\mathrm{Sym}(d) of rank rr with λmax(𝐀)>0\lambda_{\max}(\mathbf{A})>0, and gIMQεg^{\star}\in\mathcal{H}_{\mathrm{IMQ}}^{\varepsilon} with gLM\|g^{\star}\|_{L^{\infty}}\leq M and IMQ approximation rate ϕ\phi (i.e., for every δ>0\delta^{\prime}>0 there exists F¯IMQ\bar{F}\in\mathcal{F}_{\mathrm{IMQ}} with ϕ1(δ)\leq\phi^{-1}(\delta^{\prime}) atoms and gF¯Lδ\|g^{\star}-\bar{F}\|_{L^{\infty}}\leq\delta^{\prime}). Fix δ>0\delta>0 with 4δ+M<λmaxR24\delta+M<\lambda_{\max}R^{2}, and set mδ:=min{m:ϕ(m)δ/2}m_{\delta}:=\min\{m:\phi(m)\leq\delta/2\}.

(i) Yat upper bound. There exists G\tifinaghfontG\in\mathcal{F}_{\text{\tifinaghfont ⵟ}} with at most r+3mδr+3m_{\delta} atoms and sup𝐱[R,R]d|f(𝐱)G(𝐱)|δ\sup_{\mathbf{x}\in[-R,R]^{d}}|f^{\star}(\mathbf{x})-G(\mathbf{x})|\leq\delta.

(ii) IMQ lower bound. Fix Λ>0\Lambda>0 and WR/2W\leq R/2. Every FIMQ(Λ,W)F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W) with sup[R,R]d|fF|δ\sup_{[-R,R]^{d}}|f^{\star}-F|\leq\delta has atom count at least the dimension-counting lower bound of Proposition 10; in the regime where R,λmax,δ,ε,ΛR,\lambda_{\max},\delta,\varepsilon,\Lambda are held fixed as dd\to\infty, this bound is exp(Ω(dlogd))\exp(\Omega(d\log d)).

(iii) Separation. For full-rank 𝐀\mathbf{A}, the Yat side achieves the target error with d+3mδd+3m_{\delta} atoms while the IMQ side requires exp(Ω(dlogd))\exp(\Omega(d\log d)) in the fixed-Λ\Lambda regime of (ii).

Proof.

(i) Apply Proposition 9 to 𝐀\mathbf{A} with tolerance δ/2\delta/2, producing an unbiased Yat expansion G1G_{1} with rr atoms and sup|G1p|δ/2\sup|G_{1}-p^{\star}|\leq\delta/2. By definition of mδm_{\delta}, choose F¯IMQ\bar{F}\in\mathcal{F}_{\mathrm{IMQ}} with mδ\leq m_{\delta} IMQ atoms and gF¯Lδ/2\|g^{\star}-\bar{F}\|_{L^{\infty}}\leq\delta/2. Theorem 4 converts F¯\bar{F} into a biased Yat expansion G2\tifinaghfontG_{2}\in\mathcal{F}_{\text{\tifinaghfont ⵟ}} with at most 3mδ3m_{\delta} atoms and G2=F¯G_{2}=\bar{F} pointwise. Set G:=G1+G2G:=G_{1}+G_{2} with atom count r+3mδ\leq r+3m_{\delta}; the triangle inequality gives the displayed sup bound.

(ii) Given sup|fF|δ\sup|f^{\star}-F|\leq\delta, sup|pF|δ+M\sup|p^{\star}-F|\leq\delta+M. Apply Proposition 10 with the perturbed tolerance δ:=δ+M\delta^{\prime}:=\delta+M in place of δ\delta (the hypothesis 4δ+M<λmaxR24\delta+M<\lambda_{\max}R^{2} ensures δ<λmaxR2\delta^{\prime}<\lambda_{\max}R^{2}, so ρ\rho is well-defined). The atom count is bounded below by the dimension-counting estimate, which under the fixed-Λ\Lambda asymptotic regime is exp(Ω(dlogd))\exp(\Omega(d\log d)).

(iii) For full-rank 𝐀\mathbf{A}, r=dr=d; combining (i) and (ii) gives the displayed separation. ∎

Remark 16 (RKHS-norm cost of the Yat upper bound).

Theorem 28(i) is an atom-count statement, not an RKHS-norm statement. The single-atom rank-one approximation (Lemma 10) requires α0=Ω(d3/2/δ)\alpha_{0}=\Omega(d^{3/2}/\delta), so the resulting Yat atom k0,ε(α0𝐰,)k_{0,\varepsilon}(\alpha_{0}\mathbf{w}_{\star},\cdot) has center norm α0𝐰=α0=Ω(d3/2)\|\alpha_{0}\mathbf{w}_{\star}\|=\alpha_{0}=\Omega(d^{3/2}) and squared RKHS norm

k0,ε(α0𝐰,)0,ε2=k0,ε(α0𝐰,α0𝐰)=α04ε=Θ(d6εδ4).\|k_{0,\varepsilon}(\alpha_{0}\mathbf{w}_{\star},\cdot)\|_{\mathcal{H}_{0,\varepsilon}}^{2}\;=\;k_{0,\varepsilon}(\alpha_{0}\mathbf{w}_{\star},\alpha_{0}\mathbf{w}_{\star})\;=\;\frac{\alpha_{0}^{4}}{\varepsilon}\;=\;\Theta\!\Bigl(\tfrac{d^{6}}{\varepsilon\,\delta^{4}}\Bigr).

Aggregating over the rr atoms in the rank-rr spectral decomposition (Proposition 9) inflates this further by a factor at most r=dr=d in the full-rank case. So the Yat-side resource cost decomposes into a constant atom count and a polynomial-in-dd RKHS-ball radius; the IMQ side requires exp(Ω(dlogd))\exp(\Omega(d\log d)) atoms but each at bounded center norm, giving a polynomial RKHS-ball radius for any single expansion. The separation in Theorem 28 is therefore poly-vs-exp in atom count and poly-vs-poly in RKHS-ball radius; downstream generalization comparisons via Theorem 6 inherit both factors. The exponential atom-count gap on the Yat side is real and quantitative, but it is a gap in the discrete combinatorial complexity of the expansion, not a free-lunch reduction in the underlying capacity radius.

Appendix Q Toward a Uniqueness Characterization

The structural results of the main paper isolate three properties that the Yat kernel satisfies and the classical RBF/IMQ/polynomial kernels do not jointly satisfy: rational symmetric form, bounded global diagonal, and a non-trivial quadratic asymptotic trace at infinity. We record the rigorous degree-matching skeleton these conditions imply, and state the full classification as a conjecture.

Conditions.

Let k:d×dk:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}, d2d\geq 2, be a continuous PSD kernel satisfying

  1. (C1)

    (Rational symmetric form, in reduced form.) k(𝐱,𝐰)=P(𝐱,𝐰)/Q(𝐱,𝐰)k(\mathbf{x},\mathbf{w})=P(\mathbf{x},\mathbf{w})/Q(\mathbf{x},\mathbf{w}) with P,QP,Q jointly polynomial in (𝐱,𝐰)(\mathbf{x},\mathbf{w}), PP symmetric in the variable pair, Q(𝐱,𝐰)>0Q(\mathbf{x},\mathbf{w})>0 for every (𝐱,𝐰)(\mathbf{x},\mathbf{w}), and gcd(P,Q)=1\gcd(P,Q)=1 in the polynomial ring (i.e., the rational form is reduced).

  2. (C2)

    (At-most-quadratic numerator growth.) The diagonal admits k(𝐱,𝐱)=O(𝐱4)k(\mathbf{x},\mathbf{x})=O(\|\mathbf{x}\|^{4}) as 𝐱\|\mathbf{x}\|\to\infty. Equivalently, in the reduced rational form of (C1), deg𝐱Pdeg𝐱Q2\deg_{\mathbf{x}}P-\deg_{\mathbf{x}}Q\leq 2 on the diagonal 𝐰=𝐱\mathbf{w}=\mathbf{x}.

  3. (C3)

    (Non-trivial quadratic directional trace.) For some 𝐰𝟎\mathbf{w}\neq\mathbf{0}, the limit Tk(,𝐰)(𝐮):=limrk(r𝐮,𝐰)T_{\infty}k(\cdot,\mathbf{w})(\mathbf{u}):=\lim_{r\to\infty}k(r\mathbf{u},\mathbf{w}) exists for every 𝐮𝕊d1\mathbf{u}\in\mathbb{S}^{d-1} and is a non-zero quadratic form in 𝐮\mathbf{u}.

Proposition 11 (Degree matching under (C1)–(C3)).

For any kernel kk satisfying (C1)–(C3), deg𝐱P=deg𝐱Q=2\deg_{\mathbf{x}}P=\deg_{\mathbf{x}}Q=2 (and by symmetry the same in 𝐰\mathbf{w}).

Proof.

Write P=pPpP=\sum_{p}P_{p} and Q=qQqQ=\sum_{q}Q_{q} as homogeneous decompositions in 𝐱\mathbf{x}. As rr\to\infty, k(r𝐮,𝐰)k(r\mathbf{u},\mathbf{w}) has leading order rpqr^{p_{\star}-q_{\star}} where p,qp_{\star},q_{\star} are the largest indices with non-zero homogeneous parts. Existence of a finite, non-zero limit (C3) forces p=qp_{\star}=q_{\star} and the limiting ratio Pp/QqP_{p_{\star}}/Q_{q_{\star}} to be a non-zero rational function of 𝐮\mathbf{u} alone. Since the limit is a quadratic form in 𝐮\mathbf{u}, the limit’s total numerator degree in 𝐮\mathbf{u} after cancellation is 22, which is consistent only with p=q=2p_{\star}=q_{\star}=2 (the cases p=q{0,1}p_{\star}=q_{\star}\in\{0,1\} are ruled out by the requirement that the trace is non-zero quadratic of degree exactly 22). Together with (C2), no higher-degree homogeneous parts are admissible: if deg𝐱P>2\deg_{\mathbf{x}}P>2 but deg𝐱Q=2\deg_{\mathbf{x}}Q=2, then k(𝐱,𝐱)k(\mathbf{x},\mathbf{x}) is unbounded. Therefore deg𝐱P=deg𝐱Q=2\deg_{\mathbf{x}}P=\deg_{\mathbf{x}}Q=2. ∎

Conjecture 2 (Uniqueness of kb,εk_{b,\varepsilon} up to symmetry).

Every continuous PSD kernel satisfying (C1)–(C3) has the form

k(𝐱,𝐰)=c(𝐱𝐰+b)2𝐱𝐰2+ε=ckb,ε(Φ1𝐱,Φ1𝐰),k(\mathbf{x},\mathbf{w})\;=\;c\,\frac{(\mathbf{x}^{\top}\mathbf{w}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon}\;=\;c\,k_{b,\varepsilon}(\Phi^{-1}\mathbf{x},\Phi^{-1}\mathbf{w}),

for some b0b\geq 0, ε>0\varepsilon>0, c>0c>0, and ΦGL(d)\Phi\in\mathrm{GL}(d).

Remark 17 (Status).

Proposition 11 establishes the degree-22 constraint, which is the first of several steps in the conjectured classification. The remaining steps—(i) showing PSD on all finite point sets forces PP to be the square of a degree-11 symmetric form (𝐱𝐰+b)2(\mathbf{x}^{\top}\mathbf{w}+b)^{2}, up to a linear change of variables, and (ii) the analogous classification for strictly positive degree-22 symmetric denominators QQ—require an explicit case analysis with multi-point Gram constraints that we have not carried out, and which constitute the bulk of the work behind Conjecture 2.

Remark 18 (Why higher homogeneous degrees do not survive on 𝕊d1\mathbb{S}^{d-1}).

A potential concern is that P4(𝐮,𝐰)/Q4(𝐮,𝐰)P_{4}(\mathbf{u},\mathbf{w})/Q_{4}(\mathbf{u},\mathbf{w}) may collapse to a quadratic on 𝐮2=1\|\mathbf{u}\|^{2}=1 — for instance, (𝐮𝐰)2𝐮2/𝐰4=(𝐮𝐰)2/𝐰2(\mathbf{u}^{\top}\mathbf{w})^{2}\|\mathbf{u}\|^{2}/\|\mathbf{w}\|^{4}=(\mathbf{u}^{\top}\mathbf{w})^{2}/\|\mathbf{w}\|^{2} — but this does not invalidate the degree conclusion: such a collapse is a redundancy of the homogeneous expansion, and any kernel admitting it can be rewritten with deg𝐱P=deg𝐱Q=2\deg_{\mathbf{x}}P=\deg_{\mathbf{x}}Q=2 via cancellation of the 𝐮2\|\mathbf{u}\|^{2} factor. Equivalently, after passing to the reduced rational form P/QP^{*}/Q^{*} in which numerator and denominator share no common polynomial factor (the convention adopted in (C1)), the directional-trace argument forces deg𝐱P=deg𝐱Q=2\deg_{\mathbf{x}}P^{*}=\deg_{\mathbf{x}}Q^{*}=2.

Remark 19 (Independence of (C1)–(C3)).

Each condition is necessary for the conjectured classification: dropping (C1) admits Gaussian RBF and IMQ (no rational symmetric form); dropping (C2) admits higher-degree rational kernels such as (𝐱𝐰)4/(𝐱𝐰2+ε)(\mathbf{x}^{\top}\mathbf{w})^{4}/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon), whose diagonal grows as 𝐱8\|\mathbf{x}\|^{8} and so violates the at-most-quadratic numerator condition while still satisfying (C1) and (C3); dropping (C3) admits the polynomial (𝐱𝐰)2(\mathbf{x}^{\top}\mathbf{w})^{2} (no IMQ denominator, so the far-field along r𝐮r\mathbf{u} blows up rather than approaching a finite quadratic limit) and all bounded radial kernels (Matérn, IMQ) which have T0T_{\infty}\equiv 0. The simultaneous combination forces the rigid degree structure of Proposition 11 and conjecturally isolates kb,εk_{b,\varepsilon} as a two-parameter family.

Appendix R CLIP Probe Classification Sweep

Scope of this experiment.

This appendix is intended as a trainability/diagnostic check, not a head-to-head benchmark of Yat against tuned kernel baselines. The bandwidths of RBF and IMQ are fixed to match Yat’s ε\varepsilon (no per-variant tuning), so the comparison is deliberately under-tuned for the radial baselines. The two structurally informative comparisons we do draw are Yat vs. Poly (with vs. without the IMQ denominator) and Yat vs. Yatrand (trained vs. frozen centers). A bandwidth-tuned comparison against learned-center RBF/IMQ heads (and against random Fourier features and a standard MLP head) is left as future empirical work; we therefore avoid drawing performance-superiority conclusions from this table alone.

To illustrate the practical consequence of the alignment numerator, we train single-layer kernel classifiers on frozen CLIP ViT-B/32 image features (ImageNet-1k, 1000 classes, d=512d=512). Each classifier is a one-vs-rest kernel expansion with m=1000m=1000 learned centers (one per class), optimised by Adam for 20 epochs with a grid of six learning rates over three seeds. The five variants compared are: Yat (kb,εk_{b,\varepsilon}, trained centers, shared b=ln2b=\ln 2, ε=2\varepsilon=2), Poly (quadratic polynomial kernel, same numerator as Yat, no IMQ denominator), Yatrand (Yat with frozen random centers, only α\alpha trained), RBF (Gaussian kernel eγ𝐱𝐰2e^{-\gamma\|\mathbf{x}-\mathbf{w}\|^{2}} with learned centers, fixed bandwidth γ=1/(2ε)=0.25\gamma=1/(2\varepsilon)=0.25 to match Yat’s ε\varepsilon), and IMQ (inverse multiquadric kernel with learned centers, fixed εIMQ=2\varepsilon_{\text{IMQ}}=2 matching Yat’s ε\varepsilon). All five variants use the same Adam optimizer and six-point learning-rate grid; no per-variant bandwidth tuning was performed.

Table 3: ImageNet-1k top-1 validation accuracy (%) for a single-layer kernel classifier on frozen CLIP ViT-B/32 features, structural (alignment-numerator vs. trained-center) ablations only. Mean over 3 seeds; best-LR column reports the peak over the lr grid.
Variant 102.510^{-2.5} 10210^{-2} 101.510^{-1.5} 10110^{-1} 10010^{0} 100.510^{0.5} Best LR Best acc
Yat (trained) 68.4 70.8 73.1 73.9 72.2 69.7 10110^{-1} 73.9±0.173.9\pm 0.1
Poly 72.5 71.2 70.1 64.3 51.6 49.1 102.510^{-2.5} 72.5±0.172.5\pm 0.1
Yatrand 62.6 67.5 69.5 69.2 58.4 46.2 101.510^{-1.5} 69.5±0.169.5\pm 0.1

Reading Table 3. The two structurally informative comparisons are: (1) Yat (73.9%) vs. Poly (72.5%): the +1.4+1.4 pp gap isolates the benefit of the IMQ locality denominator over the polynomial alignment numerator alone. The Poly kernel has finite-dimensional RKHS (Table 1, footnote \dagger) of dimension O(d2)O(d^{2}) at degree 22, so the gap roughly reflects the cost of forcing a 10001000-class problem on d=512d=512 features into a finite-rank RKHS; (2) Yat (73.9%) vs. Yatrand (69.5%): the +4.4+4.4 pp gap shows that center optimisation is required to fully exploit the alignment numerator, consistent with the learned-center RKHS view. Both ablations isolate one variable at a time and use the same Yat hyperparameters; the comparison is not contaminated by bandwidth choice. We note that the Poly baseline (and the bandwidth-mismatched RBF/IMQ baselines in Table 4) peak at the lower end of the searched LR grid; extending the grid downwards may reduce the +1.4+1.4 pp Yat-vs-Poly gap, and we report the result with this caveat. RBF/IMQ are reported with the additional bandwidth-mismatch caveat already noted.

Bandwidth-mismatched RBF/IMQ baselines (separated for visual honesty).

For completeness we also ran fixed-bandwidth RBF and IMQ heads at the same optimizer / lr grid; we report these in a separate Table 4 rather than alongside Yat to avoid an apples-to-oranges headline juxtaposition. The bandwidth was fixed to match Yat’s ε\varepsilon (γ=1/(2ε)=0.25\gamma=1/(2\varepsilon)=0.25 for RBF; εIMQ=2\varepsilon_{\text{IMQ}}=2), not tuned per variant, and the resulting collapse to a small fraction of Yat’s accuracy at every learning rate (RBF and IMQ peak at 46.1%46.1\% and 28.7%28.7\% respectively against Yat’s 73.9%73.9\%, with the gap widening as the learning rate increases) is the optimization-side counterpart of Proposition 5: with ε\varepsilon fixed at O(1)O(1) and 𝐱𝐰2=Θ(d)\|\mathbf{x}-\mathbf{w}\|^{2}=\Theta(d) at d=512d=512, both kernels concentrate around a near-constant value, so both the forward signal and the Adam gradient lose discriminative scale (for RBF, eγ𝐱𝐰2eO(d)e^{-\gamma\|\mathbf{x}-\mathbf{w}\|^{2}}\approx e^{-O(d)} is numerically zero; for IMQ, 1/(𝐱𝐰2+ε)1/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon) is near-uniform across (𝐱,𝐰)(\mathbf{x},\mathbf{w}) pairs). The remedy is a bandwidth scaled to the data (εd\varepsilon\sim d), not a different optimizer; with per-variant bandwidth selection these baselines would be expected to perform competitively. We draw no performance-superiority conclusion against RBF or IMQ from this table; a bandwidth-tuned head-to-head comparison is left as future empirical work.

Table 4: Bandwidth-mismatched RBF and IMQ heads at the same optimizer / lr grid as Table 3, with bandwidth fixed to match Yat’s ε\varepsilon rather than tuned per variant. Reported here for completeness; not a fair head-to-head comparison against Yat.
Variant 102.510^{-2.5} 10210^{-2} 101.510^{-1.5} 10110^{-1} 10010^{0} 100.510^{0.5} Best LR Best acc
RBF (fixed γ\gamma) 46.1 0.1 0.1 0.1 0.1 0.1 102.510^{-2.5} 46.1±0.546.1\pm 0.5
IMQ (fixed ε\varepsilon) 28.7 25.1 11.9 0.1 0.1 0.1 102.510^{-2.5} 28.7±0.328.7\pm 0.3

Per-class RKHS norm correlation.

For the Yat head trained at the best learning rate (10110^{-1}), the per-class closed-form RKHS norm 𝜶c𝐊𝜶c\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c} (Proposition 3) has positive Spearman correlation ρ=0.564\rho=0.564 with per-class top-1 accuracy across n=1000n=1000 classes (one observation per class, 5050 validation examples each). The correlation is stable across a (b,ε)(b,\varepsilon) ablation grid spanning two orders of magnitude (ρ[0.54,0.56]\rho\in[0.54,0.56] over seven configurations; reproduction scripts are provided in the supplementary material). The interpretation of the positive sign is discussed in §6. The closed-form norm is computable directly from the trained weights (𝐰j,𝜶c)(\mathbf{w}_{j},\boldsymbol{\alpha}_{c}) with no held-out data. Learned-center RBF and IMQ heads admit the same 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} formula at the same cost: the reproducing-property derivation of Proposition 3 is generic to finite kernel expansions and applies equally to any Mercer kernel. What is specific to Yat is the kernel itself — its diagonal (𝐱2+b)2/ε(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon, its directional far-field trace, and the bias finite-difference structure. Ordinary scalar MLP units (𝐱σ(𝐰𝐱+b)\mathbf{x}\mapsto\sigma(\mathbf{w}^{\top}\mathbf{x}+b)) do not admit the formula at all because they do not define a Mercer section over the shared input/weight space.

Appendix S Directional-Tail Benchmark (Corollary 5)

To make Corollary 5 concrete we run a small synthetic experiment in d=2d=2. The target function is the single Yat atom gε(𝐱;𝐰,b)g_{\varepsilon}(\mathbf{x};\mathbf{w}^{*},b^{*}) with 𝐰=[1,0]\mathbf{w}^{*}=[1,0], b=1b^{*}=1, ε=1\varepsilon=1. Along the ray 𝐮=[1,0]\mathbf{u}=[1,0] the alignment factor satisfies a=(𝐮𝐰)2=1a=(\mathbf{u}^{\top}\mathbf{w}^{*})^{2}=1, so Corollary 5 predicts |gε(r𝐮)F(r𝐮)|1|g_{\varepsilon}(r\mathbf{u})-F(r\mathbf{u})|\to 1 for every finite IMQ combination FF. This experiment provides concrete numerical verification of the analytical prediction. The regime (training on an annulus, evaluating on a far-field ray) is chosen so that the analytical prediction is tight by construction.

Setup.

We draw N=4,000N=4{,}000 training points uniformly (area-measure) from the annulus 𝐱[50,100]\|\mathbf{x}\|\in[50,100], labels y=gε(𝐱)y=g_{\varepsilon}(\mathbf{x}). Two IMQ models are trained: M=50M{=}50 and M=200M{=}200 learned centers, shared bandwidth εIMQ\varepsilon_{\mathrm{IMQ}} trained jointly, Adam for 5,0005{,}000 epochs at learning rates 10310^{-3} and 5×1045{\times}10^{-4}. Both models are evaluated along the ray r𝐮r\mathbf{u} for r[0,500]r\in[0,500]. The script (directional_tail_benchmark.py) is provided in the supplementary material; it uses MLX and runs in 6{\approx}6 s on Apple Silicon M-series.

Results.

Table 5: Mean and max pointwise error on the far tail (r400r\geq 400) for finite IMQ expansions trained to fit a single Yat atom on an annulus. The asymptotic limit predicted by Corollary 5 is a=1a=1.
Model MM Mean |gF||g-F| at r400r\geq 400 Max |gF||g-F| at r400r\geq 400
IMQ-50 50 1.0081.008 1.0081.008
IMQ-200 200 1.0071.007 1.0071.007
Predicted (a=1a=1) 1.0001.000 1.0001.000

Both models train successfully in the annulus (MSE declines steadily) yet converge to the predicted error floor a1a\approx 1 in the far field, consistent with Corollary 5: increasing the number of IMQ atoms does not recover the directional tail because the analytical result is unconditional on model size. This is the finite-sample instantiation of the TF=0\|T_{\infty}F\|=0 property of the IMQ span (Proposition 2).

Appendix T The Yat Primitive at Depth in a Trained Causal Language Model

The single-layer theory of Sections 25 makes its claims at one shared-(b,ε)(b,\varepsilon) Yat layer; the pullback theorem (Theorem 14) extends them to a fixed prefix inside a deep stack so that the closed-form norm 𝜶𝐊𝜶\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha} (Proposition 3) and the diagonal-driven Rademacher bound (Theorem 6) survive composition at depth. This appendix exercises that depth extension on a trained causal language model: at matched architecture and matched Chinchilla-optimal compute on C4, the Yat primitive composed across many layers reaches the GELU-baseline accuracy regime, so the layer-local theoretical objects (per-layer Gram, closed-form norm, Rademacher diagonal) are not just well-defined at depth but populated with non-degenerate trained content.

Setup.

We replace the standard GELU MLP block of a 261261M decoder-only causal language model (1212 layers, nembd=768n_{\text{embd}}{=}768, context 10241024) with a 𝐱YatLinear\mathbf{x}\to\text{Yat}\to\text{Linear} block at the same parameter count, and train both at the GELU-Chinchilla-optimal budget of 5.25.2B C4 tokens [Raffel and others, 2020, Hoffmann and others, 2022] (20,00020{,}000 steps, global batch 256256, AdamW [Loshchilov and Hutter, 2019] with cosine decay, weight decay 0.10.1). Learning rates were chosen by a 500500K-token pre-sweep: lr=0.01\mathrm{lr}{=}0.01 for GELU and lr=0.03\mathrm{lr}{=}0.03 for Yat. We sweep the full {\{pn,sb}×{+α,+ca}\}\times\{+\alpha,+\text{ca}\} Yat grid: pn = per-neuron (bj,εj)(b_{j},\varepsilon_{j}), sb = shared (b,ε)(b,\varepsilon) within a layer, +α+\alpha = learnable per-channel scaling, +ca+\text{ca} = constant α\alpha. Each row is mean±\,\pm\,std over 33 independent training seeds; downstream zero-shot evaluations are run via lm-evaluation-harness.

Table 6: 261261M causal-LM evaluation, 5.25.2B C4 tokens, mean±\,\pm\,std over 33 training seeds. Wiki PPL is WikiText-103 next-token perplexity at the model’s training context length (10241024 tokens); long-range PPL is computed on the long-context split of PG-19 [Rae et al., 2020] with sliding-window evaluation at the same 10241024-token context. Lower is better (\downarrow); zero-shot accuracies via lm-evaluation-harness, higher is better (\uparrow). Variant naming: pn/sb = per-neuron / shared (b,ε)(b,\varepsilon); +α+\alpha/+ca+\text{ca} = learnable / constant per-channel scaling α\alpha. Bold = best per column.
Wiki PPL \downarrow LAMBADA acc \uparrow ARC-E \uparrow ARC-C \uparrow HellaSwag \uparrow OpenBookQA \uparrow PIQA \uparrow Long-range PPL \downarrow
GELU 44.23±1.4444.23_{\pm 1.44} 23.59±0.86%23.59_{\pm 0.86}\% 33.53±1.12%\mathbf{33.53_{\pm 1.12}\%} 22.47±0.90%\mathbf{22.47_{\pm 0.90}\%} 29.00±0.05%29.00_{\pm 0.05}\% 27.87±1.63%27.87_{\pm 1.63}\% 61.77±0.65%61.77_{\pm 0.65}\% 114.08±2.82114.08_{\pm 2.82}
Yat (pn+α+\alpha) 39.04±1.03\mathbf{39.04_{\pm 1.03}} 24.16±0.43%\mathbf{24.16_{\pm 0.43}\%} 33.01±0.80%33.01_{\pm 0.80}\% 22.01±1.67%22.01_{\pm 1.67}\% 29.62±0.51%\mathbf{29.62_{\pm 0.51}\%} 28.20±2.23%28.20_{\pm 2.23}\% 62.44±0.56%\mathbf{62.44_{\pm 0.56}\%} 104.20±3.40104.20_{\pm 3.40}
Yat (sb+α+\alpha) 39.07±0.3939.07_{\pm 0.39} 23.80±1.20%23.80_{\pm 1.20}\% 32.62±0.84%32.62_{\pm 0.84}\% 21.39±0.80%21.39_{\pm 0.80}\% 29.52±0.19%29.52_{\pm 0.19}\% 28.40±0.35%\mathbf{28.40_{\pm 0.35}\%} 62.37±0.74%62.37_{\pm 0.74}\% 101.09±3.90\mathbf{101.09_{\pm 3.90}}
Yat (pn+ca+\text{ca}) 42.18±0.2842.18_{\pm 0.28} 23.79±0.71%23.79_{\pm 0.71}\% 33.02±0.43%33.02_{\pm 0.43}\% 22.07±0.79%22.07_{\pm 0.79}\% 29.20±0.39%29.20_{\pm 0.39}\% 27.47±1.40%27.47_{\pm 1.40}\% 61.26±1.07%61.26_{\pm 1.07}\% 108.62±3.52108.62_{\pm 3.52}
Yat (sb+ca+\text{ca}) 42.09±1.9642.09_{\pm 1.96} 23.43±0.84%23.43_{\pm 0.84}\% 32.98±1.47%32.98_{\pm 1.47}\% 22.44±0.31%22.44_{\pm 0.31}\% 29.62±0.17%\mathbf{29.62_{\pm 0.17}\%} 27.20±0.20%27.20_{\pm 0.20}\% 61.50±0.69%61.50_{\pm 0.69}\% 106.51±4.08106.51_{\pm 4.08}

Reading.

At matched architecture and matched Chinchilla compute, the four Yat variants and the GELU baseline land in the same accuracy regime within seed variance on the six lm-evaluation-harness tasks. The two +α+\alpha Yat rows lead on Wikitext-103 perplexity (39.039.0 vs. 44.244.2 for GELU) and on long-range perplexity (101101 vs. 114114) by margins that exceed seed variance; the +ca+\text{ca} rows match GELU on accuracy and lead by 2{\sim}\!2 PPL (lower is better). Yat throughput at this scale is 655K{\sim}\!655\mathrm{K} tokens/sec versus 715K{\sim}\!715\mathrm{K} for GELU (8%{\sim}\!8\% lower), measured at the training batch size on TPU v5e and averaged over the full training run. The point of the experiment is the depth-viability claim: the Yat primitive composed across 1212 layers in a trained causal LM reaches the same accuracy regime as a GELU MLP at matched compute, and the closed-form RKHS norm of Proposition 3 continues to apply per layer to the trained shared-(b,ε)(b,\varepsilon) rows. We make no head-to-head benchmark claim and view differences smaller than seed variance as inconclusive.

Compute resources.

Each variant was trained on 5.2B5.2\mathrm{B} C4 tokens at the Chinchilla-optimal compute ratio. Per-variant training used Google Cloud TPU v5e/v6e (TPU Research Cloud) with approximately 22 TPU-hours per run; with three seeds per variant (GELU, pn+α+\alpha, sb+α+\alpha, pn+ca+\text{ca}, sb+ca+\text{ca}) alongside a 1010-point LR pre-sweep at 500500K tokens, total compute for the 261261M LM experiment was approximately 3232 TPU-hours.

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The four contributions in the introduction (PSD/Mercer + Loewner-domination universality, nonradial alignment via directional trace, exact IMQ recovery with three-atom tightness, and layer-local RKHS bookkeeping) correspond directly to Theorems 1, 2 (universality), Theorem 3 and Proposition 2 (alignment / trace), Theorem 4 and Theorem 5 (IMQ recovery and three-atom tightness), and Proposition 3 together with Theorem 6 (closed-form norm and Rademacher bound). The deep-stack scope is explicitly limited to a prefix-conditioned pullback (Theorem 14) and an ambient Sobolev containment (Theorem 19); we do not claim an exact global Yat-Gram norm.

  5. 2.

    Limitations

  6. Question: Does the paper discuss the limitations of the work performed by the authors?

  7. Answer: [Yes]

  8. Justification: Section 6 contains a dedicated limitations discussion: shared (b,ε)(b,\varepsilon) requirement for the exact RKHS norm, qualitative (not rate-form) universality, two-tier rather than exact deep-stack theory, extrapolative exterior-shell separations, and under-tuned empirical baselines. Each is paired with the scope-bounded remedy or the open problem it implies.

  9. 3.

    Theory assumptions and proofs

  10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  11. Answer: [Yes]

  12. Justification: Every theorem, proposition, lemma, and corollary states its assumptions inline (b0b\geq 0, ε>0\varepsilon>0, compactness of 𝒳\mathcal{X}, etc.); notation is consolidated in Appendix A, generic RKHS background in Appendix C, and named results from the literature in Appendix D. Proofs of main-body results are in Appendix E; proofs of extension theorems are split across Appendix G (far-field separations), Appendix H (multiclass generalisation), Appendix I (pullback and Loewner-comparison), Appendix J (Sobolev containment), Appendix P (atom-count separation), and Appendix Q (degree-matching skeleton for the uniqueness conjecture).

  13. 4.

    Experimental result reproducibility

  14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper?

  15. Answer: [Yes]

  16. Justification: The CLIP probe (Appendix R), the directional-tail benchmark (Appendix S), and the 261261M causal-LM proof-of-concept (Appendix T) report the architecture, optimizer, learning-rate grid, seed count, dataset, and bandwidth choices. All three are diagnostic experiments accompanying a theoretical paper, not benchmark contributions; the paper makes no performance-superiority claim from them.

  17. 5.

    Open access to data and code

  18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  19. Answer: [Yes]

  20. Justification: Reproduction scripts for both diagnostic experiments are referenced in the appendix; CLIP features are public ImageNet-1k embeddings from a public CLIP ViT-B/32 checkpoint.

  21. 6.

    Experimental setting/details

  22. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  23. Answer: [Yes]

  24. Justification: Appendix R specifies the optimizer (Adam), six-point learning-rate grid, three-seed averaging, and the per-variant bandwidth choices. The deliberate under-tuning of RBF/IMQ baselines is disclosed explicitly.

  25. 7.

    Experiment statistical significance

  26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  27. Answer: [Yes]

  28. Justification: The CLIP probe table reports mean ±\pm standard deviation across three seeds; the per-class Spearman correlation is reported with its stability range ρ[0.54,0.56]\rho\in[0.54,0.56] across a (b,ε)(b,\varepsilon) ablation grid.

  29. 8.

    Experiments compute resources

  30. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  31. Answer: [Yes]

  32. Justification: The directional-tail benchmark runs in approximately 6 s on Apple Silicon M-series; the CLIP probe is single-GPU with frozen features at d=512d=512.

  33. 9.

    Code of ethics

  34. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics?

  35. Answer: [Yes]

  36. Justification: The paper is a theoretical contribution about a kernel construction; no human subjects, no personal data, no harmful applications.

  37. 10.

    Broader impacts

  38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  39. Answer: [N/A]

  40. Justification: The paper develops mathematical foundations for a hidden-unit primitive; broader societal impact is not directly applicable to this contribution.

  41. 11.

    Safeguards

  42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models with a high risk for misuse?

  43. Answer: [N/A]

  44. Justification: No new datasets or pretrained models are released.

  45. 12.

    Licenses for existing assets

  46. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  47. Answer: [Yes]

  48. Justification: The CLIP ViT-B/32 model [Radford and others, 2021] (MIT license) and ImageNet-1k validation features [Deng et al., 2009] (custom research-only terms; we use only image embeddings from the public validation set, with no redistribution of raw images) are the only external assets used. Both are cited at the points of use in Appendix R.

  49. 13.

    New assets

  50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  51. Answer: [N/A]

  52. Justification: The paper introduces a kernel construction and theoretical results, not new assets requiring documentation.

  53. 14.

    Crowdsourcing and research with human subjects

  54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  55. Answer: [N/A]

  56. Justification: No human subjects.

  57. 15.

    Institutional review board (IRB) approvals or equivalent for research with human subjects

  58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  59. Answer: [N/A]

  60. Justification: No human subjects.

  61. 16.

    Declaration of LLM usage

  62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research?

  63. Answer: [N/A]

  64. Justification: LLMs are not part of the methodology of this paper.

Comments

· 0
Be the first to comment on this paper.