[2605.03262] Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

\bbl@luahyphenate\directlua

if Babel.locale_mapped == nil then Babel.locale_mapped = true Babel.linebreaking.add_before(Babel.locale_map, 1) Babel.loc_to_scr = Babel.chr_to_loc = Babel.chr_to_loc or end Babel.locale_props[1].letters = false \directlua if Babel.script_blocks[’Latn’] then Babel.loc_to_scr[1] = Babel.script_blocks[’Latn’] Babel.locale_props[1].lg = 89 end \directlua if Babel.script_blocks[’Latn’] then Babel.loc_to_scr[1] = Babel.script_blocks[’Latn’] end \bbl@luahyphenate\directlua if Babel.locale_mapped == nil then Babel.locale_mapped = true Babel.linebreaking.add_before(Babel.locale_map, 1) Babel.loc_to_scr = Babel.chr_to_loc = Babel.chr_to_loc or end Babel.locale_props[2].letters = false \directlua if Babel.script_blocks[”] then Babel.loc_to_scr[2] = Babel.script_blocks[”] Babel.locale_props[2].lg = 89 end \directlua if Babel.script_blocks[”] then Babel.loc_to_scr[2] = Babel.script_blocks[”] end [tifinagh]rm[Path=fonts/, Extension=.ttf, UprightFont=*]ebrima \bbl@patterns@luaenglish\bbl@patterns@luatifinagh

Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

Taha Bouhsine Affiliation: Azetta AI Affiliation: taha@azetta.ai

Abstract

We introduce the Yat kernel

k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{(\mathbf{w}^{\top}\mathbf{x}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon},\qquad b\geq 0,\ \varepsilon>0,

a rational hidden-unit primitive whose units are Mercer sections over a shared input/weight space. For $b\geq 0$ the kernel is PSD; for $b>0$ it dominates a scaled inverse-multiquadric (IMQ) in the Loewner order, yielding fixed-kernel universality, characteristicness, and strict positive definiteness on every compact domain. The polynomial numerator opens nonradial alignment channels absent from finite IMQ expansions, witnessed by the directional far-field trace $T_{\infty}g_{\varepsilon}(\cdot;\mathbf{w},b)(\mathbf{u})=(\mathbf{u}^{\top}\mathbf{w})^{2}$ . Algebraically, a second finite difference in the bias recovers any IMQ atom from three positive-bias Yat atoms exactly, sharp at three atoms in every dimension at exact pointwise equality. A trained shared- $(b,\varepsilon)$ Yat layer is therefore a finite learned-center expansion in a fixed universal characteristic RKHS, with closed-form norm $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ and explicit diagonal $(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ driving a Rademacher generalization bound.

1 Introduction

A trained MLP layer has no canonical Hilbert-space norm. Generalization bounds therefore route through Lipschitz or spectral surrogates of the weight matrix (Bartlett et al., 2017; Neyshabur et al., 2018), function-level objects loose by construction. The reason is structural: standard activations $\sigma(\mathbf{w}^{\top}\mathbf{x})$ (ReLU, GELU (Hendrycks and Gimpel, 2016), sigmoid) are scalar functions of an inner product, not symmetric kernel sections of the joint variable $(\mathbf{w},\mathbf{x})$ , so a trained layer’s read-out is not a finite element of any RKHS, and the toolkit that comes with one (closed-form Hilbert norm, fixed-kernel universality, characteristicness, MMD) is unavailable.

The available Mercer-section primitives are either radial without alignment (Gaussian RBF, IMQ: weights act only as centers, the kernel diagonal is constant, and the polynomial alignment $(\mathbf{w}^{\top}\mathbf{x}+b)^{2}$ plays no role) or aligned without distance locality (universal dot-product kernels of analytic positive-coefficient form (Smola et al., 2000): alignment and Hilbert structure but no $\|\mathbf{x}-\mathbf{w}\|$ decay). A finite, trainable hidden-unit primitive that combines locality, alignment, and Hilbert structure is the missing piece.

There is a primitive that splits the difference. The Yat kernel

k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})\;=\;\frac{(\mathbf{w}^{\top}\mathbf{x}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon},\qquad b\geq 0,\ \varepsilon>0,

(1)

is the rational product of a polynomial alignment numerator $p_{b}(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ and an inverse-multiquadric (IMQ) denominator $h_{\varepsilon}(\mathbf{x},\mathbf{w}):=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}$ : rational, finite, trainable. It is symmetric and PSD in $(\mathbf{w},\mathbf{x})$ , so a trained shared- $(b,\varepsilon)$ layer is genuinely a finite RKHS expansion $\sum_{j}\alpha_{j}k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)$ with closed-form norm $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ , and single-layer generalization follows from the RKHS-ball Rademacher framework with no surrogate on the weight matrix.

Why should such a primitive exist, and why are its properties what they are? Schur products of PSD kernels are PSD, and both factors $p_{b}$ and $h_{\varepsilon}$ are PSD; the technically obvious fact is that the product is itself a Mercer kernel. The interesting fact is a single Loewner inequality. Expanding $(\mathbf{w}^{\top}\mathbf{x}+b)^{2}=b^{2}+2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2}$ and dividing through, $k_{b,\varepsilon}-b^{2}h_{\varepsilon}$ is itself a sum of Schur products of PSD kernels, manifestly PSD. So the radial IMQ kernel sits inside the Yat kernel in the Loewner order; Aronszajn’s inclusion theorem combined with Micchelli’s IMQ universality (Micchelli et al., 2006) gives universality of the fixed kernel, not just of a family. The radial alternatives are recovered as a strict sub-RKHS of $\mathcal{H}_{b,\varepsilon}$ .

Loewner domination delivers RKHS containment by a soft analytic argument, but it leaves an algebraic question: can the IMQ atoms be constructed from Yat atoms, or only known to lie inside the Yat RKHS by Aronszajn? Section 4 answers the constructive question with a sharper one-line fact:

g_{\varepsilon}(\cdot;\mathbf{w},3h)\;-\;2\,g_{\varepsilon}(\cdot;\mathbf{w},2h)\;+\;g_{\varepsilon}(\cdot;\mathbf{w},h)\;=\;2h^{2}\,k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}),

(2)

a second-order finite difference of three biased Yat atoms in the bias parameter recovers every IMQ atom exactly, and the count of three is sharp in every dimension. This is what the analytic Loewner argument cannot see, and what makes the construction concrete: any quantitative IMQ approximation rate transfers to Yat at factor-three width overhead.

The polynomial numerator does more than improve the constants. IMQ RKHS functions decay at infinity, but a Yat atom retains a directional quadratic shadow along every ray $r\mathbf{u}$ :

T_{\infty}\,g_{\varepsilon}(\cdot;\mathbf{w},b)(\mathbf{u})\;=\;(\mathbf{u}^{\top}\mathbf{w})^{2},

(3)

independent of $b$ and $\varepsilon$ , a kernel invariant that survives every choice of hyperparameter. No finite IMQ combination has nonzero trace, so the alignment numerator opens directional channels in $\mathcal{H}_{b,\varepsilon}$ that the radial factor alone cannot reach. Section 3 makes this precise: the trace map surjects onto homogeneous quadratic forms on the sphere while annihilating every finite IMQ combination.

Both kernel hyperparameters appear directly in the single-layer Rademacher bound as regularization targets, a transparency that constant-diagonal radial kernels lose and scalar MLP units do not have at all. The reason is in the kernel diagonal: $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ is polynomial in $\|\mathbf{x}\|$ with explicit dependence on both $b$ and $\varepsilon$ , and the diagonal carries those two scalars into the bound. The closed-form norm $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ itself is generic to any continuous Mercer kernel; what is Yat-specific is the form of the diagonal (Section 5).

Organization. Section 2 establishes Mercer structure, Loewner domination, and fixed-kernel universality. Section 3 develops the channel decomposition, the directional far-field trace, and the bounded-domain width-complexity gap. Section 4 proves the bias-finite-difference identity (2) and its three-atom sharpness. Section 5 derives the closed-form RKHS norm and the Rademacher bound. Section 6 treats depth: per-prefix exact bookkeeping, ambient Sobolev containment, and a conjectured impossibility of any prefix-independent global Yat-Gram norm. Background, related work, and proofs are in Appendices B–E onwards.

2 Mercer Structure and Fixed-Kernel Universality

Notation. Throughout, $d\geq 1$ , $\mathcal{X}\subset\mathbb{R}^{d}$ is compact, $b\geq 0$ , $\varepsilon>0$ . Write $D_{\varepsilon}(\mathbf{x},\mathbf{w}):=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ for the regularised squared distance, and use the shorthands

g_{\varepsilon}(\mathbf{x};\mathbf{w},b):=\frac{(\mathbf{x}^{\top}\mathbf{w}+b)^{2}}{D_{\varepsilon}(\mathbf{x},\mathbf{w})}\quad\text{(biased Yat atom)},\qquad h_{\varepsilon}(\mathbf{x},\mathbf{w}):=k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}):=D_{\varepsilon}(\mathbf{x},\mathbf{w})^{-1}\quad\text{(IMQ)}.

The unbiased section $k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})=g_{\varepsilon}(\mathbf{x};\mathbf{w},0)$ corresponds to $b=0$ . We adopt throughout the convention that the RKHS of the identically-zero kernel is the zero subspace $\{0\}$ with trivial norm; this makes the channel decomposition of Theorem 3 and the limit $b\to 0^{+}$ well-defined when individual channels collapse.

Two algebraic facts anchor the kernel-section view of the Yat unit. First, $k_{b,\varepsilon}$ is positive semidefinite, and thus a Mercer kernel on every compact $\mathcal{X}$ , for every $b\geq 0$ and $\varepsilon>0$ . Second, on top of mere PSD, $k_{b,\varepsilon}$ dominates a scaled IMQ kernel in the Loewner order: $b^{2}h_{\varepsilon}\preceq k_{b,\varepsilon}$ . This second fact transfers IMQ density into $\mathcal{H}_{b,\varepsilon}$ and pins down fixed-kernel universality on every compact domain whenever $b>0$ . Both follow from the Schur-product factorisation $k_{b,\varepsilon}=p_{b}\cdot h_{\varepsilon}$ of the polynomial alignment numerator $p_{b}(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ and the IMQ denominator $h_{\varepsilon}$ .

Theorem 1 (PSD and Mercer for $b\geq 0$ ).

For every $b\geq 0$ , $\varepsilon>0$ , the kernel $k_{\text{\tifinaghfont ⵟ}}$ in (1) is symmetric and positive semidefinite on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ , and on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ the restriction is continuous and Mercer. By Moore–Aronszajn there exists a unique RKHS $\mathcal{H}_{b,\varepsilon}$ with reproducing kernel $k_{b,\varepsilon}$ . The sign condition is essential: for $b<0$ PSD can fail (explicit $2\times 2$ counterexample at $b=-1$ , $d=2$ , nodes $\mathbf{e}_{1},\mathbf{e}_{2}$ ).

The sign condition has a structural cause: the polynomial-numerator feature map $\Phi_{b}(\mathbf{x})=(\mathbf{x}\otimes\mathbf{x},\sqrt{2b}\,\mathbf{x},b)$ requires $b\geq 0$ to be real. Per-atom variation of hyperparameters within an expansion $f=\sum_{j}\alpha_{j}\,k_{b_{j},\varepsilon_{j}}(\mathbf{w}_{j},\cdot)$ lies in the larger biased Yat span $F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$ (see Appendix A) but not in any single RKHS $\mathcal{H}_{b,\varepsilon}$ , so all quantitative RKHS results that follow, namely the closed-form norm (Proposition 3) and the Rademacher bound (Theorem 6), require shared $(b,\varepsilon)$ across summands. This is the unit of analysis for the rest of the paper.

PSD structure implies a Loewner-order inequality that yields RKHS containment and singular universality as direct algebraic consequences. Write $h_{\varepsilon}(\mathbf{x},\mathbf{w}):=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}$ for the IMQ kernel. Expanding $(\mathbf{w}^{\top}\mathbf{x}+b)^{2}=b^{2}+2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2}$ gives $k_{b,\varepsilon}-b^{2}h_{\varepsilon}=[2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2}]h_{\varepsilon}$ , a sum of Schur products of PSD kernels, hence PSD as a kernel: every finite Gram matrix of $k_{b,\varepsilon}-b^{2}h_{\varepsilon}$ is itself PSD.

Theorem 2 (Kernel-order domination).

For every $b\geq 0$ , $\varepsilon>0$ ,

b^{2}\,h_{\varepsilon}\preceq k_{b,\varepsilon}

in the Loewner PSD order on kernels. By Aronszajn’s inclusion theorem (Paulsen and Raghupathi, 2016),

\mathcal{H}_{h_{\varepsilon}}\subseteq\mathcal{H}_{b,\varepsilon},\qquad\|f\|_{\mathcal{H}_{b,\varepsilon}}\leq\frac{1}{b}\|f\|_{\mathcal{H}_{h_{\varepsilon}}}\quad(b>0).

Equivalently, $B_{\mathcal{H}_{h_{\varepsilon}}}(R)\subseteq B_{\mathcal{H}_{b,\varepsilon}}(R/b)$ .

The constant $b^{2}$ on the left of the Loewner inequality is sharp on $\mathbb{R}^{d}$ : at the single point $\mathbf{x}=\mathbf{0}$ the ratio is $k_{b,\varepsilon}(\mathbf{0},\mathbf{0})/h_{\varepsilon}(\mathbf{0},\mathbf{0})=(b^{2}/\varepsilon)/(1/\varepsilon)=b^{2}$ exactly, so the $1\times 1$ Gram of $k_{b,\varepsilon}-c\,h_{\varepsilon}$ at $\mathbf{x}=\mathbf{0}$ becomes negative for any $c>b^{2}$ , and therefore $\sup\{c\geq 0:c\,h_{\varepsilon}\preceq k_{b,\varepsilon}\}=b^{2}$ . The sharpness is global: on a compact $\mathcal{X}\subset\mathbb{R}^{d}$ that excludes a neighbourhood of the origin, the optimal domination constant can be strictly larger than $b^{2}$ , since $k_{b,\varepsilon}/h_{\varepsilon}=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ (after cancelling the IMQ denominators) is bounded below away from $b^{2}$ once $\mathbf{x}^{\top}\mathbf{w}$ is bounded away from $0$ .

Two related constants govern the $b\to 0^{+}$ behaviour and should not be conflated. The kernel-domination constant $b^{2}$ enters squared because it compares kernels (and hence Gram matrices), while the RKHS embedding constant in Theorem 2 is $1/b$ , comparing norms.

Quantitative approximation rates transferred through the embedding therefore inherit $1/b$ on norms or $1/b^{2}$ on squared norms, harmless at fixed $b>0$ but exploding at the boundary.

That boundary is genuine, not an artefact of the embedding. At $b=0$ the channel decomposition (Theorem 3 below) shows that $\mathcal{H}_{0,\varepsilon}$ collapses to the quadratic-alignment channel $\mathcal{H}_{k_{2}}$ alone: the radial channel $\mathcal{H}_{k_{0}}$ and the linear-alignment channel $\mathcal{H}_{k_{1}}$ disappear under the zero-kernel convention. Equivalently, the IMQ-radial subspace $\mathcal{H}_{h_{\varepsilon}}$ embeds continuously into $\mathcal{H}_{b,\varepsilon}$ for every $b>0$ but is not contained in $\mathcal{H}_{0,\varepsilon}$ at all, since every $f\in\mathcal{H}_{0,\varepsilon}$ vanishes at the origin while generic IMQ functions do not. The transition $b=0\rightleftharpoons b>0$ is a phase transition of the RKHS, recorded analytically in Table 1: the practical recommendation is to use $b>0$ rather than to treat $b=0$ as a regularised limit. With the boundary understood, fixed-kernel universality follows for the open regime.

Proposition 1 (Singular universality at $b_{0}>0$ ).

Let $\mathcal{X}\subset\mathbb{R}^{d}$ be compact and let $b_{0}>0$ , $\varepsilon_{0}>0$ . Then the RKHS $\mathcal{H}_{b_{0},\varepsilon_{0}}$ of the fixed biased kernel $k_{b_{0},\varepsilon_{0}}(\mathbf{w},\mathbf{x})=(\mathbf{w}^{\top}\mathbf{x}+b_{0})^{2}/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon_{0})$ is dense in $C(\mathcal{X})$ in the uniform norm. Hence $k_{b_{0},\varepsilon_{0}}$ is a universal kernel, not merely a member of a universal family.

Universality in turn implies the kernel mean embedding is injective and Gram matrices at distinct nodes are strictly positive definite, both consequences we use throughout.

Corollary 1 (Characteristic kernel and SPD).

For every $b_{0}>0$ , $\varepsilon_{0}>0$ , and every compact $\mathcal{X}\subset\mathbb{R}^{d}$ : (i) $k_{b_{0},\varepsilon_{0}}$ is characteristic on $\mathcal{X}$ , so the kernel mean embedding $\mu\mapsto\int k_{b_{0},\varepsilon_{0}}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})$ is injective on finite signed Borel measures and the maximum mean discrepancy $\mathrm{MMD}_{k_{b_{0},\varepsilon_{0}}}$ metrizes weak convergence (Sriperumbudur et al., 2010, Thm. 23); (ii) $k_{b_{0},\varepsilon_{0}}$ is strictly positive definite in the standard kernel-theoretic sense: for any $n\geq 1$ and any $n$ pairwise distinct $\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}\subset\mathbb{R}^{d}$ , the Gram matrix $\mathbf{K}$ is strictly positive definite, hence invertible, and the quadratic form $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ is a proper RKHS norm on coefficient space.

Together with Theorem 1, these results license a shared- $(b,\varepsilon)$ Yat layer as a finite expansion in a universal characteristic RKHS, the unit of analysis the rest of the paper exploits. Table 1 summarises how Yat compares with its three closest classical kernel-valued alternatives along the properties used in the rest of the paper.

Table 1: Theoretical properties at the hidden-layer level. Fixed bias

b\geq 0

\varepsilon>0

; “UAT-fam” = family-level universal approximation; “UAT-sing” = single fixed kernel is universal; “Char.” = characteristic kernel (MMD metrizes weak convergence on compact

\mathcal{X}

); “Bounded” =

k(\mathbf{w},\cdot)

bounded on

\mathbb{R}^{d}

. Polynomial =

(\mathbf{w}^{\top}\mathbf{x}+b)^{2}

; RBF =

e^{-\gamma\|\mathbf{x}-\mathbf{w}\|^{2}}

; IMQ =

(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}

; Univ. dot-prod =

e^{\gamma\mathbf{x}^{\top}\mathbf{w}}

(universal analytic dot-product kernel (Smola et al., 2000)).

^†Fails universality: finite-rank RKHS (not dense in $C(\mathcal{X})$ for infinite $\mathcal{X}$ ).
Primitive	PSD/Mercer	UAT-fam	UAT-sing	Char.	Bounded	Numer.	Denom.
Yat ( $k_{\text{\tifinaghfont ⵟ}},\,b>0$ )	✓	✓	✓	✓	✓	align	local
Yat ( $k_{\text{\tifinaghfont ⵟ}},\,b=0$ )	✓	✓	—^§	—^§	✓	align	local
Polynomial $(\cdot)^{2}$	✓	—^†	—^†	—^‡	—	align	—
Univ. dot-prod	✓	✓	✓	✓	—	align	—
RBF	✓	✓	✓	✓	✓	—	local
IMQ	✓	✓	✓	✓	✓	—	local
^‡Fails characteristic: cannot separate all probability measures.
^§Fails on origin-containing domains: $k_{0,\varepsilon}(\mathbf{0},\mathbf{0})=0$ forces $f(\mathbf{0})=0$ for every $f\in\mathcal{H}_{0,\varepsilon}$ .

3 Nonradial Alignment Structure

Theorem 2 embeds the IMQ RKHS into $\mathcal{H}_{b,\varepsilon}$ for $b>0$ , but it leaves a structural question open: does the polynomial alignment numerator add anything that the radial denominator alone could not produce? It does, and the gap is recorded by a single far-field invariant. The IMQ RKHS lives entirely inside the $C_{0}$ part of function space, since every IMQ RKHS element decays at infinity, while a Yat atom retains a directional quadratic shadow $(\mathbf{u}^{\top}\mathbf{w})^{2}$ along every ray $r\mathbf{u}$ as $r\to\infty$ . That mismatch is the structural witness of nonradiality, and once it is in hand the rest of the section follows by algebra: a strict RKHS containment, a clean three-channel decomposition of $\mathcal{H}_{b,\varepsilon}$ , and a sharp directional trace identity that no finite IMQ combination can reproduce.

The decay statement is the simplest of the three.

Lemma 1 (IMQ RKHS functions vanish at infinity).

Every $f\in\mathcal{H}_{h_{\varepsilon}}$ belongs to $C_{0}(\mathbb{R}^{d})$ , i.e., $f(\mathbf{x})\to 0$ as $\|\mathbf{x}\|\to\infty$ .

The lemma is a global statement on $\mathbb{R}^{d}$ , distinct from the compact-domain universality of $h_{\varepsilon}$ : on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ , the IMQ RKHS restricts densely to $C(\mathcal{X})$ (Micchelli et al., 2006), but as functions on the ambient $\mathbb{R}^{d}$ the elements of $\mathcal{H}_{h_{\varepsilon}}$ vanish at infinity, while generic Yat sections do not. A direct consequence is that the Loewner embedding cannot reverse, so the RKHS containment is strict.

Corollary 2 (Strict containment and non-reversibility).

For $b>0$ , $\varepsilon>0$ : $\mathcal{H}_{h_{\varepsilon}}\subsetneq\mathcal{H}_{b,\varepsilon}$ . Moreover, no constant $C>0$ satisfies $k_{b,\varepsilon}\preceq C\,h_{\varepsilon}$ on $\mathbb{R}^{d}$ .

The strict gap has an explicit factorisation. Expanding the squared alignment numerator $(\mathbf{x}^{\top}\mathbf{w}+b)^{2}=b^{2}+2b(\mathbf{x}^{\top}\mathbf{w})+(\mathbf{x}^{\top}\mathbf{w})^{2}$ splits $k_{b,\varepsilon}$ into three Schur products with the IMQ denominator: each piece is itself PSD, so the RKHS sum rule decomposes the Yat space accordingly into a radial channel and two alignment channels.

Theorem 3 (Yat RKHS channel decomposition).

Write $k_{b,\varepsilon}=k_{0}+k_{1}+k_{2}$ where $k_{0}:=b^{2}h_{\varepsilon}$ , $k_{1}:=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}$ , $k_{2}:=(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}$ . Each summand is PSD: $h_{\varepsilon}$ is PSD, the linear kernel $\mathbf{x}^{\top}\mathbf{w}$ is PSD (it is the inner-product feature map), the quadratic $(\mathbf{x}^{\top}\mathbf{w})^{2}$ is PSD as the Schur square of a PSD kernel, and Schur products of PSD kernels are PSD; non-negative scalar multiples preserve PSD. By the RKHS sum rule (Paulsen and Raghupathi, 2016; Berlinet and Thomas-Agnan, 2004),

\mathcal{H}_{b,\varepsilon}=\mathcal{H}_{k_{0}}+\mathcal{H}_{k_{1}}+\mathcal{H}_{k_{2}},

with squared infimal-convolution norm

\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\inf_{\begin{subarray}{c}f_{0}+f_{1}+f_{2}=f\\ f_{i}\in\mathcal{H}_{k_{i}}\end{subarray}}\bigl(\|f_{0}\|_{\mathcal{H}_{k_{0}}}^{2}+\|f_{1}\|_{\mathcal{H}_{k_{1}}}^{2}+\|f_{2}\|_{\mathcal{H}_{k_{2}}}^{2}\bigr).

The three channels are: the radial IMQ channel ( $k_{0}=b^{2}h_{\varepsilon}$ ; as sets $\mathcal{H}_{k_{0}}=\mathcal{H}_{h_{\varepsilon}}$ with rescaled norm $\|f\|_{\mathcal{H}_{k_{0}}}=b^{-1}\|f\|_{\mathcal{H}_{h_{\varepsilon}}}$ (constant-rescaling identity for RKHS norms, Paulsen and Raghupathi 2016, Prop. 4.4), present when $b>0$ ), the linear-alignment IMQ channel ( $k_{1}=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}$ , vanishes at $b=0$ ), and the quadratic-alignment IMQ channel ( $k_{2}=(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}$ , carrying the far-field trace, present for all $b\geq 0$ ). We adopt the convention that the RKHS of the identically-zero kernel is the zero subspace $\{0\}$ with trivial norm. At $b=0$ both $k_{0}$ and $k_{1}$ vanish under this convention: the RKHS loses the IMQ subspace and the constant coordinate, which is the degeneracy responsible for failure of fixed-kernel universality on domains containing the origin. The same degeneracy appears as the kernel diagonal vanishing at the origin: $k_{0,\varepsilon}(\mathbf{0},\mathbf{0})=0$ , so every $f\in\mathcal{H}_{0,\varepsilon}$ satisfies $f(\mathbf{0})=0$ , so both the universality failure and the diagonal-vanishing diagnostic record the same fact.

The quadratic-alignment channel $\mathcal{H}_{k_{2}}$ is what survives every choice of $b\geq 0$ and is the source of the far-field shadow that distinguishes Yat from IMQ. The next proposition makes the shadow precise as a sphere-valued asymptotic trace and certifies that the trace map is surjective onto homogeneous quadratic forms while annihilating every finite IMQ combination.

Proposition 2 (Directional asymptotic trace separates Yat from IMQ).

Let $\mathcal{D}_{T_{\infty}}$ denote the linear subspace of functions $F:\mathbb{R}^{d}\to\mathbb{R}$ for which the radial limit $\lim_{r\to\infty}F(r\mathbf{u})$ exists for every $\mathbf{u}\in\mathbb{S}^{d-1}$ , and define on $\mathcal{D}_{T_{\infty}}$ the asymptotic trace operator $T_{\infty}F(\mathbf{u}):=\lim_{r\to\infty}F(r\mathbf{u})$ . Both $F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$ and $F_{\mathrm{IMQ},\varepsilon}$ lie in $\mathcal{D}_{T_{\infty}}$ . For every $\mathbf{w}\in\mathbb{R}^{d}$ , every $b\in\mathbb{R}$ , and every $\mathbf{u}\in\mathbb{S}^{d-1}$ , the biased Yat atom satisfies

T_{\infty}\,g_{\varepsilon}(\cdot;\mathbf{w},b)(\mathbf{u})\;=\;(\mathbf{u}^{\top}\mathbf{w})^{2},

(4)

independent of $b$ and $\varepsilon$ , and the convergence is uniform in $\mathbf{u}$ . By contrast, $T_{\infty}F\equiv 0$ for every finite IMQ combination $F\in F_{\mathrm{IMQ},\varepsilon}$ . Consequently the induced linear map $T_{\infty}:F_{\text{\tifinaghfont ⵟ}}^{\geq 0}\to C(\mathbb{S}^{d-1})$ is surjective onto the space of restrictions of homogeneous quadratic forms, $\mathcal{Q}_{2}(\mathbb{S}^{d-1})=\{\mathbf{u}\mapsto\mathbf{u}^{\top}\mathbf{A}\mathbf{u}:\mathbf{A}\in\mathrm{Sym}(d)\},$ of dimension $d(d{+}1)/2$ , while $F_{\mathrm{IMQ},\varepsilon}\subseteq\ker T_{\infty}$ .

The directional trace identity above is qualitative: a bounded radial combination eventually misses a Yat atom’s far-field value. The gap becomes quantitative on a bounded domain by combining the directional-trace identity (Proposition 2) with the classical radial curse of dimensionality for ridge approximation (Eldan and Shamir, 2016): a single Yat atom approximates a quadratic ridge function on $[-R,R]^{d}$ to arbitrary accuracy, while any bounded-variation IMQ expansion needs exponentially many atoms in the ambient dimension.

A precise quantitative form (Proposition 7, Appendix E.8) makes this concrete under a joint constraint on the IMQ side: bounded coefficient mass $\sum|a_{j}|\leq\Lambda$ and bounded center norms $\|\mathbf{v}_{j}\|\leq W\leq R/2$ , both held fixed as $d\to\infty$ . Under this joint constraint, on $[-R,R]^{d}$ a single Yat atom $k_{0,\varepsilon}(\alpha\mathbf{w}_{\star},\cdot)$ with $\alpha=\Omega(\sqrt{d})$ approximates the quadratic ridge $(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}$ uniformly to $\delta$ , while every IMQ expansion in this class needs $m\geq\exp(\Omega(d\log d))$ atoms. Lifting either constraint, by letting $\Lambda$ grow with $d$ or letting centers escape into the shell, collapses the lower bound, so the separation is a constrained radial-vs-ridge statement, not a model-free benchmark prediction.

Sharper but extrapolative quantitative separations on exterior shells $A_{R}=\{R\leq\|\mathbf{x}\|\leq 2R\}$ for $R\gg W$ are recorded in Appendix G.1: a uniform $L^{\infty}$ bound and an $L^{2}$ directional-tail risk lower bound, both for bounded-variation IMQ and RBF expansions. On normalized-feature compact domains both IMQ and RBF are universal (Micchelli et al., 2006; Steinwart and Christmann, 2008), so these are far-field structural separations rather than compact-domain benchmark predictions.

4 IMQ Embedding via Bias Finite Differences

Theorem 2 buys an IMQ-into-Yat embedding through Loewner domination, but that argument is analytic: Aronszajn’s inclusion theorem only certifies an injection of RKHS, not a constructive recipe for producing an IMQ atom from Yat atoms. A purely algebraic recipe exists, and the underlying reason is one line: the Yat numerator $(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ is exactly quadratic in $b$ with constant second derivative $2$ , so any second-order finite difference in $b$ kills it and leaves only the IMQ denominator times a constant. The stencil $(h,2h,3h)$ is the smallest one with all biases strictly positive. Concretely, taking that stencil at a common center gives an exact IMQ atom, multiplied by a constant. The identity is pointwise, not approximate, and the count of three Yat atoms is necessary in every dimension: two atoms cannot reproduce a single IMQ section. This construction gives a second, constructive route to fixed-kernel universality and immediately transfers any IMQ approximation rate to Yat at a factor-three width overhead.

Theorem 4 (Bias-finite-difference IMQ embedding).

Fix $\varepsilon>0$ . For every $\mathbf{w}\in\mathbb{R}^{d}$ and every $h>0$ , the positive-bias second-order finite difference of the biased Yat atom satisfies the exact pointwise identity

g_{\varepsilon}(\mathbf{x};\mathbf{w},3h)\;-\;2\,g_{\varepsilon}(\mathbf{x};\mathbf{w},2h)\;+\;g_{\varepsilon}(\mathbf{x};\mathbf{w},h)\;=\;\frac{2h^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon}\;=\;2h^{2}\,k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}),

(5)

for every $\mathbf{x}\in\mathbb{R}^{d}$ . Consequently $F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ , where $F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}=\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},b>0\}$ and $F_{\mathrm{IMQ},\varepsilon}=\mathrm{span}\{k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d}\}$ . The containment is strict.

The factor-three overhead is not slack: removing any one of the three Yat atoms loses the cancellation that produces the IMQ section. The next theorem establishes this as a tightness statement at the level of exact pointwise equality.

Theorem 5 (Three atoms are necessary).

The factor- $3$ reduction in (5) is sharp in the number of atoms: for every $\mathbf{w}_{0}\neq\mathbf{0}$ , $\varepsilon>0$ , $d\geq 1$ , no linear combination of at most two biased Yat atoms (at arbitrary centers and biases) can equal a nonzero scalar multiple of $k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0})$ .

The full proof is in Appendix E.2, where the result is restated for convenience (Theorem 9), via an irreducible-quadric argument in $\mathbb{C}[\mathbf{x}]/(D_{1})$ . Tightness is at the level of exact pointwise equality; whether two atoms can $L^{\infty}$ -approximate $k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0})$ on a compact set is open. The hypothesis $\mathbf{w}_{0}\neq\mathbf{0}$ is essential: at the origin, the single Yat atom $g_{\varepsilon}(\cdot;\mathbf{0},b)/b^{2}=k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{0})$ recovers the IMQ atom exactly, so one atom suffices in that case.

Two consequences follow without additional analysis. The first propagates Micchelli’s IMQ universality (Micchelli et al., 2006) into the Yat family by composition; the second turns the embedding into a quantitative rate-transfer statement.

Corollary 3 (Family universality, algebraic route).

For every compact $\mathcal{X}\subset\mathbb{R}^{d}$ , both $F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ at any single $\varepsilon>0$ and the broader families $F_{\text{\tifinaghfont ⵟ}}^{+}=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},b>0,\varepsilon>0\}$ and $F_{\text{\tifinaghfont ⵟ}}^{\geq 0}=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},b\geq 0,\varepsilon>0\}$ are dense in $C(\mathcal{X})$ in the uniform norm. This gives a second algebraic route to the universality already established analytically by Proposition 1 via Loewner-IMQ domination.

Corollary 4 (Approximation rate transfer at factor- $3$ width overhead).

Fix $\varepsilon>0$ and $h>0$ , and let $\mathcal{X}\subset\mathbb{R}^{d}$ be compact. For every finite IMQ approximant $F=\sum_{j=1}^{m}a_{j}\,k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{j})\in F_{\mathrm{IMQ},\varepsilon}$ of $m$ atoms, there exists a Yat approximant $G\in F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ of at most $3m$ atoms (positive biases drawn from $\{h,2h,3h\}$ ) with $G(\mathbf{x})=F(\mathbf{x})$ for every $\mathbf{x}\in\mathbb{R}^{d}$ . In particular, any quantitative IMQ approximation rate $\inf_{F\in F_{\mathrm{IMQ},\varepsilon},\,|F|\leq m}\|f-F\|_{L^{\infty}(\mathcal{X})}\leq\phi(m)$ transfers to a Yat rate $\inf_{G\in F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+},\,|G|\leq 3m}\|f-G\|_{L^{\infty}(\mathcal{X})}\leq\phi(m)$ ; equivalently, any IMQ rate $\phi(m)$ gives a Yat rate $\phi(\lfloor m/3\rfloor)$ .

Corollary 4 is the answer to “what does Yat inherit from IMQ approximation theory?”. We do not prove new Sobolev or Besov approximation rates here; the bias finite-difference identity shows that Yat is never worse than fixed- $\varepsilon$ IMQ at the level of any rate inherited through this embedding. Yat-native rates that exploit the alignment numerator beyond what radial expansions can match remain open. Note that the $3m$ atoms on the Yat side carry three distinct biases $\{h,2h,3h\}$ , so the resulting expansion lies in the wider biased Yat span $F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$ rather than in any single shared- $(b,\varepsilon)$ RKHS; the closed-form norm and Rademacher bound of Section 5 apply to each fixed-bias channel separately, with the per-channel norms combining via the RKHS sum rule.

The rate-transfer corollary leaves the RKHS-ball radius $B$ unspecified: in the absence of a constructive bound on $\|f\|_{\mathcal{H}_{b,\varepsilon}}$ , the inherited rate is information-theoretic rather than operational. The next section closes this gap by computing $B$ in closed form for any trained shared- $(b,\varepsilon)$ Yat layer and propagating it to a Rademacher generalization bound, so that the rate transfer above and the capacity bound below are companion statements: the first says what error rate a target in $\mathcal{H}_{b,\varepsilon}$ admits at $3N$ atoms; the second says what radius the trained network’s coordinate functions occupy.

5 Layer-Local RKHS Capacity

Section 4 transferred IMQ approximation rates to Yat at factor-three width overhead but left the radius of the relevant RKHS ball unspecified, and an inherited rate without a constructive radius is information-theoretic, not operational.¹¹1The factor-three rate-transfer construction uses three biases $\{h,2h,3h\}$ per center, so the constructed approximant lives in the wider biased Yat span $F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$ rather than in any single shared- $(b,\varepsilon)$ RKHS. The closed-form norm and Rademacher bound here apply per shared- $(b,\varepsilon)$ Yat layer; the two are companion statements over the kernel family, not over one trained network. Per-channel norms combine via the RKHS sum rule. The reproducing property of $\mathcal{H}_{b,\varepsilon}$ closes the gap: the squared RKHS norm of any finite kernel expansion $\sum_{j}\alpha_{j}k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)$ equals the quadratic form $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ in the trained weights and read-out coefficients, computable from those parameters alone, with no Lipschitz surrogate and no spectral product across depth. Plugging this norm into the RKHS-ball Rademacher framework yields a single-layer bound $B(R^{2}+b)/(\sqrt{n}\sqrt{\varepsilon})$ whose only Yat-specific ingredient is the kernel diagonal $(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ , which carries both $b$ and $\varepsilon$ into the bound as explicit regularisation targets, in contrast to the constant diagonals of Gaussian RBF and IMQ.

Proposition 3 (Closed-form RKHS norm).

Let $b\geq 0$ , $\varepsilon>0$ , and let $f(\cdot)=\sum_{j=1}^{m}\alpha_{j}\,k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\in\mathcal{H}_{b,\varepsilon}$ be a finite kernel expansion with shared $(b,\varepsilon)$ . Then

\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}\;=\;\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha},\qquad\mathbf{K}_{ij}=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j}).

(6)

The identity holds for any choice of $\{\mathbf{w}_{j}\}$ , including repeated centers; in that case $\mathbf{K}$ may be singular, and although different coefficient vectors can represent the same function $f$ , their difference lies in $\ker\mathbf{K}$ and the quadratic form $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ is invariant across equivalent representations.

The reproducing property (Aronszajn, 1950; Steinwart and Christmann, 2008) delivers this as an exact finite-dimensional formula: for a shared- $(b,\varepsilon)$ Yat layer, the right-hand side is computable from the trained weights $(\mathbf{w}_{j},\boldsymbol{\alpha})$ alone, in closed form, with no Lipschitz surrogate. The identity is mathematically tight at every $\boldsymbol{\alpha}$ even when $\mathbf{K}$ is singular; near-coincident centers are a numerical-conditioning issue, not a mathematical overstatement, and stable factorisation or deduplication suffices in floating-point implementations. Plugging this norm into the standard RKHS-ball Rademacher inequality (Bartlett and Mendelson, 2002) yields the Yat-specific generalisation bound below; the only ingredient that depends on the kernel beyond a generic Mercer-section property is the diagonal $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ , which transfers $b$ and $\varepsilon$ into explicit terms in the bound.

Theorem 6 (Single-layer Rademacher bound).

Fix $b\geq 0$ , $\varepsilon>0$ . Let $\mathcal{F}_{B,b,\varepsilon}=\{f\in\mathcal{H}_{b,\varepsilon}:\|f\|_{\mathcal{H}_{b,\varepsilon}}\leq B\}$ be the RKHS ball of radius $B$ , $\|\mathbf{x}\|\leq R$ , and let $\{\mathbf{x}_{i}\}_{i=1}^{n}$ be i.i.d. samples. With the convention $\widehat{\mathrm{Rad}}_{n}(\mathcal{F}):=\mathbb{E}_{\boldsymbol{\sigma}}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(\mathbf{x}_{i})$ , $\sigma_{i}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}\mathrm{Unif}\{\pm 1\}$ ,

\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;\leq\;\frac{B(R^{2}+b)}{\sqrt{n}\,\sqrt{\varepsilon}}.

(7)

A data-dependent empirical refinement (Corollary 7) replaces $(R^{2}+b)$ by $\sqrt{(1/n)\sum_{i}(\|\mathbf{x}_{i}\|^{2}+b)^{2}}$ . The polynomial diagonal has a real cost on unnormalised high-dimensional inputs and we record it explicitly: under isotropic Gaussian $\mathbf{x}\in\mathbb{R}^{d}$ , $\mathbb{E}[k_{b,\varepsilon}(\mathbf{x},\mathbf{x})]=\Theta(d^{2}/\varepsilon)$ and the bound scales as $B(d+b)/\sqrt{n\varepsilon}$ , against constant-diagonal $1$ for Gaussian RBF and $1/\varepsilon$ for IMQ. Feature-scale control (e.g. unit-sphere normalisation, which collapses the diagonal to $(1+b)^{2}/\varepsilon$ ) recovers a dimension-free bound; the analysis makes the dependence explicit via $b$ and $\varepsilon$ rather than hiding it. The high-dimensional variance preservation $\mathrm{CV}[k_{b,\varepsilon}]=\Theta(1)$ vs. $\mathrm{CV}[h_{\varepsilon}]=\Theta(d^{-1/2})$ (Proposition 5) survives under this normalisation because $\mathrm{CV}$ is scale-invariant. On the unit sphere, $\|\mathbf{x}-\mathbf{w}\|^{2}=2-2\mathbf{x}^{\top}\mathbf{w}$ , so $k_{b,\varepsilon}$ becomes a function of the single variable $\mathbf{x}^{\top}\mathbf{w}$ , i.e. a zonal kernel; the directional far-field separation of §3 then refines into a sharper spectral statement, with exponential eigenvalue decay in the spherical-harmonic basis (Appendix K).

The Rademacher bound is hidden-layer-local: $B$ is computed from the trained $(\mathbf{w}_{j},\boldsymbol{\alpha})$ of one shared- $(b,\varepsilon)$ Yat layer, $R$ is a data-side radius on the layer’s input, and there is no spectral product across depth or Lipschitz surrogate.

The reproducing-property derivation is generic to any continuous Mercer kernel: a learned-center RBF or IMQ layer admits the same $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ formula and the same diagonal-based Rademacher bound. The Yat-specific content is the form of the diagonal $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ , which depends nontrivially on $\mathbf{x}$ and on both $b$ and $\varepsilon$ , whereas the Gaussian RBF diagonal is the constant $1$ and the IMQ diagonal is $\varepsilon^{-1}$ . Ordinary scalar MLP units do not induce such a diagonal at all because they do not define a Mercer section over the shared input/weight space. Evaluating $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ at width $m$ on input dimension $d$ costs $O(m^{2}d)$ versus $O(md)$ for a forward pass; gradient descent driving $b\to 0^{+}$ inflates the embedding constant $1/b$ and the conditioning of $\mathbf{K}$ , so a softplus-style reparametrisation is recommended.

The Yat kernel section $k_{b,\varepsilon}(\mathbf{w},\cdot)$ is uniformly bounded on every compact $\|\mathbf{x}\|\leq R$ , $\|\mathbf{w}\|\leq W$ by $(RW+b)^{2}/\varepsilon$ , and globally bounded on $\mathbb{R}^{d}$ for every fixed $\mathbf{w}$ ; at $b=0$ and $\mathbf{w}\neq\mathbf{0}$ the global supremum $\|\mathbf{w}\|^{4}/\varepsilon+\|\mathbf{w}\|^{2}$ is attained at $(1+\varepsilon/\|\mathbf{w}\|^{2})\mathbf{w}$ (Proposition 4, Appendix E.6). The diagonal $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ used in Theorem 6 is the special case $\mathbf{w}=\mathbf{x}$ . Classical unbounded scalar activations such as ReLU and GELU satisfy $\sigma(\mathbf{w}^{\top}\mathbf{x})\to\infty$ as $\|\mathbf{x}\|\to\infty$ , so the Rademacher route via the kernel diagonal is not available in this same unit-as-Mercer-section form.

Table 1 compares Yat against the three closest classical kernel-valued primitives along the properties that drive layer-local RKHS analysis. Yat is the only row in which a polynomial alignment numerator and IMQ locality coexist while retaining fixed-kernel universality, characteristicness, and a closed-form layer-local RKHS norm. The directional asymptotic trace (Proposition 2) and the factor- $3$ tightness of the IMQ reduction (Theorem 9, App. E.2) turn the table from a rhetorical summary into a structural separation: Yat atoms add directional quadratic far-field components no finite IMQ combination can match, and at most two biased atoms cannot reproduce a single IMQ section.

What singles out Yat in Table 1 is not any one of these properties but their simultaneous combination: IMQ-type locality, polynomial alignment with nonzero quadratic far-field trace, an exact three-atom IMQ-recovery identity, fixed-kernel universality from Loewner domination, and concentration that survives high-dimensional inputs: under isotropic Gaussian inputs, $\mathrm{CV}[k_{b,\varepsilon}]=\Theta(1)$ while $\mathrm{CV}[h_{\varepsilon}]=\Theta(d^{-1/2})$ (Proposition 5, Appendix E.6), because the Yat numerator depends on the one-dimensional projection $\mathbf{x}^{\top}\mathbf{w}$ .

A multiclass margin bound of order $O(C(C-1)\kappa B/(\gamma\sqrt{n}))$ with $\kappa\leq(R^{2}+b)/\sqrt{\varepsilon}$ follows from the closed-form norm; a peeling argument (Theorem 13) replaces the worst-case radius by the trained $\max\{B(\mathbf{f}),B_{0}\}$ at $\log\log$ cost. Four further RKHS-ball comparisons propagate the Loewner domination to interpolation-norm, pullback, alignment-excess, and spectral-effective-dimension bounds (Appendix G).

6 Conclusion and Discussion

The Yat kernel $k_{b,\varepsilon}$ is a single-layer object on which four properties simultaneously hold: PSD/Mercer for $b\geq 0$ , fixed-kernel universality for $b>0$ from Loewner-IMQ domination, an exact three-atom finite-difference IMQ identity (sharp in every dimension), and a closed-form layer-local RKHS norm with diagonal-driven Rademacher control. Composing the unit at depth preserves none of these globally: no parameter-independent Mercer kernel on the input domain captures the depth- $L$ Yat-Gram norm of every trained network. The single-layer story does not extend to a global theory, but it survives in two complementary regimes that delimit the structural scope of the analysis.

In the per-prefix regime, we fix the prior layers. The pulled-back kernel $\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime}):=k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\Phi_{\ell-1}(\mathbf{x}^{\prime}))$ is PSD on $\mathcal{X}$ , and $\boldsymbol{\alpha}_{\ell,c}^{\top}\mathbf{K}_{\ell}\boldsymbol{\alpha}_{\ell,c}$ upper-bounds $\|z_{\ell,c}\|_{\mathcal{H}_{\widetilde{k}_{\ell}}}^{2}$ (Theorem 14), with equality on the prefix-range section subspace; $\widetilde{k}_{\ell}$ inherits universality when $b_{\ell}>0$ and $\Phi_{\ell-1}$ is continuous injective (generic at $m_{\ell}\geq d_{\ell-1}+1$ , not automatic). The pullback construction is closest in form to convolutional kernel networks (Mairal et al., 2014; Mairal, 2016), differing in unit and in the explicit closed-form per-layer norm. In the parameter-independent regime, every depth- $L$ coordinate under bounded $(\mathbf{w}_{\ell,j},b_{\ell},\varepsilon_{\ell},\boldsymbol{\alpha}_{\ell})$ lies in a fixed ambient Sobolev RKHS $H^{s}(\mathbb{R}^{d_{0}})$ ( $s>d_{0}/2$ , Theorem 19); the depth dependence is super-exponential, so the result is qualitative containment (degenerate at $d_{0}\gg 1$ ), not a capacity certificate. The infinite-width Yat NTK is universal on every compact domain for $b>0$ (Appendix N).

These two finite-depth regimes sit at opposite ends of one trade-off: per-prefix bookkeeping is exact but prefix-dependent, ambient Sobolev containment is parameter-independent but coarse. No single Mercer kernel on $X$ absorbs every trained Yat stack’s layer-local Gram structure simultaneously, and we formalise this impossibility as a conjecture.

Conjecture 1 (No exact global deep Yat-Gram norm).

Fix $L\geq 2$ , input domain $X\subset\mathbb{R}^{d_{0}}$ , layer widths $(m_{1},\dots,m_{L})$ , and the Yat-stack hypothesis class with every prefix $\Phi_{\ell-1}$ continuous, injective on $X$ , bilipschitz onto its image, and every Yat layer satisfying $b_{\ell}\geq 0$ , $\varepsilon_{\ell}>0$ , pairwise distinct centers. For each output coordinate $c\in\{1,\dots,m_{L}\}$ let $\mathbf{A}_{\ell,c}\in\mathbb{R}^{m_{\ell}\times d_{\ell}}$ denote the trained read-out matrix carrying coordinate $c$ through layer $\ell$ (with $\mathbf{A}_{L,c}$ the $c$ -th row of the final read-out and $\mathbf{A}_{\ell,c}$ the corresponding contribution at intermediate layers via the chain). There exists no Mercer kernel $K^{\star}$ on $X\times X$ , parameter-independent of $(\mathbf{w}_{\ell},\boldsymbol{\alpha}_{\ell},b_{\ell},\varepsilon_{\ell})$ , satisfying, for every output coordinate $c\in\{1,\dots,m_{L}\}$ and every trained Yat stack in the hypothesis class,

\|z_{L,c}\|_{\mathcal{H}_{K^{\star}}}^{2}\;=\;\sum_{\ell=1}^{L}\mathrm{Tr}\!\bigl(\mathbf{A}_{\ell,c}^{\top}\mathbf{K}_{\ell}\mathbf{A}_{\ell,c}\bigr).

Two empirical checks corroborate the theory: $(\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c})^{1/2}$ correlates with per-class accuracy at $\rho=0.564$ on a 1000-class CLIP probe (Appendix R), and a matched-Chinchilla $261$ M Yat causal LM reaches the GELU baseline over $3$ seeds (Appendix T).

References

R. A. Adams and J. J. F. Fournier (2003) Sobolev spaces. 2 edition, Pure and Applied Mathematics, Vol. 140, Academic Press. Cited by: §J.1, §J.1.
N. Aronszajn (1950) Theory of reproducing kernels. Transactions of the American Mathematical Society 68 (3), pp. 337–404. Cited by: Appendix C, §5.
S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, Cited by: Appendix N.
F. Bach (2017) Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research 18 (19), pp. 1–53. Cited by: Appendix B.
P. L. Bartlett, O. Bousquet, and S. Mendelson (2005) Local Rademacher complexities. The Annals of Statistics 33 (4), pp. 1497–1537. Cited by: §L.1, Appendix L.
P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix B, §1.
P. L. Bartlett and S. Mendelson (2002) Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, pp. 463–482. Cited by: Appendix B, §E.5, §5.
A. Berlinet and C. Thomas-Agnan (2004) Reproducing kernel Hilbert spaces in probability and statistics. Springer. Cited by: Theorem 3.
A. Bietti and J. Mairal (2019) On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
A. Bietti, G. Mialon, D. Chen, and J. Mairal (2019) A kernel perspective for regularizing deep neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix O.
B. Bordelon, A. Canatar, and C. Pehlevan (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
N. Boullé, Y. Nakatsukasa, and A. Townsend (2020) Rational neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix B.
D. S. Broomhead and D. Lowe (1988) Multivariable functional interpolation and adaptive networks. Complex Systems 2 (3), pp. 321–355. Cited by: Appendix B.
M. D. Buhmann (2003) Radial basis functions: theory and implementations. Cambridge University Press. Cited by: Appendix B.
C. J. C. Burges (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (2), pp. 121–167. Cited by: Appendix B.
Y. Cao and Q. Gu (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
A. Caponnetto and E. De Vito (2007) Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7 (3), pp. 331–368. Cited by: §L.2, Appendix L, Remark 10.
Y. Cho and L. K. Saul (2009) Kernel methods for deep learning. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B.
C. Cortes, P. Haffner, and M. Mohri (2004) Rational kernels: theory and algorithms. Journal of Machine Learning Research 5, pp. 1035–1062. Cited by: Appendix B.
K. Crammer and Y. Singer (2001) On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, pp. 265–292. Cited by: Appendix B.
A. Daniely, R. Frostig, and Y. Singer (2016) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: NeurIPS Paper Checklist.
W. E, C. Ma, and L. Wu (2019) A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences 17 (5), pp. 1407–1425. Cited by: §J.1, Appendix B.
N. El Karoui (2010) The spectrum of kernel random matrices. The Annals of Statistics 38 (1), pp. 1–50. Cited by: §E.6.
R. Eldan and O. Shamir (2016) The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), pp. 907–940. Cited by: Appendix B, §3.
M. G. Genton (2001) Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research 2, pp. 299–312. Cited by: Appendix B.
B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari (2019) Linearized two-layers neural networks in high dimension. In Advances in Neural Information Processing Systems, Cited by: §E.6.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13, pp. 723–773. Cited by: Appendix M.
D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415. Cited by: §1.
J. Hoffmann et al. (2022) Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. Cited by: Appendix T.
R. A. Horn and C. R. Johnson (2012) Matrix analysis. 2 edition, Cambridge University Press. Cited by: Appendix C, §E.1, §E.7, §I.2.
A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In NeurIPS, Cited by: Appendix N, Appendix N, Appendix B.
B. Laurent and P. Massart (2000) Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 28 (5), pp. 1302–1338. Cited by: §E.6.
J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018) Deep neural networks as Gaussian processes. In ICLR, Cited by: Appendix B.
I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: Appendix T.
J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid (2014) Convolutional kernel networks. In Advances in Neural Information Processing Systems, Cited by: Appendix B, §6.
J. Mairal (2016) End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
J. Mercer (1909) Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A 209, pp. 415–446. Cited by: Appendix C.
C. A. Micchelli, Y. Xu, and H. Zhang (2006) Universal kernels. Journal of Machine Learning Research 7, pp. 2651–2667. Cited by: Appendix C, §E.3, §E.9, §1, §3, §3, §4, Theorem 8.
A. Molina, P. Schramowski, and K. Kersting (2020) Padé activation units: end-to-end learning of flexible activation functions in deep networks. In International Conference on Learning Representations (ICLR), Cited by: Appendix B.
J. Moody and C. J. Darken (1989) Fast learning in networks of locally-tuned processing units. Neural Computation 1 (2), pp. 281–294. Cited by: Appendix B.
R. M. Neal (1996) Bayesian learning for neural networks. Lecture Notes in Statistics, Vol. 118, Springer. Cited by: Appendix B.
B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2018) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations (ICLR), Cited by: Appendix B, §1.
C. J. Paciorek and M. J. Schervish (2006) Spatial modelling using a new class of nonstationary covariance functions. Environmetrics 17 (5), pp. 483–506. Cited by: Appendix B.
J. Park and I. W. Sandberg (1991) Universal approximation using radial-basis-function networks. Neural Computation 3 (2), pp. 246–257. Cited by: Appendix B.
V. I. Paulsen and M. Raghupathi (2016) An introduction to the theory of reproducing kernel Hilbert spaces. Cambridge Studies in Advanced Mathematics, Vol. 152, Cambridge University Press. Cited by: Appendix N, Appendix C, §E.8, Theorem 2, Theorem 3, Theorem 3.
P. Petersen and F. Voigtländer (2018) Optimal approximation of piecewise smooth functions using deep ReLU networks. Neural Networks 108, pp. 296–330. Cited by: §J.1.
A. Pinkus (1999) Approximation theory of the MLP model in neural networks. Acta Numerica 8, pp. 143–195. Cited by: Appendix B.
A. Radford et al. (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: NeurIPS Paper Checklist.
J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap (2020) Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations (ICLR), Cited by: Table 6.
C. Raffel et al. (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. Cited by: Appendix T.
A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
A. Rahimi and B. Recht (2008) Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Cited by: Appendix B.
T. Runst and W. Sickel (1996) Sobolev spaces of fractional order, nemytskij operators, and nonlinear partial differential equations. De Gruyter. Cited by: §J.1.
R. Schaback (1995) Error estimates and condition numbers for radial basis function interpolation. Advances in Computational Mathematics 3 (3), pp. 251–264. Cited by: Appendix B, Appendix F.
B. Schölkopf and A. J. Smola (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press. Cited by: Appendix B, Appendix B.
A. J. Smola, Z. L. Óvári, and R. C. Williamson (2000) Regularization with dot-product kernels. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, Table 1.
B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet (2011) Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research 12, pp. 2389–2410. Cited by: §J.3, Appendix B, Appendix C.
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet (2010) Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research 11, pp. 1517–1561. Cited by: Appendix C, §E.4, Corollary 1.
I. Steinwart and A. Christmann (2008) Support vector machines. Springer. Cited by: Appendix B, Appendix C, §E.3, §3, §5, Theorem 7.
I. Steinwart, D. R. Hush, and C. Scovel (2009) Optimal rates for regularized least squares regression. In Conference on Learning Theory (COLT), pp. 79–93. Cited by: §L.2, Appendix L.
M. Telgarsky (2017) Neural networks and rational functions. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
H. Wendland (2004) Scattered data approximation. Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press. Cited by: Appendix K, Appendix K, Appendix K, Appendix K, Appendix B, Appendix B, Appendix F.
D. V. Widder (1941) The Laplace transform. Princeton University Press. Cited by: Appendix C.
A. G. Wilson and R. P. Adams (2013) Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning (ICML), Cited by: Appendix B.
A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In AISTATS, Cited by: Appendix B.

Appendix roadmap.

The appendices are organized in three tiers. Tier 1 (proofs of main-body results): Appendix E (one top-level section, four subsections, one per main-body theorem block) contains proofs of every result in Sections 2–5. Tier 2 (extensions of the main-body theory): Appendix G treats exterior-shell separations and the polynomial-separation gap on line-containing domains; Appendix H develops the multiclass learned-norm generalization bound (Rademacher + peeling); Appendix I contains the prefix-pullback theorem and the four Loewner-comparison theorems (interpolation norms, pullback domination, alignment-excess, spectral comparison); Appendix J establishes the generic ambient Sobolev-RKHS containment of bounded smooth-atom stacks; Appendix F records the three-step asymptotic filtration of the biased span and a native-space rate-transfer statement with coefficient-mass lower bound; Appendix P converts the directional-trace separation into a quantitative atom-count bound; Appendix Q states the degree-matching skeleton for a uniqueness conjecture. Tier 3 (orthogonal applications of the same single-layer kernel): Appendix K computes the Funk–Hecke spectrum on $\mathbb{S}^{d-1}$ , including exponential eigenvalue decay and logarithmic effective dimension; Appendix L derives fast rates via eigenvalue decay (ERM and KRR routes); Appendix M gives quantitative MMD and two-sample-testing sample complexities; Appendix N computes the infinite-width NTK; Appendix O derives an intrinsic RKHS-Lipschitz bound and certified adversarial radius. Tier-3 sections are self-contained and read independently of one another. The CLIP-probe and directional-tail benchmark appendices (Appendices R, S) collect numerical evidence for results stated in the main body, and Appendix T exercises the per-prefix theory inside a trained $261$ M Yat causal language model.

Appendix A Preliminaries and Notation

This section consolidates the notation used throughout the paper, both in the main body and in the appendices. Generic RKHS background (Mercer’s theorem, Moore–Aronszajn, universality / characteristicness / SPD, Schur products, Laplace representation) is in Appendix C; named results from the literature we cite by label (Steinwart–Christmann product-kernel containment, Micchelli IMQ universality) are in Appendix D.

Table 2: Quick reference for the notation. The table is grouped by category; longer definitions and standing assumptions follow.

Symbol	Definition	First used
Domains and dimensions
$d,d_{0},d_{\ell}$	input dimension; depth- $\ell$ width	§1
$\mathcal{X}\subset\mathbb{R}^{d}$	compact input domain	§1
$\mathbb{S}^{d-1}$	unit sphere $\{\mathbf{u}:\\|\mathbf{u}\\|=1\}$	§1
$B_{R}$	closed ball $\{\mathbf{x}:\\|\mathbf{x}\\|\leq R\}$	§6
Kernels and atoms
$D_{\varepsilon}(\mathbf{x},\mathbf{w})$	$\\|\mathbf{x}-\mathbf{w}\\|^{2}+\varepsilon$ (regularised squared distance)	§1
$g_{\varepsilon}(\mathbf{x};\mathbf{w},b)$	$(\mathbf{x}^{\top}\mathbf{w}+b)^{2}/D_{\varepsilon}$ (biased Yat atom)	§1
$k_{b,\varepsilon},k_{\text{\tifinaghfont ⵟ}}$	Yat kernel $g_{\varepsilon}$ ; unbiased $k_{0,\varepsilon}$	§1
$h_{\varepsilon},k_{\mathrm{IMQ}}^{\varepsilon}$	IMQ kernel $1/D_{\varepsilon}$	§1
RKHS objects
$\mathcal{H}_{b,\varepsilon}$ , $\mathcal{H}_{\text{\tifinaghfont ⵟ}}$ , $\mathcal{H}_{\mathrm{IMQ}}^{\varepsilon}$	RKHSs of $k_{b,\varepsilon}$ , $k_{\text{\tifinaghfont ⵟ}}$ , $h_{\varepsilon}$	§3
$\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$	squared norm of $\sum_{j}\alpha_{j}k(\mathbf{w}_{j},\cdot)$ at distinct $\mathbf{w}_{j}$	§6
$\preceq$	Loewner PSD order on kernels / Gram matrices	§3
Spans of atoms
$F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$	$\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},\,b>0\}$	§5
$F_{\text{\tifinaghfont ⵟ}}^{+},F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$	broader Yat spans over $b>0$ resp. $b\geq 0$ , $\varepsilon>0$	App. A
$F_{\mathrm{IMQ},\varepsilon}$	$\mathrm{span}\{h_{\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d}\}$	§5
$\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$	bounded-variation IMQ class: $\sum\|a_{j}\|\leq\Lambda$ , $\\|\mathbf{v}_{j}\\|\leq W$	App. G
$\mathcal{R}_{\text{\tifinaghfont ⵟ}}(\Lambda,W)$	Yat analogue of $\mathcal{R}_{\mathrm{IMQ}}$	App. P
Operators
$T_{\infty}F(\mathbf{u})$	$\lim_{r\to\infty}F(r\mathbf{u})$ (asymptotic-trace operator)	§4
$\mathcal{Q}_{2}(\mathbb{S}^{d-1})$	quadratic forms $\mathbf{u}\mapsto\mathbf{u}^{\top}\mathbf{A}\mathbf{u}$ on the sphere	§4
$T_{k}$	Mercer integral operator $f\mapsto\int k(\cdot,\mathbf{x})f(\mathbf{x})\,d\rho(\mathbf{x})$	App. L
$\mathcal{N}_{k}(\lambda)$	effective dimension $\operatorname{tr}(T_{k}(T_{k}+\lambda I)^{-1})$	App. L
Network and pullback
$\Phi_{\ell},T_{\ell},m_{\ell}$	prefix map, layer map, layer width	§7
$z_{\ell,c},\boldsymbol{\alpha}_{\ell,c}$	layer- $\ell$ output coordinate $c$ and its read-out vector	§7
$\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime})$	$k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\Phi_{\ell-1}(\mathbf{x}^{\prime}))$ (pullback)	§7

Standing assumptions.

Throughout, $d\geq 1$ , $\mathcal{X}\subset\mathbb{R}^{d}$ is compact, $b\geq 0$ , $\varepsilon>0$ . The sphere is $\mathbb{S}^{d-1}=\{\mathbf{u}\in\mathbb{R}^{d}:\|\mathbf{u}\|=1\}$ and the closed ball $B_{R}=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\}$ .

Kernels.

The regularised squared distance and the three kernels we work with:

	$\displaystyle D_{\varepsilon}(\mathbf{x},\mathbf{w})$	$\displaystyle:=\\|\mathbf{x}-\mathbf{w}\\|^{2}+\varepsilon,$
	$\displaystyle\text{biased Yat atom: }\quad g_{\varepsilon}(\mathbf{x};\mathbf{w},b)$	$\displaystyle:=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}/D_{\varepsilon}(\mathbf{x},\mathbf{w}),$
	$\displaystyle\text{Yat kernel: }\quad k_{b,\varepsilon}(\mathbf{w},\mathbf{x})$	$\displaystyle:=g_{\varepsilon}(\mathbf{x};\mathbf{w},b),\quad k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x}):=k_{0,\varepsilon}(\mathbf{w},\mathbf{x}),$
	$\displaystyle\text{IMQ kernel: }\quad h_{\varepsilon}(\mathbf{x},\mathbf{w})$	$\displaystyle:=k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w}):=D_{\varepsilon}(\mathbf{x},\mathbf{w})^{-1}.$

RKHS objects.

$\mathcal{H}_{b,\varepsilon}$ , $\mathcal{H}_{\text{\tifinaghfont ⵟ}}$ , $\mathcal{H}_{h_{\varepsilon}}=\mathcal{H}_{\mathrm{IMQ}}^{\varepsilon}$ denote the RKHSs of $k_{b,\varepsilon}$ , $k_{\text{\tifinaghfont ⵟ}}$ , $h_{\varepsilon}$ respectively. The Loewner PSD order is denoted $\preceq$ . The RKHS $\mathcal{H}_{k}$ associated with a continuous PSD kernel $k$ is the unique Hilbert space of functions on $\mathcal{X}$ for which $f(\mathbf{x})=\langle f,k(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{k}}$ holds for all $f\in\mathcal{H}_{k}$ . The squared norm $\|f\|_{\mathcal{H}_{k}}^{2}$ of a finite expansion $f=\sum_{j}\alpha_{j}k(\mathbf{w}_{j},\cdot)$ at distinct centers $\{\mathbf{w}_{j}\}_{j=1}^{m}$ equals $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ where $\mathbf{K}_{ij}:=k(\mathbf{w}_{i},\mathbf{w}_{j})$ is the Gram matrix.

Spans of atoms.

The atom families that appear repeatedly:

	$\displaystyle F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$	$\displaystyle:=\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},\ b>0\},$
	$\displaystyle F_{\text{\tifinaghfont ⵟ}}^{+}$	$\displaystyle:=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},\ b>0,\ \varepsilon>0\},$
	$\displaystyle F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$	$\displaystyle:=\mathrm{span}\{k_{b,\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d},\ b\geq 0,\ \varepsilon>0\},$
	$\displaystyle F_{\mathrm{IMQ},\varepsilon}$	$\displaystyle:=\mathrm{span}\{k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}):\mathbf{w}\in\mathbb{R}^{d}\}.$

Bounded-variation and bounded-center sub-classes used in lower bounds:

\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W):=\Bigl\{\textstyle\sum_{j}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j}):\,\sum|a_{j}|\leq\Lambda,\ \|\mathbf{v}_{j}\|\leq W\Bigr\},\quad\mathcal{R}_{\text{\tifinaghfont ⵟ}}(\Lambda,W)\text{ analogously.}

Operators on the spans.

The directional asymptotic-trace operator on the radial limit is

T_{\infty}F(\mathbf{u})\;:=\;\lim_{r\to\infty}F(r\mathbf{u}),\qquad\mathbf{u}\in\mathbb{S}^{d-1},

defined on the linear subspace $\mathcal{D}_{T_{\infty}}\subset C(\mathbb{R}^{d})$ where the limit exists for every direction. The image of $T_{\infty}$ on $F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$ is the space of quadratic forms restricted to $\mathbb{S}^{d-1}$ , $\mathcal{Q}_{2}(\mathbb{S}^{d-1}):=\{\mathbf{u}\mapsto\mathbf{u}^{\top}\mathbf{A}\mathbf{u}:\mathbf{A}\in\mathrm{Sym}(d)\}$ . The Mercer integral operator of a kernel $k$ on $L^{2}(\rho)$ is denoted $T_{k}\!:f\mapsto\int k(\cdot,\mathbf{x})f(\mathbf{x})\,d\rho(\mathbf{x})$ , with effective dimension $\mathcal{N}_{k}(\lambda):=\mathrm{tr}(T_{k}(T_{k}+\lambda I)^{-1})$ .

Network notation.

For a depth- $L$ Yat stack with input $\mathbf{x}\in X\subset\mathbb{R}^{d_{0}}$ and per-layer widths $m_{\ell}$ ,

\Phi_{0}(\mathbf{x})=\mathbf{x},\qquad\Phi_{\ell}(\mathbf{x})=T_{\ell}(\Phi_{\ell-1}(\mathbf{x})),\qquad[T_{\ell}(\mathbf{z})]_{c}=\sum_{j=1}^{m_{\ell}}\alpha_{\ell,j,c}\,k_{b_{\ell},\varepsilon_{\ell}}(\mathbf{w}_{\ell,j},\mathbf{z}),

where $\Phi_{\ell-1}$ is the prefix map, $T_{\ell}$ the layer map, $c$ a coordinate index, and $z_{\ell,c}:=[\Phi_{\ell}]_{c}$ the layer- $\ell$ output coordinate. The pulled-back kernel on $X$ is $\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime}):=k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\Phi_{\ell-1}(\mathbf{x}^{\prime}))$ .

Conventions.

$\mathbb{S}^{d-1}$ carries the uniform surface measure where unspecified; $\mathrm{Sym}(d)$ is the space of real $d\times d$ symmetric matrices; $\mathrm{vec}(\cdot)$ is column-major matrix vectorisation; $\langle\cdot,\cdot\rangle$ without subscript is the Euclidean inner product on $\mathbb{R}^{d}$ . The Tifinagh letter \tifinaghfontⵟ (Unicode U+2D5F) is used as the kernel-specific glyph $k_{\text{\tifinaghfont ⵟ}}$ .

Appendix B Related Work

RBF and learned-center kernel networks.

Radial-basis-function networks [Broomhead and Lowe, 1988, Moody and Darken, 1989] train finite expansions of locally-tuned radial kernels and are universal approximators [Park and Sandberg, 1991]; the classical MLP approximation theory underlying this is surveyed by Pinkus [1999]. The classical theory of native spaces and IMQ universality is developed in Wendland [2004]. With a shared bandwidth, an IMQ or Gaussian RBF layer is already a finite learned-center RKHS expansion admitting the same $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ norm and the same diagonal-based Rademacher bound as Yat. The Yat primitive differs in two respects: it replaces the radial numerator with the polynomial alignment factor $(\mathbf{w}^{\top}\mathbf{x}+b)^{2}$ , producing a nonzero directional far-field trace that all pure-radial combinations lack; and the bias $b>0$ places a constant-coordinate feature in the polynomial factor’s feature map, establishing fixed-kernel universality via kernel-order domination rather than through a variable-bandwidth construction.

Neural kernels and arc-cosine constructions.

Cho and Saul [2009] pioneered the program of treating neural-network hidden units as Mercer sections, introducing the arc-cosine family as kernels associated with threshold activations. The Yat construction shares the spirit of that program — treating the unit as a kernel section rather than a scalar activation — but trades the integral arc-cosine form for a finite, trainable, rational primitive whose RKHS structure (Mercer property, Loewner-IMQ domination, channel decomposition, finite-difference IMQ embedding) is fully closed-form.

Kernel machines and product kernels.

Polynomially modulated radial kernels and product or nonstationary kernel constructions are classical in kernel machines and Gaussian processes [Schölkopf and Smola, 2002, Steinwart and Christmann, 2008]. The Steinwart–Christmann product-kernel containment theorem [Steinwart and Christmann, 2008, Lemma 4.6] is one of two routes to Yat’s fixed-kernel universality (the other being Loewner-order domination). The contribution here is not the abstract existence of a product kernel, but the specific rational hidden-unit form for which product-kernel universality, an exact IMQ finite-difference identity, a nonzero directional asymptotic trace, and a layer-local closed-form RKHS norm simultaneously hold. Prior product-kernel analyses in the GP and kernel-machine literature do not study this hidden-unit parameterization, its finite-difference structure, or its asymptotic-trace geometry.

Deep-learning generalization and the NTK.

For standard scalar-activation MLPs, layer-level capacity is controlled through spectral or Lipschitz surrogates on the weight matrix [Bartlett et al., 2017, Neyshabur et al., 2018], which are function-level bounds rather than Hilbert-space objects attached to individual units. The historical origin of the neural-network-as-kernel viewpoint is Neal [1996], where infinite-width Bayesian networks induce Gaussian-process priors. The neural tangent kernel [Jacot et al., 2018] is an infinite-width post-hoc object defined at the network output; the NNGP [Lee et al., 2018] indexes a training-sample covariance at the output; the conjugate kernel construction of Daniely et al. [2016] treats layered networks as compositions of Mercer-section objects and is the closest antecedent to our prefix-pullback formulation in Section 6; deep kernel learning [Wilson et al., 2016] places a GP on top of a learned feature extractor; the pullback formalism we use in Section 6 is closely related to the convolutional kernel networks of Mairal et al. [2014], where a kernel is constructed by repeated pullback through learned feature maps. The arc-cosine kernels of Cho and Saul [2009] and random-feature constructions [Rahimi and Recht, 2007, 2008] also yield Mercer-section-flavoured hidden-unit views, in different parameterizations. The spectral analysis of NTKs on the sphere via Funk–Hecke / Gegenbauer expansion is by now a small subliterature in ML: Bietti and Mairal [2019] characterise the NTK spectrum on the sphere, and Cao and Gu [2019] use spectral decay rates analogous to our Theorem 21 in deep-network generalization arguments; Bordelon et al. [2020] connect Mercer-eigendecay and learning curves in the direction our fast-rates appendix follows. What is Yat-specific is the conjunction: a finite, trainable, non-radial, rational primitive with fixed-kernel universality from explicit Loewner-IMQ domination, an exact algebraic IMQ embedding sharp at three atoms, a nonzero directional far-field trace, and a closed-form layer-local norm with an explicit input-dependent diagonal.

RKHS generalization bounds.

The single-layer Rademacher bound (Theorem 6) and the multiclass learned-norm generalization bound (Theorem 12) follow the RKHS-ball Rademacher framework of Bartlett and Mendelson [2002]; the multiclass extension uses the pairwise-margin approach of Crammer and Singer [2001]. The modern reference for the chain “universal $\Rightarrow$ characteristic $\Rightarrow$ SPD” on locally compact spaces, including the equivalence between universality and characteristicness under the finite-signed-measure formulation, is Sriperumbudur et al. [2011]. What is Yat-specific is the explicit, input-dependent kernel diagonal $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ , which makes $b$ and $\varepsilon$ explicit regularization targets in the bound, unlike the constant diagonal of the Gaussian ( $1$ ) or IMQ ( $\varepsilon^{-1}$ ) kernels.

Approximation rates and ridge-vs-radial separations.

The width-complexity gap (Proposition 7) sits in a longer tradition of separations between ridge and radial approximation. Eldan and Shamir [2016] established an exponential gap for ridge functions against radial expansions; Bach [2017] sharpened this in the convex-neural-network setting with $\ell_{1}$ -bounded coefficient classes that closely match our $\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ . The Barron-space MLP capacity literature [E et al., 2019] provides the analogous functional class for two-layer networks; positioning Yat against the Barron framework is left to future work, as the Yat layer is a fixed-RKHS object rather than a Barron-norm-bounded class.

Rational and ratio-form kernels.

Kernels written as a polynomial-in-numerator over a polynomial-in-denominator have been studied in machine learning principally through the rational-transducer construction of Cortes et al. [2004], which formalises a class of rational kernels on sequences via finite-state operations and proves Mercer-style closure properties; the abstract framing of ”kernel as a ratio of polynomial-like objects” is the same, but the input domain (sequences vs. $\mathbb{R}^{d}$ ) and the algebraic primitives (transducer composition vs. Schur products) are disjoint, so neither the asymptotic-trace separation nor the bias finite-difference IMQ embedding has an analogue in that framework. A separate line of work installs rationality at the scalar activation level rather than at the kernel-section level: Telgarsky [2017] establishes expressivity of rational networks, Molina et al. [2020] learn Padé approximants end-to-end, and Boullé et al. [2020] train low-degree rational activations and obtain favourable approximation rates. These constructions keep the unit a scalar function of $\mathbf{w}^{\top}\mathbf{x}$ , do not define a Mercer section over the joint variable, and therefore do not carry the closed-form RKHS norm or the fixed-kernel universality the Yat construction targets; the axis is orthogonal to ours. Closer in form are the normalised polynomial kernels of the SVM literature [Burges, 1998, Schölkopf and Smola, 2002], which divide $(\mathbf{x}^{\top}\mathbf{w}+b)^{d}$ by $\sqrt{k(\mathbf{x},\mathbf{x})k(\mathbf{w},\mathbf{w})}$ , i.e. by their own self-kernel; the Yat kernel divides instead by the regularised distance $\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ , which preserves alignment-dependence in the kernel diagonal $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ and adds local IMQ-type decay that self-kernel normalisation does not. The classical theory of inverse-multiquadric kernels and their native spaces is developed at length in Wendland [2004], Buhmann [2003], including the Bessel-potential Sobolev characterisation of the IMQ native space we use throughout. Schaback [1995] established the conditional-positive-definiteness framework that places kernels with vanishing diagonals (such as the unbiased $k_{0,\varepsilon}(\mathbf{0},\mathbf{0})=0$ case) in a unified setting with strictly positive radial kernels.

Non-stationary and anisotropic kernels.

The Yat kernel is non-stationary in the sense that $k_{b,\varepsilon}(\mathbf{w},\mathbf{x})$ depends on $(\mathbf{w},\mathbf{x})$ rather than only on $\mathbf{x}-\mathbf{w}$ ; the polynomial alignment numerator $(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ is intrinsically anisotropic. This places Yat in a literature on non-stationary covariance constructions in Gaussian processes and kernel machines: Genton [2001] surveys the design space of kernel families including non-stationary and anisotropic constructions, Paciorek and Schervish [2006] develop a class of non-stationary covariance functions parameterised by spatially varying length scales, and Wilson and Adams [2013] introduce spectral mixture kernels that obtain alignment-style modulation through a sum over frequency components. The Yat kernel differs from these constructions by being a single rational expression rather than a mixture or a length-scale field, by carrying an exact algebraic IMQ embedding (Theorem 4) absent from the spectral-mixture and varying-length-scale families, and by admitting a closed-form layer-local RKHS norm that is generic to shared-bandwidth Mercer-section layers.

Appendix C Background

Reproducing kernel Hilbert spaces.

A symmetric function $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is positive semidefinite (PSD) if $\sum_{i,j}c_{i}c_{j}\,k(\mathbf{x}_{i},\mathbf{x}_{j})\geq 0$ for every finite set $\{\mathbf{x}_{i}\}\subset\mathcal{X}$ and coefficients $\{c_{i}\}\subset\mathbb{R}$ . Every continuous PSD $k$ on a compact set $\mathcal{X}$ is a Mercer kernel and admits a uniformly convergent eigenexpansion $k(\mathbf{x},\mathbf{y})=\sum_{i}\lambda_{i}\phi_{i}(\mathbf{x})\phi_{i}(\mathbf{y})$ with $\lambda_{i}\geq 0$ and $\{\phi_{i}\}\subset L^{2}(\mathcal{X})$ orthonormal [Mercer, 1909]. The Moore–Aronszajn theorem [Aronszajn, 1950] associates to every PSD $k$ a unique reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{k}$ whose reproducing property $f(\mathbf{x})=\langle f,k(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{k}}$ holds for all $f\in\mathcal{H}_{k}$ . Two properties of RKHSs are used throughout. First, if $c^{2}k_{1}\preceq k_{2}$ in the Loewner PSD order (meaning $k_{2}-c^{2}k_{1}$ is PSD), then $\mathcal{H}_{k_{1}}\subseteq\mathcal{H}_{k_{2}}$ with $\|f\|_{\mathcal{H}_{k_{2}}}\leq c^{-1}\|f\|_{\mathcal{H}_{k_{1}}}$ [Paulsen and Raghupathi, 2016]. Second, the RKHS sum rule: if $k=k_{1}+k_{2}$ then $\mathcal{H}_{k}=\mathcal{H}_{k_{1}}+\mathcal{H}_{k_{2}}$ with squared infimal-convolution norm $\|f\|_{\mathcal{H}_{k}}^{2}=\inf_{f_{1}+f_{2}=f}(\|f_{1}\|_{\mathcal{H}_{k_{1}}}^{2}+\|f_{2}\|_{\mathcal{H}_{k_{2}}}^{2})$ .

Universality, characteristicness, and strict positive definiteness.

A continuous PSD kernel on compact $\mathcal{X}$ is universal if its RKHS is dense in $C(\mathcal{X})$ [Micchelli et al., 2006, Steinwart and Christmann, 2008]. We say $k$ is characteristic if the kernel mean embedding $\mu\mapsto\int k(\cdot,\mathbf{x})\,d\mu(\mathbf{x})$ is injective on finite signed Borel measures on $\mathcal{X}$ (this finite-signed formulation is the one we use throughout). A kernel is strictly positive definite (SPD) if for every $n\geq 1$ and every set of pairwise distinct points $\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}$ the Gram matrix is positive definite. The implications we use for continuous kernels on compact $\mathcal{X}$ are

\text{universal}\;\Longrightarrow\;\text{characteristic}\;\Longrightarrow\;\text{SPD},

where the first arrow is Sriperumbudur et al. [2011] and the second follows because if $\mathbf{c}^{\top}\mathbf{K}\mathbf{c}=0$ for distinct nodes then the atomic measure $\mu=\sum_{i}c_{i}\delta_{\mathbf{x}_{i}}$ has zero kernel mean embedding. (Under the finite-signed-measure definition above, characteristicness on a compact domain is in fact equivalent to universality by a Hahn–Banach/Riesz argument; we only need the one-way chain stated here. A weaker probability-measure-only definition of characteristicness is strictly weaker than universality, but we do not use it.) For a bounded continuous characteristic kernel on a compact metric space, $\mathrm{MMD}_{k}$ metrizes weak convergence of probability measures [Sriperumbudur et al., 2010, Thm. 23]; we use this endpoint of the chain throughout.

Schur products and Laplace integrals.

Two closure properties are used repeatedly. (S) The Schur product theorem [Horn and Johnson, 2012, Sec. 7.5]: the entrywise (Hadamard) product of two PSD kernels is PSD. (L) The Laplace representation $a^{-1}=\int_{0}^{\infty}e^{-ta}\,dt$ for $a>0$ [Widder, 1941, Ch. IV]: the IMQ kernel $h_{\varepsilon}(\mathbf{x},\mathbf{w})$ $=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}=\int_{0}^{\infty}e^{-t\varepsilon}e^{-t\|\mathbf{x}-\mathbf{w}\|^{2}}\,dt$ is a nonnegative mixture of Gaussian PSD kernels, hence PSD. Combining (S) and (L), any Schur product $p(\mathbf{x},\mathbf{w})\cdot h_{\varepsilon}(\mathbf{x},\mathbf{w})$ with a PSD polynomial kernel $p$ is PSD — this is the core of the Yat PSD proof. For the Yat numerator $p_{b}(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ , the feature map $\Phi_{b}(\mathbf{x})=(\mathbf{x}\otimes\mathbf{x},\sqrt{2b}\,\mathbf{x},b)$ requires $b\geq 0$ for $\sqrt{2b}$ to be real; the sign condition is necessary, as Theorem 1 exhibits a $2\times 2$ counterexample at $b=-1$ .

Appendix D Background Theorems from the Literature

This appendix collects the named results invoked by label in the proofs of Sections 2–5. Standard background (Mercer’s theorem, Moore–Aronszajn, universality, characteristicness, strict positive definiteness, empirical Rademacher complexity, the Schur product theorem, and the Laplace representation) is given in Section C and not repeated here.

Theorem 7 (Product-kernel RKHS containment [Steinwart and Christmann, 2008, Lemma 4.6]).

Let $k_{1},k_{2}$ be continuous PSD kernels on $\mathcal{X}$ , and let $\mathcal{H}_{1},\mathcal{H}_{2}$ be their RKHSs. Then the Schur product $k=k_{1}\cdot k_{2}$ is a continuous PSD kernel with RKHS $\mathcal{H}_{k}$ satisfying: for every $f_{1}\in\mathcal{H}_{1}$ and $f_{2}\in\mathcal{H}_{2}$ , the pointwise product $f_{1}\cdot f_{2}\in\mathcal{H}_{k}$ , with $\|f_{1}\cdot f_{2}\|_{\mathcal{H}_{k}}\leq\|f_{1}\|_{\mathcal{H}_{1}}\|f_{2}\|_{\mathcal{H}_{2}}$ .

Theorem 8 (Micchelli IMQ universality [Micchelli et al., 2006, Thm. 17]).

For every $\varepsilon>0$ and every compact $\mathcal{X}\subset\mathbb{R}^{d}$ , the IMQ span $F_{\mathrm{IMQ},\varepsilon}=\mathrm{span}\{(\|\cdot-\mathbf{w}\|^{2}+\varepsilon)^{-1}:\mathbf{w}\in\mathbb{R}^{d}\}$ is dense in $C(\mathcal{X})$ in the uniform norm; equivalently, the IMQ kernel is universal on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ .

Appendix E Proofs of Main-Body Results

E.1 Proof of Theorem 1: PSD/Mercer for $b\geq 0$

We restate Theorem 1 for the reader’s convenience and prove it as a single block.

Theorem (Theorem 1 restated).

Let $\varepsilon>0$ and $b\geq 0$ . The kernel $k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=(\mathbf{w}^{\top}\mathbf{x}+b)^{2}/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)$ is symmetric and positive semidefinite on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ . On every compact $\mathcal{X}\subset\mathbb{R}^{d}$ , $k_{b,\varepsilon}|_{\mathcal{X}\times\mathcal{X}}$ is continuous, so Mercer’s theorem yields a decomposition $k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\sum_{i\geq 1}\lambda_{i}\phi_{i}(\mathbf{w})\phi_{i}(\mathbf{x})$ with $\lambda_{i}\geq 0$ and $\{\phi_{i}\}\subset L^{2}(\mathcal{X})$ orthonormal. By Moore–Aronszajn there exists a unique RKHS $\mathcal{H}_{b,\varepsilon}$ with reproducing kernel $k_{b,\varepsilon}$ .

Proof.

Factor $k_{b,\varepsilon}=p\cdot h_{\varepsilon}$ , where $p(\mathbf{x},\mathbf{w}):=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ and $h_{\varepsilon}(\mathbf{x},\mathbf{w}):=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{-1}$ .

Factor $p$ . Extend symmetrically by $\sqrt{b}$ : define $\tilde{\mathbf{x}}:=(\mathbf{x},\sqrt{b})\in\mathbb{R}^{d+1}$ and $\tilde{\mathbf{w}}:=(\mathbf{w},\sqrt{b})\in\mathbb{R}^{d+1}$ (well-defined because $b\geq 0$ ). Then $\tilde{\mathbf{x}}^{\top}\tilde{\mathbf{w}}=\mathbf{x}^{\top}\mathbf{w}+b$ , so $p(\mathbf{x},\mathbf{w})=(\tilde{\mathbf{x}}^{\top}\tilde{\mathbf{w}})^{2}$ is the square of a linear kernel in extended feature space. The linear kernel is PSD; by the Schur product theorem its Hadamard square is also PSD, so $p$ is PSD. Equivalently, $\Phi_{b}(\mathbf{x})=(\mathrm{vec}(\mathbf{x}\mathbf{x}^{\top}),\sqrt{2b}\,\mathbf{x},b)\in\mathbb{R}^{d^{2}+d+1}$ satisfies $\langle\Phi_{b}(\mathbf{x}),\Phi_{b}(\mathbf{w})\rangle=(\mathbf{x}^{\top}\mathbf{w})^{2}+2b\,\mathbf{x}^{\top}\mathbf{w}+b^{2}=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}=p(\mathbf{x},\mathbf{w})$ , which is real iff $b\geq 0$ . The $b<0$ counter-example below shows this condition is necessary.

Factor $h_{\varepsilon}$ . By the Laplace identity $a^{-1}=\int_{0}^{\infty}e^{-ta}\,dt$ for $a>0$ ,

h_{\varepsilon}(\mathbf{x},\mathbf{w})\;=\;\int_{0}^{\infty}e^{-t\varepsilon}\,e^{-t\|\mathbf{x}-\mathbf{w}\|^{2}}\,dt.

(8)

For each $t>0$ the Gaussian $e^{-t\|\mathbf{x}-\mathbf{w}\|^{2}}$ is a PSD kernel and $e^{-t\varepsilon}\geq 0$ . A nonnegative mixture of PSD kernels is PSD, so $h_{\varepsilon}$ is PSD.

Schur product. By the Schur product theorem [Horn and Johnson, 2012], the entry-wise product of two PSD matrices is PSD. Applied to the Gram matrices of $p$ and $h_{\varepsilon}$ on an arbitrary finite point set, this shows $k_{b,\varepsilon}=p\cdot h_{\varepsilon}$ is PSD on $\mathbb{R}^{d}$ .

Continuity on compact $\mathcal{X}$ . The denominator $\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ is bounded below by $\varepsilon>0$ , so $k_{b,\varepsilon}$ is a ratio of a continuous numerator and a strictly positive continuous denominator, hence continuous on $\mathcal{X}\times\mathcal{X}$ . Mercer’s theorem applies and yields the decomposition. Moore–Aronszajn gives the unique RKHS.

Counterexample at $b<0$ . Take $b=-1$ , $d=2$ , nodes $\mathbf{x}_{1}=\mathbf{e}_{1}$ , $\mathbf{x}_{2}=\mathbf{e}_{2}$ . The numerator Gram is

[(\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+b)^{2}]\;=\;\begin{pmatrix}(\,1-1)^{2}&(0-1)^{2}\\ (0-1)^{2}&(1-1)^{2}\end{pmatrix}\;=\;\begin{pmatrix}0&1\\ 1&0\end{pmatrix},

with eigenvalues $\{1,-1\}$ , so the numerator is indefinite. The denominator at these nodes is $D_{\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{j})\in\{\varepsilon,2+\varepsilon\}$ , both positive, so the Schur product inherits the eigenvalue $-(2+\varepsilon)^{-1}<0$ and $k_{-1,\varepsilon}$ is not PSD. ∎

E.2 Proof of Theorem 4: bias-finite-difference IMQ embedding

Theorem (Theorem 4 restated).

Fix $\varepsilon>0$ . For every $\mathbf{w}\in\mathbb{R}^{d}$ and every $h>0$ , the bias-finite-difference identity (2) holds identically in $\mathbf{x}\in\mathbb{R}^{d}$ . In particular, $F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ . The containment is strict.

Proof.

The denominator $D_{\varepsilon}(\mathbf{x},\mathbf{w})=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ is independent of the bias parameter. For fixed $(\mathbf{x},\mathbf{w})$ set $t:=\mathbf{x}^{\top}\mathbf{w}+2h$ . Then

g_{\varepsilon}(\mathbf{x};\mathbf{w},3h)=\frac{(t+h)^{2}}{D_{\varepsilon}},\quad g_{\varepsilon}(\mathbf{x};\mathbf{w},2h)=\frac{t^{2}}{D_{\varepsilon}},\quad g_{\varepsilon}(\mathbf{x};\mathbf{w},h)=\frac{(t-h)^{2}}{D_{\varepsilon}}.

The numerator second-order difference is

(t+h)^{2}-2t^{2}+(t-h)^{2}\;=\;(t^{2}+2th+h^{2})-2t^{2}+(t^{2}-2th+h^{2})\;=\;2h^{2},

independent of $t$ . Dividing by $h^{2}D_{\varepsilon}$ gives $2/D_{\varepsilon}=2k_{\mathrm{IMQ}}^{\varepsilon}(\mathbf{x},\mathbf{w})$ , multiplied by $h^{2}$ on both sides recovers (2). Since $h>0$ the three biases $\{h,2h,3h\}$ are all strictly positive, so the identity uses only atoms from $F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ and hence $F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ .

Strictness. Fix $\mathbf{w}\neq 0$ and a unit vector $\mathbf{u}\in\mathbb{S}^{d-1}$ with $\mathbf{u}^{\top}\mathbf{w}\neq 0$ . Along the ray $\mathbf{x}=r\mathbf{u}$ ,

k_{\text{\tifinaghfont ⵟ}}(r\mathbf{u},\mathbf{w})\;=\;\frac{r^{2}(\mathbf{u}^{\top}\mathbf{w})^{2}}{r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon}\;\xrightarrow[r\to\infty]{}\;(\mathbf{u}^{\top}\mathbf{w})^{2}\;\neq\;0,

while every finite linear combination $F=\sum_{i=1}^{N}c_{i}k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{i})\in F_{\mathrm{IMQ},\varepsilon}$ obeys

|F(r\mathbf{u})|\;\leq\;\sum_{i=1}^{N}\frac{|c_{i}|}{r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}_{i}+\|\mathbf{w}_{i}\|^{2}+\varepsilon}\;=\;O(r^{-2})\;\xrightarrow[r\to\infty]{}\;0.

The two far-field limits disagree, so $k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{w})\notin F_{\mathrm{IMQ},\varepsilon}$ for every $\mathbf{w}\neq 0$ . By Lemma 3 this same unbiased section belongs to $F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ via $k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{w})=g_{\varepsilon}(\cdot;\mathbf{w},0)=3g_{\varepsilon}(\cdot;\mathbf{w},h)-3g_{\varepsilon}(\cdot;\mathbf{w},2h)+g_{\varepsilon}(\cdot;\mathbf{w},3h)$ for any $h>0$ , so it witnesses strict containment $F_{\mathrm{IMQ},\varepsilon}\subsetneq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ . ∎

Lemma 2 (Irreducibility and coprimality of the regularised distance).

Let $d\geq 2$ , $\varepsilon>0$ , and for $\mathbf{w}\in\mathbb{R}^{d}$ set $D_{\mathbf{w}}(\mathbf{x}):=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ . Then (i) $D_{\mathbf{w}}$ is irreducible in $\mathbb{C}[\mathbf{x}_{1},\ldots,\mathbf{x}_{d}]$ ; (ii) for $\mathbf{v}\neq\mathbf{w}$ , $D_{\mathbf{v}}$ and $D_{\mathbf{w}}$ are coprime in $\mathbb{C}[\mathbf{x}]$ .

Proof.

(i) Translate by $\mathbf{w}$ : with $\mathbf{y}:=\mathbf{x}-\mathbf{w}$ , $D_{\mathbf{w}}=\|\mathbf{y}\|^{2}+\varepsilon$ . Suppose this factors as $f(\mathbf{y})g(\mathbf{y})$ in $\mathbb{C}[\mathbf{y}]$ with $\deg f,\deg g\geq 1$ . Both factors are linear: $f=\boldsymbol{\alpha}^{\top}\mathbf{y}+\beta$ , $g=\boldsymbol{\gamma}^{\top}\mathbf{y}+\delta$ . Matching coefficients gives

\boldsymbol{\alpha}\boldsymbol{\gamma}^{\top}+\boldsymbol{\gamma}\boldsymbol{\alpha}^{\top}=2\mathbf{I}_{d},\qquad\delta\boldsymbol{\alpha}+\beta\boldsymbol{\gamma}=\mathbf{0},\qquad\beta\delta=\varepsilon.

Since $\varepsilon\neq 0$ , neither $\beta$ nor $\delta$ vanishes, so the linear constraint gives $\boldsymbol{\gamma}=-(\delta/\beta)\boldsymbol{\alpha}$ — the two direction vectors are parallel. But then $\boldsymbol{\alpha}\boldsymbol{\gamma}^{\top}+\boldsymbol{\gamma}\boldsymbol{\alpha}^{\top}=-(2\delta/\beta)\boldsymbol{\alpha}\boldsymbol{\alpha}^{\top}$ has rank at most $1$ , while $2\mathbf{I}_{d}$ has rank $d\geq 2$ . Contradiction; $D_{\mathbf{w}}$ is irreducible. (ii) If $D_{\mathbf{v}}$ and $D_{\mathbf{w}}$ shared an irreducible factor for $\mathbf{v}\neq\mathbf{w}$ , since both are irreducible they would equal each other up to a scalar; comparing the degree- $1$ part $-2\mathbf{x}^{\top}\mathbf{v}$ vs. $-2\mathbf{x}^{\top}\mathbf{w}$ then forces $\mathbf{v}=\mathbf{w}$ , contradiction. ∎

Theorem 9 (Tightness of the factor- $3$ reduction).

Let $d\geq 1$ , $\varepsilon>0$ . For every $\mathbf{w}_{0}\in\mathbb{R}^{d}\setminus\{0\}$ , no linear combination of at most two biased Yat atoms (at arbitrary centers and biases) can equal a nonzero scalar multiple of $k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0})$ .

Outline. The argument splits by dimension. For $d\geq 2$ we use Lemma 2 (irreducibility and coprimality of $D_{\mathbf{w}}$ in $\mathbb{C}[\mathbf{x}]$ ) and a case-by-case polynomial analysis. For $d=1$ , $D_{w}$ factors over $\mathbb{C}$ as $(z-w-i\sqrt{\varepsilon})(z-w+i\sqrt{\varepsilon})$ , so Lemma 2 fails; we instead match poles directly in the complex plane.

Proof for $d\geq 2$ .

We first reduce to non-trivial atoms. An atom with $c_{i}=0$ , or with $L_{i}\equiv 0$ (i.e. $\mathbf{w}_{i}=\mathbf{0}$ and $b_{i}=0$ ), contributes zero; dropping it gives a one-atom or zero-atom identity. A zero-atom combination is identically $0\neq\lambda k_{\mathrm{IMQ}}^{\varepsilon}$ for $\lambda\neq 0$ . A single non-trivial atom: clearing denominators in $c\,g_{\varepsilon}(\cdot;\mathbf{w},b)=\lambda\,k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{0})$ gives $c\,L^{2}\,D_{0}=\lambda\,D_{w}$ in $\mathbb{C}[\mathbf{x}]$ , where $L(\mathbf{x})=\mathbf{x}^{\top}\mathbf{w}+b$ and $D_{w}=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ . If $\mathbf{w}=\mathbf{w}_{0}$ : cancelling $D_{0}$ gives $c\,L^{2}=\lambda$ , impossible since $L$ is nonconstant for $\mathbf{w}_{0}\neq\mathbf{0}$ . If $\mathbf{w}\neq\mathbf{w}_{0}$ : by Lemma 2, $D_{w}$ is irreducible (hence prime in the UFD $\mathbb{C}[\mathbf{x}]$ ) and coprime to $D_{0}$ ; from $D_{w}\mid c\,L^{2}$ we get $D_{w}\mid L$ , but $\deg D_{w}=2>\deg L\leq 1$ and $L\not\equiv 0$ : contradiction. Hence $\lambda=0$ in the one-atom case. Assume both atoms are non-trivial, $c_{i}\neq 0$ , $L_{i}\not\equiv 0$ , and for contradiction $\lambda\neq 0$ . Let $D_{i}(\mathbf{x})=\|\mathbf{x}-\mathbf{w}_{i}\|^{2}+\varepsilon$ and $L_{i}(\mathbf{x})=\mathbf{x}^{\top}\mathbf{w}_{i}+b_{i}$ . Clearing denominators gives

c_{1}L_{1}^{2}D_{0}D_{2}+c_{2}L_{2}^{2}D_{0}D_{1}=\lambda D_{1}D_{2}\quad\text{in }\mathbb{R}[\mathbf{x}].

Case 1: $\mathbf{w}_{1}=\mathbf{w}_{2}=:\mathbf{w}$ . Then $D_{1}=D_{2}=:D$ and $(*)$ becomes $N\,D_{0}=\lambda\,D$ where $N:=c_{1}L_{1}^{2}+c_{2}L_{2}^{2}$ . Subcase $\mathbf{w}=\mathbf{w}_{0}$ : $D=D_{0}$ , so $N=\lambda$ . The degree- $2$ homogeneous part of $N$ is $(c_{1}+c_{2})(\mathbf{x}^{\top}\mathbf{w}_{0})^{2}$ . For $d\geq 2$ and $\mathbf{w}_{0}\neq\mathbf{0}$ this vanishes only if $c_{1}+c_{2}=0$ ; the degree- $1$ part then forces $c_{1}b_{1}+c_{2}b_{2}=0$ , so $b_{1}=b_{2}$ , and $N=b_{1}^{2}(c_{1}+c_{2})=0$ , contradicting $\lambda\neq 0$ . Subcase $\mathbf{w}\neq\mathbf{w}_{0}$ : by Lemma 2, $D$ and $D_{0}$ are distinct irreducibles in $\mathbb{C}[\mathbf{x}]$ , hence coprime. From $ND_{0}=\lambda D$ : $D\mid ND_{0}$ and $\gcd(D,D_{0})=1$ , so $D\mid N$ . Since $\deg N\leq 2=\deg D$ , we have $N=\mu D$ for some constant $\mu$ ; then $\mu D_{0}=\lambda$ , a degree- $2$ polynomial equal to a constant, forcing $\mu=\lambda=0$ : contradiction.

Case 2: $\mathbf{w}_{1}=\mathbf{w}_{0}\neq\mathbf{w}_{2}$ (the case $\mathbf{w}_{2}=\mathbf{w}_{0}\neq\mathbf{w}_{1}$ is symmetric). Then $D_{1}=D_{0}$ ; cancelling $D_{0}$ from $(*)$ gives $c_{1}L_{1}^{2}D_{2}+c_{2}L_{2}^{2}D_{0}=\lambda D_{2}$ , hence

(c_{1}L_{1}^{2}-\lambda)\,D_{2}=-c_{2}L_{2}^{2}\,D_{0}.

Since $\mathbf{w}_{0}\neq\mathbf{w}_{2}$ , by Lemma 2 $D_{0}$ and $D_{2}$ are coprime irreducibles in $\mathbb{C}[\mathbf{x}]$ , so $D_{0}\mid(c_{1}L_{1}^{2}-\lambda)$ . Both have degree $\leq 2=\deg D_{0}$ , so $c_{1}L_{1}^{2}-\lambda=\mu D_{0}$ for some $\mu\in\mathbb{C}$ . Comparing degree- $2$ homogeneous parts: $c_{1}(\mathbf{x}^{\top}\mathbf{w}_{0})^{2}=\mu\|\mathbf{x}\|^{2}$ . For $d\geq 2$ and $\mathbf{w}_{0}\neq\mathbf{0}$ , the polynomials $(\mathbf{x}^{\top}\mathbf{w}_{0})^{2}$ and $\|\mathbf{x}\|^{2}$ are linearly independent in $\mathbb{C}[\mathbf{x}]$ , witnessed by two evaluations: (i) at $\mathbf{x}=\mathbf{w}_{0}$ , the values are $\|\mathbf{w}_{0}\|^{4}$ and $\|\mathbf{w}_{0}\|^{2}$ respectively, with ratio $\|\mathbf{w}_{0}\|^{2}$ ; (ii) at any non-zero $\mathbf{x}\perp\mathbf{w}_{0}$ (which exists for $d\geq 2$ and is the only step that uses the dimensional hypothesis), the values are $0$ and $\|\mathbf{x}\|^{2}>0$ , with a different ratio, so the two polynomials are not proportional. At $d=1$ both polynomials equal $w_{0,1}^{2}x^{2}$ up to a non-zero scalar, so Case 2 genuinely requires $d\geq 2$ ; Case 1 applies for all $d\geq 1$ . Hence $\mu=c_{1}=0$ , giving $\lambda=0$ : contradiction.

Case 3: $\mathbf{w}_{0},\mathbf{w}_{1},\mathbf{w}_{2}$ pairwise distinct. By Lemma 2, each $D_{i}$ is irreducible in $\mathbb{C}[\mathbf{x}]$ and distinct $D_{i}$ are pairwise coprime. Since $\mathbb{C}[\mathbf{x}]$ is a unique factorisation domain (UFD), irreducibles are prime and the principal ideal $(D_{1})$ is therefore prime, making $R:=\mathbb{C}[\mathbf{x}]/(D_{1})$ an integral domain. Reducing $(*)$ modulo $(D_{1})$ : $c_{1}\,\overline{L_{1}}^{\,2}\,\overline{D_{0}}\,\overline{D_{2}}=0$ . Since $\mathbf{w}_{0},\mathbf{w}_{2}\neq\mathbf{w}_{1}$ , the classes $\overline{D_{0}},\overline{D_{2}}$ are nonzero in $R$ ; since $R$ is an integral domain and $c_{1}\neq 0$ , we get $\overline{L_{1}}=0$ , i.e. $L_{1}\in(D_{1})$ . But $L_{1}\not\equiv 0$ and $\deg L_{1}=1<2=\deg D_{1}$ : contradiction. By symmetry (quotient by $D_{2}$ ), $c_{2}=0$ : contradiction. ∎

Proof for $d=1$ via complex pole matching.

At $d=1$ the variables collapse to scalars: $w_{i},b_{i},w_{0}\in\mathbb{R}$ , and $D_{w}(z)=(z-w)^{2}+\varepsilon$ factors over $\mathbb{C}$ as $(z-w-i\sqrt{\varepsilon})(z-w+i\sqrt{\varepsilon})$ , so Lemma 2 no longer applies. Instead, assume for contradiction

c_{1}\frac{(w_{1}z+b_{1})^{2}}{(z-w_{1})^{2}+\varepsilon}+c_{2}\frac{(w_{2}z+b_{2})^{2}}{(z-w_{2})^{2}+\varepsilon}=\frac{\lambda}{(z-w_{0})^{2}+\varepsilon}

holds for all $z\in\mathbb{R}$ with $\lambda\neq 0$ , $\varepsilon>0$ . The standing hypothesis $\mathbf{w}_{0}\neq\mathbf{0}$ gives $w_{0}\neq 0$ in $d=1$ (this is what excludes the trivial case $\mathbf{w}_{0}=\mathbf{0}$ , where one biased atom $g_{\varepsilon}(\cdot;\mathbf{0},b)/b^{2}=b^{2}/(b^{2}(\|\mathbf{x}\|^{2}+\varepsilon))=k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{0})$ already realises the IMQ atom). By analytic continuation the identity extends to $z\in\mathbb{C}$ . The right-hand side has exactly two simple poles, at $z=w_{0}\pm i\sqrt{\varepsilon}$ . We partition by the pattern of equality among $\{w_{0},w_{1},w_{2}\}$ into four mutually exclusive and jointly exhaustive cases: $\{w_{1},w_{2},w_{0}\}$ contains (1) three distinct values; (2) exactly $w_{1}=w_{2}$ and both differ from $w_{0}$ ; (3) exactly one of $w_{1},w_{2}$ equals $w_{0}$ (and the two atom centers differ); (4) $w_{1}=w_{2}=w_{0}$ .

Case 1: $w_{0},w_{1},w_{2}$ pairwise distinct. The first atom contributes potential simple poles at $z=w_{1}\pm i\sqrt{\varepsilon}$ . These poles are not present in either the second atom (whose poles are at $w_{2}\pm i\sqrt{\varepsilon}\neq w_{1}\pm i\sqrt{\varepsilon}$ ) nor the right-hand side (whose poles are at $w_{0}\pm i\sqrt{\varepsilon}\neq w_{1}\pm i\sqrt{\varepsilon}$ ). Hence they must be removable, requiring the linear factor $w_{1}z+b_{1}$ to vanish at $z=w_{1}+i\sqrt{\varepsilon}$ :

w_{1}(w_{1}+i\sqrt{\varepsilon})+b_{1}=(w_{1}^{2}+b_{1})+iw_{1}\sqrt{\varepsilon}=0.

Since $w_{1},b_{1},\varepsilon$ are real with $\varepsilon>0$ , separating real and imaginary parts forces $w_{1}=0$ and $b_{1}=0$ , i.e. the first atom is identically zero. The same argument applied to atom 2 forces $w_{2}=0=b_{2}$ . The identity collapses to $0=\lambda/D_{w_{0}}$ , contradicting $\lambda\neq 0$ .

Case 2: $w_{1}=w_{2}=:w\neq w_{0}$ . The two atoms share denominator $D_{w}(z)$ . The combined left-hand-side numerator $P(z):=c_{1}(wz+b_{1})^{2}+c_{2}(wz+b_{2})^{2}$ has degree $\leq 2$ . Equality of rational functions implies equality of pole sets with multiplicity; since $w\neq w_{0}$ , the LHS poles at $w\pm i\sqrt{\varepsilon}$ must cancel, forcing $D_{w}\mid P$ . Since $\deg P\leq 2=\deg D_{w}$ , $P=C\cdot D_{w}$ for some constant $C\in\mathbb{C}$ . Then LHS $=C$ is constant on $\mathbb{C}$ , while RHS $=\lambda/D_{w_{0}}$ is non-constant (since $\lambda\neq 0$ ). Contradiction.

Case 3: $w_{1}=w_{0}\neq w_{2}$ (and the symmetric subcase $w_{2}=w_{0}\neq w_{1}$ ). The atom whose centre differs from $w_{0}$ has poles at the off- $w_{0}$ locations; by the Case 1 argument applied to that atom alone, the atom is identically zero. The identity reduces to one atom $c_{1}(w_{0}z+b_{1})^{2}/D_{w_{0}}(z)=\lambda/D_{w_{0}}(z)$ , i.e. $c_{1}(w_{0}z+b_{1})^{2}=\lambda$ . The left side is a polynomial in $z$ of degree exactly $2$ (using $w_{0}\neq 0$ and $c_{1}\neq 0$ for a non-trivial atom), while the right side is a non-zero constant. Contradiction.

Case 4: $w_{1}=w_{2}=w_{0}$ . LHS and RHS share denominator $D_{w_{0}}(z)$ ; equating numerators gives the polynomial identity

c_{1}(w_{0}z+b_{1})^{2}+c_{2}(w_{0}z+b_{2})^{2}=\lambda\quad\text{in }\mathbb{R}[z].

Matching the $z^{2}$ coefficient: $w_{0}^{2}(c_{1}+c_{2})=0$ , so $c_{2}=-c_{1}$ (since $w_{0}\neq 0$ ). Matching the $z^{1}$ coefficient: $2w_{0}(c_{1}b_{1}+c_{2}b_{2})=2w_{0}c_{1}(b_{1}-b_{2})=0$ , so $b_{1}=b_{2}$ (since $c_{1}\neq 0$ ). Substituting back into the constant term: $c_{1}b_{1}^{2}+c_{2}b_{2}^{2}=c_{1}b_{1}^{2}-c_{1}b_{1}^{2}=0\neq\lambda$ . Contradiction.

All four cases yield a contradiction, so the factor- $3$ lower bound holds at $d=1$ as well. ∎

E.3 Proof of Proposition 1: singular universality at $b_{0}>0$

Proposition (Proposition 1 restated).

Let $\mathcal{X}\subset\mathbb{R}^{d}$ be compact, $b_{0}>0$ , $\varepsilon_{0}>0$ . Then $\mathcal{H}_{b_{0},\varepsilon_{0}}$ is dense in $C(\mathcal{X})$ in the uniform norm.

Proof.

Write $k=p\cdot h$ with $h(\mathbf{x},\mathbf{w})=(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon_{0})^{-1}$ the IMQ kernel and $p(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\top}\mathbf{w}+b_{0})^{2}$ .

Step 1: the IMQ factor is universal. The RKHS $\mathcal{H}_{h}$ of $h$ is dense in $C(\mathcal{X})$ for every compact $\mathcal{X}$ by the Micchelli universality theorem [Micchelli et al., 2006].

Step 2: the polynomial factor has a constant feature. The numerator kernel $p$ has the explicit feature map $\Phi_{b_{0}}(\mathbf{x})=(\mathrm{vec}(\mathbf{x}\mathbf{x}^{\top}),\sqrt{2b_{0}}\,\mathbf{x},b_{0})\in\mathbb{R}^{d^{2}+d+1}$ , so $p(\mathbf{x},\mathbf{w})=\langle\Phi_{b_{0}}(\mathbf{x}),\Phi_{b_{0}}(\mathbf{w})\rangle$ . One coordinate of $\Phi_{b_{0}}$ is the nonzero constant $b_{0}$ .

Step 3: the polynomial RKHS contains constants. The RKHS of $p$ is $\mathcal{H}_{p}=\{\langle\mathbf{u},\Phi_{b_{0}}(\cdot)\rangle:\mathbf{u}\in\mathbb{R}^{d^{2}+d+1}\}$ . The dual coefficient $\mathbf{u}_{1}=(0,\ldots,0,1/b_{0})$ yields $\langle\mathbf{u}_{1},\Phi_{b_{0}}(\mathbf{x})\rangle=b_{0}\cdot(1/b_{0})=1$ for every $\mathbf{x}$ , so the constant function $\mathbf{1}_{\mathcal{X}}\in\mathcal{H}_{p}$ with $\|\mathbf{1}_{\mathcal{X}}\|_{\mathcal{H}_{p}}\leq\|\mathbf{u}_{1}\|=1/b_{0}$ (an upper bound because $\Phi_{b_{0}}$ is non-minimal, so the RKHS norm is the infimum over representers).

Step 4: $\mathcal{H}_{k}$ contains $\mathcal{H}_{h}$ . By the product-kernel RKHS containment theorem [Steinwart and Christmann, 2008, Lemma 4.6] (also Theorem 7 above), the RKHS of $k=p\cdot h$ contains every pointwise product $f_{1}\cdot f_{2}$ with $f_{1}\in\mathcal{H}_{h}$ and $f_{2}\in\mathcal{H}_{p}$ , and $\|f_{1}\cdot f_{2}\|_{\mathcal{H}_{k}}\leq\|f_{1}\|_{\mathcal{H}_{h}}\|f_{2}\|_{\mathcal{H}_{p}}$ . Take $f_{2}=\mathbf{1}_{\mathcal{X}}\in\mathcal{H}_{p}$ from Step 3, with $\|\mathbf{1}_{\mathcal{X}}\|_{\mathcal{H}_{p}}\leq 1/b_{0}$ . Then for every $f_{1}\in\mathcal{H}_{h}$ the pointwise product $f_{1}\cdot\mathbf{1}_{\mathcal{X}}=f_{1}$ lies in $\mathcal{H}_{k}$ , and $\|f_{1}\|_{\mathcal{H}_{k}}\leq\|f_{1}\|_{\mathcal{H}_{h}}\cdot(1/b_{0})$ . Thus the embedding $\mathcal{H}_{h}\hookrightarrow\mathcal{H}_{k}$ has operator norm $\leq 1/b_{0}$ , recovering the Loewner bound $\|f\|_{\mathcal{H}_{b_{0},\varepsilon_{0}}}\leq b_{0}^{-1}\|f\|_{\mathcal{H}_{h}}$ of Theorem 2. (The cleaner route to the same conclusion is the Loewner-order proof in the main body.)

Step 5: density. Let $g\in C(\mathcal{X})$ and $\eta>0$ . By Step 1, there exists $f\in\mathcal{H}_{h}$ with $\|g-f\|_{\infty}<\eta$ . Step 4 gives $f\in\mathcal{H}_{k}$ . Hence $\mathcal{H}_{k}$ is dense in $C(\mathcal{X})$ . ∎

Remark 1 (Necessity of $b_{0}>0$ ).

The constant feature $b_{0}$ in $\Phi_{b_{0}}$ is what makes Step 4 produce constants in $\mathcal{H}_{p}$ . At $b_{0}=0$ this coordinate vanishes and the argument fails. When additionally $\mathbf{0}\in\mathcal{X}$ , the simpler observation $k_{0,\varepsilon}(\mathbf{w},\mathbf{0})=0$ for every $\mathbf{w}$ forces every $f\in\mathcal{H}_{0,\varepsilon}$ to satisfy $f(\mathbf{0})=0$ , ruling out density in $C(\mathcal{X})$ .

E.4 Characteristic kernel and strict positive definiteness

Corollary (Characteristic kernel, restating Corollary 1(i)).

For $b_{0}>0$ , $\varepsilon_{0}>0$ , the kernel $k_{b_{0},\varepsilon_{0}}$ is characteristic on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ .

Proof.

By Proposition 1, $k_{b_{0},\varepsilon_{0}}$ is universal on $\mathcal{X}$ . A universal continuous PSD kernel on compact $\mathcal{X}$ is characteristic: if $\mu_{k}:=\int k_{b_{0},\varepsilon_{0}}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})=0$ in $\mathcal{H}_{b_{0},\varepsilon_{0}}$ for some finite signed Borel $\mu$ , then by the reproducing property $\int f\,d\mu=\langle f,\mu_{k}\rangle=0$ for every $f\in\mathcal{H}_{b_{0},\varepsilon_{0}}$ . By density, $\int f\,d\mu=0$ for every $f\in C(\mathcal{X})$ , so $\mu=0$ by the Riesz representation theorem. The metrization of weak convergence by $\mathrm{MMD}_{k_{b_{0},\varepsilon_{0}}}$ then follows from Sriperumbudur et al. [2010, Thm. 23]. ∎

Corollary (Strict positive definiteness and Gram invertibility, restating part of Cor. 1).

For $b_{0}>0$ , $\varepsilon_{0}>0$ , and any $n$ pairwise distinct $\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}\subset\mathbb{R}^{d}$ , the Gram matrix $\mathbf{K}_{ij}=k_{b_{0},\varepsilon_{0}}(\mathbf{x}_{i},\mathbf{x}_{j})$ is strictly positive definite, hence invertible.

Proof.

Let $\mathbf{c}=(c_{1},\dots,c_{n})\in\mathbb{R}^{n}$ and set $\mu:=\sum_{i=1}^{n}c_{i}\delta_{\mathbf{x}_{i}}$ , the finite signed Borel measure on $\mathcal{X}=\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}$ with atomic masses $c_{i}$ . By bilinearity of the kernel mean embedding,

\mathbf{c}^{\top}\mathbf{K}\mathbf{c}\;=\;\sum_{i,j}c_{i}c_{j}\,k_{b_{0},\varepsilon_{0}}(\mathbf{x}_{i},\mathbf{x}_{j})\;=\;\Big\|\int k_{b_{0},\varepsilon_{0}}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})\Big\|_{\mathcal{H}_{b_{0},\varepsilon_{0}}}^{2}.

If $\mathbf{c}^{\top}\mathbf{K}\mathbf{c}=0$ , the mean embedding of $\mu$ in $\mathcal{H}_{b_{0},\varepsilon_{0}}$ is zero. Since $k_{b_{0},\varepsilon_{0}}$ is characteristic on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ (Corollary 1(i)), the mean embedding is injective on finite signed Borel measures, hence $\mu=0$ . Distinctness of $\{\mathbf{x}_{i}\}$ then forces $c_{i}=0$ for every $i$ . Equivalently, $\mathbf{K}\succ 0$ . ∎

E.5 Proof of Proposition 3 and Theorem 6: closed-form norm and single-layer Rademacher bound

Proposition (Proposition 3 restated).

Let $b\geq 0$ , $\varepsilon>0$ , and $f(\cdot)=\sum_{j=1}^{m}\alpha_{j}\,k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\in\mathcal{H}_{b,\varepsilon}$ . Then $\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ with $\mathbf{K}_{ij}=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j})$ .

Proof.

By bilinearity and the reproducing property,

	$\displaystyle\\|f\\|^{2}_{\mathcal{H}_{b,\varepsilon}}$	$\displaystyle\;=\;\sum_{i,j}\alpha_{i}\alpha_{j}\,\bigl\langle k_{b,\varepsilon}(\mathbf{w}_{i},\cdot),\,k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\bigr\rangle_{\mathcal{H}_{b,\varepsilon}}$
		$\displaystyle\;=\;\sum_{i,j}\alpha_{i}\alpha_{j}\,k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j})\;=\;\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}.\qed$

The identity $\|f\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ is exact for every coefficient vector $\boldsymbol{\alpha}$ , including when $\mathbf{K}$ is rank-deficient. If two centers coincide or two coefficient vectors $\boldsymbol{\alpha},\boldsymbol{\alpha}^{\prime}$ represent the same function $f$ (equivalently, $\mathbf{K}\boldsymbol{\alpha}=\mathbf{K}\boldsymbol{\alpha}^{\prime}$ ), then bilinearity of the RKHS inner product gives $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}=\boldsymbol{\alpha}^{\prime\top}\mathbf{K}\boldsymbol{\alpha}^{\prime}$ , since both equal $\langle f,f\rangle_{\mathcal{H}_{b,\varepsilon}}$ . The pseudoinverse expression $(\mathbf{K}\boldsymbol{\alpha})^{\top}\mathbf{K}^{\dagger}(\mathbf{K}\boldsymbol{\alpha})$ is algebraically equal to $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ on PSD $\mathbf{K}$ via $\mathbf{K}\mathbf{K}^{\dagger}\mathbf{K}=\mathbf{K}$ , so either form may be used. Coincident or near-coincident centers therefore raise a numerical-conditioning question, not a mathematical correctness question; in floating-point implementations, deduplicating exactly-coincident centers and using a stable factorisation (e.g. pivoted Cholesky) on the remaining strictly-PD block is sufficient.

Theorem (Theorem 6 restated).

Fix $b\geq 0$ , $\varepsilon>0$ . Let $\mathcal{F}_{B,b,\varepsilon}=\{f\in\mathcal{H}_{b,\varepsilon}:\|f\|_{\mathcal{H}_{b,\varepsilon}}\leq B\}$ , $\|\mathbf{x}\|\leq R$ , $n$ i.i.d. samples. Then $\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\leq B(R^{2}+b)/(\sqrt{n}\,\sqrt{\varepsilon})$ .

Proof.

For any $f\in\mathcal{F}_{B,b,\varepsilon}$ and $\mathbf{x}\in\mathbb{R}^{d}$ , the reproducing property of the fixed kernel $k_{b,\varepsilon}$ gives $f(\mathbf{x})=\langle f,k_{b,\varepsilon}(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{b,\varepsilon}}$ . By Cauchy–Schwarz,

|f(\mathbf{x})|\;\leq\;\|f\|_{\mathcal{H}_{b,\varepsilon}}\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}\;\leq\;B\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}.

(8)

The diagonal of the biased Yat kernel is

k_{b,\varepsilon}(\mathbf{x},\mathbf{x})\;=\;\frac{(\|\mathbf{x}\|^{2}+b)^{2}}{0+\varepsilon}\;=\;\frac{(\|\mathbf{x}\|^{2}+b)^{2}}{\varepsilon},

since $\mathbf{x}^{\top}\mathbf{x}+b=\|\mathbf{x}\|^{2}+b$ and $\|\mathbf{x}-\mathbf{x}\|^{2}+\varepsilon=\varepsilon$ . For $\|\mathbf{x}\|\leq R$ ,

\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}\;\leq\;\frac{\|\mathbf{x}\|^{2}+b}{\sqrt{\varepsilon}}\;\leq\;\frac{R^{2}+b}{\sqrt{\varepsilon}}.

The standard RKHS Rademacher bound for a ball of radius $B$ follows from the reproducing property and Jensen’s inequality (Bartlett and Mendelson 2002, Lemma 22, with the one-sided convention $\widehat{\mathrm{Rad}}_{n}(\mathcal{F}):=\mathbb{E}_{\boldsymbol{\sigma}}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i}\sigma_{i}f(\mathbf{x}_{i})$ used at line 7). Writing $f(\mathbf{x}_{i})=\langle f,k_{b,\varepsilon}(\cdot,\mathbf{x}_{i})\rangle$ , $\sup_{\|f\|\leq B}$ pulls out a factor $B$ , and Jensen exchanges $\mathbb{E}_{\boldsymbol{\sigma}}$ with the square root:

\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;=\;\frac{B}{n}\,\mathbb{E}_{\boldsymbol{\sigma}}\Big\|\sum_{i}\sigma_{i}k_{b,\varepsilon}(\cdot,\mathbf{x}_{i})\Big\|_{\mathcal{H}_{b,\varepsilon}}\;\leq\;\frac{B}{\sqrt{n}}\cdot\sqrt{\frac{1}{n}\sum_{i=1}^{n}k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{i})}.

The constant is exactly $1$ (no symmetrization factor of $2$ ) under this one-sided convention. Bounding the per-sample term by $(R^{2}+b)^{2}/\varepsilon$ and taking the square root gives

\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;\leq\;B\sqrt{\frac{1}{n}\cdot\frac{(R^{2}+b)^{2}}{\varepsilon}}\;=\;\frac{B(R^{2}+b)}{\sqrt{n}\,\sqrt{\varepsilon}}.

∎

Corollary (Unbiased case).

At $b=0$ , $\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,0,\varepsilon})\leq BR^{2}/(\sqrt{n}\,\sqrt{\varepsilon})$ .

E.6 Uniform Boundedness and High-Dimensional Variance

Proposition 4 (Uniform boundedness with sharpness).

For $\|\mathbf{x}\|\leq R$ , $\|\mathbf{w}\|\leq W$ , $b\geq 0$ , $\varepsilon>0$ ,

0\;\leq\;k_{b,\varepsilon}(\mathbf{w},\mathbf{x})\;\leq\;\frac{(RW+b)^{2}}{\varepsilon}.

For every fixed $\mathbf{w}$ , $b\geq 0$ , $\varepsilon>0$ , the section $k_{b,\varepsilon}(\mathbf{w},\cdot)$ is globally bounded on $\mathbb{R}^{d}$ . At $b=0$ and $\mathbf{w}\neq\mathbf{0}$ the global supremum is $\sup_{\mathbf{x}\in\mathbb{R}^{d}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})=\|\mathbf{w}\|^{4}/\varepsilon+\|\mathbf{w}\|^{2}$ , attained at $\mathbf{x}^{\ast}=(1+\varepsilon/\|\mathbf{w}\|^{2})\mathbf{w}$ . For $\mathbf{w}=\mathbf{0}$ the unbiased section is identically zero. For $b>0$ the supremum on $\mathbb{R}^{d}$ does not have a clean closed form; it is bounded by $(RW+b)^{2}/\varepsilon$ on any compact domain with $\|\mathbf{x}\|\leq R$ , $\|\mathbf{w}\|\leq W$ .

Proof.

Compact-domain bound. By Cauchy–Schwarz, $|\mathbf{w}^{\top}\mathbf{x}|\leq\|\mathbf{w}\|\|\mathbf{x}\|\leq WR$ , so $|\mathbf{w}^{\top}\mathbf{x}+b|\leq WR+b$ . Also $D_{\varepsilon}(\mathbf{x},\mathbf{w})\geq\varepsilon>0$ , so $k_{b,\varepsilon}(\mathbf{w},\mathbf{x})\leq(WR+b)^{2}/\varepsilon$ . Nonnegativity follows from the squared numerator and positive denominator.

Global boundedness at fixed $\mathbf{w}$ . Set $c:=\|\mathbf{w}\|^{2}+b\geq 0$ ; since $|\mathbf{x}^{\top}\mathbf{w}+b|\leq\|\mathbf{x}-\mathbf{w}\|\|\mathbf{w}\|+c$ and $D_{\varepsilon}(\mathbf{x},\mathbf{w})\geq\|\mathbf{x}-\mathbf{w}\|^{2}$ ,

k_{b,\varepsilon}(\mathbf{w},\mathbf{x})\;\leq\;\left(\|\mathbf{w}\|+\frac{c}{\|\mathbf{x}-\mathbf{w}\|}\right)^{2}\quad\text{for }\|\mathbf{x}-\mathbf{w}\|>0.

For $\|\mathbf{x}-\mathbf{w}\|\geq 2c$ this gives $k_{b,\varepsilon}\leq(\|\mathbf{w}\|+\tfrac{1}{2})^{2}$ ; for $\|\mathbf{x}-\mathbf{w}\|\leq 2c$ , continuity and $D_{\varepsilon}\geq\varepsilon$ give $k_{b,\varepsilon}\leq c^{2}(2\|\mathbf{w}\|+1)^{2}/\varepsilon$ .

Sharpness at $b=0$ . Decompose $\mathbf{x}=\alpha\,\mathbf{w}/\|\mathbf{w}\|+\mathbf{x}_{\perp}$ with $\mathbf{x}_{\perp}\perp\mathbf{w}$ . Then $\mathbf{x}^{\top}\mathbf{w}=\alpha\|\mathbf{w}\|$ and $\|\mathbf{x}-\mathbf{w}\|^{2}=(\alpha-\|\mathbf{w}\|)^{2}+\|\mathbf{x}_{\perp}\|^{2}$ , so

k_{\text{\tifinaghfont ⵟ}}(\mathbf{w},\mathbf{x})\;=\;\frac{\alpha^{2}\|\mathbf{w}\|^{2}}{(\alpha-\|\mathbf{w}\|)^{2}+\|\mathbf{x}_{\perp}\|^{2}+\varepsilon}.

This is monotonically decreasing in $\|\mathbf{x}_{\perp}\|^{2}$ , so the supremum is attained at $\mathbf{x}_{\perp}=0$ . Writing $\lambda=\alpha/\|\mathbf{w}\|$ and $u=\|\mathbf{w}\|^{2}$ , reduce to the one-variable maximisation

f(\lambda)\;=\;\frac{\lambda^{2}u^{2}}{(\lambda-1)^{2}u+\varepsilon}.

Differentiating, $f^{\prime}(\lambda)=2\lambda u^{2}\,[(1-\lambda)u+\varepsilon]/((\lambda-1)^{2}u+\varepsilon)^{2}$ , so the unique positive critical point is $\lambda^{\ast}=1+\varepsilon/u$ . Substituting:

f(\lambda^{\ast})\;=\;\frac{(1+\varepsilon/u)^{2}\,u^{2}}{(\varepsilon/u)^{2}\,u+\varepsilon}\;=\;\frac{(u+\varepsilon)^{2}}{\varepsilon(1+\varepsilon/u)}\;=\;\frac{u(u+\varepsilon)}{\varepsilon}\;=\;\frac{u^{2}}{\varepsilon}+u\;=\;\frac{\|\mathbf{w}\|^{4}}{\varepsilon}+\|\mathbf{w}\|^{2},

attained at $\mathbf{x}^{\ast}=\lambda^{\ast}\,\mathbf{w}/\|\mathbf{w}\|\cdot\|\mathbf{w}\|=(1+\varepsilon/\|\mathbf{w}\|^{2})\mathbf{w}$ . ∎

The diagonal value $k_{b,\varepsilon}(\mathbf{x},\mathbf{x})=(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ used in the Rademacher proof is the special case $\mathbf{w}=\mathbf{x}$ of the kernel definition. Classical unbounded scalar activations such as ReLU and GELU satisfy $\sigma(\mathbf{w}^{\top}\mathbf{x})\to\infty$ as $\|\mathbf{x}\|\to\infty$ , so the Rademacher route via the kernel diagonal is not available in this same unit-as-Mercer-section form.

High-dimensional variance preservation under isotropic Gaussian inputs.

A second structural difference between the Yat unit and a purely radial unit shows up in the high-dimensional input regime: the IMQ output concentrates while the Yat output does not. The phenomenon ”isotropic Gaussian inputs concentrate radial kernels in high $d$ ” is by now standard in random-matrix and neural-network theory: El Karoui [2010] established the original spectral collapse for kernel random matrices on isotropic high-dimensional inputs, and Ghorbani et al. [2019] characterise the same concentration for radial features in two-layer linearised networks. The proposition below is the matching structural statement at the level of the kernel diagonal: under the same inputs, the Yat numerator $(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ depends on a single 1-D projection and resists the collapse.

Proposition 5 (High-dimensional variance preservation).

Let $\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ , let $\mathbf{w}\in\mathbb{R}^{d}$ have unit norm $\|\mathbf{w}\|=1$ , and fix $b\geq 0$ , $\varepsilon>0$ . For a positive random variable $Y$ , write $\mathrm{CV}[Y]:=\sqrt{\mathrm{Var}[Y]}/\mathbb{E}[Y]$ . As $d\to\infty$ ,

\mathrm{CV}[h_{\varepsilon}(\mathbf{x},\mathbf{w})]=\Theta(d^{-1/2}),\qquad\mathrm{CV}[k_{b,\varepsilon}(\mathbf{w},\mathbf{x})]=\Theta(1).

Proof.

By rotational invariance of the standard Gaussian, take $\mathbf{w}=\mathbf{e}_{1}$ . Then $\mathbf{w}^{\top}\mathbf{x}=x_{1}\sim\mathcal{N}(0,1)$ , and the regularised distance $D:=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon=(x_{1}-1)^{2}+\sum_{i=2}^{d}x_{i}^{2}+\varepsilon$ satisfies $D-\varepsilon\sim\chi_{d}^{2}(\lambda=1)$ (non-central chi-squared, $d$ degrees of freedom, non-centrality $\|\mathbf{w}\|^{2}=1$ ). Standard moment formulas give $\mu_{D}:=\mathbb{E}[D]=d+1+\varepsilon$ and $\sigma_{D}^{2}:=\mathrm{Var}[D]=2d+4$ .

IMQ unit. For $h_{\varepsilon}=1/D$ , second-order Taylor expansion of $1/D$ around $\mu_{D}$ (justified since $D\geq\varepsilon>0$ a.s. and $|D-\mu_{D}|/\mu_{D}=O_{P}(d^{-1/2})$ by chi-squared concentration) gives

\mathbb{E}[h_{\varepsilon}]=\mu_{D}^{-1}+O(\sigma_{D}^{2}/\mu_{D}^{3})=\Theta(d^{-1}),\qquad\mathrm{Var}[h_{\varepsilon}]=\sigma_{D}^{2}/\mu_{D}^{4}+O(\mu_{D}^{-5})=\Theta(d^{-3}),

so $\mathrm{CV}[h_{\varepsilon}]=\Theta(d^{-3/2})/\Theta(d^{-1})=\Theta(d^{-1/2})$ .

Yat unit. Set $N:=(x_{1}+b)^{2}$ , so $k_{b,\varepsilon}=N/D$ . Direct calculation gives $\mu_{N}=1+b^{2}$ , $\sigma_{N}^{2}=2+4b^{2}$ , and

\mathrm{Cov}(N,D)=\mathrm{Cov}\bigl((x_{1}+b)^{2},(x_{1}-1)^{2}\bigr)=2-4b

(the contributions of $x_{2},\ldots,x_{d}$ to $D$ are independent of $N$ , so do not enter the covariance). Bivariate Taylor expansion of $N/D$ around $(\mu_{N},\mu_{D})$ yields

\mathbb{E}[k_{b,\varepsilon}]=\mu_{N}/\mu_{D}+O(\mu_{D}^{-2})=\Theta(d^{-1}),

\mathrm{Var}[k_{b,\varepsilon}]=\frac{\sigma_{N}^{2}}{\mu_{D}^{2}}-\frac{2\mu_{N}\,\mathrm{Cov}(N,D)}{\mu_{D}^{3}}+\frac{\mu_{N}^{2}\sigma_{D}^{2}}{\mu_{D}^{4}}+O(\mu_{D}^{-5}).

Substituting the dimensional scalings $\mu_{D}=\Theta(d)$ , $\sigma_{D}^{2}=\Theta(d)$ , and $\mu_{N},\sigma_{N}^{2},\mathrm{Cov}(N,D)=\Theta(1)$ , the three terms are $\Theta(d^{-2})-\Theta(d^{-3})+\Theta(d^{-3})=\Theta(d^{-2})$ , dominated by $\sigma_{N}^{2}/\mu_{D}^{2}$ . Hence $\mathrm{CV}[k_{b,\varepsilon}]=\Theta(d^{-1})/\Theta(d^{-1})=\Theta(1)$ .

Higher-order Taylor remainders are controlled by combining the chi-squared concentration $|D-\mu_{D}|/\mu_{D}=O_{P}(d^{-1/2})$ with a high-probability lower bound on $D$ . Specifically, the chi-squared concentration inequality [Laurent and Massart, 2000, Lemma 1] gives $\Pr[D<\mu_{D}/2]\leq e^{-cd}$ for some absolute constant $c>0$ and all $d\geq d_{0}$ . Split the expectation $\mathbb{E}[(D-\mu_{D})^{k}/D^{k}]$ over the events $\mathcal{E}:=\{D\geq\mu_{D}/2\}$ and its complement: on $\mathcal{E}$ , $D^{-k}\leq(2/\mu_{D})^{k}=O(d^{-k})$ deterministically and the standard chi-squared moment formula $\mathbb{E}[(D-\mu_{D})^{k}\mathbf{1}_{\mathcal{E}}]=O(d^{k/2})$ gives $\mathbb{E}[(D-\mu_{D})^{k}/D^{k}\,\mathbf{1}_{\mathcal{E}}]=O(d^{-k/2})$ ; on $\mathcal{E}^{c}$ , the deterministic floor $D\geq\varepsilon$ together with the deterministic upper bound $|D-\mu_{D}|^{k}\leq\mu_{D}^{k}=\Theta(d^{k})$ (since $0<D\leq 2\mu_{D}$ on the relevant range) gives $\mathbb{E}[(D-\mu_{D})^{k}/D^{k}\,\mathbf{1}_{\mathcal{E}^{c}}]\leq\varepsilon^{-k}\,d^{k}\,\Pr[\mathcal{E}^{c}]\leq\varepsilon^{-k}\,d^{k}\,e^{-cd}=o(d^{-k})$ , which is asymptotically negligible. We do not assume independence of $|D-\mu_{D}|^{k}$ and $\mathbf{1}_{\mathcal{E}^{c}}$ , which fails; the deterministic bound circumvents this. Combining, $\mathbb{E}[(D-\mu_{D})^{k}/D^{k}]=O(d^{-k/2})$ for $k\geq 3$ . This is the bound the bivariate Taylor expansion needs: the leading $\Theta(d^{-2})$ variance term dominates the $O(d^{-3})$ corrections, and the asymptotic CV scaling is unaffected. The earlier deterministic bound $D^{-k}\leq\varepsilon^{-k}$ alone is too loose to control the remainder at this scale, since $\mathbb{E}[(D-\mu_{D})^{k}]\cdot\varepsilon^{-k}$ grows with $d$ ; the high-probability split above is the correct bookkeeping. ∎

Remark 2.

Proposition 5 is a structural calculation under the specific toy model of isotropic Gaussian inputs and unit-norm centers; real ML feature distributions are not isotropic Gaussian, and trained Yat centers need not be unit-norm. What the proposition isolates is the asymptotic mechanism: the IMQ denominator $D$ concentrates around its mean of order $d$ , so the radial output $1/D$ concentrates around $1/d$ and its relative spread vanishes; the Yat numerator $(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ depends only on the one-dimensional projection $\mathbf{x}^{\top}\mathbf{w}$ and so does not concentrate, yielding an $\Theta(1)$ relative spread that survives the limit.

E.7 Additional proofs from Section 2

Proof of Theorem 2.

By Theorem 1 the kernel $k_{b,\varepsilon}$ is PSD, so $\mathcal{H}_{b,\varepsilon}$ is well-defined and the Loewner-order Aronszajn inclusion below is meaningful. Compute $k_{b,\varepsilon}-b^{2}h_{\varepsilon}=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}+(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}$ . The kernels $\mathbf{x}^{\top}\mathbf{w}$ and $(\mathbf{x}^{\top}\mathbf{w})^{2}$ are PSD, and the scalar $2b\geq 0$ (using $b\geq 0$ ) makes $2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}$ a nonneg-scaled PSD kernel; $h_{\varepsilon}$ is PSD (IMQ); Schur products of PSD kernels are PSD (Schur product theorem [Horn and Johnson, 2012, Sec. 7.5]). The sum is PSD; Aronszajn’s inclusion theorem then yields the RKHS containment and norm bound. For $b=0$ this reduces to the trivial $0\preceq k_{0,\varepsilon}$ ; the RKHS inclusion $\mathcal{H}_{h_{\varepsilon}}\subseteq\mathcal{H}_{b,\varepsilon}$ is meaningful only for $b>0$ . ∎

E.8 Proofs from Section 3

Proof of Lemma 1.

Every finite-span element $f_{m}(\mathbf{x})=\sum_{j=1}^{m}a_{j}h_{\varepsilon}(\mathbf{x},\mathbf{w}_{j})$ vanishes at infinity since each atom decays as $O(\|\mathbf{x}\|^{-2})$ . The finite span is dense in $\mathcal{H}_{h_{\varepsilon}}$ ; if $f_{m}\to f$ in RKHS norm, the reproducing property gives

\sup_{\mathbf{x}\in\mathbb{R}^{d}}|f_{m}(\mathbf{x})-f(\mathbf{x})|\leq\sup_{\mathbf{x}}\sqrt{h_{\varepsilon}(\mathbf{x},\mathbf{x})}\,\|f_{m}-f\|_{\mathcal{H}_{h_{\varepsilon}}}=\varepsilon^{-1/2}\|f_{m}-f\|_{\mathcal{H}_{h_{\varepsilon}}},

so $f_{m}\to f$ uniformly. Since $C_{0}(\mathbb{R}^{d})$ is closed under uniform limits, $f\in C_{0}(\mathbb{R}^{d})$ . ∎

Proof of Corollary 2.

The inclusion $\mathcal{H}_{h_{\varepsilon}}\subseteq\mathcal{H}_{b,\varepsilon}$ is Theorem 2. For strictness, choose $\mathbf{w}\neq\mathbf{0}$ and any $\mathbf{u}$ with $\mathbf{u}^{\top}\mathbf{w}\neq 0$ : the far-field trace gives $\lim_{r\to\infty}k_{b,\varepsilon}(\mathbf{w},r\mathbf{u})=(\mathbf{u}^{\top}\mathbf{w})^{2}\neq 0$ , so $k_{b,\varepsilon}(\mathbf{w},\cdot)\notin C_{0}(\mathbb{R}^{d})$ while, by Lemma 1, $\mathcal{H}_{h_{\varepsilon}}\subset C_{0}(\mathbb{R}^{d})$ . Hence $k_{b,\varepsilon}(\mathbf{w},\cdot)\in\mathcal{H}_{b,\varepsilon}\setminus\mathcal{H}_{h_{\varepsilon}}$ . Non-reversibility: if $k_{b,\varepsilon}\preceq C\,h_{\varepsilon}$ , Aronszajn inclusion would give $\mathcal{H}_{b,\varepsilon}\subseteq\mathcal{H}_{h_{\varepsilon}}\subset C_{0}(\mathbb{R}^{d})$ , contradicting the existence of Yat sections not vanishing at infinity. ∎

Proof of Theorem 3.

The RKHS sum rule states that $\mathcal{H}_{k_{1}+k_{2}}=\mathcal{H}_{k_{1}}+\mathcal{H}_{k_{2}}$ with the infimal-convolution norm [Paulsen and Raghupathi, 2016]; extended to three summands by induction. Each $k_{i}$ is PSD: $k_{0}=b^{2}h_{\varepsilon}$ since $h_{\varepsilon}$ is PSD and $b^{2}\geq 0$ ; $k_{1}=2b(\mathbf{x}^{\top}\mathbf{w})h_{\varepsilon}$ because the dot-product kernel $\mathbf{x}^{\top}\mathbf{w}$ is PSD ( $\sum_{ij}c_{i}c_{j}\mathbf{x}_{i}^{\top}\mathbf{x}_{j}=\|\sum_{i}c_{i}\mathbf{x}_{i}\|^{2}\geq 0$ ), $h_{\varepsilon}$ is PSD, their Schur product is PSD by the Schur product theorem, and $2b\geq 0$ (the channel vanishes at $b=0$ ); $k_{2}=(\mathbf{x}^{\top}\mathbf{w})^{2}h_{\varepsilon}$ by the Schur product theorem applied to $(\mathbf{x}^{\top}\mathbf{w})^{2}$ and $h_{\varepsilon}$ , both PSD. The RKHS sum rule therefore applies. ∎

Proof of Proposition 2.

For (4), write

g_{\varepsilon}(r\mathbf{u};\mathbf{w},b)\;=\;\frac{(r\,\mathbf{u}^{\top}\mathbf{w}+b)^{2}}{r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon}\;=\;\frac{(\mathbf{u}^{\top}\mathbf{w}+b/r)^{2}}{1-2(\mathbf{u}^{\top}\mathbf{w})/r+(\|\mathbf{w}\|^{2}+\varepsilon)/r^{2}}\;\xrightarrow[r\to\infty]{}\;(\mathbf{u}^{\top}\mathbf{w})^{2},

uniformly in $\mathbf{u}\in\mathbb{S}^{d-1}$ . For the IMQ side, $|k_{\mathrm{IMQ}}^{\varepsilon}(r\mathbf{u},\mathbf{w}_{i})|\leq(r^{2}-2r\|\mathbf{w}_{i}\|+\|\mathbf{w}_{i}\|^{2}+\varepsilon)^{-1}=O(r^{-2})$ , so finite combinations $F=\sum_{i}c_{i}k_{\mathrm{IMQ}}^{\varepsilon}(\cdot,\mathbf{w}_{i})$ satisfy $T_{\infty}F\equiv 0$ . The image of $T_{\infty}$ contains every rank-one quadratic form $(\mathbf{u}^{\top}\mathbf{w})^{2}=\mathbf{u}^{\top}(\mathbf{w}\mathbf{w}^{\top})\mathbf{u}$ ; the symmetric rank-one tensors $\mathbf{w}\mathbf{w}^{\top}$ span $\mathrm{Sym}(d)$ (the off-diagonal generators are obtained as $\tfrac{1}{2}((\mathbf{e}_{i}+\mathbf{e}_{j})(\mathbf{e}_{i}+\mathbf{e}_{j})^{\top}-\mathbf{e}_{i}\mathbf{e}_{i}^{\top}-\mathbf{e}_{j}\mathbf{e}_{j}^{\top})$ ), so $\mathrm{Im}(T_{\infty})=\mathcal{Q}_{2}(\mathbb{S}^{d-1})$ . ∎

Corollary 5 (Concrete large-radius regression gap).

Fix $\mathbf{w}\neq\mathbf{0}$ , $b\in\mathbb{R}$ , $\varepsilon>0$ , and choose $\mathbf{u}\in\mathbb{S}^{d-1}$ with $a:=(\mathbf{u}^{\top}\mathbf{w})^{2}>0$ . Let $f(\mathbf{x})=g_{\varepsilon}(\mathbf{x};\mathbf{w},b)$ . Then for every finite IMQ combination $F\in F_{\mathrm{IMQ},\varepsilon}$ ,

|f(r\mathbf{u})-F(r\mathbf{u})|\;\xrightarrow[r\to\infty]{}\;a.

Hence for every $\eta\in(0,a)$ there exists $R_{\eta}$ such that $\sup_{r\geq R_{\eta}}|f(r\mathbf{u})-F(r\mathbf{u})|\geq a-\eta$ . In particular, a single Yat atom carries an $O(1)$ directional signal on a sufficiently large exterior ray that no finite IMQ expansion can uniformly match there.

Proof of Corollary 5.

Proposition 2 gives $f(r\mathbf{u})\to a$ and $F(r\mathbf{u})\to 0$ , hence $f(r\mathbf{u})-F(r\mathbf{u})\to a$ . Given $\eta\in(0,a)$ , choose $R_{\eta}$ so that for all $r\geq R_{\eta}$ : $|f(r\mathbf{u})-a|<\eta/2$ and $|F(r\mathbf{u})|<\eta/2$ . Then $|f(r\mathbf{u})-F(r\mathbf{u})|\geq a-\eta$ , and the supremum over $r\geq R_{\eta}$ preserves the bound. ∎

Proposition 6 (Far-field gradient asymptotics).

Fix $\mathbf{w}\in\mathbb{R}^{d}\setminus\{\mathbf{0}\}$ , $b\geq 0$ , $\varepsilon>0$ , and a unit direction $\mathbf{u}\in\mathbb{S}^{d-1}$ with $\mathbf{u}^{\top}\mathbf{w}\neq 0$ . Along the ray $\mathbf{x}=r\mathbf{u}$ , the center-gradients of the IMQ section $h_{\varepsilon}(\mathbf{x},\mathbf{w})$ and the Yat section $k_{b,\varepsilon}(\mathbf{w},\mathbf{x})$ satisfy

\bigl\|\nabla_{\mathbf{w}}\,h_{\varepsilon}(r\mathbf{u},\mathbf{w})\bigr\|=\Theta(r^{-3}),\qquad\lim_{r\to\infty}\nabla_{\mathbf{w}}\,k_{b,\varepsilon}(\mathbf{w},r\mathbf{u})=2(\mathbf{u}^{\top}\mathbf{w})\,\mathbf{u}.

Remark 3.

The proposition records an asymptotic difference in the center-gradient of a single Yat atom versus a single IMQ atom at large $\|\mathbf{x}\|$ , evaluated in isolation. It is a statement about the kernel itself, not about gradients of trained networks: in a finite expansion the per-atom gradients combine through the readout coefficients $\boldsymbol{\alpha}$ , and far-field signal can be informative or uninformative depending on where the data actually live.

Proof of Proposition 6.

For the IMQ atom, $\nabla_{\mathbf{w}}h_{\varepsilon}=2(\mathbf{x}-\mathbf{w})/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)^{2}$ . With $\mathbf{x}=r\mathbf{u}$ , the numerator $\|r\mathbf{u}-\mathbf{w}\|=r+O(1)$ and the denominator $(r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon)^{2}=r^{4}+O(r^{3})$ , so $\|\nabla_{\mathbf{w}}h_{\varepsilon}\|=2r^{-3}(1+O(r^{-1}))=\Theta(r^{-3})$ .

For the Yat atom, set $N(\mathbf{x}):=(\mathbf{w}^{\top}\mathbf{x}+b)^{2}$ and $D(\mathbf{x}):=\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon$ . The quotient rule gives

\nabla_{\mathbf{w}}\,k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{2(\mathbf{w}^{\top}\mathbf{x}+b)\,\mathbf{x}}{D(\mathbf{x})}+\frac{2(\mathbf{w}^{\top}\mathbf{x}+b)^{2}(\mathbf{x}-\mathbf{w})}{D(\mathbf{x})^{2}}.

Substituting $\mathbf{x}=r\mathbf{u}$ , the first term equals $2(r\mathbf{u}^{\top}\mathbf{w}+b)\,r\mathbf{u}/D(r\mathbf{u})$ . The leading-order behaviour of numerator and denominator is $2r^{2}(\mathbf{u}^{\top}\mathbf{w})\mathbf{u}+O(r)$ over $r^{2}+O(r)$ , giving the limit $2(\mathbf{u}^{\top}\mathbf{w})\mathbf{u}$ . The second term has numerator $2(r\mathbf{u}^{\top}\mathbf{w}+b)^{2}(r\mathbf{u}-\mathbf{w})=O(r^{3})$ over denominator $D(r\mathbf{u})^{2}=O(r^{4})$ , giving $O(r^{-1})\to 0$ . Summing yields the stated limit. ∎

Proposition 7 (Width-complexity gap for quadratic ridge functions).

Let $d\geq 2$ , $\mathbf{w}_{\star}\in\mathbb{S}^{d-1}$ , $\varepsilon>0$ , $R>0$ , $\delta\in(0,R^{2})$ , and $\mathcal{X}:=[-R,R]^{d}$ . Set $f^{\star}(\mathbf{x}):=(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}$ and fix $\Lambda$ in the regime $\Lambda>\varepsilon(R^{2}-\delta)$ , with $\rho:=\sqrt{\Lambda/(R^{2}-\delta)-\varepsilon}$ . For $W,\Lambda>0$ , write $\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)=\{\sum_{j=1}^{m}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j}):m\in\mathbb{N},\ \sum_{j}|a_{j}|\leq\Lambda,\ \|\mathbf{v}_{j}\|\leq W\}$ .

(i)

Yat upper bound. The single Yat atom $g_{\alpha}(\mathbf{x}):=k_{0,\varepsilon}(\alpha\mathbf{w}_{\star},\mathbf{x})$ satisfies $\sup_{\mathbf{x}\in\mathcal{X}}|g_{\alpha}(\mathbf{x})-f^{\star}(\mathbf{x})|\leq\delta$ for every $\alpha\geq\alpha_{0}(d,R,\varepsilon,\delta)$ , where $\alpha_{0}$ is an explicit polynomial in $d^{1/2},R,\varepsilon^{1/2},\delta^{-1/2}$ .
(ii)

IMQ lower bound. Every $F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ with $\sup_{\mathbf{x}\in\mathcal{X}}|F(\mathbf{x})-f^{\star}(\mathbf{x})|\leq\delta$ uses at least $m\geq(2R/\rho)^{d-1}\Gamma((d+1)/2)/\pi^{(d-1)/2}$ atoms; whenever $\Lambda$ stays bounded as $d\to\infty$ , this is $\exp(\Omega(d\log d))$ .

Proof of Proposition 7.

(i) Substituting $\mathbf{w}=\alpha\mathbf{w}_{\star}$ ,

g_{\alpha}(\mathbf{x})=\frac{\alpha^{2}(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}}{\|\mathbf{x}-\alpha\mathbf{w}_{\star}\|^{2}+\varepsilon}=\frac{(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}}{1+\eta(\mathbf{x},\alpha)},\qquad\eta(\mathbf{x},\alpha):=-\frac{2\mathbf{w}_{\star}^{\top}\mathbf{x}}{\alpha}+\frac{\|\mathbf{x}\|^{2}+\varepsilon}{\alpha^{2}}.

For $\mathbf{x}\in\mathcal{X}$ , $|\mathbf{w}_{\star}^{\top}\mathbf{x}|\leq R\sqrt{d}$ and $\|\mathbf{x}\|^{2}+\varepsilon\leq dR^{2}+\varepsilon$ . If $\alpha\geq\max\{8R\sqrt{d},\,2\sqrt{dR^{2}+\varepsilon}\}$ , then $|\eta|\leq 1/2$ , and $|1/(1+\eta)-1|\leq 2|\eta|$ . Hence

|g_{\alpha}(\mathbf{x})-f^{\star}(\mathbf{x})|\leq(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\cdot 2|\eta|\leq dR^{2}\Bigl(\frac{4R\sqrt{d}}{\alpha}+\frac{2(dR^{2}+\varepsilon)}{\alpha^{2}}\Bigr).

Setting the right-hand side to $\delta$ and taking $\alpha$ to dominate both $8R^{3}d^{3/2}/\delta$ (first term, sufficient for $4R^{3}d^{3/2}/\alpha\leq\delta/2$ ) and $2\sqrt{dR^{2}(dR^{2}+\varepsilon)/\delta}$ (second term, sufficient for $2dR^{2}(dR^{2}+\varepsilon)/\alpha^{2}\leq\delta/2$ ) gives $\alpha_{0}$ as claimed.

(ii) Restrict attention to the inscribed Euclidean ball $\mathcal{B}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\}\subset[-R,R]^{d}$ . By rotational invariance of $\mathcal{B}$ , take $\mathbf{w}_{\star}=\mathbf{e}_{1}$ , and consider the affine slice $S^{\prime}:=\{R/2\}\times\mathcal{B}^{\prime}_{d-1}\subset\mathcal{B}$ , where $\mathcal{B}^{\prime}_{d-1}\subset\mathbb{R}^{d-1}$ is the $(d-1)$ -ball of radius $R\sqrt{3}/2$ . On $S^{\prime}$ , $f^{\star}\equiv R^{2}/4$ in the slice direction and the uniform approximation hypothesis (strengthened to $\delta<R^{2}/4$ ) gives $F(\mathbf{x})\geq R^{2}/4-\delta>0$ for every $\mathbf{x}\in S^{\prime}$ . For any $F=\sum_{j}a_{j}h_{\varepsilon}(\cdot,\mathbf{w}_{j})\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ ,

F(\mathbf{x})\leq|F(\mathbf{x})|\leq\sum_{j}|a_{j}|\,h_{\varepsilon}(\mathbf{x},\mathbf{w}_{j})\leq\Lambda\cdot\max_{j}h_{\varepsilon}(\mathbf{x},\mathbf{w}_{j})=\frac{\Lambda}{\min_{j}\|\mathbf{x}-\mathbf{w}_{j}\|^{2}+\varepsilon}.

Combining, $\min_{j}\|\mathbf{x}-\mathbf{w}_{j}\|^{2}\leq\Lambda/(R^{2}/4-\delta)-\varepsilon=\rho^{2}$ for every $\mathbf{x}\in S^{\prime}$ . Equivalently, $S^{\prime}\subseteq\bigcup_{j=1}^{m}B_{d}(\mathbf{w}_{j},\rho)$ where $B_{d}$ denotes the closed Euclidean ball in $\mathbb{R}^{d}$ . The cross-section $B_{d}(\mathbf{w}_{j},\rho)\cap S^{\prime}$ is contained in a $(d-1)$ -dimensional ball of radius at most $\rho$ , with $(d-1)$ -volume bounded by $V_{d-1}(\rho):=\pi^{(d-1)/2}\rho^{d-1}/\Gamma((d+1)/2)$ . The slice $S^{\prime}$ has $(d-1)$ -volume $\pi^{(d-1)/2}(R\sqrt{3}/2)^{d-1}/\Gamma((d+1)/2)$ , so volume-counting gives

\pi^{(d-1)/2}(R\sqrt{3}/2)^{d-1}/\Gamma((d+1)/2)=|S^{\prime}|_{d-1}\leq\sum_{j=1}^{m}|B_{d}(\mathbf{w}_{j},\rho)\cap S^{\prime}|_{d-1}\leq m\,V_{d-1}(\rho),

which rearranges to $m\geq(R\sqrt{3}/(2\rho))^{d-1}$ . By Stirling’s approximation, the Gamma factors cancel between the slice and the ball volumes; absorbing the constant $(\sqrt{3}/2)^{d-1}$ into the implicit $O(d)$ Stirling-error term and taking logs gives $\log m\geq(d-1)\log(R\sqrt{3}/(2\rho))-O(d)$ , which is $\Omega(d\log d)$ for any $\rho$ that does not grow with $d$ . ∎

E.9 Additional proofs from Section 4

Lemma 3 (The unbiased section lies in the positive-bias span).

Fix $\mathbf{w}\in\mathbb{R}^{d}$ and $\varepsilon>0$ . The map $b\mapsto g_{\varepsilon}(\cdot;\mathbf{w},b)$ is quadratic in $b$ , so for every $h>0$

g_{\varepsilon}(\cdot;\mathbf{w},0)\;=\;3\,g_{\varepsilon}(\cdot;\mathbf{w},h)\;-\;3\,g_{\varepsilon}(\cdot;\mathbf{w},2h)\;+\;g_{\varepsilon}(\cdot;\mathbf{w},3h),

(9)

where every atom on the right has bias in $\{h,2h,3h\}\subset(0,\infty)$ . In particular $g_{\varepsilon}(\cdot;\mathbf{w},0)\in F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ .

Proof of Lemma 3.

$D_{\varepsilon}$ is independent of $b$ , so $g_{\varepsilon}(\cdot;\mathbf{w},b)D_{\varepsilon}=(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ is a univariate quadratic in $b$ . Three values $\{h,2h,3h\}$ over-determine a quadratic, and the unique linear extrapolant to $b=0$ has coefficients $(3,-3,1)$ on $(h,2h,3h)$ respectively, since $f(0)=3f(h)-3f(2h)+f(3h)$ for any quadratic $f$ . Dividing by $D_{\varepsilon}$ gives (9). ∎

Corollary 6 (Strict span containment).

$F_{\mathrm{IMQ},\varepsilon}\subsetneq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ for every $\varepsilon>0$ . Concretely, for every $\mathbf{w}\neq\mathbf{0}$ the unbiased section $g_{\varepsilon}(\cdot;\mathbf{w},0)$ lies in $F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ (Lemma 3) but not in $F_{\mathrm{IMQ},\varepsilon}$ . This is the span-level companion to the RKHS-level strict containment of Corollary 2: the same separation holds at the level of finite linear spans of atoms, witnessed by the directional far-field trace rather than by Loewner domination.

Proof of Corollary 6.

Inclusion is Theorem 4. For the strict separation, fix a unit vector $\mathbf{u}\in\mathbb{S}^{d-1}$ with $\mathbf{u}^{\top}\mathbf{w}\neq 0$ . Along the ray $\mathbf{x}=r\mathbf{u}$ , $g_{\varepsilon}(r\mathbf{u};\mathbf{w},0)=r^{2}(\mathbf{u}^{\top}\mathbf{w})^{2}/(r^{2}-2r\,\mathbf{u}^{\top}\mathbf{w}+\|\mathbf{w}\|^{2}+\varepsilon)\to(\mathbf{u}^{\top}\mathbf{w})^{2}\neq 0,$ while every finite combination $F\in F_{\mathrm{IMQ},\varepsilon}$ obeys $|F(r\mathbf{u})|=O(r^{-2})\to 0$ . The far-field limits disagree, so $g_{\varepsilon}(\cdot;\mathbf{w},0)\notin F_{\mathrm{IMQ},\varepsilon}$ . By Lemma 3 it does lie in $F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}$ , witnessing strictness. ∎

Proof of Corollary 3.

$F_{\mathrm{IMQ},\varepsilon}$ is dense in $C(\mathcal{X})$ by Micchelli et al. [2006, Thm. 17]. By Theorem 4, $F_{\mathrm{IMQ},\varepsilon}\subseteq F_{\text{\tifinaghfont ⵟ},\varepsilon}^{+}\subseteq F_{\text{\tifinaghfont ⵟ}}^{+}\subseteq F_{\text{\tifinaghfont ⵟ}}^{\geq 0}$ , so each superset is dense. ∎

Proof of Corollary 4.

By Theorem 4, every IMQ atom equals $(2h^{2})^{-1}$ times a positive-bias second-order finite difference of three Yat atoms, identically in $\mathbf{x}$ . Substituting term by term in $F$ produces a $3m$ -atom Yat expansion $G$ with $G\equiv F$ pointwise; the quantitative rate transfer is then immediate. ∎

E.10 Additional proofs from Section 5

Corollary 7 (Data-dependent empirical bound).

Under the assumptions of Theorem 6, the same proof yields the data-dependent bound

\widehat{\mathrm{Rad}}_{n}(\mathcal{F}_{B,b,\varepsilon})\;\leq\;\frac{B}{\sqrt{n\varepsilon}}\sqrt{\frac{1}{n}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}+b)^{2}},

(10)

computable from the sample alone and at most $B(R^{2}+b)/(\sqrt{n}\,\sqrt{\varepsilon})$ on any sample with $\max_{i}\|\mathbf{x}_{i}\|\leq R$ .

Proof of Corollary 7.

The proof of Theorem 6 bounds $\widehat{\mathrm{Rad}}_{n}$ by $(B/\sqrt{n})\sqrt{(1/n)\sum_{i}k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{i})}$ before applying the worst-case bound on the diagonal. Using the Yat-kernel diagonal $k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{i})=(\|\mathbf{x}_{i}\|^{2}+b)^{2}/\varepsilon$ directly gives (10). ∎

Appendix F Asymptotic Filtration and Native-Space Rate Transfer

This appendix records two structural results of the unbiased and biased Yat spans that extend the main-body theory: a three-step asymptotic filtration of the biased span detected by the directional-trace operator and its higher-order analogues, and a native-space rate-transfer statement coupled with a coefficient-mass lower bound.

Three-step asymptotic filtration of the biased span.

The biased Yat span $\mathcal{F}_{\text{\tifinaghfont ⵟ}}=\mathrm{span}\{g_{\varepsilon}(\cdot;\mathbf{w},b):\mathbf{w}\in\mathbb{R}^{d},b\in\mathbb{R}\}$ admits a three-step filtration

\mathcal{F}_{\text{\tifinaghfont ⵟ}}\supseteq\ker T_{\infty}\supseteq\mathcal{K}_{1}\supseteq\mathcal{K}_{2}\supseteq\mathcal{K}_{3},

detected by linear asymptotic-trace operators $T_{\infty}$ , $T_{1}$ , $T_{2}$ acting at orders $r^{0}$ , $r^{-1}$ , $r^{-2}$ on radial rays. The successive quotients are canonically isomorphic to the spaces of restrictions of homogeneous quadratic, linear, and constant forms on $\mathbb{S}^{d-1}$ , decomposing further into spherical-harmonic sectors $\mathcal{H}_{0}\oplus\mathcal{H}_{2}$ , $\mathcal{H}_{1}$ , $\mathcal{H}_{0}$ . This sharpens the directional-trace separation of Proposition 2 from a single-witness statement (Yat atoms have nonzero quadratic shadow, IMQ atoms have zero shadow) into a full filtration whose quotients refine into spherical-harmonic sectors. The first two non-trivial levels are strictly nested: $\ker T_{\infty}\setminus\mathcal{K}_{1}\neq\emptyset$ via an explicit four-atom construction, and $\mathcal{K}_{1}=\mathcal{L}\oplus\mathcal{F}_{\mathrm{IMQ}}$ as a direct sum (the linear-alignment span $\mathcal{L}$ is disjoint from the IMQ span). The intrinsic characterisation of $\mathcal{K}_{3}=\ker(T_{2}|_{\mathcal{F}_{\mathrm{IMQ}}})$ and whether $\mathcal{K}_{2}\supsetneq\mathcal{K}_{3}$ are open.

Native-space rate transfer and coefficient-mass lower bound.

The exact bias-finite-difference reduction of Theorem 4 carries IMQ scattered-data approximation theory into $\mathcal{F}_{\text{\tifinaghfont ⵟ}}$ at a factor- $3$ overhead in atom count: every finite IMQ approximant on a compact set induces a biased-span approximant with identical pointwise error. The IMQ kernel has an analytic native space, so the scattered-data theory of Schaback [1995], Wendland [2004] delivers fill-distance approximation rates that are exponential in $1/h_{\Xi}$ for native-space targets, in contrast to the polynomial $h_{\Xi}^{k}$ rates of $C^{k}$ -Sobolev kernels; under the bias finite-difference reduction these exponential rates transfer to $\mathcal{F}_{\text{\tifinaghfont ⵟ}}$ with at most a factor- $3$ overhead. In the opposite direction, the codimension-one zero set $k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{w})|_{\mathbf{w}^{\perp}}=0$ produces approximation lower bounds: on $\mathbb{S}^{d-1}$ , fewer than $d$ unbiased atoms cannot approximate the constant function to error below $1$ , and the total weighted coefficient cost is bounded below by $\Omega(\varepsilon d/N)$ .

Appendix G Far-Field Separations

We write

k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{(\mathbf{w}^{\top}\mathbf{x}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon},\qquad b\geq 0,\quad\varepsilon>0.

For fixed shared $(b,\varepsilon)$ , let $\mathcal{H}_{b,\varepsilon}$ denote the RKHS of $k_{b,\varepsilon}$ . For $R>0$ , write

B_{R}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\},\qquad\mathbb{S}^{d-1}:=\{\mathbf{u}\in\mathbb{R}^{d}:\|\mathbf{u}\|=1\}.

This appendix collects the radial-vs-Yat far-field separations: an asymptotic exterior-shell bound (both $L^{\infty}$ and $L^{2}$ versions) and a polynomial-separation gap on line-containing domains. The multiclass-generalization machinery, prefix-pullback theorem, and Loewner-comparison theorems previously bundled here have been split into separate appendices (Appendices H and I) for clarity.

G.1 Asymptotic Exterior-Shell Separation

We complement the bounded-domain gap of Section 2 with asymptotic $L^{\infty}$ and $L^{2}$ separations on exterior shells $A_{R}:=\{r\mathbf{u}:R\leq r\leq 2R,\ \mathbf{u}\in\mathbb{S}^{d-1}\}$ for $R\gg W$ . For $W,\Lambda>0$ , write $\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ and $\mathcal{R}_{\mathrm{RBF}}(\Lambda,W,\gamma)$ for the bounded-variation classes $\sum_{j=1}^{m}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j})$ and $\sum_{j=1}^{m}a_{j}\,e^{-\gamma\|\cdot-\mathbf{v}_{j}\|^{2}}$ with $\sum|a_{j}|\leq\Lambda$ and $\|\mathbf{v}_{j}\|\leq W$ .

Lemma 4 (Uniform radial decay on exterior shells).

Let $R>W$ . For $F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ , $\sup_{A_{R}}|F|\leq\Lambda/((R-W)^{2}+\varepsilon)$ ; for $F\in\mathcal{R}_{\mathrm{RBF}}(\Lambda,W,\gamma)$ , $\sup_{A_{R}}|F|\leq\Lambda e^{-\gamma(R-W)^{2}}$ .

Proof.

$\|\mathbf{x}-\mathbf{v}_{j}\|\geq R-W$ by reverse triangle inequality, so $(\|\mathbf{x}-\mathbf{v}_{j}\|^{2}+\varepsilon)^{-1}\leq((R-W)^{2}+\varepsilon)^{-1}$ and $e^{-\gamma\|\mathbf{x}-\mathbf{v}_{j}\|^{2}}\leq e^{-\gamma(R-W)^{2}}$ . Sum against $|a_{j}|$ . ∎

Lemma 5 (Quantitative Yat far-field approximation).

Fix $\mathbf{w}_{\star}\in\mathbb{R}^{d}$ , $b\geq 0$ , and $\varepsilon>0$ . Let $g_{\star}(\mathbf{x})=k_{b,\varepsilon}(\mathbf{w}_{\star},\mathbf{x})$ and $q(\mathbf{u}):=(\mathbf{u}^{\top}\mathbf{w}_{\star})^{2}$ . For every $R>\|\mathbf{w}_{\star}\|$ and every $r\in[R,2R]$ ,

\left|g_{\star}(r\mathbf{u})-q(\mathbf{u})\right|\leq\eta_{R},\qquad\eta_{R}:=\frac{4R\|\mathbf{w}_{\star}\|(b+\|\mathbf{w}_{\star}\|^{2})+b^{2}+\|\mathbf{w}_{\star}\|^{2}(\|\mathbf{w}_{\star}\|^{2}+\varepsilon)}{(R-\|\mathbf{w}_{\star}\|)^{2}+\varepsilon}.

In particular, $\eta_{R}=O(R^{-1})$ as $R\to\infty$ .

Proof.

Let $W_{\star}:=\|\mathbf{w}_{\star}\|$ , $a:=\mathbf{u}^{\top}\mathbf{w}_{\star}$ , so $|a|\leq W_{\star}$ . Then $g_{\star}(r\mathbf{u})=(ra+b)^{2}/(r^{2}-2ra+W_{\star}^{2}+\varepsilon)$ and $q(\mathbf{u})=a^{2}$ . The numerator of $g_{\star}(r\mathbf{u})-a^{2}$ equals $(ra+b)^{2}-a^{2}(r^{2}-2ra+W_{\star}^{2}+\varepsilon)=2ra(b+a^{2})+b^{2}-a^{2}(W_{\star}^{2}+\varepsilon)$ , whose absolute value is at most $2rW_{\star}(b+W_{\star}^{2})+b^{2}+W_{\star}^{2}(W_{\star}^{2}+\varepsilon)\leq 4RW_{\star}(b+W_{\star}^{2})+b^{2}+W_{\star}^{2}(W_{\star}^{2}+\varepsilon)$ . The denominator $\|r\mathbf{u}-\mathbf{w}_{\star}\|^{2}+\varepsilon\geq(R-W_{\star})^{2}+\varepsilon$ . Combining gives $\eta_{R}$ . ∎

Theorem 10 (Far-field separation: $L^{\infty}$ and $L^{2}$ ).

Fix $\mathbf{w}_{\star}\neq\mathbf{0}$ , $b\geq 0$ , $\varepsilon>0$ , $g_{\star}=k_{b,\varepsilon}(\mathbf{w}_{\star},\cdot)$ , $W\geq\|\mathbf{w}_{\star}\|$ , $\Lambda>0$ , $R>W$ , and let $\eta_{R}$ be as in Lemma 5.

(i) $L^{\infty}$ -bound on the exterior shell.

\inf_{F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)}\|g_{\star}-F\|_{L^{\infty}(A_{R})}\geq\bigl(\|\mathbf{w}_{\star}\|^{2}-\eta_{R}-\tfrac{\Lambda}{(R-W)^{2}+\varepsilon}\bigr)_{+},

with the analogous bound over $\mathcal{R}_{\mathrm{RBF}}(\Lambda,W,\gamma)$ replacing the IMQ decay term by $\Lambda e^{-\gamma(R-W)^{2}}$ .

(ii) $L^{2}$ risk lower bound under directional sampling. Let $\nu$ be a probability distribution on $\mathbb{S}^{d-1}$ with $r\in[R,2R]$ independent of $\mathbf{u}\sim\nu$ , $q(\mathbf{u}):=(\mathbf{u}^{\top}\mathbf{w}_{\star})^{2}$ , and $c_{\nu}(\mathbf{w}_{\star}):=\mathbb{E}[q(\mathbf{u})^{2}]>0$ . Then every $F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ satisfies

\mathbb{E}\bigl[(g_{\star}(\mathbf{x})-F(\mathbf{x}))^{2}\bigr]\geq\Bigl[\bigl(\sqrt{c_{\nu}(\mathbf{w}_{\star})}-\eta_{R}-\tfrac{\Lambda}{(R-W)^{2}+\varepsilon}\bigr)_{+}\Bigr]^{2},

with the analogous RBF bound. A single Yat atom realises $g_{\star}$ at zero error in either norm.

Proof.

(i) Take $\mathbf{u}_{\star}:=\mathbf{w}_{\star}/\|\mathbf{w}_{\star}\|$ . For $r\in[R,2R]$ , $g_{\star}(r\mathbf{u}_{\star})\geq\|\mathbf{w}_{\star}\|^{2}-\eta_{R}$ by Lemma 5 and $|F(r\mathbf{u}_{\star})|\leq\Lambda/((R-W)^{2}+\varepsilon)$ by Lemma 4. The triangle inequality and supremum over $A_{R}$ give the IMQ bound; the RBF case is identical. (ii) $\|g_{\star}-q\|_{L^{2}}\leq\eta_{R}$ (Lemma 5 transferred $L^{\infty}\to L^{2}$ ), so reverse triangle gives $\|g_{\star}\|_{L^{2}}\geq\sqrt{c_{\nu}(\mathbf{w}_{\star})}-\eta_{R}$ . $\|F\|_{L^{2}}\leq\Lambda/((R-W)^{2}+\varepsilon)$ from Lemma 4. A second reverse-triangle step, positive part, and squaring give the IMQ bound; RBF identical. ∎

Remark 4.

Both bounds require $\|\mathbf{v}_{j}\|\leq W$ and $\sum|a_{j}|\leq\Lambda$ ; lifting either restriction allows radial methods to place centers in the shell or use coefficients growing with the radius, and the bounds become vacuous.

G.2 Polynomial Separation

Theorem 11 (Polynomial approximation gap on line-containing domains).

Let $\mathbf{w}_{\star}\neq 0$ , $b\geq 0$ , $\varepsilon>0$ , and $g_{\star}(\mathbf{x})=k_{b,\varepsilon}(\mathbf{w}_{\star},\mathbf{x})$ . Let $\mathcal{P}_{2}=\{\mathbf{x}\mapsto\mathbf{x}^{\top}\mathbf{A}\mathbf{x}+\mathbf{a}^{\top}\mathbf{x}+c:\mathbf{A}=\mathbf{A}^{\top}\}$ be the space of degree-at-most-two polynomials. Suppose $X\subset\mathbb{R}^{d}$ contains a nontrivial line segment $\{t\mathbf{v}:t\in I\}$ with $\mathbf{v}:=\mathbf{w}_{\star}/\|\mathbf{w}_{\star}\|$ and $I\subset\mathbb{R}$ a compact interval with nonempty interior. Then

\inf_{P\in\mathcal{P}_{2}}\|g_{\star}-P\|_{L^{\infty}(X)}>0.

Proof.

Let $\rho:=\|\mathbf{w}_{\star}\|>0$ . Along $\mathbf{x}=t\mathbf{v}$ , $h(t):=g_{\star}(t\mathbf{v})=(\rho t+b)^{2}/((t-\rho)^{2}+\varepsilon)$ . Restrictions of $\mathcal{P}_{2}$ to this line are univariate polynomials of degree at most two. Suppose $h(t)=p(t)$ for all $t$ with $p\in\Pi_{2}$ . By real-analyticity this extends globally: $(\rho t+b)^{2}=p(t)((t-\rho)^{2}+\varepsilon)$ . If $p\neq 0$ has degree $k$ , the right side has degree $k+2$ (since $(t-\rho)^{2}+\varepsilon$ is monic quadratic), while the left has degree two, so $k=0$ , i.e. $p=C$ constant. Then $(\rho t+b)^{2}=C((t-\rho)^{2}+\varepsilon)$ . If $C=0$ : impossible since $\rho>0$ . If $C\neq 0$ : the left side has real double root $-b/\rho$ while $(t-\rho)^{2}+\varepsilon$ has nonreal roots $\rho\pm i\sqrt{\varepsilon}$ ; contradiction. Hence $h\notin\Pi_{2}$ , and since $\Pi_{2}$ is closed in $C(I)$ , its distance from $h$ is strictly positive. ∎

Appendix H Learned-Norm Multiclass Generalization

Let $X\subseteq B_{R}$ and $\kappa:=\sup_{\mathbf{x}\in X}\sqrt{k_{b,\varepsilon}(\mathbf{x},\mathbf{x})}\leq(R^{2}+b)/\sqrt{\varepsilon}$ . For a $C$ -class predictor $\mathbf{f}=(f_{1},\ldots,f_{C})$ with each $f_{c}\in\mathcal{H}_{b,\varepsilon}$ , define $\|\mathbf{f}\|_{\mathcal{H}^{C}}^{2}:=\sum_{c=1}^{C}\|f_{c}\|_{\mathcal{H}_{b,\varepsilon}}^{2}$ . For a finite Yat head $f_{c}(\mathbf{x})=\sum_{j}\alpha_{j,c}k_{b,\varepsilon}(\mathbf{w}_{j},\mathbf{x})$ with Gram matrix $\mathbf{K}_{ij}=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j})$ , the reproducing property gives $\|\mathbf{f}\|_{\mathcal{H}^{C}}^{2}=\sum_{c=1}^{C}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}$ .

Define the ramp loss $\phi(t):=\min\{1,\max\{0,1-t\}\}$ , the pairwise margin surrogate $\ell_{\gamma}(\mathbf{f};(\mathbf{x},y)):=\sum_{c\neq y}\phi((f_{y}(\mathbf{x})-f_{c}(\mathbf{x}))/\gamma)$ , and the empirical surrogate $\widehat{L}_{\gamma}(\mathbf{f}):=(1/n)\sum_{i}\ell_{\gamma}(\mathbf{f};(\mathbf{x}_{i},y_{i}))$ .

Lemma 6 (Rademacher bound for pairwise margin loss).

Let $\mathcal{F}_{B}:=\{\mathbf{f}:\|\mathbf{f}\|_{\mathcal{H}^{C}}\leq B\}$ . Then

\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{B})\leq\frac{\sqrt{2}\,C(C-1)B\kappa}{\gamma\sqrt{n}}.

Proof.

By subadditivity, $\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{B})\leq\sum_{a\neq c}\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{a,c})$ . The ramp loss is $(1/\gamma)$ -Lipschitz, so by contraction, $\widehat{\mathrm{Rad}}_{n}(\mathcal{L}_{a,c})\leq(1/\gamma)\widehat{\mathrm{Rad}}_{n}(\mathcal{G}_{a,c})$ where $\mathcal{G}_{a,c}$ restricts to $(f_{a}-f_{c})$ on the class- $a$ support. Setting $h_{i}:=\mathbf{1}\{y_{i}=a\}k_{b,\varepsilon}(\mathbf{x}_{i},\cdot)\in\mathcal{H}_{b,\varepsilon}$ and applying Cauchy–Schwarz in the product Hilbert space gives $\widehat{\mathrm{Rad}}_{n}(\mathcal{G}_{a,c})\leq(\sqrt{2}B/n)\mathbb{E}_{\sigma}\|\sum_{i}\sigma_{i}h_{i}\|_{\mathcal{H}}$ . By Jensen’s inequality and the reproducing property, $\mathbb{E}_{\sigma}\|\sum_{i}\sigma_{i}h_{i}\|_{\mathcal{H}}\leq(\sum_{i}\|h_{i}\|_{\mathcal{H}}^{2})^{1/2}\leq\kappa\sqrt{n}$ . Summing over the $C(C-1)$ ordered pairs gives the result. ∎

Theorem 12 (Learned-norm multiclass generalization bound).

With probability at least $1-\delta$ over i.i.d. samples $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ from $X\times\{1,\ldots,C\}$ , every $\mathbf{f}\in\mathcal{F}_{B}$ satisfies

\mathbb{P}\{\arg\max_{c}f_{c}(\mathbf{x})\neq y\}\leq\widehat{L}_{\gamma}(\mathbf{f})+\frac{2\sqrt{2}\,C(C-1)B\kappa}{\gamma\sqrt{n}}+3(C-1)\sqrt{\frac{\log(2/\delta)}{2n}}.

For a finite shared- $(b,\varepsilon)$ Yat head, $B^{2}=\sum_{c=1}^{C}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}$ .

Proof.

The zero-one loss is dominated by $\ell_{\gamma}$ . A standard Rademacher generalization inequality for $[0,C-1]$ -valued losses, combined with Lemma 6, gives the result. The norm identity follows from the reproducing property. ∎

Remark 5.

The actual learned RKHS norm $\sum_{c}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}$ , not a post-hoc radius, controls the margin class.

Theorem 13 (Data-dependent learned-norm bound by peeling).

Fix $B_{0}>0$ . With probability at least $1-\delta$ , every finite shared- $(b,\varepsilon)$ Yat head $\mathbf{f}$ satisfies

\mathbb{P}\{\arg\max_{c}f_{c}(\mathbf{x})\neq y\}\leq\widehat{L}_{\gamma}(\mathbf{f})\\ +\frac{4\sqrt{2}\,C(C{-}1)\kappa\max\{B(\mathbf{f}),B_{0}\}}{\gamma\sqrt{n}}+3(C{-}1)\sqrt{\frac{\log\!\bigl(\pi^{2}(s(\mathbf{f}){+}1)^{2}/(3\delta)\bigr)}{2n}},

where $B(\mathbf{f}):=\|\mathbf{f}\|_{\mathcal{H}^{C}}$ and $s(\mathbf{f}):=\lceil\log_{2}(\max\{B(\mathbf{f}),B_{0}\}/B_{0})\rceil$ . For a finite Yat head, $B(\mathbf{f})^{2}=\sum_{c}\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}$ .

Proof.

Standard peeling argument: for $s=0,1,2,\ldots$ set $B_{s}:=2^{s}B_{0}$ and $\delta_{s}:=6\delta/(\pi^{2}(s+1)^{2})$ ; note $\sum_{s}\delta_{s}=\delta$ . Apply Theorem 12 to $\mathcal{F}_{B_{s}}$ with confidence $\delta_{s}$ . For any $\mathbf{f}$ with $s=s(\mathbf{f})$ , one has $B(\mathbf{f})\leq B_{s}\leq 2\max\{B(\mathbf{f}),B_{0}\}$ , giving the data-dependent bound after substituting $\delta_{s}$ . ∎

Corollary 8 (Regularized training controls the learned norm).

If training returns $\widehat{\mathbf{f}}$ satisfying $\widehat{L}_{\gamma}(\widehat{\mathbf{f}})+\lambda\|\widehat{\mathbf{f}}\|_{\mathcal{H}^{C}}^{2}\leq(C-1)$ , then $\|\widehat{\mathbf{f}}\|_{\mathcal{H}^{C}}^{2}\leq(C-1)/\lambda$ and in particular $\sum_{c}\widehat{\boldsymbol{\alpha}}_{c}^{\top}\mathbf{K}\widehat{\boldsymbol{\alpha}}_{c}\leq(C-1)/\lambda$ .

Proof.

The zero predictor has $\ell_{\gamma}(0;(\mathbf{x},y))=C-1$ for every $(\mathbf{x},y)$ , giving $\widehat{L}_{\gamma}(0)=C-1$ . The assumed inequality and $\lambda>0$ then give $\lambda\|\widehat{\mathbf{f}}\|_{\mathcal{H}^{C}}^{2}\leq C-1$ . ∎

Appendix I Pullback and Loewner-Comparison Theorems

I.1 Layerwise Composition: Pullback RKHS Structure

This section proves a prefix-conditioned pullback statement: once the prefix of a trained network is fixed, the next Yat layer induces a valid RKHS on the original input space.

Let $X\subset\mathbb{R}^{d_{0}}$ be compact. For a depth- $L$ Yat stack, write

\Phi_{0}(\mathbf{x})=\mathbf{x},\qquad\Phi_{\ell}(\mathbf{x})=T_{\ell}(\Phi_{\ell-1}(\mathbf{x})),

where layer $\ell$ has coordinates $[T_{\ell}(\mathbf{z})]_{c}=\sum_{j=1}^{m_{\ell}}a_{\ell,j,c}k_{b_{\ell},\varepsilon_{\ell}}(\mathbf{w}_{\ell,j},\mathbf{z})$ . Write $k_{\ell}(\mathbf{z},\mathbf{z}^{\prime}):=k_{b_{\ell},\varepsilon_{\ell}}(\mathbf{z},\mathbf{z}^{\prime})$ and $[\mathbf{K}_{\ell}]_{ij}:=k_{\ell}(\mathbf{w}_{\ell,i},\mathbf{w}_{\ell,j})$ .

When learned centers $\mathbf{w}_{\ell,j}$ do not lie in the prefix range $\Phi_{\ell-1}(X)$ , we view each layer coordinate as the restriction to $\Phi_{\ell-1}(X)$ of a function in the global RKHS $\mathcal{H}_{k_{\ell}}$ on $\mathbb{R}^{d_{\ell-1}}$ . Equivalently, define the extended domain $D_{\ell}:=\Phi_{\ell-1}(X)\cup\{\mathbf{w}_{\ell,1},\ldots,\mathbf{w}_{\ell,m_{\ell}}\}$ , apply the RKHS construction on $D_{\ell}$ , and restrict back to $\Phi_{\ell-1}(X)$ ; the restriction RKHS carries the quotient norm. The norm bound below holds in either view.

Theorem 14 (Prefix-conditioned pullback RKHS: PSD, quotient norm, universality).

Let $k$ be a continuous PSD kernel on $Z$ and $\Phi:X\to Z$ a measurable map. Define $\widetilde{k}(\mathbf{x},\mathbf{x}^{\prime}):=k(\Phi(\mathbf{x}),\Phi(\mathbf{x}^{\prime}))$ and $\mathcal{S}:=\overline{\mathrm{span}}\{k(\Phi(\mathbf{x}),\cdot):\mathbf{x}\in X\}\subset\mathcal{H}_{k}$ .

(a) PSD and norm bound. $\widetilde{k}$ is PSD on $X$ . For every $f\in\mathcal{H}_{k}$ , $f\circ\Phi\in\mathcal{H}_{\widetilde{k}}$ with $\|f\circ\Phi\|_{\mathcal{H}_{\widetilde{k}}}\leq\|f\|_{\mathcal{H}_{k}}$ .

(b) Exact quotient-norm characterisation. $\|f\circ\Phi\|_{\mathcal{H}_{\widetilde{k}}}=\inf\{\|g\|_{\mathcal{H}_{k}}:g\in\mathcal{H}_{k},\;g|_{\Phi(X)}=f|_{\Phi(X)}\}=\|P_{\mathcal{S}}f\|_{\mathcal{H}_{k}},$ where $P_{\mathcal{S}}$ is orthogonal projection onto $\mathcal{S}$ ; the infimum is attained at $P_{\mathcal{S}}f$ .

(c) Universality propagates through injective continuous prefixes. If $X$ is compact, $\Phi$ is continuous and injective, and $k$ is universal on $\Phi(X)$ , then $\widetilde{k}$ is universal on $X$ .

Application to layer- $\ell$ Yat. Specialising to $k=k_{b_{\ell},\varepsilon_{\ell}}$ , $\Phi=\Phi_{\ell-1}$ , $\widetilde{k}=\widetilde{k}_{\ell}$ , part (a) gives the per-layer norm bound $\|[T_{\ell}\circ\Phi_{\ell-1}]_{c}\|_{\mathcal{H}_{\widetilde{k}_{\ell}}}^{2}\leq\mathbf{a}_{\ell,c}^{\top}\mathbf{K}_{\ell}\mathbf{a}_{\ell,c}$ , with equality (via (b)) iff $f_{\ell,c}\in\mathcal{S}_{\ell}:=\overline{\mathrm{span}}\{k_{\ell}(\Phi_{\ell-1}(\mathbf{x}),\cdot):\mathbf{x}\in X\}$ . For $b_{\ell}>0$ and continuous injective $\Phi_{\ell-1}$ , part (c) together with Proposition 1 yields universality of $\widetilde{k}_{\ell}$ on $X$ .

Proof.

(a) For $\{\mathbf{x}_{i}\}\subset X$ and $\{\beta_{i}\}\subset\mathbb{R}$ , $\sum\beta_{i}\beta_{j}\widetilde{k}(\mathbf{x}_{i},\mathbf{x}_{j})=\sum\beta_{i}\beta_{j}k(\Phi(\mathbf{x}_{i}),\Phi(\mathbf{x}_{j}))\geq 0$ by PSD of $k$ . Let $\psi(\mathbf{z}):=k(\mathbf{z},\cdot)$ be the canonical feature map and $\widetilde{\psi}(\mathbf{x}):=\psi(\Phi(\mathbf{x}))$ , so $\widetilde{k}(\mathbf{x},\mathbf{x}^{\prime})=\langle\widetilde{\psi}(\mathbf{x}),\widetilde{\psi}(\mathbf{x}^{\prime})\rangle$ . The reproducing property gives $(f\circ\Phi)(\mathbf{x})=\langle f,\widetilde{\psi}(\mathbf{x})\rangle_{\mathcal{H}_{k}}$ , so $f\circ\Phi\in\mathcal{H}_{\widetilde{k}}$ with the displayed norm bound. Specialising $f=f_{\ell,c}=\sum_{j}a_{\ell,j,c}k_{\ell}(\mathbf{w}_{\ell,j},\cdot)$ and using $\|f_{\ell,c}\|_{\mathcal{H}_{\ell}}^{2}=\mathbf{a}_{\ell,c}^{\top}\mathbf{K}_{\ell}\mathbf{a}_{\ell,c}$ delivers the per-layer Yat-Gram bound.

(b) The map $U:\mathcal{H}_{\widetilde{k}}\to\mathcal{S}$ defined on sections by $U(\widetilde{k}(\mathbf{x},\cdot))=k(\Phi(\mathbf{x}),\cdot)$ is an isometric isomorphism (kernel-section inner products agree). Under $U$ , $f\circ\Phi$ corresponds to $P_{\mathcal{S}}f$ , so $\|f\circ\Phi\|_{\mathcal{H}_{\widetilde{k}}}=\|P_{\mathcal{S}}f\|_{\mathcal{H}_{k}}$ . The affine set $\{g\in\mathcal{H}_{k}:g|_{\Phi(X)}=f|_{\Phi(X)}\}$ equals $P_{\mathcal{S}}f+\mathcal{S}^{\perp}$ , minimised at $P_{\mathcal{S}}f$ .

(c) $X$ compact and $Z$ Hausdorff with $\Phi$ continuous injective makes $\Phi:X\to\Phi(X)$ a homeomorphism. For $g\in C(X)$ , set $h:=g\circ\Phi^{-1}\in C(\Phi(X))$ ; by universality of $k$ , find $f\in\mathcal{H}_{k}$ with $\sup_{\Phi(X)}|f-h|<\eta$ . Then $f\circ\Phi\in\mathcal{H}_{\widetilde{k}}$ and $\sup_{X}|f\circ\Phi-g|=\sup_{\Phi(X)}|f-h|<\eta$ . ∎

I.2 RKHS Ball Comparisons: Interpolation, Pullback, and Spectral Structure

The kernel-order domination of Theorem 2 propagates to three further structural comparisons: finite-sample interpolation norms, deep-prefix pullback geometry, and integral-operator spectra.

Finite-sample interpolation norm comparison.

For training points $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ , define empirical Gram matrices $[\mathbf{K}_{Y}]_{ij}:=k_{b,\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{j})$ and $[\mathbf{K}_{I}]_{ij}:=h_{\varepsilon}(\mathbf{x}_{i},\mathbf{x}_{j})$ .

Theorem 15 (Finite-sample interpolation norm comparison).

$\mathbf{K}_{Y}\succeq b^{2}\mathbf{K}_{I}$ as PSD matrices. If $b>0$ and both Gram matrices are strictly positive definite, then $\mathbf{K}_{Y}^{-1}\preceq b^{-2}\mathbf{K}_{I}^{-1}$ , and for any label vector $\mathbf{y}\in\mathbb{R}^{n}$ ,

\mathbf{y}^{\top}\mathbf{K}_{Y}^{-1}\mathbf{y}\;\leq\;\frac{1}{b^{2}}\,\mathbf{y}^{\top}\mathbf{K}_{I}^{-1}\mathbf{y}.

The minimum RKHS norm to interpolate $\mathbf{y}$ at $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ in $\mathcal{H}_{b,\varepsilon}$ is at most $b^{-1}$ times the minimum interpolation norm in $\mathcal{H}_{h_{\varepsilon}}$ .

Proof.

$\mathbf{K}_{Y}\succeq b^{2}\mathbf{K}_{I}$ follows from Theorem 2 applied entry-wise. Inversion reverses Loewner order: $\mathbf{K}_{Y}\succeq b^{2}\mathbf{K}_{I}\succ 0$ implies $\mathbf{K}_{Y}^{-1}\preceq b^{-2}\mathbf{K}_{I}^{-1}$ [Horn and Johnson, 2012, Thm. 7.7.4]. The quadratic form bound follows; the minimum-norm interpolant has squared norm $\mathbf{y}^{\top}\mathbf{K}^{-1}\mathbf{y}$ by the standard representer theorem. ∎

Pullback kernel-order domination for deep prefixes.

Theorem 16 (Pullback domination).

Let $\Phi:X\to\mathbb{R}^{d^{\prime}}$ be any measurable map, and define $\widetilde{k}_{b,\varepsilon}(\mathbf{x},\mathbf{x}^{\prime}):=k_{b,\varepsilon}(\Phi(\mathbf{x}),\Phi(\mathbf{x}^{\prime}))$ and $\widetilde{h}_{\varepsilon}(\mathbf{x},\mathbf{x}^{\prime}):=h_{\varepsilon}(\Phi(\mathbf{x}),\Phi(\mathbf{x}^{\prime}))$ . Then $b^{2}\widetilde{h}_{\varepsilon}\preceq\widetilde{k}_{b,\varepsilon}$ on $X$ , and

\mathcal{H}_{\widetilde{h}_{\varepsilon}}\subseteq\mathcal{H}_{\widetilde{k}_{b,\varepsilon}},\qquad\|f\|_{\mathcal{H}_{\widetilde{k}_{b,\varepsilon}}}\leq\frac{1}{b}\|f\|_{\mathcal{H}_{\widetilde{h}_{\varepsilon}}}\quad(b>0).

In particular, for any trained Yat stack with per-layer bias $b_{\ell}>0$ , the RKHS ball comparison of Theorem 2 survives at every layer, conditionally on the trained prefix.

Proof.

For any $\{\mathbf{x}_{i}\}\subset X$ and $\{\beta_{i}\}\subset\mathbb{R}$ : $\sum_{i,j}\beta_{i}\beta_{j}(\widetilde{k}_{b,\varepsilon}-b^{2}\widetilde{h}_{\varepsilon})(\mathbf{x}_{i},\mathbf{x}_{j})=\sum_{i,j}\beta_{i}\beta_{j}(k_{b,\varepsilon}-b^{2}h_{\varepsilon})(\Phi(\mathbf{x}_{i}),\Phi(\mathbf{x}_{j}))\geq 0$ , since $k_{b,\varepsilon}-b^{2}h_{\varepsilon}$ is PSD by Theorem 2. Aronszajn’s inclusion theorem then gives the containment. ∎

Gram-ball alignment excess.

Proposition 8 (Alignment excess decomposition).

For any trained prototype set $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}$ , define $[\mathbf{G}_{Y}]_{ij}:=k_{b,\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j})$ and $[\mathbf{G}_{I}]_{ij}:=h_{\varepsilon}(\mathbf{w}_{i},\mathbf{w}_{j})$ . Set $q(\mathbf{x},\mathbf{w}):=[2b\,\mathbf{x}^{\top}\mathbf{w}+(\mathbf{x}^{\top}\mathbf{w})^{2}]\,h_{\varepsilon}(\mathbf{x},\mathbf{w})$ ; $q$ is PSD by Schur-product factorization. For any $\boldsymbol{\alpha}\in\mathbb{R}^{m}$ ,

\boldsymbol{\alpha}^{\top}\mathbf{G}_{Y}\boldsymbol{\alpha}=b^{2}\,\boldsymbol{\alpha}^{\top}\mathbf{G}_{I}\boldsymbol{\alpha}+E_{\mathrm{align}}(\boldsymbol{\alpha}),

where $E_{\mathrm{align}}(\boldsymbol{\alpha}):=\boldsymbol{\alpha}^{\top}(\mathbf{G}_{Y}-b^{2}\mathbf{G}_{I})\boldsymbol{\alpha}\geq 0$ . Moreover,

E_{\mathrm{align}}(\boldsymbol{\alpha})=0\;\iff\;\boldsymbol{\alpha}\in\ker(\mathbf{G}_{Y}-b^{2}\mathbf{G}_{I}),

equivalently iff the finite expansion $\sum_{i}\alpha_{i}\,q(\mathbf{w}_{i},\cdot)$ is the zero function in the RKHS $\mathcal{H}_{q}$ . In particular, if the alignment Gram matrix $\mathbf{G}_{Y}-b^{2}\mathbf{G}_{I}$ is nonsingular on the chosen prototypes, equality forces $\boldsymbol{\alpha}=\mathbf{0}$ . The trained Yat norm $\boldsymbol{\alpha}^{\top}\mathbf{G}_{Y}\boldsymbol{\alpha}$ decomposes into a radial IMQ component and a non-negative alignment excess $E_{\mathrm{align}}$ .

Proof.

Substituting $\mathbf{G}_{Y}=b^{2}\mathbf{G}_{I}+\mathbf{G}_{q}$ , where $[\mathbf{G}_{q}]_{ij}:=q(\mathbf{w}_{i},\mathbf{w}_{j})$ , gives the decomposition. Non-negativity follows from the PSD property $\mathbf{G}_{q}\succeq 0$ , which is the Schur product of the PSD kernels $\mathbf{x}^{\top}\mathbf{w}$ (with multiplier $2b\geq 0$ ) plus $(\mathbf{x}^{\top}\mathbf{w})^{2}$ and $h_{\varepsilon}$ . The equality condition follows because for a PSD matrix $\mathbf{G}_{q}$ , $\boldsymbol{\alpha}^{\top}\mathbf{G}_{q}\boldsymbol{\alpha}=0$ iff $\mathbf{G}_{q}\boldsymbol{\alpha}=\mathbf{0}$ , equivalently iff $\sum_{i}\alpha_{i}q(\mathbf{w}_{i},\cdot)\equiv 0$ in $\mathcal{H}_{q}$ . ∎

Spectral domination.

For a probability measure $\varrho$ on $X$ , define the integral operators $(T_{Y}f)(\mathbf{x}):=\int k_{b,\varepsilon}(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x}^{\prime})\,d\varrho(\mathbf{x}^{\prime})$ and $(T_{I}f)(\mathbf{x}):=\int h_{\varepsilon}(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x}^{\prime})\,d\varrho(\mathbf{x}^{\prime})$ , both compact self-adjoint positive operators on $L^{2}(\varrho)$ .

Theorem 17 (Spectral domination and effective-dimension comparison).

$T_{Y}\succeq b^{2}T_{I}$ as operators on $L^{2}(\varrho)$ . Consequently,

\lambda_{j}(T_{Y})\geq b^{2}\lambda_{j}(T_{I})\quad\forall j\geq 1,

and for every $\lambda>0$ and $b>0$ the effective dimensions $\mathcal{N}_{k}(\lambda):=\mathrm{Tr}(T_{k}(T_{k}+\lambda I)^{-1})$ satisfy

\mathcal{N}_{Y}(\lambda)\;\geq\;\mathcal{N}_{I}(\lambda/b^{2}).

Proof.

For any $f\in L^{2}(\varrho)$ , $\langle f,(T_{Y}-b^{2}T_{I})f\rangle_{L^{2}}=\iint(k_{b,\varepsilon}-b^{2}h_{\varepsilon})(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x})f(\mathbf{x}^{\prime})\,d\varrho^{2}\geq 0$ , since $k_{b,\varepsilon}-b^{2}h_{\varepsilon}$ is PSD (Theorem 2). The eigenvalue bound follows from the min-max theorem. For the effective dimension, use monotonicity of $t\mapsto t/(t+\lambda)$ together with $\lambda_{j}(T_{Y})\geq b^{2}\lambda_{j}(T_{I})$ :

\frac{\lambda_{j}(T_{Y})}{\lambda_{j}(T_{Y})+\lambda}\;\geq\;\frac{b^{2}\lambda_{j}(T_{I})}{b^{2}\lambda_{j}(T_{I})+\lambda}\;=\;\frac{\lambda_{j}(T_{I})}{\lambda_{j}(T_{I})+\lambda/b^{2}},

and summing in $j$ gives $\mathcal{N}_{Y}(\lambda)\geq\mathcal{N}_{I}(\lambda/b^{2})$ . ∎

Theorem 18 (Layerwise perturbation propagation).

Let $T_{L}\circ\cdots\circ T_{1}$ and $\widetilde{T}_{L}\circ\cdots\circ\widetilde{T}_{1}$ be two depth- $L$ networks acting on $X_{0}\subset\mathbb{R}^{d_{0}}$ . Let $X_{\ell}:=T_{\ell}(X_{\ell-1})\cup\widetilde{T}_{\ell}(\widetilde{X}_{\ell-1})\subset\mathbb{R}^{d_{\ell}}$ be the union of the two intermediate ranges. Suppose each $T_{\ell}$ is $L_{\ell}$ -Lipschitz on $X_{\ell-1}$ and $\sup_{\mathbf{z}\in X_{\ell-1}}\|T_{\ell}(\mathbf{z})-\widetilde{T}_{\ell}(\mathbf{z})\|\leq\delta_{\ell}$ . Then

\sup_{\mathbf{x}\in X_{0}}\|T_{L}\circ\cdots\circ T_{1}(\mathbf{x})-\widetilde{T}_{L}\circ\cdots\circ\widetilde{T}_{1}(\mathbf{x})\|\leq\sum_{\ell=1}^{L}\delta_{\ell}\prod_{r=\ell+1}^{L}L_{r}.

The sup is over the relevant intermediate representations actually visited; it is not required to hold globally on $\mathbb{R}^{d_{\ell}}$ .

Proof.

Let $e_{\ell}:=\sup_{\mathbf{x}\in X_{0}}\|\mathbf{z}_{\ell}(\mathbf{x})-\widetilde{\mathbf{z}}_{\ell}(\mathbf{x})\|$ with $e_{0}=0$ . Then $e_{\ell}\leq L_{\ell}e_{\ell-1}+\delta_{\ell}$ . Unrolling gives $e_{L}\leq\sum_{\ell=1}^{L}\delta_{\ell}\prod_{r=\ell+1}^{L}L_{r}$ . ∎

Lemma 7 (Lipschitz bound for one Yat layer).

Let $\|\mathbf{z}\|\leq R$ , $\|\mathbf{w}_{j}\|\leq W$ , $b\geq 0$ , $\varepsilon>0$ . Define

M_{R,W,b,\varepsilon}:=\frac{2(RW+b)W}{\varepsilon}+\frac{2(RW+b)^{2}(R+W)}{\varepsilon^{2}}.

Each atom $\mathbf{z}\mapsto k_{b,\varepsilon}(\mathbf{w}_{j},\mathbf{z})$ is $M_{R,W,b,\varepsilon}$ -Lipschitz on $B_{R}$ . The coordinate $T_{c}(\mathbf{z})=\sum_{j}a_{j,c}k_{b,\varepsilon}(\mathbf{w}_{j},\mathbf{z})$ is Lipschitz with constant $\sum_{j}|a_{j,c}|M_{R,W,b,\varepsilon}$ , and the vector layer $T=(T_{1},\ldots,T_{C})$ is Lipschitz with constant $(\sum_{c}[\sum_{j}|a_{j,c}|M_{R,W,b,\varepsilon}]^{2})^{1/2}$ .

Proof.

With $D(\mathbf{z}):=\|\mathbf{z}-\mathbf{w}\|^{2}+\varepsilon$ and $N(\mathbf{z}):=(\mathbf{w}^{\top}\mathbf{z}+b)^{2}$ , $\nabla_{\mathbf{z}}k_{b,\varepsilon}(\mathbf{w},\mathbf{z})=(2(\mathbf{w}^{\top}\mathbf{z}+b)\mathbf{w}D(\mathbf{z})-2(\mathbf{w}^{\top}\mathbf{z}+b)^{2}(\mathbf{z}-\mathbf{w}))/D(\mathbf{z})^{2}$ . Using $D(\mathbf{z})\geq\varepsilon$ , $|\mathbf{w}^{\top}\mathbf{z}+b|\leq RW+b$ , and $\|\mathbf{z}-\mathbf{w}\|\leq R+W$ gives $\|\nabla_{\mathbf{z}}k_{b,\varepsilon}(\mathbf{w},\mathbf{z})\|\leq M_{R,W,b,\varepsilon}$ . The coordinate and vector-layer bounds follow by the chain rule and the Frobenius norm bound on the Jacobian. ∎

Appendix J Generic Sobolev-RKHS Containment for Bounded Smooth-Atom Stacks (Applied to Yat)

This appendix proves Theorem 19. The result is an ambient containment theorem: every bounded finite-depth Yat stack lies in one fixed Sobolev restriction RKHS on the original input domain. It does not claim that RKHSs are closed under nonlinear composition, nor does it replace the exact single-layer Gram norm $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ with an exact global Gram norm.

J.1 Sobolev restriction RKHS

Let $X\subset\mathbb{R}^{d_{0}}$ be compact and let $U\subset\mathbb{R}^{d_{0}}$ be a bounded Lipschitz domain with $X\subset\overline{U}$ . Since bounded Lipschitz domains are Sobolev extension domains [Adams and Fournier, 2003], the $H^{s}(\mathbb{R}^{d_{0}})$ Banach-algebra and Moser-composition estimates transfer to $H^{s}(U)$ without loss; the Sobolev-induction template we use here is standard in deep-network analysis, with the closest neighbours being Petersen and Voigtländer [2018] for $C^{k}$ -class deep ReLU networks and E et al. [2019] for the Barron-space variant. Fix $s>d_{0}/2$ . Define

\mathcal{H}_{s}(X):=\bigl\{f|_{X}:f\in H^{s}(U)\bigr\},\qquad\|g\|_{\mathcal{H}_{s}(X)}:=\inf\bigl\{\|f\|_{H^{s}(U)}:f|_{X}=g\bigr\}.

By the Sobolev embedding $H^{s}(U)\hookrightarrow C(\overline{U})$ (valid for $s>d_{0}/2$ ), point evaluation is continuous on $\mathcal{H}_{s}(X)$ , making it an RKHS on $X$ .

Lemma 8 (Sobolev algebra and Nemytskii inverse stability).

Let $U$ be a bounded Lipschitz domain and $s>d_{0}/2$ . Then:

1.

(Banach algebra.) $H^{s}(U)$ is a Banach algebra: there exists $C_{\mathrm{alg}}\geq 1$ such that $\|fg\|_{H^{s}(U)}\leq C_{\mathrm{alg}}\|f\|_{H^{s}(U)}\|g\|_{H^{s}(U)}$ for all $f,g\in H^{s}(U)$ .
2.

(Nemytskii inverse stability.) If $r\in H^{s}(U)$ satisfies $r(\mathbf{x})\geq\lambda>0$ for every $\mathbf{x}\in U$ , then $1/r\in H^{s}(U)$ . Moreover, there exists a nondecreasing function $\Gamma_{s}:(0,\infty)\times[0,\infty)\to[0,\infty)$ such that $\|1/r\|_{H^{s}(U)}\leq\Gamma_{s}(\lambda^{-1},\|r\|_{H^{s}(U)})$ .

Proof.

Part (1): the Sobolev multiplication theorem for $s>d_{0}/2$ on Lipschitz extension domains [Adams and Fournier, 2003, Thm. 4.39]. Part (2): since $H^{s}(U)\hookrightarrow L^{\infty}(U)$ (Sobolev embedding), $r$ is bounded above. Apply the Moser/Nemytskii composition estimate [Runst and Sickel, 1996, Ch. 5] to $\varphi(t)=1/t$ on $[\lambda,\|r\|_{L^{\infty}}]$ ; the resulting norm bound depends only on $\lambda^{-1}$ , $\|r\|_{H^{s}(U)}$ , $s$ , and the domain $U$ . ∎

J.2 Network class and parameter budgets

For a depth- $L$ Yat stack, write $\mathbf{z}_{0}(\mathbf{x})=\mathbf{x}\in\mathbb{R}^{d_{0}}$ and, for $\ell=1,\dots,L$ and $c=1,\dots,d_{\ell}$ ,

z_{\ell,c}(\mathbf{x})=\sum_{j=1}^{m_{\ell}}\alpha_{\ell,j,c}\,g_{\ell,j}(\mathbf{x}),\qquad g_{\ell,j}(\mathbf{x})=\frac{(\mathbf{w}_{\ell,j}^{\top}\mathbf{z}_{\ell-1}(\mathbf{x})+b_{\ell})^{2}}{\|\mathbf{z}_{\ell-1}(\mathbf{x})-\mathbf{w}_{\ell,j}\|^{2}+\varepsilon_{\ell}}.

Assume the uniform budgets

\|\mathbf{w}_{\ell,j}\|_{2}\leq W_{\ell},\qquad|b_{\ell}|\leq B_{\ell},\qquad 0<\varepsilon_{\min}\leq\varepsilon_{\ell}\leq\varepsilon_{\max},\qquad\sum_{j=1}^{m_{\ell}}|\alpha_{\ell,j,c}|\leq A_{\ell}.

Note $|b_{\ell}|\leq B_{\ell}$ (not $b_{\ell}\geq 0$ ): the Sobolev containment holds for any bounded real bias; $b_{\ell}\geq 0$ is required only for the Mercer/RKHS regime of the single-layer results elsewhere in the paper. The width $m_{\ell}$ enters only through $A_{\ell}$ ; if per-atom bounds $|\alpha_{\ell,j,c}|\leq a_{\ell}$ are assumed, take $A_{\ell}=m_{\ell}a_{\ell}$ .

J.3 Full Proof of Global Sobolev-RKHS Containment

Theorem 19 (Global ambient RKHS containment).

Let $X\subset\mathbb{R}^{d_{0}}$ be compact and $U\supset X$ a bounded Lipschitz domain. Fix $s>d_{0}/2$ and define $\mathcal{H}_{s}(X):=\{f|_{X}:f\in H^{s}(U)\}$ with the quotient norm. Under the budgets $\|\mathbf{w}_{\ell,j}\|\leq W_{\ell}$ , $|b_{\ell}|\leq B_{\ell}$ , $0<\varepsilon_{\min}\leq\varepsilon_{\ell}\leq\varepsilon_{\max}$ , $\sum_{j}|\alpha_{\ell,j,c}|\leq A_{\ell}$ , every coordinate $z_{\ell,c}|_{X}$ of every admissible Yat stack belongs to $\mathcal{H}_{s}(X)$ , with $\|z_{\ell,c}\|_{\mathcal{H}_{s}(X)}\leq M_{\ell}$ for finite constants $M_{0},\dots,M_{L}$ depending only on $X,U,s,L,\{d_{\ell},m_{\ell},A_{\ell},W_{\ell},B_{\ell}\},\varepsilon_{\min},\varepsilon_{\max}$ and not on the trained parameters. The reproducing kernel of $\mathcal{H}_{s}(X)$ is universal on $X$ , hence characteristic.

Proof.

We induct on $\ell$ . Write $C_{1}:=\|1\|_{H^{s}(U)}$ (finite since $U$ is bounded).

Base case ( $\ell=0$ ). Each coordinate $z_{0,i}(\mathbf{x})=x_{i}$ is a polynomial, hence in $H^{s}(U)$ , with $\|x_{i}\|_{H^{s}(U)}\leq M_{0}$ for some $M_{0}$ depending on $U$ and $s$ .

Inductive step. Assume $z_{\ell-1,i}\in H^{s}(U)$ and $\|z_{\ell-1,i}\|_{H^{s}(U)}\leq M_{\ell-1}$ for all $i=1,\dots,d_{\ell-1}$ .

Linear factor. $q_{\ell,j}:=\mathbf{w}_{\ell,j}^{\top}\mathbf{z}_{\ell-1}+b_{\ell}\in H^{s}(U)$ as a linear combination of $H^{s}$ functions plus a constant, with

\|q_{\ell,j}\|_{H^{s}(U)}\leq\sqrt{d_{\ell-1}}W_{\ell}M_{\ell-1}+B_{\ell}C_{1}=:Q_{\ell}(M_{\ell-1}).

By Lemma 8(1), $q_{\ell,j}^{2}\in H^{s}(U)$ with $\|q_{\ell,j}^{2}\|_{H^{s}(U)}\leq C_{\mathrm{alg}}Q_{\ell}(M_{\ell-1})^{2}$ .

Denominator. $r_{\ell,j}:=\|\mathbf{z}_{\ell-1}-\mathbf{w}_{\ell,j}\|^{2}+\varepsilon_{\ell}=\sum_{i}(z_{\ell-1,i}-w_{\ell,j,i})^{2}+\varepsilon_{\ell}\in H^{s}(U)$ . Algebra bounds give $\|r_{\ell,j}\|_{H^{s}(U)}\leq d_{\ell-1}C_{\mathrm{alg}}(M_{\ell-1}+W_{\ell}C_{1})^{2}+\varepsilon_{\max}C_{1}=:D_{\ell}(M_{\ell-1})$ . Pointwise, $r_{\ell,j}(\mathbf{x})\geq\varepsilon_{\min}>0$ . By Lemma 8(2), $1/r_{\ell,j}\in H^{s}(U)$ with $\|1/r_{\ell,j}\|_{H^{s}(U)}\leq\Gamma_{s}(\varepsilon_{\min}^{-1},D_{\ell}(M_{\ell-1}))$ .

Atom. $g_{\ell,j}=q_{\ell,j}^{2}\cdot(1/r_{\ell,j})\in H^{s}(U)$ by another application of (1), with

\|g_{\ell,j}\|_{H^{s}(U)}\leq C_{\mathrm{alg}}^{2}Q_{\ell}(M_{\ell-1})^{2}\,\Gamma_{s}(\varepsilon_{\min}^{-1},D_{\ell}(M_{\ell-1}))=:G_{\ell}(M_{\ell-1}).

Layer coordinate. $\|z_{\ell,c}\|_{H^{s}(U)}\leq\sum_{j}|\alpha_{\ell,j,c}|\|g_{\ell,j}\|_{H^{s}(U)}\leq A_{\ell}G_{\ell}(M_{\ell-1})$ . Set $M_{\ell}:=A_{\ell}G_{\ell}(M_{\ell-1})$ . Restricting from $U$ to $X$ and using the quotient norm gives $\|z_{\ell,c}|_{X}\|_{\mathcal{H}_{s}(X)}\leq\|z_{\ell,c}\|_{H^{s}(U)}\leq M_{\ell}$ , closing the induction. The constants $Q_{\ell},D_{\ell},G_{\ell},M_{\ell}$ depend only on the budgets and not on the trained parameters. ∎

Remark 6 (Growth of $M_{\ell}$ in depth $L$ is super-exponential).

The recurrence above defines a sequence of finite constants $M_{0},M_{1},\dots,M_{L}$ , but the dependence on depth is severe and worth recording explicitly. Each step of the induction is at least quadratic: the algebra bound $\|q^{2}\|_{H^{s}}\leq C_{\mathrm{alg}}Q_{\ell}^{2}$ contributes a square in $M_{\ell-1}$ , the denominator bound $D_{\ell}(M_{\ell-1})=O(M_{\ell-1}^{2})$ contributes another square, and the Moser-type inverse-stability factor $\Gamma_{s}(\varepsilon_{\min}^{-1},D_{\ell}(M_{\ell-1}))$ has, in the standard chain-rule decomposition for $f(t)=1/t$ on $t\geq\varepsilon_{\min}$ , polynomial dependence of degree $\lceil s\rceil$ on $\|r\|_{H^{s}}$ (with prefactors that absorb $\varepsilon_{\min}^{-(\lceil s\rceil+1)}$ from the $C^{k}$ -norms of $1/t$ on the interval $[\varepsilon_{\min},\infty)$ ). Composing through one Yat layer therefore takes $M_{\ell-1}\mapsto M_{\ell}$ via a polynomial of degree at least $2\lceil s\rceil+2$ , and iterating $L$ times gives

M_{L}\;=\;O\!\bigl(M_{0}^{(2\lceil s\rceil+2)^{L}}\bigr)\;\leq\;O\!\bigl(M_{0}^{e^{O(L\log s)}}\bigr),

i.e., a tower-of-twos blow-up in depth. Theorem 19 should therefore be read as a qualitative containment statement ( $z_{\ell,c}|_{X}\in\mathcal{H}_{s}(X)$ for every admissible Yat stack and every $\ell\leq L$ ), not as a quantitatively useful capacity certificate at any practical depth. Improving the depth dependence of $M_{L}$ — to polynomial-in- $L$ , or even exponential-in- $L$ — would require either restricting the activation atom to exponential-decay-friendly functions (Gaussian RBF, in place of the rational $g_{\varepsilon}$ ) or using a different ambient class (e.g., a depth-aware Besov scale) that absorbs the chain-rule blow-up. Both routes are out of scope here. The theorem we record is the parameter-independent containment of bounded-budget Yat stacks in a fixed universal RKHS; the constants are not the load-bearing content.

Corollary 9 (Universality and characteristicness of $\mathcal{H}_{s}(X)$ ).

$\mathcal{H}_{s}(X)$ is universal on $X$ , hence characteristic.

Proof.

Every polynomial restricted to $U$ belongs to $H^{s}(U)$ (polynomials are smooth on bounded $U$ ), and hence its restriction to $X$ belongs to $\mathcal{H}_{s}(X)$ . By Stone–Weierstrass, polynomials are dense in $C(X)$ . Characteristicness follows from universality on compact $X$ [Sriperumbudur et al., 2011]. ∎

Remark 7 (What this theorem does and does not give).

Theorem 19 is an ambient containment result. It does not give an exact deep Yat-Gram norm: the identity $\|\sum_{j}\boldsymbol{\alpha}_{j}k_{b,\varepsilon}(\mathbf{w}_{j},\cdot)\|_{\mathcal{H}_{b,\varepsilon}}^{2}=\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ is a single-layer statement. For a depth- $L$ stack the induced kernel on the original input space is the prefix-dependent pullback $\widetilde{k}_{\ell}(\mathbf{x},\mathbf{x}^{\prime})=k_{\ell}(\mathbf{z}_{\ell-1}(\mathbf{x}),\mathbf{z}_{\ell-1}(\mathbf{x}^{\prime}))$ ; the global Sobolev RKHS $\mathcal{H}_{s}(X)$ avoids this prefix-dependence at the cost of giving only a coarse ambient norm bound.

Remark 8 (Sign of bias and Mercer regime).

The proof requires only $|b_{\ell}|\leq B_{\ell}$ ; the sign of $b_{\ell}$ is irrelevant for Sobolev containment. The condition $b_{\ell}\geq 0$ is required only for the Mercer/RKHS structure of the single-layer Yat kernel (Theorem 1). In the common case where a single shared bias $b_{\ell}=b\geq 0$ is used, the theorem applies with $B_{\ell}=b$ .

Appendix K Spectral Structure on the Sphere

The Mercer eigenvalues of $k_{0,\varepsilon}$ on a generic compact domain are governed by Sobolev regularity of the IMQ factor and decay polynomially. On the unit sphere $\mathbb{S}^{d-1}$ , the kernel reduces to a zonal function whose Funk–Hecke decomposition yields an exponential eigenvalue decay rate, sharper than the polynomial Sobolev bound. The sphere is the operationally relevant subdomain in feature-normalised practice (layer-norm, $\ell_{2}$ -normalised CLIP embeddings) and the analytically cleanest setting: $\|\mathbf{u}-\mathbf{v}\|^{2}=2-2\mathbf{u}^{\top}\mathbf{v}$ , so the Yat kernel becomes a univariate function of the inner product $t=\mathbf{u}^{\top}\mathbf{v}\in[-1,1]$ , and the Funk–Hecke formula [Wendland, 2004, Sec. 10.4] diagonalises the integral operator in spherical-harmonic sectors. We work with the unbiased $k_{0,\varepsilon}$ ; the biased $k_{b,\varepsilon}$ for $b>0$ adds two terms $2b\,t/D_{\varepsilon}$ and $b^{2}h_{\varepsilon}$ that share the same off-interval pole at $t_{\star}$ , so the asymptotic decay rate $\rho_{\star}^{-\ell}$ below is unchanged and only the prefactors and low-frequency content shift. Downstream consequences for fast generalization rates appear in Appendix L.

For $\mathbf{u},\mathbf{v}\in\mathbb{S}^{d-1}$ , the unbiased Yat kernel restricts to the zonal function

k_{\text{\tifinaghfont ⵟ}}(\mathbf{u},\mathbf{v})\;=\;\kappa(\mathbf{u}^{\top}\mathbf{v}),\qquad\kappa(t)\;:=\;\frac{t^{2}}{\varepsilon+2-2t},\qquad t\in[-1,1],

(11)

which extends meromorphically to $\mathbb{C}$ with a single simple pole at $t_{\star}:=1+\varepsilon/2>1$ and residue $-(t_{\star})^{2}/2\neq 0$ .

Theorem 20 (Funk–Hecke spectrum of $k_{\text{\tifinaghfont ⵟ}}$ on the sphere).

The integral operator $T_{k_{\text{\tifinaghfont ⵟ}}}:L^{2}(\mathbb{S}^{d-1})\to L^{2}(\mathbb{S}^{d-1})$ associated with the zonal kernel (11) admits the spectral decomposition $T_{k_{\text{\tifinaghfont ⵟ}}}=\sum_{\ell\geq 0}\lambda_{\ell}\,\mathbf{P}_{\ell}$ , where $\mathbf{P}_{\ell}$ is orthogonal projection onto the space of degree- $\ell$ spherical harmonics (multiplicity $N(\ell,d)=\Theta(\ell^{d-2})$ ), and the eigenvalues are given by the Funk–Hecke formula

\lambda_{\ell}\;=\;\frac{|\mathbb{S}^{d-2}|}{C_{\ell}^{(d-2)/2}(1)}\int_{-1}^{1}\kappa(t)\,C_{\ell}^{(d-2)/2}(t)\,(1-t^{2})^{(d-3)/2}\,dt,

with $C_{\ell}^{(d-2)/2}$ the Gegenbauer polynomial of index $(d-2)/2$ .

Proof.

$\kappa$ is continuous on $[-1,1]$ (the pole $t_{\star}=1+\varepsilon/2$ lies strictly outside the closed interval since $\varepsilon>0$ ), so the kernel $(\mathbf{u},\mathbf{v})\mapsto\kappa(\mathbf{u}^{\top}\mathbf{v})$ is bounded and continuous on $\mathbb{S}^{d-1}\times\mathbb{S}^{d-1}$ . The associated operator $T_{k_{\text{\tifinaghfont ⵟ}}}$ is therefore Hilbert–Schmidt, hence compact and self-adjoint on $L^{2}(\mathbb{S}^{d-1})$ . Spherical harmonics simultaneously diagonalise every zonal Hilbert–Schmidt operator by the Funk–Hecke formula [Wendland, 2004, Sec. 10.4], and the displayed integral is the standard Gegenbauer expansion coefficient on the eigenspace of degree $\ell$ . ∎

Theorem 21 (Exponential decay rate).

The Funk–Hecke eigenvalues of $k_{\text{\tifinaghfont ⵟ}}$ on $\mathbb{S}^{d-1}$ satisfy $\lambda_{\ell}=\Theta(\rho_{\star}^{-\ell})$ as $\ell\to\infty$ , with

\rho_{\star}\;:=\;1+\frac{\varepsilon}{2}+\sqrt{\varepsilon+\frac{\varepsilon^{2}}{4}}\;=\;1+\sqrt{\varepsilon}\,\bigl(1+O(\sqrt{\varepsilon})\bigr).

Proof.

Strategy. The Funk–Hecke expansion of $\kappa(\mathbf{u}^{\top}\mathbf{w})$ in Gegenbauer polynomials gives $\kappa(t)=\sum_{\ell\geq 0}\lambda_{\ell}\,c_{\ell}^{\nu}\,C_{\ell}^{(\nu)}(t)$ , where $\lambda_{\ell}$ is the Mercer eigenvalue and $C_{\ell}^{(\nu)}$ is the Gegenbauer polynomial of degree $\ell$ . The asymptotic decay rate $\lambda_{\ell}\sim\rho_{\star}^{-\ell}$ for simple-pole kernels $\kappa(t)=-A/(t-t_{\star})$ is recovered by the standard residue argument: the closest singularity of $\kappa$ in the complex $t$ -plane to the interval $[-1,1]$ controls the radius of convergence of the Gegenbauer expansion via the standard Bernstein-ellipse / pole-distance correspondence [Wendland, 2004, Ch. 12]. We do not extract $\lambda_{\ell}$ from the Gegenbauer generating function $(1-2tz+z^{2})^{-\nu}$ directly; rather, we use the generating function only to identify the relevant ellipse parameter $\rho_{\star}$ . For $d=2$ ( $\nu=0$ ), the Gegenbauer polynomials reduce to Chebyshev polynomials of the first kind and the generating function takes the logarithmic form $-\log(1-2tz+z^{2})/2=\sum_{\ell\geq 1}T_{\ell}(t)z^{\ell}/\ell$ ; the same Bernstein-ellipse argument applies, with the asymptotic decay rate unchanged.

Joukowski substitution. The Bernstein ellipse $E_{\rho}\subset\mathbb{C}$ with foci $\pm 1$ is the image of $\{|z|=\rho\}\subset\mathbb{C}$ under the Joukowski map $z\mapsto t(z):=(z+z^{-1})/2$ for $\rho>1$ . The map $t(z)$ is a conformal bijection from $\{|z|>1\}$ onto $\mathbb{C}\setminus[-1,1]$ , with the inverse selecting the branch $z(t)=t+\sqrt{t^{2}-1}$ for real $t>1$ (so $|z|>1$ ). Under this substitution, a real point $t_{\star}>1$ corresponds to $z_{\star}=t_{\star}+\sqrt{t_{\star}^{2}-1}>1$ (the outer pre-image), and equivalently to $z_{\star}^{-1}=t_{\star}-\sqrt{t_{\star}^{2}-1}<1$ (the inner pre-image inside the unit disk). Substituting $t_{\star}=1+\varepsilon/2$ ,

\rho_{\star}\;:=\;z_{\star}\;=\;\bigl(1+\tfrac{\varepsilon}{2}\bigr)+\sqrt{\bigl(1+\tfrac{\varepsilon}{2}\bigr)^{2}-1}\;=\;1+\tfrac{\varepsilon}{2}+\sqrt{\varepsilon+\tfrac{\varepsilon^{2}}{4}},

and the inner pre-image $z_{\star}^{-1}=1/\rho_{\star}$ is the corresponding singularity location in the Gegenbauer-generating-function variable.

Upper bound. For every $1<\rho<\rho_{\star}$ , $\kappa$ is analytic on the open ellipse $E_{\rho}$ and continuous up to its boundary (the pole $t_{\star}$ lies strictly outside $E_{\rho}$ ). Bernstein’s theorem on Gegenbauer expansions [Wendland, 2004, Ch. 12] gives $\limsup_{\ell\to\infty}|\lambda_{\ell}|^{1/\ell}\leq 1/\rho$ . Letting $\rho\uparrow\rho_{\star}$ yields $|\lambda_{\ell}|=O(\rho_{\star}^{-\ell+\eta})$ for every $\eta>0$ .

Matching lower bound. Decompose

\kappa(t)\;=\;-\frac{A}{t-t_{\star}}+g(t),\qquad A:=\frac{(t_{\star})^{2}}{2}>0.

A direct computation gives $g(t)=-(t+t_{\star})/2$ , a polynomial of degree $1$ , so the only Bernstein-ellipse-bounded contributions to the Gegenbauer coefficients of $\kappa$ for $\ell\geq 2$ come from the simple-pole term. The Gegenbauer generating identity is

\frac{1}{(1-2tz+z^{2})^{\nu}}=\sum_{\ell\geq 0}C_{\ell}^{(\nu)}(t)\,z^{\ell},\qquad\nu=(d-2)/2,\quad|z|<\min\!\bigl(z_{\star},z_{\star}^{-1}\bigr)=1/\rho_{\star}.

The denominator factors as $(1-2tz+z^{2})=(1-z\,z_{\star}^{-1})(1-z\,z_{\star})$ when $t=(z_{\star}+z_{\star}^{-1})/2$ , so the simple pole at $t=t_{\star}$ in $t$ -coordinates pulls back to a simple pole at $z=z_{\star}^{-1}=1/\rho_{\star}$ in the generating-function variable. Standard residue analysis at $z=1/\rho_{\star}$ then gives the Gegenbauer coefficients of $-A/(t-t_{\star})$ as $-A\,b_{\ell}^{(d)}\rho_{\star}^{-\ell}$ with $b_{\ell}^{(d)}=\Theta(1)$ (the Gegenbauer-weight residue evaluated at $z_{\star}^{-1}$ , which converges to a positive constant as $\ell\to\infty$ ). Hence $|\lambda_{\ell}|=\Omega(\rho_{\star}^{-\ell})$ . ∎

Corollary 10 (Logarithmic effective dimension on $\mathbb{S}^{d-1}$ ).

The effective dimension of $k_{\text{\tifinaghfont ⵟ}}$ on $\mathbb{S}^{d-1}$ satisfies

\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)\;:=\;\operatorname{tr}\!\bigl(T_{k_{\text{\tifinaghfont ⵟ}}}(T_{k_{\text{\tifinaghfont ⵟ}}}+\lambda I)^{-1}\bigr)\;=\;\Theta\!\bigl((\log(1/\lambda))^{d-1}\bigr)\qquad(\lambda\to 0^{+}).

Proof.

Set $\ell_{\lambda}:=\lfloor\log(1/\lambda)/\log\rho_{\star}\rfloor$ . Each eigenvalue $\lambda_{\ell}$ has multiplicity $N(\ell,d)=\Theta(\ell^{d-2})$ , so $\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)=\sum_{\ell\geq 0}N(\ell,d)\,\lambda_{\ell}/(\lambda_{\ell}+\lambda)$ . By Theorem 21, $\lambda_{\ell}\geq\lambda$ exactly when $\ell\leq\ell_{\lambda}+O(1)$ . The head $\sum_{\ell\leq\ell_{\lambda}}N(\ell,d)\cdot\Theta(1)=\Theta(\ell_{\lambda}^{d-1})$ from $\sum_{\ell\leq L}\ell^{d-2}=\Theta(L^{d-1})$ . The tail $\sum_{\ell>\ell_{\lambda}}N(\ell,d)\rho_{\star}^{-\ell}/\lambda=O(\ell_{\lambda}^{d-2})$ by geometric summation. Adding gives the displayed rate. ∎

Remark 9 (Comparison with classical kernels and downstream rate).

The polynomial kernel $(\mathbf{u}^{\top}\mathbf{v})^{2}$ has rank $3$ on $\mathbb{S}^{d-1}$ (only $\lambda_{0},\lambda_{1},\lambda_{2}$ are non-zero); the IMQ kernel $h_{\varepsilon}$ shares the singularity at $t_{\star}$ on the spherical reduction, hence the same exponential rate $\rho_{\star}^{-\ell}$ ; the Gaussian RBF has super-exponential decay $O(C^{\ell}/\ell!)$ . The unbiased Yat kernel on the sphere therefore inherits the asymptotic IMQ rate, while the polynomial numerator $t^{2}$ alters only the prefactor and contributes finite-rank low-frequency content. The exponential decay drives a near-parametric $\widetilde{O}(n^{-1})$ rate for both ERM and KRR on the sphere; see Appendix L.

Appendix L Fast Generalization Rates via Eigenvalue Decay

The single-layer Rademacher bound of Theorem 6 is the worst-case rate $O(n^{-1/2})$ , optimal only when no spectral information is available. Mercer eigenvalue decay yields fast rates via local Rademacher complexity [Bartlett et al., 2005] for empirical-risk minimisation, and via Caponnetto–De Vito bias-variance bounds [Caponnetto and De Vito, 2007, Steinwart et al., 2009] for kernel ridge regression. On the sphere, the exponential decay of Theorem 21 drives both routes to a near-parametric $\widetilde{O}(n^{-1})$ rate.

L.1 Local Rademacher upper bound for ERM

Theorem 22 (Fast rate via Mercer eigenvalue decay).

Let $\mathcal{X}\subset\mathbb{R}^{d}\setminus\{\mathbf{0}\}$ be compact with $\sup_{\mathcal{X}}\|\mathbf{x}\|\leq R$ , and let $\rho$ be a Borel probability measure on $\mathcal{X}$ . Suppose the Mercer eigenvalues $\{\mu_{n}\}_{n\geq 1}$ of $T_{k_{0,\varepsilon}}$ on $L^{2}(\rho)$ satisfy $\mu_{n}\leq c_{0}\,n^{-\alpha}$ for some $\alpha>1$ . Under realisability $f^{\star}\in B_{\mathcal{H}_{0,\varepsilon}}(B):=\{f\in\mathcal{H}_{0,\varepsilon}:\|f\|\leq B\}$ and i.i.d. samples with conditionally sub-Gaussian noise of variance proxy $\sigma^{2}$ , the empirical-risk minimiser $\hat{f}_{n}$ over the ball satisfies

\mathbb{E}\bigl[R(\hat{f}_{n})-R(f^{\star})\bigr]\;\leq\;C\,\Bigl(\frac{B^{2}R^{4}}{\varepsilon\,n}\Bigr)^{\alpha/(\alpha+1)},

for an absolute constant $C>0$ .

Proof.

By the diagonal of $k_{\text{\tifinaghfont ⵟ}}$ , $\sup_{\mathcal{X}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{x},\mathbf{x})=R^{4}/\varepsilon$ (the unbiased section is the $b=0$ case of the diagonal $(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ used in Theorem 6). Apply Bartlett et al. [2005, Theorem 3.3] to the function class $B_{\mathcal{H}_{0,\varepsilon}}(B)$ : the local Rademacher complexity at radius $r$ is bounded by $\bar{\psi}(r)\leq\sqrt{(2/n)\sum_{n\geq 1}\min(r,B^{2}\mu_{n})}$ , and the fixed point $r^{\star}$ of the inequality $r=\bar{\psi}(r)$ satisfies $r^{\star}\leq C^{\prime}(B^{2}R^{4}/(\varepsilon n))^{\alpha/(\alpha+1)}$ under polynomial decay $\mu_{n}\leq c_{0}n^{-\alpha}$ . The local Rademacher excess-risk bound yields the displayed expectation inequality. ∎

Corollary 11 (Near-parametric ERM rate on the sphere).

For $\mathcal{X}=\mathbb{S}^{d-1}$ and $\rho$ the uniform surface measure, Theorem 21 gives $\mu_{\ell}=\Theta(\rho_{\star}^{-\ell})$ with multiplicity $N(\ell,d)=\Theta(\ell^{d-2})$ . The local Rademacher fixed point becomes $r^{\star}=\Theta((\log n)^{d-1}/n)$ , yielding

\mathbb{E}\bigl[R(\hat{f}_{n})-R(f^{\star})\bigr]\;=\;\widetilde{O}(n^{-1}),

where $\widetilde{O}$ hides factors polynomial in $\log n$ .

Proof.

On $\mathbb{S}^{d-1}$ the kernel diagonal is $1/\varepsilon$ . Substituting the Funk–Hecke spectrum into the local Rademacher fixed point, $\sum_{n\geq 1}\min(r,B^{2}\mu_{n})=\sum_{\ell\geq 0}N(\ell,d)\min(r,B^{2}\rho_{\star}^{-\ell})$ . The crossover index $\ell_{\star}$ is determined by $\rho_{\star}^{-\ell_{\star}}=r/B^{2}$ , giving $\ell_{\star}=\Theta(\log(B^{2}/r))$ . The head $\ell\leq\ell_{\star}$ contributes $\Theta(r\,\ell_{\star}^{d-1})$ , the tail $\Theta(\ell_{\star}^{d-2})$ times the same scale, hence $\bar{\psi}(r)\leq\sqrt{(C/n)\,r\,(\log(B^{2}/r))^{d-1}}$ . The fixed-point equation $r=\bar{\psi}(r)$ self-consistently gives $r^{\star}=\Theta((\log n)^{d-1}/n)$ . ∎

L.2 Refined KRR bias-variance bound

Theorem 23 (Refined Yat KRR excess risk).

Under the setup above with conditionally sub-Gaussian noise of variance proxy $\sigma^{2}$ and source condition $f^{\star}\in\mathcal{H}_{0,\varepsilon}$ , the kernel ridge regression estimator $\hat{f}_{\lambda}$ with regularisation $\lambda>0$ satisfies the bias-variance bound

\mathbb{E}\bigl\|\hat{f}_{\lambda}-f^{\star}\bigr\|_{L^{2}(\rho)}^{2}\;\leq\;\lambda\,\|f^{\star}\|_{\mathcal{H}_{0,\varepsilon}}^{2}\;+\;\frac{\sigma^{2}}{n}\,\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)\;+\;\frac{C\,\sigma^{2}R^{4}/\varepsilon}{n^{2}\lambda},

where $\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda):=\operatorname{tr}\!\bigl(T_{k_{0,\varepsilon}}(T_{k_{0,\varepsilon}}+\lambda I)^{-1}\bigr)$ is the Yat effective dimension.

Proof.

The bounded diagonal $\kappa^{2}:=\sup_{\mathcal{X}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{x},\mathbf{x})=R^{4}/\varepsilon$ together with the source condition $f^{\star}\in\mathcal{H}_{0,\varepsilon}$ place us inside the bias-variance template of Caponnetto and De Vito [2007] (see also Steinwart et al., 2009). Their Theorem 1 yields, with probability at least $1-\delta$ ,

\|\hat{f}_{\lambda}-f^{\star}\|_{L^{2}(\rho)}^{2}\;\leq\;C_{1}\,\lambda\|f^{\star}\|_{\mathcal{H}_{k}}^{2}\;+\;C_{2}\,\frac{\sigma^{2}\,\mathcal{N}_{k}(\lambda)}{n}\;+\;C_{3}\,\frac{\sigma^{2}\kappa^{2}}{n^{2}\lambda}\log^{2}(1/\delta),

where the leading variance term $\mathcal{N}_{k}(\lambda)/n$ is scale-invariant under $k\to ck$ and the lower-order term $\kappa^{2}/(n^{2}\lambda)$ tracks the kernel scale. Substituting $\kappa^{2}=R^{4}/\varepsilon$ for $k=k_{0,\varepsilon}$ and absorbing the high-probability $\log$ factor into the constants gives the displayed expectation bound. ∎

Theorem 24 (Near-parametric KRR rate on the sphere).

Let $\mathcal{X}=\mathbb{S}^{d-1}$ with $\rho$ the uniform surface measure, and choose $\lambda^{\star}=\Theta((\log n)^{d-1}/n)$ . Then $\mathbb{E}\|\hat{f}_{\lambda^{\star}}-f^{\star}\|_{L^{2}(\rho)}^{2}=\widetilde{O}(n^{-1})$ , with logarithmic factor $(\log n)^{d-1}$ .

Proof.

On $\mathbb{S}^{d-1}$ , $R=1$ , so the variance constant is $\sigma^{2}/\varepsilon$ . By Corollary 10, $\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)=\Theta((\log(1/\lambda))^{d-1})$ . Theorem 23 gives

\mathbb{E}\|\hat{f}_{\lambda}-f^{\star}\|^{2}\;\leq\;\lambda\|f^{\star}\|_{\mathcal{H}_{0,\varepsilon}}^{2}+\frac{C}{\varepsilon n}\,(\log(1/\lambda))^{d-1}.

Optimising in $\lambda$ , bias and variance balance at $\lambda=\Theta((\log(1/\lambda))^{d-1}/n)$ with self-consistent solution $\lambda^{\star}=\Theta((\log n)^{d-1}/n)$ . Substituting gives the displayed rate. ∎

Remark 10 (Comparison with Sobolev RKHS).

A Sobolev- $s$ RKHS on $\mathbb{S}^{d-1}$ with $s>(d-1)/2$ has KRR rate $O(n^{-2s/(2s+d-1)})$ , sub-parametric in $d$ [Caponnetto and De Vito, 2007]. The Yat kernel’s analytic profile gives the strictly faster $\widetilde{O}(n^{-1})$ rate uniformly in $s$ for targets in $\mathcal{H}_{0,\varepsilon}$ , which is a strictly smaller class than any Sobolev space of the same domain: the comparison reflects the higher analytic regularity of the Yat native space rather than a uniform improvement on a fixed function class.

Remark 11 (ERM and KRR are companion routes).

Corollary 11 (norm-constrained ERM via local Rademacher complexity) and Theorem 24 (KRR via Caponnetto–De Vito) deliver the same $\widetilde{O}(n^{-1})$ rate via different estimators; both rest on the exponential eigenvalue decay of Theorem 21.

Remark 12 (Reading the variance constant).

The leading variance term in Theorem 23 is $\sigma^{2}\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)/n$ , scale-invariant under $k\to ck$ ; the kernel diagonal $\kappa^{2}=R^{4}/\varepsilon$ enters only as a polynomial prefactor of the lower-order $1/(n^{2}\lambda)$ remainder. A data-dependent refinement replacing $R^{4}$ with the empirical fourth moment in the remainder term would require a separate matrix-Bernstein argument and is not implied by Theorem 23.

Appendix M Quantitative MMD and Two-Sample Testing

Corollary 1 establishes that $k_{b,\varepsilon}$ is characteristic on every compact $\mathcal{X}$ for $b>0$ : the kernel mean embedding $\mu\mapsto\int k_{b,\varepsilon}(\cdot,\mathbf{x})\,d\mu$ is injective on Borel probability measures. We strengthen this qualitative property to a quantitative sample-complexity statement for two-sample testing, with the kernel-specific constant $(R^{2}+b)^{2}/\varepsilon$ on the diagonal.

Theorem 25 (Empirical MMD convergence rate for $k_{b,\varepsilon}$ ).

Let $\mu,\nu$ be Borel probability measures on a compact set $\mathcal{X}\subset\mathbb{R}^{d}$ with $\sup_{\mathbf{x}\in\mathcal{X}}\|\mathbf{x}\|\leq R$ , and let $\mathbf{X}_{1},\dots,\mathbf{X}_{n}$ be i.i.d. samples from $\mu$ and $\mathbf{Y}_{1},\dots,\mathbf{Y}_{n}$ i.i.d. from $\nu$ , all mutually independent. The unbiased U-statistic estimator

\widehat{\mathrm{MMD}}^{2}\;:=\;\frac{1}{n(n-1)}\sum_{i\neq j}k_{b,\varepsilon}(\mathbf{X}_{i},\mathbf{X}_{j})+\frac{1}{n(n-1)}\sum_{i\neq j}k_{b,\varepsilon}(\mathbf{Y}_{i},\mathbf{Y}_{j})-\frac{2}{n^{2}}\sum_{i,j}k_{b,\varepsilon}(\mathbf{X}_{i},\mathbf{Y}_{j})

satisfies

\mathbb{E}\bigl[\bigl(\widehat{\mathrm{MMD}}^{2}-\mathrm{MMD}^{2}\bigr)^{2}\bigr]\;\leq\;\frac{C\,(R^{2}+b)^{4}}{\varepsilon^{2}\,n}

for an absolute constant $C>0$ , hence $|\widehat{\mathrm{MMD}}^{2}-\mathrm{MMD}^{2}|=O_{p}((R^{2}+b)^{2}/(\varepsilon\sqrt{n}))$ .

Proof.

$\widehat{\mathrm{MMD}}^{2}$ is the standard unbiased U-statistic estimator of $\mathrm{MMD}^{2}$ [Gretton et al., 2012]. By Proposition 4, $\sup_{\mathbf{x},\mathbf{w}\in\mathcal{X}}|k_{b,\varepsilon}(\mathbf{x},\mathbf{w})|\leq(R^{2}+b)^{2}/\varepsilon$ on the closed ball of radius $R$ . The standard variance bound for a degree- $2$ U-statistic with bounded kernel gives $\mathrm{Var}(\widehat{\mathrm{MMD}}^{2})\leq C\sup|k|^{2}/n=C(R^{2}+b)^{4}/(\varepsilon^{2}n)$ , so the mean-squared error is bounded as displayed. ∎

Corollary 12 (Sample complexity for kernel two-sample testing).

Fix significance level $\beta\in(0,1)$ and threshold $\eta>0$ . The test that rejects $\mu=\nu$ when $\widehat{\mathrm{MMD}}^{2}>\eta$ has Type-I error $\leq\beta$ on the null and power $\geq 1-\beta$ on alternatives with $\mathrm{MMD}^{2}\geq 2\eta$ , provided

n\;\geq\;\frac{C\,(R^{2}+b)^{4}}{\varepsilon^{2}\,\eta^{2}\,\beta}.

Proof.

Apply Theorem 25 with Chebyshev’s inequality on both null ( $\mathrm{MMD}^{2}=0$ , deviation $\eta$ ) and alternative ( $\mathrm{MMD}^{2}\geq 2\eta$ , deviation $\eta$ ) regimes; the deviation probability is bounded by $C(R^{2}+b)^{4}/(\varepsilon^{2}\eta^{2}n)\leq\beta$ under the displayed sample-size lower bound. ∎

Remark 13 (Trade-off in $\varepsilon$ and $b$ ).

The sample complexity scales as $1/\varepsilon^{2}$ and as $(R^{2}+b)^{4}$ : a smaller $\varepsilon$ or a larger $b$ inflates the diagonal $(R^{2}+b)^{2}/\varepsilon$ and thereby the variance constant of the U-statistic. The same diagonal drives the Rademacher constant in Theorem 6, while in Theorem 23 the kernel diagonal enters only as a prefactor on the lower-order remainder, with the leading variance term controlled by the effective dimension $\mathcal{N}_{\text{\tifinaghfont ⵟ}}(\lambda)$ alone.

Appendix N The Yat Neural Tangent Kernel at Infinite Width

The biased atom $g_{\varepsilon}(\cdot;\mathbf{w},b)$ is the natural neural unit attached to $k_{b,\varepsilon}$ . We compute the infinite-width Neural Tangent Kernel [Jacot et al., 2018, Arora et al., 2019] of a width- $m$ shared-bias Yat layer and show that the limit is itself dominated by an IMQ kernel via the bias decomposition, giving universality on every compact $\mathcal{X}$ for $b>0$ .

Setup.

Fix $b\geq 0$ , $\varepsilon>0$ , $\sigma_{w}>0$ . We adopt the standard NTK parametrisation [Jacot et al., 2018]: the width- $m$ shared- $(b,\varepsilon)$ Yat layer is

f_{m}(\mathbf{x};\boldsymbol{\theta})\;:=\;\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\alpha_{j}\,g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b),\qquad\boldsymbol{\theta}=(\mathbf{w}_{1},\dots,\mathbf{w}_{m},\alpha_{1},\dots,\alpha_{m}),

with i.i.d. initialisation $\mathbf{w}_{j}\sim\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d})$ and $\alpha_{j}\sim\mathcal{N}(0,1)$ . The empirical NTK is

\Theta_{m}(\mathbf{x},\mathbf{x}^{\prime})\;:=\;\sum_{j=1}^{m}\Bigl[\bigl\langle\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}),\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}^{\prime})\bigr\rangle+\partial_{\alpha_{j}}f_{m}(\mathbf{x})\,\partial_{\alpha_{j}}f_{m}(\mathbf{x}^{\prime})\Bigr].

Theorem 26 (Yat NTK closed form).

As $m\to\infty$ , $\Theta_{m}(\mathbf{x},\mathbf{x}^{\prime})\to\Theta(\mathbf{x},\mathbf{x}^{\prime})$ in probability on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ , with

\Theta(\mathbf{x},\mathbf{x}^{\prime})\;=\;\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})+\Theta_{w}(\mathbf{x},\mathbf{x}^{\prime}),

\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})\;:=\;\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d})}\!\bigl[g_{\varepsilon}(\mathbf{x};\mathbf{w},b)\,g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w},b)\bigr],

\Theta_{w}(\mathbf{x},\mathbf{x}^{\prime})\;:=\;\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d})}\!\bigl\langle\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x};\mathbf{w},b),\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w},b)\bigr\rangle.

Both summands are continuous PSD kernels on $\mathcal{X}\times\mathcal{X}$ , hence so is $\Theta$ .

Proof.

Under the NTK parametrisation, $\partial_{\alpha_{j}}f_{m}(\mathbf{x})=g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b)/\sqrt{m}$ , so

\sum_{j=1}^{m}\partial_{\alpha_{j}}f_{m}(\mathbf{x})\,\partial_{\alpha_{j}}f_{m}(\mathbf{x}^{\prime})\;=\;\frac{1}{m}\sum_{j=1}^{m}g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b)\,g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w}_{j},b).

The summands are i.i.d. with mean $\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})$ and bounded by Proposition 4 on compact $\mathcal{X}$ ; the law of large numbers gives convergence in probability to $\Theta_{\alpha}$ . Similarly, $\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x})=\alpha_{j}\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b)/\sqrt{m}$ , hence

\sum_{j=1}^{m}\bigl\langle\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}),\partial_{\mathbf{w}_{j}}f_{m}(\mathbf{x}^{\prime})\bigr\rangle\;=\;\frac{1}{m}\sum_{j=1}^{m}\alpha_{j}^{2}\,\bigl\langle\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x};\mathbf{w}_{j},b),\nabla_{\mathbf{w}}g_{\varepsilon}(\mathbf{x}^{\prime};\mathbf{w}_{j},b)\bigr\rangle.

By independence of $\alpha_{j}$ and $\mathbf{w}_{j}$ together with $\mathbb{E}\alpha_{j}^{2}=1$ , the law of large numbers gives convergence in probability to $\Theta_{w}$ . PSD of both summands follows from the Gram representation: $\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})=\langle g_{\varepsilon}(\mathbf{x};\cdot,b),g_{\varepsilon}(\mathbf{x}^{\prime};\cdot,b)\rangle_{L^{2}(\mu_{w})}$ where $\mu_{w}$ is the Gaussian measure, and $\Theta_{w}$ is the same statement applied coordinate-wise to the gradient. Continuity follows from continuity of $g_{\varepsilon}$ and $\nabla_{\mathbf{w}}g_{\varepsilon}$ on compact sets together with dominated convergence under $\mu_{w}$ . ∎

Corollary 13 (Universality of the Yat NTK for $b>0$ ).

For every $b>0$ , the Yat NTK $\Theta$ is universal on every compact $\mathcal{X}\subset\mathbb{R}^{d}$ : the RKHS $\mathcal{H}_{\Theta}|_{\mathcal{X}}$ is dense in $C(\mathcal{X})$ in the sup norm.

Proof.

Step 1 (Gram representation of $\Theta_{\alpha}$ ). The proof of Theorem 26 established

\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})\;=\;\bigl\langle g_{\varepsilon}(\mathbf{x};\,\cdot\,,b),\,g_{\varepsilon}(\mathbf{x}^{\prime};\,\cdot\,,b)\bigr\rangle_{L^{2}(\mu_{w})},

where $\mu_{w}=\mathcal{N}(\mathbf{0},\sigma_{w}^{2}I_{d})$ . The map $\Phi:\mathbf{x}\mapsto g_{\varepsilon}(\mathbf{x};\,\cdot\,,b)$ is a continuous feature map from $\mathcal{X}$ into $L^{2}(\mu_{w})$ (continuity follows from continuity of $g_{\varepsilon}$ in $\mathbf{x}$ and the Proposition 4 bound, which makes the family uniformly $L^{2}$ -integrable on compact $\mathcal{X}$ ).

Step 2 ( $\Theta_{\alpha}$ is integrally strictly PD on $\mathcal{X}$ ). Let $\mu$ be a finite non-zero signed Borel measure on $\mathcal{X}$ . By Tonelli,

\int_{\mathcal{X}}\!\int_{\mathcal{X}}\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})\,d\mu(\mathbf{x})\,d\mu(\mathbf{x}^{\prime})\;=\;\int_{\mathbb{R}^{d}}F(\mathbf{w})^{2}\,d\mu_{w}(\mathbf{w}),\qquad F(\mathbf{w}):=\int_{\mathcal{X}}g_{\varepsilon}(\mathbf{x};\mathbf{w},b)\,d\mu(\mathbf{x}).

Suppose this integral vanishes. Then $F\equiv 0$ for $\mu_{w}$ -almost every $\mathbf{w}\in\mathbb{R}^{d}$ ; continuity of $F$ in $\mathbf{w}$ (dominated convergence) and full support of $\mu_{w}$ extend this to $F\equiv 0$ on $\mathbb{R}^{d}$ . In particular, $F|_{\mathcal{X}}\equiv 0$ . But $F|_{\mathcal{X}}$ is precisely the kernel mean embedding of $\mu$ under $k_{b,\varepsilon}$ (since $g_{\varepsilon}(\mathbf{x};\mathbf{w},b)=k_{b,\varepsilon}(\mathbf{w},\mathbf{x})$ ). For $b>0$ , Corollary 1(i) states that $k_{b,\varepsilon}$ is characteristic on $\mathcal{X}$ in the strong sense that the kernel mean embedding is injective on finite signed Borel measures on $\mathcal{X}$ . Hence $F|_{\mathcal{X}}\equiv 0$ forces $\mu=0$ , a contradiction. Therefore $\Theta_{\alpha}$ is integrally strictly PD on $\mathcal{X}$ .

Step 3 (Density of $\mathcal{H}_{\Theta_{\alpha}}$ in $C(\mathcal{X})$ via duality). A signed Borel measure $\mu$ on $\mathcal{X}$ annihilates $\mathcal{H}_{\Theta_{\alpha}}|_{\mathcal{X}}$ iff $\int_{\mathcal{X}}f\,d\mu=0$ for every $f\in\mathcal{H}_{\Theta_{\alpha}}$ . By the reproducing property, $\int_{\mathcal{X}}f\,d\mu=\bigl\langle f,\int_{\mathcal{X}}\Theta_{\alpha}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})\bigr\rangle_{\mathcal{H}_{\Theta_{\alpha}}}$ , vanishing for every $f$ iff the element $\int_{\mathcal{X}}\Theta_{\alpha}(\cdot,\mathbf{x})\,d\mu(\mathbf{x})\in\mathcal{H}_{\Theta_{\alpha}}$ is zero, equivalently $\int\!\int\Theta_{\alpha}\,d\mu\otimes d\mu=0$ . By Step 2 this forces $\mu=0$ . Hence $\mathcal{H}_{\Theta_{\alpha}}|_{\mathcal{X}}$ has trivial annihilator in $C(\mathcal{X})^{*}$ (signed Radon measures, by Riesz), so $\mathcal{H}_{\Theta_{\alpha}}|_{\mathcal{X}}$ is dense in $C(\mathcal{X})$ in the sup norm.

Step 4 (Transfer to $\Theta$ ). The summand $\Theta_{w}$ is PSD by Theorem 26, so $\Theta=\Theta_{\alpha}+\Theta_{w}\succeq\Theta_{\alpha}$ in the Loewner order on Gram matrices. By the Aronszajn inclusion theorem [Paulsen and Raghupathi, 2016], the inclusion $\mathcal{H}_{\Theta_{\alpha}}\hookrightarrow\mathcal{H}_{\Theta}$ is continuous; density of $\mathcal{H}_{\Theta_{\alpha}}$ in $C(\mathcal{X})$ transfers to $\mathcal{H}_{\Theta}$ . ∎

Remark 14 (General principle).

Step 2 above instantiates a general principle: under spherically symmetric initialisation $\mathbf{w}\sim\mu_{w}$ with full support, the $\alpha$ -summand of the empirical NTK has the Gram representation $\Theta_{\alpha}(\mathbf{x},\mathbf{x}^{\prime})=\langle k(\mathbf{x},\cdot),k(\mathbf{x}^{\prime},\cdot)\rangle_{L^{2}(\mu_{w})}$ , and integral strict positive definiteness of $\Theta_{\alpha}$ on a compact $\mathcal{X}$ follows from characteristicness of the underlying kernel $k$ on $\mathcal{X}$ . Universality of the Yat NTK is therefore a direct consequence of universality of the Yat kernel itself, with no additional spectral hypothesis on $\mu_{w}$ beyond full support.

Appendix O RKHS-Lipschitz Bounds and Certified Adversarial Radius

The single-layer Lipschitz bound of Lemma 7 is parametric in the trained $(\mathbf{w}_{j},\boldsymbol{\alpha},b,\varepsilon)$ and is used for stack-Lipschitz propagation. We complement it with an intrinsic RKHS-Lipschitz bound expressed only in the RKHS norm, derived from the reproducing property and a closed-form bound on the mixed partial of $k_{\text{\tifinaghfont ⵟ}}$ . This RKHS-norm route to certified robustness for kernel classifiers is developed for general kernels by Bietti et al. [2019]; here we instantiate it with the $k_{\text{\tifinaghfont ⵟ}}$ -specific mixed partial.

Lemma 9 (Mixed partial of $k_{\text{\tifinaghfont ⵟ}}$ on the diagonal).

For $\mathbf{x}\in\mathbb{R}^{d}\setminus\{\mathbf{0}\}$ , set

M_{ij}(\mathbf{x})\;:=\;\partial_{x_{i}}\partial_{x^{\prime}_{j}}k_{\text{\tifinaghfont ⵟ}}(\mathbf{x},\mathbf{x}^{\prime})\Big|_{\mathbf{x}^{\prime}=\mathbf{x}}.

Then

M_{ij}(\mathbf{x})\;=\;\frac{2x_{i}x_{j}+2\,\|\mathbf{x}\|^{2}\delta_{ij}}{\varepsilon}+\frac{2\,\|\mathbf{x}\|^{4}\delta_{ij}}{\varepsilon^{2}},\qquad\operatorname{tr}M(\mathbf{x})\;=\;\frac{2\,\|\mathbf{x}\|^{2}(1+d)}{\varepsilon}+\frac{2d\,\|\mathbf{x}\|^{4}}{\varepsilon^{2}}.

Proof.

Write $N(\mathbf{x},\mathbf{x}^{\prime}):=(\mathbf{x}^{\top}\mathbf{x}^{\prime})^{2}$ and $D(\mathbf{x},\mathbf{x}^{\prime}):=\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}+\varepsilon$ , so that $k_{\text{\tifinaghfont ⵟ}}=N/D$ . Differentiating in $x_{i}$ ,

\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}\;=\;\frac{2(\mathbf{x}^{\top}\mathbf{x}^{\prime})\,x^{\prime}_{i}}{D}-\frac{(\mathbf{x}^{\top}\mathbf{x}^{\prime})^{2}\cdot 2(x_{i}-x^{\prime}_{i})}{D^{2}}.

Differentiating in $x^{\prime}_{j}$ and noting that any term with an explicit $(x_{i}-x^{\prime}_{i})$ factor vanishes at $\mathbf{x}^{\prime}=\mathbf{x}$ , and that $\partial_{x^{\prime}_{j}}(x_{i}-x^{\prime}_{i})=-\delta_{ij}$ ,

\partial_{x_{i}}\partial_{x^{\prime}_{j}}k_{\text{\tifinaghfont ⵟ}}\Big|_{\mathbf{x}^{\prime}=\mathbf{x}}\;=\;\frac{2x_{j}x_{i}+2(\mathbf{x}^{\top}\mathbf{x})\delta_{ij}}{\varepsilon}+\frac{2(\mathbf{x}^{\top}\mathbf{x})^{2}\delta_{ij}}{\varepsilon^{2}}.

Substituting $\mathbf{x}^{\top}\mathbf{x}=\|\mathbf{x}\|^{2}$ gives the displayed expression for $M_{ij}$ . The trace formula follows from $\sum_{i}x_{i}^{2}=\|\mathbf{x}\|^{2}$ and $\sum_{i}\delta_{ii}=d$ . ∎

Theorem 27 (RKHS-Lipschitz constant for $k_{\text{\tifinaghfont ⵟ}}$ ).

Let $\mathcal{X}\subset\mathbb{R}^{d}\setminus\{\mathbf{0}\}$ be path-connected and compact with $\sup_{\mathcal{X}}\|\mathbf{x}\|\leq R$ . For every $f\in\mathcal{H}_{0,\varepsilon}|_{\mathcal{X}}$ with $\|f\|_{\mathcal{H}_{0,\varepsilon}}\leq B$ and every $\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X}$ ,

|f(\mathbf{x})-f(\mathbf{x}^{\prime})|\;\leq\;B\,L(R,\varepsilon,d)\,\|\mathbf{x}-\mathbf{x}^{\prime}\|,

with

L(R,\varepsilon,d)\;:=\;\sqrt{\frac{2R^{2}(1+d)}{\varepsilon}+\frac{2dR^{4}}{\varepsilon^{2}}}\;=\;O\!\Bigl(\frac{R^{2}\sqrt{d}}{\varepsilon}\Bigr)\quad\text{as }R^{2}/\varepsilon\to\infty.

Proof.

By the reproducing property and Cauchy–Schwarz, $\partial_{i}f(\mathbf{x})=\langle f,\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{x})\rangle_{\mathcal{H}_{0,\varepsilon}}$ , hence

\|\nabla f(\mathbf{x})\|^{2}\;=\;\sum_{i=1}^{d}|\partial_{i}f(\mathbf{x})|^{2}\;\leq\;\|f\|^{2}\sum_{i=1}^{d}\|\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{x})\|^{2}\;=\;\|f\|^{2}\,\operatorname{tr}M(\mathbf{x}),

using the standard identity $\|\partial_{x_{i}}k_{\text{\tifinaghfont ⵟ}}(\cdot,\mathbf{x})\|_{\mathcal{H}_{0,\varepsilon}}^{2}=M_{ii}(\mathbf{x})$ . Lemma 9 gives $\operatorname{tr}M(\mathbf{x})\leq 2R^{2}(1+d)/\varepsilon+2dR^{4}/\varepsilon^{2}$ for $\|\mathbf{x}\|\leq R$ . The mean-value inequality on the path-connected set $\mathcal{X}$ (along a piecewise-smooth path from $\mathbf{x}$ to $\mathbf{x}^{\prime}$ in $\mathcal{X}$ ) gives $|f(\mathbf{x})-f(\mathbf{x}^{\prime})|\leq B\,L\,\|\mathbf{x}-\mathbf{x}^{\prime}\|$ . ∎

Corollary 14 (Certified adversarial radius for an RKHS classifier).

Let $\mathbf{f}=(f_{1},\dots,f_{C})$ be a $C$ -class predictor with $\sum_{c=1}^{C}\|f_{c}\|_{\mathcal{H}_{0,\varepsilon}}^{2}\leq B^{2}$ . For an input $\mathbf{x}\in\mathcal{X}$ with predicted class $y$ and margin $\gamma(\mathbf{x}):=f_{y}(\mathbf{x})-\max_{c\neq y}f_{c}(\mathbf{x})>0$ , every adversarial perturbation $\boldsymbol{\delta}$ satisfying

\|\boldsymbol{\delta}\|\;<\;\frac{\gamma(\mathbf{x})}{2BL(R,\varepsilon,d)}

keeps the predicted class equal to $y$ .

Proof.

The budget $\sum_{c}\|f_{c}\|^{2}\leq B^{2}$ gives $\|f_{c}\|\leq B$ for each $c$ , hence $\|f_{y}-f_{c}\|\leq\|f_{y}\|+\|f_{c}\|\leq 2B$ by the triangle inequality. Theorem 27 applied to $f_{y}-f_{c}$ yields

\bigl|(f_{y}-f_{c})(\mathbf{x}+\boldsymbol{\delta})-(f_{y}-f_{c})(\mathbf{x})\bigr|\;\leq\;2BL\,\|\boldsymbol{\delta}\|\;<\;\gamma(\mathbf{x}),

so $f_{y}(\mathbf{x}+\boldsymbol{\delta})>f_{c}(\mathbf{x}+\boldsymbol{\delta})$ for every $c\neq y$ and the prediction is preserved. ∎

Remark 15 (Capacity-robustness trade-off).

The certified radius is $\Theta(\varepsilon/(R^{2}\sqrt{d}))$ : increasing $\varepsilon$ buys robustness at the price of capacity, since the kernel diagonal $\|\mathbf{x}\|^{4}/\varepsilon$ shrinks. This trade-off is intrinsic to $k_{\text{\tifinaghfont ⵟ}}$ and explicit in the parameter $\varepsilon$ , in contrast to scalar-activation networks where Lipschitz control is imposed externally through weight-norm penalties.

Appendix P Yat-Native Atom-Count Bounds via the Polynomial Component

The exterior-shell separation (Appendix G.1) and the bounded-domain width-complexity gap (Proposition 7) capture qualitative atom-count separations between Yat and IMQ. We complement them with a constant-atom upper bound for arbitrary symmetric quadratic forms, paired with a dimension-counting IMQ lower bound. The combination converts the directional asymptotic-trace separation of Proposition 2 into a quantitative atom-count gap on PSD quadratic targets.

Setup.

For $\Lambda,W>0$ , define the bounded-variation atom families

\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)\;:=\;\Bigl\{\textstyle\sum_{j}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j})\,:\,\textstyle\sum|a_{j}|\leq\Lambda,\ \|\mathbf{v}_{j}\|\leq W\Bigr\},

\mathcal{R}_{\text{\tifinaghfont ⵟ}}(\Lambda,W)\;:=\;\Bigl\{\textstyle\sum_{j}c_{j}\,g_{\varepsilon}(\cdot;\mathbf{w}_{j},b_{j})\,:\,\textstyle\sum|c_{j}|\leq\Lambda,\ \|\mathbf{w}_{j}\|\leq W,\ b_{j}\in\mathbb{R}\Bigr\}.

The atom count of an element is the smallest such expansion length.

Lemma 10 (Single-atom Yat approximation of a rank-one quadratic).

Let $\mathbf{w}_{\star}\in\mathbb{S}^{d-1}$ , $\delta>0$ , $R>0$ , and set

\alpha_{0}(d,R,\varepsilon,\delta)\;:=\;\max\Bigl(\tfrac{8R^{3}d^{3/2}}{\delta},\ 2\sqrt{\tfrac{dR^{2}(dR^{2}+\varepsilon)}{\delta}}\Bigr).

For every $\alpha\geq\alpha_{0}$ ,

\sup_{\mathbf{x}\in[-R,R]^{d}}\bigl|\,k_{\text{\tifinaghfont ⵟ}}(\alpha\mathbf{w}_{\star},\mathbf{x})-(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\,\bigr|\;\leq\;\delta.

Proof.

With $\mathbf{x}\in[-R,R]^{d}$ , $\|\mathbf{x}\|\leq R\sqrt{d}$ . Compute

k_{\text{\tifinaghfont ⵟ}}(\alpha\mathbf{w}_{\star},\mathbf{x})\;=\;\frac{(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}}{1+\eta(\mathbf{x},\alpha)},\qquad\eta(\mathbf{x},\alpha)\;:=\;-\frac{2\,\mathbf{w}_{\star}^{\top}\mathbf{x}}{\alpha}+\frac{\|\mathbf{x}\|^{2}+\varepsilon}{\alpha^{2}}.

On $[-R,R]^{d}$ , $|\mathbf{w}_{\star}^{\top}\mathbf{x}|\leq R\sqrt{d}$ , so $|\eta|\leq 2R\sqrt{d}/\alpha+(dR^{2}+\varepsilon)/\alpha^{2}$ . For $\alpha\geq 2\sqrt{dR^{2}(dR^{2}+\varepsilon)/\delta}$ the second term is at most $\delta/(4dR^{2})$ , and for $\alpha\geq 8R^{3}d^{3/2}/\delta$ the first term is at most $\delta/(4dR^{2})$ ; hence $|\eta|\leq\delta/(2dR^{2})\leq 1/2$ (for $\delta\leq dR^{2}$ , otherwise the conclusion follows trivially by taking $\alpha$ even larger). The bound $|1/(1+\eta)-1|\leq 2|\eta|$ for $|\eta|\leq 1/2$ gives

\bigl|k_{\text{\tifinaghfont ⵟ}}(\alpha\mathbf{w}_{\star},\mathbf{x})-(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\bigr|\;\leq\;(\mathbf{w}_{\star}^{\top}\mathbf{x})^{2}\cdot 2|\eta|\;\leq\;dR^{2}\cdot 2\cdot\delta/(2dR^{2})\;=\;\delta.

∎

Proposition 9 (Spectral Yat approximation of any symmetric quadratic form).

Let $\mathbf{A}\in\mathrm{Sym}(d)$ have rank $r$ with spectral decomposition $\mathbf{A}=\sum_{k=1}^{r}\lambda_{k}\mathbf{e}_{k}\mathbf{e}_{k}^{\top}$ . For every $\delta>0$ , set $\delta_{k}:=\delta/(r|\lambda_{k}|)$ and choose $\alpha_{k}\geq\alpha_{0}(d,R,\varepsilon,\delta_{k})$ as in Lemma 10. Then

G(\mathbf{x})\;:=\;\sum_{k=1}^{r}\lambda_{k}\,k_{\text{\tifinaghfont ⵟ}}(\alpha_{k}\mathbf{e}_{k},\mathbf{x})

is an unbiased Yat expansion with exactly $r$ atoms satisfying $\sup_{\mathbf{x}\in[-R,R]^{d}}|G(\mathbf{x})-\mathbf{x}^{\top}\mathbf{A}\mathbf{x}|\leq\delta$ .

Proof.

$\mathbf{x}^{\top}\mathbf{A}\mathbf{x}=\sum_{k=1}^{r}\lambda_{k}(\mathbf{e}_{k}^{\top}\mathbf{x})^{2}$ . Lemma 10 applied to each unit eigenvector $\mathbf{e}_{k}$ with tolerance $\delta_{k}$ gives $|k_{\text{\tifinaghfont ⵟ}}(\alpha_{k}\mathbf{e}_{k},\mathbf{x})-(\mathbf{e}_{k}^{\top}\mathbf{x})^{2}|\leq\delta/(r|\lambda_{k}|)$ . Multiplying by $|\lambda_{k}|$ and summing gives the displayed bound. ∎

Proposition 10 (IMQ atom-count lower bound for a PSD quadratic target).

Let $\mathbf{A}\in\mathrm{Sym}(d)$ be PSD with $\lambda_{\max}:=\lambda_{\max}(\mathbf{A})>0$ , and let $\mathbf{e}_{1}$ be a unit eigenvector of $\mathbf{A}$ with eigenvalue $\lambda_{\max}$ . Fix $R>0$ and $0<\delta<\lambda_{\max}R^{2}$ . Suppose $F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ with $W\leq R/2$ satisfies $\sup_{[-R,R]^{d}}|F-\mathbf{x}^{\top}\mathbf{A}\mathbf{x}|\leq\delta$ and $\Lambda>\varepsilon\,(\lambda_{\max}R^{2}-\delta)$ . Set

\rho^{2}\;:=\;\frac{\Lambda}{\lambda_{\max}R^{2}-\delta}\,-\,\varepsilon\;>\;0.

Any expansion $F=\sum_{j=1}^{m}a_{j}\,h_{\varepsilon}(\cdot,\mathbf{v}_{j})$ realising the bounded-variation budget has

m\;\geq\;\frac{(2R)^{d-1}\,\Gamma\!\bigl(\tfrac{d+1}{2}\bigr)}{\pi^{(d-1)/2}\,\rho^{\,d-1}}.

In particular, when $\rho$ stays bounded as $d\to\infty$ (equivalently, $\Lambda=O(1)$ with $R,\lambda_{\max},\delta,\varepsilon$ fixed), the right-hand side is $\exp(\Omega(d\log d))$ .

Proof.

Restrict attention to the inscribed Euclidean ball $\mathcal{B}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq R\}\subset[-R,R]^{d}$ . By rotational invariance of $\mathcal{B}$ , align $\mathbf{e}_{1}$ with a unit eigenvector of $\mathbf{A}$ for $\lambda_{\max}$ . Consider the slice $S^{\prime}:=\{R/2\}\times\mathcal{B}^{\prime}_{d-1}$ where $\mathcal{B}^{\prime}_{d-1}\subset\mathbb{R}^{d-1}$ is the $(d-1)$ -ball of radius $R\sqrt{3}/2$ . For $\mathbf{x}\in S^{\prime}$ , $\mathbf{x}^{\top}\mathbf{A}\mathbf{x}\geq\lambda_{\max}(\mathbf{e}_{1}^{\top}\mathbf{x})^{2}=\lambda_{\max}R^{2}/4$ , so $F(\mathbf{x})\geq\lambda_{\max}R^{2}/4-\delta>0$ (we strengthen the hypothesis to $\delta<\lambda_{\max}R^{2}/4$ ). The bounded-variation bound gives

F(\mathbf{x})\;\leq\;\sum_{j=1}^{m}|a_{j}|\,h_{\varepsilon}(\mathbf{x},\mathbf{v}_{j})\;\leq\;\frac{\Lambda}{\min_{j}(\|\mathbf{x}-\mathbf{v}_{j}\|^{2}+\varepsilon)},

so $\min_{j}\|\mathbf{x}-\mathbf{v}_{j}\|^{2}\leq\rho^{2}$ for every $\mathbf{x}\in S^{\prime}$ , with the constant in $\rho$ now depending on $\lambda_{\max}R^{2}/4$ rather than $\lambda_{\max}R^{2}$ . Hence $S^{\prime}\subset\bigcup_{j}B_{d}(\mathbf{v}_{j},\rho)$ . Each ball intersects the affine slice $\{x_{1}=R/2\}$ in a $(d-1)$ -ball of radius at most $\rho$ and volume $\pi^{(d-1)/2}\rho^{d-1}/\Gamma((d+1)/2)$ . The slice $S^{\prime}$ has $(d-1)$ -volume $\pi^{(d-1)/2}(R\sqrt{3}/2)^{d-1}/\Gamma((d+1)/2)$ , so volume counting yields the displayed lower bound. By Stirling, the Stirling-error contribution from $\Gamma((d+1)/2)$ cancels between the slice and the ball volumes, and the constant $(\sqrt{3}/2)^{d-1}$ is absorbed into $O(d)$ ; taking logs, $\log m\geq(d-1)\log(R\sqrt{3}/(2\rho))-O(d)$ , which is $\Omega(d\log d)$ when $\rho$ is bounded as $d\to\infty$ . ∎

Theorem 28 (Yat-native rate-form separation).

Let $f^{\star}=p^{\star}+g^{\star}$ on $[-R,R]^{d}$ , where $p^{\star}(\mathbf{x})=\mathbf{x}^{\top}\mathbf{A}\mathbf{x}$ for some PSD $\mathbf{A}\in\mathrm{Sym}(d)$ of rank $r$ with $\lambda_{\max}(\mathbf{A})>0$ , and $g^{\star}\in\mathcal{H}_{\mathrm{IMQ}}^{\varepsilon}$ with $\|g^{\star}\|_{L^{\infty}}\leq M$ and IMQ approximation rate $\phi$ (i.e., for every $\delta^{\prime}>0$ there exists $\bar{F}\in\mathcal{F}_{\mathrm{IMQ}}$ with $\leq\phi^{-1}(\delta^{\prime})$ atoms and $\|g^{\star}-\bar{F}\|_{L^{\infty}}\leq\delta^{\prime}$ ). Fix $\delta>0$ with $4\delta+M<\lambda_{\max}R^{2}$ , and set $m_{\delta}:=\min\{m:\phi(m)\leq\delta/2\}$ .

(i) Yat upper bound. There exists $G\in\mathcal{F}_{\text{\tifinaghfont ⵟ}}$ with at most $r+3m_{\delta}$ atoms and $\sup_{\mathbf{x}\in[-R,R]^{d}}|f^{\star}(\mathbf{x})-G(\mathbf{x})|\leq\delta$ .

(ii) IMQ lower bound. Fix $\Lambda>0$ and $W\leq R/2$ . Every $F\in\mathcal{R}_{\mathrm{IMQ}}(\Lambda,W)$ with $\sup_{[-R,R]^{d}}|f^{\star}-F|\leq\delta$ has atom count at least the dimension-counting lower bound of Proposition 10; in the regime where $R,\lambda_{\max},\delta,\varepsilon,\Lambda$ are held fixed as $d\to\infty$ , this bound is $\exp(\Omega(d\log d))$ .

(iii) Separation. For full-rank $\mathbf{A}$ , the Yat side achieves the target error with $d+3m_{\delta}$ atoms while the IMQ side requires $\exp(\Omega(d\log d))$ in the fixed- $\Lambda$ regime of (ii).

Proof.

(i) Apply Proposition 9 to $\mathbf{A}$ with tolerance $\delta/2$ , producing an unbiased Yat expansion $G_{1}$ with $r$ atoms and $\sup|G_{1}-p^{\star}|\leq\delta/2$ . By definition of $m_{\delta}$ , choose $\bar{F}\in\mathcal{F}_{\mathrm{IMQ}}$ with $\leq m_{\delta}$ IMQ atoms and $\|g^{\star}-\bar{F}\|_{L^{\infty}}\leq\delta/2$ . Theorem 4 converts $\bar{F}$ into a biased Yat expansion $G_{2}\in\mathcal{F}_{\text{\tifinaghfont ⵟ}}$ with at most $3m_{\delta}$ atoms and $G_{2}=\bar{F}$ pointwise. Set $G:=G_{1}+G_{2}$ with atom count $\leq r+3m_{\delta}$ ; the triangle inequality gives the displayed sup bound.

(ii) Given $\sup|f^{\star}-F|\leq\delta$ , $\sup|p^{\star}-F|\leq\delta+M$ . Apply Proposition 10 with the perturbed tolerance $\delta^{\prime}:=\delta+M$ in place of $\delta$ (the hypothesis $4\delta+M<\lambda_{\max}R^{2}$ ensures $\delta^{\prime}<\lambda_{\max}R^{2}$ , so $\rho$ is well-defined). The atom count is bounded below by the dimension-counting estimate, which under the fixed- $\Lambda$ asymptotic regime is $\exp(\Omega(d\log d))$ .

(iii) For full-rank $\mathbf{A}$ , $r=d$ ; combining (i) and (ii) gives the displayed separation. ∎

Remark 16 (RKHS-norm cost of the Yat upper bound).

Theorem 28(i) is an atom-count statement, not an RKHS-norm statement. The single-atom rank-one approximation (Lemma 10) requires $\alpha_{0}=\Omega(d^{3/2}/\delta)$ , so the resulting Yat atom $k_{0,\varepsilon}(\alpha_{0}\mathbf{w}_{\star},\cdot)$ has center norm $\|\alpha_{0}\mathbf{w}_{\star}\|=\alpha_{0}=\Omega(d^{3/2})$ and squared RKHS norm

\|k_{0,\varepsilon}(\alpha_{0}\mathbf{w}_{\star},\cdot)\|_{\mathcal{H}_{0,\varepsilon}}^{2}\;=\;k_{0,\varepsilon}(\alpha_{0}\mathbf{w}_{\star},\alpha_{0}\mathbf{w}_{\star})\;=\;\frac{\alpha_{0}^{4}}{\varepsilon}\;=\;\Theta\!\Bigl(\tfrac{d^{6}}{\varepsilon\,\delta^{4}}\Bigr).

Aggregating over the $r$ atoms in the rank- $r$ spectral decomposition (Proposition 9) inflates this further by a factor at most $r=d$ in the full-rank case. So the Yat-side resource cost decomposes into a constant atom count and a polynomial-in- $d$ RKHS-ball radius; the IMQ side requires $\exp(\Omega(d\log d))$ atoms but each at bounded center norm, giving a polynomial RKHS-ball radius for any single expansion. The separation in Theorem 28 is therefore poly-vs-exp in atom count and poly-vs-poly in RKHS-ball radius; downstream generalization comparisons via Theorem 6 inherit both factors. The exponential atom-count gap on the Yat side is real and quantitative, but it is a gap in the discrete combinatorial complexity of the expansion, not a free-lunch reduction in the underlying capacity radius.

Appendix Q Toward a Uniqueness Characterization

The structural results of the main paper isolate three properties that the Yat kernel satisfies and the classical RBF/IMQ/polynomial kernels do not jointly satisfy: rational symmetric form, bounded global diagonal, and a non-trivial quadratic asymptotic trace at infinity. We record the rigorous degree-matching skeleton these conditions imply, and state the full classification as a conjecture.

Conditions.

Let $k:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ , $d\geq 2$ , be a continuous PSD kernel satisfying

(C1)

(Rational symmetric form, in reduced form.) $k(\mathbf{x},\mathbf{w})=P(\mathbf{x},\mathbf{w})/Q(\mathbf{x},\mathbf{w})$ with $P,Q$ jointly polynomial in $(\mathbf{x},\mathbf{w})$ , $P$ symmetric in the variable pair, $Q(\mathbf{x},\mathbf{w})>0$ for every $(\mathbf{x},\mathbf{w})$ , and $\gcd(P,Q)=1$ in the polynomial ring (i.e., the rational form is reduced).
(C2)

(At-most-quadratic numerator growth.) The diagonal admits $k(\mathbf{x},\mathbf{x})=O(\|\mathbf{x}\|^{4})$ as $\|\mathbf{x}\|\to\infty$ . Equivalently, in the reduced rational form of (C1), $\deg_{\mathbf{x}}P-\deg_{\mathbf{x}}Q\leq 2$ on the diagonal $\mathbf{w}=\mathbf{x}$ .
(C3)

(Non-trivial quadratic directional trace.) For some $\mathbf{w}\neq\mathbf{0}$ , the limit $T_{\infty}k(\cdot,\mathbf{w})(\mathbf{u}):=\lim_{r\to\infty}k(r\mathbf{u},\mathbf{w})$ exists for every $\mathbf{u}\in\mathbb{S}^{d-1}$ and is a non-zero quadratic form in $\mathbf{u}$ .

Proposition 11 (Degree matching under (C1)–(C3)).

For any kernel $k$ satisfying (C1)–(C3), $\deg_{\mathbf{x}}P=\deg_{\mathbf{x}}Q=2$ (and by symmetry the same in $\mathbf{w}$ ).

Proof.

Write $P=\sum_{p}P_{p}$ and $Q=\sum_{q}Q_{q}$ as homogeneous decompositions in $\mathbf{x}$ . As $r\to\infty$ , $k(r\mathbf{u},\mathbf{w})$ has leading order $r^{p_{\star}-q_{\star}}$ where $p_{\star},q_{\star}$ are the largest indices with non-zero homogeneous parts. Existence of a finite, non-zero limit (C3) forces $p_{\star}=q_{\star}$ and the limiting ratio $P_{p_{\star}}/Q_{q_{\star}}$ to be a non-zero rational function of $\mathbf{u}$ alone. Since the limit is a quadratic form in $\mathbf{u}$ , the limit’s total numerator degree in $\mathbf{u}$ after cancellation is $2$ , which is consistent only with $p_{\star}=q_{\star}=2$ (the cases $p_{\star}=q_{\star}\in\{0,1\}$ are ruled out by the requirement that the trace is non-zero quadratic of degree exactly $2$ ). Together with (C2), no higher-degree homogeneous parts are admissible: if $\deg_{\mathbf{x}}P>2$ but $\deg_{\mathbf{x}}Q=2$ , then $k(\mathbf{x},\mathbf{x})$ is unbounded. Therefore $\deg_{\mathbf{x}}P=\deg_{\mathbf{x}}Q=2$ . ∎

Conjecture 2 (Uniqueness of $k_{b,\varepsilon}$ up to symmetry).

Every continuous PSD kernel satisfying (C1)–(C3) has the form

k(\mathbf{x},\mathbf{w})\;=\;c\,\frac{(\mathbf{x}^{\top}\mathbf{w}+b)^{2}}{\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon}\;=\;c\,k_{b,\varepsilon}(\Phi^{-1}\mathbf{x},\Phi^{-1}\mathbf{w}),

for some $b\geq 0$ , $\varepsilon>0$ , $c>0$ , and $\Phi\in\mathrm{GL}(d)$ .

Remark 17 (Status).

Proposition 11 establishes the degree- $2$ constraint, which is the first of several steps in the conjectured classification. The remaining steps—(i) showing PSD on all finite point sets forces $P$ to be the square of a degree- $1$ symmetric form $(\mathbf{x}^{\top}\mathbf{w}+b)^{2}$ , up to a linear change of variables, and (ii) the analogous classification for strictly positive degree- $2$ symmetric denominators $Q$ —require an explicit case analysis with multi-point Gram constraints that we have not carried out, and which constitute the bulk of the work behind Conjecture 2.

Remark 18 (Why higher homogeneous degrees do not survive on $\mathbb{S}^{d-1}$ ).

A potential concern is that $P_{4}(\mathbf{u},\mathbf{w})/Q_{4}(\mathbf{u},\mathbf{w})$ may collapse to a quadratic on $\|\mathbf{u}\|^{2}=1$ — for instance, $(\mathbf{u}^{\top}\mathbf{w})^{2}\|\mathbf{u}\|^{2}/\|\mathbf{w}\|^{4}=(\mathbf{u}^{\top}\mathbf{w})^{2}/\|\mathbf{w}\|^{2}$ — but this does not invalidate the degree conclusion: such a collapse is a redundancy of the homogeneous expansion, and any kernel admitting it can be rewritten with $\deg_{\mathbf{x}}P=\deg_{\mathbf{x}}Q=2$ via cancellation of the $\|\mathbf{u}\|^{2}$ factor. Equivalently, after passing to the reduced rational form $P^{*}/Q^{*}$ in which numerator and denominator share no common polynomial factor (the convention adopted in (C1)), the directional-trace argument forces $\deg_{\mathbf{x}}P^{*}=\deg_{\mathbf{x}}Q^{*}=2$ .

Remark 19 (Independence of (C1)–(C3)).

Each condition is necessary for the conjectured classification: dropping (C1) admits Gaussian RBF and IMQ (no rational symmetric form); dropping (C2) admits higher-degree rational kernels such as $(\mathbf{x}^{\top}\mathbf{w})^{4}/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)$ , whose diagonal grows as $\|\mathbf{x}\|^{8}$ and so violates the at-most-quadratic numerator condition while still satisfying (C1) and (C3); dropping (C3) admits the polynomial $(\mathbf{x}^{\top}\mathbf{w})^{2}$ (no IMQ denominator, so the far-field along $r\mathbf{u}$ blows up rather than approaching a finite quadratic limit) and all bounded radial kernels (Matérn, IMQ) which have $T_{\infty}\equiv 0$ . The simultaneous combination forces the rigid degree structure of Proposition 11 and conjecturally isolates $k_{b,\varepsilon}$ as a two-parameter family.

Appendix R CLIP Probe Classification Sweep

Scope of this experiment.

This appendix is intended as a trainability/diagnostic check, not a head-to-head benchmark of Yat against tuned kernel baselines. The bandwidths of RBF and IMQ are fixed to match Yat’s $\varepsilon$ (no per-variant tuning), so the comparison is deliberately under-tuned for the radial baselines. The two structurally informative comparisons we do draw are Yat vs. Poly (with vs. without the IMQ denominator) and Yat vs. Yat_rand (trained vs. frozen centers). A bandwidth-tuned comparison against learned-center RBF/IMQ heads (and against random Fourier features and a standard MLP head) is left as future empirical work; we therefore avoid drawing performance-superiority conclusions from this table alone.

To illustrate the practical consequence of the alignment numerator, we train single-layer kernel classifiers on frozen CLIP ViT-B/32 image features (ImageNet-1k, 1000 classes, $d=512$ ). Each classifier is a one-vs-rest kernel expansion with $m=1000$ learned centers (one per class), optimised by Adam for 20 epochs with a grid of six learning rates over three seeds. The five variants compared are: Yat ( $k_{b,\varepsilon}$ , trained centers, shared $b=\ln 2$ , $\varepsilon=2$ ), Poly (quadratic polynomial kernel, same numerator as Yat, no IMQ denominator), Yat_rand (Yat with frozen random centers, only $\alpha$ trained), RBF (Gaussian kernel $e^{-\gamma\|\mathbf{x}-\mathbf{w}\|^{2}}$ with learned centers, fixed bandwidth $\gamma=1/(2\varepsilon)=0.25$ to match Yat’s $\varepsilon$ ), and IMQ (inverse multiquadric kernel with learned centers, fixed $\varepsilon_{\text{IMQ}}=2$ matching Yat’s $\varepsilon$ ). All five variants use the same Adam optimizer and six-point learning-rate grid; no per-variant bandwidth tuning was performed.

Table 3: ImageNet-1k top-1 validation accuracy (%) for a single-layer kernel classifier on frozen CLIP ViT-B/32 features, structural (alignment-numerator vs. trained-center) ablations only. Mean over 3 seeds; best-LR column reports the peak over the lr grid.

Variant	$10^{-2.5}$	$10^{-2}$	$10^{-1.5}$	$10^{-1}$	$10^{0}$	$10^{0.5}$	Best LR	Best acc
Yat (trained)	68.4	70.8	73.1	73.9	72.2	69.7	$10^{-1}$	$73.9\pm 0.1$
Poly	72.5	71.2	70.1	64.3	51.6	49.1	$10^{-2.5}$	$72.5\pm 0.1$
Yat_rand	62.6	67.5	69.5	69.2	58.4	46.2	$10^{-1.5}$	$69.5\pm 0.1$

Reading Table 3. The two structurally informative comparisons are: (1) Yat (73.9%) vs. Poly (72.5%): the $+1.4$ pp gap isolates the benefit of the IMQ locality denominator over the polynomial alignment numerator alone. The Poly kernel has finite-dimensional RKHS (Table 1, footnote $\dagger$ ) of dimension $O(d^{2})$ at degree $2$ , so the gap roughly reflects the cost of forcing a $1000$ -class problem on $d=512$ features into a finite-rank RKHS; (2) Yat (73.9%) vs. Yat_rand (69.5%): the $+4.4$ pp gap shows that center optimisation is required to fully exploit the alignment numerator, consistent with the learned-center RKHS view. Both ablations isolate one variable at a time and use the same Yat hyperparameters; the comparison is not contaminated by bandwidth choice. We note that the Poly baseline (and the bandwidth-mismatched RBF/IMQ baselines in Table 4) peak at the lower end of the searched LR grid; extending the grid downwards may reduce the $+1.4$ pp Yat-vs-Poly gap, and we report the result with this caveat. RBF/IMQ are reported with the additional bandwidth-mismatch caveat already noted.

Bandwidth-mismatched RBF/IMQ baselines (separated for visual honesty).

For completeness we also ran fixed-bandwidth RBF and IMQ heads at the same optimizer / lr grid; we report these in a separate Table 4 rather than alongside Yat to avoid an apples-to-oranges headline juxtaposition. The bandwidth was fixed to match Yat’s $\varepsilon$ ( $\gamma=1/(2\varepsilon)=0.25$ for RBF; $\varepsilon_{\text{IMQ}}=2$ ), not tuned per variant, and the resulting collapse to a small fraction of Yat’s accuracy at every learning rate (RBF and IMQ peak at $46.1\%$ and $28.7\%$ respectively against Yat’s $73.9\%$ , with the gap widening as the learning rate increases) is the optimization-side counterpart of Proposition 5: with $\varepsilon$ fixed at $O(1)$ and $\|\mathbf{x}-\mathbf{w}\|^{2}=\Theta(d)$ at $d=512$ , both kernels concentrate around a near-constant value, so both the forward signal and the Adam gradient lose discriminative scale (for RBF, $e^{-\gamma\|\mathbf{x}-\mathbf{w}\|^{2}}\approx e^{-O(d)}$ is numerically zero; for IMQ, $1/(\|\mathbf{x}-\mathbf{w}\|^{2}+\varepsilon)$ is near-uniform across $(\mathbf{x},\mathbf{w})$ pairs). The remedy is a bandwidth scaled to the data ( $\varepsilon\sim d$ ), not a different optimizer; with per-variant bandwidth selection these baselines would be expected to perform competitively. We draw no performance-superiority conclusion against RBF or IMQ from this table; a bandwidth-tuned head-to-head comparison is left as future empirical work.

Table 4: Bandwidth-mismatched RBF and IMQ heads at the same optimizer / lr grid as Table 3, with bandwidth fixed to match Yat’s

\varepsilon

rather than tuned per variant. Reported here for completeness; not a fair head-to-head comparison against Yat.

Variant	$10^{-2.5}$	$10^{-2}$	$10^{-1.5}$	$10^{-1}$	$10^{0}$	$10^{0.5}$	Best LR	Best acc
RBF (fixed $\gamma$ )	46.1	0.1	0.1	0.1	0.1	0.1	$10^{-2.5}$	$46.1\pm 0.5$
IMQ (fixed $\varepsilon$ )	28.7	25.1	11.9	0.1	0.1	0.1	$10^{-2.5}$	$28.7\pm 0.3$

Per-class RKHS norm correlation.

For the Yat head trained at the best learning rate ( $10^{-1}$ ), the per-class closed-form RKHS norm $\boldsymbol{\alpha}_{c}^{\top}\mathbf{K}\boldsymbol{\alpha}_{c}$ (Proposition 3) has positive Spearman correlation $\rho=0.564$ with per-class top-1 accuracy across $n=1000$ classes (one observation per class, $50$ validation examples each). The correlation is stable across a $(b,\varepsilon)$ ablation grid spanning two orders of magnitude ( $\rho\in[0.54,0.56]$ over seven configurations; reproduction scripts are provided in the supplementary material). The interpretation of the positive sign is discussed in §6. The closed-form norm is computable directly from the trained weights $(\mathbf{w}_{j},\boldsymbol{\alpha}_{c})$ with no held-out data. Learned-center RBF and IMQ heads admit the same $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ formula at the same cost: the reproducing-property derivation of Proposition 3 is generic to finite kernel expansions and applies equally to any Mercer kernel. What is specific to Yat is the kernel itself — its diagonal $(\|\mathbf{x}\|^{2}+b)^{2}/\varepsilon$ , its directional far-field trace, and the bias finite-difference structure. Ordinary scalar MLP units ( $\mathbf{x}\mapsto\sigma(\mathbf{w}^{\top}\mathbf{x}+b)$ ) do not admit the formula at all because they do not define a Mercer section over the shared input/weight space.

Appendix S Directional-Tail Benchmark (Corollary 5)

To make Corollary 5 concrete we run a small synthetic experiment in $d=2$ . The target function is the single Yat atom $g_{\varepsilon}(\mathbf{x};\mathbf{w}^{*},b^{*})$ with $\mathbf{w}^{*}=[1,0]$ , $b^{*}=1$ , $\varepsilon=1$ . Along the ray $\mathbf{u}=[1,0]$ the alignment factor satisfies $a=(\mathbf{u}^{\top}\mathbf{w}^{*})^{2}=1$ , so Corollary 5 predicts $|g_{\varepsilon}(r\mathbf{u})-F(r\mathbf{u})|\to 1$ for every finite IMQ combination $F$ . This experiment provides concrete numerical verification of the analytical prediction. The regime (training on an annulus, evaluating on a far-field ray) is chosen so that the analytical prediction is tight by construction.

Setup.

We draw $N=4{,}000$ training points uniformly (area-measure) from the annulus $\|\mathbf{x}\|\in[50,100]$ , labels $y=g_{\varepsilon}(\mathbf{x})$ . Two IMQ models are trained: $M{=}50$ and $M{=}200$ learned centers, shared bandwidth $\varepsilon_{\mathrm{IMQ}}$ trained jointly, Adam for $5{,}000$ epochs at learning rates $10^{-3}$ and $5{\times}10^{-4}$ . Both models are evaluated along the ray $r\mathbf{u}$ for $r\in[0,500]$ . The script (directional_tail_benchmark.py) is provided in the supplementary material; it uses MLX and runs in ${\approx}6$ s on Apple Silicon M-series.

Results.

Table 5: Mean and max pointwise error on the far tail (

r\geq 400

) for finite IMQ expansions trained to fit a single Yat atom on an annulus. The asymptotic limit predicted by Corollary 5 is

a=1

Model	$M$	Mean $\|g-F\|$ at $r\geq 400$	Max $\|g-F\|$ at $r\geq 400$
IMQ-50	50	$1.008$	$1.008$
IMQ-200	200	$1.007$	$1.007$
Predicted ( $a=1$ )	—	$1.000$	$1.000$

Both models train successfully in the annulus (MSE declines steadily) yet converge to the predicted error floor $a\approx 1$ in the far field, consistent with Corollary 5: increasing the number of IMQ atoms does not recover the directional tail because the analytical result is unconditional on model size. This is the finite-sample instantiation of the $\|T_{\infty}F\|=0$ property of the IMQ span (Proposition 2).

Appendix T The Yat Primitive at Depth in a Trained Causal Language Model

The single-layer theory of Sections 2–5 makes its claims at one shared- $(b,\varepsilon)$ Yat layer; the pullback theorem (Theorem 14) extends them to a fixed prefix inside a deep stack so that the closed-form norm $\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}$ (Proposition 3) and the diagonal-driven Rademacher bound (Theorem 6) survive composition at depth. This appendix exercises that depth extension on a trained causal language model: at matched architecture and matched Chinchilla-optimal compute on C4, the Yat primitive composed across many layers reaches the GELU-baseline accuracy regime, so the layer-local theoretical objects (per-layer Gram, closed-form norm, Rademacher diagonal) are not just well-defined at depth but populated with non-degenerate trained content.

Setup.

We replace the standard GELU MLP block of a $261$ M decoder-only causal language model ( $12$ layers, $n_{\text{embd}}{=}768$ , context $1024$ ) with a $\mathbf{x}\to\text{Yat}\to\text{Linear}$ block at the same parameter count, and train both at the GELU-Chinchilla-optimal budget of $5.2$ B C4 tokens [Raffel and others, 2020, Hoffmann and others, 2022] ( $20{,}000$ steps, global batch $256$ , AdamW [Loshchilov and Hutter, 2019] with cosine decay, weight decay $0.1$ ). Learning rates were chosen by a $500$ K-token pre-sweep: $\mathrm{lr}{=}0.01$ for GELU and $\mathrm{lr}{=}0.03$ for Yat. We sweep the full $\{$ pn,sb $\}\times\{+\alpha,+\text{ca}\}$ Yat grid: pn = per-neuron $(b_{j},\varepsilon_{j})$ , sb = shared $(b,\varepsilon)$ within a layer, $+\alpha$ = learnable per-channel scaling, $+\text{ca}$ = constant $\alpha$ . Each row is mean $\,\pm\,$ std over $3$ independent training seeds; downstream zero-shot evaluations are run via lm-evaluation-harness.

Table 6:

261

M causal-LM evaluation,

5.2

B C4 tokens, mean

\,\pm\,

std over

3

training seeds. Wiki PPL is WikiText-103 next-token perplexity at the model’s training context length (

1024

tokens); long-range PPL is computed on the long-context split of PG-19 [Rae et al., 2020] with sliding-window evaluation at the same

1024

-token context. Lower is better (

\downarrow

); zero-shot accuracies via lm-evaluation-harness, higher is better (

\uparrow

). Variant naming: pn/sb = per-neuron / shared

(b,\varepsilon)

;

+\alpha

+\text{ca}

= learnable / constant per-channel scaling

\alpha

. Bold = best per column.

	Wiki PPL $\downarrow$	LAMBADA acc $\uparrow$	ARC-E $\uparrow$	ARC-C $\uparrow$	HellaSwag $\uparrow$	OpenBookQA $\uparrow$	PIQA $\uparrow$	Long-range PPL $\downarrow$
GELU	$44.23_{\pm 1.44}$	$23.59_{\pm 0.86}\%$	$\mathbf{33.53_{\pm 1.12}\%}$	$\mathbf{22.47_{\pm 0.90}\%}$	$29.00_{\pm 0.05}\%$	$27.87_{\pm 1.63}\%$	$61.77_{\pm 0.65}\%$	$114.08_{\pm 2.82}$
Yat (pn $+\alpha$ )	$\mathbf{39.04_{\pm 1.03}}$	$\mathbf{24.16_{\pm 0.43}\%}$	$33.01_{\pm 0.80}\%$	$22.01_{\pm 1.67}\%$	$\mathbf{29.62_{\pm 0.51}\%}$	$28.20_{\pm 2.23}\%$	$\mathbf{62.44_{\pm 0.56}\%}$	$104.20_{\pm 3.40}$
Yat (sb $+\alpha$ )	$39.07_{\pm 0.39}$	$23.80_{\pm 1.20}\%$	$32.62_{\pm 0.84}\%$	$21.39_{\pm 0.80}\%$	$29.52_{\pm 0.19}\%$	$\mathbf{28.40_{\pm 0.35}\%}$	$62.37_{\pm 0.74}\%$	$\mathbf{101.09_{\pm 3.90}}$
Yat (pn $+\text{ca}$ )	$42.18_{\pm 0.28}$	$23.79_{\pm 0.71}\%$	$33.02_{\pm 0.43}\%$	$22.07_{\pm 0.79}\%$	$29.20_{\pm 0.39}\%$	$27.47_{\pm 1.40}\%$	$61.26_{\pm 1.07}\%$	$108.62_{\pm 3.52}$
Yat (sb $+\text{ca}$ )	$42.09_{\pm 1.96}$	$23.43_{\pm 0.84}\%$	$32.98_{\pm 1.47}\%$	$22.44_{\pm 0.31}\%$	$\mathbf{29.62_{\pm 0.17}\%}$	$27.20_{\pm 0.20}\%$	$61.50_{\pm 0.69}\%$	$106.51_{\pm 4.08}$

Reading.

At matched architecture and matched Chinchilla compute, the four Yat variants and the GELU baseline land in the same accuracy regime within seed variance on the six lm-evaluation-harness tasks. The two $+\alpha$ Yat rows lead on Wikitext-103 perplexity ( $39.0$ vs. $44.2$ for GELU) and on long-range perplexity ( $101$ vs. $114$ ) by margins that exceed seed variance; the $+\text{ca}$ rows match GELU on accuracy and lead by ${\sim}\!2$ PPL (lower is better). Yat throughput at this scale is ${\sim}\!655\mathrm{K}$ tokens/sec versus ${\sim}\!715\mathrm{K}$ for GELU ( ${\sim}\!8\%$ lower), measured at the training batch size on TPU v5e and averaged over the full training run. The point of the experiment is the depth-viability claim: the Yat primitive composed across $12$ layers in a trained causal LM reaches the same accuracy regime as a GELU MLP at matched compute, and the closed-form RKHS norm of Proposition 3 continues to apply per layer to the trained shared- $(b,\varepsilon)$ rows. We make no head-to-head benchmark claim and view differences smaller than seed variance as inconclusive.

Compute resources.

Each variant was trained on $5.2\mathrm{B}$ C4 tokens at the Chinchilla-optimal compute ratio. Per-variant training used Google Cloud TPU v5e/v6e (TPU Research Cloud) with approximately $2$ TPU-hours per run; with three seeds per variant (GELU, pn $+\alpha$ , sb $+\alpha$ , pn $+\text{ca}$ , sb $+\text{ca}$ ) alongside a $10$ -point LR pre-sweep at $500$ K tokens, total compute for the $261$ M LM experiment was approximately $32$ TPU-hours.

NeurIPS Paper Checklist

1.

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: The four contributions in the introduction (PSD/Mercer + Loewner-domination universality, nonradial alignment via directional trace, exact IMQ recovery with three-atom tightness, and layer-local RKHS bookkeeping) correspond directly to Theorems 1, 2 (universality), Theorem 3 and Proposition 2 (alignment / trace), Theorem 4 and Theorem 5 (IMQ recovery and three-atom tightness), and Proposition 3 together with Theorem 6 (closed-form norm and Rademacher bound). The deep-stack scope is explicitly limited to a prefix-conditioned pullback (Theorem 14) and an ambient Sobolev containment (Theorem 19); we do not claim an exact global Yat-Gram norm.
2.

Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: Section 6 contains a dedicated limitations discussion: shared $(b,\varepsilon)$ requirement for the exact RKHS norm, qualitative (not rate-form) universality, two-tier rather than exact deep-stack theory, extrapolative exterior-shell separations, and under-tuned empirical baselines. Each is paired with the scope-bounded remedy or the open problem it implies.
3.

Theory assumptions and proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes]
Justification: Every theorem, proposition, lemma, and corollary states its assumptions inline ( $b\geq 0$ , $\varepsilon>0$ , compactness of $\mathcal{X}$ , etc.); notation is consolidated in Appendix A, generic RKHS background in Appendix C, and named results from the literature in Appendix D. Proofs of main-body results are in Appendix E; proofs of extension theorems are split across Appendix G (far-field separations), Appendix H (multiclass generalisation), Appendix I (pullback and Loewner-comparison), Appendix J (Sobolev containment), Appendix P (atom-count separation), and Appendix Q (degree-matching skeleton for the uniqueness conjecture).
4.

Experimental result reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper?
Answer: [Yes]
Justification: The CLIP probe (Appendix R), the directional-tail benchmark (Appendix S), and the $261$ M causal-LM proof-of-concept (Appendix T) report the architecture, optimizer, learning-rate grid, seed count, dataset, and bandwidth choices. All three are diagnostic experiments accompanying a theoretical paper, not benchmark contributions; the paper makes no performance-superiority claim from them.
5.

Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes]
Justification: Reproduction scripts for both diagnostic experiments are referenced in the appendix; CLIP features are public ImageNet-1k embeddings from a public CLIP ViT-B/32 checkpoint.
6.

Experimental setting/details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: Appendix R specifies the optimizer (Adam), six-point learning-rate grid, three-seed averaging, and the per-variant bandwidth choices. The deliberate under-tuning of RBF/IMQ baselines is disclosed explicitly.
7.

Experiment statistical significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes]
Justification: The CLIP probe table reports mean $\pm$ standard deviation across three seeds; the per-class Spearman correlation is reported with its stability range $\rho\in[0.54,0.56]$ across a $(b,\varepsilon)$ ablation grid.
8.

Experiments compute resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: The directional-tail benchmark runs in approximately 6 s on Apple Silicon M-series; the CLIP probe is single-GPU with frozen features at $d=512$ .
9.

Code of ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics?
Answer: [Yes]
Justification: The paper is a theoretical contribution about a kernel construction; no human subjects, no personal data, no harmful applications.
10.

Broader impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [N/A]
Justification: The paper develops mathematical foundations for a hidden-unit primitive; broader societal impact is not directly applicable to this contribution.
11.

Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models with a high risk for misuse?
Answer: [N/A]
Justification: No new datasets or pretrained models are released.
12.

Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: The CLIP ViT-B/32 model [Radford and others, 2021] (MIT license) and ImageNet-1k validation features [Deng et al., 2009] (custom research-only terms; we use only image embeddings from the public validation set, with no redistribution of raw images) are the only external assets used. Both are cited at the points of use in Appendix R.
13.

New assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [N/A]
Justification: The paper introduces a kernel construction and theoretical results, not new assets requiring documentation.
14.

Crowdsourcing and research with human subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [N/A]
Justification: No human subjects.
15.

Institutional review board (IRB) approvals or equivalent for research with human subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [N/A]
Justification: No human subjects.
16.

Declaration of LLM usage
Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research?
Answer: [N/A]
Justification: LLMs are not part of the methodology of this paper.

A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

Abstract

1 Introduction

2 Mercer Structure and Fixed-Kernel Universality

Theorem 1 (PSD and Mercer for b≥0b\geq 0).

Theorem 2 (Kernel-order domination).

Proposition 1 (Singular universality at b0>0b_{0}>0).

Corollary 1 (Characteristic kernel and SPD).

3 Nonradial Alignment Structure

Lemma 1 (IMQ RKHS functions vanish at infinity).

Corollary 2 (Strict containment and non-reversibility).

Theorem 3 (Yat RKHS channel decomposition).

Proposition 2 (Directional asymptotic trace separates Yat from IMQ).

4 IMQ Embedding via Bias Finite Differences

Theorem 4 (Bias-finite-difference IMQ embedding).

Theorem 5 (Three atoms are necessary).

Corollary 3 (Family universality, algebraic route).

Corollary 4 (Approximation rate transfer at factor-33 width overhead).

5 Layer-Local RKHS Capacity

Proposition 3 (Closed-form RKHS norm).

Theorem 6 (Single-layer Rademacher bound).

6 Conclusion and Discussion

Conjecture 1 (No exact global deep Yat-Gram norm).

References

Appendix roadmap.

Appendix A Preliminaries and Notation

Standing assumptions.

Kernels.

RKHS objects.

Spans of atoms.

Operators on the spans.

Network notation.

Conventions.

Appendix B Related Work

RBF and learned-center kernel networks.

Neural kernels and arc-cosine constructions.

Kernel machines and product kernels.

Deep-learning generalization and the NTK.

RKHS generalization bounds.

Approximation rates and ridge-vs-radial separations.

Rational and ratio-form kernels.

Non-stationary and anisotropic kernels.

Appendix C Background

Reproducing kernel Hilbert spaces.

Universality, characteristicness, and strict positive definiteness.

Schur products and Laplace integrals.

Appendix D Background Theorems from the Literature

Theorem 7 (Product-kernel RKHS containment [Steinwart and Christmann, 2008, Lemma 4.6]).

Theorem 8 (Micchelli IMQ universality [Micchelli et al., 2006, Thm. 17]).

Appendix E Proofs of Main-Body Results

E.1 Proof of Theorem 1: PSD/Mercer for b≥0b\geq 0

Theorem (Theorem 1 restated).

Proof.

E.2 Proof of Theorem 4: bias-finite-difference IMQ embedding

Theorem (Theorem 4 restated).

Proof.

Lemma 2 (Irreducibility and coprimality of the regularised distance).

Proof.

Theorem 9 (Tightness of the factor-33 reduction).

Proof for d≥2d\geq 2.

Proof for d=1d=1 via complex pole matching.

E.3 Proof of Proposition 1: singular universality at b0>0b_{0}>0

Proposition (Proposition 1 restated).

Proof.

Remark 1 (Necessity of b0>0b_{0}>0).

E.4 Characteristic kernel and strict positive definiteness

Corollary (Characteristic kernel, restating Corollary 1(i)).

Proof.

Corollary (Strict positive definiteness and Gram invertibility, restating part of Cor. 1).

Proof.

E.5 Proof of Proposition 3 and Theorem 6: closed-form norm and single-layer Rademacher bound

Proposition (Proposition 3 restated).

Proof.

Theorem (Theorem 6 restated).

Proof.

Corollary (Unbiased case).

E.6 Uniform Boundedness and High-Dimensional Variance

Proposition 4 (Uniform boundedness with sharpness).

Proof.

Theorem 1 (PSD and Mercer for $b\geq 0$ ).

Proposition 1 (Singular universality at $b_{0}>0$ ).

Corollary 4 (Approximation rate transfer at factor- $3$ width overhead).

E.1 Proof of Theorem 1: PSD/Mercer for $b\geq 0$

Theorem 9 (Tightness of the factor- $3$ reduction).

Proof for $d\geq 2$ .

Proof for $d=1$ via complex pole matching.

E.3 Proof of Proposition 1: singular universality at $b_{0}>0$

Remark 1 (Necessity of $b_{0}>0$ ).

Theorem 10 (Far-field separation: $L^{\infty}$ and $L^{2}$ ).