arXiv:2605.01862 · cs.LG · uncurated · rendered via ar5iv

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

Title and authors will populate once this paper is indexed.
This paper is rendered from ar5iv. Reproductions and verdicts are not yet available — but you can leave a comment below.
[2605.01862] QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

QHyer: Q-conditioned Hybrid Attention-mamba Transformer
for Offline Goal-conditioned RL

Xing Lei Affiliation: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University Correspondence to:leixing@stu.xjtu.edu.cn    Jincheng Wang Affiliation: Computer Science, University College London    Xuetao Zhang Affiliation: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University    Donglin Wang Affiliation: School of Engineering, Westlake University
Abstract

Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose QHyer, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that QHyer achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.

Keywords: 
Machine Learning, ICML

1 Introduction

Offline Goal-Conditioned Reinforcement Learning (Offline GCRL) aims to learn goal-reaching policies from static datasets, offering a promising paradigm for real-world applications where online interaction is costly or infeasible (Levine et al., 2020; Liu et al., 2022). While most existing offline GCRL datasets are collected by Markovian behavior policies, an increasing number of practical datasets exhibit non-Markovian properties where actions depend on historical context rather than current observations alone (Park et al., 2025a). This properties poses fundamental challenges for existing value-based methods (Kostrikov et al., 2022; Park et al., 2023; Zhou & Kao, 2025; Ahn et al., 2025; Giammarino et al., 2025; Giammarino & Qureshi, 2026) that rely on Bellman backup. In contrast, sequence modeling approaches like Decision Transformer (DT) (Chen et al., 2021) naturally solve non-Markovian problem by conditioning on return-to-go (RTG), states, and actions, leveraging self-attention to capture long-range dependencies from extended historical sequences.

Although DT naturally handles non-Markovian patterns through history conditioning, it exhibits two fundamental limitations when applied to offline GCRL. On one hand, RTG is trajectory-dependent rather than state-dependent, assigning values based on trajectory success rather than state quality, which provides no discriminative information for distinguishing promising states within failed trajectories. This is a critical requirement for trajectory stitching under goal-conditioned sparse rewards. On the other hand, pure attention struggles to efficiently balance global goal-directed reasoning with fine-grained local dynamics modeling. Recent hybrid architectures like LSDT (Wang et al., 2025) and DMixer (Zheng et al., 2025a) incorporate convolution alongside attention to capture local patterns. However, non-Markovian offline GCRL data exhibits variable-length temporal dependencies that change dynamically across states and trajectory segments. Convolution with fixed receptive fields either wastes model capacity on irrelevant context when dependencies are short, or truncates critical information when dependencies are long, unable to adapt to this inherent variability.

We propose QHyer (Q-conditioned Hybrid Attention-Mamba Transformer), the first sequence modeling framework to jointly resolve both limitations for offline GCRL. Our key observation is that these two limitations are coupled. Effective trajectory stitching under sparse rewards requires both a state-dependent value signal and an architecture whose effective memory matches the temporal structure of the underlying behavior policy. Addressing either alone is insufficient, because Q-conditioning layered on a fixed-window hybrid retains the convolutional pathology on non-Markovian play data, while a better temporal architecture with RTG retains the trajectory-dependence bottleneck. Concretely, we (i) replace trajectory-dependent RTG with state-dependent Q-values estimated via Normalizing Flows (Ghugare & Eysenbach, 2025), chosen specifically for their exact, properly normalized log-density, a property CVAEs, contrastive critics, and diffusion likelihoods cannot provide (Section˜3.1), and (ii) design a gated Hybrid Attention-Mamba (Gu & Dao, 2024) backbone where Mamba’s input-dependent selective state-space dynamics provide content-adaptive history compression, adjusting effective memory per-token rather than through a hand-tuned receptive field. Unlike prior value-guided Decision Transformers (Yamagata et al., 2023; Wang et al., 2024; Hu et al., 2024; Zhuang et al., 2024; Zheng et al., 2025b) that retain RTG and attach Q-values as auxiliary losses or regularizers, QHyer eliminates RTG and uses Q-values directly as conditioning tokens. Under goal-conditioned sparse rewards, where RTG collapses to a near-binary signal, this distinction is decisive (Figures˜2 and 5).

Our evaluation on OGBench (Park et al., 2025a) and D4RL (Fu et al., 2020) demonstrates that QHyer achieves state-of-the-art performance across both non-Markovian datasets (OGBench play and D4RL Maze) and Markovian datasets (OGBench noisy), validating the effectiveness of NFs-based Q-value conditioning and the Hybrid Attention-Mamba architecture for offline GCRL.

2 Background

2.1 Offline GCRL

Offline GCRL is defined over a Markov Decision Process (MDP) (𝒮,𝒜,𝒢,p,γ)(\mathcal{S},\mathcal{A},\mathcal{G},p,\gamma), where 𝒮\mathcal{S} denotes the state space, 𝒜\mathcal{A} the action space, 𝒢\mathcal{G} the goal space, p(s|s,a)p(s^{\prime}|s,a) the transition dynamics, and γ[0,1)\gamma\in[0,1) the discount factor. Following prior work (Park et al., 2023, 2025a), we assume 𝒢𝒮\mathcal{G}\equiv\mathcal{S}. The agent has access only to a static dataset 𝒟={τi}i=1N\mathcal{D}=\{\tau_{i}\}_{i=1}^{N} collected by behavioral policies β\beta, where each trajectory takes the form τ(i)={(st,at,st+1)}t=0T(i)1\tau^{(i)}=\{(s_{t},a_{t},s_{t+1})\}_{t=0}^{T^{(i)}-1}. The objective is to learn a goal-conditioned policy π(a|s,g)\pi(a|s,g) that maximizes the expected cumulative return J(π)=𝔼τpπ,gp(g)[t=0Tγtr(st,g)]J(\pi)=\mathbb{E}_{\tau\sim p^{\pi},g\sim p(g)}[\sum_{t=0}^{T}\gamma^{t}r(s_{t},g)] without interaction in the environment. To obtain goal-conditioned supervision, we employ hindsight experience replay (HER) (Andrychowicz et al., 2017), which samples goals from future achieved states along the same trajectory.

In standard GCRL with sparse rewards, the reward function is defined as r(s,g)=𝟙[s=g]r(s,g)=\mathds{1}[s=g], where the agent receives 11 only upon reaching the goal and 0 otherwise. Consequently, most state-action pairs yield no learning signal for states far from the goal. To address this, following prior work (Eysenbach et al., 2020, 2022), a probabilistic reward can be defined as r(s,a,g)(1γ)p(g|s)r(s,a,g)\triangleq(1-\gamma)\cdot p(g|s^{\prime}), where ss^{\prime} is the next state. Under this formulation, the goal-conditioned Q-function corresponds to the discounted state occupancy measure (Eysenbach et al., 2022; Bortkiewicz et al., 2025):

Qπ(s,a,g)\displaystyle Q^{\pi}(s,a,g) =p+π(s+=g|s0=s,a0=a)\displaystyle=p^{\pi}_{+}(s^{+}=g|s_{0}=s,a_{0}=a)
(1γ)k=0γkpπ(sk+1=g|s0=s,a0=a),\displaystyle\triangleq(1-\gamma)\sum_{k=0}^{\infty}\gamma^{k}p^{\pi}(s_{k+1}=g|s_{0}=s,a_{0}=a), (1)

where s+sKs^{+}\triangleq s_{K} for KGeom(1γ)K\sim\mathrm{Geom}(1-\gamma) denotes a future state sampled at a geometrically distributed time step. Unlike sparse rewards that are directly observed, this formulation requires learning a density model to estimate Q-values.

2.2 Normalizing Flows for Q-Value Estimation

Normalizing Flows (NFs) (Zhai et al., 2024) are invertible generative models that learn a bijective mapping fθ:ddf_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} from a complex data distribution to a simple prior p0p_{0} (typically standard Gaussian), with density computed exactly via the change of variables formula:

pθ(x)=p0(fθ(x))|detfθ(x)x|.p_{\theta}(x)=p_{0}(f_{\theta}(x))\left|\det\frac{\partial f_{\theta}(x)}{\partial x}\right|. (2)

Following Ghugare & Eysenbach (2025), NFs can be constructed using coupling layers (Dinh et al., 2017). For the tt-th block with input xtx^{t} and condition yy:

x1t,x2t\displaystyle x^{t}_{1},x^{t}_{2} =split(xt),\displaystyle=\mathrm{split}(x^{t}),
x~2t\displaystyle\tilde{x}^{t}_{2} =(x2t+aθt(x1t,y))×exp(sθt(x1t,y)),\displaystyle=\bigl(x^{t}_{2}+a^{t}_{\theta}(x^{t}_{1},y)\bigr)\times\exp\bigl(-s^{t}_{\theta}(x^{t}_{1},y)\bigr),
x~t\displaystyle\tilde{x}^{t} =concat(x1t,x~2t),\displaystyle=\mathrm{concat}(x^{t}_{1},\tilde{x}^{t}_{2}), (3)

where split()\mathrm{split}(\cdot) partitions the input into two halves along the feature dimension, aθta_{\theta}^{t} and sθts_{\theta}^{t} are neural networks that output translation and scale parameters respectively.

In Offline GCRL, NFs can directly estimate it by modeling pθ(g|s,a)p_{\theta}(g|s,a) (Ghugare & Eysenbach, 2025). The conditioning information (s,a)(s,a) is encoded by a state-action encoder, and the behavior Monte Carlo (MC) Q-value is obtained as:

Qθβ(s,a,g)=logp0(fθ(g;z))+log|detfθ(g;z)g|,Q^{\beta}_{\theta}(s,a,g)=\log p_{0}(f_{\theta}(g;z))+\log\left|\det\frac{\partial f_{\theta}(g;z)}{\partial g}\right|, (4)

where fθ(;z)f_{\theta}(\cdot;z) is the conditional NFs mapping goals to the latent space with zz being the encoded state-action representation. Note that Qθβ(s,a,g)=logpθ(g|s,a)Q^{\beta}_{\theta}(s,a,g)=\log p_{\theta}(g|s,a) represents the log-probability, which serves as an unnormalized score for conditioning. During inference, we use exp(Qθβ)\exp(Q^{\beta}_{\theta}) when probability interpretation is needed.

In practice, the NFs is trained via maximum likelihood on hindsight-relabeled transitions:

NFs=𝔼(st,at,g)𝒟[logpθ(g|st,at)].\mathcal{L}_{\text{NFs}}=-\mathbb{E}_{(s_{t},a_{t},g)\sim\mathcal{D}}\left[\log p_{\theta}(g|s_{t},a_{t})\right]. (5)

2.3 Sequence Modeling for Decision Making

Decision Transformer (DT) (Chen et al., 2021) models decision-making from offline datasets as a sequence modeling problem. Unlike traditional RL methods that estimate Q-functions or compute policy gradients, DT generates an action ata_{t} at timestep tt conditioned on the context of the previous KK timesteps along with the current state and return-to-go (RTG). The input sequence is formulated as τ=(R^tK+1,stK+1,atK+1,,R^t1,st1,at1,R^t,st)\tau=(\hat{R}_{t-K+1},s_{t-K+1},a_{t-K+1},\ldots,\hat{R}_{t-1},s_{t-1},a_{t-1},\hat{R}_{t},s_{t}), where RTG R^t=t=tTrt\hat{R}_{t}=\sum_{t^{\prime}=t}^{T}r_{t^{\prime}} is defined as the sum of rewards from the current step to the end of the trajectory and KK is the context length. For each timestep, three tokens (RTG, state, and action) are embedded and fed into the model. DT employs a causal Transformer that leverages self-attention layers to capture long-range dependencies.

Decision Mamba (DMamba) (Ota, 2024) integrates the Mamba (Gu & Dao, 2024) architecture into the DT framework by replacing self-attention with the Mamba block. The DMamba block first applies a one-dimensional causal convolution to extract local features:

x=SiLU(Conv1d(x)),x^{\prime}=\text{SiLU}(\text{Conv1d}(x)), (6)

where Conv1d operates with a local kernel over adjacent positions. The transformed sequence is then processed by the discrete-time selective state space model (SSM):

ht\displaystyle h_{t} =A¯ht1+B¯xt,yt=Cht,\displaystyle=\bar{A}h_{t-1}+\bar{B}x^{\prime}_{t},\quad y_{t}=Ch_{t}, (7)

where hth_{t} is the hidden state and yty_{t} is the output. The key innovation of Mamba is the input-dependent selective mechanism:

B\displaystyle B =LinearB(x),C=LinearC(x),\displaystyle=\text{Linear}_{B}(x^{\prime}),\quad C=\text{Linear}_{C}(x^{\prime}),
Δ\displaystyle\Delta =softplus(LinearΔ(x)),\displaystyle=\text{softplus}(\text{Linear}_{\Delta}(x^{\prime})), (8)

where Δ\Delta controls the discretization step size.

3 QHyer: Unlocking Sequence Modeling for Offline GCRL

While sequence modeling naturally addresses the non-Markovian challenge, DT-based methods exhibits critical limitations when applied to GCRL. We propose QHyer, which introduces NFs-based Q-value conditioning (Section˜3.1) and a Hybrid Attention-Mamba architecture (Section˜3.2) to overcome these limitations. The overall architecture is illustrated in Figure˜1.

Refer to caption

Figure 1: Overview of QHyer architecture. The framework consists of three main components: (1) a NFs Q-value estimator that replaces RTG conditioning, (2) a Hybrid Attention-Mamba Block, (3) concatenated state-goal tokenization for effective goal information propagation and (4) reinforced learning with expectile regression.

3.1 Limitation 1: RTG Fails Under Sparse Rewards

In standard DT-based methods, the return-to-go (RTG) serves as the conditioning signal that guides action generation. However, RTG is fundamentally inadequate for Offline GCRL with sparse binary rewards.

The Root Cause: Trajectory-Dependence Prevents Stitching. The fundamental limitation of RTG lies in its trajectory-dependence: RTG answers “did this trajectory succeed?” rather than “how valuable is this state for reaching the goal?” Consider a state ss that appears on both a successful trajectory (RTG=1) and a failed one (RTG=0). RTG assigns contradictory values to the same state based solely on trajectory outcome, making cross-trajectory comparison impossible. This directly prevents trajectory stitching because composing segments from different trajectories requires a trajectory-agnostic value metric that RTG fundamentally cannot provide. As shown in Figure˜2 (a) (b), successful and failed trajectories receive uniformly different RTG values regardless of state quality, with only 25% of state-action pairs receiving discriminative signals.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 2: RTG vs. NFs-based Q-value conditioning in D4RL non-Markovian AntMaze-medium dataset. (a) Trajectories: successful (blue) and failed (purple). (b) RTG conditioning: successful trajectories show color gradient while failed trajectories are uniformly gray (no signal), yielding poor coverage. (c) NFs-based QβQ^{\beta} conditioning: all trajectories colored by Q-value, achieving significantly better coverage. (d) High-Q segments: trajectory portions where QβQ^{\beta} exceeds a threshold naturally form paths toward the goal, enabling trajectory stitching even from failed demonstrations. Coverage percentages indicate the fraction of state-action pairs with discriminative signals (variance > 0.01 across trajectory segments).

Our Key Insight: From Trajectory-Dependence to State-Dependence. The Q-function Qβ(s,a,g)=p+β(g|s,a)Q^{\beta}(s,a,g)=p^{\beta}_{+}(g|s,a) represents the probability of reaching goal gg from state-action pair (s,a)(s,a), measured independently of which trajectory that pair came from. This state-dependence enables a fundamentally new capability: identifying high-value segments from failed trajectories (they have high QβQ^{\beta} despite low RTG) and composing them toward goals. Figure˜2 (c) confirms this prediction, showing that Q-value conditioning achieves 92% coverage compared to RTG’s 25%. Figure˜2 (d) further illustrates that high-Q segments naturally form paths toward goals even when extracted from failed demonstrations.

Why MC Estimation Instead of TD Learning. Having established the need for Q-value conditioning, we must choose how to estimate Q-values. Many standard offline RL methods (Fujimoto & Gu, 2021; Kostrikov et al., 2022) are built upon temporal difference (TD) learning. While TD learning can learn optimal value functions and possesses stitching capabilities, its reliance on bootstrapping leads to compounding errors that hinder the acquisition of optimal policies, especially in long-horizon tasks (Myers et al., 2025; Park et al., 2026). In contrast, MC learning directly estimates the cumulative reward for reaching a goal. By integrating it with a maximum Q-expectile regression loss proposed in our later analysis, we theoretically demonstrate that our method can also converge to an optimal stitched policy. In empirical evaluations, recent MC-based contrastive RL approaches (Eysenbach et al., 2022; Myers et al., 2025) have been shown to consistently and significantly outperform TD-based methods on long-horizon GCRL tasks.

Why NFs for MC Q-Estimation. Given that MC estimation is preferable, we must choose how to model the Q-value density p+β(gs,a)p^{\beta}_{+}(g\mid s,a). Our framework places one structural requirement on this density model. It must produce an exact, properly normalized log-density. The expectile target in Equation˜10 is defined on logpθ(gs,a)\log p_{\theta}(g\mid s,a) directly, and the transformer consumes Q-tokens that span multiple goals within one context window (Section˜3.3), so goal-independent normalization is necessary for the learned Q-to-action pattern to transfer across goals. This requirement rules out the otherwise reasonable alternatives.

Conditional VAEs (Sohn et al., 2015) produce only the ELBO, a structural lower bound that cannot be closed by increasing capacity, and which distorts the Q-landscape in a goal-dependent way. Contrastive RL (Eysenbach et al., 2022) trains a binary cross-entropy classifier whose Bayes-optimal output is the log density ratio logp(gs,a)p(g)\log\frac{p(g\mid s,a)}{p(g)}. While the goal-dependent partition p(g)p(g) cancels when selecting actions at a fixed goal, it introduces goal-dependent offsets in the Q-token sequence our transformer reads across multiple goals, which degrades cross-goal conditioning. Diffusion models (Ho et al., 2020) and continuous flow-matching objectives (Lipman et al., 2023) can reach high sample quality, but their per-sample likelihood requires solving a probability-flow ODE with a Hutchinson trace estimator (Grathwohl et al., 2018), injecting variance into precisely the signal that expectile regression must fit.

Coupling-based NFs (Dinh et al., 2017) uniquely meet the requirement. The triangular Jacobian makes logpθ(gs,a)\log p_{\theta}(g\mid s,a) exactly and cheaply computable in closed form, and coupling architectures are universal diffeomorphism approximators (Teshima et al., 2020), so no structural gap remains. Figure˜17 (Section˜G.4) empirically confirms that NFs attain the lowest estimation error against the analytic future-state density among CVAE, CRL and MC C-learning. Because accurate, normalized Q-values are the bottleneck for trajectory stitching under sparse rewards (Figure˜5), this is the property we optimize for.

Expectile Regression for In-Distribution Optimal Q-Value Prediction. Given accurate QβQ^{\beta} estimates from NFs, we still need to extract optimal behaviors from suboptimal data. The expectile regression loss (Kostrikov et al., 2022; Wu et al., 2023; Zhuang et al., 2024) asymmetrically weights prediction errors:

Lτ2(u)=|τ𝟙(u<0)|u2,L_{\tau}^{2}(u)=|\tau-\mathds{1}(u<0)|\cdot u^{2}, (9)

where τ(0.5,1)\tau\in(0.5,1) controls the asymmetry. When τ>0.5\tau>0.5, the loss penalizes underestimation more heavily, causing the learned value to concentrate on the upper portion of the empirical distribution. Applying this to Q-value prediction, we define:

Q=𝔼(st,at,g)𝒟[Lτ2(Qθβ(st,at,g)Q^ϕ(st,g))],\mathcal{L}_{Q}=\mathbb{E}_{(s_{t},a_{t},g)\sim\mathcal{D}}\left[L^{2}_{\tau}\left(Q^{\beta}_{\theta}(s_{t},a_{t},g)-\hat{Q}_{\phi}(s_{t},g)\right)\right], (10)

where Q^ϕ(st,g)\hat{Q}_{\phi}(s_{t},g) is the Q-value predicted by the Hybrid Attention-Mamba transformer with parameters ϕ\phi, and Qθβ(st,at,g)Q^{\beta}_{\theta}(s_{t},a_{t},g) is the target from the NFs-based critic (Equation˜4). Our theoretical analysis (Section˜3.5) demonstrates that this enables our sequential model, Qhyer, to predict Q-values that approach the in-distribution maximum. These predictions correspond to the high-Q segments shown in Figure˜2 (d), which are essential for trajectory stitching.

3.2 Limitation 2: Temporal Modeling Requires Content-Adaptive History Compression

Beyond the conditioning signal, effective sequence modeling for Offline GCRL demands architectures that can capture heterogeneous temporal dependencies inherent in Offline GCRL datasets.

Why Offline GCRL Data Exhibits Variable-Length Historical Dependencies. Offline GCRL datasets exhibit different temporal structures depending on behavior policy properties. As documented in OGBench (Park et al., 2025a), the manipulation suite provides two representative dataset types: play datasets collected by non-Markovian expert policies with temporally correlated noise where the behavior policy follows β(at|st,h<t)\beta(a_{t}|s_{t},h_{<t}), and noisy datasets collected by Markovian expert policies with uncorrelated Gaussian noise where β(at|st)\beta(a_{t}|s_{t}) depends only on the current state. The play data demands extended memory for action coherence, while noisy data requires only short-term local information. A principled solution must adapt to both properties without manual tuning.

Why Convolution Cannot Address Variable-Length Dependencies. To address the inherent tension of datasets exhibiting the two aforementioned properties, both LSDT (Wang et al., 2025) and DMixer (Zheng et al., 2025a) incorporate attention and convolution as parallel branches. Convolution-based local modeling computes features through causal convolution with fixed-size kernels:

yt=Conv1d(𝐱)t=j=0k1wjxtj,y_{t}=\text{Conv1d}(\mathbf{x})_{t}=\sum_{j=0}^{k-1}w_{j}\cdot x_{t-j}, (11)

where kk is the fixed kernel size and wjw_{j} are input-independent weights. When convolution serves as the final output of a branch, this creates three fundamental limitations. First, convolution imposes a fixed receptive field set by the chosen kernel (and any dilation/stacking), making the effective context length a hand-tuned architectural prior that is sensitive to hyperparameters and often fails to transfer across datasets with different temporal dependencies. Second, in the first layer, convolution has a fixed receptive field and thus a fixed effective memory that cannot adapt to varying dependency lengths within or across datasets. Especially, on non-Markovian trajectories where relative cues lie beyond this window, the local branch becomes weakly informative and is often down-weighted by fusion.

Why Mamba Enables Content-Adaptive History Compression. We address the fixed-window bottleneck of a convolutional short-term branch by adopting a Mamba-style selective SSM (DMamba) module (Ota, 2024). A DMamba block combines (i) a lightweight causal convolution that mixes nearby tokens and produces local features xtx^{\prime}_{t} (and gating signals), with (ii) a selective state-space update (Equations˜7 and 8) that propagates a recurrent state across the entire prefix. Importantly, the effective memory is not determined by the convolutional kernel, but by the input-dependent SSM dynamics (via the selective discretization), which enables smooth, learned forgetting/retention over history. As a result, compared to using convolution as the branch output, DMamba provides a content-adaptive mechanism to compress long-range context into a compact state, reducing sensitivity to hand-tuned receptive fields and improving robustness on non-Markovian segments where disambiguating cues may lie beyond any fixed local window.

Refer to caption
Figure 3: Content-adaptive Δt\Delta_{t} on cube-single. Left: Δt\Delta_{t} distribution. Center: mean Δt\Delta_{t} across sequence positions. Right: learned attention/Mamba gate weights. Statistics over 50 batches of batch size 256.

To make the adaptive history modeling explicit, we expand the SSM recurrence (Equation˜7) to express the output at timestep tt:

yt=i=0tCt(j=i+1tA¯j)B¯ixi,y_{t}=\sum_{i=0}^{t}C_{t}\left(\prod_{j=i+1}^{t}\bar{A}_{j}\right)\bar{B}_{i}x^{\prime}_{i}, (12)

where xix^{\prime}_{i} is the convolution-extracted feature at step ii. The influence of historical input xix^{\prime}_{i} on current output yty_{t} is governed by the cumulative decay j=i+1tA¯j\prod_{j=i+1}^{t}\bar{A}_{j}. Critically, through the selective mechanism (Equation˜8), the discretization step Δ\Delta is input-dependent:

A¯t=exp(ΔtA),whereΔt=softplus(LinearΔ(xt)),\bar{A}_{t}=\exp(\Delta_{t}\cdot A),\quad\text{where}\quad\Delta_{t}=\text{softplus}(\text{Linear}_{\Delta}(x^{\prime}_{t})), (13)

and A<0A<0 is a negative real value following Gu & Dao (2024). This creates content-adaptive effective memory: when xtx^{\prime}_{t} yields small Δt\Delta_{t}, the decay A¯t=exp(ΔtA)1\bar{A}_{t}=\exp(\Delta_{t}\cdot A)\approx 1 preserves long-range history suitable for play data; when xtx^{\prime}_{t} yields large Δt\Delta_{t}, A¯t0\bar{A}_{t}\approx 0 retains only local context appropriate for noisy data. In contrast, convolution imposes a fixed influence wjw_{j} for j<kj<k and zero beyond, resulting in hard, input-independent truncation. The key distinction is that Mamba provides smooth, learned decay while DynamicConv enforces hard, fixed truncation.

Refer to caption

Figure 4: Hybrid Attention-Mamba Block.

Hybrid Architecture with Attention-Mamba. As illustrated in Figure˜4, we design a Hybrid Attention-Mamba architecture with two parallel branches: attention for global goal-directed planning and Mamba for temporal dynamics modeling. The outputs are fused through a learnable gating mechanism that computes a scalar weight α=σ(𝐰T𝐱+b)\alpha=\sigma(\mathbf{w}^{T}\mathbf{x}+b) to combine branch outputs: 𝐲=α𝐲attn+(1α)𝐲mamba\mathbf{y}=\alpha\cdot\mathbf{y}_{\text{attn}}+(1-\alpha)\cdot\mathbf{y}_{\text{mamba}}. This enables complementary specialization across both play and noisy datasets.

Figure˜3 visualizes this adaptation on cube-single. On play, smaller Δt\Delta_{t} preserves an effective memory of about 1212 steps and the gate favors attention. On noisy, larger Δt\Delta_{t} contracts memory to about 33 steps and the gate favors Mamba. The essential reason is that Δt=softplus(LinearΔ(xt))\Delta_{t}=\text{softplus}(\text{Linear}_{\Delta}(x^{\prime}_{t})) is input-dependent, so effective memory tracks the local temporal correlation of the data, which convolution’s input-independent receptive field cannot do.

3.3 Concatenated State-Goal Tokenization Strategy

We represent each state-goal pair as a concatenated token [st;g][s_{t};g] rather than separate tokens. Combined with NFs-based Q-value conditioning, the input sequence becomes: τ=(Q1,[s1;g],a1,Q2,[s2;g],a2,,QT,[sT;g],aT),\tau=(Q_{1},[s_{1};g],a_{1},Q_{2},[s_{2};g],a_{2},\ldots,Q_{T},[s_{T};g],a_{T}), where Qt=logpθ(g|st,at)Q_{t}=\log p_{\theta}(g|s_{t},a_{t}) is the NFs-estimated Q-value. This design ensures goal information is directly available at each decision point without increasing sequence length from 3T3T to 4T4T, avoiding quadratic computational overhead in attention. Detailed visual explanation of this tokenization strategy is provided in Section˜G.1.

3.4 Training and Inference

Training. We train QHyer end-to-end with three losses:

QHyer=λcriticNFs+λBCBC+λQQ,\mathcal{L}_{\text{QHyer}}=\lambda_{\text{critic}}\mathcal{L}_{\text{NFs}}+\lambda_{\text{BC}}\mathcal{L}_{\text{BC}}+\lambda_{Q}\mathcal{L}_{Q}, (14)

where BC\mathcal{L}_{\text{BC}} is the behavior cloning loss that predicts actions conditioned on Q-values instead of RTG:

BC=𝔼(st,at,g)𝒟[logπθ(at|Qt,[st;g])].\mathcal{L}_{\text{BC}}=-\mathbb{E}_{(s_{t},a_{t},g)\sim\mathcal{D}}\left[\log\pi_{\theta}(a_{t}|Q_{t},[s_{t};g])\right]. (15)

Inference. QHyer performs two-stage autoregressive generation: (1) predict maximum Q-value Q^(st,g)\hat{Q}(s_{t},g) from current context; (2) predict optimal action conditioned on the predicted maximum Q-value. The detailed algorithm is provided in Appendix˜D.

3.5 Theoretical Analysis

We establish convergence guarantees for QHyer: expectile regression yields near-optimal Q-values, and the learned policy achieves bounded optimal stitched policy with explicit dependence on sample size, NFs accuracy, and coverage.

Setup. Let Qβ(τ,g,h)Q^{\beta}(\tau,g,h) denote the goal-reaching probability conditioned on history. The in-distribution optimal Q-value Q(s,a,g,h):=maxτ𝒯β:(sh,ah)=(s,a)Qβ(τ,g,h)Q^{\star}(s,a,g,h):=\max_{\tau\in\mathcal{T}^{\beta}:(s_{h},a_{h})=(s,a)}Q^{\beta}(\tau,g,h) represents the maximum achievable within the behavior policy’s support. We assume: (i) Q-value coverage with constant c~(0,1]\tilde{c}\in(0,1], measuring the minimum density ratio of optimal actions in the dataset; (ii) bounded NFs error ϵNFs\epsilon_{\text{NFs}}; (iii) bounded function class with approximation error δapprox\delta_{\text{approx}}. Full definitions are in Appendix˜B.

Theorem 3.1 (Convergence of Expectile Regression to In-Distribution Optimal Q-Value).

Under Q-value coverage with constant c~(0,1]\tilde{c}\in(0,1] and sample size satisfying Equation˜37 in the Appendix, for τ(0.5,1)\tau\in(0.5,1), the expectile estimator satisfies |QQ^τ|ϵτ|Q^{\star}-\hat{Q}^{\tau}|\leq\epsilon_{\tau} with high probability, where the bias term ϵτ:=(1τ)(QQmin)τc~/2+(1τ)(1c~/2)\epsilon_{\tau}:=\frac{(1-\tau)(Q^{\star}-Q_{\min})}{\tau\cdot\tilde{c}/2+(1-\tau)(1-\tilde{c}/2)} decreases as τ\tau increases, at the cost of requiring more samples for variance control.

Theorem 3.2 (Convergence to In-Distribution Optimal Stitched Policy).

Under assumptions (i) to (iii), the learned policy π^𝒟\hat{\pi}^{\star}_{\mathcal{D}} satisfies:

J(πβ)J(π^𝒟)𝒪(N1/4)policy+δapproxMLE+𝒪(ϵNFs+ϵτ)Q-value.J(\pi^{\star}_{\beta})-J(\hat{\pi}^{\star}_{\mathcal{D}})\leq\underbrace{\mathcal{O}(N^{-1/4})}_{\text{policy}}+\underbrace{\sqrt{\delta_{\text{approx}}}}_{\text{MLE}}+\underbrace{\mathcal{O}(\sqrt{\epsilon_{\text{NFs}}}+\epsilon_{\tau})}_{\text{Q-value}}. (16)

Complete proofs are in Appendix˜C.

Table 1: Results on OGBench manipulation tasks. Average success rate (%\%) across 5 test-time goals. Results averaged over 8 seeds (4 for pixel-based). Orange = best, underline = second best.
Hierarchical Policy Flat Policy
Env Type Dataset HIQL SAW OTA Eik-HIQRL GCBC GCIVL GCIQL QRL CRL QHyer
cube play single 1515 ±3\pm 3 2323 ±2\pm 2 1313 ±1\pm 1 0 ±0\pm 0 66 ±2\pm 2 5353 ±4\pm 4 6868 ±6\pm 6 55 ±1\pm 1 1919 ±2\pm 2 𝟖𝟒\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}84}} ±4\pm 4
double 66 ±2\pm 2 2626 ±3\pm 3 22 ±1\pm 1 0 ±0\pm 0 11 ±1\pm 1 3636 ±3\pm 3 4040 ±5\pm 5 11 ±0\pm 0 1010 ±2\pm 2 𝟓𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}56}} ±2\pm 2
triple 33 ±1\pm 1 1919 ±4\pm 4 11 ±0\pm 0 0 ±0\pm 0 11 ±1\pm 1 11 ±0\pm 0 33 ±1\pm 1 0 ±0\pm 0 44 ±1\pm 1 𝟏𝟎\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}10}} ±5\pm 5
quadruple 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 𝟐\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}2}} ±1\pm 1 𝟐\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}2}} ±1\pm 1
Total 24 68 16 0 8 90 111 6 35 152
noisy single 4141 ±6\pm 6 3838 ±2\pm 2 4040 ±2\pm 2 22 ±1\pm 1 88 ±3\pm 3 7171 ±9\pm 9 𝟗𝟗\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}99}} ±1\pm 1 2525 ±6\pm 6 3838 ±2\pm 2 9595 ±5\pm 5
double 22 ±1\pm 1 1212 ±1\pm 1 55 ±2\pm 2 0 ±0\pm 0 11 ±1\pm 1 1414 ±3\pm 3 2323 ±3\pm 3 33 ±1\pm 1 22 ±1\pm 1 𝟑𝟎\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}30}} ±4\pm 4
triple 22 ±1\pm 1 1313 ±1\pm 1 11 ±0\pm 0 0 ±0\pm 0 11 ±1\pm 1 99 ±1\pm 1 22 ±1\pm 1 11 ±0\pm 0 33 ±1\pm 1 𝟏𝟒\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}14}} ±1\pm 1
quadruple 0 ±0\pm 0 11 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}6}} ±4\pm 4
Total 45 64 46 2 10 94 124 29 43 145
scene play scene 3838 ±3\pm 3 𝟓𝟖\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}58}} ±3\pm 3 1919 ±4\pm 4 1111 ±3\pm 3 55 ±1\pm 1 4242 ±4\pm 4 5151 ±4\pm 4 55 ±1\pm 1 1919 ±2\pm 2 5353 ±2\pm 2
noisy scene 𝟐𝟓\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}25}} ±4\pm 4 𝟐𝟓\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}25}} ±2\pm 2 1212 ±1\pm 1 1212 ±0\pm 0 11 ±1\pm 1 𝟐𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}26}} ±5\pm 5 𝟐𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}26}} ±2\pm 2 99 ±2\pm 2 11 ±1\pm 1 𝟐𝟓\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}25}} ±5\pm 5
puzzle play 3x3 1212 ±2\pm 2 66 ±1\pm 1 2121 ±5\pm 5 99 ±0\pm 0 22 ±0\pm 0 66 ±1\pm 1 𝟗𝟓\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}95}} ±1\pm 1 11 ±0\pm 0 33 ±1\pm 1 9292 ±2\pm 2
4x4 77 ±2\pm 2 66 ±1\pm 1 55 ±2\pm 2 22 ±1\pm 1 0 ±0\pm 0 1313 ±2\pm 2 2626 ±3\pm 3 0 ±0\pm 0 0 ±0\pm 0 𝟐𝟖\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}28}} ±5\pm 5
4x5 44 ±1\pm 1 22 ±1\pm 1 22 ±1\pm 1 0 ±0\pm 0 0 ±0\pm 0 77 ±1\pm 1 1414 ±1\pm 1 0 ±0\pm 0 11 ±0\pm 0 𝟑𝟏\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}31}} ±1\pm 1
4x6 33 ±1\pm 1 55 ±0\pm 0 11 ±1\pm 1 0 ±0\pm 0 0 ±0\pm 0 1010 ±2\pm 2 1212 ±1\pm 1 0 ±0\pm 0 44 ±1\pm 1 𝟏𝟖\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}18}} ±2\pm 2
Total 26 19 29 11 2 36 147 1 8 169
noisy 3x3 5151 ±11\pm 11 7878 ±39\pm 39 5353 ±6\pm 6 66 ±2\pm 2 11 ±0\pm 0 4242 ±19\pm 19 𝟗𝟒\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}94}} ±3\pm 3 0 ±0\pm 0 3030 ±6\pm 6 8989 ±8\pm 8
4x4 1616 ±4\pm 4 0 ±0\pm 0 0 ±0\pm 0 33 ±1\pm 1 0 ±0\pm 0 2020 ±3\pm 3 2929 ±7\pm 7 0 ±0\pm 0 0 ±0\pm 0 𝟑𝟑\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}33}} ±6\pm 6
4x5 55 ±1\pm 1 1515 ±1\pm 1 11 ±1\pm 1 22 ±2\pm 2 0 ±0\pm 0 1919 ±0\pm 0 1919 ±0\pm 0 0 ±0\pm 0 33 ±2\pm 2 𝟐𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}26}} ±2\pm 2
4x6 22 ±1\pm 1 1111 ±2\pm 2 0 ±0\pm 0 0 ±0\pm 0 0 ±0\pm 0 1717 ±2\pm 2 𝟏𝟖\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}18}} ±2\pm 2 0 ±0\pm 0 66 ±3\pm 3 𝟐𝟒\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}24}} ±3\pm 3
Total 74 104 54 11 1 98 160 0 39 172
visual-cube play single 𝟖𝟗\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}89}} ±0\pm 0 8888 ±3\pm 3 4040 ±4\pm 4 / 55 ±1\pm 1 6060 ±5\pm 5 3030 ±5\pm 5 4141 ±15\pm 15 3131 ±15\pm 15 4242 ±2\pm 2
double 3939 ±2\pm 2 𝟒𝟎\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}40}} ±3\pm 3 3636 ±17\pm 17 / 11 ±1\pm 1 1010 ±2\pm 2 11 ±1\pm 1 55 ±0\pm 0 22 ±1\pm 1 3737 ±3\pm 3
triple 𝟐𝟏\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}21}} ±0\pm 0 2020 ±1\pm 1 2424 ±2\pm 2 / 1515 ±2\pm 2 1414 ±2\pm 2 1515 ±1\pm 1 1616 ±1\pm 1 1717 ±2\pm 2 2020 ±4\pm 4
Total 149 148 100 / 21 84 46 62 50 99
visual-scene play scene 4949 ±4\pm 4 4747 ±6\pm 6 6262 ±3\pm 3 / 1212 ±2\pm 2 2525 ±3\pm 3 1212 ±2\pm 2 1010 ±1\pm 1 1111 ±2\pm 2 𝟗𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}96}} ±1\pm 1
noisy scene 5050 ±1\pm 1 5454 ±3\pm 3 𝟔𝟑\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}63}} ±2\pm 2 / 1313 ±2\pm 2 2323 ±2\pm 2 1212 ±4\pm 4 22 ±0\pm 0 1515 ±2\pm 2 3636 ±4\pm 4
Table 2: Results on D4RL. Normalized scores (5 seeds) from original papers except QHyer. Orange = best, underline = second best.
Antmaze-v2 RL Supervised Learning
CQL IQL DT RvS EDT CGDT DC DMamba Reinformer QT LSDT QHyer
umaze 74.0 87.5 64.5 65.4 67.8 71.0 85.0 81.8 84.4 96.7 80.0 98.4±1.9\pm 1.9
umaze-diverse 84.0 62.2 60.5 60.9 58.3 71.0 78.5 71.6 65.8 96.7 83.2 97.1±2.3\pm 2.3
medium-play 61.2 71.2 0.8 58.1 0.0 / 1.5 79.6 13.2 / 85.5 92.2±3.5\pm 3.5
medium-diverse 53.7 70.0 0.5 67.3 0.0 / 0.0 83.2 10.6 59.3 75.8 94.0±2.7\pm 2.7
large-play 15.8 39.6 0.0 32.4 0.0 / 0.0 23.2 0.4 / 0.0 44.2±1.9\pm 1.9
large-diverse 14.9 47.5 0.0 32.9 0.0 / 0.0 34.6 0.4 53.3 0.0 57.5±13.5\pm 13.5
Total 303.6 378.0 126.3 317.0 126.1 / 165.0 374.0 174.8 / 324.5 483.4
Maze2d CQL IQL DT QDT GDT VDT DC DMamba DMixer QT LSDT QHyer
umaze 94.7 74.0 31.0 57.3 50.4 60.3 20.1 83.4 86.9 105.4 72.3 118.5±1.9\pm 1.9
medium 41.8 84.0 8.2 13.3 7.8 88.0 38.2 98.7 95.2 172.0 68.4 173.0±11.9\pm 11.9
Total 136.5 158.0 39.2 70.6 58.2 148.3 58.3 182.1 182.1 277.4 140.7 291.5

4 Experiments

We extensively evaluate QHyer’s effectiveness and conduct ablation studies on both non-Markovian and Markovian offline GCRL datasets.

Datasets. We consider two widely used benchmarks. For OGBench (Park et al., 2025a), we evaluate on manipulation tasks including cube, scene, and puzzle environments with both play (non-Markovian) and noisy (Markovian) datasets. For D4RL (Fu et al., 2020), we evaluate on Maze (non-Markovian) tasks. A detailed introduction to these environments is presented in Appendix˜F.

Baselines. We compare QHyer against three categories of methods: (1) sequence modeling methods including DT (Chen et al., 2021), EDT (Wu et al., 2023), GDT (Hu et al., 2023), QDT (Yamagata et al., 2023), CGDT (Wang et al., 2024), Reinformer (Zhuang et al., 2024), DC (Kim et al., 2024b), DMamba (Ota, 2024), QT (Hu et al., 2024), LSDT (Wang et al., 2025), DMixer (Zheng et al., 2025a), and VDT (Zheng et al., 2025b); (2) TD-based methods including CQL (Kumar et al., 2020) and IQL (Kostrikov et al., 2022); (3) offline GCRL methods including GCBC (Ghosh et al., 2021), GCIVL, GCIQL (Kostrikov et al., 2022), QRL (Wang et al., 2023), CRL (Eysenbach et al., 2022), HIQL (Park et al., 2023), SAW (Zhou & Kao, 2025), OTA (Ahn et al., 2025), and Eik-HiQRL (Giammarino & Qureshi, 2026). For completeness, Section˜G.6 additionally reports comparisons against four recent offline RL methods (i.e., QCFQL (Li et al., 2025), SHARSA (Park et al., 2025b), Transitive RL (Park et al., 2026), DEAS (Kim et al., 2026), ) adapted to GCRL with HER, as well as GAS (Baek et al., 2025) on navigation manipulation.

4.1 OGBench Results

Table 1 validates our core claims about sequence modeling for non-Markovian Offline GCRL. On play datasets collected by non-Markovian expert policies, QHyer significantly outperforms all baselines across manipulation tasks. Hierarchical methods (HIQL, SAW, OTA) underperform on state-based play datasets because their subgoal decomposition assumes Markovian transitions between subgoals, an assumption violated when behavior policies exhibit temporal correlations. Eik-HiQRL further suffers from exponential quasimetric approximation error in high-dimensional spaces (Giammarino & Qureshi, 2026), limiting its effectiveness across both state-based and visual manipulation tasks. TD-based hierarchical methods (HIQL, SAW, OTA) achieve competitive performance on visual tasks because hierarchical value functions provide representation learning signals beneficial for pixel inputs. On noisy datasets, QHyer maintains competitive performance through adaptive gating between attention and Mamba branches.

4.2 D4RL Results

Table 2 confirms QHyer’s advantages on long-horizon navigation tasks where trajectory stitching is essential. QHyer consistently outperforms both TD-based methods and sequence modeling baselines, with the most pronounced gains on large mazes requiring extensive stitching. Vanilla DT and its variants (EDT, DC) achieve near-zero performance on medium and large mazes, directly confirming our analysis in Section˜3.1. RTG under sparse goal-conditioned rewards reduces to binary signals that provide no discriminative information for stitching trajectories. On the other hand, it also demonstrates the effectiveness of our method in non-Markovian locomotion tasks.

4.3 Ablation Studies

Refer to caption
Refer to caption
Figure 5: Ablation Study on Various Q-value Estimators and the Impact of Not Estimating Q-value on Qhyer.

Q: How does the Q-value estimator affect performance?

A: Figure˜5 reveals a consistent ordering: No Q << CVAE << CRL << NFs, directly reflecting the relationship between density estimation accuracy and policy quality established in Section˜3.1. Without Q-values, the model degenerates to behavior cloning that cannot distinguish states by their proximity to goals under sparse rewards. CVAE introduces systematic bias through the ELBO gap, distorting the goal-reaching probability landscape. CRL improves through contrastive objectives but inherits negative sampling bias that underestimates probabilities for distant goals. NFs achieve the best performance by computing exact likelihoods through invertible transformations (Equation˜2), enabling accurate identification of high-value state-action pairs via expectile regression. This mechanism is essential for extracting optimal behaviors from suboptimal data.

Refer to caption
Refer to caption
Figure 6: Ablation study on SSM variants for temporal modeling.’-u’,’-m’ and ’-d’ denote umaze, medium, and diverse, respectively.

Q: Does the architecture alone improve performance?

A: Figure˜6 isolates the architectural contribution by removing NFs-based Q-conditioning from all methods, using standard RTG instead. The results reveal a consistent ordering: LSDT << DMixer << QHyer across both AntMaze and Maze2d environments. This validates that the performance gains stem from both innovations independently. LSDT’s Dynamic Convolution branch is limited by its fixed kernel size, which cannot adaptively capture dependencies of varying ranges. DMixer’s token-level selection mechanism improves upon LSDT but may disrupt continuous action patterns through discrete token dropping. In contrast, QHyer’s Mamba branch maintains compressed hidden states that enable content-adaptive dependency modeling. The selective SSM parameters (B, C, Δ\Delta) dynamically determine how much historical context to retain based on input content, rather than relying on predefined kernel sizes or discrete selection thresholds. Combined with the results in Figure˜5, this demonstrates that QHyer’s two innovations provide complementary and additive performance improvements.

Refer to caption
Refer to caption
Figure 7: Ablation study on the expectile parameter τ\tau for Q-value prediction.

Q: How should the expectile parameter τ\tau be selected?

A: Figure˜7 shows monotonic improvement from τ=0.5\tau=0.5 to τ=0.9\tau=0.9, with optimal performance at τ[0.9,0.95]\tau\in[0.9,0.95]. This validates Theorem˜3.1: higher τ\tau reduces the bias term ϵτ\epsilon_{\tau} by focusing on upper expectiles, enabling identification of high-Q segments for trajectory stitching that RTG’s trajectory-dependence fundamentally cannot provide. However, extreme τ\tau causes degradation by over-concentrating on too few samples, increasing estimation variance. This aligns with our theoretical analysis where ϵτ\epsilon_{\tau} depends on both τ\tau and coverage c~\tilde{c}. As τ\tau approaches 1, sensitivity to coverage limitations amplifies. We use τ=0.9\tau{=}0.9 for low-coverage (play) and τ=0.95\tau{=}0.95 for high-coverage (noisy) data.

Q: Is the Hybrid architecture’s gain actually architectural, or is it confounded by Q-conditioning, and does Mamba truly adapt its memory to data type?

A: We answer both jointly. Table˜3 fixes NFs Q-conditioning and varies only the backbone. On the non-Markovian cube-single-play, Attention-only reaches 7474, Mamba-only 8080, Hybrid 8484. On the Markovian cube-single-noisy the ordering is 6060, 9191, 9595. The Hybrid beats the best single branch by 44 to 55 points on both regimes, which means the scalar gate captures genuinely complementary specialization rather than interpolating two near-identical branches.

Table 3: Backbone ablation with identical NFs Q-conditioning. Mean success rate (%\%) over 4 seeds. Orange = best, underline = second best.
Environment Attention-only Mamba-only Hybrid (QHyer)
cube-single-play (non-Markov.) 7474 ±1\pm 1 8080 ±2\pm 2 𝟖𝟒\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}84}} ±4\pm 4
cube-single-noisy (Markov.) 6060 ±3\pm 3 9191 ±3\pm 3 𝟗𝟓\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}95}} ±5\pm 5

Table˜4 then explains why, by extracting Mamba’s Δt\Delta_{t} and the learned gate weight from the trained model (cf. Figure˜3). On play, mean Δt=0.38\Delta_{t}{=}0.38 and A¯t=0.92\bar{A}_{t}{=}0.92, the SSM retains about 1212 steps of effective history, and the gate shifts 0.570.57 of capacity to attention for global goal-directed reasoning. On noisy, Δt=1.05\Delta_{t}{=}1.05 and A¯t=0.61\bar{A}_{t}{=}0.61, memory collapses to about 33 steps, and the gate shifts 0.580.58 to Mamba. The essential reason is that Mamba’s selective mechanism makes Δt\Delta_{t} a function of the input, so effective memory varies per-token with the local temporal correlation, which convolution-based hybrids (LSDT, DMixer) cannot produce because their receptive field is an architectural constant.

Table 4: Δt\Delta_{t} and gate statistics on cube-single, extracted from trained QHyer (50 batches, batch 256). Effective memory length is the largest kk with j=1kA¯tj+1>0.5\prod_{j=1}^{k}\bar{A}_{t-j+1}>0.5.
Metric play (non-Markov.) noisy (Markov.)
Mean Δt\Delta_{t} 0.380.38 1.051.05
Std Δt\Delta_{t} 0.120.12 0.310.31
Mean A¯t=exp(ΔtA)\bar{A}_{t}=\exp(\Delta_{t}\cdot A) 0.920.92 0.610.61
Effective memory (steps) 12\sim 12 3\sim 3
Gate weight (Attention) 0.570.57 0.420.42
Gate weight (Mamba) 0.430.43 0.580.58

Combined with Figure˜5 (NFs vs. CVAE/CRL/No-Q), Figure˜6 (backbone with RTG), Table˜3 (backbone with NFs), and Table˜4 (mechanism), the ablations establish that each of QHyer’s two innovations is necessary and the pair is genuinely complementary.

5 Conclusion

We presented QHyer, the first sequence modeling framework for non-Markovian offline GCRL that addresses two fundamental limitations: replacing trajectory-dependent RTG with state-dependent Q-values estimated via Normalizing Flows for effective trajectory stitching, and introducing a Hybrid Attention-Mamba architecture for content-adaptive temporal modeling. Experiments on OGBench and D4RL demonstrate state-of-the-art performance, particularly on non-Markovian datasets.

Limitations and future work. QHyer remains constrained on visual-noisy, where Markovian behavior neutralizes the non-Markovian modeling advantage and pixel-level NFs density estimation becomes the dominant source of error. Promising future directions include robust visual density estimation and extension of the deterministic-transition theory (Appendix˜B) to stochastic environments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Ahn et al. (2025) Ahn, H., Choi, H., Han, J., and Moon, T. Option-aware temporally abstracted value for offline goal-conditioned reinforcement learning. arXiv preprint arXiv:2505.12737, 2025.
  • Akimov et al. (2022) Akimov, D., Kurenkov, V., Nikulin, A., Tarasov, D., and Kolesnikov, S. Let offline RL flow: Training conservative agents in the latent space of normalizing flows. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022.
  • Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  • Baek et al. (2025) Baek, S., taegeon park, Park, J., Oh, S., and Kim, Y. Graph-assisted stitching for offline hierarchical reinforcement learning. In Forty-second International Conference on Machine Learning, 2025.
  • Bortkiewicz et al. (2025) Bortkiewicz, M., Pałucki, W., Myers, V., Dziarmaga, T., Arczewski, T., Kuciński, Ł., and Eysenbach, B. Accelerating goal-conditioned reinforcement learning algorithms and research. In The Thirteenth International Conference on Learning Representations, 2025.
  • Brahmanage et al. (2023) Brahmanage, J., Ling, J., and Kumar, A. Flowpg: action-constrained policy gradient with normalizing flows. Advances in Neural Information Processing Systems, 36:20118–20132, 2023.
  • Brandfonbrener et al. (2022) Brandfonbrener, D., Bietti, A., Buckman, J., Laroche, R., and Bruna, J. When does return-conditioned supervised learning work for offline reinforcement learning? Advances in Neural Information Processing Systems, 35:1542–1553, 2022.
  • Chao et al. (2024) Chao, C.-H., Feng, C., Sun, W.-F., Lee, C.-K., See, S., and Lee, C.-Y. Maximum entropy reinforcement learning via energy-based normalizing flow. Advances in Neural Information Processing Systems, 37:56136–56165, 2024.
  • Cheikhi & Russo (2023) Cheikhi, D. and Russo, D. On the statistical benefits of temporal difference learning. In International Conference on Machine Learning, pp. 4269–4293. PMLR, 2023.
  • Chen et al. (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  • Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
  • Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pp. 1407–1416. PMLR, 2018.
  • Eysenbach et al. (2020) Eysenbach, B., Salakhutdinov, R., and Levine, S. C-learning: Learning to achieve goals via recursive classification. arXiv preprint arXiv:2011.08909, 2020.
  • Eysenbach et al. (2022) Eysenbach, B., Zhang, T., Levine, S., and Salakhutdinov, R. R. Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35:35603–35620, 2022.
  • Eysenbach et al. (2024) Eysenbach, B., Myers, V., Salakhutdinov, R., and Levine, S. Inference via interpolation: Contrastive representations provably enable planning and inference. Advances in Neural Information Processing Systems, 37:58901–58928, 2024.
  • Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Fujimoto & Gu (2021) Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  • Ghosh et al. (2021) Ghosh, D., Gupta, A., Reddy, A., Fu, J., Devin, C. M., Eysenbach, B., and Levine, S. Learning to reach goals via iterated supervised learning. In International Conference on Learning Representations, 2021.
  • Ghugare & Eysenbach (2025) Ghugare, R. and Eysenbach, B. Normalizing flows are capable models for rl. arXiv preprint arXiv:2505.23527, 2025.
  • Ghugare et al. (2024) Ghugare, R., Geist, M., Berseth, G., and Eysenbach, B. Closing the gap between TD learning and supervised learning - a generalisation point of view. In The Twelfth International Conference on Learning Representations, 2024.
  • Giammarino & Qureshi (2026) Giammarino, V. and Qureshi, A. H. Goal reaching with eikonal-constrained hierarchical quasimetric reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026.
  • Giammarino et al. (2025) Giammarino, V., Ni, R., and Qureshi, A. H. Physics-informed value learner for offline goal-conditioned reinforcement learning. arXiv preprint arXiv:2509.06782, 2025.
  • Grathwohl et al. (2018) Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
  • Gu & Dao (2024) Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Hong et al. (2023) Hong, M., Kang, M., and Oh, S. Diffused task-agnostic milestone planner. Advances in Neural Information Processing Systems, 36:387–405, 2023.
  • Hu et al. (2023) Hu, S., Shen, L., Zhang, Y., and Tao, D. Graph decision transformer. arXiv preprint arXiv:2303.03747, 2023.
  • Hu et al. (2024) Hu, S., Fan, Z., Huang, C., Shen, L., Zhang, Y., Wang, Y., and Tao, D. Q-value regularized transformer for offline reinforcement learning. In Forty-first International Conference on Machine Learning, 2024.
  • Jain & Ravanbakhsh (2024) Jain, V. and Ravanbakhsh, S. Learning to reach goals via diffusion. In International Conference on Machine Learning, pp. 21170–21195. PMLR, 2024.
  • Jullien et al. (2023) Jullien, S., Deffayet, R., Renders, J.-M., Groth, P., and de Rijke, M. Distributional reinforcement learning with dual expectile-quantile regression. arXiv preprint arXiv:2305.16877, 2023.
  • Kakade (2001) Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  • Kim et al. (2026) Kim, C., Lee, H., Seo, Y., Lee, K., and Zhu, Y. DEAS: DEtached value learning with action sequence for scalable offline RL. In The Fourteenth International Conference on Learning Representations, 2026.
  • Kim et al. (2024a) Kim, J., Lee, S., Kim, W., and Sung, Y. Adaptive qq-aid for conditional supervised learning in offline reinforcement learning. Advances in Neural Information Processing Systems, 37:87104–87135, 2024a.
  • Kim et al. (2024b) Kim, J., Lee, S., Kim, W., and Sung, Y. Decision convformer: Local filtering in metaformer is sufficient for decision making. In The Twelfth International Conference on Learning Representations, 2024b.
  • Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  • Koenker & Hallock (2001) Koenker, R. and Hallock, K. F. Quantile regression. Journal of economic perspectives, 15(4):143–156, 2001.
  • Kostrikov et al. (2022) Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
  • Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • Lei et al. (2025a) Lei, X., Yang, W., Ke, K., Yang, S., Zhang, X., Pajarinen, J., and Wang, D. Gchr: Goal-conditioned hindsight regularization for sample-efficient reinforcement learning. arXiv preprint arXiv:2508.06108, 2025a.
  • Lei et al. (2025b) Lei, X., Zhang, X., and Wang, D. Mgda: Model-based goal data augmentation for offline goal-conditioned weighted supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 18172–18180, 2025b.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. (2025) Li, Q., Zhou, Z., and Levine, S. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969, 2025.
  • Lipman et al. (2023) Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
  • Liu et al. (2022) Liu, M., Zhu, M., and Zhang, W. Goal-conditioned reinforcement learning: Problems and solutions. arXiv preprint arXiv:2201.08299, 2022.
  • Liu et al. (2025) Liu, Z., Yang, Y., Wang, R., Xu, P., and Zhou, D. How to provably improve return conditioned supervised learning? arXiv preprint arXiv:2506.08463, 2025.
  • Loshchilov (2017) Loshchilov, I. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lynch et al. (2020) Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. In Conference on robot learning, pp. 1113–1132. Pmlr, 2020.
  • Ma et al. (2022) Ma, Y. J., Yan, J., Jayaraman, D., and Bastani, O. How far i’ll go: Offline goal-conditioned reinforcement learning via ff-advantage regression. arXiv preprint arXiv:2206.03023, 2022.
  • Myers et al. (2024) Myers, V., Zheng, C., Dragan, A., Levine, S., and Eysenbach, B. Learning temporal distances: Contrastive successor features can provide a metric structure for decision-making. arXiv preprint arXiv:2406.17098, 2024.
  • Myers et al. (2025) Myers, V., Zheng, B., Eysenbach, B., and Levine, S. Offline goal-conditioned reinforcement learning with quasimetric representations. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
  • Newey & Powell (1987) Newey, W. K. and Powell, J. L. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, pp. 819–847, 1987.
  • Opryshko et al. (2025) Opryshko, E., Quan, J., Voelcker, C., Du, Y., and Gilitschenski, I. Test-time graph search for goal-conditioned reinforcement learning. arXiv preprint arXiv:2510.07257, 2025.
  • Ota (2024) Ota, T. Decision mamba: Reinforcement learning via sequence modeling with selective state spaces. arXiv preprint arXiv:2403.19925, 2024.
  • Park et al. (2023) Park, S., Ghosh, D., Eysenbach, B., and Levine, S. HIQL: Offline goal-conditioned RL with latent states as actions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Park et al. (2024) Park, S., Kreiman, T., and Levine, S. Foundation policies with hilbert representations. arXiv preprint arXiv:2402.15567, 2024.
  • Park et al. (2025a) Park, S., Frans, K., Eysenbach, B., and Levine, S. Ogbench: Benchmarking offline goal-conditioned rl. In International Conference on Learning Representations (ICLR), 2025a.
  • Park et al. (2025b) Park, S., Frans, K., Mann, D., Eysenbach, B., Kumar, A., and Levine, S. Horizon reduction makes rl scalable. arXiv preprint arXiv:2506.04168, 2025b.
  • Park et al. (2026) Park, S., Oberai, A., Atreya, P., and Levine, S. Transitive RL: Value learning via divide and conquer. In The Fourteenth International Conference on Learning Representations, 2026.
  • Reuss et al. (2023) Reuss, M., Li, M., Jia, X., and Lioutikov, R. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.
  • Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320. PMLR, 2015.
  • Sikchi et al. (2024) Sikchi, H., Chitnis, R., Touati, A., Geramifard, A., Zhang, A., and Niekum, S. Score models for offline goal-conditioned reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024.
  • Singh et al. (2020) Singh, A., Liu, H., Zhou, G., Yu, A., Rhinehart, N., and Levine, S. Parrot: Data-driven behavioral priors for reinforcement learning. arXiv preprint arXiv:2011.10024, 2020.
  • Sohn et al. (2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  • Teshima et al. (2020) Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M., and Sugiyama, M. Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems, 33:3362–3373, 2020.
  • Wang et al. (2025) Wang, J., Karanasou, P., Wei, P., Gatti, E., Plasencia, D. M., and Kanoulas, D. Long-short decision transformer: Bridging global and local dependencies for generalized decision-making. In The Thirteenth International Conference on Learning Representations, 2025.
  • Wang et al. (2023) Wang, T., Torralba, A., Isola, P., and Zhang, A. Optimal goal-reaching reinforcement learning via quasimetric learning. In International Conference on Machine Learning, pp. 36411–36430. PMLR, 2023.
  • Wang et al. (2024) Wang, Y., Yang, C., Wen, Y., Liu, Y., and Qiao, Y. Critic-guided decision transformer for offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 15706–15714, 2024.
  • Ward et al. (2019) Ward, P. N., Smofsky, A., and Bose, A. J. Improving exploration in soft-actor-critic with normalizing flows policies. arXiv preprint arXiv:1906.02771, 2019.
  • Wu et al. (2022) Wu, J., Wu, H., Qiu, Z., Wang, J., and Long, M. Supported policy optimization for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:31278–31291, 2022.
  • Wu et al. (2023) Wu, Y.-H., Wang, X., and Hamaya, M. Elastic decision transformer. arXiv preprint arXiv:2307.02484, 2023.
  • Yamagata et al. (2023) Yamagata, T., Khalil, A., and Santos-Rodriguez, R. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. In International Conference on Machine Learning, pp. 38989–39007. PMLR, 2023.
  • Yarats et al. (2022) Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Mastering visual continuous control: Improved data-augmented reinforcement learning. In International Conference on Learning Representations, 2022.
  • Yoon et al. (2024) Yoon, Y., Lee, G., Ahn, S., and Ok, J. Breadth-first exploration on adaptive grid for reinforcement learning. In Forty-first International Conference on Machine Learning, 2024.
  • Zhai et al. (2024) Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. A., Jaitly, N., and Susskind, J. Normalizing flows are capable generative models. arXiv preprint arXiv:2412.06329, 2024.
  • Zheng et al. (2025a) Zheng, H., Shen, L., Luo, Y., Ye, D., Du, B., Shen, J., and Tao, D. Decision mixer: Integrating long-term and local dependencies via dynamic token selection for decision-making. In Forty-second International Conference on Machine Learning, 2025a.
  • Zheng et al. (2025b) Zheng, H., Shen, L., Luo, Y., Ye, D., Xu, S., Du, B., Shen, J., and Tao, D. Value-guided decision transformer: A unified reinforcement learning framework for online and offline settings. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025b.
  • Zhou & Kao (2025) Zhou, J. L. and Kao, J. C. Flattening hierarchies with policy bootstrapping. arXiv preprint arXiv:2505.14975, 2025.
  • Zhuang et al. (2024) Zhuang, Z., Peng, D., Liu, J., Zhang, Z., and Wang, D. Reinformer: Max-return sequence modeling for offline RL. In Forty-first International Conference on Machine Learning, 2024.

Appendix A Related Work

Offline Goal-Conditioned RL (GCRL). Offline GCRL aims to learn goal-reaching policies from static datasets without environment interaction. Existing approaches can be categorized into several paradigms: goal-conditioned hindsight relabeling and data augmentation (Andrychowicz et al., 2017; Lei et al., 2025b), hierarchical or subgoal-based learning (Park et al., 2023; Ahn et al., 2025; Giammarino & Qureshi, 2026; Zhou & Kao, 2025; Lei et al., 2025a), graph-based planning (Yoon et al., 2024; Eysenbach et al., 2024), metric learning (Wang et al., 2023; Park et al., 2024; Myers et al., 2024, 2025), dual optimization (Ma et al., 2022; Sikchi et al., 2024), generative modeling (Hong et al., 2023; Reuss et al., 2023; Jain & Ravanbakhsh, 2024; Myers et al., 2025), and test-time adaption (Opryshko et al., 2025). However, these methods predominantly assume that the offline data follows Markovian properties—that the optimal action depends solely on the current state and goal. Existing methods struggle on such non-Markovian datasets because they cannot capture the temporal dependencies that govern the behavior policy’s decisions. In contrast, QHyer explicitly models these dependencies through sequence modeling framework.

Normalizing Flows in RL. Normalizing flows (NFs) are invertible generative models that enable exact likelihood computation and efficient sampling (Dinh et al., 2014, 2017; Kingma & Dhariwal, 2018). Recent work has demonstrated their effectiveness in RL for policy modeling (Singh et al., 2020; Ward et al., 2019; Chao et al., 2024) and Q-function estimation. Chao et al. (2024) propose Energy-Based Normalizing Flows (EBFlow) that unify policy evaluation and improvement into a single objective for maximum entropy RL, enabling exact soft value function calculation without Monte Carlo approximation. Brahmanage et al. (2023) leverage NFs to learn invertible mappings between feasible action spaces and Gaussian latent spaces for action-constrained policy gradient methods.

For offline settings, Akimov et al. (2022) use NFs-based action encoders to construct conservative action spaces, addressing distributional shift without explicit regularization. Notably, Ghugare & Eysenbach (2025) show that NFs can serve as Q-functions in GCRL by modeling the discounted state occupancy distribution, achieving strong performance on offline GCRL benchmarks with a simple feedforward architecture. However, their approach cannot capture temporal dependencies in non-Markovian datasets. Our work integrates NFs-based Q-estimation into a sequence modeling framework, enabling both accurate value estimation and temporal dependency modeling.

Sequence Modeling in Offline RL. Decision Transformer (DT) (Chen et al., 2021) reformulates offline RL as conditional sequence modeling, where actions are generated conditioned on desired returns and past states. This paradigm has spurred extensive research, which can be broadly categorized into two directions: value-enhanced methods and architectural innovations.

Value-enhanced methods integrate reinforcement learning principles to address DT’s fundamental limitation in stitching sub-optimal trajectories (Brandfonbrener et al., 2022). For instance, Q-learning Decision Transformer (QDT) (Yamagata et al., 2023) employs dynamic programming for optimal path synthesis. Critic-Guided Decision Transformer (CGDT) (Wang et al., 2024) incorporates a value-based critic to align expected returns with target returns. Q-value Regularized Transformer (QT) (Hu et al., 2024) introduces explicit Q-value regularization to tackle long-horizon and sparse-reward tasks. Reinformer (Zhuang et al., 2024) utilizes expectile regression for maximizing returns, while Value-Guided Decision Transformer (VDT) (Zheng et al., 2025b) leverages value functions for advantage-weighted behavior regularization. These methods primarily employ TD-learning for Q-value estimation and use value functions as auxiliary losses or regularizers. In contrast, QHyer estimates Q-values via Normalizing Flows with Monte Carlo learning and directly uses them as conditioning tokens to replace RTG.

Architectural innovations aim to more effectively capture the heterogeneous temporal patterns present in offline datasets. Elastic Decision Transformer (EDT) (Wu et al., 2023) enables adaptive history length selection to facilitate stitching. Graph Decision Transformer (GDT) (Hu et al., 2023) structures input sequences as causal graphs with relation-enhanced attention mechanisms. Decision Convformer (DC) (Kim et al., 2024b) replaces attention with causal convolution filters to model local, Markovian associations efficiently. Decision Mamba (DMamba) (Ota, 2024) substitutes attention with selective state space models for linear-time sequence modeling. Long-Short Decision Transformer (LSDT) (Wang et al., 2025) combines attention with dynamic convolution using a fixed capacity ratio, and Decision Mixer (DMixer) (Zheng et al., 2025a) integrates long-term and local features via dynamic token selection. QHyer introduces a Hybrid Attention-Mamba architecture with learnable gating that dynamically allocates capacity, allowing attention to handle global goal-directed planning while Mamba captures local temporal patterns with content-adaptive memory.

Our work deviates from both value-based and architectural innovations methods. To our knowledge, this is the first work to unlock the potential of sequence modeling for Offline GCRL.

Appendix B Notation and Assumptions

B.1 Notation

We consider goal-conditioned episodic MDPs with finite horizon HH. Following Reinforced Return-conditioned Supervised Learning (R2CSL) (Liu et al., 2025), we assume deterministic transitions, i.e., given state ss and action aa, the next state s=P(s,a)s^{\prime}=P(s,a) is uniquely determined. This ensures that the in-distribution optimal Q-value Q(s,a,g,h)Q^{\star}(s,a,g,h) is well-defined as a unique value. Extension to stochastic environments is an important direction for future work.

Symbol Definition
Spaces and Indices
𝒮,𝒜,𝒢\mathcal{S},\mathcal{A},\mathcal{G} State space, action space, goal space
HH Episode horizon (total number of stages per episode)
h[H]:={1,,H}h\in[H]:=\{1,\ldots,H\} Stage index (timestep within an episode)
ϕ:𝒮𝒢\phi:\mathcal{S}\to\mathcal{G} Goal mapping; in our experiments, 𝒢𝒮\mathcal{G}\subseteq\mathcal{S}
Policies and Distributions
β\beta Behavior policy that generated the offline dataset
πβ\pi^{\star}_{\beta} In-distribution optimal stitched policy (Eq. 19)
π^𝒟\hat{\pi}^{\star}_{\mathcal{D}} Learned policy from QHyer
dhβ(s)d^{\beta}_{h}(s) State visitation probability at stage hh under policy β\beta
dminβd^{\beta}_{\min} Minimum positive state visitation: minh,s{dhβ(s)dhβ(s)>0}\min_{h,s}\{d^{\beta}_{h}(s)\mid d^{\beta}_{h}(s)>0\}
cβc^{\star}_{\beta} Distribution mismatch coefficient (Section˜B.2)
Q-Values (Key Distinction)
Qβ(s,a,g)Q^{\beta}(s,a,g) True goal-reaching probability under β\beta: p+β(gs,a)p^{\beta}_{+}(g\mid s,a)
Q(s,a,g,h)Q^{\star}(s,a,g,h) In-distribution optimal Q-value: maxτ:(sh,ah)=(s,a)Qβ(τ,g,h)\max_{\tau:(s_{h},a_{h})=(s,a)}Q^{\beta}(\tau,g,h)
Q^θβ(s,a,g)\hat{Q}^{\beta}_{\theta}(s,a,g) NFs estimate of QβQ^{\beta}, trained via Equation˜5
Q^τ(s,a,g,h)\hat{Q}^{\tau}(s,a,g,h) Expectile regression output on NFs-estimated Q-values
Q^ϕ(s,g,h)\hat{Q}_{\phi}(s,g,h) Transformer-predicted Q-value for conditioning (Equation˜10)
Error Terms
ϵNFs\epsilon_{\text{NFs}} NFs estimation MSE (Section˜B.2)
ϵτ\epsilon_{\tau} Expectile regression bias (Theorem 3.1)
δapprox\delta_{\text{approx}} MLE approximation error (Section˜B.2)
c~\tilde{c} Q-value coverage constant (Section˜B.2)

Q-Value Definition. For a trajectory τ𝒯β\tau\in\mathcal{T}^{\beta} passing through (sh,ah)=(s,a)(s_{h},a_{h})=(s,a) at stage hh, the goal-reaching Q-value is:

Qβ(τ,g,h):=p+β(gsh,ah)=(1γ)t=0γt𝟙[ϕ(sh+t+1)=g],Q^{\beta}(\tau,g,h):=p^{\beta}_{+}(g\mid s_{h},a_{h})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\cdot\mathds{1}[\phi(s_{h+t+1})=g], (17)

where in deterministic environments, this reduces to the discounted indicator of whether the trajectory reaches goal gg.

In-Distribution Optimal Q-Value:

Q(s,a,g,h):=maxτ𝒯β:(sh,ah)=(s,a)Qβ(τ,g,h).Q^{\star}(s,a,g,h):=\max_{\tau\in\mathcal{T}^{\beta}:(s_{h},a_{h})=(s,a)}Q^{\beta}(\tau,g,h). (18)

Optimal Stitched Policy:

πβ(as,g,h):=Pβ(as,g,h,Q(s,a,g,h)).\pi^{\star}_{\beta}(a\mid s,g,h):=P_{\beta}(a\mid s,g,h,Q^{\star}(s,a,g,h)). (19)

Performance Metric:

J(π):=𝔼s1ρ,gp(g)[V1π(s1,g)],J(\pi):=\mathbb{E}_{s_{1}\sim\rho,g\sim p(g)}[V^{\pi}_{1}(s_{1},g)], (20)

where Vhπ(s,g):=𝔼π[t=hHr(st,at,g)sh=s]V^{\pi}_{h}(s,g):=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r(s_{t},a_{t},g)\mid s_{h}=s] and r(s,a,g)=𝟙[ϕ(s)=g]r(s,a,g)=\mathds{1}[\phi(s^{\prime})=g].

B.2 Assumptions

Assumption B.1 (Deterministic Environment).

The transition dynamics P:𝒮×𝒜𝒮P:\mathcal{S}\times\mathcal{A}\to\mathcal{S} is deterministic, i.e., given (s,a)(s,a), the next state s=P(s,a)s^{\prime}=P(s,a) is unique. This is standard in goal-conditioned RL theory (Park et al., 2025a) and holds approximately in robotic manipulation tasks.

Remark B.2 (Scope of Assumption B.2).

Assumption B.2 constrains the transition dynamics P(ss,a)P(s^{\prime}\mid s,a), not the behavior policy β\beta. This is compatible with all our experimental settings. OGBench runs deterministic MuJoCo dynamics even for noisy datasets, where the "noise" is Gaussian perturbation of β\beta rather than of PP, and D4RL mazes likewise use deterministic PP. Non-Markovian play data corresponds to a history-dependent β(atst,h<t)\beta(a_{t}\mid s_{t},h_{<t}) over a deterministic MDP. QHyer’s sequence modeling targets exactly this behavior-policy non-Markovianness, while Theorems 3.1 and 3.2 analyze stitching on the underlying MDP. The assumption matches R2CSL (Liu et al., 2025) and is standard in the offline GCRL theory literature. Extension to stochastic PP is a genuine open problem that we flag in the conclusion.

Assumption B.3 (Policy Class Regularity).

The policy class Π\Pi satisfies:

  1. 1.

    |Π|<|\Pi|<\infty (can be relaxed to finite covering number).

  2. 2.

    For all (a,s,g,h,Q)𝒜×𝒮×𝒢×[H]×[0,1](a,s,g,h,Q)\in\mathcal{A}\times\mathcal{S}\times\mathcal{G}\times[H]\times[0,1] and πΠ\pi\in\Pi: |logπ(as,g,h,Q)|c|\log\pi(a\mid s,g,h,Q)|\leq c.

  3. 3.

    minπΠL(π)δapprox\min_{\pi\in\Pi}L(\pi)\leq\delta_{\text{approx}}, where L(π):=𝔼(s,g,Q)Pβ[DKL(Pβ(s,g,h,Q)π(s,g,h,Q))]L(\pi):=\mathbb{E}_{(s,g,Q)\sim P_{\beta}}[D_{\text{KL}}(P_{\beta}(\cdot\mid s,g,h,Q)\|\pi(\cdot\mid s,g,h,Q))].

Assumption B.4 (Q-Value Coverage).

For each (s,a,g,h)(s,a,g,h) in the support of β\beta, define:

𝒯𝒟(s,a,g,h):={k[N]:(shk,ahk)=(s,a)}.\mathcal{T}_{\mathcal{D}}(s,a,g,h):=\{k\in[N]:(s^{k}_{h},a^{k}_{h})=(s,a)\}. (21)

For trajectory kk, let Qhk(g):=Qβ(τk,g,h)Q^{k}_{h}(g):=Q^{\beta}(\tau^{k},g,h) be the empirical goal-reaching probability computed via hindsight relabeling. There exists c~(0,1]\tilde{c}\in(0,1] such that:

|{k𝒯𝒟(s,a,g,h):Qhk(g)=Q(s,a,g,h)}||𝒯𝒟(s,a,g,h)|c~.\frac{|\{k\in\mathcal{T}_{\mathcal{D}}(s,a,g,h):Q^{k}_{h}(g)=Q^{\star}(s,a,g,h)\}|}{|\mathcal{T}_{\mathcal{D}}(s,a,g,h)|}\geq\tilde{c}. (22)

Interpretation: At least c~\tilde{c}-fraction of trajectories through (s,a)(s,a) achieve the optimal Q-value. Under Section˜B.2, QQ^{\star} is well-defined as the maximum over a finite set of deterministic outcomes.

Assumption B.5 (Distribution Mismatch).

There exists cβ>0c^{\star}_{\beta}>0 such that dh,β(s)/dhβ(s)cβd^{\star,\beta}_{h}(s)/d^{\beta}_{h}(s)\leq c^{\star}_{\beta} for all (h,s)[H]×𝒮hβ(h,s)\in[H]\times\mathcal{S}^{\beta}_{h}.

Assumption B.6 (Bounded Q-Values).

Qβ(s,a,g)[0,1]Q^{\beta}(s,a,g)\in[0,1] for all (s,a,g)(s,a,g), since it represents a probability.

Assumption B.7 (NFs Estimation Error).

The NFs estimator satisfies:

𝔼(s,a,g)dhβ[(Qβ(s,a,g)Q^θβ(s,a,g))2]ϵNFs,h[H].\mathbb{E}_{(s,a,g)\sim d^{\beta}_{h}}\left[(Q^{\beta}(s,a,g)-\hat{Q}^{\beta}_{\theta}(s,a,g))^{2}\right]\leq\epsilon_{\text{NFs}},\quad\forall h\in[H]. (23)
Assumption B.8 (Policy Lipschitz Continuity).

For any (s,g,h)(s,g,h), πΠ\pi\in\Pi, and Q1,Q2[0,1]Q_{1},Q_{2}\in[0,1]:

TV(π(s,g,h,Q1)π(s,g,h,Q2))Lπ|Q1Q2|.\text{TV}(\pi(\cdot\mid s,g,h,Q_{1})\|\pi(\cdot\mid s,g,h,Q_{2}))\leq L_{\pi}|Q_{1}-Q_{2}|. (24)
Assumption B.9 (Expectile Lipschitz Stability).

Let τ({Qk})\mathcal{E}_{\tau}(\{Q_{k}\}) denote the τ\tau-expectile of samples {Qk}\{Q_{k}\}. For any two sample sets {Qk}\{Q_{k}\} and {Q~k}\{\tilde{Q}_{k}\} with |QkQ~k|ϵ|Q_{k}-\tilde{Q}_{k}|\leq\epsilon for all kk:

|τ({Qk})τ({Q~k})|LQϵ,|\mathcal{E}_{\tau}(\{Q_{k}\})-\mathcal{E}_{\tau}(\{\tilde{Q}_{k}\})|\leq L_{Q}\cdot\epsilon, (25)

where LQ1L_{Q}\leq 1 is a Lipschitz constant. This holds because the expectile is a weighted average of samples.

Appendix C Proofs of Theoretical Results

C.1 Proof of Theorem 3.1

Proof.

We prove convergence of expectile regression to the in-distribution optimal Q-value, accounting for NFs estimation error.

Problem Setup. Fix (s,a,g,h)(s,a,g,h). Let {Q1,,QK}\{Q_{1},\ldots,Q_{K}\} be the true Q-values across K:=|𝒯𝒟(s,a,h)|K:=|\mathcal{T}_{\mathcal{D}}(s,a,h)| trajectories. In practice, we observe NFs estimates {Q^1,,Q^K}\{\hat{Q}_{1},\ldots,\hat{Q}_{K}\} where Q^k=Q^θβ(sh+1k,g)\hat{Q}_{k}=\hat{Q}^{\beta}_{\theta}(s^{k}_{h+1},g). The expectile loss is Lτ2(u):=|τ𝟙(u<0)|u2L_{\tau}^{2}(u):=|\tau-\mathds{1}(u<0)|\cdot u^{2}. We define:

  • Q^τ,oracle:=argminQ~kLτ2(QkQ~)\hat{Q}^{\tau,\text{oracle}}:=\operatorname*{arg\,min}_{\tilde{Q}}\sum_{k}L_{\tau}^{2}(Q_{k}-\tilde{Q}) — expectile on true Q-values

  • Q^τ:=argminQ~kLτ2(Q^kQ~)\hat{Q}^{\tau}:=\operatorname*{arg\,min}_{\tilde{Q}}\sum_{k}L_{\tau}^{2}(\hat{Q}_{k}-\tilde{Q}) — expectile on NFs-estimated Q-values

Our goal is to bound |QQ^τ||Q^{\star}-\hat{Q}^{\tau}|.

First, Decomposition via Triangle Inequality.

|QQ^τ||QQ^τ,oracle|(A) Expectile bias on true Q+|Q^τ,oracleQ^τ|(B) NFs error propagation.|Q^{\star}-\hat{Q}^{\tau}|\leq\underbrace{|Q^{\star}-\hat{Q}^{\tau,\text{oracle}}|}_{\text{(A) Expectile bias on true Q}}+\underbrace{|\hat{Q}^{\tau,\text{oracle}}-\hat{Q}^{\tau}|}_{\text{(B) NFs error propagation}}. (26)

Second, Bounding Term (A) — Expectile Bias. From the first-order condition, the expectile satisfies:

Q^τ,oracle=(1τ)nQ¯+τn+Q¯+(1τ)n+τn+,\hat{Q}^{\tau,\text{oracle}}=\frac{(1-\tau)n_{-}\bar{Q}_{-}+\tau n_{+}\bar{Q}_{+}}{(1-\tau)n_{-}+\tau n_{+}}, (27)

where n+=|{k:QkQ^τ,oracle}|n_{+}=|\{k:Q_{k}\geq\hat{Q}^{\tau,\text{oracle}}\}|, n=Kn+n_{-}=K-n_{+}, and Q¯±\bar{Q}_{\pm} are conditional means.

Case 1: If Q^τ,oracle=Q\hat{Q}^{\tau,\text{oracle}}=Q^{\star}, then |QQ^τ,oracle|=0ϵτ|Q^{\star}-\hat{Q}^{\tau,\text{oracle}}|=0\leq\epsilon_{\tau} trivially.

Case 2: If Q^τ,oracle<Q\hat{Q}^{\tau,\text{oracle}}<Q^{\star} (generic case). Since Q^τ,oracle\hat{Q}^{\tau,\text{oracle}} is a convex combination:

QminQ^τ,oracleQ.Q_{\min}\leq\hat{Q}^{\tau,\text{oracle}}\leq Q^{\star}. (28)

By Section˜B.2, at least c~\tilde{c}-fraction achieve QQ^{\star}. Using Hoeffding’s inequality, with high probability, n:=|{k:Qk=Q}|c~K/2n^{\star}:=|\{k:Q_{k}=Q^{\star}\}|\geq\tilde{c}K/2. Since Q^τ,oracle<Q\hat{Q}^{\tau,\text{oracle}}<Q^{\star}, all QQ^{\star}-samples are in the "above" group: n+c~K/2n_{+}\geq\tilde{c}K/2.

Worst-case analysis with Q¯+=Q\bar{Q}_{+}=Q^{\star} and Q¯=Qmin\bar{Q}_{-}=Q_{\min}:

QQ^τ,oracle\displaystyle Q^{\star}-\hat{Q}^{\tau,\text{oracle}} =(1τ)n(QQmin)(1τ)n+τn+\displaystyle=\frac{(1-\tau)n_{-}(Q^{\star}-Q_{\min})}{(1-\tau)n_{-}+\tau n_{+}} (29)
(1τ)(1c~/2)K(QQmin)(1τ)(1c~/2)K+τ(c~/2)K\displaystyle\leq\frac{(1-\tau)(1-\tilde{c}/2)K(Q^{\star}-Q_{\min})}{(1-\tau)(1-\tilde{c}/2)K+\tau(\tilde{c}/2)K} (30)
=(1τ)(QQmin)τc~/2+(1τ)(1c~/2)=:ϵτ.\displaystyle=\frac{(1-\tau)(Q^{\star}-Q_{\min})}{\tau\cdot\tilde{c}/2+(1-\tau)(1-\tilde{c}/2)}=:\epsilon_{\tau}. (31)

Third, Bounding Term (B) — NFs Error Propagation. By Section˜B.2, the expectile is Lipschitz in its inputs:

|Q^τ,oracleQ^τ|=|τ({Qk})τ({Q^k})|LQmaxk|QkQ^k|.|\hat{Q}^{\tau,\text{oracle}}-\hat{Q}^{\tau}|=|\mathcal{E}_{\tau}(\{Q_{k}\})-\mathcal{E}_{\tau}(\{\hat{Q}_{k}\})|\leq L_{Q}\cdot\max_{k}|Q_{k}-\hat{Q}_{k}|. (32)

By Section˜B.2 and Markov’s inequality, with high probability:

maxk|QkQ^k|O(ϵNFs).\max_{k}|Q_{k}-\hat{Q}_{k}|\leq O(\sqrt{\epsilon_{\text{NFs}}}). (33)

Therefore, we have:

|Q^τ,oracleQ^τ|LQϵNFs.|\hat{Q}^{\tau,\text{oracle}}-\hat{Q}^{\tau}|\leq L_{Q}\sqrt{\epsilon_{\text{NFs}}}. (34)

Forth, Combining Terms. From Equation˜26, we have:

|QQ^τ|ϵτ+LQϵNFs.|Q^{\star}-\hat{Q}^{\tau}|\leq\epsilon_{\tau}+L_{Q}\sqrt{\epsilon_{\text{NFs}}}. (35)

Final, Sample Complexity. We need uniform convergence over all (s,a,g,h)𝒮×𝒜×𝒢×[H](s,a,g,h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{G}\times[H]. By union bound with |𝒢||\mathcal{G}| goals, we have:

Condition 1 (sufficient visits): For each (s,a,h)(s,a,h), we need Nhs,aNdminβ/2N^{s,a}_{h}\geq Nd^{\beta}_{\min}/2. By Hoeffding:

Pr(Nhs,a<Ndminβ/2)exp((dminβ)2N/2).\Pr(N^{s,a}_{h}<Nd^{\beta}_{\min}/2)\leq\exp(-(d^{\beta}_{\min})^{2}N/2). (36)

Condition 2 (coverage concentration): Given KK visits, need nc~K/2n^{\star}\geq\tilde{c}K/2.

Setting failure probability δ/(2|𝒮||𝒜||𝒢|H)\leq\delta/(2|\mathcal{S}||\mathcal{A}||\mathcal{G}|H) for each tuple:

Nmax{2(dminβ)2log2|𝒮||𝒜||𝒢|Hδ,4c~2dminβlog2|𝒮||𝒜||𝒢|Hδ}.N\geq\max\left\{\frac{2}{(d^{\beta}_{\min})^{2}}\log\frac{2|\mathcal{S}||\mathcal{A}||\mathcal{G}|H}{\delta},\frac{4}{\tilde{c}^{2}d^{\beta}_{\min}}\log\frac{2|\mathcal{S}||\mathcal{A}||\mathcal{G}|H}{\delta}\right\}. (37)

C.2 Proof of Theorem 3.2

Proof.

We prove convergence to the optimal stitched policy with careful treatment of distribution mismatch.

First, Performance Difference. Since rewards are bounded in [0,1][0,1]:

J(πβ)J(π^𝒟)Hdπβdπ^𝒟1.J(\pi^{\star}_{\beta})-J(\hat{\pi}^{\star}_{\mathcal{D}})\leq H\cdot\|d^{\pi^{\star}_{\beta}}-d^{\hat{\pi}^{\star}_{\mathcal{D}}}\|_{1}. (38)

Second, Simulation Lemma. By the simulation lemma (Kakade, 2001), we have:

dπβdπ^𝒟12h=1H𝔼sdhπβ[TV(πβ(|s)π^𝒟(|s))].\|d^{\pi^{\star}_{\beta}}-d^{\hat{\pi}^{\star}_{\mathcal{D}}}\|_{1}\leq 2\sum_{h=1}^{H}\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}\left[\text{TV}(\pi^{\star}_{\beta}(\cdot|s)\|\hat{\pi}^{\star}_{\mathcal{D}}(\cdot|s))\right]. (39)

Third, Policy Difference Decomposition.

TV(πβ(|s,g,h)π^𝒟(|s,g,h))\displaystyle\text{TV}(\pi^{\star}_{\beta}(\cdot|s,g,h)\|\hat{\pi}^{\star}_{\mathcal{D}}(\cdot|s,g,h))
TV(Pβ(|s,g,h,Q)π^(|s,g,h,Q))(I) Policy learning error+TV(π^(|s,g,h,Q)π^(|s,g,h,Q^τ))(II) Q-value error.\displaystyle\leq\underbrace{\text{TV}(P_{\beta}(\cdot|s,g,h,Q^{\star})\|\hat{\pi}(\cdot|s,g,h,Q^{\star}))}_{\text{(I) Policy learning error}}+\underbrace{\text{TV}(\hat{\pi}(\cdot|s,g,h,Q^{\star})\|\hat{\pi}(\cdot|s,g,h,\hat{Q}^{\tau}))}_{\text{(II) Q-value error}}. (40)

Bounding Term (I) of Section˜C.2 with Correct Derivation Order. We carefully apply the inequalities in the correct order:

a: Apply Jensen’s inequality to the expectation of TV, we have:

𝔼sdhπβ[TV(Pβπ^)]𝔼sdhπβ[TV(Pβπ^)2].\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[\text{TV}(P_{\beta}\|\hat{\pi})]\leq\sqrt{\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[\text{TV}(P_{\beta}\|\hat{\pi})^{2}]}. (41)

b: Apply Pinsker’s inequality (TV212DKL\text{TV}^{2}\leq\frac{1}{2}D_{\text{KL}}), we have:

𝔼sdhπβ[TV2]12𝔼sdhπβ[DKL(Pβπ^)].\sqrt{\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[\text{TV}^{2}]}\leq\sqrt{\frac{1}{2}\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[D_{\text{KL}}(P_{\beta}\|\hat{\pi})]}. (42)

c: Apply distribution mismatch (Section˜B.2), we have:

𝔼sdhπβ[DKL]\displaystyle\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[D_{\text{KL}}] =sdhπβ(s)DKL(Pβ(|s)π^(|s))\displaystyle=\sum_{s}d^{\pi^{\star}_{\beta}}_{h}(s)\cdot D_{\text{KL}}(P_{\beta}(\cdot|s)\|\hat{\pi}(\cdot|s)) (43)
=sdhπβ(s)dhβ(s)dhβ(s)DKL\displaystyle=\sum_{s}\frac{d^{\pi^{\star}_{\beta}}_{h}(s)}{d^{\beta}_{h}(s)}\cdot d^{\beta}_{h}(s)\cdot D_{\text{KL}} (44)
cβ𝔼sdhβ[DKL].\displaystyle\leq c^{\star}_{\beta}\cdot\mathbb{E}_{s\sim d^{\beta}_{h}}[D_{\text{KL}}]. (45)

Combining a-c, we have:

𝔼sdhπβ[TV(Pβπ^)]cβ2𝔼sdhβ[DKL(Pβπ^)]=cβ2L(π^).\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[\text{TV}(P_{\beta}\|\hat{\pi})]\leq\sqrt{\frac{c^{\star}_{\beta}}{2}\mathbb{E}_{s\sim d^{\beta}_{h}}[D_{\text{KL}}(P_{\beta}\|\hat{\pi})]}=\sqrt{\frac{c^{\star}_{\beta}}{2}L(\hat{\pi})}. (46)

By MLE analysis (Liu et al., 2025), with probability 1δ\geq 1-\delta:

L(π^)𝒪(clog|Π|/δN)+δapprox.L(\hat{\pi})\leq\mathcal{O}\left(\sqrt{c\cdot\frac{\log|\Pi|/\delta}{N}}\right)+\delta_{\text{approx}}. (47)

Summing over HH stages:

h=1H𝔼[Term (I)]Hcβ2(𝒪((log|Π|/δN)1/4)+δapprox).\sum_{h=1}^{H}\mathbb{E}[\text{Term (I)}]\leq H\sqrt{\frac{c^{\star}_{\beta}}{2}}\left(\mathcal{O}\left(\left(\frac{\log|\Pi|/\delta}{N}\right)^{1/4}\right)+\sqrt{\delta_{\text{approx}}}\right). (48)

Forth, Bounding Term (II) of Section˜C.2. By Section˜B.2, we have:

TV(π^(|Q)π^(|Q^τ))Lπ|QQ^τ|.\text{TV}(\hat{\pi}(\cdot|Q^{\star})\|\hat{\pi}(\cdot|\hat{Q}^{\tau}))\leq L_{\pi}|Q^{\star}-\hat{Q}^{\tau}|. (49)

From Theorem 3.1, we have:

|QQ^τ|ϵτ+LQϵNFs.|Q^{\star}-\hat{Q}^{\tau}|\leq\epsilon_{\tau}+L_{Q}\sqrt{\epsilon_{\text{NFs}}}. (50)

Taking expectation under dhπβd^{\pi^{\star}_{\beta}}_{h} and applying distribution mismatch for the NFs error term, we have:

𝔼sdhπβ[|QQ^τ|]\displaystyle\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[|Q^{\star}-\hat{Q}^{\tau}|] ϵτ+LQ𝔼sdhπβ[ϵNFs,s]\displaystyle\leq\epsilon_{\tau}+L_{Q}\cdot\mathbb{E}_{s\sim d^{\pi^{\star}_{\beta}}_{h}}[\sqrt{\epsilon_{\text{NFs},s}}] (51)
ϵτ+LQcβϵNFs.\displaystyle\leq\epsilon_{\tau}+L_{Q}\sqrt{c^{\star}_{\beta}\cdot\epsilon_{\text{NFs}}}. (52)

Summing over stages, we have:

h=1H𝔼[Term (II)]HLπ(ϵτ+LQcβϵNFs).\sum_{h=1}^{H}\mathbb{E}[\text{Term (II)}]\leq H\cdot L_{\pi}\left(\epsilon_{\tau}+L_{Q}\sqrt{c^{\star}_{\beta}\cdot\epsilon_{\text{NFs}}}\right). (53)

Final Bound. Combining, we have:

J(πβ)J(π^𝒟)\displaystyle J(\pi^{\star}_{\beta})-J(\hat{\pi}^{\star}_{\mathcal{D}}) 2H[hTerm (I)+hTerm (II)]\displaystyle\leq 2H\left[\sum_{h}\text{Term (I)}+\sum_{h}\text{Term (II)}\right] (54)
𝒪(cβH2c~c(log|Π|/δN)1/4)+cβH2δapprox\displaystyle\leq\mathcal{O}\left(\frac{c^{\star}_{\beta}H^{2}}{\tilde{c}}\sqrt{c}\left(\frac{\log|\Pi|/\delta}{N}\right)^{1/4}\right)+\sqrt{c^{\star}_{\beta}}H^{2}\sqrt{\delta_{\text{approx}}} (55)
+cβH2Lπ(cβϵNFs+ϵτ).\displaystyle\quad+c^{\star}_{\beta}H^{2}L_{\pi}\left(\sqrt{c^{\star}_{\beta}\cdot\epsilon_{\text{NFs}}}+\epsilon_{\tau}\right). (56)

Union bound over events from Theorems 3.1 and MLE analysis gives probability 12δ\geq 1-2\delta. ∎

C.3 Comparison with R2CSL

Aspect R2CSL QHyer
Conditioning signal RTG: f(s,h)=t=hHrtf(s,h)=\sum_{t=h}^{H}r_{t} Q-value: Q(s,a,g)=p+β(g|s,a)Q(s,a,g)=p_{+}^{\beta}(g|s,a)
Signal property Trajectory-dependent State-dependent
Consistency constraint Required Not required
Stitching mechanism Explicit RTG relabeling Implicit via expectile
Estimation method Quantile regression Expectile + NFs
Additional error term None LQcβϵNFsL_{Q}\sqrt{c^{\star}_{\beta}\epsilon_{\text{NFs}}}
Sample complexity |𝒮||𝒜|H|\mathcal{S}||\mathcal{A}|H |𝒮||𝒜||𝒢|H|\mathcal{S}||\mathcal{A}||\mathcal{G}|H
Convergence rate 𝒪(N1/4)\mathcal{O}(N^{-1/4}) 𝒪(N1/4)\mathcal{O}(N^{-1/4})

Appendix D QHyer Algorithm Details

This section describes the architecture, training, and inference procedures of QHyer. The overall structure is depicted in Figure˜8, and the complete algorithm is summarized in Algorithm˜1.

Refer to caption

Figure 8: Overview of QHyer. Left: Offline dataset 𝒟\mathcal{D} with state-goal-action tuples. Middle: NFs-based critic estimates Qθβ(st,at,g)=logpθ(g|st,at)Q^{\beta}_{\theta}(s_{t},a_{t},g)=\log p_{\theta}(g|s_{t},a_{t}) via an SA-Encoder and RealNVP. Right: The Hybrid Attention-Mamba actor predicts Q^(st,g)\hat{Q}(s_{t},g) via expectile regression and outputs action a^t\hat{a}_{t} conditioned on the predicted maximum Q-value.

Model Architecture.

The input sequence follows the format Q~t,sgt,at\langle\tilde{Q}_{t},sg_{t},a_{t}\rangle where sgt=[st;g]sg_{t}=[s_{t};g] denotes state-goal concatenation (Schaul et al., 2015), and Q~t\tilde{Q}_{t} is the normalized Q-value computed from the NFs-based critic (Equation˜4):

Q~t=Qθβ(st,at,g)/(|Qβ|¯+δ),\tilde{Q}_{t}=Q^{\beta}_{\theta}(s_{t},a_{t},g)/(\overline{|Q^{\beta}|}+\delta), (57)

where Qθβ(st,at,g)=logpθ(g|st,at)Q^{\beta}_{\theta}(s_{t},a_{t},g)=\log p_{\theta}(g|s_{t},a_{t}) is the behavior Q-value estimated by NFs, |Qβ|¯\overline{|Q^{\beta}|} denotes the mean absolute Q-value over the batch, and δ\delta is a small constant for numerical stability. At timestep tt, the model takes a context window of length KK:

Input: Q~tK+1,sgtK+1,atK+1,,Q~t,sgt,at\displaystyle\langle\tilde{Q}_{t-K+1},sg_{t-K+1},a_{t-K+1},\ldots,\tilde{Q}_{t},sg_{t},a_{t}\rangle
Output: Q^tK+1,a^tK+1,,,Q^t,a^t,\displaystyle\langle\hat{Q}_{t-K+1},\hat{a}_{t-K+1},\Box,\ldots,\hat{Q}_{t},\hat{a}_{t},\Box\rangle

The NF critic consists of an SA-Encoder that maps (st,at)(s_{t},a_{t}) to a latent representation, followed by a RealNVP (Dinh et al., 2017) that computes Qθβ(st,at,g)Q^{\beta}_{\theta}(s_{t},a_{t},g). The Hybrid Attention-Mamba backbone processes tokens through NN transformer blocks with learnable attention-Mamba gating as described in Section 3.2.

Q-Conditioned Policy Learning.

Unlike prior Q-enhanced supervised learning methods that incorporate Q-values into loss functions, we use Q-values as conditioning tokens input to the policy network. This design enables the policy to explicitly leverage Q-value signals for action selection during both training and inference. The total loss is defined in Equation˜14, combining the NFs-based critic loss (Equation˜5), behavior cloning loss (Equation˜15), and expectile regression loss (Equation˜10).

Algorithm 1 QHyer Training and Inference
1:Input: Offline dataset 𝒟\mathcal{D}, context length KK, expectile τ\tau, noise std σ\sigma, stability constant δ\delta
2:Initialize: SA-Encoder ψ\psi, NFs-based critic θ\theta, Hybrid Attention-Mamba actor ϕ\phi
3:
4:// Joint Training (end-to-end)
5:for each training iteration do
6:  Sample batch of trajectories {(st,at,g)}\{(s_{t},a_{t},g)\} from 𝒟\mathcal{D}
7:  // Step 1: Compute behavior Q-values via NF critic (Equation˜4)
8:  reprtSA-Encoderψ(st,at)\text{repr}_{t}\leftarrow\text{SA-Encoder}_{\psi}(s_{t},a_{t})
9:  Qθβ(st,at,g)logpθ(g+ϵreprt)Q^{\beta}_{\theta}(s_{t},a_{t},g)\leftarrow\log p_{\theta}(g+\epsilon\mid\text{repr}_{t}) // ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I), denoising
10:  Q~tQθβ(st,at,g)/(|Qβ|¯+δ)\tilde{Q}_{t}\leftarrow Q^{\beta}_{\theta}(s_{t},a_{t},g)/(\overline{|Q^{\beta}|}+\delta) // normalize (Equation˜57)
11:  // Step 2: Forward through Q-conditioned policy
12:  Construct input: 𝐱=Q~tK+1,sgtK+1,atK+1,,stopgrad(Q~t),sgt\mathbf{x}=\langle\tilde{Q}_{t-K+1},sg_{t-K+1},a_{t-K+1},\ldots,\texttt{stopgrad}(\tilde{Q}_{t}),sg_{t}\rangle
13:  Q^(st,g),a^tQHyerϕ(𝐱)\hat{Q}(s_{t},g),\hat{a}_{t}\leftarrow\text{QHyer}_{\phi}(\mathbf{x})
14:  // Step 3: Compute losses (Equations˜5, 15 and 10)
15:  NFs𝔼[logpθ(g|st,at)]\mathcal{L}_{\text{NFs}}\leftarrow-\mathbb{E}[\log p_{\theta}(g|s_{t},a_{t})]
16:  BC𝔼[logπϕ(at|Q~t,[st;g])]\mathcal{L}_{\text{BC}}\leftarrow-\mathbb{E}[\log\pi_{\phi}(a_{t}|\tilde{Q}_{t},[s_{t};g])]
17:  Q𝔼[Lτ2(Q^(st,g)Qθβ(st,at,g))]\mathcal{L}_{Q}\leftarrow\mathbb{E}[L^{2}_{\tau}(\hat{Q}(s_{t},g)-Q^{\beta}_{\theta}(s_{t},a_{t},g))]
18:  Update (ψ,θ,ϕ)(\psi,\theta,\phi) by (λcriticNFs+λBCBC+λQQ)\nabla(\lambda_{\text{critic}}\mathcal{L}_{\text{NFs}}+\lambda_{\text{BC}}\mathcal{L}_{\text{BC}}+\lambda_{Q}\mathcal{L}_{Q})
19:end for
20:
21:// Inference: Trajectory Stitching via Q-Conditioning
22: Initialize buffers: 𝐬𝐠,𝐚,𝐐𝟎\mathbf{sg},\mathbf{a},\mathbf{Q}\leftarrow\mathbf{0}
23:s0,gEnv.reset()s_{0},g\leftarrow\text{Env.reset}()
24:for t=0,1,2,t=0,1,2,\ldots until done do
25:  sgt[st;g]sg_{t}\leftarrow[s_{t};g] // concatenate state and goal
26:  Retrieve context window: (𝐬𝐠,𝐚,𝐐)tK+1:t(\mathbf{sg},\mathbf{a},\mathbf{Q})_{t-K+1:t}
27:  // Stage 1: Predict maximum in-distribution Q-value
28:  Q^(st,g)QHyerϕQ(𝐬𝐠tK+1:t,𝐚tK+1:t1,𝐐tK+1:t1)\hat{Q}(s_{t},g)\leftarrow\text{QHyer}_{\phi}^{Q}(\mathbf{sg}_{t-K+1:t},\mathbf{a}_{t-K+1:t-1},\mathbf{Q}_{t-K+1:t-1})
29:  Update buffer: 𝐐tQ^(st,g)\mathbf{Q}_{t}\leftarrow\hat{Q}(s_{t},g)
30:  // Stage 2: Predict action conditioned on maximum Q-value
31:  a^tQHyerϕa(𝐬𝐠tK+1:t,𝐚tK+1:t1,𝐐tK+1:t)\hat{a}_{t}\leftarrow\text{QHyer}_{\phi}^{a}(\mathbf{sg}_{t-K+1:t},\mathbf{a}_{t-K+1:t-1},\mathbf{Q}_{t-K+1:t})
32:  st+1Env.step(clip(a^t,1,1))s_{t+1}\leftarrow\text{Env.step}(\text{clip}(\hat{a}_{t},-1,1))
33:  Update buffer: 𝐚ta^t\mathbf{a}_{t}\leftarrow\hat{a}_{t}
34:end for

In practice, we apply a denoising trick to the NFs-based critic by adding Gaussian noise ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I) to goals during training, which improves density estimation quality. The expectile loss Lτ2()L^{2}_{\tau}(\cdot) is defined in Equation˜9, where τ(0.5,1)\tau\in(0.5,1) controls the asymmetry. When τ>0.5\tau>0.5, overestimation is penalized more heavily, driving the learned Q^(st,g)\hat{Q}(s_{t},g) toward the maximum of Qθβ(st,at,g)Q^{\beta}_{\theta}(s_{t},a_{t},g) over all actions in the dataset.

Inference: Trajectory Stitching via Q-Conditioning.

In classical Q-learning, the optimal value function QQ^{*} derives the optimal action given the current state. In our framework, we leverage the maximum Q-value Q^(st,g)\hat{Q}(s_{t},g) to help the policy select near-optimal actions. Note that Q^(st,g)\hat{Q}(s_{t},g) depends only on state and goal because action is marginalized by the expectile regression. The inference pipeline follows:

Env(s0,g)Q^Q^(s0,g)πϕa0Env(s1,g)Q^Q^(s1,g)πϕa1\overset{\text{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}Env}}}{\longmapsto}\left(s_{0},g\right)\xrightarrow{\hat{Q}}\hat{Q}(s_{0},g)\xrightarrow{\pi_{\phi}}a_{0}\xrightarrow{\text{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}Env}}}\left(s_{1},g\right)\xrightarrow{\hat{Q}}\hat{Q}(s_{1},g)\xrightarrow{\pi_{\phi}}a_{1}\rightarrow\cdots (58)

At each timestep tt, QHyer performs two-stage autoregressive generation as shown in Algorithm˜1:

  1. 1.

    Predict maximum Q-value: Given the historical context window, the model first predicts Q^(st,g)\hat{Q}(s_{t},g) which represents the maximum achievable goal-reaching probability from the current state.

  2. 2.

    Predict action: Conditioned on the predicted Q^(st,g)\hat{Q}(s_{t},g), the model then outputs the action a^t\hat{a}_{t} that achieves this maximum Q-value.

When the initial state and goal correspond to different trajectories in the dataset, which is precisely the scenario requiring trajectory stitching, our model outputs effective actions by leveraging the Q-conditioned policy.

Appendix E Baseline Details

We compare our approach with a wide variety of baselines, including sequence modeling, TD-based RL methods and Offline GCRL methods. Particularly, we include the following methods:

  • For sequence modeling methods, we include Decision Transformer (DT) (Chen et al., 2021), Elastic Decision Transformer (EDT) (Wu et al., 2023), Graph Decision Transformer (GDT) (Hu et al., 2023), Q-learning Decision Transformer (QDT) (Yamagata et al., 2023), Critic-Guided Decision Transformer (CGDT) (Wang et al., 2024), Reinforced Transformer (Reinformer) (Zhuang et al., 2024), Decision ConvFormer (DC) (Kim et al., 2024b), Decision Mamba (DMamba) (Ota, 2024), Q-value Regularized Transformer (QT) (Hu et al., 2024), Long-Short Decision Transformer (LSDT) (Wang et al., 2025), Decision Mixer (DMixer) (Zheng et al., 2025a), Value-guided Decision Transformer (VDT) (Zheng et al., 2025b). DT is a classic sequence modeling method that utilizes a Transformer architecture to model and reproduce sequences from demonstrations, integrating a goal-conditioned policy to convert Offline RL into a supervised learning task. Despite its competitive performance in Offline RL tasks, the DT falls short in achieving trajectory stitching (Brandfonbrener et al., 2022). GDT extends DT by explicitly structuring the input sequence as a causal graph and incorporating relation-enhanced attention to better model the dependencies between states, actions, and rewards. EDT is a variant of DT that lies in its ability to determine the optimal history length to promote trajectory stitching. But it does not incorporate the RL objective that maximizes returns to enhance the model (Zhuang et al., 2024) and its stitching capabilities are limited (Kim et al., 2024a). QDT integrates Dynamic Programming with the DT framework to enhance the optimal path generation ability of DT. CGDT enhances DT by incorporating a value-based critic to align the expected returns of actions with target returns, effectively addressing the inconsistency issues of Return-Conditioned Supervised Learning in stochastic environments and suboptimal datasets. DC replaces attention blocks with convolution filters to more efficiently capture local associations. Reinformer is similar to our work; however, it exhibits limited stitching capabilities due to the absence of QQ-value, resulting in a significant performance gap compared to TD-based RL methods. DMamba replaces the attention mechanism in DT with the Mamba selective state space model to achieve linear computational complexity while maintaining sequence modeling capabilities. QT introduces Q-value regularization to optimize action selection on top of DT and excels in handling long time horizons and sparse reward tasks. LSDT enhances the model structure of DT with a dual-branch architecture (long-term and local features) adept at extracting information within different ranges. DMixer integrates both long-term and local features, and additionally introduces a plug-and-play dynamic token selection mechanism to ensure that the model can adaptively allocate attention to different features based on the specific requirements of each task. VDT leverages value functions to perform advantage-weighting and behavior regularization on the DT, guiding the policy toward upper-bound optimal decisions during the offline training phase.

  • For TD-based RL methods, we include Conservative Q-Learning (CQL) (Kumar et al., 2020) and Implicit Q-Learning (IQL) (Kostrikov et al., 2022). CQL and IQL are classical offline RL methods that utilize dynamic programming. This trick endows them with stitching properties (Cheikhi & Russo, 2023; Ghugare et al., 2024).

  • For Offline GCRL methods, we include goal-conditioned behavioral cloning (GCBC) (Ghosh et al., 2021) , goal-conditioned implicit V-learning (GCIVL) and Q-learning (GCIQL) (Kostrikov et al., 2022), Quasimetric RL (QRL) (Wang et al., 2023), Contrastive RL (CRL) (Eysenbach et al., 2022), and Hierarchical implicit Q-learning (HIQL) (Park et al., 2023). For these baselines, we follow the implementation setup established by OGBench (Park et al., 2025a) throughout our experiments. Additionally, we select Subgoal Advantage-Weighted Policy Bootstrapping (SAW) (Zhou & Kao, 2025), Option-aware Temporally Abstracted (OTA)  (Ahn et al., 2025) and Eikonal-Constrained Quasimetric RL (Eik-QRL) (Giammarino & Qureshi, 2026) as our state-of-the-art GCRL baselines. SAW trains a flat policy by directly sampling subgoals from offline datasets through advantage-weighted policy bootstrapping, thereby eliminating the need for complex subgoal generation models, and achieves superior performance on long-horizon, high-dimensional control tasks. OTA employs temporal abstraction to reduce the effective planning horizon, which substantially improves the scalability of high-level policies to long-horizon tasks. Eik-HiQRL overcomes QRL’s dependence on trajectory continuity for local constraints and its struggle to maintain a valid quasimetric structure in high-dimensional, long-horizon tasks by introducing a trajectory-free Eikonal PDE constraint at the high level and a hierarchical policy decomposition.

Appendix F Experiment Details

In this section we provide offline datasets details as well as implementation details used for all the algorithms in our experiments – Offline GCRL Datasets, Normalizing Flows, and QHyer.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: GCRL example non-Markovian datasets from Ogbench. Each Trajectory is limited to travel at most 4 blocks for dataset type stitch, while at inference, the distance between the start and goal can be up to 3030 in the Giant maze.

F.1 Offline GCRL non-Markovian Datasets

We adopt the manipulation suite from OGBench (Park et al., 2025a), which consists of three robotic manipulation environments based on a 6-DoF UR5e robot arm. These environments are designed to evaluate the agent’s capabilities in object manipulation, sequential generalization, and combinatorial generalization.

  • Cube: This task involves pick-and-place manipulation of cube blocks, where the goal is to arrange cubes into designated configurations. Four variants are provided with different numbers of cubes: single, double, triple, and quadruple (1–4 cubes). At test time, the agent must perform moving, stacking, swapping, or permuting operations on the cube blocks.

  • Scene: This task is designed to challenge sequential, long-horizon reasoning capabilities. It involves manipulating diverse everyday objects including a cube block, a window, a drawer, and two button locks. The longest evaluation task requires completing up to eight atomic behaviors in sequence.

  • Puzzle: This task evaluates combinatorial generalization by requiring the agent to solve the “Lights Out” puzzle with a robot arm. Four difficulty levels are provided: 3x3, 4x4, 4x5, and 4x6, with state spaces containing up to 224=16,777,2162^{24}=16,777,216 distinct configurations.

Visualization examples of these tasks are shown in Figure˜9. For each manipulation environment, OGBench provides two types of datasets with different collection policies:

  • Play datasets (play): Collected by non-Markovian expert policies with temporally correlated noise, following the “play data” paradigm (Lynch et al., 2020). This results in smoother, more realistic trajectories that pose additional challenges for standard RL algorithms.

  • Noisy datasets (noisy): Collected by Markovian expert policies with uncorrelated Gaussian noise. These datasets serve as controlled baselines for ablation studies, allowing researchers to isolate the effects of non-Markovian data collection.

In the experiments comparing with related sequence modeling approaches, we adopt the maze navigation tasks from D4RL (Fu et al., 2020), which provide challenging benchmarks for evaluating offline RL algorithms on undirected, multitask data with sparse rewards.

  • Maze2D: This domain is a navigation task requiring a 2D point-mass agent to reach a fixed goal location. Three maze layouts are provided with increasing complexity: umaze, medium, and large. The tasks are designed to test the ability of offline RL algorithms to stitch together previously collected sub-trajectories to find the shortest path to the evaluation goal.

  • AntMaze-v2: This domain replaces the simple 2D ball from Maze2D with a more complex 8-DoF quadrupedal “Ant” robot, introducing morphological complexity that mimics real-world robotic navigation tasks. The same three maze layouts (umaze, medium, large) are used, with a sparse 0-1 reward that is activated only upon reaching the goal. Three dataset variants are provided: standard goal-reaching from fixed start locations, “diverse” datasets with random start and goal locations, and “play” datasets with hand-picked navigation waypoints.

Visualization examples are shown in Figure˜10. A critical characteristic of both Maze2D and AntMaze-v2 datasets is that they are collected by non-Markovian policies. The data generation process employs a hierarchical controller: a high-level planner generates sequences of waypoints, which are then followed by a low-level PD controller (for Maze2D) or a trained goal-reaching policy (for AntMaze-v2). Because these controllers maintain internal states to track visited waypoints and update their targets upon reaching intermediate goals, the resulting behavior policies are inherently non-Markovian. This property introduces additional challenges for offline RL algorithms, as the data cannot be accurately modeled by assuming a Markovian behavior policy, potentially causing bias in methods that rely on such assumptions (Fu et al., 2020).

Refer to caption
Refer to caption
Refer to caption

                                         Umaze                                      Medium                                          Large

Figure 10: GCRL example non-Markovian datasets from D4RL (Fu et al., 2020): The AntMaze-v2 datasets involve controlling an 8-DoF quadruped to navigate towards a specified goal state. This benchmark requires value propagation to effectively stitch together sub-optimal trajectories from the collected data.

F.2 Implementation Details

We ran all our experiments on NVIDIA RTX 3090 GPUs with 24GB of memory within an internal cluster. We use the default configurations in Park et al. (2025a), with some values modified. In pixel-based environments, following Park et al. (2025a), we employ n IMPALA-style encoder to transform images into state tokens. The architecture and training process of the Normalizing Flows are identical to those described in Ghugare & Eysenbach (2025).

Table 5: NF Q-value Estimation Time Analysis in QHyer. All timing values are in milliseconds (ms) per batch, averaged over 100 steps (batch size = 256). Experiments conducted on a single NVIDIA RTX 3090 GPU with dual Intel Xeon E5-2620 v4 CPUs @ 2.10GHz.
Environment NF Train (ms) Actor (ms) Infer-Q (ms) Infer-A (ms) NF Ratio
cube-single-play-v0 2.422.42 3.073.07 0.0070.007 0.0150.015 28.3%28.3\%
cube-double-play-v0 2.432.43 2.992.99 0.0080.008 0.0190.019 26.7%26.7\%
cube-triple-play-v0 2.472.47 3.143.14 0.0080.008 0.0150.015 28.8%28.8\%
cube-quadruple-play-v0 2.542.54 3.123.12 0.0110.011 0.0190.019 27.0%27.0\%
cube-single-noisy-v0 2.442.44 3.063.06 0.0090.009 0.0150.015 29.3%29.3\%
cube-double-noisy-v0 2.492.49 3.073.07 0.0110.011 0.0190.019 26.9%26.9\%
cube-triple-noisy-v0 2.422.42 3.013.01 0.0080.008 0.0140.014 29.2%29.2\%
cube-quadruple-noisy-v0 2.552.55 3.143.14 0.0120.012 0.0180.018 27.1%27.1\%
scene-play-v0 2.432.43 3.043.04 0.0090.009 0.0150.015 28.8%28.8\%
scene-noisy-v0 2.432.43 3.023.02 0.0080.008 0.0200.020 26.9%26.9\%
puzzle-3x3-play-v0 2.392.39 3.013.01 0.0080.008 0.0150.015 28.7%28.7\%
puzzle-4x4-play-v0 2.522.52 3.083.08 0.0100.010 0.0190.019 27.1%27.1\%
puzzle-4x5-play-v0 2.482.48 3.123.12 0.0080.008 0.0150.015 28.7%28.7\%
puzzle-4x6-play-v0 2.542.54 3.103.10 0.0090.009 0.0200.020 26.8%26.8\%
puzzle-3x3-noisy-v0 2.452.45 3.103.10 0.0090.009 0.0150.015 29.0%29.0\%
puzzle-4x4-noisy-v0 2.492.49 3.093.09 0.0090.009 0.0190.019 26.9%26.9\%
puzzle-4x5-noisy-v0 2.512.51 2.952.95 0.0090.009 0.0150.015 30.1%30.1\%
puzzle-4x6-noisy-v0 2.442.44 2.982.98 0.0080.008 0.0210.021 27.2%27.2\%
Average 2.472.47 3.063.06 0.0090.009 0.0170.017 28.0%28.0\%

Our QHyer implementation draws inspiration from LSDT (Wang et al., 2025) and Decision Mamba (Ota, 2024). The state tokens, goal tokens, QQ-function tokens and action tokens are first processed by different linear layers. Then these tokens are fed into the decoder layer to obtain the embedding. Here the decoder layer is a lightweight implementation from Reinformer (Zhuang et al., 2024). The context length for the decoder layer is denoted as KK. We employed both the AdamW (Loshchilov, 2017) optimizers to optimize the total loss, in alignment with the methods outlined in their original papers. The hyperparameter of LQHyerL_{QHyer} loss is denoted as τ\tau.

F.3 Hyperparameter Settings

Table˜6 summarizes the hyperparameters shared across all experiments. The Hybrid Attention-Mamba architecture uses learnable mixing weights between attention and Mamba branches, with a total hidden dimension of dmodeld_{\text{model}}. The Normalizing Flow architecture follows Ghugare & Eysenbach (2025). The expectile regression parameter τ\tau is set according to our theoretical guidance (Theorem˜3.1).

Table 6: Common hyperparameters shared across all tasks.
Hyperparameter OGBench (State) OGBench (Pixel) D4RL
Training steps 1M 500K 100K
Batch size 1024 512 256
Optimizer AdamW AdamW AdamW
Weight decay 0.0 0.0 1e-4
Gradient clipping 1.0 1.0 0.25
NF noise std 0.05 0.05
Encoder hidden dim 1024 1024
NF representation size 64–128 64–128
BC weight λBC\lambda_{\text{BC}} 1.0 1.0 1.0
Q weight λQ\lambda_{\text{Q}} 1.0 1.0 1.0
Expectile τ\tau 0.95–0.99 0.95–0.99 0.90–0.99
State-goal concatenation True False True
Image size 64×\times64
Image encoder IMPALA-small
Warmup steps 10000
LR schedule Cosine

Architecture notes. For both OGBench and D4RL experiments, the Hybrid Attention-Mamba backbone uses learnable mixing weights that are automatically optimized during training, eliminating the need for manual tuning of attention-to-Mamba ratios. The total hidden dimension dmodeld_{\text{model}} (denoted as h_dim in OGBench and embed_dim in D4RL) represents the combined capacity of both branches, with the proportion learned end-to-end via gradient descent.

Table˜7 presents the task-specific hyperparameters for OGBench state-based manipulation environments. Following our theoretical analysis (Theorem˜3.1), we set τ=0.99\tau=0.99 for play datasets (medium Q-value coverage) and τ=0.95\tau=0.95 for noisy datasets (higher coverage due to exploration noise).

Table 7: Task-specific hyperparameters for OGBench state-based manipulation tasks (1M training steps). KK: context length, dmodeld_{\text{model}}: hidden dimension, LL: number of Transformer blocks, HH: number of attention heads.
Environment KK dmodeld_{\text{model}} LL HH LR Dropout τ\tau NF Blocks NF Channels
cube-single-play-v0 20 256 4 4 3e-4 0.1 0.99 6 256
cube-single-noisy-v0 20 256 4 4 3e-4 0.1 0.95 6 256
cube-double-play-v0 25 384 5 6 3e-4 0.1 0.99 8 256
cube-double-noisy-v0 25 384 5 6 3e-4 0.1 0.95 8 256
cube-triple-play-v0 30 512 6 8 2e-4 0.15 0.99 10 384
cube-triple-noisy-v0 30 512 6 8 2e-4 0.15 0.95 10 384
cube-quadruple-play-v0 35 640 6 8 1e-4 0.2 0.99 12 512
cube-quadruple-noisy-v0 35 640 6 8 1e-4 0.2 0.95 12 512
scene-play-v0 30 384 5 6 3e-4 0.1 0.99 8 384
scene-noisy-v0 30 384 5 6 3e-4 0.1 0.95 8 384
puzzle-3x3-play-v0 25 512 6 8 3e-4 0.1 0.99 8 384
puzzle-3x3-noisy-v0 25 512 6 8 3e-4 0.1 0.95 8 384
puzzle-4x4-play-v0 30 640 6 8 2e-4 0.15 0.99 10 384
puzzle-4x4-noisy-v0 30 640 6 8 2e-4 0.15 0.95 10 384
puzzle-4x5-play-v0 35 768 6 8 1e-4 0.2 0.99 10 512
puzzle-4x5-noisy-v0 35 768 6 8 1e-4 0.2 0.95 10 512
puzzle-4x6-play-v0 40 768 6 8 1e-4 0.2 0.99 10 512
puzzle-4x6-noisy-v0 40 768 6 8 1e-4 0.2 0.95 10 512

Table˜8 presents the hyperparameters for pixel-based (visual) manipulation tasks. Compared to state-based tasks, pixel-based tasks use smaller batch size (512 vs 1024) due to memory constraints and shorter training (500K steps). Goals are represented as images rather than concatenated state vectors.

Table 8: Task-specific hyperparameters for OGBench pixel-based manipulation tasks (500K training steps). KK: context length, dmodeld_{\text{model}}: hidden dimension, LL: number of Transformer blocks, HH: number of attention heads.
Environment KK dmodeld_{\text{model}} LL HH LR Dropout τ\tau NF Blocks NF Channels
visual-cube-single-play-v0 15 256 4 4 3e-4 0.1 0.99 6 256
visual-cube-double-play-v0 20 384 5 6 3e-4 0.1 0.99 8 256
visual-cube-triple-play-v0 25 512 6 8 2e-4 0.15 0.99 10 384
visual-scene-play-v0 25 384 5 6 3e-4 0.1 0.99 8 384
visual-scene-noisy-v0 25 384 5 6 3e-4 0.1 0.95 8 384

For pixel-based tasks, the NF uses a DrQ-v2 style CNN (Yarats et al., 2022) to encode images into 256-dim features, which are concatenated with actions and passed through a 4-layer MLP to produce the state-action representation. The NF models goal-reaching probability in the low-dimensional coordinate space (e.g., object positions), extracted from simulator state. The LSDM actor uses an IMPALA-style encoder (Espeholt et al., 2018) to encode both observation and goal images into 256-dim vectors.

Table˜9 presents the hyperparameters for D4RL maze tasks. We use a unified Transformer architecture with dmodel=128d_{\text{model}}=128 and 3 blocks. Unlike OGBench, D4RL experiments use a cosine learning rate schedule with 10K warmup steps.

Table 9: Task-specific hyperparameters for D4RL maze tasks (100K training steps, 50 iterations ×\times 2000 updates).
Environment KK LR τ\tau dmodeld_{\text{model}} LL
antmaze-umaze-v2 2 2e-4 0.90 128 3
antmaze-umaze-diverse-v2 2 2e-4 0.90 128 3
antmaze-medium-play-v2 3 2e-4 0.99 128 3
antmaze-medium-diverse-v2 3 2e-4 0.99 128 3
antmaze-large-play-v2 3 4e-4 0.90 128 3
antmaze-large-diverse-v2 3 4e-4 0.90 128 3
maze2d-umaze-v1 10 2e-4 0.90 128 3
maze2d-medium-v1 10 2e-4 0.90 128 3

D4RL-specific settings. For D4RL maze tasks, we concatenate the 2D goal position to the state (--goalconcate), increasing the state dimension by 2. The training uses a combined learning rate schedule: linear warmup for 10K steps followed by cosine decay. We use smaller batch size (256) and fewer training steps (100K) compared to OGBench, as D4RL maze tasks are less complex. The expectile parameter τ\tau is set to 0.90 for umaze and large tasks, and 0.99 for medium tasks based on empirical tuning.

For computational efficiency, we extract only task-relevant goal coordinates when training the NFs-based Q-value estimator in Equation˜5. Given a full goal state gfullg_{\text{full}}, we use g=gfull[istart:iend]g=g_{\text{full}}[i_{\text{start}}:i_{\text{end}}] where the index range is environment-specific. Table˜10 summarizes the configurations:

Table 10: Goal dimension configurations for NF Q-value estimation.
Task Category Goal Dim Description
cube-single-* / visual-cube-single-* 3 Object (x, y, z)
cube-double-* / visual-cube-double-* 6 Two objects
cube-triple-* / visual-cube-triple-* 9 Three objects
cube-quadruple-* 12 Four objects
scene-* / visual-scene-* 13 Scene objects
puzzle-3x3-* 9 3×\times3 tiles
puzzle-4x4-* 16 4×\times4 tiles
puzzle-4x5-* 20 4×\times5 tiles
puzzle-4x6-* 24 4×\times6 tiles
antmaze-*, maze2d-* 2 Agent (x, y)

For goal sampling in OGBench, we use ptrajgoal=1.0p_{\text{trajgoal}}=1.0, prandomgoal=0.0p_{\text{randomgoal}}=0.0 for play datasets and ptrajgoal=0.8p_{\text{trajgoal}}=0.8, prandomgoal=0.2p_{\text{randomgoal}}=0.2 for noisy datasets.

Appendix G Additional Results

This section presents supplementary experiments and analyses for QHyer, including: (1) detailed discussion of the state-goal tokenization strategy and its role in enabling trajectory stitching, (2) ablation studies on regression functions, (3) qualitative visualization of trajectory stitching capabilities, (4) validation of Normalizing Flows for goal-reaching probability estimation, and (5) empirical verification of expectile regression for capturing maximum Q-values. Due to space constraints, these additional results are not included in the main body of this paper. The details are provided below.

G.1 Detail Discussion of State-Goal Tokenization Strategy

Refer to caption

Figure 11: State-goal tokenization strategy. (A) Offline data represented as a graph, where nodes denote states and edges represent transitions. Different trajectories (colored in orange/yellow) may share common states but target different goals. (B) State-goal concatenation: each state ss is concatenated with goal gg to form a unified token [s;g][s;g], enabling the model to directly attend to goal-relevant state features. (C) Goal stitching illustration (see Section 3.3): by conditioning on concatenated state-goal tokens, QHyer can identify and combine successful trajectory segments from different source trajectories that share the same goal, enabling optimal path discovery that neither original trajectory achieves alone.

This section details our state-goal tokenization strategy illustrated in Figure˜11 and its role in enabling trajectory stitching. The key insight is that concatenating state and goal into a unified token [s;g][s;g] allows the Transformer’s self-attention mechanism to directly model cross-dependencies between current state features and goal specifications within each token position.

Panel A shows the offline dataset structure as a graph, where multiple trajectories (indicated by different colors) traverse overlapping state regions while pursuing different goals. This shared structure creates opportunities for trajectory stitching, which combines successful segments from different trajectories.

Refer to caption
Refer to caption
Figure 12: Ablation study on state-goal tokenization strategies.

Panel B contrasts DT’s standard tokenization with QHyer’s approach. In vanilla DT, states and goals may be processed separately or with weak coupling. QHyer instead concatenates [s;g][s;g] at each timestep, ensuring that goal information is directly available when computing attention over state features. This design maintains the sequence length at 3T3T (Q-value, state-goal, action tokens) rather than increasing to 4T4T with separate goal tokens, avoiding quadratic attention overhead.

Panel C demonstrates how this enables goal stitching. Consider two trajectories targeting goals aa and bb respectively. Neither trajectory alone reaches the optimal path to goal aa. However, by conditioning on state-goal concatenated tokens with NFs-based Q-value signals, QHyer identifies high-value segments from both trajectories and stitches them together, discovering an optimal path (shown in green) that was not present in any single demonstration.

We empirically validate the effectiveness of state-goal concatenation through ablation studies comparing three tokenization strategies: No Goal (state-only input), State-Goal Separate (goal as additional token), and State-Goal Concat (our approach). Figure˜12 shows a consistent ordering across all environments: No Goal << Separate << Concat.

The performance gap between No Goal and goal-conditioned variants (37%37\%53%53\% absolute improvement) confirms that goal information is essential for learning meaningful goal-reaching behaviors. Without explicit goal conditioning, the model degenerates to unconditional behavior cloning, unable to distinguish between trajectories targeting different goals.

Among goal-conditioned strategies, concatenation outperforms separation by 16%16\%28%28\%. This improvement stems from two factors: (1) Direct cross-dependency modeling: Concatenation enables self-attention to directly learn which state features are relevant for specific goals within each token, whereas separation requires the model to establish state-goal relationships across tokens through multiple attention layers. (2) Stronger conditioning signal: Separate tokenization dilutes the goal signal as it propagates through attention layers, weakening goal-awareness at decision time. Concatenation preserves the full goal information at every position where action prediction occurs.

Refer to caption
Refer to caption
Figure 13: Ablation study on regression functions for maximum Q-value learning.

These results validate our design choice and explain why QHyer achieves effective trajectory stitching: the concatenated state-goal representation provides the necessary goal-aware context for identifying and combining high-value segments from different trajectories.

G.2 Effect of Regression Functions on Learning Stability

We compare MSE, Quantile Loss (L1L_{1}-based) (Koenker & Hallock, 2001), and Expectile Regression (L2L_{2}-based) (Newey & Powell, 1987; Kostrikov et al., 2022). Figure˜13 shows consistent ordering: MSE << Quantile << Expectile, with Expectile achieving the best results and smallest variance.

Why MSE Fails. MSE learns the mean Q-value across all trajectories passing through each state. In Offline GCRL where both successful and failed trajectories share common states, this averaging produces predictions that lie between the maximum and minimum Q-values. Such middle-ground estimates provide no discriminative signal for trajectory stitching because the model cannot distinguish promising paths from dead ends.

Why Quantile Loss Struggles. Quantile regression (Koenker & Hallock, 2001) correctly targets high-value regions via asymmetric weighting. However, the L1L_{1} loss creates a non-smooth point at zero error where gradients change direction abruptly (Liu et al., 2025; Jullien et al., 2023). For deep networks with many near-zero predictions, this causes oscillatory training dynamics and high variance across seeds. Recent theoretical work (Liu et al., 2025) shows that while quantile regression can recover in-distribution optimal values in deterministic environments, its L1L_{1} loss makes optimization less stable than L2L_{2}-based alternatives.

Why Expectile Regression Succeeds. Expectile regression (Newey & Powell, 1987) replaces the L1L_{1} non-smooth point with an L2L_{2} smooth curve, achieving both optimistic targeting and gradient consistency. This smooth gradient landscape is particularly important for non-Markovian learning: inconsistent gradients from quantile loss disrupt the temporal representations learned by attention and Mamba branches, while expectile’s stable gradients allow these components to capture history-dependent patterns effectively. This explains why the Quantile-Expectile gap is largest on Cube-double-play, the environment with the strongest non-Markovian properties. As shown in Theorem 3.1, expectile regression with τ1\tau\to 1 converges to the in-distribution optimal Q-value, providing theoretical justification for our empirical findings.

G.3 Trajectory Stitching Visualization

Refer to caption

Figure 14: D4RL Antmaze-medium environment with trajectories from different behavioral policies.

To further illustrate the trajectory stitching capabilities of different methods, we provide a qualitative comparison on the D4RL Antmaze-Medium task. As shown in Figure˜15, we visualize the trajectories generated by DT, LSDT, IQL, and QHyer (with Expectile Regression).

The maze environment consists of multiple regions, each represented by a distinct color corresponding to different data collection policies in the offline dataset (Figure˜14):

  • Cyan: Bottom-left start region

  • Purple: Middle corridor

  • Yellow: Top-right goal region

  • Green: Bottom-right area

  • Red: Top-left area

  • Black: Out-of-distribution (OOD) states (i.e., passing through walls)

The key challenge is to stitch trajectory segments from different regions to discover optimal paths from start to goal.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 15: Qualitative comparison of trajectory stitching capabilities on D4RL Antmaze-Medium task. Different colors represent trajectory segments from different data collection policies in the offline dataset. Black segments indicate OOD states where the agent passes through walls. (a) DT fails to reach the goal due to ineffective RTG conditioning. (b) LSDT moves correctly but stops early. (c) IQL successfully reaches the goal via value-based stitching, but requires bootstrapping and policy projection. (d) QHyer successfully reaches the goal with the advantages of no bootstrapping, no policy projection, and sequence modeling for non-Markovian data.

Successful trajectory stitching requires the agent to combine trajectory segments from different regions to reach the goal. Our key observations are:

  • DT (Chen et al., 2021) fails to reach the goal and instead wanders toward the bottom-right area, demonstrating its inability to stitch trajectories across different data collection policies. This failure stems from DT’s reliance on return-to-go conditioning, which provides no discriminative signal in sparse reward settings where all failed trajectories receive identical RTG values.

  • LSDT (Wang et al., 2025) moves in the correct direction but stops in the middle corridor, showing limited stitching capability. Although LSDT improves upon DT by combining attention with Dynamic Convolution for better local pattern extraction, it still relies on RTG conditioning and cannot identify high-value stitching points without explicit value guidance.

  • IQL (Kostrikov et al., 2022) successfully reaches the goal through a valid path without OOD states. IQL’s expectile regression-based value learning enables trajectory stitching by identifying high-value actions. However, IQL requires bootstrapping to learn the maximum Q-value, which means it must first learn QβQ^{\beta} before learning Q^\hat{Q}. This can lead to error accumulation in complex environments.

  • QHyer also successfully reaches the goal through a valid path without any OOD states. The trajectory smoothly transitions through cyan \rightarrow purple \rightarrow yellow regions, demonstrating proper trajectory stitching. Compared to IQL, QHyer avoids bootstrapping by using NFs for direct Q-value estimation, and avoids policy projection by using Q-conditioned supervised learning.

G.4 Evaluating the Capability of NFs to Accurately Estimate Goal-reaching Probability

In this section, we validate the accuracy of the NFs’s (Ghugare & Eysenbach, 2025) estimation of the discounted future state distribution by implementing the computation method outlined in Eysenbach et al. (2020) within a tabular setting. It is important to note that here we are solely validating the accuracy of the NFs in estimating the discounted future state distribution, which is unrelated to the actual implementation of the NFs in our QHyer framework.

Refer to caption

Figure 16: 5×55\times 5 Gridworld environment.

Specifically, we compute the true discounted future state distribution in a modified GridWorld environment example and evaluate the estimation error by comparing it against the true distribution. We also compare the predictions of CVAE(Sohn et al., 2015), C-learning (Eysenbach et al., 2020) and CRL(Eysenbach et al., 2022) with the true future state density. First, we introduce the modified GridWorld environment used in this experiment. This environment is characterized by stochastic dynamics and a continuous state space, such that the true QQ-function for the indicator reward is zero. Specifically, the environment has a size of 5×55\times 5 (Figure˜16), where the agent observes a noisy version of its current state. More precisely, when the agent is located at position (i,j)(i,j), it observes the state (i+ϵi,j+ϵj)(i+\epsilon_{i},j+\epsilon_{j}), where ϵi,ϵjUnif[0.5,0.5]\epsilon_{i},\epsilon_{j}\sim\text{Unif}[-0.5,0.5]. Note that the observation uniquely identifies the agent’s position, so there is no partial observability. Similar to Eysenbach et al. (2020), we analytically compute the exact future state density function by first determining the future state density of the underlying GridWorld, noting that the density is uniform within each cell. We generated a tabular policy by sampling from a Dirichlet (1) distribution, and sampled 100 trajectories of length 100 from this policy for NFs training.

Refer to caption

Refer to caption

Refer to caption

Figure 17: Experiments on the effectiveness of density estimation using NFs. Left: We evaluate CVAE, C-learning, CRL and NFs for predicting the future state distribution in the on-policy setting. As anticipated, NFs demonstrated the lowest estimation error among all methods evaluated. Conversely, CVAE exhibited the poorest estimation accuracy. In our empirical implementation, we observed that CVAE incurs significantly higher computational complexity due to its requirements for pre-training and importance sampling-based inference procedures (Wu et al., 2022). Middle: and Right: The visual comparison. For a given state, action, and future goal in the GridWorld trajectory data, we visualize the comparison between the actual future state density (goal-reaching probability) and the estimates provided by the NFs. The results indicate a minimal difference, further validating the effectiveness of the NFs in estimating the future state density (goal-reaching probability).

Analytic Future State Distribution

Then, as described in Eysenbach et al. (2020), we can compute the true discounted future state distribution by first constructing the following two metrics:

T25×25:\displaystyle T\in\mathbb{R}^{25\times 25}:\quad T[s,s]=a𝟙(f(s,a)=s)π(as)\displaystyle T[s,s^{\prime}]=\sum_{a}\mathds{1}(f(s,a)=s^{\prime})\pi(a\mid s)
T025×4×25:\displaystyle T_{0}\in\mathbb{R}^{25\times 4\times 25}:\quad T[s,a,s]=𝟙(f(s,a)=s),\displaystyle T[s,a,s^{\prime}]=\mathds{1}(f(s,a)=s^{\prime}),

where f(s,a)f(s,a) denotes the deterministic transition function. The future discounted state distribution is then given by:

P\displaystyle P =(1γ)[T0+γT0T+γ2T0T2+γ3T0T3+]\displaystyle=(1-\gamma)\left[T_{0}+\gamma T_{0}T+\gamma^{2}T_{0}T^{2}+\gamma^{3}T_{0}T^{3}+\cdots\right]
=(1γ)T0[I+γT+γ2T2+γ3T3+]\displaystyle=(1-\gamma)T_{0}\left[I+\gamma T+\gamma^{2}T^{2}+\gamma^{3}T^{3}+\cdots\right]
=(1γ)T0(IγT)1\displaystyle=(1-\gamma)T_{0}\left(I-\gamma T\right)^{-1}

The tensor-matrix product T0TT_{0}T is equivalent to einsum(‘ijk,kh \rightarrow ijh’, T0T_{0}, TT). We use the forward KL divergence for estimating the error in our estimate, DKL(P||Q)D_{\mathrm{KL}}(P||Q), where QQ is the tensor of predictions:

Q25×4×25:Q[s,a,g]=q(gs,a).Q\in\mathbb{R}^{25\times 4\times 25}:\quad Q[s,a,g]=q(g\mid s,a).

Following the configuration outlined in Eysenbach et al. (2020), we compare the accuracy of the future discounted state distribution under against C-Learning and QQ-learning:

On-policy Setting

Figure˜17 presents the results of our evaluation comparing CVAE, C-learning, CRL and NFs on the above modified "continuous GridWorld" environment under the on-policy setting. In this scenario, CVAE demonstrates higher error compared to C-learning, while NFs achieves the best performance. This highlights the accuracy of NFs in estimating the discounted state occupancy measure. This experiment aims to answer whether NFs solve the future state density estimation problem.

G.5 Can Expectile Regression Effectively Capture Maximum QQ-values in Practice?

We empirically validate that expectile regression converges to in-distribution maximum QQ-values in a controlled GridWorld setting, supporting our theoretical analysis in Theorem˜3.1.

Metrics. We use coefficient of determination (R2R^{2}) measuring explained variance, and Mean Absolute Error (MAE) quantifying prediction deviation:

R2=1(ytrueypred)2(ytruey¯true)2,MAE=1ni=1n|ytrue,iypred,i|.R^{2}=1-\frac{\sum(y_{\mathrm{true}}-y_{\mathrm{pred}})^{2}}{\sum(y_{\mathrm{true}}-\bar{y}_{\mathrm{true}})^{2}},\quad\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}|y_{\mathrm{true},i}-y_{\mathrm{pred},i}|. (59)

Results. As shown in Figures˜18 and 19, the results strongly support our theoretical analysis:

  1. 1.

    Standard MSE (τ=0.5\tau=0.5) learns the mean rather than maximum, yielding R2=0.781R^{2}=0.781;

  2. 2.

    Performance improves monotonically with τ\tau: R2R^{2} increases from 0.7810.781 to 0.9950.995 as τ\tau goes from 0.50.5 to 0.990.99;

  3. 3.

    At τ=0.99\tau=0.99, predicted values closely match ground-truth QmaxβQ^{\beta}_{\max} with R2=0.995R^{2}=0.995 and MAE=0.0017=0.0017.

Implications. These results validate that expectile regression effectively captures maximum in-distribution QQ-values, which is essential for QHyer’s trajectory stitching capability. The convergence aligns with Theorem˜3.1: as τ1\tau\to 1, the approximation error ϵτ0\epsilon_{\tau}\to 0 and Q^τQ\hat{Q}^{\tau}\to Q^{\star}. Specifically, our theoretical bound predicts:

ϵτ(1τ)(QQmin)τc~/2+(1τ)(1c~/2),\epsilon_{\tau}\leq\frac{(1-\tau)(Q^{\star}-Q_{\min})}{\tau\cdot\tilde{c}/2+(1-\tau)(1-\tilde{c}/2)}, (60)

which decreases as τ\tau increases, consistent with the monotonic improvement observed in Figure˜19.

However, excessively large τ\tau (e.g., 0.9990.999) may cause overfitting to outliers due to focusing on too few high-value samples, leading to increased variance. In practice, τ[0.9,0.95]\tau\in[0.9,0.95] balances accuracy and training stability, as validated in our ablation studies (Section˜4.3).

Refer to caption

Refer to caption

Figure 18: Visualization comparing predicted maximum QQ-values from expectile regression with different τ\tau values against ground-truth maximum QQ-values (QQ^{\star}) in the GridWorld environment. As τ\tau increases from 0.50.5 to 0.990.99, the predictions converge toward the diagonal line (perfect prediction), validating Theorem˜3.1.

Refer to caption

Refer to caption

Figure 19: Curves of R2R^{2} and MAE metrics as a function of the expectile parameter τ\tau. R2R^{2} increases monotonically from 0.7810.781 (τ=0.5\tau=0.5) to 0.9950.995 (τ=0.99\tau=0.99), while MAE decreases correspondingly. This empirical trend confirms that expectile regression with high τ\tau effectively approximates the in-distribution optimal Q-value QQ^{\star}, consistent with the theoretical bound in Theorem˜3.1.

G.6 Comparison with Recent Offline RL Methods Adapted to GCRL

To strengthen baseline coverage, we adapt four recent offline RL methods to offline GCRL by attaching HER goal relabeling and following the OGBench evaluation protocol. These are Transitive RL (Park et al., 2026), SHARSA (Park et al., 2025b), DEAS (Kim et al., 2026), and QCFQL (Li et al., 2025). Results are averaged over 8 seeds at 1M training steps.

Table 11: Recent offline RL methods adapted to offline GCRL on OGBench play datasets. Mean success rate (%\%) over 8 seeds. Orange = best, underline = second best.
Environment GC-TrL GC-SHARSA GC-DEAS GC-QCFQL QHyer
cube-single-play 46.246.2 ±0.8\pm 0.8 30.730.7 ±3.8\pm 3.8 44.144.1 ±2.6\pm 2.6 38.438.4 ±1.5\pm 1.5 𝟖𝟒\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}84}} ±4\pm 4
cube-double-play 1.61.6 ±0.8\pm 0.8 49.349.3 ±4.7\pm 4.7 10.710.7 ±3.0\pm 3.0 5.15.1 ±0.0\pm 0.0 𝟓𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}56}} ±2\pm 2
cube-triple-play 0.80.8 ±0.6\pm 0.6 36.8\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}36.8}} ±4.0\pm 4.0 18.118.1 ±2.3\pm 2.3 1.41.4 ±0.7\pm 0.7 1010 ±5\pm 5
cube-quadruple-play 0.00.0 ±0.0\pm 0.0 2.3\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}2.3}} ±0.3\pm 0.3 0.10.1 ±0.1\pm 0.1 0.00.0 ±0.0\pm 0.0 22 ±1\pm 1
scene-play 27.727.7 ±6.2\pm 6.2 44.044.0 ±9.1\pm 9.1 48.748.7 ±3.4\pm 3.4 36.936.9 ±4.5\pm 4.5 𝟓𝟑\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}53}} ±2\pm 2
puzzle-3x3-play 7.67.6 ±1.6\pm 1.6 35.635.6 ±2.5\pm 2.5 32.732.7 ±11.0\pm 11.0 14.914.9 ±9.6\pm 9.6 𝟗𝟐\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}92}} ±2\pm 2
puzzle-4x4-play 2.32.3 ±0.7\pm 0.7 32.4\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}32.4}} ±9.7\pm 9.7 26.726.7 ±6.7\pm 6.7 0.30.3 ±0.7\pm 0.7 2828 ±5\pm 5
puzzle-4x5-play 2.02.0 ±1.5\pm 1.5 15.115.1 ±3.1\pm 3.1 16.116.1 ±1.3\pm 1.3 9.79.7 ±3.1\pm 3.1 𝟑𝟏\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}31}} ±1\pm 1
puzzle-4x6-play 1.61.6 ±0.5\pm 0.5 12.112.1 ±2.4\pm 2.4 17.917.9 ±4.5\pm 4.5 8.38.3 ±5.5\pm 5.5 𝟏𝟖\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}18}} ±2\pm 2
Average 10.0 28.7 23.9 12.8 41.6

Interpretation. Three essential reasons explain the gap. First, none of these methods were originally validated under offline GCRL with sparse binary rewards at standard OGBench scale. TRL and SHARSA rely on oracle goals or the large-data regime, DEAS targets semi-sparse single-task settings, and QCFQL’s strongest numbers come from offline-to-online training. When forced into the pure offline, sparse-binary, multi-goal regime, their value targets and exploration mechanisms become mis-specified. Second, there are structural mismatches. TRL’s triangle inequality on temporal distance holds for continuous navigation but breaks under manipulation’s discrete contact-mode transitions, which is why TRL drops from 46.246.2 on cube-single to 1.61.6 on cube-double. Third, SHARSA must predict subgoals in the full multi-object pose space, which is far harder than 2D navigation waypoints, and DEAS and QCFQL execute fixed-length open-loop action chunks, so early errors compound and the fixed chunk length cannot align with variable-duration manipulation primitives. SHARSA and DEAS nonetheless remain nontrivially competitive on the hardest long-horizon tasks, suggesting that action chunking and temporal abstraction are complementary to our contributions.

G.7 Comparison with Graph-based Stitching (GAS)

We also compare against GAS (Baek et al., 2025), a graph-based offline GCRL stitching method.

Table 12: Comparison with GAS on navigation and visual manipulation. Mean success rate (%\%) over 5 seeds. Orange = best.
Environment GAS QHyer
antmaze-giant-stitch (navigation) 𝟖𝟖\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}88}} ±4\pm 4 7070 ±2\pm 2
visual-scene-play (manipulation) 5454 ±6\pm 6 𝟗𝟔\mathbf{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}96}} ±1\pm 1

Interpretation. GAS and QHyer occupy complementary regimes, and the reason is structural rather than a matter of tuning. On antmaze-giant-stitch, GAS replaces high-level policy learning with Dijkstra shortest-path search over a precomputed temporal-distance graph, which directly exploits the metric structure of continuous navigation. QHyer’s flat sequence model cannot match this advantage on pure navigation. On visual-scene-play, the tables turn. GAS’s graph construction is bottlenecked by its ability to learn high-dimensional representations, whereas QHyer’s end-to-end sequence modeling with content-adaptive memory benefits directly from pixel inputs, producing roughly a 4242-point improvement and nearly doubling the previous OGBench best. We therefore view the two methods as complementary tools rather than directly competing baselines.

Comments

· 0
Be the first to comment on this paper.