QHyer: Q-conditioned Hybrid Attention-mamba Transformer
for Offline Goal-conditioned RL
Abstract
Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose QHyer, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that QHyer achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.
Keywords:
Machine Learning, ICML1 Introduction
Offline Goal-Conditioned Reinforcement Learning (Offline GCRL) aims to learn goal-reaching policies from static datasets, offering a promising paradigm for real-world applications where online interaction is costly or infeasible (Levine et al., 2020; Liu et al., 2022). While most existing offline GCRL datasets are collected by Markovian behavior policies, an increasing number of practical datasets exhibit non-Markovian properties where actions depend on historical context rather than current observations alone (Park et al., 2025a). This properties poses fundamental challenges for existing value-based methods (Kostrikov et al., 2022; Park et al., 2023; Zhou & Kao, 2025; Ahn et al., 2025; Giammarino et al., 2025; Giammarino & Qureshi, 2026) that rely on Bellman backup. In contrast, sequence modeling approaches like Decision Transformer (DT) (Chen et al., 2021) naturally solve non-Markovian problem by conditioning on return-to-go (RTG), states, and actions, leveraging self-attention to capture long-range dependencies from extended historical sequences.
Although DT naturally handles non-Markovian patterns through history conditioning, it exhibits two fundamental limitations when applied to offline GCRL. On one hand, RTG is trajectory-dependent rather than state-dependent, assigning values based on trajectory success rather than state quality, which provides no discriminative information for distinguishing promising states within failed trajectories. This is a critical requirement for trajectory stitching under goal-conditioned sparse rewards. On the other hand, pure attention struggles to efficiently balance global goal-directed reasoning with fine-grained local dynamics modeling. Recent hybrid architectures like LSDT (Wang et al., 2025) and DMixer (Zheng et al., 2025a) incorporate convolution alongside attention to capture local patterns. However, non-Markovian offline GCRL data exhibits variable-length temporal dependencies that change dynamically across states and trajectory segments. Convolution with fixed receptive fields either wastes model capacity on irrelevant context when dependencies are short, or truncates critical information when dependencies are long, unable to adapt to this inherent variability.
We propose QHyer (Q-conditioned Hybrid Attention-Mamba Transformer), the first sequence modeling framework to jointly resolve both limitations for offline GCRL. Our key observation is that these two limitations are coupled. Effective trajectory stitching under sparse rewards requires both a state-dependent value signal and an architecture whose effective memory matches the temporal structure of the underlying behavior policy. Addressing either alone is insufficient, because Q-conditioning layered on a fixed-window hybrid retains the convolutional pathology on non-Markovian play data, while a better temporal architecture with RTG retains the trajectory-dependence bottleneck. Concretely, we (i) replace trajectory-dependent RTG with state-dependent Q-values estimated via Normalizing Flows (Ghugare & Eysenbach, 2025), chosen specifically for their exact, properly normalized log-density, a property CVAEs, contrastive critics, and diffusion likelihoods cannot provide (Section˜3.1), and (ii) design a gated Hybrid Attention-Mamba (Gu & Dao, 2024) backbone where Mamba’s input-dependent selective state-space dynamics provide content-adaptive history compression, adjusting effective memory per-token rather than through a hand-tuned receptive field. Unlike prior value-guided Decision Transformers (Yamagata et al., 2023; Wang et al., 2024; Hu et al., 2024; Zhuang et al., 2024; Zheng et al., 2025b) that retain RTG and attach Q-values as auxiliary losses or regularizers, QHyer eliminates RTG and uses Q-values directly as conditioning tokens. Under goal-conditioned sparse rewards, where RTG collapses to a near-binary signal, this distinction is decisive (Figures˜2 and 5).
Our evaluation on OGBench (Park et al., 2025a) and D4RL (Fu et al., 2020) demonstrates that QHyer achieves state-of-the-art performance across both non-Markovian datasets (OGBench play and D4RL Maze) and Markovian datasets (OGBench noisy), validating the effectiveness of NFs-based Q-value conditioning and the Hybrid Attention-Mamba architecture for offline GCRL.
2 Background
2.1 Offline GCRL
Offline GCRL is defined over a Markov Decision Process (MDP) , where denotes the state space, the action space, the goal space, the transition dynamics, and the discount factor. Following prior work (Park et al., 2023, 2025a), we assume . The agent has access only to a static dataset collected by behavioral policies , where each trajectory takes the form . The objective is to learn a goal-conditioned policy that maximizes the expected cumulative return without interaction in the environment. To obtain goal-conditioned supervision, we employ hindsight experience replay (HER) (Andrychowicz et al., 2017), which samples goals from future achieved states along the same trajectory.
In standard GCRL with sparse rewards, the reward function is defined as , where the agent receives only upon reaching the goal and otherwise. Consequently, most state-action pairs yield no learning signal for states far from the goal. To address this, following prior work (Eysenbach et al., 2020, 2022), a probabilistic reward can be defined as , where is the next state. Under this formulation, the goal-conditioned Q-function corresponds to the discounted state occupancy measure (Eysenbach et al., 2022; Bortkiewicz et al., 2025):
| (1) |
where for denotes a future state sampled at a geometrically distributed time step. Unlike sparse rewards that are directly observed, this formulation requires learning a density model to estimate Q-values.
2.2 Normalizing Flows for Q-Value Estimation
Normalizing Flows (NFs) (Zhai et al., 2024) are invertible generative models that learn a bijective mapping from a complex data distribution to a simple prior (typically standard Gaussian), with density computed exactly via the change of variables formula:
| (2) |
Following Ghugare & Eysenbach (2025), NFs can be constructed using coupling layers (Dinh et al., 2017). For the -th block with input and condition :
| (3) |
where partitions the input into two halves along the feature dimension, and are neural networks that output translation and scale parameters respectively.
In Offline GCRL, NFs can directly estimate it by modeling (Ghugare & Eysenbach, 2025). The conditioning information is encoded by a state-action encoder, and the behavior Monte Carlo (MC) Q-value is obtained as:
| (4) |
where is the conditional NFs mapping goals to the latent space with being the encoded state-action representation. Note that represents the log-probability, which serves as an unnormalized score for conditioning. During inference, we use when probability interpretation is needed.
In practice, the NFs is trained via maximum likelihood on hindsight-relabeled transitions:
| (5) |
2.3 Sequence Modeling for Decision Making
Decision Transformer (DT) (Chen et al., 2021) models decision-making from offline datasets as a sequence modeling problem. Unlike traditional RL methods that estimate Q-functions or compute policy gradients, DT generates an action at timestep conditioned on the context of the previous timesteps along with the current state and return-to-go (RTG). The input sequence is formulated as , where RTG is defined as the sum of rewards from the current step to the end of the trajectory and is the context length. For each timestep, three tokens (RTG, state, and action) are embedded and fed into the model. DT employs a causal Transformer that leverages self-attention layers to capture long-range dependencies.
Decision Mamba (DMamba) (Ota, 2024) integrates the Mamba (Gu & Dao, 2024) architecture into the DT framework by replacing self-attention with the Mamba block. The DMamba block first applies a one-dimensional causal convolution to extract local features:
| (6) |
where Conv1d operates with a local kernel over adjacent positions. The transformed sequence is then processed by the discrete-time selective state space model (SSM):
| (7) |
where is the hidden state and is the output. The key innovation of Mamba is the input-dependent selective mechanism:
| (8) |
where controls the discretization step size.
3 QHyer: Unlocking Sequence Modeling for Offline GCRL
While sequence modeling naturally addresses the non-Markovian challenge, DT-based methods exhibits critical limitations when applied to GCRL. We propose QHyer, which introduces NFs-based Q-value conditioning (Section˜3.1) and a Hybrid Attention-Mamba architecture (Section˜3.2) to overcome these limitations. The overall architecture is illustrated in Figure˜1.

3.1 Limitation 1: RTG Fails Under Sparse Rewards
In standard DT-based methods, the return-to-go (RTG) serves as the conditioning signal that guides action generation. However, RTG is fundamentally inadequate for Offline GCRL with sparse binary rewards.
The Root Cause: Trajectory-Dependence Prevents Stitching. The fundamental limitation of RTG lies in its trajectory-dependence: RTG answers “did this trajectory succeed?” rather than “how valuable is this state for reaching the goal?” Consider a state that appears on both a successful trajectory (RTG=1) and a failed one (RTG=0). RTG assigns contradictory values to the same state based solely on trajectory outcome, making cross-trajectory comparison impossible. This directly prevents trajectory stitching because composing segments from different trajectories requires a trajectory-agnostic value metric that RTG fundamentally cannot provide. As shown in Figure˜2 (a) (b), successful and failed trajectories receive uniformly different RTG values regardless of state quality, with only 25% of state-action pairs receiving discriminative signals.




Our Key Insight: From Trajectory-Dependence to State-Dependence. The Q-function represents the probability of reaching goal from state-action pair , measured independently of which trajectory that pair came from. This state-dependence enables a fundamentally new capability: identifying high-value segments from failed trajectories (they have high despite low RTG) and composing them toward goals. Figure˜2 (c) confirms this prediction, showing that Q-value conditioning achieves 92% coverage compared to RTG’s 25%. Figure˜2 (d) further illustrates that high-Q segments naturally form paths toward goals even when extracted from failed demonstrations.
Why MC Estimation Instead of TD Learning. Having established the need for Q-value conditioning, we must choose how to estimate Q-values. Many standard offline RL methods (Fujimoto & Gu, 2021; Kostrikov et al., 2022) are built upon temporal difference (TD) learning. While TD learning can learn optimal value functions and possesses stitching capabilities, its reliance on bootstrapping leads to compounding errors that hinder the acquisition of optimal policies, especially in long-horizon tasks (Myers et al., 2025; Park et al., 2026). In contrast, MC learning directly estimates the cumulative reward for reaching a goal. By integrating it with a maximum Q-expectile regression loss proposed in our later analysis, we theoretically demonstrate that our method can also converge to an optimal stitched policy. In empirical evaluations, recent MC-based contrastive RL approaches (Eysenbach et al., 2022; Myers et al., 2025) have been shown to consistently and significantly outperform TD-based methods on long-horizon GCRL tasks.
Why NFs for MC Q-Estimation. Given that MC estimation is preferable, we must choose how to model the Q-value density . Our framework places one structural requirement on this density model. It must produce an exact, properly normalized log-density. The expectile target in Equation˜10 is defined on directly, and the transformer consumes Q-tokens that span multiple goals within one context window (Section˜3.3), so goal-independent normalization is necessary for the learned Q-to-action pattern to transfer across goals. This requirement rules out the otherwise reasonable alternatives.
Conditional VAEs (Sohn et al., 2015) produce only the ELBO, a structural lower bound that cannot be closed by increasing capacity, and which distorts the Q-landscape in a goal-dependent way. Contrastive RL (Eysenbach et al., 2022) trains a binary cross-entropy classifier whose Bayes-optimal output is the log density ratio . While the goal-dependent partition cancels when selecting actions at a fixed goal, it introduces goal-dependent offsets in the Q-token sequence our transformer reads across multiple goals, which degrades cross-goal conditioning. Diffusion models (Ho et al., 2020) and continuous flow-matching objectives (Lipman et al., 2023) can reach high sample quality, but their per-sample likelihood requires solving a probability-flow ODE with a Hutchinson trace estimator (Grathwohl et al., 2018), injecting variance into precisely the signal that expectile regression must fit.
Coupling-based NFs (Dinh et al., 2017) uniquely meet the requirement. The triangular Jacobian makes exactly and cheaply computable in closed form, and coupling architectures are universal diffeomorphism approximators (Teshima et al., 2020), so no structural gap remains. Figure˜17 (Section˜G.4) empirically confirms that NFs attain the lowest estimation error against the analytic future-state density among CVAE, CRL and MC C-learning. Because accurate, normalized Q-values are the bottleneck for trajectory stitching under sparse rewards (Figure˜5), this is the property we optimize for.
Expectile Regression for In-Distribution Optimal Q-Value Prediction. Given accurate estimates from NFs, we still need to extract optimal behaviors from suboptimal data. The expectile regression loss (Kostrikov et al., 2022; Wu et al., 2023; Zhuang et al., 2024) asymmetrically weights prediction errors:
| (9) |
where controls the asymmetry. When , the loss penalizes underestimation more heavily, causing the learned value to concentrate on the upper portion of the empirical distribution. Applying this to Q-value prediction, we define:
| (10) |
where is the Q-value predicted by the Hybrid Attention-Mamba transformer with parameters , and is the target from the NFs-based critic (Equation˜4). Our theoretical analysis (Section˜3.5) demonstrates that this enables our sequential model, Qhyer, to predict Q-values that approach the in-distribution maximum. These predictions correspond to the high-Q segments shown in Figure˜2 (d), which are essential for trajectory stitching.
3.2 Limitation 2: Temporal Modeling Requires Content-Adaptive History Compression
Beyond the conditioning signal, effective sequence modeling for Offline GCRL demands architectures that can capture heterogeneous temporal dependencies inherent in Offline GCRL datasets.
Why Offline GCRL Data Exhibits Variable-Length Historical Dependencies. Offline GCRL datasets exhibit different temporal structures depending on behavior policy properties. As documented in OGBench (Park et al., 2025a), the manipulation suite provides two representative dataset types: play datasets collected by non-Markovian expert policies with temporally correlated noise where the behavior policy follows , and noisy datasets collected by Markovian expert policies with uncorrelated Gaussian noise where depends only on the current state. The play data demands extended memory for action coherence, while noisy data requires only short-term local information. A principled solution must adapt to both properties without manual tuning.
Why Convolution Cannot Address Variable-Length Dependencies. To address the inherent tension of datasets exhibiting the two aforementioned properties, both LSDT (Wang et al., 2025) and DMixer (Zheng et al., 2025a) incorporate attention and convolution as parallel branches. Convolution-based local modeling computes features through causal convolution with fixed-size kernels:
| (11) |
where is the fixed kernel size and are input-independent weights. When convolution serves as the final output of a branch, this creates three fundamental limitations. First, convolution imposes a fixed receptive field set by the chosen kernel (and any dilation/stacking), making the effective context length a hand-tuned architectural prior that is sensitive to hyperparameters and often fails to transfer across datasets with different temporal dependencies. Second, in the first layer, convolution has a fixed receptive field and thus a fixed effective memory that cannot adapt to varying dependency lengths within or across datasets. Especially, on non-Markovian trajectories where relative cues lie beyond this window, the local branch becomes weakly informative and is often down-weighted by fusion.
Why Mamba Enables Content-Adaptive History Compression. We address the fixed-window bottleneck of a convolutional short-term branch by adopting a Mamba-style selective SSM (DMamba) module (Ota, 2024). A DMamba block combines (i) a lightweight causal convolution that mixes nearby tokens and produces local features (and gating signals), with (ii) a selective state-space update (Equations˜7 and 8) that propagates a recurrent state across the entire prefix. Importantly, the effective memory is not determined by the convolutional kernel, but by the input-dependent SSM dynamics (via the selective discretization), which enables smooth, learned forgetting/retention over history. As a result, compared to using convolution as the branch output, DMamba provides a content-adaptive mechanism to compress long-range context into a compact state, reducing sensitivity to hand-tuned receptive fields and improving robustness on non-Markovian segments where disambiguating cues may lie beyond any fixed local window.
To make the adaptive history modeling explicit, we expand the SSM recurrence (Equation˜7) to express the output at timestep :
| (12) |
where is the convolution-extracted feature at step . The influence of historical input on current output is governed by the cumulative decay . Critically, through the selective mechanism (Equation˜8), the discretization step is input-dependent:
| (13) |
and is a negative real value following Gu & Dao (2024). This creates content-adaptive effective memory: when yields small , the decay preserves long-range history suitable for play data; when yields large , retains only local context appropriate for noisy data. In contrast, convolution imposes a fixed influence for and zero beyond, resulting in hard, input-independent truncation. The key distinction is that Mamba provides smooth, learned decay while DynamicConv enforces hard, fixed truncation.

Hybrid Architecture with Attention-Mamba. As illustrated in Figure˜4, we design a Hybrid Attention-Mamba architecture with two parallel branches: attention for global goal-directed planning and Mamba for temporal dynamics modeling. The outputs are fused through a learnable gating mechanism that computes a scalar weight to combine branch outputs: . This enables complementary specialization across both play and noisy datasets.
Figure˜3 visualizes this adaptation on cube-single. On play, smaller preserves an effective memory of about steps and the gate favors attention. On noisy, larger contracts memory to about steps and the gate favors Mamba. The essential reason is that is input-dependent, so effective memory tracks the local temporal correlation of the data, which convolution’s input-independent receptive field cannot do.
3.3 Concatenated State-Goal Tokenization Strategy
We represent each state-goal pair as a concatenated token rather than separate tokens. Combined with NFs-based Q-value conditioning, the input sequence becomes: where is the NFs-estimated Q-value. This design ensures goal information is directly available at each decision point without increasing sequence length from to , avoiding quadratic computational overhead in attention. Detailed visual explanation of this tokenization strategy is provided in Section˜G.1.
3.4 Training and Inference
Training. We train QHyer end-to-end with three losses:
| (14) |
where is the behavior cloning loss that predicts actions conditioned on Q-values instead of RTG:
| (15) |
Inference. QHyer performs two-stage autoregressive generation: (1) predict maximum Q-value from current context; (2) predict optimal action conditioned on the predicted maximum Q-value. The detailed algorithm is provided in Appendix˜D.
3.5 Theoretical Analysis
We establish convergence guarantees for QHyer: expectile regression yields near-optimal Q-values, and the learned policy achieves bounded optimal stitched policy with explicit dependence on sample size, NFs accuracy, and coverage.
Setup. Let denote the goal-reaching probability conditioned on history. The in-distribution optimal Q-value represents the maximum achievable within the behavior policy’s support. We assume: (i) Q-value coverage with constant , measuring the minimum density ratio of optimal actions in the dataset; (ii) bounded NFs error ; (iii) bounded function class with approximation error . Full definitions are in Appendix˜B.
Theorem 3.1 (Convergence of Expectile Regression to In-Distribution Optimal Q-Value).
Under Q-value coverage with constant and sample size satisfying Equation˜37 in the Appendix, for , the expectile estimator satisfies with high probability, where the bias term decreases as increases, at the cost of requiring more samples for variance control.
Theorem 3.2 (Convergence to In-Distribution Optimal Stitched Policy).
Under assumptions (i) to (iii), the learned policy satisfies:
| (16) |
Complete proofs are in Appendix˜C.
| Hierarchical Policy | Flat Policy | |||||||||||
| Env | Type | Dataset | HIQL | SAW | OTA | Eik-HIQRL | GCBC | GCIVL | GCIQL | QRL | CRL | QHyer |
| cube | play | single | ||||||||||
| double | ||||||||||||
| triple | ||||||||||||
| quadruple | ||||||||||||
| Total | 24 | 68 | 16 | 0 | 8 | 90 | 111 | 6 | 35 | 152 | ||
| noisy | single | |||||||||||
| double | ||||||||||||
| triple | ||||||||||||
| quadruple | ||||||||||||
| Total | 45 | 64 | 46 | 2 | 10 | 94 | 124 | 29 | 43 | 145 | ||
| scene | play | scene | ||||||||||
| noisy | scene | |||||||||||
| puzzle | play | 3x3 | ||||||||||
| 4x4 | ||||||||||||
| 4x5 | ||||||||||||
| 4x6 | ||||||||||||
| Total | 26 | 19 | 29 | 11 | 2 | 36 | 147 | 1 | 8 | 169 | ||
| noisy | 3x3 | |||||||||||
| 4x4 | ||||||||||||
| 4x5 | ||||||||||||
| 4x6 | ||||||||||||
| Total | 74 | 104 | 54 | 11 | 1 | 98 | 160 | 0 | 39 | 172 | ||
| visual-cube | play | single | / | |||||||||
| double | / | |||||||||||
| triple | / | |||||||||||
| Total | 149 | 148 | 100 | / | 21 | 84 | 46 | 62 | 50 | 99 | ||
| visual-scene | play | scene | / | |||||||||
| noisy | scene | / | ||||||||||
| Antmaze-v2 | RL | Supervised Learning | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CQL | IQL | DT | RvS | EDT | CGDT | DC | DMamba | Reinformer | QT | LSDT | QHyer | |
| umaze | 74.0 | 87.5 | 64.5 | 65.4 | 67.8 | 71.0 | 85.0 | 81.8 | 84.4 | 96.7 | 80.0 | 98.4 |
| umaze-diverse | 84.0 | 62.2 | 60.5 | 60.9 | 58.3 | 71.0 | 78.5 | 71.6 | 65.8 | 96.7 | 83.2 | 97.1 |
| medium-play | 61.2 | 71.2 | 0.8 | 58.1 | 0.0 | / | 1.5 | 79.6 | 13.2 | / | 85.5 | 92.2 |
| medium-diverse | 53.7 | 70.0 | 0.5 | 67.3 | 0.0 | / | 0.0 | 83.2 | 10.6 | 59.3 | 75.8 | 94.0 |
| large-play | 15.8 | 39.6 | 0.0 | 32.4 | 0.0 | / | 0.0 | 23.2 | 0.4 | / | 0.0 | 44.2 |
| large-diverse | 14.9 | 47.5 | 0.0 | 32.9 | 0.0 | / | 0.0 | 34.6 | 0.4 | 53.3 | 0.0 | 57.5 |
| Total | 303.6 | 378.0 | 126.3 | 317.0 | 126.1 | / | 165.0 | 374.0 | 174.8 | / | 324.5 | 483.4 |
| Maze2d | CQL | IQL | DT | QDT | GDT | VDT | DC | DMamba | DMixer | QT | LSDT | QHyer |
| umaze | 94.7 | 74.0 | 31.0 | 57.3 | 50.4 | 60.3 | 20.1 | 83.4 | 86.9 | 105.4 | 72.3 | 118.5 |
| medium | 41.8 | 84.0 | 8.2 | 13.3 | 7.8 | 88.0 | 38.2 | 98.7 | 95.2 | 172.0 | 68.4 | 173.0 |
| Total | 136.5 | 158.0 | 39.2 | 70.6 | 58.2 | 148.3 | 58.3 | 182.1 | 182.1 | 277.4 | 140.7 | 291.5 |
4 Experiments
We extensively evaluate QHyer’s effectiveness and conduct ablation studies on both non-Markovian and Markovian offline GCRL datasets.
Datasets. We consider two widely used benchmarks. For OGBench (Park et al., 2025a), we evaluate on manipulation tasks including cube, scene, and puzzle environments with both play (non-Markovian) and noisy (Markovian) datasets. For D4RL (Fu et al., 2020), we evaluate on Maze (non-Markovian) tasks. A detailed introduction to these environments is presented in Appendix˜F.
Baselines. We compare QHyer against three categories of methods: (1) sequence modeling methods including DT (Chen et al., 2021), EDT (Wu et al., 2023), GDT (Hu et al., 2023), QDT (Yamagata et al., 2023), CGDT (Wang et al., 2024), Reinformer (Zhuang et al., 2024), DC (Kim et al., 2024b), DMamba (Ota, 2024), QT (Hu et al., 2024), LSDT (Wang et al., 2025), DMixer (Zheng et al., 2025a), and VDT (Zheng et al., 2025b); (2) TD-based methods including CQL (Kumar et al., 2020) and IQL (Kostrikov et al., 2022); (3) offline GCRL methods including GCBC (Ghosh et al., 2021), GCIVL, GCIQL (Kostrikov et al., 2022), QRL (Wang et al., 2023), CRL (Eysenbach et al., 2022), HIQL (Park et al., 2023), SAW (Zhou & Kao, 2025), OTA (Ahn et al., 2025), and Eik-HiQRL (Giammarino & Qureshi, 2026). For completeness, Section˜G.6 additionally reports comparisons against four recent offline RL methods (i.e., QCFQL (Li et al., 2025), SHARSA (Park et al., 2025b), Transitive RL (Park et al., 2026), DEAS (Kim et al., 2026), ) adapted to GCRL with HER, as well as GAS (Baek et al., 2025) on navigation manipulation.
4.1 OGBench Results
Table 1 validates our core claims about sequence modeling for non-Markovian Offline GCRL. On play datasets collected by non-Markovian expert policies, QHyer significantly outperforms all baselines across manipulation tasks. Hierarchical methods (HIQL, SAW, OTA) underperform on state-based play datasets because their subgoal decomposition assumes Markovian transitions between subgoals, an assumption violated when behavior policies exhibit temporal correlations. Eik-HiQRL further suffers from exponential quasimetric approximation error in high-dimensional spaces (Giammarino & Qureshi, 2026), limiting its effectiveness across both state-based and visual manipulation tasks. TD-based hierarchical methods (HIQL, SAW, OTA) achieve competitive performance on visual tasks because hierarchical value functions provide representation learning signals beneficial for pixel inputs. On noisy datasets, QHyer maintains competitive performance through adaptive gating between attention and Mamba branches.
4.2 D4RL Results
Table 2 confirms QHyer’s advantages on long-horizon navigation tasks where trajectory stitching is essential. QHyer consistently outperforms both TD-based methods and sequence modeling baselines, with the most pronounced gains on large mazes requiring extensive stitching. Vanilla DT and its variants (EDT, DC) achieve near-zero performance on medium and large mazes, directly confirming our analysis in Section˜3.1. RTG under sparse goal-conditioned rewards reduces to binary signals that provide no discriminative information for stitching trajectories. On the other hand, it also demonstrates the effectiveness of our method in non-Markovian locomotion tasks.
4.3 Ablation Studies


Q: How does the Q-value estimator affect performance?
A: Figure˜5 reveals a consistent ordering: No Q CVAE CRL NFs, directly reflecting the relationship between density estimation accuracy and policy quality established in Section˜3.1. Without Q-values, the model degenerates to behavior cloning that cannot distinguish states by their proximity to goals under sparse rewards. CVAE introduces systematic bias through the ELBO gap, distorting the goal-reaching probability landscape. CRL improves through contrastive objectives but inherits negative sampling bias that underestimates probabilities for distant goals. NFs achieve the best performance by computing exact likelihoods through invertible transformations (Equation˜2), enabling accurate identification of high-value state-action pairs via expectile regression. This mechanism is essential for extracting optimal behaviors from suboptimal data.


Q: Does the architecture alone improve performance?
A: Figure˜6 isolates the architectural contribution by removing NFs-based Q-conditioning from all methods, using standard RTG instead. The results reveal a consistent ordering: LSDT DMixer QHyer across both AntMaze and Maze2d environments. This validates that the performance gains stem from both innovations independently. LSDT’s Dynamic Convolution branch is limited by its fixed kernel size, which cannot adaptively capture dependencies of varying ranges. DMixer’s token-level selection mechanism improves upon LSDT but may disrupt continuous action patterns through discrete token dropping. In contrast, QHyer’s Mamba branch maintains compressed hidden states that enable content-adaptive dependency modeling. The selective SSM parameters (B, C, ) dynamically determine how much historical context to retain based on input content, rather than relying on predefined kernel sizes or discrete selection thresholds. Combined with the results in Figure˜5, this demonstrates that QHyer’s two innovations provide complementary and additive performance improvements.


Q: How should the expectile parameter be selected?
A: Figure˜7 shows monotonic improvement from to , with optimal performance at . This validates Theorem˜3.1: higher reduces the bias term by focusing on upper expectiles, enabling identification of high-Q segments for trajectory stitching that RTG’s trajectory-dependence fundamentally cannot provide. However, extreme causes degradation by over-concentrating on too few samples, increasing estimation variance. This aligns with our theoretical analysis where depends on both and coverage . As approaches 1, sensitivity to coverage limitations amplifies. We use for low-coverage (play) and for high-coverage (noisy) data.
Q: Is the Hybrid architecture’s gain actually architectural, or is it confounded by Q-conditioning, and does Mamba truly adapt its memory to data type?
A: We answer both jointly. Table˜3 fixes NFs Q-conditioning and varies only the backbone. On the non-Markovian cube-single-play, Attention-only reaches , Mamba-only , Hybrid . On the Markovian cube-single-noisy the ordering is , , . The Hybrid beats the best single branch by to points on both regimes, which means the scalar gate captures genuinely complementary specialization rather than interpolating two near-identical branches.
| Environment | Attention-only | Mamba-only | Hybrid (QHyer) |
|---|---|---|---|
| cube-single-play (non-Markov.) | |||
| cube-single-noisy (Markov.) |
Table˜4 then explains why, by extracting Mamba’s and the learned gate weight from the trained model (cf. Figure˜3). On play, mean and , the SSM retains about steps of effective history, and the gate shifts of capacity to attention for global goal-directed reasoning. On noisy, and , memory collapses to about steps, and the gate shifts to Mamba. The essential reason is that Mamba’s selective mechanism makes a function of the input, so effective memory varies per-token with the local temporal correlation, which convolution-based hybrids (LSDT, DMixer) cannot produce because their receptive field is an architectural constant.
| Metric | play (non-Markov.) | noisy (Markov.) |
|---|---|---|
| Mean | ||
| Std | ||
| Mean | ||
| Effective memory (steps) | ||
| Gate weight (Attention) | ||
| Gate weight (Mamba) |
5 Conclusion
We presented QHyer, the first sequence modeling framework for non-Markovian offline GCRL that addresses two fundamental limitations: replacing trajectory-dependent RTG with state-dependent Q-values estimated via Normalizing Flows for effective trajectory stitching, and introducing a Hybrid Attention-Mamba architecture for content-adaptive temporal modeling. Experiments on OGBench and D4RL demonstrate state-of-the-art performance, particularly on non-Markovian datasets.
Limitations and future work. QHyer remains constrained on visual-noisy, where Markovian behavior neutralizes the non-Markovian modeling advantage and pixel-level NFs density estimation becomes the dominant source of error. Promising future directions include robust visual density estimation and extension of the deterministic-transition theory (Appendix˜B) to stochastic environments.
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
References
- Ahn et al. (2025) Ahn, H., Choi, H., Han, J., and Moon, T. Option-aware temporally abstracted value for offline goal-conditioned reinforcement learning. arXiv preprint arXiv:2505.12737, 2025.
- Akimov et al. (2022) Akimov, D., Kurenkov, V., Nikulin, A., Tarasov, D., and Kolesnikov, S. Let offline RL flow: Training conservative agents in the latent space of normalizing flows. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022.
- Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- Baek et al. (2025) Baek, S., taegeon park, Park, J., Oh, S., and Kim, Y. Graph-assisted stitching for offline hierarchical reinforcement learning. In Forty-second International Conference on Machine Learning, 2025.
- Bortkiewicz et al. (2025) Bortkiewicz, M., Pałucki, W., Myers, V., Dziarmaga, T., Arczewski, T., Kuciński, Ł., and Eysenbach, B. Accelerating goal-conditioned reinforcement learning algorithms and research. In The Thirteenth International Conference on Learning Representations, 2025.
- Brahmanage et al. (2023) Brahmanage, J., Ling, J., and Kumar, A. Flowpg: action-constrained policy gradient with normalizing flows. Advances in Neural Information Processing Systems, 36:20118–20132, 2023.
- Brandfonbrener et al. (2022) Brandfonbrener, D., Bietti, A., Buckman, J., Laroche, R., and Bruna, J. When does return-conditioned supervised learning work for offline reinforcement learning? Advances in Neural Information Processing Systems, 35:1542–1553, 2022.
- Chao et al. (2024) Chao, C.-H., Feng, C., Sun, W.-F., Lee, C.-K., See, S., and Lee, C.-Y. Maximum entropy reinforcement learning via energy-based normalizing flow. Advances in Neural Information Processing Systems, 37:56136–56165, 2024.
- Cheikhi & Russo (2023) Cheikhi, D. and Russo, D. On the statistical benefits of temporal difference learning. In International Conference on Machine Learning, pp. 4269–4293. PMLR, 2023.
- Chen et al. (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
- Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pp. 1407–1416. PMLR, 2018.
- Eysenbach et al. (2020) Eysenbach, B., Salakhutdinov, R., and Levine, S. C-learning: Learning to achieve goals via recursive classification. arXiv preprint arXiv:2011.08909, 2020.
- Eysenbach et al. (2022) Eysenbach, B., Zhang, T., Levine, S., and Salakhutdinov, R. R. Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35:35603–35620, 2022.
- Eysenbach et al. (2024) Eysenbach, B., Myers, V., Salakhutdinov, R., and Levine, S. Inference via interpolation: Contrastive representations provably enable planning and inference. Advances in Neural Information Processing Systems, 37:58901–58928, 2024.
- Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- Fujimoto & Gu (2021) Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
- Ghosh et al. (2021) Ghosh, D., Gupta, A., Reddy, A., Fu, J., Devin, C. M., Eysenbach, B., and Levine, S. Learning to reach goals via iterated supervised learning. In International Conference on Learning Representations, 2021.
- Ghugare & Eysenbach (2025) Ghugare, R. and Eysenbach, B. Normalizing flows are capable models for rl. arXiv preprint arXiv:2505.23527, 2025.
- Ghugare et al. (2024) Ghugare, R., Geist, M., Berseth, G., and Eysenbach, B. Closing the gap between TD learning and supervised learning - a generalisation point of view. In The Twelfth International Conference on Learning Representations, 2024.
- Giammarino & Qureshi (2026) Giammarino, V. and Qureshi, A. H. Goal reaching with eikonal-constrained hierarchical quasimetric reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026.
- Giammarino et al. (2025) Giammarino, V., Ni, R., and Qureshi, A. H. Physics-informed value learner for offline goal-conditioned reinforcement learning. arXiv preprint arXiv:2509.06782, 2025.
- Grathwohl et al. (2018) Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
- Gu & Dao (2024) Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024.
- Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Hong et al. (2023) Hong, M., Kang, M., and Oh, S. Diffused task-agnostic milestone planner. Advances in Neural Information Processing Systems, 36:387–405, 2023.
- Hu et al. (2023) Hu, S., Shen, L., Zhang, Y., and Tao, D. Graph decision transformer. arXiv preprint arXiv:2303.03747, 2023.
- Hu et al. (2024) Hu, S., Fan, Z., Huang, C., Shen, L., Zhang, Y., Wang, Y., and Tao, D. Q-value regularized transformer for offline reinforcement learning. In Forty-first International Conference on Machine Learning, 2024.
- Jain & Ravanbakhsh (2024) Jain, V. and Ravanbakhsh, S. Learning to reach goals via diffusion. In International Conference on Machine Learning, pp. 21170–21195. PMLR, 2024.
- Jullien et al. (2023) Jullien, S., Deffayet, R., Renders, J.-M., Groth, P., and de Rijke, M. Distributional reinforcement learning with dual expectile-quantile regression. arXiv preprint arXiv:2305.16877, 2023.
- Kakade (2001) Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Kim et al. (2026) Kim, C., Lee, H., Seo, Y., Lee, K., and Zhu, Y. DEAS: DEtached value learning with action sequence for scalable offline RL. In The Fourteenth International Conference on Learning Representations, 2026.
- Kim et al. (2024a) Kim, J., Lee, S., Kim, W., and Sung, Y. Adaptive -aid for conditional supervised learning in offline reinforcement learning. Advances in Neural Information Processing Systems, 37:87104–87135, 2024a.
- Kim et al. (2024b) Kim, J., Lee, S., Kim, W., and Sung, Y. Decision convformer: Local filtering in metaformer is sufficient for decision making. In The Twelfth International Conference on Learning Representations, 2024b.
- Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
- Koenker & Hallock (2001) Koenker, R. and Hallock, K. F. Quantile regression. Journal of economic perspectives, 15(4):143–156, 2001.
- Kostrikov et al. (2022) Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
- Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Lei et al. (2025a) Lei, X., Yang, W., Ke, K., Yang, S., Zhang, X., Pajarinen, J., and Wang, D. Gchr: Goal-conditioned hindsight regularization for sample-efficient reinforcement learning. arXiv preprint arXiv:2508.06108, 2025a.
- Lei et al. (2025b) Lei, X., Zhang, X., and Wang, D. Mgda: Model-based goal data augmentation for offline goal-conditioned weighted supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 18172–18180, 2025b.
- Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Li et al. (2025) Li, Q., Zhou, Z., and Levine, S. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969, 2025.
- Lipman et al. (2023) Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
- Liu et al. (2022) Liu, M., Zhu, M., and Zhang, W. Goal-conditioned reinforcement learning: Problems and solutions. arXiv preprint arXiv:2201.08299, 2022.
- Liu et al. (2025) Liu, Z., Yang, Y., Wang, R., Xu, P., and Zhou, D. How to provably improve return conditioned supervised learning? arXiv preprint arXiv:2506.08463, 2025.
- Loshchilov (2017) Loshchilov, I. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Lynch et al. (2020) Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. In Conference on robot learning, pp. 1113–1132. Pmlr, 2020.
- Ma et al. (2022) Ma, Y. J., Yan, J., Jayaraman, D., and Bastani, O. How far i’ll go: Offline goal-conditioned reinforcement learning via -advantage regression. arXiv preprint arXiv:2206.03023, 2022.
- Myers et al. (2024) Myers, V., Zheng, C., Dragan, A., Levine, S., and Eysenbach, B. Learning temporal distances: Contrastive successor features can provide a metric structure for decision-making. arXiv preprint arXiv:2406.17098, 2024.
- Myers et al. (2025) Myers, V., Zheng, B., Eysenbach, B., and Levine, S. Offline goal-conditioned reinforcement learning with quasimetric representations. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Newey & Powell (1987) Newey, W. K. and Powell, J. L. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, pp. 819–847, 1987.
- Opryshko et al. (2025) Opryshko, E., Quan, J., Voelcker, C., Du, Y., and Gilitschenski, I. Test-time graph search for goal-conditioned reinforcement learning. arXiv preprint arXiv:2510.07257, 2025.
- Ota (2024) Ota, T. Decision mamba: Reinforcement learning via sequence modeling with selective state spaces. arXiv preprint arXiv:2403.19925, 2024.
- Park et al. (2023) Park, S., Ghosh, D., Eysenbach, B., and Levine, S. HIQL: Offline goal-conditioned RL with latent states as actions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Park et al. (2024) Park, S., Kreiman, T., and Levine, S. Foundation policies with hilbert representations. arXiv preprint arXiv:2402.15567, 2024.
- Park et al. (2025a) Park, S., Frans, K., Eysenbach, B., and Levine, S. Ogbench: Benchmarking offline goal-conditioned rl. In International Conference on Learning Representations (ICLR), 2025a.
- Park et al. (2025b) Park, S., Frans, K., Mann, D., Eysenbach, B., Kumar, A., and Levine, S. Horizon reduction makes rl scalable. arXiv preprint arXiv:2506.04168, 2025b.
- Park et al. (2026) Park, S., Oberai, A., Atreya, P., and Levine, S. Transitive RL: Value learning via divide and conquer. In The Fourteenth International Conference on Learning Representations, 2026.
- Reuss et al. (2023) Reuss, M., Li, M., Jia, X., and Lioutikov, R. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.
- Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320. PMLR, 2015.
- Sikchi et al. (2024) Sikchi, H., Chitnis, R., Touati, A., Geramifard, A., Zhang, A., and Niekum, S. Score models for offline goal-conditioned reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024.
- Singh et al. (2020) Singh, A., Liu, H., Zhou, G., Yu, A., Rhinehart, N., and Levine, S. Parrot: Data-driven behavioral priors for reinforcement learning. arXiv preprint arXiv:2011.10024, 2020.
- Sohn et al. (2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
- Teshima et al. (2020) Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M., and Sugiyama, M. Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems, 33:3362–3373, 2020.
- Wang et al. (2025) Wang, J., Karanasou, P., Wei, P., Gatti, E., Plasencia, D. M., and Kanoulas, D. Long-short decision transformer: Bridging global and local dependencies for generalized decision-making. In The Thirteenth International Conference on Learning Representations, 2025.
- Wang et al. (2023) Wang, T., Torralba, A., Isola, P., and Zhang, A. Optimal goal-reaching reinforcement learning via quasimetric learning. In International Conference on Machine Learning, pp. 36411–36430. PMLR, 2023.
- Wang et al. (2024) Wang, Y., Yang, C., Wen, Y., Liu, Y., and Qiao, Y. Critic-guided decision transformer for offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 15706–15714, 2024.
- Ward et al. (2019) Ward, P. N., Smofsky, A., and Bose, A. J. Improving exploration in soft-actor-critic with normalizing flows policies. arXiv preprint arXiv:1906.02771, 2019.
- Wu et al. (2022) Wu, J., Wu, H., Qiu, Z., Wang, J., and Long, M. Supported policy optimization for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:31278–31291, 2022.
- Wu et al. (2023) Wu, Y.-H., Wang, X., and Hamaya, M. Elastic decision transformer. arXiv preprint arXiv:2307.02484, 2023.
- Yamagata et al. (2023) Yamagata, T., Khalil, A., and Santos-Rodriguez, R. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. In International Conference on Machine Learning, pp. 38989–39007. PMLR, 2023.
- Yarats et al. (2022) Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Mastering visual continuous control: Improved data-augmented reinforcement learning. In International Conference on Learning Representations, 2022.
- Yoon et al. (2024) Yoon, Y., Lee, G., Ahn, S., and Ok, J. Breadth-first exploration on adaptive grid for reinforcement learning. In Forty-first International Conference on Machine Learning, 2024.
- Zhai et al. (2024) Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. A., Jaitly, N., and Susskind, J. Normalizing flows are capable generative models. arXiv preprint arXiv:2412.06329, 2024.
- Zheng et al. (2025a) Zheng, H., Shen, L., Luo, Y., Ye, D., Du, B., Shen, J., and Tao, D. Decision mixer: Integrating long-term and local dependencies via dynamic token selection for decision-making. In Forty-second International Conference on Machine Learning, 2025a.
- Zheng et al. (2025b) Zheng, H., Shen, L., Luo, Y., Ye, D., Xu, S., Du, B., Shen, J., and Tao, D. Value-guided decision transformer: A unified reinforcement learning framework for online and offline settings. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025b.
- Zhou & Kao (2025) Zhou, J. L. and Kao, J. C. Flattening hierarchies with policy bootstrapping. arXiv preprint arXiv:2505.14975, 2025.
- Zhuang et al. (2024) Zhuang, Z., Peng, D., Liu, J., Zhang, Z., and Wang, D. Reinformer: Max-return sequence modeling for offline RL. In Forty-first International Conference on Machine Learning, 2024.
Appendix A Related Work
Offline Goal-Conditioned RL (GCRL). Offline GCRL aims to learn goal-reaching policies from static datasets without environment interaction. Existing approaches can be categorized into several paradigms: goal-conditioned hindsight relabeling and data augmentation (Andrychowicz et al., 2017; Lei et al., 2025b), hierarchical or subgoal-based learning (Park et al., 2023; Ahn et al., 2025; Giammarino & Qureshi, 2026; Zhou & Kao, 2025; Lei et al., 2025a), graph-based planning (Yoon et al., 2024; Eysenbach et al., 2024), metric learning (Wang et al., 2023; Park et al., 2024; Myers et al., 2024, 2025), dual optimization (Ma et al., 2022; Sikchi et al., 2024), generative modeling (Hong et al., 2023; Reuss et al., 2023; Jain & Ravanbakhsh, 2024; Myers et al., 2025), and test-time adaption (Opryshko et al., 2025). However, these methods predominantly assume that the offline data follows Markovian properties—that the optimal action depends solely on the current state and goal. Existing methods struggle on such non-Markovian datasets because they cannot capture the temporal dependencies that govern the behavior policy’s decisions. In contrast, QHyer explicitly models these dependencies through sequence modeling framework.
Normalizing Flows in RL. Normalizing flows (NFs) are invertible generative models that enable exact likelihood computation and efficient sampling (Dinh et al., 2014, 2017; Kingma & Dhariwal, 2018). Recent work has demonstrated their effectiveness in RL for policy modeling (Singh et al., 2020; Ward et al., 2019; Chao et al., 2024) and Q-function estimation. Chao et al. (2024) propose Energy-Based Normalizing Flows (EBFlow) that unify policy evaluation and improvement into a single objective for maximum entropy RL, enabling exact soft value function calculation without Monte Carlo approximation. Brahmanage et al. (2023) leverage NFs to learn invertible mappings between feasible action spaces and Gaussian latent spaces for action-constrained policy gradient methods.
For offline settings, Akimov et al. (2022) use NFs-based action encoders to construct conservative action spaces, addressing distributional shift without explicit regularization. Notably, Ghugare & Eysenbach (2025) show that NFs can serve as Q-functions in GCRL by modeling the discounted state occupancy distribution, achieving strong performance on offline GCRL benchmarks with a simple feedforward architecture. However, their approach cannot capture temporal dependencies in non-Markovian datasets. Our work integrates NFs-based Q-estimation into a sequence modeling framework, enabling both accurate value estimation and temporal dependency modeling.
Sequence Modeling in Offline RL. Decision Transformer (DT) (Chen et al., 2021) reformulates offline RL as conditional sequence modeling, where actions are generated conditioned on desired returns and past states. This paradigm has spurred extensive research, which can be broadly categorized into two directions: value-enhanced methods and architectural innovations.
Value-enhanced methods integrate reinforcement learning principles to address DT’s fundamental limitation in stitching sub-optimal trajectories (Brandfonbrener et al., 2022). For instance, Q-learning Decision Transformer (QDT) (Yamagata et al., 2023) employs dynamic programming for optimal path synthesis. Critic-Guided Decision Transformer (CGDT) (Wang et al., 2024) incorporates a value-based critic to align expected returns with target returns. Q-value Regularized Transformer (QT) (Hu et al., 2024) introduces explicit Q-value regularization to tackle long-horizon and sparse-reward tasks. Reinformer (Zhuang et al., 2024) utilizes expectile regression for maximizing returns, while Value-Guided Decision Transformer (VDT) (Zheng et al., 2025b) leverages value functions for advantage-weighted behavior regularization. These methods primarily employ TD-learning for Q-value estimation and use value functions as auxiliary losses or regularizers. In contrast, QHyer estimates Q-values via Normalizing Flows with Monte Carlo learning and directly uses them as conditioning tokens to replace RTG.
Architectural innovations aim to more effectively capture the heterogeneous temporal patterns present in offline datasets. Elastic Decision Transformer (EDT) (Wu et al., 2023) enables adaptive history length selection to facilitate stitching. Graph Decision Transformer (GDT) (Hu et al., 2023) structures input sequences as causal graphs with relation-enhanced attention mechanisms. Decision Convformer (DC) (Kim et al., 2024b) replaces attention with causal convolution filters to model local, Markovian associations efficiently. Decision Mamba (DMamba) (Ota, 2024) substitutes attention with selective state space models for linear-time sequence modeling. Long-Short Decision Transformer (LSDT) (Wang et al., 2025) combines attention with dynamic convolution using a fixed capacity ratio, and Decision Mixer (DMixer) (Zheng et al., 2025a) integrates long-term and local features via dynamic token selection. QHyer introduces a Hybrid Attention-Mamba architecture with learnable gating that dynamically allocates capacity, allowing attention to handle global goal-directed planning while Mamba captures local temporal patterns with content-adaptive memory.
Our work deviates from both value-based and architectural innovations methods. To our knowledge, this is the first work to unlock the potential of sequence modeling for Offline GCRL.
Appendix B Notation and Assumptions
B.1 Notation
We consider goal-conditioned episodic MDPs with finite horizon . Following Reinforced Return-conditioned Supervised Learning (R2CSL) (Liu et al., 2025), we assume deterministic transitions, i.e., given state and action , the next state is uniquely determined. This ensures that the in-distribution optimal Q-value is well-defined as a unique value. Extension to stochastic environments is an important direction for future work.
| Symbol | Definition |
|---|---|
| Spaces and Indices | |
| State space, action space, goal space | |
| Episode horizon (total number of stages per episode) | |
| Stage index (timestep within an episode) | |
| Goal mapping; in our experiments, | |
| Policies and Distributions | |
| Behavior policy that generated the offline dataset | |
| In-distribution optimal stitched policy (Eq. 19) | |
| Learned policy from QHyer | |
| State visitation probability at stage under policy | |
| Minimum positive state visitation: | |
| Distribution mismatch coefficient (Section˜B.2) | |
| Q-Values (Key Distinction) | |
| True goal-reaching probability under : | |
| In-distribution optimal Q-value: | |
| NFs estimate of , trained via Equation˜5 | |
| Expectile regression output on NFs-estimated Q-values | |
| Transformer-predicted Q-value for conditioning (Equation˜10) | |
| Error Terms | |
| NFs estimation MSE (Section˜B.2) | |
| Expectile regression bias (Theorem 3.1) | |
| MLE approximation error (Section˜B.2) | |
| Q-value coverage constant (Section˜B.2) | |
Q-Value Definition. For a trajectory passing through at stage , the goal-reaching Q-value is:
| (17) |
where in deterministic environments, this reduces to the discounted indicator of whether the trajectory reaches goal .
In-Distribution Optimal Q-Value:
| (18) |
Optimal Stitched Policy:
| (19) |
Performance Metric:
| (20) |
where and .
B.2 Assumptions
Assumption B.1 (Deterministic Environment).
The transition dynamics is deterministic, i.e., given , the next state is unique. This is standard in goal-conditioned RL theory (Park et al., 2025a) and holds approximately in robotic manipulation tasks.
Remark B.2 (Scope of Assumption B.2).
Assumption B.2 constrains the transition dynamics , not the behavior policy . This is compatible with all our experimental settings. OGBench runs deterministic MuJoCo dynamics even for noisy datasets, where the "noise" is Gaussian perturbation of rather than of , and D4RL mazes likewise use deterministic . Non-Markovian play data corresponds to a history-dependent over a deterministic MDP. QHyer’s sequence modeling targets exactly this behavior-policy non-Markovianness, while Theorems 3.1 and 3.2 analyze stitching on the underlying MDP. The assumption matches R2CSL (Liu et al., 2025) and is standard in the offline GCRL theory literature. Extension to stochastic is a genuine open problem that we flag in the conclusion.
Assumption B.3 (Policy Class Regularity).
The policy class satisfies:
-
1.
(can be relaxed to finite covering number).
-
2.
For all and : .
-
3.
, where .
Assumption B.4 (Q-Value Coverage).
For each in the support of , define:
| (21) |
For trajectory , let be the empirical goal-reaching probability computed via hindsight relabeling. There exists such that:
| (22) |
Interpretation: At least -fraction of trajectories through achieve the optimal Q-value. Under Section˜B.2, is well-defined as the maximum over a finite set of deterministic outcomes.
Assumption B.5 (Distribution Mismatch).
There exists such that for all .
Assumption B.6 (Bounded Q-Values).
for all , since it represents a probability.
Assumption B.7 (NFs Estimation Error).
The NFs estimator satisfies:
| (23) |
Assumption B.8 (Policy Lipschitz Continuity).
For any , , and :
| (24) |
Assumption B.9 (Expectile Lipschitz Stability).
Let denote the -expectile of samples . For any two sample sets and with for all :
| (25) |
where is a Lipschitz constant. This holds because the expectile is a weighted average of samples.
Appendix C Proofs of Theoretical Results
C.1 Proof of Theorem 3.1
Proof.
We prove convergence of expectile regression to the in-distribution optimal Q-value, accounting for NFs estimation error.
Problem Setup. Fix . Let be the true Q-values across trajectories. In practice, we observe NFs estimates where . The expectile loss is . We define:
-
•
— expectile on true Q-values
-
•
— expectile on NFs-estimated Q-values
Our goal is to bound .
First, Decomposition via Triangle Inequality.
| (26) |
Second, Bounding Term (A) — Expectile Bias. From the first-order condition, the expectile satisfies:
| (27) |
where , , and are conditional means.
Case 1: If , then trivially.
Case 2: If (generic case). Since is a convex combination:
| (28) |
By Section˜B.2, at least -fraction achieve . Using Hoeffding’s inequality, with high probability, . Since , all -samples are in the "above" group: .
Worst-case analysis with and :
| (29) | ||||
| (30) | ||||
| (31) |
Third, Bounding Term (B) — NFs Error Propagation. By Section˜B.2, the expectile is Lipschitz in its inputs:
| (32) |
By Section˜B.2 and Markov’s inequality, with high probability:
| (33) |
Therefore, we have:
| (34) |
Forth, Combining Terms. From Equation˜26, we have:
| (35) |
Final, Sample Complexity. We need uniform convergence over all . By union bound with goals, we have:
Condition 1 (sufficient visits): For each , we need . By Hoeffding:
| (36) |
Condition 2 (coverage concentration): Given visits, need .
Setting failure probability for each tuple:
| (37) |
∎
C.2 Proof of Theorem 3.2
Proof.
We prove convergence to the optimal stitched policy with careful treatment of distribution mismatch.
First, Performance Difference. Since rewards are bounded in :
| (38) |
Second, Simulation Lemma. By the simulation lemma (Kakade, 2001), we have:
| (39) |
Third, Policy Difference Decomposition.
| (40) |
Bounding Term (I) of Section˜C.2 with Correct Derivation Order. We carefully apply the inequalities in the correct order:
a: Apply Jensen’s inequality to the expectation of TV, we have:
| (41) |
b: Apply Pinsker’s inequality (), we have:
| (42) |
c: Apply distribution mismatch (Section˜B.2), we have:
| (43) | ||||
| (44) | ||||
| (45) |
Combining a-c, we have:
| (46) |
By MLE analysis (Liu et al., 2025), with probability :
| (47) |
Summing over stages:
| (48) |
Forth, Bounding Term (II) of Section˜C.2. By Section˜B.2, we have:
| (49) |
From Theorem 3.1, we have:
| (50) |
Taking expectation under and applying distribution mismatch for the NFs error term, we have:
| (51) | ||||
| (52) |
Summing over stages, we have:
| (53) |
Final Bound. Combining, we have:
| (54) | ||||
| (55) | ||||
| (56) |
Union bound over events from Theorems 3.1 and MLE analysis gives probability . ∎
C.3 Comparison with R2CSL
| Aspect | R2CSL | QHyer |
|---|---|---|
| Conditioning signal | RTG: | Q-value: |
| Signal property | Trajectory-dependent | State-dependent |
| Consistency constraint | Required | Not required |
| Stitching mechanism | Explicit RTG relabeling | Implicit via expectile |
| Estimation method | Quantile regression | Expectile + NFs |
| Additional error term | None | |
| Sample complexity | ||
| Convergence rate |
Appendix D QHyer Algorithm Details
This section describes the architecture, training, and inference procedures of QHyer. The overall structure is depicted in Figure˜8, and the complete algorithm is summarized in Algorithm˜1.

Model Architecture.
The input sequence follows the format where denotes state-goal concatenation (Schaul et al., 2015), and is the normalized Q-value computed from the NFs-based critic (Equation˜4):
| (57) |
where is the behavior Q-value estimated by NFs, denotes the mean absolute Q-value over the batch, and is a small constant for numerical stability. At timestep , the model takes a context window of length :
| Input: | |||
| Output: |
The NF critic consists of an SA-Encoder that maps to a latent representation, followed by a RealNVP (Dinh et al., 2017) that computes . The Hybrid Attention-Mamba backbone processes tokens through transformer blocks with learnable attention-Mamba gating as described in Section 3.2.
Q-Conditioned Policy Learning.
Unlike prior Q-enhanced supervised learning methods that incorporate Q-values into loss functions, we use Q-values as conditioning tokens input to the policy network. This design enables the policy to explicitly leverage Q-value signals for action selection during both training and inference. The total loss is defined in Equation˜14, combining the NFs-based critic loss (Equation˜5), behavior cloning loss (Equation˜15), and expectile regression loss (Equation˜10).
In practice, we apply a denoising trick to the NFs-based critic by adding Gaussian noise to goals during training, which improves density estimation quality. The expectile loss is defined in Equation˜9, where controls the asymmetry. When , overestimation is penalized more heavily, driving the learned toward the maximum of over all actions in the dataset.
Inference: Trajectory Stitching via Q-Conditioning.
In classical Q-learning, the optimal value function derives the optimal action given the current state. In our framework, we leverage the maximum Q-value to help the policy select near-optimal actions. Note that depends only on state and goal because action is marginalized by the expectile regression. The inference pipeline follows:
| (58) |
At each timestep , QHyer performs two-stage autoregressive generation as shown in Algorithm˜1:
-
1.
Predict maximum Q-value: Given the historical context window, the model first predicts which represents the maximum achievable goal-reaching probability from the current state.
-
2.
Predict action: Conditioned on the predicted , the model then outputs the action that achieves this maximum Q-value.
When the initial state and goal correspond to different trajectories in the dataset, which is precisely the scenario requiring trajectory stitching, our model outputs effective actions by leveraging the Q-conditioned policy.
Appendix E Baseline Details
We compare our approach with a wide variety of baselines, including sequence modeling, TD-based RL methods and Offline GCRL methods. Particularly, we include the following methods:
-
•
For sequence modeling methods, we include Decision Transformer (DT) (Chen et al., 2021), Elastic Decision Transformer (EDT) (Wu et al., 2023), Graph Decision Transformer (GDT) (Hu et al., 2023), Q-learning Decision Transformer (QDT) (Yamagata et al., 2023), Critic-Guided Decision Transformer (CGDT) (Wang et al., 2024), Reinforced Transformer (Reinformer) (Zhuang et al., 2024), Decision ConvFormer (DC) (Kim et al., 2024b), Decision Mamba (DMamba) (Ota, 2024), Q-value Regularized Transformer (QT) (Hu et al., 2024), Long-Short Decision Transformer (LSDT) (Wang et al., 2025), Decision Mixer (DMixer) (Zheng et al., 2025a), Value-guided Decision Transformer (VDT) (Zheng et al., 2025b). DT is a classic sequence modeling method that utilizes a Transformer architecture to model and reproduce sequences from demonstrations, integrating a goal-conditioned policy to convert Offline RL into a supervised learning task. Despite its competitive performance in Offline RL tasks, the DT falls short in achieving trajectory stitching (Brandfonbrener et al., 2022). GDT extends DT by explicitly structuring the input sequence as a causal graph and incorporating relation-enhanced attention to better model the dependencies between states, actions, and rewards. EDT is a variant of DT that lies in its ability to determine the optimal history length to promote trajectory stitching. But it does not incorporate the RL objective that maximizes returns to enhance the model (Zhuang et al., 2024) and its stitching capabilities are limited (Kim et al., 2024a). QDT integrates Dynamic Programming with the DT framework to enhance the optimal path generation ability of DT. CGDT enhances DT by incorporating a value-based critic to align the expected returns of actions with target returns, effectively addressing the inconsistency issues of Return-Conditioned Supervised Learning in stochastic environments and suboptimal datasets. DC replaces attention blocks with convolution filters to more efficiently capture local associations. Reinformer is similar to our work; however, it exhibits limited stitching capabilities due to the absence of -value, resulting in a significant performance gap compared to TD-based RL methods. DMamba replaces the attention mechanism in DT with the Mamba selective state space model to achieve linear computational complexity while maintaining sequence modeling capabilities. QT introduces Q-value regularization to optimize action selection on top of DT and excels in handling long time horizons and sparse reward tasks. LSDT enhances the model structure of DT with a dual-branch architecture (long-term and local features) adept at extracting information within different ranges. DMixer integrates both long-term and local features, and additionally introduces a plug-and-play dynamic token selection mechanism to ensure that the model can adaptively allocate attention to different features based on the specific requirements of each task. VDT leverages value functions to perform advantage-weighting and behavior regularization on the DT, guiding the policy toward upper-bound optimal decisions during the offline training phase.
-
•
For TD-based RL methods, we include Conservative Q-Learning (CQL) (Kumar et al., 2020) and Implicit Q-Learning (IQL) (Kostrikov et al., 2022). CQL and IQL are classical offline RL methods that utilize dynamic programming. This trick endows them with stitching properties (Cheikhi & Russo, 2023; Ghugare et al., 2024).
-
•
For Offline GCRL methods, we include goal-conditioned behavioral cloning (GCBC) (Ghosh et al., 2021) , goal-conditioned implicit V-learning (GCIVL) and Q-learning (GCIQL) (Kostrikov et al., 2022), Quasimetric RL (QRL) (Wang et al., 2023), Contrastive RL (CRL) (Eysenbach et al., 2022), and Hierarchical implicit Q-learning (HIQL) (Park et al., 2023). For these baselines, we follow the implementation setup established by OGBench (Park et al., 2025a) throughout our experiments. Additionally, we select Subgoal Advantage-Weighted Policy Bootstrapping (SAW) (Zhou & Kao, 2025), Option-aware Temporally Abstracted (OTA) (Ahn et al., 2025) and Eikonal-Constrained Quasimetric RL (Eik-QRL) (Giammarino & Qureshi, 2026) as our state-of-the-art GCRL baselines. SAW trains a flat policy by directly sampling subgoals from offline datasets through advantage-weighted policy bootstrapping, thereby eliminating the need for complex subgoal generation models, and achieves superior performance on long-horizon, high-dimensional control tasks. OTA employs temporal abstraction to reduce the effective planning horizon, which substantially improves the scalability of high-level policies to long-horizon tasks. Eik-HiQRL overcomes QRL’s dependence on trajectory continuity for local constraints and its struggle to maintain a valid quasimetric structure in high-dimensional, long-horizon tasks by introducing a trajectory-free Eikonal PDE constraint at the high level and a hierarchical policy decomposition.
Appendix F Experiment Details
In this section we provide offline datasets details as well as implementation details used for all the algorithms in our experiments – Offline GCRL Datasets, Normalizing Flows, and QHyer.




F.1 Offline GCRL non-Markovian Datasets
We adopt the manipulation suite from OGBench (Park et al., 2025a), which consists of three robotic manipulation environments based on a 6-DoF UR5e robot arm. These environments are designed to evaluate the agent’s capabilities in object manipulation, sequential generalization, and combinatorial generalization.
-
•
Cube: This task involves pick-and-place manipulation of cube blocks, where the goal is to arrange cubes into designated configurations. Four variants are provided with different numbers of cubes: single, double, triple, and quadruple (1–4 cubes). At test time, the agent must perform moving, stacking, swapping, or permuting operations on the cube blocks.
-
•
Scene: This task is designed to challenge sequential, long-horizon reasoning capabilities. It involves manipulating diverse everyday objects including a cube block, a window, a drawer, and two button locks. The longest evaluation task requires completing up to eight atomic behaviors in sequence.
-
•
Puzzle: This task evaluates combinatorial generalization by requiring the agent to solve the “Lights Out” puzzle with a robot arm. Four difficulty levels are provided: 3x3, 4x4, 4x5, and 4x6, with state spaces containing up to distinct configurations.
Visualization examples of these tasks are shown in Figure˜9. For each manipulation environment, OGBench provides two types of datasets with different collection policies:
-
•
Play datasets (play): Collected by non-Markovian expert policies with temporally correlated noise, following the “play data” paradigm (Lynch et al., 2020). This results in smoother, more realistic trajectories that pose additional challenges for standard RL algorithms.
-
•
Noisy datasets (noisy): Collected by Markovian expert policies with uncorrelated Gaussian noise. These datasets serve as controlled baselines for ablation studies, allowing researchers to isolate the effects of non-Markovian data collection.
In the experiments comparing with related sequence modeling approaches, we adopt the maze navigation tasks from D4RL (Fu et al., 2020), which provide challenging benchmarks for evaluating offline RL algorithms on undirected, multitask data with sparse rewards.
-
•
Maze2D: This domain is a navigation task requiring a 2D point-mass agent to reach a fixed goal location. Three maze layouts are provided with increasing complexity: umaze, medium, and large. The tasks are designed to test the ability of offline RL algorithms to stitch together previously collected sub-trajectories to find the shortest path to the evaluation goal.
-
•
AntMaze-v2: This domain replaces the simple 2D ball from Maze2D with a more complex 8-DoF quadrupedal “Ant” robot, introducing morphological complexity that mimics real-world robotic navigation tasks. The same three maze layouts (umaze, medium, large) are used, with a sparse 0-1 reward that is activated only upon reaching the goal. Three dataset variants are provided: standard goal-reaching from fixed start locations, “diverse” datasets with random start and goal locations, and “play” datasets with hand-picked navigation waypoints.
Visualization examples are shown in Figure˜10. A critical characteristic of both Maze2D and AntMaze-v2 datasets is that they are collected by non-Markovian policies. The data generation process employs a hierarchical controller: a high-level planner generates sequences of waypoints, which are then followed by a low-level PD controller (for Maze2D) or a trained goal-reaching policy (for AntMaze-v2). Because these controllers maintain internal states to track visited waypoints and update their targets upon reaching intermediate goals, the resulting behavior policies are inherently non-Markovian. This property introduces additional challenges for offline RL algorithms, as the data cannot be accurately modeled by assuming a Markovian behavior policy, potentially causing bias in methods that rely on such assumptions (Fu et al., 2020).



Umaze Medium Large
F.2 Implementation Details
We ran all our experiments on NVIDIA RTX 3090 GPUs with 24GB of memory within an internal cluster. We use the default configurations in Park et al. (2025a), with some values modified. In pixel-based environments, following Park et al. (2025a), we employ n IMPALA-style encoder to transform images into state tokens. The architecture and training process of the Normalizing Flows are identical to those described in Ghugare & Eysenbach (2025).
| Environment | NF Train (ms) | Actor (ms) | Infer-Q (ms) | Infer-A (ms) | NF Ratio |
|---|---|---|---|---|---|
| cube-single-play-v0 | |||||
| cube-double-play-v0 | |||||
| cube-triple-play-v0 | |||||
| cube-quadruple-play-v0 | |||||
| cube-single-noisy-v0 | |||||
| cube-double-noisy-v0 | |||||
| cube-triple-noisy-v0 | |||||
| cube-quadruple-noisy-v0 | |||||
| scene-play-v0 | |||||
| scene-noisy-v0 | |||||
| puzzle-3x3-play-v0 | |||||
| puzzle-4x4-play-v0 | |||||
| puzzle-4x5-play-v0 | |||||
| puzzle-4x6-play-v0 | |||||
| puzzle-3x3-noisy-v0 | |||||
| puzzle-4x4-noisy-v0 | |||||
| puzzle-4x5-noisy-v0 | |||||
| puzzle-4x6-noisy-v0 | |||||
| Average |
Our QHyer implementation draws inspiration from LSDT (Wang et al., 2025) and Decision Mamba (Ota, 2024). The state tokens, goal tokens, -function tokens and action tokens are first processed by different linear layers. Then these tokens are fed into the decoder layer to obtain the embedding. Here the decoder layer is a lightweight implementation from Reinformer (Zhuang et al., 2024). The context length for the decoder layer is denoted as . We employed both the AdamW (Loshchilov, 2017) optimizers to optimize the total loss, in alignment with the methods outlined in their original papers. The hyperparameter of loss is denoted as .
F.3 Hyperparameter Settings
Table˜6 summarizes the hyperparameters shared across all experiments. The Hybrid Attention-Mamba architecture uses learnable mixing weights between attention and Mamba branches, with a total hidden dimension of . The Normalizing Flow architecture follows Ghugare & Eysenbach (2025). The expectile regression parameter is set according to our theoretical guidance (Theorem˜3.1).
| Hyperparameter | OGBench (State) | OGBench (Pixel) | D4RL |
|---|---|---|---|
| Training steps | 1M | 500K | 100K |
| Batch size | 1024 | 512 | 256 |
| Optimizer | AdamW | AdamW | AdamW |
| Weight decay | 0.0 | 0.0 | 1e-4 |
| Gradient clipping | 1.0 | 1.0 | 0.25 |
| NF noise std | 0.05 | 0.05 | – |
| Encoder hidden dim | 1024 | 1024 | – |
| NF representation size | 64–128 | 64–128 | – |
| BC weight | 1.0 | 1.0 | 1.0 |
| Q weight | 1.0 | 1.0 | 1.0 |
| Expectile | 0.95–0.99 | 0.95–0.99 | 0.90–0.99 |
| State-goal concatenation | True | False | True |
| Image size | – | 6464 | – |
| Image encoder | – | IMPALA-small | – |
| Warmup steps | – | – | 10000 |
| LR schedule | – | – | Cosine |
Architecture notes. For both OGBench and D4RL experiments, the Hybrid Attention-Mamba backbone uses learnable mixing weights that are automatically optimized during training, eliminating the need for manual tuning of attention-to-Mamba ratios. The total hidden dimension (denoted as h_dim in OGBench and embed_dim in D4RL) represents the combined capacity of both branches, with the proportion learned end-to-end via gradient descent.
Table˜7 presents the task-specific hyperparameters for OGBench state-based manipulation environments. Following our theoretical analysis (Theorem˜3.1), we set for play datasets (medium Q-value coverage) and for noisy datasets (higher coverage due to exploration noise).
| Environment | LR | Dropout | NF Blocks | NF Channels | |||||
|---|---|---|---|---|---|---|---|---|---|
| cube-single-play-v0 | 20 | 256 | 4 | 4 | 3e-4 | 0.1 | 0.99 | 6 | 256 |
| cube-single-noisy-v0 | 20 | 256 | 4 | 4 | 3e-4 | 0.1 | 0.95 | 6 | 256 |
| cube-double-play-v0 | 25 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.99 | 8 | 256 |
| cube-double-noisy-v0 | 25 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.95 | 8 | 256 |
| cube-triple-play-v0 | 30 | 512 | 6 | 8 | 2e-4 | 0.15 | 0.99 | 10 | 384 |
| cube-triple-noisy-v0 | 30 | 512 | 6 | 8 | 2e-4 | 0.15 | 0.95 | 10 | 384 |
| cube-quadruple-play-v0 | 35 | 640 | 6 | 8 | 1e-4 | 0.2 | 0.99 | 12 | 512 |
| cube-quadruple-noisy-v0 | 35 | 640 | 6 | 8 | 1e-4 | 0.2 | 0.95 | 12 | 512 |
| scene-play-v0 | 30 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.99 | 8 | 384 |
| scene-noisy-v0 | 30 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.95 | 8 | 384 |
| puzzle-3x3-play-v0 | 25 | 512 | 6 | 8 | 3e-4 | 0.1 | 0.99 | 8 | 384 |
| puzzle-3x3-noisy-v0 | 25 | 512 | 6 | 8 | 3e-4 | 0.1 | 0.95 | 8 | 384 |
| puzzle-4x4-play-v0 | 30 | 640 | 6 | 8 | 2e-4 | 0.15 | 0.99 | 10 | 384 |
| puzzle-4x4-noisy-v0 | 30 | 640 | 6 | 8 | 2e-4 | 0.15 | 0.95 | 10 | 384 |
| puzzle-4x5-play-v0 | 35 | 768 | 6 | 8 | 1e-4 | 0.2 | 0.99 | 10 | 512 |
| puzzle-4x5-noisy-v0 | 35 | 768 | 6 | 8 | 1e-4 | 0.2 | 0.95 | 10 | 512 |
| puzzle-4x6-play-v0 | 40 | 768 | 6 | 8 | 1e-4 | 0.2 | 0.99 | 10 | 512 |
| puzzle-4x6-noisy-v0 | 40 | 768 | 6 | 8 | 1e-4 | 0.2 | 0.95 | 10 | 512 |
Table˜8 presents the hyperparameters for pixel-based (visual) manipulation tasks. Compared to state-based tasks, pixel-based tasks use smaller batch size (512 vs 1024) due to memory constraints and shorter training (500K steps). Goals are represented as images rather than concatenated state vectors.
| Environment | LR | Dropout | NF Blocks | NF Channels | |||||
|---|---|---|---|---|---|---|---|---|---|
| visual-cube-single-play-v0 | 15 | 256 | 4 | 4 | 3e-4 | 0.1 | 0.99 | 6 | 256 |
| visual-cube-double-play-v0 | 20 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.99 | 8 | 256 |
| visual-cube-triple-play-v0 | 25 | 512 | 6 | 8 | 2e-4 | 0.15 | 0.99 | 10 | 384 |
| visual-scene-play-v0 | 25 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.99 | 8 | 384 |
| visual-scene-noisy-v0 | 25 | 384 | 5 | 6 | 3e-4 | 0.1 | 0.95 | 8 | 384 |
For pixel-based tasks, the NF uses a DrQ-v2 style CNN (Yarats et al., 2022) to encode images into 256-dim features, which are concatenated with actions and passed through a 4-layer MLP to produce the state-action representation. The NF models goal-reaching probability in the low-dimensional coordinate space (e.g., object positions), extracted from simulator state. The LSDM actor uses an IMPALA-style encoder (Espeholt et al., 2018) to encode both observation and goal images into 256-dim vectors.
Table˜9 presents the hyperparameters for D4RL maze tasks. We use a unified Transformer architecture with and 3 blocks. Unlike OGBench, D4RL experiments use a cosine learning rate schedule with 10K warmup steps.
| Environment | LR | ||||
|---|---|---|---|---|---|
| antmaze-umaze-v2 | 2 | 2e-4 | 0.90 | 128 | 3 |
| antmaze-umaze-diverse-v2 | 2 | 2e-4 | 0.90 | 128 | 3 |
| antmaze-medium-play-v2 | 3 | 2e-4 | 0.99 | 128 | 3 |
| antmaze-medium-diverse-v2 | 3 | 2e-4 | 0.99 | 128 | 3 |
| antmaze-large-play-v2 | 3 | 4e-4 | 0.90 | 128 | 3 |
| antmaze-large-diverse-v2 | 3 | 4e-4 | 0.90 | 128 | 3 |
| maze2d-umaze-v1 | 10 | 2e-4 | 0.90 | 128 | 3 |
| maze2d-medium-v1 | 10 | 2e-4 | 0.90 | 128 | 3 |
D4RL-specific settings. For D4RL maze tasks, we concatenate the 2D goal position to the state (--goalconcate), increasing the state dimension by 2. The training uses a combined learning rate schedule: linear warmup for 10K steps followed by cosine decay. We use smaller batch size (256) and fewer training steps (100K) compared to OGBench, as D4RL maze tasks are less complex. The expectile parameter is set to 0.90 for umaze and large tasks, and 0.99 for medium tasks based on empirical tuning.
For computational efficiency, we extract only task-relevant goal coordinates when training the NFs-based Q-value estimator in Equation˜5. Given a full goal state , we use where the index range is environment-specific. Table˜10 summarizes the configurations:
| Task Category | Goal Dim | Description |
|---|---|---|
| cube-single-* / visual-cube-single-* | 3 | Object (x, y, z) |
| cube-double-* / visual-cube-double-* | 6 | Two objects |
| cube-triple-* / visual-cube-triple-* | 9 | Three objects |
| cube-quadruple-* | 12 | Four objects |
| scene-* / visual-scene-* | 13 | Scene objects |
| puzzle-3x3-* | 9 | 33 tiles |
| puzzle-4x4-* | 16 | 44 tiles |
| puzzle-4x5-* | 20 | 45 tiles |
| puzzle-4x6-* | 24 | 46 tiles |
| antmaze-*, maze2d-* | 2 | Agent (x, y) |
For goal sampling in OGBench, we use , for play datasets and , for noisy datasets.
Appendix G Additional Results
This section presents supplementary experiments and analyses for QHyer, including: (1) detailed discussion of the state-goal tokenization strategy and its role in enabling trajectory stitching, (2) ablation studies on regression functions, (3) qualitative visualization of trajectory stitching capabilities, (4) validation of Normalizing Flows for goal-reaching probability estimation, and (5) empirical verification of expectile regression for capturing maximum Q-values. Due to space constraints, these additional results are not included in the main body of this paper. The details are provided below.
G.1 Detail Discussion of State-Goal Tokenization Strategy

This section details our state-goal tokenization strategy illustrated in Figure˜11 and its role in enabling trajectory stitching. The key insight is that concatenating state and goal into a unified token allows the Transformer’s self-attention mechanism to directly model cross-dependencies between current state features and goal specifications within each token position.
Panel A shows the offline dataset structure as a graph, where multiple trajectories (indicated by different colors) traverse overlapping state regions while pursuing different goals. This shared structure creates opportunities for trajectory stitching, which combines successful segments from different trajectories.
Panel B contrasts DT’s standard tokenization with QHyer’s approach. In vanilla DT, states and goals may be processed separately or with weak coupling. QHyer instead concatenates at each timestep, ensuring that goal information is directly available when computing attention over state features. This design maintains the sequence length at (Q-value, state-goal, action tokens) rather than increasing to with separate goal tokens, avoiding quadratic attention overhead.
Panel C demonstrates how this enables goal stitching. Consider two trajectories targeting goals and respectively. Neither trajectory alone reaches the optimal path to goal . However, by conditioning on state-goal concatenated tokens with NFs-based Q-value signals, QHyer identifies high-value segments from both trajectories and stitches them together, discovering an optimal path (shown in green) that was not present in any single demonstration.
We empirically validate the effectiveness of state-goal concatenation through ablation studies comparing three tokenization strategies: No Goal (state-only input), State-Goal Separate (goal as additional token), and State-Goal Concat (our approach). Figure˜12 shows a consistent ordering across all environments: No Goal Separate Concat.
The performance gap between No Goal and goal-conditioned variants (– absolute improvement) confirms that goal information is essential for learning meaningful goal-reaching behaviors. Without explicit goal conditioning, the model degenerates to unconditional behavior cloning, unable to distinguish between trajectories targeting different goals.
Among goal-conditioned strategies, concatenation outperforms separation by –. This improvement stems from two factors: (1) Direct cross-dependency modeling: Concatenation enables self-attention to directly learn which state features are relevant for specific goals within each token, whereas separation requires the model to establish state-goal relationships across tokens through multiple attention layers. (2) Stronger conditioning signal: Separate tokenization dilutes the goal signal as it propagates through attention layers, weakening goal-awareness at decision time. Concatenation preserves the full goal information at every position where action prediction occurs.
These results validate our design choice and explain why QHyer achieves effective trajectory stitching: the concatenated state-goal representation provides the necessary goal-aware context for identifying and combining high-value segments from different trajectories.
G.2 Effect of Regression Functions on Learning Stability
We compare MSE, Quantile Loss (-based) (Koenker & Hallock, 2001), and Expectile Regression (-based) (Newey & Powell, 1987; Kostrikov et al., 2022). Figure˜13 shows consistent ordering: MSE Quantile Expectile, with Expectile achieving the best results and smallest variance.
Why MSE Fails. MSE learns the mean Q-value across all trajectories passing through each state. In Offline GCRL where both successful and failed trajectories share common states, this averaging produces predictions that lie between the maximum and minimum Q-values. Such middle-ground estimates provide no discriminative signal for trajectory stitching because the model cannot distinguish promising paths from dead ends.
Why Quantile Loss Struggles. Quantile regression (Koenker & Hallock, 2001) correctly targets high-value regions via asymmetric weighting. However, the loss creates a non-smooth point at zero error where gradients change direction abruptly (Liu et al., 2025; Jullien et al., 2023). For deep networks with many near-zero predictions, this causes oscillatory training dynamics and high variance across seeds. Recent theoretical work (Liu et al., 2025) shows that while quantile regression can recover in-distribution optimal values in deterministic environments, its loss makes optimization less stable than -based alternatives.
Why Expectile Regression Succeeds. Expectile regression (Newey & Powell, 1987) replaces the non-smooth point with an smooth curve, achieving both optimistic targeting and gradient consistency. This smooth gradient landscape is particularly important for non-Markovian learning: inconsistent gradients from quantile loss disrupt the temporal representations learned by attention and Mamba branches, while expectile’s stable gradients allow these components to capture history-dependent patterns effectively. This explains why the Quantile-Expectile gap is largest on Cube-double-play, the environment with the strongest non-Markovian properties. As shown in Theorem 3.1, expectile regression with converges to the in-distribution optimal Q-value, providing theoretical justification for our empirical findings.
G.3 Trajectory Stitching Visualization

To further illustrate the trajectory stitching capabilities of different methods, we provide a qualitative comparison on the D4RL Antmaze-Medium task. As shown in Figure˜15, we visualize the trajectories generated by DT, LSDT, IQL, and QHyer (with Expectile Regression).
The maze environment consists of multiple regions, each represented by a distinct color corresponding to different data collection policies in the offline dataset (Figure˜14):
-
•
Cyan: Bottom-left start region
-
•
Purple: Middle corridor
-
•
Yellow: Top-right goal region
-
•
Green: Bottom-right area
-
•
Red: Top-left area
-
•
Black: Out-of-distribution (OOD) states (i.e., passing through walls)
The key challenge is to stitch trajectory segments from different regions to discover optimal paths from start to goal.




Successful trajectory stitching requires the agent to combine trajectory segments from different regions to reach the goal. Our key observations are:
-
•
DT (Chen et al., 2021) fails to reach the goal and instead wanders toward the bottom-right area, demonstrating its inability to stitch trajectories across different data collection policies. This failure stems from DT’s reliance on return-to-go conditioning, which provides no discriminative signal in sparse reward settings where all failed trajectories receive identical RTG values.
-
•
LSDT (Wang et al., 2025) moves in the correct direction but stops in the middle corridor, showing limited stitching capability. Although LSDT improves upon DT by combining attention with Dynamic Convolution for better local pattern extraction, it still relies on RTG conditioning and cannot identify high-value stitching points without explicit value guidance.
-
•
IQL (Kostrikov et al., 2022) successfully reaches the goal through a valid path without OOD states. IQL’s expectile regression-based value learning enables trajectory stitching by identifying high-value actions. However, IQL requires bootstrapping to learn the maximum Q-value, which means it must first learn before learning . This can lead to error accumulation in complex environments.
-
•
QHyer also successfully reaches the goal through a valid path without any OOD states. The trajectory smoothly transitions through cyan purple yellow regions, demonstrating proper trajectory stitching. Compared to IQL, QHyer avoids bootstrapping by using NFs for direct Q-value estimation, and avoids policy projection by using Q-conditioned supervised learning.
G.4 Evaluating the Capability of NFs to Accurately Estimate Goal-reaching Probability
In this section, we validate the accuracy of the NFs’s (Ghugare & Eysenbach, 2025) estimation of the discounted future state distribution by implementing the computation method outlined in Eysenbach et al. (2020) within a tabular setting. It is important to note that here we are solely validating the accuracy of the NFs in estimating the discounted future state distribution, which is unrelated to the actual implementation of the NFs in our QHyer framework.

Specifically, we compute the true discounted future state distribution in a modified GridWorld environment example and evaluate the estimation error by comparing it against the true distribution. We also compare the predictions of CVAE(Sohn et al., 2015), C-learning (Eysenbach et al., 2020) and CRL(Eysenbach et al., 2022) with the true future state density. First, we introduce the modified GridWorld environment used in this experiment. This environment is characterized by stochastic dynamics and a continuous state space, such that the true -function for the indicator reward is zero. Specifically, the environment has a size of (Figure˜16), where the agent observes a noisy version of its current state. More precisely, when the agent is located at position , it observes the state , where . Note that the observation uniquely identifies the agent’s position, so there is no partial observability. Similar to Eysenbach et al. (2020), we analytically compute the exact future state density function by first determining the future state density of the underlying GridWorld, noting that the density is uniform within each cell. We generated a tabular policy by sampling from a Dirichlet (1) distribution, and sampled 100 trajectories of length 100 from this policy for NFs training.



Analytic Future State Distribution
Then, as described in Eysenbach et al. (2020), we can compute the true discounted future state distribution by first constructing the following two metrics:
where denotes the deterministic transition function. The future discounted state distribution is then given by:
The tensor-matrix product is equivalent to einsum(‘ijk,kh ijh’, , ). We use the forward KL divergence for estimating the error in our estimate, , where is the tensor of predictions:
Following the configuration outlined in Eysenbach et al. (2020), we compare the accuracy of the future discounted state distribution under against C-Learning and -learning:
On-policy Setting
Figure˜17 presents the results of our evaluation comparing CVAE, C-learning, CRL and NFs on the above modified "continuous GridWorld" environment under the on-policy setting. In this scenario, CVAE demonstrates higher error compared to C-learning, while NFs achieves the best performance. This highlights the accuracy of NFs in estimating the discounted state occupancy measure. This experiment aims to answer whether NFs solve the future state density estimation problem.
G.5 Can Expectile Regression Effectively Capture Maximum -values in Practice?
We empirically validate that expectile regression converges to in-distribution maximum -values in a controlled GridWorld setting, supporting our theoretical analysis in Theorem˜3.1.
Metrics. We use coefficient of determination () measuring explained variance, and Mean Absolute Error (MAE) quantifying prediction deviation:
| (59) |
Results. As shown in Figures˜18 and 19, the results strongly support our theoretical analysis:
-
1.
Standard MSE () learns the mean rather than maximum, yielding ;
-
2.
Performance improves monotonically with : increases from to as goes from to ;
-
3.
At , predicted values closely match ground-truth with and MAE.
Implications. These results validate that expectile regression effectively captures maximum in-distribution -values, which is essential for QHyer’s trajectory stitching capability. The convergence aligns with Theorem˜3.1: as , the approximation error and . Specifically, our theoretical bound predicts:
| (60) |
which decreases as increases, consistent with the monotonic improvement observed in Figure˜19.
However, excessively large (e.g., ) may cause overfitting to outliers due to focusing on too few high-value samples, leading to increased variance. In practice, balances accuracy and training stability, as validated in our ablation studies (Section˜4.3).




G.6 Comparison with Recent Offline RL Methods Adapted to GCRL
To strengthen baseline coverage, we adapt four recent offline RL methods to offline GCRL by attaching HER goal relabeling and following the OGBench evaluation protocol. These are Transitive RL (Park et al., 2026), SHARSA (Park et al., 2025b), DEAS (Kim et al., 2026), and QCFQL (Li et al., 2025). Results are averaged over 8 seeds at 1M training steps.
| Environment | GC-TrL | GC-SHARSA | GC-DEAS | GC-QCFQL | QHyer |
|---|---|---|---|---|---|
| cube-single-play | |||||
| cube-double-play | |||||
| cube-triple-play | |||||
| cube-quadruple-play | |||||
| scene-play | |||||
| puzzle-3x3-play | |||||
| puzzle-4x4-play | |||||
| puzzle-4x5-play | |||||
| puzzle-4x6-play | |||||
| Average | 10.0 | 28.7 | 23.9 | 12.8 | 41.6 |
Interpretation. Three essential reasons explain the gap. First, none of these methods were originally validated under offline GCRL with sparse binary rewards at standard OGBench scale. TRL and SHARSA rely on oracle goals or the large-data regime, DEAS targets semi-sparse single-task settings, and QCFQL’s strongest numbers come from offline-to-online training. When forced into the pure offline, sparse-binary, multi-goal regime, their value targets and exploration mechanisms become mis-specified. Second, there are structural mismatches. TRL’s triangle inequality on temporal distance holds for continuous navigation but breaks under manipulation’s discrete contact-mode transitions, which is why TRL drops from on cube-single to on cube-double. Third, SHARSA must predict subgoals in the full multi-object pose space, which is far harder than 2D navigation waypoints, and DEAS and QCFQL execute fixed-length open-loop action chunks, so early errors compound and the fixed chunk length cannot align with variable-duration manipulation primitives. SHARSA and DEAS nonetheless remain nontrivially competitive on the hardest long-horizon tasks, suggesting that action chunking and temporal abstraction are complementary to our contributions.
G.7 Comparison with Graph-based Stitching (GAS)
We also compare against GAS (Baek et al., 2025), a graph-based offline GCRL stitching method.
| Environment | GAS | QHyer |
|---|---|---|
| antmaze-giant-stitch (navigation) | ||
| visual-scene-play (manipulation) |
Interpretation. GAS and QHyer occupy complementary regimes, and the reason is structural rather than a matter of tuning. On antmaze-giant-stitch, GAS replaces high-level policy learning with Dijkstra shortest-path search over a precomputed temporal-distance graph, which directly exploits the metric structure of continuous navigation. QHyer’s flat sequence model cannot match this advantage on pure navigation. On visual-scene-play, the tables turn. GAS’s graph construction is bottlenecked by its ability to learn high-dimensional representations, whereas QHyer’s end-to-end sequence modeling with content-adaptive memory benefits directly from pixel inputs, producing roughly a -point improvement and nearly doubling the previous OGBench best. We therefore view the two methods as complementary tools rather than directly competing baselines.
Comments
· 0