MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
Abstract
We present MambaNetBurst, a compact tokenizer-free byte-level sequence classifier for network burst classification based on a Mamba-2 backbone. In contrast to most recent strong traffic-classification and intrusion-detection approaches, our method operates directly on raw packet bytes, avoids tokenization, patching, and heavy engineered multimodal representations, and does not require any self-supervised pre-training stage. Given a packet flow, we form a fixed-length burst from the first few packets, embed the resulting byte sequence appending a learnable CLS token, and process it with a stack of residual pre-normalized Mamba-2 blocks for end-to-end supervised classification. Across six public benchmarks spanning encrypted mobile app identification, VPN/Tor traffic classification, malware traffic classification, and IoT attack traffic, MambaNetBurst achieves consistently strong results and is competitive with, or outperforms, substantially heavier and often pre-trained baselines. Our ablation study shows that preserving byte-level temporal resolution is critical, that early downsampling through striding is consistently harmful, and that moderate state sizes are sufficient for robust generalization. We further show that Mamba-2, despite its more constrained transition structure relative to Mamba-1, remains highly effective for packet-byte modeling while providing clear efficiency advantages, particularly in training speed. Overall, our results demonstrate that direct undiluted byte-to-classification learning with compact selective state space models is a practical, effective and novel direction for efficient, deployable traffic analysis that bypasses the complexity of pre-training pipelines even over highly optimized linear attention architectures.
Index Terms:
Byte-level traffic classification, tokenizer-free learning, pretraining-free, Mamba-2, state space models, encrypted traffic analysis, network intrusion detection, burst classification, State space duality, Mamba vs Mamba-2, state transition matrix| Method | Header | Payload | PT¶ | MM* | Resolution | Pretraining (training) | Pretrain | Attention | Mamba |
|---|---|---|---|---|---|---|---|---|---|
| PERT [9] | ✗ | ✓ | ✗ | ✗ | token | MLM (cross-entropy) | ✓ | ✓ | ✗ |
| ET-BERT [16] | ✗ | ✓ | ✗ | ✗ | token | MLM + NSP (cross-entropy) | ✓ | ✓ | ✗ |
| YaTC [46] | ✓ | ✓ | ✓ | ✗ | patch | MAE reconstruction (MSE) | ✓ | ✓ | ✗ |
| FlowMAE [8] | ✓ | ✓ | ✗ | ✗ | patch | MAE reconstruction (MSE) | ✓ | ✓ | ✗ |
| NetGPT [23] | ✓ | ✓ | ✗ | ✗ | token | Next-token (causal LM, cross-entropy) | ✓ | ✓ | ✗ |
| Lens [34] | ✓ | ✓ | ✗ | ✗ | token | Span corruption / seq2seq (cross-entropy) | ✓ | ✓ | ✗ |
| NetMamba [35] | ✓ | ✓ | ✓ | ✗ | stride | MAE (cross-entropy) | ✓ | ✗ | 1 |
| NetMamba+ [36] | ✓ | ✓ | ✓ | ✓ | stride | MAE (Label Distribution-Aware Margin) | ✓ | Partial | 1 |
| MambaNetBurst | ✓ | ✓ | ✓ | ✗ | byte [0…255] | No-pretraining (cross-entropy) | ✗ | ✗ | 2 |
I Introduction
Network traffic classification, which aims to identify the application or service associated with a traffic flow, and network intrusion detection systems (NIDS), which aim to detect malicious or anomalous behavior, are increasingly critical research domains in cybersecurity. Recent learning-based approaches for traffic analysis have achieved strong results, but many of the most competitive methods rely on large-scale training data, domain-specific pre-training objectives [17, 35, 36], and sometimes auxiliary metadata [36]. In the networking and NIDS domains, obtaining such data is costly, time-consuming, and often incomplete or noisy. As a result, many state-of-the-art methods resort to a two-stage strategy in which a model is first pre-trained on traffic-specific self-supervised objectives and then fine-tuned for downstream classification tasks, as in ET-BERT, YaTC, and TrafficFormer [17, 47, 48].
Despite the inherently sequential nature of network communications, packets and flows, most existing ML pipelines apply some form of early summarization to the raw traffic representation, as seen in Table I. This is largely motivated by the challenge of processing long input sequences, large input ’contexts’, especially with popular Transformer-based architectures, whose attention mechanism scales quadratically with sequence length, resulting in an exponential increase in computational time. Although several efficient attention variants have been proposed, with linear complexity in the attention module [11], at scale the computational cost of a Transformer is dominated by the large feed-forward layers (module) that must also be applied to every input position [25]. Consequently, tokenization, patching, or striding has generally been treated as necessary for efficient packet sequence classification, as a means to cope with long sequence lengths flow [25], and to amortize the cost of operating directly on native (byte) resolution [43].
However, directly modeling raw bytes offers several advantages. Similar to byte-level modeling in natural language processing, byte-level packet modeling avoids fixed vocabularies, supports arbitrary formats, reduces or eliminates preprocessing complexity, and preserves fine-grained structural information that may otherwise be lost during early aggregation [43]. This is particularly appealing in network traffic analysis, where discriminative cues may occur at very small scales, including protocol-specific fields, payload signatures, and short local patterns.
Because byte-level inputs produce long one-dimensional sequences, architectures with linear-time sequence modeling are especially attractive. State space models (SSMs), and Mamba in particular, have emerged as efficient alternatives to Transformers for long-sequence learning, combining linear-time complexity with strong long-range modeling capability [7]. Mamba-2 further improves hardware efficiency by reformulating structured state-space computation into GEMM-friendly kernels via structured state space duality (SSD), resulting in faster training and inference in practice [4]. In parallel, tokenizer-free byte modeling has also shown promise outside networking; ByT5 [43] demonstrated that byte-level transformer models are competitive with their token-level counterparts but significantly more robust to noise, and MambaByte demonstrated that selective SSMs can model raw byte sequences competitively without subword tokenization [33].
Applying byte-resolution SSMs to network traffic classification is not a direct transfer from language or vision. Critical implementation specific decisions must be made regarding ingesting, sequence modeling, and obtaining the classification outputs from the recurrent state or final representations. Network packet bytes form long sequential signals with multi-scale structure, including intra-packet byte patterns, packet boundaries, and higher-level exchanges across flows. Although many existing pipelines mitigate excessive sequence length through early aggregation, using tokens, patches, or striding, (due to performance and rerource challenges), this may dilute fine-grained discriminative byte-level cues essential for accurate threat detection, malware detection, or intrusion analysis. In addition, while Mamba-2 offers clear GPU-efficiency benefits, it imposes structural constraints in its matrix, raising the question of whether these constraints are expressive enough for network byte sequences. To the best of our knowledge, no prior work explores Mamba and Mamba-2 in the context of direct byte-based packet/flow classification in the networking and NIDS domains.
In this work, we ask whether direct modeling of packet bytes can eliminate the need for expensive pre-training, heavy engineered representations, or early downsampling schemes. We introduce MambaNetBurst, a compact burst-level classifier that operates directly on raw packet bytes using a Mamba-2 backbone and a lightweight classification head. Our results show that direct byte-to-classification learning is both feasible and highly effective across a range of encrypted traffic, VPN/Tor, IoT, and malware classification benchmarks. Essentially:
-
•
We propose MambaNetBurst, the first tokenizer-free, pre-training-free, pure byte-level network packet classifier using a compact Mamba-2 backbone.
-
•
We demonstrate that Mamba-2’s constrained transition matrix (scalar identity) is not only sufficient but acts as a beneficial regularizer for packet-byte sequences, while delivering clear GPU efficiency gains.
-
•
We show that discriminative traffic representations can be learned directly from packet bytes in a fully supervised setting, eliminating the need for costly pre-training pipelines dominant in the literature.
-
•
We evaluate on six public benchmarks (encrypted mobile apps, VPN/Tor, malware, IoT attacks) where MambaNetBurst is competitive with or superior to heavier pre-trained baselines.
-
•
We provide extensive ablations over embedding design, positional encoding, striding, depth, state size, and Mamba-1 vs. Mamba-2, together with detailed forward/backward/eval timing and memory profiles. We release code and models for reproducibility.
II Related Works
Network traffic classification, especially for encrypted traffic analysis and intrusion detection, has evolved from handcrafted statistical features to deep ML. Early approaches applied CNNs/GNNs/RNNs to packet payloads or flow sequences [19, 28]. Recent work has adopted pre-trained Transformers for packet-byte and flow-level modeling. Methods such as ET-BERT [16] and TrafficFormer learn general traffic representations through self-supervised objectives before being fine-tuned on downstream benchmarks, achieving strong performance across datasets such as USTC-TFC2016 [39] and CICIoT2022 [2]. Large Transformer architecture such as GPT-4o and LLaVA have also been using in zero-shot NIDS [21].
A central design choice in traffic representation is the degree of preprocessing applied before the sequence model [22]. Pre-trained embeddings have demonstrated strong downstream performance [46, 16, 48, 26], but these methods typically depend on carefully engineered input pipelines and pre-training objectives.
Byte-level modeling, mapping from raw data to predictions without any intermediate tokenization or early aggregation, offers a compelling alternative [33]. By operating on a fixed 256-symbol vocabulary, byte-level modeling avoids subword tokenizer biases. It supports arbitrary data modalities (text, binaries, packets) naturally and directly. ByT5 [43] and MambaByte [33] demonstrate the viability of 256-vocabulary modeling on raw bytes, reporting competitive bits-per-byte perplexity and robustness in long-context language modeling. Raw byte encodings preserve packet structure more directly by encoding headers [37, 45, 41, 24, 30] as well as payload [10] as high-dimensional sequences. Byte based packet classification using CNNs, with no traditional preprocessing, achieves competitive results [37]. These results suggest that direct byte modeling may also be well suited to packet and flow classification, where inputs are naturally binary and fine-grained local structure is often important.
However, raw packet streams also contain substantial syntactic repetition, such as padding, predictable header fields and session-specific artifacts. Furthermore, variable packet lengths (IP packets can be between 20 to 65,535 bytes) necessitate (further) batch level padding in ML pipelines. This increases the computation cost and risk of learning incidental correlations. To address this, many recent approaches rely on tokens, patches, strides, or multimodal representations that combine payload bytes [17, 35, 36] and/or packet-level statistics or metadata [36].
For long-sequence modeling, state space models (SSMs) have emerged as efficient alternatives to attention-based architectures. Mamba [7] introduced selective, input-dependent SSM dynamics together with an efficient scan implementation, enabling linear-time sequence processing with strong empirical performance. Mamba-2 [4] refined this design through structured state space duality (SSD), reformulating the core computation into matrix-multiplication-friendly operations that improve hardware efficiency while preserving selective sequence modeling behavior. Relative to Mamba, Mamba‑2 retains Mamba’s selective, input-conditioned SSM dynamics, interpretable as a learned temporal filtering mechanism. Its SSD-based reformulation, mapping structured SSMs to attention-like computations, replaces scan-learning on Mamba-1 with matmul-leaning. Block matmuls and chunked operations are more GMMM-like(General Matrix Multiply) leading to substantially more efficient core utilization and training throughput, ideal for packet-level byte sequence modeling. The hardware-optimization comes at a price: Mamba-2 is more constrained/restricted in its core matrix structure, using a ’scalar-times-identity’ as its -matrix, whereas Mamba-1 uses a more flexible diagonal structure.
Mamba-based traffic classifiers have only recently begun to appear. NetMamba [35], the pioneering work, replaced Transformer components with a unidirectional Mamba backbone while retaining a carefully designed traffic representation, striding, and a self-supervised pre-training stage. NetMamba+ [36], by the same team, further extended this line with multi-modal representations (adding inter-arrival times), FlashAttention, and a label-distribution-aware loss. ET-Mamba [42] proposed a lightweight Mamba model with pre-training and task-specific fine-tuning for encrypted traffic, emphasizing ultra-low parameter count and competitive accuracy on non-VPN datasets. Further variants include graph-augmented (IDS-GraphMamba [1]), frequency-aware (HFE-Traffic), and hybrid approaches.
However, existing Mamba-based traffic models rely on engineered inputs such as flow statistics, multimodal features, graph edges, or packet-level aggregates, rather than directly ingesting raw packet bytes. Supervised pre-training, via contrastive learning [12, 13] or masked reconstruction objectives [35, 47], are dominant in traffic classification pipelines. At the same time, prior work on byte modeling has argued that fixed monotonic patching or striding can be counterproductive, since it ignores disparities in local information density [25] and may break meaningful structures across patch/stride boundaries and/or group (informative) hard bytes together. Furthermore, despite the possible efficiency gains, no prior work has explored the more constrained Mamba-2 architecture directly on byte-level inputs.
Motivated by these observations, we study whether the more constrained but efficient Mamba-2 can be applied directly to byte-level packet bursts for network traffic classification, taking advantage of the linear complexity to use undiluted byte resolution, thus avoiding the need for any pre-training. As shown in Table I, to the best of our knowledge, MambaNetBurst is the first direct byte-to-classification application of Mamba-2 in the network traffic domain, leveraging the (SSM/SSD) architecture’s native efficiency for long byte sequences while retaining the simplicity of a CLS based classification head.
III Architecture
This section describes MambaNetBurst (Figure 1). The model constructs a fixed-length byte sequence from the first packets of a network flow and feeds this sequence directly into a compact Mamba backbone for supervised classification. Unlike most recent state-of-the-art traffic classifiers, MambaNetBurst does not use any self-supervised pre-training stage.
III-A Byte sequence construction
We consider burst-level flow classification. Let the set of all packets be denoted by
where each packet for . Here, denotes the 5-tuple consisting of source IP, destination IP, source port, destination port, and protocol; denotes the packet size in bytes, with ; and denotes the transmission timestamp in seconds, with [20].
Let denote the set of flows, where each flow is the collection of packets sharing the same 5-tuple. Formally, a unidirectional flow is defined as [40]
| (1) |
Given a flow , our goal is to construct a fixed-length burst representation and predict a class label , corresponding for example to an application, device type, VPN/Tor category, or malware family. We use the first packets from the flow and retain the first bytes from each packet, producing a total sequence length of bytes. This design preserves packet boundaries while focusing on the earliest portion of each packet, which often contains the most informative header and initial payload content [40]. In all experiments, we use and , yielding a 1600-byte sequence. This is consistent with commonly used truncation settings in prior work [17].
Bias control and masking.
Some datasets contain fields that may introduce label leakage, such as IP addresses, ports, or MAC addresses [40]. Following common practice [17, 47, 35, 36], we mask IP addresses by replacing them with 0.0.0.0, remove Ethernet headers during sequence construction, and preserve flow-level train/validation/test splits to avoid overlap across partitions. We also exclude non-IP protocols such as ARP and DHCP. The remaining bytes consist of padded or truncated IP headers together with upper-layer content and payload bytes, matching the benchmark preprocessing conventions.
III-B Tokenizer-free byte embedding and projection
Our input consists of raw network bytes in the range . Each byte value is mapped to a learnable embedding vector of dimension :
| (2) |
By default, we use . This byte-level representation avoids tokenization and allows the model to operate directly on arbitrary packet content.
III-C Embedding projection
A raw embedding lookup provides a learned vector for each byte, but does not itself introduce nonlinear interaction within the feature dimension. To provide a slightly richer per-byte representation before sequence modeling, we apply a lightweight two-layer projection MLP:
| (3) |
where is the GELU activation, and and map . This projection is motivated by prior byte-level modeling work such as MambaByte [33], and provides a simple local feature lifting stage before the Mamba backbone. In practice, we find that this component modestly improves robustness, although the model remains effective without it.
III-D Positional encoding (Std pos) and CLS token
We append a learnable CLS token to the end of the projected byte sequence and add learnable positional embeddings:
| (4) |
where is the learnable classification token embedding and denotes the positional embedding for position . The final representation used for classification is the output state at the CLS position, i.e., .
III-E Mamba backbone and classifier head
We stack Mamba-2 blocks (default ) and use the final CLS representation for classification:
| (5) | ||||
| (6) |
where denotes the predicted class distribution and is the number of classes. Training minimizes the cross-entropy loss between and the ground-truth label :
| (7) |
The MambaNetBurst Mamba-2 block forward pass is outlined in Algorithm 1.
IV Experiments
IV-A Datasets
We evaluate on six public benchmarks spanning application identification, IoT device and attack classification, VPN/Tor traffic classification, and malware traffic classification. All splits are performed at the flow level to avoid leakage, and flows from the same capture session or device do not overlap across train, validation, and test partitions when evaluating application identification datasets. For comparison, we use the same splits from prior work [17, 47, 35]. Header bytes retain the IP header with masked IP addresses, and each packet is padded or truncated before five packets are concatenated into a fixed-length flow burst. The evaluated datasets are:
-
•
CrossPlatform (Android/iOS) - Encrypted mobile app traffic identification, consisting of 254 and 253 applications respectively [17].
-
•
ISCXVPN2016 - VPN traffic data from 7 communication categories [5].
-
•
ISCXTor2016 - Tor traffic data from 8 communication categories [14].
-
•
USTC-TFC2016 - Malware traffic Classification, distinguishing between malware and benign traffic, of 10 classes each [38].
-
•
CICIoT2022 - Attack traffic Classification, such as Denial of Service (DoS) attacks and brute force attacks. [2].
We compare against classical feature-based approaches (AppScanner [29], FlowPrint [31]), supervised deep learning baselines (FS-Net [18], TFE-GNN [44]), pre-trained Transformers (ET-BERT [16], YaTC [46]), and a Mamba-based baseline (NetMamba [35]). YaTC(OF) replaces packet-level and flow-level attention with a global attention module. Where indicated, baseline numbers are taken directly from the cited papers.
IV-B Implementation details and Hyper-parameters
Unless otherwise stated, we train all models for epochs using the AdamW optimizer with an initial learning rate of and weight decay of . A linear warm-up is applied for the first epochs, followed by cosine annealing with a minimum learning rate of . Training uses mixed-precision with automatic casting and gradient scaling. The default model configuration sets the embedding dimension to , the encoder depth to layers, the Mamba state size to , the classifier hidden dimension to , and dropout to .
For the backbone, we consider four alternatives: Mamba-1, Mamba-2, a standard Transformer encoder, and a linear Transformer encoder. We use the official Mamba implementation111https://github.com/state-spaces/mamba [7, 4]. We implement the code for vanilla Transformer [32] blocks with quadratic complexity and Linear Transformer [11] blocks with linear complexity from respective published works. For Mamba-based encoders, each residual pre-normalized block uses state-space expansion , expansion factor , convolution width , bias=False, and conv_bias=True; for Mamba-2, we additionally set the head dimension to , whereas Mamba-1 does not use a head-dimension parameter. For the Transformer-based variants, we use attention heads; the Transformer and linear Transformer feed-forward dimension is set to in the encoder blocks.
The sequence length is bytes. A learnable [CLS] token is appended to the end of the sequence, and learnable positional embeddings are added to all tokens. We trained on a Nvidia RTX 3090, 23.54 GiB with a batch size of for Mamba-based models, and a much smaller batch size of for transformer variants due to their higher memory cost. We do not use any pre-training. During evaluation, we report accuracy (AC), precision (PR), recall(RC), and macro-F1(F1).
V Results
V-A Network packet classification
Tables II–III summarize results across six publicly available datasets. Baseline entries are reported from NetMamba [35] where indicated. Across the six evaluated datasets, MambaNetBurst achieves high macro-F1 in the main results tables, including very strong performance on ISCXTor2016 and USTC-TFC2016 (Table III) and strong results on CrossPlatform (Android/iOS) (Table II). Importantly, these results are obtained without any pretraining stage.
| Method | Encrypted mobile app traffic Classification | Attack traffic Classification | ||||||||||||
| Params(M) | CrossPlatform(Android)[27] | CrossPlatform(iOS)[27] | CICIoT2022 [3] | |||||||||||
| PT | FT | AC | PR | RC | F1 | AC | PR | RC | F1 | AC | PR | RC | F1 | |
| AppScanner [29] | - | - | 0.1626 | 0.1646 | 0.1456 | 0.1413 | 0.1718 | 0.1400 | 0.1440 | 0.1283 | 0.7556 | 0.8093 | 0.7244 | 0.6938 |
| FlowPrint [31] | - | - | 0.8739 | 0.8941 | 0.8739 | 0.8700 | 0.8712 | 0.8687 | 0.8712 | 0.8603 | 0.5820 | 0.4164 | 0.5820 | 0.4643 |
| FS-Net [18] | - | 5.3 | 0.0147 | 0.0023 | 0.0147 | 0.0034 | 0.0293 | 0.0014 | 0.0293 | 0.0025 | 0.5747 | 0.3800 | 0.5747 | 0.4216 |
| TFE-GNN [44] | - | 44.3 | 0.8141 | 0.8308 | 0.8141 | 0.8067 | 0.8241 | 0.8326 | 0.8241 | 0.8130 | \cellcolorlightgray1.000 | \cellcolorlightgray1.000 | \cellcolorlightgray1.000 | \cellcolorlightgray1.000 |
| ET-BERT [16] | 187.4 | 136.4 | 0.8743 | 0.8913 | 0.8743 | 0.8786 | 0.9105 | 0.8809 | 0.9105 | 0.8850 | 0.9937 | 0.9938 | 0.9937 | 0.9937 |
| YaTC(OF) [46] | 2.3 | 2.1 | 0.9076 | 0.9107 | 0.9076 | 0.9077 | 0.9263 | 0.9282 | 0.9263 | 0.9264 | 0.9949 | 0.9949 | 0.9949 | 0.9949 |
| YaTC [46] | 2.3 | 2.1 | 0.8952 | 0.8989 | 0.8952 | 0.8952 | 0.9270 | 0.9296 | 0.9270 | 0.9272 | 0.9974 | 0.9975 | 0.9974 | 0.9974 |
| NetMamba [35] | 2.2 | 1.9 | 0.9094 | 0.9133 | 0.9094 | 0.9096 | 0.9301 | 0.9327 | 0.9301 | 0.9305 | 0.9928 | 0.9931 | 0.9928 | 0.9929 |
| MambaNetBurst | NA | 2.7/2.5 | \cellcolorlightgray0.9860 | \cellcolorlightgray0.9838 | \cellcolorlightgray0.9831 | \cellcolorlightgray0.9824 | \cellcolorlightgray0.9900 | \cellcolorlightgray0.9837 | \cellcolorlightgray0.9875 | \cellcolorlightgray0.9851 | 0.9974 | 0.9967 | 0.9964 | 0.9966 |
| Method | Tor traffic Classification | VPN traffic Classification | Malware traffic Classification | |||||||||||
| Params(M) | ISCXTor2016 [15] | ISCXVPN2016 [6] | USTC-TFC2016 [38] | |||||||||||
| PT | FT | AC | PR | RC | F1 | AC | PR | RC | F1 | AC | PR | RC | F1 | |
| AppScanner [29] | - | - | 0.4034 | 0.2850 | 0.2149 | 0.2113 | 0.7643 | 0.8047 | 0.7045 | 0.7256 | 0.6998 | 0.8591 | 0.6062 | 0.6633 |
| FlowPrint [31] | - | - | 0.1316 | 0.0173 | 0.1316 | 0.0306 | 0.9666 | 0.9733 | 0.9666 | 0.9681 | 0.7992 | 0.7745 | 0.7992 | 0.7755 |
| FS-Net [18] | - | 5.3 | 0.7020 | 0.7010 | 0.7020 | 0.6999 | 0.7023 | 0.7487 | 0.7023 | 0.6660 | 0.4381 | 0.2011 | 0.4381 | 0.2672 |
| TFE-GNN [44] | - | 44.3 | 0.7692 | 0.8030 | 0.7692 | 0.7618 | 0.8428 | 0.8508 | 0.8428 | 0.8447 | 0.9747 | 0.9747 | 0.9747 | 0.9734 |
| ET-BERT [16] | 187.4 | 136.4 | 0.9967 | 0.9967 | 0.9967 | 0.9967 | 0.9566 | 0.9566 | 0.9566 | 0.9565 | 0.9910 | 0.9911 | 0.9910 | 0.9910 |
| YaTC(OF) [46] | 2.3 | 2.1 | 0.9986 | 0.9986 | 0.9986 | 0.9986 | 0.9805 | 0.9808 | 0.9805 | 0.9806 | 0.9960 | 0.9955 | 0.9960 | 0.9957 |
| YaTC [46] | 2.3 | 2.1 | 0.9959 | 0.9959 | 0.9959 | 0.9959 | \cellcolorlightgray0.9848 | 0.9849 | 0.9848 | 0.9848 | 0.9972 | \cellcolorlightgray0.9976 | \cellcolorlightgray0.9972 | \cellcolorlightgray0.9970 |
| NetMamba [35] | 2.2 | 1.9 | 0.9986 | 0.9986 | 0.9986 | 0.9986 | 0.9805 | 0.9808 | 0.9805 | 0.9806 | 0.9960 | 0.9957 | 0.9960 | 0.9957 |
| MambaNetBurst | NA | 2.7/2.5 | \cellcolorlightgray0.9993 | \cellcolorlightgray0.9991 | \cellcolorlightgray0.9990 | \cellcolorlightgray0.9990 | 0.9834 | \cellcolorlightgray0.9884 | \cellcolorlightgray0.9859 | \cellcolorlightgray0.9871 | \cellcolorlightgray0.9995 | 0.9964 | 0.9949 | 0.9954 |
V-B Ablations
Table IV summarizes our ablation study across six byte-based packet sequence benchmarks. Unless otherwise stated, all variants use a Mamba-2 backbone with , , layers, , and . We report per-dataset macro-F1, together with the mean (AVG), worst-case (MIN), best-case (MAX), and cross-dataset variance (VAR). Input bytes are represented using one of three embedding strategies: (i) learned byte embeddings with a residual projection (byte), (ii) learned byte embeddings without projection (ByteEmbedNoProj), or (iii) stride-based convolutional embedding (stride) with stride size . Overall, the ablations indicate that (i) the task is highly solvable with compact selective SSM backbones, (ii) preserving byte-level temporal resolution is critical, and (iii) moderate state capacity and adequate model width yield the most robust generalization across datasets.
| Method | ISCXVPN2016 | ISCXTor2016 | USTC-TFC2016 | CICIoT2022 | CP(Android) | CP(iOS) | AVG | MIN | MAX | VAR |
|---|---|---|---|---|---|---|---|---|---|---|
| \rowcolorblue!8 A:Without pos enc | 0.9870 | 0.9986 | 0.9998 | 0.9964 | 0.9880 | 0.9864 | ||||
| \rowcolorblue!8 A:Std pos | 0.9834 | 0.9993 | 0.9995 | 0.9974 | 0.9860 | 0.9900 | ||||
| \rowcolorblue!8 A:Std pos (Mamba-1) | 0.9834 | 0.9966 | 0.9997 | 0.9974 | 0.9800 | 0.9822 | ||||
| \rowcolorgreen!15 A:Stride (4) | 0.9704 | 0.9979 | 0.9991 | 0.9969 | 0.9704 | 0.9603 | ||||
| \rowcolorgreen!15 A:Stride (4,Mamba-1) | 0.9812 | 0.9966 | 0.9938 | 0.9954 | 0.9729 | 0.9774 | ||||
| \rowcolorgreen!15 A:Stride (2) | 0.9812 | 0.9986 | 0.9982 | 0.9944 | 0.9738 | 0.9781 | ||||
| \rowcolorgreen!15 A:Stride (2, Mamba-1) | 0.9718 | 0.9925 | 0.9992 | 0.9836 | 0.9665 | 0.9812 | ||||
| A:No emb proj | 0.9855 | 0.9959 | 0.9997 | 0.9964 | 0.9824 | 0.9869 | ||||
| \rowcolorgray!15 A:2 layers | 0.9790 | 0.9973 | 0.9994 | 0.9979 | 0.9820 | 0.9864 | ||||
| \rowcolorgray!15 A:1 layers | 0.9827 | 0.9979 | 0.9994 | 0.9959 | 0.9822 | 0.9893 | ||||
| \rowcolororange!20 A: d_state 32 | 0.9827 | 0.9973 | 0.9994 | 0.9954 | 0.9856 | 0.9883 | ||||
| \rowcolororange!20 A: d_state 64 | 0.9812 | 0.9973 | 0.9994 | 0.9964 | 0.9869 | 0.9883 | ||||
| \rowcolororange!20 A: d_state 128 | 0.9769 | 0.9973 | 0.9997 | 0.9969 | 0.9867 | 0.9800 | ||||
| A: Compact((64/64/2)) | 0.9725 | 0.9959 | 0.9997 | 0.9959 | 0.9747 | 0.9812 | ||||
| A: Compact(32/32/2) | 0.9624 | 0.9966 | 0.9995 | 0.9933 | 0.8982 | 0.9303 | ||||
| \rowcolorpurple!20 A: Transformer | 0.9783 | 0.9979 | 0.9994 | 0.9969 | 0.9280 | 0.9857 | ||||
| A: Linear Tr(flash att 2)* | 0.9913 | 0.9959 | 0.9997 | 0.9974 | 0.9895 | 0.9912 | ||||
| A: Linear Tr(flash att 2)*((64/64/2)) | 0.9877 | 0.9966 | 0.9995 | 0.9944 | 0.9369 | 0.9719 | ||||
| \rowcolorblue!8 F1:Without pos enc | 0.9859 | 0.9980 | 0.9978 | 0.9952 | 0.9847 | 0.9810 | 0.9904 | 0.9810 | 0.9980 | 5.53E-05 |
| \rowcolorblue!8 F1:Std pos | 0.9871 | 0.9990* | 0.9954 | 0.9966 | 0.9824 | 0.9851 | 0.9909 | 0.9824 | 0.9990 | 4.77E-05 |
| \rowcolorblue!8 F1:Std pos (Mamba-1) | 0.9828 | 0.9957 | 0.9956 | 0.9966 | 0.9769 | 0.9770 | 0.9874 | 0.9769 | 0.9966 | 9.21E-05 |
| \rowcolorgreen!15 F1:Stride(4) | 0.9524 | 0.9972 | 0.9950 | 0.9961 | 0.9662 | 0.9561 | 0.9772 | 0.9524 | 0.9972 | 4.51E-04 |
| \rowcolorgreen!15 F1:Stride(4,Mamba-1) | 0.9732 | 0.9955 | 0.9614 | 0.992 | 0.9703 | 0.9731 | 0.9776 | 0.9614 | 0.9955 | 1.77E-04 |
| \rowcolorgreen!15 F1:Stride (2) | 0.9727 | 0.9980 | 0.9881 | 0.9902 | 0.9709 | 0.9739 | 0.9823 | 0.9709 | 0.9980 | 1.27E-04 |
| \rowcolorgreen!15 F1:Stride (2, Mamba-1) | 0.9586 | 0.9914 | 0.9930 | 0.9790 | 0.9645 | 0.9772 | 0.9773 | 0.9586 | 0.9930 | 1.92E-04 |
| F1:No emb proj | 0.9838 | 0.9952 | 0.9977 | 0.9955 | 0.9799 | 0.9841 | 0.9894 | 0.9799 | 0.9977 | 5.79E-05 |
| \rowcolorgray!15 F1:2 layers | 0.9782 | 0.9960 | 0.9944 | 0.9971 | 0.9794 | 0.9816 | 0.9878 | 0.9782 | 0.9971 | 7.97E-05 |
| \rowcolorgray!15 F1:1 layers | 0.9722 | 0.9970 | 0.9945 | 0.9944 | 0.9796 | 0.9845 | 0.9870 | 0.9722 | 0.9970 | 9.82E-05 |
| \rowcolororange!20 F1: d_state 32 | 0.9791 | 0.9960 | 0.9912 | 0.9938 | 0.9837 | 0.9833 | 0.9879 | 0.9791 | 0.9960 | 4.55E-05 |
| \rowcolororange!20 F1: d_state 64 | 0.9769 | 0.9960 | 0.9950 | 0.9952 | 0.9838 | 0.9834 | 0.9883 | 0.9769 | 0.9960 | 6.52E-05 |
| \rowcolororange!20 F1: d_state 128 | 0.9612 | 0.9960 | 0.9977 | 0.9963 | 0.9841 | 0.9743 | 0.9849 | 0.9612 | 0.9977 | 2.18E-04 |
| F1: Compact((64/64/2)) | 0.9618 | 0.9942 | 0.9956 | 0.9946 | 0.9711 | 0.9755 | 0.9821 | 0.9618 | 0.9956 | 2.12E-04 |
| F1: Compact(32/32/2) | 0.9476 | 0.9950 | 0.9963 | 0.9917 | 0.9060 | 0.9322 | 0.9615 | 0.9060 | 0.9963 | 1.48E-03 |
| \rowcolorpurple!20 F1: Transformer | 0.9766 | 0.9972 | 0.9933 | 0.9960 | 0.9289 | 0.9810 | 0.9788 | 0.9289 | 0.9972 | 6.69E-04 |
| F1: Linear Tr(flash att 2)* | 0.9920 | 0.9945 | 0.9977 | 0.9966 | 0.9871 | 0.9868 | 0.9925 | 0.9868 | 0.9977 | 2.19E-05 |
| F1: Linear Tr(flash att 2)*(64/64/2) | 0.9866 | 0.9954 | 0.9933 | 0.9927 | 0.9367 | 0.9673 | 0.9787 | 0.9367 | 0.9954 | 5.29E-04 |
Positional embeddings are not dominant. We compare standard positional encoding (Std pos) against removing positional encoding entirely (Without pos enc). While both configurations attain close scores, Std pos achieves better a mean (AVG vs. ), better worst-case performance (MIN vs. ) and lower variance (VAR vs. ).
This suggests that, for byte-based packet sequences, the ordering bias induced by causal convolution and recurrent state-space dynamics already provides strong sequential structure enabling robust discrimination even without added (absolute) positional features. Positional embeddings remain useful primarily because they improve consistency, as indicated by lower variance, across datasets. Practically, this indicates that packet-byte classification can be dominated by local and local-to-intermediate patterns (meso-scale motifs) 222intermediate structural or functional patterns that exist between the micro- and macro-scales, acting as building blocks for complex system behavior (e.g., protocol headers, fields, short repeated patterns, delimiters, length fields, checksum starts, common byte sequences in packet structures (amounting to tens to hundreds of bytes), or characteristic ’motifs’ that appear in many packets of the same class such as TCP flags + options patterns, HTTP method + version strings.) rather than requiring absolute global positional anchoring. The lower variance is the key reason to justify the use of positional encodings in MambaNetBurst.
Mamba-2 is more consistent than Mamba-1 under matched settings. We replace Mamba-2 with Mamba-1 under otherwise matched settings (Std pos (Mamba-1)). Mamba-2 yields higher mean performance (AVG vs. ), a better worst-case score (MIN vs. ), and substantially lower variance (VAR vs. ). These results indicate that the more constrained structured dynamics of Mamba-2 are not only sufficient for byte-level packet classification, but may also act as an implicit regularizer in this setting. The improved consistency across datasets suggests that the constraints in Mamba-2 may act as an implicit regularizer for this modality, where overly flexible per-channel dynamics (as in Mamba-1) may not be necessary to capture the discriminative structure present in the data.
Early downsampling is harmful. To test sensitivity to temporal resolution, we apply striding with factor 4 (Stride(4)). This change produces the largest degradation among all single-factor ablations, reducing AVG from to and substantially increasing variance to . The worst-case dataset drops to MIN , indicating that some benchmarks critically depend on fine-grained byte order and short-range structure which is lost under downsampling. Using Mamba-1 with the same striding slightly improves robustness relative to strided Mamba-2, indicating that the additional flexibility of Mamba-1 can partially compensate, but performance remains well below the non-strided baseline. This confirms that preserving fine-grained byte-level temporal fidelity is essential for robust classification across heterogeneous datasets. Consequently, we recommend avoiding striding for accuracy-critical settings and only employing it when compute constraints require downsampling.
Moderate depth is sufficient, but additional layers improve robustness. Reducing depth from to (2 layers) yields a modest drop in mean performance (AVG ), indicating that the task does not require deep models to achieve strong results. However, the worst-case performance declines as depth decreases, suggesting that additional layers mainly improve robustness rather than average accuracy. This is consistent with the view that packet-byte classification contains strong local discriminative cues that shallow sequence models can already capture effectively. Capacity reductions that jointly shrink width and depth (e.g., Compact...) cause a pronounced collapse, driven by the CrossPlatform datasets. These results suggest that the modality benefits from additional composition, but does not require substantial depth to attain high F1. Additional layers primarily improve robustness (worst-case performance) rather than dramatically increasing average accuracy, consistent with the view that packet-byte classification contains strong local cues that are extractable with relatively shallow sequence models.
Large state sizes are unnecessary. Byte based packet classification does not need a large latent dynamical memory. Increasing from to or produces only minor changes in mean performance, while further increasing to degrades both the average and worst-case scores and increases variance. This pattern implies that byte-based packet classification does not strongly benefit from very large latent dynamical memory, and that excessive state capacity may be detrimental under fixed training and regularization budgets (e.g., by overfitting dataset-specific long-range artifacts or by shifting capacity away from the short- and mid-range structures that dominate discrimination). For this modality, moderate (16–64) appears to be a favorable operating range.
Adequate model width is important for robust generalization. Compact variants that reduce , , and depth simultaneously remain competitive up to a point. The medium-sized variant with and two layers preserves strong performance, whereas the smallest variant with degrades noticeably, particularly on the CrossPlatform datasets. These findings underscore that sufficient representation width is crucial for robust generalization in packet-byte classification, particularly for datasets with higher intra-class variability or weaker signature patterns. While the smallest model remains effective on several benchmarks, it lacks the channel capacity required to consistently encode the diverse discriminative cues present in the CP datasets.
Embedding Projection helps Removing the embedding projection (No emb proj) marginally affects performance (AVG vs. baseline), with comparable worst-case and variance. This indicates that the model can learn effective representations directly from the byte embedding stream in this configuration, while the projection layer mainly offers a small robustness gain.
In summary, across all ablations, we identify three consistent trends. First, preserving byte-level temporal resolution is critical: striding causes the largest and most variance-increasing degradation, implying that fine-grained byte patterns carry essential discriminative signal. Second, Mamba-2 provides the best overall robustness in the non-strided setting, improving mean and worst-case F1 while reducing variance across datasets relative to Mamba-1. Third, hyperparameter scaling exhibits modality-specific optima: moderate (16–64) and adequate are more important than increasing state size aggressively. Finally, we note that and expansion factor (expand) are held constant in Table IV; thus, conclusions about these parameters are necessarily indirect and should be validated with targeted sweeps in future work.
V-C Mamba-1 vs Mamba-2 scaling
Table V reports timing and memory for matched Mamba-1 vs Mamba-2 configurations across batch sizes (RTX 3090). For the 2.5–2.7M parameter setting, Mamba-2 consistently reduces forward and (especially) backward time relative to Mamba-1, while memory usage remains of similar magnitude across batch sizes. The table also shows an out-of-memory condition at batch 256 for Mamba-2 in this specific configuration.
Averaged over 10 runs, each with a warm-up of 10. Implementation: mamba_ssm. OOM indicates out of memory on RTX 3090, 23.54GiB.
| Architecture | Batch | Time (ms) | Peak Memory (MiB) | ||||
|---|---|---|---|---|---|---|---|
| Forward | Back | Eval | Forward | Back | Eval | ||
| Mamba 1 (2.7M params) | 8 | 6.38 | 18.79 | 6.34 | 620.50 | 754.29 | 246.89 |
| 16 | 12.16 | 43.74 | 16.68 | 1197.66 | 1459.44 | 448.36 | |
| 32 | 23.71 | 98.90 | 29.33 | 2353.65 | 2873.82 | 856.34 | |
| 64 | 51.35 | 205.56 | 53.52 | 4675.68 | 5714.95 | 1674.65 | |
| 128 | 107.07 | 414.15 | 108.51 | 9324.30 | 11401.78 | 3311.33 | |
| 256 | 258.84 | 824.62 | 415.79 | 18629.28 | 22784.18 | 6586.31 | |
| Mamba 2 (2.5M params) | 8 | 5.65 | 12.88 | 5.91 | 618.63 | 798.47 | 254.36 |
| 16 | 10.85 | 23.36 | 11.23 | 1175.65 | 1504.09 | 451.32 | |
| 32 | 21.28 | 44.89 | 20.86 | 2300.96 | 2954.69 | 837.57 | |
| 64 | 42.87 | 86.60 | 41.84 | 4554.12 | 5860.71 | 1613.82 | |
| 128 | 86.50 | 166.61 | 85.55 | 9061.93 | 11673.69 | 3167.67 | |
| 256 | OOM | OOM | OOM | OOM | OOM | OOM | |
The backward-pass speedup (50% on average) is especially important in practice since it directly accelerates training/fine-tuning. Across the evaluated settings, Mamba-2 delivers substantially faster backpropagation. Forward and evaluation times are also consistently better with Mamba-2 (8–20% range). Memory usage is marginally lower for Mamba-2 at the same batch sizes ( savings), despite the OOM at batch 256. For the backward pass, Mamba-1 uses a recompute approach to lower memory usage at the cost of slower processing, while Mamba-2 stores more intermediates, speeding up backward passes but increasing memory consumption, potentially causing OOM. These results are consistent with the design motivation of Mamba-2, which reformulates the structured state-space computation into more hardware-efficient SSD-based operations [4]. Mamba-2’s higher backward memory usage comes from storing chunk intermediates or additional buffers. Its SSD formulation improves computational efficiency but leads to higher memory overhead due to intermediate state storage. While both models have comparable complexity, Mamba-2’s blockwise decomposition and multi-head SSM provide efficiency at larger batches despite memory limits.
V-D Mamba vs others on Accuracy (F1) to Inference time
Figure 2 shows macro-F1 vs inference time for batch sizes 8,16,32,64,128 and 256. Mamba-2 (Std pos) is Pareto optimal 333 is Pareto optimal if and only if , achieving near state-of-the-art accuracy (0.9909) while maintaining substantially faster inference times than the linear Transformer with FlashAttention-2 (highest F1 at 0.9925 but noticeably slower inference) and dramatically outperforms the vanilla Transformer and Mamba-1 in speed. While Stride-4 offers the fastest inference, its F1 is the lowest. Mamba-2 delivers substantially faster backward passes (often 30–60% over Mamba-1 and 2–3× over the linear Transformer at medium-to-large batch sizes) and uses 2–4× less GPU memory, enabling larger effective batch sizes and faster ablation cycles on commodity hardware such as a single RTX 3090. (See Appendix Table VI.) These results highlight Mamba-2’s superior practical balance of predictive performance and inference efficiency for byte-level network burst classification even over highly optimized linear attention architectures when deployability and training throughput are prioritized.
VI Discussion
VI-A Direct byte-level supervised learning is sufficient for strong traffic classification.
A central finding of this work is that Mamba-based linear-time sequence modeling can learn discriminative traffic representations directly from raw packet bytes in a fully supervised setting. This contrasts with the dominant trend in recent traffic classification, where strong performance is often associated with heavy pre-training, large engineered representations, or both. As summarized in Table I, transformer based methods such as ET-BERT [17], YaTC [47], and prior Mamba approaches such as NetMamba [35], NetMamba+ [36] rely on pre-training to obtain general-purpose traffic representations before downstream fine-tuning. Classical machine learning methods employing statistical features, such as AppScanner [29], FlowPrint [31] also rely on pre-training. Furthermore, even supervised deep learning for traffic analysis using packet lengths or raw bytes such as FS-Net [18], TFE-GNN [44] rely on pre-training and handcrafted features. We show that by avoiding early patch/stride/token aggregation, preserving and providing undiluted fine-grained byte-resolution information to the Mamba backbone eliminates the need for an entire pretraining stage and can match or exceed much heavier pre-trained alternatives.
VI-B Mamba-2’s constrained transition structure is adequate for byte-level traffic modeling.
Mamba-2 differs from Mamba-1 in that its state transition (-matrix) is more restricted and constrained: the core transition matrix takes a scalar-times-identity form rather than a more flexible full diagonal matrix. In principle, this reduces the diversity of intrinsic time constants that can be represented directly within the transition dynamics.
By governing the state transition, i.e., how information is retained/decays/oscillates over time, the -matrix controls how time (sequential) dynamics are represented in the SSM. In Mamba-1 (diagonal ), each hidden channel/state dimension can have its own decay rate/time constant, and the model can represent a mixture of many different memory scales simultaneously (fast, medium, slow) in parallel [7]. Thus, in Mamba-1, each state dimension may learn its own decay behavior, naturally supporting a mixture of fast and slow timescales. In contrast, in Mamba-2 is a ’scalar I’ or , thus all state dimensions share essentially the same base decay rate (within a head/block). While more GPU-friendly and efficient, it is more restrictive. The diversity of dynamics must come from other factors (multi-head structure, learned projections, input-dependent parameters, gating, etc.) [4]. Thus, in Mamba-2, diversity must instead emerge through other components, including multi-head structure, learned projections, input-conditioned parameters, local convolution, and gating [4].
This design trade-off is particularly relevant for network traffic, which contains compositional ’multi-timescale’ structures: (a) local-local byte patterns within packets, (b) local-to-medium packet boundaries, (c) medium range handshake/message sequences such as TLS and short protocol markers, and (d) mid-range handshake or message sequences, and (d) long-range flow-level behavior. One might therefore expect Mamba-1’s more flexible diagonal to be a natural way to allocate different channels to different time constants.
However, our experiments show that this is not necessary in practice for burst-level byte classification. Under matched settings, Mamba-2 matches or exceeds Mamba-1 on nearly all evaluated datasets. We show that for network byte modality the rest of Mamba-2 (multi-head structure + mixing + selective gating) compensates for the simpler base matrix. This suggests that the remaining degrees of freedom in Mamba-2 are sufficient to capture the relevant multi-timescale behavior of packet-byte signals.
VI-C The main empirical bottleneck is not transition flexibility, but early information loss.
Our ablation study shows that the most damaging change is not the choice between Mamba-1 and Mamba-2, but the use of early aggregation and summarization from patches/strides or torkenization . The downsampling of the input sequence substantially reduces average performance and increases cross-dataset variance, indicating that fine-grained byte order carries critical discriminative information. This observation aligns with prior byte-level modeling work arguing that fixed patching or striding can obscure important structure [25]. In the network setting, such structure may include local protocol signatures, packet header organization, and short payload motifs. Preserving undiluted byte resolution appears to be more important than maximizing the flexibility of the latent dynamical system.
VI-D Compact SSMs are a good fit for packet-byte classification.
Our experiments show that strong performance does not require large latent state sizes or deep backbones. Moderate state sizes (–) and relatively shallow models already perform extremely well, while overly large states or aggressively reduced widths tend to hurt robustness. This indicates that the discriminative structure in packet bursts can be captured with compact selective SSMs, making the approach attractive for practical deployment scenarios where compute and memory are limited.
VI-E Mamba-2 provides both modeling and systems advantages.
Our scaling study shows that Mamba-2 delivers clear computational benefits over Mamba-1 in this application, especially during backpropagation. These efficiency gains matter because supervised training, benchmarking, and ablation studies all depend heavily on turnaround time. Taken together with the classification results, the evidence suggests that Mamba-2 is a particularly suitable backbone for byte-level burst classification: it preserves the linear-time advantages of SSMs, achieves strong predictive performance, and provides better training efficiency than Mamba-1 under comparable settings.
VI-F Avoiding Self-Supervised Pre-training
One of the key advantages of MambaNetBurst is that it completely eliminates the need for self-supervised pre-training, a dominant but costly component in most state-of-the-art traffic classification pipelines. Quantitatively, removing the pre-training stage yields major efficiency gains. In representative baselines such as ET-BERT, YaTC, and NetMamba, the pre-training phase typically consumes 10–100 more compute than the downstream fine-tuning stage, resulting in an estimated 3–15 reduction in total wall-clock training time on commodity GPUs. Training memory footprint is simultaneously reduced by a factor of 2–4 relative to Transformer-based counterparts (see scaling tables), enabling larger effective batch sizes and single-GPU operation even at sequence length 1600. We further incur zero risk of negative transfer from mismatched pre-training corpora, a well-known failure mode in traffic analysis under concept drift. From a non-quantifiable perspective, it collapses the experimental pipeline from two complex stages to a single end-to-end task, drastically lowers the hyperparameter-tuning burden (masking ratios, reconstruction objectives, or auxiliary losses), simplifies code maintenance and reproducibility, and eases real-world deployment in resource-constrained or rapidly evolving NIDS environments where frequent retraining is required. Overall, this positions direct byte-level supervised learning using compact selective state-space Mamba-2 as a markedly more practical and deployable alternative to the pre-training-heavy paradigm that has dominated recent literature.
VII Conclusion
We present MambaNetBurst, a compact, tokenizer-free byte-level sequence classifier for network burst classification built on a Mamba-2 backbone. Unlike most recent strong baselines in encrypted traffic analysis and NIDS, our approach operates directly on raw packet bytes, avoids tokenization, patching, and heavy engineered multimodal inputs, and requires no self-supervised pre-training stage. Across six public benchmarks spanning encrypted mobile app identification, VPN/Tor classification, malware traffic classification, and IoT attack traffic, MambaNetBurst achieves consistently strong performance and is competitive with, or superior to, substantially heavier and often pre-trained baselines. These results show that direct byte-to-classification learning is not only feasible for network traffic, but can be highly effective when paired with a suitable linear-time sequence model.
Our experiments show (1) preserving undiluted byte-level temporal resolution is critical for performance, early downsampling through striding causes severe degradation; (2) Mamba-2’s more constrained transition dynamics not only suffice for this domain but can act as a beneficial regularizer, yielding improved consistency across datasets compared to Mamba-1; and (3) moderate model width and state sizes of 16-64, much lower that the defaults, are sufficient for robust generalization. Mamba-2 offers clear efficiency advantages, with 30–60% faster backward passes and substantially lower memory usage than Mamba-1 or Transformer variants, enabling faster training and inference on commodity GPUs, without pre-training overhead. These findings challenge the prevailing assumption that expensive pre-training pipelines are necessary for state-of-the-art traffic classification. Byte-level packet classification does not inherently require tokenization, large latent state sizes, or expensive pre-training pipelines, offering a simpler, more efficient, and more deployable paradigm for network traffic analysis. We believe this work opens promising new directions for byte-level SSMs in cybersecurity, including extensions to longer flows, online inference, concept-drift robustness, and edge-device deployment.
Acknowledgments
Dedicated to Sugandi.
References
- [1] (2025) IDS–graphmamba: a markov-enhanced graph mamba framework for real-time intrusion detection in iomt edge networks. Computer Networks, pp. 111933. Cited by: §II.
- [2] (2022) Towards the development of a realistic multidimensional iot profiling dataset. In 2022 19th Annual International Conference on Privacy, Security & Trust (PST), pp. 1–11. External Links: Document Cited by: §II, 5th item.
- [3] (2022) Towards the development of a realistic multidimensional iot profiling dataset. In 2022 19th Annual International Conference on Privacy, Security & Trust (PST), pp. 1–11. Cited by: TABLE II.
- [4] (2024) Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. External Links: 2405.21060, Link Cited by: §I, Figure 1, Figure 1, §II, §IV-B, §V-C, §VI-B.
- [5] (2016) Characterization of encrypted and vpn traffic using time-related features. In Proceedings of the 2nd International Conference on Information Systems Security and Privacy (ICISSP), pp. 407–414. External Links: Document Cited by: 2nd item.
- [6] (2016) Characterization of encrypted and vpn traffic using time-related features. In Proceedings of the 2nd international conference on information systems security and privacy (ICISSP 2016), pp. 407–414. Cited by: TABLE III.
- [7] (2023) Mamba: linear-time sequence modeling with selective state spaces. Note: arXiv preprint arXiv:2312.00752 Cited by: §I, §II, §IV-B, §VI-B.
- [8] (2023) Flow-mae: leveraging masked autoencoder for accurate, efficient and robust malicious traffic classification. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 297–314. Cited by: TABLE I.
- [9] (2020) Pert: payload encoding representation from transformer for encrypted traffic classification. In 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation (ITU K), pp. 1–8. Cited by: TABLE I.
- [10] (2022) Flow-based encrypted network traffic classification with graph neural networks. IEEE Transactions on Network and Service Management 20 (2), pp. 1224–1237. Cited by: §II.
- [11] (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. Cited by: §I, §IV-B.
- [12] (2022) SCGC: self-supervised contrastive graph clustering. arXiv preprint arXiv:2204.12656. Cited by: §II.
- [13] (2023) Efficient block contrastive learning via parameter-free meta-node approximation. Neurocomputing 561, pp. 126850. Cited by: §II.
- [14] (2017) Characterization of tor traffic using time based features. In Proceedings of the 3rd International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, pp. 253–262. External Links: Document, ISBN 978-989-758-209-7 Cited by: 3rd item.
- [15] (2017) Characterization of tor traffic using time based features. In International conference on information systems security and privacy, Vol. 2, pp. 253–262. Cited by: TABLE III.
- [16] (2022) Et-bert: a contextualized datagram representation with pre-training transformers for encrypted traffic classification. In Proceedings of the ACM Web Conference 2022, pp. 633–642. Cited by: TABLE I, §II, §II, §IV-A, TABLE II, TABLE III.
- [17] (2022) Et-bert: a contextualized datagram representation with pre-training transformers for encrypted traffic classification. In Proceedings of the ACM Web Conference 2022, pp. 633–642. Cited by: §I, §II, §III-A, §III-A, 1st item, §IV-A, §VI-A.
- [18] (2019) Fs-net: a flow sequence network for encrypted traffic classification. In IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, pp. 1171–1179. Cited by: §IV-A, TABLE II, TABLE III, §VI-A.
- [19] (2023) XG-bot: an explainable deep graph neural network for botnet detection and forensics. Internet of Things 22, pp. 100747. Cited by: §II.
- [20] (2026) Time matters: temporal netflow features for ml-based network intrusion detection. IEEE Access. Cited by: §III-A.
- [21] (2025) Multimodal llms for zero-shot intrusion detection using netflow visualisations. In 2025 IEEE 50th Conference on Local Computer Networks (LCN), pp. 1–7. Cited by: §II.
- [22] (2024) Flowtransformer: a transformer framework for flow-based network intrusion detection systems. Expert Systems with Applications 241, pp. 122564. Cited by: §II.
- [23] (2023) Netgpt: generative pretrained transformer for network traffic. Note: arXiv preprint arXiv:2304.09513 Cited by: TABLE I.
- [24] (2022) Packet representation learning for traffic classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3546–3554. Cited by: §II.
- [25] (2025) Byte latent transformer: patches scale better than tokens. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9238–9258. Cited by: §I, §II, §VI-C.
- [26] (2024) Ptu: pre-trained model for network traffic understanding. In 32nd IEEE International Conference on Network Protocols, ICNP 2024, Charleroi, Belgium, October 28-31, 2024, pp. 1–12. External Links: Link, Document Cited by: §II.
- [27] (2019) An international view of privacy risks for mobile apps. Online. Cited by: TABLE II, TABLE II.
- [28] (2023) Doc-nad: a hybrid deep one-class classifier for network anomaly detection. In 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), pp. 1–7. Cited by: §II.
- [29] (2017) Robust smartphone app identification via encrypted network traffic analysis. IEEE Transactions on Information Forensics and Security 13 (1), pp. 63–78. Cited by: §IV-A, TABLE II, TABLE III, §VI-A.
- [30] (2025) Quantifying the privacy implications of high-fidelity synthetic network traffic. arXiv preprint arXiv:2511.20497. Cited by: §II.
- [31] (2020) Flowprint: semi-supervised mobile-app fingerprinting on encrypted network traffic. In Network and Distributed System Security Symposium (NDSS), Vol. 27. Cited by: §IV-A, TABLE II, TABLE III, §VI-A.
- [32] (2017) Attention is all you need. Note: arXiv preprint arXiv:1706.03762 Cited by: §IV-B.
- [33] (2024) MambaByte: token-free selective state space model. arXiv preprint arXiv:2401.13660. Note: Published at COLM 2024 Cited by: §I, §II, §III-C.
- [34] (2024) Lens: a foundation model for network traffic in cybersecurity. Note: arXiv e-prints, arXiv:2402 Cited by: TABLE I.
- [35] (2024) Netmamba: efficient network traffic classification via pre-training unidirectional mamba. In 2024 IEEE 32nd International Conference on Network Protocols (ICNP), pp. 1–11. Cited by: TABLE I, §I, §II, §II, §II, §III-A, §IV-A, §IV-A, §V-A, TABLE II, TABLE II, TABLE II, TABLE III, TABLE III, TABLE III, §VI-A.
- [36] (2026) NetMamba+: a framework of pre-trained models for efficient and accurate network traffic classification. External Links: 2601.21792, Link Cited by: TABLE I, §I, §II, §II, §III-A, §VI-A.
- [37] (2017) End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In 2017 IEEE International Conference on Intelligence and Security Informatics, ISI 2017, Beijing, China, July 22-24, 2017, pp. 43–48. External Links: Link, Document Cited by: §II.
- [38] (2017) Malware traffic classification using convolutional neural network for representation learning. In 2017 International conference on information networking (ICOIN), pp. 712–717. Cited by: 4th item, TABLE III.
- [39] (2017) Malware traffic classification using convolutional neural network for representation learning. In 2017 International conference on information networking (ICOIN), pp. 712–717. Cited by: §II.
- [40] (2025) SoK: decoding the enigma of encrypted network traffic classifiers. In 2025 IEEE Symposium on Security and Privacy (SP), pp. 1825–1843. Cited by: §III-A, §III-A, §III-A.
- [41] (2022) EBSNN: extended byte segment neural network for network traffic classification. IEEE Transactions on Dependable and Secure Computing 19 (5), pp. 3521–3538. External Links: Document Cited by: §II.
- [42] (2025) ET-mamba: a mamba model for encrypted traffic classification. Information 16 (4), pp. 314. Cited by: §II.
- [43] (2022) ByT5: towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics 10, pp. 291–306. Cited by: §I, §I, §I, §II.
- [44] (2023) Tfe-gnn: a temporal fusion encoder using graph neural networks for fine-grained encrypted traffic classification. In Proceedings of the ACM Web Conference 2023, pp. 2066–2075. Cited by: §IV-A, TABLE II, TABLE III, §VI-A.
- [45] (2023) TFE-gnn: a temporal fusion encoder using graph neural networks for fine-grained encrypted traffic classification. In Proceedings of the ACM Web Conference 2023, WWW ’23, New York, NY, USA, pp. 2066–2075. External Links: ISBN 9781450394161, Link, Document Cited by: §II.
- [46] (2023) Yet another traffic classifier: a masked autoencoder based traffic transformer with multi-level flow representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, pp. 5420–5427. Cited by: TABLE I, §II, §IV-A, TABLE II, TABLE II, TABLE III, TABLE III.
- [47] (2023) Yet another traffic classifier: a masked autoencoder based traffic transformer with multi-level flow representation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 37, pp. 5420–5427. External Links: Link Cited by: §I, §II, §III-A, §IV-A, §VI-A.
- [48] (2025) Trafficformer: an efficient pre-trained model for traffic data. In 2025 IEEE Symposium on Security and Privacy (SP), pp. 1844–1860. Cited by: §I, §II.
Appendix
| Architecture | Param(M) | Batch | Time (ms) | Peak Memory (MiB) | ||||
|---|---|---|---|---|---|---|---|---|
| Forward | Back | Eval | Forward | Back | Eval | |||
| Std pos | 2.5 | 8 | 3.69 | 9.71 | 3.28 | 364.02 | 476.56 | 164.98 |
| Std pos (Mamba-1) | 2.6 | 8 | 4.35 | 13.28 | 4.38 | 395.61 | 455.60 | 190.68 |
| Stride (4) | 2.2 | 8 | 3.51 | 9.34 | 3.26 | 144.80 | 185.01 | 110.62 |
| Stride (4,Mamba-1) | 2.3 | 8 | 2.20 | 4.54 | 1.86 | 165.74 | 183.07 | 129.55 |
| No emb proj | 2.2 | 8 | 3.57 | 9.56 | 3.23 | 398.01 | 511.18 | 240.53 |
| 4 layers | 2.5 | 8 | 3.69 | 10.11 | 3.30 | 461.86 | 574.39 | 261.13 |
| 1 layers | 1.3 | 8 | 1.42 | 3.82 | 1.09 | 301.93 | 414.46 | 256.87 |
| d_state 32 | 2.5 | 8 | 3.79 | 10.15 | 3.32 | 496.84 | 613.76 | 294.22 |
| d_state 64 | 2.6 | 8 | 3.71 | 10.13 | 3.33 | 527.68 | 652.96 | 321.67 |
| d_state 128 | 2.7 | 8 | 3.75 | 10.32 | 3.39 | 571.56 | 712.35 | 355.45 |
| Transformer | 2.5 | 8 | 21.67 | 26.85 | 17.97 | 3194.19 | 3332.52 | 783.94 |
| Linear transformer | 2.5 | 8 | 3.67 | 8.93 | 3.56 | 714.78 | 714.78 | 339.63 |
| Std pos | 2.5 | 16 | 5.32 | 13.68 | 5.45 | 911.62 | 1108.89 | 497.67 |
| Std pos (Mamba-1) | 2.6 | 16 | 8.52 | 30.62 | 8.62 | 936.27 | 1055.16 | 506.89 |
| Stride (4) | 2.2 | 16 | 3.65 | 9.52 | 3.27 | 383.69 | 447.70 | 309.48 |
| Stride (4,Mamba-1) | 2.3 | 16 | 2.64 | 5.48 | 2.44 | 392.05 | 425.63 | 312.30 |
| No emb proj | 2.2 | 16 | 4.77 | 12.52 | 4.89 | 823.79 | 1020.06 | 496.56 |
| 4 layers | 2.5 | 16 | 5.32 | 13.65 | 5.45 | 912.11 | 1109.38 | 497.39 |
| 1 layers | 1.3 | 16 | 2.00 | 5.00 | 1.94 | 566.94 | 764.21 | 471.33 |
| d_state 32 | 2.5 | 16 | 5.40 | 13.86 | 5.51 | 922.80 | 1127.14 | 502.41 |
| d_state 64 | 2.6 | 16 | 5.82 | 14.76 | 5.94 | 940.79 | 1161.57 | 513.52 |
| d_state 128 | 2.7 | 16 | 6.49 | 16.73 | 6.61 | 979.90 | 1232.80 | 535.08 |
| Transformer | 2.5 | 16 | 43.26 | 51.70 | 35.54 | 6185.40 | 6461.41 | 1354.12 |
| Linear transformer | 2.5 | 16 | 6.61 | 15.78 | 6.59 | 1190.87 | 1190.87 | 427.73 |
| Std pos | 2.5 | 32 | 10.36 | 25.35 | 10.57 | 1571.90 | 1964.18 | 742.45 |
| Std pos (Mamba-1) | 2.6 | 32 | 16.13 | 74.69 | 16.76 | 1611.04 | 1847.63 | 761.45 |
| Stride (4) | 2.2 | 32 | 3.69 | 9.71 | 3.33 | 527.36 | 654.59 | 371.14 |
| Stride (4,Mamba-1) | 2.3 | 32 | 4.44 | 10.12 | 4.37 | 544.86 | 610.96 | 375.98 |
| No emb proj | 2.2 | 32 | 9.38 | 23.30 | 9.57 | 1397.49 | 1790.39 | 744.55 |
| 4 layers | 2.5 | 32 | 10.36 | 25.36 | 10.57 | 1574.63 | 1967.98 | 747.77 |
| 1 layers | 1.3 | 32 | 3.77 | 8.89 | 3.80 | 891.71 | 1285.36 | 696.18 |
| d_state 32 | 2.5 | 32 | 10.49 | 25.90 | 10.69 | 1602.91 | 2011.12 | 758.89 |
| d_state 64 | 2.6 | 32 | 11.35 | 27.67 | 11.49 | 1639.76 | 2080.56 | 780.65 |
| d_state 128 | 2.7 | 32 | 12.73 | 31.73 | 12.90 | 1719.38 | 2224.92 | 823.52 |
| Transformer | 2.5 | 32 | 86.41 | 102.11 | 71.03 | 12129.91 | 12681.76 | 2460.00 |
| Linear transformer | 2.5 | 32 | 13.00 | 30.16 | 12.76 | 2161.72 | 2161.72 | 607.01 |
| Std pos | 2.5 | 64 | 20.80 | 48.12 | 20.80 | 2907.68 | 3691.31 | 1237.62 |
| Std pos (Mamba-1) | 2.6 | 64 | 31.26 | 156.31 | 33.94 | 2981.16 | 3452.22 | 1276.67 |
| Stride (4) | 2.2 | 64 | 5.00 | 13.15 | 5.10 | 816.58 | 1039.42 | 489.64 |
| Stride (4,Mamba-1) | 2.3 | 64 | 8.37 | 19.74 | 8.42 | 846.53 | 978.50 | 500.13 |
| No emb proj | 2.2 | 64 | 18.48 | 43.97 | 18.82 | 2553.30 | 3336.44 | 1236.38 |
| 4 layers | 2.5 | 64 | 20.77 | 47.79 | 20.88 | 2903.16 | 3686.94 | 1235.64 |
| 1 layers | 1.3 | 64 | 7.39 | 16.38 | 7.55 | 1531.21 | 2314.84 | 1133.96 |
| d_state 32 | 2.5 | 64 | 21.09 | 48.92 | 21.01 | 2945.52 | 3760.53 | 1259.53 |
| d_state 64 | 2.6 | 64 | 22.34 | 53.79 | 22.68 | 3027.67 | 3907.00 | 1302.30 |
| d_state 128 | 2.7 | 64 | 25.04 | 60.62 | 25.48 | 3189.62 | 4196.85 | 1389.24 |
| Transformer | 2.5 | 64 | OOM | OOM | OOM | OOM | OOM | OOM |
| Linear transformer | 2.5 | 64 | 24.33 | 56.68 | 24.02 | 4018.61 | 4018.61 | 957.22 |
| Std pos | 2.5 | 128 | 40.70 | 95.35 | 41.17 | 5568.93 | 7134.06 | 2219.62 |
| Std pos (Mamba-1) | 2.6 | 128 | 68.26 | 313.53 | 71.66 | 5716.04 | 6655.32 | 2297.93 |
| Stride (4) | 2.2 | 128 | 9.74 | 24.33 | 10.04 | 1383.47 | 1826.24 | 727.33 |
| Stride (4,Mamba-1) | 2.3 | 128 | 16.03 | 40.63 | 16.23 | 1444.04 | 1705.94 | 748.99 |
| No emb proj | 2.2 | 128 | 36.58 | 86.11 | 37.33 | 4865.29 | 6430.75 | 2219.34 |
| 4 layers | 2.5 | 128 | 40.56 | 95.22 | 41.20 | 5566.41 | 7131.54 | 2219.71 |
| 1 layers | 1.3 | 128 | 14.54 | 32.46 | 14.89 | 2815.73 | 4380.90 | 2018.03 |
| d_state 32 | 2.5 | 128 | 41.19 | 96.98 | 41.78 | 5647.66 | 7277.54 | 2260.63 |
| d_state 64 | 2.6 | 128 | 44.46 | 106.05 | 45.07 | 5810.08 | 7568.36 | 2347.53 |
| d_state 128 | 2.7 | 128 | 49.96 | 120.06 | 50.68 | 6131.53 | 8145.29 | 2520.49 |
| Transformer | 2.5 | 128 | OOM | OOM | OOM | OOM | OOM | OOM |
| Linear transformer | 2.5 | 128 | 48.56 | 107.18 | 47.53 | 7778.24 | 7778.24 | 1657.55 |
| Std pos | 2.5 | 256 | 81.71 | 184.59 | 87.96 | 10892.18 | 14022.54 | 4182.62 |
| Std pos (Mamba-1) | 2.6 | 256 | 142.56 | 624.64 | 142.56 | 11185.15 | 13062.64 | 4338.48 |
| Stride (4) | 2.2 | 256 | 19.45 | 46.30 | 19.77 | 2520.27 | 3404.49 | 1198.21 |
| Stride (4,Mamba-1) | 2.3 | 256 | 31.34 | 89.58 | 33.49 | 2635.81 | 3157.05 | 1241.11 |
| No emb proj | 2.2 | 256 | 74.08 | 169.53 | 75.54 | 9487.40 | 12617.04 | 4181.06 |
| 4 layers | 2.5 | 256 | 81.82 | 184.24 | 87.84 | 10889.76 | 14019.57 | 4179.79 |
| 1 layers | 1.3 | 256 | 28.92 | 61.77 | 29.70 | 5379.59 | 8509.31 | 3777.90 |
| d_state 32 | 2.5 | 256 | 82.97 | 188.53 | 89.14 | 11049.19 | 14306.66 | 4264.63 |
| d_state 64 | 2.6 | 256 | 89.51 | 209.09 | 92.47 | 11370.64 | 14884.23 | 4436.43 |
| d_state 128 | 2.7 | 256 | 100.38 | 239.32 | 101.51 | 12013.63 | 16039.47 | 4781.30 |
| Transformer | 2.5 | 256 | OOM | OOM | OOM | OOM | OOM | OOM |
| Linear transformer | 2.5 | 256 | 97.73 | 208.76 | 95.67 | 15311.50 | 15311.50 | 3056.88 |
| Architecture | Param(M) | Batch | Time (ms) | Peak Memory (MiB) | ||||
|---|---|---|---|---|---|---|---|---|
| Forward | Back | Eval | Forward | Back | Eval | |||
| Std pos | 0.2 | 8 | 2.25 | 5.24 | 1.86 | 105.51 | 154.54 | 62.29 |
| Std pos (Mamba-1) | 0.3 | 8 | 1.40 | 2.66 | 1.11 | 99.59 | 116.05 | 56.25 |
| Stride (4) | 0.2 | 8 | 2.09 | 5.06 | 1.80 | 35.81 | 54.43 | 33.01 |
| Stride (4,Mamba-1) | 0.2 | 8 | 1.34 | 2.52 | 1.06 | 35.00 | 39.55 | 31.88 |
| No emb proj | 0.2 | 8 | 2.01 | 5.09 | 1.76 | 81.00 | 130.07 | 68.10 |
| 4 layers | 0.3 | 8 | 3.76 | 9.66 | 3.38 | 145.37 | 194.44 | 70.62 |
| 1 layers | 0.2 | 8 | 1.43 | 3.29 | 1.10 | 99.71 | 148.78 | 68.28 |
| d_state 32 | 0.3 | 8 | 2.13 | 5.56 | 1.88 | 119.25 | 171.66 | 75.12 |
| d_state 64 | 0.3 | 8 | 2.22 | 5.48 | 1.91 | 126.47 | 185.58 | 80.81 |
| d_state 128 | 0.3 | 8 | 2.16 | 5.39 | 1.90 | 207.24 | 279.82 | 147.37 |
| Transformer | 0.3 | 8 | 10.26 | 12.14 | 8.27 | 1608.67 | 1759.58 | 528.26 |
| Linear transformer | 0.3 | 8 | 1.29 | 2.83 | 0.97 | 154.87 | 161.31 | 71.46 |
| Std pos | 0.2 | 16 | 2.23 | 5.78 | 1.90 | 208.86 | 306.95 | 122.22 |
| Std pos (Mamba-1) | 0.3 | 16 | 1.60 | 4.89 | 1.41 | 195.29 | 228.17 | 106.10 |
| Stride (4) | 0.2 | 16 | 2.11 | 5.20 | 1.90 | 67.75 | 103.16 | 61.13 |
| Stride (4,Mamba-1) | 0.2 | 16 | 1.36 | 2.47 | 1.09 | 63.52 | 73.48 | 56.65 |
| No emb proj | 0.2 | 16 | 2.07 | 5.18 | 1.80 | 149.42 | 247.51 | 122.10 |
| 4 layers | 0.3 | 16 | 3.84 | 10.12 | 3.47 | 268.95 | 367.04 | 122.30 |
| 1 layers | 0.2 | 16 | 1.45 | 3.63 | 1.11 | 178.82 | 276.91 | 115.94 |
| d_state 32 | 0.3 | 16 | 2.27 | 5.84 | 1.88 | 214.10 | 318.89 | 125.92 |
| d_state 64 | 0.3 | 16 | 2.24 | 5.79 | 1.88 | 225.34 | 343.51 | 133.32 |
| d_state 128 | 0.3 | 16 | 2.25 | 5.90 | 1.88 | 247.67 | 392.60 | 148.57 |
| Transformer | 0.3 | 16 | 20.18 | 22.96 | 16.41 | 3181.91 | 3485.72 | 1020.46 |
| Linear transformer | 0.3 | 16 | 1.87 | 4.51 | 1.64 | 268.68 | 281.36 | 99.50 |
| Std pos | 0.2 | 32 | 2.35 | 6.85 | 2.23 | 377.21 | 573.34 | 204.67 |
| Std pos (Mamba-1) | 0.3 | 32 | 2.79 | 10.99 | 2.75 | 350.92 | 416.64 | 172.41 |
| Stride (4) | 0.2 | 32 | 2.11 | 5.19 | 1.85 | 96.66 | 151.42 | 82.68 |
| Stride (4,Mamba-1) | 0.2 | 32 | 1.33 | 2.56 | 1.08 | 88.08 | 106.14 | 71.99 |
| No emb proj | 0.2 | 32 | 2.08 | 5.71 | 1.83 | 258.19 | 454.33 | 204.55 |
| 4 layers | 0.3 | 32 | 3.89 | 11.15 | 3.74 | 497.49 | 693.70 | 204.74 |
| 1 layers | 0.2 | 32 | 1.58 | 4.56 | 1.53 | 317.07 | 513.20 | 192.13 |
| d_state 32 | 0.3 | 32 | 2.35 | 6.85 | 2.29 | 388.45 | 598.34 | 212.06 |
| d_state 64 | 0.3 | 32 | 2.60 | 7.35 | 2.47 | 409.39 | 645.77 | 226.83 |
| d_state 128 | 0.3 | 32 | 2.93 | 8.31 | 2.86 | 453.06 | 744.03 | 256.38 |
| Transformer | 0.3 | 32 | 38.54 | 42.91 | 31.00 | 6322.62 | 6930.17 | 1999.58 |
| Linear transformer | 0.3 | 32 | 3.51 | 8.04 | 3.15 | 502.76 | 502.76 | 159.11 |
| Std pos | 0.2 | 64 | 4.42 | 10.05 | 4.41 | 716.18 | 1109.25 | 369.57 |
| Std pos (Mamba-1) | 0.3 | 64 | 5.45 | 21.29 | 5.66 | 668.55 | 800.22 | 306.08 |
| Stride (4) | 0.2 | 64 | 2.11 | 5.31 | 1.87 | 151.76 | 261.22 | 124.97 |
| Stride (4,Mamba-1) | 0.2 | 64 | 1.51 | 3.30 | 1.34 | 135.19 | 171.27 | 105.12 |
| No emb proj | 0.2 | 64 | 3.37 | 8.39 | 3.30 | 477.56 | 869.82 | 369.44 |
| 4 layers | 0.3 | 64 | 7.22 | 17.46 | 7.25 | 956.58 | 1350.31 | 369.64 |
| 1 layers | 0.2 | 64 | 3.02 | 6.34 | 2.96 | 597.43 | 990.36 | 344.52 |
| d_state 32 | 0.3 | 64 | 4.52 | 10.59 | 4.53 | 739.73 | 1158.74 | 384.33 |
| d_state 64 | 0.3 | 64 | 4.89 | 11.47 | 4.91 | 783.17 | 1254.78 | 413.87 |
| d_state 128 | 0.3 | 64 | 5.67 | 13.82 | 5.65 | 866.12 | 1446.16 | 473.90 |
| Transformer | 0.3 | 64 | 79.26 | 84.11 | 63.95 | 12608.90 | 13821.93 | 3959.27 |
| Linear transformer | 0.3 | 64 | 6.72 | 12.67 | 6.15 | 960.58 | 960.58 | 278.31 |
| Std pos | 0.2 | 128 | 8.62 | 18.79 | 8.71 | 1392.42 | 2176.99 | 699.36 |
| Std pos (Mamba-1) | 0.3 | 128 | 11.03 | 42.39 | 11.51 | 1278.08 | 1541.18 | 570.62 |
| Stride (4) | 0.2 | 128 | 2.22 | 5.78 | 1.91 | 262.84 | 481.72 | 208.65 |
| Stride (4,Mamba-1) | 0.2 | 128 | 2.42 | 6.03 | 2.34 | 229.61 | 301.72 | 169.76 |
| No emb proj | 0.2 | 128 | 6.62 | 15.74 | 6.64 | 916.99 | 1701.42 | 699.23 |
| 4 layers | 0.3 | 128 | 14.25 | 33.14 | 14.39 | 1873.92 | 2658.39 | 699.43 |
| 1 layers | 0.2 | 128 | 5.91 | 11.66 | 5.93 | 1151.70 | 1936.14 | 649.29 |
| d_state 32 | 0.3 | 128 | 8.85 | 19.72 | 8.89 | 1436.49 | 2275.90 | 728.88 |
| d_state 64 | 0.3 | 128 | 9.62 | 21.63 | 9.61 | 1518.72 | 2464.19 | 788.15 |
| d_state 128 | 0.3 | 128 | 11.13 | 26.32 | 11.11 | 1687.40 | 2847.04 | 906.02 |
| Transformer | 0.3 | 128 | OOM | OOM | OOM | OOM | OOM | OOM |
| Linear transformer | 0.3 | 128 | 13.19 | 24.05 | 12.01 | 1880.54 | 1880.54 | 516.73 |
| Std pos | 0.2 | 256 | 16.92 | 36.63 | 17.02 | 2747.06 | 4318.06 | 1359.03 |
| Std pos (Mamba-1) | 0.3 | 256 | 21.32 | 84.31 | 22.30 | 2518.75 | 3044.30 | 1100.66 |
| Stride (4) | 0.2 | 256 | 3.50 | 8.72 | 3.44 | 487.27 | 925.69 | 377.65 |
| Stride (4,Mamba-1) | 0.2 | 256 | 4.70 | 11.50 | 4.75 | 425.00 | 568.34 | 298.22 |
| No emb proj | 0.2 | 256 | 13.05 | 30.21 | 13.05 | 1794.09 | 3366.64 | 1358.90 |
| 4 layers | 0.3 | 256 | 27.94 | 64.61 | 28.00 | 3710.08 | 5280.28 | 1359.10 |
| 1 layers | 0.2 | 256 | 11.34 | 22.58 | 11.40 | 2265.16 | 3834.67 | 1259.13 |
| d_state 32 | 0.3 | 256 | 17.19 | 38.31 | 17.23 | 2832.97 | 4509.78 | 1418.07 |
| d_state 64 | 0.3 | 256 | 18.59 | 42.36 | 18.84 | 3000.35 | 4890.60 | 1536.15 |
| d_state 128 | 0.3 | 256 | 21.77 | 52.10 | 21.60 | 3336.10 | 5654.90 | 1772.49 |
| Transformer | 0.3 | 256 | OOM | OOM | OOM | OOM | OOM | OOM |
| Linear transformer | 0.3 | 256 | 25.41 | 46.96 | 23.09 | 3701.21 | 3701.21 | 993.56 |
Comments
· 0