O(N) complexity seizure detection via dual-stack state-space architecture
Current Status (v4.2.0):
- π FLA Baseline: Epoch 30+, plateaued at 0.257 for 13 epochs (best: 0.284 @ epoch 9), early stop ~epoch 36
- π Exp4 (Cyclic LR): Ready to launch - SGDR restarts to escape local minimum
- βΈοΈ BiMamba2: Paused (focusing on local training due to cost)
50 million people worldwide suffer from epilepsy. Continuous EEG monitoring in ICUs could catch seizures earlyβbut current systems fail at a critical bottleneck: false alarm fatigue.
At 10 false alarms per 24 hours, clinical staff stop responding. The gold standard? <1 false alarm per day while maintaining >75% seizure detection. That's what we're building.
Seizures aren't just temporal patterns or spatial patternsβthey're both simultaneously:
- Temporal dynamics: Multi-scale patterns from milliseconds (spike transients) β seconds (rhythmic activity) β minutes (ictal evolution)
- Spatial propagation: Time-varying electrode connectivity as seizures propagate through neural networks (e.g., C3 β C4 β P3)
Traditional approaches fail because they treat these as separate problems. We model them jointly via time-then-graph ordering.
Controlled A/B comparison of two state-space architectures on identical pipeline:
- What: Mamba2 with bidirectional processing
- Status: βΈοΈ PAUSED at Epoch 6 (Modal A100, $1.1k spent, checkpoints backed up in
backups/modal_bimamba2_epoch6/) - Foundation: Fast CUDA kernels, selective state propagation (Gu & Dao 2023)
- Motivation: Proven SSM architecture with O(N) efficiency
- What: FLA (Flash Linear Attention) with gating + delta rule
- Status: π Baseline Complete - 0.284 @ 10 FA/24h (epoch 9), π§ͺ Exp1 running (testing regularization)
- Foundation: Beats Mamba2 on language modeling (ICLR 2025)
- Hypothesis: Better for EEG's abrupt context switches (seizure onsets)
- Next: Resume baseline with patience=20 to test "second peak" hypothesis (after exp1 completes)
Why both? Seizures have abrupt onsets (need memory clearing via gating) and persistent patterns (need selective retention via delta rule). Gated Delta theoretically handles both. But does theory match clinical reality? That's what we're testing.
Research transparency: All three outcomes (Gated Delta wins, BiMamba2 wins, or tie) are scientifically valuable. No prior work compares these architectures on clinical EEG analysis. See docs/04-model/flash-linear-attention/FLA_ROADMAP.md for full strategy.
EvoBrain (NeurIPS 2025) establishes two critical theorems:
- Theorem 1 (Dynamic Graphs): Explicit dynamic modeling (time-varying adjacency) is strictly more expressive than implicit (static graphs)
- Theorem 2 (Temporal Ordering): time-then-graph > time-and-graph > graph-then-time
Intuition: Temporal features must stabilize before graph operations. Processing graph structure first forces simultaneous learning of both patternsβa harder optimization landscape.
Empirical: EvoBrain achieves 95% AUROC on TUSZ (+23% over baselines).
Problem scale: 60-second EEG windows at 256Hz = 15,360 samples per channel. Traditional Transformers:
- Attention cost: O(NΒ²) = 236M operations per layer
- Memory: O(NΒ²) = 900MB just for attention matrices (batch=1)
- Inference: 8 Hz/batch (too slow for clinical real-time)
State-space solution: Mamba/GatedDelta achieve O(N) via selective state propagation:
- Cost: 15K operations (1500Γ reduction)
- Memory: O(N) = 60KB per layer
- Inference: 128 Hz/batch (EEG-Mamba 2024) vs 8 Hz/batch for Transformers
EEG Input (B, 19 channels, 15360 samples @ 256Hz = 60s)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β TCN ENCODER (8 layers, 16Γ downsampling) β
β β Multi-scale temporal decomposition β
β β Dilations: 1β2β4β8β16β32β64β128 β
β β Output: (B, 512, 960) compressed features β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β PROJECTION β Per-Electrode Features β
β β 512 channels β 19 electrodes Γ 64 dims β
β β Output: (B, 19, 960, 64) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββ¬βββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββββ
β NODE β β EDGE β β ADJACENCY β
β SSM β β SSM β β ASSEMBLY β
β (19Γ) β β (171Γ) β β (learned) β
ββββββ¬βββββ ββββββ¬βββββ βββββββ¬ββββββ
β β β
β ββββββββ¬ββββββββ
β βΌ
β ββββββββββββββββββββββββββ
β β DYNAMIC LAPLACIAN PE β
β β β k=16 eigenvectors β
β β β Every 5 timesteps β
β ββββββββββββ¬ββββββββββββββ
β βΌ
β ββββββββββββββββββββββββββ
β β GNN (2Γ SSGConv) β
β β β Spatial aggregation β
β β β Alpha=0.05 β
β ββββββββββββ¬ββββββββββββββ
β β
βββββββββββββββββββββββ΄ββΊ (B, 19, 960, 128)
βΌ
ββββββββββββββββββββ
β GATED FUSION β
β β 4-head combine β
β β Node + spatial β
ββββββββββ¬ββββββββββ
βΌ
ββββββββββββββββββββ
β DECODER β
β β Upsample 16Γ β
β β Per-sample β
ββββββββββ¬ββββββββββ
βΌ
(B, 15360) logits
Key: SSM boxes = π· BiMamba2 (Stack 1) or πΆ Gated DeltaNet (Stack 2)
Everything else is identicalβTCN frontend, GNN backend, fusion layer. Only the temporal core changes.
Temporal Convolutional Networks (Bai et al. 2018):
- Parallelism: Entire 60s window processed simultaneously (vs sequential RNN)
- Multi-scale: Dilated convolutions capture patterns at exponentially growing timescales:
- Layer 1 (dilation=1): 50ms receptive field (spike detection)
- Layer 4 (dilation=8): 400ms (rhythmic patterns)
- Layer 8 (dilation=128): 6.4s (ictal evolution)
- Stable gradients: Residual connections prevent vanishing gradients
Tradeoff: O(N log N) complexity due to dilation, but negligible for N=15K.
Core innovation: Selective state propagation with data-dependent gates
S_t = Ξ±_t β S_{t-1} + v_t β k_t^T # Forget (Ξ±) + update (vβk)
o_t = S_t q_t # RetrieveWhere Ξ±_t β (0,1) controls per-timestep memory decay (not global like RNNs).
Node Stream (19 parallel SSMs):
- Purpose: Model per-electrode temporal dynamics independently
- Config: 6 layers, d_model=64, d_state=16, bidirectional
- Example: Rhythmic spiking in C3 electrode evolves independently
- Parameters: 7.2M
Edge Stream (171 pairwise SSMs):
- Purpose: Model inter-electrode connectivity strength over time
- Config: 2 layers, d_model=16, d_state=8, bidirectional
- Example: C3-C4 coherence increases during seizure propagation
- Parameters: 1.2M
Total SSM: 8.4M parameters, O(N) complexity
Key difference: Adds delta rule on top of gating
Delta rule: Selective key-value updates without forgetting others
# Mamba2: Global gate (erases everything)
S_t = Ξ±_t β S_{t-1} + update
# Gated DeltaNet: Targeted update (selective retention)
S_t = Ξ±_t β S_{t-1} + Ξ²_t β (k_t β v_t - old_memory)Configuration:
- Node Stream: 6 layers, d_model=512, num_heads=6, headdim=8
- Edge Stream: 2 layers, d_model=32, num_heads=3, headdim=8
Total SSM: ~8.4M parameters (matched to BiMamba2), O(N) complexity
Hypothesis: Delta rule handles EEG better because:
- Gating clears memory during seizure onset (abrupt context switch)
- Delta rule preserves persistent patterns (rhythmic activity continues)
- BiMamba2 has only gating β may "forget" ongoing rhythms during onset
Reality check: This is a hypothesis. Full TUSZ training will tell us if it's true.
EvoBrain's Theorem 1 proves explicit time-varying adjacency is strictly more expressive than static graphs.
Implementation:
- Compute k=16 eigenvectors of normalized graph Laplacian every 5 timesteps
- Eigenvectors = fixed positional coordinates in spectral space (like Transformer sinusoidal PE)
- Learning happens in GNN layers that process PE, not in PE itself (best practice)
Why top-k=3 neighbors? 3 strongest connections capture 85%+ of spatial variance (validated by EvoBrain on EEG).
Problem: Node stream and GNN produce different feature scales and semantics.
Solution: Multi-head gated fusion learns optimal combination:
g = Ο(W_g [node_out; gnn_out]) # Per-feature gates
fused = g β node_out + (1-g) β gnn_out # Weighted mergeThis allows the model to emphasize:
- Node features when electrodes evolve independently (early seizure)
- GNN features when spatial synchronization dominates (propagated seizure)
| Component | BiMamba2 (Stack 1) | Gated DeltaNet (Stack 2) | Complexity |
|---|---|---|---|
| TCN Encoder | 12.8M | 12.8M (identical) | O(N log N) |
| Node SSM | 7.2M (d_model=64) | 7.2M (d_model=512) | O(N) |
| Edge SSM | 1.2M (d_model=16) | 1.2M (d_model=32) | O(N) |
| GNN + LPE | 6.2M | 6.2M (identical) | O(NΒ·kΒ²) |
| Fusion | 2.1M | 2.1M (identical) | O(N) |
| Decoder | 1.0M | 1.0M (identical) | O(N) |
| Total | 30.5M | 30.5M (matched) | O(N) |
π Key: Parameter counts matched for fair comparison. Only Node/Edge SSM layers differ. TCN frontend, GNN backend, fusion, and decoder are 100% identical.
World's largest open-source seizure dataset (Temple University):
- 504 hours of continuous EEG from 592 patients
- 36 hours of seizures (~7% prevalence) β 12:1 class imbalance
- 19-channel 10-20 montage @ 256Hz (clinical standard)
- Patient-based splits (train/dev/eval) β no data leakage
Preprocessing pipeline:
- Bandpass filter: 0.5-120Hz
- Notch filter: 60Hz (removes powerline noise)
- Resample: 256Hz (standardize across recordings)
- Windowing: 60s windows, 10s stride (83% overlap)
- Normalization: Per-channel z-score + clip to Β±10Ο (removes outliers)
Our cache system (memory-mapped NPY format):
- Train: 4667 files β 61,616 balanced windows (34.2% seizure ratio via oversampling)
- Dev: 1832 files β 148,224 natural windows (7.7% seizure ratio, real distribution)
- Speed: 99.6% faster startup than NPZ (manifest-based loading)
- Memory: <1 GB RAM vs 387 GB for NPZ
Why oversample training? Standard ML practice: Train on balanced data (model learns seizure patterns), validate on natural distribution (measures real-world performance). See docs/05-training/training-methodology.md for detailed explanation.
Based on verified clinical benchmarks and SOTA research (see docs/00-overview/performance-targets.md for comprehensive analysis):
β€4 FA/24h @ β₯50% sensitivity (NEDC OVERLAP scoring)
- Temple NEDC verified: 4 FA/24h @ ~50% sensitivity (real clinical deployments)
- SeizureTransformer #1: 26.89 FA/24h @ 45.63% sensitivity (TUSZ eval, 2025)
- Our goal: Match or beat Temple's verified clinical benchmark
β€10 FA/24h @ β₯75% sensitivity (NEDC OVERLAP scoring)
- Enables ICU monitoring with manageable alarm fatigue
- Current gap: SeizureTransformer @ 10 FA = 33.90% sensitivity (42-point gap to close)
| Metric | Target | Baseline (SeizureTransformer) | Rationale |
|---|---|---|---|
| AUROC | β₯0.90 | 0.902 (TUSZ eval) | Overall discrimination capability |
| AUPRC | β₯0.40 | Not reported | Better for 12:1 class imbalance |
| F1 Score | β₯0.45 | 0.414 (NEDC OVERLAP) | Balanced precision/recall |
| Outcome | Sensitivity @ 4 FA/24h | Publication Tier |
|---|---|---|
| Breakthrough | β₯60% | Top-tier venue (beats all known systems) |
| Strong | β₯50% | Highly publishable (matches Temple SOTA) |
| Publishable | β₯45% | Solid contribution (architectural novelty) |
| Minimum | β₯40% | Viable if architectural insights clear |
Reality check: Temple NEDC research confirms ROC curves are very steep at low FA rates. 5% absolute sensitivity change = massive FA rate shift. Our dual-stack (BiMamba2 vs Gated DeltaNet) comparison provides scientific value regardless of absolute performance.
Scoring impact: Same predictions can yield 3-16Γ different FA rates depending on scorer (SzCORE vs NEDC OVERLAP vs NEDC TAES). We use NEDC OVERLAP as primary metric. See docs/06-evaluation/TAES_DISAMBIGUATION.md for critical naming collision explanation.
# 1οΈβ£ Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2οΈβ£ Clone repo
git clone https://github.com/clarity-digital-twin/brain-go-brr-v2.git
cd brain-go-brr-v2
# 3οΈβ£ Setup environment (installs mamba-ssm, PyG)
make setup
make setup-gpu
# Optional: Install FLA for Gated DeltaNet research
make setup-fla
# 4οΈβ£ Download TUSZ corpus
# Visit: https://isip.piconepress.com/projects/nedc/html/tuh_eeg/index.shtml
# Place in: data_ext4/tusz/edf/
# 5οΈβ£ Build preprocessing cache (one-time, ~2 hours)
python -m src build-cache \
--data-dir data_ext4/tusz/edf/train \
--cache-dir cache/tusz_mmap/train \
--split train
python -m src build-cache \
--data-dir data_ext4/tusz/edf/dev \
--cache-dir cache/tusz_mmap/dev \
--split dev
# 6οΈβ£ Smoke test (3 files, 5 minutes)
make smoke-bimamba # Test BiMamba2 stack
make smoke-fla # Test Gated DeltaNet stack
# 7οΈβ£ Full local training (RTX 4090, ~960 hours / 40 days)
export BGB_NAN_DEBUG=1
tmux new -s train
make train-bimamba # or: make train-fla
# Ctrl+B then D to detach | tmux attach -t train to reattachCloud training (Modal A100-80GB) - See docs/05-training/modal.md for details:
# Deploy Modal functions first
modal deploy deploy/modal/app.py
# BiMamba2 production (hands-free, auto-restart)
modal run --detach deploy/modal/app.py \
--action schedule-training \
--config configs/modal/train_bimamba.yaml
# Gated DeltaNet production (hands-free, auto-restart)
modal run --detach deploy/modal/app.py \
--action schedule-training \
--config configs/modal/train_fla.yaml
# Monitor progress
modal app list
modal app logs <app-id>π¨ CRITICAL: Use --action schedule-training for 100-epoch production runs (auto-restart every 23h). Use --action train ONLY for smoke tests and experiments.
See docs/01-installation/ and docs/05-training/ for complete setup guides.
- Quickstart - 5-minute validation
- First Training Run - Complete walkthrough
- V3 Spec - Full implementation details
- Laplacian PE - Dynamic graph theory
- Stability Evolution - NaN prevention history
- FLA Roadmap - Complete A/B strategy
- FLA Quick Reference - Config guide
- Future Work - Post-training enhancements
- Training Guide - Local + Modal setup
- Training Methodology - Why validation has more batches
- Modal Timeout Guard - Three-layer defense system
- Troubleshooting - Common issues
- NaN Prevention - Gradient stability
We welcome contributions! See docs/09-development/ for:
- Coding Standards (Ruff, mypy, no comments unless requested)
- Testing Strategy (
make qbefore committing) - Technical Debt (currently zero!)
Zero technical debt policy: All P0/P1/P2 issues resolved before major releases.
@software{brain-go-brr-v4,
title = {Brain-Go-Brr V4: Clinical EEG Seizure Detection via Dual-Stack State-Space Models},
author = {Clarity Digital Twin},
year = {2025},
version = {4.2.0},
url = {https://github.com/clarity-digital-twin/brain-go-brr-v2},
note = {Empirical A/B comparison of BiMamba2 and Flash Linear Attention (BiGatedDeltaNet) architectures on TUSZ}
}Apache 2.0 - See LICENSE for full text.
Datasets:
- TUH EEG Seizure Corpus (Temple University)
- CHB-MIT Scalp EEG Database (Boston Children's Hospital / MIT)
Foundational Papers:
- EvoBrain (Kotoge et al., NeurIPS 2025) - Time-then-graph paradigm, dynamic graphs
- Mamba (Gu & Dao 2023) - Selective state-space models
- Gated DeltaNet (Yang et al., ICLR 2025) - Memory erasure + delta rule
- SeizureTransformer (Wu et al. 2025) - SOTA baseline, U-Net + Transformer (EpilepsyBench #1)
- EEGMamba (Gui et al. 2024) - Bidirectional Mamba for EEG (speed benchmark)
- TCN (Bai et al. 2018) - Temporal convolutional networks
- Focal Loss (Lin et al. 2017) - Class imbalance handling
Infrastructure & Libraries:
- Modal.com - A100-80GB GPU infrastructure
- PyTorch Geometric - Graph neural networks
- mamba-ssm (Tri Dao) - Mamba2 implementation
- FLA (Songlin Yang) - Gated DeltaNet implementation
Questions? Open an issue β’ Updates? Watch the repo β’ Discussion? Start a discussion
Status: v4.2.0 checkpoint resume fixes released β’ FLA training resumed (Epoch 18/100) β’ BiMamba2 paused (Epoch 6, backed up) β’ See STATUS.md for full details