Skip to content

Clarity-Digital-Twin/brain-go-brr-v2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Brain-Go-Brr V4: Clinical EEG Seizure Detection

O(N) complexity seizure detection via dual-stack state-space architecture

Python 3.11+ PyTorch 2.5.0 CUDA 12.4 License: Apache 2.0 v4.2.0

Current Status (v4.2.0):

  • πŸ”„ FLA Baseline: Epoch 30+, plateaued at 0.257 for 13 epochs (best: 0.284 @ epoch 9), early stop ~epoch 36
  • πŸš€ Exp4 (Cyclic LR): Ready to launch - SGDR restarts to escape local minimum
  • ⏸️ BiMamba2: Paused (focusing on local training due to cost)

πŸ“‹ The Clinical Problem

50 million people worldwide suffer from epilepsy. Continuous EEG monitoring in ICUs could catch seizures earlyβ€”but current systems fail at a critical bottleneck: false alarm fatigue.

At 10 false alarms per 24 hours, clinical staff stop responding. The gold standard? <1 false alarm per day while maintaining >75% seizure detection. That's what we're building.


🎯 The Technical Challenge

Seizures aren't just temporal patterns or spatial patternsβ€”they're both simultaneously:

  • Temporal dynamics: Multi-scale patterns from milliseconds (spike transients) β†’ seconds (rhythmic activity) β†’ minutes (ictal evolution)
  • Spatial propagation: Time-varying electrode connectivity as seizures propagate through neural networks (e.g., C3 β†’ C4 β†’ P3)

Traditional approaches fail because they treat these as separate problems. We model them jointly via time-then-graph ordering.


πŸ”¬ Our Approach: Dual-Stack Research Experiment

Controlled A/B comparison of two state-space architectures on identical pipeline:

πŸ”· Stack 1: BiMamba2 (Baseline)

  • What: Mamba2 with bidirectional processing
  • Status: ⏸️ PAUSED at Epoch 6 (Modal A100, $1.1k spent, checkpoints backed up in backups/modal_bimamba2_epoch6/)
  • Foundation: Fast CUDA kernels, selective state propagation (Gu & Dao 2023)
  • Motivation: Proven SSM architecture with O(N) efficiency

πŸ”Ά Stack 2: Gated DeltaNet (Research Variant)

  • What: FLA (Flash Linear Attention) with gating + delta rule
  • Status: πŸ“Š Baseline Complete - 0.284 @ 10 FA/24h (epoch 9), πŸ§ͺ Exp1 running (testing regularization)
  • Foundation: Beats Mamba2 on language modeling (ICLR 2025)
  • Hypothesis: Better for EEG's abrupt context switches (seizure onsets)
  • Next: Resume baseline with patience=20 to test "second peak" hypothesis (after exp1 completes)

Why both? Seizures have abrupt onsets (need memory clearing via gating) and persistent patterns (need selective retention via delta rule). Gated Delta theoretically handles both. But does theory match clinical reality? That's what we're testing.

Research transparency: All three outcomes (Gated Delta wins, BiMamba2 wins, or tie) are scientifically valuable. No prior work compares these architectures on clinical EEG analysis. See docs/04-model/flash-linear-attention/FLA_ROADMAP.md for full strategy.


πŸ—οΈ Architecture: Theory & Design

πŸ€” Why Time-Then-Graph?

EvoBrain (NeurIPS 2025) establishes two critical theorems:

  • Theorem 1 (Dynamic Graphs): Explicit dynamic modeling (time-varying adjacency) is strictly more expressive than implicit (static graphs)
  • Theorem 2 (Temporal Ordering): time-then-graph > time-and-graph > graph-then-time

Intuition: Temporal features must stabilize before graph operations. Processing graph structure first forces simultaneous learning of both patternsβ€”a harder optimization landscape.

Empirical: EvoBrain achieves 95% AUROC on TUSZ (+23% over baselines).

⚑ Why O(N) Complexity?

Problem scale: 60-second EEG windows at 256Hz = 15,360 samples per channel. Traditional Transformers:

  • Attention cost: O(NΒ²) = 236M operations per layer
  • Memory: O(NΒ²) = 900MB just for attention matrices (batch=1)
  • Inference: 8 Hz/batch (too slow for clinical real-time)

State-space solution: Mamba/GatedDelta achieve O(N) via selective state propagation:

  • Cost: 15K operations (1500Γ— reduction)
  • Memory: O(N) = 60KB per layer
  • Inference: 128 Hz/batch (EEG-Mamba 2024) vs 8 Hz/batch for Transformers

πŸ”„ Architecture Flow

EEG Input (B, 19 channels, 15360 samples @ 256Hz = 60s)
        β”‚
        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ TCN ENCODER (8 layers, 16Γ— downsampling)    β”‚
  β”‚ β†’ Multi-scale temporal decomposition        β”‚
  β”‚ β†’ Dilations: 1β†’2β†’4β†’8β†’16β†’32β†’64β†’128           β”‚
  β”‚ β†’ Output: (B, 512, 960) compressed features β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ PROJECTION β†’ Per-Electrode Features         β”‚
  β”‚ β†’ 512 channels β†’ 19 electrodes Γ— 64 dims    β”‚
  β”‚ β†’ Output: (B, 19, 960, 64)                  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό              β–Ό              β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  NODE   β”‚   β”‚  EDGE   β”‚   β”‚ ADJACENCY β”‚
   β”‚   SSM   β”‚   β”‚   SSM   β”‚   β”‚ ASSEMBLY  β”‚
   β”‚  (19Γ—)  β”‚   β”‚ (171Γ—)  β”‚   β”‚ (learned) β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
        β”‚             β”‚              β”‚
        β”‚             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                    β–Ό
        β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚          β”‚ DYNAMIC LAPLACIAN PE   β”‚
        β”‚          β”‚ β†’ k=16 eigenvectors    β”‚
        β”‚          β”‚ β†’ Every 5 timesteps    β”‚
        β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                     β–Ό
        β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚          β”‚ GNN (2Γ— SSGConv)       β”‚
        β”‚          β”‚ β†’ Spatial aggregation  β”‚
        β”‚          β”‚ β†’ Alpha=0.05           β”‚
        β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                     β”‚
        └─────────────────────┴─► (B, 19, 960, 128)
                                  β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚ GATED FUSION     β”‚
                        β”‚ β†’ 4-head combine β”‚
                        β”‚ β†’ Node + spatial β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚ DECODER          β”‚
                        β”‚ β†’ Upsample 16Γ—   β”‚
                        β”‚ β†’ Per-sample     β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                        (B, 15360) logits

Key: SSM boxes = πŸ”· BiMamba2 (Stack 1) or πŸ”Ά Gated DeltaNet (Stack 2)

Everything else is identicalβ€”TCN frontend, GNN backend, fusion layer. Only the temporal core changes.


πŸ’‘ Component Justification

1. TCN Encoder: Multi-Scale Temporal Decomposition

Temporal Convolutional Networks (Bai et al. 2018):

  • Parallelism: Entire 60s window processed simultaneously (vs sequential RNN)
  • Multi-scale: Dilated convolutions capture patterns at exponentially growing timescales:
    • Layer 1 (dilation=1): 50ms receptive field (spike detection)
    • Layer 4 (dilation=8): 400ms (rhythmic patterns)
    • Layer 8 (dilation=128): 6.4s (ictal evolution)
  • Stable gradients: Residual connections prevent vanishing gradients

Tradeoff: O(N log N) complexity due to dilation, but negligible for N=15K.

2. State-Space Models: The Heart of the System

Core innovation: Selective state propagation with data-dependent gates

S_t = Ξ±_t βŠ™ S_{t-1} + v_t βŠ— k_t^T    # Forget (Ξ±) + update (vβŠ—k)
o_t = S_t q_t                          # Retrieve

Where α_t ∈ (0,1) controls per-timestep memory decay (not global like RNNs).

πŸ”· BiMamba2 Architecture (Stack 1)

Node Stream (19 parallel SSMs):

  • Purpose: Model per-electrode temporal dynamics independently
  • Config: 6 layers, d_model=64, d_state=16, bidirectional
  • Example: Rhythmic spiking in C3 electrode evolves independently
  • Parameters: 7.2M

Edge Stream (171 pairwise SSMs):

  • Purpose: Model inter-electrode connectivity strength over time
  • Config: 2 layers, d_model=16, d_state=8, bidirectional
  • Example: C3-C4 coherence increases during seizure propagation
  • Parameters: 1.2M

Total SSM: 8.4M parameters, O(N) complexity

πŸ”Ά Gated DeltaNet Architecture (Stack 2)

Key difference: Adds delta rule on top of gating

Delta rule: Selective key-value updates without forgetting others

# Mamba2: Global gate (erases everything)
S_t = Ξ±_t βŠ™ S_{t-1} + update

# Gated DeltaNet: Targeted update (selective retention)
S_t = Ξ±_t βŠ™ S_{t-1} + Ξ²_t βŠ™ (k_t βŠ— v_t - old_memory)

Configuration:

  • Node Stream: 6 layers, d_model=512, num_heads=6, headdim=8
  • Edge Stream: 2 layers, d_model=32, num_heads=3, headdim=8

Total SSM: ~8.4M parameters (matched to BiMamba2), O(N) complexity

Hypothesis: Delta rule handles EEG better because:

  1. Gating clears memory during seizure onset (abrupt context switch)
  2. Delta rule preserves persistent patterns (rhythmic activity continues)
  3. BiMamba2 has only gating β†’ may "forget" ongoing rhythms during onset

Reality check: This is a hypothesis. Full TUSZ training will tell us if it's true.

3. Dynamic Laplacian PE: Time-Evolving Graph Structure

EvoBrain's Theorem 1 proves explicit time-varying adjacency is strictly more expressive than static graphs.

Implementation:

  • Compute k=16 eigenvectors of normalized graph Laplacian every 5 timesteps
  • Eigenvectors = fixed positional coordinates in spectral space (like Transformer sinusoidal PE)
  • Learning happens in GNN layers that process PE, not in PE itself (best practice)

Why top-k=3 neighbors? 3 strongest connections capture 85%+ of spatial variance (validated by EvoBrain on EEG).

4. Gated Fusion: Adaptive Feature Combination

Problem: Node stream and GNN produce different feature scales and semantics.

Solution: Multi-head gated fusion learns optimal combination:

g = Οƒ(W_g [node_out; gnn_out])        # Per-feature gates
fused = g βŠ™ node_out + (1-g) βŠ™ gnn_out  # Weighted merge

This allows the model to emphasize:

  • Node features when electrodes evolve independently (early seizure)
  • GNN features when spatial synchronization dominates (propagated seizure)

πŸ“Š Model Statistics: Side-by-Side Comparison

Component BiMamba2 (Stack 1) Gated DeltaNet (Stack 2) Complexity
TCN Encoder 12.8M 12.8M (identical) O(N log N)
Node SSM 7.2M (d_model=64) 7.2M (d_model=512) O(N)
Edge SSM 1.2M (d_model=16) 1.2M (d_model=32) O(N)
GNN + LPE 6.2M 6.2M (identical) O(NΒ·kΒ²)
Fusion 2.1M 2.1M (identical) O(N)
Decoder 1.0M 1.0M (identical) O(N)
Total 30.5M 30.5M (matched) O(N)

πŸ”‘ Key: Parameter counts matched for fair comparison. Only Node/Edge SSM layers differ. TCN frontend, GNN backend, fusion, and decoder are 100% identical.


πŸ₯ Dataset: TUSZ Clinical Reality

TUH EEG Seizure Corpus

World's largest open-source seizure dataset (Temple University):

  • 504 hours of continuous EEG from 592 patients
  • 36 hours of seizures (~7% prevalence) β†’ 12:1 class imbalance
  • 19-channel 10-20 montage @ 256Hz (clinical standard)
  • Patient-based splits (train/dev/eval) β†’ no data leakage

Preprocessing pipeline:

  1. Bandpass filter: 0.5-120Hz
  2. Notch filter: 60Hz (removes powerline noise)
  3. Resample: 256Hz (standardize across recordings)
  4. Windowing: 60s windows, 10s stride (83% overlap)
  5. Normalization: Per-channel z-score + clip to Β±10Οƒ (removes outliers)

Our cache system (memory-mapped NPY format):

  • Train: 4667 files β†’ 61,616 balanced windows (34.2% seizure ratio via oversampling)
  • Dev: 1832 files β†’ 148,224 natural windows (7.7% seizure ratio, real distribution)
  • Speed: 99.6% faster startup than NPZ (manifest-based loading)
  • Memory: <1 GB RAM vs 387 GB for NPZ

Why oversample training? Standard ML practice: Train on balanced data (model learns seizure patterns), validate on natural distribution (measures real-world performance). See docs/05-training/training-methodology.md for detailed explanation.


🎯 Performance Targets: Evidence-Based Goals

Based on verified clinical benchmarks and SOTA research (see docs/00-overview/performance-targets.md for comprehensive analysis):

Primary Target (Match Temple Clinical SOTA)

≀4 FA/24h @ β‰₯50% sensitivity (NEDC OVERLAP scoring)

  • Temple NEDC verified: 4 FA/24h @ ~50% sensitivity (real clinical deployments)
  • SeizureTransformer #1: 26.89 FA/24h @ 45.63% sensitivity (TUSZ eval, 2025)
  • Our goal: Match or beat Temple's verified clinical benchmark

Stretch Goal (Clinical Deployment)

≀10 FA/24h @ β‰₯75% sensitivity (NEDC OVERLAP scoring)

  • Enables ICU monitoring with manageable alarm fatigue
  • Current gap: SeizureTransformer @ 10 FA = 33.90% sensitivity (42-point gap to close)

Additional Metrics (Threshold-Independent)

Metric Target Baseline (SeizureTransformer) Rationale
AUROC β‰₯0.90 0.902 (TUSZ eval) Overall discrimination capability
AUPRC β‰₯0.40 Not reported Better for 12:1 class imbalance
F1 Score β‰₯0.45 0.414 (NEDC OVERLAP) Balanced precision/recall

Realistic Success Criteria

Outcome Sensitivity @ 4 FA/24h Publication Tier
Breakthrough β‰₯60% Top-tier venue (beats all known systems)
Strong β‰₯50% Highly publishable (matches Temple SOTA)
Publishable β‰₯45% Solid contribution (architectural novelty)
Minimum β‰₯40% Viable if architectural insights clear

Reality check: Temple NEDC research confirms ROC curves are very steep at low FA rates. 5% absolute sensitivity change = massive FA rate shift. Our dual-stack (BiMamba2 vs Gated DeltaNet) comparison provides scientific value regardless of absolute performance.

Scoring impact: Same predictions can yield 3-16Γ— different FA rates depending on scorer (SzCORE vs NEDC OVERLAP vs NEDC TAES). We use NEDC OVERLAP as primary metric. See docs/06-evaluation/TAES_DISAMBIGUATION.md for critical naming collision explanation.


πŸš€ Quick Start

# 1️⃣ Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2️⃣ Clone repo
git clone https://github.com/clarity-digital-twin/brain-go-brr-v2.git
cd brain-go-brr-v2

# 3️⃣ Setup environment (installs mamba-ssm, PyG)
make setup
make setup-gpu

# Optional: Install FLA for Gated DeltaNet research
make setup-fla

# 4️⃣ Download TUSZ corpus
# Visit: https://isip.piconepress.com/projects/nedc/html/tuh_eeg/index.shtml
# Place in: data_ext4/tusz/edf/

# 5️⃣ Build preprocessing cache (one-time, ~2 hours)
python -m src build-cache \
  --data-dir data_ext4/tusz/edf/train \
  --cache-dir cache/tusz_mmap/train \
  --split train

python -m src build-cache \
  --data-dir data_ext4/tusz/edf/dev \
  --cache-dir cache/tusz_mmap/dev \
  --split dev

# 6️⃣ Smoke test (3 files, 5 minutes)
make smoke-bimamba    # Test BiMamba2 stack
make smoke-fla        # Test Gated DeltaNet stack

# 7️⃣ Full local training (RTX 4090, ~960 hours / 40 days)
export BGB_NAN_DEBUG=1
tmux new -s train
make train-bimamba    # or: make train-fla
# Ctrl+B then D to detach | tmux attach -t train to reattach

Cloud training (Modal A100-80GB) - See docs/05-training/modal.md for details:

# Deploy Modal functions first
modal deploy deploy/modal/app.py

# BiMamba2 production (hands-free, auto-restart)
modal run --detach deploy/modal/app.py \
  --action schedule-training \
  --config configs/modal/train_bimamba.yaml

# Gated DeltaNet production (hands-free, auto-restart)
modal run --detach deploy/modal/app.py \
  --action schedule-training \
  --config configs/modal/train_fla.yaml

# Monitor progress
modal app list
modal app logs <app-id>

🚨 CRITICAL: Use --action schedule-training for 100-epoch production runs (auto-restart every 23h). Use --action train ONLY for smoke tests and experiments.

See docs/01-installation/ and docs/05-training/ for complete setup guides.


πŸ“š Documentation

Getting Started

Architecture

Research

Operations


🀝 Contributing

We welcome contributions! See docs/09-development/ for:

Zero technical debt policy: All P0/P1/P2 issues resolved before major releases.


πŸ“– Citation

@software{brain-go-brr-v4,
  title = {Brain-Go-Brr V4: Clinical EEG Seizure Detection via Dual-Stack State-Space Models},
  author = {Clarity Digital Twin},
  year = {2025},
  version = {4.2.0},
  url = {https://github.com/clarity-digital-twin/brain-go-brr-v2},
  note = {Empirical A/B comparison of BiMamba2 and Flash Linear Attention (BiGatedDeltaNet) architectures on TUSZ}
}

βš–οΈ License

Apache 2.0 - See LICENSE for full text.


πŸ™ Acknowledgments

Datasets:

Foundational Papers:

Infrastructure & Libraries:


Questions? Open an issue β€’ Updates? Watch the repo β€’ Discussion? Start a discussion

Status: v4.2.0 checkpoint resume fixes released β€’ FLA training resumed (Epoch 18/100) β€’ BiMamba2 paused (Epoch 6, backed up) β€’ See STATUS.md for full details