A novel attention mechanism for language models that achieves O(N log N) complexity, replacing the standard O(N²) dot-product attention.
Tokens interact through wave propagation on a continuous field rather than direct pairwise comparison — enabling efficient scaling to long contexts that are impossible with standard transformers.
| Metric | Value |
|---|---|
| DCLM CORE (130M model) | 46.8% (GPT-2 target: 26.5%) |
| Throughput at 32K context | 21.8x faster than standard |
| Memory at 32K context | 5.3x less than standard |
| 128K context | Runs (standard OOMs) |
| Model size | 505 MB (runs on laptop) |
See BENCHMARKS.md for full results.
- Scatter — tokens deposit information onto a continuous field
- Convolve — FFT-based wave kernel propagates information across the field
- Gather — tokens read back from the field at their positions
Each attention head learns three parameters: frequency, damping, and phase — controlling how information flows between tokens.
Standard attention computes all token pairs — O(N²). At long context:
32K tokens: Standard needs 35 GB → Wave Field needs 6.7 GB
128K tokens: Standard OOMs → Wave Field runs at 26.7 GB
1M+ tokens: Standard impossible → Wave Field feasible
Wave Field's throughput increases with context length while standard decreases.
Active research. Training and benchmarking at 130M–1.5B parameter scale.
Badaramoni Avinash
All rights reserved.