Skip to content

PwnKit-Labs/noeris

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

282 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Noeris mark

Noeris

the fusions your compiler can't find

Python License Operators Hardware


Noeris discovers and optimizes cross-operation kernel fusions that torch.compile and single-op libraries like Liger Kernel can't find. pip install noeris → one line → 1.3--1.4x faster training.

What makes Noeris different

What Liger Kernel Noeris
Fusion level Single-op (RMSNorm alone, RoPE alone) Cross-op (RMSNorm+RoPE in 1 kernel)
Config tuning Fixed Bandit autotuned per shape per GPU
Attention No Beats cuDNN 6.24x on sliding-window
Architecture Transformers only Transformers + SSMs
Cross-hardware No Zero-shot ρ=0.907
Backward pass Individual op backward Fused cross-op backward (QK-norm+RoPE)

Key results

Result Metric
Sliding-window attention vs cuDNN FlashAttention (A100) 6.24x faster (8/8 shapes)
Cross-op fusion: kernel launch reduction 40 launches → 1 (QK-RMSNorm+RoPE)
End-to-end 26-layer Gemma 4 (A100) 1.43x (41.4 ms → 29.1 ms)
Gemma 4 decoder layer deeper fusion (A100) 1.13x (31b_local), 1.07x (31b_global), 1.90x (e2b_local)
Gemma 4 decoder layer deeper fusion (H100) 1.26x (31b_local), 1.17x (31b_global), 2.32x (e2b_local)
Bandit convergence 98% of optimal in 1 iteration (50x faster than grid search)
Model coverage 19 models / 13 families
Cross-hardware zero-shot config prediction ρ=0.907 (A100 from free T4 data)
Fused QK-RMSNorm+RoPE prologue (A100) 10.2--12.9x vs separated launches
Fused QK-RMSNorm+RoPE prologue (H100) 10.4--11.9x, peak 1628 GB/s

Latest layer artifacts: docs/results/gemma4-layer-bench-deeper-fusion-a100-after-geglu-retune.json, docs/results/gemma4-layer-bench-deeper-fusion-h100-after-geglu-retune.json. Canonical results index: docs/results/README.md.

Quick start

import noeris
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-2b")
noeris.patch(model)  # Fused QK-norm+RoPE, autotuned RMSNorm, fused GeGLU
# Training is now 1.3-1.4x faster. Works with HF Trainer.

CLI / search

python3 -m venv .venv && . .venv/bin/activate
pip install -e .

# Search a specific operator (A100 via Modal)
python -m research_engine.cli triton-iterate \
    --operator rmsnorm --gpu A100 --llm --configs-per-run 8

Free GPU validation (no paid compute)

# Kaggle (30 hr/week free T4) or Google Colab
!git clone https://github.com/PwnKit-Labs/noeris && cd noeris
!pip install -e . numpy scikit-learn -q
!python scripts/colab_validate_all.py

Works alongside Liger Kernel

Noeris is complementary to Liger Kernel. Liger optimizes individual operations (RMSNorm, RoPE, SwiGLU separately). Noeris fuses operations across boundaries that single-op libraries can't cross -- like combining RMSNorm + RoPE into one kernel pass. Use both for maximum performance.

Model support

All 19 shapes pass correctness. Fusion speedup measured on T4 (Kaggle) and A100 (Modal).

Model family Models Fusion speedup (T4) Fusion speedup (A100)
Gemma 4 E2B, 4 31B, 4 26B-A4B 6.5--8.3x 5.6--9.8x
LLaMA 3 8B, 3 70B, 4 Scout 8.3--8.9x 3.9--6.3x
Qwen 3 8B, 3 32B 7.8--8.5x 4.8--6.1x
Mistral / Mixtral 7B, 8x7B 8.3--8.9x 5.2--6.3x
Phi 3 mini, 4 mini 8.3--8.5x 5.2--6.1x
Falcon 3 7B 8.3x 5.2x
DBRX 132B 8.9x 6.3x
OLMo 2 7B 8.3x 5.2x
InternLM 3 8B 8.3x 5.2x
Mamba-3 SSM scan 1.88 GB/s --

Operators (21 parameterized Triton templates)

Category Operators
Core matmul, rmsnorm (1+w Gemma affine), layernorm, softmax (+ softcap), cross_entropy
Attention GQA + causal + sliding-window + QK-norm + YOCO KV-share, paged-KV decode (pure Triton)
Fusion fused QK-RMSNorm+RoPE (fwd + bwd), fused GeGLU, fused norm+matmul
Routing RoPE (dual-base with p-RoPE), MoE router (matmul+softmax+top-k), grouped GEMM (sort-free)
Embedding PLE gather (Gemma E2B/E4B per-layer), PLE fusion (2.07x), K=V attention
SSM selective scan (Mamba-3)

110+ shape buckets. 606 unit tests. 18/20 operators validated on T4; all pass correctness on A100 and H100.

Search system

  • Thompson-sampling bandit + gradient-boosted cost model (R^2 = 0.94) + MAP-Elites quality-diversity
  • Cross-run shape-indexed config database -- knowledge compounds across sessions
  • Cross-hardware transfer -- A100-trained cost model rankings transfer to H100 with ρ=0.967
  • ~$0.01 per iteration on Modal

Compiler comparison (T4)

Stage Kernel launches Prologue time
PyTorch eager 40 3.45 ms
torch.compile 9 (4 Triton kernels) 0.92 ms (3.75x)
Noeris 1 0.57 ms (6.08x)

torch.compile splits at the RMSNorm reduction boundary and materializes to HBM between passes. Noeris fuses the entire prologue into a single kernel launch.

Honest framing

  • vLLM has an experimental enable_qk_norm_rope_fusion pass (disabled by default due to H100 regression, issue #34391). We make this fusion practical with parameterized Triton + bandit autotuning.
  • Mirage (OSDI 2025) demonstrated fused norm+matmul. We implement it independently in Triton with autotuning.
  • FlexAttention does block-sparse tile skipping. To our knowledge, we are the first to measure >3x wins on narrow-window shapes.
  • Liger Kernel fuses RMSNorm and RoPE backward passes separately. To our knowledge, no existing framework fuses the combined QK-RMSNorm+RoPE backward into a single kernel.
  • All novelty claims use "to our knowledge" qualification.

Citing

Paper draft: docs/paper/noeris.md. arXiv preprint coming.

@misc{noeris2026,
  title   = {Noeris: Architecture-Agnostic Kernel Fusion and Autotuning},
  author  = {Doruk Tan Ozturk},
  year    = {2026},
  url     = {https://github.com/PwnKit-Labs/noeris}
}

MIT License.

About

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors