the fusions your compiler can't find
Noeris discovers and optimizes cross-operation kernel fusions that torch.compile and single-op libraries like Liger Kernel can't find.
The research engine contains the full operator/search stack; the current public
noeris.patch() API is a conservative drop-in surface for RMSNorm and gated MLP
activation patches.
| What | Liger Kernel | Noeris |
|---|---|---|
| Fusion level | Single-op (RMSNorm alone, RoPE alone) | Cross-op (RMSNorm+RoPE in 1 kernel) |
| Config tuning | Fixed | Bandit autotuned per shape per GPU |
| Attention | No | Beats cuDNN 6.24x on sliding-window |
| Architecture | Transformers only | Transformers + SSMs |
| Cross-hardware | No | Zero-shot ρ=0.907 |
| Backward pass | Individual op backward | Fused cross-op backward (QK-norm+RoPE) |
| Result | Metric |
|---|---|
| Sliding-window attention vs cuDNN FlashAttention (A100) | 6.24x faster (8/8 shapes) |
| Cross-op fusion: kernel launch reduction | 40 launches → 1 (QK-RMSNorm+RoPE) |
| End-to-end 26-layer Gemma 4 (A100) | 1.43x (41.4 ms → 29.1 ms) |
| Gemma 4 decoder layer deeper fusion (A100) | 1.13x (31b_local), 1.07x (31b_global), 1.90x (e2b_local) |
| Gemma 4 decoder layer deeper fusion (H100) | 1.26x (31b_local), 1.18x (31b_global), 2.29x (e2b_local) |
| Bandit convergence | 98% of optimal in 1 iteration (50x faster than grid search) |
| Model coverage | 19 models / 13 families |
| Cross-hardware zero-shot config prediction | ρ=0.907 (A100 from free T4 data) |
| Fused QK-RMSNorm+RoPE prologue (A100) | 10.2--12.9x vs separated launches |
| Fused QK-RMSNorm+RoPE prologue (H100) | 10.4--11.9x, peak 1628 GB/s |
Latest layer artifacts: docs/results/gemma4-layer-bench-deeper-fusion-a100-after-geglu-retune.json, docs/results/gemma4-layer-bench-deeper-fusion-h100-after-geglu-retune.json.
Canonical results index: docs/results/README.md.
import noeris
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-2b")
noeris.patch(model) # Drop-in RMSNorm + gated MLP activation patches.
# QK-RMSNorm+RoPE kernels are available for custom integrations, but generic
# HuggingFace attention patching is not wired into noeris.patch() yet.Contributor setup has separate CPU-only and Linux CUDA paths. Start with
CONTRIBUTING.md if you are on macOS arm64, working without
an NVIDIA GPU, or trying to run local CI parity.
noeris.patch() currently wires two module-level optimizations into supported
HuggingFace-style models:
- RMSNorm module replacement, including Gemma's
(1+w)affine mode. - Gated MLP activation fusion for GeGLU/SwiGLU-style
gate_proj,up_proj,down_projblocks.
The project also includes lower-level Triton kernels for QK-RMSNorm+RoPE,
cross-entropy, attention, and other operators. Those kernels back the benchmark
artifacts below, but QK-RMSNorm+RoPE and cross-entropy are not generic
noeris.patch() hooks yet.
python3 -m venv .venv && . .venv/bin/activate
pip install -e .
# Search a specific operator (A100 via Modal)
python -m research_engine.cli triton-iterate \
--operator rmsnorm --gpu A100 --llm --configs-per-run 8Run the same core checks used in GitHub CI from repo root on a Linux CUDA-capable development environment:
./scripts/ci_local.shIf your default python3 does not have test dependencies installed, select one:
PYTHON_BIN=python3.11 ./scripts/ci_local.shThis runs unit tests, public artifact reference checks, two matmul-speedup
benchmark runs, history export, and the history regression gate with
--fail-on-missing.
On macOS arm64 or CPU-only machines, use the source-tree setup in
CONTRIBUTING.md instead of pip install -e .; Triton does
not ship a native macOS arm64 wheel in the checked-in lockfile.
# Kaggle (30 hr/week free T4) or Google Colab
!git clone https://github.com/0sec-labs/noeris && cd noeris
!pip install -e . numpy scikit-learn -q
!python scripts/colab_validate_all.pyNoeris is complementary to Liger Kernel. Liger optimizes individual operations (RMSNorm, RoPE, SwiGLU separately). Noeris fuses operations across boundaries that single-op libraries can't cross -- like combining RMSNorm + RoPE into one kernel pass. Use both for maximum performance.
All 19 shapes pass correctness. Fusion speedup measured on T4 (Kaggle) and A100 (Modal).
| Model family | Models | Fusion speedup (T4) | Fusion speedup (A100) |
|---|---|---|---|
| Gemma | 4 E2B, 4 31B, 4 26B-A4B | 6.5--8.3x | 5.6--9.8x |
| LLaMA | 3 8B, 3 70B, 4 Scout | 8.3--8.9x | 3.9--6.3x |
| Qwen | 3 8B, 3 32B | 7.8--8.5x | 4.8--6.1x |
| Mistral / Mixtral | 7B, 8x7B | 8.3--8.9x | 5.2--6.3x |
| Phi | 3 mini, 4 mini | 8.3--8.5x | 5.2--6.1x |
| Falcon 3 | 7B | 8.3x | 5.2x |
| DBRX | 132B | 8.9x | 6.3x |
| OLMo 2 | 7B | 8.3x | 5.2x |
| InternLM 3 | 8B | 8.3x | 5.2x |
| Mamba-3 | SSM scan | 1.88 GB/s | -- |
Noeris currently registers 22 operator specs. Ten are wired into the generic
triton-iterate / ablation CLI, eight run in the default GitHub Actions
Triton Iterate matrix, and two are exposed as drop-in noeris.patch() hooks.
The full surface split is documented in
docs/system/OPERATOR_SURFACE.md.
| Surface | Count | Operators |
|---|---|---|
| Registered specs | 22 | attention, attention_decode, attention_v2, cross_entropy, cuda_qk_norm_rope, fused_norm_linear, geglu, gelu, grouped_gemm, kv_shared_attention, layernorm, matmul, matmul_splitk, moe_router, ple_fusion, ple_gather, qk_norm_rope, qk_norm_rope_bwd, rmsnorm, rotary, softmax, ssm_scan |
| Generic search CLI | 10 | matmul, rmsnorm, softmax, layernorm, cross_entropy, attention, attention_v2, rotary, geglu, fused_norm_linear |
| Default workflow matrix | 8 | matmul, rmsnorm, softmax, layernorm, cross_entropy, attention, rotary, geglu |
Public noeris.patch() hooks |
2 | rmsnorm, geglu |
| Category | Operators |
|---|---|
| Core | matmul, matmul_splitk, rmsnorm, layernorm, softmax, cross_entropy, gelu |
| Attention | attention, attention_v2, attention_decode, rotary, qk_norm_rope, qk_norm_rope_bwd, cuda_qk_norm_rope, kv_shared_attention |
| Fusion | geglu, fused_norm_linear, ple_gather, ple_fusion |
| Routing | moe_router, grouped_gemm |
| SSM | ssm_scan |
110+ shape buckets. 606 unit tests. T4/A100/H100 validation status varies by
operator and benchmark path; see docs/results/ for the current canonical
artifacts.
- Thompson-sampling bandit + gradient-boosted cost model (R^2 = 0.94) + MAP-Elites quality-diversity
- Cross-run shape-indexed config database -- knowledge compounds across sessions
- Cross-hardware transfer -- A100-trained cost model rankings transfer to H100 with ρ=0.967
- Kernel-aware NAS proxy -- ranks architecture candidates against A100/T4/H100 kernel profiles
- ~$0.01 per iteration on Modal
| Stage | Kernel launches | Prologue time |
|---|---|---|
| PyTorch eager | 40 | 3.45 ms |
torch.compile |
9 (4 Triton kernels) | 0.92 ms (3.75x) |
| Noeris | 1 | 0.57 ms (6.08x) |
torch.compile splits at the RMSNorm reduction boundary and materializes to HBM between passes. Noeris fuses the entire prologue into a single kernel launch.
- vLLM has an experimental
enable_qk_norm_rope_fusionpass (disabled by default due to H100 regression, issue #34391). We make this fusion practical with parameterized Triton + bandit autotuning. - Mirage (OSDI 2025) demonstrated fused norm+matmul. We implement it independently in Triton with autotuning.
- FlexAttention does block-sparse tile skipping. To our knowledge, we are the first to measure >3x wins on narrow-window shapes.
- Liger Kernel fuses RMSNorm and RoPE backward passes separately. To our knowledge, no existing framework fuses the combined QK-RMSNorm+RoPE backward into a single kernel.
- All novelty claims use "to our knowledge" qualification.
Paper draft: docs/paper/noeris.md. arXiv preprint coming.
@misc{noeris2026,
title = {Noeris: Architecture-Agnostic Kernel Fusion and Autotuning},
author = {Doruk Tan Ozturk},
year = {2026},
url = {https://github.com/0sec-labs/noeris}
}MIT License.