Skip to content

0sec-labs/noeris

Repository files navigation

Noeris mark

Noeris

the fusions your compiler can't find

Python License Operators Hardware


Noeris discovers and optimizes cross-operation kernel fusions that torch.compile and single-op libraries like Liger Kernel can't find. The research engine contains the full operator/search stack; the current public noeris.patch() API is a conservative drop-in surface for RMSNorm and gated MLP activation patches.

What makes Noeris different

What Liger Kernel Noeris
Fusion level Single-op (RMSNorm alone, RoPE alone) Cross-op (RMSNorm+RoPE in 1 kernel)
Config tuning Fixed Bandit autotuned per shape per GPU
Attention No Beats cuDNN 6.24x on sliding-window
Architecture Transformers only Transformers + SSMs
Cross-hardware No Zero-shot ρ=0.907
Backward pass Individual op backward Fused cross-op backward (QK-norm+RoPE)

Key results

Result Metric
Sliding-window attention vs cuDNN FlashAttention (A100) 6.24x faster (8/8 shapes)
Cross-op fusion: kernel launch reduction 40 launches → 1 (QK-RMSNorm+RoPE)
End-to-end 26-layer Gemma 4 (A100) 1.43x (41.4 ms → 29.1 ms)
Gemma 4 decoder layer deeper fusion (A100) 1.13x (31b_local), 1.07x (31b_global), 1.90x (e2b_local)
Gemma 4 decoder layer deeper fusion (H100) 1.26x (31b_local), 1.18x (31b_global), 2.29x (e2b_local)
Bandit convergence 98% of optimal in 1 iteration (50x faster than grid search)
Model coverage 19 models / 13 families
Cross-hardware zero-shot config prediction ρ=0.907 (A100 from free T4 data)
Fused QK-RMSNorm+RoPE prologue (A100) 10.2--12.9x vs separated launches
Fused QK-RMSNorm+RoPE prologue (H100) 10.4--11.9x, peak 1628 GB/s

Latest layer artifacts: docs/results/gemma4-layer-bench-deeper-fusion-a100-after-geglu-retune.json, docs/results/gemma4-layer-bench-deeper-fusion-h100-after-geglu-retune.json. Canonical results index: docs/results/README.md.

Quick start

import noeris
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-2b")
noeris.patch(model)  # Drop-in RMSNorm + gated MLP activation patches.
# QK-RMSNorm+RoPE kernels are available for custom integrations, but generic
# HuggingFace attention patching is not wired into noeris.patch() yet.

Contributor setup has separate CPU-only and Linux CUDA paths. Start with CONTRIBUTING.md if you are on macOS arm64, working without an NVIDIA GPU, or trying to run local CI parity.

Current public patch coverage

noeris.patch() currently wires two module-level optimizations into supported HuggingFace-style models:

  • RMSNorm module replacement, including Gemma's (1+w) affine mode.
  • Gated MLP activation fusion for GeGLU/SwiGLU-style gate_proj, up_proj, down_proj blocks.

The project also includes lower-level Triton kernels for QK-RMSNorm+RoPE, cross-entropy, attention, and other operators. Those kernels back the benchmark artifacts below, but QK-RMSNorm+RoPE and cross-entropy are not generic noeris.patch() hooks yet.

CLI / search

python3 -m venv .venv && . .venv/bin/activate
pip install -e .

# Search a specific operator (A100 via Modal)
python -m research_engine.cli triton-iterate \
    --operator rmsnorm --gpu A100 --llm --configs-per-run 8

Local CI parity runner

Run the same core checks used in GitHub CI from repo root on a Linux CUDA-capable development environment:

./scripts/ci_local.sh

If your default python3 does not have test dependencies installed, select one:

PYTHON_BIN=python3.11 ./scripts/ci_local.sh

This runs unit tests, public artifact reference checks, two matmul-speedup benchmark runs, history export, and the history regression gate with --fail-on-missing.

On macOS arm64 or CPU-only machines, use the source-tree setup in CONTRIBUTING.md instead of pip install -e .; Triton does not ship a native macOS arm64 wheel in the checked-in lockfile.

Free GPU validation (no paid compute)

# Kaggle (30 hr/week free T4) or Google Colab
!git clone https://github.com/0sec-labs/noeris && cd noeris
!pip install -e . numpy scikit-learn -q
!python scripts/colab_validate_all.py

Works alongside Liger Kernel

Noeris is complementary to Liger Kernel. Liger optimizes individual operations (RMSNorm, RoPE, SwiGLU separately). Noeris fuses operations across boundaries that single-op libraries can't cross -- like combining RMSNorm + RoPE into one kernel pass. Use both for maximum performance.

Model support

All 19 shapes pass correctness. Fusion speedup measured on T4 (Kaggle) and A100 (Modal).

Model family Models Fusion speedup (T4) Fusion speedup (A100)
Gemma 4 E2B, 4 31B, 4 26B-A4B 6.5--8.3x 5.6--9.8x
LLaMA 3 8B, 3 70B, 4 Scout 8.3--8.9x 3.9--6.3x
Qwen 3 8B, 3 32B 7.8--8.5x 4.8--6.1x
Mistral / Mixtral 7B, 8x7B 8.3--8.9x 5.2--6.3x
Phi 3 mini, 4 mini 8.3--8.5x 5.2--6.1x
Falcon 3 7B 8.3x 5.2x
DBRX 132B 8.9x 6.3x
OLMo 2 7B 8.3x 5.2x
InternLM 3 8B 8.3x 5.2x
Mamba-3 SSM scan 1.88 GB/s --

Operators

Noeris currently registers 22 operator specs. Ten are wired into the generic triton-iterate / ablation CLI, eight run in the default GitHub Actions Triton Iterate matrix, and two are exposed as drop-in noeris.patch() hooks. The full surface split is documented in docs/system/OPERATOR_SURFACE.md.

Surface Count Operators
Registered specs 22 attention, attention_decode, attention_v2, cross_entropy, cuda_qk_norm_rope, fused_norm_linear, geglu, gelu, grouped_gemm, kv_shared_attention, layernorm, matmul, matmul_splitk, moe_router, ple_fusion, ple_gather, qk_norm_rope, qk_norm_rope_bwd, rmsnorm, rotary, softmax, ssm_scan
Generic search CLI 10 matmul, rmsnorm, softmax, layernorm, cross_entropy, attention, attention_v2, rotary, geglu, fused_norm_linear
Default workflow matrix 8 matmul, rmsnorm, softmax, layernorm, cross_entropy, attention, rotary, geglu
Public noeris.patch() hooks 2 rmsnorm, geglu

Registered Categories

Category Operators
Core matmul, matmul_splitk, rmsnorm, layernorm, softmax, cross_entropy, gelu
Attention attention, attention_v2, attention_decode, rotary, qk_norm_rope, qk_norm_rope_bwd, cuda_qk_norm_rope, kv_shared_attention
Fusion geglu, fused_norm_linear, ple_gather, ple_fusion
Routing moe_router, grouped_gemm
SSM ssm_scan

110+ shape buckets. 606 unit tests. T4/A100/H100 validation status varies by operator and benchmark path; see docs/results/ for the current canonical artifacts.

Search system

  • Thompson-sampling bandit + gradient-boosted cost model (R^2 = 0.94) + MAP-Elites quality-diversity
  • Cross-run shape-indexed config database -- knowledge compounds across sessions
  • Cross-hardware transfer -- A100-trained cost model rankings transfer to H100 with ρ=0.967
  • Kernel-aware NAS proxy -- ranks architecture candidates against A100/T4/H100 kernel profiles
  • ~$0.01 per iteration on Modal

Compiler comparison (T4)

Stage Kernel launches Prologue time
PyTorch eager 40 3.45 ms
torch.compile 9 (4 Triton kernels) 0.92 ms (3.75x)
Noeris 1 0.57 ms (6.08x)

torch.compile splits at the RMSNorm reduction boundary and materializes to HBM between passes. Noeris fuses the entire prologue into a single kernel launch.

Honest framing

  • vLLM has an experimental enable_qk_norm_rope_fusion pass (disabled by default due to H100 regression, issue #34391). We make this fusion practical with parameterized Triton + bandit autotuning.
  • Mirage (OSDI 2025) demonstrated fused norm+matmul. We implement it independently in Triton with autotuning.
  • FlexAttention does block-sparse tile skipping. To our knowledge, we are the first to measure >3x wins on narrow-window shapes.
  • Liger Kernel fuses RMSNorm and RoPE backward passes separately. To our knowledge, no existing framework fuses the combined QK-RMSNorm+RoPE backward into a single kernel.
  • All novelty claims use "to our knowledge" qualification.

Citing

Paper draft: docs/paper/noeris.md. arXiv preprint coming.

@misc{noeris2026,
  title   = {Noeris: Architecture-Agnostic Kernel Fusion and Autotuning},
  author  = {Doruk Tan Ozturk},
  year    = {2026},
  url     = {https://github.com/0sec-labs/noeris}
}

MIT License.

About

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors