Noeris

the fusions your compiler can't find

Noeris discovers and optimizes cross-operation kernel fusions that torch.compile and single-op libraries like Liger Kernel can't find. The research engine contains the full operator/search stack; the current public noeris.patch() API is a conservative drop-in surface for RMSNorm and gated MLP activation patches.

What makes Noeris different

What	Liger Kernel	Noeris
Fusion level	Single-op (RMSNorm alone, RoPE alone)	Cross-op (RMSNorm+RoPE in 1 kernel)
Config tuning	Fixed	Bandit autotuned per shape per GPU
Attention	No	Beats cuDNN 6.24x on sliding-window
Architecture	Transformers only	Transformers + SSMs
Cross-hardware	No	Zero-shot ρ=0.907
Backward pass	Individual op backward	Fused cross-op backward (QK-norm+RoPE)

Key results

Result	Metric
Sliding-window attention vs cuDNN FlashAttention (A100)	6.24x faster (8/8 shapes)
Cross-op fusion: kernel launch reduction	40 launches → 1 (QK-RMSNorm+RoPE)
End-to-end 26-layer Gemma 4 (A100)	1.43x (41.4 ms → 29.1 ms)
Gemma 4 decoder layer deeper fusion (A100)	1.13x (`31b_local`), 1.07x (`31b_global`), 1.90x (`e2b_local`)
Gemma 4 decoder layer deeper fusion (H100)	1.26x (`31b_local`), 1.18x (`31b_global`), 2.29x (`e2b_local`)
Bandit convergence	98% of optimal in 1 iteration (50x faster than grid search)
Model coverage	19 models / 13 families
Cross-hardware zero-shot config prediction	ρ=0.907 (A100 from free T4 data)
Fused QK-RMSNorm+RoPE prologue (A100)	10.2--12.9x vs separated launches
Fused QK-RMSNorm+RoPE prologue (H100)	10.4--11.9x, peak 1628 GB/s

Latest layer artifacts: docs/results/gemma4-layer-bench-deeper-fusion-a100-after-geglu-retune.json, docs/results/gemma4-layer-bench-deeper-fusion-h100-after-geglu-retune.json. Canonical results index: docs/results/README.md.

Quick start

import noeris
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-2b")
noeris.patch(model)  # Drop-in RMSNorm + gated MLP activation patches.
# QK-RMSNorm+RoPE kernels are available for custom integrations, but generic
# HuggingFace attention patching is not wired into noeris.patch() yet.

Contributor setup has separate CPU-only and Linux CUDA paths. Start with CONTRIBUTING.md if you are on macOS arm64, working without an NVIDIA GPU, or trying to run local CI parity.

Current public patch coverage

noeris.patch() currently wires two module-level optimizations into supported HuggingFace-style models:

RMSNorm module replacement, including Gemma's (1+w) affine mode.
Gated MLP activation fusion for GeGLU/SwiGLU-style gate_proj, up_proj, down_proj blocks.

The project also includes lower-level Triton kernels for QK-RMSNorm+RoPE, cross-entropy, attention, and other operators. Those kernels back the benchmark artifacts below, but QK-RMSNorm+RoPE and cross-entropy are not generic noeris.patch() hooks yet.

CLI / search

python3 -m venv .venv && . .venv/bin/activate
pip install -e .

# Search a specific operator (A100 via Modal)
python -m research_engine.cli triton-iterate \
    --operator rmsnorm --gpu A100 --llm --configs-per-run 8

Local CI parity runner

Run the same core checks used in GitHub CI from repo root on a Linux CUDA-capable development environment:

./scripts/ci_local.sh

If your default python3 does not have test dependencies installed, select one:

PYTHON_BIN=python3.11 ./scripts/ci_local.sh

This runs unit tests, public artifact reference checks, two matmul-speedup benchmark runs, history export, and the history regression gate with --fail-on-missing.

On macOS arm64 or CPU-only machines, use the source-tree setup in CONTRIBUTING.md instead of pip install -e .; Triton does not ship a native macOS arm64 wheel in the checked-in lockfile.

Free GPU validation (no paid compute)

# Kaggle (30 hr/week free T4) or Google Colab
!git clone https://github.com/0sec-labs/noeris && cd noeris
!pip install -e . numpy scikit-learn -q
!python scripts/colab_validate_all.py

Works alongside Liger Kernel

Noeris is complementary to Liger Kernel. Liger optimizes individual operations (RMSNorm, RoPE, SwiGLU separately). Noeris fuses operations across boundaries that single-op libraries can't cross -- like combining RMSNorm + RoPE into one kernel pass. Use both for maximum performance.

Model support

All 19 shapes pass correctness. Fusion speedup measured on T4 (Kaggle) and A100 (Modal).

Model family	Models	Fusion speedup (T4)	Fusion speedup (A100)
Gemma	4 E2B, 4 31B, 4 26B-A4B	6.5--8.3x	5.6--9.8x
LLaMA	3 8B, 3 70B, 4 Scout	8.3--8.9x	3.9--6.3x
Qwen	3 8B, 3 32B	7.8--8.5x	4.8--6.1x
Mistral / Mixtral	7B, 8x7B	8.3--8.9x	5.2--6.3x
Phi	3 mini, 4 mini	8.3--8.5x	5.2--6.1x
Falcon 3	7B	8.3x	5.2x
DBRX	132B	8.9x	6.3x
OLMo 2	7B	8.3x	5.2x
InternLM 3	8B	8.3x	5.2x
Mamba-3	SSM scan	1.88 GB/s	--

Operators

Noeris currently registers 22 operator specs. Ten are wired into the generic triton-iterate / ablation CLI, eight run in the default GitHub Actions Triton Iterate matrix, and two are exposed as drop-in noeris.patch() hooks. The full surface split is documented in docs/system/OPERATOR_SURFACE.md.

Surface	Count	Operators
Registered specs	22	`attention`, `attention_decode`, `attention_v2`, `cross_entropy`, `cuda_qk_norm_rope`, `fused_norm_linear`, `geglu`, `gelu`, `grouped_gemm`, `kv_shared_attention`, `layernorm`, `matmul`, `matmul_splitk`, `moe_router`, `ple_fusion`, `ple_gather`, `qk_norm_rope`, `qk_norm_rope_bwd`, `rmsnorm`, `rotary`, `softmax`, `ssm_scan`
Generic search CLI	10	`matmul`, `rmsnorm`, `softmax`, `layernorm`, `cross_entropy`, `attention`, `attention_v2`, `rotary`, `geglu`, `fused_norm_linear`
Default workflow matrix	8	`matmul`, `rmsnorm`, `softmax`, `layernorm`, `cross_entropy`, `attention`, `rotary`, `geglu`
Public `noeris.patch()` hooks	2	`rmsnorm`, `geglu`

Registered Categories

Category	Operators
Core	`matmul`, `matmul_splitk`, `rmsnorm`, `layernorm`, `softmax`, `cross_entropy`, `gelu`
Attention	`attention`, `attention_v2`, `attention_decode`, `rotary`, `qk_norm_rope`, `qk_norm_rope_bwd`, `cuda_qk_norm_rope`, `kv_shared_attention`
Fusion	`geglu`, `fused_norm_linear`, `ple_gather`, `ple_fusion`
Routing	`moe_router`, `grouped_gemm`
SSM	`ssm_scan`

110+ shape buckets. 606 unit tests. T4/A100/H100 validation status varies by operator and benchmark path; see docs/results/ for the current canonical artifacts.

Search system

Thompson-sampling bandit + gradient-boosted cost model (R^2 = 0.94) + MAP-Elites quality-diversity
Cross-run shape-indexed config database -- knowledge compounds across sessions
Cross-hardware transfer -- A100-trained cost model rankings transfer to H100 with ρ=0.967
Kernel-aware NAS proxy -- ranks architecture candidates against A100/T4/H100 kernel profiles
~$0.01 per iteration on Modal

Compiler comparison (T4)

Stage	Kernel launches	Prologue time
PyTorch eager	40	3.45 ms
`torch.compile`	9 (4 Triton kernels)	0.92 ms (3.75x)
Noeris	1	0.57 ms (6.08x)

torch.compile splits at the RMSNorm reduction boundary and materializes to HBM between passes. Noeris fuses the entire prologue into a single kernel launch.

Honest framing

vLLM has an experimental enable_qk_norm_rope_fusion pass (disabled by default due to H100 regression, issue #34391). We make this fusion practical with parameterized Triton + bandit autotuning.
Mirage (OSDI 2025) demonstrated fused norm+matmul. We implement it independently in Triton with autotuning.
FlexAttention does block-sparse tile skipping. To our knowledge, we are the first to measure >3x wins on narrow-window shapes.
Liger Kernel fuses RMSNorm and RoPE backward passes separately. To our knowledge, no existing framework fuses the combined QK-RMSNorm+RoPE backward into a single kernel.
All novelty claims use "to our knowledge" qualification.

Citing

Paper draft: docs/paper/noeris.md. arXiv preprint coming.

@misc{noeris2026,
  title   = {Noeris: Architecture-Agnostic Kernel Fusion and Autotuning},
  author  = {Doruk Tan Ozturk},
  year    = {2026},
  url     = {https://github.com/0sec-labs/noeris}
}

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 325 Commits
.github		.github
.noeris		.noeris
docs		docs
models		models
noeris		noeris
scripts		scripts
src/research_engine		src/research_engine
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Noeris

What makes Noeris different

Key results

Quick start

Current public patch coverage

CLI / search

Local CI parity runner

Free GPU validation (no paid compute)

Works alongside Liger Kernel

Model support

Operators

Registered Categories

Search system

Compiler comparison (T4)

Honest framing

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Noeris

What makes Noeris different

Key results

Quick start

Current public patch coverage

CLI / search

Local CI parity runner

Free GPU validation (no paid compute)

Works alongside Liger Kernel

Model support

Operators

Registered Categories

Search system

Compiler comparison (T4)

Honest framing

Citing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages