Fused GPU kernels for LLM inference. Memory-bound → Compute-bound.
📖 Docs | 🇨🇳 中文 | 💡 Examples | 🤝 Contributing
- Operator fusion with correctness guarantees — every kernel ships with CPU-testable NumPy reference implementations, not just speed claims.
- Production-ready FP8 GEMM pipeline — explicit scale management and overflow handling, not toy quantization examples.
- Latency-driven autotuner with persistent config cache —
TritonAutoTuner+ConfigCache, not one-off benchmark scripts. - OpenSpec-driven development — every non-trivial change is design-documented before code, not YOLO-coded.
User API (triton_ops.__init__)
├── Validation Layer (device, dtype, shape, contiguity)
├── Compute Reference Layer (NumPy, CPU-testable)
├── Kernel Layer (Triton, GPU)
└── Tooling Layer (autotuner, benchmark, performance metrics)
See the Architecture Lab and Kernel Families docs for details.
git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"CPU-only validation (no GPU required):
ruff format --check . && ruff check . && mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m buildFull GPU benchmark (requires CUDA):
import torch
from triton_ops import fused_rmsnorm_rope, BenchmarkSuite
from triton_ops.reference import fused_rmsnorm_rope as fused_rmsnorm_rope_reference
x = torch.randn(2, 2048, 4096, device="cuda", dtype=torch.float16)
suite = BenchmarkSuite(warmup_runs=10, benchmark_runs=100)
result = suite.benchmark_kernel(
fused_rmsnorm_rope, fused_rmsnorm_rope_reference,
"fused_rmsnorm_rope", (2, 2048, 4096), x, ...
)
print(result.metrics.latency_ms)Representative numbers on NVIDIA A100 SXM4 80GB (CUDA 12.1, PyTorch 2.1, Triton 2.1). Methodology: 10 warmup runs + 100 benchmark runs with torch.cuda.synchronize() before and after timing.
| Kernel | Speedup vs PyTorch | Memory Traffic Reduction |
|---|---|---|
fused_rmsnorm_rope |
up to ~3.0× | ~40% |
fused_gated_mlp |
~1.3x–1.8× | ~25% |
fp8_gemm |
~1.2x–1.5× | ~50% (weights) |
See Benchmark Visualization for interactive charts.
| Section | Best For | Key Takeaway |
|---|---|---|
| Academy | First-time users | Narrative reading path from system overview to implementation seams |
| Architecture Lab | Interview prep | Module seams, runtime contracts, public exports |
| Performance | Tuning practitioners | Correct timing, bottleneck analysis |
| Reference & Research | Deep learning researchers | Papers, projects, tech stack landscape |
This repository is OpenSpec-driven for non-trivial work. See AGENTS.md, CLAUDE.md, and openspec/README.md.
@software{triton_fused_ops,
title = {Triton Fused Ops: High-Performance GPU Kernels for Transformer Inference},
author = {LessUp},
year = {2025},
url = {https://github.com/LessUp/triton-fused-ops},
note = {Built on OpenAI Triton, PyTorch, and CUDA}
}- OpenAI Triton for the compiler and Python DSL
- PyTorch for the tensor runtime
- NVIDIA for CUDA, FP8 hardware, and performance tooling
MIT.