Skip to content

AICL-Lab/triton-fused-ops

Triton Fused Ops

CI Pages License: MIT Python 3.9+ PyTorch 2.0+ Triton 2.1+ CUDA 12.1+ Code Style: Ruff Types: mypy

Fused GPU kernels for LLM inference. Memory-bound → Compute-bound.

📖 Docs | 🇨🇳 中文 | 💡 Examples | 🤝 Contributing


Why this repo stands out

  • Operator fusion with correctness guarantees — every kernel ships with CPU-testable NumPy reference implementations, not just speed claims.
  • Production-ready FP8 GEMM pipeline — explicit scale management and overflow handling, not toy quantization examples.
  • Latency-driven autotuner with persistent config cacheTritonAutoTuner + ConfigCache, not one-off benchmark scripts.
  • OpenSpec-driven development — every non-trivial change is design-documented before code, not YOLO-coded.

Architecture

User API (triton_ops.__init__)
    ├── Validation Layer (device, dtype, shape, contiguity)
    ├── Compute Reference Layer (NumPy, CPU-testable)
    ├── Kernel Layer (Triton, GPU)
    └── Tooling Layer (autotuner, benchmark, performance metrics)

See the Architecture Lab and Kernel Families docs for details.

Quick Start

git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"

CPU-only validation (no GPU required):

ruff format --check . && ruff check . && mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m build

Full GPU benchmark (requires CUDA):

import torch
from triton_ops import fused_rmsnorm_rope, BenchmarkSuite
from triton_ops.reference import fused_rmsnorm_rope as fused_rmsnorm_rope_reference

x = torch.randn(2, 2048, 4096, device="cuda", dtype=torch.float16)
suite = BenchmarkSuite(warmup_runs=10, benchmark_runs=100)
result = suite.benchmark_kernel(
    fused_rmsnorm_rope, fused_rmsnorm_rope_reference,
    "fused_rmsnorm_rope", (2, 2048, 4096), x, ...
)
print(result.metrics.latency_ms)

Performance

Representative numbers on NVIDIA A100 SXM4 80GB (CUDA 12.1, PyTorch 2.1, Triton 2.1). Methodology: 10 warmup runs + 100 benchmark runs with torch.cuda.synchronize() before and after timing.

Kernel Speedup vs PyTorch Memory Traffic Reduction
fused_rmsnorm_rope up to ~3.0× ~40%
fused_gated_mlp ~1.3x–1.8× ~25%
fp8_gemm ~1.2x–1.5× ~50% (weights)

See Benchmark Visualization for interactive charts.

Documentation Index

Section Best For Key Takeaway
Academy First-time users Narrative reading path from system overview to implementation seams
Architecture Lab Interview prep Module seams, runtime contracts, public exports
Performance Tuning practitioners Correct timing, bottleneck analysis
Reference & Research Deep learning researchers Papers, projects, tech stack landscape

Development

This repository is OpenSpec-driven for non-trivial work. See AGENTS.md, CLAUDE.md, and openspec/README.md.

Citation

@software{triton_fused_ops,
  title = {Triton Fused Ops: High-Performance GPU Kernels for Transformer Inference},
  author = {LessUp},
  year = {2025},
  url = {https://github.com/LessUp/triton-fused-ops},
  note = {Built on OpenAI Triton, PyTorch, and CUDA}
}

Acknowledgements

  • OpenAI Triton for the compiler and Python DSL
  • PyTorch for the tensor runtime
  • NVIDIA for CUDA, FP8 hardware, and performance tooling

License

MIT.