Triton Fused Ops

Fused GPU kernels for LLM inference. Memory-bound → Compute-bound.

📖 Docs | 🇨🇳 中文 | 💡 Examples | 🤝 Contributing

Why this repo stands out

Operator fusion with correctness guarantees — every kernel ships with CPU-testable NumPy reference implementations, not just speed claims.
Production-ready FP8 GEMM pipeline — explicit scale management and overflow handling, not toy quantization examples.
Latency-driven autotuner with persistent config cache — TritonAutoTuner + ConfigCache, not one-off benchmark scripts.
OpenSpec-driven development — every non-trivial change is design-documented before code, not YOLO-coded.

Architecture

User API (triton_ops.__init__)
    ├── Validation Layer (device, dtype, shape, contiguity)
    ├── Compute Reference Layer (NumPy, CPU-testable)
    ├── Kernel Layer (Triton, GPU)
    └── Tooling Layer (autotuner, benchmark, performance metrics)

See the Architecture Lab and Kernel Families docs for details.

Quick Start

git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"

CPU-only validation (no GPU required):

ruff format --check . && ruff check . && mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m build

Full GPU benchmark (requires CUDA):

import torch
from triton_ops import fused_rmsnorm_rope, BenchmarkSuite
from triton_ops.reference import fused_rmsnorm_rope as fused_rmsnorm_rope_reference

x = torch.randn(2, 2048, 4096, device="cuda", dtype=torch.float16)
suite = BenchmarkSuite(warmup_runs=10, benchmark_runs=100)
result = suite.benchmark_kernel(
    fused_rmsnorm_rope, fused_rmsnorm_rope_reference,
    "fused_rmsnorm_rope", (2, 2048, 4096), x, ...
)
print(result.metrics.latency_ms)

Performance

Representative numbers on NVIDIA A100 SXM4 80GB (CUDA 12.1, PyTorch 2.1, Triton 2.1). Methodology: 10 warmup runs + 100 benchmark runs with torch.cuda.synchronize() before and after timing.

Kernel	Speedup vs PyTorch	Memory Traffic Reduction
`fused_rmsnorm_rope`	up to ~3.0×	~40%
`fused_gated_mlp`	~1.3x–1.8×	~25%
`fp8_gemm`	~1.2x–1.5×	~50% (weights)

See Benchmark Visualization for interactive charts.

Documentation Index

Section	Best For	Key Takeaway
Academy	First-time users	Narrative reading path from system overview to implementation seams
Architecture Lab	Interview prep	Module seams, runtime contracts, public exports
Performance	Tuning practitioners	Correct timing, bottleneck analysis
Reference & Research	Deep learning researchers	Papers, projects, tech stack landscape

Development

This repository is OpenSpec-driven for non-trivial work. See AGENTS.md, CLAUDE.md, and openspec/README.md.

Citation

@software{triton_fused_ops,
  title = {Triton Fused Ops: High-Performance GPU Kernels for Transformer Inference},
  author = {LessUp},
  year = {2025},
  url = {https://github.com/LessUp/triton-fused-ops},
  note = {Built on OpenAI Triton, PyTorch, and CUDA}
}

Acknowledgements

OpenAI Triton for the compiler and Python DSL
PyTorch for the tensor runtime
NVIDIA for CUDA, FP8 hardware, and performance tooling

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.claude		.claude
.devcontainer		.devcontainer
.githooks		.githooks
.github		.github
.vscode		.vscode
assets/images		assets/images
changelog		changelog
docs		docs
examples		examples
openspec		openspec
tests		tests
triton_ops		triton_ops
.editorconfig		.editorconfig
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CHANGELOG.zh-CN.md		CHANGELOG.zh-CN.md
CLAUDE.md		CLAUDE.md
CONTEXT.md		CONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
robots.txt		robots.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton Fused Ops

Why this repo stands out

Architecture

Quick Start

Performance

Documentation Index

Development

Citation

Acknowledgements

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Triton Fused Ops

Why this repo stands out

Architecture

Quick Start

Performance

Documentation Index

Development

Citation

Acknowledgements

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages