Runner Docs | Dump Docs | Benchmark Docs | 中文文档 | triton-runner.org
English | 中文
Triton Runner is a lightweight execution and debugging layer for Triton. It lets you launch kernels from multiple compilation stages, inspect intermediate IR, and reuse compiled artifacts directly during performance tuning.
Compatibility summary:
- Supported Triton versions:
v3.0.0throughv3.7.0 - Primary target:
v3.7.0 - Supported runner inputs: Python Triton, Gluon, TTIR, TTGIR, LLIR, PTX, cubin, AMDGCN, and hsaco
- Dump support: Python, TTIR, and TTGIR
- Optional CUDA bridge: TVM-FFI on Triton
v3.7.0only - MLIR split output: set
MLIR_ENABLE_DUMP=1to expandall.mlirinto per-pass files in the cache directory
pip install triton-runnergit clone https://github.com/toyaix/triton-runner
cd triton-runner
pip install -e .Triton Runner also provides a CUDA/cubin-only bridge to TVM-FFI for Triton v3.7.0 only.
pip install triton-runner[tvm-ffi]
pip install -e .[tvm-ffi]- Multi-level runner examples: examples/runner/README.md
- Dump examples: examples/dump/README.md
- Benchmarks: benchmark/README.md
- Issue case studies: doc/solving_triton_issues/README.md
Triton Runner can launch kernels from multiple points in the Triton compilation pipeline.
---
title: Triton Compilation Pipeline
---
flowchart LR
subgraph Triton
A["Python<br>Triton"]:::supported --> B["TTIR<br>Triton IR"]:::supported
B --> C["TTGIR<br>Triton GPU IR"]:::supported
C --> D["LLIR<br>LLVM IR"]:::supported
Gluon["Python<br>Gluon"]:::supported --> C
TLX["Python<br>TLX"]:::supported --> B
end
subgraph Backend
D --> E["PTX"]:::supported
D --> G["GCN"]:::supported
E --> F["cubin<br>CUDA Binary"]:::supported
G --> H["hsaco<br>HIP Binary"]:::supported
end
classDef supported fill:#AED6F1,stroke:#2E86C1,stroke-width:2px,color:#000000;
classDef unsupported fill:#F5B7B1,stroke:#C0392B,stroke-width:2px,color:#000000;
TLX support is available for commit 9a7a23d in examples/runner/tlx/README.md.
Triton Runner supports two integration styles for Python kernels, and both are valid:
- Replace
@triton.jitwith@triton_runner.jit - Monkey-patch Triton's decorators and keep using
@triton.jit
If the module also uses @triton.autotune, call triton_runner.configure_autotune_backend() when using the monkey-patch style.
import triton_runner
@triton_runner.jit
def kernel(...):
...import triton
import triton.language as tl
import triton_runner
triton_runner.configure_jit_backend()
# Optional when using @triton.autotune
# triton_runner.configure_autotune_backend()
@triton.jit
def kernel(...):
...Examples:
- Python runner: examples/runner/python/triton/matmul.py
- cubin autotune with monkey patch: examples/autotune/cubin/gate.py
python examples/runner/python/triton/matmul.pyOn success, Triton Runner prints the kernel launch banner. When the kernel cache is reused, it also prints the cache location.
Provide the .ttir file and point the runner at its directory, typically with ttir_dir=triton_runner.get_file_dir(__file__). See examples/runner/v3.7.0/ttir/matmul/matmul.py.
You can also reuse the Triton cache generated by the Python runner.
python examples/runner/v3.7.0/ttir/matmul/matmul.pyTTGIR is architecture-aware. Provide the matching .ttgir file and, when needed, the corresponding metadata JSON. See examples/runner/v3.7.0/ttgir/sm90/matmul-with-tma-v4.py.
If you hit torch.AcceleratorError: CUDA error: an illegal instruction was encountered, the selected TTGIR artifact likely does not match the target GPU, or the metadata JSON is missing.
For LLIR, PTX, and cubin launches, provide the input file plus the matching metadata JSON.
- LLIR example: examples/runner/v3.7.0/llir/sm90/matmul-with-tma-v4.py
- PTX example: examples/runner/v3.7.0/ptx/sm90/matmul-with-tma-v4.py
- cubin example: examples/runner/v3.7.0/cubin/sm90/matmul-with-tma-v4.py
- Example metadata: examples/runner/v3.7.0/llir/sm90/matmul_kernel_make_tensor_desciptor.json
Gluon uses the same compiler stack as Triton but exposes a lower-level programming model. The repository currently includes two Gluon examples:
python examples/runner/python/gluon/01-intro.py
python examples/runner/python/gluon/02-layouts.pyArchitecture-specific examples are collected in examples/runner/README.md.
- NVIDIA examples cover
sm75,sm80,sm86,sm90, andsm120 - AMD examples for MI300/CDNA3 are under examples/runner/amd/v3.6.0
Representative commands:
python examples/runner/python/triton/matmul-with-tma-v4.py
python examples/runner/v3.7.0/ttgir/sm90/matmul-with-tma-v4.py
python examples/runner/v3.7.0/cubin/sm90/matmul-with-tma-v4.py
python examples/runner/amd/v3.6.0/hsaco/matmul.pyIf your GPU does not match one of the bundled examples, set TRITON_CACHE_DIR=$PWD/.cache, compile once on the target machine, and then reuse the generated kernel cache.
Use the example set that matches your Triton version:
- Triton
v3.7.0: examples/runner/v3.7.0 - Triton
v3.6.0: examples/runner/v3.6.0 - Triton
v3.5.x: examples/runner/v3.5.x - Triton
v3.4.0: examples/runner/v3.4.0 - Triton
v3.3.x: examples/runner/v3.3.x - Triton
v3.2.0: examples/runner/v3.2.0 - Triton
v3.1.0: examples/runner/v3.1.0 - Triton
v3.0.0: examples/runner/v3.0.0 - TLX examples: examples/runner/tlx
Triton Runner supports dump workflows at the Python, TTIR, and TTGIR levels.
---
title: Triton Dump Coverage
---
flowchart LR
subgraph Triton
A["Python<br>Triton"]:::supported --> B["TTIR<br>Triton IR"]:::supported
B --> C["TTGIR<br>Triton GPU IR"]:::supported
C --> D["LLIR<br>LLVM IR"]:::unsupported
Gluon["Python<br>Gluon"]:::unsupported --> C
end
subgraph Backend
D --> E["PTX"]:::unsupported
E --> F["cubin<br>CUDA Binary"]:::unsupported
end
classDef supported fill:#AED6F1,stroke:#2E86C1,stroke-width:2px,color:#000000;
classDef unsupported fill:#F5B7B1,stroke:#C0392B,stroke-width:2px,color:#000000;
The full dump guide lives in examples/dump/README.md.
Inside a Triton kernel, use triton_runner.language.dump() to inspect a block. You can also use triton_runner.language.dump_boundary() for boundary blocks and triton_runner.language.dump_grids() for grid inspection.
Representative examples:
python examples/dump/python/01-vec_add/dump_output.py
python examples/dump/python/03-matrix_multiplication/dump_acc.py
python examples/dump/python/04-softmax/dump_max_in_loop.py
python examples/dump/python/06-attention/dump_out.pyTTIR dump examples cover common ops such as tt.load, arith.addf, and tt.trans.
python examples/dump/ttir/01-vector_add/dump_addf.py
python examples/dump/ttir/03-matrix_multiplication/dump_acc.py
python examples/dump/ttir/04-softmax/dump_maxnumf.py
python examples/dump/ttir/06-attention/dump_out.pyTTGIR dump examples cover the same class of operations at the GPU IR level.
python examples/dump/ttgir/01-vec_add/dump_addf.py
python examples/dump/ttgir/03-matrix_multiplication/dump_acc.py
python examples/dump/ttgir/04-softmax/dump_maxnumf.py
python examples/dump/ttgir/06-attention/dump_out.pyBenchmark examples are under benchmark/README.md. The repository currently includes:
launch_latency: kernel launch overheadmatmul: matrix multiplication performanceflash_attention: attention benchmark cases
Example commands:
python benchmark/launch_latency/bench.py
python benchmark/matmul/mma/bench.py
python benchmark/attn/flash_attention/bench.pybenchmark/launch_latency/bench.py requires Triton v3.3.0+.
The case studies in doc/solving_triton_issues/README.md show how to reproduce and work around Triton regressions with Triton Runner, especially by reusing cubin artifacts.
Current documented cases include:
| Variable | Default | Description |
|---|---|---|
TRITON_RUNNER_PROD |
0 |
Enable Triton Runner production mode on CUDA with Triton v3.7.0; this switches triton_runner.jit to the production launcher path and requires triton-runner[tvm-ffi]. |
TRITON_RUNNER_PROD_TEST |
0 |
Enable production mode and keep the extra production cache consistency checks used by the production JIT path. |
Other environment variables such as TRITON_CACHE_DIR, TRITON_ALWAYS_COMPILE, TRITON_KERNEL_OVERRIDE,
TRITON_KERNEL_DUMP, TRITON_STORE_BINARY_ONLY, TRITON_DEBUG, MLIR_ENABLE_DUMP, MLIR_DUMP_PATH,
and USE_IR_LOC are Triton or Triton compiler controls that Triton Runner reuses rather than redefining.
This project is licensed under the MIT License. See LICENSE for details.
This project includes code from:
- Triton (MIT License): https://github.com/triton-lang/triton
- TritonBench (BSD 3-Clause License): https://github.com/pytorch-labs/tritonbench