Skip to content

NVIDIA/TileGym

Repository files navigation

TileGym

TileGym is a CUDA Tile kernel library that provides a rich collection of kernel tutorials and examples for tile-based GPU programming.

Overview | Features | Installation | Quick Start | Contributing | License

Overview

This repository aims to provide helpful kernel tutorials and examples for tile-based GPU programming. TileGym is a playground for experimenting with CUDA Tile, where you can learn how to build efficient GPU kernels and explore their integration into real-world large language models such as Llama 3.1 and DeepSeek V2. Whether you're learning tile-based GPU programming or looking to optimize your LLM implementations, TileGym offers practical examples and comprehensive guidance. TileGym_repo

Features

  • Rich collection of CUDA Tile kernel examples
  • Practical kernel implementations for common deep learning operators
  • Performance benchmarking to evaluate kernel efficiency
  • End-to-end integration examples with popular LLMs (Llama 3.1, DeepSeek V2)

Installation

Prerequisites

⚠️ Important: TileGym requires CUDA 13.1 and NVIDIA Blackwell architecture GPUs (e.g., B200, RTX 5080, RTX 5090). We will support other GPU architectures in the future. Download CUDA from NVIDIA CUDA Downloads.

  • PyTorch (version 2.9.1 or compatible)
  • CUDA 13.1 (Required - TileGym is built and tested exclusively on CUDA 13.1)
  • Triton (included with PyTorch installation)

Setup Steps

1. Prepare torch and triton environment

If you already have torch and triton, skip this step.

pip install --pre torch --index-url https://download.pytorch.org/whl/cu130

We have verified that torch==2.9.1 works. You can also get triton packages when installing torch.

2. Install TileGym

git clone <tilegym-repository-url>
cd tilegym
pip install .

It will automatically install cuda-tile, see https://github.com/nvidia/cutile-python.

If you want to use edit mode for TileGym, run pip install -e .

We also provide Dockfile, you can refer to modeling/transformers/README.md.

Quick Start

There are three main ways to use TileGym:

1. Explore Kernel Examples

All kernel implementations are located in the src/tilegym/ops/ directory. You can test individual operations with minimal scripts. Function-level usage and minimal scripts for individual ops are documented in tests/ops/README.md

2. Run Benchmarks

Evaluate kernel performance with micro-benchmarks:

cd tests/benchmark
bash run_all.sh

Complete benchmark guide available in tests/benchmark/README.md

3. Run LLM Transformer Examples

Use TileGym kernels in end-to-end inference scenarios. We provide runnable scripts and instructions for transformer language models (e.g., Llama 3.1-8B) accelerated using TileGym kernels.

First, install the additional dependency:

pip install accelerate --no-deps

Containerized Setup (Docker):

docker build -t tilegym-transformers -f modeling/transformers/Dockerfile .
docker run --gpus all -it tilegym-transformers bash

More details in modeling/transformers/README.md

Contributing

We welcome contributions of all kinds. Please read our CONTRIBUTING.md for guidelines, including the Contributor License Agreement (CLA) process.

License and third-party notices

About

Helpful kernel tutorials and examples for tile-based GPU programming

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7