Stars
Tensors and Dynamic neural networks in Python with strong GPU acceleration
A high-throughput and memory-efficient inference and serving engine for LLMs
An extremely fast Python type checker and language server, written in Rust.
Machine Learning Engineering Open Book
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Efficient Triton Kernels for LLM Training
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A PyTorch native platform for training generative AI models
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
A storage solution for PyTorch tensors with distributed tensor support.
Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, as a standalone package for Pytorch
My submission for the GPUMODE/AMD fp8 mm challenge