Stars
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
Achieve state of the art inference performance with modern accelerators on Kubernetes
A collection of GPU experiments and benchmarks for my personal understanding and research.
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
Fast and memory-efficient exact attention
A PyTorch native platform for training generative AI models
My tools for the Slurm HPC workload manager
Write a fast kernel and see how you compare against the best humans and AI on gpumode.com
deepbeepmeep / Wan2GP
Forked from Wan-Video/Wan2.1A fast AI Video Generator for the GPU Poor. Supports Wan 2.1/2.2, Qwen Image, Hunyuan Video, LTX Video and Flux.
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.
Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton
Allow torch tensor memory to be released and resumed later
An open-source AI agent that brings the power of Gemini directly into your terminal.
Train your Agent model via our easy and efficient framework
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
From a+b to sparsemax(QK^T)V in Triton!
FlashInfer: Kernel Library for LLM Serving
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Tensors and Dynamic neural networks in Python with strong GPU acceleration
My tests and experiments with some popular dl frameworks.