Stars
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
A tool to parse PyTorch profiler trace files for kernel-level analysis.
A benchmark of real-world DL kernel problems
FlyDSL is the Python front‑end of the project: Flexible LaYout DSL.
FlashInfer: Kernel Library for LLM Serving
🚀 Efficient implementations for emerging model architectures
DeepSeek Native Sparse Attention pytorch implementation
sogalin / benchmark
Forked from pytorch/benchmarkTorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
Flash Attention from Scratch on CUDA Ampere
Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
Fast and memory-efficient exact attention
[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror
My learning notes for ML SYS.
This repository hosts configuration files for HPC Toolkit, ROCprof, NVprof and ERT, and scripts to help us create roofline and instruction based roofline diagrams (performance models) for applications
A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)
SGLang is a high-performance serving framework for large language models and multimodal models.