-
Tsinghua University
- Beijing, China
Stars
Productive, portable, and performant GPU programming in Python.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
FlashMLA: Efficient Multi-head Latent Attention Kernels
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
High-speed Large Language Model Serving for Local Deployment
Implementation of popular deep learning networks with TensorRT network definition API
lightweight, standalone C++ inference engine for Google's Gemma models.
Transformer related optimization, including BERT, GPT
header only, dependency-free deep learning framework in C++14
Diffusion model(SD,Flux,Wan,Qwen Image,...) inference in pure C/C++
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
fastllm是后端无依赖的高性能大模型推理库。同时支持张量并行推理稠密模型和混合模式推理MOE模型,任意10G以上显卡即可推理满血DeepSeek。双路9004/9005服务器+单显卡部署DeepSeek满血满精度原版模型,单并发20tps;INT4量化模型单并发30tps,多并发可达60+。
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
SPlisHSPlasH is an open-source library for the physically-based simulation of fluids.
Large-scale LLM inference engine
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A highly optimized LLM inference acceleration engine for Llama and its variants.
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
collection of benchmarks to measure basic GPU capabilities
MSCCL++: A GPU-driven communication stack for scalable AI applications
Demonstration of various hardware effects on CUDA GPUs.