Stars
Fully autonomous & self-evolving research from idea to paper. Chat an Idea. Get a Paper. 🦞
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
NVIDIA curated collection of educational resources related to general purpose GPU programming.
GPU programming related news and material links
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
Summary of some awesome work for optimizing LLM inference
Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]
分享AI Infra知识&代码练习:PyTorch/vLLM/SGLang框架入门⚡️、性能加速🚀、大模型基础🧠、AI软硬件🔧等
A Throughput-Optimized Pipeline Parallel Inference System for Large Language Models
The official GitHub page for the survey paper "A Survey of Large Language Models".
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. ACM Computing Surveys, 2026.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Persist and reuse KV Cache to speedup your LLM.
Source code of paper ''KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing''
Official code repo for the O'Reilly Book - "Hands-On Large Language Models"
The simplest, fastest repository for training/finetuning medium-sized GPTs.
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
Supercharge Your LLM with the Fastest KV Cache Layer
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.