Stars
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
fmchisel: Efficient Compression and Training Algorithms for Foundation Models
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
A modern GUI client based on Tauri, designed to run in Windows, macOS and Linux for tailored proxy experience
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
A PyTorch native platform for training generative AI models
Using PyTorch autograd to compute Hessian of Perplexity for Large Language Models
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Official inference framework for 1-bit LLMs
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
Calculating the actual value of your job beyond just salary
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
My learning notes for ML SYS.
Codebase for the Progressive Mixed-Precision Decoding paper.
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
fwtan / any-precision-llm
Forked from SNU-ARC/any-precision-llm[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Lightning fast C++/CUDA neural network framework
Efficient Triton Kernels for LLM Training
FlashInfer: Kernel Library for LLM Serving
Dynamic Memory Management for Serving LLMs without PagedAttention
Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]
Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]
A highly optimized LLM inference acceleration engine for Llama and its variants.