- Chengdu, China
-
19:00
(UTC +08:00)
Stars
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
FlagGems is an operator library for large language models implemented in the Triton Language.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
A curated list for Efficient Large Language Models
Awesome LLM compression research papers and tools.
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
FlashInfer: Kernel Library for LLM Serving
collection of benchmarks to measure basic GPU capabilities
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
Development repository for the Triton language and compiler
TinyChatEngine: On-Device LLM Inference Library
A simple high performance CUDA GEMM implementation.
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
List of papers related to neural network quantization in recent AI conferences and journals.
The official repository for the gem5 computer-system architecture simulator.
how to optimize some algorithm in cuda.
A high-throughput and memory-efficient inference and serving engine for LLMs
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
llm deploy project based mnn. This project has merged into MNN.
[ICCV 2023] Consistent Image Synthesis and Editing
Flexible and powerful tensor operations for readable and reliable code (for pytorch, jax, TF and others)