Stars
๐ค Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
A high-throughput and memory-efficient inference and serving engine for LLMs
Fast and memory-efficient exact attention
A framework for few-shot evaluation of language models.
๐A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.๐
A curated list for Efficient Large Language Models
[NeurIPS 2024] A Generalizable World Model for Autonomous Driving
A low-latency & high-throughput serving engine for LLMs
A baseline repository of Auto-Parallelism in Training Neural Networks
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its efficiency gain.
codebase for MUSTAFAR:Promoting Unstructured Sparsity for KV Pruning in LLM Inference