Stars
codebase for Coruscant: Co-Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference
codebase for MUSTAFAR:Promoting Unstructured Sparsity for KV Pruning in LLM Inference
Summary of some awesome work for optimizing LLM inference
From a+b to sparsemax(QK^T)V in Triton!
A comprehensive list of papers about Large-Language-Diffusion-Models.
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Curated collection of papers in MoE model inference
Awesome LLM Books: Curated list of books on Large Language Models
A curated list of neural network pruning resources.
[NeurIPS 2024] A Generalizable World Model for Autonomous Driving
A curated list for Efficient Large Language Models
A low-latency & high-throughput serving engine for LLMs
Fast and memory-efficient exact attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
A framework for few-shot evaluation of language models.
Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its efficiency gain.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Sample codes for my CUDA programming book