Stars
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
A CUDA tutorial to make people learn CUDA program from 0
Leveraging Critical Proof Obligations for Efficient IC3 Verification
ModelChecker: A bit-level model checking tool
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding code links.
HabanaAI / vllm-fork
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
DeepAuto-AI / sglang
Forked from sgl-project/sglangThis is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.
🍒 Cherry Studio is a desktop client that supports for multiple LLM providers.
🤯 LobeHub - an open-source, modern design AI Agent Workspace. Supports multiple AI providers (OpenAI / Claude 4 / Gemini / DeepSeek / Ollama / Qwen), Knowledge Base (file upload / RAG ), one click …
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
FlashMLA: Efficient Multi-head Latent Attention Kernels
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
上海交通大学开题报告/中期报告LaTeX模板(非官方) Shanghai Jiao Tong University LaTeX templates for thesis proposals and annual reports (unofficial)
Large Language Model (LLM) Systems Paper List
PyTorch入门教程,在线阅读地址:https://datawhalechina.github.io/thorough-pytorch/
Supercharge Your LLM with the Fastest KV Cache Layer
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
Hackable and optimized Transformers building blocks, supporting a composable construction.
Fast and memory-efficient exact attention
Official code for paper: Chain of Ideas: Revolutionizing Research via Novel Idea Development with LLM Agents