Lists (3)
Sort Name ascending (A-Z)
Stars
Sparser Block-Sparse Attention via Token Permutation
The evaluation framework for training-free sparse attention in LLMs
Tensors and Dynamic neural networks in Python with strong GPU acceleration
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
A Python wrapper around Metis, a graph partitioning package
ParMETIS - Parallel Graph Partitioning and Fill-reducing Matrix Ordering
METIS - Serial Graph Partitioning and Fill-reducing Matrix Ordering
Intercept Google Antigravity IDE API calls and use your own Gemini API token
these are custom recipes of nvidia nsight system post collection analysis.
Code repository for the SOSP'25 paper DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism.
SGLang is a fast serving framework for large language models and vision language models.
[ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Implementation of plug in and play Attention from "LongNet: Scaling Transformers to 1,000,000,000 Tokens"
A sparse attention kernel supporting mix sparse patterns
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
FlashInfer: Kernel Library for LLM Serving
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch