[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
-
Updated
Dec 21, 2025 - Cuda
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":
Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding
DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.
Implementation of ICCV 2025 paper "Growing a Twig to Accelerate Large Vision-Language Models".
a mixed-precision gemm with quantize and reorder kernel.
The official implementation of the paper "Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts".
The official repo for “Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models”.
Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"
AURA: Augmented Representation for Unified Accuracy-aware Quantization
The official implementation of the paper "Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts".
Add a description, image, and links to the inference-acceleration topic page so that developers can more easily learn about it.
To associate your repository with the inference-acceleration topic, visit your repo's landing page and select "manage topics."