Stars
[ICLR25] STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
[ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
A framework to compare low-bit integer and float-point formats
arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv
This repository contains low-bit quantization papers from 2020 to 2025 on top conference.
A selective knowledge distillation algorithm for efficient speculative decoders
[EMNLP 2025] LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization (Oral)
A collection of research papers on low-precision training methods
Hierarchical Reasoning Model Official Release
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Kimi K2 is the large language model series developed by Moonshot AI team
[ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention
[NeurIPS 2025] Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation
[ICML 2025 oral] Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
[NeurIPS 2025 Spotlight] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone.
A high-throughput and memory-efficient inference and serving engine for LLMs
Official repository of Agent Attention (ECCV2024)
[SIGGRAPH 2025] One Model to Rig Them All: Diverse Skeleton Rigging with UniRig