-
IST Austria & Neural Magic
Stars
Efficient non-uniform quantization with GPTQ for GGUF
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
Code for data-aware compression of DeepSeek models
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.
An Open Large Reasoning Model for Real-World Solutions
A high-throughput and memory-efficient inference and serving engine for LLMs
List of (mostly ML) papers, where description of the method could be shortened significantly
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Code for the EMNLP 2024 paper "Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on LLMs".
Efficient Triton Kernels for LLM Training
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
A fast inference library for running LLMs locally on modern consumer-class GPUs
Vector Approximate Message Passing inference framework for GWAS
Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
4 bits quantization of LLaMA using GPTQ
A collection of libraries to optimise AI model performances
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
Code for ICML 2022 paper "SPDY: Accurate Pruning with Speedup Guarantees"
Pytorch distributed backend extension with compression support