SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
-
Updated
Apr 17, 2026 - Python
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
Native Windows build of vLLM 0.19.0 — no WSL, no Docker. Pre-built wheels + 33-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.
Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.
Research Test: REAP expert pruning + AWQ quantization of Qwen3-Coder-Next MoE model
[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.
Quantize LLM using AWQ
Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.
Effortlessly quantize, benchmark, and publish Hugging Face models with cross-platform support for CPU/GPU. Reduce model size by 75% while maintaining performance.
Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.
Quantization quality analyzer - benchmark GGUF/GPTQ/AWQ quantization accuracy.
Artificial Personality is text2text AI chatbot that can use character cards
AWQ Quantization of Microsoft/Phi-4-Reasoning
Pure Gleam tensor library with quantization (INT8, NF4, AWQ), Flash Attention, and 2:4 Sparsity - 7.5x memory multiplication
Network-specific model quantization benchmarks — GPTQ, AWQ, GGUF on infrastructure NLP tasks
Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF.
White paper & reproducible benchmark suite for LLM inference optimization on AMD MI300X using ROCm 6.1
Add a description, image, and links to the awq topic page so that developers can more easily learn about it.
To associate your repository with the awq topic, visit your repo's landing page and select "manage topics."