SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
-
Updated
Jun 11, 2026 - Python
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
FastAPI wrapper around original Vibevoice 1.5B and 7B models, with support for AWQ4 quant
Native Windows build of vLLM 0.21.0 — no WSL, no Docker. Now for RTX 50-series (Blackwell, sm_120): Python 3.13 + CUDA 12.8 + PyTorch 2.11. Pre-built wheel + Windows patch, 10 KV-cache compression dtypes, and the OpenAI API server fixed to run on Windows.
A light, transparent, and modular inference & quantization engine for studying LLMs.
Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.
Research Test: REAP expert pruning + AWQ quantization of Qwen3-Coder-Next MoE model
本来叫 nano 的,后来发现装不下 Qwen3.5,就改名叫 big 了
[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.
Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.
Quantize LLM using AWQ
Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.
Effortlessly quantize, benchmark, and publish Hugging Face models with cross-platform support for CPU/GPU. Reduce model size by 75% while maintaining performance.
White paper & reproducible benchmark suite for LLM inference optimization on AMD MI300X using ROCm 6.1
Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.
AWQ Quantization of Microsoft/Phi-4-Reasoning
Add a description, image, and links to the awq topic page so that developers can more easily learn about it.
To associate your repository with the awq topic, visit your repo's landing page and select "manage topics."