awq

Here are 44 public repositories matching this topic...

intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

sparsity pruning quantization knowledge-distillation auto-tuning int8 low-precision quantization-aware-training post-training-quantization awq int4 large-language-models gptq smoothquant sparsegpt fp4 mxformat

Updated Jun 11, 2026
Python

ModelTC / LightCompress

Star

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

benchmark deployment tool evaluation pruning quantization wan awq large-language-models llm token-pruning vllm smoothquant token-reduction mixtral internlm2 token-merging deepseek-v3

Updated May 14, 2026
Python

vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.

docker rocm openai-api awq vllm llm-inference speculative-decoding multimodal-llm qwen3 gfx1151 ryzen-ai-max dflash amd-strix-halo rdna35 27b

Updated May 10, 2026
Python

ncoder-ai / VibeVoice-FastAPI

Star

FastAPI wrapper around original Vibevoice 1.5B and 7B models, with support for AWQ4 quant

tts-api fastapi awq vibevoice-microsoft vibevoice-large

Updated May 7, 2026
Python

aivrar / vllm-windows-build

Star

Native Windows build of vLLM 0.21.0 — no WSL, no Docker. Now for RTX 50-series (Blackwell, sm_120): Python 3.13 + CUDA 12.8 + PyTorch 2.11. Pre-built wheel + Windows patch, 10 KV-cache compression dtypes, and the OpenAI API server fixed to run on Windows.

Updated May 26, 2026
Python

hcd233 / Aris-AI-Model-Server

Star

An OpenAI Compatible API which integrates LLM, Embedding and Reranker. 一个集成 LLM、Embedding 和 Reranker 的 OpenAI 兼容 API

ai embedding mlx reranker rag fastapi sentence-transformers awq llm vllm gptq openai-compatible-api

Updated Aug 21, 2025
Python

AEON-7 / supergemma4-26b-abliterated-multimodal-nvfp4

Star

NVFP4 AWQ Full quantization of SuperGemma4-26B-Abliterated-Multimodal for Blackwell GPUs — pre-built vLLM container + patches included

moe quantization multimodal blackwell awq llm vllm nvfp4 dgx-spark gemma4 modelopt

Updated May 1, 2026
Python

BoundlessWindMoon / minivllm

Star

A light, transparent, and modular inference & quantization engine for studying LLMs.

framework inference awq multi-backends quantum-kernel cuda-graph megakernel

Updated Jun 4, 2026
Cuda

harleyszhang / harleyszhang.github.io

Star

🧗‍♂️ harleyszhang 的个人博客

blog awq llm llm-inference

Updated May 10, 2026
HTML

ShipItAndPray / turboquant

Star

Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.

quantization model-compression awq llm llama-cpp vllm gptq ollama gguf

Updated Mar 25, 2026
Python

mtecnic / research-test-Qwen3-Coder-Next-REAP-AWQ

Star

Research Test: REAP expert pruning + AWQ quantization of Qwen3-Coder-Next MoE model

python machine-learning research ai deep-learning optimization transformers moe pruning quantization model-compression mixture-of-experts awq llm

Updated Apr 4, 2026
Python

duchengyao / big-vllm

Star

本来叫 nano 的，后来发现装不下 Qwen3.5，就改名叫 big 了

python cuda quantization awq vllm gptq llm-inference qwen llm-compressor

Updated May 6, 2026
Python

psunlpgroup / Compression-Effects

Star

[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.

pruning quantization distillation awq llm mechanistic-interpretability gptq llm-compression

Updated May 6, 2026
Python

neosun100 / kimi-linear-vllm-docker-serve

Star

Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.

docker awq long-context llm-serving vllm kimi-linear

Updated Jun 8, 2026
Python

GURPREETKAURJETHRA / Quantize-LLM-using-AWQ

Star

Quantize LLM using AWQ

quantize awq large-language-models llms generative-ai llm-training

Updated Apr 26, 2024
Jupyter Notebook

aphroditeformal93 / vllm-awq4-qwen

Star

Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.

docker rocm openai-api awq vllm llm-inference speculative-decoding multimodal-llm qwen3 gfx1151 ryzen-ai-max dflash amd-strix-halo rdna35 27b

Updated Jun 11, 2026
Python

lpalbou / model-quantizer

Star

Effortlessly quantize, benchmark, and publish Hugging Face models with cross-platform support for CPU/GPU. Reduce model size by 75% while maintaining performance.

python nlp machine-learning cross-platform optimization transformers inference pytorch quantization model-compression huggingface awq llm gptq bitsandbytes cpu-compatible

Updated Mar 15, 2025
Python

MayurVijayPatil / amd-llm-rocm

Star

White paper & reproducible benchmark suite for LLM inference optimization on AMD MI300X using ROCm 6.1

benchmark amd hip quantization rocm awq vllm llm-inference mi300x flashattention

Updated Apr 17, 2026
Jupyter Notebook

chris-colinsky / Zorac

Star

Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.

python cli ai self-hosted chat-client homelab mistral nvidia-gpu awq llm vllm chatgpt-alternative local-llm llm-inference offline-ai consumer-gpu

Updated Feb 27, 2026
Python

ronantakizawa / phi4-reasoning-awq

Star

AWQ Quantization of Microsoft/Phi-4-Reasoning

quantization awq phi-4

Updated Oct 6, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the awq topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the awq topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awq

Here are 44 public repositories matching this topic...

intel / neural-compressor

ModelTC / LightCompress

hec-ovi / vllm-awq4-qwen

ncoder-ai / VibeVoice-FastAPI

aivrar / vllm-windows-build

hcd233 / Aris-AI-Model-Server

AEON-7 / supergemma4-26b-abliterated-multimodal-nvfp4

BoundlessWindMoon / minivllm

harleyszhang / harleyszhang.github.io

ShipItAndPray / turboquant

mtecnic / research-test-Qwen3-Coder-Next-REAP-AWQ

duchengyao / big-vllm

psunlpgroup / Compression-Effects

neosun100 / kimi-linear-vllm-docker-serve

GURPREETKAURJETHRA / Quantize-LLM-using-AWQ

aphroditeformal93 / vllm-awq4-qwen

lpalbou / model-quantizer

MayurVijayPatil / amd-llm-rocm

chris-colinsky / Zorac

ronantakizawa / phi4-reasoning-awq

Improve this page

Add this topic to your repo