ARCQuant: Boosting Fine-Grained Quantization with Augmented Residual Channels for LLMs

ARCQuant is a high-performance quantization framework designed to resolve the conflict between accuracy and inference efficiency in low-bit LLMs.

While fine-grained quantization (e.g., Block-wise/NVFP4) effectively isolates quantization noise, activation outliers still degrade performance in critical channels. Traditional mixed-precision methods address this by splitting computations into separate branches (INT4 + FP16), which introduces significant kernel launch overhead and memory fragmentation.

ARCQuant takes a different approach. Instead of treating outliers separately, we leverage the structural sparsity of quantization errors in fine-grained settings. We capture the quantization residuals of these critical channels and fuse them back into the computation as Augmented Residual Channels (ARC).

🚀 Key Features

Unified Single-Kernel Execution: By converting error compensation into channel augmentation, ARCQuant performs the entire inference using a single, standard GEMM kernel. This decouples the algorithm from complex custom operators and allows full utilization of optimized libraries like CUTLASS.
Accuracy-Aware Compensation: Powered by a rigorous analysis of error bounds, ARCQuant identifies and compensates only the most critical "heavy-hitter" channels, recovering FP16-level accuracy with negligible computational cost.
Hardware-Friendly Design: Designed for modern GPU architectures, ARCQuant eliminates the bottleneck of integer dequantization on CUDA cores, making it a future-proof solution for native floating-point quantization.

📊 Performance

On Llama-3 and Qwen-2.5 models, ARCQuant achieves state-of-the-art accuracy while delivering significantly lower latency compared to traditional mixed-precision baselines.

1. Installation

conda create -n arcquant python=3.10 -y
conda activate arcquant

Please make sure that CUDA 12.8 is in your environment.

git clone --recurse-submodules https://github.com/actypedef/ARCQuant.git
cd ARCQuant
pip install -r requirements.txt

2. Usage

2.1 Building Kernels

sudo apt-get update
sudo apt-get install python3-dev

conda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

cd kernels/
bash remake.sh

This might take a few minutes.

2.2 Preprocessing

Reorder_indices, select_num are needed for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 128 --seqlen 2048 --act_sort_metric max

Results are saved in ./saved/

2.3 Accuracy Evaluation

bash run_arcquant.sh /PATH/TO/YOUR/MODEL/

3. Efficiency Evaluation

End-to-end efficiency:

python benchmarks/benchmark_e2e_arc.py --model 'llama-2-7b' --batch_size 8 --prefill_seq_len 1024 --decode_steps 50

TensorRT efficiency:

pip install tensorrt
python benchmark/trt-fp8-prefill-llama.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
kernels		kernels
model		model
third-party		third-party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
evaluate.sh		evaluate.sh
reorder_indices.py		reorder_indices.py
requirements.txt		requirements.txt
utilize.py		utilize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARCQuant: Boosting Fine-Grained Quantization with Augmented Residual Channels for LLMs

🚀 Key Features

📊 Performance

1. Installation

2. Usage

2.1 Building Kernels

2.2 Preprocessing

2.3 Accuracy Evaluation

3. Efficiency Evaluation

About

Uh oh!

Releases

Packages

Languages

actypedef/ARCQuant

Folders and files

Latest commit

History

Repository files navigation

ARCQuant: Boosting Fine-Grained Quantization with Augmented Residual Channels for LLMs

🚀 Key Features

📊 Performance

1. Installation

2. Usage

2.1 Building Kernels

2.2 Preprocessing

2.3 Accuracy Evaluation

3. Efficiency Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages