AWQ is a PTQ(Post-Training Quantization) method. It's an efficient and accurate low-bit weight quantization (INT3/4) for LLMs.
Orion models can be quantized using AWQ easily. Follow step-by-step tutorial below.
To run AWQ, we will use AutoAWQ.
The quant.py script is provided for you to perform AWQ quantization:
python quant.py --model_path /base_model \
--save_path /quantized_model --group_size 128 --version "gemm"You can run a quantized model using the eval_quantized_model.py:
python eval_quant.py --model /quantized_model --trust_remote_code