SPQR model compression

Note: This repository contains quantization algorithm and the model evaluation code for SpQR method for LLM compression; The efficient inference code will be added soon.

It accompanies the research paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression" .

Installation

Packages

Install packages from requirements.txt:

pip install -r requirements.txt

Note that due to compatibility issues with LLaMA, we recommend using 4.28.dev0 version of transformers.

Downloading model weights and dataset(s)

This scripts assume that model weights are preloaded and stored locally. See MODEL_PATH references below.

The scripts can use a variety of datasets for training. To use Red Pajamas, download it locally and then pass the location to the scripts. See PAJAMAS_PATH references below.

Loading / caching datasets and tokenizer

The script will require downloading and caching locally the relevant LLaMA tokenizer and one or few datasets for testing. They will be saved in default locations.

Data

The tokenized and preproccessed subset of RedPajamas (mixture of datasets used for LLaMA training) is located here: data/red_pajama_n=1024.pth. Below PAJAMAS_PATH denotes the path to this subset.

Launching

GPU requirements

This code was developed and tested using a single A100 GPU with 80GB GPU RAM. It may successfully run on GPUs with 32 - 40GB

Model downloading

The code requires the LLaMA model to be dowloaded in Hugging Face format and saved locally. The scripts below require such model folder path as argument.

Perplexity benchmarks:

This script compresses the model and then tests its performance in terms of perplexity using Wikitext2, C4, and Penn Treebank datasets. Note that the perplexity is related to the loss used in the article as loss = log2(perplexity)

The command to launch the script should look like this:

export MODEL_PATH=<INSERT PATH_TO_MODEL_DIR>
export PAJAMAS_PATH=<INSERT PATH TO PAJAMAS DIR>

python main.py $MODEL_PATH custom \
    --load_from_saved=$PAJAMAS_PATH \
    --wbits 4 \
    --groupsize 16 \
    --perchannel \
    --qq_scale_bits 3 \
    --qq_zero_bits 3 \
    --qq_groupsize 16 \
    --fit_quantizer_without_outliers \
    --outlier_threshold=0.2 \
    --permutation_order act_order \
    --percdamp 1e0 \
    --nsamples 128

The command above runs near-lossless compression as described in the article. Adjusting the above parameters allows for tighter compression with a slightly greater loss.

Note the launch arguments:

<PATH_TO_MODEL_DIR> - path to model folder, which contains config.json
one of [c4, ptb, wikitext2, custom] -- name of dataset to use for compression
--load_from_saved - path to preprocessed and tokenized dataset (if custom chosen). Otherwise do not specify.
--wbits 3 -- number of bits for quantized weights representation
--groupsize 16 -- size of first-order groups for compression
--qq_groupsize 16 -- size of second-order (quantized) groups for compression
--qq_scale_bits 3 --qq_zero_bits 3 -- bit sizes for quantizing first order weights' scale and zeros.
--fit_quantizer_without_outliers -- when finding optimal quantizer params, remove any points that would be declared outliers run python main.py --help for more details on command line arguments, including compression parameters.

LM Evaluation Harness benchmark.

To perform zero-shot evaluation, we use lm-eval-harness framework with slight modifications. The LICENSE and CODEOWNERS files inside lm-evaluation-harness refer to the original authors of lm-eval-harness and not the authors of this paper.

For instructions about zero-shot evaluation refer to README.md inside lm-evaluation-harness directory.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
lm-evaluation-harness		lm-evaluation-harness
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
datautils.py		datautils.py
datautils_for_rw.py		datautils_for_rw.py
main.py		main.py
main_for_rw.py		main_for_rw.py
quant_groups.py		quant_groups.py
requirements.txt		requirements.txt
spqr_engine.py		spqr_engine.py
weight_permutation.py		weight_permutation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPQR model compression

Installation

Packages

Downloading model weights and dataset(s)

Loading / caching datasets and tokenizer

Data

Launching

GPU requirements

Model downloading

Perplexity benchmarks:

LM Evaluation Harness benchmark.

About

Uh oh!

Releases

Packages

Languages

License

miolini/SpQR

Folders and files

Latest commit

History

Repository files navigation

SPQR model compression

Installation

Packages

Downloading model weights and dataset(s)

Loading / caching datasets and tokenizer

Data

Launching

GPU requirements

Model downloading

Perplexity benchmarks:

LM Evaluation Harness benchmark.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages