GitHub - Costigan-Stephen/airllm-ui: AirLLM 70B inference with single 4GB GPU

AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.

AI Agents Recommendation:

Updates

[2024/08/20] v2.11.0: Support Qwen2.5

[2024/08/18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work!

[2024/07/30] Support Llama3.1 405B (example notebook). Support 8bit/4bit quantization.

[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.

[2023/12/25] v2.8.2: Support MacOS running 70B large language models.

[2023/12/20] v2.7: Support AirLLMMixtral.

[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.

[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.

[2023/12/03] added support of ChatGLM, QWen, Baichuan, Mistral, InternLM!

[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.

[2023/12/01] airllm 2.0. Support compressions: 3x run time speed up!

[2023/11/20] airllm Initial version!

Star History

Quickstart

1. Install package

First, install the airllm pip package.

pip install airllm

2. Inference

Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.

(You can also specify the path to save the splitted layered model through layer_shards_saving_path when init AirLLMLlama2.

from airllm import AutoModel

MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
        'What is the capital of United States?',
        #'I like',
    ]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=False)
           
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.

Web UI (Local App)

This project also includes a local web interface in app.py for chat, model settings, and Hugging Face downloads.

Start the web app

pip install -r requirements.txt
python app.py

Open http://127.0.0.1:8000.

Dependency note:

transformers is intentionally pinned to 4.56.2 for runtime compatibility with this app.
Avoid upgrading transformers to latest unless you are testing compatibility changes.

First-run setup (recommended)

Open the Settings tab.
In Model Config, set:
- AIRLLM_MODEL_BASE_DIR to your local model directory root (for example O:\\AI\\Models).
- AIRLLM_DEVICE to cuda:0 (or cpu if no CUDA GPU is available).
Click Scan Base Directory to detect compatible local model folders.
Select a model from Discovered Models (or set AIRLLM_MODEL_PATH manually).
Click Load / Reload Model.
Go to Chat and send a prompt.

Downloading models from the UI

Open Settings -> Hugging Face Download.
Enter the model repo ID (for example Qwen/Qwen3-Coder-Next).
Optionally set revision, target subdir, and allow/ignore patterns.
Click Download from Hugging Face.
Monitor progress in Download Progress (status, file counts, logs).
If enabled, Set downloaded model as active updates AIRLLM_MODEL_PATH automatically after download.

Notes:

The web UI tracks one active download job at a time.
Reloading the page reconnects to an in-progress download job.
AirLLM loading in this UI expects Transformers-style model directories (config.json + weight files), not GGUF-only folders.

UI overview

Chat tab: send prompts to the loaded model (Enter sends, Shift+Enter adds a newline).
Settings -> Model Config: configure AIRLLM_MODEL_ID, AIRLLM_MODEL_PATH, AIRLLM_MODEL_BASE_DIR, and AIRLLM_DEVICE; scan the base directory for compatible local Transformers models.
Settings -> Hugging Face Download: download a model into the base directory with revision/pattern filters, with live progress logs in the UI.
Download persistence: if a download is already running, reloading the page reconnects to the active job and restores progress.
Dark mode: toggle from the top bar.

Environment variables used by the web app

AIRLLM_MODEL_ID: model repo id (for example TinyLlama/TinyLlama-1.1B-Chat-v1.0).
AIRLLM_MODEL_PATH: explicit local model directory; overrides base-dir resolution when set.
AIRLLM_MODEL_BASE_DIR: root folder used for model scans and HF downloads.
AIRLLM_DEVICE: runtime device (for example cuda:0 or cpu).
HF_TOKEN: optional Hugging Face token for gated/private repositories.
PORT: web server port (default 8000).

When Persist selected values to project .env is enabled in Settings, changes are written back to the local .env file.

Access from other machines on your network

By default the app binds to localhost. To expose it on your LAN, run:

uvicorn app:app --host 0.0.0.0 --port 8000

Then open http://<your-machine-ip>:8000 from another device on the same network.

Model Compression - 3x Inference Speed Up!

We just added model compression based on block-wise quantization-based model compression. Which can further speed up the inference speed for up to 3x , with almost ignorable accuracy loss! (see more performance evaluation and why we use block-wise quantization in this paper)

How to enable model compression speed up:

Step 1. make sure you have bitsandbytes installed by pip install -U bitsandbytes
Step 2. make sure airllm verion later than 2.0.0: pip install -U airllm
Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):

model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct",
                     compression='4bit' # specify '8bit' for 8-bit block-wise quantization 
                    )

What are the differences between model compression and quantization?

Quantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.

While in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights' part, which is easier to ensure the accuracy.

Configurations

When initialize the model, we support the following configurations:

compression: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
profiling_mode: supported options: True to output time consumptions or by default False
layer_shards_saving_path: optionally another path to save the splitted model
hf_token: huggingface token can be provided here if downloading gated models like: meta-llama/Llama-2-7b-hf
prefetching: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this.
delete_original: if you don't have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space.

MacOS

Just install airllm and run the code the same as on linux. See more in Quick Start.

make sure you installed mlx and torch
you probably need to install python native see more here
only Apple silicon is supported

Example [python notebook] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb)

Example Python Notebook

Example colabs here:

example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):

Details

ChatGLM:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=True)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache= True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

QWen:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache=True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

Baichuan, InternLM, Mistral, etc:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base")
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache=True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

To request other model support: here

Acknowledgement

A lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg:

GitHub account @SimJeg, the code on Kaggle, the associated discussion.

FAQ

1. MetadataIncompleteBuffer

safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See this. You may need to extend your disk space, clear huggingface .cache and rerun.

2. ValueError: max() arg is an empty sequence

Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:

For QWen model:

from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)

For ChatGLM model:

from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)

3. 401 Client Error....Repo model ... is gated.

Some models are gated models, needs huggingface api token. You can provide hf_token:

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN')

4. ValueError: Asking to pad but the tokenizer does not have a padding token.

Some model's tokenizer doesn't have padding token, so you can set a padding token or simply turn the padding config off:

input_tokens = model.tokenizer(input_text,
   return_tensors="pt", 
   return_attention_mask=False, 
   truncation=True, 
   max_length=MAX_LENGTH, 
   padding=False  #<-----------   turn off padding 
)

Citing AirLLM

If you find AirLLM useful in your research and wish to cite it, please use the following BibTex entry:

@software{airllm2023,
  author = {Gavin Li},
  title = {AirLLM: scaling large language models on low-end commodity computers},
  url = {https://github.com/lyogavin/airllm/},
  version = {0.0},
  year = {2023},
}

Contribution

Welcomed contributions, ideas and discussions!

If you find it useful, please ⭐ or buy me a coffee! 🙏

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
.github		.github
.vscode		.vscode
air_llm		air_llm
anima_100k		anima_100k
assets		assets
data		data
eval		eval
examples		examples
rlhf		rlhf
scripts		scripts
templates		templates
training		training
webui		webui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
airllm_logo.svg		airllm_logo.svg
anima_logo.png		anima_logo.png
app.py		app.py
favicon.ico		favicon.ico
funding.json		funding.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI Agents Recommendation:

Updates

Star History

Table of Contents

Quickstart

1. Install package

2. Inference

Web UI (Local App)

Start the web app

First-run setup (recommended)

Downloading models from the UI

UI overview

Environment variables used by the web app

Access from other machines on your network

Model Compression - 3x Inference Speed Up!

How to enable model compression speed up:

What are the differences between model compression and quantization?

Configurations

MacOS

Example Python Notebook

example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):

To request other model support: here

Acknowledgement

FAQ

1. MetadataIncompleteBuffer

2. ValueError: max() arg is an empty sequence

3. 401 Client Error....Repo model ... is gated.

4. ValueError: Asking to pad but the tokenizer does not have a padding token.

Citing AirLLM

Contribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages