TwigVLM

This repository contains the official code of our paper accepted at ICCV 2025. TwigVLM is a general and effective framework that accelerates large visual language models (LVLMs) by “growing” a lightweight twig block on top of an early layer of the base VLM.

Compared to existing VLM acceleration methods that are purely based on visual token pruning, our TwigVLM not only retains better accuracy by employing a twig-guided token pruning (TTP) strategy, but also achieves higher generation speed by utilizing a self-speculative decoding (SSD) strategy. More specifically, the LLaVA-1.5-7B model with our TwigVLM can retain 96% of the original performance when 88.9% of visual tokens are pruned, and achieves a 154% improvement in generation speed, which establishes a new state-of-the-art in terms of both accuracy retention and generation speed in the field of VLM acceleration.

Prerequisites

To train the models, you will need a server with at least 4 GPUs, each with more than 40GB of memory (e.g., 4×NVIDIA A6000). For inference or testing, a single GPU with >40GB memory is sufficient.
Clone this repository and navigate to the folder:

git clone https://github.com/MILVLG/twigvlm.git
cd twigvlm

Prepare the software environment. We recommend using Anaconda to create a new environment for the project, and install the requirements with the following commands:

conda create -n twigvlm python=3.10 -y
conda activate twigvlm
pip install -r requirements.txt
pip install flash-attn==2.3.2 --no-build-isolation

Please note that SDPA is currently unsupported. We recommend using FlashAttention-2 as the preferred backend. If FlashAttention-2 cannot be installed, the eager implementation can be used as a fallback option.

Training

This section provides the instructions for training the TwigVLM for the LLaVA-1.5-7B model. Please refer to the original LLaVA project to prepare the training data here and the base model LLaVA-1.5-7b. After that, you can use the following script to train the TwigVLM:

# Training the twig block
twig_K=2 twig_T=3 bash scripts/v1_5/train_twig.sh

where twig_K and twig_T are the position of the twig block and the number of twig layers, respectively (see the paper for details).

The trained checkpoints will be stored in ./checkpoints/TwigVLM-llava1.5-7b-K2-T3 by default. The trained TwigVLM model (only the learned twig block) is available at here.

Evaluation

This section provides the instructions for evaluating the TwigVLM and reproducing the results with LLaVA-1.5-7B reported in the paper. Before preparing task-specific data, you should download eval.zip and unzip it to ./playground/data/eval. For more specific instructions, please take a look at LLaVA's Evaluation.md.

Example for evaluating GQA benchmark, where -R is the average number of retained visual tokens:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  twig_K=2 twig_T=3 bash scripts/v1_5/eval/gqa.sh  -R 192

Using our provided model, you can reproduce the following results in R=192.

Models	GQA	MME	MMBench	SQA(IMG)	TextVQA	VQAv2	RelAcc
SparseVLM	57.6	1721	62.5	69.1	56.1	75.6	95.7%
MustDrop	58.2	1787	62.3	69.2	56.5	76	96.6%
VisionZip	59.3	1783	63	68.9	57.3	76.8	97.4%
VisionZip*	60.1	1834	63.4	68.2	57.8	77.4	98.3%
TwigVLM	61.2	1848	64	68.9	58	78.1	99.2%

Example for evaluating generation speed:

CUDA_VISIBLE_DEVICES=0  twig_K=2 twig_T=3 bash scripts/v1_5/eval/mmvet.sh  -R 64

Running on an RTX 4090 GPU, the average generation speed is about 60.6 tokens/s. Note that different GPUs may exhibit fluctuations when handling parallel computations.

Demo

To test some cases, you can use the provided cli_demo.py script. This script allows you to interactively ask questions about an image using the TwigVLM model.

python cli_demo.py \
    --base-model liuhaotian/llava-v1.5-7b \
    --twig-block "TwigVLM-llava-v1.5-7b-K2-T3" \
    --twig-K 2 \
    --twig-T 3 \
    --stream \
    --image-file "./assets/image.png"

The generation process is demonstrated in the following GIF image. The tokens in green are generated by the draft model.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About us

This project is maintained by the MILVLG @ Hangzhou Dianzi University (HDU).

Citation

If this code is used in your research, please cite our paper:

@inproceedings{shao2025twigvlm,
  title={Growing a twig to accelerate large vision-language models},
  author={Shao, Zhenwei and Wang, Mingyang and Yu, Zhou and Pan, Wenwen and Yang, Yan and Wei, Tao and Zhang, Hongyuan and Mao, Ning and Chen, Wei and Yu, Jun},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TwigVLM

Table of Contents

Prerequisites

Training

Evaluation

Demo

License

About us

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
scripts		scripts
twigvlm		twigvlm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli_demo.py		cli_demo.py
requirements.txt		requirements.txt

License

MILVLG/twigvlm

Folders and files

Latest commit

History

Repository files navigation

TwigVLM

Table of Contents

Prerequisites

Training

Evaluation

Demo

License

About us

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages