Skip to content

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

License

Notifications You must be signed in to change notification settings

AMD-AGI/DUET-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



DUET-VLM: Dual-Stage Efficient Token Reduction for Vision-Language Models

GitHub License: Apache 2.0

DUET-VLM is a dual-stage token reduction framework that significantly speeds up both training and inference in Vision-Language Models (VLMs). It removes redundant visual tokens while preserving task-critical information, enabling faster iterations and lower serving latency for both image and video reasoning tasks.

Modern VLMs can produce 2,800+ visual tokens from a single high-resolution image, making attention cost the dominant bottleneck. DUET-VLM addresses this by coordinating compression across both the vision encoder and the language backbone:

  • Stage 1 — Vision-to-Vision (V2V) Merging: Uses vision self-attention to identify "dominant" tokens and groups remaining tokens with localized cluster aggregation (VisionZip). This removes background redundancy before it ever hits the LLM.
  • Stage 2 — Text-to-Vision (T2V) Pruning: Uses salient text tokens to guide layer-wise rank-and-drop pruning of visual tokens inside the LLM (PyramidDrop). This makes token retention context-aware — the model keeps the visual evidence that supports the question being asked.

Key Results

  • ~31% training speedup with less than 1% accuracy drop
  • >99% baseline accuracy with 67% fewer tokens at inference
  • Video performance: matches or exceeds baseline while cutting tokens by ~53%
  • At extreme 93.4% token reduction on video, still retains 97.6% accuracy

Supported Models

Model VisionZip Function PyramidDrop Config Location
LLaVA-1.5 visionzip() modeling_llama_pdrop.py llava/model/
Video-LLaVA visionzip_video() modeling_llama_pdrop.py videollava/model/
Qwen2.5-VL Built-in configure_duet() Built-in qwen2_5_vl/modeling_qwen2_5vl_duet.py

Getting Started

Installation

git clone https://github.com/AMD-AGI/DUET-VLM.git
cd DUET-VLM

# Core LLaVA-1.5 support
pip install -e .

# With Video-LLaVA support (adds decord, einops)
pip install -e ".[video]"

# With Qwen2.5-VL support (adds qwen-vl-utils)
pip install -e ".[qwen]"

# Everything
pip install -e ".[all]"

Dependencies by Model

Model Required Packages
LLaVA-1.5 torch, transformers, pillow, accelerate
Video-LLaVA + decord, einops, av
Qwen2.5-VL + qwen-vl-utils

Example Usage

LLaVA-1.5 with DUET-VLM

from llava.model.builder import load_pretrained_model
from visionzip import visionzip

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-7b",
    model_base=None,
    model_name="llava-v1.5-7b"
)

# Apply VisionZip (Stage 1: patches CLIP encoder)
model = visionzip(model, dominant=170, contextual=35, cluster_width=4)

# PyramidDrop (Stage 2) is integrated in the model forward pass

Video-LLaVA with DUET-VLM

from videollava.model.builder import load_pretrained_model
from visionzip import visionzip_video

tokenizer, model, processor, context_len = load_pretrained_model(
    model_path="LanguageBind/Video-LLaVA-7B",
    model_base=None,
    model_name="Video-LLaVA-7B"
)

# Apply VisionZip for Video-LLaVA (patches LanguageBind towers)
model = visionzip_video(model, dominant=170, contextual=35, cluster_width=4)

Qwen2.5-VL with DUET-VLM

from qwen2_5_vl import Qwen2_5_VLForConditionalGeneration
from transformers import AutoProcessor
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Configure DUET (VisionZip + PyramidDrop)
model.configure_duet(
    visionzip_enabled=True,
    dominant_tokens=170,
    contextual_tokens=35,
    pdrop_enabled=True,
    layer_list=[14, 21],
    ratio_list=[0.5, 0.25]
)

Evaluation

Running Benchmarks

# LLaVA-1.5 TextVQA
bash scripts/llava/v1_5/pdrop_eval/textvqa.sh

# Video-LLaVA MSVD
bash scripts/videollava/v1_5/eval/eval_qa_msvd.sh

# Qwen2.5-VL TextVQA
bash scripts/qwen/textvqa.sh duet_640

# Qwen2.5-VL POPE
bash scripts/qwen/pope.sh duet_640

Inference-Only Results (LLaVA-1.5-7B)

Method Avg Tokens Token Reduction Avg Accuracy (%)
LLaVA-1.5-7B (Baseline) 576 0% 100.0%
VisionZip 192 66.7% 97.7%
PyramidDrop 192 66.7% 96.4%
DUET-VLM 192 66.7% 99.0%
DUET-VLM 64 88.9% 95.4%

Video Results (Video-LLaVA-7B)

Method Avg Tokens Token Reduction Avg Accuracy (%)
Video-LLaVA (Baseline) 2048 0% 100.0%
PyramidDrop 960 53.1% 100.7%
DUET-VLM 960 53.1% 100.8%
DUET-VLM 136 93.4% 97.6%

Project Structure

DUET-VLM/
├── llava/                      # LLaVA-1.5 model (image VLM)
├── videollava/                 # Video-LLaVA model (image + video VLM)
├── qwen2_5_vl/                 # Qwen2.5-VL DUET (standalone implementation)
├── visionzip/                  # Shared VisionZip module
├── scripts/                    # Evaluation and training scripts
│   ├── llava/                  # LLaVA-1.5 scripts
│   ├── videollava/             # Video-LLaVA scripts
│   └── qwen/                   # Qwen2.5-VL scripts
├── setup.py                    # Package installation
├── STRUCTURE.md                # Detailed codebase documentation
└── utils.py                    # Modified HF generation utils

Training

DUET-VLM supports training with integrated token compression. See the training scripts for each model:

# LLaVA-1.5 pre-training
bash scripts/llava/v1_5/pdrop_train/pretrain.sh

# LLaVA-1.5 fine-tuning
bash scripts/llava/v1_5/pdrop_train/finetune.sh

# Video-LLaVA fine-tuning
bash scripts/videollava/v1_5/finetune.sh

Acknowledgement

This codebase builds on LLaVA, Video-LLaVA, VisionZip, PyramidDrop, and Qwen2.5-VL.

License

DUET-VLM is released under the Apache License 2.0.

About

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages