VIRAL: VIsual Representation ALignment for MLLMs

This is the official implementation of the paper "Visual Representation Alignment for Multimodal Large Language Models (VIRAL)"

by Heeji Yoon^*, Jaewoo Jung^*, Junwan Kim^*, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

*: Equal Contribution

Introduction

We introduce VIRAL (VIsual Representation ALignment), a simple regularization strategy that explicitly aligns intermediate visual features in MLLMs with representations from pretrained vision encoders or stronger vision foundation models (VFMs). This alignment preserves rich spatial and semantic information, enabling MLLMs to reason more effectively over complex visual inputs.

Extensive experiments demonstrate that VIRAL consistently improves performance across standard multimodal benchmarks, highlighting the benefit of directly supervising the visual pathway.

🔧 Installation

We implement VIRAL on top of LLaVA. To set up the environment, run:

pip install -e .
pip install -e ".[train]"
pip install flash-attn==2.5.7 --no-build-isolation
pip install peft==0.10.0
pip install -U timm==1.0.16
pip install git+https://github.com/facebookresearch/segment-anything.git  # if using SAM
pip install opencv-python --no-deps  # if using Depth-Anything

This codebase was tested on Python=3.10 and CUDA=12.1 environment

💾 Dataset Preparation (From LLaVA)

Please download the annotation of the final mixture instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, will save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in ./playground/llava-665k,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

🔥 Training

Run

To finetune Vicuna-1.5-7b with VIRAL, run:

bash scripts/v1_5/finetune_lora.sh

Make sure --config_path ./config.json is included in the bash file. Also, keep effective batch size as 128 for reproduction.

Configuration

Example configuration of config.json

  "vra_loss": true,              # set false when running baseline model
  "target_layers": [16],         # nth layer outputs (1-based index, 0 denotes LLM input)
  "vra_target": "dinov2-vit-b",  # dinov2-vit-b, sam_vit_b_01ec64, clip, radio_v2.5-b, c-radio_v3-b, depth_anything_v2_vitb
  "vra_weight": 0.5,             # loss weight for VRA
  "projector_dim": 2048,         # hidden dimension of MLP projector (default: 2048)
  "z_dim": 768,                  # target VFM feature dimension
  "use_multiple_projectors": true # use separate MLPs for target layers (default: false)

🚀 Usage

For an example of how to run inference, please refer to inference.py script.

python inference.py \
  --model-path PATH/TO/MODEL \
  --image-path PATH/TO/IMAGE \
  --prompt YOUR PROMPT \
  --temperature 0.2 \
  --top_p 0.9 \
  --max-new-tokens 128

☺️ Acknowledgement

Code is implemented with extensive reference to LLaVA and REPA. We sincerely thank the original authors for their invaluable work and contributions!

📑 Citation

If you find this research useful, please consider citing:

@misc{yoon2025visualrepresentationalignmentmultimodal,
      title={Visual Representation Alignment for Multimodal Large Language Models}, 
      author={Heeji Yoon and Jaewoo Jung and Junwan Kim and Hyungyu Choi and Heeseong Shin and Sangbeom Lim and Honggyu An and Chaehyun Kim and Jisang Han and Donghyun Kim and Chanho Eom and Sunghwan Hong and Seungryong Kim},
      year={2025},
      eprint={2509.07979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.07979}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
checkpoints/llava-v1.5-7b-pretrain		checkpoints/llava-v1.5-7b-pretrain
evaluation		evaluation
images		images
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
inference.py		inference.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VIRAL: VIsual Representation ALignment for MLLMs

Introduction

🔧 Installation

💾 Dataset Preparation (From LLaVA)

🔥 Training

Run

Configuration

🚀 Usage

☺️ Acknowledgement

📑 Citation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

cvlab-kaist/VIRAL

Folders and files

Latest commit

History

Repository files navigation

VIRAL: VIsual Representation ALignment for MLLMs

Introduction

🔧 Installation

💾 Dataset Preparation (From LLaVA)

🔥 Training

Run

Configuration

🚀 Usage

☺️ Acknowledgement

📑 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages