This is the official implementation of the paper "Visual Representation Alignment for Multimodal Large Language Models (VIRAL)"
by Heeji Yoon*, Jaewoo Jung*, Junwan Kim*, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim
*: Equal Contribution
We introduce VIRAL (VIsual Representation ALignment), a simple regularization strategy that explicitly aligns intermediate visual features in MLLMs with representations from pretrained vision encoders or stronger vision foundation models (VFMs). This alignment preserves rich spatial and semantic information, enabling MLLMs to reason more effectively over complex visual inputs.
Extensive experiments demonstrate that VIRAL consistently improves performance across standard multimodal benchmarks, highlighting the benefit of directly supervising the visual pathway.
We implement VIRAL on top of LLaVA. To set up the environment, run:
pip install -e .
pip install -e ".[train]"
pip install flash-attn==2.5.7 --no-build-isolation
pip install peft==0.10.0
pip install -U timm==1.0.16
pip install git+https://github.com/facebookresearch/segment-anything.git # if using SAM
pip install opencv-python --no-deps # if using Depth-AnythingThis codebase was tested on Python=3.10 and CUDA=12.1 environment
Please download the annotation of the final mixture instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, will save all files as
.jpg - TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/llava-665k,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
To finetune Vicuna-1.5-7b with VIRAL, run:
bash scripts/v1_5/finetune_lora.shMake sure --config_path ./config.json is included in the bash file. Also, keep effective batch size as 128 for reproduction.
Example configuration of config.json
"vra_loss": true, # set false when running baseline model
"target_layers": [16], # nth layer outputs (1-based index, 0 denotes LLM input)
"vra_target": "dinov2-vit-b", # dinov2-vit-b, sam_vit_b_01ec64, clip, radio_v2.5-b, c-radio_v3-b, depth_anything_v2_vitb
"vra_weight": 0.5, # loss weight for VRA
"projector_dim": 2048, # hidden dimension of MLP projector (default: 2048)
"z_dim": 768, # target VFM feature dimension
"use_multiple_projectors": true # use separate MLPs for target layers (default: false)For an example of how to run inference, please refer to inference.py script.
python inference.py \
--model-path PATH/TO/MODEL \
--image-path PATH/TO/IMAGE \
--prompt YOUR PROMPT \
--temperature 0.2 \
--top_p 0.9 \
--max-new-tokens 128
Code is implemented with extensive reference to LLaVA and REPA. We sincerely thank the original authors for their invaluable work and contributions!
If you find this research useful, please consider citing:
@misc{yoon2025visualrepresentationalignmentmultimodal,
title={Visual Representation Alignment for Multimodal Large Language Models},
author={Heeji Yoon and Jaewoo Jung and Junwan Kim and Hyungyu Choi and Heeseong Shin and Sangbeom Lim and Honggyu An and Chaehyun Kim and Jisang Han and Donghyun Kim and Chanho Eom and Sunghwan Hong and Seungryong Kim},
year={2025},
eprint={2509.07979},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.07979},
}