Authors: Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei
Affiliations: China University of Geosciences, Wuhan; Shanghai Innovation Institute; University of Southern California; Fudan University.
Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from
helper imagesinto sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.
- [2025-12-08] The code is released.
- [2025-12-08] The paper is released on arXiv.
The code is tested with Python 3.11. We recommend using Conda for environment management.
# 1. Create a conda environment
conda create -n ilvr python=3.11
conda activate ilvr
# 2. Install standard dependencies
pip install -r requirements.txt
# 3. Install custom Transformers library
# ILVR requires modifications to the standard transformers library.
# We provide the modified source code in this repository.
cd transformers
pip install -e .
cd ..This project uses HuggingFace accelerate for distributed training. Please configure it before running the training script.
accelerate configWe utilize the CoMT dataset (Chain of Multi-modal Thought) as an example for constructing training data. For more details about the benchmark, please refer to the CoMT paper.
We provide the processed data on HuggingFace. Please download it from shuai22/comt and organize the directory as follows.
- Download
TRAIN.jsonl,TEST.jsonl, andcomt.tar.gz. - Extract the images from the tarball.
Expected Directory Structure:
ILVR/
βββ data/
β βββ TRAIN.jsonl
β βββ TEST.jsonl
β βββ images_comt/ <-- Extracted from comt.tar.gz
β βββ creation/
β βββ ...
βββ src/
βββ transformers/
βββ run_training.sh
βββ README.md
The dataset follows the JSONL format:
- text_input: The question/instruction.
- image_input: Initial input images.
- sequence_plan: The interleaved chain-of-thought rationale containing "text" and "helper_image" paths.
We provide a shell script run_training.sh to launch distributed training.
Open run_training.sh and modify the paths to match your local setup:
# In run_training.sh:
# Path to the directory containing TRAIN.jsonl
DATA_PATH="/path/to/your/data"
# Directory to save model checkpoints
SAVE_MODEL_PATH="/path/to/save/checkpoints"
# File path for training logs
LOG_FILE="/path/to/save/train.log"
# (Optional) HuggingFace Cache Directory
export HF_HOME="/path/to/cache" Start training with the following command:
bash run_training.shDefault Hyperparameters:
- Base Model:
Qwen/Qwen2.5-VL-7B-Instruct - Epochs: 15
- Gradient Accumulation Steps: 8
- Latent Size: 8
If you find this project or the ILVR framework useful, please cite our paper:
@article{dong2025interleaved,
title={Interleaved Latent Visual Reasoning with Selective Perceptual Modeling},
author={Shuai Dong and Siyuan Wang and Xingyu Liu and Zhongyu Wei},
year={2025},
eprint={2512.05665},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.05665},
}This codebase is built upon Qwen-VL γ Transformers and Mirage. We thank the authors for their open-source contributions.
This project is licensed under the MIT License.