Skip to content
/ ILVR Public

Implementation of "Interleaved Latent Visual Reasoning with Selective Perceptual Modeling".

Notifications You must be signed in to change notification settings

XD111ds/ILVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ILVR: Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

arXiv License: MIT

Authors: Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei

Affiliations: China University of Geosciences, Wuhan; Shanghai Innovation Institute; University of Southern California; Fudan University.


📖 Abstract

Model Architecture

Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.


📢 News

  • [2025-12-08] The code is released.
  • [2025-12-08] The paper is released on arXiv.

🛠️ Installation

1. Environment Setup

The code is tested with Python 3.11. We recommend using Conda for environment management.

# 1. Create a conda environment
conda create -n ilvr python=3.11
conda activate ilvr

# 2. Install standard dependencies
pip install -r requirements.txt

# 3. Install custom Transformers library
# ILVR requires modifications to the standard transformers library.
# We provide the modified source code in this repository.
cd transformers
pip install -e .
cd ..

2. Accelerate Configuration

This project uses HuggingFace accelerate for distributed training. Please configure it before running the training script.

accelerate config

📚 Data Preparation

We utilize the CoMT dataset (Chain of Multi-modal Thought) as an example for constructing training data. For more details about the benchmark, please refer to the CoMT paper.

Download Data

We provide the processed data on HuggingFace. Please download it from shuai22/comt and organize the directory as follows.

  1. Download TRAIN.jsonl, TEST.jsonl, and comt.tar.gz.
  2. Extract the images from the tarball.

Expected Directory Structure:

ILVR/
├── data/
│   ├── TRAIN.jsonl
│   ├── TEST.jsonl
│   └── images_comt/      <-- Extracted from comt.tar.gz
│       ├── creation/
│       └── ...
├── src/
├── transformers/
├── run_training.sh
└── README.md

Data Format

The dataset follows the JSONL format:

  • text_input: The question/instruction.
  • image_input: Initial input images.
  • sequence_plan: The interleaved chain-of-thought rationale containing "text" and "helper_image" paths.

🚀 Training

We provide a shell script run_training.sh to launch distributed training.

1. Configure Script

Open run_training.sh and modify the paths to match your local setup:

# In run_training.sh:

# Path to the directory containing TRAIN.jsonl
DATA_PATH="/path/to/your/data" 

# Directory to save model checkpoints
SAVE_MODEL_PATH="/path/to/save/checkpoints"

# File path for training logs
LOG_FILE="/path/to/save/train.log"

# (Optional) HuggingFace Cache Directory
export HF_HOME="/path/to/cache" 

2. Run Training

Start training with the following command:

bash run_training.sh

Default Hyperparameters:

  • Base Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Epochs: 15
  • Gradient Accumulation Steps: 8
  • Latent Size: 8

📝 Citation

If you find this project or the ILVR framework useful, please cite our paper:

@article{dong2025interleaved,  
      title={Interleaved Latent Visual Reasoning with Selective Perceptual Modeling}, 
      author={Shuai Dong and Siyuan Wang and Xingyu Liu and Zhongyu Wei},
      year={2025},
      eprint={2512.05665},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.05665}, 
}

🙏 Acknowledgements

This codebase is built upon Qwen-VLTransformers and Mirage. We thank the authors for their open-source contributions.

📄 License

This project is licensed under the MIT License.

About

Implementation of "Interleaved Latent Visual Reasoning with Selective Perceptual Modeling".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published