Skip to content

NOVAglow646/Monet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang

Paper PDF HF Model: ViGaL HF Model: ViGaL

Logo

We introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts.

🔥Updates

  • 2025.12.11 Update the prompt in ./inference/vllm_inference_example.py (add "Put your final answer in \boxed{}.").
  • 2025.12.04 Fix typos in RL/examples/vlpo_train.sh
  • 2025.12.02 Fix typos in script_examples/sft_stage1.sh, script_examples/sft_stage2.sh, script_examples/sft_stage3.sh

🔍Overview

Tabel of Contents
  1. Installation
  2. Training Data
  3. SFT Training
  4. RL Training
  5. Inference
  6. Citation
  7. Acknowledgement

To support latent reasoning, we use customized Qwen2.5-VL-7B model to replace the official code in Transformers and vLLM.

⚙Installation

git clone https://github.com/NOVAglow646/Monet.git

SFT environment:

conda create -n monet python=3.10
conda activate monet
cd Monet
pip install -r requirements.txt

RL environment:

cd Monet/RL
conda create -n easyr1 python=3.11
conda activate easyr1
pip install -r requirements.txt

📕Training Data

🔧SFT Training

Training Scripts

See this folder.

Implementation Details

The training requires a modification of the official code of Qwen2.5-VL-7B, which is implemented in this file. The main implementation of the forward process with latent embeddings is in Qwen2_5_VLModel:forward and Qwen2_5_VLForConditionalGeneration:forward.

🚀RL Training

We implement our RL training based on EasyR1.

Training Scripts

See this training script.

Illustrations of key parameters:

  • worker.rollout.sampling_strategy=monet Perform latent reasoning during rollout (VLPO is achieved by specifying this parameter); worker.rollout.sampling_strategy=greedy Text reasoning.
  • export LATENT_SIZE=10 Number of latent embeddings.
  • worker.rollout.monet.select_acc_threshold=0.6 Select samples with accuracy between $(0, 0.6)$ for training.
  • worker.rollout.online_difficulty_sampling=true Dynamically sample hard examples for training (with select_acc_threshold).
  • worker.actor.monet_rl_sigma=10.0 worker.ref.monet_rl_sigma VLPO $\sigma$.
  • worker.reward.repetition_penalty=true Penalty on repetitive meaningless outputs. Repetition detection is implemented by API.

After training, remember to use model merging script to merge the parameter splits and get the final model.

API Calling

For RL training, we use external LLM APIs (Gemini / DeepSeek) via the helper in RL/tools/custom_api.py to support accurate rule-based judgement.

  • Gemini (Google AI)

    • Install the SDK:
      pip install google-genai
    • Set your API key (from Google AI Studio) before running RL scripts:
      export GOOGLE_API_KEY="<your_gemini_api_key>"
    • In the training script, use worker.rule_based_judge.api_name="gemini-2.5-pro".
  • DeepSeek

    • Install the OpenAI-compatible SDK:
      pip install openai
    • Set the API key:
      export DEEPSEEK_API_KEY="<your_deepseek_api_key>"
    • In the training script, use worker.rule_based_judge.api_name="deepseek-chat".

Please refer to RL/tools/custom_api.py for the exact calling interface.

⭐Inference

Download Monet-7B Model

You can download Monet-7B at this repo. The inference requires replacing the official code of vLLM (see Modified vLLM model).

Inference Example

See this quick example to use Monet-7B with latent reasoning.

  • Setting latent size at inference: You can control the number of latent embeddings to generate each time the model starts latent reasoning by using: export LATENT_SIZE=10
  • Handling model outputs containing latent reasoning. To achieve latent-text interleaved reasoning, the model may generate <abs_vis_token> to switch to the latent thinking mode. Then, with our modified vLLM gpu_model_runner.py, it will replace the next tokens with the representations of the last layer. Since these latent tokens are not human-readable, you can post-process the output by detecting the start token <abs_vis_token> and nd replacing the enclosed latent tokens with a clean placeholder such as <latent>.

Evaluation

We evalutate Monet-7B on VLMEvalKit. Notably, we replace the original exact matching judgement with API judge to ensure more accurate assessment.

⚠Note that:

To accurately reproduce the result:

  • Please use the following system prompt for VLMEvalKit evaluation: You are a helpful multimodal assistant. You are required to answer the question based on the image provided. Put your final answer in \\boxed{}.
  • Please apply an API model as a supplementary judge.

🖊Citation

If you find this work useful, please use the following BibTeX. Thank you for your support!

@misc{wang2025monetreasoninglatentvisual,
      title={Monet: Reasoning in Latent Visual Space Beyond Images and Language}, 
      author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
      year={2025},
      eprint={2511.21395},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21395}, 
}

🙏Acknowledgement

We sincerely thank the following great works as they provide valuable data or code for our work:

About

Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages