Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
We introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts.- 2025.12.11 Update the prompt in ./inference/vllm_inference_example.py (add "Put your final answer in \boxed{}.").
- 2025.12.04 Fix typos in RL/examples/vlpo_train.sh
- 2025.12.02 Fix typos in script_examples/sft_stage1.sh, script_examples/sft_stage2.sh, script_examples/sft_stage3.sh
Tabel of Contents
To support latent reasoning, we use customized Qwen2.5-VL-7B model to replace the official code in Transformers and vLLM.
- Modified Transformers model (for SFT Training)
- Modified Transformers model (for RL Training)
- Modified vLLM model (for RL Training)
- Modified vLLM model (for inference)
git clone https://github.com/NOVAglow646/Monet.gitSFT environment:
conda create -n monet python=3.10
conda activate monet
cd Monet
pip install -r requirements.txtRL environment:
cd Monet/RL
conda create -n easyr1 python=3.11
conda activate easyr1
pip install -r requirements.txtSee this folder.
The training requires a modification of the official code of Qwen2.5-VL-7B, which is implemented in this file. The main implementation of the forward process with latent embeddings is in Qwen2_5_VLModel:forward and Qwen2_5_VLForConditionalGeneration:forward.
We implement our RL training based on EasyR1.
See this training script.
Illustrations of key parameters:
-
worker.rollout.sampling_strategy=monetPerform latent reasoning during rollout (VLPO is achieved by specifying this parameter);worker.rollout.sampling_strategy=greedyText reasoning. -
export LATENT_SIZE=10Number of latent embeddings. -
worker.rollout.monet.select_acc_threshold=0.6Select samples with accuracy between$(0, 0.6)$ for training. -
worker.rollout.online_difficulty_sampling=trueDynamically sample hard examples for training (withselect_acc_threshold). -
worker.actor.monet_rl_sigma=10.0worker.ref.monet_rl_sigmaVLPO$\sigma$ . -
worker.reward.repetition_penalty=truePenalty on repetitive meaningless outputs. Repetition detection is implemented by API.
After training, remember to use model merging script to merge the parameter splits and get the final model.
For RL training, we use external LLM APIs (Gemini / DeepSeek) via the helper in RL/tools/custom_api.py to support accurate rule-based judgement.
-
Gemini (Google AI)
- Install the SDK:
pip install google-genai
- Set your API key (from Google AI Studio) before running RL scripts:
export GOOGLE_API_KEY="<your_gemini_api_key>"
- In the training script, use
worker.rule_based_judge.api_name="gemini-2.5-pro".
- Install the SDK:
-
DeepSeek
- Install the OpenAI-compatible SDK:
pip install openai
- Set the API key:
export DEEPSEEK_API_KEY="<your_deepseek_api_key>"
- In the training script, use
worker.rule_based_judge.api_name="deepseek-chat".
- Install the OpenAI-compatible SDK:
Please refer to RL/tools/custom_api.py for the exact calling interface.
You can download Monet-7B at this repo. The inference requires replacing the official code of vLLM (see Modified vLLM model).
See this quick example to use Monet-7B with latent reasoning.
- Setting latent size at inference: You can control the number of latent embeddings to generate each time the model starts latent reasoning by using:
export LATENT_SIZE=10 - Handling model outputs containing latent reasoning. To achieve latent-text interleaved reasoning, the model may generate
<abs_vis_token>to switch to the latent thinking mode. Then, with our modified vLLM gpu_model_runner.py, it will replace the next tokens with the representations of the last layer. Since these latent tokens are not human-readable, you can post-process the output by detecting the start token<abs_vis_token>and nd replacing the enclosed latent tokens with a clean placeholder such as<latent>.
We evalutate Monet-7B on VLMEvalKit. Notably, we replace the original exact matching judgement with API judge to ensure more accurate assessment.
â Note that:
To accurately reproduce the result:
- Please use the following system prompt for VLMEvalKit evaluation:
You are a helpful multimodal assistant. You are required to answer the question based on the image provided. Put your final answer in \\boxed{}. - Please apply an API model as a supplementary judge.
If you find this work useful, please use the following BibTeX. Thank you for your support!
@misc{wang2025monetreasoninglatentvisual,
title={Monet: Reasoning in Latent Visual Space Beyond Images and Language},
author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
year={2025},
eprint={2511.21395},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21395},
}We sincerely thank the following great works as they provide valuable data or code for our work: