Official implementation of Laser (Latent Superposition for Effective Visual Reasoning). Laser enables vision-language models to perform implicit reasoning in continuous latent space, prioritizing global understanding ("Forest") before detailed processing ("Trees").
Note: Training data and code are now available!
- [2026/04] ๐ Laser-7B checkpoint released on Hugging Face: wybb/laser-7b
- [2026/04] ๐ Training data (ScanPath) released!
- [2026/01] Code release for Laser.
git clone https://github.com/ybb6/laser.git
cd Laser
pip install -r requirements.txt
# Optional: Flash Attention 2
pip install flash-attn --no-build-isolationRequirements:
- Python >= 3.10
- PyTorch >= 2.1.0
- CUDA >= 11.8
To start training with the default configuration:
bash scripts/finetune_laser_dwal_7b.shTo run parallel evaluation across supported benchmarks:
bash evaluation/run_evaluation_dwal_parallel.shTraining data should follow the LLaVA-style JSON format, extended with Laser-specific tokens:
[
{
"id": "sample_001",
"image": ["path/to/image.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\nWhat is shown in this image?"
},
{
"from": "gpt",
"value": "<|laser_start|><laser><laser>...<laser><|laser_end|><answer>A cat sitting on a couch.</answer>"
}
]
}
]<|laser_start|>/<|laser_end|>: Delimiters for the latent reasoning region.<laser>: Placeholder token for each latent reasoning step (replaced dynamically during training).<answer>: Wraps the final textual output.
For efficient dynamic batching during training, precompute the token lengths:
python scripts/precompute_lengths.py \
--data_path data/training_data.json \
--output_path data/sample_lengths.json \
--model_id Qwen/Qwen2.5-VL-7B-InstructWe support a comprehensive suite of visual reasoning benchmarks:
| Benchmark | Description |
|---|---|
| BLINK | Visual reasoning (14 subtasks) |
| MMVP | Multimodal visual perception |
| MMStar | Multimodal reasoning |
| SEED-Bench-2-Plus | Text-rich understanding |
| HallusionBench | Hallucination detection |
| HR-Bench | High-resolution understanding |
The released checkpoint is now available on Hugging Face.
| Model | Base Model | Status | Download |
|---|---|---|---|
| Laser-7B | Qwen2.5-VL-7B-Instruct | Released | HF Link |
If you find our work useful, please consider citing:
@article{laser2026forest,
title={Forest Before Trees: Latent Superposition for Efficient Visual Reasoning},
author={Wang, Yubo and Zhang, Juntian and Wu, Yichen and Lin, Yankai and Lukas, Nils and Liu, Yuhan},
journal={arXiv preprint arXiv:2601.06803},
year={2026}
}This project is licensed under the Apache-2.0 License.
We thank the authors of the following projects for their open-source contributions: