- [October 30, 2025]: π Our paper is now available on arXiv and HF Paper.
- [October 28, 2025]: π Our codebase and model released. You can now use Video-Thinker-7B at Huggingface Model.
Video-Thinker is an end-to-end video reasoning framework that empowers MLLMs to autonomously leverage intrinsic "grounding" and "captioning" capabilities during inference. This paradigm extends "Thinking with Images" to video understanding, enabling dynamic temporal navigation and visual cue extraction without relying on external tools or pre-designed prompts. To spark this capability, we construct Video-Thinker-10K, a curated dataset with structured reasoning traces synthesized through hindsight-curation reasoning, ensuring that temporal localizations and visual descriptions genuinely contribute to correct answers. Furthermore, we propose a two-stage training strategy combining SFT for format learning and GRPO with pure outcome reward for reinforcement learning, enabling Video-Thinker to achieve state-of-the-art performance on challenging video reasoning benchmarks with remarkable data efficiency.
Video-Thinker-7B achieves state-of-the-art performance among 7B-sized MLLMs across multiple challenging video reasoning benchmarks. Our model demonstrates exceptional capabilities on both in-domain and out-of-domain tasks:
-
Out-of-Domain Benchmarks:
- Video-Holmes: 43.22% (β4.68% over best baseline)
- CG-Bench-Reasoning: 33.25% (β3.81% over best baseline)
- VRBench: 80.69% (β11.44% over best baseline)
-
In-Domain Benchmarks:
- ActivityNet: 78.72% | Star: 70.66% | ScaleLong: 49.53%
- YouCook2: 73.66% | LVBench: 37.04%
Our approach enables MLLMs to "Think with Videos" by autonomously leveraging intrinsic grounding and captioning capabilities, achieving superior reasoning performance with only 10K training samples.
We construct Video-Thinker-10K through a systematic pipeline that transforms diverse video data into structured reasoning samples:
-
Data Sources: We curate from 6 datasets spanning multiple domains:
- Caption-labeled (ActivityNet, TutorialVQA, YouCook2): Rich temporal annotations but lack complex reasoning questions
- QA-labeled (STAR, ScaleLong, LVBench): Challenging QA pairs but lack granular visual descriptions
-
Complementary Generation:
- For caption-labeled data β Generate complex multi-segment reasoning questions
- For QA-labeled data β Generate answer-conditioned visual descriptions for key segments
-
Hindsight-Curation Reasoning: We employ a novel quality assurance process where generated
<time>and<caption>contents are validated by testing whether they enable models to derive correct answers, with up to 3 regeneration attempts to ensure high-quality supervision.
We adopt a two-stage training approach to progressively build video reasoning capabilities:
Stage 1: SFT for Format-Following
- Initialize the model to generate structured reasoning traces with
<time>,<caption>, and<think>tags - Provides essential cold-start by teaching the specialized reasoning format
Stage 2: GRPO for Autonomous Navigation
- Strengthens intrinsic grounding and captioning capabilities through reinforcement learning
- Uses outcome-based rewards (correctness + format adherence) without requiring step-wise annotations
- Enables the model to autonomously discover effective temporal reasoning strategies
- Demonstrates remarkable data efficiency (10K samples)
# Create conda environment
conda create -n videothinker python=3.10
conda activate videothinker
# Install requirements
cd Video-Thinker
pip install -r requirements.txtπ Training and evaluation data are available in data:
data/train/- Training datadata/eval/id/- In-domain Evaluation datadata/eval/ood/- Out-of-domain Evaluation data
Note: Video files will be released soon. Current data files contain video IDs and annotations.
We evaluate on both in-domain and out-of-domain benchmarks:
Out-of-Domain:
- Video-Holmes, CG-Bench-Reasoning, VRBench
In-Domain:
- ActivityNet, STAR, ScaleLong, YouCook2, LVBench
Video-Thinker-10K is curated from diverse video reasoning tasks:
- Caption-labeled: ActivityNet, TutorialVQA, YouCook2
- QA-labeled: STAR, ScaleLong, LVBench
We build upon Qwen2.5-VL-7B-Instruct as our foundation model, which provides strong multimodal understanding capabilities.
Configure your training parameters and run:
bash scripts/run_sft_video.shAfter SFT completion, run GRPO training:
bash scripts/run_grpo_video.shOur trained model Video-Thinker-7B is available on Hugging Face. You can directly use it to evaluate on your custom video reasoning tasks.
To run batch evaluation on trained models:
bash scripts/run_eval_batch.py- Release Paper
- Release Model Weights (Video-Thinker-7B)
- Release Training & Evaluation Data (Annotations)
- Release Code
- Release Video Files
- Provide Detailed Training Guidelines
- Provide Detailed Evaluation Guidelines
We sincerely appreciate the contributions of the open-source community:
If you find Video-Thinker useful in your research, please consider citing:
@article{wang2025video,
title={Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning},
author={Wang, Shijian and Jin, Jiarui and Wang, Xingjian and Song, Linxin and Fu, Runhao and Wang, Hecheng and Ge, Zongyuan and Lu, Yuan and Cheng, Xuelian},
journal={arXiv preprint arXiv:2510.23473},
year={2025}
}This project is released under the MIT License.
For any questions or feedback, please reach out to us at shijian@seu.edu.cn.