Skip to content

shijian2001/Video-Thinker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Sparking "Thinking with Videos" via Reinforcement Learning

Paper Paper Model License Python 3.10+

If you like our project, please give us a star ⭐ on GitHub for the latest update.
Typing Animation

πŸ“£ Latest News

  • [October 30, 2025]: πŸ“„ Our paper is now available on arXiv and HF Paper.
  • [October 28, 2025]: πŸš€ Our codebase and model released. You can now use Video-Thinker-7B at Huggingface Model.

πŸ’‘ Overview

Video-Thinker is an end-to-end video reasoning framework that empowers MLLMs to autonomously leverage intrinsic "grounding" and "captioning" capabilities during inference. This paradigm extends "Thinking with Images" to video understanding, enabling dynamic temporal navigation and visual cue extraction without relying on external tools or pre-designed prompts. To spark this capability, we construct Video-Thinker-10K, a curated dataset with structured reasoning traces synthesized through hindsight-curation reasoning, ensuring that temporal localizations and visual descriptions genuinely contribute to correct answers. Furthermore, we propose a two-stage training strategy combining SFT for format learning and GRPO with pure outcome reward for reinforcement learning, enabling Video-Thinker to achieve state-of-the-art performance on challenging video reasoning benchmarks with remarkable data efficiency.

πŸ“Š Overall Performance

Video-Thinker-7B achieves state-of-the-art performance among 7B-sized MLLMs across multiple challenging video reasoning benchmarks. Our model demonstrates exceptional capabilities on both in-domain and out-of-domain tasks:

  • Out-of-Domain Benchmarks:

    • Video-Holmes: 43.22% (↑4.68% over best baseline)
    • CG-Bench-Reasoning: 33.25% (↑3.81% over best baseline)
    • VRBench: 80.69% (↑11.44% over best baseline)
  • In-Domain Benchmarks:

    • ActivityNet: 78.72% | Star: 70.66% | ScaleLong: 49.53%
    • YouCook2: 73.66% | LVBench: 37.04%

Our approach enables MLLMs to "Think with Videos" by autonomously leveraging intrinsic grounding and captioning capabilities, achieving superior reasoning performance with only 10K training samples.

✨ The Video-Thinker Framework

πŸ”„ Data Synthesis Pipeline

We construct Video-Thinker-10K through a systematic pipeline that transforms diverse video data into structured reasoning samples:

  • Data Sources: We curate from 6 datasets spanning multiple domains:

    • Caption-labeled (ActivityNet, TutorialVQA, YouCook2): Rich temporal annotations but lack complex reasoning questions
    • QA-labeled (STAR, ScaleLong, LVBench): Challenging QA pairs but lack granular visual descriptions
  • Complementary Generation:

    • For caption-labeled data β†’ Generate complex multi-segment reasoning questions
    • For QA-labeled data β†’ Generate answer-conditioned visual descriptions for key segments
  • Hindsight-Curation Reasoning: We employ a novel quality assurance process where generated <time> and <caption> contents are validated by testing whether they enable models to derive correct answers, with up to 3 regeneration attempts to ensure high-quality supervision.

🎯 Training Strategy of Video-Thinker

We adopt a two-stage training approach to progressively build video reasoning capabilities:

Stage 1: SFT for Format-Following

  • Initialize the model to generate structured reasoning traces with <time>, <caption>, and <think> tags
  • Provides essential cold-start by teaching the specialized reasoning format

Stage 2: GRPO for Autonomous Navigation

  • Strengthens intrinsic grounding and captioning capabilities through reinforcement learning
  • Uses outcome-based rewards (correctness + format adherence) without requiring step-wise annotations
  • Enables the model to autonomously discover effective temporal reasoning strategies
  • Demonstrates remarkable data efficiency (10K samples)

πŸ”§ Installation

# Create conda environment
conda create -n videothinker python=3.10
conda activate videothinker

# Install requirements
cd Video-Thinker
pip install -r requirements.txt

πŸ“¦ Data Preparation

πŸ“‚ Training and evaluation data are available in data:

  • data/train/ - Training data
  • data/eval/id/ - In-domain Evaluation data
  • data/eval/ood/ - Out-of-domain Evaluation data

Note: Video files will be released soon. Current data files contain video IDs and annotations.

πŸ“Š Benchmark Datasets

We evaluate on both in-domain and out-of-domain benchmarks:

Out-of-Domain:

  • Video-Holmes, CG-Bench-Reasoning, VRBench

In-Domain:

  • ActivityNet, STAR, ScaleLong, YouCook2, LVBench

🎯 Training Data

Video-Thinker-10K is curated from diverse video reasoning tasks:

  • Caption-labeled: ActivityNet, TutorialVQA, YouCook2
  • QA-labeled: STAR, ScaleLong, LVBench

🎨 Base Model

We build upon Qwen2.5-VL-7B-Instruct as our foundation model, which provides strong multimodal understanding capabilities.

πŸš€ Training

Step 1: Supervised Fine-Tuning (SFT)

Configure your training parameters and run:

bash scripts/run_sft_video.sh

Step 2: Group Relative Policy Optimization (GRPO)

After SFT completion, run GRPO training:

bash scripts/run_grpo_video.sh

πŸ“ˆ Evaluation

Our trained model Video-Thinker-7B is available on Hugging Face. You can directly use it to evaluate on your custom video reasoning tasks.

To run batch evaluation on trained models:

bash scripts/run_eval_batch.py

πŸ“‹ TODO

  • Release Paper
  • Release Model Weights (Video-Thinker-7B)
  • Release Training & Evaluation Data (Annotations)
  • Release Code
  • Release Video Files
  • Provide Detailed Training Guidelines
  • Provide Detailed Evaluation Guidelines

πŸ™ Acknowledgement

We sincerely appreciate the contributions of the open-source community:

πŸ“ Citation

If you find Video-Thinker useful in your research, please consider citing:

@article{wang2025video,
  title={Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning},
  author={Wang, Shijian and Jin, Jiarui and Wang, Xingjian and Song, Linxin and Fu, Runhao and Wang, Hecheng and Ge, Zongyuan and Lu, Yuan and Cheng, Xuelian},
  journal={arXiv preprint arXiv:2510.23473},
  year={2025}
}

πŸ“„ License

This project is released under the MIT License.

πŸ“ž Contact

For any questions or feedback, please reach out to us at shijian@seu.edu.cn.

Star History

Star History Chart

About

Sparking "Thinking with Videos" via Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published