Official repository for the project "Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-COF Benchmark"
[🌍 Homepage] [📖 arXiv Paper] [🤗 HF Datasets]
- [2025.11.15] 🔥 We update the MME-CoF results for the Wan 2.2 series and HunyuanVideo, alongside the previously reported results for the closed-source Veo 3 series, Sora 2 series, Kling, and Seedance. The leaderboard covering all evaluated models on the updated benchmark will be refreshed shortly.
- [2025.11.15] 🔥 We expand MME-CoF to support a more comprehensive and reliable evaluation. Please access the updated benchmark on [🤗 HF Datasets].
- [2025.11.04] 🔥 We release the evaluation code.
- [2025.11.03] 🔥 We publish MME-CoF benchmark data at [🤗 Huggingface Dataset].
- [2025.11.01] 🚀 We release the arXiv paper.
Overview of Our Study on the Reasoning Potential of Video Models.
We investigate a key question: Are current video models reliable zero-shot reasoners? While modern video models can “see the world” and show promising ability to perceive, understand, and manipulate complex visual scenes, their actual reliability in visual reasoning remains unverified.
We conduct a comprehensive Chain-of-Frame (CoF) evaluation of the leading model Veo-3 across 12 core dimensions and introduce MME-CoF, a compact and standardized benchmark for systematic CoF reasoning assessment. Our findings show that current video models are not yet dependable standalone zero-shot reasoners, but they demonstrate strong potential as powerful visual perception and scene-understanding modules to complement dedicated reasoning systems.
We provide the first investigation of video models (Veo-3) to analyze their visual reasoning potential, detailing representative successes, characteristic errors, and the conditions under which CoF reasoning emerges, holds, or breaks.
git lfs install
git clone https://huggingface.co/datasets/ZiyuG/MME-CoFBy default, each image is padded to 16:9, and the video model generates six videos per image. We evaluate using Gemini-2.5-Pro.
- Place
evaluate.pyandgenai_client.pyunder the dataset folder - Edit line 24 in
genai_client.pyto add your Google AI API Key - Run:
python evaluate.py
Results will be saved to mme-cof_eval_results.json
We curate MME-CoF, a compact benchmark providing a standardized taxonomy and an evaluation protocol aligned with CoF reasoning, enabling consistent and category-wise assessment beyond surface-level visual fidelity.
Evaluation Radar Map and Word Cloud of MME-CoF.
If you find this work useful, please cite:
@article{guo2025video,
title={Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark},
author={Guo, Ziyu and Chen, Xinyan and Zhang, Renrui and An, Ruichuan and Qi, Yu and Jiang, Dongzhi and Li, Xiangtai and Zhang, Manyuan and Li, Hongsheng and Heng, Pheng-Ann},
journal={arXiv preprint arXiv:2510.26802},
year={2025}
}