A shortcut-aware benchmark for spatio-temporal and intuitive physics video understanding (VideoQA) using minimally different video pairs.
To enable reproducible evaluation, we utilize the lmms-eval library, which is referenced as a submodule. First clone this repo using the flag --recurse-submodules which will automatically setup the required sumbodules. Alternatively, you will need to manually run git sumbodule init after cloning.
Next, navigate into the root directory of the repository and create your conda environment,
make .env.init
The annotations are released at facebook/minimal_video_pairs on Huggingface Datasets. We provide scripts in this repository for downloading the videos in Makefile, make sure to go and accept all data license requirements for each data source before attempting to download. Next log into Huggingface (this is needed to download the Vinoground subset)
huggingface-cli login
Now you can download all videos from their original data sources,
make download_videos
This will create a videos folder with 9 subfolders for different data sources which are used to create the subsets:
| Subset | Data sources |
|---|---|
| Human object interactions | PerceptionTest, SomethingSomethingV2 |
| Robot object interactions | Language Table |
| Intuitive Physics and collisions | IntPhys, InfLevel, GRASP, CLEVRER |
| Temporal Reasoning | STAR, Vinoground |
As previously mentioned, we utilize the lmms-eval library to enable reproducible evaluation.
We have provided the task files you need to run mvp and mvp_mini, mvp_mini is essentially a smaller, balanced evaluation set with 9k examples for enabling faster evaluations.
To run the evals:
- Copy the task files at tasks/mvp folder files to lmms-evals/lmms_evals/tasks/
- Ensure videos are downloaded in
videosfolder in root of this repository - Run the evaluations with task names
mvpandmvp_mini. You can also run individual subsets.
We primarily report paired_accuracy. An example in mvp consists two QA examples, with identical question and answer options (A or B) but the video is differrent and the correct option is different (A is correct for video1 and B for video2). For paired_accuracy, a model only gets a correct score (+1) if it gets both questions correct.
We have setup a leaderboard as part of Physical World Models release from FAIR on Huggingface: Physical Reasoning Leaderboard. To submit the results of your model on our leaderboard, combine the mvp_[mini]_{task}.jsonl in ./logs/{model} folder and upload with the specifics of your run.
cat submissions/mvp_*.jsonl > mvp_submission.jsonl
We are grateful to the many open-source datasets on top of which we built our benchmark: Perception Test, Something Something v2, CLEVRER, Language Table, IntPhys, InfLevel, GRASP, STAR, Vinoground.
If you find this repository useful in your research, please consider giving a star ⭐ and a citation, and make sure to cite the original video data sources referenced above as well
@article{krojer2025shortcut,
title={A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs}
author={Benno Krojer and Mojtaba Komeili and Candace Ross and Quentin Garrido and Koustuv Sinha and Nicolas Ballas and Mahmoud Assran},
journal={arXiv},
year={2025}
}We release this benchmark under the LICENSE file in the root directory of this source tree.