Shijie Zhou*1, Alexander Vilesov*1, Xuehai He*2,3, Ziyu Wan2, Shuwang Zhang1, Aditya Nagachandra1, Di Chang3, Dongdong Chen2, Xin Eric Wang3, Achuta Kadambi1
1UCLA, 2Microsoft, 3UCSC, 4USC
- Release dataset
- Release evaluation code
The dataset can be downloaded from Hugging Face. Each entry in the dataset contains the following fields:
id: Unique identifier for each evaluation questionvideo: Hugging Face URL of the videoquestion_type: We use the objective question type "multiple-choice"question: The questionchoices: 4 choices for the multiple-choice questionanswer: Ground-truth answer to the question
[
{
"id": "validation_160",
"video": "https://huggingface.co/datasets/shijiezhou/VLM4D/resolve/main/videos_real/davis/city-ride.mp4",
"question_type": "multiple-choice",
"question": "From the camera perspective, which direction are the cyclists moving toward?",
"choices": {
"A": "not moving",
"B": "left",
"C": "right",
"D": "backwards"
},
"answer": "left"
}
]Install the required packages:
pip install -r requirements/requirements.txtRun the scripts under model_inference_scripts, for example:
bash model_inference_scripts/run_vllm_video_models.sh The model outputs are saved in the outputs/{data_type}_{prompt} directory, where:
-
{data_type}real_mc: multiple-choice answers on real video datasynthetic_mc: multiple-choice answers on synthetic video data
-
{prompt}cot: chain-of-thought reasoningdirect-output: direct answers without intermediate reasoning steps
To evaluate the generated responses, run the following command:
python acc_evaluation.py --output_dir outputs/real_mc_cotThe evaluation results are saved in the outputs/processed_outputs/ directory.
As illustrated in our paper, the LLM-as-judge may occasionally make mistakes. To address this, we also provide manually verified evaluation results, obtained by cross-checking the outputs of two LLM judges (OpenAI o3 and o4-mini), which can be found in processed_outputs_paper_results.
Finally, run the following command to generate the statistics of the evaluation results:
python acc_final_statistics.pyWhere you are free to set your input and output folders inside. You can also reproduce the numbers shown in our paper Table 1 by changing the paths to the following:
real_data_folder = "processed_outputs_paper_results/real_mc_cot"
synthetic_data_folder = "processed_outputs_paper_results/synthetic_mc_cot"
output_csv = "csv_final_results/final_accuracy_table_cot.csv"
Please refer to LICENSE. All videos of the VLM4D benchmark are obtained from the public research video datasets (DAVIS, YouTube-VOS, Ego4D) which are not property of our institutions. The copyright remains with the original owners of the video. This repo is developed based on the evaluation framework of MMVU, many thanks to the authors for opensoucing the codebase.
@inproceedings{zhou2025vlm4d,
title={Vlm4d: Towards spatiotemporal awareness in vision language models},
author={Zhou, Shijie and Vilesov, Alexander and He, Xuehai and Wan, Ziyu and Zhang, Shuwang and Nagachandra, Aditya and Chang, Di and Chen, Dongdong and Wang, Xin Eric and Kadambi, Achuta},
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
pages={8600--8612},
year={2025}
}