VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou*¹, Alexander Vilesov*¹, Xuehai He*^2,3, Ziyu Wan², Shuwang Zhang¹, Aditya Nagachandra¹, Di Chang³, Dongdong Chen², Xin Eric Wang³, Achuta Kadambi¹

¹UCLA, ²Microsoft, ³UCSC, ⁴USC

TODO

Release dataset
Release evaluation code

Dataset Structure

The dataset can be downloaded from Hugging Face. Each entry in the dataset contains the following fields:

id: Unique identifier for each evaluation question
video: Hugging Face URL of the video
question_type: We use the objective question type "multiple-choice"
question: The question
choices: 4 choices for the multiple-choice question
answer: Ground-truth answer to the question

Example Entry

[
    {
        "id": "validation_160",
        "video": "https://huggingface.co/datasets/shijiezhou/VLM4D/resolve/main/videos_real/davis/city-ride.mp4",
        "question_type": "multiple-choice",
        "question": "From the camera perspective, which direction are the cyclists moving toward?",
        "choices": {
            "A": "not moving",
            "B": "left",
            "C": "right",
            "D": "backwards"
        },
        "answer": "left"
    }
]

Evaluation

1. Setup

Install the required packages:

pip install -r requirements/requirements.txt

2. Response Generation

Run the scripts under model_inference_scripts, for example:

bash model_inference_scripts/run_vllm_video_models.sh

The model outputs are saved in the outputs/{data_type}_{prompt} directory, where:

{data_type}
- real_mc: multiple-choice answers on real video data
- synthetic_mc: multiple-choice answers on synthetic video data
{prompt}
- cot: chain-of-thought reasoning
- direct-output: direct answers without intermediate reasoning steps

3. Evaluation

To evaluate the generated responses, run the following command:

python acc_evaluation.py --output_dir outputs/real_mc_cot

The evaluation results are saved in the outputs/processed_outputs/ directory.

As illustrated in our paper, the LLM-as-judge may occasionally make mistakes. To address this, we also provide manually verified evaluation results, obtained by cross-checking the outputs of two LLM judges (OpenAI o3 and o4-mini), which can be found in processed_outputs_paper_results.

Finally, run the following command to generate the statistics of the evaluation results:

python acc_final_statistics.py

Where you are free to set your input and output folders inside. You can also reproduce the numbers shown in our paper Table 1 by changing the paths to the following:

real_data_folder = "processed_outputs_paper_results/real_mc_cot"
synthetic_data_folder = "processed_outputs_paper_results/synthetic_mc_cot"
output_csv = "csv_final_results/final_accuracy_table_cot.csv"

License Agreement

Please refer to LICENSE. All videos of the VLM4D benchmark are obtained from the public research video datasets (DAVIS, YouTube-VOS, Ego4D) which are not property of our institutions. The copyright remains with the original owners of the video. This repo is developed based on the evaluation framework of MMVU, many thanks to the authors for opensoucing the codebase.

Citation

@inproceedings{zhou2025vlm4d,
  title={Vlm4d: Towards spatiotemporal awareness in vision language models},
  author={Zhou, Shijie and Vilesov, Alexander and He, Xuehai and Wan, Ziyu and Zhang, Shuwang and Nagachandra, Aditya and Chang, Di and Chen, Dongdong and Wang, Xin Eric and Kadambi, Achuta},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={8600--8612},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
model_inference		model_inference
model_inference_scripts		model_inference_scripts
processed_outputs_paper_results		processed_outputs_paper_results
requirements		requirements
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
acc_evaluation.py		acc_evaluation.py
acc_final_statistics.py		acc_final_statistics.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

TODO

Dataset Structure

Example Entry

Evaluation

1. Setup

2. Response Generation

3. Evaluation

License Agreement

Citation

About

Uh oh!

Releases

Packages

Languages

License

ShijieZhou-UCLA/VLM4D

Folders and files

Latest commit

History

Repository files navigation

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

TODO

Dataset Structure

Example Entry

Evaluation

1. Setup

2. Response Generation

3. Evaluation

License Agreement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages