Sihyun Yu1,2†*,
Nanye Ma1†*,
Pinzhi Huang1†*,
Hyunseok Lee2*,
Shusheng Yang1,
June Suk Choi2,
Ellis Brown1,
Oscar Michel1,
Boyang Zheng1,
Jinwoo Shin2,
Saining Xie1
1New York University 2KAIST
† Project lead. * Equal technical contribution.
VSTAT (Visual STAte Tracking) is a benchmark for evaluating the ability of Multimodal Large Language Models (MLLMs) to track fine-grained visual state changes in long-form videos. Unlike benchmarks that test static scene understanding or simple event recognition, VSTAT requires models to maintain a running mental model of object states, count changes, and temporal order across extended video sequences sourced from YouTube.
2026-05🚀 We release VSTAT benchmark and evaluation code.
We benchmark video-supporting MLLMs from diverse model families in zero-shot settings with greedy decoding. Each question is labeled by state element (Count, Location, Attribute) and state structure (Atomic, Sequence, Set, Dict). MCQ tasks are scored by accuracy; numerical tasks by Mean Relative Accuracy (MRA). The reported average is computed over all questions.
Even the strongest proprietary model (Gemini-3.1 Pro) reaches only 44.4 average, far below human performance (90.5), highlighting the difficulty of visual state tracking for current MLLMs. See our project website for the full, interactive leaderboard.
conda create --name vstat python=3.10
conda activate vstat
git clone https://github.com/vision-x-nyu/vstat.git
cd vstat
git submodule update --init --recursive
pip install -e ".[video]"VSTAT is hosted on HuggingFace: nyu-visionx/VSTAT.
Download it into the local data/ folder:
mkdir -p data
huggingface-cli download nyu-visionx/VSTAT \
--repo-type=dataset \
--local-dir data/vstatThen run the video download and redaction scripts bundled with the benchmark:
cd data/vstat
python scripts/download_youtube.py --resolution-map youtube_resolutions.json
bash scripts/redact.sh
cd ../..Ensure every video referenced in vstat_qa_clean.json exists under data/vstat/ before evaluation. Missing files can cause silent multi-rank hangs during distributed runs.
Custom paths: If your data lives elsewhere, set:
export VSTAT_QA_PATH=/path/to/vstat_qa_clean.json
export VSTAT_VIDEO_ROOT=/path/to/vstatTask name: vstat
Open-weight model (single process):
python -m lmms_eval \
--include_path "$(pwd)/lmms_eval/tasks" \
--model qwen3_vl \
--model_args "pretrained=Qwen/Qwen3-VL-8B-Instruct,min_pixels=784,max_pixels=50176,max_num_frames=128" \
--tasks vstat \
--batch_size 1 \
--output_path ./results/vstat \
--log_samplesOpen-weight model (multi-GPU with Accelerate):
python -m accelerate.commands.launch \
--num_processes 8 \
-m lmms_eval \
--include_path "$(pwd)/lmms_eval/tasks" \
--model qwen3_vl \
--model_args "pretrained=Qwen/Qwen3-VL-8B-Instruct,min_pixels=784,max_pixels=50176,max_num_frames=128" \
--tasks vstat \
--batch_size 1 \
--output_path ./results/vstat \
--log_samplesAPI model (Gemini):
export GOOGLE_API_KEY=your_key
python -m lmms_eval \
--include_path "$(pwd)/lmms_eval/tasks" \
--model gemini_api \
--model_args "model_version=gemini-3.1-pro-preview,timeout=240,num_concurrent=64" \
--tasks vstat \
--batch_size 1 \
--output_path ./results/vstat \
--log_samplesSwap --model and --model_args for other supported backends (e.g. internvl3, cambrians, llava_onevision, mimo_vl). See lmms_eval/models/simple/ for available model wrappers.
| Variable | Role | Default |
|---|---|---|
VSTAT_QA_PATH |
Path to QA JSON | data/vstat/vstat_qa_clean.json |
VSTAT_VIDEO_ROOT |
Root directory for video files | data/vstat/ |
HF_HOME |
HuggingFace cache | ~/.cache/huggingface |
This project is licensed under the Apache License 2.0.
Our evaluation framework is built upon lmms-eval. We thank the LMMs-Lab team for providing this excellent toolkit for evaluating multimodal large language models.
If you find our benchmark and code useful, please consider citing our work:
@article{vstat2026,
title={Benchmarking Visual State Tracking in Multimodal Video Understanding},
author={Sihyun Yu and Nanye Ma and Pinzhi Huang and Hyunseok Lee and Shusheng Yang and June Suk Choi and Ellis Brown and Oscar Michel and Boyang Zheng and Jinwoo Shin and Saining Xie},
year={2026},
journal={arXiv preprint arXiv:2606.03920},
}