Skip to content

vision-x-nyu/vstat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

 VSTAT: Benchmarking Visual State Tracking in Multimodal Video Understanding

VSTAT teaser

Sihyun Yu1,2†*, Nanye Ma1†*, Pinzhi Huang1†*, Hyunseok Lee2*, Shusheng Yang1, June Suk Choi2, Ellis Brown1,
Oscar Michel1, Boyang Zheng1, Jinwoo Shin2, Saining Xie1

1New York University   2KAIST

Project lead.   * Equal technical contribution.


Overview

VSTAT (Visual STAte Tracking) is a benchmark for evaluating the ability of Multimodal Large Language Models (MLLMs) to track fine-grained visual state changes in long-form videos. Unlike benchmarks that test static scene understanding or simple event recognition, VSTAT requires models to maintain a running mental model of object states, count changes, and temporal order across extended video sequences sourced from YouTube.

Release

  • 2026-05 🚀 We release VSTAT benchmark and evaluation code.

Contents

Results

We benchmark video-supporting MLLMs from diverse model families in zero-shot settings with greedy decoding. Each question is labeled by state element (Count, Location, Attribute) and state structure (Atomic, Sequence, Set, Dict). MCQ tasks are scored by accuracy; numerical tasks by Mean Relative Accuracy (MRA). The reported average is computed over all questions.

VSTAT leaderboard

Even the strongest proprietary model (Gemini-3.1 Pro) reaches only 44.4 average, far below human performance (90.5), highlighting the difficulty of visual state tracking for current MLLMs. See our project website for the full, interactive leaderboard.

Installation

conda create --name vstat python=3.10
conda activate vstat

git clone https://github.com/vision-x-nyu/vstat.git
cd vstat

git submodule update --init --recursive
pip install -e ".[video]"

Benchmark

VSTAT is hosted on HuggingFace: nyu-visionx/VSTAT.

Download it into the local data/ folder:

mkdir -p data
huggingface-cli download nyu-visionx/VSTAT \
  --repo-type=dataset \
  --local-dir data/vstat

Then run the video download and redaction scripts bundled with the benchmark:

cd data/vstat
python scripts/download_youtube.py --resolution-map youtube_resolutions.json
bash scripts/redact.sh
cd ../..

Ensure every video referenced in vstat_qa_clean.json exists under data/vstat/ before evaluation. Missing files can cause silent multi-rank hangs during distributed runs.

Custom paths: If your data lives elsewhere, set:

export VSTAT_QA_PATH=/path/to/vstat_qa_clean.json
export VSTAT_VIDEO_ROOT=/path/to/vstat

Evaluation

Task name: vstat

Open-weight model (single process):

python -m lmms_eval \
  --include_path "$(pwd)/lmms_eval/tasks" \
  --model qwen3_vl \
  --model_args "pretrained=Qwen/Qwen3-VL-8B-Instruct,min_pixels=784,max_pixels=50176,max_num_frames=128" \
  --tasks vstat \
  --batch_size 1 \
  --output_path ./results/vstat \
  --log_samples

Open-weight model (multi-GPU with Accelerate):

python -m accelerate.commands.launch \
  --num_processes 8 \
  -m lmms_eval \
  --include_path "$(pwd)/lmms_eval/tasks" \
  --model qwen3_vl \
  --model_args "pretrained=Qwen/Qwen3-VL-8B-Instruct,min_pixels=784,max_pixels=50176,max_num_frames=128" \
  --tasks vstat \
  --batch_size 1 \
  --output_path ./results/vstat \
  --log_samples

API model (Gemini):

export GOOGLE_API_KEY=your_key

python -m lmms_eval \
  --include_path "$(pwd)/lmms_eval/tasks" \
  --model gemini_api \
  --model_args "model_version=gemini-3.1-pro-preview,timeout=240,num_concurrent=64" \
  --tasks vstat \
  --batch_size 1 \
  --output_path ./results/vstat \
  --log_samples

Swap --model and --model_args for other supported backends (e.g. internvl3, cambrians, llava_onevision, mimo_vl). See lmms_eval/models/simple/ for available model wrappers.

Variable Role Default
VSTAT_QA_PATH Path to QA JSON data/vstat/vstat_qa_clean.json
VSTAT_VIDEO_ROOT Root directory for video files data/vstat/
HF_HOME HuggingFace cache ~/.cache/huggingface

License

This project is licensed under the Apache License 2.0.

Acknowledgement

Our evaluation framework is built upon lmms-eval. We thank the LMMs-Lab team for providing this excellent toolkit for evaluating multimodal large language models.

Citation

If you find our benchmark and code useful, please consider citing our work:

@article{vstat2026,
    title={Benchmarking Visual State Tracking in Multimodal Video Understanding},
    author={Sihyun Yu and Nanye Ma and Pinzhi Huang and Hyunseok Lee and Shusheng Yang and June Suk Choi and Ellis Brown and Oscar Michel and Boyang Zheng and Jinwoo Shin and Saining Xie},
    year={2026},
    journal={arXiv preprint arXiv:2606.03920},
}

About

Evaluation code for "Benchmarking Visual State Tracking in Multimodal Video Understanding"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages