Project Page | Paper | Video | Data
Official PyTorch implementation for the NeurIPS 2025 paper: "Tracking and Understanding Object Transformations".
The code is tested with python=3.10, torch==2.7.0+cu126 and torchvision==0.22.0+cu126 on a RTX A6000 GPU.
# Clone and setup environment
git clone --recurse-submodules https://github.com/YihongSun/TubeletGraph/
cd TubeletGraph/
conda create -n tubeletgraph python=3.10 -y
conda activate tubeletgraph
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
bash thirdparty/setup_ckpts.sh
# Install SAM2 with multi-mask predictions
cd thirdparty/sam2
pip install -e .
pip install -e ".[notebooks]"
python setup.py build_ext --inplace
cd ../..
# Install CropFormer
cd thirdparty
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2 --no-build-isolation
ln -s "$(pwd)"/Entity/Entityv2/CropFormer detectron2/projects/CropFormer
cd detectron2/projects/CropFormer/mask2former/modeling/pixel_decoder/ops
bash make.sh
cd ../../../../../../../..
# conda install -c conda-forge libstdcxx-ng # if libstdc++ version mismatch occurs
# Install FC-CLIP
cd thirdparty/fc-clip
pip install -r requirements.txt
cd ../..In addition, please configure your OpenAI API key (required for GPT-4.1 querying) as follows (add to ~/.bashrc to persist across sessions).
export OPENAI_API_KEY="sk-..."You can quickly run TubeletGraph on your own videos using quick_run.py:
python quick_run.py \
--input_dir <VIDEO_FRAME_DIR> \
--input_mask <FIRST_FRAME_MASK.png> \
[--fps 30] \It generates video visualizations (.mp4) and state graph diagrams (.pdf) for all prompt objects in <FIRST_FRAME_MASK.png>.
--input_dir: Directory containing video frames as individual images (e.g.,0001.jpg, 0002.jpg, ...)--input_mask: PNG annotation of the first frame with object IDs as pixel values (0=background, 1=object1, 2=object2, ..., 255=ignore)--fps(optional): Frames per second, default=30
Example: 0334_cut_fruit_1
python quick_run.py --input_dir assets/example/0334_cut_fruit_1 --input_mask assets/example/0334_cut_fruit_1_0000000.pngThe output visualizations are found under ./_pred_out/predictions/custom-0334_cut_fruit_1-Ours_gpt-4.1.
- To ensure consistency, the expected outputs are pre-computed and found under
./assets/expected_output/
Please first download VOST and update the corresponding paths in configs/default.yaml.
🔹 To compute dataset-wise predictions, please run the following lines.
python TubeletGraph/run.py -c configs/default.yaml -d vost -s val -m Ours [--gpus 0 1 2 3] # optional --gpus flag for multi-GPU 🔹 To evaluate tracking / state graph performances, please run the following lines.
python3 eval/eval_tracking.py -c configs/default.yaml -p vost-val-Ours
python3 eval/eval_state_graph.py -c configs/default.yaml -p vost-val-Ours_gpt-4.1| Data-Split-Method | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vost-val-Ours(†) | 50.9 | 41.3 | 53.0 | 68.6 | 68.1 | 63.7 | 36.7 | 23.6 | 40.2 | 60.1 | 55.2 | 47.0 |
| Data-Split-Method_VLM | Sem-Acc Verb |
Sem-Acc Obj |
Temp-Loc Pre |
Temp-Loc Rec |
Overall-Rec(ST) |
Overall-Rec |
|---|---|---|---|---|---|---|
| vost-val-Ours_gpt-4.1 | 81.8 (*) | 72.3 (*) | 43.1 | 20.4 | 12.0 | 6.5 (*) |
(†) We observe very minor differences compared to the results in the paper when CropFormer and FC-CLIP are integrated into the same pytorch environment as SAM2.
(*) Minor variance may be observed across runs due to non-deterministic LLM behavior in metric computation.
Please first download VSCOS and update the corresponding paths in configs/default.yaml.
🔹 To compute dataset-wise predictions, please run the following lines.
python TubeletGraph/run.py -c configs/default.yaml -d vscos -s val -m Ours [--gpus 0 1 2 3] # optional --gpus flag for multi-GPU 🔹 To evaluate tracking performances, please run the following lines.
python3 eval/eval_tracking.py -c configs/default.yaml -p vscos-val-Ours| Data-Split-Method | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vscos-val-Ours | 75.9 | 67.8 | 79.1 | 81.0 | 89.3 | 82.9 | 72.2 | 60.7 | 77.6 | 78.4 | 87.4 | 81.7 |
Please first download M3-VOS and update the corresponding paths in configs/default.yaml.
🔹 To compute dataset-wise predictions, please run the following lines.
python TubeletGraph/run.py -c configs/default.yaml -d m3vos -s val -m Ours [--gpus 0 1 2 3] # optional --gpus flag for multi-GPU 🔹 To evaluate tracking performances, please run the following lines.
python3 eval/eval_tracking.py -c configs/default.yaml -p m3vos-val-Ours| Data-Split-Method | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| m3vos-val-Ours(†) | 74.1 | 67.4 | 78.7 | 78.2 | 88.4 | 79.8 | 64.1 | 55.9 | 68.5 | 70.3 | 82.4 | 71.5 |
(†) We observe very minor differences compared to the results in the paper when CropFormer and FC-CLIP are integrated into the same pytorch environment as SAM2.
Option 1: Quick Run (Recommended for Single Videos)
python quick_run.py --input_dir --input_mask This automatically handles dataset setup and config generation.
Option 2: Manual Config (For Dataset-Wide Evaluation)
🔹 To run on entire datasets with multiple videos, please add the following lines to configs/default.yaml under datasets:
datasets:
<data_name>:
name: <data_name>
data_dir: <DATA_PATH>
image_dir: <DATA_PATH>/JPEGImages
anno_dir: <DATA_PATH>/Annotations
split_dir: <DATA_SPLIT_PATH>
image_format: <IMAGE_FORMAT> # e.g., "*.jpg"
anno_format: <ANNO_FORMAT> # e.g., "*.png"
fps: <FPS> # for visualization fps
🔹 Then, run the following line to compute dataset-wise predictions.
python TubeletGraph/run.py -c configs/default.yaml -d <data_name> -s <split> -m Ours [--gpus 0 1 2 3] # optional --gpus flag for multi-GPU - Note that
<DATA_SPLIT_PATH>/<split>.txtshould be found.
🔹 To evaluate tracking performances, please run the following lines.
python3 eval/eval_tracking.py -c configs/default.yaml -p <data_name>-<split>-Ours- In this case,
<split>should bevalandval-{S/M/L}.txtshould also be found under<DATA_SPLIT_PATH>.
🔹 To visualizing model predictions, please run the following lines.
python3 eval/vis.py -c <CONFIG> -p <PRED> [-i <INSTANCE>_<OBJ_ID>] # optional -i flag to visualize only 1 instance
## example
python3 eval/vis.py -c configs/default.yaml -p vost-val-Ours_gpt-4.1 ## visualize all
python3 eval/vis.py -c configs/default.yaml -p vost-val-Ours_gpt-4.1 -i 555_tear_aluminium_foil_1- Output visualization can be found in
_vis_out/predictions/vost-val-Ours_gpt-4.1/, where<INSTANCE>_<OBJ_ID>.mp4and<INSTANCE>_<OBJ_ID>.pdfcontain the visualized object tracks and state graph, respectively.
🔹 To visualize the internally-computed spatiotemporal partition (tubelets), please run the following lines.
python3 TubeletGraph/vis/tubelets.py -c <CONFIG> -d <DATASET> -m <MODEL> -i <INSTANCE>_<OBJ_ID>
## example
python3 TubeletGraph/vis/tubelets.py -c configs/default.yaml -d vost -m cropformer -i 555_tear_aluminium_foil_1- Output visualization can be found at
_vis_out/tubelets/tubelets_vost_cropformer_555_tear_aluminium_foil_1.mp4showing input video (top-left), entity segmentation (top-right), initial tubelets (bottom-left), and newly emergent tubelets (bottom-right).
If you find our work useful in your research, please consider citing our paper:
@article{sun2025tracking,
title={Tracking and Understanding Object Transformations},
author={Sun, Yihong and Yang, Xinyu and Sun, Jennifer J and Hariharan, Bharath},
journal={Advances in Neural Information Processing Systems},
year={2025}
}