Yu Hu1* Chong Cheng1,2* Sicheng Yu1*
Xiaoyang Guo2 Hao Wang1†
1The Hong Kong University of Science and Technology (Guangzhou)
2Horizon Robotics
* Equal contribution. † Corresponding author.
This section will guide you through setting up the environment and running VGGT4D on your own data.
We recommend using pyenv together with virtualenv to ensure a clean and reproducible Python environment.
# Select Python version
pyenv shell 3.12
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install core dependencies
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu118
# Install remaining project requirements
pip install -r requirements.txtDownload the pre-trained model checkpoint:
mkdir -p ckpts/
wget -c "https://huggingface.co/facebook/VGGT_tracker_fixed/resolve/main/model_tracker_fixed_e20.pt?download=true" -O ckpts/model_tracker_fixed_e20.ptRun the VGGT4D demo script to process your scene data:
python demo_vggt4d.py --input_dir <path_to_input_dir> --output_dir <path_to_output_dir>Input Directory Structure:
The input directory should follow this structure:
input_dir/
├── scene1/
│ ├── image001.jpg
│ ├── image002.jpg
│ └── ...
└── scene2/
├── image001.png
├── image002.png
└── ...
Each scene subdirectory should contain image files in .jpg or .png format.
Example Usage:
python demo_vggt4d.py --input_dir ./datasets/input_dir --output_dir ./outputsOutput Files:
The script processes each scene and generates the following outputs in the output directory:
- Depth maps (
frame_%04d.npyformat) - Depth confidence maps (
conf_%04d.npyformat) - Camera intrinsics (
pred_intrinsics.txt) - Camera poses in TUM format (
pred_traj.txt) - Refined dynamic masks (
dynamic_mask_%04d.pngformat) - RGB images (
frame_%04d.pngformat)
- Release code
- Data preprocess scripts
- Evaluation scripts
- Visualization scripts
- Long sequence implementation
We thank the authors of VGGT, DUSt3R, and Easi3R for releasing their models and code. Their contributions to geometric learning and dynamic reconstruction provided essential foundations for this work, along with many other inspiring works in the community.
This project is licensed under the MIT License.
You are free to use, modify, and distribute this software for both academic and commercial purposes, provided that proper attribution is given.
See the LICENSE file for details.
If you find VGGT4D useful for your research, please cite our paper:
@misc{hu2025vggt4d,
title={VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction},
author={Yu Hu and Chong Cheng and Sicheng Yu and Xiaoyang Guo and Hao Wang},
year={2025},
eprint={2511.19971},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.19971},
}