Skip to content

The official implementation of the paper “VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction.”

License

Notifications You must be signed in to change notification settings

3DAgentWorld/VGGT4D

Repository files navigation

VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Project Page arXiv Code

Yu Hu1*    Chong Cheng1,2*    Sicheng Yu1*

Xiaoyang Guo2    Hao Wang1

1The Hong Kong University of Science and Technology (Guangzhou)
2Horizon Robotics

* Equal contribution.    † Corresponding author.

Quick Start

This section will guide you through setting up the environment and running VGGT4D on your own data.

1. Environment Setup

We recommend using pyenv together with virtualenv to ensure a clean and reproducible Python environment.

# Select Python version
pyenv shell 3.12

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install core dependencies
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu118

# Install remaining project requirements
pip install -r requirements.txt

2. Download Pre-trained Checkpoint

Download the pre-trained model checkpoint:

mkdir -p ckpts/
wget -c "https://huggingface.co/facebook/VGGT_tracker_fixed/resolve/main/model_tracker_fixed_e20.pt?download=true" -O ckpts/model_tracker_fixed_e20.pt

3. Run the Demo

Run the VGGT4D demo script to process your scene data:

python demo_vggt4d.py --input_dir <path_to_input_dir> --output_dir <path_to_output_dir>

Input Directory Structure:

The input directory should follow this structure:

input_dir/
├── scene1/
│   ├── image001.jpg
│   ├── image002.jpg
│   └── ...
└── scene2/
    ├── image001.png
    ├── image002.png
    └── ...

Each scene subdirectory should contain image files in .jpg or .png format.

Example Usage:

python demo_vggt4d.py --input_dir ./datasets/input_dir --output_dir ./outputs

Output Files:

The script processes each scene and generates the following outputs in the output directory:

  • Depth maps (frame_%04d.npy format)
  • Depth confidence maps (conf_%04d.npy format)
  • Camera intrinsics (pred_intrinsics.txt)
  • Camera poses in TUM format (pred_traj.txt)
  • Refined dynamic masks (dynamic_mask_%04d.png format)
  • RGB images (frame_%04d.png format)

TODO

  • Release code
  • Data preprocess scripts
  • Evaluation scripts
  • Visualization scripts
  • Long sequence implementation

Acknowledgements

We thank the authors of VGGT, DUSt3R, and Easi3R for releasing their models and code. Their contributions to geometric learning and dynamic reconstruction provided essential foundations for this work, along with many other inspiring works in the community.

License

This project is licensed under the MIT License.
You are free to use, modify, and distribute this software for both academic and commercial purposes, provided that proper attribution is given.

See the LICENSE file for details.

Citation

If you find VGGT4D useful for your research, please cite our paper:

@misc{hu2025vggt4d,
      title={VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction}, 
      author={Yu Hu and Chong Cheng and Sicheng Yu and Xiaoyang Guo and Hao Wang},
      year={2025},
      eprint={2511.19971},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.19971}, 
}

About

The official implementation of the paper “VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction.”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published