Skip to content

nv-dvl/vgg-ttt

Repository files navigation

VGG-T³: Offline Feed-Forward 3D Reconstruction at Scale

arXiv Project Page Hugging Face Model

NVIDIA     University of Toronto     Vector Institute

Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep

Overview

VGG-T³ processes large image collections significantly faster than other feed-forward methods (1k images in <1 minute vs. 10 minutes for VGGT) by replacing the quadratic-scaling softmax attention in the global attention layers with a linear alternative based on test-time training.

Quick Start

Clone this repo and then install (preferably in a conda environment):

pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install .

VGG-T³ is compatible with the VGGT API and can be used in a similar way:

from vggttt.nets.vggt.models.vggt import VGGT
from vggttt.nets.vggt.img import load_and_preprocess_images

vggttt = VGGT.from_pretrained("nvidia/vgg-ttt").eval().cuda()

image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
images = load_and_preprocess_images(image_names).to("cuda")

preds = vggttt.infer(images)
# Dict containing the predicted outputs with the following keys:
#  - 'pose':        [#images, 4, 4]  Camera-to-world transformation
#  - 'intrinsics':  [#images, 3, 3]  Pinhole camera matrix
#  - 'pts3d':       [#images, height, width, 3]  Per-pixel points in world coordinates
#  - 'conf':        [#images, height, width]  Per-pixel confidence in range ]1, inf[
#  - 'depth':       [#images, height, width, 1]  Per-pixel depth

Demo

We provide an interactive web interface to perform 3D reconstruction of images and videos and visualize the result.

python vggttt/demo.py

Note: When running on a remote server you need to forward both the viser and Gradio port. See the CLI output for details.

Evaluation

Find details on how to reproduce the results in the paper here.

Training

We release the training harness, however, dataset implementations and preprocessing code is missing. We are currently in the process of checking feasibility for releasing the relevant code.

Acknowledgmens

We are also grateful to several other open-source repositories that we drew inspiration from or built upon during the development of our pipeline:

Citation

If you find this work useful, please cite:

@inproceedings{elflein2026vggttt,
  title     = {VGG-T\textsuperscript{3}: Offline Feed-Forward 3D Reconstruction at Scale},
  author    = {Elflein, Sven and Li, Ruilong and Agostinho, S{\'e}rgio and Gojcic, Zan and Leal-Taix{\'e}, Laura and Zhou, Qunjie and Osep, Aljosa},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

License

The code and model are released under the NVIDIA OneWay Noncommercial License, with the following exceptions:

See THIRD_PARTY_LICENSES.md for the full license texts of all third-party components.

About

[CVPR'26] Official code for the paper "VGG-T³: Offline Feed-Forward 3D Reconstruction at Scale"

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages