NVIDIA University of Toronto Vector Institute
Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
VGG-T³ processes large image collections significantly faster than other feed-forward methods (1k images in <1 minute vs. 10 minutes for VGGT) by replacing the quadratic-scaling softmax attention in the global attention layers with a linear alternative based on test-time training.
Clone this repo and then install (preferably in a conda environment):
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install .VGG-T³ is compatible with the VGGT API and can be used in a similar way:
from vggttt.nets.vggt.models.vggt import VGGT
from vggttt.nets.vggt.img import load_and_preprocess_images
vggttt = VGGT.from_pretrained("nvidia/vgg-ttt").eval().cuda()
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
images = load_and_preprocess_images(image_names).to("cuda")
preds = vggttt.infer(images)
# Dict containing the predicted outputs with the following keys:
# - 'pose': [#images, 4, 4] Camera-to-world transformation
# - 'intrinsics': [#images, 3, 3] Pinhole camera matrix
# - 'pts3d': [#images, height, width, 3] Per-pixel points in world coordinates
# - 'conf': [#images, height, width] Per-pixel confidence in range ]1, inf[
# - 'depth': [#images, height, width, 1] Per-pixel depthWe provide an interactive web interface to perform 3D reconstruction of images and videos and visualize the result.
python vggttt/demo.pyNote: When running on a remote server you need to forward both the viser and Gradio port. See the CLI output for details.
Find details on how to reproduce the results in the paper here.
We release the training harness, however, dataset implementations and preprocessing code is missing. We are currently in the process of checking feasibility for releasing the relevant code.
We are also grateful to several other open-source repositories that we drew inspiration from or built upon during the development of our pipeline:
If you find this work useful, please cite:
@inproceedings{elflein2026vggttt,
title = {VGG-T\textsuperscript{3}: Offline Feed-Forward 3D Reconstruction at Scale},
author = {Elflein, Sven and Li, Ruilong and Agostinho, S{\'e}rgio and Gojcic, Zan and Leal-Taix{\'e}, Laura and Zhou, Qunjie and Osep, Aljosa},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}The code and model are released under the NVIDIA OneWay Noncommercial License, with the following exceptions:
vggttt/nets/vggt/— released under the VGGT license.vggttt/nets/ttt.py— adapted from LaCT and released under the MIT License.vggttt/evaluation/pointmaps/utils.py— adapted from CUT3R and released under CC BY-NC-SA 4.0.
See THIRD_PARTY_LICENSES.md for the full license texts of all third-party components.