Sicheng Zuo*, Zixun Xie*, Wenzhao Zheng*
$\ddagger$ , Shaoqing Xu$\dagger$ ,
Fang Li, Shengyin Jiang, Long chen, Zhi-Xin Yang, Jiwen Lu
* Equal contributions.
DVGT, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data.
- [2025/12/19] We have released the paper, inference code, and visualization checkpoints.
DVGT proposes a universal framework for driving geometry perception. Unlike conventional driving models that are tightly coupled to specific sensor setups or require ground-truth poses, our model leverages spatial-temporal attention to process unposed image sequences directly. By decoding global geometry in the ego-coordinate system, DVGT achieves metric-scaled dense reconstruction without LiDAR alignment, offering a robust solution that adapts seamlessly to diverse vehicles and camera configurations.
DVGT significantly outperforms existing models on various scenarios. As shown below, our method (red) demonstrates superior accuracy (
Firstly, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub). We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.
git clone https://github.com/wzzheng/DVGT.git
cd dvgt
conda create -n dvgt python=3.11
conda activate dvgt
pip install -r requirements.txtSecondly, download the pretrained checkpoint and save it to the ./ckpt directory.
Now, try the model with just a few lines of code:
import torch
from dvgt.models.dvgt import DVGT
from dvgt.utils.load_fn import load_and_preprocess_images
from iopath.common.file_io import g_pathmgr
checkpoint_path = 'path to your checkpoint'
device = "cuda" if torch.cuda.is_available() else "cpu"
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
# Initialize the model and load the pretrained weights.
model = DVGT()
with g_pathmgr.open(checkpoint_path, "rb") as f:
checkpoint = torch.load(f, map_location="cpu")
model.load_state_dict(checkpoint)
model = model.to(device).eval()
# Load and preprocess example images (replace with your own image paths)
image_dir = 'examples/openscene_log-0104-scene-0007'
images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=24).to(device)
with torch.no_grad():
with torch.amp.autocast(device, dtype=dtype):
# Predict attributes including cameras, depth maps, and point maps.
predictions = model(images)Click to expand
You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.
import torch
from dvgt.models.dvgt import DVGT
from dvgt.utils.load_fn import load_and_preprocess_images
from iopath.common.file_io import g_pathmgr
from dvgt.utils.pose_enc import pose_encoding_to_ego_pose
from dvgt.utils.geometry import convert_point_in_ego_0_to_ray_depth_in_ego_n
checkpoint_path = 'ckpt/open_ckpt.pt'
device = "cuda" if torch.cuda.is_available() else "cpu"
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
# Initialize the model and load the pretrained weights.
model = DVGT()
with g_pathmgr.open(checkpoint_path, "rb") as f:
checkpoint = torch.load(f, map_location="cpu")
model.load_state_dict(checkpoint)
model = model.to(device).eval()
# Load and preprocess example images (replace with your own image paths)
image_dir = 'examples/openscene_log-0104-scene-0007'
images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=23).to(device)
with torch.no_grad():
with torch.amp.autocast(device, dtype=dtype):
aggregated_tokens_list, ps_idx = model.aggregator(images)
# Predict ego n to ego first
pose_enc = model.ego_pose_head(aggregated_tokens_list)[-1]
# Ego pose following the OpenCV convention, relative to the ego-frame of the first time step.
ego_n_to_ego_0 = pose_encoding_to_ego_pose(pose_enc)
# Predict Point Maps in the ego-frame of the first time step
point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)
# The predicted ray depth maps are originated from each ego-vehicle's position in its corresponding frame.
ray_depth_in_ego_n = convert_point_in_ego_0_to_ray_depth_in_ego_n(point_map, ego_n_to_ego_0) Run the following command to perform reconstruction and visualize the point clouds in Viser. This script requires a path to an image folder formatted as follows:
data_dir/
├── frame_0/ (contains view images, e.g., CAM_F.jpg, CAM_B.jpg...)
├── frame_1/
...
Note on Data Requirements:
- Consistency: The data must be sampled at 2Hz. All frames must contain the same number of views arranged in a fixed order.
- Capacity: Inference supports up to 24 frames with an arbitrary number of views per frame.
python demo_viser.py --image_folder examples/openscene_log-0104-scene-0007- Paper, inference code, and pre-trained weights (for visualization).
- Training suite: includes training code, evaluation scripts, and the data preparation pipeline.
- Dataset release: comprehensive datasets for training and testing.
Our code is based on the following brilliant repositories:
Moge-2 CUT3R Driv3R VGGT MapAnything Pi3
Many thanks to these authors!
If you find this project helpful, please consider citing the following paper:
@article{zuo2025dvgt,
title={DVGT: Driving Visual Geometry Transformer},
author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
journal={arXiv preprint arXiv:2512.16919},
year={2025}
}