This is the official PyTorch implementation of St4RTrack (pronounced “Star Trek”).
We propose a unified feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. Our method enables simultaneous 4D reconstruction and tracking by predicting two appropriately defined pointmaps for frame pairs, naturally combining 3D reconstruction with 3D tracking through video sequences.
Key Features:
- Unified Representation: Simultaneous 4D reconstruction and tracking in world coordinates
- Feed-forward: Efficient processing without post optimization
- In-The-Wild Adaptation: Test-time adaptation using reprojection loss on unlabeled data
- Comprehensive Evaluation: New WorldTrack benchmark for world coordinate tracking
Please refer to the arXiv paper for more technical details and Project Page for interactive video results.
- Clean code for dataset and loss computation
- Add configurable track evaluation data root argument
- Remove unused training arguments for cleaner codebase
- Remove abs path in scripts
- Release pre-trained model weights: Hugging Face and Google Drive
- More unit tests
- Add dataset/benchmark download and preprocess instructions
- Add pre-trained models download scripts
- Check requirements.txt
- Clone St4RTrack with submodules:
git clone https://github.com/HavenFeng/St4RTrack.git
cd St4RTrack- Create the environment: we use torch 2.5.1 with CUDA 12.1 for our implementation, you can set up the environment with
conda env create -f environment.yml
conda activate st4rtrackoptionally, you can also use
conda create -n st4rtrack python=3.12 cmake=3.14.0
conda activate st4rtrack
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
# Optional dependencies for training and evaluation:
pip install -r requirements_optional.txt- Optional: Compile CUDA kernels for RoPE (faster runtime):
cd croco/models/curope/
python setup.py build_ext --inplace
cd ../../../We currently provide fine-tuned model weights for St4RTrack, which can be downloaded locally via Google Drive. We recommand St4RTrack_Seqmode_reweightMax5.pth by default.
Optionally, you can also load checkpoint via Hugging Face by easily adding
--hf_model "yupengchengg147/St4RTrack" \
--hf_variant seq \
--hf_force_download #optionalin training and inference command.
Download the dataset for PointOdyssey and DynamicReplica:
bash data/download_po.sh
bash data/download_dynamic_replica.shFor Kubric, please download and prepare the dataset based on the official instructions.
mkdir -p checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric.pth -P ./checkpoints/Train with pair-wise image reconstruction and dynamic pixel reweighting:
# Remember to replace the dataset path to your own path
# the script has been tested on a 4xA100(80G) machine
bash scripts_run/train_pair_reweight.shTrain with sequence-based tracking:
# Remember to replace the dataset path to your own path
# the script has been tested on a 4xA100(80G) machine
bash scripts_run/train_seq_reweight.shFine-tune on specific sequences:
# Remember to replace the dataset path and checkpoint path to your own path
# the script has been tested on a 4xA100(80G) machine
bash scripts_run/train_tta.shTraining and Adaptation Improvements post Submission (Led by Pengcheng Yu)
After the initial submission, we identified and addressed training stability issues related to confidence estimation for dynamic pixels. Originally, the confidence-based loss function is:
L = conf1 * l1 - alpha * log_conf1 + conf2 * l2 - alpha * log_conf2We observed that the confidence-based loss formulation used in prior works (e.g., DUSt3R, MonSt3R, MASt3R) is suboptimal for St4RTrack, as it often leads the model to neglect pixels belonging to moving objects. Since DUSt3R-based models are mainly trained on static scenes, they tend to assign low confidence to dynamic or translucent pixels. As a result, during training, the model reduces conf1 (confidence) rather than minimizing the actual error l1, which undermines learning on dynamic content. Additionally, simply removing the confidence term without introducing alternative constraints degrades performance, as previous approaches rely on confidence weighting for effective point-view regression.
Our Quick Solution - Dynamic Pixel Reweighting for Head1: We reweight the confidence values of dynamic pixels in head1 by replacing them with scaled static pixel confidence:
w = reweight_scale * conf1_static.max() # or conf1_static.mean()
L = conf1_static*l1_static - alpha * log_conf1_static + w * l1_dynamic + conf2 * l2 - alpha * log_conf2Dynamic vs Static Pixel Classification: During training, pixels are classified based on ground-truth trajectory displacement:
- Dynamic pixels: Displacement > dataset-specific threshold
- Static pixels: Displacement ≤ threshold
Threshold Selection Strategy:
- Pair mode training:
- PointOdyssey and Dynamic Replica: max(0.75 quantile displacement, mean displacement)
- Kubrick: median displacement
- Sequence mode training: mean displacement across all datasets
This improvement significantly enhances training stability and final performance, especially for scenes with substantial dynamic content.
Here we show the quantitative comparision of tracking performance between model trained with and without Dynamic Pixel Reweighting for Head1. Please take Table 1 and Table 3 in the paper as reference.
| Method | All-points | Dynamic points | ||||||
|---|---|---|---|---|---|---|---|---|
| po | dr | adt | PStudio | po | dr | adt | PStudio | |
| pair_mode with reweight | 67.29 | 71.28 | 68.97 | 67.59 | 69.76 | 74.65 | 76.22 | 67.59 |
| sequence with reweight | 67.34 | 74.34 | 73.03 | 70.67 | 72.04 | 76.82 | 78.01 | 70.67 |
| st4rtrack wo reweight | 67.95 | 73.74 | 76.00 | 69.67 | 68.71 | 68.13 | 75.34 | 69.67 |
| Method | All-points | Dynamic points | ||||||
|---|---|---|---|---|---|---|---|---|
| po | dr | adt | PStudio | po | dr | adt | PStudio | |
| pair_mode with reweight | 0.3163 | 0.3016 | 0.3324 | 0.2850 | 0.2612 | 0.2180 | 0.1158 | 0.2850 |
| sequence with reweight | 0.3169 | 0.2605 | 0.2946 | 0.2489 | 0.2367 | 0.1978 | 0.1087 | 0.2489 |
| st4rtrack wo reweight | 0.3140 | 0.2682 | 0.2680 | 0.2637 | 0.2970 | 0.2961 | 0.1212 | 0.2637 |
Run inference on your data:
python infer.py \
--batch_size 128 \
--input_dir /path/to/your/data \
--weights checkpoints/your_model.pth \
--output_dir results/your_pathVisualize the results:
python visualizer_st4rtrack.py --traj_path results/your_pathEvaluate on standard benchmarks:
bash scripts_run/eval.shDownload the WorldTrack dataset from Google Drive and place it in ./data/worldtrack_release, the structure should be like this:
./data/worldtrack_release/
└───adt_mini/
└───pstudio_mini/
└───po_mini/
└───ds_mini/
└───tum/
We provide the following evaluation datasets:
- Trajectory evaluation datasets: adt_mini, pstudio_mini, po_mini, ds_mini
- Reconstruction evaluation datasets: po_mini, tum
If you find our work useful, please cite:
@inproceedings{st4rtrack2025,
title={St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World},
author={Feng*, Haiwen and Zhang*, Junyi and Wang, Qianqian and Ye, Yufei and Yu, Pengcheng and Black, Michael J. and Darrell, Trevor and Kanazawa, Angjoo},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}This code and model are available for non-commercial scientific research purposes.
Shout-out to other concurrent efforts on pointmap-based dense 3D tracking:
ZeroMSF - https://research.nvidia.com/labs/lpr/zero_msf/
DynaDUSt3R - https://stereo4d.github.io/
Dynamic Point Maps - https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/
We would like to thank the authors of DUSt3R and MonST3R for their foundational work in stereo matching and 3D reconstruction. We also thank the contributors of PointOdyssey, TUM-Dynamics, Dynamic Replica, TAPVid-3D and DAVIS datasets for enabling comprehensive evaluation.
This work was supported by UC Berkeley and the Max Planck Institute for Intelligent Systems.