This repository contains the official implementation of
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution (ICCV 2025 Highlight)
We recommend creating a conda environment then installing the required packages using our setup_env.sh script. Note that the mamba package should be installed from our local folder and the torch version should be 2.4 (as of early May 2025, Mamba2 does not work if compiling torch 2.5 and above).
conda create -n flashdepth python=3.11 --yes
conda activate flashdepth
bash setup_env.sh
We provide three checkpoints on huggingface. They correspond to FlashDepth (Full), FlashDepth-L, and FlashDepth-S, respectively, as referenced in the paper. Generally, FlashDepth-L is most accurate and FlashDepth (Full) is fastest, but we recommend using FlashDepth-L when the input resolution is low (e.g. short side less than 518).
Save the checkpoints to configs/flashdepth/iter_43002.pth, configs/flashdepth-l/iter_10001.pth, and configs/flashdepth-s/iter_14001.pth, respectively.
To run inference on a video:
torchrun train.py --config-path configs/flashdepth inference=true eval.random_input=<path to video> eval.outfolder=output
The output depth maps (as npy files) and mp4s will be saved to configs/flashdepth/output/. Change the configs path to use another model. We provide some examples:
torchrun train.py --config-path configs/flashdepth inference=true eval.random_input=examples/video1.mp4 eval.outfolder=output
torchrun train.py --config-path configs/flashdepth inference=true eval.random_input=examples/video2.mp4 eval.outfolder=output
**If you run into TypeError: Invalid NaN comparison errors, add eval.compile=false to the command.
As reported in the paper, training is split into two stages. We first train FlashDepth-L and FlashDepth-S at resolution 518x518. Then, we train FlashDepth (Full) at higher resolution.
To train the first stage, download the Depth Anything V2 checkpoints and save them to checkpoints.
For data, see dataloaders/README.md to verify the data format. We generally used the default format downloaded from the official websites. Also check the dataloader python files to verify how the data is loaded, if needed.
# first stage
torchrun --nproc_per_node=8 train.py --config-path configs/flashdepth-l/ load=checkpoints/depth_anything_v2_vitl.pth dataset.data_root=<path to data>
torchrun --nproc_per_node=8 train.py --config-path configs/flashdepth-s/ load=checkpoints/depth_anything_v2_vits.pth dataset.data_root=<path to data>
# second stage
torchrun --nproc_per_node=8 train.py --config-path configs/flashdepth load=configs/flashdepth-s/<latest flashdepth-s checkpoint .pth> hybrid_configs.teacher_model_path=configs/flashdepth-l/<latest flashdepth-l checkpoint .pth> dataset.data_root=<path to data>
Check the config.yaml files in the configs folders for hyperparameters and logging.
By default, we print out the wall time and FPS during inference. You can also run
torchrun train.py --config-path configs/flashdepth inference=true eval.dummy_timing=true
to get the wall time over 100 frames (excluding warmup) with resolution 2044x1148. This is our console output on an A100 GPU:
INFO - shape: torch.Size([1, 105, 3, 1148, 2044])
INFO - wall time taken: 4.15; fps: 24.12; num frames: 100
As mentioned in the paper, we originally wrote CUDA graphs but found that simply compiling the model provided similar performance.
We included an ablation study on a couple of temporal modules in our supplement. They include bi-directional Mamba (specifically, Hydra), xLSTM, and an transformer-based RNN inspired by CUT3R. Simply set the flags in config.yaml and install required packages if you want to use them (e.g. set use_xlstm=true but make sure use_mamba is still true).
In the paper we reported placing the mamba layers after the last DPT layer. In this repo we moved them to after the first DPT layer in FlashDepth (Full). See mamba_in_dpt_layer in config.yaml. We will update the paper later.
Our code was modified and heavily borrowed from the following projects:
Depth Anything V2
Mamba 2
If you find our code or paper useful, please consider citing
@inproceedings{chou2025flashdepth,
title = {FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution},
author = {Chou, Gene and Xian, Wenqi and Yang, Guandao and Abdelfattah, Mohamed and Hariharan, Bharath and Snavely, Noah and Yu, Ning and Debevec, Paul},
journal = {The IEEE International Conference on Computer Vision (ICCV)},
year = {2025},
}