@article{wang2025lightweight,
title={Lightweight and Accurate Multi-View Stereo With Confidence-Aware Diffusion Model},
author={Wang, Fangjinhua and Xu, Qingshan and Ong, Yew-Soon and Pollefeys, Marc},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
publisher={IEEE}
}
- 11.09.2025: Releasing a new checkpoint for CasDiffMVS. Instead of finetuning DTU-pretrained model on BlendedMVS (a subset of BlendedMVG), we finetune it on BlendedMVG. The performance on benchmarks consistently improves without changing other hyper-parameters.
Method | T&T Intermediate | T&T Advanced | ETH3D Training | ETH3D Test |
---|---|---|---|---|
CasDiffMVS_MVG | 66.14 | 42.00 | 77.79 | 85.99 |
There always exists tradeoff between accuracy and efficiency. To find a better balance between quality and computation, we introduce diffusion model in Multi-View Stereo (MVS) for efficient and accurate reconstruction, and propose two methods, named DiffMVS and CasDiffMVS. Because of the high efficiency, impressive performance and lightweight structure, we hope that our methods can serve as new strong baselines for future research in MVS.
Specifically, we estimate a coarse initial depth map and refine it with a conditional diffusion model. Diffusion model learns depth prior and uses random noise to avoid local minima. Different from other diffusion models that start from pure random noise, we refine a coarse depth map with diffusion model, which explicitly reduces the sampling steps and increases stability w.r.t. random seeds. In addition, we design a lightweight diffusion network to improve both performance and efficiency.
We compare the efficiency with previous methods on a same workstation with NVIDIA 2080Ti GPU and summarize the results as follows.
First, clone this repository and install the dependencies.
git clone https://github.com/cvg/diffmvs.git
cd diffmvs
pip install -r requirements.txt
Next, download checkpoints and unzip them.
Try the models with your images. First, place the images under images
folder.
SCENE_DIR/
├── images/
We support VGGT and COLMAP to estimate camera poses and reconstruct sparse point cloud.
VGGT. We suggest creating a new conda environment for VGGT because the default version of Pytorch is different.
git clone git@github.com:facebookresearch/vggt.git
cd vggt
pip install -r requirements.txt
pip install -r requirements_demo.txt
python vggt/demo_colmap.py --scene_dir="/SCENE_DIR"
COLMAP. Follow wiki to get sparse reconstruction. The easiest way is to use COLMAP's GUI and run automatic reconstruction.
Back to our codebase, we need to transform the data from COLMAP format to the format that we use for DTU, Tanks & Temples, and ETH3D.
VGGT. Though VGGT exports the output directly to COLMAP format, there exists difference. For example, each 3D point from VGGT corresponds to a single pixel, while that from COLMAP may corresponds to multiple keypoints in multiple images. Therefore, it is not suitable to follow MVSNet and use common 3D track to compute view score for co-visibility.
Here, we use the similarity of image retrieval features to estimate co-visibility between different images. Download pretrained model CVPR23_DeitS_Rerank.pth from R2Former. Then we can run:
python colmap_input.py --input_folder="/SCENE_DIR" --output_folder="/SCENE_DIR/mvs" --VGGT --checkpoint="/R2FORMER_CKPT"
COLMAP
python colmap_input.py --input_folder="/SCENE_DIR" --output_folder="/SCENE_DIR/mvs"
After conversion, the directory looks like:
SCENE_DIR/
├── images/
└── sparse/
├── cameras.bin
├── images.bin
└── points3D.bin
└── mvs/
├── images/
├── cams/
└── pair.txt
Next, we can estimate the depth maps and fuse them into a point cloud with our methods. Take CasDiffMVS for example:
MVS_DIR="/SCENE_DIR/mvs"
CKPT_FILE="./checkpoints/casdiffmvs_blend.ckpt"
OUT_DIR='./outputs_demo'
if [ ! -d $OUT_DIR ]; then
mkdir -p $OUT_DIR
fi
python test.py --dataset=general --batch_size=1 --num_view=7 --method=casdiffmvs --save_depth \
--testpath=$MVS_DIR --numdepth_initial=48 --numdepth=384 \
--loadckpt=$CKPT_FILE --outdir=$OUT_DIR \
--scale 0.0 0.125 0.025 --sampling_timesteps 0 1 1 --ddim_eta 0 1 1 \
--stage_iters 1 3 3 --cost_dim_stage 4 4 4 --CostNum 0 4 4 \
--hidden_dim 0 32 20 --context_dim 32 32 16 --unet_dim 0 16 8 \
--min_radius 0.125 --max_radius 8 \
--geo_pixel_thres 0.125 --geo_depth_thres 0.01 --geo_mask_thres 2 \
--photo_thres 0.3 0.5 0.5
Note that you may want to tune hyperparameters for post-processing, i.e. geo_pixel_thres, geo_depth_thres, geo_mask_thres, photo_thres
, to get the best reconstruction quality. geo_pixel_thres, geo_depth_thres, geo_mask_thres
are used for geometric consistency filtering across different views. photo_thres
is used for photometric consistency filtering to filter out unconfident estimation. For more details, see Sec. 8.2 of this survey.
Here is a visual comparison for a scene with 25 images:
- Download the pre-processed dataset provided by PatchmatchNet: DTU's evaluation set, Tanks & Temples and ETH3D benchmark. Each dataset is organized as follows:
root_directory
├──scan1 (scene_name1)
├──scan2 (scene_name2)
├── images
│ ├── 00000000.jpg
│ ├── 00000001.jpg
│ └── ...
├── cams_1
│ ├── 00000000_cam.txt
│ ├── 00000001_cam.txt
│ └── ...
└── pair.txt
Camera file cam.txt
stores the camera parameters, which includes extrinsic, intrinsic, minimum depth and maximum depth. pair.txt
stores the view selection result. For details, check PatchmatchNet.
-
Run corresponding scripts in
scripts/test/
to evaluate on different datasets. -
We have reproduced the results ourselves with this codebase and the checkpoints that we provide. The results are listed as follows:
Methods | DTU Overall | T&T Intermediate | T&T Advanced | ETH3D Training |
---|---|---|---|---|
DiffMVS | 0.308 | 63.49 | 40.02 | 74.80 |
CasDiffMVS | 0.297 | 65.90 | 41.87 | 76.73 |
- Dowload the preprocessed DTU training data and depths maps, upzip them and organize them as follows:
root_directory
├── Cameras
│ ├── train
│ │ ├── 00000000_cam.txt
│ │ └── ...
│ └── pair.txt
├── Depths_raw
│ ├── scan1
│ │ ├── depth_map_0000.pfm
│ │ ├── depth_visual_0000.png
│ │ └── ...
│ ├── scan2
│ └── ...
└── Rectified
├── scan1_train
│ ├── rect_001_0_r5000.png
│ ├── rect_001_1_r5000.png
│ ├── ...
│ ├── rect_001_6_r5000.png
│ └── ...
├── scan2_train
└── ...
- Download BlendedMVS.
- Train DiffMVS or CasDiffMVS with the scripts in
scripts/train
.
Thanks to these great repositories: MVSNet, MVSNet-pytorch, PatchmatchNet, IterMVS, EffiMVS, denoising-diffusion-pytorch.