Seonho Lee*, Jiho Choi*, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shimβ
*: equal contribution, β : corresponding author
Graduate School of Artificial Intelligence, KAIST, Republic of Korea{glanceyes, jihochoi, rkswlsj, tom919, jshackist, kateshim}@kaist.ac.krWe propose a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture.
By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural imageβtext inputs.
- [2025.06.12] π Our paper is now available! You can find the paper here.
# ------------------
# Init conda
# ------------------
conda create -n 3dvlm_gd python=3.10 -y
conda activate 3dvlm_gd
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
# --------------------------
# Install Python packages
# --------------------------
pip install -r requirements.txt
# --------------------------
# Install CroCo / DUSt3R / MASt3R
# --------------------------
# If error occurs, please refer to the each official repository.
pip install -r dust3r/requirements.txt
pip install -r dust3r/requirements_optional.txt
# DUST3R relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd dust3r/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../
# --------------------------
# Install VGGT
# --------------------------
pip install -r vggt/requirements.txtThe dataset can be downloaded from the Objaverse. We use the 10k subset of Objaverse as in Multiview-Equivariance Finetuning. Please follow the instructions introduced in this section.
After setup, the resulting directory structure should look like the following:
data/
βββ objaverse/
β βββ hf-objaverse-v1/
β βββ glbs/
β βββ 000-000/
β βββ ...
β βββ 000-159/
βββ objaverse_renderings/
β βββ 000-000/
β βββ ...
β βββ 000-159/
βββ 10k.txt
We use ScanNet++ preprocessed by FiT3D for both training and evaluation. To prepare the dataset, please follow the instructions provided in this section, or directly download the preprocessed data from the link.
After downloading, place the data in the data/scannetpp directory. The resulting directory structure should look like the following:
data/
βββ {other datasets}
βββ scannetpp/
βββ masks/
βββ metadata/
βββ scenes/
βββ 036bce3393
βββ ...
βββ fe1733741f
We use PF-PASCAL dataset for evaluation. Please follow the instruction in this seciton.
After setup, put the data/test_pairs_pf_different_views.csv and data/test_pairs_pf_same_views.csv files in the data/PF-dataset-PASCAL directory. The resulting directory structure should look like the following:
data/
βββ PF-dataset-PASCAL/
βββ Annotations/
βββ JPEGImages/
βββ test_pairs_pf_different_views.csv
βββ test_pairs_pf_same_views.csv
We use OnePose-LowTexture dataset for evaluation. Please follow the instruction in this seciton.
To be in detail, please download the LowTexture dataset from OnePose++ and rename the directory to data/onepose_lowtexture. Also, please reconstruct the object point cloud by the following command as in the repository:
python run.py +preprocess=sfm_inference_lowtexture.yaml use_local_ray=True # for lowtexture test dataYou can get the data/sfm_output, and the resulting directory structure should look like the following:
data/
βββ lowtexture_test_data/
β βββ 0700-toyrobot-others
β βββ ...
β βββ 0748-penboxvert-others
βββ sfm_output/
βββ outputs_softmax_loftr_loftr
βββ 0408-colorbox-box
βββ ...
βββ 0748-penboxvert-others
βββ vis3d
We use TAP-Vid DAVIS dataset for evaluation. Please follow the instruction in this seciton.
To be in detail, please download the vidoe dataset from this link and rename the directory to data/davis_480. Please check that the data file tapvid_davis_data_strided.pkl is in the data directory.:
data/
βββ lowtexture_test_data/
β βββ 0700-toyrobot-others
β βββ ...
β βββ 0748-penboxvert-others
βββ sfm_output/
βββ outputs_softmax_loftr_loftr
βββ 0408-colorbox-box
βββ ...
βββ 0748-penboxvert-others
βββ vis3d
python3 src/main.py --config-name {config_name}
# Example
python3 src/main.py --config-name finetune_timm_mast3r_scannetppYou can modify the configuration file in config/ directory. The default configuration is finetune_timm_mast3r_scannetpp.yaml, which is used for finetuning on ScanNet++ dataset with MASt3R.
python evaluate_timm_mast3r.py \
--ckpt {checkpoint path} \
--transferPlease follow the FiT3D repository for evaluation.
Please follow the Lexicon3D repository for evaluation.
We would like to express our gratitude to the open-source projects and their contributors, including MEF, FiT3D, and Lexicon3D.