GitHub - kaist-cvml/3d-vlm-gd: [EMNLP 2025 Findings] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

3D-Aware Vision-Language Models
Fine-Tuning with Geometric Distillation

Seonho Lee^, Jiho Choi^, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim^†

*: equal contribution, †: corresponding author
Graduate School of Artificial Intelligence, KAIST, Republic of Korea

`{glanceyes, jihochoi, rkswlsj, tom919, jshackist, kateshim}@kaist.ac.kr`

Geometric Distillation

We propose a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture.

By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image–text inputs.

Updates

[2025.06.12] 📄 Our paper is now available! You can find the paper here.

Installation

# ------------------
#     Init conda
# ------------------
conda create -n 3dvlm_gd python=3.10 -y
conda activate 3dvlm_gd
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit


# --------------------------
#     Install Python packages
# --------------------------

pip install -r requirements.txt

# --------------------------
#     Install CroCo / DUSt3R / MASt3R
# --------------------------
# If error occurs, please refer to the each official repository.
pip install -r dust3r/requirements.txt
pip install -r dust3r/requirements_optional.txt
# DUST3R relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd dust3r/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../

# --------------------------
#     Install VGGT
# --------------------------
pip install -r vggt/requirements.txt

Data Preparation

Objaverse

The dataset can be downloaded from the Objaverse. We use the 10k subset of Objaverse as in Multiview-Equivariance Finetuning. Please follow the instructions introduced in this section.

After setup, the resulting directory structure should look like the following:

data/
└── objaverse/
│   └── hf-objaverse-v1/
│       └── glbs/
│           ├── 000-000/
│           ├── ...
│           └── 000-159/
└── objaverse_renderings/
│   ├── 000-000/
│   ├── ...
│   └── 000-159/
└── 10k.txt

ScanNet++

We use ScanNet++ preprocessed by FiT3D for both training and evaluation. To prepare the dataset, please follow the instructions provided in this section, or directly download the preprocessed data from the link.

After downloading, place the data in the data/scannetpp directory. The resulting directory structure should look like the following:

data/
├── {other datasets}
└── scannetpp/
    ├── masks/
    ├── metadata/
    └── scenes/
        ├── 036bce3393
        ├── ...
        └── fe1733741f

PF-PASCAL

We use PF-PASCAL dataset for evaluation. Please follow the instruction in this seciton.

After setup, put the data/test_pairs_pf_different_views.csv and data/test_pairs_pf_same_views.csv files in the data/PF-dataset-PASCAL directory. The resulting directory structure should look like the following:

data/
└── PF-dataset-PASCAL/
    ├── Annotations/
    ├── JPEGImages/
    ├── test_pairs_pf_different_views.csv
    └── test_pairs_pf_same_views.csv

OnePose-LowTexture

We use OnePose-LowTexture dataset for evaluation. Please follow the instruction in this seciton.

To be in detail, please download the LowTexture dataset from OnePose++ and rename the directory to data/onepose_lowtexture. Also, please reconstruct the object point cloud by the following command as in the repository:

python run.py +preprocess=sfm_inference_lowtexture.yaml use_local_ray=True # for lowtexture test data

You can get the data/sfm_output, and the resulting directory structure should look like the following:

data/
└── lowtexture_test_data/
│   ├── 0700-toyrobot-others
│   ├── ...
│   └── 0748-penboxvert-others
└── sfm_output/
    └── outputs_softmax_loftr_loftr
        ├── 0408-colorbox-box
        ├── ...
        ├── 0748-penboxvert-others
        └── vis3d

TAP-Vid DAVIS

We use TAP-Vid DAVIS dataset for evaluation. Please follow the instruction in this seciton.

To be in detail, please download the vidoe dataset from this link and rename the directory to data/davis_480. Please check that the data file tapvid_davis_data_strided.pkl is in the data directory.:

data/
└── lowtexture_test_data/
│   ├── 0700-toyrobot-others
│   ├── ...
│   └── 0748-penboxvert-others
└── sfm_output/
    └── outputs_softmax_loftr_loftr
        ├── 0408-colorbox-box
        ├── ...
        ├── 0748-penboxvert-others
        └── vis3d

Usage

Run (Fine-tuning + Evaluation)

python3 src/main.py --config-name {config_name}

# Example
python3 src/main.py --config-name finetune_timm_mast3r_scannetpp

You can modify the configuration file in config/ directory. The default configuration is finetune_timm_mast3r_scannetpp.yaml, which is used for finetuning on ScanNet++ dataset with MASt3R.

Evaluate (with Finetuned Model)

3D Correspondence Understanding

python evaluate_timm_mast3r.py \
    --ckpt {checkpoint path} \
    --transfer

Depth Estimation & Semantic Segmentation

Please follow the FiT3D repository for evaluation.

3D VQA

Please follow the Lexicon3D repository for evaluation.

Acknowledgements

We would like to express our gratitude to the open-source projects and their contributors, including MEF, FiT3D, and Lexicon3D.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
config		config
data		data
data_utils		data_utils
dust3r		dust3r
mast3r		mast3r
src		src
utils		utils
vggt		vggt
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

3D-Aware Vision-Language Models
Fine-Tuning with Geometric Distillation

Seonho Lee^, Jiho Choi^, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim^†

*: equal contribution, †: corresponding author
Graduate School of Artificial Intelligence, KAIST, Republic of Korea

`{glanceyes, jihochoi, rkswlsj, tom919, jshackist, kateshim}@kaist.ac.kr`

Geometric Distillation

Updates

Installation

Data Preparation

Objaverse

ScanNet++

PF-PASCAL

OnePose-LowTexture

TAP-Vid DAVIS

Usage

Run (Fine-tuning + Evaluation)

Evaluate (with Finetuned Model)

3D Correspondence Understanding

Depth Estimation & Semantic Segmentation

3D VQA

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

kaist-cvml/3d-vlm-gd

Folders and files

Latest commit

History

Repository files navigation

3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Seonho Lee*, Jiho Choi*, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim† *: equal contribution, †: corresponding author Graduate School of Artificial Intelligence, KAIST, Republic of Korea {glanceyes, jihochoi, rkswlsj, tom919, jshackist, kateshim}@kaist.ac.kr

Geometric Distillation

Updates

Installation

Data Preparation

Objaverse

ScanNet++

PF-PASCAL

OnePose-LowTexture

TAP-Vid DAVIS

Usage

Run (Fine-tuning + Evaluation)

Evaluate (with Finetuned Model)

3D Correspondence Understanding

Depth Estimation & Semantic Segmentation

3D VQA

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

3D-Aware Vision-Language Models
Fine-Tuning with Geometric Distillation

Seonho Lee^, Jiho Choi^, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim^†

*: equal contribution, †: corresponding author
Graduate School of Artificial Intelligence, KAIST, Republic of Korea

`{glanceyes, jihochoi, rkswlsj, tom919, jshackist, kateshim}@kaist.ac.kr`

Packages