Skip to content

cvg/3D-MOOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

arXiv Project Page

Banner 2

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, Zuria Bauer
ICCV 2025, Paper at arXiv 2507.23567

News and ToDo

  • 27.08.2025: Add scripts/demo.py and Huggingface Demo!
  • 25.08.2025: Release code and models.
  • 25.06.2025: 3D-MOOD is accepted at ICCV 2025!

Getting Started

We use Vis4D as the framework to implement 3D-MOOD. Please check the document for more details.

Installation

We support Python 3.11+ and PyTorch 2.4.0+. Please install the correct PyTorch version according to your own hardware settings.

conda create -n opendet3d python=3.11 -y

conda activate opendet3d

# Install Vis4D
# It should also install the PyTorch with CUDA support. But please check.
pip install vis4d==1.0.0

# Install CUDA ops
pip install git+https://github.com/SysCV/vis4d_cuda_ops.git --no-build-isolation --no-cache-dir

# Install 3D-MOOD
pip install -v -e .

Demo

We provide the demo.py to test whether the installation is complete.

python scripts/demo.py

It will save the prediction as follow to assets/demo/output.png.

You can also try the live demo on here!

Data Preparation

We provide the HDF5 files and annotations here for ScanNet v2, Argoverse 2, and the depth GT for Omni3D datasets.

For training and testing with OmniD, please refer to DATA to setup the Omni3D data.

We also illustrate the coordinate system we use here.

The final data folder should be like:

REPO_ROOT
├── data
│   ├── omni3d
│   │   └── annotations
│   ├── KITTI_object
│   ├── KITTI_object_depth
│   ├── nuscenes
│   ├── nuscenes_depth
│   ├── objectron
│   ├── objectron_depth
│   ├── SUNRGBD
│   ├── ARKitScenes
│   ├── ARKitScenes_depth
│   ├── hypersim
│   ├── hypersim_depth
│   ├── argoverse2
│   │   ├── annotations
│   │   └── val.hdf5
│   └── scannet
│       ├── annotations
│       └── val.hdf5

By default in our provided config we use HDF5 as the data backend. You can convert each folder using the script to generate them.

It is worth noting that if you download the provided .hdf5 from here, you only need to convert each omni3d dataset to HDF5.

To be more specific:

cd data

python -m vis4d.data.io.to_hdf5 -p KITTI_object
python -m vis4d.data.io.to_hdf5 -p KITTI_object_depth # Only needed if you generate depth on your own

...

python -m vis4d.data.io.to_hdf5 -p hypersim
python -m vis4d.data.io.to_hdf5 -p hypersim_depth # Only needed if you generate depth on your own

Then you will have all datasets in .hdf5.

The other solution is to change the data_backend in the configs to FileBackend.

Model Zoo

Note that the score of Argoverse 2 and ScanNet is the proposed open detection score (ODS) and the score for Omni3D test set is AP.

Backbone Config Omni3D Argoverse 2 ScanNet
Swin-T config 28.4 22.4 30.2
Swin-B config 30.0 23.8 31.5

For per-dataset results for Omni3D, please refer to the Table 3 of the paper.

Testing

# Swin-T
vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_t_omni3d.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-t_120e_omni3d_699f69.pt

# Swin-B 
vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_b_omni3d.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-b_120e_omni3d_834c97.pt

Training

We use batch size of 128 to train our models. The setting is assumed running on the cluster using RTX 4090.

# Swin-T
vis4d fit --config opendet3d/zoo/gdino3d/gdino3d_swin_t_omni3d.py --gpus 8 --nodes 4 --config.params.samples_per_gpu=4

# Swin-B 
vis4d fit --config opendet3d/zoo/gdino3d/gdino3d_swin_b_omni3d.py --gpus 8 --nodes 8

ScanNet200

We also provide the code to reproduce our ScanNet200 results in supplementray. Note that it will take longer time since we need to chunk the classes.

vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_b_scannet200.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-b_120e_omni3d_834c97.pt

Visualization

It will dump all the visualization results under vis4d-workspace/gdino3d_swin-b_omni3d/${VERSION}/vis/test/.

vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_b_omni3d.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-b_120e_omni3d_834c97.pt --vis --config.params.nms=True --config.params.score_threshold=0.1

Citation

If you find our work useful in your research please consider citing our publications:

@article{yang20253d,
  title={3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection},
  author={Yang, Yung-Hsu and Piccinelli, Luigi and Segu, Mattia and Li, Siyuan and Huang, Rui and Fu, Yuqian and Pollefeys, Marc and Blum, Hermann and Bauer, Zuria},
  journal={arXiv preprint arXiv:2507.23567},
  year={2025}
}

About

[ICCV'25] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages