3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, Zuria Bauer
ICCV 2025, Paper at arXiv 2507.23567
-
27.08.2025
: Addscripts/demo.py
and Huggingface Demo! -
25.08.2025
: Release code and models. -
25.06.2025
: 3D-MOOD is accepted at ICCV 2025!
We use Vis4D as the framework to implement 3D-MOOD. Please check the document for more details.
We support Python 3.11+ and PyTorch 2.4.0+. Please install the correct PyTorch version according to your own hardware settings.
conda create -n opendet3d python=3.11 -y
conda activate opendet3d
# Install Vis4D
# It should also install the PyTorch with CUDA support. But please check.
pip install vis4d==1.0.0
# Install CUDA ops
pip install git+https://github.com/SysCV/vis4d_cuda_ops.git --no-build-isolation --no-cache-dir
# Install 3D-MOOD
pip install -v -e .
We provide the demo.py
to test whether the installation is complete.
python scripts/demo.py
It will save the prediction as follow to assets/demo/output.png
.
You can also try the live demo on here!
We provide the HDF5 files and annotations here for ScanNet v2, Argoverse 2, and the depth GT for Omni3D datasets.
For training and testing with OmniD, please refer to DATA to setup the Omni3D data.
We also illustrate the coordinate system we use here.
The final data folder should be like:
REPO_ROOT
├── data
│ ├── omni3d
│ │ └── annotations
│ ├── KITTI_object
│ ├── KITTI_object_depth
│ ├── nuscenes
│ ├── nuscenes_depth
│ ├── objectron
│ ├── objectron_depth
│ ├── SUNRGBD
│ ├── ARKitScenes
│ ├── ARKitScenes_depth
│ ├── hypersim
│ ├── hypersim_depth
│ ├── argoverse2
│ │ ├── annotations
│ │ └── val.hdf5
│ └── scannet
│ ├── annotations
│ └── val.hdf5
By default in our provided config we use HDF5
as the data backend.
You can convert each folder using the script to generate them.
It is worth noting that if you download the provided .hdf5
from here, you only need to convert each omni3d dataset to HDF5.
To be more specific:
cd data
python -m vis4d.data.io.to_hdf5 -p KITTI_object
python -m vis4d.data.io.to_hdf5 -p KITTI_object_depth # Only needed if you generate depth on your own
...
python -m vis4d.data.io.to_hdf5 -p hypersim
python -m vis4d.data.io.to_hdf5 -p hypersim_depth # Only needed if you generate depth on your own
Then you will have all datasets in .hdf5
.
The other solution is to change the data_backend
in the configs to FileBackend
.
Note that the score of Argoverse 2 and ScanNet is the proposed open detection score (ODS) and the score for Omni3D test set is AP.
Backbone | Config | Omni3D | Argoverse 2 | ScanNet |
---|---|---|---|---|
Swin-T | config | 28.4 | 22.4 | 30.2 |
Swin-B | config | 30.0 | 23.8 | 31.5 |
For per-dataset results for Omni3D, please refer to the Table 3 of the paper.
# Swin-T
vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_t_omni3d.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-t_120e_omni3d_699f69.pt
# Swin-B
vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_b_omni3d.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-b_120e_omni3d_834c97.pt
We use batch size of 128
to train our models.
The setting is assumed running on the cluster using RTX 4090.
# Swin-T
vis4d fit --config opendet3d/zoo/gdino3d/gdino3d_swin_t_omni3d.py --gpus 8 --nodes 4 --config.params.samples_per_gpu=4
# Swin-B
vis4d fit --config opendet3d/zoo/gdino3d/gdino3d_swin_b_omni3d.py --gpus 8 --nodes 8
We also provide the code to reproduce our ScanNet200 results in supplementray. Note that it will take longer time since we need to chunk the classes.
vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_b_scannet200.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-b_120e_omni3d_834c97.pt
It will dump all the visualization results under vis4d-workspace/gdino3d_swin-b_omni3d/${VERSION}/vis/test/
.
vis4d test --config opendet3d/zoo/gdino3d/gdino3d_swin_b_omni3d.py --gpus 1 --ckpt https://huggingface.co/RoyYang0714/3D-MOOD/resolve/main/gdino3d_swin-b_120e_omni3d_834c97.pt --vis --config.params.nms=True --config.params.score_threshold=0.1
If you find our work useful in your research please consider citing our publications:
@article{yang20253d,
title={3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection},
author={Yang, Yung-Hsu and Piccinelli, Luigi and Segu, Mattia and Li, Siyuan and Huang, Rui and Fu, Yuqian and Pollefeys, Marc and Blum, Hermann and Bauer, Zuria},
journal={arXiv preprint arXiv:2507.23567},
year={2025}
}