Gaia Di Lorenzo1 . Federico Tombari3 . Marc Pollefeys1, 2 . DΓ‘niel BΓ©la BarΓ‘th1, 3 .
1ETH ZΓΌrich Β· 2Microsoft Β· 3Google
Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g., images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy.Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.
- Add code and instructions for evaluation
- Add code and instructions for baseline evaluation
- Release checkpoints and metadata for 3RScan and ScanNet
The code is tested with the following dependencies
- Operating System: Ubuntu
- Architecture: x86_64 GNU/Linux
- Python Version: 3.9.18
- CUDA Version: 12.4
- NVIDIA Driver Version: 550.144.03
- GPU: NVIDIA A100 PCIe 40GB
- Total GPU Memory: Above 40GB
-
Create and Activate a Virtual Environment:
python -m venv .venv source .venv/bin/activate
(Ensure that your virtual environment is activated before proceeding with the installation.)
-
Install dependencies from
requirements.txt
:pip install -r requirements.txt # add [--no-deps] if installation causes dependency issues pip install -r other_deps.txt
-
Install dependencies separately
pip install dependencies/gaussian-splatting pip install git+https://github.com/nerfstudio-project/gsplat.git # Needed for evaluation pip install dependencies/2d-gaussian-splatting # Needed for baselines evaluation
After installing the dependencies, verify your environment:
-
Check Python Version:
python --version
(Should output
Python 3.9.18
) -
Check CUDA & GPU Availability:
nvidia-smi
(Ensure the GPU is detected and available for computation.)
This section outlines the required datasets and their organization within the root directory.
- 3RScan: Download from here and move all files to
\<root_dir>/scenes/
. - 3DSSG: Download from here and place all files in
\<root_dir>/files/
. - ScanNet: Download from here and move the scenes to
\<root_dir>/scenes/
. - Additional Meta Files: Download from this link and move them to
\<root_dir>/files/
.
After this step, the directory structure should look as follows:
βββ <root_dir>
β βββ files <- Meta files and annotations
β β βββ <meta_files_0>
β β βββ <meta_files_1>
β β βββ ...
β βββ scenes <- Scans (3RScan/ScanNet)
β β βββ <id_scan_0>
β β βββ <id_scan_1>
β β βββ ...
To generate labels.instances.align.annotated.v2.ply
for each 3RScan scan, refer to the repository:
3DSSG Data Processing.
- The preprocessing code for 3RScan is located in the dependencies/VLSG directory.
- Ensure the following environment variables are set:
VLSG_SPACE
= Repository pathDATA_ROOT_DIR
= Path to the downloaded dataset (i.e.,root_dir
)CONDA_BIN
=.venv/bin
(as linked in installation)
- Execute the preprocessing script:
cd dependencies/VLSG && bash scripts/preprocess/scan3r_data_preprocess.sh
Run the following command to generate pixel-wise and patch-level ground truth annotations:
cd dependencies/VLSG && bash scripts/gt_annotations/scan3r_gt_annotations.sh
Precompute patch-level features using Dino v2:
cd dependencies/VLSG && bash scripts/features2D/scan3r_dinov2.sh
Run the following command to generate featured voxel annotations:
bash scripts/voxel_annotations/voxelise_features.sh --split {split}
# 2.5.1 Generating Subscenes Annotations (Optional)
# Follow the instructions from SGAligner at dependencies/sgaligner to create subscenes annotations
# Then run the following command:
# bash scripts/voxel_annotations/voxelise_features_scene_alignment.sh --split {split}
Generate Gaussian splat annotations using the following commands:
bash scripts/gs_annotations/map_to_colmap.sh --split {split}
bash scripts/gs_annotations/annotate_gaussian.sh --split {split}
ScanNet requires scene graph annotations generated using SceneGraphFusion.
- Download the pretrained model from here and move it to
dependencies/SCENE-GRAPH-FUSION/
. - Build SceneGraphFusion by following the instructions in its repository.
Run the following commands:
python preprocessing/scene_graph_anno/scenegraphfusion_prediction.py
python preprocessing/scene_graph_anno/scenegraphfusion2scan3r.py
Generate ground truth annotations with:
python preprocessing/gt_anno/scannet_obj_projector.py
Run the following script to generate featured voxel annotations:
bash scripts/voxel_annotations/voxelise_features_scannet.sh
To generate Gaussian splat annotations, execute:
bash scripts/gs_annotations/map_to_colmap_scannet.sh
bash scripts/gs_annotations/annotate_gaussian_scannet.sh
After the above preprocessing, the directory structure should look as follows:
βββ <root_dir>
β βββ files <- Meta files and annotations
β β βββ Features2D <- (Step 2.4)
β β βββ gt_projection <- (Step 2.3)
β β βββ orig <- (Step 2.1)
β β βββ patch_anno <- (Step 2.3)
β β βββ gs_annotations <- (Step 2.5/2.6)
β β βββ gs_annotations_scannet <- (Step 3.4/3.5)
β β βββ <meta_files_0>
β β βββ <meta_files_1>
β β βββ ...
β βββ scenes <- Scans (3RScan/ScanNet)
β β βββ <id_scan_0>
β β βββ <id_scan_1>
β β βββ ...
β βββ scene_graph_fusion <- (Step 3.1)
β β βββ <id_scan_0>
β β βββ <id_scan_1>
β β βββ ...
β βββ out <- (Step 2.5.1)
β β βββ files ...
β β βββ scenes ...
Refer to the TRAIN.md for training instructions.
@misc{dilorenzo2025objectxlearningreconstructmultimodal,
title={Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations},
author={Gaia Di Lorenzo and Federico Tombari and Marc Pollefeys and Daniel Barath},
year={2025},
eprint={2506.04789},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.04789},
}
In this project we use (parts of) the official implementations of the following works:
- SceneGraphLoc: SceneGraphLoc
- Trellis: Trellis
- SGAligner: SGAligner
- GSplat: GSplat
- 2D Gaussian Splatting: 2D Gaussian Splatting
- 3DSSG: 3DSSG
- SceneGraphFusion: SceneGraphFusion