Jianyun Xu†, Song Wang†, Ziqian Ni†, Chunyong Hu, Sheng Yang*, Jianke Zhu, Qiang Li
This is the official implementation of SAM4D: Segment Anything in Camera and LiDAR Streams (ICCV2025) [Paper] [Project Page].
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
Please ensure that your Linux system has cuda 12.1 or above installed, along with nvcc.
#If you dont have, you may install cuda via:
wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run
sudo sh cuda_12.3.2_545.23.08_linux.run \
--silent \
--toolkit \
--toolkitpath=/usr/local/cuda-12.3 \
--no-opengl-libs \
--override- Create a conda environment and install torch
conda create -n sam4d python=3.8
conda activate sam4d
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121- Clone the repo and install dependencies
git clone https://github.com/CN-ADLab/SAM4D.git && cd SAM4D
pip install -r requirements.txt
python setup.py developNote: If you encounter google/dense_hash_map issue with installing torchsparse, please install sparsehash first via
sudo apt update && sudo apt install libsparsehash-dev.
Download dataset Waymo-4DSeg from here: modelscope. You can put the extracted example data from Waymo-4DSeg/samples.zip to ./data.
The data structure are as follows:
${dataset}
├── meta_infos
│ └── ${sequence_name}.pkl
├── pcds
│ └── ${sequence_name}
│ ├── {timestamp1}.npz
│ ├── {timestamp2}.npz
│ └── ...
├── sam4d_labels (optional)
│ └── ${sequence_name}
│ ├── {timestamp1}.json
│ ├── {timestamp2}.json
│ └── ...
└── undistort_images
└── ${sequence_name}
├── ${timestamp1}
│ ├── ${cam_name}.jpg
│ └── ...
├── ${timestamp2}
│ ├── ${cam_name}.jpg
│ └── ...
└── ...meta_infos: a pickle file containing meta information of the sequence
# meta_infos/${sequence_name}.pkl structure:
from typing import Dict, List, Tuple, Union
MetaInfoType = Dict[str, Union[
str,
List[Dict[str, Union[
Dict[str, Dict[str, Union[
str,
List[List[float]],
None
]]],
Dict[str, str],
List[List[float]]
]]]
]]
example_meta_info: MetaInfoType = {
'seq_name': 'your_sequence_name',
'frames': [
{
'cams_info': {
'your_cam_name': {
'data_path': 'undistort_images/your_sequence_name/your_timestamp/your_cam_name.jpg', # path to image
'camera_intrinsics': [[fx, 0, cx], [0, fy, cy], [0, 0, 1]], # 3x3 matrix
'camera2lidar': [[...], [...], [...], [...]] # 4x4 matrix
},
'your_cam_name2': {...}
},
'path': {
'pcd': 'pcds/your_sequence_name/your_timestamp.npz', # path to point cloud
},
'lidar2world': [[...], [...], [...], [...]] # 4x4 matrix
}
]
}You can follow data/visualize_sam4d_labels.ipynb to explore the dataset. Make sure you have extracted samples.zip to ./data.
Download the model checkpoints from: modelscope.
Please follow notebooks/sam4d_predictor_example.ipynb step by step to proceed. Make sure you have extracted samples.zip to ./data.
We gratefully acknowledge the developers of the following open-source projects and datasets, whose foundational tools enabled our research: SAM2, GroundingDINO, Grounded-SAM-2, Waymo Open Dataset, VDBFusion, among others.
@article{xu2025sam4d,
title={SAM4D: Segment Anything in Camera and LiDAR Streams},
author={Xu, Jianyun and Wang, Song and Ni, Ziqian and Hu, Chunyong and Yang, Sheng and Zhu, Jianke and Li, Qiang},
journal={arXiv preprint arXiv:2506.21547},
year={2025}
}