DepthCLIP: Open-Vocabulary Segmentation on Depth Maps via Contrastive Learning

DepthCLIP is a research framework that explores open-vocabulary semantic segmentation using depth maps only, leveraging contrastive learning to align depth-based features with frozen CLIP text and image embeddings. This work was developed as part of Ryan Jin’s senior thesis at Yale University (advised by Prof. Alex Wong, Yale Vision Lab).

While CLIP has shown strong zero-shot generalization on RGB imagery, its extension to depth data — crucial for low-light, texture-poor, or RGB-scarce environments — remains underexplored. DepthCLIP bridges this gap with a ResNet-UNet architecture and a hybrid contrastive loss that combines pixel-text and area-image alignment.

Refer the following pdf for a more comprehensive report. DepthCLIP.pdf

Features

Depth-only open-vocabulary segmentation – No RGB data required.
ResNet-based UNet with ASPP for multi-scale context.
Contrastive learning with:
- Pixel-text alignment (InfoNCE loss against CLIP text embeddings).
- Area-image alignment (align object crops with CLIP image embeddings).
- Smoothness regularization for spatial consistency.
Curriculum-based distractor sampling – progressively harder negative samples.
Equivalence-aware evaluation metrics – handles near-synonymous labels with top-k pixel accuracy and mIoU.
Distributed, mixed-precision training with PyTorch DDP.

Results

Evaluated on the SUN RGB-D dataset (depth maps only):

85% Top-5 Pixel Accuracy
67% Top-5 mIoU
27% Standard mIoU

DepthCLIP demonstrates robust segmentation in low-light and texture-poor environments, narrowing the gap with RGB-based methods under challenging conditions.

Here are some samples of RGB (not fed into the model at inference, here solely for us to understand what the scene is), depth map (fed into model), ground truth segmentation, and predicted segmentation.

Motivation for This Research: Impact of Low Light on the Performance of RGB Segmentation Models

Model Architecture

The architecture consists of:

Input: 256×256 depth maps
Encoder-Decoder: ResNet-18 backbone with ASPP
Two contrastive pathways:
- Pixel embeddings aligned to candidate text embeddings (CLIP ViT-B/32).
- Object-level area embeddings aligned to CLIP image embeddings of cropped regions.

Both pathways are optimized jointly via:

L_total = W_t * L_text + W_i * L_image + W_s * L_smooth

Installation

Clone the repo:

git clone https://github.com/jinryan/DepthCLIP.git
cd DepthCLIP

Install dependencies:

pip install -r requirements.txt

Key dependencies:

torch, torchvision, transformers
matplotlib, tensorboard
nltk (for label cleaning)

(Optional) Setup SUN RGB-D:
- Download from SUN RGB-D.
- Preprocess using provided scripts to unify labels and generate equivalence sets.

Usage

bash train_segmentation_model.sh Supports DistributedDataParallel (DDP) and mixed-precision training.

Also performs evaluation and outputs standard, top-k, and equivalence-aware metrics (pixel accuracy, mIoU).

Repository Structure

DepthCLIP/src/depth_segmentation_model
├── dataloader.py
├── datasets.py
├── evaluation.py
├── log.py
├── model.py
├── train.py
├── train_util.py
├── validate.py

Future Work

Experiment with Vision Transformers (ViTs) (e.g., SegFormer, Mask2Former) for better long-range context.
Incorporate ontological hierarchies (WordNet, ConceptNet) for label reasoning.
Explore automatic prompt optimization (reinforcement learning, multi-prompt ensembling).
Optimize for real-time inference (distillation, quantization, pruning).
Extend beyond indoor SUN RGB-D to robotics, outdoor, and AR/VR domains.

Citation

If you use DepthCLIP in your research, please cite:

@misc{jin2025depthclip,
  title={DepthCLIP: Open-Vocabulary Segmentation on Depth Maps via Contrastive Learning},
  author={Ryan Jin},
  year={2025},
  institution={Yale University},
  note={Senior Thesis, advised by Alex Wong}
}

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
RangeCLIP/src/depth_segmentation_model		RangeCLIP/src/depth_segmentation_model
benchmark		benchmark
setup		setup
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train_segmentation_model.sh		train_segmentation_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DepthCLIP: Open-Vocabulary Segmentation on Depth Maps via Contrastive Learning

Features

Results

Motivation for This Research: Impact of Low Light on the Performance of RGB Segmentation Models

Model Architecture

Installation

Usage

Repository Structure

Future Work

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jinryan/DepthCLIP

Folders and files

Latest commit

History

Repository files navigation

DepthCLIP: Open-Vocabulary Segmentation on Depth Maps via Contrastive Learning

Features

Results

Motivation for This Research: Impact of Low Light on the Performance of RGB Segmentation Models

Model Architecture

Installation

Usage

Repository Structure

Future Work

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages