Skip to content

jinryan/DepthCLIP

Repository files navigation

DepthCLIP: Open-Vocabulary Segmentation on Depth Maps via Contrastive Learning

DepthCLIP is a research framework that explores open-vocabulary semantic segmentation using depth maps only, leveraging contrastive learning to align depth-based features with frozen CLIP text and image embeddings. This work was developed as part of Ryan Jin’s senior thesis at Yale University (advised by Prof. Alex Wong, Yale Vision Lab).

While CLIP has shown strong zero-shot generalization on RGB imagery, its extension to depth data — crucial for low-light, texture-poor, or RGB-scarce environments — remains underexplored. DepthCLIP bridges this gap with a ResNet-UNet architecture and a hybrid contrastive loss that combines pixel-text and area-image alignment.

Refer the following pdf for a more comprehensive report. DepthCLIP.pdf


Features

  • Depth-only open-vocabulary segmentation – No RGB data required.
  • ResNet-based UNet with ASPP for multi-scale context.
  • Contrastive learning with:
    • Pixel-text alignment (InfoNCE loss against CLIP text embeddings).
    • Area-image alignment (align object crops with CLIP image embeddings).
    • Smoothness regularization for spatial consistency.
  • Curriculum-based distractor sampling – progressively harder negative samples.
  • Equivalence-aware evaluation metrics – handles near-synonymous labels with top-k pixel accuracy and mIoU.
  • Distributed, mixed-precision training with PyTorch DDP.

Results

Evaluated on the SUN RGB-D dataset (depth maps only):

  • 85% Top-5 Pixel Accuracy
  • 67% Top-5 mIoU
  • 27% Standard mIoU

DepthCLIP demonstrates robust segmentation in low-light and texture-poor environments, narrowing the gap with RGB-based methods under challenging conditions.

Here are some samples of RGB (not fed into the model at inference, here solely for us to understand what the scene is), depth map (fed into model), ground truth segmentation, and predicted segmentation.

Sample 4 Sample 11

Motivation for This Research: Impact of Low Light on the Performance of RGB Segmentation Models

sample_004619_variation

Model Architecture

The architecture consists of:

  • Input: 256×256 depth maps
  • Encoder-Decoder: ResNet-18 backbone with ASPP
  • Two contrastive pathways:
    • Pixel embeddings aligned to candidate text embeddings (CLIP ViT-B/32).
    • Object-level area embeddings aligned to CLIP image embeddings of cropped regions.

Both pathways are optimized jointly via:

L_total = W_t * L_text + W_i * L_image + W_s * L_smooth

Architecture


Installation

  1. Clone the repo:
git clone https://github.com/jinryan/DepthCLIP.git
cd DepthCLIP
  1. Install dependencies:
pip install -r requirements.txt

Key dependencies:

  • torch, torchvision, transformers
  • matplotlib, tensorboard
  • nltk (for label cleaning)
  1. (Optional) Setup SUN RGB-D:
    • Download from SUN RGB-D.
    • Preprocess using provided scripts to unify labels and generate equivalence sets.

Usage

bash train_segmentation_model.sh Supports DistributedDataParallel (DDP) and mixed-precision training.

Also performs evaluation and outputs standard, top-k, and equivalence-aware metrics (pixel accuracy, mIoU).


Repository Structure

DepthCLIP/src/depth_segmentation_model
├── dataloader.py
├── datasets.py
├── evaluation.py
├── log.py
├── model.py
├── train.py
├── train_util.py
├── validate.py

Future Work

  • Experiment with Vision Transformers (ViTs) (e.g., SegFormer, Mask2Former) for better long-range context.
  • Incorporate ontological hierarchies (WordNet, ConceptNet) for label reasoning.
  • Explore automatic prompt optimization (reinforcement learning, multi-prompt ensembling).
  • Optimize for real-time inference (distillation, quantization, pruning).
  • Extend beyond indoor SUN RGB-D to robotics, outdoor, and AR/VR domains.

Citation

If you use DepthCLIP in your research, please cite:

@misc{jin2025depthclip,
  title={DepthCLIP: Open-Vocabulary Segmentation on Depth Maps via Contrastive Learning},
  author={Ryan Jin},
  year={2025},
  institution={Yale University},
  note={Senior Thesis, advised by Alex Wong}
}

License

MIT License.

About

Open vocabulary object segmentation on depth map

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages