Skip to content

[ICCV 2025] Official repository of the paper "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation"

Notifications You must be signed in to change notification settings

lorebianchi98/Talk2DINO

Repository files navigation

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

arXiv Paper Project Website Demo on Hugging Face Hugging Face Models

Talk2DINO is an open-vocabulary segmentation architecture that combines the localized and semantically rich patch-level features of DINOv2 with the multimodal understanding capabilities of CLIP. This is achieved by learning a projection from the CLIP text encoder to the embedding space of DINOv2 using only image-caption pairs and exploiting the self-attention properties of DINOv2 to understand which part of the image has to be aligned to the corresponding caption.

Updates

  • ☄️ 10/2025: Added support for DINOv3 🦖🦖🦕!
  • 🚀 10/2025: Gradio demo is now live! Try Talk2DINO interactively on the Hugging Face Spaces 🦖
  • 🤗 09/2025: Talk2DINO ViT-B and Talk2DINO ViT-L are now available on the Hugging Face Hub 🎉
  • 🔥 06/2025: "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation" has been accepted to ICCV2025 in Honolulu! 🌺🌴🏖️

Results

Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours (Talk2DINO)
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours

Here’s a refined and concise version of your installation guidelines that separates Hugging Face inference from full MMCV-based evaluation, while keeping them clear and easy to follow:


Installation

1️⃣ Hugging Face Interface (for inference)

To quickly run Talk2DINO on your own images:

# Clone the repository
git clone https://github.com/lorebianchi98/Talk2DINO.git
cd Talk2DINO

# Install dependencies
pip install -r requirements.txt

# Install PyTorch (CUDA 12.6 example)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

This setup allows you to load Hugging Face models (Talk2DINO-ViTB / Talk2DINO-ViTL) and generate segmentation masks without setting up MMCV or MMSegmentation.


2️⃣ MMCV Interface (for evaluation & full pipelines)

If you want to perform benchmark evaluation using MMSegmentation:

# Create a dedicated environment
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino

# Install C++/CUDA compilers
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"

# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda 
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0

# Install PyTorch 2.1 + CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# Install remaining dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine

# Install MMCV (compatible with PyTorch 2.1 + CUDA 11.8)
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html

# Install MMSegmentation
pip install mmsegmentation==0.30.0

Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO

Talk2DINO enables you to align CLIP text embeddings with the patch-level embedding space of DINOv2.
You can try it in two ways:

🔹 Using the Hugging Face Hub

Easily load pretrained models with the HF interface:

from transformers import AutoModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO-ViTB").to(device).eval()

with torch.no_grad():
    text_embed = model.encode_text("a pikachu")

🔹 Using the Original Talk2DINO Interface

If you prefer local configs and weights:

import clip
from src.model import ProjectionLayer
import torch, os

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load Talk2DINO projection layer
proj_name = 'vitb_mlp_infonce'
config_path = os.path.join("configs", f"{proj_name}.yaml")
weights_path = os.path.join("weights", f"{proj_name}.pth")

talk2dino = ProjectionLayer.from_config(config_path)
talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
talk2dino.to(device)

# Load CLIP model
clip_model, _ = clip.load("ViT-B/16", device=device, jit=False)
tokenizer = clip.tokenize

# Example: Tokenize and project text features
texts = ["a cat"]
text_tokens = tokenizer(texts).to(device)
text_features = clip_model.encode_text(text_tokens)
projected_text_features = talk2dino.project_clip_txt(text_features)

Feature Extraction

To speed up training, we use pre-extracted features. Follow these steps:

  1. Download the 2014 images and annotations from the COCO website.
  2. Run the following commands to extract features:
    mkdir ../coco2014_b14
    python dino_extraction_v2.py --ann_path ../coco/captions_val2014.json --out_path ../coco2014_b14/val.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
    python dino_extraction_v2.py --ann_path ../coco/captions_train2014.json --out_path ../coco2014_b14/train.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
    python text_features_extraction.py --ann_path ../coco2014_b14/train.pth
    python text_features_extraction.py --ann_path ../coco2014_b14/val.pth

Training

To train the model, use the following command (this example runs training for the ViT-Base configuration):

python train.py --model configs/vitb_mlp_infonce.yaml --train_dataset ../coco2014_b14/train.pth --val_dataset ../coco2014_b14/val.pth

Evaluation

This section is adapted from GroupViT, TCL, and FreeDA. The segmentation datasets should be organized as follows:

data
├── cityscapes
│   ├── leftImg8bit
│   │   ├── train
│   │   ├── val
│   ├── gtFine
│   │   ├── train
│   │   ├── val
├── VOCdevkit
│   ├── VOC2012
│   │   ├── JPEGImages
│   │   ├── SegmentationClass
│   │   ├── ImageSets
│   │   │   ├── Segmentation
│   ├── VOC2010
│   │   ├── JPEGImages
│   │   ├── SegmentationClassContext
│   │   ├── ImageSets
│   │   │   ├── SegmentationContext
│   │   │   │   ├── train.txt
│   │   │   │   ├── val.txt
│   │   ├── trainval_merged.json
│   ├── VOCaug
│   │   ├── dataset
│   │   │   ├── cls
├── ade
│   ├── ADEChallengeData2016
│   │   ├── annotations
│   │   │   ├── training
│   │   │   ├── validation
│   │   ├── images
│   │   │   ├── training
│   │   │   ├── validation
├── coco_stuff164k
│   ├── images
│   │   ├── train2017
│   │   ├── val2017
│   ├── annotations
│   │   ├── train2017
│   │   ├── val2017

Please download and setup PASCAL VOC , PASCAL Context, COCO-Stuff164k , Cityscapes, and ADE20k datasets following MMSegmentation data preparation document.

COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. Run the following command to convert instance segmentation annotations to semantic segmentation annotations:

python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/

To evaluate the model on open-vocabulary segmentation benchmarks, use the src/open_vocabulary_segmentation/main.py script. Select the appropriate configuration based on the model, benchmark, and PAMR settings. The available models are [vitb, vitl], while the available benchmarks are [ade, cityscapes, voc, voc_bg, context, context_bg, cityscapes, coco_object, stuff]. Below we provide the list of evaluations to reproduce the results reported in the paper for the ViT-Base architecture:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml

# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml

# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml

# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_pamr.yml

# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml

# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_pamr.yml

# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml

# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_pamr.yml

Instead, the evaluations for the ViT-Large architecture are:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml

# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml

# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml

# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_vitl_pamr.yml

# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml

# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_vitl_pamr.yml

# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml

# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_vitl_pamr.yml

Demo

We provide two simple entry points for trying out Talk2DINO:

  • hf_demo.ipynb – an interactive notebook showing how to generate segmentation masks directly using the Hugging Face interface.
  • demo.py – a lightweight script for running inference on a custom image with your own textual categories. . Run
python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..

Example:

python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,route

Result:

Acknowledgments

Thanks to AyoubDamak for contributing to the updated installation instructions.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{barsellotti2025talking,
  title={Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation},
  author={Barsellotti, Luca and Bianchi, Lorenzo and Messina, Nicola and Carrara, Fabio and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={22025--22035},
  year={2025}
}

About

[ICCV 2025] Official repository of the paper "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •