Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Talk2DINO is an open-vocabulary segmentation architecture that combines the localized and semantically rich patch-level features of DINOv2 with the multimodal understanding capabilities of CLIP. This is achieved by learning a projection from the CLIP text encoder to the embedding space of DINOv2 using only image-caption pairs and exploiting the self-attention properties of DINOv2 to understand which part of the image has to be aligned to the corresponding caption.
- ☄️ 10/2025: Added support for DINOv3 🦖🦖🦕!
- 🚀 10/2025: Gradio demo is now live! Try Talk2DINO interactively on the Hugging Face Spaces 🦖
- 🤗 09/2025: Talk2DINO ViT-B and Talk2DINO ViT-L are now available on the Hugging Face Hub 🎉
- 🔥 06/2025: "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation" has been accepted to ICCV2025 in Honolulu! 🌺🌴🏖️
| Image | Ground Truth | FreeDA | ProxyCLIP | CLIP-DINOiser | Ours (Talk2DINO) |
|---|---|---|---|---|---|
Here’s a refined and concise version of your installation guidelines that separates Hugging Face inference from full MMCV-based evaluation, while keeping them clear and easy to follow:
To quickly run Talk2DINO on your own images:
# Clone the repository
git clone https://github.com/lorebianchi98/Talk2DINO.git
cd Talk2DINO
# Install dependencies
pip install -r requirements.txt
# Install PyTorch (CUDA 12.6 example)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126This setup allows you to load Hugging Face models (Talk2DINO-ViTB / Talk2DINO-ViTL) and generate segmentation masks without setting up MMCV or MMSegmentation.
If you want to perform benchmark evaluation using MMSegmentation:
# Create a dedicated environment
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino
# Install C++/CUDA compilers
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"
# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0
# Install PyTorch 2.1 + CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
# Install remaining dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
# Install MMCV (compatible with PyTorch 2.1 + CUDA 11.8)
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html
# Install MMSegmentation
pip install mmsegmentation==0.30.0Talk2DINO enables you to align CLIP text embeddings with the patch-level embedding space of DINOv2.
You can try it in two ways:
Easily load pretrained models with the HF interface:
from transformers import AutoModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO-ViTB").to(device).eval()
with torch.no_grad():
text_embed = model.encode_text("a pikachu")If you prefer local configs and weights:
import clip
from src.model import ProjectionLayer
import torch, os
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load Talk2DINO projection layer
proj_name = 'vitb_mlp_infonce'
config_path = os.path.join("configs", f"{proj_name}.yaml")
weights_path = os.path.join("weights", f"{proj_name}.pth")
talk2dino = ProjectionLayer.from_config(config_path)
talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
talk2dino.to(device)
# Load CLIP model
clip_model, _ = clip.load("ViT-B/16", device=device, jit=False)
tokenizer = clip.tokenize
# Example: Tokenize and project text features
texts = ["a cat"]
text_tokens = tokenizer(texts).to(device)
text_features = clip_model.encode_text(text_tokens)
projected_text_features = talk2dino.project_clip_txt(text_features)To speed up training, we use pre-extracted features. Follow these steps:
- Download the 2014 images and annotations from the COCO website.
- Run the following commands to extract features:
mkdir ../coco2014_b14 python dino_extraction_v2.py --ann_path ../coco/captions_val2014.json --out_path ../coco2014_b14/val.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn python dino_extraction_v2.py --ann_path ../coco/captions_train2014.json --out_path ../coco2014_b14/train.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn python text_features_extraction.py --ann_path ../coco2014_b14/train.pth python text_features_extraction.py --ann_path ../coco2014_b14/val.pth
To train the model, use the following command (this example runs training for the ViT-Base configuration):
python train.py --model configs/vitb_mlp_infonce.yaml --train_dataset ../coco2014_b14/train.pth --val_dataset ../coco2014_b14/val.pthThis section is adapted from GroupViT, TCL, and FreeDA. The segmentation datasets should be organized as follows:
data
├── cityscapes
│ ├── leftImg8bit
│ │ ├── train
│ │ ├── val
│ ├── gtFine
│ │ ├── train
│ │ ├── val
├── VOCdevkit
│ ├── VOC2012
│ │ ├── JPEGImages
│ │ ├── SegmentationClass
│ │ ├── ImageSets
│ │ │ ├── Segmentation
│ ├── VOC2010
│ │ ├── JPEGImages
│ │ ├── SegmentationClassContext
│ │ ├── ImageSets
│ │ │ ├── SegmentationContext
│ │ │ │ ├── train.txt
│ │ │ │ ├── val.txt
│ │ ├── trainval_merged.json
│ ├── VOCaug
│ │ ├── dataset
│ │ │ ├── cls
├── ade
│ ├── ADEChallengeData2016
│ │ ├── annotations
│ │ │ ├── training
│ │ │ ├── validation
│ │ ├── images
│ │ │ ├── training
│ │ │ ├── validation
├── coco_stuff164k
│ ├── images
│ │ ├── train2017
│ │ ├── val2017
│ ├── annotations
│ │ ├── train2017
│ │ ├── val2017Please download and setup PASCAL VOC , PASCAL Context, COCO-Stuff164k , Cityscapes, and ADE20k datasets following MMSegmentation data preparation document.
COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. Run the following command to convert instance segmentation annotations to semantic segmentation annotations:
python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/To evaluate the model on open-vocabulary segmentation benchmarks, use the src/open_vocabulary_segmentation/main.py script. Select the appropriate configuration based on the model, benchmark, and PAMR settings. The available models are [vitb, vitl], while the available benchmarks are [ade, cityscapes, voc, voc_bg, context, context_bg, cityscapes, coco_object, stuff]. Below we provide the list of evaluations to reproduce the results reported in the paper for the ViT-Base architecture:
# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_pamr.yml
Instead, the evaluations for the ViT-Large architecture are:
# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_vitl_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_vitl_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_vitl_pamr.ymlWe provide two simple entry points for trying out Talk2DINO:
hf_demo.ipynb– an interactive notebook showing how to generate segmentation masks directly using the Hugging Face interface.demo.py– a lightweight script for running inference on a custom image with your own textual categories. . Run
python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..Example:
python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,routeResult:
Thanks to AyoubDamak for contributing to the updated installation instructions.
If you found this code useful, please cite the following paper:
@inproceedings{barsellotti2025talking,
title={Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation},
author={Barsellotti, Luca and Bianchi, Lorenzo and Messina, Nicola and Carrara, Fabio and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={22025--22035},
year={2025}
}