DETR-ViP is a detection framework for visual prompted object detection, an interactive paradigm that uses visual features — rather than text — to define target categories on the fly. While visual prompts excel at recognizing rare and fine-grained categories, existing methods suffer from poor class discriminability because they treat visual prompts as a byproduct of text-prompted training.
DETR-ViP addresses this with three key innovations:
- Global Prompt Integration — incorporates global class relationships into visual prompt learning
- Visual-Textual Prompt Relation Distillation — transfers discriminability from text to visual prompts via knowledge distillation
- Selective Fusion Strategy — stably combines visual and textual prompts for robust detection
Built on image-text contrastive learning, DETR-ViP achieves substantial improvements on COCO, LVIS, ODinW, and Roboflow100 for both zero-shot generic and interactive detection.
This repository contains the official implementation of DETR-ViP-T and DETR-ViP-L.
| Model | Pretrain | COCO | LVIS | ODinW | RF100 | |||
|---|---|---|---|---|---|---|---|---|
| AP | AP | APf | APc | APr | APavg | APavg | ||
| DETR-ViP-T | O365 | 42.3 | 41.1 | 40.4 | 43.3 | 35.1 | 65.4 | 66.1 |
| DETR-ViP-L | GoldG | 52.4 | 43.5 | 42.3 | 45.1 | 42.9 | — | 64.2 |
Below are qualitative results of zero-shot generic detection on COCO:
Below are qualitative results of zero-shot generic detection on LVIS:
| Model | COCO | LVIS | ODinW | RF100 | |||
|---|---|---|---|---|---|---|---|
| AP | AP | APf | APc | APr | APavg | APavg | |
| DETR-ViP-T | 65.4 | 66.1 | 57.5 | 73.5 | 78.4 | 46.8 | 40.1 |
| DETR-ViP-L | 71.1 | 71.9 | 64.2 | 78.2 | 83.6 | 51.2 | 44.3 |
Below are qualitative results of zero-shot interactive detection on COCO:
Below are qualitative results of zero-shot interactive detection on LVIS:
| Package | Version |
|---|---|
| PyTorch | 2.0.1+cu117 |
| torchaudio | 2.0.2+cu117 |
| torchvision | 0.15.2+cu117 |
| MMCV | 2.1.0 |
| MMDetection | 3.3.0 |
| MMEngine | 0.11.0rc2 |
| numpy | 1.26.4 |
| spacy | 2.3.9 |
Online installation: refer to the MMCV installation guide.
Offline installation:
cd third_party
git clone https://github.com/open-mmlab/mmcv.git
pip install -e . -vOnline installation: refer to the MMDetection installation guide.
Offline installation:
cd third_party
git clone https://github.com/open-mmlab/mmdetection.git
pip install -e . -vOnline installation: refer to the MMEngine installation guide.
Offline installation:
cd third_party
git clone https://github.com/open-mmlab/mmengine.git
pip install -e . -vSwin Transformer backbones (auto-downloaded by default; offline fallback):
| Model | Download Link |
|---|---|
| Swin-Tiny | swin_tiny_patch4_window7_224.pth |
| Swin-Large | swin_large_patch4_window12_384_22k.pth |
If offline, download the above and place them under ~/.cache/torch/hub/checkpoints/.
CLIP model (for generating category feature caches):
Download from clip-vit-base-patch32 and set the path as path_clip_weights in the commands below.
The models are pretrained on Objects365 V1 and GoldG datasets. Configs for training on COCO or Objects365 alone are also provided.
Corresponding config: DETR-ViP_swin-t_pretrain_obj365.py
Objects365 V1 can be downloaded from opendatalab. Both CLI and SDK download methods are supported.
After downloading and extracting, place or symlink it to data/objects365v1 with the following structure:
DETR-ViP
├── configs
├── data
│ ├── objects365v1
│ │ ├── objects365_train.json
│ │ ├── objects365_val.json
│ │ ├── train
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
│ │ ├── val
│ │ │ ├── xxxx.jpg
│ │ │ ├── ...
│ │ ├── test
Convert to ODVG format using coco2odvg.py:
python -m tools.dataset_converters.coco2odvg data/objects365v1/objects365_train.json -d o365v1After conversion, o365v1_train_od.json and o365v1_label_map.json will be created under data/objects365v1:
DETR-ViP
├── configs
├── data
│ ├── objects365v1
│ │ ├── objects365_train.json
│ │ ├── objects365_val.json
│ │ ├── objects365_train_od.json
│ │ ├── o365v1_label_map.json
│ │ ├── train
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
│ │ ├── val
│ │ │ ├── xxxx.jpg
│ │ │ ├── ...
│ │ ├── test
Generate the CLIP feature cache for Objects365 category names (required for prepare_OD_cache.py):
python -m tools.data_prepare.prepare_OD_cache data/objects365v1/objects365_train.json --clip-path <path_clip_weights> --output cache/vocabulary/o365_vocabulary.pklThe GoldG dataset consists of GQA and Flickr30k, originally from the MixedGrounding dataset in the GLIP paper (excluding COCO).
First, download the annotation files from mdetr_annotations. The required files are:
final_mixed_train_no_coco.jsonfinal_flickr_separateGT_train.json
GQA images can be downloaded from here. After downloading and extracting, place or symlink to data/gqa:
DETR-ViP
├── configs
├── data
│ ├── gqa
│ │ ├── final_mixed_train_no_coco.json
│ │ ├── images
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
Flickr30k images can be downloaded from here, which requires an application for access. After downloading and extracting, place or symlink to data/flickr30k_entities:
DETR-ViP
├── configs
├── data
│ ├── flickr30k_entities
│ │ ├── final_flickr_separateGT_train.json
│ │ ├── flickr30k_images
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
Convert GQA annotations to ODVG format using goldg2odvg.py:
python -m tools.dataset_converters.goldg2odvg data/gqa/final_mixed_train_no_coco.jsonAfter conversion, final_mixed_train_no_coco_vg.json will be created under data/gqa:
DETR-ViP
├── configs
├── data
│ ├── gqa
│ │ ├── final_mixed_train_no_coco.json
│ │ ├── final_mixed_train_no_coco_vg.json
│ │ ├── images
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
Convert Flickr30k annotations to ODVG format:
python -m tools.dataset_converters.goldg2odvg data/flickr30k_entities/final_flickr_separateGT_train.jsonAfter conversion, final_flickr_separateGT_train_vg.json will be created under data/flickr30k_entities:
DETR-ViP
├── configs
├── data
│ ├── flickr30k_entities
│ │ ├── final_flickr_separateGT_train.json
│ │ ├── final_flickr_separateGT_train_vg.json
│ │ ├── flickr30k_images
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
Generate GoldG cache file using prepare_OG_cache.py:
python -m tools.data_prepare.prepare_OG_cache --clip-path <path_clip_weights> --gqa-path <gqa_json> --flickr-path <flickr_json> --save-path <save_path>Example:
python -m tools.data_prepare.prepare_OG_cache --clip-path weights/clip-vit-base-patch32 --gqa-path data/gqa/final_mixed_train_no_coco_vg.json --flickr-path data/flickr30k_entities/final_flickr_separateGT_train_vg.json --save-path cache/vocabulary/The above configs evaluate on COCO 2017 during training, so the dataset needs to be prepared. Download from the COCO website or opendatalab. Place or symlink to data/coco.
Generate the COCO category CLIP feature cache (required for text detection evaluation):
python -m tools.data_prepare.prepare_OD_cache data/coco/annotations/instances_train2017.json --clip-path <path_clip_weights> --output cache/vocabulary/coco_text_cache.pklNote: Make sure to create the
cache/support/directory before running the support set sampling below.
Sample the support set for Visual-G prompt detection using support_dataset.py:
python -m tools.data_prepare.support_dataset data/coco/annotations/instances_train2017.json -o cache/support/coco_sub.jsonGenerate the LVIS category CLIP feature cache:
python -m tools.data_prepare.prepare_OD_cache data/lvis/annotations/lvis_v1_train.json --clip-path <path_clip_weights> --output cache/vocabulary/lvis_vocabulary.pklNote: There is a typo in the LVIS vocabulary. After generating the cache, manually rename
speaker_(stero_equipment)tospeaker_(stereo_equipment):import pickle with open("cache/vocabulary/lvis_vocabulary.pkl", 'rb') as f: data = pickle.load(f) data['speaker_(stereo_equipment)'] = data['speaker_(stero_equipment)'] del data['speaker_(stero_equipment)'] with open("cache/vocabulary/lvis_vocabulary.pkl", 'wb') as f: pickle.dump(data, f)
Sample the support set for Visual-G prompt detection on LVIS:
python -m tools.data_prepare.support_dataset data/lvis/annotations/lvis_v1_train.json -o cache/support/lvis_sup.json --minival data/lvis/annotations/lvis_v1_minival.jsonpython -m tools.train <config> --work-dir <work_dirs>Example:
python -m tools.train configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py --work-dir work_dirs/debugNote: Before running, set
path-to-projectto your project root path in tools/dist_train.sh.
bash tools/dist_train.sh <config> <GPUs> --work-dir <work_dirs>Example:
bash tools/dist_train.sh configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py 4 --work-dir work_dirs/debugConfigure .vscode/launch.json:
{
"name": "train",
"cwd": "${workspaceFolder}",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/tools/train.py",
"console": "integratedTerminal",
"env": {
"PYTHONPATH": "${workspaceFolder}"
},
"args": [
"configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py",
"--work-dir", "work_dirs/debug"
],
"justMyCode": false
}Note: You can switch between test modes (text / visual-generic / visual-interactive) by modifying
val_modeandtest_modeinval_cfgandtest_cfgof the config file.
python -m tools.test <config> <checkpoint> --work-dir <work_dirs>Note: Before running, set
path-to-projectto your project root path in tools/dist_test.sh.
bash tools/dist_test.sh <config> <checkpoint> <work_dirs> <GPUs>Due to company confidentiality requirements, pretrained weights cannot be released at this time. We will consider open-sourcing them once permitted.
If any parts of our paper and code help your research, please consider citing us and giving a star to our repository.
@article{qian2026detr,
title={DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts},
author={Qian, Bo and Shi, Dahu and Wei, Xing},
journal={arXiv preprint arXiv:2604.14684},
year={2026}
}