DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Introduction

DETR-ViP is a detection framework for visual prompted object detection, an interactive paradigm that uses visual features — rather than text — to define target categories on the fly. While visual prompts excel at recognizing rare and fine-grained categories, existing methods suffer from poor class discriminability because they treat visual prompts as a byproduct of text-prompted training.

DETR-ViP addresses this with three key innovations:

Global Prompt Integration — incorporates global class relationships into visual prompt learning
Visual-Textual Prompt Relation Distillation — transfers discriminability from text to visual prompts via knowledge distillation
Selective Fusion Strategy — stably combines visual and textual prompts for robust detection

Built on image-text contrastive learning, DETR-ViP achieves substantial improvements on COCO, LVIS, ODinW, and Roboflow100 for both zero-shot generic and interactive detection.

This repository contains the official implementation of DETR-ViP-T and DETR-ViP-L.

Results

Zero-shot Generic Detection (COCO & LVIS)

Model	Pretrain	COCO	LVIS				ODinW	RF100
Model	Pretrain	AP	AP	AP_f	AP_c	AP_r	AP_avg	AP_avg
DETR-ViP-T	O365	42.3	41.1	40.4	43.3	35.1	65.4	66.1
DETR-ViP-L	GoldG	52.4	43.5	42.3	45.1	42.9	—	64.2

Below are qualitative results of zero-shot generic detection on COCO:

Below are qualitative results of zero-shot generic detection on LVIS:

Zero-shot Interactive Detection (COCO & LVIS)

Model	COCO	LVIS				ODinW	RF100
Model	AP	AP	AP_f	AP_c	AP_r	AP_avg	AP_avg
DETR-ViP-T	65.4	66.1	57.5	73.5	78.4	46.8	40.1
DETR-ViP-L	71.1	71.9	64.2	78.2	83.6	51.2	44.3

Below are qualitative results of zero-shot interactive detection on COCO:

Below are qualitative results of zero-shot interactive detection on LVIS:

Installation

Requirements

Package	Version
PyTorch	2.0.1+cu117
torchaudio	2.0.2+cu117
torchvision	0.15.2+cu117
MMCV	2.1.0
MMDetection	3.3.0
MMEngine	0.11.0rc2
numpy	1.26.4
spacy	2.3.9

MMCV

Online installation: refer to the MMCV installation guide.

Offline installation:

cd third_party
git clone https://github.com/open-mmlab/mmcv.git
pip install -e . -v

MMDetection

Online installation: refer to the MMDetection installation guide.

Offline installation:

cd third_party
git clone https://github.com/open-mmlab/mmdetection.git
pip install -e . -v

MMEngine

Online installation: refer to the MMEngine installation guide.

Offline installation:

cd third_party
git clone https://github.com/open-mmlab/mmengine.git
pip install -e . -v

Data Preparation

Pretrained Models

Swin Transformer backbones (auto-downloaded by default; offline fallback):

Model	Download Link
Swin-Tiny	swin_tiny_patch4_window7_224.pth
Swin-Large	swin_large_patch4_window12_384_22k.pth

If offline, download the above and place them under ~/.cache/torch/hub/checkpoints/.

CLIP model (for generating category feature caches):
Download from clip-vit-base-patch32 and set the path as path_clip_weights in the commands below.

Pretraining Data

The models are pretrained on Objects365 V1 and GoldG datasets. Configs for training on COCO or Objects365 alone are also provided.

1. Objects365 V1

Corresponding config: DETR-ViP_swin-t_pretrain_obj365.py

Objects365 V1 can be downloaded from opendatalab. Both CLI and SDK download methods are supported.

After downloading and extracting, place or symlink it to data/objects365v1 with the following structure:

DETR-ViP
├── configs
├── data
│   ├── objects365v1
│   │   ├── objects365_train.json
│   │   ├── objects365_val.json
│   │   ├── train
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── test

Convert to ODVG format using coco2odvg.py:

python -m tools.dataset_converters.coco2odvg data/objects365v1/objects365_train.json -d o365v1

After conversion, o365v1_train_od.json and o365v1_label_map.json will be created under data/objects365v1:

DETR-ViP
├── configs
├── data
│   ├── objects365v1
│   │   ├── objects365_train.json
│   │   ├── objects365_val.json
│   │   ├── objects365_train_od.json
│   │   ├── o365v1_label_map.json
│   │   ├── train
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── test

Generate the CLIP feature cache for Objects365 category names (required for prepare_OD_cache.py):

python -m tools.data_prepare.prepare_OD_cache data/objects365v1/objects365_train.json --clip-path <path_clip_weights> --output cache/vocabulary/o365_vocabulary.pkl

2. GoldG

The GoldG dataset consists of GQA and Flickr30k, originally from the MixedGrounding dataset in the GLIP paper (excluding COCO).

First, download the annotation files from mdetr_annotations. The required files are:

final_mixed_train_no_coco.json
final_flickr_separateGT_train.json

GQA images can be downloaded from here. After downloading and extracting, place or symlink to data/gqa:

DETR-ViP
├── configs
├── data
│   ├── gqa
│   │   ├── final_mixed_train_no_coco.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Flickr30k images can be downloaded from here, which requires an application for access. After downloading and extracting, place or symlink to data/flickr30k_entities:

DETR-ViP
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Convert GQA annotations to ODVG format using goldg2odvg.py:

python -m tools.dataset_converters.goldg2odvg data/gqa/final_mixed_train_no_coco.json

After conversion, final_mixed_train_no_coco_vg.json will be created under data/gqa:

DETR-ViP
├── configs
├── data
│   ├── gqa
│   │   ├── final_mixed_train_no_coco.json
│   │   ├── final_mixed_train_no_coco_vg.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Convert Flickr30k annotations to ODVG format:

python -m tools.dataset_converters.goldg2odvg data/flickr30k_entities/final_flickr_separateGT_train.json

After conversion, final_flickr_separateGT_train_vg.json will be created under data/flickr30k_entities:

DETR-ViP
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── final_flickr_separateGT_train_vg.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Generate GoldG cache file using prepare_OG_cache.py:

python -m tools.data_prepare.prepare_OG_cache --clip-path <path_clip_weights> --gqa-path <gqa_json> --flickr-path <flickr_json> --save-path <save_path>

Example:

python -m tools.data_prepare.prepare_OG_cache --clip-path weights/clip-vit-base-patch32 --gqa-path data/gqa/final_mixed_train_no_coco_vg.json --flickr-path data/flickr30k_entities/final_flickr_separateGT_train_vg.json --save-path cache/vocabulary/

3. COCO 2017

The above configs evaluate on COCO 2017 during training, so the dataset needs to be prepared. Download from the COCO website or opendatalab. Place or symlink to data/coco.

Generate the COCO category CLIP feature cache (required for text detection evaluation):

python -m tools.data_prepare.prepare_OD_cache data/coco/annotations/instances_train2017.json --clip-path <path_clip_weights> --output cache/vocabulary/coco_text_cache.pkl

Note: Make sure to create the cache/support/ directory before running the support set sampling below.

Sample the support set for Visual-G prompt detection using support_dataset.py:

python -m tools.data_prepare.support_dataset data/coco/annotations/instances_train2017.json -o cache/support/coco_sub.json

Evaluation Data

LVIS

Generate the LVIS category CLIP feature cache:

python -m tools.data_prepare.prepare_OD_cache data/lvis/annotations/lvis_v1_train.json --clip-path <path_clip_weights> --output cache/vocabulary/lvis_vocabulary.pkl

Note: There is a typo in the LVIS vocabulary. After generating the cache, manually rename speaker_(stero_equipment) to speaker_(stereo_equipment):

import pickle
with open("cache/vocabulary/lvis_vocabulary.pkl", 'rb') as f:
    data = pickle.load(f)
data['speaker_(stereo_equipment)'] = data['speaker_(stero_equipment)']
del data['speaker_(stero_equipment)']
with open("cache/vocabulary/lvis_vocabulary.pkl", 'wb') as f:
    pickle.dump(data, f)

Sample the support set for Visual-G prompt detection on LVIS:

python -m tools.data_prepare.support_dataset data/lvis/annotations/lvis_v1_train.json -o cache/support/lvis_sup.json --minival data/lvis/annotations/lvis_v1_minival.json

Train

Single GPU

python -m tools.train <config> --work-dir <work_dirs>

Example:

python -m tools.train configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py --work-dir work_dirs/debug

Multi-node Multi-GPU

Note: Before running, set path-to-project to your project root path in tools/dist_train.sh.

bash tools/dist_train.sh <config> <GPUs> --work-dir <work_dirs>

Example:

bash tools/dist_train.sh configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py 4 --work-dir work_dirs/debug

VS Code Debug

Configure .vscode/launch.json:

{
    "name": "train",
    "cwd": "${workspaceFolder}",
    "type": "python",
    "request": "launch",
    "program": "${workspaceFolder}/tools/train.py",
    "console": "integratedTerminal",
    "env": {
        "PYTHONPATH": "${workspaceFolder}"
    },
    "args": [
        "configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py",
        "--work-dir", "work_dirs/debug"
    ],
    "justMyCode": false
}

Evaluation

Note: You can switch between test modes (text / visual-generic / visual-interactive) by modifying val_mode and test_mode in val_cfg and test_cfg of the config file.

Single GPU

python -m tools.test <config> <checkpoint> --work-dir <work_dirs>

Multi-GPU

Note: Before running, set path-to-project to your project root path in tools/dist_test.sh.

bash tools/dist_test.sh <config> <checkpoint> <work_dirs> <GPUs>

Model Zoo

Due to company confidentiality requirements, pretrained weights cannot be released at this time. We will consider open-sourcing them once permitted.

Citation

If any parts of our paper and code help your research, please consider citing us and giving a star to our repository.

@article{qian2026detr,
  title={DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts},
  author={Qian, Bo and Shi, Dahu and Wei, Xing},
  journal={arXiv preprint arXiv:2604.14684},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
detr_vip		detr_vip
figs		figs
third_party		third_party
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Introduction

Results

Zero-shot Generic Detection (COCO & LVIS)

Zero-shot Interactive Detection (COCO & LVIS)

Installation

Requirements

MMCV

MMDetection

MMEngine

Data Preparation

Pretrained Models

Pretraining Data

1. Objects365 V1

2. GoldG

3. COCO 2017

Evaluation Data

LVIS

Train

Single GPU

Multi-node Multi-GPU

VS Code Debug

Evaluation

Single GPU

Multi-GPU

Model Zoo

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Introduction

Results

Zero-shot Generic Detection (COCO & LVIS)

Zero-shot Interactive Detection (COCO & LVIS)

Installation

Requirements

MMCV

MMDetection

MMEngine

Data Preparation

Pretrained Models

Pretraining Data

1. Objects365 V1

2. GoldG

3. COCO 2017

Evaluation Data

LVIS

Train

Single GPU

Multi-node Multi-GPU

VS Code Debug

Evaluation

Single GPU

Multi-GPU

Model Zoo

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages