Skip to content

MIV-XJTU/DETR-ViP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

[Paper] [PDF] [BibTeX]

Introduction

DETR-ViP is a detection framework for visual prompted object detection, an interactive paradigm that uses visual features — rather than text — to define target categories on the fly. While visual prompts excel at recognizing rare and fine-grained categories, existing methods suffer from poor class discriminability because they treat visual prompts as a byproduct of text-prompted training.

DETR-ViP addresses this with three key innovations:

  • Global Prompt Integration — incorporates global class relationships into visual prompt learning
  • Visual-Textual Prompt Relation Distillation — transfers discriminability from text to visual prompts via knowledge distillation
  • Selective Fusion Strategy — stably combines visual and textual prompts for robust detection

Built on image-text contrastive learning, DETR-ViP achieves substantial improvements on COCO, LVIS, ODinW, and Roboflow100 for both zero-shot generic and interactive detection.

This repository contains the official implementation of DETR-ViP-T and DETR-ViP-L.

DETR-ViP Framework

Results

Zero-shot Generic Detection (COCO & LVIS)

Model Pretrain COCO LVIS ODinW RF100
AP AP APf APc APr APavg APavg
DETR-ViP-T O365 42.3 41.1 40.4 43.3 35.1 65.4 66.1
DETR-ViP-L GoldG 52.4 43.5 42.3 45.1 42.9 64.2

Below are qualitative results of zero-shot generic detection on COCO:

Generic Detection on COCO

Below are qualitative results of zero-shot generic detection on LVIS:

Generic Detection on LVIS

Zero-shot Interactive Detection (COCO & LVIS)

Model COCO LVIS ODinW RF100
AP AP APf APc APr APavg APavg
DETR-ViP-T 65.4 66.1 57.5 73.5 78.4 46.8 40.1
DETR-ViP-L 71.1 71.9 64.2 78.2 83.6 51.2 44.3

Below are qualitative results of zero-shot interactive detection on COCO:

Interactive Detection on COCO

Below are qualitative results of zero-shot interactive detection on LVIS:

Interactive Detection on LVIS

Installation

Requirements

Package Version
PyTorch 2.0.1+cu117
torchaudio 2.0.2+cu117
torchvision 0.15.2+cu117
MMCV 2.1.0
MMDetection 3.3.0
MMEngine 0.11.0rc2
numpy 1.26.4
spacy 2.3.9

MMCV

Online installation: refer to the MMCV installation guide.

Offline installation:

cd third_party
git clone https://github.com/open-mmlab/mmcv.git
pip install -e . -v

MMDetection

Online installation: refer to the MMDetection installation guide.

Offline installation:

cd third_party
git clone https://github.com/open-mmlab/mmdetection.git
pip install -e . -v

MMEngine

Online installation: refer to the MMEngine installation guide.

Offline installation:

cd third_party
git clone https://github.com/open-mmlab/mmengine.git
pip install -e . -v

Data Preparation

Pretrained Models

Swin Transformer backbones (auto-downloaded by default; offline fallback):

Model Download Link
Swin-Tiny swin_tiny_patch4_window7_224.pth
Swin-Large swin_large_patch4_window12_384_22k.pth

If offline, download the above and place them under ~/.cache/torch/hub/checkpoints/.

CLIP model (for generating category feature caches):
Download from clip-vit-base-patch32 and set the path as path_clip_weights in the commands below.

Pretraining Data

The models are pretrained on Objects365 V1 and GoldG datasets. Configs for training on COCO or Objects365 alone are also provided.

1. Objects365 V1

Corresponding config: DETR-ViP_swin-t_pretrain_obj365.py

Objects365 V1 can be downloaded from opendatalab. Both CLI and SDK download methods are supported.

After downloading and extracting, place or symlink it to data/objects365v1 with the following structure:

DETR-ViP
├── configs
├── data
│   ├── objects365v1
│   │   ├── objects365_train.json
│   │   ├── objects365_val.json
│   │   ├── train
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── test

Convert to ODVG format using coco2odvg.py:

python -m tools.dataset_converters.coco2odvg data/objects365v1/objects365_train.json -d o365v1

After conversion, o365v1_train_od.json and o365v1_label_map.json will be created under data/objects365v1:

DETR-ViP
├── configs
├── data
│   ├── objects365v1
│   │   ├── objects365_train.json
│   │   ├── objects365_val.json
│   │   ├── objects365_train_od.json
│   │   ├── o365v1_label_map.json
│   │   ├── train
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── test

Generate the CLIP feature cache for Objects365 category names (required for prepare_OD_cache.py):

python -m tools.data_prepare.prepare_OD_cache data/objects365v1/objects365_train.json --clip-path <path_clip_weights> --output cache/vocabulary/o365_vocabulary.pkl

2. GoldG

The GoldG dataset consists of GQA and Flickr30k, originally from the MixedGrounding dataset in the GLIP paper (excluding COCO).

First, download the annotation files from mdetr_annotations. The required files are:

  • final_mixed_train_no_coco.json
  • final_flickr_separateGT_train.json

GQA images can be downloaded from here. After downloading and extracting, place or symlink to data/gqa:

DETR-ViP
├── configs
├── data
│   ├── gqa
│   │   ├── final_mixed_train_no_coco.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Flickr30k images can be downloaded from here, which requires an application for access. After downloading and extracting, place or symlink to data/flickr30k_entities:

DETR-ViP
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Convert GQA annotations to ODVG format using goldg2odvg.py:

python -m tools.dataset_converters.goldg2odvg data/gqa/final_mixed_train_no_coco.json

After conversion, final_mixed_train_no_coco_vg.json will be created under data/gqa:

DETR-ViP
├── configs
├── data
│   ├── gqa
│   │   ├── final_mixed_train_no_coco.json
│   │   ├── final_mixed_train_no_coco_vg.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Convert Flickr30k annotations to ODVG format:

python -m tools.dataset_converters.goldg2odvg data/flickr30k_entities/final_flickr_separateGT_train.json

After conversion, final_flickr_separateGT_train_vg.json will be created under data/flickr30k_entities:

DETR-ViP
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── final_flickr_separateGT_train_vg.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

Generate GoldG cache file using prepare_OG_cache.py:

python -m tools.data_prepare.prepare_OG_cache --clip-path <path_clip_weights> --gqa-path <gqa_json> --flickr-path <flickr_json> --save-path <save_path>

Example:

python -m tools.data_prepare.prepare_OG_cache --clip-path weights/clip-vit-base-patch32 --gqa-path data/gqa/final_mixed_train_no_coco_vg.json --flickr-path data/flickr30k_entities/final_flickr_separateGT_train_vg.json --save-path cache/vocabulary/

3. COCO 2017

The above configs evaluate on COCO 2017 during training, so the dataset needs to be prepared. Download from the COCO website or opendatalab. Place or symlink to data/coco.

Generate the COCO category CLIP feature cache (required for text detection evaluation):

python -m tools.data_prepare.prepare_OD_cache data/coco/annotations/instances_train2017.json --clip-path <path_clip_weights> --output cache/vocabulary/coco_text_cache.pkl

Note: Make sure to create the cache/support/ directory before running the support set sampling below.

Sample the support set for Visual-G prompt detection using support_dataset.py:

python -m tools.data_prepare.support_dataset data/coco/annotations/instances_train2017.json -o cache/support/coco_sub.json

Evaluation Data

LVIS

Generate the LVIS category CLIP feature cache:

python -m tools.data_prepare.prepare_OD_cache data/lvis/annotations/lvis_v1_train.json --clip-path <path_clip_weights> --output cache/vocabulary/lvis_vocabulary.pkl

Note: There is a typo in the LVIS vocabulary. After generating the cache, manually rename speaker_(stero_equipment) to speaker_(stereo_equipment):

import pickle
with open("cache/vocabulary/lvis_vocabulary.pkl", 'rb') as f:
    data = pickle.load(f)
data['speaker_(stereo_equipment)'] = data['speaker_(stero_equipment)']
del data['speaker_(stero_equipment)']
with open("cache/vocabulary/lvis_vocabulary.pkl", 'wb') as f:
    pickle.dump(data, f)

Sample the support set for Visual-G prompt detection on LVIS:

python -m tools.data_prepare.support_dataset data/lvis/annotations/lvis_v1_train.json -o cache/support/lvis_sup.json --minival data/lvis/annotations/lvis_v1_minival.json

Train

Single GPU

python -m tools.train <config> --work-dir <work_dirs>

Example:

python -m tools.train configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py --work-dir work_dirs/debug

Multi-node Multi-GPU

Note: Before running, set path-to-project to your project root path in tools/dist_train.sh.

bash tools/dist_train.sh <config> <GPUs> --work-dir <work_dirs>

Example:

bash tools/dist_train.sh configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py 4 --work-dir work_dirs/debug

VS Code Debug

Configure .vscode/launch.json:

{
    "name": "train",
    "cwd": "${workspaceFolder}",
    "type": "python",
    "request": "launch",
    "program": "${workspaceFolder}/tools/train.py",
    "console": "integratedTerminal",
    "env": {
        "PYTHONPATH": "${workspaceFolder}"
    },
    "args": [
        "configs/detr_vip/DETR-ViP_swin-t_pretrain_obj365.py",
        "--work-dir", "work_dirs/debug"
    ],
    "justMyCode": false
}

Evaluation

Note: You can switch between test modes (text / visual-generic / visual-interactive) by modifying val_mode and test_mode in val_cfg and test_cfg of the config file.

Single GPU

python -m tools.test <config> <checkpoint> --work-dir <work_dirs>

Multi-GPU

Note: Before running, set path-to-project to your project root path in tools/dist_test.sh.

bash tools/dist_test.sh <config> <checkpoint> <work_dirs> <GPUs>

Model Zoo

Due to company confidentiality requirements, pretrained weights cannot be released at this time. We will consider open-sourcing them once permitted.

Citation

If any parts of our paper and code help your research, please consider citing us and giving a star to our repository.

@article{qian2026detr,
  title={DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts},
  author={Qian, Bo and Shi, Dahu and Wei, Xing},
  journal={arXiv preprint arXiv:2604.14684},
  year={2026}
}

About

This is the official implementation of DETR ViP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors