Offitial implementation of the ICASSP 2025 paper "Object-Centric Discriminative Learning for Text-Based Person Retrieval"Paper
We propose a novel framework for text-based person retrieval, Object-Centric Discriminative Learning (OCDL), which incorporates person masks to indicate attentive regions, thereby enhancing the model’s focus on the pedestrians in images while suppressing the background noise. Additionally, a novel cross-modal matching loss, namely Soft Angular Distribution Matching (SADM), is introduced to learn discriminative visual and textual representations. Experiments on three widely-used TBPR benchmarks demonstrate the effectiveness of our approach.
We use a single NVIDIA A100 GPU for training and evaluation.
conda create -n ocdl_reid python=3.8
conda activate ocdl_reid
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
cd AlphaCLIP
pip install -e .
We have uploaded the dataset to Google Drive, which includes the original CUHK-PEDES, ICFG-PEDES, and RSTPReid data, as well as additional person masks data (saved under the path ...alphas/).
Note that we have not open-sourced the code for generating person masks, as it is quite simple and many open-source projects today can even produce better masks. If you're very interested, you can refer to Grounded-Segment-Anything as the basis.
Unzip and organize them in your dataset root dir folder as follows:
|-- YOUR_DATA_ROOT
|-- CUHK-PEDES
|-- imgs
|-- cam_a
|-- cam_b
|-- ...
|-- alphas
|-- cam_a
|-- cam_b
|-- ...
|-- reid_raw.json
|-- ICFG-PEDES
|-- imgs
|-- test
|-- train
|-- alphas
|-- test
|-- train
|-- ICFG_PEDES.json
|-- RSTPReid
|-- imgs
|-- alphas
|-- data_captions.json
Download the model weights according to the provided link and place the downloaded files into the pretrain/ directory (e.g. pretrain/clip_b16_grit+mim_fultune_4xe.pth), or specify the weights directory using the --alpha_ckpt parameter. Note that you can select the architecture by specifying --pretrain_choice (e.g. --pretrain_choice ViT-B/16 for AlphaCLIP-B/16)
| model | google drive link | openxlab link |
|---|---|---|
| AlphaCLIP-B/16 | clip_b16_grit1m+mim_fultune_4xe | clip_b16_grit1m+mim_fultune_4xe |
| AlphaCLIP-L/14 | clip_l14_grit1m+mim_fultune_6xe | clip_l14_grit1m+mim_fultune_6xe |
Change the YOUR_DATA_ROOT to your own path, specify a dataset and start to train your TBPR models.
# Training on text-based person retrieval benchmarks
YOUR_DATA_ROOT="data"
DATASET_NAME="CUHK-PEDES, ICFG-PEDES or RSTPReid"
CUDA_VISIBLE_DEVICES=0 \
python train_ocdl.py \
--root_dir $YOUR_DATA_ROOT \
--name OCDL \
--batch_size 128 \
--dataset_name $DATASET_NAME \
--loss_names 'sadm+id' \
--img_aug \
--lr 1e-5 \
--num_epoch 60 \
--pretrain_choice 'ViT-B/16' \
--sampler 'identity' \
--num_cls 4Some components of this code implementation are adapted from CLIP, IRRA and AlphaCLIP. We sincerely appreciate for their contributions.
If you find our work useful for your research, please cite our paper.
@inproceedings{li2025object,
title={Object-Centric Discriminative Learning for Text-Based Person Retrieval},
author={Li, Haiwen and Liu, Delong and Su, Fei and Zhao, Zhicheng},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2025},
organization={IEEE}
}If you have any question, please contact us. E-mail: lihaiwen@bupt.edu.cn, liudelong@bupt.edu.cn.