Hao Fang, Peng Wu, Yawei Li, Xinxin Zhang, Xiankai Lu
See installation instructions.
See Preparing Datasets for OVFormer.
We firstly train the OVFormer model on LVIS dataset:
python train_net.py --num-gpus 4 \
--config-file configs/lvis/ovformer_R50_bs8.yamlTo evaluate model's zero-shot generalization performance on VIS Datasets, use
python train_net_video.py \
--config-file configs/youtubevis_2019/ovformer_R50_bs8.yaml \
--eval-only MODEL.WEIGHTS models/ovformer_r50_lvis.pthYTVIS19/21 requires splitting the results.json into base and novel categories by Tool,
OVIS directly packages and uploads to the specified server, BURST needs to run mAP.py.
You are expected to get results like this:
| Model | Backbone | YTVIS19 | YTVIS21 | OVIS | BURST | weights |
|---|---|---|---|---|---|---|
| OVFormer | R-50 | 34.8 | 29.8 | 15.1 | 6.8 | model |
| OVFormer | Swin-B | 44.3 | 37.6 | 21.3 | 7.6 | model |
Then, we video-based train the OVFormer model on LV-VIS dataset:
python train_net_lvvis.py --num-gpus 4 \
--config-file configs/lvvis/video_ovformer_R50_bs8.yamlTo evaluate a model's performance on LV-VIS dataset, use
python train_net_lvvis.py \
--config-file configs/lvvis/video_ovformer_R50_bs8.yaml \
--eval-only MODEL.WEIGHTS models/ovformer_r50_lvvis.pthRun mAP.py, you are expected to get results like this:
| Model | Backbone | LVVIS val | LVVIS test | weights |
|---|---|---|---|---|
| OVFormer | R-50 | 21.9 | 15.2 | model |
| OVFormer | Swin-B | 24.7 | 19.5 | model |
@inproceedings{fang2024unified,
title={Unified embedding alignment for open-vocabulary video instance segmentation},
author={Fang, Hao and Wu, Peng and Li, Yawei and Zhang, Xinxin and Lu, Xiankai},
booktitle={ECCV},
pages={225--241},
year={2025},
organization={Springer}
}This repo is based on detectron2, Mask2Former, and LVVIS. Thanks for their great work!