Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation (ECCV 2024)

Hao Fang, Peng Wu, Yawei Li, Xinxin Zhang, Xiankai Lu

[paper] [BibTeX]

Installation

See installation instructions.

Data Preparation

See Preparing Datasets for OVFormer.

Getting Started

We firstly train the OVFormer model on LVIS dataset:

python train_net.py --num-gpus 4 \
  --config-file configs/lvis/ovformer_R50_bs8.yaml

To evaluate model's zero-shot generalization performance on VIS Datasets, use

python train_net_video.py \
  --config-file configs/youtubevis_2019/ovformer_R50_bs8.yaml \
  --eval-only MODEL.WEIGHTS models/ovformer_r50_lvis.pth

YTVIS19/21 requires splitting the results.json into base and novel categories by Tool, OVIS directly packages and uploads to the specified server, BURST needs to run mAP.py. You are expected to get results like this:

Model	Backbone	YTVIS19	YTVIS21	OVIS	BURST	weights
OVFormer	R-50	34.8	29.8	15.1	6.8	model
OVFormer	Swin-B	44.3	37.6	21.3	7.6	model

Then, we video-based train the OVFormer model on LV-VIS dataset:

python train_net_lvvis.py --num-gpus 4 \
  --config-file configs/lvvis/video_ovformer_R50_bs8.yaml

To evaluate a model's performance on LV-VIS dataset, use

python train_net_lvvis.py \
  --config-file configs/lvvis/video_ovformer_R50_bs8.yaml \
  --eval-only MODEL.WEIGHTS models/ovformer_r50_lvvis.pth

Run mAP.py, you are expected to get results like this:

Model	Backbone	LVVIS val	LVVIS test	weights
OVFormer	R-50	21.9	15.2	model
OVFormer	Swin-B	24.7	19.5	model

Citing OVFormer

@inproceedings{fang2024unified,
  title={Unified embedding alignment for open-vocabulary video instance segmentation},
  author={Fang, Hao and Wu, Peng and Li, Yawei and Zhang, Xinxin and Lu, Xiankai},
  booktitle={ECCV},
  pages={225--241},
  year={2025},
  organization={Springer}
}

Acknowledgement

This repo is based on detectron2, Mask2Former, and LVVIS. Thanks for their great work!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
datasets		datasets
evaluate		evaluate
ovformer		ovformer
ovformer_video		ovformer_video
tools		tools
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
OVFormer.png		OVFormer.png
README.md		README.md
mAP.py		mAP.py
requirements.txt		requirements.txt
train_net.py		train_net.py
train_net_lvvis.py		train_net_lvvis.py
train_net_video.py		train_net_video.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation (ECCV 2024)

Installation

Data Preparation

Getting Started

Citing OVFormer

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

fanghaook/OVFormer

Folders and files

Latest commit

History

Repository files navigation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation (ECCV 2024)

Installation

Data Preparation

Getting Started

Citing OVFormer

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages