Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim
Paper | Project Page | BibTeX
conda create -n SOLA python=3.10
conda activate SOLA
pip install -r requirements.txt
As our work requires SAM2 and GroundingDINO, please follow the Installation guides [SAM2, GroundingDINO] from each repository.
You need to clone each repository in track_generation directory.
cd track_generation
git clone https://github.com/facebookresearch/sam2.git
cd sam2
(continue with instructions in SAM2 repository)
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
(continue with instructions in GroundingDINO repository)
cd ../..
You have to download MeViS and Ref-Youtube-VOS in datasets folder.
The datasets have to be prepared like this:
dataset/
mevis/
train/
JPEGImages/
meta_expressions.json
mask_dict.json
valid_u/
...
valid/
...
ref-ytbvos/
train/
Annotations/
JPEGImages/
meta_expressions.json
valid/
...
For Track Generation, use the code inside the track_generation directory, which assumes to include both the SAM2 and GroundingDINO repositories.
Each dataset and split requires both Prompt Generation and Track Generation.
You can refer to the scripts directory for usage examples.
# MeViS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_mevis.py --dataset mevis --data_type train --pid 0 --n_pids 1
# MeViS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1
# MeViS (valid_u / valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1
# MeViS (valid_u / valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type valid_u --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1# Ref-Youtube-VOS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_ytbvos.py --dataset ref-ytbvos --data_type train --pid 0 --n_pid 1
# Ref-Youtube-VOS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1
# Ref-Youtube-VOS (valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1
# Ref-Youtube-VOS (valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type valid --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1These will generate SAM2 object tokens and corresponding masklets in sam2_tracks directory.
After generating SAM2 tracks and object tokens, you have to train and inference to obatain the final results.
You can use the scripts directory for simple usage.
# Training
sh train.sh mevis/default
# Evaluation
sh eval.sh mevis/default [epoch] --eval_pred_threshold [threshold]
# Inference
sh inference.sh mevis/default [epoch] --eval_pred_threshold [threshold]To obtain Zero-shot results:
# Zero-shot Evaluation
sh eval.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]
# Zero-shot Inference
sh inference.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]Please consider to cite SOLA if it helps your research.
@article{kim2024referring,
title={Referring Video Object Segmentation via Language-aligned Track Selection},
author={Kim, Seongchan and Jin, Woojeong and Lim, Sangbeom and Yoon, Heeji and Choi, Hyunwook and Kim, Seungryong},
journal={arXiv preprint arXiv:2412.01136},
year={2024}
}