Referring Video Object Segmentation via Language-aligned Track Selection

Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim

Paper | Project Page | BibTeX

⬇️ Click to watch!

Environment Settings

conda create -n SOLA python=3.10
conda activate SOLA

pip install -r requirements.txt

As our work requires SAM2 and GroundingDINO, please follow the Installation guides [SAM2, GroundingDINO] from each repository.

You need to clone each repository in track_generation directory.

cd track_generation

git clone https://github.com/facebookresearch/sam2.git
cd sam2
(continue with instructions in SAM2 repository)

git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
(continue with instructions in GroundingDINO repository)

cd ../..

Dataset Preparation

You have to download MeViS and Ref-Youtube-VOS in datasets folder.

The datasets have to be prepared like this:

dataset/
    
    mevis/
        train/
            JPEGImages/
                meta_expressions.json
                mask_dict.json
        valid_u/
            ...
        valid/
            ...
    
    ref-ytbvos/
        train/
            Annotations/
            JPEGImages/
            meta_expressions.json
        valid/
            ...

Track Generation

For Track Generation, use the code inside the track_generation directory, which assumes to include both the SAM2 and GroundingDINO repositories.

Each dataset and split requires both Prompt Generation and Track Generation.

You can refer to the scripts directory for usage examples.

MeViS train / valid_u / valid

# MeViS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_mevis.py --dataset mevis --data_type train --pid 0 --n_pids 1

# MeViS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

# MeViS (valid_u / valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1

# MeViS (valid_u / valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type valid_u --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

Ref-Youtube-VOS train / valid

# Ref-Youtube-VOS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_ytbvos.py --dataset ref-ytbvos --data_type train --pid 0 --n_pid 1

# Ref-Youtube-VOS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

# Ref-Youtube-VOS (valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1

# Ref-Youtube-VOS (valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type valid --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

These will generate SAM2 object tokens and corresponding masklets in sam2_tracks directory.

Track Selection

After generating SAM2 tracks and object tokens, you have to train and inference to obatain the final results.

You can use the scripts directory for simple usage.

# Training
sh train.sh mevis/default

# Evaluation
sh eval.sh mevis/default [epoch] --eval_pred_threshold [threshold]

# Inference
sh inference.sh mevis/default [epoch] --eval_pred_threshold [threshold]

To obtain Zero-shot results:

# Zero-shot Evaluation
sh eval.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]

# Zero-shot Inference
sh inference.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]

BibTeX

Please consider to cite SOLA if it helps your research.

@article{kim2024referring,
  title={Referring Video Object Segmentation via Language-aligned Track Selection},
  author={Kim, Seongchan and Jin, Woojeong and Lim, Sangbeom and Yoon, Heeji and Choi, Hyunwook and Kim, Seungryong},
  journal={arXiv preprint arXiv:2412.01136},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs/mevis		configs/mevis
module		module
scripts		scripts
tools		tools
track_generation		track_generation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
eval.py		eval.py
evaluator.py		evaluator.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Referring Video Object Segmentation via Language-aligned Track Selection

Paper | Project Page | BibTeX

Environment Settings

Dataset Preparation

Track Generation

MeViS train / valid_u / valid

Ref-Youtube-VOS train / valid

Track Selection

BibTeX

About

Uh oh!

Releases

Packages

Languages

License

deep-overflow/SOLA

Folders and files

Latest commit

History

Repository files navigation

Referring Video Object Segmentation via Language-aligned Track Selection

Paper | Project Page | BibTeX

Environment Settings

Dataset Preparation

Track Generation

MeViS train / valid_u / valid

Ref-Youtube-VOS train / valid

Track Selection

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages