Skip to content

Official implementation of "Referring Video Object Segmentation via Language Aligned Track Selection".

License

Notifications You must be signed in to change notification settings

deep-overflow/SOLA

 
 

Repository files navigation

Referring Video Object Segmentation via Language-aligned Track Selection

PyTorch Python

Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim

⬇️ Click to watch!
Watch the video

Environment Settings

conda create -n SOLA python=3.10
conda activate SOLA

pip install -r requirements.txt

As our work requires SAM2 and GroundingDINO, please follow the Installation guides [SAM2, GroundingDINO] from each repository.

You need to clone each repository in track_generation directory.

cd track_generation

git clone https://github.com/facebookresearch/sam2.git
cd sam2
(continue with instructions in SAM2 repository)

git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
(continue with instructions in GroundingDINO repository)

cd ../..

Dataset Preparation

You have to download MeViS and Ref-Youtube-VOS in datasets folder.

The datasets have to be prepared like this:

dataset/
    
    mevis/
        train/
            JPEGImages/
                meta_expressions.json
                mask_dict.json
        valid_u/
            ...
        valid/
            ...
    
    ref-ytbvos/
        train/
            Annotations/
            JPEGImages/
            meta_expressions.json
        valid/
            ...

Track Generation

For Track Generation, use the code inside the track_generation directory, which assumes to include both the SAM2 and GroundingDINO repositories.

Each dataset and split requires both Prompt Generation and Track Generation.

You can refer to the scripts directory for usage examples.

MeViS train / valid_u / valid

# MeViS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_mevis.py --dataset mevis --data_type train --pid 0 --n_pids 1

# MeViS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

# MeViS (valid_u / valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1

# MeViS (valid_u / valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type valid_u --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

Ref-Youtube-VOS train / valid

# Ref-Youtube-VOS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_ytbvos.py --dataset ref-ytbvos --data_type train --pid 0 --n_pid 1

# Ref-Youtube-VOS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

# Ref-Youtube-VOS (valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1

# Ref-Youtube-VOS (valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type valid --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1

These will generate SAM2 object tokens and corresponding masklets in sam2_tracks directory.

Track Selection

After generating SAM2 tracks and object tokens, you have to train and inference to obatain the final results.

You can use the scripts directory for simple usage.

# Training
sh train.sh mevis/default

# Evaluation
sh eval.sh mevis/default [epoch] --eval_pred_threshold [threshold]

# Inference
sh inference.sh mevis/default [epoch] --eval_pred_threshold [threshold]

To obtain Zero-shot results:

# Zero-shot Evaluation
sh eval.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]

# Zero-shot Inference
sh inference.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]

BibTeX

Please consider to cite SOLA if it helps your research.

@article{kim2024referring,
  title={Referring Video Object Segmentation via Language-aligned Track Selection},
  author={Kim, Seongchan and Jin, Woojeong and Lim, Sangbeom and Yoon, Heeji and Choi, Hyunwook and Kim, Seungryong},
  journal={arXiv preprint arXiv:2412.01136},
  year={2024}
}

About

Official implementation of "Referring Video Object Segmentation via Language Aligned Track Selection".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.6%
  • Shell 1.4%