MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

🏠[Project page] 📄[arXiv] 💾[Evaluation Server v1 (legacy)] 🔥[Evaluation Server v2]

This repository contains code for ICCV2023 and TPAMI 2025 paper:

MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang TPAMI 2025

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
ICCV 2023

Abstract

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects’ motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identiﬁed in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes.

Figure 1. Examples from Motion expressions Video Segmentation (MeViS) showing the dataset’s nature and complexity. The selected target objects are masked in orange ▇. The expressions in MeViS primarily focus on motion attributes, making it impossible to identify the target object from a single frame. For example, the ﬁrst example has three parrots with similar appearances, and the target object is identiﬁed as “The bird ﬂying away”. This object can only be recognized by capturing its motion throughout the video. The updated MeViS 2024 further provides motion-reasoning and no-target expressions, adds audio expressions alongside text, and provides mask and bounding box trajectory annotations.

TABLE 1. Scale comparison between MeViS and existing language-guided video segmentation datasets.

Dataset	Pub.&Year	Videos	Object	Expression	Mask	Obj/Video	Obj/Expn	Target	Multi-target	No-target	Audio
A2D Sentence	CVPR 2018	3,782	4,825	6,656	58k	1.28	1	Actor	-	-	-
DAVIS17-RVOS	ACCV 2018	90	205	205	13.5k	2.27	1	Object	-	-	-
ReferYoutubeVOS	ECCV 2020	3,978	7,451	15,009	131k	1.86	1	Object	-	-	-
MeViS 2023	ICCV 2023	2,006	8,171	28,570	443k	4.28	1.59	Object(s)	7,539	-	-
MeViS 2024	TPAMI	2,006	8,171	33,072	443k	4.28	1.58	Object(s)	8,028	3,503	33,072

MeViS v2 Dataset

Dataset Split

2,006 videos & 33,458 sentences in total;
Train set: 1662 videos & 27,502 sentences, used for training;
Val^u set: 50 videos & 907 sentences, ground-truth provided, used for offline self-evaluation (e.g., ablation study) during training;
Val set: 140 videos & 2,523 sentences, ground-truth not provided, used for CodaLab online evaluation;
Test set: Will be progressively and selectively released and used for evaluation during the competition periods (PVUW, LSVOS);

It is suggested to report the results on Val^u set and Val set.

Online Evaluation

Please submit your results of Val set on

💯 v1 server (Closing Soon): CodaLab
💯 v2 server: CodaBench.

It is strongly suggested to first evaluate your model locally using the Val^u set before submitting your results of the Val to the online evaluation system.

File Structure

The dataset follows a similar structure as Refer-YouTube-VOS. Each split of the dataset consists of three parts: JPEGImages, which holds the frame images, meta_expressions.json, which provides referring expressions and metadata of videos, and mask_dict.json, which contains the ground-truth masks of objects. Ground-truth segmentation masks are saved in the format of COCO RLE, and expressions are organized similarly like Refer-Youtube-VOS.

Please note that while annotations for all frames in the Train set and the Val^u set are provided, the Val set only provide frame images and referring expressions for inference.

mevis
├── train                       // Split Train
│   ├── JPEGImages
│   │   ├── <video #1  >
│   │   ├── <video #2  >
│   │   └── <video #...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
├── valid_u                     // Split Val^u
│   ├── JPEGImages
│   │   └── <video ...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
└── valid                       // Split Val
    ├── JPEGImages
    │   └── <video ...>
    │
    └── meta_expressions.json

Method Code Installation:

Please see INSTALL.md

Inference

1. Val^u set

Obtain the output masks of Val^u set:

python train_net_lmpm.py \
    --config-file configs/lmpm_SWIN_bs8.yaml \
    --num-gpus 8 --dist-url auto --eval-only \
    MODEL.WEIGHTS [path_to_weights] \
    OUTPUT_DIR [output_dir]

Obtain the J&F results on Val^u set:

python tools/eval_mevis.py

2. Val set

Obtain the output masks of Val set for CodaLab online evaluation:

python train_net_lmpm.py \
    --config-file configs/lmpm_SWIN_bs8.yaml \
    --num-gpus 8 --dist-url auto --eval-only \
    MODEL.WEIGHTS [path_to_weights] \
    OUTPUT_DIR [output_dir] DATASETS.TEST '("mevis_test",)'

CodaLab Evaluation Submission Guideline

The submission format should be a .zip file containing the predicted .PNG results of the Val set (for current competition stage).

You can use following command to prepare .zip submission file

cd [output_dir]
zip -r ../xxx.zip *

A submission example named sample_submission_valid.zip can be found from the CodaLab.

sample_submission_valid.zip       // .zip file, which directly packages 140 val video folders
├── 0ab4afe7fb46                  // video folder name
│   ├── 0                         // expression_id folder name
│   │   ├── 00000.png             // .png files
│   │   ├── 00001.png
│   │   └── ....
│   │
│   ├── 1
│   │   └── 00000.png
│   │   └── ....
│   │
│   └── ....
│ 
├── 0fea0cb75a25
│   ├── 0                              
│   │   ├── 00000.png
│   │   └── ....
│   │
│   └── ....
│
└── ....

Training

Firstly, download the backbone weights (model_final_86143f.pkl) and convert it using the script:

wget https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_swin_tiny_bs16_50ep/model_final_86143f.pkl
python tools/process_ckpt.py

Then start training:

python train_net_lmpm.py \
    --config-file configs/lmpm_SWIN_bs8.yaml \
    --num-gpus 8 --dist-url auto \
    MODEL.WEIGHTS [path_to_weights] \
    OUTPUT_DIR [path_to_weights]

Note: We also support training ReferFormer by providing ReferFormer_dataset.py

Models

Our results on Val^u set and Val set of MeViS dataset.

Val^u set is used for offline evaluation by userself, like doing ablation study
Val set is used for CodaLab online evaluation by MeViS dataset organizers

Backbone	Val^u			Val
Backbone	J&F	J	F	J&F	J	F
Swin-Tiny & RoBERTa	40.23	36.51	43.90	37.21	34.25	40.17

☁️ Google Drive

Acknowledgement

This project is based on VITA, GRES, Mask2Former, and VLT. Many thanks to the authors for their great works!

BibTeX

Please consider to cite MeViS if it helps your research.

@article{MeViSv2,
  title={MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Ying, Kaining and Jiang, Xudong and Loy, Chen Change and Jiang, Yu-Gang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025},
  publisher={IEEE}
}
@inproceedings{MeViS,
  title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
  booktitle={ICCV},
  year={2023}
}
@inproceedings{GRES,
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
  booktitle={CVPR},
  year={2023}
}
@article{VLT,
  title={{VLT}: Vision-language transformer and query generation for referring segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023},
  publisher={IEEE}
}

A majority of videos in MeViS are from MOSE: Complex Video Object Segmentation Dataset.

@article{MOSEv2,
  title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
  journal={arXiv preprint arXiv:2508.05630},
  year={2025}
}
@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}

MeViS is licensed under a CC BY-NC-SA 4.0 License. The data of MeViS is released for non-commercial research purpose only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Abstract

MeViS v2 Dataset

Online Evaluation

File Structure

Method Code Installation:

Inference

1. Val^u set

2. Val set

CodaLab Evaluation Submission Guideline

Training

Models

Acknowledgement

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
lmpm		lmpm
mask2former		mask2former
tools		tools
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
ReferFormer_dataset.py		ReferFormer_dataset.py
requirements.txt		requirements.txt
train_net_lmpm.py		train_net_lmpm.py

License

henghuiding/MeViS

Folders and files

Latest commit

History

Repository files navigation

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Abstract

MeViS v2 Dataset

Online Evaluation

File Structure

Method Code Installation:

Inference

1. Valu set

2. Val set

CodaLab Evaluation Submission Guideline

Training

Models

Acknowledgement

BibTeX

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

1. Val^u set

Packages