Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang
Paper | Project Page | Code
MuSS is a large-scale cinematic dataset and benchmark for multi-shot video generation and Subject-to-Video generation. It targets three limitations that are difficult to expose in single-shot settings:
- lack of authentic cinematic narrative logic;
- conflicts between global captions and local shot-level alignment;
- copy-paste shortcuts in Subject-to-Video models.
MuSS contains two complementary data settings: Complex Cinematic Narrative, which focuses on montage, shot transitions, and multi-character storytelling, and Subject-Centric Narrative, which focuses on preserving the same subject across disjoint shots and viewpoints.
- 2026-04: Paper released on arXiv.
- Initial public release: data construction code is released first. Dataset files, benchmark implementation, and model checkpoints are not included in this repository.
MuSS is sourced from over 3,000 movies and contains more than 30,000 professionally captioned multi-shot clips, totaling over 1,000 hours of high-quality video content.
The construction pipeline first turns raw cinematic footage into high-quality physical shots with coherent captions, then builds cross-shot S2V pairs by sampling reference subjects from disjoint shot contexts.
Complex Cinematic Narrative. Progressive multi-shot captions are aligned to physical shots, capturing shot transitions, scene changes, and multi-character narrative flow.
Subject-Centric Narrative. A reference subject is extracted from a disjoint shot, while the target sequence preserves identity across different viewpoints and contexts.
The Cinematic Narrative Benchmark evaluates generated videos under realistic multi-shot storytelling conditions. It combines shot boundary parsing, expert perception models, and LMM-based visual-logic assessment.
| Track | Evaluation Goal | Metrics |
|---|---|---|
| Track 1: Narrative Effectiveness | Shot-level alignment, transition precision, scene continuity, and visual logic. | Txt.Align, Trans.Dev, Scene.Con, Con.Gap, Scene.Logic, Casting.Logic, Act.Logic, Spat.Logic |
| Track 2: Subject Consistency | Cross-shot identity preservation, subject grounding, motion strength, and anti-copy-paste behavior. | Subj.Recall, Ref-Sub.Con, Inter-Sub.Con, Act.Str, ACP-Var, CP-Rate |
ACP-Var measures pose and structural diversity between the reference image and generated frames, explicitly penalizing rigid 2D reference copying.
This repository currently releases the data-side code. The benchmark implementation is intentionally not included in this initial repository and will be released separately after further verification.
MuSS/
├── download/ # Raw video acquisition from organized yearly YouTube lists
├── main-pipeline/ # Multi-shot video curation and progressive captioning
├── s2v-pipeline/ # Cross-shot subject extraction and S2V pair construction
├── assets/ # README figures converted from the paper and supplementary material
├── CITATION.cff
└── README.md
git clone https://github.com/<your-org>/MuSS.git
cd MuSS
python3 download/download.py 2011Module-level guides:
@article{zhang2026muss,
title = {MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation},
author = {Zhang, Haojie and Wu, Di and Liu, Bingyan and Zhong, Linjie and Wei, Yuancheng and Ye, Xingsong and Liu, Nanqing and Liang, Yaling},
journal = {arXiv preprint arXiv:2604.23789},
year = {2026}
}The code and dataset license are being finalized. Please check this section before redistribution or commercial use.
MuSS builds on open research infrastructure for video processing, perception, visual-language reasoning, and generative video evaluation. Please also follow the licenses of any third-party models, datasets, and tools used by the individual pipelines.