GitHub - zhang-haojie/MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

A Large-Scale Dataset and Cinematic Narrative Benchmark for
Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

Overview

MuSS is a large-scale cinematic dataset and benchmark for multi-shot video generation and Subject-to-Video generation. It targets three limitations that are difficult to expose in single-shot settings:

lack of authentic cinematic narrative logic;
conflicts between global captions and local shot-level alignment;
copy-paste shortcuts in Subject-to-Video models.

MuSS contains two complementary data settings: Complex Cinematic Narrative, which focuses on montage, shot transitions, and multi-character storytelling, and Subject-Centric Narrative, which focuses on preserving the same subject across disjoint shots and viewpoints.

News

2026-04: Paper released on arXiv.
Initial public release: data construction code is released first. Dataset files, benchmark implementation, and model checkpoints are not included in this repository.

Dataset

MuSS is sourced from over 3,000 movies and contains more than 30,000 professionally captioned multi-shot clips, totaling over 1,000 hours of high-quality video content.

The construction pipeline first turns raw cinematic footage into high-quality physical shots with coherent captions, then builds cross-shot S2V pairs by sampling reference subjects from disjoint shot contexts.

Data Examples

Complex Cinematic Narrative. Progressive multi-shot captions are aligned to physical shots, capturing shot transitions, scene changes, and multi-character narrative flow.

Subject-Centric Narrative. A reference subject is extracted from a disjoint shot, while the target sequence preserves identity across different viewpoints and contexts.

More raw cinematic examples

Benchmark

The Cinematic Narrative Benchmark evaluates generated videos under realistic multi-shot storytelling conditions. It combines shot boundary parsing, expert perception models, and LMM-based visual-logic assessment.

Track	Evaluation Goal	Metrics
Track 1: Narrative Effectiveness	Shot-level alignment, transition precision, scene continuity, and visual logic.	`Txt.Align`, `Trans.Dev`, `Scene.Con`, `Con.Gap`, `Scene.Logic`, `Casting.Logic`, `Act.Logic`, `Spat.Logic`
Track 2: Subject Consistency	Cross-shot identity preservation, subject grounding, motion strength, and anti-copy-paste behavior.	`Subj.Recall`, `Ref-Sub.Con`, `Inter-Sub.Con`, `Act.Str`, `ACP-Var`, `CP-Rate`

ACP-Var measures pose and structural diversity between the reference image and generated frames, explicitly penalizing rigid 2D reference copying.

Code

This repository currently releases the data-side code. The benchmark implementation is intentionally not included in this initial repository and will be released separately after further verification.

MuSS/
├── download/        # Raw video acquisition from organized yearly YouTube lists
├── main-pipeline/   # Multi-shot video curation and progressive captioning
├── s2v-pipeline/    # Cross-shot subject extraction and S2V pair construction
├── assets/          # README figures converted from the paper and supplementary material
├── CITATION.cff
└── README.md

Quick Start

git clone https://github.com/<your-org>/MuSS.git
cd MuSS
python3 download/download.py 2011

Module-level guides:

Citation

@article{zhang2026muss,
  title   = {MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation},
  author  = {Zhang, Haojie and Wu, Di and Liu, Bingyan and Zhong, Linjie and Wei, Yuancheng and Ye, Xingsong and Liu, Nanqing and Liang, Yaling},
  journal = {arXiv preprint arXiv:2604.23789},
  year    = {2026}
}

License

The code and dataset license are being finalized. Please check this section before redistribution or commercial use.

Acknowledgements

MuSS builds on open research infrastructure for video processing, perception, visual-language reasoning, and generative video evaluation. Please also follow the licenses of any third-party models, datasets, and tools used by the individual pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Large-Scale Dataset and Cinematic Narrative Benchmark for
Multi-Shot Subject-to-Video Generation

Overview

News

Dataset

Data Examples

Benchmark

Code

Quick Start

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
download		download
main-pipeline		main-pipeline
s2v-pipeline		s2v-pipeline
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Overview

News

Dataset

Data Examples

Benchmark

Code

Quick Start

Citation

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A Large-Scale Dataset and Cinematic Narrative Benchmark for
Multi-Shot Subject-to-Video Generation

Packages