Skip to content

zhang-haojie/MuSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MuSS logo

A Large-Scale Dataset and Cinematic Narrative Benchmark for
Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

Paper | Project Page | Code

Project Page arXiv Code

MuSS overview

Overview

MuSS is a large-scale cinematic dataset and benchmark for multi-shot video generation and Subject-to-Video generation. It targets three limitations that are difficult to expose in single-shot settings:

  • lack of authentic cinematic narrative logic;
  • conflicts between global captions and local shot-level alignment;
  • copy-paste shortcuts in Subject-to-Video models.

MuSS contains two complementary data settings: Complex Cinematic Narrative, which focuses on montage, shot transitions, and multi-character storytelling, and Subject-Centric Narrative, which focuses on preserving the same subject across disjoint shots and viewpoints.

News

  • 2026-04: Paper released on arXiv.
  • Initial public release: data construction code is released first. Dataset files, benchmark implementation, and model checkpoints are not included in this repository.

Dataset

MuSS is sourced from over 3,000 movies and contains more than 30,000 professionally captioned multi-shot clips, totaling over 1,000 hours of high-quality video content.

MuSS dataset statistics

The construction pipeline first turns raw cinematic footage into high-quality physical shots with coherent captions, then builds cross-shot S2V pairs by sampling reference subjects from disjoint shot contexts.

MuSS dataset construction pipeline

Data Examples

Complex Cinematic Narrative. Progressive multi-shot captions are aligned to physical shots, capturing shot transitions, scene changes, and multi-character narrative flow.

Track 1 data examples

Subject-Centric Narrative. A reference subject is extracted from a disjoint shot, while the target sequence preserves identity across different viewpoints and contexts.

Track 2 data examples

More raw cinematic examples

Raw Track 1 cinematic transitions Raw Track 2 subject-centric sequences

Benchmark

The Cinematic Narrative Benchmark evaluates generated videos under realistic multi-shot storytelling conditions. It combines shot boundary parsing, expert perception models, and LMM-based visual-logic assessment.

Cinematic Narrative Benchmark pipeline

Track Evaluation Goal Metrics
Track 1: Narrative Effectiveness Shot-level alignment, transition precision, scene continuity, and visual logic. Txt.Align, Trans.Dev, Scene.Con, Con.Gap, Scene.Logic, Casting.Logic, Act.Logic, Spat.Logic
Track 2: Subject Consistency Cross-shot identity preservation, subject grounding, motion strength, and anti-copy-paste behavior. Subj.Recall, Ref-Sub.Con, Inter-Sub.Con, Act.Str, ACP-Var, CP-Rate

ACP-Var measures pose and structural diversity between the reference image and generated frames, explicitly penalizing rigid 2D reference copying.

Qualitative benchmark comparison

Code

This repository currently releases the data-side code. The benchmark implementation is intentionally not included in this initial repository and will be released separately after further verification.

MuSS/
├── download/        # Raw video acquisition from organized yearly YouTube lists
├── main-pipeline/   # Multi-shot video curation and progressive captioning
├── s2v-pipeline/    # Cross-shot subject extraction and S2V pair construction
├── assets/          # README figures converted from the paper and supplementary material
├── CITATION.cff
└── README.md

Quick Start

git clone https://github.com/<your-org>/MuSS.git
cd MuSS
python3 download/download.py 2011

Module-level guides:

Citation

@article{zhang2026muss,
  title   = {MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation},
  author  = {Zhang, Haojie and Wu, Di and Liu, Bingyan and Zhong, Linjie and Wei, Yuancheng and Ye, Xingsong and Liu, Nanqing and Liang, Yaling},
  journal = {arXiv preprint arXiv:2604.23789},
  year    = {2026}
}

License

The code and dataset license are being finalized. Please check this section before redistribution or commercial use.

Acknowledgements

MuSS builds on open research infrastructure for video processing, perception, visual-language reasoning, and generative video evaluation. Please also follow the licenses of any third-party models, datasets, and tools used by the individual pipelines.

About

A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors