CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

🎬 CineSRD

A tri-modal framework integrating visual, acoustic, and linguistic cues for robust speaker recognition and diarization in cinematic content.

🔹 Audio-Visual Fusion: Combines face tracking with voice embeddings.
🔹 LLM Reasoning: Leverages linguistic context for precise speaker turn detection.
🔹 Open-World Ready: Designed for the complex, dynamic nature of movies.

📄 Read the full paper: CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

📋 Table of Contents

🎯 Overview

CineSRD addresses the challenging task of speaker diarization in movies and TV shows, where multiple speakers often overlap and visual information is crucial for accurate identification. Our framework integrates:

Multi-modal Speaker Recognition: Fuses audio embeddings with visual face recognition
Active Speaker Detection: Identifies who is speaking using visual cues
Speaker Turn Detection: Leverages LLMs to detect speaker changes from audio-text context
SubtitleSD Benchmark: A comprehensive dataset for evaluating speaker diarization in cinematic content

✨ Features

Speaker Detection

Audio-Visual Fusion: Combines ERes2Net audio embeddings with face recognition
Multi-stage Pipeline:
- Face detection and quality assessment
- Active speaker detection
- Audio-visual clustering and post-processing
- Speaker role assignment via avatar matching
Robust to Overlaps: Handles multiple simultaneous speakers
High Accuracy: Optimized for cinematic content with complex scenes

Speaker Turn Detection

LLM-Powered: Uses Qwen2-Audio for audio-text based speaker change detection
Context-Aware: Analyzes conversational context to identify speaker switches
Probabilistic Output: Provides confidence scores for speaker turns

🚀 Installation

Prerequisites

Python 3.10+
CUDA 12.0+ (for GPU acceleration)
FFmpeg

Setup

Clone the repository

git clone https://github.com/yourusername/CineSRD.git
cd CineSRD

Install Speaker Detection dependencies

cd code/speaker_detection
pip install -r requirements.txt

Install Speaker Turn Detection dependencies

cd ../speaker_turn_detection
pip install -r requirements.txt

Download Pretrained Models

Download the required pretrained models and place them in the appropriate directories:

Audio embedding models (ERes2Net)
Face detection models
Speaker turn detection models (Qwen2-Audio)

🎬 Quick Start

Speaker Detection

Configure paths in code/speaker_detection/config.yaml:
Run the pipeline:

cd code/speaker_detection
bash main.sh

Speaker Turn Detection

Configure in code/speaker_turn_detection/config.yaml:
Run inference:

cd code/speaker_turn_detection
bash main.sh

📖 Usage

Speaker Detection Pipeline

The speaker detection module provides a complete pipeline including:

Video preprocessing and face detection
Audio embedding extraction (ERes2Net)
Active speaker detection
Clustering and post-processing

Speaker Turn Detection

Uses Qwen2-Audio to detect speaker changes from audio-text context.

📄 Citation

If you use CineSRD in your research, please cite:

【占位符】

🤝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

🙏 Acknowledgments

FunASR - For audio processing tools
ModelScope - For model hub support
ms-swift - For model training framework

📧 Contact

For questions or support, please open an issue on GitHub or contact us at h13971032630@163.com.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

🎬 CineSRD

📋 Table of Contents

🎯 Overview

✨ Features

Speaker Detection

Speaker Turn Detection

🚀 Installation

Prerequisites

Setup

🎬 Quick Start

Speaker Detection

Speaker Turn Detection

📖 Usage

Speaker Detection Pipeline

Speaker Turn Detection

📄 Citation

🤝 Contributing

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

🎬 CineSRD

📋 Table of Contents

🎯 Overview

✨ Features

Speaker Detection

Speaker Turn Detection

🚀 Installation

Prerequisites

Setup

🎬 Quick Start

Speaker Detection

Speaker Turn Detection

📖 Usage

Speaker Detection Pipeline

Speaker Turn Detection

📄 Citation

🤝 Contributing

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages