Skip to content

BSTLL/CineSRD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Python 3.10+ PyTorch

🎬 CineSRD

A tri-modal framework integrating visual, acoustic, and linguistic cues for robust speaker recognition and diarization in cinematic content.

πŸ”Ή Audio-Visual Fusion: Combines face tracking with voice embeddings.
πŸ”Ή LLM Reasoning: Leverages linguistic context for precise speaker turn detection.
πŸ”Ή Open-World Ready: Designed for the complex, dynamic nature of movies.

πŸ“„ Read the full paper: CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization


πŸ“‹ Table of Contents


🎯 Overview

CineSRD addresses the challenging task of speaker diarization in movies and TV shows, where multiple speakers often overlap and visual information is crucial for accurate identification. Our framework integrates:

  • Multi-modal Speaker Recognition: Fuses audio embeddings with visual face recognition
  • Active Speaker Detection: Identifies who is speaking using visual cues
  • Speaker Turn Detection: Leverages LLMs to detect speaker changes from audio-text context
  • SubtitleSD Benchmark: A comprehensive dataset for evaluating speaker diarization in cinematic content

✨ Features

Speaker Detection

  • Audio-Visual Fusion: Combines ERes2Net audio embeddings with face recognition
  • Multi-stage Pipeline:
    • Face detection and quality assessment
    • Active speaker detection
    • Audio-visual clustering and post-processing
    • Speaker role assignment via avatar matching
  • Robust to Overlaps: Handles multiple simultaneous speakers
  • High Accuracy: Optimized for cinematic content with complex scenes

Speaker Turn Detection

  • LLM-Powered: Uses Qwen2-Audio for audio-text based speaker change detection
  • Context-Aware: Analyzes conversational context to identify speaker switches
  • Probabilistic Output: Provides confidence scores for speaker turns

πŸš€ Installation

Prerequisites

  • Python 3.10+
  • CUDA 12.0+ (for GPU acceleration)
  • FFmpeg

Setup

  1. Clone the repository
git clone https://github.com/yourusername/CineSRD.git
cd CineSRD
  1. Install Speaker Detection dependencies
cd code/speaker_detection
pip install -r requirements.txt
  1. Install Speaker Turn Detection dependencies
cd ../speaker_turn_detection
pip install -r requirements.txt
  1. Download Pretrained Models

Download the required pretrained models and place them in the appropriate directories:

  • Audio embedding models (ERes2Net)
  • Face detection models
  • Speaker turn detection models (Qwen2-Audio)

🎬 Quick Start

Speaker Detection

  1. Configure paths in code/speaker_detection/config.yaml:

  2. Run the pipeline:

cd code/speaker_detection
bash main.sh

Speaker Turn Detection

  1. Configure in code/speaker_turn_detection/config.yaml:

  2. Run inference:

cd code/speaker_turn_detection
bash main.sh

πŸ“– Usage

Speaker Detection Pipeline

The speaker detection module provides a complete pipeline including:

  • Video preprocessing and face detection
  • Audio embedding extraction (ERes2Net)
  • Active speaker detection
  • Clustering and post-processing

Speaker Turn Detection

Uses Qwen2-Audio to detect speaker changes from audio-text context.


πŸ“„ Citation

If you use CineSRD in your research, please cite:

【占位符】

🀝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ™ Acknowledgments


πŸ“§ Contact

For questions or support, please open an issue on GitHub or contact us at h13971032630@163.com.

About

The source code for CineSRD and the SubtitleSD benchmark is provided in this repository.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors