CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
A tri-modal framework integrating visual, acoustic, and linguistic cues for robust speaker recognition and diarization in cinematic content.
πΉ Audio-Visual Fusion: Combines face tracking with voice embeddings.
πΉ LLM Reasoning: Leverages linguistic context for precise speaker turn detection.
πΉ Open-World Ready: Designed for the complex, dynamic nature of movies.
π Read the full paper: CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
CineSRD addresses the challenging task of speaker diarization in movies and TV shows, where multiple speakers often overlap and visual information is crucial for accurate identification. Our framework integrates:
- Multi-modal Speaker Recognition: Fuses audio embeddings with visual face recognition
- Active Speaker Detection: Identifies who is speaking using visual cues
- Speaker Turn Detection: Leverages LLMs to detect speaker changes from audio-text context
- SubtitleSD Benchmark: A comprehensive dataset for evaluating speaker diarization in cinematic content
- Audio-Visual Fusion: Combines ERes2Net audio embeddings with face recognition
- Multi-stage Pipeline:
- Face detection and quality assessment
- Active speaker detection
- Audio-visual clustering and post-processing
- Speaker role assignment via avatar matching
- Robust to Overlaps: Handles multiple simultaneous speakers
- High Accuracy: Optimized for cinematic content with complex scenes
- LLM-Powered: Uses Qwen2-Audio for audio-text based speaker change detection
- Context-Aware: Analyzes conversational context to identify speaker switches
- Probabilistic Output: Provides confidence scores for speaker turns
- Python 3.10+
- CUDA 12.0+ (for GPU acceleration)
- FFmpeg
- Clone the repository
git clone https://github.com/yourusername/CineSRD.git
cd CineSRD- Install Speaker Detection dependencies
cd code/speaker_detection
pip install -r requirements.txt- Install Speaker Turn Detection dependencies
cd ../speaker_turn_detection
pip install -r requirements.txt- Download Pretrained Models
Download the required pretrained models and place them in the appropriate directories:
- Audio embedding models (ERes2Net)
- Face detection models
- Speaker turn detection models (Qwen2-Audio)
-
Configure paths in
code/speaker_detection/config.yaml: -
Run the pipeline:
cd code/speaker_detection
bash main.sh-
Configure in
code/speaker_turn_detection/config.yaml: -
Run inference:
cd code/speaker_turn_detection
bash main.shThe speaker detection module provides a complete pipeline including:
- Video preprocessing and face detection
- Audio embedding extraction (ERes2Net)
- Active speaker detection
- Clustering and post-processing
Uses Qwen2-Audio to detect speaker changes from audio-text context.
If you use CineSRD in your research, please cite:
γε δ½η¬¦γWe welcome contributions! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- FunASR - For audio processing tools
- ModelScope - For model hub support
- ms-swift - For model training framework
For questions or support, please open an issue on GitHub or contact us at h13971032630@163.com.