InterPARES-Audio is a sophisticated system designed to listen, transcribe, translate, and analyze complex audio recordings. It is built to handle files with multiple speakers conversing in different languages, making it the ideal tool for processing archives of meetings, interviews, and panel discussions.
-
Live Demo: demos.dlnlp.ai/InterPARES/
-
Jupyter Notebook: multilingual_audio_analysis.ipynb
InterPARES-Audio designed to build a top-line system that focuses on:
-
Speaker Diarization: To identify and separate different speakers in an audio recording, even when they switch languages.
-
Robust Transcription and Translation: To accurately transcribe spoken words and translate them into a target language, preserving meaning across linguistic boundaries.
-
Summarization: To generate concise summaries that capture the essence of multi-speaker conversations, highlighting key points and decisions made.
-
End-to-End Processing: Ingests a long audio file and outputs a structured text report.
-
Speaker Diarization: Determines "who spoke when" by identifying and tagging different speakers.
-
Multilingual Transcription & LID: Accurately transcribes speech to text while automatically identifying the language being spoken.
-
Advanced LLM Analysis: Uses a large language model to perform high-level analysis, summarizing content and extracting key information.
-
Structured Multilingual Output: Generates clean, organized reports in Arabic, English, French, Spanish, German, and Italian. The reports include:
-
Speaker profiles with predicted names and roles
-
Main topics discussed
-
Decisions made during the conversation
-
Action items assigned to participants
-
Key insights and takeaways from the discussion
-
The pipeline consists of four main stages:
-
Speaker Diarization and Segmentation: The long audio input is processed to identify speaker changes and segment the audio into a sequence of individual utterances, each tagged with a speaker ID.
-
Multilingual Speech Model: Each utterance is fed into a speech model that performs both transcription (speech-to-text) and language identification (LID).
-
Transcription Manager: This component merges the individual transcribed utterances, saves the full transcript, and creates manageable, contextually coherent chunks of text optimized for the LLM.
-
LLM Analysis: The structured text chunks are analyzed by an LLM to transform raw transcript data into a meaningful and actionable structured report.
This pipeline integrates state-of-the-art deep learning models to handle the complexity of multilingual, multi-speaker audio environments.
- Model Name:
pyannote/speaker-diarization-3.1 - Model Card: Pyannote Speaker Diarization-3.1
- Publications Powerset Multi-Class Cross Entropy Loss for Neural Speaker Diarization and pyannote.audio 2.1 Speaker Diarization Pipeline: Principle, Benchmark, and Recipe
- Function: Responsible for the "Who Spoke When" task. It generates time-stamped speaker segments, distinguishing between different participants in the meeting even amidst interruptions or overlaps.
- Why it's used: Pyannote is currently the industry standard for open-source diarization, offering high accuracy in clustering speaker embeddings.
- Model Name:
openai/whisper-large-v3 - Model Card: OpenAI Whisper large-v3
- Publications Robust Speech Recognition via Large-Scale Weak Supervision)
- Function: Performs robust Speech-to-Text (STT) transcription and automatic Language Identification (LID).
- Capabilities: * Handles noisy audio effectively.
- Automatically detects language switches (e.g., switching from English to Arabic or French).
- Provides timestamps that align with the diarization segments.
-
Model Name:
openai/gpt-oss-20b -
Model Card: OpenAI GPT-OSS 20B
-
Publications gpt-oss-120b & gpt-oss-20b Model Card
-
Model: Large Language Models (openai/gpt-oss-20b)
-
Function: Processes the raw, diarized transcripts to generate structured outputs.
-
Output Generation:
- Contextualizes the conversation.
- Extracts action items, decisions, and summaries.
- Translates the final report into the user's requested target language (Arabic, English, French, Spanish, German, Italian) while preserving the nuance of the original discussion.
Live Demo: demos.dlnlp.ai/InterPARES/