Multilingual Audio Analysis

InterPARES-Audio is a sophisticated system designed to listen, transcribe, translate, and analyze complex audio recordings. It is built to handle files with multiple speakers conversing in different languages, making it the ideal tool for processing archives of meetings, interviews, and panel discussions.

Demo

Live Demo: demos.dlnlp.ai/InterPARES/
Jupyter Notebook: multilingual_audio_analysis.ipynb

InterPARES-Audio designed to build a top-line system that focuses on:

Speaker Diarization: To identify and separate different speakers in an audio recording, even when they switch languages.

Robust Transcription and Translation: To accurately transcribe spoken words and translate them into a target language, preserving meaning across linguistic boundaries.

Summarization: To generate concise summaries that capture the essence of multi-speaker conversations, highlighting key points and decisions made.

Key Features

End-to-End Processing: Ingests a long audio file and outputs a structured text report.
Speaker Diarization: Determines "who spoke when" by identifying and tagging different speakers.
Multilingual Transcription & LID: Accurately transcribes speech to text while automatically identifying the language being spoken.
Advanced LLM Analysis: Uses a large language model to perform high-level analysis, summarizing content and extracting key information.
Structured Multilingual Output: Generates clean, organized reports in Arabic, English, French, Spanish, German, and Italian. The reports include:
- Speaker profiles with predicted names and roles
- Main topics discussed
- Decisions made during the conversation
- Action items assigned to participants
- Key insights and takeaways from the discussion

Workflow

The pipeline consists of four main stages:

Speaker Diarization and Segmentation: The long audio input is processed to identify speaker changes and segment the audio into a sequence of individual utterances, each tagged with a speaker ID.
Multilingual Speech Model: Each utterance is fed into a speech model that performs both transcription (speech-to-text) and language identification (LID).
Transcription Manager: This component merges the individual transcribed utterances, saves the full transcript, and creates manageable, contextually coherent chunks of text optimized for the LLM.
LLM Analysis: The structured text chunks are analyzed by an LLM to transform raw transcript data into a meaningful and actionable structured report.

Models & Technologies

This pipeline integrates state-of-the-art deep learning models to handle the complexity of multilingual, multi-speaker audio environments.

1. Speaker Diarization

Model Name: pyannote/speaker-diarization-3.1
Model Card: Pyannote Speaker Diarization-3.1
Publications Powerset Multi-Class Cross Entropy Loss for Neural Speaker Diarization and pyannote.audio 2.1 Speaker Diarization Pipeline: Principle, Benchmark, and Recipe
Function: Responsible for the "Who Spoke When" task. It generates time-stamped speaker segments, distinguishing between different participants in the meeting even amidst interruptions or overlaps.
Why it's used: Pyannote is currently the industry standard for open-source diarization, offering high accuracy in clustering speaker embeddings.

2. Multilingual ASR & LID (Language Identification)

Model Name: openai/whisper-large-v3
Model Card: OpenAI Whisper large-v3
Publications Robust Speech Recognition via Large-Scale Weak Supervision)
Function: Performs robust Speech-to-Text (STT) transcription and automatic Language Identification (LID).
Capabilities: * Handles noisy audio effectively.
- Automatically detects language switches (e.g., switching from English to Arabic or French).
- Provides timestamps that align with the diarization segments.

3. LLM Analysis & Summarization

Model Name: openai/gpt-oss-20b
Model Card: OpenAI GPT-OSS 20B
Publications gpt-oss-120b & gpt-oss-20b Model Card
Model: Large Language Models (openai/gpt-oss-20b)
Function: Processes the raw, diarized transcripts to generate structured outputs.
Output Generation:
- Contextualizes the conversation.
- Extracts action items, decisions, and summaries.
- Translates the final report into the user's requested target language (Arabic, English, French, Spanish, German, Italian) while preserving the nuance of the original discussion.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
meeting_reports		meeting_reports
IntePARES_Audio_workflow.png		IntePARES_Audio_workflow.png
README.MD		README.MD
multilingual_audio_analysis.ipynb		multilingual_audio_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilingual Audio Analysis

Demo

Key Features

Workflow

Models & Technologies

1. Speaker Diarization

2. Multilingual ASR & LID (Language Identification)

3. LLM Analysis & Summarization

Online Demo

Examples

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

UBC-NLP/InterPARES_audio

Folders and files

Latest commit

History

Repository files navigation

Multilingual Audio Analysis

Demo

Key Features

Workflow

Models & Technologies

1. Speaker Diarization

2. Multilingual ASR & LID (Language Identification)

3. LLM Analysis & Summarization

Online Demo

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages