0% found this document useful (0 votes)

39 views24 pages

Presentation 3

The document provides an overview of speech recognition and text-to-speech (TTS) technologies, detailing their processes, components, and applications. It explains the steps involved in speech recognition, including audio input, preprocessing, acoustic and language modeling, and output generation. Additionally, it outlines TTS processes, types of systems, and their impact across various sectors, along with examples of Python libraries for implementation.

Uploaded by

emaddiana82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views24 pages

Presentation 3

Uploaded by

emaddiana82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

MULTIMODEL NLP

1.WHAT IS SPEECH
RECOGNITION?
Speech recognition is the process by which a
computer system can listen to spoken
language, analyze it, and convert it into
written text.

2
SPEECH RECOGNITION STEPS

• Computer system: This can be your phone, laptop, smart

speaker (like Alexa).
• Listen to spoken language: Using a microphone, the system
captures your voice.
• Analyze it: It processes the voice using algorithms to
understand what’s being said.
• Convert into text: It then outputs what you said as text on
screen or uses it to take action (like searching on Google or
typing for you).

3
2. HOW DOES IT
WORK?
STEP 1: AUDIO INPUT

• The user speaks into a microphone.

• The system records this as an audio signal (a waveform).

STEP 2: PREPROCESSING (FEATURE EXTRACTION)

• The audio is cleaned: removing background noise, adjusting volume.

• Then it's converted into features like:
• MFCC (Mel-Frequency Cepstral Coefficients)
• Spectrogram

5
STEP 3: ACOUSTIC MODEL

• This model maps audio features to phonemes.

• Phonemes are the smallest units of sound in language (like "b", "d", "sh").
• Models used:
• HMM (Hidden Markov Models): Older model, statistical.
• RNN (Recurrent Neural Networks): Good for sequences like speech.
• Transformer: Newer, very powerful model for processing long sequences.

6
STEP 4: LANGUAGE MODEL

• Predicts the most likely words in context.

• Helps correct things that sound similar (e.g., "write" vs "right").
• Uses:
• Grammar rules
• Probability of word combinations

7
STEP 5: DECODER

• Combines the acoustic model and the language model.

• Outputs the final text that matches what was spoken.

STEP 6: OUTPUT
• The system gives the result as text.
• This can be:
• Displayed on screen
• Used to execute a command (e.g., call someone)

8
HOW A SPEECH RECOGNITION SYSTEM WORKS ?
SPEAKER-INDEPENDENT VS. SPEAKER-DEPENDENT
• Speaker-Independent: Works with any voice.
• Speaker-Dependent: Trained to understand one person's voice better.

ONLINE VS. OFFLINE RECOGNITION

• Online: Requires internet, more accurate.
• Offline: Works without internet, but less accurate.

REAL-TIME VS. BATCH

• Real-Time: Transcribes while you speak.
• Batch: Transcribes a full recorded file later.
11
EXAMPLES

Application Example
Voice Assistants Siri, Alexa

Transcription Zoom captions

Smart Homes “Turn on the lights”

Medical Doctor dictation

Call Centers IVR menus

12
WHAT IS TTS
WHAT IS TTS ?

• Text-to-Speech (TTS) is a technology that converts written text into spoken voice
output using artificial intelligence and signal processing.

• TTS process :
Text Analysis
Phonetic Conversion
Prosody Generation
Speech Synthesis
Voice Output
Post-processing
14
1. Text Analysis and Preprocessing
Text Normalization: This process converts elements like numbers, abbreviations, and symbols into their
corresponding spoken forms. For instance, "Dr." is expanded to "Doctor," and "123" becomes "one
hundred twenty-three."
Linguistic Analysis: The system examines the text to determine its syntactic and semantic structures,
identifying parts of speech, sentence boundaries, and areas requiring emphasis.

2. Phonetic Conversion
Grapheme-to-Phoneme (G2P) Conversion: The normalized text is transformed into phonetic
transcriptions, mapping letters and letter combinations (graphemes) to their corresponding sounds
(phonemes).

3. Prosody Generation
Prosodic Features: The system generates features such as intonation, stress, rhythm, and pauses to
produce natural-sounding speech. This involves modulating pitch, duration, and volume throughout the
speech.

15
4. Speech Synthesis
Concatenative Synthesis: This method stitches together pre-recorded speech segments (like diphones
or entire words) stored in a database to produce fluent speech.
Formant Synthesis: Utilizes mathematical models to simulate the acoustic properties of human
speech by manipulating formants (resonant frequencies of the vocal tract).
HMM-based Synthesis: Employs Hidden Markov Models to statistically model sequences of speech
sounds and their prosody.
Neural TTS: Modern systems use deep learning techniques, such as WaveNet and Tacotron, to
generate speech waveforms directly from phonetic and prosodic features, resulting in highly natural
and expressive speech.
5. Audio Output
The generated speech is converted into an audio waveform, which can be played back to the user
through speakers or headphones.
6. Post-processing
Audio Enhancements: The audio waveform may undergo further processing to improve quality,
including noise reduction, equalization, and dynamic range compression.

16
TYPES OF TTS SYSTEMS

• Concatenative TTS :
o Uses pre-recorded units (words/syllables). Natural but limited flexibility.

• Parametric TTS :
o Uses machine learning to generate speech from parameters. Less natural.

• Neural TTS (e.g. Tacotron, WaveNet) :

o Deep learning-based. Produces highly natural human-like speech.

18
APPLICATIONS OF TTS

• Screen readers for the visually impaired

• Virtual assistants like Siri, Alexa, Google
• Language learning and pronunciation help
• Audiobook generation
• Customer service bots

19
IMPACT OF TTS ACROSS VARIOUS SECTORS
TTS technology has significantly influenced multiple industries:
• Education: Enhances accessibility for students with visual impairments or reading
difficulties by converting text into speech, aiding comprehension and retention.
• Accessibility: Assists individuals with visual impairments or literacy challenges by
providing auditory access to written content, promoting inclusivity.
• Customer Service: Enables automated, consistent, and cost-effective customer
interactions through voice responses, improving service availability and efficiency.
• Healthcare: Facilitates patient care by reading medical information aloud, aiding
those with visual impairments, and assisting healthcare professionals with hands-
free data access.
• Entertainment: Enhances user experiences in gaming and media by providing realistic
and dynamic voice outputs.
• Automotive: Improves in-car systems by offering voice navigation and hands-free
control, enhancing safety and user experience.
• Global Communication: Breaks language barriers by providing real-time translation
and voice-based interactions, fostering international collaboration.

20
LIBRARIES
Text-to-Speech (TTS) technology enables machines to convert written text into spoken audio,
enhancing accessibility, automation, and user interaction. Two popular Python libraries for TTS are
gTTS (Google Text-to-Speech) and pyttsx3.

• gTTS (Google Text-to-Speech) is a cloud-based library that utilizes Google’s TTS API to convert text
into speech. It supports multiple languages and produces high-quality, natural-sounding voices.
However, it requires an internet connection to work, as it relies on Google's online services.

• pyttsx3 is an offline, platform-independent TTS library that works with text-to-speech engines like
SAPI5 (Windows), NSSpeechSynthesizer (macOS), and espeak (Linux). It allows for detailed voice
control, including adjusting speech rate, volume, and voice type, making it suitable for local
applications that need reliable offline performance.

21
EXAMPLE WITH GTTS LIBRARY :
from gtts import gTTS
import os
text = "Welcome to artificial intelligence learning"
language = 'en'
speech = gTTS(text=text, lang=language, slow=False)
speech.save("output.mp3")
os.system("xdg-open output.mp3")

22
EXAMPLE WITH PYTTSX3 LIBRARY :

import pyttsx3
engine = pyttsx3.init()
engine.setProperty('rate', 150) Note that !!
• gTTS is better for Arabic
engine.setProperty('volume', 1.0) and natural voice, but needs
engine.say("Welcome to AI learning") internet.
• pyttsx3 works offline but
engine.runAndWait() depends on system-
installed voices.

23
THANK YOU
Dana Abdallah
Diana Emad

Text-to-Speech Converter Guide
No ratings yet
Text-to-Speech Converter Guide
21 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
Text-to-Speech Conversion Guide
No ratings yet
Text-to-Speech Conversion Guide
8 pages
Text To Speech Seminar
No ratings yet
Text To Speech Seminar
10 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
Computer Expo
No ratings yet
Computer Expo
6 pages
TTS Tech Review for Researchers
No ratings yet
TTS Tech Review for Researchers
4 pages
TTS SRM Speech
No ratings yet
TTS SRM Speech
38 pages
Session 5 - Speech Recognition
No ratings yet
Session 5 - Speech Recognition
20 pages
Speech Synthesis - Christopher Mwololo Fred
No ratings yet
Speech Synthesis - Christopher Mwololo Fred
18 pages
IJRPR4449
No ratings yet
IJRPR4449
4 pages
U 4
No ratings yet
U 4
8 pages
Format of Mini - Project Report
No ratings yet
Format of Mini - Project Report
32 pages
AdityaMittal 07 PPT
No ratings yet
AdityaMittal 07 PPT
12 pages
Speech Recognition Technology
No ratings yet
Speech Recognition Technology
23 pages
Speech Technology
No ratings yet
Speech Technology
5 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
Mini Project
No ratings yet
Mini Project
19 pages
Ai Project Sona-1 (1) - 250630 - 194118
No ratings yet
Ai Project Sona-1 (1) - 250630 - 194118
10 pages
Video Transcript - Explore The Text To Speech Technology
No ratings yet
Video Transcript - Explore The Text To Speech Technology
2 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
Design and Implementation of Text To Speech Audio System
No ratings yet
Design and Implementation of Text To Speech Audio System
5 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
DL Proj Rep
No ratings yet
DL Proj Rep
11 pages
Concatenative Text-to-Speech Synthesis System For Communication Recognition
No ratings yet
Concatenative Text-to-Speech Synthesis System For Communication Recognition
6 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Paper 5728
No ratings yet
Paper 5728
3 pages
Text 2 Speech Article Summery
No ratings yet
Text 2 Speech Article Summery
2 pages
UNIT 5 Application AI
No ratings yet
UNIT 5 Application AI
16 pages
Synopsis
No ratings yet
Synopsis
18 pages
Emotional Speech Synthesis Using End-to-End Neural TTS Models
No ratings yet
Emotional Speech Synthesis Using End-to-End Neural TTS Models
7 pages
Speech Recognition in AI (COMP 334)
No ratings yet
Speech Recognition in AI (COMP 334)
26 pages
Urk22ai1022 NLP Qa
No ratings yet
Urk22ai1022 NLP Qa
21 pages
Text and Speech CCS369-UNIT 5
No ratings yet
Text and Speech CCS369-UNIT 5
9 pages
Voice Assistant
No ratings yet
Voice Assistant
30 pages
Thesis
No ratings yet
Thesis
37 pages
CASSI Speech Recognition
No ratings yet
CASSI Speech Recognition
14 pages
Natural Language Processing: Task4
No ratings yet
Natural Language Processing: Task4
12 pages
Text To Speech Converter 25,26,27
No ratings yet
Text To Speech Converter 25,26,27
10 pages
Speech Recognition
No ratings yet
Speech Recognition
10 pages
Real Time Voice Translator
No ratings yet
Real Time Voice Translator
28 pages
SPEECH
100% (1)
SPEECH
17 pages
Sujal Kumar Sinha - IOT - MATLAB Mini
No ratings yet
Sujal Kumar Sinha - IOT - MATLAB Mini
13 pages
224s 22 Lec1
No ratings yet
224s 22 Lec1
31 pages
Unit 2
No ratings yet
Unit 2
13 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
Labs 9
No ratings yet
Labs 9
4 pages
Text Tool Report
No ratings yet
Text Tool Report
32 pages
Development of Multilingual Speech
No ratings yet
Development of Multilingual Speech
13 pages
Speech Synthesis Toward A Voice For All H. Timothy Bunnell
No ratings yet
Speech Synthesis Toward A Voice For All H. Timothy Bunnell
9 pages
NLP 1.3.1 - Speed Recogmnition
No ratings yet
NLP 1.3.1 - Speed Recogmnition
20 pages
Text To Speech
No ratings yet
Text To Speech
15 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Speech Recognition
No ratings yet
Speech Recognition
7 pages
Synopsis
No ratings yet
Synopsis
5 pages
Joint Speech-Text Embeddings For Multitask Speech Processing
No ratings yet
Joint Speech-Text Embeddings For Multitask Speech Processing
13 pages
EEE 6211 Digital Speech Processing: Course Instructor Dr. Mohammad Ariful Haque Professor, Dept. of EEE, BUET
No ratings yet
EEE 6211 Digital Speech Processing: Course Instructor Dr. Mohammad Ariful Haque Professor, Dept. of EEE, BUET
16 pages
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio
No ratings yet
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio
5 pages
IVONA Voices 2: TTS for All Users
6% (16)
IVONA Voices 2: TTS for All Users
2 pages
Sentence Intonation For Polish Language
No ratings yet
Sentence Intonation For Polish Language
8 pages
MAD - Ch5 & 6 Notes PDF
100% (1)
MAD - Ch5 & 6 Notes PDF
85 pages
Audio DeepFake Detection Insights
100% (1)
Audio DeepFake Detection Insights
6 pages
Alpine Catalogue
No ratings yet
Alpine Catalogue
20 pages
UCS749
No ratings yet
UCS749
1 page
A Simple LPC Vocoder Bob Beauchaine EE586, Spring 2004: Vocal Tract Modeling
No ratings yet
A Simple LPC Vocoder Bob Beauchaine EE586, Spring 2004: Vocal Tract Modeling
12 pages
Top 300 AI Tools
100% (8)
Top 300 AI Tools
302 pages
Top 100 AI Tools for Productivity
No ratings yet
Top 100 AI Tools for Productivity
19 pages
Accessories Description Application: Product Lineup 2015
No ratings yet
Accessories Description Application: Product Lineup 2015
6 pages
Stuti Goel: Masters in Computer Application (Mca) Automated Text Summarization
No ratings yet
Stuti Goel: Masters in Computer Application (Mca) Automated Text Summarization
2 pages
Zone Audio Diagram
No ratings yet
Zone Audio Diagram
13 pages
Feature Extraction Using PCA
No ratings yet
Feature Extraction Using PCA
36 pages
EFace System & Motherboard Guide
No ratings yet
EFace System & Motherboard Guide
50 pages
AI For Assistive Technology in Digital Learning Toolkit
No ratings yet
AI For Assistive Technology in Digital Learning Toolkit
15 pages
Sar Plus Com Auto Dialler PDF
No ratings yet
Sar Plus Com Auto Dialler PDF
20 pages
DataSheet - SC-605 Speech and Music Processor
No ratings yet
DataSheet - SC-605 Speech and Music Processor
8 pages
Cisco Unified Contact Center Express - Scripts - Release 8.0 PDF
No ratings yet
Cisco Unified Contact Center Express - Scripts - Release 8.0 PDF
640 pages
Artificial Intelligence-Based Chatbot With Voice Assistance
No ratings yet
Artificial Intelligence-Based Chatbot With Voice Assistance
6 pages
Ocr
No ratings yet
Ocr
19 pages
Synthesis Meaning in Telugu
100% (2)
Synthesis Meaning in Telugu
6 pages
Project Synopsis
No ratings yet
Project Synopsis
8 pages
Voice Email for the Visually Impaired
No ratings yet
Voice Email for the Visually Impaired
57 pages
Acessible Presentation of Information For People With Visual
No ratings yet
Acessible Presentation of Information For People With Visual
24 pages
Data Sheet - Five9 IVA Supported Language
No ratings yet
Data Sheet - Five9 IVA Supported Language
2 pages
ILCC 2013 Proceedings
0% (1)
ILCC 2013 Proceedings
644 pages
Visual OCR
No ratings yet
Visual OCR
17 pages
2 Marks
No ratings yet
2 Marks
11 pages

Presentation 3

Uploaded by

Presentation 3

Uploaded by

MULTIMODEL NLP

• Computer system: This can be your phone, laptop, smart

• The user speaks into a microphone.

STEP 2: PREPROCESSING (FEATURE EXTRACTION)

• The audio is cleaned: removing background noise, adjusting volume.

• This model maps audio features to phonemes.

• Predicts the most likely words in context.

• Combines the acoustic model and the language model.

ONLINE VS. OFFLINE RECOGNITION

REAL-TIME VS. BATCH

Transcription Zoom captions

Smart Homes “Turn on the lights”

Medical Doctor dictation

Call Centers IVR menus

• Neural TTS (e.g. Tacotron, WaveNet) :

• Screen readers for the visually impaired

You might also like