0% found this document useful (0 votes)
39 views24 pages

Presentation 3

The document provides an overview of speech recognition and text-to-speech (TTS) technologies, detailing their processes, components, and applications. It explains the steps involved in speech recognition, including audio input, preprocessing, acoustic and language modeling, and output generation. Additionally, it outlines TTS processes, types of systems, and their impact across various sectors, along with examples of Python libraries for implementation.

Uploaded by

emaddiana82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views24 pages

Presentation 3

The document provides an overview of speech recognition and text-to-speech (TTS) technologies, detailing their processes, components, and applications. It explains the steps involved in speech recognition, including audio input, preprocessing, acoustic and language modeling, and output generation. Additionally, it outlines TTS processes, types of systems, and their impact across various sectors, along with examples of Python libraries for implementation.

Uploaded by

emaddiana82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MULTIMODEL NLP

1.WHAT IS SPEECH
RECOGNITION?
Speech recognition is the process by which a
computer system can listen to spoken
language, analyze it, and convert it into
written text.

2
SPEECH RECOGNITION STEPS

• Computer system: This can be your phone, laptop, smart


speaker (like Alexa).
• Listen to spoken language: Using a microphone, the system
captures your voice.
• Analyze it: It processes the voice using algorithms to
understand what’s being said.
• Convert into text: It then outputs what you said as text on
screen or uses it to take action (like searching on Google or
typing for you).

3
2. HOW DOES IT
WORK?
STEP 1: AUDIO INPUT

• The user speaks into a microphone.


• The system records this as an audio signal (a waveform).

STEP 2: PREPROCESSING (FEATURE EXTRACTION)

• The audio is cleaned: removing background noise, adjusting volume.


• Then it's converted into features like:
• MFCC (Mel-Frequency Cepstral Coefficients)
• Spectrogram

5
STEP 3: ACOUSTIC MODEL

• This model maps audio features to phonemes.


• Phonemes are the smallest units of sound in language (like "b", "d", "sh").
• Models used:
• HMM (Hidden Markov Models): Older model, statistical.
• RNN (Recurrent Neural Networks): Good for sequences like speech.
• Transformer: Newer, very powerful model for processing long sequences.

6
STEP 4: LANGUAGE MODEL

• Predicts the most likely words in context.


• Helps correct things that sound similar (e.g., "write" vs "right").
• Uses:
• Grammar rules
• Probability of word combinations

7
STEP 5: DECODER

• Combines the acoustic model and the language model.


• Outputs the final text that matches what was spoken.

STEP 6: OUTPUT
• The system gives the result as text.
• This can be:
• Displayed on screen
• Used to execute a command (e.g., call someone)

8
HOW A SPEECH RECOGNITION SYSTEM WORKS ?
SPEAKER-INDEPENDENT VS. SPEAKER-DEPENDENT
• Speaker-Independent: Works with any voice.
• Speaker-Dependent: Trained to understand one person's voice better.

ONLINE VS. OFFLINE RECOGNITION


• Online: Requires internet, more accurate.
• Offline: Works without internet, but less accurate.

REAL-TIME VS. BATCH


• Real-Time: Transcribes while you speak.
• Batch: Transcribes a full recorded file later.
11
EXAMPLES

Application Example
Voice Assistants Siri, Alexa

Transcription Zoom captions

Smart Homes “Turn on the lights”

Medical Doctor dictation

Call Centers IVR menus

12
WHAT IS TTS
WHAT IS TTS ?

• Text-to-Speech (TTS) is a technology that converts written text into spoken voice
output using artificial intelligence and signal processing.

• TTS process :
Text Analysis
Phonetic Conversion
Prosody Generation
Speech Synthesis
Voice Output
Post-processing
14
1. Text Analysis and Preprocessing
Text Normalization: This process converts elements like numbers, abbreviations, and symbols into their
corresponding spoken forms. For instance, "Dr." is expanded to "Doctor," and "123" becomes "one
hundred twenty-three."
Linguistic Analysis: The system examines the text to determine its syntactic and semantic structures,
identifying parts of speech, sentence boundaries, and areas requiring emphasis.

2. Phonetic Conversion
Grapheme-to-Phoneme (G2P) Conversion: The normalized text is transformed into phonetic
transcriptions, mapping letters and letter combinations (graphemes) to their corresponding sounds
(phonemes).

3. Prosody Generation
Prosodic Features: The system generates features such as intonation, stress, rhythm, and pauses to
produce natural-sounding speech. This involves modulating pitch, duration, and volume throughout the
speech.

15
4. Speech Synthesis
Concatenative Synthesis: This method stitches together pre-recorded speech segments (like diphones
or entire words) stored in a database to produce fluent speech.
Formant Synthesis: Utilizes mathematical models to simulate the acoustic properties of human
speech by manipulating formants (resonant frequencies of the vocal tract).
HMM-based Synthesis: Employs Hidden Markov Models to statistically model sequences of speech
sounds and their prosody.
Neural TTS: Modern systems use deep learning techniques, such as WaveNet and Tacotron, to
generate speech waveforms directly from phonetic and prosodic features, resulting in highly natural
and expressive speech.
5. Audio Output
The generated speech is converted into an audio waveform, which can be played back to the user
through speakers or headphones.
6. Post-processing
Audio Enhancements: The audio waveform may undergo further processing to improve quality,
including noise reduction, equalization, and dynamic range compression.

16
TYPES OF TTS SYSTEMS

• Concatenative TTS :
o Uses pre-recorded units (words/syllables). Natural but limited flexibility.

• Parametric TTS :
o Uses machine learning to generate speech from parameters. Less natural.

• Neural TTS (e.g. Tacotron, WaveNet) :


o Deep learning-based. Produces highly natural human-like speech.

18
APPLICATIONS OF TTS

• Screen readers for the visually impaired


• Virtual assistants like Siri, Alexa, Google
• Language learning and pronunciation help
• Audiobook generation
• Customer service bots

19
IMPACT OF TTS ACROSS VARIOUS SECTORS
TTS technology has significantly influenced multiple industries:
• Education: Enhances accessibility for students with visual impairments or reading
difficulties by converting text into speech, aiding comprehension and retention.
• Accessibility: Assists individuals with visual impairments or literacy challenges by
providing auditory access to written content, promoting inclusivity.
• Customer Service: Enables automated, consistent, and cost-effective customer
interactions through voice responses, improving service availability and efficiency.
• Healthcare: Facilitates patient care by reading medical information aloud, aiding
those with visual impairments, and assisting healthcare professionals with hands-
free data access.
• Entertainment: Enhances user experiences in gaming and media by providing realistic
and dynamic voice outputs.
• Automotive: Improves in-car systems by offering voice navigation and hands-free
control, enhancing safety and user experience.
• Global Communication: Breaks language barriers by providing real-time translation
and voice-based interactions, fostering international collaboration.

20
LIBRARIES
Text-to-Speech (TTS) technology enables machines to convert written text into spoken audio,
enhancing accessibility, automation, and user interaction. Two popular Python libraries for TTS are
gTTS (Google Text-to-Speech) and pyttsx3.

• gTTS (Google Text-to-Speech) is a cloud-based library that utilizes Google’s TTS API to convert text
into speech. It supports multiple languages and produces high-quality, natural-sounding voices.
However, it requires an internet connection to work, as it relies on Google's online services.

• pyttsx3 is an offline, platform-independent TTS library that works with text-to-speech engines like
SAPI5 (Windows), NSSpeechSynthesizer (macOS), and espeak (Linux). It allows for detailed voice
control, including adjusting speech rate, volume, and voice type, making it suitable for local
applications that need reliable offline performance.

21
EXAMPLE WITH GTTS LIBRARY :
from gtts import gTTS
import os
text = "Welcome to artificial intelligence learning"
language = 'en'
speech = gTTS(text=text, lang=language, slow=False)
speech.save("output.mp3")
os.system("xdg-open output.mp3")

22
EXAMPLE WITH PYTTSX3 LIBRARY :

import pyttsx3
engine = pyttsx3.init()
engine.setProperty('rate', 150) Note that !!
• gTTS is better for Arabic
engine.setProperty('volume', 1.0) and natural voice, but needs
engine.say("Welcome to AI learning") internet.
• pyttsx3 works offline but
engine.runAndWait() depends on system-
installed voices.

23
THANK YOU
Dana Abdallah
Diana Emad

You might also like