VOICE ASSISTANT            POWERED BY YOUR VOICE .
DRIVEN BY INTELLIGENCE
COURSE: FUNDAMENTALS IN AI/ML
COURSE CODE: CSA2001
SLOT: A24+C21+F22
                         our GROUP
1. Rudransh Bhardwaj   24BCE10011
2. Himanshu Singh      24BCE10123
3. Tanishka Chauhan    24BCE10353
4. Rashi Tiwari        24BSA10132
5. Daksh Patodi        24BCE10304
6. Saransh Singh       23BAS10099
                                    About
Speech recognition, also known as automatic speech recognition (ASR)
or voice recognition, is a technology that converts spoken language
into written text.
The primary goal of speech recognition systems is to accurately and
efficiently transcribe spoken words into a format that can be processed,
stored, or used for various applications.
 This technology relies on sophisticated algorithms and machine
learning techniques to interpret and understand human speech
patterns.
OBJECTIVES AND GOALS
To design and develop a smart voice assistant capable of understanding and executing user
commands through natural language interaction, offering seamless integration with various
           devices and services for improved user convenience and efficiency.
   1.                                    2.                                    3.
                                          TASK
 ACCURATE                                                                  REAL-TIME
                                       EXECUTION
   SPEECH                                                                AND NATURAL
                                          AND
RECOGNITION                                                              INTERACTION
                                      AUTOMATION
FUNCTIONALITIES
Listening: The assistant uses a microphone to capture the user's
voice and processes it.
Speech Recognition: Converts the user's spoken input into text for
processing.
Command Processing: Interprets the text to understand the user's
intent.
Performing Actions: Executes the required task, like fetching
information, opening applications, or responding verbally.
Responding: Converts the response text back into speech for output
                  INTRODUCTION
The assistant can perform simple tasks like opening apps, fetching
information, or giving time and date updates.
It employs technologies like Speech Recognition to convert spoken words
into text, Natural Language Processing to understand user intent, and Text-
to-Speech (TTS) to provide audio responses. Libraries such as
speech_recognition, pyttsx3, and pyaudio are commonly used.
While simple, it demonstrates the fundamentals of voice-based interaction
and can be enhanced with additional features for more advanced
functionality.
METHODOLOGY
 USE OF CONDITIONAL STATEMENTS
Conditional statements, also known as "if statements", allow programs to make decisions based on
specific conditions. They are fundamental in programming to control the flow of a program. Basic
Structure of Conditional Statements
•If Statement: Executes a block of code if a condition is true.
•If-Else Statement: Provides an alternative block of code if the condition is false.
•If-Else Statement: Provides an alternative block of code if the condition is false.
•Elif statement: Used as part of conditional statements to check multiple conditions. It's short for "else
if." When you need to test more than one condition in a sequence, you use elif between if and else
Use of LOOPS: Loops in Python are used to execute a block of code repeatedly, either for a fixed
number of times or until a condition is met. They are fundamental in automating repetitive tasks,
handling large datasets, and iterating over elements in collections like lists or strings. In this program
we have used the while loop in the main function to execute the block of code
USE OF FUNCTIONS IN PYTHON PROGRAM
A function in Python is a block of reusable code designed to perform a specific task. It help organize
and structure code, make it easier to debug, and improve reusability
In this program we have used various Functions:
        Listen to
                                              Process                                Open
         speech
                                             Commands                               Website
        or voice
EXIT CONDITION:
The program stops running when the user says "exit" or "quit."
            1. Wave
            2. NumPy
            3. datetime
LIBRARIES   4. requests
USED...     5. Thread
            6. webbrowser
            7. Speech_recognition
            8. pyttsx3
    FEATURES
                  ERROR HANDLING                             REDUCED LATENCY
                       Handled                                 Shortened timeout
                    unknown input                                  for speech
                                                               recognition, faster
                     and failed API                              text-to-speech
                         calls                                       output
THREADED OPERATIONS                   VOICE OPTIMIZATION                             CORE FUNCTIONS
                                          Fast speech
   Used threads to                                                                      Provide Word
                                         synthesis with
    open websites                                                                      meanings , fetch
                                       adjustable rate and
   without delaying                                                                  weather details, tell
                                      volume, female voice
      other tasks                                                                    time and date, open
                                          prioritization
                                                                                          websites.
How does speech to text work?!
Steps in speech to text conversion....
  1.     2.      3.   4.       output
  Step 1. Acoustic Signal Processing:
 The input to a speech recognition system is an acoustic
signal, the analogue waveform of the spoken words. This
signal is captured by a microphone and converted into a
digital format, Therefore, a complex speech recognition
algorithm known as the Fast Fourier Transform is used to
convert the graph into a spectrogram.
                                            Step 2. Feature Extraction
Spectral Analysis: The digital signal undergoes spectral analysis to extract relevant features.
This involves breaking down the signal into frequency components, revealing patterns
representing speech sound characteristics.
Pitch and Intensity Analysis: Additional features, such as pitch (frequency of the speech) and
intensity (loudness), are extracted to capture more nuances of the spoken language.
Standardization prepares the data; CNN extracts the features
STANDARDIZATION                                       CNN (Convolutional Neural Network)
Standardization is a data preprocessing step that     CNN is a type of neural network that learns to extract
transforms input features so they have:               features from data automatically — especially spatial or
Mean = 0                                              temporal patterns.
Standard deviation = 1
                                                      🔍 What it does:
Formula:                                              Applies filters (or kernels) that slide over input data
z=(x−μ)/σ                                             Detects local features, like edges, shapes (in images), or
                                                      phonetic patterns (in spectrograms)
Where:                                                Each convolution layer learns increasingly abstract
x= input value                                        representations
μ= mean of the feature
σ= standard deviation                                 ✅ CNN learns what features are important — it doesn't
                                                      just scale or normalize them.
✅ Why it's used:
To normalize the range of values
Helps neural networks converge faster
Prevents features with large scales from dominating
learning
Step 3. Acoustic Modelling
Acoustic models can be of various types and with different loss functions but the most used in literature
and production are Connectionist Temporal Classification (CTC) based model that considers
spectrogram (X) as input and produces the log probability scores (P) of all different vocabulary tokens for
each time step.
The Problem CTC Solves In speech recognition:
The input is a long sequence of audio frames (e.g., 1,000 time steps).
The output is a much shorter sequence of text (e.g., 10 characters).
We don’t know exactly which part of the audio corresponds to which letter/word (no alignment).
How CTC Works (Conceptually):
Allows the model to predict at every time step a character or a special blank token (_).
During training, it considers all possible alignments of the output text within the input length.\It computes
the total probability of all valid alignments using dynamic programming.
                                      CTC solves the alignment problem in speech recognition.
                      It lets models map long input sequences to shorter outputs without frame-level labels.
                                 It uses dynamic programming to sum over all possible alignments.
                It powers many speech models — especially before attention-based methods became mainstream.
Step 4. Decoding
Matching Patterns: The acoustic and language models work
together in decoding. The system matches the observed acoustic
patterns with the learned models to identify the most probable
sequence of phonemes and words.
                    RESEARCH
                      PAPER
By:
Ms. Preethi G
Mr. Abhishek K
Mr. Thiruppugal S
Mr. Vishwaa D A
CODE
TARGET AUDIENCE
TECH-SAVVY INDIVIDUALS INCLUDING
STUDENTS, RESEARCHERS, DEVELOPERS,
AND INDUSTRY PROFESSIONALS INTERESTED
IN ARTIFICIAL INTELLIGENCE, NATURAL
LANGUAGE PROCESSING, HUMAN-
COMPUTER INTERACTION, AND THE FUTURE
DEVELOPMENT OF SMART VOICE ASSISTANT
TECHNOLOGY.THE PROJECT IS INTENDED
INITIALIZATION     LISTENING       PROCESSING       OUTPUT
  Set up text to   Wait for user     Match the       Provide
     speech        input with a      input with
   recognition,                                     responses
                   microphone,       predefined
  adjust voice,                                      through
                     recognize       commands
    speed and
                                    and execute      text-to-
     volume        speech using
                                   corresponding    speech or
                      Google
                                      functions    web browsin
                    speech API
How it
WORKS?!
TIMELINE
NOVEMBER 2016       APRIL 2017           OCTOBER 2017           MAY 2019
                                                             Google announced
                                       Google announced
 Google Home      A software update                            that virtual home
                                        two new products:
was introduced   brought back multi-                          devices, including
                                        the Google Home       the Nest Hub Max,
 in the United    user functionality
                                       Mini and the Google   would be rebranded
     States.                               Home Max.
                                                               under the Google
                                                                Nest standard.
THANK YOU