Pimpri Chinchwad Education Trust’s
PIMPRI CHINCHWAD UNIVERSITY
Plot No. 44, 49 50, Mohitewadi Rd, Mohitewadi, Maharashtra 412106
SCHOOL OF ENGINEERING AND TECHNOLOGY
PROGRAMME: B.TECH CSE (AIML)
PROJECT GROUP NO: 5
COMMUNITY ENGINEERING PROJECT
Sr.no Name Roll no
1. Ambarish Popatwar I-58
2. Khemraj Jadhav I-59
3. Tejaswini Suman I-73
4. Mazin Akkalath I-56
Guided by: Ashwini Biradar
This is to certify that,
Ambarish Popatwar, Khemraj Jadhav, Tejaswini Suman, Mazin
Akkalath, students of S.Y B.Tech AIML have satisfactorily completed
a community engineered project titled “Speech Recognition Using
Computer Vision” towards partial fulfilment of S.Y B.Tech CSE
(AIML) Degree for the academic year 2024 – 2025 at Pimpri
Chinchwad University.
Mrs. Ashwini Biradar Prof. V.N. Patil Sir
(Project Guide) (HOD of Computer Science)
INDEX
Sr.no Title
Page no
1 Abstract 5
2 Introduction 6
3 Literature survey 7
4 Working 9
5 Future Scope 17
6 References 19
7 Conclusion 20
SPEECH RECOGNITION
USING COMPUTER
VISION
Abstract
Lip reading, the process of understanding spoken words by
visually interpreting lip movements, holds great potential
for improving human-computer interaction, accessibility for
the hearing-impaired, and silent communication systems.
Our project proposes an AI-based lip-reading system that
recognizes phonemes—basic sound units—from lip
movements using deep learning models. The focus is to use
the GRID corpus dataset for training a model capable of
mapping lip movements in video frames to corresponding
phonemes, thereby enabling partial or full speech
reconstruction. This work avoids reliance on large-scale
datasets like LRS2 by using a more accessible, permission-
free dataset. The proposed system can eventually be
enhanced with facial expression recognition and text-to-
speech modules, allowing more expressive and intuitive
communication from silent videos.
Introduction
The ability to lip-read has traditionally been a human skill,
often requiring years of practice and contextual
understanding. With recent advances in computer vision and
deep learning, automatic lip reading systems have become a
viable solution for silent speech recognition, especially in
noisy environments, for accessibility tools, and even in
surveillance and security.
Unlike speech recognition models that rely solely on audio
input, lip-reading systems analyze visual cues to infer
speech. This is a more challenging task due to factors such as
speaker variability, coarticulation (overlap of mouth
movements for different phonemes), and visual ambiguity
between certain phonemes. To address these, we aim to
create a system that learns phoneme-level visual patterns
using deep neural networks trained on the GRID corpus—a
dataset designed for visual speech research.
The project’s end goal is to build a proof-of-concept demo
that can accurately identify phonemes from lip movements
in short video clips, thereby laying the foundation for a more
comprehensive speech reconstruction model.
Literature Survey
1. Chung et al. (2017) - Lip Reading in the Wild (LRW,
LRS2)
Chung and colleagues developed large-scale lip reading
datasets (LRW, LRS2, and LRS3) and proposed models that
leverage 3D CNNs and attention-based RNNs. These datasets
require permission and are storage-heavy, which makes
them challenging to use in resource-constrained setups.
Their models, however, set a benchmark in the field.
2. Assael et al. (2016) - LipNet
LipNet was one of the first end-to-end models for sentence-
level lip reading. It used spatiotemporal convolutions
followed by GRU-based sequence modeling. Trained on the
GRID dataset, it achieved high accuracy in word-level
recognition, validating GRID's utility for controlled
experiments.
3. Wand et al. (2016) - Visual Speech Recognition with
a PCA/LSTM Network
They proposed using Principal Component Analysis (PCA)
for dimensionality reduction of lip features, followed by
LSTM layers to recognize temporal sequences. This
approach was data-efficient but struggled with scalability.
4. Zhao et al. (2019) - A Phoneme-Level Approach for Lip
Reading
Rather than predicting full words or characters, their
method mapped video frames to phonemes, which helped
with better generalization and reconstruction. This aligns
closely with our approach of phoneme-level prediction.
5. GRID Corpus Dataset
The GRID corpus consists of over 30,000 utterances from 34
speakers in a controlled setting. Each sentence follows a
fixed grammar structure, making it ideal for phoneme-level
and word-level visual speech recognition tasks.
Working
The core of our lip-reading system lies in the ability to accurately
map visual cues—specifically lip movements over time—to their
corresponding phoneme sequences. This section describes each
technical component of the system in detail, from raw data
preprocessing to model deployment.
1. Dataset: GRID Corpus
The GRID Corpus is a multi-speaker audio-visual sentence
corpus with a fixed sentence grammar. It is structured and ideal
for research in audio-visual speech recognition. The corpus
contains:
• ~34 speakers
• 1,000 sentences per speaker
• ~30,000 total video samples
• Each sentence is 3 seconds long
• Annotated at word, phoneme, and frame levels
Each sentence follows the format:
command + color + preposition + letter + digit + adverb
E.g., “Place blue at J 2 now”
Why GRID?
• High-quality, controlled conditions
• Fixed vocabulary allows easier phoneme alignment
• Suitable size (~20GB) for limited storage environments
2. Preprocessing Pipeline
a. Frame Extraction
• Videos are split into individual frames using OpenCV.
• We maintain a fixed frame rate (e.g., 25 fps), giving ~75
frames per 3-second video.
b. Lip Region Localization
• We use Mediapipe Face Mesh or Dlib 68-point facial
landmarks to detect the lip region.
• The mouth region is cropped with a bounding box slightly
larger than the lips to capture coarticulation cues.
c. Frame Resizing and Normalization
• Cropped mouth frames are resized to (64x64) or (96x96)
pixels.
• Each frame is converted to grayscale (to reduce complexity)
and normalized (pixel values scaled to [0,1]).
d. Temporal Alignment with Phonemes
• GRID provides phoneme timestamps in a .txt alignment file.
• We segment video frames into smaller sub-sequences
aligned with phonemes (e.g., frames 10–18 map to phoneme
/p/).
e. Dataset Creation
• For training, each data point becomes:
→ a sequence of preprocessed lip frames + corresponding
phoneme label
• Final structure:
2. Code Architecture and Python Files
To maintain scalability, readability, and modularity, our project is
divided into several Python files, each performing a specific
function. Below is a detailed explanation of each Python script
and its role in the system.
3. lipmovement_analyzer.py
Purpose:
This is the core script that processes raw video data and
extracts the region of interest (the lips) from each frame. It
handles preprocessing, frame extraction, and prepares input
for the lip-reading model.
Key Functionalities:
➤ Frame Extraction
• Uses OpenCV to read a video file frame-by-frame.
• Converts frames to grayscale or RGB depending on the
model input.
• Maintains a consistent frame rate (e.g., 25 fps) to match
phoneme timestamps later.
python
CopyEdit
cv2.VideoCapture(video_path)
➤ Lip Region Detection
• Uses Mediapipe Face Mesh or dlib’s 68-point landmark
detector.
• Detects landmarks on the face and isolates the region
corresponding to the lips (landmarks 48–67 in dlib).
python
CopyEdit
face_mesh.process(frame)
➤ Cropping and Normalization
• Extracts a bounding box around the lips and resizes it to
a fixed dimension (e.g., 64x64).
• Normalizes pixel values to a [0,1] range to help the
model converge faster.
➤ Output:
• A sequence of lip-cropped images (NumPy arrays or
tensors) saved for model input.
• Can optionally save a visualized version of lip
movement as a video/GIF for debugging.
4 phoneme_mapper.py
Purpose:
Maps each processed video (or segment) to a phoneme label,
allowing supervised learning. This is where label assignment
happens—crucial for phoneme-level recognition.
Key Functionalities:
➤ Load Alignment or Text Files
• For the GRID corpus, each video has an associated .txt
file containing the spoken sentence.
• This file is parsed to extract the phoneme-level
transcription (via dictionary mapping or forced
alignment results).
➤ Phoneme Tokenization
• Converts words into their corresponding phoneme
sequences using a lexicon like the CMU Pronouncing
Dictionary.
python
CopyEdit
from nltk.corpus import cmudict
pronunciations = cmudict.dict()
• Example: "bin" → ['B', 'IH', 'N']
➤ Sequence Alignment
• Maps lip frame sequences to phoneme sequences. Two
main strategies can be used:
o Equal segmentation: Assume phonemes are evenly
distributed across frames.
o Forced alignment (optional advanced step): Align
phoneme duration with time using a tool like
Montreal Forced Aligner.
➤ Label Encoding
• Each phoneme is encoded as a number for
classification.
• Dictionary like {‘B’: 0, ‘D’: 1, ‘P’: 2, ..., ‘N’: 38} is used to
convert strings to labels.
➤ Output:
• Dataset pairs: (lip_frame_sequence, phoneme_label) for
training the model.
• Optionally outputs .npy or .pt files for fast data loading.
7. Challenges Faced
• Visual Ambiguity: Some phonemes look identical on lips
(e.g., /p/, /b/, /m/)
• Speaker Dependency: Performance may degrade with
new/unseen speakers
• Dataset Imbalance: Rare phonemes have fewer examples
Future Scope
1. Full Sentence Reconstruction from Phonemes
o Integrate phoneme-to-word models using language
models like BERT or GPT to generate natural
language output.
2. Facial Expression Analysis
o Combine lip movements with facial emotion
recognition to infer tone, intention, and emotional
context.
3. Real-time Lip Reading
o Optimize model inference to support real-time
predictions, useful for assistive devices.
4. Multilingual Support
o Extend the system to support languages beyond
English by training on multilingual datasets.
5. Text-to-Speech Integration
o Convert predicted phonemes into speech using
phoneme-to-speech engines (e.g., Tacotron 2 +
WaveGlow), enabling complete silent-to-audio
transformation.
o
6. Dataset Expansion
o Incorporate varied datasets with real-world noise,
lighting variations, and accents to improve
generalization.
References
Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017).
Lip Reading Sentences in the Wild.
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2017.
[https://arxiv.org/abs/1611.05358]
➤ This is the foundational work that introduced large-scale
sentence-level lip reading using deep learning. The paper uses
deep CNN + RNN + CTC architecture.
Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N.
(2016).
LipNet: End-to-End Sentence-level Lipreading.
arXiv preprint arXiv:1611.01599.
[https://arxiv.org/abs/1611.01599]
➤ This paper introduced LipNet, the first end-to-end lip-reading
model using spatiotemporal convolution and GRUs trained with
CTC loss on the GRID corpus.
Petridis, S., Stafylakis, T., Ma, P., Cai, J., & Pantic, M. (2018).
End-to-End Multi-View Lipreading.
BMVC 2018.
[https://arxiv.org/abs/1806.09053]
➤ This paper discusses improvements using multiple views
(angles) of the mouth and deep 3D CNNs for improved phoneme
and word recognition.
Afouras, T., Chung, J. S., & Zisserman, A. (2018).
Deep Audio-Visual Speech Recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2018.
[https://arxiv.org/abs/1809.02108]
➤ Explores combining audio and video inputs using attention
mechanisms. While your project is visual-only, it gives insights
into attention mechanisms.
Stafylakis, T., & Tzimiropoulos, G. (2017).
Combining Residual Networks with LSTMs for Lipreading.
Interspeech 2017.
[https://arxiv.org/abs/1703.04105]
➤ A clean example of combining ResNet for frame-level feature
extraction with temporal modeling using LSTMs.
Bear, H. L., Harvey, R., Theobald, B. J., & Lan, Y. (2014).
Which Phonemes Are Hard to Lip-Read?
Proc. of AVSP 2014.
➤ Investigates the visual distinguishability of phonemes, a key
challenge in visual-only speech recognition.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006).
An Audio-Visual Corpus for Speech Perception and Automatic
Speech Recognition.
Journal of the Acoustical Society of America, 2006.
➤ The paper that introduced the GRID Corpus used in your
project.
Conclusion:
This project presents a robust framework for phoneme-level
lip reading using deep learning. By leveraging the GRID
corpus and avoiding large-scale datasets like LRS2, we’ve
built an efficient and accessible entry point into visual
speech recognition. Although still in early stages, our
approach lays the groundwork for more advanced
applications in accessibility, human-computer interaction,
and silent communication systems. Future additions like
emotion detection, real-time processing, and speech
generation can significantly expand its usability across
diverse fields.