0% found this document useful (0 votes)

27 views21 pages

Cep Report

The document details a community engineering project by students at Pimpri Chinchwad University focused on developing an AI-based lip-reading system titled 'Speech Recognition Using Computer Vision.' The project utilizes the GRID corpus dataset to train a model that recognizes phonemes from lip movements, aiming to enhance human-computer interaction and accessibility for the hearing-impaired. Future enhancements include real-time lip reading, facial expression analysis, and text-to-speech integration.

Uploaded by

rishiadik55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views21 pages

Cep Report

Uploaded by

rishiadik55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Pimpri Chinchwad Education Trust’s

PIMPRI CHINCHWAD UNIVERSITY

Plot No. 44, 49 50, Mohitewadi Rd, Mohitewadi, Maharashtra 412106

SCHOOL OF ENGINEERING AND TECHNOLOGY

PROGRAMME: B.TECH CSE (AIML)
PROJECT GROUP NO: 5

COMMUNITY ENGINEERING PROJECT

Sr.no Name Roll no

1. Ambarish Popatwar I-58
2. Khemraj Jadhav I-59
3. Tejaswini Suman I-73
4. Mazin Akkalath I-56

Guided by: Ashwini Biradar

This is to certify that,
Ambarish Popatwar, Khemraj Jadhav, Tejaswini Suman, Mazin
Akkalath, students of S.Y B.Tech AIML have satisfactorily completed
a community engineered project titled “Speech Recognition Using
Computer Vision” towards partial fulfilment of S.Y B.Tech CSE
(AIML) Degree for the academic year 2024 – 2025 at Pimpri
Chinchwad University.

Mrs. Ashwini Biradar Prof. V.N. Patil Sir

(Project Guide) (HOD of Computer Science)
INDEX
Sr.no Title
Page no
1 Abstract 5
2 Introduction 6
3 Literature survey 7
4 Working 9
5 Future Scope 17
6 References 19
7 Conclusion 20
SPEECH RECOGNITION
USING COMPUTER
VISION
Abstract

Lip reading, the process of understanding spoken words by

visually interpreting lip movements, holds great potential
for improving human-computer interaction, accessibility for
the hearing-impaired, and silent communication systems.
Our project proposes an AI-based lip-reading system that
recognizes phonemes—basic sound units—from lip
movements using deep learning models. The focus is to use
the GRID corpus dataset for training a model capable of
mapping lip movements in video frames to corresponding
phonemes, thereby enabling partial or full speech
reconstruction. This work avoids reliance on large-scale
datasets like LRS2 by using a more accessible, permission-
free dataset. The proposed system can eventually be
enhanced with facial expression recognition and text-to-
speech modules, allowing more expressive and intuitive
communication from silent videos.
Introduction

The ability to lip-read has traditionally been a human skill,

often requiring years of practice and contextual
understanding. With recent advances in computer vision and
deep learning, automatic lip reading systems have become a
viable solution for silent speech recognition, especially in
noisy environments, for accessibility tools, and even in
surveillance and security.
Unlike speech recognition models that rely solely on audio
input, lip-reading systems analyze visual cues to infer
speech. This is a more challenging task due to factors such as
speaker variability, coarticulation (overlap of mouth
movements for different phonemes), and visual ambiguity
between certain phonemes. To address these, we aim to
create a system that learns phoneme-level visual patterns
using deep neural networks trained on the GRID corpus—a
dataset designed for visual speech research.
The project’s end goal is to build a proof-of-concept demo
that can accurately identify phonemes from lip movements
in short video clips, thereby laying the foundation for a more
comprehensive speech reconstruction model.
Literature Survey

1. Chung et al. (2017) - Lip Reading in the Wild (LRW,

LRS2)

Chung and colleagues developed large-scale lip reading

datasets (LRW, LRS2, and LRS3) and proposed models that
leverage 3D CNNs and attention-based RNNs. These datasets
require permission and are storage-heavy, which makes
them challenging to use in resource-constrained setups.
Their models, however, set a benchmark in the field.

2. Assael et al. (2016) - LipNet

LipNet was one of the first end-to-end models for sentence-

level lip reading. It used spatiotemporal convolutions
followed by GRU-based sequence modeling. Trained on the
GRID dataset, it achieved high accuracy in word-level
recognition, validating GRID's utility for controlled
experiments.

3. Wand et al. (2016) - Visual Speech Recognition with

a PCA/LSTM Network
They proposed using Principal Component Analysis (PCA)
for dimensionality reduction of lip features, followed by
LSTM layers to recognize temporal sequences. This
approach was data-efficient but struggled with scalability.

4. Zhao et al. (2019) - A Phoneme-Level Approach for Lip

Reading

Rather than predicting full words or characters, their

method mapped video frames to phonemes, which helped
with better generalization and reconstruction. This aligns
closely with our approach of phoneme-level prediction.

5. GRID Corpus Dataset

The GRID corpus consists of over 30,000 utterances from 34

speakers in a controlled setting. Each sentence follows a
fixed grammar structure, making it ideal for phoneme-level
and word-level visual speech recognition tasks.
Working

The core of our lip-reading system lies in the ability to accurately

map visual cues—specifically lip movements over time—to their
corresponding phoneme sequences. This section describes each
technical component of the system in detail, from raw data
preprocessing to model deployment.

1. Dataset: GRID Corpus

The GRID Corpus is a multi-speaker audio-visual sentence

corpus with a fixed sentence grammar. It is structured and ideal
for research in audio-visual speech recognition. The corpus
contains:
• ~34 speakers
• 1,000 sentences per speaker
• ~30,000 total video samples
• Each sentence is 3 seconds long
• Annotated at word, phoneme, and frame levels
Each sentence follows the format:
command + color + preposition + letter + digit + adverb
E.g., “Place blue at J 2 now”
Why GRID?
• High-quality, controlled conditions
• Fixed vocabulary allows easier phoneme alignment
• Suitable size (~20GB) for limited storage environments

2. Preprocessing Pipeline

a. Frame Extraction
• Videos are split into individual frames using OpenCV.
• We maintain a fixed frame rate (e.g., 25 fps), giving ~75
frames per 3-second video.
b. Lip Region Localization
• We use Mediapipe Face Mesh or Dlib 68-point facial
landmarks to detect the lip region.
• The mouth region is cropped with a bounding box slightly
larger than the lips to capture coarticulation cues.
c. Frame Resizing and Normalization
• Cropped mouth frames are resized to (64x64) or (96x96)
pixels.
• Each frame is converted to grayscale (to reduce complexity)
and normalized (pixel values scaled to [0,1]).
d. Temporal Alignment with Phonemes
• GRID provides phoneme timestamps in a .txt alignment file.
• We segment video frames into smaller sub-sequences
aligned with phonemes (e.g., frames 10–18 map to phoneme
/p/).
e. Dataset Creation
• For training, each data point becomes:
→ a sequence of preprocessed lip frames + corresponding
phoneme label
• Final structure:
2. Code Architecture and Python Files
To maintain scalability, readability, and modularity, our project is
divided into several Python files, each performing a specific
function. Below is a detailed explanation of each Python script
and its role in the system.

3. lipmovement_analyzer.py
Purpose:
This is the core script that processes raw video data and
extracts the region of interest (the lips) from each frame. It
handles preprocessing, frame extraction, and prepares input
for the lip-reading model.
Key Functionalities:
➤ Frame Extraction
• Uses OpenCV to read a video file frame-by-frame.
• Converts frames to grayscale or RGB depending on the
model input.
• Maintains a consistent frame rate (e.g., 25 fps) to match
phoneme timestamps later.
python
CopyEdit
cv2.VideoCapture(video_path)
➤ Lip Region Detection
• Uses Mediapipe Face Mesh or dlib’s 68-point landmark
detector.
• Detects landmarks on the face and isolates the region
corresponding to the lips (landmarks 48–67 in dlib).
python
CopyEdit
face_mesh.process(frame)
➤ Cropping and Normalization
• Extracts a bounding box around the lips and resizes it to
a fixed dimension (e.g., 64x64).
• Normalizes pixel values to a [0,1] range to help the
model converge faster.
➤ Output:
• A sequence of lip-cropped images (NumPy arrays or
tensors) saved for model input.
• Can optionally save a visualized version of lip
movement as a video/GIF for debugging.

4 phoneme_mapper.py
Purpose:
Maps each processed video (or segment) to a phoneme label,
allowing supervised learning. This is where label assignment
happens—crucial for phoneme-level recognition.
Key Functionalities:
➤ Load Alignment or Text Files
• For the GRID corpus, each video has an associated .txt
file containing the spoken sentence.
• This file is parsed to extract the phoneme-level
transcription (via dictionary mapping or forced
alignment results).
➤ Phoneme Tokenization
• Converts words into their corresponding phoneme
sequences using a lexicon like the CMU Pronouncing
Dictionary.
python
CopyEdit
from nltk.corpus import cmudict
pronunciations = cmudict.dict()
• Example: "bin" → ['B', 'IH', 'N']

➤ Sequence Alignment
• Maps lip frame sequences to phoneme sequences. Two
main strategies can be used:
o Equal segmentation: Assume phonemes are evenly
distributed across frames.
o Forced alignment (optional advanced step): Align
phoneme duration with time using a tool like
Montreal Forced Aligner.
➤ Label Encoding
• Each phoneme is encoded as a number for
classification.
• Dictionary like {‘B’: 0, ‘D’: 1, ‘P’: 2, ..., ‘N’: 38} is used to
convert strings to labels.
➤ Output:
• Dataset pairs: (lip_frame_sequence, phoneme_label) for
training the model.
• Optionally outputs .npy or .pt files for fast data loading.

7. Challenges Faced
• Visual Ambiguity: Some phonemes look identical on lips
(e.g., /p/, /b/, /m/)
• Speaker Dependency: Performance may degrade with
new/unseen speakers
• Dataset Imbalance: Rare phonemes have fewer examples
Future Scope

1. Full Sentence Reconstruction from Phonemes

o Integrate phoneme-to-word models using language
models like BERT or GPT to generate natural
language output.

2. Facial Expression Analysis

o Combine lip movements with facial emotion
recognition to infer tone, intention, and emotional
context.

3. Real-time Lip Reading

o Optimize model inference to support real-time
predictions, useful for assistive devices.

4. Multilingual Support
o Extend the system to support languages beyond
English by training on multilingual datasets.

5. Text-to-Speech Integration
o Convert predicted phonemes into speech using
phoneme-to-speech engines (e.g., Tacotron 2 +
WaveGlow), enabling complete silent-to-audio
transformation.
o
6. Dataset Expansion
o Incorporate varied datasets with real-world noise,
lighting variations, and accents to improve
generalization.
References

Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017).

Lip Reading Sentences in the Wild.
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2017.
[https://arxiv.org/abs/1611.05358]

➤ This is the foundational work that introduced large-scale

sentence-level lip reading using deep learning. The paper uses
deep CNN + RNN + CTC architecture.
Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N.
(2016).
LipNet: End-to-End Sentence-level Lipreading.
arXiv preprint arXiv:1611.01599.
[https://arxiv.org/abs/1611.01599]

➤ This paper introduced LipNet, the first end-to-end lip-reading

model using spatiotemporal convolution and GRUs trained with
CTC loss on the GRID corpus.
Petridis, S., Stafylakis, T., Ma, P., Cai, J., & Pantic, M. (2018).
End-to-End Multi-View Lipreading.
BMVC 2018.
[https://arxiv.org/abs/1806.09053]

➤ This paper discusses improvements using multiple views

(angles) of the mouth and deep 3D CNNs for improved phoneme
and word recognition.
Afouras, T., Chung, J. S., & Zisserman, A. (2018).
Deep Audio-Visual Speech Recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2018.
[https://arxiv.org/abs/1809.02108]
➤ Explores combining audio and video inputs using attention
mechanisms. While your project is visual-only, it gives insights
into attention mechanisms.
Stafylakis, T., & Tzimiropoulos, G. (2017).
Combining Residual Networks with LSTMs for Lipreading.
Interspeech 2017.
[https://arxiv.org/abs/1703.04105]

➤ A clean example of combining ResNet for frame-level feature

extraction with temporal modeling using LSTMs.
Bear, H. L., Harvey, R., Theobald, B. J., & Lan, Y. (2014).
Which Phonemes Are Hard to Lip-Read?
Proc. of AVSP 2014.

➤ Investigates the visual distinguishability of phonemes, a key

challenge in visual-only speech recognition.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006).
An Audio-Visual Corpus for Speech Perception and Automatic
Speech Recognition.
Journal of the Acoustical Society of America, 2006.

➤ The paper that introduced the GRID Corpus used in your

project.
Conclusion:

This project presents a robust framework for phoneme-level

lip reading using deep learning. By leveraging the GRID
corpus and avoiding large-scale datasets like LRS2, we’ve
built an efficient and accessible entry point into visual
speech recognition. Although still in early stages, our
approach lays the groundwork for more advanced
applications in accessibility, human-computer interaction,
and silent communication systems. Future additions like
emotion detection, real-time processing, and speech
generation can significantly expand its usability across
diverse fields.

Review I - Documentation Format
No ratings yet
Review I - Documentation Format
20 pages
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
No ratings yet
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
8 pages
23MCI10142, 23MCI10007 - Project Report
No ratings yet
23MCI10142, 23MCI10007 - Project Report
38 pages
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
No ratings yet
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
5 pages
Piyush Project Report (LipReader)
No ratings yet
Piyush Project Report (LipReader)
35 pages
Ashrith Miniproject 2
No ratings yet
Ashrith Miniproject 2
11 pages
ANN Paper
No ratings yet
ANN Paper
6 pages
Batch A3
No ratings yet
Batch A3
7 pages
LipReadNet: Advancing Lip Reading
No ratings yet
LipReadNet: Advancing Lip Reading
6 pages
SpeakVision: A Comprehensive Survey On End-to-End Sentence Level Lipreading
No ratings yet
SpeakVision: A Comprehensive Survey On End-to-End Sentence Level Lipreading
4 pages
Chung 18
No ratings yet
Chung 18
28 pages
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
10 pages
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
No ratings yet
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
11 pages
ANN Paper
No ratings yet
ANN Paper
7 pages
Virtual Personal Assistant
No ratings yet
Virtual Personal Assistant
4 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
SBSPS Challenge 11002
No ratings yet
SBSPS Challenge 11002
3 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
22 pages
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
10 pages
Deep Learning for Visual Lip Reading
No ratings yet
Deep Learning for Visual Lip Reading
15 pages
Documentation (AA20)
No ratings yet
Documentation (AA20)
62 pages
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
No ratings yet
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
3 pages
Roi Detection For Visual Lip Reading
No ratings yet
Roi Detection For Visual Lip Reading
22 pages
Vision Based Lip Reading System Using Deep Learning: July 2022
No ratings yet
Vision Based Lip Reading System Using Deep Learning: July 2022
7 pages
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
No ratings yet
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
8 pages
Lip Decoder
No ratings yet
Lip Decoder
11 pages
Deep Learning Lip Reading Model
No ratings yet
Deep Learning Lip Reading Model
6 pages
Lip Reading With Hahn Convolutional Neural Networks
No ratings yet
Lip Reading With Hahn Convolutional Neural Networks
28 pages
Lip Reading with CNN for Noisy Environments
No ratings yet
Lip Reading with CNN for Noisy Environments
5 pages
Analysis of Lip-Reading Using Deep Learning Techniques A Review
No ratings yet
Analysis of Lip-Reading Using Deep Learning Techniques A Review
6 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
DL Review
No ratings yet
DL Review
4 pages
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
No ratings yet
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
5 pages
584 Camera Ready
No ratings yet
584 Camera Ready
6 pages
Second Paper
No ratings yet
Second Paper
7 pages
Deep Learning for Lip Reading
No ratings yet
Deep Learning for Lip Reading
5 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
Acoustic New
No ratings yet
Acoustic New
36 pages
Deformation Flow Based Two-Stream Network For Lip Reading
No ratings yet
Deformation Flow Based Two-Stream Network For Lip Reading
7 pages
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
No ratings yet
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
9 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
8 pages
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
No ratings yet
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
11 pages
Lip-Reading Dataset Construction
No ratings yet
Lip-Reading Dataset Construction
6 pages
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
No ratings yet
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
11 pages
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
No ratings yet
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
11 pages
Afouras Et Al - 2018 - Deep Lip Reading
No ratings yet
Afouras Et Al - 2018 - Deep Lip Reading
8 pages
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
No ratings yet
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
6 pages
Lip Reading Using CNN and LTSM
No ratings yet
Lip Reading Using CNN and LTSM
9 pages
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
No ratings yet
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
45 pages
Lip Detection Report
No ratings yet
Lip Detection Report
5 pages
Speech Recognition Techniques - GUVI
No ratings yet
Speech Recognition Techniques - GUVI
4 pages
Speech Recognition
No ratings yet
Speech Recognition
9 pages
Firoz KHAN
No ratings yet
Firoz KHAN
31 pages
Lip2 Speech Report
No ratings yet
Lip2 Speech Report
7 pages
Speech Recognition System Using Python Report
No ratings yet
Speech Recognition System Using Python Report
7 pages
A Lip Reading Method Based On 3D Convolutional Vision Transformer
No ratings yet
A Lip Reading Method Based On 3D Convolutional Vision Transformer
8 pages
AV Speech Recognition with Deep Learning
No ratings yet
AV Speech Recognition with Deep Learning
62 pages
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
No ratings yet
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
5 pages
ACAD-R-18 Theory Assignment 1 - Java
No ratings yet
ACAD-R-18 Theory Assignment 1 - Java
3 pages
AloeVera BlackSoil Guide
No ratings yet
AloeVera BlackSoil Guide
1 page
DocScanner 26-Mar-2025 11-18 PM
No ratings yet
DocScanner 26-Mar-2025 11-18 PM
24 pages
Betul
No ratings yet
Betul
8 pages
James Et Al 2012 Differential Diagnosis of Pelvic Masses by Gray Scale Sonography
No ratings yet
James Et Al 2012 Differential Diagnosis of Pelvic Masses by Gray Scale Sonography
8 pages
Mahabharata (Unabridged in English) (1) - Compressed
No ratings yet
Mahabharata (Unabridged in English) (1) - Compressed
600 pages
21 3SSREB 12023 2024AIForEveryoneFundamentals CompleteE Book
No ratings yet
21 3SSREB 12023 2024AIForEveryoneFundamentals CompleteE Book
185 pages
Data-Driven Asset Management Strategies
No ratings yet
Data-Driven Asset Management Strategies
31 pages
CBSE Syllabus For Class 8 Artificial Intelligence - 240306 - 135638
No ratings yet
CBSE Syllabus For Class 8 Artificial Intelligence - 240306 - 135638
2 pages
AI Prompt Toolkit For Therapists
No ratings yet
AI Prompt Toolkit For Therapists
12 pages
Plant Information Modelling, Using Artificial Intelligence, For Process Hazard and Risk Analysis Study
No ratings yet
Plant Information Modelling, Using Artificial Intelligence, For Process Hazard and Risk Analysis Study
143 pages
FINAL REPORT - E VERSION - Technology-Facilitated Violence Experienced by Women and Marginalized Groups in Sri Lanka
No ratings yet
FINAL REPORT - E VERSION - Technology-Facilitated Violence Experienced by Women and Marginalized Groups in Sri Lanka
102 pages
ANN 3 - Perceptron
100% (1)
ANN 3 - Perceptron
56 pages
CBSE Class 12 Business Studies Set 1 2025
No ratings yet
CBSE Class 12 Business Studies Set 1 2025
34 pages
1 Mba - Q. Bank 2
No ratings yet
1 Mba - Q. Bank 2
5 pages
Leveraging The Power of AI in Undergraduate Computer Science Education Opportunities and Challenges
No ratings yet
Leveraging The Power of AI in Undergraduate Computer Science Education Opportunities and Challenges
5 pages
MAT6007 - Session1 - History of Deep Learning
No ratings yet
MAT6007 - Session1 - History of Deep Learning
22 pages
KNN-SVM Assignment
No ratings yet
KNN-SVM Assignment
4 pages
MCS 224 New P
No ratings yet
MCS 224 New P
42 pages
DL Co3 - PPT 1
No ratings yet
DL Co3 - PPT 1
22 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
8 pages
89 Seconds to Midnight
No ratings yet
89 Seconds to Midnight
18 pages
Chap2 Part1 KMEANS
No ratings yet
Chap2 Part1 KMEANS
31 pages
521495A Tasks
No ratings yet
521495A Tasks
6 pages
CSCourses
No ratings yet
CSCourses
31 pages
AI & NLP: A Comprehensive Overview
No ratings yet
AI & NLP: A Comprehensive Overview
4 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
14 pages
Tarun Goel
No ratings yet
Tarun Goel
3 pages
ADP Mid Funnel 2025 HR Trends Guide 2024
No ratings yet
ADP Mid Funnel 2025 HR Trends Guide 2024
33 pages
Exam Night - 1st Prep T 2
No ratings yet
Exam Night - 1st Prep T 2
12 pages
Cold Case: The Lost MNIST Digits
No ratings yet
Cold Case: The Lost MNIST Digits
9 pages
CO-PO Mapping Sheet
No ratings yet
CO-PO Mapping Sheet
44 pages
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
No ratings yet
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
17 pages
Medicinal Plant Classification Using Particle Swarm Optimized Cascaded Network
No ratings yet
Medicinal Plant Classification Using Particle Swarm Optimized Cascaded Network
14 pages
Meahhal Resume1
No ratings yet
Meahhal Resume1
2 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages

Cep Report

Uploaded by

Cep Report

Uploaded by

Pimpri Chinchwad Education Trust’s

PIMPRI CHINCHWAD UNIVERSITY

SCHOOL OF ENGINEERING AND TECHNOLOGY

COMMUNITY ENGINEERING PROJECT

Sr.no Name Roll no

Guided by: Ashwini Biradar

Mrs. Ashwini Biradar Prof. V.N. Patil Sir

Lip reading, the process of understanding spoken words by

The ability to lip-read has traditionally been a human skill,

1. Chung et al. (2017) - Lip Reading in the Wild (LRW,

Chung and colleagues developed large-scale lip reading

2. Assael et al. (2016) - LipNet

LipNet was one of the first end-to-end models for sentence-

3. Wand et al. (2016) - Visual Speech Recognition with

4. Zhao et al. (2019) - A Phoneme-Level Approach for Lip

Rather than predicting full words or characters, their

5. GRID Corpus Dataset

The GRID corpus consists of over 30,000 utterances from 34

The core of our lip-reading system lies in the ability to accurately

1. Dataset: GRID Corpus

The GRID Corpus is a multi-speaker audio-visual sentence

1. Full Sentence Reconstruction from Phonemes

2. Facial Expression Analysis

3. Real-time Lip Reading

Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017).

➤ This is the foundational work that introduced large-scale

➤ This paper introduced LipNet, the first end-to-end lip-reading

➤ This paper discusses improvements using multiple views

➤ A clean example of combining ResNet for frame-level feature

➤ Investigates the visual distinguishability of phonemes, a key

➤ The paper that introduced the GRID Corpus used in your

This project presents a robust framework for phoneme-level

You might also like