0% found this document useful (0 votes)
19 views17 pages

Review 5

The document outlines a project aimed at developing a real-time object recognition and audio narration system to assist visually impaired individuals by detecting objects and converting them into speech. Utilizing the YOLOv4 model for object detection and Google Text-to-Speech for audio output, the system enhances user independence and interaction with their surroundings. Future enhancements include upgrading AI models, optimizing processing, and expanding language support.

Uploaded by

Pravina Ak444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

Review 5

The document outlines a project aimed at developing a real-time object recognition and audio narration system to assist visually impaired individuals by detecting objects and converting them into speech. Utilizing the YOLOv4 model for object detection and Google Text-to-Speech for audio output, the system enhances user independence and interaction with their surroundings. Future enhancements include upgrading AI models, optimizing processing, and expanding language support.

Uploaded by

Pravina Ak444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

REAL-TIME OBJECT RECOGNITION AND AUDIO NARRATION


SYSTEM FOR ASSISTING VISUALLY IMPAIRED INDIVIDUALS
5th review

Presented by:
Guided by
Harshinee (21UAI034)
Ilakkia Ramanan
Megha Shinoj (21UAI055)
Assistant Professor
Pravina.A (21UAI064)
Department of Artificial Intelligence and Data Science
Swettha.E (21UAI096)
PROJECT OBJECTIVE

Objective
:To develop a system that assists visually impaired individuals in obtaining
information about objects in their surroundings by detecting objects through
image captures and converting the detected objects into speech output and to
provide detailed image scene descriptions, enhancing their ability to
understand and interact with their surroundings
Role:
To create a model that uses a camera module to capture images,identifies
objects using the YOLOv4 model, and converts the recognized objects into
audio signals using Google Text-to-Speech (gTTs),thereby enabling visually
imapaired persons to independently understand their environment.
DOMAIN
Domain of the Project: ComputerVision
Domain Explanation: Computer Vision is a
field of artificial intelligence (AI) that enables
computers and systems to interpret and
understand the visual world. By using digital
images from cameras, videos, and deep
learning models, machines can accurately
identify and classify objects, and then react
to what they "see."
DOMAIN EXPLANATION

Computer Vision Object Detection Speech Synthesis

Computer Vision is a field of Object Detection Object Speech synthesis is the


artificial intelligence that detection is a crucial aspect of artificial production of human
allows machines to interpret computer vision, where speech. It plays a crucial role in
and understand the visual algorithms are trained to converting text into spoken
world. This is achieved by identify and locate specific language, making technology
enabling computers to objects within images or more accessible to users
accurately identify and classify videos. This technology finds across various domains. Here’s
objects, analyze images and applications in diverse fields how speech synthesis fits into
videos, and extract meaningful like autonomous vehicles, the broader technological
information from visual data medical imaging, and landscape
surveillance systems.
ARCHITECTURE
Camera Feed (Image Input): Captures the image
data.

Image Normalization: Preprocesses the image


for further analysis.

Parallel Processing Paths: Object Detection:


Identifies objects in the image and converts
them into text.

OCR Text Extraction: Extracts text from the


image using Optical Character Recognition
(OCR) and processes it.
5. Feature Extraction:
Uses CNN (Convolutional Neural
Networks) and LSTM (Long Short-Term
Memory) networks to extract key image
features, aiding in description generation.

6. Post Processing:
Refines extracted text and generated
descriptions.

7. Speech Conversion & Audio Output:


Converts processed text into speech for
audio output.
PERFORMANCE EVALUATION
1. Experimental Setup & Parameters

The system was tested using a laptop camera for real-time object
detection and scene description.
Model Used: YOLOv4 for object detection, LSTM for scene
description
Text-to-Speech (TTS): Google Text-to-Speech (gTTS) for audio
conversion
Image Preprocessing: OpenCV for normalization, resizing, and
grayscale conversion
Hardware: Standard laptop with CPU processing (no additional
hardware)
2. Results & Observations

Object Detection Accuracy: Achieved high


accuracy in detecting objects from various
environments.
Processing Speed: Real-time detection with
an average latency of ~0.5s per frame.
Scene Description Quality: Meaningful and
coherent descriptions were generated using
CNN + LSTM.
Audio Output: Clear and accurate narration
of detected objects and scene details.
3.Output Screens & Simulated Graphs

Detection Output: Bounding boxes and labels over


detected objects.
Scene Description Output: Generated text
summarizing the environment.
Audio Output: Successfully converted text into
speech.
Performance Graphs: FPS rate, object detection
accuracy, and latency variations over multiple test
cases.
ADVANTAGES

1. Real-Time Object Detection & 3.Independence for Visually Impaired


Description Users
Provides instant feedback, ensuring Reduces reliance on physical
quick identification of objects. assistance.
Enables seamless interaction with the Helps users navigate and understand
environment. surroundings effortlessly.

2.High Accuracy & Efficiency 4. User-Friendly & Automated


Uses YOLOv4 for robust and precise Simple interface with hands-free
object detection. operation.
CNN + LSTM ensures meaningful scene Automatic detection and narration
descriptions. without manual input.
5. Scalability & Future Enhancements 7.Independence for Visually Impaired
Can be expanded with multi-language Users
support, navigation assistance, and Reduces reliance on physical
voice commands. assistance.
Adaptable for indoor and outdoor Helps users navigate and understand
environments. surroundings effortlessly.

6. Cost-Effective & Open-Source 8. User-Friendly & Automated


Uses open-source tools (YOLO, Simple interface with hands-free
OpenCV, gTTS) reducing overall cost. operation.
Works on standard hardware without Automatic detection and narration
additional devices. without manual input.
Conclusion
The proposed system successfully enables real-time object detection
and scene description, providing instant auditory feedback for
visually impaired users. By leveraging YOLOv4 for object detection
and LSTM for scene description, the system ensures accurate and
meaningful narration of surroundings. This enhances independence
and mobility, reducing the need for external assistance.
Performance Comparison
The proposed system demonstrates significant improvements over existing
solutions:
Accuracy Improvement: Proposed system achieves higher object detection
accuracy (~90%) compared to traditional methods (~75%).
Processing Speed: Real-time processing with ~0.5s latency, significantly faster
than older models (~1.5s).
User Experience: Automated and hands-free, offering a more seamless
interaction than existing assistive technologies
Justification & Final Thoughts
The system outperforms existing assistive solutions in terms of efficiency, accuracy,
and user experience. Its ability to provide instant feedback and meaningful scene
descriptions makes it a powerful tool for visually impaired individuals. Future
enhancements, such as multi-language support, cloud-based improvements, and
optimized deep learning models, can further expand its impact.
Future Work
Enhancing System Capabilities
To keep up with recent advancements and improve system performance,
future developments may include:
Upgrading AI Models: Exploring YOLOv8 and transformer-based models for
enhanced object detection accuracy.
Optimized Processing: Reducing latency through more efficient model
compression and parallel processing techniques.
Improved Scene Understanding: Enhancing context recognition by
integrating advanced NLP techniques for more descriptive outputs.
Aligning with Recent Trends
Multi-Modal AI: Combining vision, text, and speech models to
improve interaction quality.
Cloud Integration: Implementing real-time cloud synchronization
to enhance processing efficiency.
Expanded Language Support: Developing multilingual capabilities
to cater to diverse users.
Adaptive Learning: Training models on diverse datasets to
improve robustness in different environments.
THANK YOU!

You might also like