DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
REAL-TIME OBJECT RECOGNITION AND AUDIO NARRATION
          SYSTEM FOR ASSISTING VISUALLY IMPAIRED INDIVIDUALS
                                       5th review
                                                         Presented by:
Guided by
                                                         Harshinee       (21UAI034)
Ilakkia Ramanan
                                                         Megha Shinoj    (21UAI055)
Assistant Professor
                                                         Pravina.A       (21UAI064)
Department of Artificial Intelligence and Data Science
                                                         Swettha.E       (21UAI096)
PROJECT OBJECTIVE
   Objective
   :To develop a system that assists visually impaired individuals in obtaining
    information about objects in their surroundings by detecting objects through
    image captures and converting the detected objects into speech output and to
    provide detailed image scene descriptions, enhancing their ability to
    understand and interact with their surroundings
   Role:
   To create a model that uses a camera module to capture images,identifies
   objects using the YOLOv4 model, and converts the recognized objects into
   audio signals using Google Text-to-Speech (gTTs),thereby enabling visually
   imapaired persons to independently understand their environment.
       DOMAIN
Domain of the Project: ComputerVision
Domain Explanation: Computer Vision is a
field of artificial intelligence (AI) that enables
computers and systems to interpret and
understand the visual world. By using digital
images from cameras, videos, and deep
learning models, machines can accurately
identify and classify objects, and then react
to what they "see."
DOMAIN EXPLANATION
Computer Vision                    Object Detection                    Speech Synthesis
Computer Vision is a field of      Object     Detection      Object     Speech      synthesis   is    the
artificial  intelligence   that    detection is a crucial aspect of     artificial production of human
allows machines to interpret       computer       vision,    where      speech. It plays a crucial role in
and understand the visual          algorithms are trained to            converting text into spoken
world. This is achieved by         identify and locate specific         language, making technology
enabling      computers      to    objects within images or             more accessible to users
accurately identify and classify   videos. This technology finds        across various domains. Here’s
objects, analyze images and        applications in diverse fields       how speech synthesis fits into
videos, and extract meaningful     like    autonomous      vehicles,    the     broader    technological
information from visual data       medical       imaging,      and      landscape
                                   surveillance systems.
ARCHITECTURE
               Camera Feed (Image Input): Captures the image
               data.
               Image Normalization: Preprocesses the image
               for further analysis.
               Parallel Processing Paths: Object Detection:
               Identifies objects in the image and converts
               them into text.
               OCR Text Extraction: Extracts text from the
               image using Optical Character Recognition
               (OCR) and processes it.
5. Feature Extraction:
Uses CNN (Convolutional Neural
Networks) and LSTM (Long Short-Term
Memory) networks to extract key image
features, aiding in description generation.
6. Post Processing:
Refines extracted text and generated
descriptions.
7. Speech Conversion & Audio Output:
Converts processed text into speech for
audio output.
PERFORMANCE EVALUATION
1. Experimental Setup & Parameters
The system was tested using a laptop camera for real-time object
detection and scene description.
   Model Used: YOLOv4 for object detection, LSTM for scene
   description
   Text-to-Speech (TTS): Google Text-to-Speech (gTTS) for audio
   conversion
   Image Preprocessing: OpenCV for normalization, resizing, and
   grayscale conversion
   Hardware: Standard laptop with CPU processing (no additional
   hardware)
2. Results & Observations
   Object Detection Accuracy: Achieved high
   accuracy in detecting objects from various
   environments.
   Processing Speed: Real-time detection with
   an average latency of ~0.5s per frame.
   Scene Description Quality: Meaningful and
   coherent descriptions were generated using
   CNN + LSTM.
   Audio Output: Clear and accurate narration
   of detected objects and scene details.
3.Output Screens & Simulated Graphs
  Detection Output: Bounding boxes and labels over
  detected objects.
  Scene Description Output: Generated text
  summarizing the environment.
  Audio Output: Successfully converted text into
  speech.
  Performance Graphs: FPS rate, object detection
  accuracy, and latency variations over multiple test
  cases.
                              ADVANTAGES
1. Real-Time Object Detection &            3.Independence for Visually Impaired
   Description                             Users
   Provides instant feedback, ensuring         Reduces reliance on physical
   quick identification of objects.            assistance.
   Enables seamless interaction with the       Helps users navigate and understand
   environment.                                surroundings effortlessly.
2.High Accuracy & Efficiency               4. User-Friendly & Automated
  Uses YOLOv4 for robust and precise           Simple interface with hands-free
  object detection.                            operation.
  CNN + LSTM ensures meaningful scene          Automatic detection and narration
  descriptions.                                without manual input.
5. Scalability & Future Enhancements    7.Independence for Visually Impaired
    Can be expanded with multi-language Users
    support, navigation assistance, and     Reduces reliance on physical
    voice commands.                         assistance.
    Adaptable for indoor and outdoor        Helps users navigate and understand
    environments.                           surroundings effortlessly.
6. Cost-Effective & Open-Source            8. User-Friendly & Automated
    Uses open-source tools (YOLO,              Simple interface with hands-free
    OpenCV, gTTS) reducing overall cost.       operation.
    Works on standard hardware without         Automatic detection and narration
    additional devices.                        without manual input.
Conclusion
The proposed system successfully enables real-time object detection
and scene description, providing instant auditory feedback for
visually impaired users. By leveraging YOLOv4 for object detection
and LSTM for scene description, the system ensures accurate and
meaningful narration of surroundings. This enhances independence
and mobility, reducing the need for external assistance.
Performance Comparison
The proposed system demonstrates significant improvements over existing
solutions:
   Accuracy Improvement: Proposed system achieves higher object detection
   accuracy (~90%) compared to traditional methods (~75%).
   Processing Speed: Real-time processing with ~0.5s latency, significantly faster
   than older models (~1.5s).
   User Experience: Automated and hands-free, offering a more seamless
   interaction than existing assistive technologies
Justification & Final Thoughts
The system outperforms existing assistive solutions in terms of efficiency, accuracy,
and user experience. Its ability to provide instant feedback and meaningful scene
descriptions makes it a powerful tool for visually impaired individuals. Future
enhancements, such as multi-language support, cloud-based improvements, and
optimized deep learning models, can further expand its impact.
Future Work
Enhancing System Capabilities
To keep up with recent advancements and improve system performance,
future developments may include:
   Upgrading AI Models: Exploring YOLOv8 and transformer-based models for
   enhanced object detection accuracy.
   Optimized Processing: Reducing latency through more efficient model
   compression and parallel processing techniques.
   Improved Scene Understanding: Enhancing context recognition by
   integrating advanced NLP techniques for more descriptive outputs.
                Aligning with Recent Trends
Multi-Modal AI: Combining vision, text, and speech models to
improve interaction quality.
Cloud Integration: Implementing real-time cloud synchronization
to enhance processing efficiency.
Expanded Language Support: Developing multilingual capabilities
to cater to diverse users.
Adaptive Learning: Training models on diverse datasets to
improve robustness in different environments.
THANK YOU!