Smart Drishti For Blind Report
Smart Drishti For Blind Report
CHAPTER – 1
PREAMBLE
1.1 INTRODUCTION
revolution has almost gaspedour industry. The foremost technology that is turning this
Learning or Deep Learning. There are many areas in the Machine Learning domain
the humongous availability of data around the world, it has become comparatively
In this project, our prime concept is to simplify the lives of people affected
visually by translating the perceptible surroundings around them into audio comments
such that they can understand what is going on around them. Visual impairment is a
serious cause that segregates people from the basicway of survival in times like these,
making them go devoid of various opportunities and general chores that people don’t
even pay proper heed to. However, with the advent of technology and the
optimal process. Thus, for our proposed concept and to reach our required objective,
we shall be using the aforementioned domain in ways that shall be explained here
under.
For the same purpose, we use various Machine Learning libraries, of which
OpenCV is a prime one. Open Source Computer Vision Library, generally termed as
OpenCV is an open-source library that deals with image manipulation and various
instantaneous operations. In this paper, we will be seeing the use of OpenCV for
detecting objects. Other than that Google Speech API and a suitable trainingalgorithm
In this project, our goal is to build a prototype of a model that easily sees the
objects around us and translates them in a way that visually impaired people can
recognition and computer vision to conduct this project. The dataset used to
Context, as the image data set was created with the aim of advancing image
addition, COCO also provides 80+ object categories called COCO Classes, 91 object
categories called COCO Things, and provides 5 captions per image. Also, for the
purpose of pose estimation, 250,000 people with 17 different pre-trained key points
The project can be segmented into various steps for a better understanding of
the work. The partial objective of the project is to detect objects around us and to
identify them using the YOLOv3 algorithm that supports Darknet architecture, thus
giving a wide spherical view of the object detection and recognition process as a
whole. Furthermore, the latter half of the project aims on translating the recognized
objects into speech through the Google-Text-to-Speech API of the python library.
Thereby, the initial designs of the project prototype are made, thus addressing
This project is developed and structured using various python libraries for
computer vision, natural language processing. The whole project is made in Jupyter
OpenCV 3.4+. Since OpenCV 4 is still in beta right now and the official release has
not been initiated yet, it is safer to use the version3 of the OpenCV library. Apart from
that, we need to install the YOLOv3 training dataset and MS COCO dataset to start
image: The path to the input image. We shall be detecting objects in this image
using YOLO.
yolo: This is the base path to the YOLO directory. The scripts will load the
directory. The default value is given as 0.5 and the value is open to
experimentation.
which shall be explained further and the detected output is translated to audio for
audio feedback. Implementing the same on a video based input meets the objective
The basic concept behind our research is to propose a system that will assist
the typed, handwritten & printed text without usingthe old fashioned tough & difficult
advancements has been made for the same purpose but none of them was completely
able to overcome the technical challenges and hurdles. So we aim to overcome all
CHAPTER – 2
LITERATURE SURVEY
IEEE published papers were made. We shall briefly be discussing the review
2.1.1 Real Time Object Detection with Audio Feedback using Yolo vs. Yolo_v3:
The first paper is titled “Real Time Object Detection with Audio Feedback
using Yolo vs. Yolo_v3” and was published in the year 2021. This paper uses
algorithms and techniques like the OpenCV library, Yolo, Yolo v3. The performance
recorded in this paper indicates that it works better for smaller objects with future
works mentioned as the expansion of the research on self explored dataset [1]
The next paper is titled “Reader and Object Detector for Blind” which was
published in the year 2020. This paper uses algorithms and techniques such as
Raspberry pi, OCR, tesseract, and tensorflow for carrying out the project. Text
reading and object detection was successful but not for smaller than 16 font size is
what was recorded in the performance of the paper. As the objective for future works,
The paper that was studied next is titled “Obstacle Detection for Visually
Impaired Patients” which was published in the year 2014. The techniques and
algorithms used in this paper are stereoscopic sonar system, sound buzzers, voice IC-
APR 9600. Wearable optical detection system is provided that provides full body
vibration effect on obstacle detection. However, the device has a very limited range
when compared to its own size and is also found difficult for users to comprehend the
The paper that was studied next is titled “Voice Based Smart Assistive Device
for Visually Challenged” and was published in the year 2020. The Raspberry Pi, Deep
algorithms and methodologies were described in the article. After being trained on
only 50 photos of each object, the model has an accuracy of 83 percent and detects
campus objects that are commonly available. However, because it was trained on
8000 photos from the Flickr 8k dataset, the accuracy drops as the image complexity
grows.[4].
2.1.5 A Wearable Assistive Technology for the Visually Impaired with Door
Manipulation:
The next paper is titled “A Wearable Assistive Technology for the Visually
Impaired with Door Knob Detection and Real-Time Feedback for Hand-to-Handle
Manipulation” and was published in the year 2017. Algorithms and techniques such
as YOLOv2, Deep Learning, Neural Network were used. The performance of the
device is increased to folds if the hand detection is stable. The biggest difficulty,
however, is the consistency of the hand detection performance. More images will be
added to the database in the future, and the door knob identification feature will be
2.1.6 VISION- Wearable Speech Based Feedback System for the Visually
The paper is titled as “VISION- Wearable Speech Based Feedback System for
device based on Raspberry pi, gTTs and YOLO. The text will be read out in English
and at a slow speed that is saved as an mp3 file and future work is mentioned as
effective.[6]
for a location navigation that works in low-light conditions while remaining cost-
effective.. The model would be more precise if the depth, width and precision is
improved. [7].
The next 2020 published paper having title “CPU based YOLO: A Real Time
CNN, SSD, Mask R-CNN, R-FCN, OpenCV and RetinaNet. The Model discovers
CPU Based YOLO with aforementioned future work as increment of FPS and mAP
people in a huge crowd and the accuracy states that it works perfectly fine for a huge
crowd and for a single person as well. Future Work stated in the paper is to add
The last paper that we have reviewed is “Edge detection based boundary box
Prediction, and Object Detection are all used in this research. The intersection over
union for the proposed algorithm and YOLO v3 is determined, and the proposed
approach outperforms YOLO v3 in terms of boundary box accuracy. When there are
sharp objects in the image or there is too much noise, the model becomes
constrained.[10]
accessing printed and digital text independently. With conventional reading methods
often proving insufficient, there is a compelling needto harness the power of artificial
intelligence to empower the blind. This project aims to address these challenges by
rely on others for reading assistance, limiting their access to information and hindering
assistance. The proposed AI smartreader seeks to bridge this accessibility gap, offering
The overarching problem is the limited access to information for the visually
materials. This dependency restricts independence and hinders the ability to engage
with a diverse range of written content. The lack of an efficient and real-time solution
context. These challenges underscore the critical need for an AI-driven solution that
accessing and comprehending printed or digital text, relying on external assistance for
reading various materials. This dependence limits their autonomy and access to
information, creating a need for an AI-driven smart reader. The problem is to develop
a solution that employs Python to convert text from diverse sources into audio in real-
The user is the visually impaired people who shall be addressed by the
proponent to develop this model.Common users who will get to access the model for
their own benefit upon the supervision of the medical facility within their own
proponents:
Accuracy
Reliability
Usability
Efficiency
User Friendliness
2.3 OBJECTIVES
visual impairment and to help people suffering from visual impairment know what
Adding on to that, the project also aims at running seamlessly without any
system barrier and high usability. The accuracy of the built model is one to look out
for, since it shall be dealing with detectingobjects and recognizing them in real time.
Scope
The built model shall be carrying out the idea and concept of healthcare
user.
The model shall also be providing the best suited and most accurate ML
this project.
The model is most accurate when it comes to the given objective, thereby
The model is easy to use and understandable by regular people, giving them
Limitation
This is a prototype of the idea that we have proposed and thus we are still to
use hardware thatmust be required for the proper execution of this project.
Although it is highly reliable and gives the accurate algorithm to follow, the
accuracy of the algorithms used might differ depending upon the processing
2.4 MOTIVATION
improving the quality of life for visually impaired individuals. This initiative seeks to
providing a solution that not only enhances accessibility but also fostersindependence,
designed for those with sight. The inability to independently access and comprehend
smart reader that transcends traditional solutions, offering a holistic and adaptive
Empowering Independence:
The core motivation revolves around empowering blind users to navigate the
artificial intelligence into the smart reader, the goal is to elevate the reading experience
diverse text formats, provide natural and intelligible speech synthesis, and incorporate
Fostering Inclusivity:
that limit the participation of visually impaired individuals in various aspects of life.
The AI smart reader aspires to be a catalyst for positive change, ensuring that blind
users can seamlessly access a multitude of written materials, regardless of the source
landscape.
Enabling Personalization:
Understanding that each user has unique preferences and needs, the motivation
CHAPTER – 3
improvement and innovation in the Smart Drishti project for assisting the blind. These
insights have been critical in guiding the modifications and enhancements we plan to
challenges in the realm of assistive devices for the visually impaired, specifically
several key conclusions can be drawn to guide the development of the "Smart Drishti"
project.
INTRODUCTION TO ANACONDA
Conda package manager, which allows users to install, update, and manage packages
It's widely used in data analysis and research communities due to its ease of
3.1.1 DATASET
datasets for computer vision, mostly using state-of-the-artneural networks. This name
is also used to refer to the format in which the datasets are stored. It is an object
which more than 200k images are labelled, which makes it even more easier to
recognize the class (category) of detected object. It has around 1.5 million object
instances and 80 object categories. COCO annotations employ the JSON file format,
which has a top value of dictionary (key-value pairs inside braces). It can also have
below:
"info": {…},
"licenses": […],
"images": […],
Info Section: It contains metadata about the dataset like description, url,
version etc.
Licenses section: It contains the links to the licenses for the images present in
the dataset. All thelicense contains the id field which is used to recgonize the
license.
Image: It is the second most important dictionary of the dataset. It has the
images.
Annotations Section: This is the most important section of the dataset, which
3.1.2 PACKAGES
The following packages has been used for building the model:
variety of databases.
ii. OpenCV: Open Source Computer Vision. It is one of the most widely used
tools for computer vision and image processing tasks. It is used in various
very important in today's systems. By using it, one can process images and
iii. gTTs: gTTS is a very easy to use tool which converts the text entered, into
audio which can be saved as a mp3 file. The gTTS API supports several
languages including English, Hindi, Tamil, French, German and many more.
The speech can be delivered in any one of the two available audio speeds,
fast or slow. However, as of the latest update, it is not possible to change the
3.2 ALGORITHM
algorithms. However, thehigh accuracy rated algorithms that we have further used to
YOLO v3:
The term 'You Only Look Once' is abbreviated as YOLO. This is an algorithm
for detecting and recognising different items in a photograph (in real-time). Object
detection in YOLO is done as a regression problem, and the identified photos' class
probabilities are provided. Convolutional neural networks (CNN) are used in the
YOLO method to recognise objects in real time. To detect objects, the approach just
takes a single forward propagation through a neural network, as the name suggests.
This indicates that a single algorithm run is used to forecast the entire image. The
CNN is used to forecast multiple bounding boxes and class probabilities at the same
mAP of 57.9% while analyzing images at 30 frames per second rate. The main
features of YOLOv3 lie in it being very fast and accurate, which can easily be traded
off by simply customizing the size of the model, thereby requiring no retraining
whatsoever.
YOLO is regression-based. Initially it takes the video input and segments the
video into 24 frames. Each frame is then divided into cells. Image classification and
localization are applied on each grid. YOLO then predicts the bounding boxes and
To break down it into more simpler terms, the labelled data lets say is divided
into 3x3 grids and there are total of 3 classes in which we want it to be classified. So,
Here, pc is the probability of whether the object is present in the grid or not.bx, by,
Bounding boxes i.e bx, by, bh and bw are calculated relative to the grid cell it is a
dealing with. bx and byare the x and y coordinates of the midpoint of the object with
respect to this grid.bh is the ratio of the height of the bounding box to the height of the
corresponding grid cell. bw is the ratio of the width of the bounding box to the width
of the grid cell. bx and by will always range between 0 and 1 as the midpoint will
always lie within the grid. Whereas bh and bw can be more than 1 in case the
dimensions of the bounding box are more than the dimension of the grid.
more robust but a little slower than its previous versions. This model features multi-
scale detection, a stronger feature extraction network, and a few changes in the loss
function. For understanding the network architecture on a high- level, let’s divide the
entire architecture into two major components: Feature Extractor and Feature Detector
(Multi-scale Detector). The image is first given to the Feature extractor which extracts
feature embeddings and then is passed on to the feature detector part of the network
that spits out the processed image with bounding boxes around the detected classes.
as a feature extractor in prior YOLO versions, with 19 layers as the name suggests.
Darknet-19 now has a total of 30 layers thanks to YOLO v2, which adds 11 extra
layers. However, because to the down sampling of the input image and the loss of
fine-grained characteristics, the system had difficulty detecting small objects. The
architecture (ResNet).
The network is formed with consecutive 3x3 and 1x1 convolution layers
The darknet's 53 layers are piled on top of another 53 for the detection head,
As a result, it has a huge architecture, which makes it a little slower than YOLO v2,
architecture of the proposed model Fig. 3 part B, which describes the process in the
following way. When the input image contacting text or the text file is fed into the
system, it passes through different phases before coming out as voice output.
In the text analysis phase, the text is arranged into a manageable list of words.
Identification of any pauses or any punctuation mark is the key aim of this
process. Fig. 4. Block diagram of the optical character recognition process and
its stepwise detail Fig. 5. Process of Text-to-Speech Smart Reader for Visually
as grapheme-to-phoneme conversion.
The amalgamation of stress pattern, the rise, and fall in the speech and rhythm
synthesized speech.
Acoustic processing refers to the process in which the type of speech synthesis
on models of the human vocal tract falls in the domain of acoustic processing.
After all processing through these phases, the intended voice output is taken
out.
CHAPTER – 4
IMPLEMENTATION
hardware, integrating the software components, and ensuring real-time processing for
Software Setup
Object Detection Model: Download and set up a pre-trained object detection model
(e.g., SSD MobileNet, YOLO). For this example, we'll assume you have a
Writing the Code: Create a Python script to integrate all components. Complete
Python Script
import cv2
import tensorflow as tf
import pyttsx3
import RPi.GPIO as GPIO
import time
def measure_distance():
GPIO.output(TRIG, False)
time.sleep(2)
GPIO.output(TRIG, True)
time.sleep(0.00001)
GPIO.output(TRIG, False)
while GPIO.input(ECHO) == 0:
pulse_start = time.time()
while GPIO.input(ECHO) == 1:
pulse_end = time.time()
def detect_objects(frame):
# Process frame with the object detection model
input_tensor = tf.convert_to_tensor(frame)
detections = model(input_tensor)
# Extract object detection results
return detections
while True:
ret, frame = camera.read()
if not ret:
break
objects = detect_objects(frame)
distance = measure_distance()
functional prototype of the "Smart Drishti for the Blind" project using Python,
providing visually impaired individuals with a valuable tool for safer and more
independent navigation.
CHAPTER – 5
The system successfully detects and identifies objects in real-time using the
camera feed. The TensorFlow object detection model provides accurate bounding
boxes and labels for common obstacles such as furniture, stairs, and people.Detection
speed and accuracy are satisfactory for real-time navigation, with minimal latency.
Distance Measurement:
Audio Feedback:
detected objects and their distances to the user. Feedback is timely, clear, and
customizable, allowing users to adjust the volume and speech rate according to their
preferences.
User Testing:
Initial user testing with visually impaired individuals showed positive results,
with users finding the system helpful for indoor navigation. Users reported feeling
more confident when navigating unfamiliar environments with the aid of audio
DISCUSSION
1. System Performance:
The combination of the Raspberry Pi, camera, and ultrasonic sensors provides
a reliable and cost-effective solution for real-time obstacle detection. The use of
Python and its extensive libraries, such as OpenCV and TensorFlow, streamlines the
2. Challenges Encountered:
3. User Experience:
The system's usability is a key factor in its success. Ensuring that the device is
lightweight, easy to use, and non-intrusive is vital for adoption by visually impaired
users. Continuous user feedback and iterative design improvements are essential to
CHAPTER – 6
The proposed method was applied to the hardware and it was tested with
different samples repeatedly. Our research methodology has successfully done the
process of an image containing text and its transformation into audible speech.
In this project, we have hereby built a prototype of a model that can accurately
and efficiently detect, recognize and give an audio feedback of objects around us with
multitude of ways that we can only imagine. Day to day activitiesof life gets affected
by the same. With this project, we thus have built an idea that can help people
suffering from any of such vision related issues. The plight of these suffering people
is beyond our measures of control but we can help them substitute vision with the
For building this project, all that has been learnt from various sources were
made to utility and hence, it resulted in the successful implementation of the project.
Although there are quite a few drastic future enhancements that have been suggested
as aforementioned, the model that is right now shall be the strong foundation to that
of pace, and with the booming of technology, it can not only be cured, but also in the
REFERENCES
[1] Patel, P., & Bhavsar, B. (2021). Object Detection and Identification. International
Journal, 10(3).
[2] Mahendru, M., & Dubey, S. K. (2021, January). Real Time Object Detection with
Audio Feedback using Yolo vs. Yolo_v3. In 2021 11th International Conference on
Cloud Computing, Data Science & Engineering (Confluence)(pp. 734-740). IEEE.
[3] Annapoorani, A., Kumar, N. S., & Vidhya, V. (2021). Blind-Sight: Object Detection
with Voice Feedback.
[4] Murali, M., Sharma, S., & Nagansure, N. (2020, July). Reader and Object Detector
for Blind. In 2020International Conference on Communication and Signal Processing
(ICCSP) (pp. 0795-0798). IEEE.
[5] Srikanteswara, R., Reddy, M. C., Himateja, M., & Kumar, K. M. (2022). Object
Detection and Voice Guidance for the Visually Impaired Using a Smart App. In
Recent Advances in Artificial Intelligence and Data Engineering (pp. 133-144).
Springer, Singapore.
[7] Dewangan, R. K., & Chaubey, S. (2021). Object Detection System with Voice Output
using Python.
[8] Samhita, M. S., Ashrita, T., Raju, D. P., & Ramachandran, B. (2021). A critical
investigation on blind guiding device using cnn algorithm based on motion stereo
tomography images. Materials Today: Proceedings.
[9] Potdar, K., Pai, C. D., & Akolkar, S. (2018). A convolutional neural network based
live object recognitionsystem as blind aid. arXiv preprint arXiv:1811.10399..
[10] Lakde, C. K., & Prasad, P. S. (2015, April). Navigation system for visually impaired
people. In 2015 International Conference on Computation of Power, Energy,
Information and Communication (ICCPEIC) (pp. 0093-0098). IEEE.