Document From Sindhu Reddy... ??
Document From Sindhu Reddy... ??
Image Processing Using Machine Learning & Real Time Use Cases
Feature Mapping Using the SIFT Algorithm, Image Registration Using the RANSAC Algorithm: estimate_ affine,
residual lengths, processing the Images, The Complete code.
Image Classification Using Artificial Neural Networks, Image Classification Using CNNs, Image Classification
Using Machine Learning Approaches: Decision Trees, Support Vector Machines, Logistics Regression, Code,
Important Terms
Introduction to Real-Time Use Cases: Finding Palm Lines, Detecting Faces, Recognizing
Faces, Tracking Movements, Detecting Lanes
Image Processing Using Machine Learning & Real Time Use Cases
Feature Extraction
After an image has been segmented into regions or their boundaries using methods such as those in, the resulting
sets of segmented pixels usually have to be converted into a form suitable for further computer processing.
Typically, the step after segmentation is Feature extraction, which consists of Feature detection and Feature
description.
For example, we might detect corners in a region boundary, and describe those corners by their orientation and
location, both of which are quantitative attributes.
Feature processing methods discussed in this chapter are subdivided into three principal categories, depending on
whether they are applicable to
• Boundaries,
• Regions, or
• Whole images.
Some features are applicable to more than one category.
Feature descriptors should be as insensitive as possible to variations in parameters such as scaling, translation,
rotation, illumination, and viewpoint. The descriptors discussed in this chapter are either insensitive to, or can be
normalized to compensate for, variations in one or more of these parameters.
Feature mapping is a technique used in data analysis and machine learning to transform input data from
a lower-dimensional space to a higher-dimensional space, where it can be more easily analyzed or classified.
• Feature mapping involves selecting or designing a set of functions that map the original data to a
new set of features that better capture the underlying patterns in the data. The resulting feature
space can then be used as input to a machine learning algorithm or other analysis technique.
• Feature mapping can be used in a wide range of applications, from natural language processing to
computer vision, and is a powerful tool for transforming data into a format that can be analyzed
more easily. However, there are also potential issues to consider, such as the curse of dimensionality,
overfitting, and computational complexity.
• Feature mapping, also known as Feature engineering, is the process of transforming raw input data
into a set of meaningful features that can be used by a machine learning algorithm. Feature mapping
is an important step in machine learning, as the quality of the features can have a significant impact
on the performance of the algorithm.
SIFT stands for Scale-Invariant Feature Transform and was first presented in 2004, by D.Lowe, University of British
Columbia. SIFT is invariance to image scaling and rotation. This algorithm is patented, so this algorithm is included
in the Non-free module in OpenCV.
SIFT (Scale-Invariant Feature Transform) is a powerful technique for image matching that can identify and
match features in images that are invariant to Scaling, Rotation, and affine Distortion. It is widely used in
computer vision applications, including image matching, object recognition, and 3D reconstruction.
Scale-space
Real world objects are meaningful only at a certain scale. You might see a sugar cube perfectly on a table. But if
looking at the entire milky way, then it simply does not exist. This multi-scale nature of objects is quite
common in nature. And a scale space attempts to replicate this concept on digital images.
The scale space of an image is a function L(x,y,σ) that is produced from the convolution of a Gaussian
kernel(Blurring) at different scales with the input image. Scale-space is separated into octaves and the number
of octaves and scale depends on the size of the original image. So we generate several octaves of the original
image. Each octave’s image size is half the previous one.
Blurring
Within an octave, images are progressively blurred using the Gaussian Blur operator. Mathematically,
“blurring” is referred to as the convolution of the Gaussian operator and the image. Gaussian blur has a
particular expression or “operator” that is applied to each pixel. What results is the blurred image.
Blurred image G is the Gaussian Blur operator and I is an image. While x,y are the location coordinates and
σ is the “scale” parameter. Think of it as the amount of blur. Greater the value, greater the blur.
Finding keypoints
Up till now, we have generated a scale space and used the scale space to calculate the Difference of Gaussians.
Those are then used to calculate Laplacian of Gaussian approximations that are scale invariant.
One pixel in an image is compared with its 8 neighbors as well as 9 pixels in the next scale and 9 pixels in previous
scales. This way, a total of 26 checks are made. If it is a local extrema, it is a potential keypoint. It basically means
that keypoint is best represented in that scale.
Keypoint Localization
Keypoints generated in the previous step produce a lot of keypoints. Some of them lie along an edge, or they don’t
have enough contrast. In both cases, they are not as useful as features. So we get rid of them. The approach is
similar to the one used in the Harris Corner Detector for removing edge features. For low contrast features, we
simply check their intensities.
They used Taylor series expansion of scale space to get a more accurate location of extrema, and if the intensity at
this extrema is less than a threshold value (0.03 as per the paper), it is rejected. DoG has a higher response for
edges, so edges also need to be removed. They used a 2x2 Hessian matrix (H) to compute the principal curvature.
Orientation Assignment
Now we have legitimate keypoints. They’ve been tested to be stable. We already know the scale at which the
keypoint was detected (it’s the same as the scale of the blurred image). So we have scale invariance. The next thing
is to assign an orientation to each keypoint to make it rotation invariance.
A neighborhood is taken around the keypoint location depending on the scale, and the gradient magnitude and
direction is calculated in that region. An orientation histogram with 36 bins covering 360 degrees is created. Let's
say the gradient direction at a certain point (in the “orientation collection region”) is 18.759 degrees, then it will go
into the 10–19-degree bin. And the “amount” that is added to the bin is proportional to the magnitude of the
gradient at that point. Once you’ve done this for all pixels around the keypoint, the histogram will have a peak at
some point.
The highest peak in the histogram is taken and any peak above 80% of it is also considered to calculate the
orientation. It creates keypoints with same location and scale, but different directions. It contributes to the stability
of matching.
Keypoint descriptor
At this point, each keypoint has a location, scale, orientation. Next is to compute a descriptor for the local image
region about each keypoint that is highly distinctive and invariant as possible to variations such as changes in
viewpoint and illumination.
To do this, a 16x16 window around the keypoint is taken. It is divided into 16 sub-blocks of 4x4 size.
So 4 X 4 descriptors over 16 X 16 sample array were used in practice. 4 X 4 X 8 directions give 128 bin
values. It is represented as a feature vector to form keypoint descriptor. This feature vector introduces a
few complications. We need to get rid of them before finalizing the fingerprint.
Rotation dependence The feature vector uses gradient orientations. Clearly, if you rotate the image, everything
changes. All gradient orientations also change. To achieve rotation independence, the keypoint’s rotation is
subtracted from each orientation. Thus each gradient orientation is relative to the keypoint’s orientation.
Illumination dependence If we threshold numbers that are big, we can achieve illumination independence. So, any
number (of the 128) greater than 0.2 is changed to 0.2. This resultant feature vector is normalized again. And now
you have an illumination independent feature vector!
Keypoint Matching
Keypoints between two images are matched by identifying their nearest neighbors. But in some cases, the second
closest-match may be very near to the first. It may happen due to noise or some other reasons. In that case, the
ratio of closest-distance to second-closest distance is taken. If it is greater than 0.8, they are rejected. It eliminates
around 90% of false matches while discards only 5% correct matches, as per the paper.
Advantages of SIFT
• Distinctiveness
The features that are obtained can be compared with large datasets of objects.
• Quantity SIFT can help generate many features even from small objects.
Disadvantages of SIFT:
• Used to Be Expensive: It used to cost money to use SIFT because of patents. ...
• Needs a Lot of Computer Power: SIFT can be slow and needs a strong computer, especially for big pictures or
lots of features.
2. Feature matching.
The RANSAC algorithm is often used in computer vision, e.g., to simultaneously solve the correspondence
problem and estimate the fundamental matrix related to a pair of stereo cameras; see also: Structure from
motion, scale-invariant feature transform, image stitching, rigid motion segmentation.
Image stitching [1, 2] is the stitching of two or more images with overlapping or identical features into a
Image registration is a critical step in computer vision and image processing that involves aligning two or
more images of the same scene taken at different times, from different viewpoints, or by different sensors.
The Random Sample Consensus (RANSAC) algorithm is often used for robust estimation in the presence
of outliers. Here's a high-level overview of how RANSAC can be applied to image registration:
Feature Detection:
• Identify distinctive features in both images. Common features include corners, keypoints,
or other unique patterns.
Feature Matching:
• Match the features between the two images. This can be done using descriptors like SIFT,
SURF, or ORB.
Model Estimation:
• Use the randomly selected correspondences to estimate a transformation model. This could
be an affine transformation, a homography, or another transformation depending on the
nature of the images.
Inlier Selection:
• Apply the estimated model to all feature correspondences and identify inliers—matches
that agree well with the model.
Evaluate Model:
• Assess the quality of the model by counting the number of inliers. This is a measure of
how well the model aligns the images.
Repeat:
• Repeat steps 3-6 for a predefined number of iterations or until a sufficiently good model is
found.
Refinement (Optional):
• Refine the final transformation model using all inliers. This step may involve using a more
sophisticated optimization method.
Apply Transformation:
• Apply the computed transformation to register one image onto the other.
single image with a larger angle of view and scene. The main field of the paper is based on the feature
point image mosaic. Currently, researchers have proposed a number of feature detection and extraction
algorithms based on points, such as Harris algorithm [3], Fast algorithm [4], SIFT algorithm [5] and so
on. However, when images are imaged, they are subject to light, angle etc., resulting in slowing matching,
poor extraction accuracy etc. Herbert Bay et al. proposed speeded up robust features (SURF) [6] in 2006
and improved it in 2008. SURF is a local robust feature detection algorithm. Part of it was inspired by
scale-invariant feature transform (SIFT) [5]. The extracted feature point pairs usually have large errors or
matching-points are wrong, resulting in inaccurate transformation matrix. Robustness methods are usually
used to eliminate mis-matching points, such as M-estimation, least median squares, random sample
consensus (RANSAC) [7]. The RANSAC algorithm is a robust data fitting algorithm, and first proposed
by Fischler et al. in 1981. Its basic assumption is that a set of data contains a sample data set composed of
correct data and a small amount of abnormal data. Iteratively eliminates the process of erroneous data.
Applying the RANSAC algorithm to the feature points matching screening can effectively eliminate the
error matching-points. However, the RANSAC algorithm also has the disadvantages of long time caused
by iteration of all pairs of points to be matched and lack of strong stability. In order to make the image
matching more accurate and efficient, this paper proposes an improved RANSAC features image
matching method based on SURF. First, SURF algorithm is used to select the feature points of the
images, and fast library for approximate nearest neighbours-based matcher algorithm is used to pre-match
the extracted feature points. For the mis-matched points in the matching process, the improved RANSAC
algorithm is used to select and reject the features. It reduces the number of iterations and improves the
accuracy and efficiency of the match.
The RANSAC algorithm is used to eliminate false-matching points in OpenCV. First, find an optimal homography
matrix H so that the number of feature points satisfying this matrix is the most. The relationship between feature
point pairs and homography matrices is as follows:
Matching statistics Matching pairs Wrong match pairs Correct match rate (%) Time (s)
For the mis-matched points in the initial matching point pairs, the RANSAC algorithm [10, 11] is usually used for
filter and rejection. The RANSAC algorithm uses an iterative method to randomly sample all pairs of feature
matching points, obtain a plurality of minimum sample sets, and sequentially perform tests from a plurality of
minimum sample sets. In the N pair of matching pairs, the sample sets whose error range are less than the given
threshold are considered as the correct matching pairs, and the other matching pairs are considered as the outer
points, those also called the mis-matched points. However, when the RANSAC algorithm filter matching points,
the probability that each pair of matching points in the sample are selected is the same, and the results of each
selection do not affect each other. Considering that there is a fixed transformation matrix between the two
images to be matched, the Euclidean distance value between the correct matching point pairs floats within a
certain range, and there will be no abnormally large or abnormally small phenomenon. This paper proposes an
improved RANSAC algorithm based on the idea of literature [12]. Before the RANSAC algorithm is performed, the
Euclidean distances of the matching point pairs are calculated, and the Euclidean distances of the pair of matching
points are sorted, the pairs of points that are too close or too far away are eliminated, and the RANSAC algorithm
is performed only on the middle matching point pairs. The algorithm improves the probability that the correct
matching point is sampled, so that a correct feedback loop is obtained after obtaining the correct matching point
pairs. At the same time, the sample data are reduced and the iteration time is reduced too.
(i) Calculate the Euclidean distance of N pairs of matching points, sort them in order of distance, and delete
the first 20% and the last 20% of points;
(ii) Randomly select four pairs of matched points in the matched pair of 0.6*N pairs to ensure that four pairs
of matching points satisfy any three points not collinear and calculate the transformation matrix H;
(iii) Calculate the distance d of the remaining (0.6*N-4) pairs of matching pairs while they are matched with
the original corresponding matching points through transformation matrix H. Set the threshold T (the
threshold of this experiment is 2), less than the threshold is the correct matching points pair; otherwise it is
removed.
(iv) Calculate the number of interior points obtained from step 3 and re-fit the transformation matrix
(v) Until the number of interior points no longer changes, the final interior point set is obtained, and the final
transformation matrix is calculated at the same time.
Relative to the RANSAC algorithm, the improved RANSAC algorithm makes a change in the selection matching
point, and the probability of selecting a correct pair of point increases, reducing the impact of the mis-matched
points on the desired matrix. At the same time, the number of iterations is greatly reduced, the efficiency
improved, and the accuracy of the transformation matrix is improved. It also laid the foundation for the real-time
panoramic stitching in the later period.
5.3 estimate_ affine,
What is estimate?
estimation is the case where a clean image is contaminated with noise, usually through sensing, transmission, or
storage. Image restoration means that in addition to the noise, there is some blurring due to motion or lack of
focus. Both nonrecursive and recursive approaches for 2-D estimation are presented
What is affine?
The affine transformation technique is typically used to correct for geometric distortions or deformations that
occur with non-ideal camera angles. For example, satellite imagery uses affine transformations to correct for wide
angle lens distortion, panorama stitching, and image registration.
In image processing, estimating an affine transformation is a common task when trying to align or register two
images. An affine transformation is a linear mapping that preserves points, straight lines, and planes. It includes
translations, rotations, scaling, and shearing. Here's how you can estimate an affine transformation using Python
and OpenCV:
python
Copy code
import cv2
import numpy as np
# Load images
image1 = cv2.imread('image1.jpg')
image2 = cv2.imread('image2.jpg')
# Display results
cv2.imshow('Image 1', image1)
cv2.imshow('Image 2', image2)
cv2.imshow('Aligned Image 1', result)
cv2.waitKey(0)
cv2.destroyAllWindows()
In this example:
We use the ORB detector and descriptor to find keypoints and descriptors in both images.
We match the features using a brute-force matcher with the Hamming distance.
We use the cv2.estimateAffine2D function with the RANSAC method to estimate the affine transformation matrix.
The resulting transformation matrix (affine transform) is then used to align or warp image1 onto image2.
The aligned image is displayed for visual inspection.
Adjust the parameters and methods according to the characteristics of your images and the specific requirements
of your application. Keep in mind that RANSAC helps in robustly estimating the transformation by handling
outliers (mismatched or erroneous correspondences).
What is an Affine Transformation?
1. A transformation that can be expressed in the form of a matrix multiplication (linear transformation)
followed by a vector addition (translation).
2. From the above, we can use an Affine Transformation to express:
a. Rotations (linear transformation)
b. Translations (vector addition)
c. Scale operations (linear transformation)
you can see that, in essence, an Affine Transformation represents a relation between two images.
3. The usual way to represent an Affine Transformation is by using a 2×3 matrix.
A=[a00a10a01a11]2×2B=[b00b10]2×1
M=[AB]=[a00a10a01a11b00b10]2×3
Considering that we want to transform a 2D vector X=[xy] by using A and B, we can do the same with:
T=A⋅[xy]+B or T=M⋅[x,y,1]T
T=[a00x+a01y+b00a10x+a11y+b10]
How do we get an Affine Transformation?
1. We mentioned that an Affine Transformation is basically a relation between two images. The information
about this relation can come, roughly, in two ways:
a. We know both X and T and we also know that they are related. Then our task is to find M
b. We know M and X. To obtain T we only need to apply T=M⋅X. Our information for M may be
explicit (i.e. have the 2-by-3 matrix) or it can come as a geometric relation between points.
2. Let's explain this in a better way (b). Since M relates 2 images, we can analyze the simplest case in which
it relates three points in both images. Look at the figure below:
the points 1, 2 and 3 (forming a triangle in image 1) are mapped into image 2, still forming a triangle, but now
they have changed notoriously. If we find the Affine Transformation with these 3 points (you can choose them as
you like), then we can apply this found relation to all the pixels in an image.
5.4 residual lengths,
What is residual lengths?
Definition. The residual for each observation is the difference between predicted values of y (dependent
variable) and observed values of y. Residual=actual y value−predicted y value, RI=yi−^yi. Residual = actual y value
− predicted y value, r i = y i − y i ^.
Residual plot
A residual plot shows the difference between the observed response and the fitted response values.
The ideal residual plot, called the null residual plot, shows a random scatter of points forming an approximately
constant width band around the identity line.
It is important to check the fit of the model and assumptions – constant variance, normality, and independence of
the errors, using the residual plot, along with normal, sequence, and lag plot.
Constant variance If the points tend to form an increasing, decreasing or non-constant width band,
then the variance is not constant.
You should consider transforming the response variable or incorporating weights
into the model. When variance increases as a percentage of the response, you can
use a log transform, although you should ensure it does not produce a poorly fitting
model.
Independence When the order of the cases in the dataset is the order in which they occurred:
Examine a sequence plot of the residuals against the order to identify any
dependency between the residual and time.
Examine a lag-1 plot of each residual against the previous residual to identify a
serial correlation, where observations are not independent, and there is a
correlation between an observation and the previous observation.
Time-series analysis may be more suitable to model data where serial correlation is
present.
For a model with many terms, it can be difficult to identify specific problems using the residual plot. A non-null
residual plot indicates that there are problems with the model, but not necessarily what these are.
Residuals - normality
Normality is the assumption that the underlying residuals are normally distributed, or approximately so.
While a residual plot, or normal plot of the residuals can identify non-normality, you can formally test the
hypothesis using the Shapiro-Wilk or similar test.
The null hypothesis states that the residuals are normally distributed, against the alternative hypothesis that they
are not normally-distributed. If the test p-value is less than the predefined significance level, you can reject the
null hypothesis and conclude the residuals are not from a normal distribution. If the p-value is greater than the
predefined significance level, you cannot reject the null hypothesis.
Violation of the normality assumption only becomes an issue with small sample sizes. For large sample sizes, the
assumption is less important due to the central limit theorem, and the fact that the F- and t-tests used for
hypothesis tests and forming confidence intervals are quite robust to modest departures from normality.
Residuals – independence
Autocorrelation occurs when the residuals are not independent of each other. That is, when the value of e[i+1] is
not independent from e[i].
While a residual plot, or lag-1 plot allows you to visually check for autocorrelation, you can formally test the
hypothesis using the Durbin-Watson test. The Durbin-Watson statistic is used to detect the presence of
autocorrelation at lag 1 (or higher) in the residuals from a regression. The value of the test statistic lies between 0
and 4, small values indicate successive residuals are positively correlated. If the Durbin-Watson statistic is much
less than 2, there is evidence of positive autocorrelation, if much greater than 2 evidence of negative
autocorrelation.
The null hypothesis states that the residuals are not autocorrelated, against the alternative hypothesis that they
are. If the test p-value is less than the predefined significance level, you can reject the null hypothesis and
conclude the residuals are correlated. If the p-value is greater than the predefined significance level, you cannot
reject the null hypothesis.
Note: The p-value is computed using the bootstrap method and can take a long time to compute.
As of my last knowledge update in January 2022, the term "residual lengths" doesn't have a specific, widely
recognized meaning in a general context. However, the term could be used in various fields or disciplines with
different meanings. Without additional context, it's challenging to provide a precise definition.
In mathematics or signal processing, "residual" often refers to the difference between an observed or measured
value and the value predicted or estimated by a model. "Lengths" might refer to the size or extent of something.
So, "residual lengths" could potentially be related to differences in sizes or extents in a particular context.
If you have a specific field or context in mind where you've encountered this term, providing more details could
help in offering a more accurate explanation. If "residual lengths" is a term coined or introduced after my last
update in January 2022, I recommend checking the latest literature or resources in the relevant field for the most
up-to-date information.
5.5 processing the Images,
What is the processing the image?
Image processing is a method to perform some operations on an image, in order to get an enhanced image or to
extract some useful information from it. It is a type of signal processing in which input is an image and output may
be image or characteristics/features associated with that image. Image processing refers to the manipulation of
an image to extract meaningful information or enhance certain features. This field is crucial in various
applications, including computer vision, medical imaging, remote sensing, and more. Image processing can
involve a wide range of operations, and here are some common tasks:
Image Acquisition: The process begins with capturing or obtaining the image data through cameras, sensors, or
other devices. Image processing refers to the manipulation of an image to extract meaningful information or
enhance certain features. This field is crucial in various applications, including computer vision, medical imaging,
remote sensing, and more. Image processing can involve a wide range of operations, and here are some common
tasks:
1. Image Acquisition: The process begins with capturing or obtaining the image data through cameras,
sensors, or other devices.
2. Preprocessing: This step involves preparing the image for further analysis. Operations may include
resizing, noise reduction, and image enhancement to improve the quality of the image.
3. Image Enhancement: This step aims to improve the visual appearance of an image. Techniques such as
contrast adjustment, sharpening, and filtering are commonly used.
4. Image Restoration: Involves the removal or reduction of artifacts or distortions introduced during image
acquisition or transmission.
5. Image Segmentation: The process of dividing an image into meaningful segments or regions. This is often
a crucial step in object recognition and computer vision.
6. Feature Extraction: Involves identifying and extracting relevant features from an image, such as edges,
corners, or texture patterns, which are important for subsequent analysis.
7. Image Recognition: Using patterns and features identified in the previous steps to recognize and classify
objects or patterns within an image. This is a fundamental aspect of computer vision.
8. Image Compression: Reducing the size of an image to save storage space or enable faster transmission.
9. Image Analysis: Involves extracting quantitative information from an image. This can include
measurements, statistical analysis, and other data extraction techniques.
10. Image Synthesis: Creating new images from existing ones, often using computer graphics techniques.
11. Image Understanding: The highest level of image processing involves interpreting and understanding the
content of an image, often requiring advanced artificial intelligence and machine learning techniques.
Image processing can be performed using various tools and programming languages, and it often involves a
combination of traditional methods and modern machine learning approaches. The specific techniques used
depend on the goals of the image processing task and the characteristics of the images being analyzed.
Preprocessing: This step involves preparing the image for further analysis. Operations may include resizing, noise
reduction, and image enhancement to improve the quality of the image.
Image Enhancement: This step aims to improve the visual appearance of an image. Techniques such as contrast
adjustment, sharpening, and filtering are commonly used.
Image Restoration: Involves the removal or reduction of artifacts or distortions introduced during image
acquisition or transmission.
Image Segmentation: The process of dividing an image into meaningful segments or regions. This is often a crucial
step in object recognition and computer vision.
Feature Extraction: Involves identifying and extracting relevant features from an image, such as edges, corners, or
texture patterns, which are important for subsequent analysis.
Image Recognition: Using patterns and features identified in the previous steps to recognize and classify objects
or patterns within an image. This is a fundamental aspect of computer vision.
Image Compression: Reducing the size of an image to save storage space or enable faster transmission.
Image Analysis: Involves extracting quantitative information from an image. This can include measurements,
statistical analysis, and other data extraction techniques.
Image Synthesis: Creating new images from existing ones, often using computer graphics techniques.
Image Understanding: The highest level of image processing involves interpreting and understanding the content
of an image, often requiring advanced artificial intelligence and machine learning techniques.
Image processing can be performed using various tools and programming languages, and it often involves a
combination of traditional methods and modern machine learning approaches. The specific techniques used
depend on the goals of the image processing task and the characteristics of the images being analyzed. Examples
of this operation are shown below.
Source: Paper
The authors achieved a 3% boost in performance with this simple preprocessing procedure which is a
considerable enhancement, especially in a biomedical application where the accuracy of diagnosis is crucial for AI
systems. The quantitative results obtained with and without preprocessing for the lesion segmentation problem
in three different datasets are shown below.
Source: Paper
Types of Images / How Machines “See” Images?
Digital images are interpreted as 2D or 3D matrices by a computer, where each value or pixel in the matrix
represents the amplitude, known as the “intensity” of the pixel. Typically, we are used to dealing with 8-bit
images, wherein the amplitude value ranges from 0 to 255.
1. Binary Image
Images that have only two unique values of pixel intensity- 0 (representing black) and 1 (representing white) are
called binary images. Such images are generally used to highlight a discriminating portion of a colored image. For
example, it is commonly used for image segmentation, as shown below.
Source: Paper
2. Grayscale Image
Grayscale or 8-bit images are composed of 256 unique colors, where a pixel intensity of 0 represents the black
color and pixel intensity of 255 represents the white color. All the other 254 values in between are the different
shades of gray.
An example of an RGB image converted to its grayscale version is shown below. Notice that the shape of the
histogram remains the same for the RGB and grayscale images.
Up until now, we had images with only one channel. That is, two coordinates could have defined the location of
any value of a matrix. Now, three equal-sized matrices (called channels), each having values ranging from 0 to
255, are stacked on top of each other, and thus we require three unique coordinates to specify the value of a
matrix element.
Thus, a pixel in an RGB image will be of color black when the pixel value is (0, 0, 0) and white when it is (255, 255,
255). Any combination of numbers in between gives rise to all the different colors existing in nature. For example,
(255, 0, 0) is the color red (since only the red channel is activated for this pixel). Similarly, (0, 255, 0) is green and
(0, 0, 255) is blue.
An example of an RGB image split into its channel components is shown below. Notice that the shapes of the
histograms for each of the channels are different.
Splitting of an image into its Red, Green and Blue channels
4. RGBA Image
RGBA images are colored RGB images with an extra channel known as “alpha” that depicts the opacity of the RGB
image. Opacity ranges from a value of 0% to 100% and is essentially a “see-through” property.
Opacity in physics depicts the amount of light that passes through an object. For instance, cellophane paper is
transparent (100% opacity), frosted glass is translucent, and wood is opaque. The alpha channel in RGBA images
tries to mimic this property. An example of this is shown below.
Example of changing the “alpha” parameter in RGBA images
Phases of Image Processing
The fundamental steps in any typical Digital Image Processing pipeline are as follows:
1. Image Acquisition
The image is captured by a camera and digitized (if the camera output is not digitized automatically) using an
analogue-to-digital converter for further processing in a computer.
2. Image Enhancement
In this step, the acquired image is manipulated to meet the requirements of the specific task for which the image
will be used. Such techniques are primarily aimed at highlighting the hidden or important details in an image, like
contrast and brightness adjustment, etc. Image enhancement is highly subjective in nature.
3. Image Restoration
This step deals with improving the appearance of an image and is an objective operation since the degradation of
an image can be attributed to a mathematical or probabilistic model. For example, removing noise or blur from
images.
6. Image Compression
For transferring images to other devices or due to computational storage constraints, images need to be
compressed and cannot be kept at their original size. This is also important in displaying images over the internet;
for example, on Google, a small thumbnail of an image is a highly compressed version of the original. Only when
you click on the image is it shown in the original resolution. This process saves bandwidth on the servers.
7. Morphological Processing
Image components that are useful in the representation and description of shape need to be extracted for further
processing or downstream tasks. Morphological Processing provides the tools (which are essentially mathematical
operations) to accomplish this. For example, erosion and dilation operations are used to sharpen and blur the
edges of objects in an image, respectively.
8. Image Segmentation
This step involves partitioning an image into different key parts to simplify and/or change the representation of
an image into something that is more meaningful and easier to analyze. Image segmentation allows for
computers to put attention on the more important parts of the image, discarding the rest, which enables
automated systems to have improved performance.
First published in 1993, "Code Complete" covers a broad range of topics, including software design principles, coding
practices, debugging, testing, and project management. It aims to help software developers improve their coding skills and
produce higher-quality software. The book has been updated over the years to reflect changes in technology and software
development practices.
If you are looking for information on a specific software product or library, please provide more details, and I'll do my best to
assist you.
Static examples — plain code blocks, possibly with a screenshot to statically show the result of such code if it were to be run.
Interactive examples — Our system for creating live interactive examples that show the code running live but also allow you
to change code on the fly to see what the effect is and easily copy the results.
Traditional MDN "live samples" — A macro that takes plain code blocks, dynamically puts them into a document inside an
<iframe> element, and embeds it into the page to show the code running live.
GitHub "live samples" — A macro that takes a document in a GitHub repo inside the MDN organization, puts it inside an
<iframe> element, and embeds it into the page to show the code running live.
We'll discuss each one in later sections.
Static examples are useful if you just need to show some code, and it isn't super important to show what the live result is.
Some people just want something to copy and paste. Maybe you are just showing an intermediate step, or the source code is
enough. (For example, the article is for an advanced audience, and they just need to see the code.) Also, you might be
demonstrating an API feature that doesn't work well as an embedded example, which might need its own separate page to
link to.
The interactive examples are great as readers can modify values on the fly — this is very valuable for learning. However, they
are more complex to set up than the other forms, with more limitations, and are intended for specific purposes.
Traditional live samples are useful if you want to show source code on a page, then show it running, and you're not that
bothered about it being accessible as a standalone example. This approach also has the advantage that if you are showing
source code and live examples side by side, you only need to update the code once to update both. They can however be
awkward to edit and get working.
GitHub live samples are useful when you've got an existing example you want to embed, don't want to show the source code
for, and/or you want to make sure the example is available in standalone form. They have a better contribution workflow,
but it does require you to know GitHub. Also because on-page code and source code are in two different places, it is easier
for them to get out of sync.
General guidelines
Aside from the specific system for presenting the live samples, there are style and content considerations to keep in mind
when adding or updating samples on MDN.
When placing samples on a page, try to ensure that all of the features or options of the API or concept you're writing about
are covered. At a minimum, at least the most-common options or properties should be included in examples.
Precede each example with an explanation of what the example does and why it's interesting or useful.
Follow each piece of code with an explanation of what it does.
When possible, break large examples into smaller pieces. For instance, the "live sample" system will automatically
concatenate all your code together into one piece before running the example, so you can actually break your JavaScript,
HTML, and/or CSS into smaller pieces with descriptive text after each piece if you choose to do so. This is a great way to help
explain long or complicated stretches of code more clearly.
Go beyond just demonstrating how each piece of the API or technology works. Consider possible real-world use cases you
might try to demonstrate.
Static examples
By static examples, we are talking about static code blocks that show how a feature might be used in code. These are put on
a page using Markdown "code fences", as described in Example code blocks. An example result might look like this:
What is source code?
Source code is the fundamental component of a computer program that is created by a programmer, often written in the
form of functions, descriptions, definitions, calls, methods and other operational statements. It is designed to be human-
readable and formatted in a way that developers and other users can understand.
As an example, when a programmer types a sequence of C programming language statements into Windows Notepad and
saves the sequence as a text file, the text file now contains source code.
Source code and object code are sometimes referred to as the before and after versions of a compiled computer program.
However, source code and object code do not apply to script (non compiled or interpreted) program languages,
Programmers can use a text editor, a visual programming tool or an integrated development environment (IDE) such as a
software development kit (SDK) to create source code. In large program development environments, there are often
management systems that help programmers separate and keep track of different states and levels of source code files.
Source code can be proprietary or open, and licensing agreements often reflect this distinction.
When a user installs a software suite like Microsoft Office, for example, the source code is proprietary. Microsoft only gives
the customer access to the software's compiled executables and the associated library files that various executable files
By comparison, when a user installs Apache OpenOffice, its open source software code can be downloaded and modified.
5.7 Image Classification Using Artificial Neural Networks,
What is Image classification?
The process of categorizing and labeling groups of pixels or vectors within an image based on specific rules. The
categorization law can be devised using one or more spectral or textural characteristics. Two general methods
of classification are 'supervised' and 'unsupervised'.
These convolutional neural network models are ubiquitous in the image data space. They work phenomenally
well on computer vision tasks like image classification, object detection, image recognition, etc. They have hence
been widely used in artificial intelligence modeling, especially to create image classifiers
Image classification using artificial neural networks is a popular and powerful application of machine learning.
Convolutional Neural Networks (CNNs) are particularly well-suited for this task. Here's a step-by-step guide on
how image classification using artificial neural networks, specifically CNNs, can be implemented:
1. Dataset Preparation:
Obtain a labeled dataset with images and corresponding labels. Common datasets for image classification include
CIFAR-10, CIFAR-100, ImageNet, etc.
2. Data Preprocessing:
Resize images to a consistent size.
Normalize pixel values (typically between 0 and 1).
Augment data for better generalization (rotate, flip, zoom, etc.) to increase the variety of training examples.
3. Architecture Design:
Build a CNN architecture. A typical architecture includes convolutional layers, pooling layers, and fully connected
layers.
Popular CNN architectures include LeNet, AlexNet, VGG, GoogLeNet (Inception), ResNet, and more. You can also
design a custom architecture based on your specific requirements.
4. Model Compilation:
Choose an appropriate loss function (categorical crossentropy for multi-class classification) and an optimizer (e.g.,
Adam, SGD).
Compile the model with these settings.
5. Training:
Split the dataset into training and validation sets.
Train the model using the training set and validate it using the validation set.
Adjust hyperparameters like learning rate, batch size, and architecture based on the validation performance.
6. Model Evaluation:
Evaluate the model on a separate test set to measure its performance accurately.
7. Fine-Tuning:
If the model performance is not satisfactory, consider fine-tuning the architecture or hyperparameters.
Techniques like transfer learning can be employed by using pre-trained models and adapting them to your
specific task.
8. Deployment:
Once satisfied with the performance, deploy the model for inference. This can involve integrating the model into
a web application, mobile app, or other platforms.
9. Monitoring and Maintenance:
Regularly monitor the model's performance, and update it if necessary with new data or retraining to ensure its
accuracy over time.
Tips and Considerations:
Experiment with different architectures and hyperparameters to find the best combination for your specific
problem.
Use GPU acceleration for faster training times.
Regularization techniques like dropout can help prevent overfitting.
Keep an eye on class imbalances and use techniques such as class weights or oversampling to address them.
Birds inspired us to fly, nature inspired us to countless inventions. It seems logical, then to look at the brain’s
architecture for inspiration on how to build an Intelligent Machine. This is the logic that sparked Artificial Neural
Networks (ANN). ANN is a Machine Learning Model inspired by the networks of biological neurons found in our
brains. However, although planes were inspired by birds, they don’t have to flap their wings. Similarly, ANN have
gradually become quite different from their biological cousins. In this Article, I will build an Image Classification
model with ANN to show you how ANN works
try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass
# Common imports
import numpy as np
import os
Keras provide some quality functions to fetch and load common datasets, including MNIST, Fashion MNIST, and
the California housing dataset. Let’s start by loading the fashion MNIST dataset to create an Image Classification
model.
Keras has a number of functions to load popular datasets in keras.datasets. The dataset is already split for you
between a training set and a test set, but it can be useful to split the training set further to have a validation set:
import tensorflow as tf
from tensorflow import keras
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()Code language: Python (python)
When loading MNIST or Fashion MNIST using Keras rather than Scikit-Learn, one important difference is that
every image is represented as a 28 x 28 array rather than a 1D array of size 784. Moreover, the pixel intensities
are represented as integers rather than the floats. Let’s take a look at the shape and data type of the training set:
X_train_full.shapeCode language: Python (python)
(60000, 28, 28)
plt.imshow(X_train[0], cmap="binary")
plt.axis('off')
plt.show()Code language: Python (python)
With MNIST, when the label is equal to 5, it means that the image represents the handwritten digit 5. easy. For
Fashion MNIST, however, we need the list of class names to know what we are dealing with:
The validation set contains 5,000 images, and the test set contains 10,000 images:
n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
for row in range(n_rows):
for col in range(n_cols):
index = n_cols * row + col
plt.subplot(n_rows, n_cols, index + 1)
plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
plt.axis('off')
plt.title(class_names[y_train[index]], fontsize=12)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
save_fig('fashion_mnist_plot', tight_layout=False)
plt.show()Code language: Python (python)
Now, let’s build the neural network. Here is a classification MLP with two hidden layers:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)Code language: Python (python)
Let’s go through the above code line by line:
• The first line creates a Sequential model. This is the simplest kind of Keras model for neural networks that
are just composed of a single stack of layers connected sequentially. This is called the Sequential API.
• Next, we build the first layer and add it to the model. It is Flatten layer whose role is to convert each input
image into a 1D array. If it receives input data X, it computes X.reshape(-1,1). This layer does not have any
parameters, it is just there to do some simple preprocessing. Since it is the first layer in the model, you should
specify the input_shape, which doesn’t include the batch size, only the shape of the instances. Alternatively, you
could add a keras.layers.InputLayer as the first layer, setting input _shape = [28,28].
• Next we add a Dense hidden layer with 300 neurons.It will use the ReLU activation function. Each Dense
layer manages its own weight matrix, containing all the connection weights between the neurons and their
inputs. It also manages a vector of bias term.
• Then we add a second Dense hidden layer with 100 neurons, also using the ReLU activation function.
• Finally, we add a Dense output layer with 10 neurons, using the softmax qctivation function.
Instead of adding the layers one by one as we just did, you can pass a list of layers when creating the Sequential
model:
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="relu"),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
model.layersCode language: Python (python)
The model’s summary() method will display all the model’s layers. including each layer’s name, it’s output shape,
and it’s number of parameters, including trainable and non-trainable parameters.
After a model is created, you must call its compile() methid to specify that the loss function and the optimizer to
use. Optionally, you can specify a list of extra metrices to compute during training and evaluation:
model.compile(loss="sparse_categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"])Code language: Python (python)
Now the model is ready to be trained. For this we simply need to call its fit() method:
import pandas as pd
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1)
save_fig("keras_learning_curves_plot")
plt.show()Code language: Python (python)
You can see that both the training accuracy and the validation accuracy steadily increase during training, while
the training loss and the validation loss decrease.
Once you are satisfied with your model’s validation accuracy, you should evaluate it on a test set to estimate the
generalization error before you deploy it to the production. You can easily do this using the evaluate() method:
Next, we can use the model’s predict() method to make predictions on new instances. Since we don’t have actual
new instances, we will just use the first three instances of the test set:
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)Code language: Python (python)
array([[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.03, 0. , 0.96], [0. , 0. , 0.99, 0. , 0.01, 0. , 0. , 0. , 0. , 0. ], [0. , 1. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ]], dtype=float32)
y_pred = model.predict_classes(X_new)
y_predCode language: Python (python)
array([9, 2, 1])
Here, the classification model actually classified all three images correctly:
y_new = y_test[:3]
plt.figure(figsize=(7.2, 2.4))
for index, image in enumerate(X_new):
plt.subplot(1, 3, index + 1)
plt.imshow(image, cmap="binary", interpolation="nearest")
plt.axis('off')
plt.title(class_names[y_test[index]], fontsize=12)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
save_fig('fashion_mnist_images_plot', tight_layout=False)
plt.show()Code language: Python (python)
What is CNN?
Classifier for image classification is a CNN-based model specifically designed to classify images into different
predefined classes. It learns to extract relevant features from input images and map them to the corresponding
classes, enabling accurate image classification.
Remember to make appropriate changes according to your setup.
Step 1: Choose a Dataset. ...
Step 2: Prepare Dataset for Training. ...
Step 3: Create Training Data. ...
Step 4: Shuffle the Dataset. ...
Step 5: Assigning Labels and Features. ...
Step 6: Normalizing X and Converting Labels to Categorical Data. ...
Step 7: Split X and Y for Use in CNN.
Image classification using Convolutional Neural Networks (CNNs) is a popular and effective approach in the field
of computer vision. CNNs are particularly well-suited for tasks like image classification because they can
automatically learn hierarchical representations of features from raw pixel values. Here's a general outline of the
steps involved in building an image classification system using CNNs:
Dataset Preparation:
Collect a labeled dataset of images for training and testing. Ensure that the dataset is diverse and representative
of the target classes. Split the dataset into training and testing sets.
Data Preprocessing:
Resize images to a standard size.
Normalize pixel values to a common scale (e.g., between 0 and 1).
Augment the dataset with techniques like rotation, flipping, and zooming to increase variability in the training set.
Building the CNN Model:
Import necessary libraries (e.g., Tensor Flow, Porch, Keas).
Define the CNN architecture, typically consisting of convolutional layers, pooling layers, and fully connected
layers.
Add activation functions (e.g., Relay) to introduce non-linearity.
Use dropout layers to reduce overfitting.
Choose an appropriate output layer activation function based on the number of classes in your problem (e.g., soft
ax for multi-class classification).
Compiling the Model:
Specify the optimizer (e.g., Adam, SGD), loss function (e.g., categorical cross entropy), and evaluation metric (e.g.,
accuracy). Compile the model.
Training the Model:
Feed the training data into the model.
Adjust the model weights during training using backpropagation and optimization algorithms.
Monitor training performance using validation data.
Evaluation:
Evaluate the trained model on the test set to assess its performance.
Analyze metrics like accuracy, precision, recall, and F1 score.
Fine-tuning:
Fine-tune the model based on performance metrics.
Adjust hyper parameters or experiment with different architectures if needed.
Prediction:
Use the trained model for making predictions on new, unseen data.
Deployment:
Deploy the model in a production environment if necessary.
Optimize the model for inference speed and resource usage.
For example,
if we have a 50 X 50 image of a cat, and we want to train our traditional ANN on that image to classify it into a
dog or a cat the trainable parameters become –
(50*50) * 100 image pixels multiplied by hidden layer + 100 bias + 2 * 100 output neurons + 2bias=2,50,302
We use filters when using CNNs. Filters exist of many different types according to their purpose.
Examples of different filters and their effects
Filters help us exploit the spatial locality of a particular image by enforcing a local connectivity pattern between
neurons. Convolution basically means a pointwise multiplication of two functions to produce a third function.
Here one function is our image pixels matrix and another is our filter. We slide the filter over the image and get
the dot product of the two matrices. The resulting matrix is called an “Activation Map” or “Feature Map”.
There are multiple convolutional layers extracting features from the image and finally the output layer.
Image classification involves assigning labels or classes to input images. It is a supervised learning task where a
model is trained on labeled image data to predict the class of unseen images. CNN are commonly used for
image classification as they can learn hierarchical features like edges, textures, and shapes, enabling accurate
object recognition in images. CNNs excel in this task because they can automatically extract meaningful spatial
features from images.
Input Layer
The input layer of a CNN takes in the raw image data as input. The images are typically represented as matrices
of pixel values. The dimensions of the input layer correspond to the size of the input images (e.g., height,
width, and color channels).
Convolutional Layers
Convolutional layers are responsible for feature extraction. They consist of filters (also known as kernels) that
are convolved with the input images to capture relevant patterns and features. These layers learn to detect
edges, textures, shapes, and other important visual elements.
Pooling Layers
Pooling layers reduce the spatial dimensions of the feature maps produced by the convolutional layers. They
perform downsampling operations (e.g., max pooling) to retain the most salient information while discarding
unnecessary details. This helps in achieving translation invariance and reducing computational complexity.
I will be working on Google Colab and I have connected the dataset through Google Drive, so the code provided
by me should work if the same setup is being used. Remember to make appropriate changes according to your
setup.
Choose a dataset of your interest or you can also create your own image dataset for solving your own image
classification problem. An easy place to choose a dataset is on kaggle.com.
The dataset I’m going with can be found here. This dataset contains 12,500 augmented images of blood cells
(JPEG) with accompanying cell type labels (CSV). There are approximately 3,000 images for each of 4 different cell
types grouped into 4 different folders (according to cell type). The cell types are Eosinophil, Lymphocyte,
Monocyte, and Neutrophil.
Here are all the libraries that we would require and the code for importing them:
X =[]
y =[]for features, label in training:
X.append(features)
y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
Step 6: Normalising X and Converting Labels to Categorical Data
X = X.astype('float32')
X /= 255
from keras.utils import np_utils
Y = np_utils.to_categorical(y, 4)
print(Y[100])
print(shape(Y))
Step 7: Split X and Y for Use in CNN
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)
Step 8: Define, Compile and Train the CNN Model
Define, compile and train the CNN Model | cnn image classification
batch_size = 16
nb_classes =4
nb_epochs = 5
img_rows, img_columns = 200, 200
img_channel = 3
nb_filters = 32
nb_pool = 2
nb_conv = 3
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, (3,3), padding='same', activation=tf.nn.relu,
input_shape=(200, 200, 3)),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
tf.keras.layers.Conv2D(32, (3,3), padding='same', activation=tf.nn.relu),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(4, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train, y_train, batch_size = batch_size, epochs = nb_epochs, verbose = 1, validation_data = (X_test,
y_test))
Step 9: Accuracy and Score of Model
score = model.evaluate(X_test, y_test, verbose = 0 )
print("Test Score: ", score[0])
print("Test accuracy: ", score[1])
accuracy and model score | cnn image classification
In these 9 simple steps, you would be ready to train your own Convolutional Neural Networks model and solve
real-world problems using these skills. You can practice these skills on platforms like Analytics Vaidya and Haggle.
You can also play around by changing different parameters and discovering how you would get the best accuracy
and score. Try changing the batch size, the number of epochs or even adding/removing layers in the CNN model,
and have conclusion,
CNN image classification has revolutionized the field of computer vision, enabling accurate recognition of
objects within images. With its ability to automatically learn and extract complex features, CNNs have become
a powerful tool for various applications. To further enhance your understanding and skills in CNN image
classification and other advanced data science techniques, consider enrolling in our Black belt Program. This
comprehensive program offers in-depth knowledge and practical experience, empowering you to become a
proficient data scientist.
➢ Image classification is a task in machine learning that involves categorizing images into
predefined classes or labels. It is achieved by training a machine learning algorithm on a
dataset of labeled images, which allows the algorithm to learn patterns and features that
differentiate one class from another. Once trained, the algorithm can classify new, unseen
images into the appropriate classes.
➢ Image classification is the task of assigning a label to an image based on its content. For
example, an image classifier can recognize whether an image contains a cat, a dog, a flower, or
a car. Image classification is one of the most common applications of machine learning and
computer vision.
There are different approaches to perform image classification using machine learning, depending on
the type of features and algorithms used. Some of the most popular approaches are:
➢ Multilayer Perceptron (MLP): This approach treats an image as a vector of pixel values, and
feeds it directly to a neural network with multiple hidden layers. The neural network learns to
extract features and classify the image in an end-to-end manner.
MLP is more powerful than BoVW, as it can learn non-linear and complex features from the raw
pixels. However, MLP is also more prone to overfitting, as it has many parameters to tune and
requires a large amount of training data.
Moreover, MLP does not take advantage of the spatial structure and the local patterns in the
image, as it treats each pixel independently.
➢ Convolutional Neural Network (CNN): This approach is a special type of neural network that
uses convolutional layers to extract features from the image.
A convolutional layer consists of a set of filters that slide over the image and produce a feature
map, which captures the presence of certain patterns or shapes in the image. By stacking
multiple convolutional layers, the network can learn hierarchical and abstract features from the
image, such as edges, textures, shapes, and objects.
CNN also uses pooling layers to reduce the dimensionality and increase the invariance of the
features. A pooling layer applies a function, such as max or average, to a local region of the
feature map and outputs a single value. CNN is followed by one or more fully connected layers,
which perform the final classification.
CNN is the most advanced and successful approach for image classification, as it can learn high-
level and semantic features from the image, and exploit the spatial structure and the local
patterns in the image. CNN also requires less parameters and less training data than MLP, as it
shares the weights of the filters across the image.
➢ Transfer Learning: This approach leverages the knowledge and the features learned by a pre-
trained CNN on a large and generic dataset, such as ImageNet, and applies it to a new and
specific dataset.
Transfer learning can be done in two ways: feature extraction and fine-tuning. Feature extraction
involves using the pre-trained CNN as a fixed feature extractor, and feeding its output to a new
classifier, such as SVM or MLP. Fine-tuning involves updating the weights of the pre-trained CNN,
or some of its layers, using the new dataset.
Transfer learning is useful when the new dataset is small or similar to the original dataset, as it
can improve the performance and reduce the training time of the classifier.
These are some of the main approaches to perform image classification using machine learning. Each
approach has its own advantages and disadvantages, and the choice of the best approach depends on
the characteristics and the requirements of the problem.
• Machine learning approaches, on the other hand, have revolutionized image classification by
automating the feature extraction process. These methods, particularly deep learning models,
can learn hierarchical representations of images directly from raw pixel data, capturing complex
patterns and relationships that may be difficult to define manually.
• The size and the quality of your dataset: If you have a large and diverse dataset, you can use
more complex and powerful approaches, such as CNN or MLP, to learn from the raw pixels. If
you have a small or noisy dataset, you can use simpler and faster approaches, such as BoVW, or
use transfer learning to leverage the features learned by a pre-trained CNN.
• The similarity and the complexity of your classes: If your classes are very similar or very
complex, you need more discriminative and abstract features, which can be obtained by using
CNN or transfer learning. If your classes are very different or very simple, you can use more
generic and low-level features, which can be obtained by using BoVW or MLP.
• The computational resources and the time constraints: If you have limited resources or time,
you can use more efficient and scalable approaches, such as BoVW or feature extraction, which
require less parameters and less training time. If you have more resources or time, you can use
more expressive and accurate approaches, such as CNN or fine-tuning, which require more
parameters and more training time.
These are some of the general guidelines to help you choose the best approach for your problem.
However, there is no definitive answer, and you may need to experiment with different approaches and
compare their results to find the optimal solution for your problem.
• Support Vector Machines (SVMs): SVMs are a powerful classification algorithm that finds a
hyperplane that best separates data points of different classes.
• Random Forests: Random forests are ensemble methods that combine multiple decision trees to
improve classification accuracy.
• K-Nearest Neighbors (KNN): KNN is a non-parametric algorithm that classifies an image based on
the labels of its k nearest neighbors in the feature space.
• Deep Learning Models: Deep learning models, particularly CNNs, have achieved state-of-the-art
results in image classification tasks.
• Precision: Precision is the proportion of positive predictions that are actually correct.
• Recall: Recall is the proportion of actual positive cases that are correctly identified.
• F1-score: The F1-score is a harmonic mean of precision and recall, providing a balanced measure
of classification performance.
• Object Recognition: Identifying and classifying objects in images, such as cars, pedestrians, and
animals, has applications in autonomous vehicles, surveillance systems, and robotics.
• Medical Diagnosis: Classifying medical images, such as X-rays and MRI scans, can assist doctors
in diagnosing diseases and identifying abnormalities.
• Content-Based Image Retrieval: Enabling users to search for images based on their content,
such as finding images of cats or landscapes.
• Satellite Image Analysis: Classifying land cover types and identifying features in satellite imagery
for environmental monitoring and urban planning.
Conclusion
Image classification has become an indispensable tool in various fields, driven by advancements in
machine learning, particularly deep learning. With the increasing availability of labeled image data and
computational resources, machine learning approaches are continuously pushing the boundaries of
image classification performance, enabling new and groundbreaking applications.
A decision tree is one of the most powerful tools of supervised learning algorithms used for both
classification and regression tasks.
It builds a flowchart-like tree structure where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node) holds a class label. It is constructed by
recursively splitting the training data into subsets based on the values of the attributes until a stopping
criterion is met, such as the maximum depth of the tree or the minimum number of samples required to split
a node.
During training, the Decision Tree algorithm selects the best attribute to split the data based on a metric
such as entropy or Gini impurity, which measures the level of impurity or randomness in the subsets.
The goal is to find the attribute that maximizes the information gain or the reduction in impurity after the
split.
• A decision tree can be used for both classification and regression problems, where the goal is to assign
a label or a value to a given input based on a set of rules or criteria.
• A decision tree can be constructed by recursively splitting the data into smaller and smaller subsets,
based on the values of one or more features, until a stopping criterion is met, such as the maximum
depth of the tree, the minimum number of samples in a node, or the purity of the node. The feature
that is used to split the data at each node is chosen by a splitting criterion, such as the information gain
or the Gini index, which measures how much the split reduces the impurity or the uncertainty in the
data. The impurity of a node is a measure of how mixed or homogeneous the samples in the node are,
in terms of their labels or values. The lower the impurity, the more confident the prediction.
• A decision tree can be easily interpreted and understood, as it mimics the human way of thinking and
reasoning. It can also handle both numerical and categorical features, and can deal with missing values
and outliers. However, a decision tree can also suffer from some drawbacks, such as overfitting,
instability, and bias.
Where,
• S is the dataset sample.
• k is the particular class from K classes
• p(k) is the proportion of the data points that belong to class k to the total number of data points in
dataset sample S.
Gini Impurity
Gini Impurity is a score that evaluates how accurate a split is among the classified groups. The Gini Impurity
evaluates a score in the range between 0 and 1, where 0 is when all observations belong to one class, and 1
is a random distribution of the elements within classes. In this case, we want to have a Gini index score as low
as possible. Gini Index is the evaluation metric we shall use to evaluate our Decision Tree Model.
Here, pi is the proportion of elements in the set that belongs to the ith category.
Information Gain:
Information gain measures the reduction in entropy or variance that results from splitting a dataset based on
a specific property. It is used in decision tree algorithms to determine the usefulness of a feature by
partitioning the dataset into more homogeneous subsets with respect to the class labels or target variable.
The higher the information gain, the more valuable the feature is in predicting the target variable.
The information gain of an attribute A, with respect to a dataset S, is calculated as follows:
where
• A is the specific attribute or class label
• |H| is the entropy of dataset sample S
• |HV| is the number of instances in the subset S that have the value v for attribute A
Information gain measures the reduction in entropy or variance achieved by partitioning the dataset on
attribute A. The attribute that maximizes information gain is chosen as the splitting criterion for building the
decision tree.
Information gain is used in both classification and regression decision trees. In classification, entropy is used as
a measure of impurity, while in regression, variance is used as a measure of impurity. The information gain
calculation remains the same in both cases, except that entropy or variance is used instead of entropy in the
formula.
The decision tree operates by analyzing the data set to predict its classification. It commences from the tree’s
root node, where the algorithm views the value of the root attribute compared to the attribute of the record in
the actual data set. Based on the comparison, it proceeds to follow the branch and move to the next node.
The algorithm repeats this action for every subsequent node by comparing its attribute values with those of the
sub-nodes and continuing the process further. It repeats until it reaches the leaf node of the tree. The
complete mechanism can be better explained through the algorithm given below.
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node Classification and Regression Tree algorithm.
Dataset 2:
Dataset 2:
That is, the first case has lower Gini Impurity and is the chosen split. In this simple example, only one
feature remains, and we can build the final decision tree.
Final Decision Tree considering only the features ‘likes gravity’ and ‘likes dogs’
Until now, we considered only a subset of our data set - the categorical variables. Now we will add the numerical
variable ‘age’. The criterion for splitting is the same. We already know the Gini Impurities for ‘likes gravity’ and
‘likes dogs’. The calculation for the Gini Impurity of a numerical variable is similar, however the decision takes more
calculations.
We can see that the Gini Impurity of all possible ‘age’ splits is higher than the one for ‘likes gravity’ and
‘likes dogs’. The lowest Gini Impurity is, when using ‘likes gravity’, i.e. this is our root node and the first
split.
The first split of the tree. ‘likes gravity’ is the root node.
The subset Dataset 2 is already pure, that is, this node is a leaf and no further splitting is necessary. The branch on
the left-hand side, Dataset 1 is not pure and can be split further. We do this in the same way as before: We
calculate the Gini Impurity for each feature: ‘likes dogs’ and ‘age’.
Let the data available at node m be Qm and it has nm samples. and tm as the threshold for node m. then, The
classification and regression tree algorithm for classification can be written as :
Here,
• H is the measure of impurities of the left and right subsets at node m. it can be entropy or Gini
impurity.
• nm is the number of instances in the left and right subsets at node m.
To select the parameter, we can write as:
# Import the necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
# DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(criterion='entropy',
max_depth=2)
tree_clf.fit(X, y)
with open("iris_tree.dot") as f:
dot_graph = f.read()
Source(dot_graph)
Output:
• To use this decision tree, we need to answer the questions or conditions at each node, starting from the
root node. For example, if we have a person who is 35 years old, male, and has an income of $50,000, we
would follow the path:
The right child node is a leaf node, which predicts that the person will not buy the product. Therefore, the
decision tree gives us a negative prediction for this person.
There are different algorithms to build a decision tree, such as ID3, C4.5, CART, and CHAID. These algorithms
differ in the way they handle the splitting criterion, the stopping criterion, the pruning technique, and the
handling of missing values and continuous features. To use a decision tree for image classification, one needs to
extract relevant features from the images, such as pixel values, color histograms, or other image descriptors, and
feed them to the decision tree algorithm.
• It is easy to understand and interpret, as it visualizes the decision-making process and the logic behind
the prediction.
• It can handle both numerical and categorical features, and can deal with missing values and outliers by
using different strategies, such as ignoring, replacing, or splitting.
• It can perform feature selection and dimensionality reduction, as it chooses the most relevant and
informative features to split the data.
• It is fast and scalable, as it can handle large datasets and perform parallel computations.
• It can overfit the data, especially if the tree is too deep or too complex, and capture the noise or the
outliers in the data, leading to poor generalization and high variance.
• It can be unstable, as small changes in the data or the parameters can result in large changes in the
structure and the prediction of the tree, leading to high sensitivity and low robustness.
• It can be biased, as some features or splits may be favored over others, depending on the splitting
criterion, the data distribution, and the order of the features, leading to poor accuracy and high error.
• Tuning the parameters, such as the maximum depth, the minimum samples, the splitting criterion, the
pruning technique, etc., to find the optimal balance between the complexity and the accuracy of the
tree.
• Using cross-validation, such as k-fold or leave-one-out, to evaluate the performance of the tree on
different subsets of the data, and to avoid overfitting and underfitting.
• Using ensemble methods, such as bagging, boosting, or random forest, to combine multiple decision
trees, and to reduce the variance, the bias, and the error of the prediction.
5.11 Support Vector Machines
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear
classification, regression, and even outlier detection tasks.
SVMs can be used for a variety of tasks, such as text classification, image classification, spam
detection, handwriting identification, gene expression analysis, face detection, and anomaly detection. SVMs
are adaptable and efficient in a variety of applications because they can manage high-dimensional data and
nonlinear relationships.
SVM algorithms are very effective as we try to find the maximum separating hyperplane between the different
classes available in the target feature.
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear equation
i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a
critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main objective of
the support vector machine algorithm is to maximize the margin. The wider margin indicates
better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data
points into high-dimensional feature spaces, so, that the hyperplane can be easily found out even if
the data points are not linearly separable in the original input space. Some of the common kernel
functions are linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that
properly separates the data points of different categories without any misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft
margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.
7. C: Margin maximisation and misclassification lines are balanced by the regularisation parameter C
in SVM. The penalty for going over the margin or misclassifying data items is decided by i t. A
stricter penalty is imposed with a greater value of C, which results in a smaller margin and perhaps
fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or
margin violations. The objective function in SVM is frequently formed by combining it with the
regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange
multipliers related to the support vectors can be used to solve SVM. The dual formulation enables
the use of kernel tricks and more effective computing
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because we are
considering only two input features x1, x2) that segregate our data points or do a classification between red and
blue circles. So how do we choose the best line or in general the best hyperplane that segregates our data
points?
One reasonable choice as the best hyperplane is the one that represents the largest separation or margin
between the two classes.
Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training dataset
consisting of input feature vectors X and their corresponding class labels Y.
The equation for the linear hyperplane can be written as:
The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the
hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the
origin along the normal vector w.
The distance between a data point x_i and the decision boundary can be calculated as:
Optimization:
• For Hard margin linear SVM classifier:
The target variable or label for the ith training instance is denoted by the symbol ti in this statement. And ti=-1
for negative occurrences (when yi= 0) and ti=1positive instances (when yi = 1) respectively. Because we require
• Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange
multipliers related to the support vectors can be used to solve SVM. The optimal Lagrange
multipliers α(i) that maximize the following dual objective function
where,
• αi is the Lagrange multiplier associated with the ith training sample.
• K(xi, xj) is the kernel function that computes the similarity between two samples xi and xj. It
allows SVM to handle nonlinear classification problems by implicitly mapping the samples into a
higher-dimensional feature space.
• The term ∑αi represents the sum of all Lagrange multipliers.
The SVM decision boundary can be described in terms of these optimal Lagrange multipliers and the support
vectors once the dual issue has been solved and the optimal Lagrange multipliers have been discovered. The
training samples that have i > 0 are the support vectors, while the decision boundary is supplied by:
The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-dimensional
space, ie it converts nonseparable problems to separable problems. It is mostly useful in non-linear separation
problems. Simply put the kernel, does some extremely complex data transformations and then finds out the
process to separate the data based on the labels or outputs defined.
Advantages of SVM
• Effective in high-dimensional cases.
• Its memory is efficient as it uses a subset of training points in the decision function called support
vectors.
• Different kernel functions can be specified for the decision functions and its possible to specify
custom kernels.
Code
# Load the important packages
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC
# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()
Output:
Logistic regression is a supervised machine learning algorithm mainly used for classification tasks where the
goal is to predict the probability that an instance of belonging to a given class. It is used for classification
algorithms its name is logistic regression. it’s referred to as regression because it takes the output of
the linear regression function as input and uses a sigmoid function to estimate the probability for the given
class.
The difference between linear regression and logistic regression is that linear regression output is the
continuous value that can be anything while logistic regression predicts the probability that an instance
belongs to a given class or not.
Logistic Regression:
It is used for predicting the categorical dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for
solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function,
which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like
the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.
Terminologies involved in Logistic Regression:
Here are some common terms involved in logistic regression:
• Independent variables: The input characteristics or predictor factors applied to the dependent
variable’s predictions.
• Dependent variable: The target variable in a logistic regression model, which we are trying to
predict.
• Logistic function: The formula used to represent how the independent and dependent variables
relate to one another. The logistic function transforms the input variables into a probability value
between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
• Odds: It is the ratio of something occurring to something not occurring. it is different from
probability as the probability is the ratio of something occurring to everything that could possibly
occur.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In
logistic regression, the log odds of the dependent variable are modeled as a linear combination of
the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the independent and
dependent variables relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the log odds when all
independent variables are equal to zero.
• Maximum likelihood estimation: The method used to estimate the coefficients of the logistic
regression model, which maximizes the likelihood of observing the data given the model.
Here is the ith observation of X, is the weights or Coefficient, and b is the bias term
also known as intercept. simply this can be represented as the dot product of weight and bias.
Sigmoid Function
Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e
predicted y.
As shown above, the figure sigmoid function converts the continuous variable data into the
The odd is the ratio of something occurring to something not occurring. it is different from probability as the
probability is the ratio of something occurring to everything that could possibly occur. so odd will be
Output:
Logistic Regression model accuracy (in %): 95.6140350877193
Multinomial Logistic Regression
target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative
significance) like “disease A” vs “disease B” vs “disease C”.
In this case, the softmax function is used in place of the sigmoid function. Softmax function for K classes will be:
In Multinomial Logistic Regression, the output variable can have more than two possible discrete outputs.
Consider the Digit Dataset.
Output:
Logistic Regression model accuracy(in %): 96.52294853963839
It deals with target variables with ordered categories. For example, a test score can be categorized as: “very
poor”, “poor”, “good”, or “very good”. Here, each category can be given a score like 0, 1, 2, or 3.
Applying steps in logistic regression modeling:
The following are the steps involved in logistic regression modeling:
• Define the problem: Identify the dependent variable and independent variables and determine if
the problem is a binary classification problem.
• Data preparation: Clean and preprocess the data, and make sure the data is suitable for logistic
regression modeling.
• Exploratory Data Analysis (EDA): Visualize the relationships between the dependent and
independent variables, and identify any outliers or anomalies in the data.
• Feature Selection: Choose the independent variables that have a significant relationship with the
dependent variable, and remove any redundant or irrelevant features.
• Model Building: Train the logistic regression model on the selected independent variables and
estimate the coefficients of the model.
• Model Evaluation: Evaluate the performance of the logistic regression model using appropriate
metrics such as accuracy, precision, recall, F1-score, or AUC-ROC.
• Model improvement: Based on the results of the evaluation, fine-tune the model by adjusting the
independent variables, adding new features, or using regularization techniques to reduce
overfitting.
• Model Deployment: Deploy the logistic regression model in a real-world scenario and make
predictions on new data.
In the case of a Precision-Recall tradeoff, we use the following arguments to decide upon the threshold:
1. Low Precision/High Recall: In applications where we want to reduce the number of false negatives
without necessarily reducing the number of false positives, we choose a decision value that has a
low value of Precision or a high value of Recall. For example, in a cancer diagnosis application, we
do not want any affected patient to be classified as not affected without giving much heed to if the
patient is being wrongfully diagnosed with cancer. This is because the absence of cancer can be
detected by further medical diseases but the presence of the disease cannot be detected in an
already rejected candidate.
2. High Precision/Low Recall: In applications where we want to reduce the number of false positives
without necessarily reducing the number of false negatives, we choose a decision value that has a
high value of Precision or a low value of Recall. For example, if we are classifying customers
whether they will react positively or negatively to a personalized advertisement, we want to be
absolutely sure that the customer will react positively to the advertisement because otherwise, a
negative reaction can cause a loss of potential sales from the customer.
original = cv2.imread(file)
img = cv2.cvtColor(original, cv2.COLOR_BGR2GRAY)
save_image_file(img, "gray")
img = cv2.equalizeHist(img)
save_image_file(img, "equalize")
lined = np.copy(original) * 0
lines = cv2.HoughLinesP(img, 1, np.pi / 180, 15, np.array([]), 50, 20)
for line in lines:
for x1, y1, x2, y2 in line:
cv2.line(lined, (x1, y1), (x2, y2), (0, 0, 255))
save_image_file(lined, "lined")
The objective of this program is to use OpenCV, a popular computer vision library, to detect and track faces and eyes in
real-time using a webcam. The program utilizes pre-trained Haar Cascade classifiers, which are machine learning models
used for object detection. These classifiers can identify specific features of objects, such as faces and eyes, in an image or
video stream.
Before running the code, make sure your system meets the following requirements:
a. Software Requirements:
• Python 2.7.x: The code is written for Python 2.7, which is compatible with older versions of OpenCV. Consider
using Python 3.x and OpenCV 4.x for better performance and support.
• NumPy: This library is used for handling arrays and matrices.
• OpenCV 2.7.x: The OpenCV library used here is version 2.7.x, which is compatible with Python 2.7.
b. Installation Steps:
Let's go through the code line by line and explain its functionality:
a. Import Libraries
• cv2.CascadeClassifier(): Loads the Haar Cascade classifier XML file, which is used for detecting objects
(faces and eyes in this case).
cap = cv2.VideoCapture(0) # Capture video from the default camera (0 for the default webcam)
• cv2.VideoCapture(0): Initializes the video capture from the default camera. You can use 1 or another number if
you're using an external camera.
while True:
# Read a frame from the camera
ret, img = cap.read() # `ret` indicates if the frame was read successfully, `img` is the
frame itself
• Loop: The while True loop runs indefinitely, continuously capturing frames from the webcam.
• cv2.cvtColor(): Converts the captured frame from color (BGR) to grayscale. Grayscale images simplify
processing and are more efficient for detection.
e. Face Detection
• detectMultiScale():
o Detects objects (faces) of varying sizes in the input image.
o scaleFactor: Specifies how much the image size is reduced at each image scale. A value of 1.3 means
that the image is reduced by 30% at each scale.
o minNeighbors: Specifies how many neighbors each candidate rectangle should have to retain it. A value
of 5 works well for face detection.
f. Draw Rectangles Around Detected Faces
• cv2.rectangle(): Draws a rectangle around the detected face with color (255, 255, 0) (blue) and thickness 2.
• roi_gray and roi_color: Define the region of interest (ROI) within the frame to be used for eye detection.
• Detect eyes: The eye_cascade.detectMultiScale() method detects eyes within the roi_gray, which is the
gray-scale version of the detected face.
• Draw rectangles: For each detected eye, cv2.rectangle() is used to draw rectangles around them with color
(0, 127, 255) (orange) and thickness 2.
cv2.imshow('img', img) # Display the image with detected faces and eyes
• cv2.imshow(): Opens a window named 'img' to display the current frame with rectangles drawn around faces and
eyes.
j. Release Resources
• Release the webcam: cap.release() releases the webcam so other applications can use it.
• Close the window: cv2.destroyAllWindows() closes all the OpenCV windows that were opened.
4. Expected Output
When you run the script, a window titled img will display the webcam feed with rectangles drawn around detected faces
and eyes. The program will continue running until you press the Esc key to stop it.
• Detect other objects: Train or use pre-trained Haar Cascade classifiers for other objects (e.g., cars, animals).
• Use Python 3.x and OpenCV 4.x: Consider updating to Python 3.x and the latest OpenCV library for enhanced
features and better support.
• Adjust detection parameters: Modify the scaleFactor and minNeighbors parameters for optimal detection
based on lighting and camera quality.
• Poor detection in low light: Ensure good lighting conditions for better detection.
• False positives/negatives: Adjust the parameters or use more robust classifiers like deep learning-based detectors
(e.g., DNNs with OpenCV).
• Compatibility issues: Use the latest version of Python and OpenCV for improved performance.
7. Potential Improvements
• Add face tracking: Integrate a tracking algorithm (e.g., cv2.TrackerKLT or cv2.TrackerMOSSE) to track faces
across multiple frames.
• Integrate facial recognition: Combine this code with facial recognition libraries like face_recognition for
identifying and matching specific faces.
• Enhance user interface: Use cv2.putText() to label detected faces or show relevant information on the display.
Conclusion
This study material provides you with an understanding of how to implement real-time face and eye detection using
OpenCV. With this foundation, you can experiment with more complex computer vision tasks and extend the program
for specific use cases like surveillance, access control, and more.
5.15 Recognizing Faces,
We will build a detector to identify the human face in a photo from Unsplash. Make sure to save the picture to
your working directory and rename it to input_image before coding along.
Now, let’s import OpenCV and enter the input image path with the following lines of code:
import cv2
imagePath = 'input_image.jpg'
Run code
POWERED BY
Run code
POWERED BY
This will load the image from the specified file path and return it in the form of a Numpy array.
Run code
POWERED BY
(4000, 2667, 3)
Run code
POWERED BY
Notice that this is a 3-dimensional array. The array’s values represent the picture’s height, width, and channels
respectively. Since this is a color image, there are three channels used to depict it - blue, green, and red (BGR).
Note that while the conventional sequence used to represent images is RGB (Red, Blue, Green), the OpenCV library
uses the opposite layout (Blue, Green, Red).
Run code
POWERED BY
Run code
POWERED BY
(4000, 2667)
Run code
POWERED BY
Notice that this array only has two values since the image is grayscale and no longer has the third color channel.
Let’s load the pre-trained Haar Cascade classifier that is built into OpenCV:
face_classifier = cv2.CascadeClassifier(
cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
Run code
POWERED BY
Notice that we are using a file called haarcascade_frontalface_default.xml. This classifier is designed specifically for
detecting frontal faces in visual input.
OpenCV also provides other pre-trained models to detect different objects within an image - such as a person’s
eyes, smile, upper body, and even a vehicle’s license plate. You can learn more about the different classifiers built
into OpenCV by examining the library’s GitHub repository.
We can now perform face detection on the grayscale image using the classifier we just loaded:
face = face_classifier.detectMultiScale(
gray_image, scaleFactor=1.1, minNeighbors=5, minSize=(40, 40)
Run code
POWERED BY
Let’s break down the methods and parameters specified in the above code:
1. detectMultiScale():
The detectMultiScale() method is used to identify faces of different sizes in the input image.
1. grey_image:
The first parameter in this method is called grey_image, which is the grayscale image we created previously.
1. scaleFactor:
This parameter is used to scale down the size of the input image to make it easier for the algorithm to detect larger
faces. In this case, we have specified a scale factor of 1.1, indicating that we want to reduce the image size by 10%.
1. minNeighbors:
The cascade classifier applies a sliding window through the image to detect faces in it. You can think of these
windows as rectangles.
Initially, the classifier will capture a large number of false positives. These are eliminated using
the minNeighbors parameter, which specifies the number of neighboring rectangles that need to be identified for an
object to be considered a valid detection.
To summarize, passing a small value like 0 or 1 to this parameter would result in a high number of false positives,
whereas a large number could lead to losing out on many true positives.
The trick here is to find a tradeoff that allows us to eliminate false positives while also accurately identifying true
positives.
1. minSize:
Finally, the minSize parameter sets the minimum size of the object to be detected. The model will ignore faces that
are smaller than the minimum size specified.
Now that the model has detected the faces within the image, let’s run the following lines of code to create a
bounding box around these faces:
Run code
POWERED BY
The face variable is an array with four values: the x and y axis in which the faces were detected, and their width
and height. The above code iterates over the identified faces and creates a bounding box that spans across these
measurements.
The parameter 0,255,0 represents the color of the bounding box, which is green, and 4 indicates its thickness.
To display the image with the detected faces, we first need to convert the image from the BGR format to RGB:
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
Run code
POWERED BY
plt.figure(figsize=(20,10))
plt.imshow(img_rgb)
plt.axis('off')
Run code
POWERED BY
The model has successfully detected the human face in this image and created a bounding box around it.
Step 1: Pre-Requisites
First, let’s go ahead and import the OpenCV library and load the Haar Cascade model just like we did in the
previous section. You can skip this block of code if you already ran it previously:
import cv2
face_classifier = cv2.CascadeClassifier(
cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
)
Run code
POWERED BY
Now, we need to access our device’s camera to read a live stream of video data. This can be done with the
following code:
video_capture = cv2.VideoCapture(0)
Run code
POWERED BY
Notice that we have passed the parameter 0 to the VideoCapture() function. This tells OpenCV to use the default
camera on our device. If you have multiple cameras attached to your device, you can change this parameter value
accordingly.
Now, let’s create a function to detect faces in the video stream and draw a bounding box around them:
def detect_bounding_box(vid):
return faces
Run code
POWERED BY
In this function, we are using the same codes as we did earlier to convert the frame into grayscale before
performing face detection.
Then, we are also detecting the face in this image using the same parameter values for scaleFactor, minNeighbors,
and minSize as we did previously.
Now, we need to create an indefinite while loop that will capture the video frame from our webcam and apply the
face detection function to it:
while True:
if result is False:
faces = detect_bounding_box(
video_frame
cv2.imshow(
) # display the processed frame in a window named "My Face Detection Project"
break
video_capture.release()
cv2.destroyAllWindows()
Run code
POWERED BY
After running the above code, you should see a window called My Face Detection Project appear on the screen:
The algorithm should track your face and create a green bounding box around it regardless of where you move
within the frame.
In the frame above, the model recognizes my face and my picture on the driving license I’m holding up.
You can also test the efficacy of this model by holding up multiple pictures or by getting different people to stand
at various angles behind the camera. The model should be able to identify all human faces in different
backgrounds or lighting settings.
If you’d like to exit the program, you can press the “q” key on your keyboard to break out of the loop.
• Capturing and decoding video file: We will capture the video using VideoFileClip
object and after the capturing has been initialized every video frame is decoded (i.e.
converting into a sequence of images).
• Grayscale conversion of image: The video frames are in RGB format, RGB is
converted to grayscale because processing a single channel image is faster than
processing a three-channel colored image.
• Reduce noise: Noise can create false edges, therefore before going further, it’s
imperative to perform image smoothening. Gaussian blur is used to perform this
process. Gaussian blur is a typical image filtering technique for lowering noise and
enhancing image characteristics. The weights are selected using a Gaussian
distribution, and each pixel is subjected to a weighted average that considers the
pixels surrounding it. By reducing high-frequency elements and improving overall
image quality, this blurring technique creates softer, more visually pleasant images.
• Canny Edge Detector: It computes gradient in all directions of our blurred image and
traces the edges with large changes in intensity. For more explanation please go
through this article: Canny Edge Detector
• Region of Interest: This step is to take into account only the region covered by the
road lane. A mask is created here, which is of the same dimension as our road
image. Furthermore, bitwise AND operation is performed between each pixel of our
canny image and this mask. It ultimately masks the canny image and shows the
region of interest traced by the polygonal contour of the mask.
• Hough Line Transform: In image processing, the Hough transformation is a feature
extraction method used to find basic geometric objects like lines and circles. By
converting the picture space into a parameter space, it makes it possible to identify
shapes by accumulating voting points. We’ll use the probabilistic Hough Line
Transform in our algorithm. The Hough transformation has been extended to
address the computational complexity with the probabilistic Hough transformation.
In order to speed up processing while preserving accuracy in shape detection, it
randomly chooses a selection of picture points and applies the Hough
transformation solely to those points.
• Draw lines on the Image or Video: After identifying lane lines in our field of interest
using Hough Line Transform, we overlay them on our visual input(video
stream/image).
Dataset: To demonstrate the working of this algorithm we will be working on a video
file of a road. You can download the dataset from this GitHub link – Dataset
Note: This code is implemented in google colab. If you are working on any other editor
you might have make some alterations in code because colab has some dependency
issues with OpenCV
Steps to Implement Road Lane Detection
Step 1: Install OpenCV library in Python.
• Python3
def frame_processor(image):
"""
Process the input frame to detect lane lines.
Parameters:
image: image of a road where one wants to detect lane lines
(we will be passing frames of video to this function)
"""
# convert the RGB image to Gray scale
grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# applying gaussian Blur which removes noise from the image
# and focuses on our region of interest
# size of gaussian kernel
kernel_size = 5
# Applying gaussian blur to remove noise from the frames
blur = cv2.GaussianBlur(grayscale, (kernel_size, kernel_size), 0)
# first threshold for the hysteresis procedure
low_t = 50
# second threshold for the hysteresis procedure
high_t = 150
# applying canny edge detection and save edges in a variable
edges = cv2.Canny(blur, low_t, high_t)
# since we are getting too many edges from our image, we apply
# a mask polygon to only focus on the road
# Will explain Region selection in detail in further steps
region = region_selection(edges)
# Applying hough transform to get straight lines from our image
# and find the lane lines
# Will explain Hough Transform in detail in further steps
hough = hough_transform(region)
#lastly we draw the lines on our resulting frame and return it as output
result = draw_lane_lines(image, lane_lines(image, hough))
return result
Output:
def hough_transform(image):
"""
Determine and cut the region of interest in the input image.
Parameter:
image: grayscale image which should be an output from the edge detector
"""
# Distance resolution of the accumulator in pixels.
rho = 1
# Angle resolution of the accumulator in radians.
theta = np.pi/180
# Only lines that are greater than threshold will be returned.
threshold = 20
# Line segments shorter than that are rejected.
minLineLength = 20
# Maximum allowed gap between points on the same line to link them
maxLineGap = 500
# function returns an array containing dimensions of straight lines
# appearing in the input image
return cv2.HoughLinesP(image, rho = rho, theta = theta, threshold = threshold,
minLineLength = minLineLength, maxLineGap = maxLineGap)
Output:
[[[284 180 382 278]]
def average_slope_intercept(lines):
"""
Find the slope and intercept of the left and right lanes of each image.
Parameters:
lines: output from Hough Transform
"""
left_lines = [] #(slope, intercept)
left_weights = [] #(length,)
right_lines = [] #(slope, intercept)
right_weights = [] #(length,)
def region_selection(image):
"""
Determine and cut the region of interest in the input image.
Parameters:
image: we pass here the output from canny where we have
identified edges in the frame
"""
# create an array of the same size as of the input image
mask = np.zeros_like(image)
# if you pass an image with more then one channel
if len(image.shape) > 2:
channel_count = image.shape[2]
ignore_mask_color = (255,) * channel_count
# our image only has one channel so it will go under "else"
else:
# color of the mask polygon (white)
ignore_mask_color = 255
# creating a polygon to focus only on the road in the picture
# we have created this polygon in accordance to how the camera was placed
rows, cols = image.shape[:2]
bottom_left = [cols * 0.1, rows * 0.95]
top_left = [cols * 0.4, rows * 0.6]
bottom_right = [cols * 0.9, rows * 0.95]
top_right = [cols * 0.6, rows * 0.6]
vertices = np.array([[bottom_left, top_left, top_right, bottom_right]],
dtype=np.int32)
# filling the polygon with white color and generating the final mask
cv2.fillPoly(mask, vertices, ignore_mask_color)
# performing Bitwise AND on the input image and mask to get only the edges on the
road
masked_image = cv2.bitwise_and(image, mask)
return masked_image
def hough_transform(image):
"""
Determine and cut the region of interest in the input image.
Parameter:
image: grayscale image which should be an output from the edge detector
"""
# Distance resolution of the accumulator in pixels.
rho = 1
# Angle resolution of the accumulator in radians.
theta = np.pi/180
# Only lines that are greater than threshold will be returned.
threshold = 20
# Line segments shorter than that are rejected.
minLineLength = 20
# Maximum allowed gap between points on the same line to link them
maxLineGap = 500
# function returns an array containing dimensions of straight lines
# appearing in the input image
return cv2.HoughLinesP(image, rho = rho, theta = theta, threshold = threshold,
minLineLength = minLineLength, maxLineGap = maxLineGap)
def average_slope_intercept(lines):
"""
Find the slope and intercept of the left and right lanes of each image.
Parameters:
lines: output from Hough Transform
"""
left_lines = [] #(slope, intercept)
left_weights = [] #(length,)
right_lines = [] #(slope, intercept)
right_weights = [] #(length,)
# driver function
def process_video(test_video, output_video):
"""
Read input video stream and produce a video file with detected lane lines.
Parameters:
test_video: location of input video file
output_video: location where output video file is to be saved
"""
# read the video file using VideoFileClip without audio
input_video = editor.VideoFileClip(test_video, audio=False)
# apply the function "frame_processor" to each frame of the video
# will give more detail about "frame_processor" in further steps
# "processed" stores the output video
processed = input_video.fl_image(frame_processor)
# save the output video stream to an mp4 file
processed.write_videofile(output_video, audio=False)