REAL-TIME FACE DETECTION
AND TRACKING
USING VIOLA JONES ALGORITHM
PFA Internship, Computer Vision
ENSA TÉTOUAN
Face Detection and
Tracking
Supervised by: Prof.HADDI Ali
M.Monsef BENLAZAR
Prepared by: AGLILAH Iliass
EL-AOUAME Achraf
SUMMARY
I- Abstract
II- Company Presentation
III- Introduction
IV- Viola Jones Algorithm
V- Face Detection using Viola Jones
WWObject Detection Framework
VI- Image Processing (Detection and
WWaTracking)
VII- MATLAB Code of Viola Jones
WWaAlgorithm
VIII- Simulink Model of Video and Image
WWaEdge Detection
IX- Conception ofThree Axes Gimbal for
qqqQCamera Support in CATIA V5
X- Conclusion
1
I- Abstract
Face recognition has been a very active research area in the past two
decades. Many attempts have been made to understand the process of how
human beings recognize human faces. It is widely accepted that face
recognition may depend on both componential information (such as eyes,
mouth and nose) and non-componential/holistic information (the spatial
relations between these features), though how these cues should be
optimally integrated remains unclear. In the present study, a different
observer’s approach is proposed using Eigen/fisher features of multi-scaled
face components and artificial neural networks. The basic idea of the
proposed method is to construct facial feature vector by down-sampling face
components such as eyes, nose, mouth and whole face with different
resolutions based on significance of face components.
In this report, we use the Viola jones approach for detecting a face and
track it continuously. Basically video sequences provide more information
than a still image. It is always a challenging task to track a target object in
a live video. We undergo challenges like illumination; pose variation and
occlusion in pre-processing stages. But this is can be overcome by detection
of the target object continuously in each and every frame. Face tracking by
the Viola Jones algorithm is used to detect the face based on the haar
features. We have modified Viola Jones Algorithm to make
its working, a bit more efficient and simpler than what it was.
2
II- Company Presentation
Techwin consulting is a Training,
Consulting, Support and Audit
company.
The founder, Engineer consultant expert
in industrial management and
Operational Excellence, Expert in Project
Management. More than 15 years of
experience, assuming several managerial
responsibilities in multinational
companies.
The company missions :
Help improve the performance of companies so that they can adapt
to changing markets and satisfy their customers.
Help create synergies within the company and develop team spirit
and collective skills.
Improve the individual skills of the company's human resources on
managerial aspects, operational procedures and methods and
personal development. Convinced that performance management and
management lead the company to organizational excellence.
Industry: Management Consulting
Type: Partnership
Specialties: Operational excellence, quality, safety, Lean Manufacturing,
and Project management.
3
III- Introduction
Face detection is one of the most complex and challenging problem in the
field of computer vision, due to the large variations caused by the changes
in facial appearance, lighting, and facial expression. So face distribution to
be highly nonlinear and complex in any space which is linear to the original
image space. In the real time applications like surveillance and biometric,
the camera limitations and pose variations make the distribution of human
faces in feature space more complicated than that of frontal faces. It
further complicates the problem of robust face detection.
There are many techniques has been researched for years and much
progress has been proposed in literature most of the detection method most
of the detection methods concentrated on detecting frontal faces with
enough lightning condition. Yang’s categorized this method into four types
in his survey: knowledge based, feature invariant, template matching and
appearance-based.
Knowledge based methods modeled facial feature using human coding,
such as two symmetric, mouth, nose etc.
The feature invariant which is invariant to pose and lighting condition
are find using feature invariant method.
The correlation between a test image and pre image fall into template
matching category.
This appearance based method include machine learning techniques to
extract discriminative feature from a relabeled set.
III-1 Face Detection
Detecting a face is a computer technology which let us know the locations
and sizes of human faces. This helps in getting the facial features and
avoiding other objects and things.
In the present situation human face perception is a biggest research area.
4
It is basically about detecting a human face through some trained
features. Here, face detection is preliminary step for many other
applications such as face recognition, video surveillance etc.
IV- Viola Jones Algorithm
This algorithm helps us detect features of a face in a particular frame of a
video sequence. This is the first object detection framework which gives a
competition to real time detection rates.
Paul Viola and Michael Jones are the ones who introduced this algorithm.
They made this algorithm mainly by the issue of face detection. There are
four steps which have to be followed to detect a face. Firstly, we train the
system with the haar features .
Haar features are a kind of rectangular boxes which are black and white.
They are simple rectangular feature which is the difference of the sum of
pixels of areas inside the rectangle. This rectangle can be at any position of
the frame and can scale the image. This modified feature set is called 2-
rectangle feature. Each feature type can indicate the existence or the
absence of certain characteristics in the frame, such as edges or changes in
texture.
These haar features are applied to determine the facial features. The Black
part is used to detect nose feature of a human face as the black colored
part defines the presence of a nose which is located at the center of the
face. The Figure-1 is called a 4 rectangle feature. Where the black part is
denoted as +1 and the white part is denoted as -1. The result is calculated
by subtracting the sum of pixels under the white rectangle from the sum of
pixels.
To sum up, Viola and jones’s algorithm is used as the basis of our design.
As we know there is some similarities in all human faces, we used this
concept as a haar feature to detect face in image.
5
The VJ Algorithm looks for a specific haar feature of the face, if these
feature found, then the algorithm pass the candidate to size of 24*24 pixel.
Fig-1.Integral Image Calculation
IV-1 Haar Features
As we know there some kind of similarities in human face. We use this
concept for making haar feature. They are composed of two or three
rectangles. These features are applied on face candidate to find out
whether face is present or not. Each haar feature has a value and this can
be calculated by taking the area of each rectangle and then adding the
result. Using the integral image concept, we can easily find out the
area of rectangle.
Fig-2.Examples of Haar Features
6
IV-2. Integral Image
The integral image is defined as the summation of the pixel values of the
original image.
The value at any location (x, y) of the integral image is the sum of the
image‟ s pixels above and to the left of location (x, y).
“Fig. 3” illustrates the integral image generation
Fig-3.Integral Image Fig-4.Integral Image Calculation Process
Fast Calculation in Integral Image Fig.4 presents the calculation process: in
order to calculate the intensity sum of green region. Just four values of F
have to be considered. As a consequence, the intensity sum of any
rectangular-shaped area can be calculated by considering as few as four
values of F. This allows for an extremely fast calculation of a convolution
with one of the rectangular haar feature describe above.
The integral image F can be calculated in pre-processing stage prior to
detection in a recursive manner in just one pass over the original image I
as in equation 2 and 3 below.
R(x, y) = R(x, y-1) + I(x, y) (2)
F(x, y) = F(x-1, y) + R(x, y) (3)
7
Where R and F are initialized by R(x, -1) = 0 and F(-1, y) = 0.
The sum of intensities of a rectangular are ranging from (x, y) to (x1, y1)
can be calculated by considering the values of F at the four cover point of
the region instead of summing up the intensities of all pixels inside:
Fig-5.Conversion of original image to integral image (top) and how to
calculate a rectangular region using an integral image (bottom)
A Cascade Classifier is a multi-stage classifier that can perform detection
quickly and accurately. Each stage consists of a strong classifier produced
by the AdaBoost Algorithm.
From one stage to another, the number of weak classifiers in a strong
classifier increases. An input is evaluated on a sequential (stage
by stage) basis.
8
If a classifier for a specific stage outputs a negative result, the input is
discarded immediately. In case the output is positive, the input is
forwarded onto the next stage. According to Viola & Jones (2001), this
multi-stage approach allows for the construction of simpler classifiers which
can then be used to reject most negative (non face) input quickly while
spending more time on positive (face) input.
IV-3. Haar Feature Classifier
A Haar feature classifier uses the rectangle integral to calculate the value
of a feature. The Haar feature classifier multiplies the weight of each
rectangle by its area and the results are added together.
Several Haar feature classifiers compose a stage. A stage
comparator sums all the Haar feature classifier results in a stage and
compares this summation with a stage threshold. The threshold is also a
constant obtained from the Ada Boost algorithm. Each stage does not have
a set number of Haar features.
For example, Viola and Jones‟ data set used 2 features in the first stage
and 10 in the second. All together they used a total of 38 stages and 6060
features.
Fig-6.
Stages of Haar Features
Classifier
9
IV-4. Learning Classification Functions
The complete set of features is quite large - 160,000 features per a single
24x24 sub-window. Though computing a single feature can be done with
only a few simple operations, evaluating the entire set of features is still
extremely expensive, and cannot be performed by a real-time application.
Viola and Jones assumed that a very small number of the extracted
features can be used to form an effective classifier for face detection.
Thus, the main challenge was to find these distinctive features. They
decided to use AdaBoost learning algorithm as a feature selection
mechanism. In its original form, AdaBoost is used to improve classification
results of a learning algorithm by combining a collection of weak classifiers
to form a strong classifier.
The algorithm starts with equal weights for all examples. In each round, the
weight is updated so that the misclassified examples receive more weight.
By drawing an analogy between weak classifiers and features, Viola and
Jones decided to use AdaBoost algorithm for aggressive selection of a
small number of good features, which nevertheless have significant variety.
Practically, the weak learning algorithm was restricted to the set of
classification functions, which of each was dependent on a single feature.
The weak classifier h(x, f, p, ) was then defined for a sample x (i.e. 24x24
sub-window) by a feature f, a threshold , and a polarity p indicating the
direction of the inequality.
The key advantage of the AdaBoost over its competitors is the speed of
learning. For each feature, the examples are sorted based on a feature
value. The optimal threshold for that feature can be then computed in a
single pass over this sorted list.
10
Fig-7. Cascading
IV-5 Cascading
This step is introduced to speed up the process and give an accurate result.
This step consists of several stages where each stage consists of a strong
classifier.
All features are grouped into several stages. It detects faces in the frame
by sliding a window over a frame. When an input is given it checks for
certain classifier in the first stage and then so on. But it is passed to the
successive stage if and only if it satisfies the preceding stage classifier.
It is possible to eliminate the false candidate quickly using stage
cascading. The cascade eliminates candidate if it not passed the first
stage. If it passed than send it to next stage which is more complicated
than previous one. If a candidate passed all the stage, this means a face is
detected.
Fig-8. Cascading Process
11
V- Face Detection using Viola Jones
aQObject Detection Framework
After learning about the major concepts used in the Viola-Jones Object
Detection Framework, we are now ready to learn about how those concepts
work together. The framework consists of two phases: Training and
Testing/Application. Let’s look at each of them one by one.
V-1 Training
After learning about the major concepts used in the Viola-Jones Object
Detection Framework, we are now ready to learn about how those concepts
work together. The framework consists of two phases: Training and
Testing/Application. Let’s look at each of them one by one.
V-1.1 Data Preparation
Assuming that you already have a training set consisting of positive
samples (faces) and negative samples (non-faces).
The first step is to extract features from those sample images. Viola &
Jones (2001) recommends the images to be 24 x 24. Since each type of
Haar-like features can have different sizes and positions in a 24 x 24
window, over 160,000 Haar-like features can be extracted. Nonetheless, in
this stage all 160,000+ Haar-like features need to be calculated.
Fortunately, the introduction of Integral Images helps speed up this
process. Figure 9 illustrates the entire process of data preparation.
12
Fig-9. Data
Preparation Process
V-1.2 Construction a Cascade Classifier with a modified
Adaboost Algorithm
Fig-10. Cascade Classifier Process Construction
13
VI- Image Processing (Detection &
qqqTracking
A lot of work has been done in the field and various algorithms for
object tracking have been
proposed. Object tracking is basically completed in two steps.
1. Object Detection
2. Tracking of the Detected Object
VI-1. Tracking
The purpose of tracking is to make trajectory of object in the frame.
Classification of object tracking algorithms is given below.
Point tracking
Once the object is detected the purpose of point tracking is to represent it
in the form of points, previous object state is taken as a reference. Then
these points are joined to make the object trajectory. Different approaches
can be used e.g. multi point correspondence, parametric transformation or
contour evaluation. The most common point tracking algorithms that have
been used are MCE tracker, GOA tracker, Kalman filter JPDAF, PMHT.
14
Kernel tracking
This is the tracking based on the object features, shape and appearance.
Kernel of any shape can be chosen to track the object; motion of the kernel
shows the motion of the object in consecutive frames. The algorithms for
kernel tracking are, mean-shift, KTL, layering.
Silhouette tracking
The object region is tracked by matching or contour evaluation by using
the information of the object like its volume, surface density by its shape
models. A silhouette is formed in the object region. The representative work
for silhouette tracking are state space models, vibrational models, heuristic
models, Hausdorff, histogram.
VI-2. Simulation using Matlab
The Following steps need to be involved in software based Object Tracking:
First of all the video is converted to frames. MTALAB generates 15 frames
per second; this rate can be adjusted to the desired frame rate. The next
steps involve Grayscale Conversion, thresholding, image enhancement, edge
detection, object identification and finally object tracking using normalized
cross correlation.
Edge Detection
Edge detection includes a variety of mathematical methods that aim at
identifying points in a digital image at which the image brightness changes
sharply or, more formally, has discontinuities.
The points at which image brightness changes sharply are typically
organized into a set of curved line segments termed edges.
15
The same problem of finding discontinuities in one-dimensional signals is
known as step detection and the problem of finding sign discontinuities
over time is known as change detection. Edge detection is a fundamental
tool in image processing, machine vision and computer vision, particularly
in the areas of feature detection and feature extraction.
Fig-11. Example of a Canny Edge Detection
Grayscale Image
Grayscale is a range of monochromatic shades from black to white.
Therefore, a grayscale image contains only shades of gray and no color.
While digital images can be saved as grayscale (or black and white)
images, even color images contain grayscale information.
This is because each pixel has a luminance value, regardless of its color.
Luminance can also be described as brightness or intensity, which can be
measured on a scale from black (zero intensity) to white (full intensity).
Fig-12. Colors converted to Grayscale
16
Tresholding
In digital image processing, thresholding is the simplest method of
segmenting images. From a grayscale image, thresholding can be used to
create binary images.
The simplest thresholding methods replace each pixel in an image with a
black pixel if the image intensity I(ij) {\displaystyle I_{i,j}} is less than
some fixed constant T (that is, I(ij) < T {\displaystyle I_{i,j}<T}), or a white
pixel if the image intensity is greater than that constant. In the example
image on the right, this results in the dark tree becoming completely black,
and the white snow becoming completely white
Fig-13. Example of Otsu Tresholding
VII- Matlab Code of Viola Jones
qqqAlgorithm
VII-1. Video face detection and tracking (Matlab Code)
clear all ;
% Choose a camera and it resolution, for our case, camera: webcam
cam = webcam();
cam.Resolution = '424x240' ;
% Divide the video into snapshots
video_frame = snapshot(cam) ;
17
% initialize the Video Player to display the results
video_player = vision.VideoPlayer('Position', [100 100 424 240]);
% create a cascade detector object
face_detector = vision.CascadeObjectDetector();
% initialize a tracker to track the points
point_tracker = vision.PointTracker('MaxBidirectionalError', 2);
% run a loop that will counts 400 frame of the tracked video
run_loop = true ;
number_of_points = 0 ;
frame_count = 0 ;
while run_loop && frame_count <400
% RGB conversion to Grayscale using rgb2gray function
video_frame = snapshot(cam) ;
gray_frame = rgb2gray(video_frame);
frame_count = frame_count+1 ;
if number_of_points <10
face_rectangle = face_detector.step(gray_frame);
% finding the Region Of Interest
if ~isempty(face_rectangle)
points = detectMinEigenFeatures(gray_frame, 'ROI', face_rectangle(1,:));
xy_points = points.Location ;
number_of_points = size(xy_points, 1);
release(point_tracker);
initialize(point_tracker, xy_points, gray_frame);
previous_points = xy_points ;
18
%Convert the first box into a list of 4 points
% This is needed to be able to visualize the rotation of the object.
rectangle = bbox2points(face_rectangle(1, :));
face_polygon = reshape(rectangle', 1, []);
% Insert a bounding box around the object being tracked
video_frame = insertShape(video_frame, 'Polygon', face_polygon,
qqqqqqqqqqqqqqqqqq'LineWidth', 3);
% Display tracked points
video_frame = insertMarker(video_frame, xy_points, '+', 'Color', 'White')
end
else
% Track the points. Note that some points may be lost
[xy_points, isFound] = step(point_tracker, gray_frame);
new_points = xy_points(isFound, :);
old_points = previous_points(isFound, :);
number_of_points = size(new_points, 1);
if number_of_points >= 10 % need at least 10 points
% % Estimate the geometric transformation between the old points
% and the new points and eliminate outliers
[xform, old_points, new_points] = estimateGeometricTransform(...
old_points, new_points, 'similarity', 'MaxDistance', 4);
rectangle = transformPointsForward(xform, rectangle);
% Insert a bounding box around the object being tracked
face_polygone = reshape(rectangle', 1, []);
video_frame = insertShape(video_frame, 'Polygon', face_polygon,
qqqqqqqqqqqqqqqqqq'LineWidth', 3);
19
% display tracked points
video_frame = insertMarker(video_frame, new_points, '+',
qqqqqqqqqqqqqqqqqqqqq'Color', 'White');
% reset points
previous_points = new_points ;
setPoints(point_tracker, previous_points);
end
end
step(video_player, video_frame);
run_loop = isOpen(video_player);
end
clear cam;
release(video_player);
release(point_tracker);
release(face_detector);
Result
20
VII-2. Image face detection and tracking (Matlab Code)
Image = imread('Achraf.jpg'); % read the image by imread function
[width, height] = size(Image); % determine width, height of the image
% resize to image if it width > 320px
if width > 320 Image = imresize(Image, [320 NaN]);
end
%Detect objects using Viola-Jones Algorithm
%To detect Face
Face_Detector = vision.CascadeObjectDetector();
Location_of_the_face = step(Face_Detector, Image);
Detected_Image = insertShape(Image,'Rectangle', Location_of_the_face);
figure;
imshow(Detected_Image); % showing the result
title('Detected Face'); % give a title of the detected image
Result
21
VIII- Simulink Model of Video
qqqqqand Image Edge Detection
VIII-1. Image Edge Detection
Fig14. Simulink model of image edge detection using sobel method
Result
The input image : Image From File
22
Video Viewer 1 : Grayscale Video Viewer : Sobel Edge
Image Detection (Output)
VIII-2. Video Edge Detection
Result
23
Input Video : From Multimedia Video Viewer 2 : Grayscale Image
File from the video
Video Viewer 1: Video Edge Detection using Sobel method
IX- Conception Of Three Axes Gimbal
qqqfor Camera Support In CATIA V5
IX-1. Diffeent Parts Of three Axes Gimbal
24
Part 1
Part 2
25
Part 3
Part 4
26
Part 2 & Part 1 Assembly
Double Assembly of Part 1 & 2
27
Three Axes Gimbal: Final Prototype
X- Conclusion
To conclude, we have learnt about the Viola-Jones Object Detection
Framework and its application for face detection. Many technologies today
benefited from Paul Viola and Michael Jones’s work. By understanding how
the framework works, we can confidently implement our own version of their
work or used an open source implementation like the one provided by
OpenCV. We hope that our explanation moves forward in that direction
and compels us to create amazing technologies that uses this awesome
framework.
28
Sources
IEEE Visual Surveillance; Dorin Comaniciu and Visvanathan Ramesh International
Journal of Soft Computing and Artificial Intelligence. MAMATA S. KALAS
Cen, K. (2016). Study of Viola-Jones Real Time Face Detector. Retrieved
from:https://web.stanford.edu/class/cs231a/prev_projects_2016/cs231a_final_report.pdf
Department of Electrical Engineering, Polytechnic University, Brooklyn. Zhu Liu and
Yao Wang
Pulsar team, INRIA. Etienne Corvee and Francois Bremond.
ARPN Journal of Engineering and Applied Sciences. 10(17): 7678-7683. Divya George
and Arunkant A. Jose.
s. Conference on Computer Vision and Pattern Recognition. Paul Viola and Michael
Jones.
Jensen, O. H. (2008). Implementing the Viola-Jones Face Detection Algorithm. PhD
thesis, Technical University of Denmark, DTU. Retrieved
from:https://pdfs.semanticscholar.org/40b1/0e330a5511a6a45f42c8b86da222504c717f.p
df
MathWorks : https://www.mathworks.com/help/vision/ref/vision.cascadeobjectdetector-
system-object.html
Viola and Jones first published pdf :
https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf
WikiWand : https://www.wikiwand.com/en/Edge_detection
Researchgate :
https://www.researchgate.net/publication/318327664_Design_and_simulation_of_vario
us_edge_detection_techniques_using_Matlab_Simulink
29