0% found this document useful (0 votes)
36 views41 pages

Information 16 00107

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views41 pages

Information 16 00107

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Review

A Systematic Review of CNN Architectures, Databases,


Performance Metrics, and Applications in Face Recognition
Andisani Nemavhola 1,† , Colin Chibaya 2, *,‡ and Serestina Viriri 3,‡

1 School of Consumer Intelligence and Information Systems, University of Johannesburg,


Johannesburg 2197, South Africa; 201571528@student.uj.ac.za
2 School of Natural and Applied Sciences, Sol Plaatje University, Kimberley 8300, South Africa
3 School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal,
Durban 4000, South Africa; viriris@ukzn.ac.za
* Correspondence: colin.chibaya@spu.ac.za
† This author contributed 80% to this work.
‡ These authors contributed equally to this work.

Abstract: This study provides a comparative evaluation of face recognition databases


and Convolutional Neural Network (CNN) architectures used in training and testing
face recognition systems. The databases span from early datasets like Olivetti Research
Laboratory (ORL) and Facial Recognition Technology (FERET) to more recent collections
such as MegaFace and Ms-Celeb-1M, offering a range of sizes, subject diversity, and
image quality. Older databases, such as ORL and FERET, are smaller and cleaner, while
newer datasets enable large-scale training with millions of images but pose challenges like
inconsistent data quality and high computational costs. The study also examines CNN
architectures, including FaceNet and Visual Geometry Group 16 (VGG16), which show
strong performance on large datasets like Labeled Faces in the Wild (LFW) and VGGFace,
achieving accuracy rates above 98%. In contrast, earlier models like Support Vector Machine
(SVM) and Gabor Wavelets perform well on smaller datasets but lack scalability for larger,
more complex datasets. The analysis highlights the growing importance of multi-task
learning and ensemble methods, as seen in Multi-Task Cascaded Convolutional Networks
(MTCNNs). Overall, the findings emphasize the need for advanced algorithms capable of
Academic Editors: Eleni Vrochidou,
George A. Papakostas and
handling large-scale, real-world challenges while optimizing accuracy and computational
Ioannis Tsimperidis efficiency in face recognition systems.
Received: 27 November 2024
Revised: 25 January 2025
Keywords: face recognition; CNN; neural networks; artificial intelligence; imaging
Accepted: 27 January 2025
Published: 5 February 2025

Citation: Nemavhola, A.; Chibaya,


C.; Viriri, S. A Systematic Review of 1. Introduction
CNN Architectures, Databases, Face recognition has piqued the interest of researchers in the fields of artificial in-
Performance Metrics, and
telligence and computer vision, and it has made tremendous progress over the previous
Applications in Face Recognition.
four decades [1]. Face detection, face alignment, feature extraction, and classification are
Information 2025, 16, 107. https://
doi.org/10.3390/info16020107
the steps of a realistic face recognition system, with feature extraction being one of the
most important problems [2]. To obtain excellent recognition performance, it is crucial
Copyright: © 2025 by the authors.
to find strong face descriptors for the look of face regions that are unique, resilient, and
Licensee MDPI, Basel, Switzerland.
This article is an open access article
computationally inexpensive [3].
distributed under the terms and Face recognition has been widely used in a variety of applications such as surveillance
conditions of the Creative Commons control, face attendance, border control gates, entrance/exit from public communities,
Attribution (CC BY) license facial security checks at airports and railway stations, and so on [1]. Due to differences
(https://creativecommons.org/
in head posture, lighting, age, and facial expression, face identification in unconstrained
licenses/by/4.0/).

Information 2025, 16, 107 https://doi.org/10.3390/info16020107


Information 2025, 16, 107 2 of 41

situations is a tough challenge. Additionally, cosmetics, facial hair, and accessories (such as
scarves or spectacles) may alter one’s image. The resemblance between people (e.g., twins,
relatives) presents another challenge to face recognition [4,5]. Face recognition is the most
favoured bio-metric for identity recognition for the following reasons: 1. high accuracy,
2. cross platform, 3. reliability [6], and 4. consent—unlike other bio-metric systems, face
recognition does not require consent from the subject [7].
In our scoping review [8], we consulted 266 CNN face recognition articles from the
year 2013 to 2023 and found the following gaps in the literature: (1) Most researchers are
interested in face recognition using images compared to videos. (2) Researchers prefer
using clean images compared to occluded ones. This is a problem because it affects
model performance when applied in the real world, where occlusion exists. (3) There has
been a lot of research that has been conducted using traditional CNN compared to other
CNN architectures.
The objectives of this systematic review are as follows:
1. To determine which techniques have been applied in the face recognition domain.
2. To identify which databases of face recognition are most common.
3. To find out which areas have adopted face recognition.
4. To assess and identify suitable evaluation metrics to use when comparative studies
are carried out in the field of face recognition.
This study seeks to review CNN architectures, databases, metrics, and applications for
face recognition.

2. Face Recognition History


• 1964: American researchers investigated computer programming for face recognition.
They envisioned a semi-automated process in which users input twenty computer
measurements, such the length of the mouth or the width of the eyes [9]. After that,
the computer would automatically compare the distances shown in each picture,
determine how much the distances differed, and provide a potential match from
closed records [10].
• 1970: Takeo Kanade introduced a facial recognition system that considered the spacing
between facial features to identify anatomical elements, including features like the
chin. Subsequent trials showed that the system’s ability to accurately recognize face
characteristics was not always consistent. Yet, as curiosity about the topic increased,
Kanade produced the first comprehensive book on face recognition technology in
1977 [11].
• 1990: Research on face recognition increased dramatically due to advancements in
technology and the growing significance of applications connected to security [5].
• 1991: Eigenfaces [12], a facial recognition system that uses the statistical principal
component analysis (PCA) approach, was introduced by Alex Pentland and Matthew
Turk of the Massachusetts Institute of Technology (MIT) as the first effective example
of facial identification technology [9].
• 1993: The Defense Advanced Research Project Agency (DARPA) and the Army Re-
search Laboratory (ARL) launched the Face Recognition Technology Program (FERET)
with the goal of creating “automatic face recognition capabilities” that could be used
in a real-world setting to productively support law enforcement, security, and intelli-
gence personnel in carrying out their official duties [13].
• 1997: The PCA Eigenface technique of face recognition was refined by employing
linear discriminant analysis (LDA) to generate Fisherfaces [14].
• 2000s: Hand-crafted features such as Gabor features [14,15], local binary patterns
(LBPs), and variations became popular for face recognition. The Viola–Jones ob-
Information 2025, 16, 107 3 of 41

ject identification framework for faces was developed in 2001, making it feasible to
recognize faces in real time from video material [16].
• 2011: Deep learning, a machine learning technology based on artificial neural net-
works [17], accelerated everything. The computer chooses which points to compare: it
learns faster when more photos are provided. Studies aimed to enhance the perfor-
mance of existing approaches by exploring novel arcface loss functions.
• 2015: The Viola–Jones method was implemented on portable devices and embedded
systems employing tiny low-power detectors. As a result, the Viola–Jones method has
been utilized to enable new features in user interfaces and teleconferencing, not just
broadening the practical use of face recognition systems [18].
• 2022: Ukraine is utilizing Clearview AI face recognition software from the United
States to identify deceased Russian servicemen. Ukraine has undertaken 8600 searches
and identified the families of 582 Russian troops who died in action [19].

3. Application of Face Recognition


3.1. Security and Surveillance
Face recognition is frequently used in surveillance systems for security, especially in
places like banks, government buildings, and airports. By detecting criminals, people of in-
terest, or those on watchlists, it increases safety. Furthermore, cellphones and other gadgets
include this technology for user identification, providing safe and practical access [20].

3.2. Law Enforcement


Through the comparison of facial photographs with criminal databases, law enforce-
ment organizations use face recognition technology to identify criminals or find missing
individuals. This technology speeds up case resolution by automating the process of
looking through vast amounts of photos and videos [21].

3.3. Healthcare
Face recognition may be used in the healthcare industry to identify and monitor pa-
tients, guaranteeing that the appropriate person receives the proper therapy. Furthermore,
facial expressions and characteristics can be examined to track patient reactions to medicine,
identify early indicators of mental health conditions like depression, and monitor emotional
states [22].

3.4. Access Control


Access control in secure facilities also uses facial recognition to make sure that only
people with permission may enter areas that are off-limits. Automating time-tracking and
attendance procedures in officse with facial recognition software lowers the possibility of
fraud and mistakes that come with manual approaches [20].

3.5. Automotive Industry


Face recognition technology in automobiles may be used to identify drivers [23]
and personalize settings like temperature, seat position, and favorite routes. To improve
road safety, it may also track driver attentiveness and identify signs of exhaustion or
distraction [24].
Table 1 summarizes the different uses of facial recognition technology, which include
security, surveillance, healthcare, and mobile devices.
Information 2025, 16, 107 4 of 41

Table 1. Summary of applications of face recognition.

Application Areas Use


Office access, email authentication on
multimedia workstations, flight boarding
Security
systems, and building access
management [20].
CCTV control, power grid surveillance,
Surveillance portal control, and drug offender
monitoring and search [20].
To identify patients and manage patients’
Health
medical records [25]
Unlocking device, gaming, and mobile
Cell phone and gaming consoles
banking [10]

4. Face Recognition Systems


A face recognition system consists of four steps: face detection, face pre-processing,
feature extraction, and face matching [26]. The stages are illustrated in Figure 1 below.

Figure 1. Face recognition steps [26].

4.1. Face Recognition Systems Traditionally Consist of Four Main Stages


• The face is captured in an input photograph or video.
• Pre-processing is the process of applying several techniques to an image or video,
such as alignment, noise reduction, contrast enhancement, or video frame selection.
• Extracting facial features from a picture or video. Holistic, model-based, or texture-
based feature extraction approaches are used in image-based methods, whilst set-based
or sequence-based approaches are used in video-based methods.
• Face matching is performed with a database of stored images. If the image exists, it
will be matched and if it does not exist, there will not be any match.
We provide a quick overview of face detection and facial land-marking approaches in
the literature below. Face detection and facial land-marking techniques that are accurate
and effective improve the accuracy of face recognition systems.

4.1.1. Face Detection


The process of face detection involves determining the bounding-box of a face within a
certain picture or video frame. Every face in the pictures is identified if there are several. In
addition to removing the backdrop as much as possible, face detection should be resistant
to changes in position, lighting, and size [5,27].
The ability to identify faces at various scales is a significant difficulty in face detec-
tion. Even with the usage of deep CNNs, this problem still exists. Deeper networks
Information 2025, 16, 107 5 of 41

alone cannot solve the problem of detection across scales [28]. Despite these challenges,
great progress has been achieved in the past decade, with several systems demonstrat-
ing excellent real-time performance. These algorithms’ recent advancements have also
made major contributions to recognizing other objects such as humans/pedestrians and
automobiles [29].

4.1.2. Face Preprocessing


This stage’s goal is to remove characteristics that make it difficult to classify pho-
tographs of the same person (intra-class differences), which makes them stand out from
other people (inter-class differences) [30].

4.1.3. Face Extraction


The technique of removing face component characteristics such as eyes, nose, mouth,
and so on from a human face picture is known as facial feature extraction. Facial feature
extraction is critical for the start-up of processing techniques such as face tracking, facial
emotion detection, and face identification [31]. Every face is unique and may be recognized
by its structure, size, and shape. Using the size and distance, ways of carrying this out
include identifying the face by removing the mouth, eyes, or nose form [32].

4.1.4. Face Matching


The process of face matching involves comparison of a digital target image against a
stored image.

5. Methodology
Our systematic review’s primary objective is to identify the databases, approaches,
and issues that face recognition is now facing. Future research in this field will be based
on the knowledge gathered from this study. The goals and research topics covered in this
study are displayed in Table 2.

Table 2. Research questions.

Question Purpose
Q1 What are the most prevalent methods
To determine which techniques have been
used for face recognition? How do they
applied in the face recognition domain
compare in performance?
Q2 What databases are used in face To identify which databases of face
recognition? recognition are most common
Q3 In which areas in the real world have To find out which areas have adopted face
face recognition techniques been applied? recognition
To assess and identify suitable evaluation
Q4 What are the most common evaluation metrics to use when comparative studies
metrics of face recognition systems? are carried out in the field of face
recognition

In the following subsection, we will cover the steps suggested to identify and screen
the studies in our systematic review.
Information 2025, 16, 107 6 of 41

5.1. Data Selection


Twelve internet databases provided the data used in this study, which were utilized to
generate a population of pertinent papers for the systematic review. EBSCOhost (Green-
File), ScienceDirect, MasterFile, Emerald ERIC, ProQuest, Taylor & Francis, Cambridge
Core, JSTOR, ACM Digital Library, Springer, IEEE Xplore Digital Library, and Premier
Scopus were among these databases. Articles were only considered for literature currency
if they were published between 2013 and 2023, inclusive. Following the PRISMA tech-
nique [33], the records were screened using the four stages of a systematic review, which
are detailed below.

5.1.1. Inclusion and Exclusion Criteria


The research questions served as guidance throughout the evaluation process. The
Tables 3 and 4 shows the inclusion and exclusion criteria used to find relevant papers for the
investigation.

Table 3. Inclusion criteria.

Inclusion Criteria Explanation


This criterion guarantees that only
publicly available papers or those
published in open-access journals are
IC1 Publicly available articles. examined. The purpose is to make the
articles accessible to all researchers and
readers, thus fostering transparency
and reproducibility.
This promotes uniformity and avoids
problems with linguistic hurdles. It also
streamlines the review process because all
IC2 English-language studies.
research can be examined in a single
language, making it more efficient
and useful.
The review’s goal is to give particular
insights on facial recognition. This assures
Papers about face
IC3 that the gathered research immediately
recognition research.
adds to the knowledge and improvement
of face recognition systems.
Face recognition has advanced
significantly, particularly with the
introduction of deep learning and
convolutional neural networks (CNNs).
Research published during the previous
Articles published
IC4 decade is more likely to represent the most
between 2013 and 2023.
recent advances, trends, and
breakthroughs in facial recognition.
Excluding older papers guarantees that
the evaluation focuses on current
methodologies and technology.
Information 2025, 16, 107 7 of 41

Table 4. Exclusion criteria.

Exclusion Criteria Explanation


This exclusion is mostly intended for efficiency.
English-language studies are significantly more
accessible to a worldwide audience, ensuring
Articles published in languages uniformity in terminology, research methodologies,
EC1
other than English. and conclusions. Translating non-English papers
would take time and might include mistakes or
misinterpretations, jeopardizing the
review’s credibility.
The complete paper is required for a comprehensive
evaluation because it offers specific information
about the approach, findings, and conclusions.
Studies that do not offer the Relying just on abstracts or summaries may result in
EC2
complete article. a misleading or partial assessment of the study’s
quality and significance. Excluding incomplete
papers guarantees that the review includes only fully
accessible and transparent research.
An abstract is a short overview of a study’s aims,
methodology, findings, and conclusions. It enables
reviewers to swiftly assess a study’s relevance to the
EC3 Studies without an abstract. research subject. Without an abstract, it is impossible
to evaluate if the study is aligned with the review’s
aims, and such studies may lack clarity or
organization, resulting in exclusion.
The removal of studies published before 2013 assures
that the study is focused on the most current
developments in facial recognition technology. Over
the last decade, there have been major advances in
EC4 Articles published before 2013. deep learning, notably the use of convolutional
neural networks (CNNs) for facial recognition. Older
articles may not reflect these improvements and may
be out of date in comparison to current
cutting-edge procedures.

5.1.2. Search Strategy


A detailed search strategy was created employing keywords and boolean operators.
Primary search phrases were the following:
• “Facial recognition”;
• “Face recognition technology”;
• “Biometric identification”;
• “Deep learning in facial recognition”;
• “Privacy and facial recognition”;
• “Bias in facial recognition”;
• “Face recognition applications”;
• “Ethics of facial recognition”;
• “Convolutional Neural Networks for facial recognition”;
• “Deep learning”.
Based on the research questions, a search string was created to specify the studies’
parameters. To integrate pertinent terms, boolean operators were employed:
(“Facial recognition” OR “Face recognition technology” OR “Biometric identification”)
AND (“Deep learning” OR “Deep learning in facial recognition” OR “Convolutional
Neural Networks”) AND (“Privacy and facial recognition” OR “Bias in facial recogni-
tion” OR “Ethics of facial recognition” OR “Face recognition applications”).
The search did not use any population filter—on the contrary, all fields were explored.
A total of 3622 records were found in accordance with the search strategy and the twelve
databases proposed.
Information 2025, 16, 107 8 of 41

5.1.3. Screening
In order to eliminate duplicate records, the first filter was applied during the screening
phase. Twelve databases included a total of two texts that were repeated, and 3620 of those
texts matched the remaining records.

5.1.4. Eligibility and Inclusion of Articles


Four categories were created from the 266 articles that met the requirements for inclu-
sion. We started by examining the distribution of these papers by year of publication. The
primary objectives and areas of emphasis of these essays were also significant. Next, we
looked at the distribution of these articles based on the type of CNN they represented. Fi-
nally, we categorized the included publications according to the CNN types they employed
for facial recognition. The outcomes for each category are described in the remainder of
this section.

5.1.5. Quality Assessment Rules (QAR)


A quality evaluation procedure ensures that only the most credible and relevant
research are included in a review or analysis. Table 5’s questions give a systematic approach
to evaluating research based on essential characteristics such as clarity, study design, and
the validity of outcome measures. Figure 2 below shows the PRISMA-ScR methodology
that was applied when filtering the articles.

Table 5. Quality assessment questions.

QAR Quality Questions Explanation


The study’s objectives must be well-defined in order
Are the study’s objectives well to focus the research and link it with the research
QAR1
defined? topic. Ambiguity in the study’s aims might result in
imprecise or inconclusive findings.
This question asks if the study design is appropriate
for solving the research topic. A well-chosen design
Is the study design appropriate
QAR2 (e.g., experimental, observational, etc.) assures that
for the research question?
the study will provide valid and
trustworthy findings.
This determines whether the study compares
Is there any comparative study different deep learning algorithms used for image
QAR3 conducted on deep learning and video processing. Comparative studies assist in
methods for video processing? determining which strategies are most effective and
give deeper insights into the topic matter.
This question determines if the facial recognition
Are the facial recognition techniques or methodologies utilized in the study are
QAR4 algorithms or methods adequately presented. A thorough explanation is
clearly described? essential for understanding the technique used,
reproducing the study, and assessing its success.
This question assesses the study’s academic effect by
assessing its average citation count annually. A
Does the study have an
greater citation count might suggest that the study is
QAR5 adequate average citation count
well known and significant in the area, but it should
per year?
also be seen in context (for example, the study’s age
and field of research).
This evaluates if the study’s outcome measures (the
variables or metrics used to assess success) are
Are the outcome measures properly defined and valid (meaning they accurately
QAR6
clearly defined and valid? assess what they are meant to measure). Valid
outcome measurements guarantee that the study’s
findings are significant and dependable.
Information 2025, 16, 107 9 of 41

Figure 2. PRISMA-ScR diagram showing all the steps taken to filter out articles [8].

6. Databases
Researchers studying face recognition have access to many face datasets. The databases
can be offered with free access to data or for sale. Figure 3 shows the databases used in face
recognition systems.
Information 2025, 16, 107 10 of 41

Figure 3. Databases used in face recognition systems.

6.1. ORL Database


The ORL Database of Faces provides a collection of lab-taken facial photographs from
April 1992 to April 1994 [5]. The database was utilized as part of a face recognition exper-
iment conducted in partnership with Cambridge University Engineering Department’s
Speech, Vision, and Robotics Group. There are 10 different photos for each of the forty
various themes. Some subjects were photographed at different times, with variable lighting,
facial emotions (open or closed eyes, smiling or not smiling), and face features (glasses or
no glasses). All photographs were taken against a black, homogenous background, with
the individuals standing erect and facing forward [5]. Figure 4 below shows a preview
picture of the database.

Figure 4. ORL Database samples.

6.2. FERET Database


The Face Recognition Technology (FERET) database is a dataset that is used to evaluate
face recognition systems as part of the Face Recognition Technology (FERET) initiative.
It was founded in 1993 to 1996 as a collaboration between Harry Wechsler at George
Mason University and Jonathan Phillips at the Army Research Laboratory in Adelphi,
Maryland [34]. The FERET database is a standard library of face photos that academics
may use to design algorithms and report on findings. The usage of a common database
also enables one to evaluate the effectiveness of different methodologies and measure
their strengths and shortcomings [34]. The objective was to create machine-based facial
recognition for authentication, security, and forensic applications. Facial pictures were
Information 2025, 16, 107 11 of 41

captured during 15 sessions in a semi-controlled setting. The dataset includes 1564 sets of
face pictures for 14,126 individuals, including 365 duplicates [5]. Figure 5 below shows a
preview picture of the database.

Figure 5. FERET database samples.

6.3. Yale Face Database


The Yale Face Database, which was established in 1997, has 165 GIF-formatted
grayscale photos of 15 different people [35]. There are 11 pictures for each topic, one
for each distinct combination of features or expressions: normal, right-light, sad, drowsy,
shocked, joyful, center-light, with glasses, and wink [35].

6.4. AR Database
Aleix Martinez and Robert Benavente created the AR Face database in 1998 at the
Computer Vision Center (CVC) of the Universitat Autònoma de Barcelona in Barcelona,
Spain [20,36,37]. This database focuses on face recognition but can also recognize facial
expressions. The AR database includes almost 4000 frontal pictures from 126 participants
(70 men and 56 females). Each subject had 26 samples recorded in two sessions on different
days, with photos measuring 13,576 × 768 pixels each session [20,36,37]. Figure 6 below
shows a preview picture of the database.

Figure 6. AR database samples.

6.5. CVL Database


The CVL Face Database was established in 1999 by Peter Peer [38]. The database
features 114 people, each associated with 7 images that were captured in a controlled
environment [38].

6.6. XM2VTS Databases


The XM2VTS frontal data collection includes 2360 mug images of 295 individuals
gathered during four sessions [39]. Figure 7 below shows a preview picture of the database.
Information 2025, 16, 107 12 of 41

Figure 7. XM2VTS database samples.

6.7. BANCA Database


The BANCA database was created in 2003 to evaluate multi-modal in a variety of
settings (controlled, deteriorated, and unfavorable) using two cameras and two micro-
phones as acquisition devices [40]. Video and audio data were gathered for 52 respondents
(26 males and 26 females) across twelve occasions in four distinct languages (English,
French, Italian, and Spanish), for a total of 208 people [40]. Every population that was par-
ticular to a language and a gender was further split into two groups of thirteen participants
each [40].

6.8. FRGC (Face Recognition Grand Challenge) Database


The FRGC Database was collected between May 2004 and March 2006; it consists of
50,000 recordings separated into training and validation divisions [41]. The controlled
photographs were captured in a studio environment and are full frontal facial images with
two lighting settings and two facial emotions (smiling and neutral). The uncontrolled
photographs were taken in various lighting circumstances, such as corridors, atriums, and
outdoors [41]. Figure 8 below shows a preview picture of the database.

Figure 8. FGRC database samples.

6.9. LFW (Labeled Faces in the Wild) Database


Labeled Faces in the Wild (LFW) is a library of facial photos created in 2007 to inves-
tigate the topic of unconstrained face recognition [36]. Researchers at the University of
Massachusetts, Amherst established and maintain this database (particular references are
provided in the Acknowledgements section). The Viola–Jones face detector recognized and
centered 13,233 photos of 5749 people taken from the internet. In the dataset, 1680 persons
are represented by two or more photographs [36]. Figure 9 below shows a preview picture
of the database.

Figure 9. LFW database samples.


Information 2025, 16, 107 13 of 41

6.10. The MUCT Face Database


Stephen Milborrow, John Morkel, and Fred Nicolls of the University of Cape Town
created the MUCT database in 2008 [42]. The MUCT database has 3755 faces along with
76 manual markers. The database was developed to increase variety in lighting, age, and
ethnicity [42].

6.11. CMU Multi-PIE Database


The CMU Multi-PIE face database comprises over 750,000 photos of 337 persons cap-
tured in up to four sessions over the course of five months [43]. Subjects were photographed
using 15 view angles and 19 lighting settings while making a variety of facial expressions.
High-resolution frontal pictures were also captured [43]. Figure 10 below shows a preview
picture of the database.

Figure 10. CMU Multi-PIE database samples.

6.12. CASIA-Webface Database


The CASIA-WebFace large-scale dataset was proposed by Yi et al. [44] in 2014 for the
facial recognition challenge. It was acquired semiautomatically from the IMDb website,
containing 49,414 photographs of the faces of 10,575 individuals. An autonomous training
set for LFW can be considered CASIA-WebFace [44].

6.13. IARPA Janus Benchmark-A Database


There are 2085 face pictures from a video and 5712 face photos from the network in the
IARPA Janus Benchmark A (IJB-A) dataset [45]. A total of 500 distinct people contributed
these facial photos. Each target typically contains 4 images from the network video and
11 images from the network photos [45].

6.14. Megaface
The MegaFace dataset is a massive facial recognition dataset intended to test and
enhance face recognition capabilities at scale [46]. It is one of the biggest datasets for
face recognition system training and benchmarking, with over 1 million tagged photos
representing 690,000 distinct identities [46]. MegaFace provides a strong platform for
testing face verification and identification algorithms, including a probe set for testing and
a gallery set containing the majority of identities [46]. Despite the fact that its magnitude
offers important insights into actual face recognition situations, the dataset’s web-based
collection presents problems including noise and poor image quality [46].

6.15. IARPA Janus Benchmark-B Database


The IARPA Janus Benchmark-B dataset pushes the boundaries of unconstrained
face recognition technology with its extensive annotated corpus of facial imagery [47].
The collection includes both video and photos, as well as facial imagery that is Creative
Commons licensed (i.e., free to be reused as long as due credit is given to the original data
Information 2025, 16, 107 14 of 41

source). Face photos and videos of 1500 more people were gathered as an addition to the
IJB-A dataset [47].

6.16. VGGFACE Database


The 2015 edition of the VGGFace collection, one of the biggest publicly accessible
datasets, has 2.6 million photos with 2622 individuals [48]. There are 800,000 photos in
the curated edition, with about 305 photos for each identification after label noise was
eliminated by human annotators [48].

6.17. CFP Database


The difficult and publicly available CFP (celebrities in frontal-profile) dataset was
created in 2016 by Sengupta et al. [49] at the University of Maryland. It has 7000 images
with 500 different topics [49]. Every individual has over 4 profile photos and 10 frontal
images [49].

6.18. Ms-Celeb-M1 Database


In 2016, Microsoft made available for training and testing the massive Ms-Celeb-1M
dataset, which has 10 million photos from 100,000 celebrities [50].

6.19. DMFD Database


The Disguise and Makeup Database (with 410 subjects and 2460 total photos) is pre-
sented in [51]. These pictures include celebrities (movie/TV stars, athletes, or politicians)
wearing disguises and/or cosmetics that reflect their true selves (beards, mustaches, eye-
glasses, goggles, etc.) [51]. The primary task of this database (DMFD) is to match face
photos taken in natural settings with many variables, including light, occlusion, distance,
position, and expression change [51].

6.20. VGGFACE 2 Database


The VGGFace2 dataset is made up of 3.31 million photos from 9131 celebrities repre-
senting a diverse range of professions (such as politicians and athletes) and ethnicities (such
as more Chinese and Indian faces than VGGFace, though the distribution of celebrities
and public figures still limits the ethnic balance) [52]. Pose, age, backdrop, lighting, and
other aspects of the images vary greatly; they were acquired from Google Image Search.
The dataset has a gender distribution that is roughly balanced, with 59.3% of the male
participants. The average number of photos per identity ranges from 80 to 843 [52].

6.21. IARPA Janus Benchmark C Database


In 2018, Maze et al. developed the IARPA Janus Benchmark-C (IJB-C) database, which
is an extension of IJB-B. It has 31,334 still photos (21,294 faces and 10,040 non-faces), with
an average of 6 pictures per person, and 11,779 full-motion videos (117,542 frames, with an
average of 33 frames and 3 videos per person) [53].

6.22. MF2 Database


Nech and Shlizerman of the University of Washington generated the public facial
recognition dataset known as MF2 (MegaFace 2) in 2017 [54]. It features 4.7 million photos
and 672,000 identities [54].

6.23. DFW (Disguised Faces in the Wild) Database


The Disguised Faces in the Wild (DFW) collection comprises approximately 11,000 pho-
tos and 1000 people [55]. In order to accomplish facial recognition, it was recorded in
uncontrolled situations. The dataset includes variants in disguise related to hairstyles,
Information 2025, 16, 107 15 of 41

spectacles, facial hair, beards, hats, turbans, veils, masquerades, and ball masks. The dataset
is difficult for face identification because of these changes in addition to those related to
stance, lighting, emotion, backdrop, ethnicity, age, gender, clothes, hairstyles, and camera
quality [55]. There are 1001 normal face images, 903 validation face photos, 4814 disguised
face images, and 4440 impersonator images in the DFW collection overall. Any individual
who, whether knowingly or unknowingly, assumes the identity of a topic is considered an
imposter of that subject.

6.24. LFR Database


The Left–Front–Right Pose dataset was created by combining four datasets—LFW,
CFP, CASIAWebFace, and VGGFace2—into one [56].

6.25. CASIA-Mask Database


There are 494,414 photos of 10,575 people with masked faces in the CASIA-Mask
database [57].
Table 6 presents a summary of various face recognition databases, including key
details such as the number of images, videos, subjects, data accessibility, and whether the
data includes clean or occluded faces. These databases, ranging from early datasets like
ORL to more recent ones like CASIA Mask, are pivotal in evaluating and training face
recognition systems.
The face recognition databases presented in the table exhibit a wide variety of charac-
teristics, offering different datasets for training and testing. These databases, spanning from
the early ORL database in 1994 to more recent datasets like CASIA Mask in 2021, provide
various combinations of images, videos, and subjects. The ORL and FERET databases, for
instance, primarily feature clean data with a relatively smaller number of subjects (40 and
1199, respectively), making them suitable for early face recognition tasks. In contrast, newer
datasets like MegaFace (2016) and Ms-Celeb-M1 (2016) include vast numbers of images (up
to millions) and subjects (hundreds of thousands), allowing for large-scale training of face
recognition models.
Many of the older datasets, such as Yale (1997) and FRGC (2006), focus on clean images,
while recent datasets like CASIA Webface (2014) and IARPA Janus Benchmark-A (2015)
include a mix of clean and occluded images, providing a more challenging environment
for testing face recognition systems under varying conditions. Furthermore, IARPA Janus
Benchmark-A and IARPA Janus Benchmark-B (2015, 2017) introduce datasets with both
images and videos, adding a dynamic element for evaluating algorithms. The CASIA Mask
database (2021), specifically focusing on occluded faces, poses an additional challenge to
recognition systems, testing their ability to handle face occlusions effectively.
The majority of these datasets are publicly accessible, which has facilitated the devel-
opment and benchmarking of various face recognition algorithms, while a few like DMFD
(2016) are private, restricting broader accessibility. This diversity in the databases enables
researchers to evaluate face recognition models under a range of conditions, including
varying numbers of subjects, image quality, and occlusion types, making these datasets
valuable for advancing the field of face recognition.
Information 2025, 16, 107 16 of 41

Table 6. Summary of the databases used for training and testing face recognition systems.

Database Year Images Videos Subjects Clean/ Occlusion Accessible


ORL [5] 1994 400 0 40 Both Public
FERET [5,34] 1996 14,126 0 1199 Clean Public
Yale [35] 1997 165 0 15 Data Public
AR [20,36,37] 1998 >3000 0 116 Both Public
CVL [38] 1999 798 0 114 Clean Public
XM2VTS [39] 1999 2360 0 295 Both Public
BANCA [40] 2003 Data 0 208 Clean Public
FRGC [41] 2006 50,000 0 7143 Clean Public
LFW [36] 2007 13,233 0 5749 Both Public
MUCT [42] 2008 3755 - 0 Both Public
CMU Multi PIE [44] 2009 750,000 0 337 Both Public
CASIA Webface [45] 2014 494,414 0 10,575 Both Public
IARPA Janus Benchmark-A [45] 2015 5712 2085 500 Both Public
MegaFace [46] 2016 1,000,000 0 690,572 Both Public
CFP [49] 2016 7000 0 500 Both Public
Ms-Celeb-M1 [50] 2016 10,000,000 0 100,000 Both Public
DMFD [51] 2016 2460 0 410 Both Private
VGGFACE [48] 2016 2,600,000 0 2600 Both Public
VGGFACE 2 [52] 2017 3,310,000 0 9131 Both Public
IARPA janus Benchmark-B [47] 2017 21,798 7011 1845 Both Public
MF2 [54] 2017 4,700,000 0 672,000 Both Public
DFW [55] 2018 11,157 0 1000 Both Public
IARPA janus Benchmark-C [53] 2018 31,334 11,779 3531 Both Public
CASIA mask [57] 2021 494,414 0 10,575 Occluded Public

Table 7 provides an overview of the strengths and limitations of various face recogni-
tion databases used for training and testing face recognition systems. These databases vary
in terms of the diversity of subjects, quality of images, and environmental conditions, each
presenting unique advantages and challenges for researchers in the field.
The above table presents a comparison of various face recognition databases used for
training and testing systems, focusing on their strengths and limitations. Some datasets,
such as ORL, FERET, and Yale, are smaller and suitable for controlled experiments, but
they are limited in terms of subject diversity and environmental conditions, often with
low-resolution images. On the other hand, databases like FRGC, MegaFace, and Ms-Celeb-
M1 offer large-scale datasets with diverse subjects, poses, and lighting conditions, though
they can suffer from issues like high computational costs, data quality inconsistencies, or
limited diversity in real-world scenarios. Datasets like CASIA Webface and VGGFACE
provide large collections of high-quality images, but some have imbalances in data dis-
tribution or limited pose variation. Certain specialized databases like AR, CASIA Mask,
and DMFD focus on specific challenges, such as occlusion or manipulation detection, but
may have limited generalizability to unconstrained environments. Overall, while larger
datasets provide better scalability, the limitations related to data quality, environmental
Information 2025, 16, 107 17 of 41

conditions, and pose variation must be considered when choosing a database for training
face recognition systems.

Table 7. Strengths and limitations of face recognition databases used for training and testing face
recognition systems.

Database Strengths Limitations


Limited number of subjects
Small dataset, good for
ORL [5] (40), low-resolution images,
controlled experiments
restricted pose variation
Limited pose variation,
Diverse faces, widely used in
FERET [5,34] restricted illumination
face recognition research
conditions, outdated data
Limited number of images,
Good for face recognition in
Yale [35] poses and expressions not
controlled settings
varied enough
Large number of subjects and Significant noise due to
AR [20,36,37] images, includes both clean occlusion, limited ethnic
and occluded faces diversity, low-quality images
Includes a variety of Limited number of subjects
CVL [38] ethnicities, poses, and lighting (114), less variation in
conditions environmental conditions
High-quality and -resolution Small sample of subjects, data
XM2VTS [39] images, widely used in are not diverse enough for
benchmarking real-world scenarios
Focused on Restricted in terms of pose
BANCA [40] low-impersonation tasks, variation, focused mostly on
balanced dataset controlled settings
Large scale, high-resolution High computational cost due
FRGC [41] images, variety of facial to the large number of images,
expressions and lighting limited in diversity of subjects
Large-scale dataset, Limited to frontal face images,
commonly used for performance drops in
LFW [36]
benchmarking face challenging real-world
recognition scenarios
Diverse dataset in terms of
Limited to images with visible
MUCT [42] ethnicity, good for real-world
faces, low variation in poses
scenarios
Faces with extreme poses or
Includes a variety of poses,
CMU Multi PIE [43] occlusions underrepresented,
lighting, and expressions
limited lighting conditions
Imbalanced data distribution,
Large dataset with a variety of low-quality images, faces
CASIA Webface [44]
subjects and images mostly frontal with limited
lighting
Relatively limited number of
IARPA Janus High-quality data, diverse
subjects, focused mostly on
Benchmark-A [45] subjects and poses
controlled settings
Faces may be poorly
Extremely large-scale dataset
MegaFace [46] annotated or of low resolution,
with many identities
data quality varies
Limited diversity,
High diversity of subjects and performance drops in
CFP [49]
images, challenging tasks challenging real-world
conditions
Information 2025, 16, 107 18 of 41

Table 7. Cont.

Database Strengths Limitations


Mislabeled or noisy data,
Very large dataset, includes a limited to celebrity faces,
Ms-Celeb-M1 [50]
wide variety of subjects limited diversity in real-world
scenarios
Small number of subjects
Focuses on face manipulation (410), limited ethnic diversity,
DMFD [51]
detection, high-quality data focuses on facial manipulation
detection
Limited variation in lighting
Large-scale dataset with
conditions, faces mostly
VGGFACE [48] diverse identities, popular for
frontal with minimal pose
face recognition
changes
Faces are mostly frontal,
Large dataset with a good
VGGFACE 2 [52] variation in poses and lighting
variety of subjects and poses
conditions not well covered
High-quality images, good Mislabeled data in some cases,
IARPA Janus
variety of facial poses and faces from controlled settings,
Benchmark-B [47]
conditions limited pose variation
Large-scale dataset, useful for Faces are of low resolution
MF2 [54] testing large-scale face and poor quality in certain
recognition systems instances, not diverse enough
Large-scale dataset, includes Faces with extreme poses or
DFW [55] challenging scenarios for face occlusions underrepresented,
recognition limited facial expressions
Data may be noisy or
IARPA Janus Includes high-quality data,
mislabeled, limited variation
Benchmark-C [53] diverse set of subjects
in facial expressions
Occlusion focus limits the
Focuses on occlusion, valuable dataset’s applicability to face
CASIA mask [57]
for studying occluded faces recognition in unconstrained
environments

The next section concerns face recognition methods and will inform us about tradi-
tional and deep learning architectures.

7. Face Recognition Methods


7.1. Traditional
7.1.1. Principal Component Analysis (PCA)
Matthew Turk and Alex Pentland created the PCA approach for detecting faces; they
coupled the conceptual approach of the Karhunen–Loève theorem with factor analysis
to create a linear model [16]. Eigenfaces are determined using global and orthogonal
characteristics from human faces. A human face is calculated as a weighted mixture of
many Eigenfaces. However, this technique is not extremely accurate when the lighting and
position of facial photos change significantly [58]. With PCA, the face database represents all
images as long vectors that are correlated, rather than the typical matrix structure [58]. The
PCA Eigenface method of face recognition was improved by utilizing linear discriminant
analysis (LDA) to obtain Fisherfaces [14]. The most popular applications of LDA are feature
extraction and dimensionality reduction [59,60]. For supervised classification problems, it
is more reliable than PCA. LDA requires labeled data and struggles with pose variation
and large datasets [35].
Information 2025, 16, 107 19 of 41

7.1.2. Gabor Filter


The Gabor filter optimizes spatial and frequency resolution by acting as a band-pass
filter for local frequency distributions [61]. In texture analysis, the Gabor filter is a linear
filter that effectively determines if the picture contains any certain frequency content in
particular directions within a small area surrounding the point or region of examination [62].
Gabor filter-based feature selection is resilient, but computationally expensive due to high-
dimensional Gabor features [61]. In a variety of image-based applications, Gabor filters have
demonstrated exceptional performance. Gabor filters are capable of achieving excellent
results in both the frequency and spatial domains [63].

7.1.3. Viola–Jones Object Detection Framework


Paul Viola and Michael Jones introduced the Viola–Jones object detection framework,
a machine learning object identification framework, in 2001 [64]. It is made up of several
classifiers. A single perceptron with several binary masks (Haar features) makes up each
classifier. Although it is less accurate than more recent techniques like convolutional neural
networks, it is effective and computationally inexpensive.

7.1.4. Support Vector Machine (SVM)


The hyperplane with the greatest distance to the closest training-data point of any
class (also known as the “functional margin”) intuitively achieves a decent separation
since, generally speaking, the bigger the margin, the lower the classifier’s generalization
error [65]. Overfitting is less likely to occur for the implementer when the generalization
error is lower. There are two types of support vector machines: linear and nonlinear [65].

7.1.5. Histogram of Oriented Gradients (HOG)


An image’s gradient information, or edge-like characteristics, can be captured via the
Histogram of Oriented Gradients (HOG) approach. Face recognition and detection have
been effectively implemented using it. Each cell’s gradient is calculated once the picture is
split up into tiny cells [66]. Histograms, which serve as an image feature descriptor, are
created by grouping these gradients. HOG is good at capturing texture information but
struggles with handling many poses and occlusion [66].
Modern facial recognition systems have their roots in traditional technologies. Despite
their notable achievements in controlled settings, techniques such as Eigenfaces (PCA),
Fisherfaces (LDA), LBP, Gabor filter, and HOG are limited in their ability to handle variables
in the real world, such as posture, illumination, and occlusion. These conventional tech-
niques have mostly been superseded by more current deep learning techniques, especially
CNNs and Transformers, which can extract more intricate and discriminative features from
big datasets.

7.2. Deep Learning


7.2.1. AlexNet
The convolutional neural network (CNN) architecture known as AlexNet was created
by Alex Krizhevsky in association with Ilya Sutskever and Geoffrey Hinton [67]. On
30 September 2012, AlexNet participated in and won the ImageNet Large Scale Visual
Recognition Challenge. Eight layers make up the basic architecture of AlexNet, three
of which are completely linked and five of which are convolutional. Similar to LeNet,
CNN has a deeper architecture with additional filters and layered convolutional layers.
The profundity of the AlexNet paradigm enhanced the efficacy of Levi and Hassner’s
studies [68].
Information 2025, 16, 107 20 of 41

7.2.2. VGGNet
In the publication “Very Deep Convolutional Networks for Large-Scale Image Recogni-
tion”, K. Simonyan and A. Zisserman introduced the convolutional neural network model
known as the VGG-Network [69]. Their primary contribution was to employ the VGGNet
design, which has modest (3 × 3) convolution filters and doubles the amount of feature
maps following the (2 × 2) pooling. To improve the deep architecture’s ability to learn
continuous nonlinear mappings, the network’s depth was raised to 16–19 weight layers.
Figure 11 below shows an example of a VGG architecture.

Figure 11. VGG architecture.

7.2.3. ResNet
In 2016, He et al., designed the residual neural network (also known as a residual
network or ResNet) architecture largely to improve the performance of existing CNN
architectures such as VGGNet, GoogLeNet, and AlexNet [70]. A ResNet is a deep learning
model in which weight layers train residual functions based on the layer inputs. It functions
like a highway network, with gates opened using significantly positive bias weights [71].
ResNet uses “global average pooling” instead of “fully connected” layers, resulting in a
significantly reduced model size compared to the VGG network [70].

7.2.4. FaceNet
Google researchers Florian Schroff, Dmitry Kalenichenko, and James Philbina created
the FaceNet face recognition technology. The technology was initially demonstrated at the
2015 IEEE Conference on Computer Vision and Pattern Recognition [72]. Using the Labeled
Faces in the Wild (LFW) and YouTube Faces Databases, FaceNet was able to recognize faces
with an accuracy of 99.63% and 95.12%, respectively [72].

7.2.5. LBPNet
LBPNet uses two filters that are based on Principle Component Analysis (PCA) and
LBP methodologies, respectively [73]. The two components of LBPNet’s architecture are
the (i) regular network for classification and (ii) deep network for feature extraction [73].

7.2.6. Lightweight Convolutional Neural Network (LWCNN)


Lightweight frameworks with shorter time tolerances and parameter values for 256-D
compact embedding on large-scale face data with very noisy labels [74].

7.2.7. YOLO
A convolutional neural network architecture called YOLO was developed by Joseph
Redmon and his colleagues [75]. It provides a one-stop shop for frame position prediction
and the classification of many candidates. Regression is the method used by YOLO to
handle object recognition, simplifying the process from picture input to category and
position output [75].
Information 2025, 16, 107 21 of 41

7.2.8. MTCNN
MTCNNs or Multi-Task Cascaded Convolutional Neural Networks represent a neural
network that detects faces and facial landmarks on images. It was published in 2016 by
Zhang et al. [76]. Additionally, this system uses one-shot learning, or just one image of the
offender to identify him. The goal is to recognize the criminal’s face, locate the information
that has been entered in the database for that criminal, and notify the police of all the facts,
including the place where the criminal was being watched by cameras [76].

7.2.9. DeepMaskNet
Ullah et al. presented DeepMaskNet in 2021 during COVID-19 [77]. It is a powerful
system that can distinguish between individuals who are wearing face masks and those
who are not. With an accuracy of 93.33%, DeepMaskNet outperformed various cutting-edge
CNN models, including VGG19, AlexNet, and Resnet18 [77].

7.2.10. DenseNet
DenseNet was suggested as a solution to the vanishing gradient problem and is
comparable to ResNet [78]. DenseNet solves the problem with ResNet by using feed-
forward connections between each previous layer and the subsequent layer, utilizing
cross-layer connectivity. ResNet explicitly retains information through additive identity
transformations, which adds to its complexity. It makes use of thick blocks; as a result, all
following levels receive their feature maps from all preceding layers [78].

7.2.11. MobileNetV2
A convolutional neural network design called MobileNetV2 aims to function well
on mobile devices [79]. Its foundation is an inverted residual structure, in which the
bottleneck layers are connected by residuals. Lightweight depthwise convolutions are used
by the intermediate expansion layer to filter features as a source of non-linearity. The first
complete convolution layer with 32 filters makes up the entirety of MobileNetV2’s design.
It is followed by 19 residual bottleneck levels [79].

7.2.12. MobileFaceNets
MobileFaceNets are a type of very efficient CNN model designed for high-accuracy
real-time face verification on mobile and embedded devices [80]. These models have less
than one million parameters. The superiority of MobileFaceNets over MobileNetV2 has
been established. MobileFaceNets require smaller amounts of data while maintaining
superior accuracy when compared to other cutting-edge CNNs [80].

7.2.13. Vision Transformer (ViT)


The Vision Transformer treats pictures as a series of patches, adapting the Transformer
design from Natural Language Processing to computer vision. Like word embeddings
in natural language processing, these patches are flattened and linearly projected into
embeddings [81]. In order to preserve spatial information, positional embeddings are
inserted. Global relationship capture is then made possible by a typical Transformer encoder
with multi-head self-attention layers processing the patch embedding sequence. To classify
images, a classification token compiles data [81]. ViT works well for applications requiring
comprehensive picture interpretation because of its ability to capture global context and
long-range interdependence [81]. In some situations, Vision Transformers (ViTs) have
outperformed traditional convolutional neural networks (CNNs) in image identification
tests, exhibiting impressive performance [81]. But in order to function at their best, ViTs
often demand a lot of data and processing power, but new developments like hybrid
models—which combine CNNs and Transformers—have lessened this drawback [82].
Information 2025, 16, 107 22 of 41

7.2.14. Face Transformer for Recognition Model


By treating face photos as a series of patches, Face Transformer for Recognition models,
as investigated in [83], make use of the Transformer architecture’s prowess in processing
sequential data. These picture patches are flattened and linearly projected into embeddings,
much like the tokenization of words in natural language processing. In order to maintain
spatial information that is essential for comprehending face features, positional embeddings
are used. After that, the patch embedding sequence is sent into a conventional Transformer
encoder made up of feed-forward and multi-head self-attention networks. This makes it
possible for the model to depict the face holistically by capturing the global correlations
between various facial characteristics. Face recognition tasks are subsequently performed
using the final representation, which is frequently obtained from a classification token.
With this method, Face Transformers may efficiently learn from massive datasets and may
even threaten CNNs’ hegemony in face recognition [84].

7.2.15. DeepFace
Facebook’s research group developed DeepFace, a deep learning face recognition
technology. It recognizes human faces in digital photos. The approach employs a deep
convolutional neural network (CNN) to develop robust face representations for the purpose
of face verification [85]. DeepFace achieved a human-level accuracy of 97.35% on the LFW
(Labeled Faces in the Wild) dataset, which was a substantial advance above existing
approaches at the time [85]. The network comprises nine layers and pre-processes photos
using 3D face alignment, which aligns them based on facial landmarks and helps with
position variations. It uses a softmax loss function during training to discriminate between
distinct identities [85].

7.2.16. Attention Mechanism


Attention mechanisms increase the performance of face recognition models by allow-
ing them to focus on the most relevant facial characteristics, deal with occlusions, illumi-
nation differences, and position changes, and improve generalization. Self-attention [86],
channel [87] and spatial attention [87] processes, and multi-head self-attention [88] are
effective tools for improving feature maps, increasing model resilience, and attaining high
accuracy even in difficult situations such as low-resolution inputs or enormous datasets.
Attention processes are an important component of contemporary face recognition systems
due to their strengths.
The channel mechanism describes the relevance or link between various channels
in feature maps. CNNs process input data (such as an image) through many filters (or
kernels), resulting in distinct feature maps. Each feature map represents a distinct channel.
For example, in a color picture, there might be three channels representing the RGB (Red,
Green, Blue) components [87]. However, when pictures progress through deeper layers of
the network, each channel carries more abstract and high-level characteristics. The problem
is determining which channels carry the most useful features for a specific activity [87]. The
spatial mechanism focuses on the interactions between multiple spatial locations (pixels) in
a feature map. In CNNs, spatial mechanisms are critical for capturing spatial hierarchies
and patterns in the input data, such as the relative location of objects or elements in an
image [87]. Self-attention, also known as intra-attention, is a process in which a series
of input items attend to themselves. This implies that each element in the sequence is
compared to every other element, allowing the model to determine how important other
items are for each element in the sequence [86]. Multi-head attention is a variation of the
self-attention process that enables the model to pay attention to several sections of the
sequence at once. Rather than employing a single set of attention weights, multi-head
Information 2025, 16, 107 23 of 41

attention employs many sets (or “heads”) to capture various relationships or aspects in the
data [88].

7.2.17. Swin Face Recognition


The Swin Transformer for recognition uses a multi-stage, hierarchical design to handle
picture patches [89]. Like a Vision Transformer, it begins by segmenting the vision into tiny
sections. However, Swin Transformer creates a hierarchical representation that captures
both local and global information by combining neighboring patches at each level, in
contrast to ViT [89]. A shifted windowing method is used in this merger to provide
effective cross-window information transmission. The Swin Transformer block, which
models interactions inside windows via shifting window-based multi-head self-attention, is
the central component of each stage [89]. This makes it scalable by lowering the computing
complexity in comparison to global attention. The Swin Transformer performs well on
tasks like facial recognition because of its hierarchical structure and shifting windows,
which let it collect global context and fine details [89].
Table 8 summarizes the strengths and weaknesses of various face recognition meth-
ods. These methods, including PCA, Gabor filters, and Viola–Jones, each have distinct
advantages, such as resource efficiency and robustness to certain variations, but also face
limitations in handling large datasets, non-frontal faces, and occlusions.
The above are the most commonly used CNN architectures for training and testing
face recognition systems. We can observe that some CNN’s require small or large datasets
in order to achieve high accuracy. We can also observe that some CNNs have more layers
compared to others. Some CNNs are best suited for real-time verification and masked
faces. The corresponding open-source resources are available at this link: https://github.
com/ddlee-cn/awesome_cnn (accessed on 17 December 2024). When it comes to facial
recognition, Transformers encounter a number of difficulties. When compared to CNNs,
they may be less successful since they need big datasets and a lot of processing power [82].
They may also be computationally costly due to their quadratic self-attention complexity,
which restricts real-time applications. To address these shortcomings, hybrid models that
combine CNNs and Transformers have been suggested; nevertheless, optimization is still
difficult [90]. There are still a small number of studies that have been conducted using
Transformers for face recognition.

Table 8. Summary of face recognition methods’ strengths and weaknesses.

Algorithm Strength Weakness


Difficulties with big datasets
Works well in
and non-frontal faces;
low-dimensional spaces and is
PCA [14,16,58–60] sensitive to changes in
easy on resources for small,
illumination, emotion,
well-aligned face datasets
and posture
Complex and
Withstands variations in
resource-intensive, with
illumination and is capable of
Gabor filter [61–63] challenges managing wide
capturing spatial details and
stance variations and
face texture at various sizes.
non-frontal faces
Low accuracy with
Fast, efficient, real-time face
non-frontal faces, sensitive to
detection that is resistant to
Viola–Jones [64] position changes, and
lighting fluctuations and
challenged by occlusions and
effective for frontal faces.
crowded backdrops.
Information 2025, 16, 107 24 of 41

Table 8. Cont.

Algorithm Strength Weakness


Outstanding generalization, Computationally costly,
adaptation to classification sensitive to feature selection,
SVM [65] challenges, handling of has trouble handling big
non-linear data using kernels, datasets, and needs precise
and accuracy. parameter adjustment.
Demands precise adjustment
Excellent at capturing edge of parameters (e.g., cell size,
and texture details, and block normalization). Less
HOG [66]
resilient to minor changes in efficient when there are
pose and lighting. significant pose variations or
occlusions.
High accuracy, can handle High computing costs,
enormous datasets, and is extensive training data needs,
AlexNet [67,68]
resistant to position, lighting, and sensitivity to overfitting
and expression changes. with limited datasets.
High accuracy, deep
Computationally costly,
architecture, and robust
requires huge datasets, and
VGGNet [69] performance on complicated
may suffer from overfitting
datasets with a variety of
with insufficient data.
faces.
Deep architecture improves Computationally complex,
accuracy, handles complicated requires massive datasets, and
ResNet [70,71]
features, and reduces can be difficult to train and
vanishing gradient concerns. infer.
Needs a lot of computing
Real-time performance, strong
power, big datasets, and could
feature extraction, high
FaceNet [72] have trouble with significant
accuracy, and efficacy in
occlusions or position
large-scale face recognition.
changes.
Excels in texture
Difficulties with high
categorization, capturing local
computational cost and may
LBPNet [73] patterns with great
underperform on complicated,
performance and economical
extremely diverse materials.
calculation.
Excels in capturing spatial
Due to its lightweight design
data and using lightweight,
and parameter limitations, it
LWCNN [74] effective convolutional layers
struggles with extremely
to increase classification
complicated patterns.
accuracy.
Has trouble detecting little
Excels in real-time object
objects, being accurate in busy
identification, providing
YOLO [75] environments, and having
quick, precise, and effective
limited precision in
results for a range of activities.
complicated situations.
Demonstrates exceptional
Performs poorly in
proficiency in multi-task facial
complicated or obstructed
MTCNN [76] detection, providing great
face circumstances and has
precision in facial alignment
trouble in real time.
and identification.
Specializes in precise object
Has significant computing
segmentation and uses deep
needs and may struggle to
DeepMaskNet [77] learning to produce
execute in real time with
high-quality, accurate mask
complicated situations.
predictions.
Information 2025, 16, 107 25 of 41

Table 8. Cont.

Algorithm Strength Weakness


Excels in feature reuse, Has significant memory usage
increasing efficiency by and computational complexity,
DenseNet [78]
densely linking layers for which limits scalability to
efficient information flow. large-scale models or datasets.
Specializes in lightweight,
May compromise accuracy for
efficient architecture,
efficiency, difficulty with
MobileNetV2 [79] providing quick performance
complicated jobs that need
with minimal computational
great precision and intricacy.
and memory expenses.
Excels at facial recognition in Decreased accuracy under
real time, combining cheap difficult circumstances, like
MobileFaceNets [80]
computational cost and great rapid changes in posture and
accuracy. occlusions.
Uses the Transformer Struggles with smaller
architecture to achieve high datasets or a lack of training
ViT [81,82] accuracy in collecting global data, and demands huge
context for picture datasets and computing
recognition. resources.
Excels in facial recognition, Demands a lot of processing
using a Transformer-based power and big datasets,
Face Transformer [83,84]
architecture to capture context having trouble with efficiency
and fine features. or smaller datasets.
High computing cost and
Achieves great accuracy even
complexity, especially with
in difficult conditions like
DeepFace [85] huge datasets or lengthy
low-resolution inputs or big
sequences, which might be a
datasets.
constraint.
Requires huge datasets, has
Excels in end-to-end learning
trouble with high occlusions,
for effective face recognition,
Attention [86,87] and demands a lot of
precision, and robustness to
computing power during
variances.
training.
Requires a lot of resources,
Is excellent at collecting
has high computational
hierarchical features and
Swin [89] complexity, and might have
provides great scalability and
trouble with jobs that need to
accuracy for image processing.
be completed in real time.

The next section covers performance measures; the section will inform us about the
perform metrics used in face recognition systems.

8. Performance Measures
8.1. Accuracy
Accuracy quantifies the model’s overall accuracy by comparing the number of correct
predictions to the total number of predictions. Below is the mathematical computation
of accuracy:

TP + TN
Accuracy =
TP + TN + FP + FN

In face recognition systems, a prediction is regarded as valid when it properly identifies


or confirms the person in the image. Incorrect predictions occur when the algorithm does
not recognize the person or incorrectly labels them as someone else.
Information 2025, 16, 107 26 of 41

8.2. Precision
Precision refers to the fraction of correctly identified positives. In face recognition
systems, it refers to the percentage of correct positive matches out of all those recognized
by the system. Below is the mathematical computation of precision:

TP
Precision =
TP + FP

8.3. Recall
Recall is a statistic that indicates how often a machine learning model accurately
detects positive examples (true positives) among all of the actual positive samples in the
dataset. Below is the mathematical computation of recall:

TP
Recall =
TP + FN

8.4. F1-Score
The F1-score is a critical assessment parameter for facial recognition algorithms, partic-
ularly when dealing with unbalanced datasets. It offers a single measurement that balances
accuracy and recall. The F1-score is a more trustworthy estimate of a model’s performance
than accuracy since it takes into account both false positives and false negatives. The
F1-score represents the harmonic mean of accuracy and recall. The harmonic mean is
utilized instead of the standard arithmetic mean since it penalizes extreme values more.
Below is the mathematical computation of F1-score:

2 ∗ Precision ∗ Recall 2 ∗ TP
F1 = =
Precision + Recall 2 ∗ TP + FP + FN

A perfect F1-score of 1 indicates excellent precision and recall, whereas the poorest
possible F1-score is 0.

8.5. Sensitivity
This is the system’s capacity to accurately recognize positives. This metric measures
the system’s ability to identify subject persons. Below is the mathematical computation
of sensitivity:

TP
Sensitivity = Recall =
TP + FN

8.6. Specificity
This is the system’s capacity to accurately recognize negatives. This metric measures
the system’s ability to identify non-subject persons. Below is the mathematical computation
of specificity:

TN
Speci f icity =
FP + TN

8.7. AUC (Area Under the Curve)


The area under the ROC curve is a single measure that represents a binary classification
model’s overall performance.

8.8. ROC Curve (Receiver Operating Characteristic Curve)


A graphical representation of the trade-off between True Positive Rate and False
Positive Rate at different categorization levels.
Information 2025, 16, 107 27 of 41

Table 9 provides a summary of the strengths and weaknesses of various performance


metrics commonly used in face recognition systems. These metrics, including accuracy, pre-
cision, recall, F1-score, ROC curve, and AUC, each have specific advantages and limitations,
especially when dealing with skewed datasets, false positives, and false negatives.

Table 9. Summary of the strengths and weaknesses of the performance metrics for face recognition.

Perfomance Metric Strength Weakness


Does not offer a whole
view, particularly in
Simple, intuitive, and easy skewed datasets. High
Accuracy to compute and accuracy might be
comprehend. deceptive in situations
such as facial blockage
or aging.
Precision and recall are
Useful for unbalanced
negatively connected. Does
datasets. Aids in
not provide a fair
Precision and Recall determining false positives
perspective when both
(precision) and proper
false positives and false
identification (recall).
negatives must be reduced.
Does not discriminate
between the relative
Balances precision and
relevance of precision and
recall. Effective when both
F1-Score recall. In some
false positives and false
circumstances,
negatives are essential.
discrepancies may
be masked.
Does not offer direct
Visualizes the trade-off
insight into absolute
Receiver Operating between FAR and FRR.
performance. AUC-ROC
Characteristic (ROC) Curve Aids in comparing models
might be deceptive in
across various thresholds.
unbalanced datasets.
Does not consider
Provides a single number
operational thresholds.
Area Under the Curve that summarizes
Some subgroups may be
(AUC) performance. Resistant to
overestimated in terms of
skewed datasets.
model performance.

The above are the most commonly used performance measures for face recognition
systems. The next section is on Face Recognition Loss Functions; the section will inform us
about how faces are mapped in order to achieve the best accuracy.

9. Face Recognition Loss Functions


In face recognition training, loss functions play a critical role in directing models to
maximize face representations. In order to directly impact model performance, they seek to
map faces into a feature space where intra-class distances (same person) are minimized
and inter-class distances (different persons) are maximized.

9.1. Softmax Cross-Entropy Loss

N
Lsoftmax = − ∑ yi log( pi )
i =1
Information 2025, 16, 107 28 of 41

where L is the loss, N is the number of classes, yi is the ground truth label (1 for the correct
class, 0 for others), and ŷi is the predicted probability for class i, given by the Softmax
function. The goal is to minimize this loss function, which means maximizing the likelihood
of the true class being predicted with high probability.

9.2. Triplet Loss

Ltriplet = max(d(a, p) − d(a, n) + α, 0)

In the Triplet Loss function, the variables represent key components for learning
discriminative embeddings. a is the anchor, which is the reference input, typically a face
image. p is the positive sample, which is another image of the same identity or class as the
anchor, while n is the negative sample, from a different class or identity. The term d(a, p)
measures the distance between the anchor and positive samples, aiming to keep them close,
while d(a, n) measures the distance between the anchor and the negative sample, aiming
to maximize the distance. The parameter α is a margin that ensures the anchor-negative
distance is sufficiently larger than the anchor-positive distance by a margin of α, preventing
trivial solutions. The max function ensures the loss is zero if the margin condition is met;
otherwise, it penalizes the model for not achieving sufficient separation between positive
and negative pairs.

9.3. Center Loss


1 N
2 i∑
Lcenter = ∥xi − cyi ∥22
=1

In this equation, Lcenter represents the center loss. The variable N denotes the number
of samples in the batch. xi refers to the feature vector of the i-th sample, and cyi is the center
of the class yi to which the i-th sample belongs. The notation ∥xi − cyi ∥22 represents the
squared Euclidean distance between the feature vector xi and the corresponding class center
cyi . This loss function minimizes the distance between the feature representations of the
same class by encouraging each sample to be closer to the center of its corresponding class.

9.4. ArcFace Loss


!
exp(cos(θy + m))
Larcface = − log
exp(cos(θy + m)) + ∑ j̸=y exp(cos(θ j ))

In this equation, Larcface represents the ArcFace loss. The term θy is the angle between
the feature vector and the weight vector of the true class y. m is a margin added to the
true class angle θy to enforce a larger angular distance for correct classification. The cosine
of this modified angle, cos(θy + m), is used to increase the decision margin for the true
class. The denominator consists of the sum of the exponentials of the cosine values of the
angles for all classes, including the true class and all other classes j ̸= y. This loss function
encourages the model to distinguish between classes by maximizing the angular margin
between the true class and other classes, thereby improving classification accuracy.

9.5. Contrastive Loss

1 
Lcontrastive = y · ∥a − b∥22 + (1 − y) · max(0, m − ∥a − b∥2 )2
2
In this equation, Lcontrastive represents the contrastive loss. The term a and b are
feature vectors of two samples (e.g., two images), and ∥a − b∥2 is the Euclidean distance
between these vectors. The label y is a binary indicator where y = 1 indicates that the two
Information 2025, 16, 107 29 of 41

samples are from the same class (positive pair), and y = 0 indicates that they are from
different classes (negative pair).
For positive pairs, the loss is proportional to the squared Euclidean distance between
the feature vectors, encouraging the model to bring similar samples closer. For negative
pairs, the loss is based on a margin m, encouraging the model to push dissimilar samples
apart. If the distance between the negative pair is smaller than the margin m, the loss will
be proportional to the square of the difference from the margin. If the distance exceeds the
margin, no penalty is applied. This loss function promotes the model to learn embeddings
that minimize the distance for similar samples and maximize the distance for dissimilar
ones, thus improving the model’s discriminative ability.

9.6. Margin Cosine Loss


cos(θy + m)
Lcosine = 1 −
∥ w y ∥2 · ∥ x i ∥2
In this equation, Lcosine represents the cosine loss. The term cos(θy + m) refers to the
cosine similarity between the feature vector xi of the input image and the weight vector wy
corresponding to the target class, adjusted by a margin m.
The vectors xi and wy are normalized, and the denominator ∥wy ∥2 · ∥xi ∥2 ensures
that the vectors are normalized to unit length. The cosine similarity measures the angle
between two vectors in a high-dimensional space. A value closer to 1 indicates the vectors
are similar, while values closer to 0 or negative values indicate dissimilarity.
The margin m is added to the angle θy to introduce a margin between the correct class
and other classes, forcing the network to produce more discriminative embeddings. The
cosine loss function is designed to encourage the model to maximize the cosine similarity
for the correct class while minimizing it for other classes. This makes the model more robust
in differentiating between different classes by ensuring a clear margin in the feature space.
The above are the most commonly used loss functions for face recognition systems. The
next section covers recent applications of CNN architectures and will inform us about the
areas where facial recognition systems have been applied and accuracy has been achieved.

10. Recent Applications of CNN Architectures


In 2016, Sharma et al. [91] suggested a new face recognition approach to effectively
recognize individuals. They developed a face recognition system utilizing deep learning
and a convolutional neural network (CNN) with Dlib face alignment. The suggested
system includes four core processes: face identification, alignment, cropping, and feature
extraction. The study employed a set of 286 labels on 11,284 cropped grayscale photos
with a resolution of 96 × 96. After 20 epochs, FAREC achieved 96% accuracy for FRGC
with a false acceptance rate of 0.1% (1 in 100). However, position variation and intensity
fluctuation remain a concern.
In 2017, Arsenovic et al. [92] suggested a face recognition model that may be incor-
porated into other systems as a supporting or major component for monitoring, with or
without minimal changes. The suggested technique uses face augmentation to enlarge the
dataset and achieve improved accuracy on smaller datasets. The augmentation procedure
was divided into two steps. The initial step used typical picture augmentation techniques,
such as noise and blurring at various intensities. This model comprises a number of
critical steps: Face detection, picture pre-processing (identifying landmarks, positioning,
embeddings, and classification). With fewer face photos and the suggested augmentation
approach, a high overall accuracy of 95.02% was attained. Despite being mainly automated,
these methods are nonetheless prone to mistakes.
Information 2025, 16, 107 30 of 41

Al-Azzawi et al. [93]’s experimental findings in 2018 indicate that utilizing a Localized-
Based CNN structure for face recognition and identification improves performance by
leveraging the Localized Deep feature map. The model addresses issues with huge expres-
sions, poses, lighting, and poor resolution. The suggested model achieved 98.76% facial
recognition accuracy.
In 2018, Lu et al. [94] proposed Deep Coupled ResNet (DCR) for low-resolution face
recognition. The suggested DCR model consistently outperformed the state of the art,
according to a thorough examination on the LFW database. For a probe size of 8 × 8,
when using the LFW database, the DCR achieved an accuracy of 93.6%, which was better
compared to 67.7%, 72.7%, and 75.0% achieved by the LightCNN, ResNet, and VGGFace,
respectively [94].
In 2018, Qu et al. [95] proposed a real-time facial recognition based on a convolution
neural network (CNN) on Field Programmable Gate Array (FPGA), which enhances speed
and accuracy. FPGA technology allows parallel computing and unique logic circuit design,
resulting in faster processing speeds compared to Central Processing Unit (CPU), Graphics
Processing Unit (GPU), and Tensor Processing Unit (TPU) processors. Parallel processing
of FPGA speeds up network calculation, enabling real-time face recognition. The network
operates at a clock frequency of 50 Megahertz (MHz) and achieves recognition speeds of
up to 400 Frames per second (FPS), exceeding previous achievements. The recognition rate
was 99.25% greater than the human eye.
In 2019, Talab et al. [96] developed an effective sub-pixel convolution neural network
for face recognition at low to high resolutions. This convolutional neural network is used
in image pre-processing to improve recognition of low-resolution pictures. The suggested
Efficient Sub-Pixel Convolutional Neural Network converts low-resolution pictures to high-
resolution for further recognition. Evaluations were conducted using Yale face database
and ORL dataset faces. The suggested technique achieved greater accuracy for the Yale and
ORL datasets (95.3% and 93.5%, respectively) compared to prior methods.
In 2020, Feng et al. [97] proposed a small sample face recognition approach using
ensemble deep learning. This approach improves network performance by increasing
the number of training samples, optimizing training parameters, and reducing the dis-
advantage of small sample sizes. Experiments indicate that this technique outperforms
convolutional neural networks in recognizing faces in the ORL database, even with few
training sets.
In 2020, Lin et al. [98] suggested LWCNN for small sample data and employed k-fold
cross-validation to ensure robustness. Lin et al.’s suggested LWCNN approach outper-
formed other methods in terms of recognition accuracy and avoidance of overfitting in
limited sample spaces.
In 2021, Yudita and colleagues [99] investigated the concept of face recognition utilizing
VGG16 based on incomplete or insufficient human face data. The study achieved low
accuracy; Yudita et al. suggested that the effectiveness of CNN models for human face
recognition may be impacted by the many databases that are employed, each having
varying quantities and kinds of data.
In 2021, Szmurlo and Osowski [100] suggested a technique that combines feature
selection methods with three classifiers: support vector machine, random forest of decision
trees, and Softmax incorporated into a CNN. The system uses an ALEXNET-based CNN.
Results indicate that SVM outperforms random forests. However, SVM and Softmax are
not as accurate as the standard softmax classifier (96.3% vs. 97.3%, respectively). Classical
classifiers perform poorly due to little learning data. CNN generates a huge number
of descriptors, making traditional classifiers difficult and requiring a large number of
learning samples.
Information 2025, 16, 107 31 of 41

Multi-task cascaded convolutional neural networks (MTCNNs) were used for rapid
face identification and alignment, while FaceNet with increased loss function provided
high-accuracy face verification and recognition. The study evaluated the performance of
their MTCNN and FaceNet hybrid network for face identification and recognition to other
deep learning algorithms and approaches. The testing results show that the upgraded
FaceNet can handle real-time recognition needs with an accuracy of 99.85% compared to
97.83% achieved by MTCNN. They recommended that the Access Control System’s face
detection and recognition functions may be efficiently integrated [101].
The studies conducted by Sarahi et al. [102] in 2021 show that YOLO-Face achieves
precision, recall, and accuracy scores of 99.8%, 72.9%, and 72.8%. The only model that
outperforms YOLO-Face is MTCNN [102], which achieves an accuracy of 81.5% in the
FDDB dataset. Furthermore, the CelebA dataset, which consists of 19,962 image faces, was
used to test the same face detection models. The findings indicate that all of the models
work well when used with the CelebA dataset. But Face-SSD outperforms YOLO-Face
with 99.6% accuracy, with 99.7% accuracy. Lastly, it should be noted that YOLO-Face
achieves 95.8%, 94.2%, and 87.4% on the Easy, Medium, and Hard subsets of WIDER FACE
validation. It is evident that small-scale faces generate superior results for the YOLO-Face
detector [102].
In 2021, Malakar et al. [103] suggested a reconstructive approach to obtain partially
restored characteristics of the occluded area of the face, which was subsequently recognized
using an existing deep learning method. The proposed technique does not entirely recreate
the occluded section, but it does give enough characteristics to enhance recognition accuracy
by up to 15%.
In 2022, Marjan et al. [104] provide an improved VGG19 deep model to enhance the
accuracy of masked face recognition systems. The experimental findings show that the
suggested extended VGG19 method outperforms other techniques. The suggested model
accurately detects the frontal face with a mask (96%).
With respect to generative adversarial network (GAN)-based techniques, Li et al. [105]
provided an algorithm architecture including de-occlusion and distillation modules. The
de-occlusion module utilizes GAN for masked face completion, allowing for the recovery of
occluded features and eliminating appearance ambiguity. The distillation module employs
a pre-trained model for face categorization. The simulated LFW dataset had the greatest
recognition accuracy of 95.44%. GANs offer a way to grow datasets without requiring a lot
of real-world data collection, which enhances system performance while resolving privacy
and data availability issues [106–108]. However, it is difficult for GAN-based algorithms
to duplicate the characteristics of the face’s key features, especially when there is broad
occlusion, as in the case of a facemask [109].
Using a Convolutional Block Attention Module (CBAM) [110], Li et al. [111] presented
a novel technique that blends cropping-based and attention-based methodologies. In
order to maximize recognition accuracy, the cropping-based method eliminates the masked
face region while experimenting with different cropping proportions. The masked facial
features received lower weights in the attention-based procedure, but the characteristics
surrounding the eyes received larger weights. The accuracy of this method was 92.61%
Masked Face Recognition (MFR). In a different work, Deng et al. [112] improved masked
facial recognition by applying cosine loss to create the MF-Cosface algorithm, which
outperformed the attention-based approach. In order to improve the model’s emphasis
on the uncovered face area, they also developed an Attention–Inception module that
combines CBAM with Inception–ResNet. Verification tasks were somewhat improved by
this approach. Wu [113] presented a local restricted dictionary learning approach for an
attention-based MFR algorithm that distinguishes between the face and the mask. It uses
Information 2025, 16, 107 32 of 41

the attention mechanism to lessen information loss and the dilated convolution to increase
picture resolution.
Table 10 summarizes various face recognition architectures, their corresponding train-
ing sets, and performance metrics, highlighting key approaches such as Convolutional
Neural Networks (CNNs), VGG16, FaceNet, and others across different datasets and years.
The table provides insights into the evolution of face recognition models, including their
verification metrics and accuracy rates, showcasing both traditional and contemporary
methods.

Table 10. Summary of CNN architectures used for face recognition systems.

Convolutional
Architectures Training Set Year Authors Verif. Metric Accuracy
Layers
SVM ORL 2003 Yanhun and Chongqing [114] - - 96%
Gabor
ORL 2001 Kepenekci. [115] - - 95.25%
Wavelets
ESP-CNN +
ORL 2019 Talab et al. [96] - - 93.5%
CNN
Ensemble
ORL 2020 Feng et al. [97] 4 Softmax 88.5%
CNN
Center Loss and
VGG16 ORL 2020 Lou and Shi [116] 16 99.02%
Softmax
DeepFace LFW 2014 Taigman et al. [48,85] - Softmax 97.35%
FaceNet LFW 2015 Parkhi et al. [48] - Triplet Loss 98.87%
DCR LFW 2018 Lu et al. [94] - CM Loss 93.6%
ResNet LFW 2018 Lu et al. [94] - CM Loss 72.7%
Localized
LFW 2018 Al-Azzawi et al. [93] - Softmax 97.13%
Deep-CNN
FaceNet LFW 2021 Malakar et al. [103] - - 70–80%
Triplet Loss and
MTCNN LFW 2021 Wu and Zhang [101] 9 97.83%
ArcFace Loss
MTCNN + Triplet Loss and
LFW 2021 Wu and Zhang [101] 9 99.85%
FaceNet ArcFace Loss
VGG16 VGGFace 2015 Parkhi et al. [85] - Triplet Loss 98.95%
Light-CNN MS-Celeb-1M 2015 Parkhi et al. [48] - Softmax 98.8%
Traditional
CMU-PIE 2018 Qu et al. [95] 5 Sigmoid 99.25%
CNN
ESPCN + CNN Yale 2019 Talab et al. [96] - - 95.3%
Center Loss and
VGG16 Yale 2020 Lou and Shi [116] 16 97.62%
Softmax
Yale Face
LWCNN 2020 Lin et al. [98] 9 Softmax 96.19%
Database
VGG16 CASIA 2020 Lou and Shi [116] 16 Center Loss + Softmax 98.65%
CASIA Mask CASIA 2021 - - Occluded Faces -
AlexNet Own dataset 2021 Szmurlo and Osowski [100] 9 Softmax 97.8%
AlexNet Own dataset 2022 Mahesh and Ramkumar [117] 8 Softmax 96%
Face
- 2022 Sun and Tzimiropoulos [84] - - 99.83%
Transformer
YOLO-Face FDDB 2021 Sarahi et al. [102] - - 72.8%
Yale Face
PCA + FaceNet 2021 Malakar et al. [103] - - 85–95%
Database B
Extended
- 2022 Marjan et al. [104] 19 Softmax 96%
VGG19

When comparing various convolutional neural network (CNN) architectures used for
face recognition, several differences in accuracy and training datasets become apparent.
For instance, FaceNet, a model developed by Parkhi et al. [48] in 2015, demonstrated an
impressive accuracy of 98.87% on the LFW dataset, showcasing its strong performance
in large-scale face verification tasks. Similarly, VGG16, introduced by Parkhi et al. [48]
in the same year, achieved a slightly higher accuracy of 98.95% on the VGGFace dataset,
Information 2025, 16, 107 33 of 41

which indicates its effectiveness in handling facial recognition tasks involving variations
in face poses, lighting, and identities. The models trained on these benchmark datasets,
LFW and VGGFace, highlight the robustness of these architectures in large-scale real-
world conditions.
On the other hand, models like SVM (2003) and Gabor Wavelets (2001), while demon-
strating high accuracy on smaller datasets like ORL (96% and 95.25%, respectively), show
limitations in terms of scalability to larger, more complex datasets. Their reliance on simpler
algorithms and smaller training sets places them at a disadvantage when compared to more
recent architectures like FaceNet and VGG16, which leverage deep learning techniques
and larger, more diverse datasets. For example, FaceNet’s success on LFW (98.87%) and its
application to other datasets like Yale Face Database B (70–80%) emphasize its versatility
and high generalization capabilities.
Moreover, architectures like MTCNN, which achieved 99.85% accuracy on the WIDER
Face dataset and LFW, demonstrate the advantage of combining multiple models—such as
face detection and face recognition—into a single framework. This further underscores the
importance of using multi-task learning and ensemble methods to achieve state-of-the-art
results. While older models like Gabor Wavelets performed reasonably well on smaller
datasets, the increasing accuracy and applicability of newer models on larger datasets
signal the rapid progress in the field of face recognition systems.
In conclusion, models like FaceNet, VGG16, and MTCNN, when trained on compre-
hensive datasets such as LFW and VGGFace, show superior performance, with accuracy
rates exceeding 98%. In contrast, earlier models such as SVM and Gabor Wavelets perform
well on smaller, less complex datasets but fall short in comparison to modern architectures
designed for scalability and robustness. The evolution from simpler models to advanced
deep learning approaches highlights the growing demand for sophisticated algorithms
capable of handling large-scale, real-world recognition challenges.

11. Limitations of Face Recognition Models


Concerns regarding the violation of individual privacy rights are raised by the use of
face data without authorization, particularly in public monitoring systems [118]. Facial
recognition systems have been demonstrated to have severe biases, notably regarding
ethnicity, gender, and age [8]. These biases can lead to increased mistake rates, such as
misidentification or false positives, which can be problematic in law enforcement or security
applications [8]. Face recognition systems are subject to spoofing attacks, which occur
when an attacker deceives the system with pictures, videos, or 3D models of a person’s
face. Although systems are improving with the use of liveness detection, these assaults are
still a risk for extremely sensitive applications [119].

12. Challenges, Results, and Discussion


From our review of CNNs for face recognition, we discovered that some re-
searchers are facing challenges with applying face recognition on low-resolution images.
Al-Azzawi et al. [93], suggested that using a Localized Deep feature map improves facial
recognition accuracy when dealing with low-resolution images. Talab et al. [96] suggested
that conversion of low-resolution images to high-resolution ones is the best solution for
handling low-resolution images for better accuracy. All the suggested techniques achieved
good accuracy; however, they are computationally expensive and most likely to struggle in
real-time face recognition.
We also discovered that some researchers are building face recognition systems with
small datasets due to challenges having to do with collecting data, over-fitting, and reducing
computational time. Feng et al. [97] proposed a small sample face recognition approach
Information 2025, 16, 107 34 of 41

using ensemble deep learning. Lin et al. [98] suggested LWCNN for small sample data and
employed k-fold cross-validation to ensure robustness. Using small sample dataset raises
questions regarding the accuracy of the model in real-life application. Generally, CNNs are
known to use large datasets to achieve the best accuracy.
Occlusion, pose, and light are also still a concern. Malakar et al. [103] suggested a
reconstructive approach to obtain partially restored characteristics of the occluded area
of the face, which was subsequently recognized using an existing deep learning method.
Marjan et al. [104] provided an improved VGG19 deep model to enhance the accuracy of
masked face recognition systems. The two techniques rely on existing unmasked data to
identify an individual. The techniques appear to be ineffective if the unmasked data are
not available and they also lack the inability to fully reconstruct the missing parts.
Most researchers are using images to train and test facial recognition systems [8]. Using
image face datasets helps them achieve great accuracy reasons provided that the following are
met: 1. for the most part, the subject is directly facing the camera; and 2. the environment is
controlled. Using video for facial recognition presents a lot of challenges but creates an opportu-
nity for research that explore gait recognition. Gait recognition has advantages of overcoming
camera angle problems and the ability to identify a masked person by their movement.

13. Conclusions
In this study, we explored CNNs for face recognition, databases, and their performance
metrics. CNNs are very versatile depending on the objective(s), provided that their architecture
and layers are not limited. Face recognition systems can identify a person under various
conditions; however, most of them struggle with occlusion, camera angle, pose, and light.

13.1. Observations
• More sophisticated deep learning architectures like FaceNet, VGG16, and MTCNN
have clearly replaced previous, simpler models like SVM and Gabor Wavelets. Newer
models increase scalability and accuracy, demonstrating the field of face recognition’s
increasing complexity.
• There are several CNN designs, databases, and performance measures, as we have
shown. In the face recognition field, several CNN architectures are applied to different
tasks. Certain designs work well in scenarios such as low-quality images or videos
and frontal face recognition. There were more images than videos in most of the
datasets used to train and evaluate face recognition algorithms. Most researchers, we
also noticed, employed Softmax verification measures.
• We observed that most of the databases used for training and testing face recognition
models had less black people in comparison to other races. The lack of balance in the
datasets leads to bias. The issue of privacy surrounding cameras is still being debated
in the US [8,120]. One aspect of the discussion is the inability of facial recognition
software to distinguish between black and white faces, resulting in racial profiling
and erroneous arrests [8,120].
• Occlusion, camera angle, stance, and lighting are all issues that persist [8]. Researchers
are attempting to devise solutions to these challenges; however, the varying resolutions
provided by the source of images or videos, such as CCTV, creates additional issues.
Intelligent and automated face recognition systems work well in controlled contexts,
but badly in uncontrolled ones [121]. The two main causes of this are facial alignment
and the use of high- or low-resolution face images taken in controlled settings by
researchers for training.
• Models trained on bigger and more diversified datasets, like LFW and VGGFace, often
perform better in real-world, large-scale recognition tasks. The comparison of simpler
Information 2025, 16, 107 35 of 41

datasets such as ORL and more complex datasets such as LFW indicates that dataset
size and variety are important variables in the performance of face recognition systems.
• Little research has been conducted using Transformers and mamba for face recognition.
• Most researchers use accuracy as a performance metric. The use of accuracy as a
performance measurement without comparing it to other performance metrics raises
questions about the validity of results.
• The use of several models for face detection and recognition, as demonstrated in
MTCNN, highlights the potential advantages of hybrid and ensemble techniques.
Research into more advanced hybrid models might increase the robustness and adapt-
ability of face recognition systems.

13.2. Contribution of Article


This work will contribute to the body of knowledge in the face recognition field by
providing an updated overview of facial recognition systems, encompassing past, present,
and future challenges. The report also summarizes 266 publications which were reviewed
and compared. The CNN architectures and performance measures detailed in this paper
demonstrate their applicability and challenges.

13.3. Future Work


In order to overcome present constraints and improve the capabilities of current models,
future research in face recognition systems should concentrate on a few crucial areas. The
creation of hybrid techniques, which combine the scalability and performance of contemporary
deep learning architectures with the simplicity and effectiveness of older models like SVM
and Gabor Wavelets, is one encouraging avenue. More reliable training methods and data
augmentation procedures are also required, as evidenced by the critical requirement to
improve the generalization of models such as FaceNet and VGG16 over a variety of datasets
with different lighting, postures, and occlusions. As proven by MTCNN, multi-task learning
and ensemble techniques provide chances to enhance the integration of face detection and
identification inside a single, cohesive framework. Another crucial topic is improving the
quality of large-scale datasets by tackling issues that might impair model performance, such as
noise, low resolution, and poor picture quality. Additionally, to enhance face identification in
unrestricted, real-world settings, models that are more robust to changes in occlusions, extreme
positions, and facial emotions must be developed. The requirement for accurate and resource-
efficient models that guarantee that computational constraints do not impair performance is
increasing as real-time facial recognition system applications proliferate. Finally, more studies
should be conducted on the ethical issues of fairness, prejudice, and privacy in face recognition
systems. Future studies should concentrate on reducing bias and making sure privacy laws
are followed in real-world applications.

Author Contributions: Conceptualization, A.N., C.C. and S.V.; methods, A.N. and C.C.; investigation,
A.N. and C.C.; formal Analysis, A.N.; data curation, A.N.; writing—original draft preparation, A.N.;
writing—review and editing, A.N., C.C. and S.V.; supervision, S.V. and C.C.; project administration,
S.V. and C.C.; funding acquisition, S.V. and C.C. Authorship has been limited to those who have
contributed substantially to the work reported. All authors have read and agreed to the published
version of the manuscript.

Funding: This work was undertaken within the context of the Centre for Artificial Intelligence
Research, which is supported by the Centre for Scientific and Innovation Research (CSIR) under grant
number CSIR/BEI/HNP/CAIR/2020/10, supported by the Government of the Republic of South
Africa through its Department of Science and Innovation’s University Capacity Development grants.

Institutional Review Board Statement: Not applicable.


Information 2025, 16, 107 36 of 41

Informed Consent Statement: Not applicable.

Data Availability Statement: The datasets analysed in this study can be found in the University of
Johannesburg repository using the following link: https://figshare.com/s/dbfbe28773afeb71872f
(accessed on 12 June 2024).

Acknowledgments: We acknowledge both the moral and technical support given by the University
of Johannesburg, University of KwaZulu-Natal, and Sol Plaatje University.

Conflicts of Interest: The funders had no role in the design of the study; in the collection, analyses,
or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

MDPI Multidisciplinary Digital Publishing Institute


DOAJ Directory of open-access journals
TLA Three-letter acronym
LD Linear dichroism
CNN Convolutional neural networks
VGG Directory of open-access journals
ORL Olivetti Research Laboratory
FERET Facial Recognition Technology
VGG16 Visual Geometry Group 16
LFW Labeled Faces in the Wild
VGGFace Visual Geometry Group Face
MTCNN Multi-Task Cascaded Convolutional Networks
FGRC Face Recognition Grand Challenge
CASIA Chinese Academy of Sciences Institute of Automation
IARPA Intelligence Advanced Research Projects Activity
DMFD Dynamic Multi-Factor Database
PCA Principal Component Analysis
SVM Support Vector Machine
HOG Histogram of Oriented Gradients
AlexNet AlexanderNet (CNN Architecture)
VGGNet Visual Geometry Group Network (CNN Architecture)
ResNet Residual Network
FaceNet Face Recognition Network
LBPNet Local Binary Patterns Network
LWCNN Lightweight Convolutional Neural Network
YOLO You Only Look Once
ViT Vision Transformer
AR Affective Computing (AR)
CVL Computer Vision Laboratory
AUC Area Under the Curve
GAN Generative adversarial networks
QAR Quality Assessment Rules
CBAM Convolutional Block Attention Module
DCR Deep Coupled ResNet
FGPA Field Programmable Gate Array
CPU Central Processing Unit (CPU)
GPU Graphics Processing Unit
TPU Tensor Processing Unit
MHz Megahertz
FPS Frames Per Second
Information 2025, 16, 107 37 of 41

References
1. Junayed, M.S.; Sadeghzadeh, A.; Islam, M.B. Deep covariance feature and cnn-based end-to-end masked face recognition. In
Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur,
India, 15–18 December 2021; IEEE: New York, NY, USA, 2021; pp. 1–8.
2. Chien, J.T.; Wu, C.C. Discriminant waveletfaces and nearest feature classifiers for face recognition. IEEE Trans. Pattern Anal.
Mach. Intell. 2002, 24, 1644–1649. [CrossRef]
3. Wan, L.; Liu, N.; Huo, H.; Fang, T. Face recognition with convolutional neural networks and subspace learning. In Proceedings
of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; IEEE: New
York, NY, USA, 2017; pp. 228–233.
4. Jain, A.K.; Ross, A.A.; Nandakumar, K. Introduction to Biometrics; Springer: Berlin/Heidelberg, Germany, 2011.
5. Taskiran, M.; Kahraman, N.; Erdem, C.E. Face recognition: Past, present and future (a review). Digit. Signal Process. 2020,
106, 102809. [CrossRef]
6. Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. Face Recognition Systems: A Survey. Sensors 2020, 20, 342. [CrossRef]
7. Saini, R.; Rana, N. Comparison of various biometric methods. Int. J. Adv. Sci. Technol. 2014, 2, 24–30.
8. Nemavhola, A.; Viriri, S.; Chibaya, C. A Scoping Review of Literature on Deep Learning Techniques for Face Recognition. Hum.
Behav. Emerg. Technol. 2025, 2025, 5979728. [CrossRef]
9. Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020,
9, 1188. [CrossRef]
10. Nilsson, N.J. The Quest for Artificial Intelligence; Cambridge University Press: Cambridge, UK, 2009.
11. de Leeuw, K.M.M.; Bergstra, J. The History of Information Security: A Comprehensive Handbook; Elsevier: Amsterdam,
The Netherlands, 2007.
12. Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [CrossRef]
13. Gates, K.A. Our Biometric Future: Facial Recognition Technology and the Culture of Surveillance; NYU Press: New York, NY, USA,
2011; Volume 2.
14. King, I. Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, 3–6 October 2006: Proceedings;
Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006.
15. Xue-Fang, L.; Tao, P. Realization of face recognition system based on Gabor wavelet and elastic bunch graph matching. In
Proceedings of the 2013 25th Chinese Control and Decision Conference (CCDC), Guiyang, China, 25–27 May 2013; IEEE: New
York, NY, USA, 2013; pp. 3384–3386.
16. Kundu, M.K.; Mitra, S.; Mazumdar, D.; Pal, S.K. Perception and Machine Intelligence: First Indo-Japan Conference, PerMIn 2012, Kolkata,
India, 12–13 January 2011, Proceedings; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 7143.
17. Guo, G.; Zhang, N. A survey on deep learning based face recognition. Comput. Vis. Image Underst. 2019, 189, 102805. [CrossRef]
18. Datta, A.K.; Datta, M.; Banerjee, P.K. Face Detection and Recognition: Theory and Practice; CRC Press: Boca Raton, FL, USA, 2015.
19. Gofman, M.I.; Villa, M. Identity and War: The Role of Biometrics in the Russia-Ukraine Crisis. Int. J. Eng. Sci. Technol. (IJonEST)
2023, 5, 89. [CrossRef]
20. Barnouti, N.H.; Al-Dabbagh, S.S.M.; Matti, W.E. Face recognition: A literature review. Int. J. Appl. Inf. Syst. 2016, 11, 21–31.
[CrossRef]
21. Lal, M.; Kumar, K.; Arain, R.H.; Maitlo, A.; Ruk, S.A.; Shaikh, H. Study of face recognition techniques: A survey. Int. J. Adv.
Comput. Sci. Appl. 2018; 1–8 . [CrossRef]
22. Shamova, U. Face Recognition in Healthcare: General Overview. Язык в сфере профессиональной коммуникации.—Екатеринбург,
2020; pp. 748–752. Available online: https://elar.urfu.ru/handle/10995/84113 (accessed on 17 March 2024).
23. Elngar, A.A.; Kayed, M. Vehicle security systems using face recognition based on internet of things. Open Comput. Sci. 2020,
10, 17–29. [CrossRef]
24. Xing, J.; Fang, G.; Zhong, J.; Li, J. Application of face recognition based on CNN in fatigue driving detection. In Proceedings of
the 2019 International Conference on Artificial Intelligence and Advanced Manufacturing, Dublin, Ireland, 17–19 October 2019;
pp. 1–5.
25. Pabiania, M.D.; Santos, K.A.P.; Villa-Real, M.M.; Villareal, J.A.N. Face recognition system for electronic medical record to access
out-patient information. J. Teknol. 2016, 78. [CrossRef]
26. Aswis, A.; Morsy, M.; Abo-Elsoud, M. Face Recognition Based on PCA and DCT Combination Technique. Int. J. Eng. Res. Technol.
2018, 4, 1295–1298.
27. Ranjan, R.; Sankaranarayanan, S.; Bansal, A.; Bodla, N.; Chen, J.C.; Patel, V.M.; Castillo, C.D.; Chellappa, R. Deep learning for
understanding faces: Machines may be just as good, or better, than humans. IEEE Signal Process. Mag. 2018, 35, 66–83. [CrossRef]
28. Loy, C.C. Face detection. In Computer Vision: A Reference Guide; Springer: Berlin/Heidelberg, Germany, 2021; pp. 429–434.
29. Yang, M.H. Face Detection. In Encyclopedia of Biometrics; Li, S.Z., Jain, A.K., Eds.; Springer: Boston, MA, USA, 2015; pp. 447–452.
[CrossRef]
Information 2025, 16, 107 38 of 41

30. Calvo, G.; Baruque, B.; Corchado, E. Study of the pre-processing impact in a facial recognition system. In Proceedings of
the Hybrid Artificial Intelligent Systems: 8th International Conference, HAIS 2013, Salamanca, Spain, 11–13 September 2013;
Proceedings 8; Springer: Berlin/Heidelberg, Germany, 2013; pp. 334–344.
31. Benedict, S.R.; Kumar, J.S. Geometric shaped facial feature extraction for face recognition. In Proceedings of the 2016 IEEE
International Conference on Advances in Computer Applications (ICACA), Coimbatore, India, 24 October 2016; pp. 275–278.
[CrossRef]
32. Napoléon, T.; Alfalou, A. Pose invariant face recognition: 3D model from single photo. Opt. Lasers Eng. 2017, 89, 150–161.
[CrossRef]
33. Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.; Horsley, T.; Weeks, L.;
et al. PRISMA extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann. Intern. Med. 2018, 169, 467–473.
[CrossRef] [PubMed]
34. Phillips, P.; Moon, H.; Rizvi, S.; Rauss, P. The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern
Anal. Mach. Intell. 2000, 22, 1090–1104. [CrossRef]
35. Belhumeur, P.; Hespanha, J.; Kriegman, D. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE
Trans. Pattern Anal. Mach. Intell. 1997, 19, 711–720. [CrossRef]
36. Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database forstudying face recognition in
unconstrained environments. In Proceedings of the Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and
Recognition, Marseille, France, 17–20 October 2008.
37. Zou, J.; Ji, Q.; Nagy, G. A comparative study of local matching approach for face recognition. IEEE Trans. Image Process. 2007,
16, 2617–2628. [CrossRef]
38. Peer, P. CVL Face Database; University of Ljubljana: Ljubljana, Slovenia, 2010.
39. Fox, N.; Reilly, R.B. Audio-visual speaker identification based on the use of dynamic audio and visual features. In Proceedings
of the International Conference on Audio-and Video-Based Biometric Person Authentication, Guildford, UK, 9–11 June 2003;
Springer: Berlin/Heidelberg, Germany, 2003; pp. 743–751.
40. Bailly-Bailliére, E.; Bengio, S.; Bimbot, F.; Hamouz, M.; Kittler, J.; Mariéthoz, J.; Matas, J.; Messer, K.; Popovici, V.; Porée, F.; et al.
The BANCA database and evaluation protocol. In Proceedings of the Audio-and Video-Based Biometric Person Authentication:
4th International Conference, AVBPA 2003, Guildford, UK, 9–11 June 2003; Proceedings 4; Springer: Berlin/Heidelberg, Germany,
2003; pp. 625–638.
41. Phillips, P.; Flynn, P.; Scruggs, T.; Bowyer, K.; Chang, J.; Hoffman, K.; Marques, J.; Min, J.; Worek, W. Overview of the face
recognition grand challenge. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 947–954. [CrossRef]
42. Milborrow, S.; Morkel, J.; Nicolls, F. The MUCT Landmarked Face Database. In Pattern Recognition Association of South Africa;
2010. Available online: http://www.milbo.org/muct (accessed on 12 April 2024).
43. Gross, R.; Matthews, I.; Cohn, J.; Kanade, T.; Baker, S. Multi-PIE. Image Vis. Comput. 2013, 28, 807–813. [CrossRef]
44. Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923.
45. Klare, B.F.; Klein, B.; Taborsky, E.; Blanton, A.; Cheney, J.; Allen, K.; Grother, P.; Mah, A.; Jain, A.K. Pushing the frontiers of
unconstrained face detection and recognition: Iarpa janus benchmark a. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1931–1939.
46. Kemelmacher-Shlizerman, I.; Seitz, S.M.; Miller, D.; Brossard, E. The MegaFace Benchmark: 1 Million Faces for Recognition at
Scale. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
27–30 June 2016; pp. 4873–4882. [CrossRef]
47. Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams, J.; Miller, T.; Kalka, N.; Jain, A.K.; Duncan, J.A.; Allen, K.; et al. Iarpa
janus benchmark-b face dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Honolulu, HI, USA, 21–26 July 2017; pp. 90–98.
48. Parkhi, O.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the BMVC 2015-Proceedings of the British
Machine Vision Conference, Swansea, UK, 7–10 September 2015; British Machine Vision Association: Durham, UK, 2015.
49. Sengupta, S.; Chen, J.C.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to profile face verification in the wild. In
Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10
March 2016; IEEE: New York, NY, USA, 2016; pp. 1–9.
50. Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings
of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings,
Part III 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 87–102.
51. Al-ghanim, F.; Aljuboori, A. Face Recognition with Disguise and Makeup Variations Using Image Processing and Machine
Learning. In Proceedings of the Advances in Computing and Data Sciences: 5th International Conference, ICACDS 2021, Nashik,
India, 23–24 April 2021; pp. 386–400. [CrossRef]
Information 2025, 16, 107 39 of 41

52. Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. Vggface2: A dataset for recognising faces across pose and age. In
Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China,
15–19 May 2018; IEEE: New York, NY, USA, 2018; pp. 67–74.
53. Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. Iarpa janus
benchmark-c: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast,
Australia, 20–23 February 2018; IEEE: New York, NY, USA, 2018; pp. 158–165.
54. Nech, A.; Kemelmacher-Shlizerman, I. Level Playing Field for Million Scale Face Recognition. arXiv 2017, arXiv:1705.00393.
55. Kushwaha, V.; Singh, M.; Singh, R.; Vatsa, M.; Ratha, N.K.; Chellappa, R. Disguised Faces in the Wild. In Proceedings of the 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June
2018; pp. 1–18.
56. Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S. LFR face dataset:Left-Front-Right dataset for pose-invariant face recognition in
the wild. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha,
Qatar, 2–5 February 2020; pp. 124–130. [CrossRef]
57. Ayad, W.; Qays, S.; Al-Naji, A. Generating and Improving a Dataset of Masked Faces Using Data Augmentation. J. Tech. 2023,
5, 46–51. [CrossRef]
58. Gottumukkal, R.; Asari, V.K. An improved face recognition technique based on modular PCA approach. Pattern Recognit. Lett.
2004, 25, 429–436. [CrossRef]
59. Yang, J.; Liu, C.; Zhang, L. Color space normalization: Enhancing the discriminating power of color spaces for face recognition.
Pattern Recognit. 2010, 43, 1454–1466. [CrossRef]
60. Ye, J.; Janardan, R.; Li, Q. Two-dimensional linear discriminant analysis. Adv. Neural Inf. Process. Syst. 2004, 17, 1–8.
61. Rahman, M.T.; Bhuiyan, M.A. Face recognition using Gabor Filters. In Proceedings of the 2008 11th International Conference on
Computer and Information Technology, Khulna, Bangladesh, 24–27 December 2008; pp. 510–515. [CrossRef]
62. Olshausen, B.A.; Field, D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.
Nature 1996, 381, 607–609. [CrossRef]
63. Hammouche, R.; Attia, A.; Akhrouf, S.; Akhtar, Z. Gabor filter bank with deep autoencoder based face recognition system. Expert
Syst. Appl. 2022, 197, 116743. [CrossRef]
64. Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1,
pp. 1–8. [CrossRef]
65. Ruppert, D. The elements of statistical learning: Data mining, inference, and prediction. J. Am. Stat. Assoc. 2004, 99, 567.
[CrossRef]
66. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893.
[CrossRef]
67. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in
Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Nice, France,
2012; Volume 25.
68. Levi, G.; Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 34–42.
69. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
70. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
71. Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv 2015, arXiv:1505.00387.
72. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823.
73. Xi, M.; Chen, L.; Polajnar, D.; Tong, W. Local binary pattern network: A deep learning approach for face recognition. In
Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016;
IEEE: New York, NY, USA, 2016; pp. 3224–3228.
74. Wu, X.; He, R.; Sun, Z.; Tan, T. A light CNN for deep face representation with noisy labels. IEEE Trans. Inf. Forensics Secur. 2018,
13, 2884–2896. [CrossRef]
75. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
76. Kumar, K.K.; Kasiviswanadham, Y.; Indira, D.; Priyanka palesetti, P.; Bhargavi, C. Criminal face identification system using
deep learning algorithm multi-task cascade neural network (MTCNN). Mater. Today Proc. 2023, 80, 2406–2410; SI:5 NANO 2021.
[CrossRef]
Information 2025, 16, 107 40 of 41

77. Ullah, N.; Javed, A.; Ali Ghazanfar, M.; Alsufyani, A.; Bourouis, S. A novel DeepMaskNet model for face mask detection and
masked facial recognition. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 9905–9914. [CrossRef]
78. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
79. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv
2019, arXiv:1801.04381.
80. Chen, S.; Liu, Y.; Gao, X.; Han, Z. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In
Proceedings of the Biometric Recognition: 13th Chinese Conference, CCBR 2018, Urumqi, China, 11–12 August 2018; Proceedings
13; Springer: Berlin/Heidelberg, Germany, 2018; pp. 428–438.
81. Alexey, D.; Fischer, P.; Tobias, J.; Springenberg, M.R.; Brox, T. Discriminative unsupervised feature learning with exemplar
convolutional neural networks. IEEE TPAMI 2016, 38, 1734–1747.
82. Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. DeepViT: Towards Deeper Vision Transformer. arXiv
2021, arXiv:2103.11886.
83. Zhong, Y.; Deng, W. Face transformer for recognition. arXiv 2021, arXiv:2103.14803.
84. Sun, Z.; Tzimiropoulos, G. Part-based face recognition with vision transformers. arXiv 2022, arXiv:2212.00057.
85. Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 1701–1708.
86. Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10076–10085.
87. Zhu, B.; Li, L.; Hu, X.; Wu, F.; Zhang, Z.; Zhu, S.; Wang, Y.; Wu, J.; Song, J.; Li, F.; et al. DEFOG: Deep Learning with Attention
Mechanism Enabled Cross-Age Face Recognition. Tsinghua Sci. Technol. 2024, 30, 1342–1358. [CrossRef]
88. Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418.
89. Lin, K.; Li, L.; Lin, C.C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; Wang, L. Swinbert: End-to-end transformers with sparse attention for
video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA,
USA, 18–24 June 2022; pp. 17949–17958.
90. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation
through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021;
pp. 10347–10357.
91. Sharma, S.; Shanmugasundaram, K.; Ramasamy, S.K. FAREC—CNN based efficient face recognition technique using Dlib.
In Proceedings of the 2016 International Conference on Advanced Communication Control and Computing Technologies
(ICACCCT), Ramanathapuram, India, 25–27 May 2016; pp. 192–195. [CrossRef]
92. Arsenovic, M.; Sladojevic, S.; Anderla, A.; Stefanovic, D. FaceTime—Deep learning based face recognition attendance system.
In Proceedings of the 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia,
14–16 September 2017; pp. 53–58. [CrossRef]
93. Al-Azzawi, A.; Hind, J.; Cheng, J. Localized Deep-CNN Structure for Face Recognition. In Proceedings of the 2018 11th
International Conference on Developments in eSystems Engineering (DeSE), Cambridge, UK, 2–5 September 2018; pp. 52–57.
[CrossRef]
94. Lu, Z.; Jiang, X.; Kot, A. Deep coupled resnet for low-resolution face recognition. IEEE Signal Process. Lett. 2018, 25, 526–530.
[CrossRef]
95. Qu, X.; Wei, T.; Peng, C.; Du, P. A Fast Face Recognition System Based on Deep Learning. In Proceedings of the 2018 11th
International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 8–9 December 2018; Volume 1,
pp. 289–292. [CrossRef]
96. Talab, M.A.; Awang, S.; Najim, S.A.d.M. Super-Low Resolution Face Recognition using Integrated Efficient Sub-Pixel Convo-
lutional Neural Network (ESPCN) and Convolutional Neural Network (CNN). In Proceedings of the 2019 IEEE International
Conference on Automatic Control and Intelligent Systems (I2CACIS), Selangor, Malaysia, 29 June 2019; pp. 331–335. [CrossRef]
97. Feng, Y.; Pang, T.; Li, M.; Guan, Y. Small sample face recognition based on ensemble deep learning. In Proceedings of the 2020
Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 4402–4406. [CrossRef]
98. Lin, M.; Zhang, Z.; Zheng, W. A Small Sample Face Recognition Method Based on Deep Learning. In Proceedings of the 2020
IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 28–31 October 2020; pp. 1394–1398.
[CrossRef]
99. Yudita, S.I.; Mantoro, T.; Ayu, M.A. Deep Face Recognition for Imperfect Human Face Images on Social Media using the CNN
Method. In Proceedings of the 2021 4th International Conference of Computer and Informatics Engineering (IC2IE), Depok,
Indonesia, 14–15 September 2021; pp. 412–417. [CrossRef]
Information 2025, 16, 107 41 of 41

100. Szmurlo, R.; Osowski, S. Deep CNN ensemble for recognition of face images. In Proceedings of the 2021 22nd International
Conference on Computational Problems of Electrical Engineering (CPEE), Hradek u Susice, Czech Republic, 15–17 September
2021; pp. 1–4. [CrossRef]
101. Wu, C.; Zhang, Y. MTCNN and FACENET based access control system for face detection and recognition. Autom. Control.
Comput. Sci. 2021, 55, 102–112.
102. Sanchez-Moreno, A.S.; Olivares-Mercado, J.; Hernandez-Suarez, A.; Toscano-Medina, K.; Sanchez-Perez, G.; Benitez-Garcia, G.
Efficient face recognition system for operating in unconstrained environments. J. Imaging 2021, 7, 161. [CrossRef] [PubMed]
103. Malakar, S.; Chiracharit, W.; Chamnongthai, K.; Charoenpong, T. Masked Face Recognition Using Principal component analysis
and Deep learning. In Proceedings of the 2021 18th International Conference on Electrical Engineering/Electronics, Computer,
Telecommunications and Information Technology (ECTI-CON), Online, 19–22 May 2021; pp. 785–788. [CrossRef]
104. Marjan, M.A.; Hasan, M.; Islam, M.Z.; Uddin, M.P.; Afjal, M.I. Masked Face Recognition System using Extended VGG-19. In
Proceedings of the 2022 4th International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE),
Rajshahi, Bangladesh, 29–31 December 2022; pp. 1–4. [CrossRef]
105. Li, Y.; Liu, S.; Yang, J.; Yang, M.H. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3911–3919.
106. Pann, V.; Lee, H.J. Effective attention-based mechanism for masked face recognition. Appl. Sci. 2022, 12, 5590. [CrossRef]
107. Yuan, L.; Li, F. Face recognition with occlusion via support vector discrimination dictionary and occlusion dictionary based
sparse representation classification. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association
of Automation (YAC), Wuhan, China, 11–13 November 2016; IEEE: New York, NY, USA, 2016; pp. 110–115.
108. Deng, W.; Hu, J.; Guo, J. Extended SRC: Undersampled face recognition via intraclass variant dictionary. IEEE Trans. Pattern Anal.
Mach. Intell. 2012, 34, 1864–1870. [CrossRef]
109. Alzu’bi, A.; Albalas, F.; Al-Hadhrami, T.; Younis, L.B.; Bashayreh, A. Masked face recognition using deep learning: A review.
Electronics 2021, 10, 2666. [CrossRef]
110. Li, S.; Lee, H.J. Effective Attention-Based Feature Decomposition for Cross-Age Face Recognition. Appl. Sci. 2022, 12, 4816.
[CrossRef]
111. Boutros, F.; Damer, N.; Kirchbuchner, F.; Kuijper, A. Self-restrained triplet loss for accurate masked face recognition. Pattern
Recognit. 2022, 124, 108473. [CrossRef]
112. Deng, H.; Feng, Z.; Qian, G.; Lv, X.; Li, H.; Li, G. MFCosface: A masked-face recognition algorithm based on large margin cosine
loss. Appl. Sci. 2021, 11, 7310. [CrossRef]
113. Wu, G. Masked face recognition algorithm for a contactless distribution cabinet. Math. Probl. Eng. 2021, 2021, 5591020. [CrossRef]
114. Yanhun, Z.; Chongqing, L. Face recognition based on support vector machine and nearest neighbor classifier. J. Syst. Eng. Electron.
2003, 14, 73–76.
115. Kepenekci, B. Face Recognition Using Gabor Wavelet Transform. Master’s Thesis, Middle East Technical University, Ankara,
Turkey, 2001.
116. Lou, G.; Shi, H. Face image recognition based on convolutional neural network. China Commun. 2020, 17, 117–124. [CrossRef]
117. Mahesh, S.; Ramkumar, G. Smart Face Detection and Recognition in Illumination Invariant Images using AlexNet CNN Compare
Accuracy with SVM. In Proceedings of the 2022 3rd International Conference on Intelligent Engineering and Management
(ICIEM), London, UK, 27–29 April 2022; pp. 572–575. [CrossRef]
118. Garvie, C.; Bedoya, A.; Frankle, J. Unregulated Police Face Recognition in America. Perpetual Line Up. 2016. Available online:
https://www.perpetuallineup.org/ (accessed on 17 March 2024).
119. Korshunov, P.; Marcel, S. DeepFakes: A New Threat to Face Recognition? Assessment and Detection. arXiv 2018, arXiv:1812.08685.
120. Sikhakhane, N. Joburg hostels and Townships coming under surveillance by facial recognition cameras. Drones 2023 . Available
online: https://www.dailymaverick.co.za/article/2023-08-13-joburg-hostels-and-townships-coming-under-surveillance-by-
facial-recognition-cameras-and-drones/ (accessed on 12 April 2024).
121. Masud, M.; Muhammad, G.; Alhumyani, H.; Alshamrani, S.S.; Cheikhrouhou, O.; Ibrahim, S.; Hossain, M.S. Deep learning-based
intelligent face recognition in IoT-cloud environment. Comput. Commun. 2020, 152, 215–222. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like