Introduction- The usage of computer-based technology            Zakkaet al.
conducted a study between traditional
is growing in numerous directions due to its easy               classrooms and e-learning to improve the e-learning
availability, effectiveness, etc. Technological advancements    environment to match the traditional one [3]. A framework
like smartphones, laptops, and other intelligent devices help   to detect the motivation level of learners is incorporated
us to use online learning facilities termed E-learning. E-      which senses emotion and sends feedback to the teacher
learning platforms have become a significant tool for
                                                                and further develops a response mechanism to distinguish
knowledge sharing and understanding for almost every
student, especially after the pandemic. E-learning has          expressions and attention.
numerous advantages like eco-friendliness as it saves paper,
reduces the cost of traveling and time, etc. The students can   The authors created a smart computer system (EAC Net)
attend the classes from their places even if they are not       that can better recognize facial expressions (AUs) [4].
feeling well. If one uses such devices for a longer time        Unlike other methods, it doesn't need perfectly aligned
within a close range of the screen, it can cause health         faces and understands facial features more effectively. It
problems like eye stiffness, Eye power loss, Headache, etc.     showed big improvements in accuracy on a dataset. This
During a pandemic situation like COVID-19, all parents and      system is useful because it works well even if faces are at
students were thinking of the futures of their children in      different angles or partly covered; making it more versatile
lockdown situations, but E-learning has solved this problem
                                                                for real-world situations.
and every student was capable of learning through their
mobile phones and laptops.                                      In [5] authors emphasized more on multi-modal emotion
    On the other hand, the physical education system has its    recognition than the single modal ones. They used
benefits like Interaction between teacher and students,         Affectnet as their FER model. The use of predicted
assessment of student’s understandability, hands-on sessions    emotions extends beyond understanding student behavior to
to make the students convenient to understand the topic, etc.   include visual summarization of classroom films and
To extract these features into E-learning, the primary
                                                                classification of the group-level emotions on videos.
objective of the proposed approach is “Engagement
detection during E-Learning”. The system will be capable of     Based on their behavior and biological data, several
finding the engagement of the students during their classes,
how they interact with teachers, and their lectures. In         attempts have been made to ascertain the e-learners'
traditional classroom teaching, teachers evaluate their         level of focus through a neuro-fuzzy inference system that
students’ learning effect, the level of understanding and       tracks the position of the eye’s iris to find out the
comprehension, by mainly observing students’ behavior.          concentration level. In it, SVM is also used to determine
The behavior aspects may include body language, eye gaze,       concentration level [6]. This proposed model has an
facial expressions, and emotions exhibited through vocal        overhead of timing for the preprocessing of data which in
feedback. Multiple researchers have proposed the use of         the future could be eliminated by fully automating the
natural language processing, hand gesture recognition, eye
                                                                process so that it could be implemented in real time also.
gaze estimation, facial emotion recognition, and body
language detection to estimate learners' learning effects and   Pise et al. employed a two-phase method to categorize 3D
provide a measure that will provide a more effective
learning experience [3].                                        face expression images extracted from the video to quantify
                                                                the optical flow intensity with the help of the Hidden
DEEP Learning                                                   Markov model and Naïve Bayes [7]. Also, any slight
                                                                change in the video can be measured using video rather
Nonverbal behavior of forty-four undergraduate students
                                                                than just static images.
was observed using a USB front-facing camera while
participants completed an on-screen multiple-choice             Using temporal appearance and facial landmark points
question-and-answer test [1]. An ANN is fed the                 facial gestures are extracted and then integrated to best
information on their question-answer scores and behavioral      recognize the expression [8]. CNN is used for object
patterns to classify their comprehension states almost          detection, and feature extraction, in this model. In the
instantly. Future work by the authors is planned to increase    future, the framework could be extended to work in GPU-
the classifier's accuracy. Furthermore, this investigation      supported machines for a better training time.
might be extended to examine the relationship between
behavior and the type of question asked. In the future, this    Researchers utilize multimodal learning analytics in online
technique might also be used to analyze various kinds of        education to comprehend the feelings and engagement of
behaviors and mental states.                                    students [9]. They use information from posture, gestures,
                                                                and emotions to estimate students' level of engagement in
In [2] a concept of similarity is introduced generally to       the classroom. Computer vision techniques examine lecture
preserve the actual data with features and group them           footage and detect feelings such as joy or indifference.
creating a degree of similarity among the pairs which is        Some systems even use head and eye motions to categorize
achieved through fuzzification. Authors have proposed to        different levels of engagement. By taking into account the
develop theoretical methods to generate rules and select        feelings and participation of the students, these methods
features based on the similarity relation in the future.        seek to improve online instruction.
Thiruthuvanathan et al. discussed the challenges in             whether a student is paying attention and engaged in the
detecting e-learners' engagement level from their facial        material. The proposed model uses an SVM (Support
expression recognition. They have proposed to extend their      Vector Machines), Random Forest, Neural Networks, CNN
model for detecting group-level engagement in the future        (Convolutional Neural Networks), LSTM (Long Short-
[10].                                                           Term Memory), InceptionV3, and VGG16 for object
                                                                recognition in video scenes to analyze students' facial
In [11] the author is trying to remove the challenges and       expressions, body postures, and hand gestures. The data set
weaknesses of the mostly used blended learning model.           considered for analyzing the model is too small consisting
Additionally, the gamification concept has been applied.        of only 45 students. The number of features considered to
The fer and gamification system were developed using an         detect engagement is also very small which could be
                                                                enhanced in the future for a more accurate detection with
object-oriented approach using unified modeling language
                                                                an increased number of labels like neutral, low engaged,
(UML). The methods used are ANN, CNN, and JavaScript
                                                                highly engaged, etc. apart from only engaged or not
library with open-source code TensorFlowJS. It has two          engaged labels.
stages of testing the facial expression recognition system
and gamification application.                                   Ozdamli et al. explore various algorithms and models for
                                                                facial recognition in education [17]. For face detection and
In [12] the authors have used a video dataset called            recognition, it mentions software like MATLAB and
Children's Spontaneous Facial Expressions (LIRIS-CSE)           Python utilizing techniques like PCA and 3WPCA-MD.
and proposed a system that uses Convolutional-Neural-           Classifications are done with diverse algorithms like SVM,
Network (CNN)-based models, such as VGG19, VGG16,               Bayesian, or neural networks. When it comes to
and Resnet50, for feature extractions and Support Vector        recognizing emotions from facial expressions, the paper
Machine (SVM) and Decision Tree (DT) for classification.        discusses static models, Action Unit (AU) based models,
This system will automatically recognize children's             and the Facial Action Coding System (FACS). Deep
expressions. Several experimental configurations, such as       learning frameworks like CNNs are also increasingly used.
80–20% split, K-Fold Cross-Validation (K-Fold CV), and          To detect cheating in online exams, models like Multi-
leave-one-out cross-validation (LOOCV), are used to             Class Markov Chain Latent Dirichlet Allocation and
assess the system for both image and video-based                supervised dynamic Bayesian models are employed.
                                                                Various datasets like GI4E for gaze tracking and FEI for
categorization.
                                                                general face recognition tasks are used in this model.
Keerthana et al. proposed a hybrid model for identifying        The authors investigate methods for evaluating teaching
student facial emotions by monitoring eye gaze and head         quality and student engagement in classrooms [18].
movements to analyze student engagement levels [13].            Traditional approaches, including tests and observations,
Haar cascade model and binary patterns are used to detect       are criticized for their limitations in providing
the head movements.CNN models are used for FER. Using           comprehensive and real-time data. Attention shifts towards
the OpenCV object detection framework authors                   assessing student engagement, defined across behavior,
determined the status of the student, whether he is             cognition, emotion, and social interaction dimensions.
Distracted or engaged.                                          Technological advancements, particularly in computer
                                                                vision, offer non-invasive means to detect engagement
[14] In this paper, the proposed framework calculates the       through facial expression analysis and gaze tracking.
concentration index of students. CNN model using Keras is       Studies demonstrate high accuracy in predicting
used with fer2013 datasets for emotion recognition.             engagement levels using deep learning models. Overall, the
Additionally, Mamdani MATLAB software is used to                review highlights a growing interest in leveraging
create fuzzy rule sets and implement membership functions       technology to enhance classroom evaluation and improve
utilizing the principles of Fuzzy Logic. It mainly deals with   teaching practices.
three major steps: face detection, feature extraction, and
feature classification.                                         Proposed models define structures for representing learning
Dewan et al. explore various methods for learners’              behaviors in classrooms, enabling effective feature
engagement detection and classify them into three main          extraction [19]. Evaluation metrics demonstrate the efficacy
categories —automatic, semi-automatic, and manual [15].
                                                                of machine learning models in accurately detecting and
Each category is further divided into audio video and text
                                                                categorizing student behaviors. More data sources could be
depending on the type of data used for detection. The
authors reviewed the automatic methods for engagement           added to the model in the future. Advances in the computer
detection in more detail as they found it more effective than   vision field, specifically in Human Pose and Graph Neural
other methods in the case of online learning platforms.         Networks could be incorporated also to increase the
They further discussed the challenges of all of these           efficacy.
methods and future directions about how they could be
used in a more advanced way.                                    The Authors have used Multimodal Emotion Recognition
[16] In this paper, a framework using computer vision to        in Multiparty Conversations (MERMC) which focuses
detect student engagement is proposed. Facial expressions,      more on audio and text more while ignoring visual
body language, and other cues are used to determine             information in [20]. It incorporates two-stage frameworks.
Facial expression-aware Multimodal Multi-Task learning          importance of understanding and addressing student
and Multimodal facial expression-aware emotion                  engagement for effective digital education.
recognition model which helps in extraction of face and
help improve emotion recognition. They have planned to          Dukicet al. discover connections between emotions,
leverage multimodal fusion mechanisms to improve the            activities, and gender in online learning [25]. Authors have
performance of this task in the future.                         analyzed two different perspectives (1) classroom
                                                                experiment-related and (2) FER data-related. To gather
Conventional ML based                                           feedback on active teaching strategies, students' emotions
                                                                are tracked as they complete programming assignments.
Communication requires the use of facial expressions,           The focus of the methodology in this research was
which differ between individuals and civilizations.             primarily on the activity portion. However, the variance
Effective teaching requires a knowledge of students'            due to age difference is not taken into consideration in this
emotions, especially with the development of online             study. Authors have planned to use more data to experiment
learning brought on by COVID-19. Understanding facial           with this model with better camera positioning and sticker
expressions allows educators to modify their approaches         conditions on the behavior of the participants in the future.
and add interest to their lessons. While negative emotions
might cause disengagement, positive emotions support            The authors have designed a framework to recognize
academic progress.                                              emotions and categorize into them into 3 parts. First is the
                                                                face tracker. The second one is the facial motion tracking
The authors conducted a thorough review of emotion              optical flow algorithm and the third is the recognition
classification on facial emotion recognition [21]. It           engine.
elaborates analysis on emotion classifiers and datasets used    This method proposed using a channel attention network
in FER. Different approaches considered by the researchers      with depth separable convolution to enhance the linear
for preprocessing and feature extractions are discussed. The    bottleneck structure. When tested on the FER2013 dataset,
authors highlighted the strengths and limitations of this       it outperforms other methods significantly. Mainly because
approach. Their study revealed that deep learning is the        it pays more attention to extracting features, resulting in
most commonly used approach for FER in the academic             better accuracy in recognizing emotions [26]
arena. Whereas the most used dataset and emotion classifier
are DAiSEE and SVM respectively.                                In [27] Authors conducted a systematic review of existing
                                                                frameworks for Facial Emotion Recognition (FER) and
Zhanget. Al. developed an algorithm that can accurately         how they are used in classifying academic emotions mainly
detect students' engagement in online learning                  in the context of online learning. Authors observed that low
environments [22]. Supervised learning is used to recognize     illumination, Lack of frontal pose, and small size datasets
students’ emotional gestures through which it distinguishes     are some of the major hindrances in FER in e-learning.
emotions like frustration, happiness, boredom, confusion,       They suggested that long-term monitoring of facial
etc. Sensors are being used in this research. Measures of the   emotions through wearable sensors, continuous video
degree of engagement are divided into 3 categories namely       recording, and exclusion of potential human biases could
single-sensor, multiple-sensor, and sensor-free methods.        produce better accuracy in the case of FER in online
                                                                learning.
[23] The primary objective of this research is to explore
reliable facial information models that can describe how        Alkabany et al. proposed a methodology that assesses
people interact in a learning environment. Different            students' degree of engagement in both traditional
approaches for automated recognition of student                 classroom settings and online learning environments [28].
engagement levels are studied. Authors stated that              The suggested framework records the user's video and
engagement recognition will be more effective and variant       follows their faces as they move through the frames. It can
in long-term learning situations rather than short-term         be used for tracking the development of e-learners with
studies of current scenarios.                                   different levels of learning impairments and analyzing the
                                                                impact of nerve palsy on social interactions and facial
In [24] authors wanted to highlight the growing interest in     expressions. Different features like facial fiducial points,
facial expression recognition and eye tracking, to assess       head pose, eye gaze, and learning features are extracted
and enhance student engagement in digital learning              from the video of the user’s face to detect the Facial Action
environments. Various methods, including deep learning          Coding System (FACS), which decomposes facial
models and facial action coding systems, have been              expressions in terms of the fundamental actions of
explored to measure concentration levels and emotional          individual muscles or groups of muscles (i.e., action units).
states. These approaches aim to provide real-time feedback      The student's behavioral engagement (i.e., willingness to
to instructors, allowing for personalized adjustments to        participate in the learning process) and emotional
content delivery. Despite challenges such as dropout rates      engagement (i.e., attitude toward learning) are then
in virtual classrooms, ongoing research underscores the         measured using these decoded action units (AUs).
Emotional changes of 67 students during a lecture on
Information technology are studied in [29]. The software is
developed using Microsoft Emotion Recognition API and
C# programming language to categorize the feelings of
students into disgust, sadness, happiness, fear, contempt,
anger, and surprise. The significance of the correlation of
the student’s emotions with their departments, gender,
lecture hours, the location of the computer in the
classroom, lecture type, and session information is studied.
Finally, the association between student’s emotional change
and their achievements is analyzed to examine how the
emotional recognition of students could contribute to
increasing the overall quality of education.
In the research work done by Gong et al. [30], the high-
definition video of classroom teaching is recorded using the
camera in front of the classroom. The faces of every student
in the classroom are located and intercepted using the
AdaBoost algorithm from the sampled frame images, and
the images are pre-processed to produce an expression area
of 64 by 64 pixels. After PCA+ dimensionality reduction,
Gabor and ULBPHS feature fusion is integrated with the
KNN classification algorithm for expression classification.
At last, the assessment and results of the semotional
learning of the students are achieved.