NATIONAL INSTITUTE OF TECHNOLOGY CALICUT
Department of Electronics and Communication Engineering
                        Monsoon 2024
 EC6103E APPLIED MATHEMATICS FOR ELECTRONICS DESIGN
      FACIAL EMOTION DETECTOR USING CNN
                        SUBMITTED BY
                      SALINI C P (M240869EC)
                SREEDARSHANA C V (M241161EC)
                M.Tech - Electronics Design Technology
AIM
To design and implement a facial emotion detection system using a convolutional neural network
(CNN) trained on a dataset of labeled facial images, enabling accurate classification of human
emotions into predefined categories such as angry, disgust, fear, happy, sad, surprise, and neutral.
The system involves preprocessing images, detecting faces using Haar Cascade, and leveraging a
trained deep learning model to predict emotions from images and with the potential for real-time
applications.
THEORY
Facial Emotion Recognition
Facial emotion recognition (FER) is the process of detecting and analyzing human emotions
from facial expressions using computer vision and machine learning techniques. This technology
relies on the identification of specific patterns and features in facial images, such as eye
movement, mouth curvature, and other expressions that correspond to emotional states like
happiness, sadness, anger, or surprise.
The significance of FER lies in its ability to enable machines to understand and interpret human
emotions, thereby bridging the gap between human behavior and artificial intelligence. It
facilitates more intuitive human-computer interactions and helps in understanding psychological
and social cues. FER is particularly valuable in applications that require empathy,
communication, or behavioral analysis.
Convolutional Neural Networks
CNNs are a class of deep learning models specifically designed to process and analyze grid-like
data such as images. They have revolutionized fields like computer vision by automating the
process of feature extraction and enabling accurate classification, detection, and recognition
tasks.
CNNs work by applying a series of operations, such as convolutions and pooling, that learn to
extract and prioritize important features from input data. These features are then passed to fully
connected layers for decision-making, such as predicting an image's category or class.
The architecture of a CNN is inspired by the structure of the visual cortex, which processes
visual information hierarchically. A typical CNN architecture includes several layers like
convolutional, pooling, flattening, fully connected (dense), and dropout layers.
In this project, the CNN architecture was defined entirely from scratch without using any
pre-trained models like ResNet or VGGNet, in compliance with the guidelines. The architecture
                                                                                                     2
was designed to effectively handle the task of facial emotion recognition while maintaining
computational efficiency.
   1. Number of Layers:
         ○ The model consists of 3 convolutional layers, 3 max pooling layers, 4 dropout
             layers, 1 flattening layer, and 2 fully connected (dense) layers.
         ○ The sequence of these layers was carefully selected to ensure proper feature
             extraction, dimensionality reduction, and robust classification.
   2. Neurons per Layer:
         ○ The number of filters (neurons) in the convolutional layers increases
             progressively: 32, 64, 128. This gradual increase allows the network to capture
             finer and more complex features at each stage.
         ○ The first dense layer has 1024 neurons, enabling the network to learn complex
             relationships between features, while the final dense layer has 7 neurons,
             corresponding to the 7 emotion classes.
   3. Activation Functions:
         ○ ReLU (Rectified Linear Unit) is a popular activation function in deep learning,
             defined as
                                                 f(x)=max(0,x)
               It introduces non-linearity, enabling the model to learn complex patterns, and is
              computationally efficient due to its simple thresholding operation. ReLU helps
              mitigate the vanishing gradient problem, allowing deeper networks to train
              effectively, and promotes sparsity by outputting zero for negative inputs, reducing
              overfitting. However, it can lead to the "dying neuron" problem, where neurons
              output zero consistently, which can be addressed using variants like Leaky ReLU.
           ○ Softmax is an activation function used in the output layer of classification
             models, especially for multi-class problems. It converts raw logits (unnormalized
             scores) into probabilities by exponentiating each value and normalizing them so
             they sum to 1. The formula for softmax is
               where xirepresents a specific logit. This allows the model to interpret outputs as
              probabilities, making it easy to determine the most likely class. Softmax is
              particularly effective when the task requires mutually exclusive class predictions.
                                                                                                 3
4. Optimizer:
      ○ The Adam (Adaptive Moment Estimation) optimizer was chosen for its
         adaptive learning rate, which helps accelerate convergence while preventing
         overfitting. The Adam optimizer is an advanced optimization algorithm used in
         deep learning models. It combines the advantages of two other optimizers:
         AdaGrad and RMSProp. Adam calculates adaptive learning rates for each
         parameter by maintaining both the first moment (mean) and second moment
         (uncentered variance) of the gradients. This allows it to adapt the learning rate for
         each parameter individually, improving convergence speed and efficiency. Adam
         also uses bias correction terms to prevent issues in the initial stages of training.
         With its robust performance and low memory requirements, Adam is widely used
         in training deep neural networks.
         One of the key strengths of Adam is that it adapts the learning rate for each
         parameter individually. Parameters with large gradients will receive smaller
         updates, and parameters with smaller gradients will receive larger updates, which
         helps the model converge faster and reduces the need for manual tuning of the
         learning rate.
      ○ Learning rate: 0.0001.
5. Loss Function:
      ○ In deep learning, a loss function (also known as the cost function) is crucial for
         training models because it measures how well the model's predictions match the
         true values (targets). During training, the model learns by adjusting its weights to
         minimize this loss function. The process of minimizing the loss function is done
         using an optimization algorithm (like Adam or SGD), which updates the model's
         parameters based on the gradients of the loss with respect to the model's weights.
      ○ In the context of classification tasks, where the goal is to predict a class or
         category (like recognizing emotions from facial expressions), a suitable loss
         function is necessary to quantify how far off the model’s predicted class
         probabilities are from the true class labels.
      ○ Categorical Cross-Entropy is a loss function specifically designed for
         multi-class classification problems where the goal is to classify an input into one
         of several possible categories. It is commonly used when the model's output is a
         set of probabilities for each class (as in a softmax output layer).
           C is the number of classes.
                                                                                            4
               yi is the true label for class i (this is typically a one-hot encoded vector for
               multi-class classification, where the correct class is 1 and all other classes are 0).
               piis the predicted probability for class i (output from the softmax layer of the
               model).
           ○ In summary, categorical cross-entropy quantifies the difference between the
             predicted probability distribution (after applying softmax) and the actual class
             label, guiding the model in adjusting its weights to improve its predictions over
             time. The smaller the loss, the better the model is at classifying the input into the
             correct category.
Haar Cascade Classifier
The Haar Cascade Classifier is a popular and efficient method for object detection, particularly
for tasks like face detection. It is based on the concept of using Haar-like features to identify
specific regions of interest in an image. This technique is widely used in computer vision due to
its balance of accuracy and computational efficiency.
Haar Cascade is an object detection framework proposed by Paul Viola and Michael Jones in
their landmark 2001 paper titled “Rapid Object Detection using a Boosted Cascade of Simple
Features”. It is especially effective for detecting objects with consistent and distinguishable
features, such as faces.
This framework relies on machine learning to train a classifier using positive images (containing
the object of interest, such as faces) and negative images (not containing the object). Once
trained, the classifier can be used to detect objects in real-time from new images or video frames.
It works by applying Haar-like features, which compare the intensity of adjacent rectangular
regions in an image, to detect specific patterns such as edges or lines that characterize facial
structures.
Dataset and Preprocessing
The dataset used for training and validation in this project plays a crucial role in building an
effective facial emotion recognition model. We are using the dataset FER-2013 (Facial
Expression Recognition 2013) from Kaggle that corresponds to emotions - Angry, Fear, Happy,
Sad, Surprise, Disgust, Neutral. These datasets are essential for teaching the model to generalize
across different facial expressions.
To prepare the dataset for training, several preprocessing techniques were applied to ensure
consistency and optimize computational efficiency. Rescaling pixel values or normalization, was
performed by dividing pixel intensities by 255, transforming the values to range between 0 and
                                                                                                        5
1. This step helps the neural network converge faster and improves numerical stability during
training. Grayscale conversion was applied to the images to reduce computational complexity
and focus on intensity-based features, as color information is not critical for detecting emotions.
Finally, the images were resized to a fixed dimension of 48x48 pixels, ensuring uniform input
size to the CNN while retaining sufficient detail for emotion recognition. These preprocessing
steps streamline the training process and enhance the model's ability to learn from the data
effectively.
Libraries and Tools Used
Python libraries and tools employed in the project:
           ○   TensorFlow/Keras for model building and training.
           ○   OpenCV for face detection and preprocessing.
           ○   NumPy for numerical operations.
           ○   Matplotlib for visualization
ALGORITHM DESCRIPTION
Step 1: Import Necessary Libraries
   ● Import libraries such as numpy, cv2, keras, and matplotlib for handling image processing,
     model training, and visualization.
Step 2: Data Preparation
   1. Dataset Extraction:
         ○ Upload a dataset ZIP file and extract its contents to obtain the training (train_dir)
            and validation (val_dir) datasets.
   2. Data Preprocessing:
         ○ Rescaling: Normalize pixel values to the range [0, 1] by dividing by 255 using
            ImageDataGenerator.
         ○ Grayscale Conversion: Convert images to grayscale to reduce computational
            complexity.
         ○ Image Resizing: Resize images to 48x48 pixels to standardize input size for the
            model.
   3. Data Augmentation:
         ○ Use ImageDataGenerator to augment the dataset by applying transformations like
            zoom, shift, and flips, improving model generalization.
                                                                                                      6
Step 3: CNN Model Definition
   1. Architecture Design:
         ○ Define a custom Convolutional Neural Network (CNN) model.
         ○ Include three convolutional blocks with:
                ■ Convolutional Layers: Extract spatial features using filters (e.g., 32, 64,
                   128 filters).
                ■ Batch Normalization: Normalize feature maps to stabilize training.
                ■ MaxPooling: Reduce dimensionality and retain key features.
                ■ Dropout: Prevent overfitting by randomly deactivating neurons during
                   training.
         ○ Add fully connected layers:
                ■ Flatten: Transform the 2D feature map into a 1D array.
                ■ Dense Layers: Perform high-level reasoning and classification.
                ■ Use Dropout to further reduce overfitting.
         ○ The final layer uses the softmax activation function to output probabilities for 7
             emotion classes.
   2. Compile the Model:
         ○ Loss Function: categorical_crossentropy for multi-class classification.
         ○ Optimizer: Adam with a learning rate of 0.0001.
         ○ Metrics: Accuracy to evaluate the model's performance.
Step 4: Model Training
      Train the model using the prepared datasets:
          ○ Training Parameters:
               ■ Batch size: 64
               ■ Number of epochs: 10
               ■ Steps per epoch: Calculated based on dataset size.
          ○ Use augmented training data and validation data to fit the model.
Step 5: Save and Load the Model
   ● Save the trained model as model.h5 for future predictions.
   ● Load the saved model for emotion prediction.
Step 6: Face Detection
   1. Haar Cascade Classifier:
        ○ Use the pre-trained Haar Cascade XML file (haarcascade_frontalface_alt.xml) to
            detect faces in an uploaded image.
                                                                                            7
          ○ Detect faces using cv2.CascadeClassifier and crop the detected face region.
          ○ Save the cropped face as capture.jpg.
Step 7: Emotion Prediction
   1. Preprocess the Cropped Face:
         ○ Load the cropped face image.
         ○ Resize it to 48x48 pixels, convert it to grayscale, and normalize pixel values.
         ○ Expand dimensions to match the model's input shape.
   2. Predict Emotion:
         ○ Use the trained CNN model to predict the probabilities of the 7 emotion classes.
Step 8: Visualization
   ● Display the predicted emotion scores as a bar graph with labels for each emotion.
   ● Print the emotion probabilities for better interpretability.
CODE and RESULTS
                                                                                              8
9
10
11
12
13
CONCLUSION
The Facial Emotion Detection System successfully demonstrates the use of Convolutional
Neural Networks (CNNs) combined with Haar Cascade for real-time emotion recognition. By
employing a well-designed custom CNN architecture, the model effectively classifies facial
expressions into seven distinct emotions with high accuracy. Preprocessing techniques such as
grayscale conversion, normalization, and resizing ensure computational efficiency, while data
augmentation enhances the model's generalization capabilities.
The project highlights the importance of deep learning and computer vision in solving real-world
problems, such as emotion analysis, which has applications in diverse fields like healthcare,
psychology, and human-computer interaction. The use of Haar Cascade for face detection
ensures accurate input data for the CNN, while the visualization of predictions makes the system
user-friendly.
Overall, this project serves as a foundation for building more advanced emotion recognition
systems by incorporating additional features such as real-time video analysis, multi-face
detection, and integration with IoT devices for real-world applications. It showcases the potential
of machine learning in improving human-computer interaction and understanding human
emotions.
REFERENCES
   ● https://www.kaggle.com/datasets/msambare/fer2013/data
   ● https://github.com/komalck/FACIAL-EMOTION-RECOGNITION/blob/master/Facial_e
     motion_recognition.ipynb
   ● https://blog.clairvoyantsoft.com/emotion-recognition-with-deep-learning-on-google-cola
     b-24ceb015e5
   ● S. Alizadeh and A. Fazel, "Convolutional Neural Networks for Facial Expression
     Recognition," Stanford University
                                                                                                14