research-article

Open access

FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers’ Emotion Recognition

Authors:

Saira Bano,

Nicola Tonellotto,

Pietro Cassarà,

Alberto GottaAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 3

Article No.: 57, Pages 1 - 27

https://doi.org/10.1145/3650040

Published: 17 May 2024 Publication History

PDF eReader

Abstract

Emotion recognition has attracted a lot of interest in recent years in various application areas such as healthcare and autonomous driving. Existing approaches to emotion recognition are based on visual, speech, or psychophysiological signals. However, recent studies are looking at multimodal techniques that combine different modalities for emotion recognition. In this work, we address the problem of recognizing the user’s emotion as a driver from unlabeled videos using multimodal techniques. We propose a collaborative training method based on cross-modal distillation, i.e., “FedCMD” (Federated Cross-Modal Distillation). Federated Learning (FL) is an emerging collaborative decentralized learning technique that allows each participant to train their model locally to build a better generalized global model without sharing their data. The main advantage of FL is that only local data is used for training, thus maintaining privacy and providing a secure and efficient emotion recognition system. The local model in FL is trained for each vehicle device with unlabeled video data by using sensor data as a proxy. Specifically, for each local model, we show how driver emotional annotations can be transferred from the sensor domain to the visual domain by using cross-modal distillation. The key idea is based on the observation that a driver’s emotional state indicated by a sensor correlates with facial expressions shown in videos. The proposed “FedCMD” approach is tested on the multimodal dataset “BioVid Emo DB” and achieves state-of-the-art performance. Experimental results show that our approach is robust to non-identically distributed data, achieving 96.67% and 90.83% accuracy in classifying five different emotions with IID (independently and identically distributed) and non-IID data, respectively. Moreover, our model is much more robust to overfitting, resulting in better generalization than the other existing methods.

1 Introduction

With the advancement of artificial intelligence, the number of applications for intelligent vehicles has increased significantly worldwide [27]. These intelligent vehicles are designed to provide convenience to the driver and enhance the driving experience [39]. Recognizing driver emotions could help determine the level of attention automated intelligent vehicle systems need to adapt.

While emotion recognition technology is still experimental in the automotive sector, it is more mature in other areas such as human-machine interaction, healthcare, e-learning, smart home, and gaming [30]. In these areas, advances in Machine Learning (ML) have improved the processing of different data types, such as physiological, speech, and video data, and enabled more accurate emotion recognition. For example, the authors of References [20, 25] use physiological signals such as electroencephalogram (EEG), electromyogram (EMG), electrocardiogram (ECG), electrodermal activity (EDA), and galvanic skin response (GSR) and found that sadness lowers heart rate and anger increases skin temperature. The authors in References [13, 41] found similar results for emotion recognition using facial expressions or gesture analysis. In addition, many other approaches have been proposed in the literature, such as those based on information from speech [38], text [50], and gaze direction [29]. However, physiological data can provide more accurate emotion recognition, while text and speech cues are more difficult to recognize, because drivers cannot always speak or provide text-based information.

Predicting emotions by sensors is also difficult for several reasons. First, physiological signals have the disadvantage of requiring drivers to wear sensors, such as wearable activity recognition systems. These systems typically include multiple sensors connected to the user’s body, which can be uncomfortable and distracting while driving. Second, human involvement makes it difficult to collect valuable data for training a ML model. Camera-based emotion recognition systems can overcome these limitations, because they are better suited to capture the driver’s emotions without interacting with the vehicle and the driver. We can enrich the image-based information with other data sources to incorporate data from multiple modalities and enable the so-called multimodal analysis [1, 4, 19]. For example, physiological sensors could be used as a proxy for the unlabeled video to develop more robust emotion recognition systems for drivers.

In this work, we investigate the possibility of learning the emotional content of drivers from unlabeled video data captured by an onboard vehicle camera by simply transferring knowledge about the driver’s physiological state. To this end, we propose an emotion recognition system using data from physiological signals and unlabeled video. We transfer knowledge between these two modalities using cross-modal distillation, where knowledge is transferred from one Neural Network (NN) model to another NN model in a teacher-student architecture trained on different modalities. We use one modality (physiological data) to train a teacher NN model for emotion recognition. We then distill this NN model into a smaller student model and feed it with the target modality (video sequences). During inference, only the student model is used in the vehicle to recognize drivers’ emotions from the video sequences. Convolutional Neural Networks (CNNs), which can accurately process physiological signals and visual data using nonlinear transformations, are used as the teacher and student models to enable powerful feature extraction and classification of human emotional behavior [2]. However, the challenge of using CNN to detect the emotional state of drivers is that a large dataset is needed for model training. One solution could be to collect data from all vehicles in a central location, e.g., a cloud, and use this dataset to train the CNN. However, this is not feasible due to privacy issues and a lack of network capacity. There is a need for collaborative training using different small islands of data without sharing them. McMahan et al. proposed a collaborative learning paradigm called Federated Learning (FL) in Reference [34] to address NN learning with decentralised data. FL effectively avoids privacy violations by having multiple participants learn a standard global NN model without revealing their local data.

Considering the above challenges, we propose an emotion recognition method based on federation of learning models, called Federated Cross-Modal Distillation (FedCMD). Our main goal is to develop a method that increases the accuracy of emotion recognition using distributed information from different vehicles. Our approach aims to train NN for emotion recognition with cross-modal distillation locally on each vehicle. We then enrich the accuracy of the local models by aggregating them across FL to obtain a generalized model that we can share among vehicles for more accurate emotion recognition. In this federated environment, we also leverage edge computing resources, since resources in the vehicles are constrained and deploying teacher and student models on each vehicle device is not feasible. Therefore, using the results in Reference [33], we split the teacher model so part of it is deployed on the vehicle device and the rest on edge. To the best of our knowledge, this is the first study to consider knowledge transfer from sensors to unlabeled videos in a federated manner considering resource-constrained in-vehicle devices for driver emotion recognition.

Our main contributions are summarized as follows:

—

We develop a reliable and efficient teacher model for emotion recognition using physiological input through empirical analysis, achieving state-of-the-art performance. Then, we transfer the teacher’s knowledge to a small student model for emotion prediction using unlabeled videos. The goal is to find the relationship between the features extracted from the video and the physiological signals in the representation of the driver’s emotions, as both are generated synchronously;

—

We propose an FL-based method for aggregating locally distilled student models to obtain a subject-independent emotion classification model with minimal communication cost and high privacy level;

—

We perform a performance analysis to evaluate the effectiveness and accuracy of the proposed “FedCMD” approach for driver emotion recognition using the “BioVid Emo DB” dataset by identifying discrete emotions. Our experiments show that our proposed approach outperforms current emotion recognition methods.

The remainder of this article is divided into the following sections: Section 2 provides the related work on the techniques used. The proposed reference scenario and system model assumptions for this work are introduced in Section 3. Section 4 sets forth the proposed cross-modal knowledge distillation technique. Section 5 presents the FedCMD approach. In Section 6, we evaluate our proposed method. In Section 7, we perform the ablation studies and conclude with an outlook on future developments in Section 8.

2 Related Works

In this section, we review related work on emotion recognition techniques (Section 2.1), as well as some of the existing work on cross-modal knowledge distillation (Section 2.2), and finally some recent work on FL (Section 2.3).

2.1 Emotion Recognition

As discussed in Section 1, recognizing human emotions is a valuable task to achieve better human-computer interaction, leading to emotion-aware vehicle intelligence. In addition, human emotion recognition has tremendous potential for application in numerous scenarios such as smart devices, autonomous vehicles, and medical purposes [51]. To recognize human emotions, researchers in the literature have used different modalities, such as physiological signals, text, visual signals, and speech, of which physiological signals are the most commonly used. Ali et al. [3] proposed an approach to detect emotions in drivers using a set of three physiological signals (EDA, ECG, and skin temperature (ST)). They classified four different emotional states by using cellular NNs. The authors of [32] have developed a model in which they extract several statistical features of different physiological signals such as BVP (blood pressure), EMG, EDA, RSP (respiration), and ST and feed them into Support Vector Machine (SVM) and Fisher’s LDA (linear discriminant analysis) classifiers to identify six different emotional states.

In addition to sensor modality, there are other common methods of emotion recognition based on images or videos. For example, in [17], Gao et al. proposed a real-time framework for detecting a driver’s emotional state by analyzing facial expressions and using linear SVM to classify emotions. In [44], the authors presented a real-time emotion recognition system for drivers based on CNNs that extracts visual and geometric features from the facial image to determine the driver’s emotional state. Akshi et al. proposed a multimodal emotion recognition system using visual and psychophysiological modalities. They used three different CNNs for each sensor signal separately and one CNN for video data. The results of the two modality networks were combined by decision-level weighted fusion to classify emotions [26]. However, our work uses the psychophysiological modality as a proxy for the visual modality in a cross-modal distillation approach. To build a more robust model, it is necessary to include two or more modalities for emotion recognition to enable cross-modal learning. In this cross-modal approach, the data come independently from different modalities, but the problem to be solved is the same.

2.2 Cross-modal Knowledge Distillation

Deep NNs have made significant progress in solving many complex problems in various domains in recent years. Still, these NNs consist of a billion parameters and are too complex to be used on resource-constrained devices. To solve this problem, Hinton et al. proposed the Knowledge Distillation (KD) technique, in which knowledge is distilled from one large NN to another small NN in a teacher-student architecture. The pre-trained large NN model is the teacher model, and the NN model we train by distillation is the student model. Under the teacher’s supervision, the student is trained on potentially unlabeled data. During training, the student replicates a mathematical function of the teacher network through a backpropagation algorithm using the teacher’s predictions as soft labels. These soft labels are achieved by introducing temperature scaling of the teacher’s logits. There are two common methods for performing KD: one is to train the student model to regress the teacher’s pre-softmax logits [5], and the second is to minimize the cross-entropy loss between softmax probability outputs [21]. Many KD methods have been proposed in the literature. In Reference [31], the authors applied the KD method to the speech recognition task. They achieved the same accuracy for the compact student model with the supervision of a high-accuracy cumbersome teacher model. The authors in Reference [14] proposed a KD technique for action recognition in still images using CNNs.

In contrast to these methods that transfer knowledge within the same modality, the cross-modal KD approach works with two different modalities for teachers and students. This idea can be used in various ways, such as with a visual recognition network where the teacher, trained on RGB images, monitors the student, trained on depth or visual flow images [19]. In Reference [4], the authors proposed a method to learn directly on raw audio waveforms by synchronizing videos and audio to learn an acoustic representation using unlabeled videos. In other work, the output of a pre-trained facial emotion classifier is used to train a student network to recognize emotions in speech [1] or a pre-trained human pose visual recognition network to identify human poses from radio signals [53]. In Reference [36], the authors propose an end-to-end vision-to-sensor KD for human activity detection. In contrast, we aim to detect human emotions while driving by developing a sensor-to-vision system that can be easily deployed on resource-constrained devices and in real-world environments.

2.3 Federated Learning

FL is a popular paradigm for collaborative learning presented in 2016 by McMahan et al. [34]. The FL framework enables a communication-efficient and privacy-friendly solution in which many mobile or edge devices collaborate with a central server to jointly and iteratively train a global NN model without sharing users’ data. In the FL process, the initial model is transmitted from the central FL server to the distributed clients, which can train the model with their own data. The weights of the local NN models are then sent back to the FL server to update the global NN model through aggregation. After aggregation, the FL server sends the updated NN model to all participating clients. This process continues until training is complete.

The authors in Reference [35] developed a global classification system based on FL for real-time classification of emotions using multimodal physiological data streams from body-worn sensors, called “Fed-ReMECS,” without access to the distributed multimodal data. Authors in Reference [16] presented an automatic emotion recognition system that combines FL with emotion analysis based on extracted facial and speech features to create a simple and secure emotion monitoring system. Researchers in Reference [43] proposed natural language-based human emotion recognition through semi-supervised FL using labelled and unlabeled data. They also showed that highly non-identical speaker data has little impact on their model. However, unlike the above methods, we propose a semi-supervised FL method that does not access the videos’ labels. The annotations of the videos are transmitted from the sensors using the cross-modal distillation approach, and only the student models trained on these unlabeled videos are federated to obtain a generalized and small NN model that can be used on the vehicle device for emotion recognition.

In the next sections, we build the system model (Section 3) using all the components mentioned in this section. Section 4 illustrates how to perform the cross-modal distillation using the sensor and video data. And Section 5 describes how the student models are federated to obtain state-of-the-art results.

3 System Model

In this section, we discuss the assumptions we made for designing the “FedCMD” learning architecture and the computations and the communication scenario used to study the performance of the proposed architecture.

The proposed architecture is based on two main learning concepts: cross-modal KD and FL, as shown in Figures 1 and 2, respectively. We consider physiological signals and videos as modalities for the cross-modal approach. Our idea is based on the hypothesis that we can recognize human emotions from facial expressions by distilling a student model from a teacher model, i.e., a model trained with physiological signals. We consider the following driver’s emotional states or labels: amusement, anger, fear, disgust, and sadness. To achieve the desired results and investigate the ability of the proposed architecture, we develop the model to detect emotions from unlabeled videos. We use the ground truth provided by the corresponding sensor outputs for unlabeled videos, since there is temporal synchronization between physiological signals and videos.

Fig. 1.

Fig. 2.

We assume that each vehicle is equipped with a camera that continuously records the driver’s face and that it is possible to extract features from the video images, e.g., facial landmark, as shown in Figure 1. We train the teacher model using the labelled physiological data to obtain a model with high accuracy. The trained teacher model is then used in inference mode for cross-modal distillation. Assuming that the vehicles have a resource-constrained device that performs computation and communication tasks, we foresee a reduction in the computational load caused by the learning procedure. For this reason, we split the teacher model into two parts, the head and the tail, as shown in Figure 1, using the splitting method mentioned in Reference [33]. While the head of the teacher model, which consists of a smaller number of model parameters, runs on the vehicle device along with the student model, the tail instead runs on a serverless computing infrastructure deployed at the edge of the vehicle, as shown in Figure 1. The use of serverless and Function-as-a-Service (FaaS) paradigms in an edge environment was first demonstrated by Baresi et al. [12]. As described by the authors in References [6, 12], serverless computing can run different logical parts on many infrastructures that form a continuum. This way, we can reduce latency and hardware resource consumption while reducing billing costs for execution time rather than resource allocation. We assume that only one tail of the teacher runs over the edge in the containerized form.

In the proposed cross-modal KD approach, the driver only needs to wear the sensors during the training phase of the student model; instead, during inference, emotions are detected only from the videos. After performing local distillation for each vehicle, we use the FL communication architecture developed in References [8, 10] to federate the local, personalized model to the global model as shown in Figure 2. The proposed FedCMD consists of the following steps:

(1)

In each round of FL, the aggregation server selects a subset of vehicles to download the initial model parameters.

(2)

Each vehicle trains its student model using its own local dataset.

(3)

For each vehicle participating in the FL process, the student model generates the soft logits for the video dataset.

(4)

Using the physiological signals as inputs, the teacher head on the vehicle device sends its intermediate layer parameters to the edge, which thus generates the teacher logits. These teacher logits are then sent to each vehicle to perform the KD. Note that we split the teacher model between a mobile device (vehicle) and an edge.

(5)

After a few local training epochs, each student sends its model parameters to the edge for aggregation to create a global model based on the FedAvg algorithm [34].

(6)

The aggregation server returns the updated student model to all participating vehicles for further training.

To communicate model updates within FL, we use the KafkaFed architecture [10], which employs Kafka brokers. A Kafka broker manages the model parameters going to and from the clients and server in this system and acts as an orchestrator between the FL server and the vehicles. The generalized global model for all vehicles is created when the training process is complete. This small student model for emotion recognition can be deployed on devices with limited resources and recognizes emotions based only on the videos in the vehicle. In this way, we can create small models of high quality with low communication overhead and unlabeled data.

4 Cross-Modal Distillation

In this section, we conduct experiments to answer the following research question: Can we obtain a representation of drivers’ emotional content from unlabeled video data by simply transferring knowledge from drivers’ sensor data? For this reason, we analyze the knowledge transfer methodology from teacher-NN to student-NN based on sensor and visual modality, i.e., sensor-to-vision cross-modal distillation. The overall architecture of this cross-modal distillation is shown in Figure 1. We refer to this cross-modal distillation for each vehicle as local training, which we will discuss in more detail in Section 5. Before explaining the proposed cross-modal distillation technique, we introduce the dataset and the data preprocessing (Section 4.1), the learning architectures of the teacher (Section 4.2) and the student (Section 4.3), and finally the cross-modal knowledge transfer between the two architectures (Section 4.4).

4.1 Data Preprocessing

In this work, we use the multimodal dataset “BioVid Emo DB” [52] for discrete emotion analysis. It is a multimodal database that contains physiological signals such as skin conductance (SCL), electrocardiogram (ECG), trapezius muscle electromyogram (tEMG), and corresponding videos to classify emotions into five discrete emotions (amusement, sad, anger, disgust, and fear). Video data were recorded from subjects watching various movie clips that elicited emotional states. The predominant emotion of each subject was noted and listed in the dataset. The videos were sampled at a frame rate of 25 Hz and a resolution of 1,388 \(\times\) 1,038 colored pixels, while data from three physiological signals were recorded at a sampling rate of 512 Hz. The video signals were recorded using three cameras, allowing participants to move their heads freely, as the cameras could capture the face even when it was rotated out of the plane. A total of 94 subjects participated and were aroused by 15 standardized film clips. Participants belonged to three different age groups ranging from 18 to 65 years, including 44 men and 50 women without emotional disorders. Due to missing and corrupted records, only the data of 86 subjects are available in the database. The entire experiment was conducted in a controlled laboratory environment.

Before using the datasets to train NN models, we preprocess the data to enhance the accuracy of the models. As described above, the videos from the “BioVid Emo DB” dataset are sampled at 25 frames per second (fps). To extract temporal features from these videos and use them as input to the learning model, we use the OpenFace software [7]. The OpenFace software was developed by the University of Cambridge in collaboration with the Carnegie Mellon University Institute. It was designed for computer vision applications based on behavioral analysis [7]. OpenFace is the first open-source tool capable of extracting many features. OpenFace can determine these features by analyzing faces in video files, computer images, or a live camera. Some of the most important features are:

—

Facial landmark detection: enables recognition and localization of certain key points of the face. OpenFace identifies a set of landmarks on the face of the person shown in Figure 1.

—

Head pose estimation: it displays the values in x, y, and z coordinates reflecting the position of the head and its rotation.

—

Facial action unit recognition: human facial expressions are recognized by action units (AUs). These action units describe the various activities of the facial muscles. Each AU is recognized using guidelines based on the distance between facial points. Different AUs are activated when the person is happy and when the person is sad.

—

Eye-gaze estimation: OpenFace uses coordinates \((x,y,z)\) to identify the direction vector of the eyes’ gaze. Estimating the direction of gaze is crucial for evaluating the driver’s attention.

After extracting the video features from OpenFace, we performed the following data preprocessing steps:

(1)

Due to the limited computational resources to process the data, we randomly selected 30 subjects from the 86 available for the experiments. After selecting the subjects, we combined their emotions and assigned a discrete label to each emotion: amusement = 0, anger = 1, disgust = 2, fear = 3, and sad = 4.

(2)

The output of OpenFace consists of 714 feature columns. We removed the 5 feature columns that do not provide useful information for emotional state recognition, such as ID, frame number, and so on, giving us 709 columns. To further reduce the features, we applied Pearson correlation to remove the features that are correlated to more than 80%. Thus, the feature space was reduced from 709 vectors to 63 vectors to be suitable for running on a resource-limited vehicle device.

(3)

Note that the dataset contains the synchronized sensor data and associated videos of the same length as mentioned in Reference [52]. However, we have timestamped the sensor data and the OpenFace features again by considering their sampling frequencies of 512 Hz and 25 Hz, respectively. We must add this synchronization layer due to the errors introduced by the feature extraction with OpenFace.

(4)

We apply the technique of data standardization to the selected features to bring all the features to a common scale so the differences in the range of values are not distorted.

(5)

We form 1-second windows of the dataset to maintain the temporal coherence of the data, so in each window, we have 512 samples from the sensors and 25 features extracted from the video. Since the two modalities are synchronized, we get the same number of windows.

(6)

Finally, we divide the whole dataset into three parts: training, 70%; and validation and test consisting of 15% of the whole data each.

4.2 Teacher Model

This section describes designing a more reliable and efficient teacher model for classifying driver emotions. The teacher model is based on a two-dimensional CNN. A CNN is an NN with a sequence of convolutional layers (often with a pooling step) followed by one or more fully connected layers. The architecture of the teacher model used is selected through extensive experimentation and is shown in Figure 3. The teacher model aims to learn valuable representations for emotion recognition based on sensor data. The input to the teacher model is a 1-second window, i.e., the size of each window is (512, 3), since we have three sensor values, SCL, EMG, and ECG, sampled at 512 Hz. We trained the teacher model from scratch with the sensor dataset using the Keras “early-stopping” tool that monitors validation loss. If the validation loss does not improve over several epochs, then further model training is stopped to avoid overfitting. However, one epoch corresponds to a complete run of the training data with the parameters given in Table 1. We monitor the teacher’s progress on the validation dataset and select the final model that minimizes our validation loss (cross-entropy). Evaluating the predictions on the validation set, we achieved an accuracy of 96.65% and a loss of 0.1050, as shown in Figure 4 for the 5-way classification.

Table 1.

Hyperparameter	Value	Hyperparameter	Value
Conv2D Layers	4	Learning Rate	0.0001
No. of Filters	(32,80,80,32)	Batch Size	32
Activation Function	ReLU	Dropout	0.01
Early Stopping (ES)	Enabled	Optimizer	Adam
ES Patience	10	Loss	Cross-entropy

Table 1. Hyperparameters for Training Teacher Model (CNN)

Fig. 3.

Fig. 4.

4.3 Student Model

The goal is to develop a small student model with a smaller number of model parameters adapted to the limited computational capabilities of the vehicle device. Recall that this work assumes a resource-constrained mobile device to run the student model for emotion recognition from videos. Training the student model on unlabeled videos from vehicle cameras while aggressively compressing the model is based on cross-modal KD. This technique is used to create small and compact models that mimic the results of a larger teacher model. The (teacher) models used for knowledge transfer are often over-parameterized and unsuitable for devices with limited resources. Therefore, we can use a subset of the layers of the original teacher model to create the student models. When creating the student models, the number of layers is reduced, and some are skipped. The architecture of the student model used in this work is shown in Table 2. The input for the student model is the features extracted from videos, e.g., using the OpenFace tool described earlier. In this shallow CNN student model, only two convolutional layers and one fully connected output layer are used. The dropout layer follows each convolutional layer, and the convolutional layers are padded with zeros. The dropout ratio value is selected using the Keras tuner.¹ Each vehicle device records the driver’s videos through the camera and extracts their features to train the student model.

Table 2.

Layer	No. of Neurons (filters)	Kernel Size	Data Size	Dropout
conv1	32	3 \(\times\) 3	25 \(\times\) 63	-
dropout1	-	-	25 \(\times\) 63	0.5
conv2	32	3 \(\times\) 3	25 \(\times\) 63	-
dropout2	-	-	25 \(\times\) 63	0.1
FC	5	-	5	-

Table 2. The Student Model Architecture

4.4 Cross-modal Transfer

In this section, we describe the methodology for training the student network using features extracted from videos and the corresponding results of this cross-modal approach. Note that the data for the student modality is not annotated. The idea is that a teacher can pass supervision of a task to a student without requiring labelled data in the student’s modality by using large modality-paired data. First, the teacher network is trained with sensor data. Then, the soft labels (softened output logits) provided by the teacher model are used to train the student model from scratch. The output logits of the teacher model are softened by dividing with the temperature \(\tau\). In Reference [21], Hinton et al. propose to scale the raw logits of the teacher model with a given temperature \(\tau\) before passing them to the softmax layer to effectively transfer the knowledge to the students. In this way, the probability distribution is spread across the available class labels so students can learn better. The trained teacher network predicts the target classes’ probabilities from a source modality training pair. The student network parameters are then optimized so the student-estimated class probabilities match the target modality. To effectively transfer knowledge to students, the teacher network must provide reliable monitoring to allow a small student model to learn different class annotations. The student also benefits from the teacher’s errors, provided there is a signal of uncertainty in the predictions [21].

We train the student model on the features extracted from OpenFace with a learning rate of 1e-3, using the Adam optimizer and at a temperature of 0.5. We chose the temperature value empirically by experimenting with different \(\tau\) values to facilitate training and found that a temperature of 0.5 was most effective, as shown in Table 6. We also used the Kullback-Leibler divergence (KL) for distillation loss, as suggested in Reference [40]. We minimize the KL divergence between the logits of teachers divided by temperature \(\tau\) and the logits of students divided by the same temperature, where \(S^\tau\) and \(T^{\tau }\) are the softened logits of the student and teacher networks, respectively.

\begin{equation} KL(S^{\tau }, T^{\tau }) = \sum _{x}S^\tau (x)log \frac{S^{\tau }(x)}{T^{\tau }(x)} \end{equation}

(1)

To better understand the transfer of knowledge from the teacher to the student, we present the confusion matrix of the student and the teacher in Figure 5 to observe their performance on each emotion label. The results show that the student model performs almost as well as the teacher model with a validation accuracy of 95.37%, but makes more errors than the teacher model on all emotions except fear. There are several reasons for this; for example, certain emotions, such as smiling and sadness, can be easily identified from the faces in the videos. In contrast, it is difficult to identify disgusting behavior because it varies from person to person. Moreover, we can see in Figure 5 that the teacher makes more mistakes in identifying the emotion of disgust and confuses it with fear and anger. This is because they all deal with negative emotions. The students also show similar behavior. This makes sense, since the student is not identifying emotions, but rather emulating a mathematical function of the teacher’s model using backpropagation. In Reference [46], the authors examined the overlap between disgust and fear. They found that these emotions differ in their acquisition mechanisms and behavioral intentions, but that there are strong similarities between them, especially when they are of moderate intensity. They also found that disgust, like fear, has a protective value and is a defensive emotion, so these two emotions overlap. Similarly, disgust is also related to anger. Nevertheless, our models can determine these emotions with a quite high accuracy.

Table 3.

No. of students in FL	Val. Accuracy (%)	Test Accuracy (%)	F1-score (%)	FL Rounds
4	98.06 \(\pm\) 0.38	95.00 \(\pm\) 0.68	95.80 \(\pm\) 1.19	21
6	96.94 \(\pm\) 1.04	95.83 \(\pm\) 0.67	96.11 \(\pm\) 0.80	7
8	96.94 \(\pm\) 0.38	95.00 \(\pm\) 0.67	95.83 \(\pm\) 0.58	12
10	96.11 \(\pm\) 0.31	94.17 \(\pm\) 0.67	95.25 \(\pm\) 0.40	8
12	97.22 \(\pm\) 0.38	96.67 \(\pm\) 0.67	95.24 \(\pm\) 1.08	11
15	97.78 \(\pm\) 0.39	95.28 \(\pm\) 0.38	94.96 \(\pm\) 0.66	15
18	96.67 \(\pm\) 1.35	95.56 \(\pm\) 1.42	96.41 \(\pm\) 1.54	41
20	97.22 \(\pm\) 1.71	96.39 \(\pm\) 1.04	95.80 \(\pm\) 3.00	51
25	97.22 \(\pm\) 0.79	95.28 \(\pm\) 0.79	95.50 \(\pm\) 1.60	24
30	95.56 \(\pm\) 0.38	96.11 \(\pm\) 1.04	94.96 \(\pm\) 0.64	32

Table 3. Validation, Test Accuracy, and F1-score of FedCMD with IID Data

Table 4.

No. of students in FL	Val. Accuracy (%)	Test Accuracy (%)	F1-score (%)	FL Rounds
4	92.50 \(\pm\) 0.67	90.83 \(\pm\) 0.68	93.82 \(\pm\) 1.73	35
6	91.94 \(\pm\) 2.08	89.72 \(\pm\) 1.04	90.40 \(\pm\) 1.41	60
8	90.83 \(\pm\) 0.01	90.00 \(\pm\) 2.45	91.07 \(\pm\) 1.02	85
10	91.39 \(\pm\)1.42	88.33 \(\pm\) 1.18	92.17 \(\pm\) 0.70	81
12	91.94 \(\pm\) 2.39	87.50 \(\pm\) 2.45	89.30 \(\pm\) 1.38	82
15	90.00 \(\pm\) 0.67	87.22 \(\pm\) 1.04	91.12 \(\pm\) 1.06	100
18	87.78 \(\pm\) 1.42	85.83 \(\pm\) 1.35	88.60 \(\pm\) 0.82	100
20	86.11 \(\pm\) 0.39	82.50 \(\pm\) 1.79	89.05 \(\pm\) 0.69	100
25	89.17 \(\pm\) 1.79	87.50 \(\pm\) 0.67	87.98 \(\pm\) 1.01	100
30	87.78 \(\pm\) 0.38	83.06 \(\pm\) 0.58	88.30 \(\pm\) 0.77	100

Table 4. Validation, Test Accuracy, and F1-score of FedCMD with Non-IID Data

Table 5.

Window Size (seconds)	Overlapping Ratio (%)	Accuracy (%)	Standard Deviation (%)
1	0	95.65	1.11
2	0	92.67	1.16
2	50	94.43	1.76
3	0	90.38	1.91
3	50	91.34	1.34
4	0	90.86	1.95
4	50	90.75	1.23
5	0	84.33	1.37
5	50	87.20	2.35
6	0	84.47	2.71
6	50	90.14	1.02
10	0	79.41	1.06

Table 5. Average Validation Accuracy and Standard Deviation of the Model Trained on Sensor Data Is Calculated by Training the Model Five Times with Different Window Sizes

Table 6.

Temperature	Accuracy (%)	Standard Deviation (%)
0.5	95.67	0.72
1	95.30	0.85
2	93.82	0.80
3	93.05	1.03
4	92.62	0.77
5	90.86	0.66
6	90.10	0.61
7	90.02	0.96
8	89.64	0.86
9	88.88	1.35
10	87.97	0.89

Table 6. Average Validation Accuracy and Standard Deviation of the Student Model Calculated by Training the Model Five Times with Different Temperature Values

Fig. 5.

In the next section, we describe in detail the procedure of federating student models trained by cross-modal distillation on each device.

5 Federated Cross-Modal Knowledge Distillation

There are billions of vehicles in the world [18]. The datasets generated by these vehicles contain important information about drivers’ emotions. Data from distributed vehicles becomes key for developing better and more general emotion recognition models to improve and maximize user experience. When the data is distributed across different locations, using FL instead of collecting data at one place is a smart solution. FL provides a unique way to build such models without violating user privacy.

In this section, we are interested in constructing a common feature subspace for different students (clients) that exhibit cross-modal heterogeneity. We achieved this by the FL procedure over heterogeneous students trained on locally available data. Specifically, we present how we can train the student models with distillation in a distributed approach-based FL algorithm called Federated Cross-Modal Distillation (FedCMD) to build a generalized emotion recognition system for all vehicles.

Since we have already trained our teacher model in the previous section, we now train the student model in a distributed manner by performing a local distillation on each vehicle device. For this reason, we split our teacher model to be used together with the student model on each vehicle device, as described in the following Section 5.1.

5.1 Splitting of the Teacher Model

We split our trained teacher model into a head and a tail part for inference, following the idea in Reference [33] to perform the distillation process for a resource-constrained device in a distributed manner. The head of the teacher model is used to receive the sensor inputs corresponding to the video features. We place the tail of the teacher model at the edge for inference so most of the model resides at the edge. For splitting the teacher model, one extreme is to run the entire model over the vehicle device, which is not feasible for the weaker device, and the other extreme is to run the entire model over the edge to form pure edge computing so the head of the model consists of zero layers. However, sending data over the edge is impractical and compromises user privacy. Considering these challenges, we split the teacher model at the first max-pooling layer, which acts as a bottleneck layer to reduce the number of parameters that need to be sent over the edge. This avoids sending large amounts of data over the network, which reduces the communication overhead and reduces the computational load on the vehicle device. We can also further reduce the size of the output tensor by performing quantization, as mentioned in Reference [42]. The splitting of the teacher model into a head and a tail part can be seen in Figure 1. The teacher model takes the sensor data and sends the tensor of the max-pooling layer to the edge, where the teacher tail generates the output logits.

5.2 Federated Learning Process

In FedCMD, the training process consists of local and collaborative training of the student model via FL. Three main actors are involved in training the student model: one on the client side, i.e., the local training on the vehicle; the Kafka broker architecture for the communication perspective; and the last one is the FL server to generate the global model. All these components are described below:

(1)

Local Training on Vehicle: The vehicles are the end-users with access to the drivers’ data. In each vehicle, the on-board device reads the values of the sensors and the videos’ features, as described in Section 4. Then, the training of the local student model takes place. Each vehicle trains its student model based on video features and generates the output logits of its local student model. In contrast, the output sensor of the teacher’s head is sent to the broker at the edge to generate the teacher’s logits and sent back to the vehicle to perform local cross-modal distillation.

(2)

Communication Framework: The main function of the Kafka-based communication framework is to manage the communication between the end clients (vehicles) and the FL server to exchange model parameters in the FL process. In our proposed approach, the vehicle clients act as publishers when the local model is sent to the Kafka broker and subscribers when they receive the global model from the Kafka broker sent from the FL server. This framework was developed in our earlier work [10]. It is a two-tier FL communication framework that receives the model updates asynchronously through the middle instance, i.e., the Kafka broker between the vehicles and the FL server. This framework supports the dynamic vehicle scenario and solves the intermittent connectivity problems of the vehicles through loosely coupled and scalable communication.

(3)

Collaborative (global) training: The FL server is responsible for two functions: first, creating the global model by aggregating the weights of all student models using the FedAvg algorithm, and second, transmitting the global model to the participating vehicles. Once the global model is created, it is sent to the Kafka broker, and the vehicle clients that participate in the FL process and subscribe to this service in Kafka receive the global model. The FL server acts as the subscriber when it receives the local student models, and it also serves as the publisher when it sends the global model to the Kafka broker.

6 Numerical Evaluation

This section focuses on the numerical analysis of our proposed FedCMD method. We compare the performance of FedCMD with the performance obtained in centralized settings for training the student model with the entire dataset in Section 4. Recall that in FedCMD, each vehicle can only process the data it collects locally in a distributed environment. Here, we seek to answer the following research question: “Is FedCMD able to perform better than centralized learning in emotion recognition?” To answer this research question, we first describe the experimental setup (Section 6.1) and then the results we obtained with FedCMD in both IID (Section 6.2.1) and non-IID data settings (Section 6.2.2).

6.1 Experimental Setup

All experiments for FedCMD are performed on a system with a Core i7 processor with 16 GB RAM and Ubuntu 20.04 64-bit OS. The Jetson Xavier NX is also used as an edge device. The entire framework is implemented in Python 3.8 using TensorFlow2 and Keras. For training, we use the Adam optimizer with a learning rate of 1e-3 and a batch size of 64. During the FL process, the number of local epochs performed by each student client before aggregation is 6 for the IID scenario and 3 for the Non-IID scenario, as described in Section 7. The student network is trained for 100 FL rounds, using the “early-stopping” tool to avoid wasting communication resources by running additional rounds of FL if no improvements are made during training. Thus, the learning process is stopped if the global model validation loss does not decrease in successive 10 rounds. To evaluate the accuracy of the global model, we stored the validation and test sets separately on the FL server, each comprising 10% of the entire dataset. We have a standard classification performance for emotion recognition, i.e., the accuracy of the global model. For the experiments, we use the multimodal dataset “BioVid Emo DB” described in Section 4. The same temperature \(\tau\) of 0.5 is used for local cross-modal distillation on each vehicle device.

6.2 Federated Cross-modal Knowledge Distillation

Using the cross-modal distillation technique for local training, we analyze our proposed FL-based emotion recognition system, FedCMD, for both IID and non-IID data. As explained earlier, the advantage of FL over classical learning is the ability to aggregate multiple client models into one global model to improve generalizability without degrading specialization. The experimental results in Reference [22] show that KD can improve both the local and global models simultaneously, as promoting local nodes leads to improving the global model after averaging. In our experiments for the FL process, the central cloud server is initialized, and a copy of the global model is sent to each vehicle (student) at a time (t = 0). In each time frame, we assume that all students (clients) participate in each round of FL to eliminate the effect of randomness caused by sampling clients [34]. We also specify that all students perform local learning for a certain number of epochs, which we refer to as local distillation epochs and which we will describe in more detail in Section 7.

We studied a total of 11 settings for different numbers of clients to evaluate the effectiveness of our proposed strategy (4, 6, 8, 10 and 12, 15, 18, 20, 25 and 30). The first experiment is repeated with four clients running simultaneously, and so on. When the clients train simultaneously on different vehicle devices, each client performs the cross-modal distillation on its own system with its own local dataset. After three repetitions, the average test and validation accuracy are calculated for each experiment together with the F1-score and its standard deviation.

6.2.1 Federated Learning with IID Dataset Distribution.

For the IID setting, the data are first shuffled and evenly distributed across all student clients by dividing the total dataset by the number of students so each student has approximately the same number of samples and all available labels. The IID distribution of the data for four students is shown in Figure 6(a).

Fig. 6.

The average accuracy of the global model in FL with the specified local distillation epochs for each client is shown in Table 3. The results confirm that the decentralized training algorithm performs well compared to the centralized baseline results, outperforming the validation accuracy of the centralized cross-modal distillation from 95.67% to 96.94%. We found only a minimal difference in accuracy between the decentralized training of different students using an IID dataset. Moreover, with only 20 students and 23 rounds of FL, we achieved a test accuracy of 97.22%. Note that the 23 rounds are the early-stopping point for the FL process with 20 students, and the experiment was repeated with different initializations to obtain the mean values. Surprisingly, depending on the configuration chosen, the final performance of the model over FL exceeds the baseline accuracy associated with training on a centralized dataset in the traditional way. This improvement in accuracy results from expanding the student’s learning capacity, allowing them to learn from other students using model averaging rather than just learning from the teacher’s results. It also provides more flexibility, as we configure the hyperparameters to view the decentralized training process as an optimization problem and choose local distillation epochs and early-stops to avoid the problem of overfitting the student model in the FL process.

The results in Table 3 show that the FedCMD method has also achieved high F1-score performance in IID environments due to the uniform and consistent distribution of data across participating clients. This uniformity allows for effective training and evaluation, resulting in reliable and robust performance.

6.2.2 Federated Learning with Non-IID Dataset Distribution.

To distribute the dataset for the non-IID scenario among students, we created different label sections that contained 70%, 20%, 5%, 3%, and 2% of the total number of labels and then randomly assigned labels from each section to each student so each student had a different number of samples and a different label distribution. Figure 6(b) shows the data distribution among four students, with uneven targets.

The FL accuracy for non-IID is shown in Table 4. The table shows that we achieved the maximum average accuracy of 90.83% with only four clients and 35 rounds of FL. In this case, the accuracy is lower than the centralized cross-modal distillation (95.67%). Also, we need more rounds in the non-IID case than in the IID scenario. For example, if we increase the number of clients to more than 12, then we have completed the 100 rounds of FL process. While we always reach the early-stopping point in the IID scenario, we do not reach the early-stopping point in the non-IID case with more clients. Thus, the results show that training many clients in non-IID environments within a finite number of communication rounds does not result in accuracy that approaches centralized accuracy. This is because as the number of clients increases, there is a higher value of non-IID data and, thus, a higher degree of global imbalance. After a few local distillation epochs of training on local devices, the weights of the client and server models can differ dramatically, leading to client drift. As client drift increases, the accuracy of the global model decreases because convergence is slower, and many rounds are required to converge. The studies in Reference [28] show that heterogeneity in the data leads to drift in the local client updates, which slows down the convergence speed. Also, the studies in Reference [24] claim that drift in client updates caused by non-IID data is the main cause of degradation in convergence rates in heterogeneous environments. In other words: In the local training phase, the local models are updated toward the local optima, which may be far from the global optima. The averaged model may also be far from the global optima, especially if the local updates are large enough to have a large local epoch number [24]. Ultimately, the converged global model has lower accuracy than the IID setting for fixed 100 rounds of FL. However, the accuracy decreases slowly in our scenario, confirming the method’s robustness to global imbalances.

In Table 4, we have highlighted the F1-score as an important performance indicator for model evaluation. The F1-score provides insights, especially in scenarios with non-IID datasets. The results in Table 4 show that the FedCMD method has superior F1-score performance. However, it is worth noting that this performance tends to decrease with increasing non-IIDness, which is due to both an increased number of clients and the emergence of the client drift problem.

6.2.3 Comparison with Existing Baselines.

In addition to comparing our FedCMD approach with centralized learning, we also compare the FedCMD approach with recent existing approaches on the BioVid Emo DB dataset. For example, in Reference [26], the authors achieved 83.79% accuracy for the emotion recognition model. As explained in Section 2, they used two modality networks, and the results were combined using decision-level weighted fusion. In Reference [45], an accuracy of 80.89% was achieved, and the authors proposed to fuse the features of each modality extracted using a BDBN (bimodal deep belief network). Using these extracted multimodal features, they trained a linear SVM to achieve emotion recognition results. The authors in Reference [15] computed various time- and frequency-based features, which were then fused with an SVM for classification, achieving the highest accuracy of 79.51%. These results show that our proposed method significantly outperforms existing methods by using the much smaller student model. We also found that our student model learns much faster and more reliable compared to the existing techniques, as we trained using the output of a strong teacher model as soft labels instead of ground truth while maintaining user privacy.

7 Ablation Studies

In this section, we perform experiments to investigate how various selected parameters contribute to the performance of the locally performed cross-modal distillation (Section 7.1) and effects on FedCMD approach (Section 7.2).

7.1 Cross-modal Knowledge Distillation

First, we evaluate the performance of our centralized cross-modal approach in terms of validation accuracy for different configurations of the parameters used in the process, such as the window size to obtain the temporal characteristics of the time series data, different temperature values for the cross-modal distillation, and by changing the loss function.

7.1.1 Window Size Impact.

When training the teacher and student models in Section 4, the 1-second segment window of the sensor and video data is selected after extensive experimentation, leading to effective results. Banos et al. [11] studied the effects of window size on human activity recognition systems and concluded that smaller windows of 1 or 2 seconds provide better and faster results and reduce energy consumption. We chose different window sizes for the experiments to test the results on our emotion recognition dataset. The results for the different window sizes chosen are shown in Table 5. From the table, we can see that the accuracy increases and the loss decreases as we decrease the window size. Thus, the 1-second window provides the best tradeoff between speed and accuracy of the emotion recognition model. Moreover, a smaller window size reduces the power consumption of the resource-constrained device. Therefore, we work with the one-second feature windows, considering the results in Table 5.

7.1.2 Effect of Temperature in Distillation.

We also try to perform a cross-modal distillation with different temperatures to study the effects of temperature on distillation for the given problem. In our experiments, we used a temperature of 0.5 to distill knowledge from the teacher network to the student network, considering the different modalities. In Table 6, we reported the average validation accuracy for each temperature value and the standard deviation by running each experiment five times to initialize the model with different random weights at each re-training. It can be seen from the table that the accuracy decreases as the temperature increases. This confirms Hinton’s findings [21], that lower temperature values lead to better results when the student model is very small compared to the teacher model.

7.1.3 Cross-entropy Loss.

In Section 4, we used only the KL-divergence loss for cross-modal distillation. Now, we want to evaluate the effects of different loss functions on the proposed approach, e.g., cross-entropy loss. The experiment shows that the accuracy for the student network decreases from 95.67% to 93.60% when we use the cross-entropy loss compared to the KL-divergence loss. However, when we increase the temperature to 2, we get an accuracy of 94.58%, which continues to decrease with temperature, with the same trend as in Table 6 with KL-divergence loss.

7.2 Federated Cross-modal Knowledge Distillation

In this section, we evaluate the performance of the FedCMD approach by changing the number of local distillation epochs for both the IID and non-IID data distributions to determine the parameter of how many times local distillation must be performed before student models are federated, as previously described in Section 5. The number of local epochs is a key parameter in FL. While one traditional way is to develop robust approaches for local updates, another is to develop efficient approaches for parameter settings for FL. To evaluate the value for the local epochs for each student training, we conducted the experiments by changing the local distillation epochs to 1,2,3,4,5,6, and 8. We experimented with an increased number of students to investigate the effects of the local distillation epochs on the accuracy of the federated model. We also determine the total number of rounds to achieve the desired accuracy with different local distillation epochs. We study the effects by changing the number of students to \(4, 6, 8, 10, 12,\) and 15. With early-stopping, the number of FL rounds is set to 100 for each experiment.

7.3 Local Distillation Epochs for IID Data Distribution

We change the local distillation epochs in the case of IID data settings. The experimental results are shown in Figure 7. Increasing the number of local distillation epochs by more than 6 leads to a degradation of the global model accuracy. Indeed, as the number of local epochs increases, the local student model becomes better fitted to the local data, leading to overfitting. For this reason, we set the local distillation epochs to 6. We also find that the model converges faster for small distillation epochs.

Fig. 7.

We also examine the relationship between the number of FL rounds and the number of local distillation epochs. As can be seen in Table 7, all students converge to the desired accuracy in fewer rounds when the local distillation epochs are set to 6. Therefore, 6 epochs are a better option for IID dataset distribution across clients. The bold numbers in the tables represent the smaller FL rounds for the given number of students.

Table 7.

Number of Students		Local Distillation Epoch
Number of Students		1	2	3	4	5	6	8
4	FL Rounds		23	16	20	20	16	21	39
6		56	23	40	21	7	6	28
8		40	31	20	14	40	12	21
10		41	35	25	63	23	8	57
12		44	32	38	35	32	11	67
15		59	43	25	40	38	15	27

Table 7. Number of FL Rounds for Different Numbers of Students by Changing the Local Distillation Epochs of Each Student in Case of IID Dataset Distribution

7.4 Local Distillation Epochs for Non-IID Data Distribution

In experiments with non-IID datasets, we found that non-IID data can significantly affect the accuracy of FL, because the distribution of each local dataset is very different from the global distribution. The local optimum of each client does not match the global optimum. Therefore, there is a drift in the local updates [24]. For this reason, we chose to use half the local distillation epoch that we use in the IID case, i.e., 6. In addition, we found that most experiments have a lower minimum number of rounds on FL with a local distillation epoch of 3, as shown in Table 8. In contrast, when we increase the number of local epochs, we find a significant standard deviation in the final validation accuracy. This behavior suggests that the number of local epochs significantly impacts the accuracy of the FL algorithms. Further increasing the local distillation epochs tends not to lead to significant improvements and, in some cases, even to more communication rounds when the local epochs are larger.

Table 8.

Number of Students		Local Distillation Epoch
Number of Students		1	2	3	4	5	6	8
4	FL Rounds		81	45	35	48	42	59	44
6		100	76	60	71	66	52	62
8		100	85	85	76	62	86	89
10		100	100	81	81	80	73	100
12		100	84	82	100	79	100	100
15		100	100	100	100	100	100	100

Table 8. Number of FL Rounds for Different Numbers of Students by Changing the Local Distillation Epochs of Each Student in Case of Non-IID Dataset Distribution

We also found that students’ collaborative learning via FL does not depend on the temperature parameters, because each student calculated its loss locally, and the same network architecture was applied to the same modality. Finally, based on the average accuracy and overall performance results, we can conclude that the system can classify emotions in real-time from multimodal data and protect privacy by creating a sufficiently robust global emotion classifier. Experiments conducted with different numbers of clients show that the system can work with multiple clients simultaneously. Therefore, it is scalable and suitable for large-scale practical adaptation.

7.5 Clients Drop-out

In this section, we address the challenges associated with the Internet of Vehicles (IoV), focusing on the limited network connectivity that can lead to disruptions in-vehicle connectivity for emotion recognition of drivers. Our goal is to evaluate the performance of FedCMD in such environments, especially at different dropout rates. To this end, we selected a total of 30 clients to which we applied different dropout rates between 10% to 90%. For example, with a dropout rate of 10%, only three clients are randomly excluded in each training epoch. Tables 9 and 10 show the performance of FedCMD in both IID and non-IID settings, respectively. The results show that in the case of IID, the accuracy does not decrease much due to the redundancy and variety of data on the different clients. Even Random dropouts do affect individual clients, but the collaborative nature of FL ensures that the impact on the collective model is mitigated. However, in the case of non-ID data, there is a rapid performance degradation, as FL is more sensitive to client dropouts due to the heterogeneous nature of non-IID data. Unlike in IID environments, where uniform distribution provides a degree of redundancy, client dropouts in non-IID environments results in a greater loss of valuable information.

Table 9.

Dropout rate (%)	Val. Accuracy (%)	Test Accuracy (%)	F1-score (%)
0	\(95.56 \pm 0.38\)	\(96.11 \pm 1.04\)	\(94.96 \pm 0.64\)
10	\(96.67 \pm 0.67\)	\(96.39 \pm 0.79\)	\(96.61 \pm 0.71\)
20	\(97.50 \pm 0.67\)	\(97.50 \pm 0.0\)	\(97.48 \pm 0.70\)
30	\(96.94 \pm 1.04\)	\(96.11 \pm 1.04\)	\(96.91 \pm 1.04\)
40	\(97.22 \pm 0.38\)	\(96.11 \pm 1.04\)	\(97.18 \pm 0.42\)
50	\(96.67 \pm 0.67\)	\(97.22 \pm 0.38\)	\(96.59 \pm 0.73\)
60	\(96.39 \pm 1.04\)	\(96.94 \pm 0.39\)	\(96.32 \pm 1.09\)
70	\(96.39 \pm 1.42\)	\(95.28 \pm 0.38\)	\(96.35 \pm 1.43\)
80	\(96.67 \pm 0.67\)	\(94.44 \pm 0.38\)	\(96.64 \pm 0.66\)
90	\(96.11 \pm 0.79\)	\(95.00 \pm 1.18\)	\(96.09 \pm 0.76\)

Table 9. Test Accuracy and F1-score of the FedCMD at Different Dropout Rates of Clients in the Context of IID Data

Table 10.

Dropout rate (%)	Val. Accuracy (%)	Test Accuracy (%)	F1-score (%)
0	\(87.78 \pm 0.38\)	\(83.06 \pm 0.58\)	\(88.30 \pm 0.77\)
10	\(86.39 \pm 0.79\)	\(84.72 \pm 2.19\)	\(86.33 \pm 0.77\)
20	\(85.56 \pm 0.38\)	\(84.72 \pm 1.71\)	\(85.50 \pm 0.30\)
30	\(86.11 \pm 1.42\)	\(83.33 \pm 1.79\)	\(85.90 \pm 1.32\)
40	\(86.94 \pm 1.04\)	\(86.67 \pm 1.18\)	\(86.80 \pm 1.09\)
50	\(85.28 \pm 2.58\)	\(84.17 \pm 2.04\)	\(85.07 \pm 2.67\)
60	\(82.22 \pm 2.19\)	\(82.22 \pm 1.71\)	\(81.94\pm 1.91\)
70	\(80.00 \pm 0.75\)	\(78.00 \pm 2.39\)	\(79.14 \pm 4.83\)
80	\(64.44 \pm 15.47\)	\(62.78 \pm 20.57\)	\(60.76 \pm 20.31\)
90	\(37.78 \pm 11.99\)	\(35.83 \pm 13.66\)	\(29.39 \pm 15.76\)

Table 10. Test Accuracy and F1-score of the FedCMD at Different Dropout Rates of Clients in the Context of Non-IID Data

8 Conclusion and Future Work

In this article, we present the development of emotion recognition systems based on FL to classify the emotional states of drivers from multimodal data. The proposed method detects emotions using extracted facial features from unlabeled videos of drivers and using sensor data as a proxy using the cross-modal KD approach. As a result, we obtained a strong teacher model trained on sensor data with 95.65% accuracy and a small student model that can be used on devices with limited resources. To build a highly accurate global classifier without accessing distributed data in individual vehicles, we federate the student model with the FedAvg algorithm while detecting drivers’ emotional states.

In the proposed federated approach, i.e., FedCMD, the user’s privacy is protected because the privacy-sensitive videos of the driver do not leave the vehicle. Experimental results based on the dataset “BioVid Emo DB” have shown that the proposed system achieves high accuracy for both IID and non-IID data, i.e., 96.67% and 90.83%, respectively, and outperforms state-of-the-art methods in classifying the five emotion categories. The proposed emotion recognition approach is scalable and has great potential for practical adaptation to vehicle scenarios. Another major advantage of this approach is that the video data is almost unlimited and can be easily captured by the cameras on the vehicle dashboard. However, a person’s cognitive state while driving depends on many factors, such as how he or she perceives the state of the vehicle and the environment, as well as his or her preferences and expectations for a driving scenario and context awareness. In this article, however, we did not address the context awareness of the vehicle.

In our future work, we will investigate the relationship between context awareness and driver emotions. We also intend to investigate the bias between the soft labels assigned by the teacher model and the subjective ratings of individual drivers. There is a possibility that someone may not perceive the risks from the sensors. This is because the perceived emotion may depend on the perceived threshold of that emotion, which varies from person to person. These subjective ratings could be used to create personalized models to determine the emotional state of each driver. In addition, traditional FL systems face challenges such as scalability, complex infrastructure management, and wasted resources. We seek to implement the FL process with serverless computing, because these challenges of FL systems are closely related to the core problems that serverless computing and FaaS platforms seek to solve.

Footnote

https://keras.io/keras_tuner/

A Appendix

In this section, we discuss the convergence analysis of the proposed “FedCMD” method and the error propagation from the teacher to the student models based on the authors’ comments in References [28, 37]. Referring to the work presented in Reference [37], the authors use the Neural Tangent Kernel (NTK) [23] to show that the prediction error of the student NN in estimating the ground truth vector \(\mathbf {y}\) is expressed as follows:

\begin{equation} \Vert \mathbf {f}_{\mathcal {S}}-\mathbf {y}\Vert _2 = C \cdot \left\Vert \mathbf {y}-\lambda \cdot \frac{\mathbf {a}_{\mathcal {S}}}{N_H}\cdot \mathbf {f}_{\mathcal {T}}^{\prime } \right\Vert _2 . \end{equation}

(2)

In Equation (2), C is a constant that depends on the Thatcher hyperparameters, including the distillation regularization parameter \(\lambda\). \(\mathbf {f}_{\mathcal {S}}\) and \(\mathbf {f}_{\mathcal {T}}\) are the logits for the student and teacher models. \(\mathbf {a}_{\mathcal {S}}\) is the vector of weights used by the student to generate the prediction output by linearly combining its logits, and \(N_H\) is the number of neurons used in the hidden layer. This result implies that the student’s prediction error decreases as the pre-trained teacher model \(\mathbf {f}_{\mathcal {T}}\) approaches \(\mathbf {y}\).

To build the convergence analysis of our proposed approach, we rely on the results presented in Reference [28]. In this article, the authors perform a comprehensive convergence analysis for the FedAvg approach, considering scenarios with IID and non-IID data distributions. In particular, they show that FedAvg converges to the global optimum over a total number of T FL rounds at a rate of \(\mathcal {O}(1/T)\). This holds even in the presence of non-IID data, provided that the local losses correspond to strongly convex and smoothed functions. Moreover, the authors find that local update iterations E with a small learning rate have a similar effect on the convergence rate as a single step of Stochastic Gradient Descent (SGD) with a larger learning rate. This observation suggests that the global model update in FedAvg behaves similarly to an SGD update when combined with appropriate sampling and averaging strategies. Consequently, the FedAvg algorithm shows a convergence pattern comparable to that of SGD. Moreover, the study emphasizes the role of learning rates in reducing the variance of the averaged sequence of the global models. This reduction in variance is attributed to the effects of partial user participation, highlighting the impact of the choice of learning rate on the stability and efficiency of the FedAvg algorithm. The results in Reference [28] provide valuable insights into the convergence behavior and optimization properties of FedAvg and support the analysis of our proposed approach.

In the following, we consider the upper bounds found by the authors for the difference between the global minimum loss \(\mathcal {L}^*\) and its average \(\mathbb {E}[\mathcal {L}(\mathbf {w}_{\mathcal {S}}(T))]\) evaluated for the federated model after a total number of T iterations:

\begin{equation} \mathbb {E}[\mathcal {L}(\mathbf {w}_{\mathcal {S}}(T))]-\mathcal {L}^* \: \le \: \frac{L/\mu }{\gamma +T-1} \cdot \left(\frac{2B+C}{\mu }+\frac{\mu \gamma }{2} \cdot \mathbb {E}[\Vert \mathbf {w}_{\mathcal {S}}(1)-\mathbf {w}_{\mathcal {S}}^*\Vert ^2]\right) \end{equation}

(3)

\begin{equation*} \gamma =\max {\lbrace 8\cdot L/\mu ,E\rbrace }; \end{equation*}

\begin{equation*} B=\displaystyle \sum _{k=1}^N (p_k\cdot \sigma _k)^2+6 \cdot L \left(\mathcal {L}^*-\displaystyle \sum _{k=1}^N p_k \mathcal {L}_k^*\right)+ 8 \cdot (E-1) \cdot G^2; \end{equation*}

\begin{equation*} C=\frac{4}{K}\cdot (E \cdot G)^2. \end{equation*}

The difference given in Equation (3) depends on the number of total T and local update iterations E performed by the users on the subset of clients K selected from the total set N. It also depends on the properties of the loss function at each user, i.e., L-smoothness, \(\mu\)-strong convexity, an upper bound on the variance of the stochastic gradient, and an upper bound on the expected quadratic norm of the stochastic gradient with parameters L, \(\mu\), \(\sigma _k\), and G, respectively. The equation also shows that the difference depends on the mean square distance between the model update in the first step \(\mathbf {w}_{\mathcal {S}}(1)\) and the model that provides the minimum global loss \(\mathbf {w}_{\mathcal {S}}^*\). The constant B in the equation takes into account the effect of the weights \(p_k\:| \: k=1 \dots K\) used to select the subset of users during the federation and the degree of non-IID of the data \(\mathcal {L}^*-\textstyle \sum _{k=1}^N p_k \mathcal {L}_k^*\). The first term of the difference is again the global minimum of the loss and the second is the minimum for each local loss. In Equation (4), authors derived the number of FL communication rounds needed to achieve the error in Equation (3).

\[\begin{eqnarray} \frac{T}{E} \propto \Big (\frac{K+1}{K}\Big) E G^2+ \frac{1}{E}\sum _{k=1}^N (p_k \sigma _k)^2+ \end{eqnarray}\]

(4)

\[\begin{eqnarray} \frac{L}{E} \bigg [\bigg (\mathcal {L}^*(\mathbf {w}_{\mathcal {S}})- \sum _{k=1}^N p_k \mathcal {L}_k^*(\mathbf {w}_{\mathcal {S}})\bigg)+G^2 L/\mu \bigg ]+G^2 \end{eqnarray}\]

(5)

In this article, we exploit the above insights and characterize the error propagation for the federated model caused by the errors of individual students. To address the generality of the analysis, we consider a simple 3-layer student model based on an NN, which consists of an input layer, a hidden layer, and an output layer. These layers are fully connected, and a nonlinear Lipschitz continuous function \(\delta (\mathbf {w}_{\mathcal {S}})\) takes care of the activation of the hidden layers and generates the output logits \(\mathbf {f}_{\mathcal {S}}=\delta (\mathbf {w}_{\mathcal {S}})\). The activation function is therefore the link between the logits of the student model and the locally generated and federated NN model weights. This means that the minimum for each local loss \(\mathcal {L}_k^*(\mathbf {w}_{\mathcal {S}})\) in Equation (4) is given by Equation (2) in our case. Equation (4) shows that the main contribution due to the distillation error corresponds to the degree of non-ID. In fact, in the case of IID data, \(\mathcal {L}^*-\textstyle \sum _{k=1}^N p_k \mathcal {L}_k^*\) approaches zero, and the only parameters responsible for this error are \(\sigma _k\) and G. Comparing the results in Table 8 with those in Table 7, we find that the federation method requires a higher number of rounds, which is again due to the non-IID data. We also note that the distribution with weights \(p_k\) can affect convergence, as it allows us to select the clients that contribute with the loss functions with the lowest variance and the lowest squared norm of the stochastic gradient. If we analyze the results in Table 8, then we find that the optimal number of local updates to achieve the minimum number of FL communication rounds is 3. Considering Equation (3), this value is due to the parameter E, which affects the convergence rate; in fact, too small values of E make FedAvg equivalent to SGD, too large E decrease the convergence rate. Again, the Equation (3) shows that the contribution of selecting a subset of users to the growth of communication rounds is negligible.

B Appendix

In this section, we focus on the detailed investigation of metrics such as communication cost/time, computational overhead, and memory utilization to evaluate the practicability and feasibility of the FedCMD algorithm on edge computing platforms for vehicular usecase.

Computation Time.

To determine the computation time for each client in FedCMD, we followed the similar approach mentioned in Reference [47]. However, to compute all these values empirically, we rely on the TensorFlow Profiler, which provides insights into the execution of TensorFlow code. With the TensorFlow Profiler, we can extract the average time required for a step. The “Average Step Time” in the TensorFlow Profiler shows the average time taken to execute a training step on a given device, where a batch of data is processed by the NN and the model parameters are adjusted accordingly. The breakdown of the step time provides a detailed analysis of the time distribution across different components or operations, such as input processing, forward pass, backward pass, gradient update, and other related overheads incurred for a batch of data.

To perform the experiments for a resource-constrained student machine in each vehicle, we assumed a 2.20 GHz device (CPU). Since we assume that the installed devices in all vehicles are the same, we use one device as a probe to determine the computation time. Table 11 shows the average step time and the total computation time, i.e., the duration needed to compute one epoch on each device. The results show that the average step time is almost identical for a different number of students in FedCMD. However, an interesting observation is that the total computation time decreases as the number of student clients increases. This phenomenon is due to the distributed nature of FL. FL uses distributed data processing on multiple clients to train models collaboratively. As the number of clients increases, the computational load is distributed across more clients, which can result in faster computation times for individual clients, as the load is distributed across a larger number of devices.

Table 11.

Configuration	[email protected] GHz
Configuration	step time	epoch time
Centralized	0.2739	21.09
Fed. 2 users	0.2538	9.891
Fed. 4 users	0.2647	5.294
Fed. 6 users	0.2827	3.675
Fed. 8 users	0.2683	2.683
Fed. 15 users	0.3308	1.980
Fed. 30 users	0.2784	0.832

Table 11. Computation Time (Seconds) of Each Student in FedCMD

Communication Time/Cost.

Communication costs usually refer to the resources consumed when exchanging data or messages between different components of a distributed system. In FL, these costs are often quantified by the number of model parameters transferred between the central server and the participating devices. Furthermore, in our work, we consider not only the cost and time associated with the transmission of model parameters, but also the communication cost and time required for CMD. This includes sending intermediate layer parameters from the teacher model to the edge device and receiving output logits from the teacher. Note that the frequency of performing local CMD varies depending on the scenario, whether IID or non-IID.

To measure the communication time in a FL round for both IID and non-IID scenarios, we run the experiment with three different numbers of clients: Minimum, Average, and Maximum (4, 15, and 30 clients, respectively). To evaluate the communication time, we used the framework “KafkaFed,” which is tailored for Internet of Vehicles applications [9]. For a detailed description of how KafkaFed works, please refer to Reference [10]. In the experiments with the KafkaFed, we used a 5G technology with a bandwidth of 100 MB/s and a delay of 1 ms as the vehicle technology for communication.

In the Table 12, we have specified the time required for one round of FL. It is important to note that the time required for CMD depends on the number of local epochs performed, depending on whether it is an IID or a non-IID scenario, while the time for FL remains almost constant. We have specified the time for each client, assuming that all clients run CMD and FL processes simultaneously. However, it is important to note that according to the Kafkafed architecture, the server time increases as more clients are added. This is because the server has to fetch the model updates of each client from the edge (broker, in our case) for aggregation and then distribute the updated model back to all participating clients. As the number of clients increases, the server workload increases accordingly.

Table 12.

No. of students	CMD (each client)		FL time	Server time	FedCMD (time per round)
	IID	non-IID			IID	non-IID
4	0.9828	0.4914	16.88	7.215	25.061	24.58
15	1.1717	0.5858	20.11	9.139	30.290	29.07
30	1.2677	0.6336	21.75	13.23	36.255	35.61

Table 12. Communication Time (Seconds) of One Round of FedCMD

We also calculated the communication costs in terms of data exchanged for performing CMD and FL, depending on the number of parameters exchanged in the upload and download directions. The overall communication costs in Table 13 reflect the total number of data exchanges for 4, 15, and 30 clients during a round of FL. Table 13 shows that the total communication costs of the non-IID scenario are slightly lower than those of the IID scenario. This discrepancy results from the fact that the data exchange between the clients and the edge in the non-IID scenario takes place over three local epochs to perform CMD, while in the IID scenario it takes place over six local epochs. Nevertheless, the total communication cost of the non-IID scenario remains higher, as a larger number of FL rounds are required to converge, as shown in Table 4.

Table 13.

No. of	Comm. cost	Comm. cost	Overall Comm. cost
students	(Upload)	(Download)	IID	non-IID
4	1.0	0.998	8.049	8.016
15	1.0	0.998	30.18	30.06
30	1.0	0.998	60.37	60.12

Table 13. Communication Cost (MB) of One Round of FedCMD with Both IID and Non-IID Data

It is important to mention that some works in the literature have also treated communication cost as the number of rounds of FL required to achieve convergence, as shown in References [48] and [49]. In our manuscript, the number of FL rounds required for convergence for different number of students are shown in Table 3 for the IID case and in Table 4 for the non-IID case. This provides additional insight into the communication cost considerations in our study.

Memory/Cache Usage.

Assuming each parameter is a 32-bit datatype (4 bytes). The memory usage should be (number of parameters) \(\times\) 4B. Note that, on each vehicle device, we have two models:

—

The head of the teacher model (544 parameters)

—

The student model (261,573 parameters).

The total memory required on each vehicle device is therefore given by:

\begin{equation*} 544 \times 4B + 261{,}573 \times 4B = 1{,}048{,}468 \text{ bytes,} \end{equation*}

which corresponds to approximately 1 MB. However, it should be noted that this calculation is valid assuming a batch size of 1, i.e., that 1 input is loaded at a time. However, in our experiments, we used batch size of 64. Therefore, we need to multiply this memory calculation by the size of the batch. Thus, the total memory requirement for training each student model is 64 MB.

References

[1]

Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2018. Emotion recognition in speech using cross-modal transfer in the wild. In Proceedings of the 26th ACM International Conference on Multimedia. 292–301.

Abstract

1 Introduction

2 Related Works

2.1 Emotion Recognition

2.2 Cross-modal Knowledge Distillation

2.3 Federated Learning

3 System Model

4 Cross-Modal Distillation

4.1 Data Preprocessing

4.2 Teacher Model

4.3 Student Model

4.4 Cross-modal Transfer

5 Federated Cross-Modal Knowledge Distillation

5.1 Splitting of the Teacher Model

5.2 Federated Learning Process

6 Numerical Evaluation

6.1 Experimental Setup

6.2 Federated Cross-modal Knowledge Distillation

6.2.1 Federated Learning with IID Dataset Distribution.

6.2.2 Federated Learning with Non-IID Dataset Distribution.

6.2.3 Comparison with Existing Baselines.

7 Ablation Studies

7.1 Cross-modal Knowledge Distillation

7.1.1 Window Size Impact.

7.1.2 Effect of Temperature in Distillation.

7.1.3 Cross-entropy Loss.

7.2 Federated Cross-modal Knowledge Distillation

7.3 Local Distillation Epochs for IID Data Distribution

7.4 Local Distillation Epochs for Non-IID Data Distribution

7.5 Clients Drop-out

8 Conclusion and Future Work

Footnote

A Appendix

B Appendix

Computation Time.

Communication Time/Cost.

Memory/Cache Usage.

References

Cited By

Index Terms

Recommendations

EmotionKD: A Cross-Modal Knowledge Distillation Framework for Emotion Recognition Based on Physiological Signals

EmotiW 2016: video and group-level emotion recognition challenges

The Role of Empathic Traits in Emotion Recognition and Emotion Contagion of Cozmo Robots

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations