skip to main content
research-article
Open access

Labeling Chaos to Learning Harmony: Federated Learning with Noisy Labels

Published: 22 February 2024 Publication History

Abstract

Federated Learning (FL) is a distributed machine learning paradigm that enables learning models from decentralized private datasets where the labeling effort is entrusted to the clients. While most existing FL approaches assume high-quality labels are readily available on users’ devices, in reality, label noise can naturally occur in FL and is closely related to clients’ characteristics. Due to scarcity of available data and significant label noise variations among clients in FL, existing state-of-the-art centralized approaches exhibit unsatisfactory performance, whereas prior FL studies rely on excessive on-device computational schemes or additional clean data available on the server. We propose FedLN, a framework to deal with label noise across different FL training stages, namely FL initialization, on-device model training, and server model aggregation, able to accommodate the diverse computational capabilities of devices in an FL system. Specifically, FedLN computes per-client noise level estimation in a single federated round and improves the models’ performance by either correcting or mitigating the effect of noisy samples. Our evaluation on various publicly available vision and audio datasets demonstrates a 22% improvement on average compared to other existing methods for a label noise level of 60%. We further validate the efficiency of FedLN in human-annotated real-world noisy datasets and report a 4.8% increase on average in models’ recognition performance, highlighting that FedLN can be useful for improving FL services provided to everyday users.

1 Introduction

Recent advances in smartphones, wearables, and the Internet of Things devices have led to the continuous generation of massive amounts of data from embedded sensors and users’ interactions with various applications. The ubiquity of these devices and the exponential growth of the data they produce present a significant opportunity to tackle critical problems in domains such as healthcare, well-being, and manufacturing. Traditionally, Machine Learning (ML) approaches require the distributed data to be stored or aggregated in a centralized cloud-based server before being further processed to solve a specific problem. However, the rapidly increasing volume of generated data, combined with high communication costs and bandwidth limitations, makes centralized data aggregation infeasible [22]. Furthermore, such centralized schemes may also be restricted by privacy issues and regulations, such as the General Data Protection Regulation (GDPR).1
To this end, the field of Federated Learning (FL) [18] aims to enable distributed training of ML models on decentralized data residing on personal devices like smartphones and wearables. The key idea behind FL is to bring the computation closer to where the data resides to extensively harness data locality. In an FL regime, updates to the deep neural network models (e.g., their learnable parameters) are performed entirely on-device and communicated to the central server, which aggregates these updates from all participating devices to produce a unified global model. Unlike the standard centralized way of learning models, the salient differentiating factor of FL is that the data never leaves the user’s device, making it an appealing property for privacy-sensitive data. Recently, FL has been successfully applied to a wide range of tasks with great success [20, 38, 48]. Nevertheless, a common limitation of existing supervised FL approaches is the implicit assumption that on-device data are perfectly annotated [41].
In reality, the quality of labeled data can vary depending on the data collection and annotation process. Under centralized regimes, having access to a larger and diverse pool of labeled data allows for better label validation and higher-quality annotations than in federated settings [6, 8, 44]. Moreover, access to centrally aggregated data enables various label correction processes to be exploited to further improve the quality of labeled data, such as crowdsourcing, outsourcing, and expert annotation. In contrast, the decentralized nature of FL hinders the collection of high-quality labeled data, leaving no way to verify the quality of labels. Here, the data annotation is typically performed through user interaction or automatically via a programmatic labeling functions, such as those used for keyboard query suggestions [48]. However, such techniques often result in noisy labels being assigned to the data samples either due to missing expertise of users, or due to the inherently noisy labels constructed from automatic labeling systems, such as “weak” labeling [30]. Therefore, in FL, the presence of mislabeled data samples, referred to as label noise or noisy labels, can naturally occur, but there is no straightforward way to perform label correction.
The problem of training models under label noise has received noticeable attention with various proposed algorithms over the years [1, 26, 27], and has emerged as a major practical challenge in the context of federated learning in recent times [6, 8, 44, 46, 47, 49, 51]. The non-i.i.d. nature of data in FL, which is characterized by variations in the data distribution across clients, can affect both the presence and distribution of label noise. Specifically, label noise in FL is closely related to the characteristics of the clients’ devices and the expertise of their users. These unique label noise characteristics in FL make it challenging to successfully apply centralized learning schemes that filter noisy samples or mitigate their effect through regularization techniques [6, 8, 44]. To deal with label noise in FL, recent approaches often rely on repeated server-side aid, either in the form of additional clean data [6, 9, 46] or communication of client-sensitive data [8, 47], whereas computational expensive approaches have been proposed to perform label correct in FL [44].
To the best of our knowledge, our work represents the first attempt to reduce the effect of label noise in the federated setting for various classification tasks in multiple stages of FL (i.e., initialization, local training step, server-side model aggregation) without relying on any additional server-side clean data, and with varying degrees of compute requirements; thus, providing suitable solutions across a wide range of devices (low- to high-end computation devices). We present three distinct FL schemes, namely Nearest Neighbor based Correction (NNC), Adaptive Knowledge Distillation (AKD), and Noise-Aware Federated Averaging (NA-FedAvg), each tackling the problem of label noise in the federated setting in two stages: first, computing a per-client noise level estimation, and second, exploiting this knowledge to efficiently train deep neural networks to improve the performance of a given task, while correcting (or limiting the effect of) noisy labeled samples. Concisely, the main contributions of our work are as follows:
We propose a framework called FedLN2 (Federated Learning with Label Noise) to address the generalizability issues introduced when training federated models use noisy labeled data.
We design simple yet effective approaches to accurately estimate a per-client label noise level and identify clients with clean or relatively high-quality labels.
We devise various mechanisms to alleviate or correct noisy labeled instances on a per-client basis, thus mitigating the need of user interaction for high-quality label acquisition.
We demonstrate that our framework is highly useful for learning generalizable models under a variety of federated and label noise settings on diverse public datasets from both vision and audio domains, namely CIFAR-10 [19], Fashion-MNIST [43], PathMNIST [45], EuroSAT [13], and Speech Commands [40].
We show that FedLN can improve recognition rate by 22% on average across all datasets compared to the fully supervised federated model, when 60% of labeled data contain noisy labels. Further evaluation of FedLN on real-world human annotated noisy datasets, namely CIFAR-10N/100N [41], exhibits an increase in recognition rate by 9% compared to the naive FL strategy.

2 Background

In this section, we provide a brief overview of the statistical properties of label noise, the procedure of training a neural network with label noise, and FL to provide a foundation for our approach for training deep learning models with noisy labels in the federated setting.

2.1 Label Noise

Label noise refers to the misalignment between a ground truth label \(y^{*}\) and an observed label \(y\) in a given dataset. Specifically, in a \(\mathcal {C}\)-way classification problem, where \(\mathcal {C}\) is the number of label categories, label noise can be considered as a class-conditional label flipping process \(f\left(\cdot \right)\), which projects \(y^{*} \rightarrow y\), in a way that every label in class \(j\in \mathcal {C}\) may be independently mislabeled as class \(i\in \mathcal {C}\) with probability \(p\left(y{=}i{\mid }y^{*}{=}j\right)\), written in a shorthand notation as \(p\left(y{\mid }y^{*}\right)\). Hence, in our work, we assume that the occurrences of label noise are data independent (i.e., \(p\left(y{\mid }y^{*},x\right) = p\left(y{\mid }y^{*}\right))\), similar to the work of Goldberger and Ben-Reuven [10]. From the definition of label noise function \(f\left(y^{*},\mathcal {C}\right)\), a \(\mathcal {C}\times \mathcal {C}\) noise distribution matrix denoted by \(\mathcal {Q}_{y{\mid }y^{*}}\) can be defined, where each column corresponds to the probability distribution for an input instance with ground truth label \(y^{*}{=}i\) to be assigned to label \(j\).
Given the preceding definitions, we can characterize label noise through two statistic parameters: noise level (denoted by \(n_{l}\)), and noise sparsity (denoted by \(n_{s}\)). Inspired by Northcutt et al. [27], we provide a formal definition for each of these parameters as follows.
Definition 2.1 (Noise Level).
Noise level (\(n_l\)) quantifies the amount of label noise present in a given dataset. It is defined as the reverse probability of the sum along the diagonal of \(\mathcal {Q}_{y{\mid }y^{*}}\), denoted as \(n_{l} = 1 - \operatorname{diag}(\mathcal {Q}_{y{\mid }y^{*}})\). Intuitively, a noise level of zero corresponds to a “clean” dataset, where all observed labels match their ground truth labels, whereas a noise level of 1 can be considered as a completely “noisy” dataset.
Definition 2.2 (Noise Sparsity).
Noise sparsity (\(n_s\)) quantifies the shape of the label noise present in a given dataset. It is defined by the probability concentration of noise in each column of \(\mathcal {Q}_{y{\mid }y^{*}}\) when off-diagonal values are discarded. Thus, a high noise sparsity value indicates a non-uniformity of label noise, common in most real-world datasets. For example, a high-sparsity noise can indicate a confusion between classes that are perceived to be related by humans—that is, mislabeling of a cat as tiger or a lion rather than a cat as a bird or dog. Alternatively, a zero level of noise sparsity corresponds to completely random noise, where all instances belonging to one class can be confused with any other class. The special case of “class flipping” can be constructed for \(n_{s}=1\), where instances that belong to a pair of two classes are confused. In this case, we consider the noise probabilities between these pair of classes to be equal (i.e., \(p\left(y{=}i{\mid }y^{*}{=}j\right) = p\left(y{=}j{\mid }y^{*}{=}i\right)\)). Whereas in reality non-diagonal entries in \(\mathcal {Q}_{y{\mid }y^{*}}\) are non-zero (label noise is essentially unavoidable among classes), we allow zero values to be present in \(\mathcal {Q}_{y{\mid }y^{*}}\) and consider \(n_s\) as the fraction of positive non-diagonal entries per column of \(\mathcal {Q}_{y{\mid }y^{*}}\), similar to the work of Northcutt et al. [27].

2.2 Federated Learning

FL is a collaborative learning paradigm that aims to learn a single, global model from data stored on remote clients with no need to share their data with a central server. Specifically, with the data residing on clients’ devices, a subset of clients is selected to perform a number of local SGD steps on their data in parallel on each communication round. Upon completion, clients exchange their models’ weights updates with the server, aiming to learn a unified global model by aggregating these updates. Formally, the goal of FL is typically to minimize the following objective function:
\begin{equation} \min _{\theta } \mathcal {L}_{\theta } = \sum _{m=1}^{M} \gamma _{m} {\mathcal {L}}_m(\theta), \end{equation}
(1)
where \(\mathcal {L}_m\) is the loss function of the \(m^{th}\) client and \(\gamma _{m}\) corresponds to the relative impact of the \(m^{th}\) client on the construction of the global model. For the FedAvg [25] algorithm, \(\gamma _{m}\) is equal to the ratio of client’s local data \(N_m\) over all training samples (i.e., \((\gamma _{m} = \frac{N_m}{N})\)).

Federated Label Noise.

Considering the traditional centralized learning, noise distribution can be characterized by a single noise distribution matrix \(\mathcal {Q}_{y{\mid }y^{*}}\). However, in FL, where data is fragmented across multiple clients, distinct noise distributions among clients have to be considered, as label noise is closely related to the clients’ characteristics (i.e., a user’s expertise or preferences). Subsequently, in FL, noise distribution matrices among clients can differ significantly (i.e., \(\mathcal {Q}_{y{\mid }y^{*}}^{i} \ne \mathcal {Q}_{y{\mid }y^{*}}^{j}\) with \(i,j\) indicating any pair of clients). These naturally occurring differences in noise distributions among clients (i.e., clients’ noise profiles) can introduce additional challenges for the FL process, especially during models’ aggregation step.

2.3 Learning from Noisy Labels

The goal of a \(\mathcal {C}\)-way supervised learning task is to learn a function that maps an input instance \(x\) to a corresponding ground truth label \(y_{i}^{*} ~\epsilon \left\lbrace 1, \ldots , \mathcal {C} \right\rbrace\). Let \(p_\theta \left(y \mid x \right)\) be a neural network that is parameterized by weights \(\theta\) that predicts softmax outputs \(y\) for a given input \(x\). In a typical classification problem, the model is provided with a training dataset \(\mathcal {D} = \lbrace (x_{i},y_{i}^{*}) \rbrace _{i=1}^{N}\) and aims to minimize the following objective function by learning the model’s parameters \(\theta\):
\begin{equation} \mathcal {L}_{\theta }\left(\mathcal {D}\right) = \mathit {l}\left(y^{*},p_\theta \left(y{\mid }x\right)\right), \end{equation}
(2)
where \(\mathcal {L}_{\theta }\left(\mathcal {D}\right)\) is the minimization function for supervised learning on \(\mathcal {D}\), \(\mathit {l}\left(\cdot \right)\) denotes a certain loss, and \(p_\theta\) is the neural network that is parameterized by weights \(\theta\).
In real-world scenarios, in which data labels are often noisy, the neural network \(p_\theta \left(y{\mid }x\right)\) is trained on noisy labels \(y\) instead of the actual ground truth labels \(y^{*}\). Similar to Equation (2), the objective is to minimize \(\mathcal {L}_{\theta }\) on \(\mathcal {D}_{n} = \lbrace \left(x_{i},y_{i}\right) \rbrace _{i=1}^{N_n}\) by learning the model’s parameters \(\theta\), as follows:
\begin{equation} \mathcal {L}_{\theta }\left(\mathcal {D}_{n}\right) = \mathit {l}\left(y,p_\theta \left(y{\mid }x\right)\right). \end{equation}
(3)
Here, noisy labeled instances interfere with the loss minimization process, since the computed loss is over the noisy dataset \(\mathcal {D}_{n}\). Specifically, the neural network \(p_{\theta }\) can easily memorize noisy labels and consequently degenerate the network’s generalization on unseen data [34].

3 Related Work

Noisy Label Learning.

Supervised deep learning approaches primarily use data-label pairs to train models, prompting extensive research on neural network robustness against noisy labels. Here, researchers have focused on tackling label noise through loss correction techniques, which involve adjusting the loss value per sample to mitigate the impact of noisy samples. Early techniques performed loss correction by estimating the noise distribution matrix, either via pre-trained models [28] or by utilizing a clean validation set [14]. In addition to estimating the noise distribution matrix, Wang et al. [39] proposed the use of symmetric cross entropy to help deal with noisy labels. Furthermore, Amid et al. [1] utilized a generalized cross-entropy loss with heavy-tailed softmax probabilities via two tunable parameters to limit the loss value per sample, thus minimizing the effect of noisy labels. Regularization techniques, such as Label Smoothing (LS) [24, 26], and data augmentation techniques, like MixUp [50], show promise in handling label noise. Co-learning [36] involves performing supervised and self-supervised learning in a cooperative way to prevent noisy label memorization. Alternatively, direct estimation and removal of noisy labeled instances prior to training have been explored [2, 27]. Arazo et al. [2] utilized Gaussian Mixture Models (GMMs) based on gradient values, whereas Confidence Learning (CL) [27] is based on probabilistic thresholds from pre-trained neural network predictions. Based on the “memorization effects” of deep networks, where models fit data with clean labels prior to noisy instances [3], researchers have proposed early-stopping mechanisms to mitigate the negative effects of label noise in models generalization [4, 21, 28, 42]. Nevertheless, in the federated setting, noise patterns vary across clients and on-device data can be scarce, introducing further complexities. This necessitates modeling noise profiles, assessing the model’s confidence in its loss values, and determining appropriate thresholds to detect noisy instances [8, 44, 46]. Our work performs on a thorough evaluation of centralized approaches that deal with label noise in the federated setting. By examining these techniques, we aim to enable future research advancements in the area of FL under the presence of label noise.

Label Noise in FL.

Recent research has addressed the issue of label noise in the FL setting [6, 8, 44, 46, 47, 49, 51]. Some approaches (e.g., [8, 47]) focus on filtering noisy samples. For example, Yang et al. [47] utilize communication of class-wise data centroids among clients to construct decision boundaries across classes, whereas Duan et al. [8] rely on the communication of data features to the server for identifying noise instances. In addition to data filtering, label correction techniques have been investigated in FL [44, 49, 51]. CLC [49] applies consensus-based label correction technology, enabling clients to cooperate in correcting labels through a consensus mechanism. FedCorr [44] employs a multi-stage scheme, where clean samples are first detected using GMM based on loss scores and then used to train a model that provides pseudo-labels for the noisy instances. Furthermore, Zhang et al. [51] proposed the use of meta-learning to jointly learn the underlying recognition task and the noise distribution matrix, mapping noisy labeled instances to their correct counterparts during training. Recently, some works [6, 9, 46] explored the utilization of a small clean dataset to quantify the credibility of on-device data and adjust the weighted aggregation process accordingly. However, a common limitation of these approaches is the significant computational overhead they often require for performing label correction. Alternatively, some approaches (e.g., [8, 47]) necessitate the communication of user-related information and assume that clients share the same noise ratio for effective data filtering. Moreover, for efficient label noise handling with a low computational footprint on clients, some approaches (e.g., [6, 8, 9, 46]) often assume the availability of additional clean data. However, for many practical FL scenarios, depending on additional clean data (e.g., [6, 9, 46]), implementing complex learning schemes (e.g., [44, 49, 51]), or exchanging client-sensitive data (e.g., [8, 47]) can present significant challenges in the deployment of such approaches. To address the aforementioned problems, we propose FedLN, a framework that provides simple yet effective approaches to accurately estimate label noise on a per-client basis and offer robust learning schemes to learn better generalizable federated models under the presence of label noise without relying on additional clean data, complex learning schemes, or communication of client-sensitive data.

4 Methodology

In this section, we present our FL framework, FedLN, for FL models under the presence of label noise. First, we provide a formal definition of the underlying problem. Then, we discuss our proposed techniques for determining a per-client noise level estimation in detail. Finally, we provide a thorough description of our developed approached for handling and mitigating the impact of noisy labels during training of deep models in the federated setting.

4.1 Problem Formulation

We focus on the problem of FL with noisy labels, where clients’ data samples are often mislabeled either due to missing expertise of annotators, users’ mistakes, or error in the automated procedure for label inference. Although noise can be present in input space of data, in this work, we solely focus on “noise” to be present in the label space (i.e., categorize in case of classification problems). In particular, as label noise, we consider the misalignment between a ground truth label \(y^{*}\) and an observed label \(y\) in a given dataset, which is characterized by noise level \(n_{l}\) (reverse probability of the sum along the diagonal of \(\mathcal {Q}_{y{\mid }y^{*}}\)) and \(n_{s}\) (fraction of non-zero entries per column of \(\mathcal {Q}_{y{\mid }y^{*}}\)). Whereas label noise can be closely related to the clients’ characteristics in FL, distinct noise distribution matrices (i.e., clients’ noise profiles) can exist among clients. These varying label noise profiles introduce risks of overfitting to noisy data, negatively impacting the server-side model aggregation, and necessitate the need to detect label noise on a client-based level. Additionally, clients holding data with high-quality labels (i.e., \(n_{l} \leqslant \varepsilon\) with \(\varepsilon \rightarrow 0\)) may be present in the FL process. With FedLN, we aim to eliminate the effect of on-device mislabeled examples in the training process and to improve the performance of FL models, alleviating the common assumption that clients hold well-annotated data.
Formally in FL, we have a set of \(M\) clients, each holding a training set \(\mathcal {D}^{m}\). Subsequently, each client’s dataset, \(\mathcal {D}^{m}\), can be divided into a correctly labeled set (clean data) \(\mathcal {D}_{c}^{m} = \lbrace (x_{i},y_{i}^{*}) \rbrace _{i=1}^{N_{c}^{m}}\) and a noisy labeled set (noisy data) \(\mathcal {D}_{n}^{m} = \lbrace \left(x_{i}, y_{i} \right)\rbrace _{i=1}^{N_{n}^{m}}\), where \(N^{m} = N_{c}^{m} + N_{n}^{m}\) is the total number of data samples stored on the \(\mathit {m^{th}}\) client. The label noise level present in the \(\mathit {m^{th}}\) client’s data is defined by \(n_{l}^{m} = \frac{N_{n}^{m}}{N^{m}}\) and \(N = \sum _{i=0}^{M} N^{m}\) is the total number of samples present during training. We aim to learn a global unified model \(G\) without clients sharing any of their local data (\(\mathcal {D}^m\)) while minimizing the effect of noisy label set \(\mathcal {D}_{n}^{m}\) on the training process. Specifically, the objective function we aim to minimize is the following:
\begin{equation} \min _{\theta } {\mathcal {L}}_{\theta } = \sum _{m=1}^{M} \gamma _{m} {\mathcal {L}}_m\left(\theta \right) \textrm {, where } \mathcal {L}_{m}\left(\theta \right) = \mathcal {L}_{\theta }\left(I\left(\mathcal {D}_{c}^{m}\right) + \Phi \left(\mathcal {D}_{n}^{m}\right) \right), \end{equation}
(4)
where \(\mathcal {L}_{m}\left(\theta \right)\) is the supervised loss term of the \(m^{th}\) client given model weights \(\theta\), \(\Phi \left(\cdot \right)\) is a correction mechanism aiming at reducing the impact of noisy samples of the \(m^{th}\) client on the training procedure by either masking or correcting the label \(y\) of \(\mathcal {D}_{n}^{m}\), and \(I\left(\cdot \right)\) is the identity function. With \(\gamma _{m}\), we denote the relative impact of the \(m^{th}\) client on the generation of the global model \(G\). For the FedAvg [25] algorithm, parameter \(\gamma _{m}\) is equal to the ratio of client’s local data \(N_m\) over all training samples (i.e., \(\gamma _{m} = \frac{N_m}{N}\)).

4.2 Label Noise Estimation in the Federated Setting

In the federated setting, label noise is influenced by discrepancies in clients’ labeling systems or the expertise of their users. This leads to varying label noise profiles on a per-client basis, where a few “clean” clients may exist, holding high-quality labels. To tackle the issue of label noise in FL without introducing unnecessary complexity to clients’ computational tasks, we propose simple yet effective approaches for estimating the per-client label noise level. These approaches are designed to accommodate the diverse computational capabilities of devices in an FL system. Specifically, we propose two methods for determining per-client noise level estimation: (i) an embeddings-based discovery, which computes noise from “noise-tolerant” embeddings, and (ii) a model’s confidence-based approach, where noise is estimated using a scoring function based on the models’ outputs or logits. By establishing a per-client noise level, we can efficiently train deep neural networks to improve the performance on a given task, limiting the effect of noisy samples on a model’s generalizability.

Embedding-Based Discovery of Noisy Labels.

In supervised learning, corrupted labels can significantly impact the generalization of deep models, leading to poor performance on unseen data [34]. To address this issue, we propose leveraging embeddings from self-supervised pre-trained models, which are trained to learn useful data representations for a variety of tasks [29, 33], to detect noisy labeled instances in each client’s data. By generating embeddings on a per-client basis without relying on any labels, our approach ensures that the extracted embeddings remain robust to the presence of label noise [52].
Formally, we utilize a self-supervised pre-trained model as a feature extractor \(g\left(\cdot \right)\) to produce embeddings \(e_{i}\) for every input instance \(x_{i} \in \mathcal {D}_{m}\). With the generation of embeddings, we can then utilize a k-Nearest Neighbor (kNN) approach to identify noisy samples, which corresponds to outliers in the embeddings space (i.e., data points belonging to the same neighborhood with different labels). Specifically, for the neighborhood of \(k\) points surrounding \(e_{i}\), we assign a new label using a majority voting mechanism, with random tie-breaking, as
\begin{equation} \begin{aligned}y_{i}^{vote} =\underset{j\in \left[k\right]}{\text{arg max}}~\widehat{y_{i}}\left[j\right] = \text{arg max}\sum _{j=1}^{k} \left\lbrace y_{j} \in \mathcal {D}_{k}: \left| {\it sorted} \left\lbrace \left\Vert g(x_{i})- g(x_{l})\right\Vert , \forall x_{l} \in \mathcal {D} \right\rbrace \right| \lt k \right\rbrace , \end{aligned} \end{equation}
(5)
where \(y_{j}\) is the predicted kNN label for embedding vector \(e_{j}\), extracted using the feature extractor \(g\left(\cdot \right)\) from an input instance \(x_{j}\). One should note that local neighborhood surrounding embedding \(e_{i}\), denoted as \(\mathcal {D}_{k}\) in Equation (5), is formulated by computing the Euclidean distance of \(e_{i}\) with all other embedding vectors in \(\mathcal {D}\).
To derive a per-client noise level estimation, we divide the number of instances where a mismatch between \(y_{i}^{vote}\) and \(y_{i}\) occurs over the total number of available samples in each client [52] (i.e., \(n_{l}^{m} = (\sum _{i=0}^{N_{m}} y_{i}^{vote} \ne y_{i})/N_{m}\)). It is important to note that the aforementioned procedure is performed locally on each client during the initialization phase of FL (i.e., beginning of the training). Thus, noise level estimation via embeddings does not introduce additional computation or communication costs to the FL training process.

Model Confidence as a Proxy for Label Noise.

To provide an alternative to utilizing an external module (i.e., a pre-trained model) for the detection of colormarknoisy labeled instances on a per-client basis, we propose the use of a scoring-based method directly applicable to the outputs (or logits) of the neural network. The intuition behind our approach is to utilize a scoring function to rank input instances such that a low score indicates a high probability of having a noisy label. Therefore, the critical components required to facilitate this approach include a scoring function, which ranks samples based on the model’s predictions confidence, and a threshold to distinguish between low-score (noisy labeled) and high-score (correctly labeled) data.
To this end, we utilize an energy score [23] as our scoring function to compute a per-client noise level estimation in a given federated round, denoted as \(R_{w}\). Although the energy score is leveraged for detecting out-of-domain samples, in this work we propose to use it as a measurement of label uncertainty on clients’ data. In contrast to the softmax score, the energy score has proven to be less susceptible to modern neural networks’ overconfidence issues [11, 23]. Accordingly, we compute the energy score \(s\) for an input instance \(x\) as the \(\mathit {logsumexp}\) (\(\mathcal {E}(\cdot)\)) operator over the logits \(z\) (i.e., \(s = log(\sum _{i}^{\mathcal {C}} e^{z})\)). Specifically, in federated round \(R_{w}\), using the newly received aggregated model weights \(\theta _{G}^{R_{w}}\), we apply \(\mathcal {E}(\cdot)\) operator over each client’s dataset to acquire a scoring set \(\mathcal {S}_{\theta _{G}^{R{w}}}\), which contains a score for each locally stored sample. In a similar fashion, we utilize the locally trained model weights (\(\theta _{m}^{R_{w}}\), local model weights after completion of local train step in \(R_{w}\)) to compute a second scoring set, \(\mathcal {S}_{\theta _{m}^{R{w}}}\).
The threshold value is computed from the \(\nu ^{th}\) percentile over \(\mathcal {S}_{\theta _{G}^{R{w}}}\), although other statistical measures, such as median or mean, can also be used. Specifically, we compute the threshold \(\tau _{\nu }\) as
\begin{equation} \tau _{\nu } = \mathcal {P}_{\nu }\left(\mathcal {S}_{\theta _{G}^{R{w}}}\right) = \underset{\left[\mathcal {S}_{\theta _{G}^{R{w}}}\right]}{\text{arg max}}~ \left(\frac{\nu }{100} \cdot \left|\mathcal {S}_{\theta _{G}^{R{w}}} \right| \right), \end{equation}
(6)
where \(\left[\mathcal {S}_{\theta ^G}\right]\) and \(\left| \mathcal {S}_{\theta ^G} \right|\) are the ordered set and cardinality of \(\mathcal {S}_{\theta ^G}\), respectively. Motivated by the “memorization effect” in deep networks, where clean data is memorized faster in early training stages [4, 21, 28, 42, 44], an appropriate federated round \(R_{w}\) can result in diverse scores between noisy and clean data. Along the same lines, we argue that clients holding high-quality labels have a more influential role in the early stages of model aggregation in the server side, aiding in the differentiation between noisy and clean instances on clients’ data. Furthermore, by computing a threshold based on the same received global model weights \(\theta _G^{R_{w}}\), we provide a “common” ground for classification, ensuring a clear separation between noisy and clean instances. With both \(\mathcal {S}_{\theta _{m}^{R{w}}}\) and \(\tau _{\nu }\) computed, we can estimate a per-client noise level by counting the percentage of the \(m^{th}\) client’s local instances that are below the obtained \(\tau _{\nu }\)—that is, \(n_{l}^{m} = (\sum _{i=0}^{N_m} u_{\tau _{\nu }}(s_{i}))/N_{m}\) with \(u_{\tau _{\nu }}(\cdot)\) a “\(\tau\)-shifted” Heaviside function that produces 1 for all inputs above a threshold \(\tau\).
Concisely, we estimate the noise level by computing computationally inexpensive scoring sets (\(\mathcal {S}_{\theta _{G}^{R_{w}}}\), \(\mathcal {S}_{\theta _{m}^{R_{w}}}\)) on each client during a single federated round \(R_{w}\). This approach is particularly beneficial for devices with low computational and energy resources commonly found in FL. Although the estimated noise level may not be precise compared to the embedding-based method, it is effective in adjusting the importance of clients’ updates during server-side model aggregation, as explained in Section 4.3.

4.3 FL under the Presence of Label Noise

The objective of supervised training in the presence of label noise is to learn a model, where the effect of such labeled instances on the model’s performance is minimal. Recent federated approaches often address the issue of label or excessive computational burdens in the FL process. To tackle this challenge, we propose three distinct approaches, namely NNC, AKD, and NA-FedAvg, each of which effectively handle label noise at different stages of FL (i.e., initialization, local training steps, and server-side model aggregation) with varying degrees of compute requirements. Composed of these three approaches, our framework, termed FedLN, provides a suitable solution to label noise in FL for a wide range of devices (low-end to high-end computation devices). At its core, FedLN utilizes label noise estimation techniques (described in Section 4.2). In Figure 1, a comprehensive overview of the intended device families, in terms of computational characteristics, associated with each approach is illustrated.
Fig. 1.
Fig. 1. Illustration of a targeted device audience of FedLN approaches. Each approach has different computational resource requirements. NA-FedAvg is lightweight and thus well suited for wearables and edge devices. NNC focuses on sufficient computational resources for computing embeddings from pre-trained models before training, which are used for label correction, whereas AKD deals with label noise during training by computing an additional loss term, making it suitable for clients with relatively higher computational capabilities.

4.3.1 Nearest Neighbor Based Correction.

Input embeddings (i.e., feature extracted from a specific layer of a neural network) computed from a pre-trained model can be exploited to perform label correction (Figure 2). As discussed in Section 4.2, we estimate the label for each input instance \(x\) by “looking” at the labels of a neighborhood examples in the embedding space. Thus, apart from predicting a per-client noise estimation using the kNN predictions, we can also perform label correction [52]. Specifically, when we detect a label mismatch between the predicted (with kNN) and current label, we consider the predicted label as the true label to be used during the training phase. This way, instead of discarding noisy instances altogether, their labels are modified on a per-client basis at the initialization phase of FL, after which the FL training process continues as usual. Mathematically, the “corrected” local datasets across all clients in FL can be written as
\begin{equation} \mathcal {D}^{m^*} = \left\lbrace x_{i}, y_{i}^{vote} \right\rbrace _{i=0}^{N_m} = \left\lbrace x_{i}, \Psi \left(g,x_{i} \right) \right\rbrace _{i=0}^{N_m}, \end{equation}
(7)
where \(\Psi (\cdot)\) is a mapping function that adjusts \(y_{i}\) to \(y_{i}^{vote}\) (as computed in Equation (5)) using the feature extractor \(g(\cdot)\). To learn from the “corrected” datasets across all clients, we apply the cross-entropy loss as
\begin{equation} \begin{aligned}\mathcal {L}_{\theta }(\mathcal {D}^{m*}) = \mathcal {L}_{CE}\left(y^{vote}~,~p_{\theta ^m}\left(y{\mid } x \right) \right) = - \frac{1}{N_{m}} \sum \limits _{i=1}^{N_{m}}\sum \limits _{j=1}^{C} {y_{i}^{vote}}^{j} \log (\mathit {f}_{i}^{\theta ^{m}}(x_{j})). \end{aligned} \end{equation}
(8)
Fig. 2.
Fig. 2. Illustration of NNC for identifying and correcting noisy labeled instances via embeddings. The process is depicted for one federated client for the sake of simplicity.
It is important to highlight that the quality of the extracted embeddings is crucial for the effectiveness of NNC in identifying and correcting noisy instances. However, we do not see it as a major problem, as several self-supervised pre-training approaches [29, 32, 33] provide useful embeddings for broad spectrum of tasks. Alternatively, the label correction process in NNC is not affected to the same degree by the actual label noise present at each client, since the embeddings are extracted without any labels. In addition, clients are required to hold a pre-trained model only for a single forward pass during the initialization phase of FL. Further details and an overview of our proposed NNC federated approach for learning models under the presence of label noisy can be found in Algorithm 1, highlighted in red color.

4.3.2 Adaptive Knowledge Distillation.

Rather than directly altering the labels, we can implicitly guide the model to avoid “memorization” of noisy labeled samples during the FL training phase through means of an additional loss term (Figure 3). Specifically, in addition to the standard cross-entropy loss, we propose to use a knowledge distillation loss [15, 31] that requires each client to also learn to mimic embeddings or output of a teacher model. Compared to NCC, in which we proposed a label correction process for training models under the presence of noisy labeled instances, with AKD, we utilize soft labels or embeddings from a teacher model as ground truth.
Fig. 3.
Fig. 3. Illustration of our AKD approach for training federated models under the presence of noisy labels with “adaptive” knowledge distillation (see Section 4.3.2). Embeddings are extracted from a pre-trained model, as an extra source of supervision for local models, whereas KD is activated on a per-client basis in case label noise is detected.
To this end, we propose to utilize embeddings computed with a pre-trained model or the server-side aggregated model’s outputs from each round as a source of supervision. Specifically, after the noise level is estimated on a per-client basis, as discussed in Section 4.2, we incorporate a distillation loss term to clients with detected label noise while permitting any client with well-annotated data to directly train on their locally stored data. Thus, we present an “adaptive” knowledge distillation (AKD) during the FL training process, where clients with noisy instances are guided with additional supervision to learn noise-robust models. To learn from the datasets of all clients, \(\mathcal {D}^{m}\), we apply the cross-entropy loss and a knowledge distillation loss as follows:
\begin{equation} \begin{array}{ll} \mathcal {L}_{\theta }(\mathcal {D}^{m}) \!\!&= \mathcal {L}_{CE}\left(y,p_{\theta ^m}\left(y\mid x \right) \right) + \beta \cdot u_{\epsilon }(n_l^{m}) \cdot \mathcal {L}_{KD}\left(e, p_{\theta ^{m}}\left(e|x\right)\right), \\ \!\!& \textrm { where } \mathcal {L}_{KD} = {\left\lbrace \begin{array}{ll} \mathcal {L}_{MAE} = \frac{1}{N_{m}} \sum \limits _{i=1}^{N_{m}} \left| e_{i} - \tilde{e_{i}} \right|,~ &\! \text{if}\ e\ \text{are embeddings of}\ g(\cdot) \\ \mathcal {L}_{KL} = T^{2} \sum \limits _{j=1}^{C} p_{\theta ^m}^{T} log \frac{p_{\theta ^m}^{T}}{p_{\theta ^G}^{T}},~ &\! \text{if}\ e\ \text{are}\ ``\text{temperature-scaled}\mbox{''}\ \text{logits of}\ G. \\ \end{array}\right.} \end{array} \end{equation}
(9)
Here, \(\mathcal {L}_{CE}(\cdot)\) indicates the standard cross-entropy loss, \(\mathcal {L}_{MAE}(\cdot)\) corresponds to the mean absolute error loss between the embeddings computed from a feature extractor \(g(\cdot)\) and the local model of the \(m^{th}\) client (noted as \(\tilde{e_{i}}\)), and \(\mathcal {L}_{KL}(\cdot)\) refers to the Kullback-Leibler divergence loss [15] between the “temperature-scaled” logits of the \(m^{th}\) client’s local model and the server-side aggregated model \(G\) of the current federated round. With “\(\epsilon\)-shifted” Heaviside function, \(u_{\epsilon }\), we introduce \(\mathcal {L}_{KD}\) loss to clients, whose estimated label noise exceeds \(\epsilon\), and we add the scalar \(\beta\) to control the contribution of \(\mathcal {L}_{KD}\) on model optimization. We fix \(\beta =\) 10 and \(T=\) 2, which we found to be working well during our initial exploration. One may note that in the case of \(\mathcal {L}_{MAE}(\cdot)\), clients perform a forward pass of their data via the feature extractor in each federated round. Further details and an overview of our proposed AKD federated approach can be found in Algorithm 1, highlighted with green color.

4.3.3 Noise-Aware Federated Averaging.

As an alternative to previously proposed approaches for dealing with noisy labels, we aim to directly address the impact of noisy labels during a later stage of the training process (Figure 4). Specifically, during the server-side aggregation process of FedAvg [25], we propose utilizing the estimated noise level of each client to perform a “noise-aware” federated averaging (NA-FedAvg) aggregation step by modifying parameter \(\gamma _{m}\) in Equation (4) to consider both the number of samples and the label noise level of clients. In other words, clients with few noisy instances have a greater contribution to the FL process compared to the ones with a large amount of label noise. Although straightforward, weight re-scaling techniques have been beneficial in mitigating the impact of noisy labels on the global model \(G\), yet relying on additional clean data to provide a computationally inexpensive label noise estimation in FL [6, 9, 46]. In this work, to minimize the computational overhead for clients in NA-FedAvg, we directly estimate the noise level from the model’s predictions confidence, as discussed in Section 4.2. This enables NA-FedAvg to be suitable for clients with limited computational resources. Clients only compute computationally inexpensive scores for each locally stored sample, while the majority of the work is off-loaded to the server, which computes a noise level estimation for each client.
Fig. 4.
Fig. 4. Illustration of the NA-FedAvg method for mitigating the effect of label noise on a model’s generalizability. In addition to the standard FedAvg process, a noise estimation process is introduced once, indicated with gray colors. In this round, energy scores are computed and communicated to server. Afterward, the global model and a noise estimation per client is calculated on the server. For the remaining FL training rounds, a typical local train step is performed, while the server performs a noise-aware weighted model aggregation, incorporating clients’ estimated noise level.
Concisely, NA-FedAvg begins with standard FL process up to a certain federated round \(R_{w}\), which is considered a hyperparameter of NA-FedAvg. In round \(R_{w}\), after receiving globally aggregated weights \(\theta _{G}^{R_{w}}\), each client performs a local training step on the global model weights and now possesses a local model with weights \(\theta _{m}^{R_{w}}\). Using these two local and global models, and only for round \(R_{w}\), the scoring sets \(\mathcal {S}_{\theta _{G}}^{R_{w}}\) and \(\mathcal {S}_{\theta _{m}}^{R_{w}}\) are computed on a per-client basis and communicated back to the central server, together with each model’s parameters \(\theta _{R_{c}}^{m}\). Using the acquired scoring sets, we compute a per-client noise estimation at the server side, as discussed in Section 4.2, to introduce the “noise-aware” model aggregation by setting \(\gamma _{m}= (1-n_{l}^{m})\cdot \frac{N_m}{N}\) in Equation (4), whereas the remaining FL training process continues as usual. Further details and an overview of NA-FedAvg approach can be found in Algorithm 1, highlighted in blue color.

5 Experiments

In this section, we describe our extensive performance evaluation for FedLN. We use various publicly available datasets to determine efficacy of our approaches in learning generalizable models under a variety of federated leaning and label noise settings. First, the utilized datasets are presented, followed by a detailed description of the neural network architectures and FL framework used in our experiments. Next, we present our evaluation strategy, including all considered federated parameters and baselines used to evaluate/compare against our methods. Finally, we provide our finding about performance of FedLN across a wide range of label noise scenarios.

5.1 Datasets

We use publicly available datasets for performance evaluation on a range of classification tasks from both the vision and audio domains. For all datasets, we use standard training/test splits for comparability purposes, as provided with the original datasets. From the vision domain, we use the CIFAR-10 [19], Fashion-MNIST [43], and EuroSAT datasets, where the tasks of interests are object detection, clothes classification, and landmark categorization, respectively. By utilizing these datasets, we facilitate the ease of benchmarking for future research. In addition, we perform experiments with the PathMNIST [45] dataset from medical imaging. By doing this, we investigate the performance of FedLN in a domain, where noisy labels may occur due to incorrect diagnosis. From the audio domain, we use the Speech Commands (v2) dataset [40], where the learning objective is to detect when a particular keyword is spoken out of a set of 12 target classes. Apart from these datasets, we extend our evaluation to a real-world, human-annotated version of popular CIFAR-10/100 [41] datasets, namely CIFAR-10N/100N, where label noise presents varied (or biased in some manner) patterns based on users’ preferences.
For all image classification tasks, we perform standard augmentations, such as random flipping and cropping, followed by Cutout [7] transformation. For the Speech Commands audio dataset, we extract log-Mel spectrograms from raw waveforms as our model input. We compute this by applying a short-time Fourier transform on the 1-second audio segment with a window size of 25 ms and a hop size equal to 10 ms to extract 64 Mel-spaced frequency bins for each window. To make an accurate prediction on an audio clip, we average over the model predictions of non-overlapping segments of an entire audio clip.

5.2 Implementation Details and Evaluation Strategy

Models and Optimization.

For our image classification tasks, we choose a ResNet-20 [12] model architecture due to its relatively compact model size, which makes it ideal for on-device learning, where devices have medium to low computational resources. Here, we utilize an SGD optimizer with a learning rate of 0.1 and momentum of 0.9 for both CIFAR-10 and Fashion-MNIST, whereas the Adam optimizer with a default learning rate of 0.001 was used for PathMNIST and EuroSAT.
For the audio domain, the network architecture of our global model is inspired by Tagliasacchi et al. [35] for mobile devices. Our convolutional neural network architecture consists of four blocks. In each block, we perform two separate convolutions: one on the temporal and another one on the frequency dimension, outputs of which we concatenate afterward to perform a joint \(1 \times 1\) convolution. Using this scheme, the model can capture fine-grained features from each dimension and discover high-level features from their shared output. Furthermore, we apply L2 regularization with a rate of 0.0001 in each convolution layer and group normalization after each layer. Between model blocks, we utilize max-pooling to reduce the time-frequency dimensions by a factor of 2 and use a spatial dropout rate of 0.1 to avoid over-fitting. We apply ReLU as a non-linear activation function and use the Adam optimizer with the default learning rate of 0.001 to minimize the loss function.
For our methods as discussed in Sections 4.2 and 4.3.1, which rely on embeddings, we use off-the-shelf pre-trained models trained on large-scale datasets in an unsupervised manner. For vision models, we use ViT-B/32 from CLIP [29], and for audio, we leverage TRILLsson (v3, EfficientNetv2-B3) [33], which has a same audio front-end as we used for our audio model. These publicly available models can be downloaded directly on client devices, and we run them once (i.e., forward pass only) to compute embeddings from the client’s local storage.

Federated Environment.

To simulate a federated environment, we use the Flower framework [5] and utilize FedAvg [25] as the optimization algorithm to construct the global model from clients’ local updates. Additionally, a number of primary parameters, listed in Table 1(b), were selected to control the federated setting in our experiments. In all FedLN experiments, where noise level estimation is computed based on a model’s confidence, we use a fixed scoring threshold percentile \(\nu\) of \(75\%\), whereas for embedding-based noise discovery the neighborhood size in kNN is set to 100, which we found to be working well during our initial exploration. Further, we set the clients’ participation rate in each federated round (\(q\)) to be equal to 80%. It is important to note that we employ uniform random sampling for the clients’ selection strategy, as other approaches for adequate client election are outside the scope of our work.
Table 1.
Table 1. Details from Our Experimental Setup
For the data distribution process in our federated experiments, we randomly partitioned the datasets across the available clients in a non-overlapping fashion, controlling the amount of data across clients through parameter \(\sigma\). With \(\sigma\) set to 25% and a random partitioning of data among clients, the resulting data distribution across clients is intentionally imbalanced. This type of data distribution is common across clients’ data in the federated setting [17], where the variation in the number of samples per client is influenced by the characteristics of the respective users. To generate label noise in the datasets, we constructed a noise matrix \(Q_{y{\mid }y^{*}}\) based on parameters \(n_{l}\) and \(n_{s}\), similar to Northcutt et al. [27]. Even though in centralized settings a single noise matrix was considered, in FL we constructed a unique noise matrix per client, thus introducing distinct noise profiles across clients. We note that the noise injection process has been performed after the data partitioning process. With parameter \(F\), we controlled the number of “noisy” clients (i.e., clients holding noisy labeled instances in their datasets). Last, for an accurate comparison between our experiments, we manage any randomness during data partitioning, label noise injection, and training procedures by using a seed alongside the parameters as presented in Table 1.

Benchmark Baselines.

As the effect of label noise remains mostly unexplored in the federated setting, we perform a thorough evaluation across a wide range of existing techniques used in the centralized setting to handle noisy labels. From the perspective of regularization techniques, we consider LS [26], which “softens” the labels by taking a weighted average of the hard targets and the uniform distribution over labels. Such “smoothing” of labels aims to account for the fact that datasets could contain mislabeled instances, thus maximizing the likelihood of \(p\left(y,x\right)\) directly can be harmful. In our experiments, we adjusted the LS rate, \(\alpha\), to be equal to 0.2 [24]. Additionally, we conduct experiments where we use bi-tempered [1] loss instead of standard cross entropy. This loss replaces the softmax function with a high temperature generalization and uses a low temperature logarithm. In this way, bi-tempered loss is able to construct a decision boundary less susceptible to noisy labels and learn from outliers (which are considered noisy labeled instances) present in the data. Furthermore, from a noisy instances detection perspective directly from data, we consider CL [27], which prunes noisy instances from a dataset based on probabilistic thresholds from a neural network’s predictions. To compute these probabilistic thresholds, we utilize the globally unified model \(G\) during the initialization phase of FL, which we train locally for a fixed number of epochs (\(E\) = 20) on all clients. Once the data pruning has been performed using these thresholds, we discard the trained model and resume the FL training process to train a model from scratch.
From existing FL approaches, we utilized FedCorr [44], a multi-stage label noise correction approach based on GMMs and pseudo-labeling, in our FL experiments. FedCorr consists of a pre-processing stage where a subset of clients participate in each iteration (all clients are participating per iteration in random subsets) to detect noisy labeled instances using a GMM based on sample LID [16] values. The label correction phase involves training the model with only the relatively “clean” clients (i.e., clients with low label noise) and generating pseudo-labels for the identified noisy instances. Subsequently, a standard FL (FedAvg) process is performed on all clients. We followed the original FedCorr methodology, conducting five iterations with a single client participating in each round for a total of 150 rounds. For fine-tuning and standard FL, we conducted 95 and 100 rounds, respectively, to align with other approaches used in our experiments. As MixUp [50] augmentation is utilized in FedCorr, we did not consider FedCorr for our audio classification tasks. Apart from these methods, we perform preliminary experiments under standard centralized and federated settings with clean data. With the centralized experiments, we aim to establish an upper bound of performance for our FL approach, whereas the FL experiments serve as a baseline to evaluate the performance improvement we can obtain with FedLN, when noisy labels are present. Last, for a rigorous evaluation, we perform three distinct trials (i.e., running an entire federated experiment) in each setting, and the average accuracy over three runs is reported across the results of Section 5.3.

5.3 Results

In this subsection, we discuss our findings on the effectiveness of FedLN to deal with label noise in FL. First, we show the efficacy of our noise detection mechanisms, Next, we study the performance of FedLN across a wide range of classification tasks. Finally, we explore how our approaches handle different model architectures, distinct noise profiles, and real-life label noise settings.

5.3.1 Mechanisms for Label Noise Detection in FL.

The estimation of noise level on a per-client basis is a fundamental block of our FedLN framework. By exploiting this information, we can compose learning schemes to mitigate the effect of label noise during training. For this purpose, we first perform experiments, where we evaluate the performance of our proposed label noise detection mechanisms, namely embedding based and model’s confidence based, on CIFAR-10 and Speech Commands for different noise profiles. Additionally, we conduct experiments where the model’s predictions (i.e., softmax scores), instead of energy scores, were used to provide a noise level estimation. From our considered baselines, we used the CL [27] data pruning process as an approximation to noise level, where the noise level corresponds to the percentage of pruned data, and we also include the FedCorr [44] estimated noise label across clients. We perform experiments with identical data partitioning and noise injection processes across methods to ensure an unbiased evaluation.
Whereas the embedding-based discovery of noisy labels does not depend on the clients’ models, our energy-based approach requires proper tuning of \(R_{w}\), as discussed in Section 4.3. To this end, we performed initial experiments to derive a suitable \(R_w\), where noisy labeled instances can be effectively detected across clients. Figure 5 provides the obtained AUC score on detection of noisy labeled instances across all clients in different federated rounds for various detection mechanisms. We observe that energy scores provide an effective mechanism for detection of noisy labeled instances, following a steep curve and outperforming CL across all considered noise profiles. In particular, after 30 federated rounds, the energy score is within a close proximity of its maximum AUC score across all federated rounds. Therefore, for the results reported in the rest of Section 5.3, we use a fixed \(R_w\) = 30 for all model-based approaches to identify noisy clients.
Fig. 5.
Fig. 5. Performance evaluation of model confidence based mechanisms for detection of noisy labeled instances. CL refers to confidence learning [27], whereas Softmax and Energy [23] correspond to our approach, when the corresponding scoring function is being utilized. The mean AUC score across clients in different federated rounds is reported on CIFAR-10, whereas variance is indicated from the shaded line. Federated parameters are set to \(R\)=200, \(M\)=30, \(F\)=80%, \(E\)=1, \(q\)=80%, and \(\sigma\)=25%.
In Table 2, the AUC scores of all considered approaches are reported for CIFAR-10 and Speech Commands. Comparing all model-based approaches with the embedding-based one, we note that the latter’s performance is superior. This is due to the difference in the quality of the noise-free “source” that is utilized to detect noisy labeled instances on the proposed methods. Whereas embeddings are extracted from a pre-trained model that is completely unrelated to clients’ own models or their label noise, the remaining approaches rely on the “memorization effect” of deep learning models during the early stages to detect noisy instances. Consequently, as the label noise increases in the federated setting, a certain degree of noise will be implicitly introduced during the model aggregation process. This can be observed by the sharp decrease, averaging 7%, in the AUC score for detecting noisy labeled instances across the two datasets in all model-based approaches when the label noise level (\(n_l\)) is increased from 40% to 70%. However, even in the presence of high levels of label noise, it is still possible to observe significant differences in the generalization capabilities of client models with noisy labels and those with clean labels. This information is valuable for re-weighting the contribution of client models during server-side model aggregation. To illustrate this point, we present histograms of energy scores obtained at \(R\)=30 in Figure 6, showing a clear separation between the scores of “clean” and “noisy” clients. Here, we consider two cases of label noise: one with clients sharing the same amount of noise and another with a random label noise distribution, both maintaining the same number of noisy labeled instances.
Table 2.
Table 2. AUC Scores on Detection of Noisy Labeled Data in CIFAR-10 and Speech Commands
Fig. 6.
Fig. 6. Energy scores computed for CIFAR-10 on \(R\)=30 using a client’s local models with \(F\)=80%. “Clean” and “noisy” terms refer to the presence of label noise in each client’s dataset. In (a), a uniform noise with \(n_l\)=40% is considered across all “noisy” clients, and in (b), label noise follows random distribution across clients, whereas both (a) and (b) maintain the same number of noisy labeled instances. A clear separation between the scores computed from clean and noisy clients is evident. Note that labels are not explicitly used to compute these scores.
On the contrary, the embedding-based discovery approach exhibits greater robustness to the client’s noise profile. However, there is an exception in the case of high noise with high sparsity (where \(n_l\)=70% and \(n_s\)=70/100%). In this scenario, labels belonging to groups of classes become easily confused. Table 2 shows that, on average, there is an approximate 30% decline in the AUC score when \(n_l\)=70% and \(n_s\) increases from 40% to 70%. Although the computed embeddings are unrelated to the noise profile, they amplify the noise level through faulty label estimation. To understand this behavior, we provide an intuition behind how noisy labeled instances are identified using our embedding-based approach in Figure 7, where t-SNE visualization is used for CIFAR-10 embeddings extracted using the work of Radford et al. [29]. Noisy labeled instances are indicated by black circles in Figure 7(a). As discussed in Section 4.3.1, the discovery of noisy labels using the NNC method relies on the detection of outliers in a given neighborhood through majority vote, as shown in Figure 7(b). However, in cases of high noise sparsity, labels from a group of classes become mixed, resulting in a concentration of noisy labels from a particular class in a given neighborhood. This amplifies the noise level and leads to incorrect label estimation. For example, in the neighborhood of a cat’s image (ground truth label) in the embedding space, where 7 out of the 10 samples are mislabeled as dogs due to high noise sparsity between cats and dogs, NNC computes the estimated label based on the dominant label in the neighborhood, resulting in a dog’s label being assigned. To overcome this issue, the AKD method (see Section 4.3.2) can be used to exploit embeddings for distillation rather than fixing labels directly, as demonstrated in Section 5.3.2.
Fig. 7.
Fig. 7. Intuition behind FedLN embedding-based approaches for noise level estimation. Parts (a) and (b) show t-SNE on computed embeddings using the work of Radford et al. [29] for CIFAR-10 before and after the label correction using NNC (see Section 4.3.1). A few cases of noisy labeled instances are highlighted with black circles in (a), where examples of one class are incorrectly assigned a label from another class. Embedding-based discovery of noisy labels relies on the detection of an outlier in a given neighborhood.

5.3.2 Comparison of FedLN against Existing Techniques.

Here, we compare FedLN to determine the achieved improvements versus other considered baselines, and assess FedLN effectiveness on a wide range of tasks from both vision and audio domains. To this end, we perform experiments on all datasets for a diverse number of noisy profiles, varying both noise level (\(n_l\)) and sparsity (\(n_s\)). For a fair comparison, we utilize identical data partitioning and noisy injection schemes in all related experiments. Table 3 provides the accuracy scores on test sets averaged across three independent runs to be robust against differences in randomness involved in training deep neural networks. In the experiments, which we conducted in a centralized setting, models are trained until convergence to obtain the resulting accuracy on a test set, which is presented in the centralized rows of Table 3. Additionally, for ease of comparison and benchmarking, we include the obtained accuracy for all considered approaches and datasets when no additional noise is injected in data, thus acting as an upper bound of models’ performance. In the case of AKD, we report the performance for the case of embeddings as a source of supervision in Table 3, whereas the accuracies for identical experiments performed for AKD with the globally aggregated model’s outputs (i.e., logits) as the supervisory signal are provided later in Table 7 of the appendix.
Table 3.
Noise (\(n_l\))  0.00.40.7
Sparsity (\(n_s\))  0.00.00.40.71.00.00.40.71.0
CIFAR-10Centralized91.5276.6176.9876.9169.4758.8358.3357.4245.06
FedAvgSupervised78.5268.7767.0567.3167.9257.6556.9456.8163.54
LS73.9168.6364.9164.0264.6858.0856.5356.2162.04
Bi-Temp75.2966.1865.6666.5867.7156.7557.6158.3963.74
CL73.8468.9667.6568.9868.0359.2560.2861.2164.99
FedCorr 76.4670.7971.1470.4971.2267.2768.1767.7665.23
FedLNNNC75.2973.6873.6474.2371.4474.7672.6364.4954.59
AKD83.5569.7269.1870.2670.5568.4368.0669.7468.39
NA-FedAvg76.4169.5270.7370.6171.0165.3466.0468.1167.92
Fashion-MNISTCentralized91.8583.5686.7686.1878.2663.8760.9661.9540.32
FedAvgSupervised86.4382.0583.2483.3581.0758.5556.0657.1159.37
LS84.8682.0882.0781.8879.8459.3856.2456.9555.66
Bi-Temp84.6181.9381.1581.8481.4657.9357.3558.3159.24
CL83.9182.8283.2983.7681.9159.4857.9757.3360.89
FedCorr 86.5184.9384.3783.7182.5579.4779.2875.6274.01
FedLNNNC80.2986.8187.2287.7783.1385.3884.9176.8844.48
AKD86.9183.9784.0683.5382.3580.7778.6380.4178.64
NA-FedAvg86.3983.5984.2884.4583.0278.2278.6876.1779.46
PathMNISTCentralized90.6581.1680.9281.0278.0558.3359.8257.7547.89
FedAvgSupervised87.0578.8277.0676.6877.0354.7452.4953.2258.61
LS84.1379.6276.9674.5774.956.1752.0652.3153.46
Bi-Temp83.0978.0377.2377.6176.6755.8955.7456.1558.54
CL84.3178.9779.0977.2681.0259.6955.8854.2960.21
FedCorr 86.0181.0781.3280.0678.0377.7576.8174.5371.18
FedLNNNC84.4582.7682.9782.0178.9780.1382.7478.7841.53
AKD87.8282.4282.8381.4683.2679.9478.6378.3176.02
NA-FedAvg85.9680.5281.0681.3678.3576.6975.8579.6472.84
EuroSATCentralized95.1587.0986.9586.0479.5071.8171.8570.2854.96
FedAvgSupervised95.0786.6886.9784.9882.9871.7070.2969.3771.99
LS89.1387.1786.6382.6682.1872.1969.5268.9268.61
Bi-Temp89.9485.3385.9284.3282.3271.7770.9270.0372.24
CL90.3387.9488.3386.1884.0574.2971.6871.1272.57
FedCorr 90.9889.2989.0287.5386.1783.5383.2681.9680.11
FedLNNNC88.2393.1893.1192.7785.5593.4892.7269.5134.59
AKD94.0590.3890.0291.1189.3388.6489.0688.3387.56
NA-FedAvg90.7788.4989.1788.5685.8683.3483.6884.4686.91
Speech CommandsCentralized96.6890.3390.3190.8484.8484.9583.3182.6560.41
FedAvgSupervised96.3181.8382.5382.4480.3372.3470.3470.8972.39
LS94.6491.1384.7780.1179.3577.0671.2868.1369.71
Bi-Temp96.2182.3181.3582.7682.7873.2771.4172.5770.98
CL87.1285.4587.9784.3485.5478.3472.9270.0772.81
FedLNNNC95.7995.9195.9595.9796.2496.0996.1196.1346.07
AKD84.8286.4284.8385.4683.2679.9476.6378.3176.02
NA-FedAvg96.0789.4990.3592.7294.0979.1281.9182.3380.37
Table 3. Performance Evaluation of FedLN on Different Datasets against a Range of Baselines
Average accuracy over three distinct trials on the test set is reported. Supervised refers to a standard FedAvg [25] training process, whereas LS and Bi-Temp denote the use of label smoothing [26] regularization and bi-tempered loss [1], respectively. We include FedCorr [44] on all vision-based datasets, whereas CL corresponds to the CL [27] technique. In AKD, we report the performance for the case of embeddings as source of supervision. The accuracies for AKD with the globally aggregated model’s outputs (i.e., logits) as supervisory signals are provided in Table 7 of the appendix. Federated parameters are set to \(R\)=200, \(M\)=30, \(F\)=80%, \(E\)=1, \(q\)=80%, and \(\sigma\)=25%.
Table 4.
Table 4. Performance Evaluation of FedLN with ConvMixer-128/4 [37] on CIFAR-10 and EuroSAT
Table 5.
% Noisy Clients (\(F\))NNCNA-FedAvg
\(n_s\)=\(0\%\)\(n_s\)=\(40\%\)\(n_s\)=\(70\%\)\(n_s\)=\(100\%\)\(n_s\)=\(0\%\)\(n_s\)=\(40\%\)\(n_s\)=\(70\%\)\(n_s\)=\(100\%\)
2573.8074.8474.1872.8475.9975.9776.5976.02
5074.5473.5974.9774.2373.2773.4572.9974.36
7573.5074.4974.9673.9570.9970.7471.4870.80
10073.7673.4173.1472.4964.0262.6262.7264.22
Table 5. Performance Evaluation of FedLN When Varying the Percentage of Noisy Clients (\(F\)) for \(n_l\)=40% on CIFAR-10
Average accuracy over three distinct trials on the test set is reported. Federated parameters are set to \(R\)=200, \(M\)=30, \(E\)=1, \(q\)=80%, and \(\sigma\)=25%.
Table 6.
% of \(n_l\) on Clean ClientsNNCNA-FedAvg
\(n_s\)=\(0\%\)\(n_s\)=\(40\%\)\(n_s\)=\(70\%\)\(n_s\)=\(100\%\)\(n_s\)=\(0\%\)\(n_s\)=\(40\%\)\(n_s\)=\(70\%\)\(n_s\)=\(100\%\)
073.6873.6474.2371.4469.5270.7370.6171.01
573.5374.0275.4572.2469.1770.4670.0270.87
1074.3974.5774.7072.2968.1268.1367.1867.53
2573.4972.9473.6872.8267.0166.1764.1166.11
5074.2573.6872.5772.9864.8864.0263.3264.57
7574.7274.0873.4872.4563.7963.9763.5764.14
10074.0173.0874.1473.8764.0262.6262.7264.22
Table 6. Evaluation of FedLN When Noise Is Introduced on “Clean” Clients for \(n_l\)=40% in CIFAR-10
Average accuracy over three distinct trials on the test set is reported. Federated parameters are set to \(R\)=200, \(M\)=30, \(F\)=80%, \(E\)=1, \(q\)=80%, and \(\sigma\)=25%.
In Table 3, we observe that our approaches can improve the model’s performance compared to standard FedAvg across all datasets significantly. Consequently, we can conclude that FedLN can be applied in a federated environment with noisy labels to boost the performance, independent of the learning task or inputs’ modality. In particular, comparing the rows for \(n_l\)=70%, we note an increase of 22.67% in accuracy on average using FedLN across the considered tasks compared to the standard federated model.
Observing the results obtained with the baselines, we notice inconsistencies in the performance of both LS and bi-tempered loss across a wide range of noise profiles. LS shows effectiveness for low sparsity label noises, whereas bi-tempered loss produces desirable results only for large levels of label noise. CL, however, demonstrates more stable performance across diverse label noise settings, yet its overall improvement does not exceed an average of \(4\%\) across the tasks. In comparison, FedCorr, a label correction approach designed for FL, can effectively handle label noise across most settings, particularly under low sparsity label noise. However, its performance diminishes as the noise sparsity increases, resulting in less accurate pseudo-labels. On the contrary, FedLN performance is stable across a wide range of noise profiles and provides significant improvement in the model’s generalization capability. From our proposed approaches, we note that NNC provides the largest improvement on the model’s performance across distinct noise profiles, with the exception of the special cases of “class flipping” on high levels of noise (\(n_l\)=40/70% and \(n_s\)=100%). In such cases, NNC is unable to use the computed embeddings to perform the label correction process, as discussed in Section 5.3.1, whereas AKD can adequately handle these cases to properly utilize the embeddings and overcome the effect of label noise. Apart from the particular “class-flipping” noise, AKD’s embedding-based distillation approach could possibly surpass the performance of NNC, in case FL runs for a longer period (\(R\) > 200) at the expense of increased computational overhead. This is evident when no noise is present, where AKD surpasses all other approaches in the federated setting for all datasets from the vision domain, showcasing that it is a “cleaner” supervision signal with no correlation to existing labels. Furthermore, we observe that NA-FedAvg can remain effective across all considered noise profiles while introducing minimal computational overhead (a \(\mathit {logsumexp}\) operator on clients’ model’s logits) and client-side modifications during the FL training process. In particular, NA-FedAvg performance is in pair with FedCorr for \(n_s \lt 70\%\) and even surpasses it for high label noise sparsity scenarios, where the latter requires extensive federated rounds and a computationally expensive process involving training of GMM to produce a label noise estimate per client (see Section 5.2). Therefore, NA-FedAvg can be an ideal alternative for clients with minimal computational and storage resources to train federated models under label noise.

5.3.3 Evaluation of FedLN with Different Model Architectures.

Our proposed methods for dealing with label noise in FL attain high performance across a wide range of classification tasks. Whereas in NNC the label correction process does not utilize clients’ models, both AKD and NA-FedAvg approaches involve clients’ models. To ensure that FedLN efficacy is not related to a specific network architecture or model optimization, we perform experiments on CIFAR-10 and EuroSAT datasets, where we replace the ResNet-20 model with a recent convolutional-based neural network, named ConvMixer [37]. In particular, we choose ConvMixer-128/4, with a kernel size of 2, a patch size of 5, and approximately 0.1M parameters. Such small memory footprint and low complexity render ConvMixer-128/4 ideal for on-device learning. In our experimentation with ConvMixer, we use the Adam optimizer with learning rate of 0.001, similar to the original paper [37].
From the results presented in Table 4, we note that FedLN retains high efficacy, even if a different architecture is used. In particular, for \(n_l\)=70%, we notice an increase of 5.32% in accuracy on average using FedLN compared to the standard “FedAvg,” with all three of the proposed approaches following similar performance gains to the ones observed in Table 3. This indicates that both of our embeddings and model confidence based methods can be employed in FL to mitigate the effect of label noise and improve the model’s generalizability, irrespective of the architecture of federated model to be learned.

5.3.4 Effectiveness of FedLN across Diverse Label Noise Settings.

So far, in our evaluation, we considered diverse noise settings, assuming that “clean” clients, holding well-annotated data, are present during the FL procedure. In this subsection, we assess the efficacy of FedLN, and we relax our assumption regarding the presence of “clean” clients. To this end, we conduct further experiments on CIFAR-10, where the percentage of noisy clients and the amount of well-annotated data present on “clean” clients are varied.
Varying Number of Noisy Clients. Considering that the client’s labeling systems and the user’s willingness (or ability) to perform a correct annotation process significantly varies in a federated environment, the number of noisy (and clean) clients present in the FL process can fluctuate drastically. In this ablation study, we conduct experiments to determine the effect of the number of noisy clients on FedLN performance. To this end, we perform experiments by varying the percentage of noisy clients (\(F\)) from 25% up to 100% for \(n_l\)=40% utilizing both NNC and NA-FedAvg. In this way, we are able to assess the performance of our methods for estimating label noise (namely, using embeddings and the model’s predictions confidence). It is important to note that \(F\)=100% corresponds to a scenario where all clients contain \(n_l\)% of label noise in their locally stored data (not to be confused with a completely noisy dataset).
We present our findings in Table 5, where we note that the number of noisy clients has a relatively low impact on the ability of NNC to perform the label correction process, as the embeddings remain unaffected by the noise introduced by those clients. On the contrary, when trained with NA-FedAvg, the model’s performance deteriorates once all clients become noisy (\(F\)=100%). However, even with as few as 20% of clients being “clean,” the model’s performance reaches approximately 70% across all noise profiles, as shown in Table 3. Furthermore, in cases where the label noise is sparse across clients (\(F\)=25%), we observe that NA-FedAvg performance is superior to NNC. Although the overall number of noisy labeled instances has a negligible effect on the FL process, a small amount of label noise is introduced to “clean” clients due to slight imperfections in label estimation via NNC.
Varying Label Noise Distribution across Devices. Apart from varying the number of noisy devices in the federated setting, the quality of labels present on “clean” clients can also fluctuate in most pragmatic federated settings. This can happen due to imperfect labeling processes, which frequently occur in real life, either due to human error or an unforeseen mistake in an automated labeling system [30]. Therefore, in this ablation study, we aim to assess FedLN performance, where in addition to noisy clients, “clean” clients also hold a small amount of label noise. For this purpose, we perform experiments with both NCC and NA-FedAvg for \(n_l\)=40%, where we introduce a percentage of label noise to “clean” clients. Note that this percentage is considered a percentage of label noise \(n_l\), thus a value of 100% corresponds to \(n_l\)% of label noise injected in data of “clean” clients.
From the results provided in Table 6, it can be observed that the performance of NA-FedAvg starts to deteriorate when label noise is injected into the data of “clean” clients. Specifically, with a noise addition of \(0.1 \times n_l\), an average drop in accuracy of 3% is noted across the noise profiles. This drop occurs as NA-FedAvg utilizes “clean” client models’ ability to generalize faster to act as a “source” for detection of noisy labeled instances. This drop occurs as NA-FedAvg relies on the ability of any available “clean” client models to generalize faster, acting as a “source” for the detection of noisy labeled instances. Although the overall label noise present in the FL process is similar to that of the experiments reported in Table 5, here we introduce label noise to all “clean” clients instead of flipping them to noisy ones. As a result, a clear separation between the energy scores obtained from the two groups of clients, as illustrated in Figure 6, begins to diminish. Consequently, the ability of NA-FedAvg to determine a suitable threshold for determining noisy labeled instances sharply deteriorates, leading to poor generalizability of the obtained model. On the contrary, NNC remains effective in performing label correction and producing more generalizable models, even when \(n_l\) noise is introduced across all clients in FL. Therefore, NNC can be a preferable approach in cases where strong assumptions are made about the noise profile, such as the absence of clients with quality labels.

5.3.5 Real-World Human Annotation Errors.

Since synthetic noise often mimics clear structures to enable statistical analyses, it may fail to model complex real-world noise patterns or biases, which impose additional challenges as compared to synthetic label noise. In an effort to evaluate FedLN performance in a more realistic label noise setting, we use the re-annotated versions of the CIFAR-10/100 datasets, which contain real-world human annotation errors, namely CIFAR-10/100N [41]. In this way, we are able to study FedLN performance with label noise in the wild. During the labeling process by human annotators, with the help of Amazon Mechanical Turk, a noise level of approximately 40% (\(n_l\)=40%) was observed. It is evident that human labeling efforts inevitably result in a considerable amount of label noise being introduced in the data. In the federated setting, this brings major challenges to users, who are required to provide well-annotated data to enjoy high-quality FL services, and further necessitates the development of approaches to adequately deal with label noise in FL.
We perform experiments on both CIFAR-10/100N datasets utilizing FedLN approaches, and include standard FedAvg with the same training configuration, to clearly illustrate the performance gain of our approach. For a fair comparison, we included CL and FedCorr performance in identical experiments, as these were the two most prominent approaches from Table 3. Whereas the train set of CIFAR-10/100N was annotated by humans, the original noise-free test set of both datasets is used for evaluation. Furthermore, we randomly distribute the data across clients, thus no clear group of “clean” clients is considered in this case, which makes the learning even more challenging. From the CIFAR-10N results presented in Figure 8, we note that FedLN retains its effectiveness while moving from synthetic to real-world noise patterns. In particular, the model’s recognition rate remains within 2% of the ones reported in Table 3 for \(n_l\)=40%. In the case of CIFAR-100N, where the complexity of the task is increased due to large number of classes, we train all models for 500 federated rounds (\(R\)=500). In this case, we observe that embedding-based supervision via AKD outperforms the remaining approaches, validating that AKD’s performance can be especially beneficial if longer federated training is possible. Subsequently, we note that FedLN is able to largely address the challenges introduced from real-life noise patterns and is able to improve the recognition rate by 9% compared to the standard FL process. Therefore, our approaches can be useful for everyday users who utilize FL services and are requested to provide labels for their data, either explicitly or through their own activities (e.g., the next word prediction in Gboard [20]), which are inherently noisy.
Fig. 8.
Fig. 8. Evaluation of FedLN in real-world label noise patterns from CIFAR-10/100N. Average accuracy over three distinct trials and on the test set are reported for \(R\)=200/500 for CIFAR-10/100N, respectively. In AKD, we report the performance for the case of embeddings as the source of supervision. Remaining federated parameters are set to \(M\)=30, \(E\)=1, \(q\)=80%, and \(\sigma\)=25%.

6 Conclusion

In this article, we studied the pragmatic problem of label noise under the federated setting. In the distributed scenario, clients’ well-annotated examples are sparse due to flaws in the labeling process, which originate either from deficient automatic-labeling techniques or users’ mistakes. To aggravate this problem, label noise in the federated setting can substantially differ among clients, as it is closely coupled to a number of client-dependent sources, such as the discrepancy between clients’ labeling systems or the difference in clients’ expertise and willingness to label data correctly. Due to this reason, and the scarcity of data in FL, most centralized approaches (to handle noisy labels) deteriorate in the federated setting, whereas the limited FL approaches dealing with noisy labels introduce extensive overhead on the client side or rely on server-side clean data availability.
To address the lack of computationally efficient ways to deal with label noise during learning on-device models without relying on any additional data, we presented the FedLN framework, where we proposed three different approaches, each operating in a distinct phase of the FL process, providing label noise solutions in FL for diverse device compute characteristics. Apart from noise mitigation, FedLN provides a mechanism to perform label correction. Despite the simplicity of our approaches, namely NNC, AKD, and NA-FedAvg, we demonstrated that they can address noisy labeled instances across a wide range of label noise settings. We evaluated FedLN on several publicly available datasets, comparing its performance with several baselines in the federated setting. The models’ generalization we achieved is consistently superior to the considered baselines, whereas an evaluation of FedLN with in-the-wild label noise data showcased the effectiveness of FedLN under complex real-world label noise patterns and real-life scenarios.
In addition, the minimal communication footprint of FedLN in the FL process effectively mitigated additional communication overhead typically associated with large-scale FL applications, making it a viable choice for deployments in various real-world applications. On that note, NA-FedAvg will require all clients’ participation to compute the energy score in a fixed round (although this can be performed in an asynchronous fashion), thus ensuring accurate assessment of clients’ noise levels, an assumption also considered in the work of Xu et al. [44]. We believe that FedLNopens up promising avenues for future research, including an in-depth analysis of its computational efficiency on real hardware platforms, as well as the detection of real-life noisy samples and exploration of combining different FedLN approaches within a single FL scheme, all contributing to the practicality and effectiveness of FedLN in real-world FL scenarios.

Footnotes

Appendix

Table 7.
Noise (\(n_l\))0.00.40.7
Sparsity (\(n_s\))0.00.00.40.71.00.00.40.71.0
CIFAR-10FedAvg 78.5268.7767.0567.3167.9257.6556.9456.8163.54
FedLNAKD (Emb.)83.5569.7269.1870.2670.5568.4368.0669.7468.39
AKD (Logits)78.5463.4262.7967.4568.6354.0953.3859.1464.73
Fashion-MNISTFedAvg 86.4382.0583.2483.3581.0758.5556.0657.1159.37
FedLNAKD (Emb.)86.9183.9784.0683.5382.3580.7778.6380.4178.64
AKD (Logits)85.8279.1180.8280.2880.8273.3968.5771.9478.61
PathMNISTFedAvg 87.0578.8277.0676.6877.0354.7452.4953.2258.61
FedLNAKD (Emb.)87.8286.4282.8381.4683.2679.9478.6378.3176.02
AKD (Logits)85.9877.7779.9475.3880.8757.5659.3649.3660.04
Speech CommandsFedAvg 96.3181.8382.5382.4480.3372.3470.3470.8972.39
FedLNAKD (Emb.)84.8286.4284.8385.4683.2679.9476.6378.3176.02
AKD (Logits)95.6980.3981.0782.5681.6473.1172.7474.1469.67
Table 7. Performance Evaluation for FedLN’s AKD Approach, Whereas the Globally Aggregated Model’s Outputs Are Exploited as the Source of Supervision
Average accuracy over three distinct trials on the test set is reported. Federated parameters are set to \(R\)=200, \(M\)=30, \(F\)=80%, \(E\)=1, \(q\)=80%, and \(\sigma\)=25%. Emb. denotes embedding.

References

[1]
Ehsan Amid, Manfred K. Warmuth, Rohan Anil, and Tomer Koren. 2019. Robust bi-tempered logistic loss based on Bregman divergences. arXiv:1906.03361 (2019).
[2]
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. 2019. Unsupervised label noise modeling and loss correction. arXiv:cs.CV/1904.11238 (2019).
[3]
Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A closer look at memorization in deep networks. In Proceedings of the International Conference on Machine Learning. 233–242.
[4]
Yingbin Bai, Erkun Yang, Bo Han, Yanhua Yang, Jiatong Li, Yinian Mao, Gang Niu, and Tongliang Liu. 2021. Understanding and improving early stopping for learning with noisy labels. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS ’21). 1–12.
[5]
Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D. Lane. 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020).
[6]
Yiqiang Chen, Xiaodong Yang, Xin Qin, Han Yu, Biao Chen, and Zhiqi Shen. 2020. FOCUS: Dealing with label quality disparity in federated learning. arXiv:cs.LG/2001.11359 (2020).
[7]
Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552 (2017).
[8]
Shaoming Duan, Chuanyi Liu, Zhengsheng Cao, Xiaopeng Jin, and Peiyi Han. 2022. Fed-DR-Filter: Using global data representation to reduce the impact of noisy labels on the performance of federated learning. Future Generation Computer Systems 137 (2022), 336–348.
[9]
Xiuwen Fang and Mang Ye. 2022. Robust federated learning with noisy and heterogeneous clients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10072–10081.
[10]
Jacob Goldberger and Ehud Ben-Reuven. 2017. Training deep neural-networks using a noise adaptation layer. In Proceedings of the 5th International Conference on Learning Representations: Conference Track (ICLR ’17).
[11]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. arXiv:1706.04599 (2017).
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385 (2015). https://arxiv.org/abs/1512.03385
[13]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2018. Introducing Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In Proceedings of the 2018 IEEE International Geoscience and Remote Sensing Symposium. 204–207.
[14]
Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. 2018. Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates.https://proceedings.neurips.cc/paper_files/paper/2018/file/ad554d8c3b06d6b97ee76a2448bd7913-Paper.pdf
[15]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531 (2015).
[16]
Michael E. Houle. 2013. Dimensionality, discriminability, density and distance distributions. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining Workshops. 468–473.
[17]
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. 2021. Advances and Open Problems in Federated Learning. Foundations and Trends® in Machine Learning. Now Foundations and Trends.
[18]
Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. In Proceedings of the NIPS Workshop on Private Multi-Party Machine Learning. https://arxiv.org/abs/1610.05492
[19]
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
[20]
David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2019. Federated learning for keyword spotting. arXiv:eess.AS/1810.05512 (2019).
[21]
Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. 2020. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 4313–4324.
[22]
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020), 50–60.
[23]
Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li. 2020. Energy-based out-of-distribution detection. arXiv:2010.03759 (2020).
[24]
Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, and Sanjiv Kumar. 2020. Does label smoothing mitigate label noise? arXiv:2003.02819 (2020).
[25]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 1273–1282.
[26]
Rafael Müller, Simon Kornblith, and Geoffrey Hinton. 2020. When does label smoothing help? arXiv:cs.LG/1906.02629 (2020).
[27]
Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. 2019. Confident learning: Estimating uncertainty in dataset labels. arXiv:1911.00068 (2019).
[28]
Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen Qu. 2016. Making deep neural networks robust to label noise: A loss correction approach. arXiv:1609.03683 (2016).
[29]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. 8748–8763.
[30]
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel. Proceedings of the VLDB Endowment 11, 3 (Nov. 2017), 269–282.
[31]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. FitNets: Hints for thin deep nets. arXiv:1412.6550 (2014).
[32]
Aaqib Saeed, David Grangier, and Neil Zeghidour. 2021. Contrastive learning of general-purpose audio representations. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’21). IEEE, Los Alamitos, CA, 3875–3879.
[33]
Joel Shor and Subhashini Venugopalan. 2022. TRILLsson: Distilled universal paralinguistic speech representations. arXiv preprint arXiv:2203.00236 (2022).
[34]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2020. Learning from noisy labels with deep neural networks: A survey. arXiv:2007.08199 (2020).
[35]
Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, and Dominik Roblek. 2019. Self-supervised audio representation learning for mobile devices. arXiv preprint arXiv:1905.11796 (2019).
[36]
Cheng Tan, Jun Xia, Lirong Wu, and Stan Z Li. 2021. Co-learning: Learning from noisy labels with self-supervision. In Proceedings of the 29th ACM International Conference on Multimedia. 1405–1413.
[37]
Asher Trockman and J. Zico Kolter. 2022. Patches are all you need? arXiv:2201.09792 (2022).
[38]
Vasileios Tsouvalas, Aaqib Saeed, and Tanir Ozcelebi. 2022. Federated self-training for semi-supervised audio recognition. ACM Transactions on Embedded Computing Systems 21, 6 (Feb. 2022), Article 74, 26 pages.
[39]
Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. 2019. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE International Conference on Computer Vision.
[40]
P. Warden. 2018. Speech Commands: A dataset for limited-vocabulary speech recognition. arXiv e-prints arXiv:cs.CL/1804.03209 (2018).
[41]
Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. 2021. Learning with noisy labels revisited: A study using real-world human annotations. CoRR abs/2110.12088 (2021).
[42]
Xiaobo Xia, Tongliang Liu, Bo Han, Chen Gong, Nannan Wang, Zongyuan Ge, and Yi Chang. 2021. Robust early-learning: Hindering the memorization of noisy labels. In Proceedings of the International Conference on Learning Representations.
[43]
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747 (2017).
[44]
Jingyi Xu, Zihan Chen, Tony Q. S. Quek, and Kai Fong Ernest Chong. 2022. FedCorr: Multi-stage federated learning for label noise correction. arXiv:cs.LG/2204.04677 (2022).
[45]
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. 2023. MedMNIST v2: A large-scale lightweight benchmark for 2D and 3D biomedical image classification. Scientific Data 10 (2023), 41.
[46]
Miao Yang, Hua Qian, Ximin Wang, Yong Zhou, and Hongbin Zhu. 2022. Client selection for federated learning with label noise. IEEE Transactions on Vehicular Technology 71, 2 (2022), 2193–2197.
[47]
Seunghan Yang, Hyoungseob Park, Junyoung Byun, and Changick Kim. 2022. Robust federated learning with noisy labels. IEEE Intelligent Systems 37, 2 (March 2022), 35–43.
[48]
Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. 2018. Applied federated learning: Improving Google keyboard query suggestions. arXiv:cs.LG/1812.02903 (2018).
[49]
Bixiao Zeng, Xiaodong Yang, Yiqiang Chen, Hanchao Yu, and Yingwei Zhang. 2022. CLC: A consensus-based label correction approach in federated learning. ACM Transactions on Intelligent Systems and Technology 13, 5 (June 2022), Article 75, 23 pages.
[50]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. Mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=r1Ddp1-Rb
[51]
Jinghui Zhang, Dingyang Lv, Qiangsheng Dai, Fa Xin, and Fang Dong. 2023. Noise-aware local model training mechanism for federated learning. ACM Transactions on Intelligent Systems and Technology 14, 4 (May 2023), Article 65, 22 pages.
[52]
Zhaowei Zhu, Zihao Dong, and Yang Liu. 2021. Detecting corrupted labels without training a model to predict. arXiv:2110.06283 (2021).

Cited By

View all
  • (2024)Robust Learning under Hybrid NoiseACM Transactions on Intelligent Systems and Technology10.1145/370914916:2(1-27)Online publication date: 23-Dec-2024
  • (2024)Federated Learning with Instance-Dependent Noisy LabelICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447823(8916-8920)Online publication date: 14-Apr-2024
  • (2024)Communication-Efficient Federated Learning Through Adaptive Weight Clustering And Server-Side DistillationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447174(5805-5809)Online publication date: 14-Apr-2024
  • Show More Cited By

Index Terms

  1. Labeling Chaos to Learning Harmony: Federated Learning with Noisy Labels

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 2
    April 2024
    481 pages
    EISSN:2157-6912
    DOI:10.1145/3613561
    • Editor:
    • Huan Liu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 February 2024
    Online AM: 09 October 2023
    Accepted: 15 August 2023
    Revised: 21 July 2023
    Received: 04 October 2022
    Published in TIST Volume 15, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Federated learning
    2. noisy labels
    3. label correction
    4. deep learning
    5. knowledge distillation

    Qualifiers

    • Research-article

    Funding Sources

    • ECSEL Joint Undertaking

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,370
    • Downloads (Last 6 weeks)130
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Robust Learning under Hybrid NoiseACM Transactions on Intelligent Systems and Technology10.1145/370914916:2(1-27)Online publication date: 23-Dec-2024
    • (2024)Federated Learning with Instance-Dependent Noisy LabelICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447823(8916-8920)Online publication date: 14-Apr-2024
    • (2024)Communication-Efficient Federated Learning Through Adaptive Weight Clustering And Server-Side DistillationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447174(5805-5809)Online publication date: 14-Apr-2024
    • (2024)Collaboratively Learning Federated Models from Noisy Decentralized Data2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825502(7879-7888)Online publication date: 15-Dec-2024
    • (2023)On the Impact of Label Noise in Federated Learning2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt)10.23919/WiOpt58741.2023.10349830(183-190)Online publication date: 24-Aug-2023
    • (2023)Robust Networked Federated Learning for Localization2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317125(1193-1198)Online publication date: 31-Oct-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media