Federated Edge Learning: Design Issues and Challenges: Afaf Ta Ik and Soumaya Cherkaoui
Federated Edge Learning: Design Issues and Challenges: Afaf Ta Ik and Soumaya Cherkaoui
Abstract—Federated Learning (FL) is a distributed machine significant delays can be caused by stragglers. Moreover,
learning technique, where each device contributes to the learning communication loads across devices limit the scalability of
model by independently computing the gradient based on its local FL for large models. Participating devices communicate full
training data. It has recently become a hot research topic, as it
model updates during every training iteration, which are of the
arXiv:2009.00081v2 [cs.DC] 27 Jan 2022
FEEL algorithms, as we advocate in this article. Limited Resources: In a contrast to the cloud, the computing
The main contributions of this article can be summarized and storage resources of the devices are very limited. There-
as follows: fore the models that can be trained on device are relatively
• We discuss the FEEL challenges imposed by the nature of simpler and smaller than the models trained on the cloud.
the edge environment, from an algorithms design perspec- Furthermore, devices are frequently offline or unavailable
tive. We review the challenges related to computational either due to low battery levels, or because their resources
and communication capacities, as well as data properties, are fully or partially used by other applications.
as they are at the core of the trade-offs in learning and As for the communication resources, the available band-
resource optimization algorithms. width is limited. It is therefore important to develop
• We propose a general framework for incorporating data communication-efficient methods that allow to send com-
properties in FEEL, by providing a guideline for a pressed or partial model updates. To further reduce com-
thorough algorithm design, and criteria for the choice of munication cost in FEEL settings, two potential directions
diversity measures in both datasets and models. are generally considered 1) reducing the total number of
• We present several possible measures and techniques to communication rounds until convergence [9], and 2) reducing
evaluate data and model diversity, which can be applied the size of the transmitted updates through compression and
in different scenarios (e.g., classification, time series partial updates [4].
forecasting), in an effort to assist fellow researchers to Data: In most cases, data distributions depend on the
further address FEEL challenges. users’ behaviour [10]. As a result, the local datasets are
The remainder of this article is as follows. In Section II, we massively distributed, statistically heterogeneous (i.e., non-
review the challenges found in designing FEEL algorithms, IID and unbalanced), and highly redundant. Additionally, the
and we derive the main trade-offs. Then, we shed the light raw generated data is often privacy-sensitive as it can reveal
on a new data-aware design direction for FEEL algorithms in personal and confidential information.
section III. Some possible techniques and methods to evaluate Small and widely distributed datasets: In FEEL scenarios,
diversity are detailed in this section. At last, a conclusion and a large number of devices participate in the FL training with
final remarks are presented in Section IV. a small average number of data samples per client. Learning
from small datasets makes local models prone to overfitting.
II. D ESIGN CHALLENGES : OVERVIEW Non-IID: The training data on a given device is typically
based on the usage of the device by a particular user, and hence
FEEL has several constraints related to the nature of the
any particular user’s local dataset will not be representative
edge environment. In fact, FEEL involves the participation
of the population distribution. This data-generation paradigm
of heterogeneous devices that have different computation and
fails to comply with the independent and identically distributed
communication capabilities, energy states, and dataset charac-
(IID) assumptions in distributed optimization, and thus adds
teristics. Under device and data heterogeneity, in addition to
complexity to the problem formulation and convergence analy-
resources constraints, participants selection [7] and resource
sis. The empirical evaluation of FEEL algorithms on non-IID
allocation [8] have to be optimized for an efficient FEEL
data is usually performed on artificial partitions of MNIST
solution.
or CIFAR-10, which do not provide a realistic model of a
federated scenario.
A. Design Challenges Unbalance: Similarly to the nature of the distributions, the
The core challenges associated with solving the distributed size of the generated data depends on the user. Depending on
optimization problem are twofold: Resources and Data. These users’ use of the device, these may have varying amounts of
challenges increase the FEEL setting complexity compared to local training data.
similar problems, such as distributed learning in data centers. Redundancy: The unbalance of the data is also observed
Resources: The challenges related to the resources, namely within the local datasets at a single device. In fact, IoT data is
computation, storage and communication, are mainly in terms highly redundant. In sequential data (e.g., video surveillance,
of their heterogeneity and scarcity. sensors data) for instance, only a subset of the data is infor-
Heterogeneity of the resources: The computation, storage mative or useful for the training.
and communication capabilities vary from a device to another. Privacy: The privacy-preserving aspect is an essential re-
Devices may be equipped with different hardware (CPU and quirement in FL applications. The raw data generated on each
memory), network connectivity (e.g., 4G/5G, Wi-Fi), and may device is protected by sharing only model updates instead
differ in available power (battery level). The gap in computa- of the raw data. However, communicating model updates
tional resources creates challenges such as delays caused by throughout the training process can still be reverse-engineered
stragglers. FEEL algorithms must therefore be adaptive to the to reveal sensitive information, either by a third-party or a
heterogeneous hardware and be tolerant toward device drop- malicious central server.
out and low or partial participation. A potential solution to
the straggler problem is asynchronous learning. However, the
reliability of asynchronous FL and the model convergence in B. Design Trade-offs
this setting are not always guaranteed. Thus, synchronous FL Several efforts were made to tackle the aforementioned
remains the preferred approach. challenges. However, FEEL is a multi-dimensional problem
3
diversity. Nonetheless, the number of collected updates in this be incorporated into the design of FL scheduling algorithms.
setting might be low. The fairness is also considered in the Nonetheless, the uncertainty measures used in Active Learning
aggregation by q-Fair FL (q-FFL) [17], which reweighs the targets individual samples from unlabeled data in a centralized
objective function in FedAvg to assign higher weights in the setting, thus, these measures cannot be directly integrated in
loss function to devices with higher loss. Another approach is FEEL. In the FEEL setting, the updates’ scheduling can be
to use data size priority, which maximizes the size of data used either before the training or after it, therefore the diversity
in the training, by using a probability of selection inversely measures should be selected depending on the time of schedul-
proportional to the available dataset’s size. In the background, ing. If the scheduling before the training is preferred, then
these scheduling algorithms all share the same idea : if the size the datasets’ diversity is to be considered. Otherwise, if the
of the training data is large then the training would converge scheduling is set after the training is over, the diversity to be
faster. However, IoT data is highly redundant and inherently considered is model diversity, as the diversity of the dataset
unbalanced. Thus, many of the proposed algorithms witness a can be reflected by the resulting model. In both cases, in
drop in performance in non-IID and unbalanced experiments. addition to maximizing the diversity through careful selection
Therefore, the data properties should be considered throughout of participating devices, the scheduling algorithm can focus
the FEEL algorithm. on minimizing the consumed resources in terms of completion
time of FL and transmission energy of participating devices.
III. DATA - AWARE FEEL DESIGN : F UTURE D IRECTION For the pre-training scheduling, local computation energy can
also be optimized. Furthermore, the scheduling problems’
Even if FL was first proposed with data as a central aspect, it
constraints are to be derived from the environment’s properties
has been overlooked in the design of proposed FEEL schedul-
concerning resources and data.
ing algorithms. With the significant drop of accuracy of models
In this section, and to better illustrate the data-aware solutions,
trained with resource-aware FEEL algorithms in non-IID and
we consider the architecture illustrated in Figure 2. The
unbalanced settings, it becomes clear that the data aspect
architecture is a cellular network composed of one base station
should be considered. Henceforth, we propose a new possible
(BS) equipped with a parameter server, and N devices that
data-aware end-to-end FEEL solution based on the diversity
collaboratively train a shared model. In the following, we dis-
properties of the different datasets. In general, diversity con-
cuss different constraints related to the scheduling algorithms
sists of two aspects, namely, richness and uncertainty. Richness
in this setting. Then, we present pre-training and post-training
quantifies the size of the data, while the uncertainty quantifies
algorithms guidelines, where we detail the key criteria for the
the information contained in the data. In fact, it has been long
design of data-aware FEEL solutions, and we present some
proven in Active Learning that by choosing highly uncertain
potential measures and methods to enable a variety of data-
data samples, a model can be trained using fewer labelled
aware FEEL applications, which are summarized in Figure 3.
data samples. This fact suggests that data uncertainty should
5
A. Scheduling Constraints and send the updates. Scheduling the devices before the
The scheduling algorithms’ must consider the following training allows to eliminate potential stragglers, and adapt the
constraints that arise from the FEEL environment’s properties: number of epochs based on the battery levels available at the
Energy consumption: Due to the limited energy level and the participating devices.
high computational requirements of training algorithms, it is
necessary to evaluate a device’s battery level before scheduling 1) Scheduling algorithm:
it for a training round. When first FL was proposed, the In this algorithm, the global model is initialized by the BS.
selected devices were limited to the ones plugged for charging. Afterwards, the following steps are repeated until the model
However, this criterion limits the number of devices that can converges or a maximum of rounds is attained:
be selected, leading to a slow convergence of the learning. • Step 1: At the beginning of each training round, the
Radio Channel State: It is important to consider the radio devices send their diversity indicators and battery levels
channel state changes in the scheduling. The quality of the to the server.
communication is critical for both the device selection and • Step 2: Based on the received information, alongside with
resource allocation. the evaluated channel state indicator, the server schedules
Expected completion time: The available computation re- a subset of devices and sends them the current global
sources, alongside data size, can be used to estimate the model.
completion time of the device. Potential stragglers can be • Step 3: Each device in the subset uses its local data to
discarded even before the training process. train the model.
Number of participants: A communication round cannot • Step 4: The updated models are sent to the server to be
be considered valid unless a minimum number of updates is aggregated.
obtained. Therefore, a training round can be dropped if there • Step 5: The PS aggregates the updates and created the
are not enough devices to schedule. new model.
Data size: The available data in the device is smaller in size
2) Datasets Diversity Measures:
than a required minimum, it can be immediately discarded
In the pre-training scheduling, dataset diversity will serve
from the selection process. For instance, if the number of
essentially as a lead for device selection, where it should
samples is less than the selected mini-batch size, the device
prioritize devices that have potentially informative datasets
should be excluded.
with less redundancy, to speed up the learning process. While
the richness of datasets can be easily quantified through the
B. Pre-training scheduling: Dataset Diversity total number of samples, the uncertainty of the dataset depends
The pre-training scheduling that we propose uses dataset strongly on the application. For supervised learning, the un-
diversity to choose devices that will conduct the training certainty can be evaluated through the evenness of the dataset
6
(i.e., the degree of balance between the classes in classification or angular based (e.g., cosine similarity). A higher value
problems), which can be calculated through entropy measures. is obtained if most of the data points in the sample are
For sequence data, the uncertainty is reflected by the regular- dissimilar, and thus the dataset should be considered as
ity of the series. Moreover, for unsupervised learning, local more diverse. It should be noted that angular based measures
dissimilarity between pseudo-classes or randomly sampled are invariant to scale, translation, rotation, and orientation,
data points can be considered. Furthermore, it is essential which makes them suitable for a wide range of applications,
to consider the privacy as a component of the used index. particularly multivariate datasets.
Sending the number of samples from each class for instance is
a violation of the privacy principle of FEEL. In the following,
we introduce some potential methods to evaluate datasets C. Post-training scheduling: Model Diversity
diversity.
Diversity measures for classification: The measures of The post-training setting uses model diversity to choose
diversity have long been used in Active learning. In fact, un- devices that will send the updates. The model diversity is
certainty is used to choose the samples that should be labeled evaluated on two different aspects: 1) by comparing the
as this task is costly. However, in FL, the client selection does dissimilarity between the local model’s parameters and the
not concern independent samples, instead the diversity should previous global model’s parameters. 2) by comparing the
be evaluated at the level of the entire dataset. Moreover, in diversity within the model’s parameters. In fact, choosing
the premise of supervised FL, the labels are already known, the local models that are divergent from the previous global
which gives the possibility to use more informed measures. For model will possibly improve the representational ability of
instance, Shannon Entropy or Gini-Simpson index are suitable the global model directly, by aggregating updates that have
measures for datasets’ uncertainty in classification problems. potentially new information. Furthermore, if a dataset is
Shannon Entropy and Gini-Simpson index both favor IID highly unbalanced and limited in size, the model’s parameters
partitions, where the maximum for both indexes is obtained for would be very similar. The redundancy within parameters
balanced distributions and the datasets with a single class has negatively affects the model’s representational ability. It is
the minimum possible value. The Shannon entropy quantifies therefore necessary to prioritize updates with high diversity.
the uncertainty (entropy or degree of surprise) of a prediction. In the following, we detail the post-training scheduling
It was first proposed to quantify the information content algorithm, then we present some possible measures for model
in strings of text. The underlying idea is that when a text diversity.
contains more different letters, with almost equal proportional
abundances, it will be more difficult to correctly predict 1) Scheduling algorithm:
which letter will be the next one in the string. However, Similarly to pre-training scheduling, the global model is
Shannon Entropy is not defined for the case of classes with initialized by the BS. Afterwards, the following steps are
no representative samples. Therefore, it may not practical in repeated until the model converges or a maximum of rounds
scenarios with high unbalance. Another possible measure is is attained:
the Gini-Simpson index. The Simpson index 𝜆 measures the • Step 1: At the beginning of each training round, devices
probability that two samples taken at random from the dataset receive the current model.
of interest are from the same class. The Gini–Simpson index is • Step 2: Each device in the subset uses its local data to
its transformation 1 − 𝜆, which represents the probability that train the model.
the two samples belong to different classes. Nonetheless, if • Step 3: The server sends an update request to the devices,
the number of classes is large, the distinction using this index to which each device responds by sending its model
will be hard. diversity index.
Diversity measures for time-series forecasting: In time • Step 4: Based on the received information, alongside with
series problems, other methods can be used, such as Ap- the evaluated channel state indicator, the server schedules
proximate Entropy (ApEn) and Sample Entropy (SampEn). a subset of devices to upload their models. Then, the
In sequential data, statistical measures such as the mean and updated models are sent to the server to be aggregated.
the variance are not enough to illustrate the regularity, as they • Step 5: The PS aggregates the updates and created the
are influenced by system noise. ApEn is proposed to quantify new global model.
the amount of regularity and the unpredictability of time-series 2) Model Diversity Measures:
data. It is based on the comparison between values of data in While the richness aspect of the diversity is irrelevant in
successive vectors, by quantifying how many data points vary models diversity due to fixed model size among devices, the
more than a defined threshold. SampEn was proposed as a information contained in the models can be quantified through
modification of ApEn. It is used for assessing the complexity how the local model’s vary compared to the global model, and
of time-series data, with the advantage of being independent how the parameters within the same model repulse from each
from the length of the vectors. other. Some possible measures are as follows:
Diversity measures for clustering tasks: For clustering Local and global models’ dissimilarity: Choosing the
tasks, a similarity measure between data points from a local models that are divergent from the previous global model
randomly sampled subset should be considered. The measure will possible improve the representational ability of the global
can be distance based (e.g., Euclidean distance, Heat Kernel) model directly [9]. Pairwise similarity measures such as cosine
7
Fig. 3: Diversity measures that can be used in pre-training and post-training scheduling