Report
Report
Contents
II Related work                                                                                                                                  8
   i   General ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                             8
   ii  Reduction of communication cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                11
   iii Improvement of aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                               14
III Methodology                                                                                 19
    i   Basic knowledge of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 19
    ii  Management of participation through a contribution evaluation . . . . . . . . . . . . 20
    iii Management of participants for the next round . . . . . . . . . . . . . . . . . . . . . 28
IV Experiments                                                                                                                                  36
   i   Generation of data . . . . . . . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
   ii  Baseline - Federated Averaging . . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
   iii Results on contribution evaluation algorithm . . . . .               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
   iv Results on the selection of next participants algorithm               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   47
A Appendix                                                                                         52
  i  Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
  ii Concise table on different algorithms of Reinforcement Learning . . . . . . . . . . . . 52
B References 56
C Glossary 58
                                                      1
                                 Federated Reinforcement Learning
i.   Introduction
Federated Learning (FL) was introduced by McMahan et al. (2017) which has attracted increasing
interest of ML researchers. They proposed a decentralized collaborative approach that uses multiple
devices to train a global model without sharing their local data. In most cases, devices are limited in
resources such as computation and communication and restricted by the usage of the user. Federated
Learning can meet expectations of specific fields such as natural language processing, computer
vision, health care, transportation, finance, smart city, robotics, networking, blockchain and others.
In other words, the fact of training a model without exchanging local data gives an overview of new
possibilities for taking advantage of each other while keeping privacy.
    However, Li et al. (2020) analyze the convergence of FedAvg, algorithm based on Stochastic
Gradient Descent (SGD) which was suggested by McMahan et al. (2017) on non-IID data (Indepen-
dent and Identically Distributed). Indeed, on realistic settings, there is no guarantee to have IID
data. This paper establishes a convergence rate of O( T1 ) for strongly convex and smooth problems,
where T is the number of SGDs. Moreover there are also multiple problems that are not covered
with FedAvg approach namely making FL a completely distributed method, dealing with fairness of
contributions or minimizing the energy consumption. The section iv will cover different challenges
on Federated Learning.
    In this study, we will focus on Reinforcement Learning (RL) which is a branch where the agent
and environment are the basic components of it. The agent interacts with the environment by taking
actions and receives rewards according to the impact of the action taken. The objective of the agent
is to learn to behave in such a way as to maximize the expected rewards in the long term. RL
demonstrates a great potential in various applications namely communications. The section iii will
introduce basic knowledge of Reinforcement Learning.
    When we contextualize the kind of application of FL, there are different settings that might
block or reduce the efficiency of traditional methods of FL. More precisely, when local data are not
labeled or also when the distribution of data is non-IID, RL could help to deal with this type of
issue. Through the observation of the environment, Qi et al. (2021) put forward that FL can be
divided into two categories according to the distribution characteristics of data. More details will be
explained in section i.1.
    Federated Reinforcement Learning (FRL) is the combination of RL and FL. Generally, the
dimension of sample, feature and label in FL can be replaced by environment, state and action
respectively in FRL. On one hand, RL can be implemented directly in clients’ devices which takes
the place of the model for each client. It can be a solution when data are not labeled or when data
are located in particular environment according to the usage of user. For instance, the usage of
mobile phones are not the same in cities and in rural areas. On the other hand, RL can play the
role of a scheduler in order to select clients a better way. It turns out the agent is able to solve a
optimization problem by finding parameters that minimize the objective function.
                                                                                                     2
                                 Federated Reinforcement Learning
Formulation Consider N parts Fi , ∀i ∈ [1, N ] which are established and are used to train the
cooperative ML model. Also, there are N datasets Di , ∀i ∈ [1, N ]. The FL model is a set of
parameters wi which are learned base on datasets Di . The loss function of each dataset Di at part
Fi can be defined as follow :
                                              1 X
                                    Fi (w) =           fj (w)
                                             |Di |
                                                      j∈Di
where fi (w) represents the loss function of sample j with the given model parameter vector w and
| · | represents the size of the dataset. Now, we can define the global loss function thanks to multiple
parts put together to train the global model without sharing a dataset :
                                                    N
                                                    X
                                         Fg (w) =         ηi Fi (w)
                                                    i=1
|νSU M − νF ED | < δ
where ν is the estimated performance such as accuracy, recall and F1-score, etc, and δ is a non-
negative real number.
   Despite good performance of FL, this method has difficulty to fit in dynamics environments
specifically cooperative and optimal decision-making. To solve these real-world challenges, RL and
Deep Reinforcement Learning (DRL) have proved excellent performance in many problems.
                                                                                                      3
                                 Federated Reinforcement Learning
                                                  Action At
                                Agent
                                                                    Environment
                                Policy
                                                  State St+1 ,
                                                 Reward Rt+1
where rt is the reward at the step t and γ ∈ [0, 1[ is the discount-rate. γ is less than 1, which allows
to weight less distant future events than events in the immediate future. For instance, in case we
would like to reduce the energy consumption of devices on Federated Learning, the agent is rewarded
properly when the impact of the actions are beneficial. And on the contrary, the agent is punished
(negative rewards) when the consumption of energy is worse than previous state.
    Bellman (1957), who worked on Markov Decision Process (MDP), found a way to estimate the
optimal-value Q∗ (s, a) which is the maximum expected return achievable by following any strategy,
after seeing some sequence s and then taking some action a :
                                                                                                      4
                                   Federated Reinforcement Learning
     After getting the optimal Q values, it is simple to define the optimal policy π ∗ (s). In the state
s, the agent must choose the action which has the highest Q value in this state. In other words,
π ∗ (s) = argmax Q∗ (s, a)
             a
    We can use the Bellman equation as an iterative update Qi+1 (s, a) = E [r + γQi (s′ , a′ )|s, a]
in order to converge to Q∗ when i → +∞. A function for approximation is defined to estimate
the action-value function Q(s, a; θ) ≈ Q∗ (s, a). A neural network can be trained by minimizing a
sequence of loss functions Li (θi ) that change at each iteration i :
                                Li (θi ) =          (yi − Q(s, a; θi ))2
                                                                        
                                              E
                                            s,a∼p(s,a)
where yi = ′E [r + γ maxa′ Q(s′ , a′ ; θi−1 |s, a)] is the target for iteration i and p(s, a) is a probability
            s ∼E
distribution over sequences s and actions a that refers to the behaviour distribution.
Client challenges
   1. In real-world applications, data are not necessarily labeled. On this condition, local models and
      global model have to learn through unsupervised or reinforcement techniques. For instance,
      if we would like to use the dataset of three different sales-oriented clothing applications, on
      the one hand, each application manages its data as they were designed for. They are not
      supposed to follow the same design or "codification". Thus, it is harder to find a way to identify
      corresponding features between datasets. On the other hand, we are not able to label data
      knowing that the scope of each application is not the same with each other. The objective is
      to gather information of applications in order to compute a local update without having to do
      the preprocessing of the data.
   2. Even in the case of supervised learning with labeled training dataset, there is still no guarantee
      of the data distribution. In other words, some clients could present specific characteristics like
      their location or their habits which influence local dataset. It involves to deal with non-IID
      data and moreover, it shows that at a global viewpoint, data could be unbalanced, which
      means classes are not distributed identically.
   3. As well, Internet of Things (IoT) or embedded devices have often limited resources. In some
      cases, the computation or the communication with the server can be slowed down or even
      interrupted during the activity. We can think when battery is discharged for computation or
      when there is a loss of connection on WiFi or 4G network for communication. This problem is
      called the straggler problem.
Communication challenges
   1. Communication plays a main role in FL. It must be minimize to speed up the training of the
      global model because the number of communications represents the bottleneck of FL, mainly
      in cross-device configuration.
   2. Besides, the energy consumption is closely related to the number of communications and the
      amount of information transferred between the device and the server. The energy consumption
      must also be minimized for the computation but, we assume, on the same embedded device,
      the more we reduce the energy consumption, the more the training is slowed down.
                                                                                                            5
                                           Federated Reinforcement Learning
  3. During communications, global model parameters and local updates are communicated between
     participants and the server. It introduces a risk of potential attacks because, not only privacy
     still remains in local updates but also because attackers are able to build information from
     parameters. Privacy must be kept protected during communications.
    Due to different habits of clients, local updates impact differently the global model according to
their contributions. In other words, clients, who have spent more time using their device, participate
more and therefore the global model will be more dependent on their usage. Then, the fairness is
not respected because of the imbalance of contributions of participants.
Server challenges
  1. Federated Learning algorithms often start with the assumption of a centralized server. However,
     in order to apply those algorithms in a fully distributed way, the server has to be removed and
     replaced by a peer-to-peer network.
  2. In some situations, a centralized way remains a better solution than distributed network.
     One way to reduce communication is spread the flow through Ultra Edge Computing, Edge
     Computing and Cloud Computing. Each level of computing could take charge of different
     services in the interest of reducing the workload.
   Challenges of FL and some technical solutions to solve them are illustrated on figure 2. Horizontal
Federated Learning and Vertical Federated Learning are described in Qi et al. (2021) with more
details in section i.1.
                                                      Communication challenges
                                                             Clients send
                                                            local updates                     Bayesian
                                                                                              Learning
                                                                 Data            Transfer                   Continual
                                                             distribution        Learning                   Learning
                                                                                              Hierarchal
                                                                                                 FL
                                                                                                             Causal
                                                                                 Blockchain
                                                             Labeled /                                      Learning
                                                           Unlabeled data
                       Client challenges
                                                                                            Reinforcement
                                                                                              Learning
                                                              Straggler
                Scalability                                   problem
                                                                                        Server challenges
                                             Client
                  Vertical                                                                         Ultra Edge
                 Federated                                                                         computing
                 Learning
                                             Client
                Horizontal                                                                            Edge
                Federated      K clients                                           Server           computing
                Learning
                                             Client                                                   Cloud
                 Cross-silo                                                                         computing
                                                            Server sends
                                             Client       the global model
                Cross-device                               Communication
                                                               cost
Fairness
Privacy
                                                               Energy
                                                            consumption
                                                                                                                        6
                                 Federated Reinforcement Learning
v.   Motivation
Various papers propose solutions to achieve tremendous results with IID data whereas in real-world
situations, we find more often non-IID data. Moreover, working on local data of users is dealing
with non-IID data because each user has his own habits and his own interests. Furthermore, in
some situations, data are unlabeled. This problem can be illustrated when local data are collected
on mobile phones through different applications. Applications propose different services and then
collect and give them according to the application was built. Because of this, it is not necessary
possible to merge datasets in order to generate a unique local dataset for each client. There are some
difficulties when the application does not provide features, because the analysis and preprocessing
of data are not feasible through a generic algorithm. Moreover, it involves models in clients’ devices
should be able to learn through unsupervised learning algorithms.
    In another point of view, due to the protection of privacy of users, the biggest part that can be
improved is server part. In other words, the server receives all updates from clients and decides
how to gather them and train the global model. While an additional process which it is called the
scheduler, is in charge of selecting the clients for each round.
    This paper aims to propose a solution for applications which tends to unsupervised learning.
We assume that settings is known as cross-device learning, which means the number of clients is very
large. It presupposes that only a fraction of clients is available at any time. We assume the primary
bottleneck is often communication knowing that generally, devices have heterogeneous connections.
                                                                                                    7
                                 Federated Reinforcement Learning
i.    General ideas
i.1   HFRL and VFRL
The comprehensive survey by Qi et al. (2021) focuses on Federated Reinforcement Learning to put
forward ideas where the Reinforcement Learning could solve specific situations where FL could be
improve "to deal with cooperative control and optimal decision-making in dynamic environments".
They introduce the two scenarios : Horizontal Federated Reinforcement Learning (HFRL) and
Vertical Federated Reinforcement Learning (VFRL).
    In HFRL scenario, the environment that each agent interacts with, is independent of the others,
while the state space and action space of different agents are aligned to solve similar problem. A
typical example is when the autonomous driving system where various environment information is
collected to train the autonomous driving models locally. However, multiple factors such as weather
conditions or driving regulations may differ from a country to another even if vehicles have same
operations (braking, acceleration, steering, etc.). Users could share their learned experience of their
model with each other without revealing their driving data.
    In opposition, in VFRL scenario, multiple agents interact with the same global environment
but each can only observe limited state information in the scope of its view. For example, for
photovoltaic management company, the balance between the power generation and consumption
should be optimized. It depends on the electrical utilities of household users and the decisions of the
power company on the power dispatch of generators. The electricity bill can be seen as "rewards" for
household users while the power company get "rewards" from the management of power generation.
Both don’t want to share their private data to others. Instead, VFRL can help to improve policy
decisions without exposing specific data.
                                                                                                     8
                                 Federated Reinforcement Learning
                                     Medium-Scale                        Large-Scale
 Small-Scale
                                     • Resource Allocation               • Resource Allocation
 • Caching Policy
                                     • Caching Policy                    • Caching Policy
 • Traffic Engineering ...
                                     • Computation Offloading Policy     • Computation Offloading Policy
 On One Ultra or Edge Node
                                     • Traffic Engineering ...           • Traffic Engineering ...
Figure 3: Taxonomy of applying Deep Reinforcement Learning in mobile edge system by Wang et al. (2019)
                                                                                                            9
                                   Federated Reinforcement Learning
Loss function
                                                                                         Q(st , at ; θ)
                                                               Gradient
                                                                                                                    Q(st+1 , a− ; θ− )
                                                                                                               (a− = argmaxa Q(st+1 , a; θ))
  Central Server
                                                                          MainNet                                                         TargetNet
                        Download
             ates
Model
                                                                                                                      Replace
        upd
                                                                                                                    Local Model
       oad
                                                                                                                    Periodically
      Upl
                                                                                    at                      argmaxa Q(st+1 , a; θ)
                                                                                         =                                                Multi-Layer
                                                                                                     arg                                  Perceptron
                                                                                                        m      ax
                                                                     st                                          aQ
                                                                                                                      (s
                                              st , at , st+1
                                                                                                                       t,
                                                                                                                            at
                                                                                                                                 ;θ
                                                                                                                                      )
Environment st+1 rj
Figure 4: General process of DRL with Federated Learning over mobile edge system (figure inspired by
          Wang et al. (2019))
                                                                                                                                                             10
                                Federated Reinforcement Learning
    Federated Learning still remains a learning technique which asks many resources and especially
energy. The papers in the literature (Zhan et al. (2020), Zarandi and Tabassum (2021), Chen et al.
(2021)) that deal with the energy consumption, formulate an optimization problem where several
units (computation delay, communication delay, bandwidth, ...) are taken into account. They
introduce a RL algorithm where its goal is to find the optimized parameters that minimize the
objective function given the current state of the environment.
    In some scenarios, data obtained at the client-side are not labeled due to high labeling cost or
difficulty of annotation for instance. FSSL (Federated Semi-Supervised Learning) was introduced by
Jeong et al. (2021) tries to solve two challenges : when the clients have both labeled and unlabeled
data and when the labeled data is only available on the server. Instead of using Reinforcement
Learning technique, the authors base their idea on inter-client consistency loss which is one of the
most popular approaches to learn unlabeled examples in a semi-supervised learning setting. As well,
to find the parameter decomposition for disjoint learning (supervised and unsupervised learning),
they minimize the loss term depending on the cross entropy between labeled data and the model
output.
    However, in most settings, the distribution of the data in local dataset of devices is not IID.
Ghosh et al. (2019) propose sparse ternary compression which is specifically designed for unbalanced
and non-IID data. Their algorithm does not use RL and outperforms on CIFAR dataset in
unbalanced and non-IID settings other algorithms.
where Ne denotes the number of equipment which participate for the training, ctl (i) represents the
local training time of the device i at the time t and ctc (i) represents the communication cost of
the device i at the time t. Moreover, the total of training quality cost for all equipment Cqu t
                                                                                                  is
considered as :
                                        XNe X
                                   t
                                 Cqu  =        Loss(yj − ŵt (xj ))
                                        i=1       j
                                              |          {z              }
                                                         σi
                                                                                                 11
                                  Federated Reinforcement Learning
where σi quantifies the quality of the network model and ŵt represents the training model aggregated
at the time t. Thus, the total cost in the time slot t is :
                                           C t = Ctime
                                                  t
                                                       + Cqu
                                                          t
A RL agent is used to find the optimal solution to minimize the total cost. The agent observes the
current state which is characterized by the performance of IIoT equipment, the quality of local
data and the gradient loss of these IIoT equipment. Then, the agent takes action at which after all,
is a vector with a size of the number of nodes, filled with 0 and 1 where when the node i is selected,
δit = 1, and otherwise δit = 0. And the evaluation method is a weighted average as follows:
                                                     Pn     t    t
                                                       i=1 Ci · ai
                                      r(st , at ) = − P  n
                                                         i=1 ai
    The performance evaluation of the solution of this paper shows results about the accuracy to
reduce communication costs on the MNIST data set with IID data and non-IID data. However,
the evolution of the total reward was not put forward and it would essentially not prove the efficiency
of the RL agent. Even if the convergence of the loss value was pointed out during the federated
learning, there is a lack of explanation on how where selected the participants. We could conclude
the agent figures out that the best way to minimize the communication costs is to select the most
fastest devices and with the best quality. In other words, the algorithm faces two issues :
   • the fairness is not taken into account during the training of the global model. In other words,
     the agent would move aside some participants and prioritize other ones even though its goal is
     to minimize the communication costs.
   • there is no reason to believe that the agent would be able to find new participants over the
     time if the agent selects the same participants to maximize its rewards. For instance, if devices
     change behaviors over the time because of their usage, the agent could be stuck in local minima,
     because of a lack of exploration over the time.
st and at are the set of states and the set of actions collected from all agents at the time t.
    Like the previous paper Zhang et al. (2021), the problem to solve is an optimization problem.
They supposed that "a subset of clients will be early rejected by the central server while the rest
clients will continue to finish their local training process". They are interested in the communication
latency Ht,n
           u
               for uploading the local model from a client n. As well, Bnt represents the communication
cost for sending updates from client n to the central server. The local training time for client n is
denoted as Ht,n c
                   Ht represents the total processing latency of a training round t as well as the total
communication cost Bt for the round t :
                                                                       X
                          Ht = max (Ht,n   u
                                               + Ht,n
                                                   c
                                                      )atn and Bt =        Bnt atn
                               1≤n≤N
                                                                        n
                                                                                                        12
                                   Federated Reinforcement Learning
where Acc(t) represents the accuracy and A = [atn ] is a T × N matrix for client selection, with
T the number of rounds and N the number of workers. w1 , w2 , w3 are chosen by the user to
design the FL application. Obj1 must be maximized in order to get the best accuracy while Obj2
must be minimized to reduce communication latency and Obj3 must be also minimized to reduce
communication cost.
   Now, the state stn of agent n at the round t is defined as follows:
                                               p
                                  stn = Ltn , Ht,n
                                                     u
                                                          , Bnt , Dn , t
                                                                        
                                                   , Ht,n
where Dn is the size of the local dataset and t is the round index. On the central server, each
MARL agent n produce a binary action atn ∈ [0, 1] where 0 indicates the device will be terminated
after the probing training and vice versa. We indicate the reward rt at training round t as :
Note U (·) is a utility function defined to ensure that the difference U (Acc(t)) − U (Acc(t − 1)) is
no-null when Acc(t) improvement is small near the end of FL process. Also, we recall for each client
      p
n, Ht,n   represents the training latency, Ht,n
                                            rest
                                                 is the time required to finish the local training process
and Ht,n is the communication latency. Bt is the total communication cost as it was defined at the
        u
                                                                                                          13
                                   Federated Reinforcement Learning
Figure 5: Decisions made by FedMarl. Each dot represents a client device. The blue dot means the client
          is selected for local training, the red dot means the client is early rejected (results from Zhang
          et al. (2021))
                                                                                                         14
                                   Federated Reinforcement Learning
communication rounds for CGA are pretty important : in the experiments, twice compared to SGP
and 4 times the communication rounds of SwarmSGD.
Knowledge function algorithm In robotic field, inputs are generally based on sensor data
attributes. The degree of confirmation is defined on which action (output) the robot chooses to
perform as the "confidence value". This idea is basically to highlight when there is a significant
differentiation in the evaluation of Q-values, which it is called the scoring process. For instance, the
values (20, 19, 16, 14, 20) are less different than the values (5, 4, 80, 8, 10). The information entropy is
chosen to characterize the confidence as a quantitative function of robotic confidence:
                                       m
                                    1 X            scoreij                 scoreij
                                                                                      
                         cj = −                  Pm           · ln       Pm
                                  ln m i=1        i=1 scoreij             i=1 scoreij
where m is the action size of robot j, n is the number of private networks. Also the memory weight
of robot j is specified as:
                                                   (1 − cj )
                                         wj = Pn
                                                  j=1 (1 − cj )
and the knowledge fusion function which is the proportion of the confidence of robots as :
                                                                            T
                                    labelj = score × c1 , c2 , . . . , cm
Also, in the cloud, the loss function along with the optimal parameter are defined as :
                                                   num
                                                   X
                                           yi =          cj · wj
                                                   j=1
                                                      N
                                                   1 X
                              L(y, hθ (xi )) =           (yi − hθ (xi ))2
                                                   N i=1
                                                                 N
                                                              1 X
                                         θ∗ = argminθ               L(yi · hθ (xi ))
                                                              N i=1
Transferring approaches One idea is to take the shared model as initial actor network while
another idea is to use the shared model as a feature extractor. When the shared model is the initial
actor network, it turns out that it is unstable even if good scores are observed at the beginning.
However, the feature extractor increases the dimension of features and improves the effect stably.
                                                                                                         15
                                   Federated Reinforcement Learning
One highlighted problem is when there is a structural difference between input layer of the shared
network and private network. For example, most of neural networks in robot are convolutional
neural networks (CNN) because CNN process images as input. The output of the CNN are
described as features. But, if a non-image sensor such as a laser radar, is used, the Q-network is not
considered as CNN. Then, the output of the entire network is used as additional features.
To summarize, each private neural network and the shared network generate actions which are called
scores. They are stored and organized as a score matrix where each column corresponds to scores of
a specific network. After normalizing each column, the knowledge fusion is applied which means
the computation of confidence (cj ) and labels (labelj ). In other words, all sample data labels are
generated. Finally, a network for the generation (k + 1), is generated and fits the sample data as
much as possible.
Results and discussion The major benefit of this approach is presented as a kind of unsupervised
learning technique, where from unlabeled data, the algorithm is able to label sample data based on
the confidence. Moreover, the transferring idea is designed as a feature extractor in order to improve
the shared network. However, there are several issues to be pointed out :
   • there is no information about how the shared network is supposed to have access to data
   • to construct the score matrix, there is no reason that the dimension of samples is the same
     between each neural network, which means the number of scores between neural networks
     should vary depending of its environment (i.e. its local data at the time t)
   • the formula of the output yi which is based on the memory weight wj is not completely justified
     to be used in the loss function as the target
where Lv is the loss function of global model on validation set Dv and δ is a baseline calculated by
moving average of previous loss Lv with moving average window T > 0.
   The global model parameters are updated as:
                                                                 N
                                     t+1    t    αθ              X
                                    θG   ← θG − PN                     sti ∇θit
                                                             t
                                                        i=1 si   i=1
The probability of θit is introduced as P (sti = 0) = ωit Considering a parameterized stochastic policy
πθ (a|s), the probability of trajectory (with St the state at the time-step t and At the action at the
                                                                                                    16
                                  Federated Reinforcement Learning
time-step t) is :
                                                    T
                                                    Y
                              p(τ |π) = ρ0 (S0 )          p(St+1 |St , At )π(At , St )
                                                   t=0
with ρ0 (S0 ) the initial state distribution and τ the trajectory (sequel of state - action - reward).
Then, the probability of St of the policy ϕ is defined as :
                                                  T h
                                                               sti                  1−sti
                                                  Y                                         i
                                   p(S t |ϕ) =           ωit         · (1 − ωit )
                                                  t=1
and :
                                                   Z
                             ∇θ Eτ ∼πθ [Rt ] = ∇θ      Rt p(τt |θ)dτt
                                                    τt
                                               Z
                                             =     Rt ∇θ p(τt |θ)dτt
                                                τt
                                               Z
                                             =     Rt p(τt |θ)∇θ log p(τt |θ)dτt
                                                    τt
                                             = Eτ ∼πθ [Rt ∇θ log p(τt |θ)]
Then the evaluator’s model parameters ϕ can be optimized by gradient ascent method with learning
α:
                                           Xn
                            ϕt+1 ← ϕt + αϕ     r(S t )∇ϕ log p(S t |ϕ)|ϕt
                                                    i=1
    The algorithm F-RCCE offers promising results. Indeed, the authors of the paper have simulated
FL systems with 50 to 500 clients. The total number of iterations of the model is around 1000
iterations. From 100 to 500 clients, the time of cost only increases by 6% (in comparison to the
baseline leave-one-out (LOO) which increases linearly). The paper highlights that removing the
                                                                                                   17
                                  Federated Reinforcement Learning
highest contribution slows down the model convergence. Moreover, removing the lowest contribution
also puts forward negative impacts because they still have positive effects on the model. When the
amount of clean data is large, thus the calculated gradient is relatively stable which is correlated with
the client’s contribution. However, the task model in the experiment is a logistic regression classifier
which means the size of gradients is not large contrary to a deep learning algorithm. In other words,
if the task model requires many parameters, then the size of gradients times the number of task
models (i.e. the number for participants) could be a significant size of input for the evaluator’s
model. Moreover, the assumption of having a dataset in the server to compute the reward function,
rises some questions. There is no information about the size or the distribution of the dataset on
the server even if the dataset is supposed to be anonymized for keeping privacy.
                                                                                                      18
                                        Federated Reinforcement Learning
III.         Methodology
The chapter will cover explanations for the algorithms tested in experiments. Firstly, an introduction
of fundamental knowledge of Reinforcement Learning will clarify how algorithms are built. Secondly,
the first algorithm is designed to evaluate in a better way the contribution of participants for the
aggregation. Finally, the second algorithm is designed to optimize different criteria by selecting in a
better way the participants each round.
The immediate reward is Rt = R(St , At ) where At is a finite set of actions. γ is the reward discount
factor and γ ∈ [0, 1]
    The discounted return is the weighted sum of rewards which gives more weights to the closer
steps.
                                                       XT
                                   Gt∈[0,T ] = R(τ ) =     γ t Rt
                                                                      t=0
   The agent follows a policy π and the objective of the agent is to maximize the expected reward
through this policy. The policy is defined as3 :
The action-value function5 represents the expected return at the state S and at the action A6 :
                                                                                                                         19
                                  Federated Reinforcement Learning
    The expected return of the policy π for any trajectories τ and any reward function R is defined
as :                                   Z
                               J(π) = P (τ |π)R(τ ) = Eτ ∼π [R(τ )]
                                          τ
   The optimal policy is the policy which the agent aims to get to maximize the expected return
and it is defined as:
                                       π ∗ = argmax J(π)
                                                   π
where θGt
           is the parameters of the global task model at the time t, θit is the local task model of
the participant i at the time t after training step, ai ∈ {0, 1} is the stochastic action given the
probabilities of the output of the agent and k is the number of participants. The reward is designed
to push the agent to find the best contribution configuration in order to improve the accuracy:
                                                            k
                                                1           X
                             Rt = AcctG − δt = Pk               ai Acci (θit ) − δt
                                                   i   ai   i
where Acci (·) is the accuracy of the participant i and δt is the exponential moving average of
previous accuracies AccG . AccG is an approximation of the accuracy of the global task model. The
exponential moving average applies weighting factor which decrease exponentially. In other words,
the recent accuracies have more impact than older accuracies.
   The algorithm of Reinforcement Learning is the REINFORCE algorithm which will be explained
theoretically below. This algorithm encourages the agent to take action that has greater cumulative
reward weighting the gradient by the cumulative reward of each action.
                                                                                                      20
                                   Federated Reinforcement Learning
t′ =0
                                                                                                                 21
                                            Federated Reinforcement Learning
                                                       = B0 (A0 + A1 + A2 + · · · + AT )
                                                       + B1 (A1 + A2 + · · · + AT )
                                                       + B2 (A2 + · · · + AT )
                                                       + ...
                                                       + B T AT
                                                           T
                                                           X              T
                                                                          X
                                                       =           Bt′          At
                                                           t′ =0         t=t′
The following statement Eτ ∼πθ [∇θ log πθ (At′ |St′ )b(St′ )] = 0 can be verified. Any normalized
distribution follows:
                                                Rx Pθ (x) dx = 1
                                                R
                         ⇐⇒                ∇Rθ x Pθ (x) dx = ∇θ 1 = 0
                         ⇐⇒ R                   ∇ P (x) dx = 0
                                               x θ
                         ⇐⇒       x
                                    P θ (x)∇ θ log Pθ (x) dx = 0
                         ⇐⇒       Ex∼P (·|θ) [∇θ logPθ (x)] = 0
                                                                                                                                            22
                                    Federated Reinforcement Learning
                                            h                            i
                             =⇒ EAt′ ∼π(·|θ) ∇θ log πθ (At′ |St′ )b(St′ ) = 0
Moreover, it is better to choose b(St ) with a variance V ar(b(St )) close to 0 and strongly correlated
  PT          ′
to t=t′ γ t−t Rt because if X and Y are two random distributions then
with cov(X, Y ) the covariance between X and Y . If Y has a variance close to 0 then :
The intuition of adding a bias allows to reduce the variance. The bias b(St ) is often the value
function V (St ) for an actor-critic algorithm.
This distribution can be verified to be a normalized distribution. For convenience, S is the sum
                     t
          Qn      t ai             1−at
                       · (1 − pti ) i and A is the set of all possible actions, A = (a1 , . . . , ak ) | ∀i ∈
P                                                                                  
   At ∈A    i=1 pi
[1, k], ai ∈ {0, 1} . Then, the expression can be organized differently to verify the result.
                              0             1         0                    1               0       1
                      S = pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                              1             0      0                      1                0       1
                         + pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                              0             1      1                      0                0       1
                         + pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                              1             0      0                      1                0       1
                         + pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                         + ...
                              1             0      0                      1                1       0
                         + pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                                            0                1
                            ⇐⇒ S = pt1 · (1 − pt1 ) × [
                                            0                1                    0            1
                                          pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
                                            1                0                    0            1
                                          pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
                                          ···+
                                            1                0                    1            0
                                          pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                                             1               0
                                    ] + pt1 · (1 − pt1 ) × [
                                            0                1                    0            1
                                          pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
                                            1                0                    0            1
                                          pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
                                          ···+
                                            1                0                    1            0
                                          pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
                                    ]
                                                                                                          23
                                   Federated Reinforcement Learning
                             h                                     i
                                 0            1     1            0
                       ⇐⇒ S = pt1 · (1 − pt1 ) + pt1 · (1 − pt1 ) × [
                                           0                     1
                                       pt2 · (1 − pt2 ) × [
                                               0                     1                 0   1
                                           pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
                                               1                     0                 0   1
                                           pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
                                           ···+
                                               1                     0                 1   0
                                           pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn )
                                                   1                     0
                                       ] + pt2 · (1 − pt2 ) × [
                                               0                     1                 0   1
                                           pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
                                               1                     0                 0   1
                                           pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
                                           ···+
                                               1                     0                 1   0
                                           pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn )
                                       ]
                               ]
                                   h                                      i
                                       0            1     1             0
                             ⇐⇒ S = pt1 · (1 − pt1 ) + pt1 · (1 − pt1 )
                                   h                                      i
                                       0            1     1             0
                                  × pt2 · (1 − pt2 ) + pt2 · (1 − pt2 )
                                      × ...
                                        h                                     i
                                            0           1     1             0
                                      × ptn · (1 − ptn ) + ptn · (1 − ptn )
                                      = 1 × 1 × ··· × 1 = 1
Therefore the elements of the output belong to the range [0, 1] and sum to 1.
    Nevertheless, it means that when the probabilities are distributed in a homogeneous way, the
probability for one participant to be selected depends strongly on other probabilities of participants.
For example, with 100 participants, a uniform distribution would be that the probability to be
selected is 100
             1
                . Consequently, next participants would have few chances to be selected for the next
round and it probably leads to an action where all participants are discarded for the next round.
                                                                                                      24
                                       Federated Reinforcement Learning
    With the aim of preventing this difficulty, the output would not be a vector with k elements
where k is the number of participants but a k × 2 matrix where the first column corresponds to
the probability of being selected while the second column corresponds to the probability of being
discarded. Therefore, the softmax function is applied on rows and the sum of probabilities per row
is 1. The architecture of the neural model must enable to have an output with k × 2 values. Two
ideas are suggested on the figure 6 :
(a) Information are treated globally and hidden layers do   (b) Information are treated separately and allows the
    not take account of the independence of probabilities       neural network to focus on probabilities of selection
    of selection and probabilities of discard                   and probabilities of discard
Figure 6: Different architectures of neural networks to get an output with two columns. Green color
          represents neurons for selection and violet color represents neurons for discard.
    The architecture of the left side of the figure 6 puts forward several difficulties to understand.
The output of the neural network should be a k × 2 matrix, however to get this specific shape on
this specific architecture, the last output must be reshaped from (2k) × 1 to k × 2. There is no
better reason for choosing the k first elements for the first column and the k last elements for the
second column for the output shape than choosing randomly elements to fill the output matrix for
having the correct shape. Also the size of hidden layers could be doubled to have enough parameters.
Whereas the architecture of the right side on the figure 6 separates selection and discard which
makes the information easier to deal with. This architecture is then used for the neural network of
the agent.
                                                                                                                  25
                                   Federated Reinforcement Learning
local data set. Participants estimate the accuracy of their local model θi . After this step, the server
downloads at its turn all local models and the accuracies of them. The agent, on the server, observes
a state St , a vector of k accuracies (Acc0 , . . . , Acck ) and takes an action At , a matrix of probabilities
where the first column is the probability of being selected and the second column is the probability
of being discarded. The action is then translated by a vector of k elements, 0 for discarded and
1 for selected generated by applying the Bernoulli law on the first column of the previous matrix.
Depending of his decision, the agent receives a reward Rt . The moving batch is updated by removing
the oldest transition and adding a new transition (St , At , Rt ) Then the agent collects for training
with moving batch. When the moving batch is full, the model of the agent can be trained by using
the following formula :
                                             " T                             T
                                                                                         #
                                                                                 t−t′
                                                X                            X
                          ∇θ J(πθ ) = Eτ ∼πθ           ∇θ log πθ (At′ |St′ )   γ      Rt
                                              t′ =0                    t=t′
                                                                                                           26
                                      Federated Reinforcement Learning
Algorithm 1 EvaluatorFL There are k participants; Bl is the local minibatch size, E is the number
of local epochs, and η is the learning rate of the task model, R is the number of rounds, Ba is the
size of the moving batch, αlr is the agent model and αexp is the parameter for exponential moving
average. The algorithm is described for one episode.
 1:   procedure Server with the agent executes:
 2:      Initialize θG
                     0
                                                                         ▷ Global task model
 3:      Initialize the Moving Batch B
 4:      Initialize θa                                                        ▷ Agent model
 5:      Initialize δ = 0
 6:      for each round r from 1 of R do
 7:          for each client k ∈ St in parallel do
 8:              Update local model θit ← θG
                                           t
                                                   ▷ Participants download global task model
 9:              θi ← ClientUpdate(θi )
                  t+1                    t
10:              Acct+1
                     i   ← Accuracy(θit )
11:         The server collects local gradients θit+1 and accuracies Acct+1
                                                                         i
12:         St ← (Acct+1
                      1 , . . . , Acck )
                                     t+1
                                                                                               ▷ State
13:         At ← θa (St )                                                                    ▷ Action
14:         Selt ← Bernoulli(A0t )                      ▷ Selection using first column of probabilities
                 Pk       i
15:         p ← i=1 PSel t                
                       k               t+1
16:         R̂ ← p1             t
                       i=1 Seli · Acci
17:         Rt ← R̂ − δ
18:         δ ← δ + (1 − αexp )(R̂ − δ)
19:         Collects {St , At , Rt } and update B
              t+1       Pk
20:         θG    ← p1 i=1 Selti · θit+1
21:         if B is full then
                         PT           ′
22:             Ât′ ,j ← t=t′ γ t−t Rt
23:             Compute the policy gradient :
24:
                                            Ba X T
                                         1 X
                                J(θa ) ←           ∇θ log πθa (At′ ,j |St′ ,j )Ât′ ,j
                                         Ba j=1 ′
                                                       t =0
                                                                                                    27
                                Federated Reinforcement Learning
                                                                                                  28
                                      Federated Reinforcement Learning
when the accuracy is increased and the total of time is decreased between two consecutive rounds
and by punishing it in the opposite case. Else, for instance, the reward could increase exponentially
when the time would be more and more minimized while the accuracy would be more and more
maximized. Finally, the distribution of data could be taken as a weight which influences the reward.
In case of non-IID, an illustration of the intention, would be when a worker with 300 samples of
the label A should have a different impact on the reward than a worker with 4000 samples of the
same label A.
    However, in this study, the reward is based on the criterion of performance, then the objective of
the agent is to minimize the total time T t for each round. The reward is established as the difference
between the maximum of time of participants and the previous maximum time:
Rt = (− max(T a ) − ν)
where T a is the set of all total times of participants of the current round and ν is the previous
maximum time. Note the total times are normalized and to normalize the reward, it is better to
take the mean of times than the sum of times (i.e. the total times).
   • TRPO algorithm (Trust Region Policy Optimization) where the step size is properly handled
     in policy gradient based on the idea of trust region. In other words, the algorithm helps to
     manage the step size depending on the curvature of the reward landscape. The algorithm is
     complex and would take too much time to be implemented.
   • PPO algorithm (Proximal Policy Optimization) is an improvement of the TRPO algorithm
     because TRPO is complicated and suffers of a computation burden in computing the natural
     gradient. In the same way, the algorithm still stays elaborate to be used for Federated Learning
     and was not retained.
   • AC algorithm (Actor Critic) is a simple algorithm which is an improvement of the policy-based
     methods (for example, REINFORCE algorithm) where the state value is used as a baseline to
     reduce the variance of the estimated gradient in the vanilla policy gradient method. The
     main advantage of this algorithm is its simplicity, however, after hundreds of experiments the
     algorithm does not converge to an acceptable solution. Most of the time, the probabilities
     diminish until being close to zero and only one or two participants are selected for the next
     round.
   Only the DDPG algorithm (Deep Deterministic Policy Gradient) was validated. The idea of
the algorithm is to improve the DQN algorithm (Deep Q Network) to be a deterministic policy
gradient. DDPG establishes a Q function (critic) and a policy function (actor) simultaneously.
Temporal-difference methods are used to update the Q function (critic) which is the same with
   7A   small description of several algorithms can be found in Appendix ii
                                                                                                    29
                                         Federated Reinforcement Learning
DQN. The action-value function Q(s, a|θQ ), where θQ denotes for the critic network parameters, is
learned using Bellman equation (see Bellman (1957)). The action-value function given any policy
π is defined as :
t=0
Then the Q value of the critic network θQ is computed and uses to minimize the loss function :
                   1 X                      2                                                            ′      ′
           L=          Yi − Q(Si , Ai |θQ )               with     Yi = Ri + γQ(St+1 , π(St+1 |θπ )|θQ )
                   N i
    The policy of the actor network is deterministic thus it will be noted as µ for a deterministic
policy to make the difference with stochastic policy. The performance objective for the deterministic
policy µ follows the same expected discounted reward definition than the stochastic policy:
                                    "∞               #
                                     X
         J(µ) = ESt ∼ρµ ,At =µ(St )    γ R(St , At )
                                        t−1
                                          t=1
                       ∞
                   Z Z X
               =                 γ t−1 ρ0 (S)P (S ′ |S, t, µ) · R(S ′ , µ(S ′ ) dS dS ′
                    S    S t=1
                   Z                                                              ∞
                                                                                Z X
               =        ρ (S) · R(S, µ(S)) dS
                         µ
                                                       because ρ (S) =µ
                                                                                          γ t−1 ρ0 (S)P (S ′ |S, t, µ)
                    S                                                             S t=1
where P (S ′ |S, t, µ) = P (St+1 |St , At )P µ (At |St ) is the transition probability times the probability of
the action choice. The definition of the state function can be used :
                                           "∞                               #
                                             X
                             V µ (S) = E         γ t−1 R(St , At )|St = S; µ
                                                t=1
                                               ∞
                                             Z X
                                         =            γ t−1 P (S ′ |S, t, µ)R(S ′ , µ(S ′ ))dS ′
                                              S t=1
                                                                                                                         30
                                  Federated Reinforcement Learning
With a deterministic policy, there is only a single action whereas with a stochastic policy, the Q-value
is an expectation over the action distribution. Thus V µ (S) = Qµ (S, µ(S)) and :
                                               Z
                                      J(µ) =      ρ0 Qµ (S, µ(S))ds
                                                 S
And :
                                ∇θ A = ∇θ [R(S, µθ )]
                                      = ∇θ µθ (S) × ∇A R(S, A)|A=µθ (S)
And:
                                         Z                                              
                           ∇θ B = ∇θ     γP (S |S, µθ (S)) × V (S )dS
                                                     ′                      µθ   ′   ′
                                       S
                                  Z
                                =    γ∇θ [P (S ′ |S, µθ (S)) × V µθ (S ′ )] dS ′
                                     S
                                                                                                     31
                                          Federated Reinforcement Learning
By posing u(θ) = P (S ′ |Sµθ (S)) and v(θ) = V µθ (S ′ ), the formula ∇θ [u(θ) × v(θ)] = ∇θ u(θ) × v(θ) +
u(θ) × ∇θ v(θ) can be used to develop the expression:
            Z                                                   Z
    ∇θ B =     ∇θ µθ (S)∇A P (S |S, µθ (S)) × V (S )dS +
                                 ′              µθ     ′    ′
                                                                  γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
                 S                                                         S
                                    Z                                            Z
∇θ V (S) = µθ (S) ∇θ R(S, µθ (S)) +
    µθ
                                        γ∇A P (S |S, µθ (S)) × V (S )dS +
                                                 ′                 µθ   ′      ′
                                                                                       γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
                                      S                                              S
                                         Z
           = ∇θ µθ (S)∇A Q (S, µθ (S)) +
                          µθ
                                            γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
                                                        S
Fubini’s theorem Suppose A and B are complete measures spaces. Suppose f (x, y) is A × B
measurable. If               Z
                                  |f (x, y)|d(x, y) < ∞
                                                A×B
where the integral is taken with respect to a product measure on the space over A × B, then
               Z Z                     Z Z                   Z
                        f (x, y)dy dx =          f (x, y)dx dy =       f (x, y)d(x, y)
                     A     B                        B       A                     A×B
the first two integrals being iterated integrals with respect to two measures, respectively, and the third
being an integral with respect to a product of these two measures.
   The last part of the expression can be iterated using the relation between ∇θ V µθ (S) and
∇A Qµθ (S, µθ (S)), and simplified by using the Fubini’s theorem :
 Z
     γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
   S
     Z                                                            Z                                              
 =      γP (S |S, µθ (S)) ∇θ µθ (S )∇A Q (S |µθ (S )) +
               ′                       ′      µθ    ′        ′
                                                                       γP (S |S , µθ (S ))∇θ V (S )dS dS ′
                                                                             ′′ ′          ′      µθ    ′′     ′′
      S                                                              S
     Z                                                               Z
 =      γP (S ′ |S, µθ (S))∇θ µθ (S ′ )∇A Qµθ (S ′ |µθ (S ′ ))dS ′ +    γ 2 P (S ′′ |S, µθ (S))∇θ V µθ (S ′′ )dS ′′
       S                                                                 S
where P (S → S ′ , t, µθ (S)) is the probability of transition t times8 . Then the expression can be used
in the policy gradient:
                      Z
        ∇θ J(µθ ) =       ρ0 (S) · ∇θ V µθ (S)dS
                           S
                              ∞
                          Z Z X
                      =                γ t ρ0 (S)P (S → S ′ , t, µθ (S))∇θ µθ (S ′ ) × ∇A Qµθ (S ′ , A) dS ′ dS
                           S   S t=0
                                                                                                                  32
                                      Federated Reinforcement Learning
              R P∞
As ρµ (S) =   S   t=0   γ t−1 ρ0 (S)P (S ′ |S, t, µ)dS, then :
                                                Z
                             ∇θ J(µθ ) =            ρµθ (S) ∇θ µθ (S) ∇A Qµθ (S, A)dS
                                                S
    For the algorithm, the policy function is updated by applying the chain rule to the expected
return from the start distribution J. Learning from mini-batches, the loss of the actor network can
be computed as:                    X
                          ∇θ π J ≈    ∇A Q(Si , π(Si , |θπ )|θQ )∇θπ π(Si |θπ )
                                            i
   The network parameters are updated by exponentially smoothing rather than directly copying
the parameters:
                                                    ′                                      ′
                                                θQ ← ρθQ + (1 − ρ)θQ
                                                    ′                                  ′
                                                θπ ← ρθπ + (1 − ρ)θπ
where ρ represents the degree of weighting decrease, a constant smoothing factor between 0 and 1.
DDQN Loss
                                                                           Q(St , At |θQ )                                        ′
                                 Gradient               Â = π(St |θπ )                                   Q(St+1 , π(St+1 |θπ )|θQ )
                                                                           Q(St , Â|θQ )
                                                                                                St+1
                                                                                                                     Rt
(St , At )
Figure 8: DDPG algorithm with replay memory where π(·|θπ ) denotes the policy function (actor) and Q(·|θQ )
          denotes the Q function (critic)
    The figure 8 illustrates the DDPG algorithm and introduces the concept of replay memory. The
replay memory stores the experiences in a large replay memory. Then, for each iteration of training,
it selects randomly a sample of this memory. It allows to reduce the correlation between experiences
in a training batch and it helps greatly the training along.
                                                                                                                                       33
                                Federated Reinforcement Learning
                                                                                                   34
                                   Federated Reinforcement Learning
Algorithm 2 SchedulerFL There are k workers, p percentage of participants the first round, E
experiences, R rounds, γ is the discount factor for reward
 1:   procedure Server with the agent executes:
 2:      Initialize θπ                                                                    ▷ Actor network
                       ′
 3:      Initialize θπ                                                             ▷ Target Actor network
 4:      Initialize θQ                                                                    ▷ Critic network
                      Q′
 5:      Initialize θ                                                              ▷ Target Critic network
 6:      Initialize a replay memory M
 7:      for each experience e from 1 of E do
 8:          Initialize θG0
                                                                                      ▷ Global task model
 9:          Initialize S0 as a full 3 × k zero matrix                                              ▷ State
10:          Initialize ν = 0                                        ▷ Previous maximum time for reward
11:          Select randomly p · k participants
12:          for each participant in parallel do
13:              θi0 ← θG0
                                                                ▷ Participants download global task model
14:              θi ← ClientUpdate(θi0 )
                   1
                                                                                          ▷ Local training
15:              Server collects the local model θi1
16:              Compute the times T c , T u and T d
17:              Update S0 to S1 by replacing with new times
18:          Complete S1 by replacing zeros by the mean of participants times
19:          Aggregate local models with FedAvg algorithm to get θG       1
                                                                                                        35
                                       Federated Reinforcement Learning
IV.        Experiments
For the experiments, the distribution of data between clients is crucial to simulate IID cases and
non-IID cases. In the first section, the generation of data will be covered to illustrate the different
distributions of data. Then the others sections aim to test algorithms through experiments and
analyse their results. The whole code for the experiments is available in the github repository
https://github.com/bourbonut/reinforcedFL.
i.      Generation of data
This section aims to cover how data are distributed in each device of clients where the assumption
of having an identically and independent distribution when all data are gathered is applied. To
achieve this assumption, the working data (e.g. MNIST dataset or FashionMNIST dataset) is
initialized and has to be IID. Then, there are several parameters which are manageable :
      • the distribution based on the volume which means in case of IID, all clients have the same
        amount of data for each label and in case of non-IID, the volume of data changes from one
        client to another for a specific label. For instance, with two clients A and B, if both have
        the label l then in IID case, each client has half of the amount of data for label l while
        in non-IID case, the proportion owning by clients is randomly chosen, which creates an
        unbalanced configuration (e.g. 36% of data on client A and 64% of data on client B)
      • the distribution based on the label which implies in case of IID, all clients have all labels and
        in case of non-IID, a parameter m allows to specify the minimum number of labels on clients
        and labels are randomly chosen. Another parameter b indicates if each client has at least
        m labels or less. For example, if there are 3 clients and 5 labels l1 , l2 , l3 , l4 , l5 , then on IID
        situation, all clients have the 5 labels, whereas in non-IID labels, if m = 3 and b = F alse,
        then one client has 3 labels (e.g. l2 , l3 , l5 ), one client has 2 labels (e.g. l1 , l4 ) and other clients
        shared one label (e.g. l5 ). In the case when b = T rue, then all clients have at least 3 labels.
    The figure 9 illustrates visually some distributions. Each color corresponds to a client and then
it reflects the amount of data on each label for a specific client. The distribution of volume and
the distribution of labels can be mixed. For instance, the volume can be non-IID while the labels
are non-IID too. When the distribution of labels is non identically and independent and balanced,
depending on the number of clients and the number of labels, the volume can not always be IID.
On the figure, it is noticeable that the label 3 has a total of four clients where two clients which
have more data than the others. Due to choosing m = 3 (the minimum number of labels on each
client), several labels become unbalanced in order to meet the expectations of parameters.
    When data are generated, two types of data are given per client : the data for training and the
data for testing. Necessary, taking the whole set of data, data are split to tend to the ratio 80% for
training - 20% for testing. For well-known datasets as MNIST, the default configuration is kept (for
MNIST, 60000 samples for training and 10000 samples of testing). Same labels and same volume of
data are used between training data and testing data. In other words, for example, if a model on
the client’s device learns on labels 2, 4 and 7, then, the model is tested also on labels 2, 4, 7. As
well, speaking of data for training, if the client’s device holds 5% of label 2, 10% of label 4 and 6%
of label 7, then data for testing is distributed in the same way than data for training.
                                                                                                                 36
                                         Federated Reinforcement Learning
6000 6000
5000 5000
4000 4000
3000 3000
2000 2000
1000 1000
   0                                                         0
       0   1     2    3    4     5   6     7   8   9             0    1      2    3    4    5   6    7     8   9
(a) Configuration where labels and volume follow are       (b) Configuration where labels are IID and volume is
    IID                                                        non-IID
Non IID (label and balanced) Non IID (label and unbalanced)
6000 6000
5000 5000
4000 4000
3000 3000
2000 2000
1000 1000
   0                                                         0
       0   1     2    3    4     5   6     7   8   9             0    1      2    3    4    5   6    7     8   9
(c) Configuration where labels are non-IID (balanced       (d) Configuration where labels are non-IID (unbalanced
    b = T rue and m = 3) and volume is IID                     b = F alse and m = 3) and volume is IID
Figure 9: Impact on the generation of data according to the distribution of labels and the distribution of
          the volume with 20 clients on MNIST dataset (10 labels and 60000 samples)
evolution of the global model. FedAvg is described as a weighted average of local models on clients’
devices. The weights are computed based on the proportion of data for a device compared to the
whole size of data. For instance, if a device holds 1500 samples and the size of the whole dataset,
all data of devices taken together, is 60000, then the weight is 0.025. The total amount of data on
local devices is supposed known thanks to Federated Analytics techniques, which are a collaborative
data science without data collection. Then, knowing all amounts of data on local devices allows to
know the whole size of data if they were gathered on a server. Thus, FedAvg can be computed while
keeping privacy of clients.
                                                                                                                   37
                                     Federated Reinforcement Learning
Algorithm 3 FederatedAveraging The K clients are indexed by k; B is the local minibatch size,
E is the number of local epochs, and η is the learning rate.
 1:    procedure Server executes:
 2:       initialize w0
 3:       for each round t = 1, 2, . . . do
 4:           m ← max(C · K, 1)
 5:           St ← (random set of m clients)
 6:           for each client k ∈ St in parallel do
 7:                 k
                  wt+1  ← ClientUpdate(k, wt )
                       PK nk k
 8:           wt+1 ← k=1 n wt+1
 9:
10: procedure ClientUpdate(k, w):                                                   ▷ Run on client k
11:    B ← (split Pk into batches of size B)
12:    for each local epoch i from 1 of E do
13:       for batch b ∈ B do
14:           w ← w − η∇L(w; b)                                 ▷ L(w; b) represents the loss function
15:       return w to server
    Different experiments are presented on figure 10, figure 11, figure 12 and figure 13 which are
realized on FashionMNIST dataset (60000 samples with balanced labels) working with 20 workers9
with 10 rounds and 3 epochs, with different distributions of data. Each figure is composed of the
distribution of data for training through all workers on left side and the evolution of the weighted
average accuracy on the right side. Note the distribution of data for testing is similar to the
distribution of data for training and therefore, each worker has same volume of labels and same
labels for training and testing. In case of 10 labels, a model which learns on label 1, 5 and 6 for
instance, will never be tested on label 0, 2, 3, 4, 7, 8 and 9. Each round, every worker computes
the accuracy of the model established on their own local dataset. Then, the server collects all
local accuracies and computes the weighted average accuracy where the evolution is illustrated the
right side of each figure. Note the first round 0 corresponds to the state where workers’ models
are randomly initialized. When all labels appear on each dataset of workers (case figure 10 and
11) then the method converges considerably quickly. However, when labels are non independent
and identically distributed, the convergence becomes slower on the balanced (see figure 12) and is
penalized when only few workers have all data of some specific labels while others have the remainder
part of one label (see figure 13). Note in all figures 10, 11, 12 and 13, all workers are participants
every round.
                                                                                                   38
                                                        Federated Reinforcement Learning
                    6000
                                                                                    0.8
                    5000
                                                                                    0.7
Number of samples
4000 0.6
                                                                         Accuracy
                    3000                                                            0.5
                                                                                    0.4
                    2000
                                                                                    0.3
                    1000
                                                                                    0.2
                       0
                           0   1   2   3   4   5    6     7   8   9                       0   1   2   3   4   5    6   7   8   9
                                           Labels                                                         Rounds
                    Figure 10: Convergence of FashionMNIST model with Federated Averaging using 20 workers on IID
                               configuration
                    6000
                                                                                    0.8
                    5000
                                                                                    0.7
Number of samples
                    4000                                                            0.6
                                                                         Accuracy
3000 0.5
                                                                                    0.4
                    2000
                                                                                    0.3
                    1000
                                                                                    0.2
                       0
                           0   1   2   3   4   5    6     7   8   9                       0   1   2   3   4   5    6   7   8   9
                                           Labels                                                         Rounds
                    Figure 11: Convergence of FashionMNIST model with Federated Averaging using 20 workers with non-IID
                               volume
6000 0.8
                    5000                                                            0.7
Number of samples
                                                                                    0.6
                    4000
                                                                         Accuracy
                                                                                    0.5
                    3000
                                                                                    0.4
                    2000
                                                                                    0.3
1000 0.2
                       0                                                            0.1
                           0   1   2   3   4   5    6     7   8   9                       0   1   2   3   4   5    6   7   8   9
                                           Labels                                                         Rounds
                    Figure 12: Convergence of FashionMNIST model with Federated Averaging using 20 workers with non-IID
                               labels (balanced and 3 labels minimum)
                                                                                                                               39
                                                         Federated Reinforcement Learning
6000
                    5000
                                                                                     0.5
Number of samples
4000
                                                                          Accuracy
                                                                                     0.4
                    3000
2000 0.3
                    1000
                                                                                     0.2
                        0
                            0   1   2   3   4   5    6     7   8   9                       0   1   2   3   4   5    6   7   8   9
                                            Labels                                                         Rounds
                    Figure 13: Convergence of FashionMNIST model with Federated Averaging using 20 workers with non-IID
                               labels (unbalanced and 3 labels minimum)
                    at least per worker and with non-IID volume. Between each episode, workers keep the same dataset
                    but only the task model is reinitialized globally and locally. There are two approaches realized.
                    where Seli is 1 for selected and 0 for discarded, Acci is the accuracy of the participant i and k is
                    the number of participants.
Figure 14: Selection represented for 20 episodes with 10 rounds per episode with 20 participants
                                                                                                                                40
                                    Federated Reinforcement Learning
Figure 15: Selection represented for 20 episodes with 10 rounds per episode with 100 participants
    The figure 14 and 15 illustrates the selection of contributors for aggregation. Each color is
associated to a participant and when there is a colored dot, it means the participant is selected
else the participant is discarded for the aggregation. On figure 14, with only one episode, the
algorithm converges quickly and selects the same participants over the time. On figure 15, the
number of participants is increased to 100 workers and the algorithm converges at the 18th episode.
A separation is significantly distinguishable between homogeneous selection similar to a random
behavior and heterogeneous selection when the algorithm finds a strategy for optimizing contribution.
    The following figure 16 highlights the evolution of the rewards which indicates the development
of the reinforcement learning system.
(a) Evolution of the rewards given rounds with 20 par-   (b) Evolution of the rewards given rounds with 100
    ticipants                                                participants
   The figure 16a puts forward the same behavior with rewards as the figure 14 on selection. After
one episode, the evolution of the rewards reaches a limit and stabilizes around it. On the figure 16b,
100 participants impact the shape of the curve. The system increases slower to reach a limit and
then oscillates. It underlines that the evolution of the reward should not be a single indicator of
                                                                                                        41
                                Federated Reinforcement Learning
Figure 17: Evolution of the accuracies given episodes and rounds with 20 participants
Figure 18: Evolution of the accuracies given episodes and rounds with 100 participants
   Figure 17 and figure 18 illustrate the evolution of the accuracy per episode and per round. The
evolution of the accuracy should rise faster over experiments. However in both figures, the trend
remains the same and the progression of the color does not tend to change. The gradient of colors
highlights that the progression of figure 18 with 100 workers based on rounds is slower than the
progression of the figure 17 with 20 workers.
                                                                                                 42
                                 Federated Reinforcement Learning
   Even if the algorithm converges and find a group of participants which should contribute in a
better way to optimize the aggregation, results point out that the contribution evaluation based on
the accuracy does not improve the speed of convergence as expected.
                                                     k
                                             1       X
                             and Rt = Pk                   di · Acci (θG
                                                                       t+1
                                                                           )−δ
                                            i=1 di   i=1
where Acci (·) is the accuracy computed using data of the worker i, θit+1 is the local model of the
worker i after a training of the global model θGt
                                                  , di is the size of the local worker data and δ is the
exponential moving average of previous accuracies.
    On figure 19 and figure 20, the selection of participants for the aggregation puts forward that
the algorithm does not converge. In fact, with 20 workers or with 100 workers, both figures 19 and
figure 20 are filled of dots which are not discernible because almost all gradients are used for the
aggregation. However, the evolution of the rewards on figure 21 has almost the same shape as the
reward on the first approach except some small pics. If the network would have converged, thus the
reward would be smooth. But in this case, the reward oscillates, which is a type of divergence, and
explains why the agent does not select a subset of gradients after the training.
Figure 19: Selection represented for 20 episodes with 10 rounds per episode with 20 participants
   Moreover, the evolution of the accuracy over the episodes on figures 22 and 23 reaches directly
85% for every first round in both cases (20 workers and 100 workers). It can be explained due to
the number of gradients which are aggregated. Indeed, even if the distribution of data is non-IID,
there are enough gradients to make an aggregation in the same way as a IID case.
                                                                                                       43
                                    Federated Reinforcement Learning
Figure 20: Selection represented for 20 episodes with 10 rounds per episode with 100 participants
(a) Evolution of the rewards given rounds with 20 par-   (b) Evolution of the rewards given rounds with 100
    ticipants                                                participants
                                                                                                        44
                             Federated Reinforcement Learning
Figure 22: Evolution of the accuracies given episodes and rounds with 20 participants
Figure 23: Evolution of the accuracies given episodes and rounds with 100 participants
                                                                                              45
                                 Federated Reinforcement Learning
   • Isomap embedding
   • Spectral embedding for non-linear dimensionality reduction
   • Locally linear embedding
    The experiment was made on 40 workers on MNIST dataset where labels are swapped and are
distributed to form 4 clusters. For instance, the sample of the number 3 become the label 8 for 10
workers randomly chosen. The goal is to reduce the dimension of gradients and find the clusters
through visualization.
    On figure 24 and figure 25, dots indicates a gradient. When different dots have with the same
color, it means they belong to the same cluster. These methods allows to go from an input size
40 × 1.2M to an output size 40 × 2.
    With the PCA method on figure 24, the variance ratio for 2 components is, in this case, 25%.
Before the reduction, there is a total of 48 millions of values for the state with 40 workers, and after
reduction, there is only a total of 80 values. Applying the PCA method with a preservation of 95%
of variance gives an output size 40 × 37, in other words, gradients of every workers are reduced to 47
components per worker. The spectral embedding constructs an affinity matrix and applies spectral
decomposition to the corresponding graph Laplacian. The affinity matrix in this case is computed
based on a radial basis function kernel. The main difference between 24 and 25 is the scale of axis
which is smaller for the spectral embedding method than the PCA method.
                                                                                                     46
                                Federated Reinforcement Learning
  • the speed of computation (in s/batch) can have 3 different means : 0.5, 0.7 or 1. The standard
    deviation of the variable is always 0.02.
  • the bandwidth of uploading (in Mb/s) can have 2 different means associated with its own
    standard deviation : (8, 2) or (0.5, 0.2) written as (m, σ)
  • the bandwidth of downloading (in Mb/s) can have 2 different means associated with its own
    standard deviation : (30, 5) or (5, 1) written as (m, σ)
Note for each worker corresponds a mean and a standard deviation chosen randomly based on
previous values.
    For the experiments, the actor network and the critic actor are composed of 3 linear layers with
128 neurons per layer. The exploration of the algorithm can be dissociated from Federated Learning.
It means there is no need to have a task model to train the agent network due to the aim of the agent
: minimize the time spent by the next participants. However, to compare the performance of the
agent on task model accuracy against a completely random selection, the Federated Learning is used
                                                                                                  47
                                  Federated Reinforcement Learning
to keep the same environment between the agent and the baseline (random selection). There are
100 workers in this experiment with 20 rounds per experiment and a maximum of 30 experiments.
The output of the agent network is a set of probabilities. The selection of next participants is made
by applying the Bernoulli law B(p) for each probability. The shape of the selection is a vector filled
with values 0 or 1 where 0 to discard the worker for the next round and 1 denotes to select the
worker for the next round. When the probabilities are close to zero and no worker is selected for the
next round, then the experiment ends.
    The results are divided in two parts :
   • a first experiment where the goal is to highlight that the agent is able to find the fastest
     workers in a dynamic environment
   • a second experiment where the environment is kept between training step, testing step and
     comparison of performance with a random selection
    The figure 27 puts forward how probabilities progress over rounds. They are sorted from the
fastest worker to the slowest worker according to their means (speed of computation, bandwidth of
                                                                                                      48
                                Federated Reinforcement Learning
uploading and bandwidth of downloading). Each pixel of the image corresponds to the probability
of a worker to be selected to the next round. Blue color indicates a probability close to zero while
red indicates a probability close to one. Around the round 240, there are 10 participants with a high
probability for being selected for the next round. The training should be stopped at this point in
order to have enough participants to represent the distribution of date in case of Federated Learning.
   With the same idea, the figure 28 indicates the selection of workers over rounds. When there is
a colored dot, it means the worker has been selected else there is a white space.
Figure 28: Evolution of the selection of workers over rounds for 100 workers
    Finally, the maximum time of participants given the episode decreases over time on figure 29.
Note on the figure 29a, the scale of the Y-axis is logarithmic to reduce the variance due to a large
pic of maximum time whereas the figure 29b does not have this point.
                                                                                                   49
                                     Federated Reinforcement Learning
(a) The mean of the times for each episode (b) The median of the times for each episode
Figure 29: Evolution of the maximum time of participants over episodes for 100 workers
(a) Evolution of the accuracy on IID data with 100     (b) Evolution of the accuracy on non-IID data with
    workers                                                100 workers
Figure 30: Comparison of the evolution of the accuracy between the selection with DDPG algorithm and
           the random selection with 100 workers on MNIST data
However, the comparison of the maximum time between the two methods in tables 2 and 3,
                                                                                                          50
                                Federated Reinforcement Learning
highlights that the agent selects workers where the maximum time of the set of participants is better
than the random selection in IID scenario and has a worse performance in non-IID scenario like it
was mentioned above.
Table 2: Comparison of maximum time between the selection with DDPG algorithm and the random
         selection with 100 workers on IID data
Table 3: Comparison of maximum time between the selection with DDPG algorithm and the random
         selection with 100 workers on non-IID data
                                                                                                  51
                             Federated Reinforcement Learning
A. Appendix
i. Figures
                                                                               52
                             Federated Reinforcement Learning
                                                                                               53
                             Federated Reinforcement Learning
Double DQN     off-policy   Double DQN is an enhancement of DQN for reducing overestimation.
                            Q is noisy, which may be caused by environment, non-stationarity,
                            function approximation or any other reasons. The central idea of
                            double DQN is to decorrelate the noises in selection and evaluation
                            by using two different networks in these two stages. The Q-network
                            in the DQN architecture provides a natural candidate for the extra
                            network. Recall that it is the evaluation role of the target network
                            that improves the stability more.
Dueling DQN    off-policy   The Q-values of different actions do not matter. So decoupling the
                            action-independent value of state and Q-value may lead to more robust
                            learning. The experiments show that dueling architectures lead to
                            better policy evaluation in the presence of many similar-valued actions.
                                                                           (k)
Rainbow DQN    off-policy   Rainbow uses the truncated n-step return Rt from a given state St
                                               (k)                (k)     n−1
                            directly, where Rt is defined by Rt = k=0 γ k Rt+k
                                                                        P
Noisy DQN      off-policy   It is an alternative exploration algorithm for ϵ-greedy, especially for
                            games requiring huge exploration. The noise is added into linear
                            layer y = (W x + b) by an extra noisy stream y = (W x + b) +
                            ((Wnoisy ⊙ ϵw ) x + bnoisy ⊙ ϵb )
REINFORCE      on-policy    The algorithm REINFORCE follows a straightforward idea of per-
                            forming gradient ascent on the parameters of the policy heta. Despite
                            of its simplicity, naive REINFORCE has been observed to suffer a
                            large variance when estimating the gradient. To alleviate the large
                            variance problem, we further introduce a baseline b(Si ), where b(Si ) is
                            a function only depending on Si (or more importantly, not dependent
                            to Ai )
Actor-Critic   off-policy   The actor-critic method follows this idea that learns together an actor,
                            the policy function π(·|s), and the critic, the value function Vπ (s).
                            Moreover, actor-critic also uses the idea of bootstrapping to estimate
                            the Q-value
A2C            on-policy    Synchronous Advantage Actor-Critic - Similar to actor-critic but it
                            focuses on parallel training. It works with one master, one coordinator
                            and several workers. The master is updated like in the AC algorithm
A3C            on-policy    Asynchronous Advantage Actor-Critic - There is no coordinator (asyn-
                            chronous). Each worker talks directly to the global actor and critic
TRPO           on-policy    Trust Region Policy Optimization aims to handle the step size more
                            properly in policy gradient based on the idea of trust region
PPO            on-policy    Based on TRPO and instead of optimizing with a hard constraint,
                            Proximal Policy Optimization tends to optimize its regularization
                            version
DPPO           on-policy    Distributed Proximal Policy Optimization is a distributed version of
                            the PPO algorithm.
ACKTR          on-policy    Actor Critic Using Kronecker Factor Trust Region uses the Kronecker-
                            factored approximated curvature to compute the natural gradient
                                                                                                54
                            Federated Reinforcement Learning
                                                                                              55
                               Federated Reinforcement Learning
B.    References
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient
  Learning of Deep Networks from Decentralized Data,” arXiv:1602.05629 [cs], Feb. 2017, arXiv:
  1602.05629. [Online]. Available: http://arxiv.org/abs/1602.05629 1
X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the Convergence of FedAvg on Non-IID
  Data,” arXiv:1907.02189 [cs, math, stat], Jun. 2020, arXiv: 1907.02189. [Online]. Available:
  http://arxiv.org/abs/1907.02189 1
J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated Reinforcement Learning: Techniques, Applications,
   and Open Challenges,” arXiv:2108.11887 [cs], Oct. 2021, arXiv: 2108.11887. [Online]. Available:
   http://arxiv.org/abs/2108.11887 2
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep
 Learning with Differential Privacy,” Proceedings of the 2016 ACM SIGSAC Conference on
 Computer and Communications Security, pp. 308–318, Oct. 2016, arXiv: 1607.00133. [Online].
 Available: http://arxiv.org/abs/1607.00133
Bellman, “Dynamic Programming,” Princeton University Press, Princeton, N.J., 1957.
X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-Edge AI: Intelligentizing
  Mobile Edge Computing, Caching and Communication by Federated Learning,” IEEE
  Network, vol. 33, no. 5, pp. 156–165, Sep. 2019, arXiv: 1809.07857. [Online]. Available:
  http://arxiv.org/abs/1809.07857
X. Wang, C. Wang, X. Li, V. C. M. Leung, and T. Taleb, “Federated Deep Reinforcement
  Learning for Internet of Things With Decentralized Cooperative Edge Caching,” IEEE
  Internet of Things Journal, vol. 7, no. 10, pp. 9441–9455, Oct. 2020. [Online]. Available:
  https://ieeexplore.ieee.org/document/9062302/
I. Martinez, S. Francis, and A. S. Hafid, “Record and Reward Federated Learning Contributions
   with Blockchain,” in 2019 International Conference on Cyber-Enabled Distributed Computing and
   Knowledge Discovery (CyberC). Guilin, China: IEEE, Oct. 2019, pp. 50–57. [Online]. Available:
   https://ieeexplore.ieee.org/document/8945913/
S. Yu, X. Chen, Z. Zhou, X. Gong, and D. Wu, “When Deep Reinforcement Learning Meets
  Federated Learning: Intelligent Multi-Timescale Resource Management for Multi-access Edge
  Computing in 5G Ultra Dense Network,” arXiv:2009.10601 [cs], Sep. 2020, arXiv: 2009.10601.
  [Online]. Available: http://arxiv.org/abs/2009.10601
Y. Zhan, P. Li, and S. Guo, “Experience-Driven Computational Resource Allocation of Federated
  Learning by Deep Reinforcement Learning,” in 2020 IEEE International Parallel and Distributed
  Processing Symposium (IPDPS). New Orleans, LA, USA: IEEE, May 2020, pp. 234–243.
  [Online]. Available: https://ieeexplore.ieee.org/document/9139873/
S. Zarandi and H. Tabassum, “Federated Double Deep Q-learning for Joint Delay and Energy
  Minimization in IoT networks,” arXiv:2104.11320 [cs], Apr. 2021, arXiv: 2104.11320. [Online].
  Available: http://arxiv.org/abs/2104.11320
Y. Chen, Z. Liu, Y. Zhang, Y. Wu, X. Chen, and L. Zhao, “Deep Reinforcement Learning-Based
  Dynamic Resource Management for Mobile Edge Computing in Industrial Internet of Things,”
  IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 4925–4934, Jul. 2021. [Online].
  Available: https://ieeexplore.ieee.org/document/9214878/
                                                                                               56
                                Federated Reinforcement Learning
W. Liu, L. Chen, Y. Chen, and W. Zhang, “Accelerating Federated Learning via Momentum
 Gradient Descent,” arXiv:1910.03197 [cs, stat], Oct. 2019, arXiv: 1910.03197. [Online]. Available:
 http://arxiv.org/abs/1910.03197
S. Q. Zhang, J. Lin, and Q. Zhang, “A Multi-agent Reinforcement Learning Approach for Efficient
   Client Selection in Federated Learning,” arXiv:2201.02932 [cs], Jan. 2022, arXiv: 2201.02932.
   [Online]. Available: http://arxiv.org/abs/2201.02932
X. Liang, Y. Liu, T. Chen, M. Liu, and Q. Yang, “Federated Transfer Reinforcement Learning for
  Autonomous Driving,” arXiv:1910.06001 [cs], Oct. 2019, arXiv: 1910.06001. [Online]. Available:
  http://arxiv.org/abs/1910.06001
H.-K. Lim, J.-B. Kim, J.-S. Heo, and Y.-H. Han, “Federated Reinforcement Learning for Training
  Control Policies on Multiple IoT Devices,” Sensors, vol. 20, no. 5, p. 1359, Mar. 2020. [Online].
  Available: https://www.mdpi.com/1424-8220/20/5/1359
S. U. Stich and S. P. Karimireddy, “The Error-Feedback Framework: Better Rates for SGD with
  Delayed Gradients and Compressed Communication,” Jun. 2021, number: arXiv:1909.05350
  arXiv:1909.05350 [cs, math, stat]. [Online]. Available: http://arxiv.org/abs/1909.05350
J. Zhao, X. Zhu, J. Wang, and J. Xiao, “Efficient Client Contribution Evaluation for Horizontal
   Federated Learning,” arXiv:2102.13314 [cs, eess], Feb. 2021, arXiv: 2102.13314. [Online].
  Available: http://arxiv.org/abs/2102.13314
H. Dong, Z. Ding, and S. Zhang, Deep Reinforcement Learning, Fundamentals, Research and
  Applications. Springer, 2020.
                                                                                                57
                                Federated Reinforcement Learning
C.    Glossary
AC Actor Critic. 29
IID Independent and Identically Distributed. 1, 7, 11, 12, 14, 34, 36, 38, 42, 49, 50
ML Machine Learning. 1, 2
non-IID Non Independent and Identically Distributed. 1, 2, 5, 7, 11, 12, 14, 28, 34, 36–39, 42, 49,
     50
                                                                                                58
                              Federated Reinforcement Learning
59