0% found this document useful (0 votes)

13 views59 pages

Report

The document discusses Federated Reinforcement Learning (FRL), which combines Federated Learning (FL) and Reinforcement Learning (RL) to enable decentralized training of machine learning models while preserving user privacy. It outlines the challenges and trade-offs associated with FL, such as dealing with non-IID data and resource limitations of devices, and emphasizes the potential of RL to address these issues. The paper also covers methodologies, related work, and experimental results related to FRL applications across various fields.

Uploaded by

2608780279

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views59 pages

Report

Uploaded by

2608780279

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Federated Reinforcement Learning

Benjamin BOURBON - Thierry NAGELLEN

Orange

September 26, 2022

Contents

I Introduction to Federated Reinforcement Learning 2

i Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
ii Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
iii Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
iv Overview of trade-offs of Federated Learning . . . . . . . . . . . . . . . . . . . . . . 5
v Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

II Related work 8
i General ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ii Reduction of communication cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iii Improvement of aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

III Methodology 19
i Basic knowledge of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 19
ii Management of participation through a contribution evaluation . . . . . . . . . . . . 20
iii Management of participants for the next round . . . . . . . . . . . . . . . . . . . . . 28

IV Experiments 36
i Generation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
ii Baseline - Federated Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iii Results on contribution evaluation algorithm . . . . . . . . . . . . . . . . . . . . . . 38
iv Results on the selection of next participants algorithm . . . . . . . . . . . . . . . . . 47

A Appendix 52
i Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ii Concise table on different algorithms of Reinforcement Learning . . . . . . . . . . . . 52

B References 56

C Glossary 58

1
Federated Reinforcement Learning

I. Introduction to Federated Reinforcement Learning

i. Introduction
Federated Learning (FL) was introduced by McMahan et al. (2017) which has attracted increasing
interest of ML researchers. They proposed a decentralized collaborative approach that uses multiple
devices to train a global model without sharing their local data. In most cases, devices are limited in
resources such as computation and communication and restricted by the usage of the user. Federated
Learning can meet expectations of specific fields such as natural language processing, computer
vision, health care, transportation, finance, smart city, robotics, networking, blockchain and others.
In other words, the fact of training a model without exchanging local data gives an overview of new
possibilities for taking advantage of each other while keeping privacy.
However, Li et al. (2020) analyze the convergence of FedAvg, algorithm based on Stochastic
Gradient Descent (SGD) which was suggested by McMahan et al. (2017) on non-IID data (Indepen-
dent and Identically Distributed). Indeed, on realistic settings, there is no guarantee to have IID
data. This paper establishes a convergence rate of O( T1 ) for strongly convex and smooth problems,
where T is the number of SGDs. Moreover there are also multiple problems that are not covered
with FedAvg approach namely making FL a completely distributed method, dealing with fairness of
contributions or minimizing the energy consumption. The section iv will cover different challenges
on Federated Learning.
In this study, we will focus on Reinforcement Learning (RL) which is a branch where the agent
and environment are the basic components of it. The agent interacts with the environment by taking
actions and receives rewards according to the impact of the action taken. The objective of the agent
is to learn to behave in such a way as to maximize the expected rewards in the long term. RL
demonstrates a great potential in various applications namely communications. The section iii will
introduce basic knowledge of Reinforcement Learning.
When we contextualize the kind of application of FL, there are different settings that might
block or reduce the efficiency of traditional methods of FL. More precisely, when local data are not
labeled or also when the distribution of data is non-IID, RL could help to deal with this type of
issue. Through the observation of the environment, Qi et al. (2021) put forward that FL can be
divided into two categories according to the distribution characteristics of data. More details will be
explained in section i.1.
Federated Reinforcement Learning (FRL) is the combination of RL and FL. Generally, the
dimension of sample, feature and label in FL can be replaced by environment, state and action
respectively in FRL. On one hand, RL can be implemented directly in clients’ devices which takes
the place of the model for each client. It can be a solution when data are not labeled or when data
are located in particular environment according to the usage of user. For instance, the usage of
mobile phones are not the same in cities and in rural areas. On the other hand, RL can play the
role of a scheduler in order to select clients a better way. It turns out the agent is able to solve a
optimization problem by finding parameters that minimize the objective function.

2
Federated Reinforcement Learning

ii. Federated Learning

Federated Learning is a method where multiple devices (called clients) are led to collaborate together
with communications to a central server in order to train a global model while protecting privacy.
In other words, instead of sharing the local data of users, only updates of local models on clients’
devices are communicated to the server.
FL can be applied in various situations. For instance, mobile phones represent a large amount
of data, which much of it, is private by nature. The data could be acquired through GPS locations,
microphones or cameras for example. If we want to build a global model such as takes account all
contributions of all mobile phones, there are risks and responsibilities to deal with private data. FL
aims to build a joint Machine Learning model (ML) without sharing local data. This technique
could help diverse fields to collaborate together in order to train a global model that would be
used by all participants. For instance, in hospitals, where privacy about health data, is the most
important criterion, they could build an organization to perform a ML model in collaboration and
every of them would take advantage of each others while keeping privacy. Another way to see FL
application is when the implementation of the algorithm is fully decentralized. In other words, when
FL architecture is based on peer-to-peer, communications between the server and devices are not
required. For example, this approach offers new types of applications such as attack detection model
that could be develop jointly by multiple banks.
Even if FL suggests an approach which protects privacy of users, there is still potential attacks
which could pick up the parameters of the global model and rebuild information or data. To avoid
such situations, Abadi et al. (2016) suggest differential privacy technique and demonstrate that
we can train deep neural networks (DNN) with non-convex objectives. In order to go further, FL
is going to be defined.

Formulation Consider N parts Fi , ∀i ∈ [1, N ] which are established and are used to train the
cooperative ML model. Also, there are N datasets Di , ∀i ∈ [1, N ]. The FL model is a set of
parameters wi which are learned base on datasets Di . The loss function of each dataset Di at part
Fi can be defined as follow :
1 X
Fi (w) = fj (w)
|Di |
j∈Di

where fi (w) represents the loss function of sample j with the given model parameter vector w and
| · | represents the size of the dataset. Now, we can define the global loss function thanks to multiple
parts put together to train the global model without sharing a dataset :
N
X
Fg (w) = ηi Fi (w)
i=1

where ηi indicates the relative impact of each part of global model.

Qi et al. (2021) present a metric to access the quality between the expected model MSU M
trained on the dataset containing all local data and the federated model MF ED :

|νSU M − νF ED | < δ

where ν is the estimated performance such as accuracy, recall and F1-score, etc, and δ is a non-
negative real number.
Despite good performance of FL, this method has difficulty to fit in dynamics environments
specifically cooperative and optimal decision-making. To solve these real-world challenges, RL and
Deep Reinforcement Learning (DRL) have proved excellent performance in many problems.

3
Federated Reinforcement Learning

iii. Reinforcement Learning

Machine learning challenges are grouped in supervised, unsupervised or Reinforcement Learning
depending on the situation. In supervised learning, training data set is labeled and the objective of
the model is to build a relation between the input, output and system parameters. In unsupervised
learning, the model has no feedback due to unlabeled training data set. Therefore, the objective
of the algorithm is to find similarities between the input samples in order to approximately learn
some patterns from untagged data. Finally, Reinforcement Learning takes place for a goal-oriented
learning tool. In Reinforcement Learning (RL), an agent interacts with an environment. At each
time-step, the agent observes its environment and selects an action At . The agent receives a reward
Rt+1 depending of its action on the environment. The environment changes over the time with state
St+1 (see figure 1). Generally speaking, the agent behaves like a decision maker which learns through
a policy π(s). Its goal is to optimize a long-term reward by interacting with the environment. RL
includes all work done in all areas such as psychology, computer science, economic, and so on.

Action At
Agent

Environment
Policy
State St+1 ,
Reward Rt+1

Figure 1: Basic overview of Reinforcement Learning model

Formulation We define the future discounted return R at the time t as :

∞
X
R= γ t rt
t=0

where rt is the reward at the step t and γ ∈ [0, 1[ is the discount-rate. γ is less than 1, which allows
to weight less distant future events than events in the immediate future. For instance, in case we
would like to reduce the energy consumption of devices on Federated Learning, the agent is rewarded
properly when the impact of the actions are beneficial. And on the contrary, the agent is punished
(negative rewards) when the consumption of energy is worse than previous state.
Bellman (1957), who worked on Markov Decision Process (MDP), found a way to estimate the
optimal-value Q∗ (s, a) which is the maximum expected return achievable by following any strategy,
after seeing some sequence s and then taking some action a :

Q∗ (s, a) = max E [Rt |st = s, at = a, π]

where π is a policy mapping sequences to actions.

The optimal action-value function obeys the Bellman equation. If the optimal value Q∗ (s′ , a′ )
of the sequence s′ at the time-step was known for all possible actions a′ , then the optimal strategy
to select the action a′ maximising the expected value of r + γQ∗ (s′ , a′ ):
h i
Q∗ (s, a) = ′E r + γ max ′
Q (s
∗ ′ ′
, a )|s, a
s ∼E a

where E is the environment.

4
Federated Reinforcement Learning

After getting the optimal Q values, it is simple to define the optimal policy π ∗ (s). In the state
s, the agent must choose the action which has the highest Q value in this state. In other words,
π ∗ (s) = argmax Q∗ (s, a)
a
We can use the Bellman equation as an iterative update Qi+1 (s, a) = E [r + γQi (s′ , a′ )|s, a]
in order to converge to Q∗ when i → +∞. A function for approximation is defined to estimate
the action-value function Q(s, a; θ) ≈ Q∗ (s, a). A neural network can be trained by minimizing a
sequence of loss functions Li (θi ) that change at each iteration i :
Li (θi ) = (yi − Q(s, a; θi ))2

E
s,a∼p(s,a)

where yi = ′E [r + γ maxa′ Q(s′ , a′ ; θi−1 |s, a)] is the target for iteration i and p(s, a) is a probability
s ∼E
distribution over sequences s and actions a that refers to the behaviour distribution.

iv. Overview of trade-offs of Federated Learning

As mentioned in the introduction, FL presents several challenges to solve.

Client challenges
1. In real-world applications, data are not necessarily labeled. On this condition, local models and
global model have to learn through unsupervised or reinforcement techniques. For instance,
if we would like to use the dataset of three different sales-oriented clothing applications, on
the one hand, each application manages its data as they were designed for. They are not
supposed to follow the same design or "codification". Thus, it is harder to find a way to identify
corresponding features between datasets. On the other hand, we are not able to label data
knowing that the scope of each application is not the same with each other. The objective is
to gather information of applications in order to compute a local update without having to do
the preprocessing of the data.
2. Even in the case of supervised learning with labeled training dataset, there is still no guarantee
of the data distribution. In other words, some clients could present specific characteristics like
their location or their habits which influence local dataset. It involves to deal with non-IID
data and moreover, it shows that at a global viewpoint, data could be unbalanced, which
means classes are not distributed identically.
3. As well, Internet of Things (IoT) or embedded devices have often limited resources. In some
cases, the computation or the communication with the server can be slowed down or even
interrupted during the activity. We can think when battery is discharged for computation or
when there is a loss of connection on WiFi or 4G network for communication. This problem is
called the straggler problem.

Communication challenges
1. Communication plays a main role in FL. It must be minimize to speed up the training of the
global model because the number of communications represents the bottleneck of FL, mainly
in cross-device configuration.
2. Besides, the energy consumption is closely related to the number of communications and the
amount of information transferred between the device and the server. The energy consumption
must also be minimized for the computation but, we assume, on the same embedded device,
the more we reduce the energy consumption, the more the training is slowed down.

5
Federated Reinforcement Learning

3. During communications, global model parameters and local updates are communicated between
participants and the server. It introduces a risk of potential attacks because, not only privacy
still remains in local updates but also because attackers are able to build information from
parameters. Privacy must be kept protected during communications.

Due to different habits of clients, local updates impact differently the global model according to
their contributions. In other words, clients, who have spent more time using their device, participate
more and therefore the global model will be more dependent on their usage. Then, the fairness is
not respected because of the imbalance of contributions of participants.

Server challenges

1. Federated Learning algorithms often start with the assumption of a centralized server. However,
in order to apply those algorithms in a fully distributed way, the server has to be removed and
replaced by a peer-to-peer network.

2. In some situations, a centralized way remains a better solution than distributed network.
One way to reduce communication is spread the flow through Ultra Edge Computing, Edge
Computing and Cloud Computing. Each level of computing could take charge of different
services in the interest of reducing the workload.

Challenges of FL and some technical solutions to solve them are illustrated on figure 2. Horizontal
Federated Learning and Vertical Federated Learning are described in Qi et al. (2021) with more
details in section i.1.
Communication challenges

Clients send
local updates Bayesian
Learning
Data Transfer Continual
distribution Learning Learning
Hierarchal
FL
Causal
Blockchain
Labeled / Learning
Unlabeled data
Client challenges
Reinforcement
Learning
Straggler
Scalability problem
Server challenges
Client
Vertical Ultra Edge
Federated computing
Learning
Client
Horizontal Edge
Federated K clients Server computing
Learning
Client Cloud
Cross-silo computing
Server sends
Client the global model

Cross-device Communication
cost

Fairness

Privacy

Energy
consumption

Figure 2: Overview of Federated Learning with principal dimensions to deal with

6
Federated Reinforcement Learning

v. Motivation
Various papers propose solutions to achieve tremendous results with IID data whereas in real-world
situations, we find more often non-IID data. Moreover, working on local data of users is dealing
with non-IID data because each user has his own habits and his own interests. Furthermore, in
some situations, data are unlabeled. This problem can be illustrated when local data are collected
on mobile phones through different applications. Applications propose different services and then
collect and give them according to the application was built. Because of this, it is not necessary
possible to merge datasets in order to generate a unique local dataset for each client. There are some
difficulties when the application does not provide features, because the analysis and preprocessing
of data are not feasible through a generic algorithm. Moreover, it involves models in clients’ devices
should be able to learn through unsupervised learning algorithms.
In another point of view, due to the protection of privacy of users, the biggest part that can be
improved is server part. In other words, the server receives all updates from clients and decides
how to gather them and train the global model. While an additional process which it is called the
scheduler, is in charge of selecting the clients for each round.
This paper aims to propose a solution for applications which tends to unsupervised learning.
We assume that settings is known as cross-device learning, which means the number of clients is very
large. It presupposes that only a fraction of clients is available at any time. We assume the primary
bottleneck is often communication knowing that generally, devices have heterogeneous connections.

7
Federated Reinforcement Learning

II. Related work

i. General ideas
i.1 HFRL and VFRL
The comprehensive survey by Qi et al. (2021) focuses on Federated Reinforcement Learning to put
forward ideas where the Reinforcement Learning could solve specific situations where FL could be
improve "to deal with cooperative control and optimal decision-making in dynamic environments".
They introduce the two scenarios : Horizontal Federated Reinforcement Learning (HFRL) and
Vertical Federated Reinforcement Learning (VFRL).
In HFRL scenario, the environment that each agent interacts with, is independent of the others,
while the state space and action space of different agents are aligned to solve similar problem. A
typical example is when the autonomous driving system where various environment information is
collected to train the autonomous driving models locally. However, multiple factors such as weather
conditions or driving regulations may differ from a country to another even if vehicles have same
operations (braking, acceleration, steering, etc.). Users could share their learned experience of their
model with each other without revealing their driving data.
In opposition, in VFRL scenario, multiple agents interact with the same global environment
but each can only observe limited state information in the scope of its view. For example, for
photovoltaic management company, the balance between the power generation and consumption
should be optimized. It depends on the electrical utilities of household users and the decisions of the
power company on the power dispatch of generators. The electricity bill can be seen as "rewards" for
household users while the power company get "rewards" from the management of power generation.
Both don’t want to share their private data to others. Instead, VFRL can help to improve policy
decisions without exposing specific data.

i.2 Integration of Federated Learning within Edge

Internet of Things or the emergence of multimedia services over mobile networks produce a huge rise
in the traffic and computation for mobile users and devices over the recent years. Edge computing
can significantly extend the capacity of cloud. Indeed, the improvement of the content deliveries
and the quality of mobile services are at the core of network edges. In order to optimize edge
systems, Wang et al. (2019) explain how to integrate Deep Reinforcement Learning techniques in
Federated Learning framework aiming to bring more intelligent edge systems. They also discuss on
User Equipments (UEs or also called Ultra Edge computing) for pushing computation and storage
resources to the geographical proximity.
Beforehand, they introduce Deep Learning solutions with their benefits and their drawbacks on
figure 3. To reduce traffic between UEs and the cloud, the edge part equipped with AI computation
could combine massive UEs and the cloud together intending to form a powerful AI entity. We
can imagine that each edge node support AI tasks to ensure global optimization and meet the
expectation of the local dynamic system. Thus, the edge node needs to deal with the cloud and the
UEs. The cloud server coordinates all edge nodes while they keep a DRL agent and updates it by
their own local training data. Furthermore, in a scenario of computation, the edge nodes receive
instructions from UEs through wireless channel. According to the demanded task and the energy
consumption, edge nodes are able to share the amount of tasks. In other words, they are dedicated
to a specific task fulfilled by the inference result of the agent in it. Such infrastructures offer the
ability to deal with a large-scale data even if UEs as mobile phones, industrial IoT devices and
smart vehicles are limited by the computation and memory resources.

8
Federated Reinforcement Learning

Deep Learning with Edge

Deep RL Distributed Deep RL Federated Learning

Pros: Best Performance Pros: Minimum Data

Pros: Fast Training; Barely Transmission; Privacy Protection;
Usage in Edge Flexible Training; Robust to
Unbalanced and non-IID data

Cons: Impractical in Edge

(Massive Redundant Data
Cons: Privacy Risk;
Transmission; Privacy Risk)
Weaker Performance; ... Cons: Near Best Performance

Medium-Scale Large-Scale
Small-Scale
• Resource Allocation • Resource Allocation
• Caching Policy
• Caching Policy • Caching Policy
• Traffic Engineering ...
• Computation Offloading Policy • Computation Offloading Policy
On One Ultra or Edge Node
• Traffic Engineering ... • Traffic Engineering ...

Figure 3: Taxonomy of applying Deep Reinforcement Learning in mobile edge system by Wang et al. (2019)

Then when FL is implemented in such infrastructures, through distributed training, it provides

the quality of knowledge across a large number of devices. One challenge that they emphasize, is the
idea of not deploying directly neural networks with at the random state (initialization of weights).
Indeed, knowing that learning takes long time of training, it should be better to start FL with
pretrained neural networks. One way to solve this challenge is to boost the training process in edge
computing systems using transfer learning. Then, in the case of Reinforcement Learning, the DRL
agent should be trained offline before being distributed in the system.
FL process with DRL agent is illustrated on figure 4. The iterative process solicits clients
depending on actions chosen by the agent. They download the parameters of the global model from
a certain server. Then, they replace their local model by the downloaded one. After that, they
perform the training process and upload their local updates to the server. The server aggregates
the local updates on the global model. According to various collected information, which could be
the time spent on computation, the bandwidth or characteristics of the local data (mean, variance,
entropy, ...), the agent observes a new state and takes an action. The action reflects which next
clients should participate or not. The agent here is described as a Deep Q-Network (DQN) with a
replay memory.

9
Federated Reinforcement Learning

One Round in FL Training Process in the scheduler

Loss function

Q(st , at ; θ)
Gradient
Q(st+1 , a− ; θ− )
(a− = argmaxa Q(st+1 , a; θ))

Central Server
MainNet TargetNet
Download
ates

Model

Replace
upd

Local Model
oad

Periodically
Upl

at argmaxa Q(st+1 , a; θ)
= Multi-Layer
arg Perceptron
m ax
st aQ
(s
st , at , st+1

t,
at
;θ
)

Environment st+1 rj

Client (st , at , rt , st+1 )

Local Replay Memory

Figure 4: General process of DRL with Federated Learning over mobile edge system (figure inspired by
Wang et al. (2019))

i.3 Other ideas

To tackle the lack of self-adaptive ability in dynamic environments and the centralized FL, Wang
et al. (2020) develop a federated deep-reinforcement-learning-based cooperative edge caching (FADE)
approach which is decentralized federated learning supervised by a DRL model. The edge aggregator
disperses the global model parameters to devices through the policy of the Reinforcement Learning
agent. Also, it plays the role to obtain the global loss function from the collected parameters of
participants.
Even if the local data are not exchanged between participants and the server, there are still some
potential risks of attacks to intercept communications to get parameters and sensitive information
through them. To avoid this issue, Martinez et al. (2019) describe a workflow including blockchain
which aims to establish data security. The main idea is, once all local gradients are collected, the
algorithm follows the process of training the next model, validates the transactions and paying the
users.
Furthermore, Yu et al. (2020) suggest to use a two-timescale deep reinforcement learning and
blockchain into communication networks. The main goal is to optimize communication costs by
minimizing the total offloading delay, computation offloading, resource allocation and service caching
placement. The secure blockchain communication is applied to ensure the efficient and reliable
cooperation in networks.

10
Federated Reinforcement Learning

Federated Learning still remains a learning technique which asks many resources and especially
energy. The papers in the literature (Zhan et al. (2020), Zarandi and Tabassum (2021), Chen et al.
(2021)) that deal with the energy consumption, formulate an optimization problem where several
units (computation delay, communication delay, bandwidth, ...) are taken into account. They
introduce a RL algorithm where its goal is to find the optimized parameters that minimize the
objective function given the current state of the environment.
In some scenarios, data obtained at the client-side are not labeled due to high labeling cost or
difficulty of annotation for instance. FSSL (Federated Semi-Supervised Learning) was introduced by
Jeong et al. (2021) tries to solve two challenges : when the clients have both labeled and unlabeled
data and when the labeled data is only available on the server. Instead of using Reinforcement
Learning technique, the authors base their idea on inter-client consistency loss which is one of the
most popular approaches to learn unlabeled examples in a semi-supervised learning setting. As well,
to find the parameter decomposition for disjoint learning (supervised and unsupervised learning),
they minimize the loss term depending on the cross entropy between labeled data and the model
output.
However, in most settings, the distribution of the data in local dataset of devices is not IID.
Ghosh et al. (2019) propose sparse ternary compression which is specifically designed for unbalanced
and non-IID data. Their algorithm does not use RL and outperforms on CIFAR dataset in
unbalanced and non-IID settings other algorithms.

ii. Reduction of communication cost

The section iv highlights that FL exposes the number of communications to the main bottleneck of
the algorithm. To solve this problem, one idea is to accelerate federated learning through momentum
gradient descent (Liu et al. (2019a)). This method allows us to take advantage of local gradients to
improve the speed of convergence and thus it reduces the communication cost. Also, Smith et al.
(2018) put forward federated multi task learning where they manage to separate computation across
the nodes. Their algorithm MOCHA is able to deal with high communication cost, stragglers, and
fault tolerance for distributed learning.

ii.1 Deep Reinforcement Learning assisted Federated Learning algorithm

Recent papers have tried to solve this problem thanks to RL playing the role of scheduler. Zhang
et al. (2021) are interested in industrial Internet of Things (IIoT) and model the training time
cost and the training quality cost. The total cost is given by adding both of them that the agent
minimizes by selecting the best IIoT equipment nodes each round. Formally, let consider Ctime
t
the
total time cost of training for all equipment :
Ne
1 X
t
Ctime = (ct (i) + ctc (i))
Ne i=1 l

where Ne denotes the number of equipment which participate for the training, ctl (i) represents the
local training time of the device i at the time t and ctc (i) represents the communication cost of
the device i at the time t. Moreover, the total of training quality cost for all equipment Cqu t
is
considered as :
XNe X
t
Cqu = Loss(yj − ŵt (xj ))
i=1 j
| {z }
σi

11
Federated Reinforcement Learning

where σi quantifies the quality of the network model and ŵt represents the training model aggregated
at the time t. Thus, the total cost in the time slot t is :

C t = Ctime
t
+ Cqu
t

A RL agent is used to find the optimal solution to minimize the total cost. The agent observes the
current state which is characterized by the performance of IIoT equipment, the quality of local
data and the gradient loss of these IIoT equipment. Then, the agent takes action at which after all,
is a vector with a size of the number of nodes, filled with 0 and 1 where when the node i is selected,
δit = 1, and otherwise δit = 0. And the evaluation method is a weighted average as follows:
Pn t t
i=1 Ci · ai
r(st , at ) = − P n
i=1 ai

The performance evaluation of the solution of this paper shows results about the accuracy to
reduce communication costs on the MNIST data set with IID data and non-IID data. However,
the evolution of the total reward was not put forward and it would essentially not prove the efficiency
of the RL agent. Even if the convergence of the loss value was pointed out during the federated
learning, there is a lack of explanation on how where selected the participants. We could conclude
the agent figures out that the best way to minimize the communication costs is to select the most
fastest devices and with the best quality. In other words, the algorithm faces two issues :

• the fairness is not taken into account during the training of the global model. In other words,
the agent would move aside some participants and prioritize other ones even though its goal is
to minimize the communication costs.
• there is no reason to believe that the agent would be able to find new participants over the
time if the agent selects the same participants to maximize its rewards. For instance, if devices
change behaviors over the time because of their usage, the agent could be stuck in local minima,
because of a lack of exploration over the time.

ii.2 A Multi-agent Reinforcement Learning approach in Federated Learning

Another approach to reduce the communication costs is to define a cooperative Multi-Agent
Reinforcement Learning (MARL). Zhang et al. (2022) propose to set N agents to be trained to
produce optimal actions in order to maximize a team reward. The method is called FedMARL. The
main idea of this strategy aims to tackle the non-IID training data problem which represents a
real-world problem. So, it improves the model accuracy with much lower processing latency and
communication cost. A set of N agents is defined in such a way where each one of them must
maximize the total expected discounted reward. Each agent selects the action a∗ with the maximum
Q-value. To take the final action, a joint Q-function Qtot (st , at ) = n Qθn (stn , atn ) is defined where
P

st and at are the set of states and the set of actions collected from all agents at the time t.
Like the previous paper Zhang et al. (2021), the problem to solve is an optimization problem.
They supposed that "a subset of clients will be early rejected by the central server while the rest
clients will continue to finish their local training process". They are interested in the communication
latency Ht,n
u
for uploading the local model from a client n. As well, Bnt represents the communication
cost for sending updates from client n to the central server. The local training time for client n is
denoted as Ht,n c
Ht represents the total processing latency of a training round t as well as the total
communication cost Bt for the round t :
X
Ht = max (Ht,n u
+ Ht,n
c
)atn and Bt = Bnt atn
1≤n≤N
n

12
Federated Reinforcement Learning

Finally, the problem is to optimize :

 
 XT XT 
max E  )
 
w 1 Acc(T − w2 Ht − w3 Bt

A 
| {z } t=1 t=1 
Obj1 | {z } | {z }
Obj2 Obj3

where Acc(t) represents the accuracy and A = [atn ] is a T × N matrix for client selection, with
T the number of rounds and N the number of workers. w1 , w2 , w3 are chosen by the user to
design the FL application. Obj1 must be maximized in order to get the best accuracy while Obj2
must be minimized to reduce communication latency and Obj3 must be also minimized to reduce
communication cost.
Now, the state stn of agent n at the round t is defined as follows:
p
stn = Ltn , Ht,n
u
, Bnt , Dn , t

, Ht,n

where Dn is the size of the local dataset and t is the round index. On the central server, each
MARL agent n produce a binary action atn ∈ [0, 1] where 0 indicates the device will be terminated
after the probing training and vice versa. We indicate the reward rt at training round t as :

rt = w1 [U (Acc(t)) − U (Acc(t − 1))] − w2 Ht − w3 Bt

with Ht the processing latency of round t which is defined as :

p
Ht = max Ht,n + max rest
+ Ht,n
u

Ht,n
1≤n≤N
1≤n≤N
with atn = 1

Note U (·) is a utility function defined to ensure that the difference U (Acc(t)) − U (Acc(t − 1)) is
no-null when Acc(t) improvement is small near the end of FL process. Also, we recall for each client
p
n, Ht,n represents the training latency, Ht,n
rest
is the time required to finish the local training process
and Ht,n is the communication latency. Bt is the total communication cost as it was defined at the
u

beginning of the problem formulation.

According to their results, FedMARL outperforms classical algorithms such as FedAvg, FeteroFL,
FedNova or FedProx. Note that agents are multi-layer perceptrons (MLP) with only two layers
which is cheap to implement. Moreover, the figure 5, where dots represent a client device, illustrates
that agents do not select same participants over the time. It indicates how the scheduler can adapt
its choices over the time to find the optimal participants at each step of the time.
However, this approach would be harder to implement in peer-to-peer configuration. Indeed,
the loss function to train the main agent depends on Qθn (stn , atn ) which means that all Qθn (stn , atn ) of
agent n has to be collected. Also, the contributions of participants were not studied to show how
the global model influences the accuracy on local dataset to reveal the decrease of communications.
In other words, if the global model has always good impacts on local accuracy compared to the local
model, thus the global model will converge quicker. It is obviously an ideal configuration, but by
minimizing negative impacts, it improves the convergence of the global model.

13
Federated Reinforcement Learning

Figure 5: Decisions made by FedMarl. Each dot represents a client device. The blue dot means the client
is selected for local training, the red dot means the client is early rejected (results from Zhang
et al. (2021))

iii. Improvement of aggregation

In this part, we will discuss about ideas where the management of learning process during aggregation
is improved aiming to enhance the convergence of local models. For instance, Liang et al. (2019)
demonstrate through transfer Reinforcement Learning, that it is possible to share knowledge between
participant agents. This work highlights the potential to make agents to learn cooperatively through
each other in very different environments. Another idea is to avoid to change too much the new
model from the previous model where there is a new update. That is what Lim et al. (2020) propose
where independent IoT device finds its optimal learning in its environment taking advantage of other
devices and with only benefits. A critic model learns to evaluate if the action taken by the actor
model (supervisor) led the device to be in a better state or not. It leads to a better optimization of
local models. At this time, several papers bring up interesting ideas.

iii.1 Cross-gradient aggregation

Esfandiari et al. (2021) are interested in cross-gradient aggregation where the algorithm is based on
quadratic programming projection. The algorithm does not use a RL agent; however it is intended for
non-IID data. Two algorithms are proposed: the basic cross-gradient aggregation algorithm and a
better version of the first one in terms of communication costs; compressed cross-gradient aggregation
algorithm (compressed CGA algorithm). Contrary to the averaging approach which succeeds
most of the time on IID data, they seek a descent direction which is close to g jj = ∇x fj (Dj ; xj )
and simultaneously is positively correlated with all the cross-gradients. Dj represents the dataset
of the client j and xj the model parameter copy of the client j. The cross-gradient is defined as
g jl = ∇x fl (Dl ; xj ) . Then, the algorithm aims to minimize the quadratic programming projection
as follows :
1 1
min z T z − g T z + g T g where (g jl )z ≤ 0
z 2 2
Note for information only that the compressed CGA algorithm uses the Error Feedback SGD (see
Stich and Karimireddy (2021)) to compress gradients.
At the end, they show good results of the method on non-IID especially with the basic approach
(non compressed). Unfortunately, there is no recommendation or specific information to help us
to choose correctly the hyperparameters α and β mentioned in the algorithm. Moreover, the

14
Federated Reinforcement Learning

communication rounds for CGA are pretty important : in the experiments, twice compared to SGP
and 4 times the communication rounds of SwarmSGD.

iii.2 Lifelong Federated Reinforcement Learning

Speaking about RL makes often reference to robots. It is common to use Reinforcement Learning
algorithms for mobile robotic navigation in order to reach target position and avoid obstacles. Liu
et al. (2019b) propose a Lifelong Federated Reinforcement Learning architecture (LFRL) to reduce
training time, storing data, over the time for application of Reinforcement Learning in navigation.
Even if the application is specific for navigation applied to mobile robots, the idea suggested is still
relevant for other applications.
First of all, to start quicker the training time, the initial Q-network in robots has the ability
to reach the target and avoid some types of obstacles. To avoid random initial weights, a transfer
learning of a model could be trained on anonymized data for privacy purposes. Secondly, the key of
the algorithms result on knowledge function algorithm and transferring approaches.

Knowledge function algorithm In robotic field, inputs are generally based on sensor data
attributes. The degree of confirmation is defined on which action (output) the robot chooses to
perform as the "confidence value". This idea is basically to highlight when there is a significant
differentiation in the evaluation of Q-values, which it is called the scoring process. For instance, the
values (20, 19, 16, 14, 20) are less different than the values (5, 4, 80, 8, 10). The information entropy is
chosen to characterize the confidence as a quantitative function of robotic confidence:
m
1 X scoreij scoreij

cj = − Pm · ln Pm
ln m i=1 i=1 scoreij i=1 scoreij

where m is the action size of robot j, n is the number of private networks. Also the memory weight
of robot j is specified as:
(1 − cj )
wj = Pn
j=1 (1 − cj )

and the knowledge fusion function which is the proportion of the confidence of robots as :
T
labelj = score × c1 , c2 , . . . , cm

Also, in the cloud, the loss function along with the optimal parameter are defined as :
num
X
yi = cj · wj
j=1
N
1 X
L(y, hθ (xi )) = (yi − hθ (xi ))2
N i=1
N
1 X
θ∗ = argminθ L(yi · hθ (xi ))
N i=1

Transferring approaches One idea is to take the shared model as initial actor network while
another idea is to use the shared model as a feature extractor. When the shared model is the initial
actor network, it turns out that it is unstable even if good scores are observed at the beginning.
However, the feature extractor increases the dimension of features and improves the effect stably.

15
Federated Reinforcement Learning

One highlighted problem is when there is a structural difference between input layer of the shared
network and private network. For example, most of neural networks in robot are convolutional
neural networks (CNN) because CNN process images as input. The output of the CNN are
described as features. But, if a non-image sensor such as a laser radar, is used, the Q-network is not
considered as CNN. Then, the output of the entire network is used as additional features.
To summarize, each private neural network and the shared network generate actions which are called
scores. They are stored and organized as a score matrix where each column corresponds to scores of
a specific network. After normalizing each column, the knowledge fusion is applied which means
the computation of confidence (cj ) and labels (labelj ). In other words, all sample data labels are
generated. Finally, a network for the generation (k + 1), is generated and fits the sample data as
much as possible.

Results and discussion The major benefit of this approach is presented as a kind of unsupervised
learning technique, where from unlabeled data, the algorithm is able to label sample data based on
the confidence. Moreover, the transferring idea is designed as a feature extractor in order to improve
the shared network. However, there are several issues to be pointed out :
• there is no information about how the shared network is supposed to have access to data
• to construct the score matrix, there is no reason that the dimension of samples is the same
between each neural network, which means the number of scores between neural networks
should vary depending of its environment (i.e. its local data at the time t)
• the formula of the output yi which is based on the memory weight wj is not completely justified
to be used in the loss function as the target

iii.3 Federated REINFORCE client contribution evaluation

Intending to improve the aggregation, Zhao et al. (2021) propose a method to evaluate the contribu-
tion of federated participants on horizontal Federated Learning systems. The idea is to set up an
evaluator which estimates the values of gradients sent by clients before aggregating client updates.
F-RCCE algorithm is used as decision-maker and selects or discards clients gradients. It is able to
perform task-specific evaluation that provides designated data distribution. The RL problem is
presented as :
• State space: The feasible set of θG
• Action space: S t = st1 , . . . , stn where sti ∈ {0, 1} (0 : discarded; 1: selected)

• Reward function: r(S t )

The reward function is related to the global model’s performance on validation set.
mv
1 X
r(S t ) = Lv (fθG (xvk ), ykv ) − δ
mv
k=1

where Lv is the loss function of global model on validation set Dv and δ is a baseline calculated by
moving average of previous loss Lv with moving average window T > 0.
The global model parameters are updated as:
N
t+1 t αθ X
θG ← θG − PN sti ∇θit
t
i=1 si i=1

The probability of θit is introduced as P (sti = 0) = ωit Considering a parameterized stochastic policy
πθ (a|s), the probability of trajectory (with St the state at the time-step t and At the action at the

16
Federated Reinforcement Learning

time-step t) is :
T
Y
p(τ |π) = ρ0 (S0 ) p(St+1 |St , At )π(At , St )
t=0

with ρ0 (S0 ) the initial state distribution and τ the trajectory (sequel of state - action - reward).
Then, the probability of St of the policy ϕ is defined as :
T h
sti 1−sti
Y i
p(S t |ϕ) = ωit · (1 − ωit )
t=1

Using the log-derivative trick:

∇ϕ p(S t |ϕ) = p(S t |ϕ)∇ϕ log p(S t |ϕ)

where in this specific case, the expression ∇ϕ log p(S t |ϕ) is :

n
X
∇ϕ log p(S t |ϕ) = sti ∇ϕ log ωit + (1 − sti )∇ϕ log(1 − ωit )
i=1

The learning objective is to maximize the expected cumulative reward :

J (πθ ) = Eτ ∼πθ [R(τ )]

" T #
X
= Eτ ∼πθ Rt
t=0
T
X
= Eτ ∼πθ [Rt ]
t=0

Then, the policy gradient can be given as :

∇ϕ J(ϕ) = ES t ∼πϕ (θt ,·) ∇ϕ log p(S t |ϕ)r(S t )

Then the evaluator’s model parameters ϕ can be optimized by gradient ascent method with learning
α:
Xn
ϕt+1 ← ϕt + αϕ r(S t )∇ϕ log p(S t |ϕ)|ϕt
i=1

The algorithm F-RCCE offers promising results. Indeed, the authors of the paper have simulated
FL systems with 50 to 500 clients. The total number of iterations of the model is around 1000
iterations. From 100 to 500 clients, the time of cost only increases by 6% (in comparison to the
baseline leave-one-out (LOO) which increases linearly). The paper highlights that removing the

17
Federated Reinforcement Learning

highest contribution slows down the model convergence. Moreover, removing the lowest contribution
also puts forward negative impacts because they still have positive effects on the model. When the
amount of clean data is large, thus the calculated gradient is relatively stable which is correlated with
the client’s contribution. However, the task model in the experiment is a logistic regression classifier
which means the size of gradients is not large contrary to a deep learning algorithm. In other words,
if the task model requires many parameters, then the size of gradients times the number of task
models (i.e. the number for participants) could be a significant size of input for the evaluator’s
model. Moreover, the assumption of having a dataset in the server to compute the reward function,
rises some questions. There is no information about the size or the distribution of the dataset on
the server even if the dataset is supposed to be anonymized for keeping privacy.

18
Federated Reinforcement Learning

III. Methodology
The chapter will cover explanations for the algorithms tested in experiments. Firstly, an introduction
of fundamental knowledge of Reinforcement Learning will clarify how algorithms are built. Secondly,
the first algorithm is designed to evaluate in a better way the contribution of participants for the
aggregation. Finally, the second algorithm is designed to optimize different criteria by selecting in a
better way the participants each round.

i. Basic knowledge of Reinforcement Learning

The environment of the agent is defined as a state St and the agent takes an action At to interact
with it. Then the agent receives a reward Rt . Thus the trajectory τ 1 can be defined such as
τ = (S0 , A0 , R0 , S1 , A1 , R1 , . . . ). Also, the initial state follows a start-state distribution St ∼ ρ0 (·)
and the transition process is defined as St+1 ∼ P (St+1 |St , At ) (for a stochastic process2 ).
The element of state transition of the tuple (S, A, P, R, γ) in a Markov Decision Process is :

P (S ′ |S, A) = P (St+1 = S ′ |St = S, At = A)

The immediate reward is Rt = R(St , At ) where At is a finite set of actions. γ is the reward discount
factor and γ ∈ [0, 1]
The discounted return is the weighted sum of rewards which gives more weights to the closer
steps.
XT
Gt∈[0,T ] = R(τ ) = γ t Rt
t=0

The agent follows a policy π and the objective of the agent is to maximize the expected reward
through this policy. The policy is defined as3 :

π(A|S) = P (At = A|St = S), ∀t

The probability of the trajectory τ according to the policy π is:

−1
TY
P (τ |π) = ρ0 (S0 ) P (St+1 |St , At )π(At |St )
t=0

The value function4 represents the expected return at the state S :

V (S) = Eτ ∼π [R(τ )|S0 = S]

The action-value function5 represents the expected return at the state S and at the action A6 :

Qπ (S, A) = Eτ ∼π [R(τ )|S0 = S, A0 = A]

1τ ∼ π are the trajectories τ sampled according to the policy π
2 Fora deterministic transition process, the next state St+1 follows a deterministic function St+1 = f (St , At )
3 The policy can be also written as π(A|S) = argmax q π (S, A)
A∈A
4 The
function can be written πas v(S)
approximation value = ES ′ ∼P (·|S) [r + γv(S ′ )]
5 q π (S, A) = E ′ ′ ′
S ∼P (·|S) R(S, A) + γ EA′ ∼π(·|S ′ ) [q (S , A )] represents the approximation action-state function
6 There is a relation between the optimal approximated value function and the optimal approximated action-state

function v ∗ (S) = maxA∈A q ∗ (S, A)

19
Federated Reinforcement Learning

The expected return of the policy π for any trajectories τ and any reward function R is defined
as : Z
J(π) = P (τ |π)R(τ ) = Eτ ∼π [R(τ )]
τ
The optimal policy is the policy which the agent aims to get to maximize the expected return
and it is defined as:
π ∗ = argmax J(π)
π

ii. Management of participation through a contribution evaluation

ii.1 Understanding of the algorithm
Based on the paper of (Zhao et al., 2021), the algorithm aims to improve the aggregation of local
models from participants. One difficulty from the initial idea is that the size of gradients could be
significantly large. For instance, for a convolutional neural network for the dataset MNIST with two
convolutional layers and two linear layers, the total of parameters of gradients is almost 1.2 millions
of parameters. Therefore, if the state is designed as the gradients of each local model ∇θi where i is
the index of the participant, then the state is composed of n × 1.2 × 106 components where n is the
number of participants. In that case, the amount of information becomes tough to deal with for
achieving to have a good policy and for computation with limit resources even if the agent is on the
server.
In order to simplify this approach, the state is represented with a collection of accuracies computed
after a training step. The agent outputs a vector of probability where the probability pi represents
the probability of the participant i to be chosen. Then, probabilities are sampled according to the
Bernoulli law to produce a vector with the same size as the number of participants where 1 denotes
to select the participant for aggregation and 0 denotes to discard the participant. The update of
the global task model is done as the average of selected local task models of participants :
Pk
ai θt
t
θG t
← θG + Pi k i
i ai

where θGt
is the parameters of the global task model at the time t, θit is the local task model of
the participant i at the time t after training step, ai ∈ {0, 1} is the stochastic action given the
probabilities of the output of the agent and k is the number of participants. The reward is designed
to push the agent to find the best contribution configuration in order to improve the accuracy:
k
1 X
Rt = AcctG − δt = Pk ai Acci (θit ) − δt
i ai i

where Acci (·) is the accuracy of the participant i and δt is the exponential moving average of
previous accuracies AccG . AccG is an approximation of the accuracy of the global task model. The
exponential moving average applies weighting factor which decrease exponentially. In other words,
the recent accuracies have more impact than older accuracies.
The algorithm of Reinforcement Learning is the REINFORCE algorithm which will be explained
theoretically below. This algorithm encourages the agent to take action that has greater cumulative
reward weighting the gradient by the cumulative reward of each action.

20
Federated Reinforcement Learning

ii.2 Gradient-Based Optimization

The agent is modeled as a neural network. The parameters of the neural network θ are updated
according to :
θt+1 ← θt + ∆θ with ∆θ = α∇θ J(πθ )
By definition, the goal of the agent is to maximize the expected return which is formulated as:
" T # T
X X
J(πθ ) = Eτ ∼πθ [R(τ )] = E Rt = E [Rt ] because E(aX + bY ) = aE(X) + bE(Y )
t=0 t=0

The expectation of Rt is given by the integral:

Z
Eτt ∼πθ [Rt ] = P (τt |θ)Rt dτt
τt

because Rt only depends on τt = (S0 , A0 , R0 , . . . , St , At , Rt , St+1 ). The derived expectation can be

t′ =0

The expression can be differentiated according to θ :

t
X
∇θ log P (τt |πθ ) = ∇θ log πθ (At′ |St′ )
t′ =0

Thus, the derived expected return can be expressed as:

" T t
#
X X
=⇒ ∇θ J(πθ ) = Eτ ∼πθ Rt ∇θ log πθ (At′ |St′ )
t=0 t′ =0

21
Federated Reinforcement Learning

An equivalence of indices is explained in the following expression :

T
X t
X
At Bt′ = A0 B0
t=0 t′ =0
+ A1 (B0 + B1 )
+ A2 (B0 + B1 + B2 )
+ ...
+ AT (B0 + B1 + B2 + · · · + BT )

= B0 (A0 + A1 + A2 + · · · + AT )
+ B1 (A1 + A2 + · · · + AT )
+ B2 (A2 + · · · + AT )
+ ...
+ B T AT
T
X T
X
= Bt′ At
t′ =0 t=t′

The previous equivalence allows to write differently the indices of sums :

" T T
#
X X
∇θ J(πθ ) = Eτ ∼πθ ∇θ log πθ (At′ |St′ ) Rt
t′ =0 t=t′

By introducing the discount factor γ, the expression can be completed :

" T T
#
X X ′ ′

∇θ J(πθ ) = Eτ ∼πθ ∇θ log πθ (At′ |St′ ) γ t Rt γ t = γ t−t × γ t
t′ =0 t=t′
" T T
!#
t′ t−t′
X X
= Eτ ∼πθ ∇θ log πθ (At′ |St′ ) γ γ Rt
t′ =0 t=t′
′
γ t is removed to prevent an excessive update of recent rewards. A reference b(St′ ) is introduced for
reducing the large variance when the gradient is estimated.
The complexity of the model and the random value of the reward increase exponentially when
the size of the trajectory increases
" T T
!#
t−t′
X X
∇θ J(πθ ) = Eτ ∼πθ ∇θ log πθ (At′ |St′ ) γ Rt − b(St′ )
t′ =0 t=t′
" T T
# T
" #
t−t′
X X X
= Eτ ∼πθ ∇θ log πθ (At′ |St′ ) γ Rt − Eτ ∼πθ ∇θ log πθ (At′ |St′ )b(St′ )
t′ =0 t=t′ t′ =0

The following statement Eτ ∼πθ [∇θ log πθ (At′ |St′ )b(St′ )] = 0 can be verified. Any normalized
distribution follows:
Rx Pθ (x) dx = 1
R

⇐⇒ ∇Rθ x Pθ (x) dx = ∇θ 1 = 0
⇐⇒ R ∇ P (x) dx = 0
x θ
⇐⇒ x
P θ (x)∇ θ log Pθ (x) dx = 0
⇐⇒ Ex∼P (·|θ) [∇θ logPθ (x)] = 0

22
Federated Reinforcement Learning

h i
=⇒ EAt′ ∼π(·|θ) ∇θ log πθ (At′ |St′ )b(St′ ) = 0

Moreover, it is better to choose b(St ) with a variance V ar(b(St )) close to 0 and strongly correlated
PT ′
to t=t′ γ t−t Rt because if X and Y are two random distributions then

V ar(X − Y ) = V ar(X) + V ar(Y ) − 2 cov(X, Y )

with cov(X, Y ) the covariance between X and Y . If Y has a variance close to 0 then :

V ar(X − Y ) ≈ V ar(X) − 2cov(X, Y ) ≤ V ar(X)

The intuition of adding a bias allows to reduce the variance. The bias b(St ) is often the value
function V (St ) for an actor-critic algorithm.

ii.3 Application of a binary policy

Having a binary policy means that the probability of the action At is defined as:
n
ati 1−ati
Y
P (At |θ) = pti · (1 − pti )
i=1

This distribution can be verified to be a normalized distribution. For convenience, S is the sum
t
Qn t ai 1−at
· (1 − pti ) i and A is the set of all possible actions, A = (a1 , . . . , ak ) | ∀i ∈
P
At ∈A i=1 pi
[1, k], ai ∈ {0, 1} . Then, the expression can be organized differently to verify the result.
0 1 0 1 0 1
S = pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
1 0 0 1 0 1
+ pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
0 1 1 0 0 1
+ pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
1 0 0 1 0 1
+ pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
+ ...
1 0 0 1 1 0
+ pt1 · (1 − pt1 ) × pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )

0 1
⇐⇒ S = pt1 · (1 − pt1 ) × [
0 1 0 1
pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
1 0 0 1
pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
···+
1 0 1 0
pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
1 0
] + pt1 · (1 − pt1 ) × [
0 1 0 1
pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
1 0 0 1
pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn ) +
···+
1 0 1 0
pt2 · (1 − pt2 ) × · · · × ptn · (1 − ptn )
]

23
Federated Reinforcement Learning

h i
0 1 1 0
⇐⇒ S = pt1 · (1 − pt1 ) + pt1 · (1 − pt1 ) × [
0 1
pt2 · (1 − pt2 ) × [
0 1 0 1
pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
1 0 0 1
pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
···+
1 0 1 0
pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn )
1 0
] + pt2 · (1 − pt2 ) × [
0 1 0 1
pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
1 0 0 1
pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn ) +
···+
1 0 1 0
pt3 · (1 − pt3 ) × · · · × ptn · (1 − ptn )
]
]

h i
0 1 1 0
⇐⇒ S = pt1 · (1 − pt1 ) + pt1 · (1 − pt1 )
h i
0 1 1 0
× pt2 · (1 − pt2 ) + pt2 · (1 − pt2 )
× ...
h i
0 1 1 0
× ptn · (1 − ptn ) + ptn · (1 − ptn )
= 1 × 1 × ··· × 1 = 1

Thus, the probability P (At |θ) follows a normalized distribution :

n
ati 1−ati
X Y
pti · (1 − pti ) =1
At ∈A i=1

ii.4 Neural network architecture for the agent

The output of the neural network of the agent is a vector of probabilities. In order to get probabilities
from a neural network, usually the softmax function is applied after computing the output of the
last layer. Softmax is defined as :
exp(xi )
Softmax(xi ) = P
j exp xj

Therefore the elements of the output belong to the range [0, 1] and sum to 1.
Nevertheless, it means that when the probabilities are distributed in a homogeneous way, the
probability for one participant to be selected depends strongly on other probabilities of participants.
For example, with 100 participants, a uniform distribution would be that the probability to be
selected is 100
1
. Consequently, next participants would have few chances to be selected for the next
round and it probably leads to an action where all participants are discarded for the next round.

24
Federated Reinforcement Learning

With the aim of preventing this difficulty, the output would not be a vector with k elements
where k is the number of participants but a k × 2 matrix where the first column corresponds to
the probability of being selected while the second column corresponds to the probability of being
discarded. Therefore, the softmax function is applied on rows and the sum of probabilities per row
is 1. The architecture of the neural model must enable to have an output with k × 2 values. Two
ideas are suggested on the figure 6 :

(a) Information are treated globally and hidden layers do (b) Information are treated separately and allows the
not take account of the independence of probabilities neural network to focus on probabilities of selection
of selection and probabilities of discard and probabilities of discard

Figure 6: Different architectures of neural networks to get an output with two columns. Green color
represents neurons for selection and violet color represents neurons for discard.

The architecture of the left side of the figure 6 puts forward several difficulties to understand.
The output of the neural network should be a k × 2 matrix, however to get this specific shape on
this specific architecture, the last output must be reshaped from (2k) × 1 to k × 2. There is no
better reason for choosing the k first elements for the first column and the k last elements for the
second column for the output shape than choosing randomly elements to fill the output matrix for
having the correct shape. Also the size of hidden layers could be doubled to have enough parameters.
Whereas the architecture of the right side on the figure 6 separates selection and discard which
makes the information easier to deal with. This architecture is then used for the neural network of
the agent.

ii.5 Moving batch

To train a neural network, batch could significantly speed up model training because batch sizes
would allow to make parallel computations to a greater degree. However batch tends to generalize
worse to test data. In the situation of limited experiments and time, speeding up model training is
important but having a batch means collecting enough states, actions and rewards over the time.
The moving batch involves to collect b elements by b elements which overlap between each other
where b is the size of the batch and also denotes a window parameter. For instance, if b = 3, the
agent learns only starting from the third round after collecting (S0 , A0 , R0 , S1 , A1 , R1 , S2 , A2 , R2 ).
Then at the 4th , the batch is composed of (S1 , A1 , R1 , S2 , A2 , R2 , S3 , A3 , R3 ). The figure 7 allows to
visualize an example of the idea of moving batch.

ii.6 Details on the algorithm

To train the agent, for each experiment, also called episode, there are r rounds performed. Each
round, participants download the global task model θG . Then they train their local model on their

25
Federated Reinforcement Learning

Time t t+1 t+2 t+1 t+2 t+3

State
   
St St+1 St+2 St+1 St+2 St+3
Action  At At+1 At+2  −→ At+1 At+2 At+3 
Reward Rt Rt+1 Rt+2 Rt+1 Rt+2 Rt+3

Figure 7: Moving batch example with a size of 3 states

local data set. Participants estimate the accuracy of their local model θi . After this step, the server
downloads at its turn all local models and the accuracies of them. The agent, on the server, observes
a state St , a vector of k accuracies (Acc0 , . . . , Acck ) and takes an action At , a matrix of probabilities
where the first column is the probability of being selected and the second column is the probability
of being discarded. The action is then translated by a vector of k elements, 0 for discarded and
1 for selected generated by applying the Bernoulli law on the first column of the previous matrix.
Depending of his decision, the agent receives a reward Rt . The moving batch is updated by removing
the oldest transition and adding a new transition (St , At , Rt ) Then the agent collects for training
with moving batch. When the moving batch is full, the model of the agent can be trained by using
the following formula :
" T T
#
t−t′
X X
∇θ J(πθ ) = Eτ ∼πθ ∇θ log πθ (At′ |St′ ) γ Rt
t′ =0 t=t′

26
Federated Reinforcement Learning

The algorithm EvaluatorFL is summarized here:

Algorithm 1 EvaluatorFL There are k participants; Bl is the local minibatch size, E is the number
of local epochs, and η is the learning rate of the task model, R is the number of rounds, Ba is the
size of the moving batch, αlr is the agent model and αexp is the parameter for exponential moving
average. The algorithm is described for one episode.
1: procedure Server with the agent executes:
2: Initialize θG
0
▷ Global task model
3: Initialize the Moving Batch B
4: Initialize θa ▷ Agent model
5: Initialize δ = 0
6: for each round r from 1 of R do
7: for each client k ∈ St in parallel do
8: Update local model θit ← θG
t
▷ Participants download global task model
9: θi ← ClientUpdate(θi )
t+1 t

10: Acct+1
i ← Accuracy(θit )
11: The server collects local gradients θit+1 and accuracies Acct+1
i
12: St ← (Acct+1
1 , . . . , Acck )
t+1
▷ State
13: At ← θa (St ) ▷ Action
14: Selt ← Bernoulli(A0t ) ▷ Selection using first column of probabilities
Pk i
15: p ← i=1 PSel t
k t+1
16: R̂ ← p1 t
i=1 Seli · Acci
17: Rt ← R̂ − δ
18: δ ← δ + (1 − αexp )(R̂ − δ)
19: Collects {St , At , Rt } and update B
t+1 Pk
20: θG ← p1 i=1 Selti · θit+1
21: if B is full then
PT ′
22: Ât′ ,j ← t=t′ γ t−t Rt
23: Compute the policy gradient :
24:
Ba X T
1 X
J(θa ) ← ∇θ log πθa (At′ ,j |St′ ,j )Ât′ ,j
Ba j=1 ′
t =0

25: Update θat+1 ← θat + αlr · J(θa )

27
Federated Reinforcement Learning

iii. Management of participants for the next round

iii.1 Understanding of the situation
The method for choosing the next participants for the algorithm FederatedAveraging involves to
choose the next participants randomly by imposing a percentage of participants. For instance, if
there are 100 workers, to insure the convergence of the task model, 10% of workers are selected
randomly each round. Several questions are coming up :
• how to insure the fairness of participation to avoid that the same workers contribute every
round ?
• is it possible to select the fastest workers to make the learning quicker ?
• how to take into account the distribution of data over all local data of workers ?

iii.2 Definition of the state

In order to solve this problem, Reinforcement Learning algorithms can be design to find the best way
to optimize these criteria. In fact, the state of the environment can be modeled as a vector which
contains for each worker : the previous participation, a criterion of performance and a value which
represents the distribution of local data obtained through the assumption of Federated Analytics.
For this study, in order to simplify the challenge, only the time is used to defined the criterion of
performance. The state is then defined with the computation time T c , the time for uploading the
model to the server T u and the time to download the model from the server T d where :
|D|
Tc = v × ×E
Bs
P × |u|
Tu =
Bu
P × |u|
Td =
Bd
with v the time spent to compute a batch (e.g. 0.5 s/batch), |D| the size of local data, Bs the batch
size, P the number of parameters, |u| the size of a parameter unit (e.g. a float is 32 bits), Bu the
bandwidth for uploading and Bd the bandwidth for downloading. The total time is the sum of
previous times :
Tt = Tc + Tu + Td
The state is normalized each round following a standard normalization:
X −m
Xnormalized =
σ
where m is the mean of X and σ is the standard deviation of X. Note the state is composed of 3
components for each worker which means there are 3 means and 3 standard deviations to compute
and one normalization per component.

iii.3 Definition of the reward

The goal of the agent could be to minimize the variance of participation. For instance, the table 1
illustrates the evolution of the variance of participation which must be minimized by the agent to
insure the fairness of participation.
Moreover, the agent could be used to minimize the learning time while speeding up the evolution
of the accuracy. There are several ways to reward the agent : it could be by giving a positive reward

28
Federated Reinforcement Learning

Time Worker 1 Worker 2 Worker 3 Worker 4 Variance

t 4 7 9 3 7.583
t+1 5 7 10 3 8.917
t+k 10 10 10 10 0.000

Table 1: Example of the evolution of the variance of participation with 4 workers

when the accuracy is increased and the total of time is decreased between two consecutive rounds
and by punishing it in the opposite case. Else, for instance, the reward could increase exponentially
when the time would be more and more minimized while the accuracy would be more and more
maximized. Finally, the distribution of data could be taken as a weight which influences the reward.
In case of non-IID, an illustration of the intention, would be when a worker with 300 samples of
the label A should have a different impact on the reward than a worker with 4000 samples of the
same label A.
However, in this study, the reward is based on the criterion of performance, then the objective of
the agent is to minimize the total time T t for each round. The reward is established as the difference
between the maximum of time of participants and the previous maximum time:

Rt = (− max(T a ) − ν)

where T a is the set of all total times of participants of the current round and ν is the previous
maximum time. Note the total times are normalized and to normalize the reward, it is better to
take the mean of times than the sum of times (i.e. the total times).

iii.4 Choice of the algorithm and explanations

Different algorithms were explored7 :

• TRPO algorithm (Trust Region Policy Optimization) where the step size is properly handled
in policy gradient based on the idea of trust region. In other words, the algorithm helps to
manage the step size depending on the curvature of the reward landscape. The algorithm is
complex and would take too much time to be implemented.
• PPO algorithm (Proximal Policy Optimization) is an improvement of the TRPO algorithm
because TRPO is complicated and suffers of a computation burden in computing the natural
gradient. In the same way, the algorithm still stays elaborate to be used for Federated Learning
and was not retained.
• AC algorithm (Actor Critic) is a simple algorithm which is an improvement of the policy-based
methods (for example, REINFORCE algorithm) where the state value is used as a baseline to
reduce the variance of the estimated gradient in the vanilla policy gradient method. The
main advantage of this algorithm is its simplicity, however, after hundreds of experiments the
algorithm does not converge to an acceptable solution. Most of the time, the probabilities
diminish until being close to zero and only one or two participants are selected for the next
round.

Only the DDPG algorithm (Deep Deterministic Policy Gradient) was validated. The idea of
the algorithm is to improve the DQN algorithm (Deep Q Network) to be a deterministic policy
gradient. DDPG establishes a Q function (critic) and a policy function (actor) simultaneously.
Temporal-difference methods are used to update the Q function (critic) which is the same with
7A small description of several algorithms can be found in Appendix ii

29
Federated Reinforcement Learning

DQN. The action-value function Q(s, a|θQ ), where θQ denotes for the critic network parameters, is
learned using Bellman equation (see Bellman (1957)). The action-value function given any policy
π is defined as :

Qπ (S, A) = Eτ ∼π [R(τ )|S0 = S, A0 = A]

"∞ #
X
= EAt ∼π(·|St ) γ R(St , At )|S0 = S, A0 = A
t

t=0

The Bellman equation given any policy π can be derived as follows:

Qπ (S, A) = EA∼π(·|S),S ′ ∼P (·|S,A) [R(τt:T )|St = S, At = A]

= EA∼π(·|S),S ′ ∼P (·|S,A) Rt + γRt+1 + γ 2 Rt+2 + · · · + γ T RT |St = S, At = A

= EA∼π(·|S),S ′ ∼P (·|S,A) Rt + γ(Rt+1 + γRt+2 + · · · + γ T −1 RT )|St = S, At = A

= ESt+1 ∼P (·|St ,At ) Rt + γEA∼π(·|S),S ′ ∼P (·|s,a) [Rτt+1:T ]|St = S

= ESt+1 ∼P (·|St ,At ) Rt + γEAt+1 ∼pi(·|St+1 ) [Qπ (St+1 , At+1 )]|St = S

= ESt+1 ∼P (·|St ,At ) R(S, A) + γEA′ ∼pi(·|S ′ ) [Qπ (S ′ , A′ )]

where S ′ = St+1 and A′ = At+1 .

The action-value function Qπ (St , At ) in the DDPG algorithm given any policy π is then written
as :
Qπ (St , At ) = E [R(St , At ) + γQπ (St+1 , π(St+1 ))]

Then the Q value of the critic network θQ is computed and uses to minimize the loss function :
1 X 2 ′ ′
L= Yi − Q(Si , Ai |θQ ) with Yi = Ri + γQ(St+1 , π(St+1 |θπ )|θQ )
N i

The policy of the actor network is deterministic thus it will be noted as µ for a deterministic
policy to make the difference with stochastic policy. The performance objective for the deterministic
policy µ follows the same expected discounted reward definition than the stochastic policy:
"∞ #
X
J(µ) = ESt ∼ρµ ,At =µ(St ) γ R(St , At )
t−1

t=1
∞
Z Z X
= γ t−1 ρ0 (S)P (S ′ |S, t, µ) · R(S ′ , µ(S ′ ) dS dS ′
S S t=1
Z ∞
Z X
= ρ (S) · R(S, µ(S)) dS
µ
because ρ (S) =µ
γ t−1 ρ0 (S)P (S ′ |S, t, µ)
S S t=1

where P (S ′ |S, t, µ) = P (St+1 |St , At )P µ (At |St ) is the transition probability times the probability of
the action choice. The definition of the state function can be used :
"∞ #
X
V µ (S) = E γ t−1 R(St , At )|St = S; µ
t=1
∞
Z X
= γ t−1 P (S ′ |S, t, µ)R(S ′ , µ(S ′ ))dS ′
S t=1

30
Federated Reinforcement Learning

And the objective can be written as :

Z
J(µ) = ρ0 V µ (S)ds
S

With a deterministic policy, there is only a single action whereas with a stochastic policy, the Q-value
is an expectation over the action distribution. Thus V µ (S) = Qµ (S, µ(S)) and :
Z
J(µ) = ρ0 Qµ (S, µ(S))ds
S

There is a Bellman (1957) equation for the action-value function :

Qµ (S, A) = E [R(τt:T )|St = S, At = µ(S)]

= E Rt + γRt+1 + γ 2 Rt+2 + · · · + γ T RT )|St = S, At = µ(S)

= E Rt + γ(Rt+1 + γRt+2 + · · · + γ T −1 RT )|St = S, At = µ(S)

= E [Rt + γE [R(τt+1:T )|At = µ(S)] |St = S]

It can be use in the policy gradient:

Z
∇θ J(µθ ) = ∇θ ρ0 (S)V µθ (S)dS
S
Z
= ρ0 (S) · ∇θ V µθ (S)dS because the integral does not depend on θ
S

And :

∇θ V µθ (S) = ∇θ Qµθ (S, µθ (S))

= ∇θ [R(S, µθ (S) + γE [V µθ (S ′ )|At = µ(S)]]
Z
= ∇θ [R(S, µθ (S)] + ∇θ γP (S ′ |S, µθ (S)) · V µθ (S ′ )dS ′
S
= ∇θ A + ∇θ B

So, the two expressions ∇θ A and ∇θ B can be developed:

∇θ A = ∇θ [R(S, µθ )]
= ∇θ µθ (S) × ∇A R(S, A)|A=µθ (S)

And:
Z
∇θ B = ∇θ γP (S |S, µθ (S)) × V (S )dS
′ µθ ′ ′
S
Z
= γ∇θ [P (S ′ |S, µθ (S)) × V µθ (S ′ )] dS ′
S

31
Federated Reinforcement Learning

By posing u(θ) = P (S ′ |Sµθ (S)) and v(θ) = V µθ (S ′ ), the formula ∇θ [u(θ) × v(θ)] = ∇θ u(θ) × v(θ) +
u(θ) × ∇θ v(θ) can be used to develop the expression:
Z Z
∇θ B = ∇θ µθ (S)∇A P (S |S, µθ (S)) × V (S )dS +
′ µθ ′ ′
γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
S S

Thus, the expression ∇θ V (S) is simplified as:

µθ

Z Z
∇θ V (S) = µθ (S) ∇θ R(S, µθ (S)) +
µθ
γ∇A P (S |S, µθ (S)) × V (S )dS +
′ µθ ′ ′
γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
S S
Z
= ∇θ µθ (S)∇A Q (S, µθ (S)) +
µθ
γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
S

Fubini’s theorem Suppose A and B are complete measures spaces. Suppose f (x, y) is A × B
measurable. If Z
|f (x, y)|d(x, y) < ∞
A×B
where the integral is taken with respect to a product measure on the space over A × B, then
Z Z Z Z Z
f (x, y)dy dx = f (x, y)dx dy = f (x, y)d(x, y)
A B B A A×B

the first two integrals being iterated integrals with respect to two measures, respectively, and the third
being an integral with respect to a product of these two measures.

The last part of the expression can be iterated using the relation between ∇θ V µθ (S) and
∇A Qµθ (S, µθ (S)), and simplified by using the Fubini’s theorem :
Z
γP (S ′ |S, µθ (S)) × ∇θ V µθ (S ′ )dS ′
S
Z Z
= γP (S |S, µθ (S)) ∇θ µθ (S )∇A Q (S |µθ (S )) +
′ ′ µθ ′ ′
γP (S |S , µθ (S ))∇θ V (S )dS dS ′
′′ ′ ′ µθ ′′ ′′
S S
Z Z
= γP (S ′ |S, µθ (S))∇θ µθ (S ′ )∇A Qµθ (S ′ |µθ (S ′ ))dS ′ + γ 2 P (S ′′ |S, µθ (S))∇θ V µθ (S ′′ )dS ′′
S S

Thus, by iterating infinitely, the expression ∇θ V µθ (S) can be written as:

∞
Z X
∇θ V µθ (S) = γ t P (S → S ′ , t, µθ (S)) × ∇θ µθ (S ′ ) × ∇A Qµθ (S ′ , A)dS ′
S t=0

where P (S → S ′ , t, µθ (S)) is the probability of transition t times8 . Then the expression can be used
in the policy gradient:
Z
∇θ J(µθ ) = ρ0 (S) · ∇θ V µθ (S)dS
S
∞
Z Z X
= γ t ρ0 (S)P (S → S ′ , t, µθ (S))∇θ µθ (S ′ ) × ∇A Qµθ (S ′ , A) dS ′ dS
S S t=0

8 Note P (S → S ′ , 0, µθ (S) = 1 for S ′ = S and 0 when S ′ ̸= S.

32
Federated Reinforcement Learning

R P∞
As ρµ (S) = S t=0 γ t−1 ρ0 (S)P (S ′ |S, t, µ)dS, then :
Z
∇θ J(µθ ) = ρµθ (S) ∇θ µθ (S) ∇A Qµθ (S, A)dS
S

For the algorithm, the policy function is updated by applying the chain rule to the expected
return from the start distribution J. Learning from mini-batches, the loss of the actor network can
be computed as: X
∇θ π J ≈ ∇A Q(Si , π(Si , |θπ )|θQ )∇θπ π(Si |θπ )
i

where θ denotes the actor network parameters.

The network parameters are updated by exponentially smoothing rather than directly copying
the parameters:
′ ′
θQ ← ρθQ + (1 − ρ)θQ
′ ′
θπ ← ρθπ + (1 − ρ)θπ

where ρ represents the degree of weighting decrease, a constant smoothing factor between 0 and 1.

DDQN Loss

At = π(St ) Actor Network Target Actor Network

Soft
Environment update

St Critic Network Target Critic Network

St+1
Rt

(St , At )

(St , At , Rt , St+1 ) Replay memory

Figure 8: DDPG algorithm with replay memory where π(·|θπ ) denotes the policy function (actor) and Q(·|θQ )
denotes the Q function (critic)

The figure 8 illustrates the DDPG algorithm and introduces the concept of replay memory. The
replay memory stores the experiences in a large replay memory. Then, for each iteration of training,
it selects randomly a sample of this memory. It allows to reduce the correlation between experiences
in a training batch and it helps greatly the training along.

33
Federated Reinforcement Learning

iii.5 Finalization of the algorithm

The algorithm SchedulerFL summarizes the DDPG algorithm. At the first round, the participants
are chosen randomly given a percentage. The server sends the global model to all participants. Each
participant trains their local model on local data. Every participants send back their local model
and their respective times. For worker which did not participate the first round, their time are
generated from the mean of each collected time (T c , T u , T d ). Thus for 100 workers, if only 10%
have been selected, the others would have the mean of each collected time to avoid zero values. Note
the first round helps to generate the first state and the time of non-participants are replaced by the
mean of collected times only this round. At the end of the first round, the server aggregates the
models of participants using the FedAvg algorithm.
On the next rounds, the policy network takes an action At depending on the current state St .
The server sends the global model to participants and they send back its local trained model and
times. Then, the server collects the new times and replaces them to update the state to make the
next state St+1 . The agent receives a reward Rt based on the state and its action. A replay memory
stores the experience (St , At , Rt , St+1 ). At the end, the critic network and the actor network are
trained on experiences randomly selected from the replay memory. All information are summarized
in the algorithm SchedulerFL.

34
Federated Reinforcement Learning

Algorithm 2 SchedulerFL There are k workers, p percentage of participants the first round, E
experiences, R rounds, γ is the discount factor for reward
1: procedure Server with the agent executes:
2: Initialize θπ ▷ Actor network
′
3: Initialize θπ ▷ Target Actor network
4: Initialize θQ ▷ Critic network
Q′
5: Initialize θ ▷ Target Critic network
6: Initialize a replay memory M
7: for each experience e from 1 of E do
8: Initialize θG0
▷ Global task model
9: Initialize S0 as a full 3 × k zero matrix ▷ State
10: Initialize ν = 0 ▷ Previous maximum time for reward
11: Select randomly p · k participants
12: for each participant in parallel do
13: θi0 ← θG0
▷ Participants download global task model
14: θi ← ClientUpdate(θi0 )
1
▷ Local training
15: Server collects the local model θi1
16: Compute the times T c , T u and T d
17: Update S0 to S1 by replacing with new times
18: Complete S1 by replacing zeros by the mean of participants times
19: Aggregate local models with FedAvg algorithm to get θG 1

20: for each round r from 2 of R do

21: Normalize St to get Ŝt ▷ Using standard normalization
22: Selection action At = π(Ŝt |θπ )
23: for each participant according to At in parallel do
24: θit ← θG
t
▷ Participants download global task model
25: θi ← ClientUpdate(θit )
t+1
▷ Local training
26: Server collects the local model θi1
27: Compute the times T c , T u and T d
28: Update St+1 to St by replacing with new times
29: Aggregate local models with FedAvg algorithm
30: Compute the maximum time tm and compute Rt
31: ν ← tm
32: Store the experience (St , At , Rt , St+1 ) in M
′ ′
33: Set Yi = Ri + γQ(St+1 , π(St+1 |θπ )|θQ )
34: Update critic by minimizing the loss:
35:
1 X 2
L= (Yi − Q(Si , Ai |θQ ))
N i
36: Update the actor policy using the sampled policy gradient:
37:
1 X
∇θ π J ≈ ∇A Q(Si , π(Si , |θπ )|θQ )∇θπ π(Si |θπ )
N i
38: Update the target networks
′ ′
39: θQ ← ρθQ + (1 − ρ)θQ
′ ′
40: θπ ← ρθπ + (1 − ρ)θπ

35
Federated Reinforcement Learning

IV. Experiments
For the experiments, the distribution of data between clients is crucial to simulate IID cases and
non-IID cases. In the first section, the generation of data will be covered to illustrate the different
distributions of data. Then the others sections aim to test algorithms through experiments and
analyse their results. The whole code for the experiments is available in the github repository
https://github.com/bourbonut/reinforcedFL.

i. Generation of data
This section aims to cover how data are distributed in each device of clients where the assumption
of having an identically and independent distribution when all data are gathered is applied. To
achieve this assumption, the working data (e.g. MNIST dataset or FashionMNIST dataset) is
initialized and has to be IID. Then, there are several parameters which are manageable :
• the distribution based on the volume which means in case of IID, all clients have the same
amount of data for each label and in case of non-IID, the volume of data changes from one
client to another for a specific label. For instance, with two clients A and B, if both have
the label l then in IID case, each client has half of the amount of data for label l while
in non-IID case, the proportion owning by clients is randomly chosen, which creates an
unbalanced configuration (e.g. 36% of data on client A and 64% of data on client B)
• the distribution based on the label which implies in case of IID, all clients have all labels and
in case of non-IID, a parameter m allows to specify the minimum number of labels on clients
and labels are randomly chosen. Another parameter b indicates if each client has at least
m labels or less. For example, if there are 3 clients and 5 labels l1 , l2 , l3 , l4 , l5 , then on IID
situation, all clients have the 5 labels, whereas in non-IID labels, if m = 3 and b = F alse,
then one client has 3 labels (e.g. l2 , l3 , l5 ), one client has 2 labels (e.g. l1 , l4 ) and other clients
shared one label (e.g. l5 ). In the case when b = T rue, then all clients have at least 3 labels.
The figure 9 illustrates visually some distributions. Each color corresponds to a client and then
it reflects the amount of data on each label for a specific client. The distribution of volume and
the distribution of labels can be mixed. For instance, the volume can be non-IID while the labels
are non-IID too. When the distribution of labels is non identically and independent and balanced,
depending on the number of clients and the number of labels, the volume can not always be IID.
On the figure, it is noticeable that the label 3 has a total of four clients where two clients which
have more data than the others. Due to choosing m = 3 (the minimum number of labels on each
client), several labels become unbalanced in order to meet the expectations of parameters.
When data are generated, two types of data are given per client : the data for training and the
data for testing. Necessary, taking the whole set of data, data are split to tend to the ratio 80% for
training - 20% for testing. For well-known datasets as MNIST, the default configuration is kept (for
MNIST, 60000 samples for training and 10000 samples of testing). Same labels and same volume of
data are used between training data and testing data. In other words, for example, if a model on
the client’s device learns on labels 2, 4 and 7, then, the model is tested also on labels 2, 4, 7. As
well, speaking of data for training, if the client’s device holds 5% of label 2, 10% of label 4 and 6%
of label 7, then data for testing is distributed in the same way than data for training.

ii. Baseline - Federated Averaging

The algorithm FedAvg is called Federated Averaging from McMahan et al. (2017) and it is simple to
implement and to see the impact of different non identically and independent distributions on the

36
Federated Reinforcement Learning

IID Non IID (volume)

6000 6000

5000 5000

4000 4000

3000 3000

2000 2000

1000 1000

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(a) Configuration where labels and volume follow are (b) Configuration where labels are IID and volume is
IID non-IID

Non IID (label and balanced) Non IID (label and unbalanced)

6000 6000

5000 5000

4000 4000

3000 3000

2000 2000

1000 1000

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(c) Configuration where labels are non-IID (balanced (d) Configuration where labels are non-IID (unbalanced
b = T rue and m = 3) and volume is IID b = F alse and m = 3) and volume is IID

Figure 9: Impact on the generation of data according to the distribution of labels and the distribution of
the volume with 20 clients on MNIST dataset (10 labels and 60000 samples)

evolution of the global model. FedAvg is described as a weighted average of local models on clients’
devices. The weights are computed based on the proportion of data for a device compared to the
whole size of data. For instance, if a device holds 1500 samples and the size of the whole dataset,
all data of devices taken together, is 60000, then the weight is 0.025. The total amount of data on
local devices is supposed known thanks to Federated Analytics techniques, which are a collaborative
data science without data collection. Then, knowing all amounts of data on local devices allows to
know the whole size of data if they were gathered on a server. Thus, FedAvg can be computed while
keeping privacy of clients.

37
Federated Reinforcement Learning

Algorithm 3 FederatedAveraging The K clients are indexed by k; B is the local minibatch size,
E is the number of local epochs, and η is the learning rate.
1: procedure Server executes:
2: initialize w0
3: for each round t = 1, 2, . . . do
4: m ← max(C · K, 1)
5: St ← (random set of m clients)
6: for each client k ∈ St in parallel do
7: k
wt+1 ← ClientUpdate(k, wt )
PK nk k
8: wt+1 ← k=1 n wt+1
9:
10: procedure ClientUpdate(k, w): ▷ Run on client k
11: B ← (split Pk into batches of size B)
12: for each local epoch i from 1 of E do
13: for batch b ∈ B do
14: w ← w − η∇L(w; b) ▷ L(w; b) represents the loss function
15: return w to server

Different experiments are presented on figure 10, figure 11, figure 12 and figure 13 which are
realized on FashionMNIST dataset (60000 samples with balanced labels) working with 20 workers9
with 10 rounds and 3 epochs, with different distributions of data. Each figure is composed of the
distribution of data for training through all workers on left side and the evolution of the weighted
average accuracy on the right side. Note the distribution of data for testing is similar to the
distribution of data for training and therefore, each worker has same volume of labels and same
labels for training and testing. In case of 10 labels, a model which learns on label 1, 5 and 6 for
instance, will never be tested on label 0, 2, 3, 4, 7, 8 and 9. Each round, every worker computes
the accuracy of the model established on their own local dataset. Then, the server collects all
local accuracies and computes the weighted average accuracy where the evolution is illustrated the
right side of each figure. Note the first round 0 corresponds to the state where workers’ models
are randomly initialized. When all labels appear on each dataset of workers (case figure 10 and
11) then the method converges considerably quickly. However, when labels are non independent
and identically distributed, the convergence becomes slower on the balanced (see figure 12) and is
penalized when only few workers have all data of some specific labels while others have the remainder
part of one label (see figure 13). Note in all figures 10, 11, 12 and 13, all workers are participants
every round.

iii. Results on contribution evaluation algorithm

Due to train an agent through REINFORCE algorithm, the main difficulty is to know how many
steps is needed for the model to converge depending on the number of participants. Two sizes are
simulated : 20 participants and 100 participants. For both sizes, the number of episodes is 20 and
there are 10 rounds per episode. All workers are participants which means there is no random choice
of future participants. The model of agent is 4 parallel linear layers where each of them is composed
of 128 neurons. The task model is convolutional neural network with 2 convolutional layers and 2
linear layers working on MNIST dataset. Data are generated with non-IID labels including 3 labels
9A worker is a client device.

38
Federated Reinforcement Learning

6000
0.8
5000
0.7
Number of samples

4000 0.6

Accuracy
3000 0.5

0.4
2000
0.3
1000
0.2
0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Labels Rounds

Figure 10: Convergence of FashionMNIST model with Federated Averaging using 20 workers on IID
configuration

6000
0.8
5000
0.7
Number of samples

4000 0.6
Accuracy

3000 0.5

0.4
2000
0.3
1000
0.2

0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Labels Rounds

Figure 11: Convergence of FashionMNIST model with Federated Averaging using 20 workers with non-IID
volume

6000 0.8

5000 0.7
Number of samples

0.6
4000
Accuracy

0.5
3000
0.4
2000
0.3

1000 0.2

0 0.1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Labels Rounds

Figure 12: Convergence of FashionMNIST model with Federated Averaging using 20 workers with non-IID
labels (balanced and 3 labels minimum)

39
Federated Reinforcement Learning

6000

5000
0.5
Number of samples

4000

Accuracy
0.4
3000

2000 0.3

1000
0.2

0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Labels Rounds

Figure 13: Convergence of FashionMNIST model with Federated Averaging using 20 workers with non-IID
labels (unbalanced and 3 labels minimum)

at least per worker and with non-IID volume. Between each episode, workers keep the same dataset
but only the task model is reinitialized globally and locally. There are two approaches realized.

iii.1 First approach

The first idea is to base the state and the reward on what was explained in section ii.6 for the
algorithm EvaluatorFL where the state and the reward are defined as :
k
!
1 X
St = (Acc1 , . . . , Acck ) and Rt = Pk Seli · Acci − δ
i=1 Seli i=1

where Seli is 1 for selected and 0 for discarded, Acci is the accuracy of the participant i and k is
the number of participants.

Figure 14: Selection represented for 20 episodes with 10 rounds per episode with 20 participants

40
Federated Reinforcement Learning

Figure 15: Selection represented for 20 episodes with 10 rounds per episode with 100 participants

The figure 14 and 15 illustrates the selection of contributors for aggregation. Each color is
associated to a participant and when there is a colored dot, it means the participant is selected
else the participant is discarded for the aggregation. On figure 14, with only one episode, the
algorithm converges quickly and selects the same participants over the time. On figure 15, the
number of participants is increased to 100 workers and the algorithm converges at the 18th episode.
A separation is significantly distinguishable between homogeneous selection similar to a random
behavior and heterogeneous selection when the algorithm finds a strategy for optimizing contribution.
The following figure 16 highlights the evolution of the rewards which indicates the development
of the reinforcement learning system.

(a) Evolution of the rewards given rounds with 20 par- (b) Evolution of the rewards given rounds with 100
ticipants participants

Figure 16: Evolution of the rewards given rounds

The figure 16a puts forward the same behavior with rewards as the figure 14 on selection. After
one episode, the evolution of the rewards reaches a limit and stabilizes around it. On the figure 16b,
100 participants impact the shape of the curve. The system increases slower to reach a limit and
then oscillates. It underlines that the evolution of the reward should not be a single indicator of

41
Federated Reinforcement Learning

convergence of the algorithm.

Figure 17: Evolution of the accuracies given episodes and rounds with 20 participants

Figure 18: Evolution of the accuracies given episodes and rounds with 100 participants

Figure 17 and figure 18 illustrate the evolution of the accuracy per episode and per round. The
evolution of the accuracy should rise faster over experiments. However in both figures, the trend
remains the same and the progression of the color does not tend to change. The gradient of colors
highlights that the progression of figure 18 with 100 workers based on rounds is slower than the
progression of the figure 17 with 20 workers.

42
Federated Reinforcement Learning

Even if the algorithm converges and find a group of participants which should contribute in a
better way to optimize the aggregation, results point out that the contribution evaluation based on
the accuracy does not improve the speed of convergence as expected.

iii.2 Second approach

Another approach is to use a state based on the difference between the accuracy of the trained local
task model and the accuracy of the global task model before being trained. The reward is then
computed by the difference between the weighted average of the accuracy of global task model after
aggregation and the exponential moving average of previous accuracies.

St = Acc1 θ1t+1 − Acc1 θG t

, . . . , Acck θkt+1 − Acck θG
t

k
1 X
and Rt = Pk di · Acci (θG
t+1
)−δ
i=1 di i=1

where Acci (·) is the accuracy computed using data of the worker i, θit+1 is the local model of the
worker i after a training of the global model θGt
, di is the size of the local worker data and δ is the
exponential moving average of previous accuracies.
On figure 19 and figure 20, the selection of participants for the aggregation puts forward that
the algorithm does not converge. In fact, with 20 workers or with 100 workers, both figures 19 and
figure 20 are filled of dots which are not discernible because almost all gradients are used for the
aggregation. However, the evolution of the rewards on figure 21 has almost the same shape as the
reward on the first approach except some small pics. If the network would have converged, thus the
reward would be smooth. But in this case, the reward oscillates, which is a type of divergence, and
explains why the agent does not select a subset of gradients after the training.

Figure 19: Selection represented for 20 episodes with 10 rounds per episode with 20 participants

Moreover, the evolution of the accuracy over the episodes on figures 22 and 23 reaches directly
85% for every first round in both cases (20 workers and 100 workers). It can be explained due to
the number of gradients which are aggregated. Indeed, even if the distribution of data is non-IID,
there are enough gradients to make an aggregation in the same way as a IID case.

43
Federated Reinforcement Learning

Figure 20: Selection represented for 20 episodes with 10 rounds per episode with 100 participants

(a) Evolution of the rewards given rounds with 20 par- (b) Evolution of the rewards given rounds with 100
ticipants participants

Figure 21: Evolution of the rewards given rounds

iii.3 An exploration of Dimensionality Reduction applied for gradients

The performance of the algorithm shown in experiments in section iii does not allow to improve the
aggregation. In fact, the state of the environment is based on the accuracy of the task model on
local data. The accuracy does not have enough information than the gradients. Therefore, one way
to improve the state of the environment is to use gradients. However, the size of gradients for neural
networks is often large (e.g. here, the task model has almost 1.2 millions of values for the gradient).
Hence, the dimensionality reduction of gradients can be used to extract the information of gradients.
Several methods of manifold learning were explored and only the best results will be shown below.

44
Federated Reinforcement Learning

Figure 22: Evolution of the accuracies given episodes and rounds with 20 participants

Figure 23: Evolution of the accuracies given episodes and rounds with 100 participants

Here is a non exhaustive list of them:

• PCA (Principal Component Analysis)

• MDS (Multidimensional Scaling)
• t-SNE (t-distributed Stochastic Neighbor Embedding)

45
Federated Reinforcement Learning

• Isomap embedding
• Spectral embedding for non-linear dimensionality reduction
• Locally linear embedding

The experiment was made on 40 workers on MNIST dataset where labels are swapped and are
distributed to form 4 clusters. For instance, the sample of the number 3 become the label 8 for 10
workers randomly chosen. The goal is to reduce the dimension of gradients and find the clusters
through visualization.
On figure 24 and figure 25, dots indicates a gradient. When different dots have with the same
color, it means they belong to the same cluster. These methods allows to go from an input size
40 × 1.2M to an output size 40 × 2.

Figure 24: PCA applied on 40 gradients with 4 clusters

With the PCA method on figure 24, the variance ratio for 2 components is, in this case, 25%.
Before the reduction, there is a total of 48 millions of values for the state with 40 workers, and after
reduction, there is only a total of 80 values. Applying the PCA method with a preservation of 95%
of variance gives an output size 40 × 37, in other words, gradients of every workers are reduced to 47
components per worker. The spectral embedding constructs an affinity matrix and applies spectral
decomposition to the corresponding graph Laplacian. The affinity matrix in this case is computed
based on a radial basis function kernel. The main difference between 24 and 25 is the scale of axis
which is smaller for the spectral embedding method than the PCA method.

46
Federated Reinforcement Learning

Figure 25: Spectral embedding applied on 40 gradients with 4 clusters

iv. Results on the selection of next participants algorithm

This experiment focuses on simulating a dynamic environment with different times between workers
and estimates the algorithm for a better selection of future participants.

iv.1 Scope of the experiment

To simulate different bandwidths and speed of computation, several means and standard deviations
are used as parameters for a normal law N (m, σ) :

• the speed of computation (in s/batch) can have 3 different means : 0.5, 0.7 or 1. The standard
deviation of the variable is always 0.02.
• the bandwidth of uploading (in Mb/s) can have 2 different means associated with its own
standard deviation : (8, 2) or (0.5, 0.2) written as (m, σ)
• the bandwidth of downloading (in Mb/s) can have 2 different means associated with its own
standard deviation : (30, 5) or (5, 1) written as (m, σ)

Note for each worker corresponds a mean and a standard deviation chosen randomly based on
previous values.
For the experiments, the actor network and the critic actor are composed of 3 linear layers with
128 neurons per layer. The exploration of the algorithm can be dissociated from Federated Learning.
It means there is no need to have a task model to train the agent network due to the aim of the agent
: minimize the time spent by the next participants. However, to compare the performance of the
agent on task model accuracy against a completely random selection, the Federated Learning is used

47
Federated Reinforcement Learning

to keep the same environment between the agent and the baseline (random selection). There are
100 workers in this experiment with 20 rounds per experiment and a maximum of 30 experiments.
The output of the agent network is a set of probabilities. The selection of next participants is made
by applying the Bernoulli law B(p) for each probability. The shape of the selection is a vector filled
with values 0 or 1 where 0 to discard the worker for the next round and 1 denotes to select the
worker for the next round. When the probabilities are close to zero and no worker is selected for the
next round, then the experiment ends.
The results are divided in two parts :
• a first experiment where the goal is to highlight that the agent is able to find the fastest
workers in a dynamic environment
• a second experiment where the environment is kept between training step, testing step and
comparison of performance with a random selection

iv.2 Training of the agent

Knowing that the objective of the agent is to maximize the reward, the evolution of the reward
illustrated on figure 26 reflects the convergence of the network learning. In fact, due to the definition
of the reward (the difference between the maximum of time of participants and the previous maximum
time), the progression of the reward tends to zero over rounds when the selection of workers is the
fastest workers.

Figure 26: Evolution of the reward over rounds

The figure 27 puts forward how probabilities progress over rounds. They are sorted from the
fastest worker to the slowest worker according to their means (speed of computation, bandwidth of

48
Federated Reinforcement Learning

uploading and bandwidth of downloading). Each pixel of the image corresponds to the probability
of a worker to be selected to the next round. Blue color indicates a probability close to zero while
red indicates a probability close to one. Around the round 240, there are 10 participants with a high
probability for being selected for the next round. The training should be stopped at this point in
order to have enough participants to represent the distribution of date in case of Federated Learning.

Figure 27: Evolution of probabilities over rounds for 100 workers

With the same idea, the figure 28 indicates the selection of workers over rounds. When there is
a colored dot, it means the worker has been selected else there is a white space.

Figure 28: Evolution of the selection of workers over rounds for 100 workers

Finally, the maximum time of participants given the episode decreases over time on figure 29.
Note on the figure 29a, the scale of the Y-axis is logarithmic to reduce the variance due to a large
pic of maximum time whereas the figure 29b does not have this point.

49
Federated Reinforcement Learning

(a) The mean of the times for each episode (b) The median of the times for each episode

Figure 29: Evolution of the maximum time of participants over episodes for 100 workers

iv.3 Analysis of the performance of the agent

The two experiments which are illustrated on following figures and table, allows to compare the
performance of the DDPG algorithm with a random selection. The comparison is made on the
distribution of data on the MNIST dataset: IID and non-IID. Note the experiment is made
with same values of speed of computation and bandwidths for the training, the evaluation and the
baseline (random selection). The training is ended when the agent selects less than 10 workers,
which represents 10% of the numbers of workers. The random selection is based on 10% of the
numbers of workers. The FedAvg is used to aggregate for evaluation and for the baseline.
The figure 30 puts forward several points :
• in the IID case, DDPG algorithm converges almost with an identical speed.
• on the non-IID case, DDPG algorithm takes twice rounds to converge. It can be explained
by the fact that the agent chooses the same participants every round which do not represent a
subset of the distribution of the data.

(a) Evolution of the accuracy on IID data with 100 (b) Evolution of the accuracy on non-IID data with
workers 100 workers

Figure 30: Comparison of the evolution of the accuracy between the selection with DDPG algorithm and
the random selection with 100 workers on MNIST data

However, the comparison of the maximum time between the two methods in tables 2 and 3,

50
Federated Reinforcement Learning

highlights that the agent selects workers where the maximum time of the set of participants is better
than the random selection in IID scenario and has a worse performance in non-IID scenario like it
was mentioned above.

Round 1 Round 2 Round 3 m±σ

DDPG algorithm 214.852 s 204.948 s 212.285 s 210.685 ± 5.14
Random selection 250.050 s 703.737 s 254.55 s 402.779 ± 260.647

Table 2: Comparison of maximum time between the selection with DDPG algorithm and the random
selection with 100 workers on IID data

Round 1 Round 2 Round 3 Round 4

DDPG algorithm 221.021 s 383.08 s 219.107 s 268.685 s
Random selection 241.361 s 277.299 s 287.556 s 374.664 s

Round 5 Round 6 Round 7 Round 8

DDPG algorithm 277.683 s 325.126 s 677.49 s 330.88 s
Random selection 240.724 s 260.318 s 297.858 s

Round 9 Round 10 Round 11 Round 12

DDPG algorithm 264.877 s 252.359 s 224.117 s 264.161 s
Random selection

Round 13 Round 14 Round 15 m±σ

DDPG algorithm 226.657 s 209.424 s 240.285 s 292.33 ± 117.126
Random selection 282.826 ± 46.028

Table 3: Comparison of maximum time between the selection with DDPG algorithm and the random
selection with 100 workers on non-IID data

51
Federated Reinforcement Learning

A. Appendix

i. Figures

Figure 31: Multidimensional scaling method on 40 gradients

Figure 32: Isomap embedding method on 40 gradients

ii. Concise table on different algorithms of Reinforcement Learning

52
Federated Reinforcement Learning

Figure 33: T-distributed Stochastic Neighbor Embedding method on 40 gradients

Figure 34: Locally Linear Embedding method on 40 gradients

Name Type Advantages

TD(0) off-policy This method is called one-step TD for looking one-step ahead. TD
forms the target from observed return and an estimated state value
for the next state. It can learn at every step and **TD should have
less variance than MC methods** (see also 1)
TD(λ) off-policy λ-returns are a combination of n discounted returns with an existing
estimate at the last step (see also 1)
SARSA on-policy (state-action-reward-state-action) Difference with TD: we transit from
the state-to-state alternation to state-action pair alternation
Q-Learning off-policy The main difference that Q-learning has from Sarsa is that the target
value now is no longer dependent on the policy being used but only
on the state-action function
DQN off-policy It uses a deep neural network for Q-value function approximation
in Q-Learning, and maintains an experience replay buffer to store
transition samples during the agent–environment interactions. DQN
also applies a target network QT , which is parameterized by a copy of
the original network Q parameter and updated in a delayed manner,
to stabilize the learning process, i.e. to alleviate the non- stationary
data distribution problem in deep learning.

53
Federated Reinforcement Learning

Double DQN off-policy Double DQN is an enhancement of DQN for reducing overestimation.
Q is noisy, which may be caused by environment, non-stationarity,
function approximation or any other reasons. The central idea of
double DQN is to decorrelate the noises in selection and evaluation
by using two different networks in these two stages. The Q-network
in the DQN architecture provides a natural candidate for the extra
network. Recall that it is the evaluation role of the target network
that improves the stability more.
Dueling DQN off-policy The Q-values of different actions do not matter. So decoupling the
action-independent value of state and Q-value may lead to more robust
learning. The experiments show that dueling architectures lead to
better policy evaluation in the presence of many similar-valued actions.
(k)
Rainbow DQN off-policy Rainbow uses the truncated n-step return Rt from a given state St
(k) (k) n−1
directly, where Rt is defined by Rt = k=0 γ k Rt+k
P

Noisy DQN off-policy It is an alternative exploration algorithm for ϵ-greedy, especially for
games requiring huge exploration. The noise is added into linear
layer y = (W x + b) by an extra noisy stream y = (W x + b) +
((Wnoisy ⊙ ϵw ) x + bnoisy ⊙ ϵb )
REINFORCE on-policy The algorithm REINFORCE follows a straightforward idea of per-
forming gradient ascent on the parameters of the policy heta. Despite
of its simplicity, naive REINFORCE has been observed to suffer a
large variance when estimating the gradient. To alleviate the large
variance problem, we further introduce a baseline b(Si ), where b(Si ) is
a function only depending on Si (or more importantly, not dependent
to Ai )
Actor-Critic off-policy The actor-critic method follows this idea that learns together an actor,
the policy function π(·|s), and the critic, the value function Vπ (s).
Moreover, actor-critic also uses the idea of bootstrapping to estimate
the Q-value
A2C on-policy Synchronous Advantage Actor-Critic - Similar to actor-critic but it
focuses on parallel training. It works with one master, one coordinator
and several workers. The master is updated like in the AC algorithm
A3C on-policy Asynchronous Advantage Actor-Critic - There is no coordinator (asyn-
chronous). Each worker talks directly to the global actor and critic
TRPO on-policy Trust Region Policy Optimization aims to handle the step size more
properly in policy gradient based on the idea of trust region
PPO on-policy Based on TRPO and instead of optimizing with a hard constraint,
Proximal Policy Optimization tends to optimize its regularization
version
DPPO on-policy Distributed Proximal Policy Optimization is a distributed version of
the PPO algorithm.
ACKTR on-policy Actor Critic Using Kronecker Factor Trust Region uses the Kronecker-
factored approximated curvature to compute the natural gradient

54
Federated Reinforcement Learning

DDPG off-policy Deep Determinist Policy Gradient - It can be viewed as an extension

of the DQN algorithm in continuous action space. It establishes
a Q function (critic) and a policy function (actor) simultaneously.
It distributes data collection and gradient calculation over multiple
workers, which greatly reduces the learning time. Periodically, the
chief updates parameters after averaging gradients passed by workers,
and then passes the latest parameters to workers synchronously.
TD3 off-policy Twin Delayed Deep Deterministic Policy Gradient is an improvement
of DDPG : clipped double Q-learning for actor critic, target networks
and delayed policy updated, target policy smoothing regularization
SAC off-policy Soft actor critic follows the idea of maximum entropy reinforcement
learning (it aims to maximize its entropy regularized reward). SAC also
provides a way in automatically update the regularization coefficient
α
CE method on-policy CE method is usually faster for policy search in reinforcement learning
as a non-gradient-based optimization method. However, preliminary
investigations showed that the applicability of CE to reinforcement
learning problems is restricted severely by the phenomenon that the
distribution concentrates to a single point too fast. Therefore, its
applicability in reinforcement learning seems to be limited though fast,
because it often converges to suboptimal policies.

Table 4: Set of algorithms with a small description of their advantages

55
Federated Reinforcement Learning

B. References
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient
Learning of Deep Networks from Decentralized Data,” arXiv:1602.05629 [cs], Feb. 2017, arXiv:
1602.05629. [Online]. Available: http://arxiv.org/abs/1602.05629 1
X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the Convergence of FedAvg on Non-IID
Data,” arXiv:1907.02189 [cs, math, stat], Jun. 2020, arXiv: 1907.02189. [Online]. Available:
http://arxiv.org/abs/1907.02189 1
J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated Reinforcement Learning: Techniques, Applications,
and Open Challenges,” arXiv:2108.11887 [cs], Oct. 2021, arXiv: 2108.11887. [Online]. Available:
http://arxiv.org/abs/2108.11887 2
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep
Learning with Differential Privacy,” Proceedings of the 2016 ACM SIGSAC Conference on
Computer and Communications Security, pp. 308–318, Oct. 2016, arXiv: 1607.00133. [Online].
Available: http://arxiv.org/abs/1607.00133
Bellman, “Dynamic Programming,” Princeton University Press, Princeton, N.J., 1957.
X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-Edge AI: Intelligentizing
Mobile Edge Computing, Caching and Communication by Federated Learning,” IEEE
Network, vol. 33, no. 5, pp. 156–165, Sep. 2019, arXiv: 1809.07857. [Online]. Available:
http://arxiv.org/abs/1809.07857
X. Wang, C. Wang, X. Li, V. C. M. Leung, and T. Taleb, “Federated Deep Reinforcement
Learning for Internet of Things With Decentralized Cooperative Edge Caching,” IEEE
Internet of Things Journal, vol. 7, no. 10, pp. 9441–9455, Oct. 2020. [Online]. Available:
https://ieeexplore.ieee.org/document/9062302/
I. Martinez, S. Francis, and A. S. Hafid, “Record and Reward Federated Learning Contributions
with Blockchain,” in 2019 International Conference on Cyber-Enabled Distributed Computing and
Knowledge Discovery (CyberC). Guilin, China: IEEE, Oct. 2019, pp. 50–57. [Online]. Available:
https://ieeexplore.ieee.org/document/8945913/
S. Yu, X. Chen, Z. Zhou, X. Gong, and D. Wu, “When Deep Reinforcement Learning Meets
Federated Learning: Intelligent Multi-Timescale Resource Management for Multi-access Edge
Computing in 5G Ultra Dense Network,” arXiv:2009.10601 [cs], Sep. 2020, arXiv: 2009.10601.
[Online]. Available: http://arxiv.org/abs/2009.10601
Y. Zhan, P. Li, and S. Guo, “Experience-Driven Computational Resource Allocation of Federated
Learning by Deep Reinforcement Learning,” in 2020 IEEE International Parallel and Distributed
Processing Symposium (IPDPS). New Orleans, LA, USA: IEEE, May 2020, pp. 234–243.
[Online]. Available: https://ieeexplore.ieee.org/document/9139873/
S. Zarandi and H. Tabassum, “Federated Double Deep Q-learning for Joint Delay and Energy
Minimization in IoT networks,” arXiv:2104.11320 [cs], Apr. 2021, arXiv: 2104.11320. [Online].
Available: http://arxiv.org/abs/2104.11320
Y. Chen, Z. Liu, Y. Zhang, Y. Wu, X. Chen, and L. Zhao, “Deep Reinforcement Learning-Based
Dynamic Resource Management for Mobile Edge Computing in Industrial Internet of Things,”
IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 4925–4934, Jul. 2021. [Online].
Available: https://ieeexplore.ieee.org/document/9214878/

56
Federated Reinforcement Learning

W. Jeong, J. Yoon, E. Yang, and S. J. Hwang, “Federated Semi-Supervised Learning with

Inter-Client Consistency & Disjoint Learning,” arXiv:2006.12097 [cs, stat], Mar. 2021, arXiv:
2006.12097. [Online]. Available: http://arxiv.org/abs/2006.12097

A. Ghosh, J. Hong, D. Yin, and K. Ramchandran, “Robust Federated Learning in a Heterogeneous

Environment,” arXiv:1906.06629 [cs, stat], Oct. 2019, arXiv: 1906.06629. [Online]. Available:
http://arxiv.org/abs/1906.06629

W. Liu, L. Chen, Y. Chen, and W. Zhang, “Accelerating Federated Learning via Momentum
Gradient Descent,” arXiv:1910.03197 [cs, stat], Oct. 2019, arXiv: 1910.03197. [Online]. Available:
http://arxiv.org/abs/1910.03197

V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated Multi-Task Learning,”

arXiv:1705.10467 [cs, stat], Feb. 2018, arXiv: 1705.10467. [Online]. Available: http:
//arxiv.org/abs/1705.10467

P. Zhang, C. Wang, C. Jiang, and Z. Han, “Deep Reinforcement Learning Assisted

Federated Learning Algorithm for Data Management of IIoT,” IEEE Transactions on
Industrial Informatics, vol. 17, no. 12, pp. 8475–8484, Dec. 2021. [Online]. Available:
https://ieeexplore.ieee.org/document/9372789/

S. Q. Zhang, J. Lin, and Q. Zhang, “A Multi-agent Reinforcement Learning Approach for Efficient
Client Selection in Federated Learning,” arXiv:2201.02932 [cs], Jan. 2022, arXiv: 2201.02932.
[Online]. Available: http://arxiv.org/abs/2201.02932

X. Liang, Y. Liu, T. Chen, M. Liu, and Q. Yang, “Federated Transfer Reinforcement Learning for
Autonomous Driving,” arXiv:1910.06001 [cs], Oct. 2019, arXiv: 1910.06001. [Online]. Available:
http://arxiv.org/abs/1910.06001

H.-K. Lim, J.-B. Kim, J.-S. Heo, and Y.-H. Han, “Federated Reinforcement Learning for Training
Control Policies on Multiple IoT Devices,” Sensors, vol. 20, no. 5, p. 1359, Mar. 2020. [Online].
Available: https://www.mdpi.com/1424-8220/20/5/1359

Y. Esfandiari, S. Y. Tan, Z. Jiang, A. Balu, E. Herron, C. Hegde, and S. Sarkar, “Cross-Gradient

Aggregation for Decentralized Learning from Non-IID data,” arXiv:2103.02051 [cs], Jun. 2021,
arXiv: 2103.02051. [Online]. Available: http://arxiv.org/abs/2103.02051

S. U. Stich and S. P. Karimireddy, “The Error-Feedback Framework: Better Rates for SGD with
Delayed Gradients and Compressed Communication,” Jun. 2021, number: arXiv:1909.05350
arXiv:1909.05350 [cs, math, stat]. [Online]. Available: http://arxiv.org/abs/1909.05350

B. Liu, L. Wang, and M. Liu, “Lifelong Federated Reinforcement Learning: A Learning

Architecture for Navigation in Cloud Robotic Systems,” IEEE Robotics and Automation
Letters, vol. 4, no. 4, pp. 4555–4562, Oct. 2019, arXiv: 1901.06455. [Online]. Available:
http://arxiv.org/abs/1901.06455

J. Zhao, X. Zhu, J. Wang, and J. Xiao, “Efficient Client Contribution Evaluation for Horizontal
Federated Learning,” arXiv:2102.13314 [cs, eess], Feb. 2021, arXiv: 2102.13314. [Online].
Available: http://arxiv.org/abs/2102.13314

H. Dong, Z. Ding, and S. Zhang, Deep Reinforcement Learning, Fundamentals, Research and
Applications. Springer, 2020.

57
Federated Reinforcement Learning

C. Glossary
AC Actor Critic. 29

CNN Convolutional Neural Network. 15, 16

DDPG Deep Deterministic Policy Gradient. 29

DNN Deep Neural Networks. 2

DQN Deep Q Network. 9, 29

DRL Deep Reinforcement Learning. 3, 8–10

FL Federated Learning. 1–3, 5, 6, 8–11, 13, 17

FRL Federated Reinforcement Learning. 2

HFRL Horizontal Federated Reinforcement Learning. 8

IID Independent and Identically Distributed. 1, 7, 11, 12, 14, 34, 36, 38, 42, 49, 50

IIoT Industrial Internet of Things. 11, 12

IoT Internet of Things. 5, 9, 14

LFRL Lifelong Federated Reinforcement Learning. 15

MARL Multi-Agent Reinforcement Learning. 12, 13

MDP Markov Decision Process. 4

MDS Multidimensional Scaling. 44

MLP Multi-Layer Perceptron. 13

ML Machine Learning. 1, 2

PCA Principal Component Analysis. 44

PPO Proximal Policy Optimization. 29

RL Reinforcement Learning. 1–4, 11, 12, 14–16

SGD Stochastic Gradient Descent. 1

TRPO Trust Region Policy Optimization. 29

UEs User Equipments. 8, 9

VFRL Vertical Federated Reinforcement Learning. 8

non-IID Non Independent and Identically Distributed. 1, 2, 5, 7, 11, 12, 14, 28, 34, 36–39, 42, 49,
50

t-SNE t-distributed Stochastic Neighbor Embedding. 44

CGA Cross-Gradient Aggregation. 14

58
Federated Reinforcement Learning

F-RCCE Federated REINFORCE client contribution evaluation. 16, 17

FSSL Federated Semi-Supervised Learning. 11

FedAvg Federated Averaging. 1, 33, 36, 49

FedMARL Federated Multi-Agent Reinforcement Learning. 12, 13

A Fair Federated Learning Framework With
No ratings yet
A Fair Federated Learning Framework With
8 pages
Federated Deep Reinforcement Learning
No ratings yet
Federated Deep Reinforcement Learning
9 pages
2404 08003v5
No ratings yet
2404 08003v5
31 pages
(24.07) Combining Federated Learning and Control A Survey
No ratings yet
(24.07) Combining Federated Learning and Control A Survey
19 pages
Momentum-Based Federated RL Optimization
No ratings yet
Momentum-Based Federated RL Optimization
26 pages
Federated Learning Systems Survey
No ratings yet
Federated Learning Systems Survey
41 pages
Federated Learning Via Consensus Mechanism
No ratings yet
Federated Learning Via Consensus Mechanism
12 pages
Optimizing Federated Learning On Non-IID Data With Reinforcement Learning
No ratings yet
Optimizing Federated Learning On Non-IID Data With Reinforcement Learning
10 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Federated Learning: Strategies & Applications
No ratings yet
Federated Learning: Strategies & Applications
24 pages
2024 MTH058 Lecture07 FederatedLearning
No ratings yet
2024 MTH058 Lecture07 FederatedLearning
25 pages
Fedlab A Flexible Federated Learning Framework
No ratings yet
Fedlab A Flexible Federated Learning Framework
10 pages
Bharati Et Al 2022 Federated Learning Applications Challenges and Future Directions
No ratings yet
Bharati Et Al 2022 Federated Learning Applications Challenges and Future Directions
17 pages
Learning Critically Selective Self Distillation in Federated Learning On Non-IID Data
No ratings yet
Learning Critically Selective Self Distillation in Federated Learning On Non-IID Data
12 pages
Federated Learning A Survery
No ratings yet
Federated Learning A Survery
31 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
CSCE689 DRL Project Report
No ratings yet
CSCE689 DRL Project Report
7 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
BIT-FL Blockchain-Enabled Incentivized and Secure Federated Learning Framework PDF
No ratings yet
BIT-FL Blockchain-Enabled Incentivized and Secure Federated Learning Framework PDF
18 pages
Flair Paper 19 Socialsec
No ratings yet
Flair Paper 19 Socialsec
19 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Federated Learning Design Insights
No ratings yet
Federated Learning Design Insights
14 pages
20894-Article Text-24907-1-2-20220628
No ratings yet
20894-Article Text-24907-1-2-20220628
9 pages
DistributedRL Tradeoff
No ratings yet
DistributedRL Tradeoff
50 pages
FL Tut Ans
No ratings yet
FL Tut Ans
19 pages
Open Source FL Frameworks Ranking
No ratings yet
Open Source FL Frameworks Ranking
26 pages
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
No ratings yet
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
20 pages
Learning Critically Selective Self-Distillation in Federated Learning On Non-IID Data
No ratings yet
Learning Critically Selective Self-Distillation in Federated Learning On Non-IID Data
12 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
1263 Federated Adversarial Lea
No ratings yet
1263 Federated Adversarial Lea
28 pages
Reinforcement Learning Algorithms
No ratings yet
Reinforcement Learning Algorithms
98 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Federated Learning for Mobile Devices
No ratings yet
Federated Learning for Mobile Devices
10 pages
Blockchain For Federated Learning
No ratings yet
Blockchain For Federated Learning
18 pages
Federated Learning - Hope and Scope
No ratings yet
Federated Learning - Hope and Scope
3 pages
Federated Learning - Hope and Scope
No ratings yet
Federated Learning - Hope and Scope
4 pages
Client Selection in Federated Learning-Convergence Analysis and Power of Choice Selection Strategies
No ratings yet
Client Selection in Federated Learning-Convergence Analysis and Power of Choice Selection Strategies
22 pages
Prioritized Multi-Criteria Federated Learning: Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia and Antonio Ferrara
No ratings yet
Prioritized Multi-Criteria Federated Learning: Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia and Antonio Ferrara
18 pages
Cie2 Set 1 Aml
No ratings yet
Cie2 Set 1 Aml
5 pages
Modern Deep Reinforcement Learning Algorithms
No ratings yet
Modern Deep Reinforcement Learning Algorithms
56 pages
DeepMind Reinforcement Learning Overview
No ratings yet
DeepMind Reinforcement Learning Overview
216 pages
REFL
No ratings yet
REFL
18 pages
Federated Learning Taxonomy Guide
No ratings yet
Federated Learning Taxonomy Guide
45 pages
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
No ratings yet
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
113 pages
FederatedLearning Bahaa Nashwa 2311.16021v1
No ratings yet
FederatedLearning Bahaa Nashwa 2311.16021v1
7 pages
A Detailed Survey On Federated Learning Attacks and Defenses
No ratings yet
A Detailed Survey On Federated Learning Attacks and Defenses
18 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
4325-Article Text-22970-2-10-20241122
No ratings yet
4325-Article Text-22970-2-10-20241122
12 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
FLUTE: Scalable Federated Learning Platform
No ratings yet
FLUTE: Scalable Federated Learning Platform
13 pages
Federated Learning with CIFAR-10
No ratings yet
Federated Learning with CIFAR-10
6 pages
To Appear in KAIS: From Distributed Machine Learning To Federated Learning: A Survey
No ratings yet
To Appear in KAIS: From Distributed Machine Learning To Federated Learning: A Survey
36 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Base Paper
No ratings yet
Base Paper
38 pages
1FL 2024
No ratings yet
1FL 2024
75 pages
Time-Sensitive Federated Learning With Heterogeneous Training Intensity A Deep Reinforcement Learning Approach
No ratings yet
Time-Sensitive Federated Learning With Heterogeneous Training Intensity A Deep Reinforcement Learning Approach
14 pages
Non-Interactive Zero-Knowledge Functional Proofs: Abstract
No ratings yet
Non-Interactive Zero-Knowledge Functional Proofs: Abstract
57 pages
Scope On Detecting Constrained Backdoor Attacks in Federated Learning
No ratings yet
Scope On Detecting Constrained Backdoor Attacks in Federated Learning
14 pages
Sec22fall Nguyen
No ratings yet
Sec22fall Nguyen
18 pages
SAFELearn Secure Aggregation For Private FEderated Learning
No ratings yet
SAFELearn Secure Aggregation For Private FEderated Learning
7 pages
Effective Intrusion Detection in IoT Env
No ratings yet
Effective Intrusion Detection in IoT Env
8 pages
Economic Dynamics in Discrete Time The MIT Press 2nd Edition Jianjun Miao Available Instanly
100% (3)
Economic Dynamics in Discrete Time The MIT Press 2nd Edition Jianjun Miao Available Instanly
96 pages
Data Science Interview Prep Guide
No ratings yet
Data Science Interview Prep Guide
11 pages
Time Series Analysis: Long-Run Relationships
No ratings yet
Time Series Analysis: Long-Run Relationships
49 pages
Computational Complexity in Decision-Making
No ratings yet
Computational Complexity in Decision-Making
13 pages
Spython
No ratings yet
Spython
1 page
Transportation Model Insights
100% (1)
Transportation Model Insights
7 pages
Chapter 8 - Time Series Forecasting
No ratings yet
Chapter 8 - Time Series Forecasting
15 pages
10 Maths 3
No ratings yet
10 Maths 3
2 pages
Robust Control of Controlsystems
100% (1)
Robust Control of Controlsystems
17 pages
Function Modules vs Subroutines
No ratings yet
Function Modules vs Subroutines
1 page
Chapter1: Numerical Differentiation: 1.1 Finite Difference Approximation of The Derivative
No ratings yet
Chapter1: Numerical Differentiation: 1.1 Finite Difference Approximation of The Derivative
51 pages
Linear Algebra Course Outline
No ratings yet
Linear Algebra Course Outline
3 pages
Mechanical Vibration Analysis
No ratings yet
Mechanical Vibration Analysis
44 pages
Contacts Hypermesh
No ratings yet
Contacts Hypermesh
26 pages
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
31 pages
Experiment No: 3 Design and Evaluation of Second Order System
No ratings yet
Experiment No: 3 Design and Evaluation of Second Order System
6 pages
Test 2019 1
No ratings yet
Test 2019 1
1 page
Realizing Discrete Time Systems
No ratings yet
Realizing Discrete Time Systems
35 pages
Foundation of Tcs
No ratings yet
Foundation of Tcs
142 pages
MAT102 Scheme 2024
No ratings yet
MAT102 Scheme 2024
5 pages
Algorithms and Data Structures: Simonas Šaltenis
No ratings yet
Algorithms and Data Structures: Simonas Šaltenis
49 pages
Management Science Notes
No ratings yet
Management Science Notes
13 pages
Operation Research MBA 2nd Sem
No ratings yet
Operation Research MBA 2nd Sem
4 pages
Correlation
No ratings yet
Correlation
12 pages
Lazy List Implementation in R
No ratings yet
Lazy List Implementation in R
9 pages
EEE 410 Signals and Systems Course Outline 14-4-2022
No ratings yet
EEE 410 Signals and Systems Course Outline 14-4-2022
2 pages
EC2302 DIGITAL SIGNAL PROCESSING Two Marks With Answers Vidyarthiplus (V+) Blog - A Blog For Students
100% (2)
EC2302 DIGITAL SIGNAL PROCESSING Two Marks With Answers Vidyarthiplus (V+) Blog - A Blog For Students
41 pages
DSP Complex Engineering Activity
No ratings yet
DSP Complex Engineering Activity
12 pages
Econometrics Chapter Three
No ratings yet
Econometrics Chapter Three
70 pages