skip to main content
research-article
Open access

Ensuring Fairness and Gradient Privacy in Personalized Heterogeneous Federated Learning

Published: 17 May 2024 Publication History

Abstract

With the increasing tension between conflicting requirements of the availability of large amounts of data for effective machine learning-based analysis, and for ensuring their privacy, the paradigm of federated learning has emerged, a distributed machine learning setting where the clients provide only the machine learning model updates to the server rather than the actual data for decision making. However, the distributed nature of federated learning raises specific challenges related to fairness in a heterogeneous setting. This motivates the focus of our article, on the heterogeneity of client devices having different computational capabilities and their impact on fairness in federated learning. Furthermore, our aim is to achieve fairness in heterogeneity while ensuring privacy. As far as we are aware there are no existing works that address all three aspects of fairness, device heterogeneity, and privacy simultaneously in federated learning. In this article, we propose a novel federated learning algorithm with personalization in the context of heterogeneous devices while maintaining compatibility with the gradient privacy preservation techniques of secure aggregation. We analyze the proposed federated learning algorithm under different environments with different datasets and show that it achieves performance close to or greater than the state-of-the-art in heterogeneous device personalized federated learning. We also provide theoretical proofs for the fairness and convergence properties of our proposed algorithm.

1 Introduction

There has been an explosion in the applications of machine learning in various fields (such as speech recognition [9], X-ray image segmentation [42], and molecular design [15]). At the same time, we have been witnessing a rise in data privacy regulations (such as Europe’s General Data Protection Regulation, California Consumer Privacy Act, and China’s Personal Information Protection Law). These two developments have led to an increased tension between machine learning and data privacy. A fundamental challenge is that most machine learning algorithms require large amount of data to achieve an adequate level of performance, which immediately raises the issue of privacy. Hence, there is the need to minimize the sharing of personal data via distributed learning algorithms such as federated learning [36]. Federated learning follows a client-server setup, where clients share model parameter updates (gradients) instead of their data, and the server aggregates those model updates. However, federated learning algorithms have their own imperfections, for instance, in the areas of privacy, heterogeneity, fairness, and robustness.
An honest-but-curious server or an eavesdropping adversary can use the client model updates to recover information about the data that has been used to train them [54, 56]. Some powerful attacks [18, 46, 53] can even attempt to recover complete batches of data. The existence of these attacks directly oppose the guiding privacy regulations and the premise of federated learning algorithms. Hence, to achieve improved data privacy properties, federated learning has been extended with secure aggregation [5, 20, 40]. The secure aggregation techniques provide encryption of the client updates, while enabling the server to aggregate the updates without decrypting them. Only the clients are able to see their own updates in plain, while giving the server the ability to coordinate the learning process. As updates are encrypted (by the client who creates them), they can no longer be inspected to recover the plain data, thus better preserving the privacy in federated learning. Despite secure aggregation, federated learning environments have further limitations due to heterogeneity of client devices.
Since federated learning systems encourage a wide range of clients, there are inequalities among the clients in terms of the data they have and the computational capabilities of their devices. However, such heterogeneity can have negative impact on the learning process with the data heterogeneity shifting or biasing the joint target parameters of the global model, and device heterogeneity creating bottlenecks to federated learning. To address these issues, several algorithms [10, 16, 23, 50] have been proposed that allow the server to analyze the submitted updates while performing aggregations to maximize the “fairness” of the system. Analyzing updates immediately introduces conflict with privacy of aggregation. Hence, this leads to an inherent conflict and tension between fairness and privacy of the aggregation. This has led to a new research direction in federated learning, which is aimed at developing algorithms that can simultaneously provide fairness and privacy while minimally compromising either one. This is the focus of our work.

1.1 Motivation and Contributions

In this article, we address the tension between privacy and fairness in the context of device heterogeneity in federated learning. To the best of our knowledge, no solution has been proposed so far in the literature in this area. In the context of device heterogeneity, we have clients holding devices with varying computational capabilities and hence taking varying amounts of time to perform their local updates. This in turn can create a bottleneck in the federated learning process due its synchronous properties. While there exist techniques for asynchronous federated learning [35, 55], we focus on solutions that make use of model heterogeneity due to their lower memory and orchestration requirements. Model heterogeneity at this level often has the server with the full global model to be jointly trained, and the clients assigning themselves an overlapping partition of the global model to train such that the amount of time to perform their local training falls within a global deadline.
At the aggregation level, the server requires the awareness of the individual client’s updates such that each parameter may be proportionally updated according to the number of clients updating that parameter. This ensures fairness of contributions, as it prevents the most popularly held parameters from experiencing updates orders of magnitude greater than the least popular ones. For example, given there are five clients that, respectively, hold the ordered partitions of \(\mathbb {P} \in \lbrace 0.2, 0.2, 0.4, 0.1, 1.0\rbrace\), where all clients i with \(\mathbb {P}_i \ge \mathbb {P}_j\) hold a superset partition to each client j. In this, while a blind average will result in the 0.1 partition to be averaged correctly, since all clients hold and update that partition, the others will not. One such case is with the 0.4 partition where only two clients hold it, client 3 and client 5 (since \(1.0 \ge 0.4\)), but a blind average will divide by total number of clients, five, instead of the number of clients that updated that part of the model, two. The prior solutions to this issue (FjORD [23], HeteroFL [10]) have required the server to look at the gradients to determine this correct aggregation value. For example, we have seen that a correct aggregation value would be determined as the element-wise average of the gradients, and thus ensure that gradients are aggregated according to the number of clients that are updating them, despite the variability in partitioning. However to do this, the server would need to observe which of the values within the global gradient, and as a result in the partition, each client is updating, thereby compromising privacy. In our example, doing so would amount to the server observing that all clients update the first 0.1 proportion of the global model, clients 1, 2, 3, and 5 update the 0.2 proportion, clients 3 and 5 update the 0.4 proportion, and client 5 updates 1.0 proportion. This compromises privacy, since the server is explicitly aware of which clients update specific parts of the model, which can enable them to easily launch attacks, such as References [39, 49], that can reconstruct client data even if secure aggregation is employed. This creates a tension between fairness and privacy and motivates us to propose a change that allows the server to be completely oblivious to the client updates, including which parts of the model they update. We propose that the server blindly adds together the gradients, then the clients take the result and use norm scaling to approximate the true average, and update their model.
There have been prior works that have attempted to provide some solutions to this issue, but, to the best of our knowledge, they are all inadequate. First, Li et al. [33] mainly focuses on data heterogeneity, and has a mechanism for heterogeneous number of epochs of training per client; however, in a future work [26], the authors have stated that their proposed mechanism does not have convergence benefits over normal federated averaging. The FedMD [31] algorithm requires a public dataset, which we consider to be a strong assumption that is not applicable to many federated learning environments. The TiFL [6] algorithm focuses on client selection optimization, which could be an effective method, but does not address contribution fairness, the issue being addressed in our work.
In summary, our work considers the problem setting of device heterogeneity in federated learning, meaning different clients having devices with different computational capabilities that are able to perform different amount of updates to the model in a given time period. Prior works [10, 23] that address this issue have proposed a form of model personalization, where clients update a subset of model parameters, enabling them to still provide updates within a timely manner. These subsets of models are taken as an ordered partition of the global model, meaning that all clients will have overlap in the model parameters they take, e.g., if client A takes a larger subset than client B, then client B’s model will be a strict subset to client A’s model. First, we observe that while this method improves fairness, it is detrimental to privacy, since when clients are contributing different subsets of the model, the aggregation server needs to know which parameters are being updated for an element-wise average, because an indiscriminate average will create a bias toward model parameters that are updated by all clients. Hence, we propose an algorithm, where the server no longer performs this average but only performs a blind sum of the updates, and then the clients individually approximate the average by scaling the global sum by using its norm or a securely calculated counter.
The main contributions of this article are as follows:
(1)
Extension of Ordered Partitioning: First, we design an extension to the ordered partitioning scheme, which is a model partitioning scheme that improves fairness in learning under device heterogeneous settings. We demonstrate that partitioning the depth and width of the model have different benefits and trade-offs. Furthermore, we show how prior restrictions to ordered partitioning schemes can be removed thereby enabling improvements to training speed and the resulting performance, in addition to being more easily applicable to the established neural network architectures. We then propose an algorithm that allows the clients to select a partition of the global model that maximizes fairness based on equal opportunity. To do so, we provide a formal definition of fairness based on equal opportunity.
(2)
New Federated Learning Algorithm: We propose a federated learning algorithm that achieves device heterogeneity with gradient privacy using secure aggregation techniques. This is the second novel contribution of this article, which we refer to as the Private Personalized Device Heterogeneous Federated Learning (PPDHFL) algorithm. It enables the learning of high performing models while remaining compatible with the techniques of secure aggregation.
(3)
Extensive Practical Analysis: We have carried out an extensive empirical study on PPDHFL across several settings, and show that the proposed scheme achieves performance that is comparable to or even better than the state-of-the-art heterogeneous device-based federated learning algorithms.
(4)
Finally, we provide a proof of convergence for the techniques in the proposed algorithm.

1.2 Structure of the Article

The rest of this article is organized as follows: Section 2 discusses relevant literature to provide background. Section 3 presents the key issues involved in the design of PPDHFL and introduce the problem setting. In Section 4, we present the proposed PPDHFL algorithm and consider the design choices such as the partition strategy and the best performing gradient averaging substitute. Section 5 describes our experiments and results, comparing the performance of our proposed algorithm to the state-of-the-art and baselines across five federated learning environments. Section 6 concludes the article.

2 Background

Federated learning aims to find a global model, \(\theta ^*\), that minimizes the average objective function, \(f_i\), held by the set of clients, \(\mathbb {C}\), as observed at the current global model, \(\theta\), after \(E - 1\) client updates, \(\mathcal {U}_i^{E - 1}(\theta)\). This goal is more formally expressed in Equation (1):
\begin{equation} \theta ^* = \arg \min _\theta \mathbb {E}[f_i(\mathcal {U}_i^{E - 1}(\theta))]. \end{equation}
(1)
Clients contribute in a series of steps coordinated by the server, known as rounds. The ith client contributes to the current round, r, by first downloading the current global model from the server, \(\theta _r\), performing a number of epochs of training, E, and uploading the difference, \(\Delta _i = \theta _r - \psi _{i,r,E}\), between their local model, \(\psi _{i,r,E}\), and the global model to the server, \(\theta _r\). Once the server has received updates from the randomly selected subset of clients, \(\mathbb {S}_r \subseteq \mathbb {C}\), they close the round by computing the new global model, by updating according to the aggregation of the client updates, \(\theta _{r + 1} = \theta _r - \eta \sum _{i \in \mathbb {S}_r} \alpha _i \Delta _i\). There are several applicable aggregation algorithms, which lead to alternative computations of the \(\alpha _i\). For example, federated averaging [36] computes \(\alpha _i\) as the number of samples the client has been trained in the current round, \(n_i\), divided by the total number of samples all clients have been trained in the current round, \(n = \sum _{i \in \mathbb {S}_r} n_i\), whereas CONTRA [3] computes \(\alpha _i\) according to a cosine similarity-based clustering and reputation tracking. Regardless of the aggregation algorithm, once the next global model is computed, the process is repeated for the remaining R rounds, a target loss value is achieved, or forevermore.
Several research works have shown that sharing of gradients in federated learning still leaks the private data of clients through attacks available to potential interceptors and the server. For instance, Zhu, Liu, and Han [56] proposed deep leakage from gradients, which is an attack that uses gradient information either intercepted or from the server, to learn about the data that the model has been trained on. The adversaries first choose some dummy data, and then based on the obtained gradient, they optimize their dummy data such that their own model generates the most similar gradient. While this technique is generic to the model being considered, it is quite inefficient as it requires the computations of second-order gradients. This led to advancements such as Zhang, Chen, and Li [52], which proposed an architecture agnostic gradient inversion attack that is effective even when clients train with large batch sizes. They do so by observing that labels and their respective representation are encoded into each gradient, constructing a smaller and simpler optimization problem to perform the attack. Additionally, Yang et al. [49] proposed an attack that is effective when only only partial gradients are sent to the server, as is typically done in communication efficient methods of federated learning. Furthermore, several other efficient techniques have been proposed in the literature such as in References [17, 18, 48], each of which exploits often some aspects of the model architecture, in addition to the adversary holding the honest-but-curious server privileges. To preserve the privacy of the client updates from both the server and potential interceptors, secure aggregation algorithm and its variants were proposed in References [5, 38, 51]. Generally, secure aggregation algorithms provide encryption of the client updates in such a way that the server can perform some operation on the collected encrypted gradients that results in the sum of the gradients. For example, References [51] uses the homomorphic cryptosystem Paillier, which allows the server to multiply the encrypted gradients to result in the encrypted sum of gradients, \(\prod {\rm E}{\rm\small{NC}}(\Delta _i) \mod {n}^2 = {\rm E}{\rm\small{NC}}(\sum \Delta _i \mod {n})\). Such systems enable the function of federated learning systems without ever revealing the gradients, although at the costs of computational power and the ability for the server to make corrections to the learning process.
However, recently some flaws have been found in the secure aggregation protocols under the adversarial server threat model that allow the circumvention of gradient privacy. In Reference [39], Pasquini, Francati, and Ateniese propose the gradient suppression attack where the server distributes corrupted parameters to all but a target client, fooling those other clients into computing a zero gradient. This results in a functionally correct secure aggregation from the perspective of the client, although it will compute a clear copy of the target client’s gradient, and thus enable direct gradient inversion targeting that single client. Stealthier yet weaker variants of this attack that only target a subset of the gradient elements are also proposed in the same work. While in our work, we do not consider the adversarial server threat model, the mitigation approaches proposed in References [39] remain trivially applicable to our proposed algorithm, enabling at least some degree of defense against this attack, if not complete defense. For example, in our proposed algorithm, clients can add noise to their updates (prior to submission to the server), thereby mitigating this attack using local differential privacy.
A somewhat orthogonal but equally important direction of federated learning research has been in the heterogeneity of clients. The most obvious form of heterogeneity is that of data heterogeneity, where the difference in data distributions among clients significantly impacts the computation of the joint objective, unfairly biasing it toward certain clients. Recently, this type of heterogeneity has been the subject of extensive study such as in References [2, 13, 25], and several algorithms have been proposed to address the issue of “client drift” from the global objective, e.g., References [7, 19, 27, 34]. Generally, these proposed algorithms use server-side observations of the updates made by the individual clients to determine the computation of their aggregation that minimizes the impact of the drift. This mechanism contradicts the requirement of secure aggregation, and by extension contradicts update privacy, as the server must be unaware of the individual client’s updates. Hence, there is the tension between privacy and fairness in federated learning.
In this article, we focus on the issue of device heterogeneity within synchronous federated learning, namely, with solutions involving model heterogeneity. Two algorithms making use of this technique (without secure aggregation) are FjORD [23] and HeteroFL [10], where each of the clients take an “ordered” partition of the global model. FjORD takes the first p percentage of neurons or filters from left to right from each layer, while HeteroFL extends this concept by also taking those same neurons or filters but from the first p percentage of the hidden layers from the input to output direction. The benefit of ordering the partitioning is that the permutation invariant issue [45], wherein there are infinitely many combinations of parameters that lead to the same model performance, is addressed by structurally assigning greater importance to particular elements of the model parameters. Furthermore, the algorithm FedDrop [47] was proposed as a simple solution to the device heterogeneity problem. It simply assigns clients with updating a random subset of the global model, and averages the resulting gradients. While this algorithm is simple, the lack of ordering in the partitioning leads to the permutation invariance issue, resulting in unstable and worse performing models.

3 Overview of Key Concepts and Setting

In this section, first, we present a detailed discussion of the design issues related to device heterogeneity, privacy and fairness in federated learning, and then describe the threat model/setting we will address them in.
Heterogeneity. The heterogeneity of clients has been acknowledged as one of the key design issues in federated learning [36]. Heterogeneity can pertain to several differing aspects such as data heterogeneity (where the statistical properties of data among clients is imbalanced), model heterogeneity (where clients have different models), and device heterogeneity, which is the focus of our work.
Definition 3.1 (Device Heterogeneity).
Clients in the federated learning system use different devices for performing their computations. These devices differ in their computational power, taking different amounts of time to perform equivalent computations. This impacts the federated learning process due to the canonical algorithm requiring synchronicity among the contributing clients.
It is worth noting that in a real-world system, different types of heterogeneity are likely to be present simultaneously, such as device heterogeneity, data heterogeneity and model heterogeneity. For example, an online messaging application using federated learning for next-word-prediction may have different clients and devices such as smart phones and laptops. This application would also have data heterogeneity due to the different communication patterns of clients. However, these two forms of heterogeneity, in general, can be independent, since there is not necessarily a correlation between the device that is being used and the data that it handles. This led us to focus our work exclusively on device heterogeneity. Although there can be relationships between these different types of heterogeneity, we believe it is useful to investigate each one in its own right.
Fairness and Device Heterogeneity. In general, fairness is considered as the minimization of variance of some metric across the clients in the system. The specific metric used can vary with the type of heterogeneity in the system. For example, the work in References [14, 21] considers data heterogeneity, where their chosen metric is based on the distance between predictions or expected loss with respect to the individuals involved. Similar concepts have been defined for data heterogeneity in federated learning, for example, in References [16, 32]. However, since we are considering device heterogeneity, it is not suitable to consider the performance of the resulting model to be as indicative of fairness, because it does not reflect the equal ability to contribute to the federated learning system. We believe for device heterogeneity, the fairness metric should reflect the ability (and experience) of each client to participate in the learning process. Hence, our fairness metric focuses on the equality of opportunity for the clients to contribute to the federated learning model.
In synchronous federated learning, the server often imposes a time limit for rounds. This has the benefit of ensuring that there is a guarantee of progress in learning the global objective, especially considering the possibility of communication dropouts and faults. However, the time limits can make the learning process more unfair for clients that take longer time to perform their local computations, since their updates will not contribute to the learning objective if their time to update exceeds the limit. Additionally, clients may choose to place a greater emphasis on reducing the time taken to perform local updates than on the contributions they make to the global update; for example, they may be interested in saving battery power. As mentioned earlier, there are some works such as References [10, 23] in which the individual learning tasks given to the clients are partitioned in such a way that they can be completed within the time limit. We observe that these approaches only consider the amount of computation performed and not the amount of contribution made to the joint task, and therefore are not necessarily fair. Furthermore, they do not consider maximization of fairness. In contrast, in our approach, we consider the assignment of task proportions in a game theoretical sense, which enables us to explicitly define the optimal fair setting. Our approach uses as its foundation the definition given in Reference [43]. This gives a synchronous collaborative system where clients can be assigned proportions of a joint task to contribute to. The clients will gain utility with respect to the said proportion when they simultaneously maximize the amount of contribution being made, and minimize the distance between the time taken to complete their individual task and the time limit of the system.
Definition 3.2 (Fairness under Device Heterogeneity).
A fair setting is one where each client is assigned proportions of the learning task in such a way that they are all envy free and Pareto efficient. Envy freeness is defined as an assignment of partitions in such a way that each client gains greater utility from their personal assignment than any other client. Pareto efficiency is defined as an assignment where no client’s utility can be improved through changes without reducing the utility experienced by another client.
The formal definitions are as follows, where \(u_i\) is the utility function of client i and \(p_i\) is the partition of client i,
\begin{align*} &\max _{\boldsymbol {p}} u_i(p_i), \forall i \in \mathbb {C} & \\ s.t.\;& u_i(p_i) \ge u_i(p_j), \forall i,j \in \mathbb {C}, & \text{Envy Free} \\ &u_i(p_i) \ge u_j(p_i), \forall i, j \in \mathbb {C}, i \ne j. & \text{Pareto Efficient} \end{align*}
In our fairness definition, it is important that both envy freeness and Pareto efficiency are satisfied.
If only envy freeness is satisfied, then although the clients have the best opportunity relative to each other, they still have potential to improve their assigned allocations without detriment to the others. That is, the clients can still have the opportunity to get a better experience despite no other clients currently holding a better position compared to them.
In contrast, if only Pareto efficiency is provided, then although there are cases where the assignment of allocations cannot be improved without reducing others, other clients may hold more beneficial partitions relative to each other. For example, a client may acquire a disproportionately large allocation while other equivalent clients have smaller allocations; this is Pareto efficient, as assigning the smaller allocations to clients will reduce the size and thus utility of the large allocation client, despite the possibility for each of those other small clients gaining an equal utility with respect to each other.1
In the federated learning setting, envy freeness is provided through the training and aggregation framework, while overall utility and Pareto efficiency are maximized with performance.
Device Heterogeneous Personalization. Variants of personalized learning have been proposed in References [10, 23, 47]. These techniques are effectively a restricted form of model heterogeneity. They assign clients partitions of the global neural network in such a way that each client takes approximately the same amount of time to perform their local updates, ensuring synchronicity in the system. While these techniques improve fairness, we observe that they do not necessarily find a completely fair setting (in view of our approach to fairness defined above). Additionally, since there is now an imbalance of parameters held by each client, greater coordination is required from the server to achieve proper aggregation of the updates.
Definition 3.3 (Secure Aggregation).
Secure aggregation is an algorithm that encrypts the individual client’s updates, while simultaneously enabling the server to aggregate those updates without decrypting them. Such an algorithm prevents the server or the attackers (eavesdroppers) from reading the individual client updates thereby preventing data, attribute or property reconstruction attacks.
How secure aggregation ensures privacy. When secure aggregation is employed, all the updates that are uploaded by the clients are encrypted. This means that to view them, encryption must be broken. Assuming good practices are followed in the usage of the well designed crypto-algorithms, client updates will appear to be composed of random values and will be computationally difficult to break, thereby ensuring their privacy. Client updates can also be made private from an honest-but-curious server in secure aggregation, if it is ensured that the encrypted updates remain encrypted upon aggregation, and the result is only decrypted by the clients after download. As a result, the individual encrypted client updates are private with respect to the server, while their aggregation can only be seen by only those appropriate clients (who encrypted them in the first place). With that said, there are some less common variations of secure aggregation, where upon aggregation, the result is decrypted. Such secure aggregation algorithms are used when the assumption of an honest server is valid, and such algorithms can be more efficient.
Secure aggregation and other privacy preserving techniques. In federated learning, secure aggregation is not the only technique that helps to preserve privacy. Notably, there are privacy preserving techniques such as that of DPSGD [1, 37], where clients perturb their updates prior to submission, and these perturbations improve client-level privacy2 at the cost of model performance due to added noise and/or reduction of magnitude. Unlike these techniques, secure aggregation does not intentionally distort model parameters to provide privacy, since the privacy preservation is based on the lossless technique of encryption of the updates. But to encrypt updates, parameters are often converted to a space suitable for the encryption scheme, usually the space of unsigned integers, and this can lead to a small amount of loss of information. We briefly explore the impacts of the conversion process (quantization) in Appendix B. We show that it is of negligible impact on performance in practice, and in some cases, it can actually improve the performance due to the reduction of the search space considered by the machine learning algorithm.
Conflict between device heterogeneous personalization and secure aggregation. Device heterogeneity and secure aggregation have opposing requirements. The former requires the server to explicitly observe the individual client’s updates for aggregation, and the latter requires the server not to be able to observe the client’s updates. Specifically, the device heterogeneous techniques require that the server is aware of which partition of the model that the each client has updated to fairly aggregate them without punishing those using less popular partitions. However, with this knowledge, the server can become aware of subsets of the model that a single client has updated, which has been shown in the literature to be sufficient to perform gradient inversion attacks [39, 49].
In Section 4, we propose the PPDHFL algorithm, which addresses this conflict within the following threat model.

3.1 Threat Model

We consider a client-server federated learning architecture where both the server and clients are honest-but-curious. That is, both clients and the server correctly follow the federated learning procedure, but may attempt to deduce or infer as much information as possible from communications they may receive or intercept. In the case of the server, this amounts to attempting to infer specific clients’ data based on their individual communications; while in the case of the clients, they may attempt to do the same but additionally rely on intercepted communications. We do not explicitly consider the possibility of adversarial alteration of the global model made by the server or clients, as this is part of “robustness,” which we consider to be outside the scope of this current work.
In our proposed algorithm, PPDHFL, we use secure aggregation [5], which consists of a set of algorithms that provide client gradient privacy via cryptographic techniques. Secure aggregation encrypts the gradients communicated by clients, preventing interception and hence preventing the re-creation of individual client data through attacks such as gradient inversion [56]. In general, secure aggregation uses cryptographic techniques that enable the server to aggregate the encrypted client gradients, and the resulting global gradient remaining encrypted until received and decrypted by the client. This has led us to define our proportional scaling function at the client level, simply to cater for generalization of possible secure aggregation algorithms.
A primary goal of PPDHFL is to provide device contribution fairness, whereby each client has an equal opportunity to make some contribution to the global model during each round of training. In a synchronous setting, this requires that clients either reduce their computations or their communications. With either of these techniques, each client provides a more non-uniform or sparse gradient to the server, which the server must directly analyze and use to compute the aggregation. This directly compromises the client’s gradient privacy. Hence, our aim is to propose a method that provides both privacy and fairness, with the clients providing the gradient updates that appear uniform due to secure aggregation, and achieving fair aggregation using client-side scaling.

4 Algorithm and Design

We propose the PPDHFL algorithm, which provides both device heterogeneous personalization and secure aggregation while at the same time achieving fairness. Fairness is achieved using an extended version of the partial ordered model partitioning scheme (extension of the one presented in References [10, 23]), with optimal selection of partitioning rates. However, for accurate aggregation of these heterogeneous models, the server normally needs to inspect and track which parameters are being updated by specific clients, which makes them vulnerable to the evasion of privacy preserving attacks. We eliminate this issue by moving the computation of averaging to the individual clients. This enables the server to simply “blindly” add client gradients, while clients scale the result to achieve appropriate summation.
Our proposed PPDHFL algorithm is given in Algorithm 1. The algorithm is divided into the following components: a procedure of federated learning (PPDHFL), a function defining client’s contributions (ClientLearn), and a function that updates the client’s locally held models (ClientUpdate). In the federated learning procedure, clients first each benchmark3 their device according to the speed at which they can update the full global model and partitions of it, then, accordingly, they find a suitable partition of the global model for their learning process. This allocation process is vital for ensuring fairness in the system. We provide a more detailed discussion and definition of this allocation process in Section 4.2. The partitions during the allocation process consider both directions, namely, width (\(\boldsymbol {p}_w\)) and depth (\(\boldsymbol {p}_d\)), where the width reduces the number of parameters (such as neurons and filters) in each layer and the depth reduces the number of layers. We describe our partitioning methods in detail in Section 4.1.
After assigning the clients’ models, our procedure moves to the learning process. For R rounds of learning, a subset of \(K = \max (c \cdot |\mathbb {C}|, 1)\) clients is randomly selected from the set of clients to perform a local update. Each client communicates back to the server the encrypted updates that it has performed on the global model. The encrypted updates are obtained according to the secure aggregation algorithm. The server then takes the encrypted updates and produces the encrypted sum4 of the gradients by performing the secure aggregation operation. This encrypted sum is then communicated back to each of the clients so that they can proceed with the next round of updates.
For local learning, clients take their model in the current round r, \(\psi _{i, r, 0}\), and perform \(E_i\) epochs of updates using the generically defined optimizer function \({\rm C}{\rm\small{LIENT}}{\rm O}{\rm\small{PT}}\)5 and the gradient of the client’s objective function, \(f_i\), evaluated on their current model with respect to a randomly selected mini-batch, \(b_i\). Note an optimizer function is an algorithm that steps the currently held model parameters closer toward a minima of the objective function. There are several such functions, the most common being SGD, which directly subtracts the gradient scaled by a learning rate value from the model parameters, and the Adam optimizer [28], which subtracts the fraction composed of the first and second order moments of the gradient. Once the local model is updated, the client submits the encrypted difference between their local model and the global model to the server.
To update their local model, clients store both a local partitioned model (which gets locally trained), and a copy of the global model for tracking. The updating function takes the encrypted sum of the most recent gradients and decrypts them. We specify this decryption step to be agnostic to the type of secure aggregation algorithm being applied. For example, when using BatchCrypt [51] the operation would involve decryption of the homomorphic cipher, whereas for Secure Aggregation [5], it would be the identity function as the secure aggregation operation eliminates the encryption masks. The resulting decrypted sum then gets scaled according to one of the methods specified in Section 4.3, and is subtracted from the client’s current global model. If the client is not performing knowledge distillation [22],6 then they set their local model to the partition of their global model. Then the client is ready for the next round of learning procedure.

4.1 Partitioning Scheme

We have ordered the partitions [10, 23] in such a way that all clients hold common neurons and larger partitions are strictly a super set of the smaller partitions. An example of our ordered partitioning is demonstrated in Figure 1. In this figure, we show three possible partitions in a simplified form of rectangles, where the clients 1 and 2 both take the smallest partition, the client 3 takes the largest partition, and the client n takes the middle partition. The other intermediate clients could take any possible partition. Larger partitions are supersets to each other, as can be seen by having smaller partition’s be completely contained within the larger ones. The width of these rectangles represent \(p_w\), the proportional amount of neurons or filters that are included in the client’s models. The smaller partitions start from the left most side of the rectangle and have the respective proportion of the length toward the right, this is the same with the models, where the left most neurons/filters are taken first. Similarly, the height of the rectangles represent \(p_d\), that is the number of hidden layers that the partitions take. They are always taken from the bottom to the top, that is from the input to output directions.
Fig. 1.
Fig. 1. Ordered Partitioning Scheme: Partitions are taken from the left to right and input to output directions, where larger partitions strictly overlap with smaller partitions.
The rationale behind the design of this ordered partitioning is that it reduces the impact of the permutation invariance in federated learning. Permutation invariance [45] arises when each client learns completely different model weights even when trained to predict the same classes. This occurs because there is an infinite number of combinations of weights and data products that collectively yield any class prediction. Permutation invariance becomes an issue when aggregating the model weights, as they can dampen or even cancel each other. This issue is exacerbated when clients update only subsets of the model, since they may potentially learn the same or a similar task on differing sets of neurons, altering the resulting global model and its predictions. Ordered partitioning addresses this issue by ensuring that the clients update a consistent set of neurons at each level of the model partition and where larger partitions include all neurons of the smaller partitions.
We propose an extension to the ordered partitioning scheme in which the rate of partitioning across both depth and width is independent. Although, the best direction in which the partitioning should be done is not clear, we observe that partitioning by width and depth have different impact on the performance and the updating times. We discuss the issue of partitioning by width and depth, and what are the implications of doing one versus the other in our supplementary material and given in our GitHub repository,7 where we observe that partitioning by depth of the model is more beneficial. However, the results are not definite, as there is not yet a clear understanding of the precise effect of the hyper-parameters of neural networks. This has led us to propose the use of the two independent parameters \(\boldsymbol {p}_w\) and \(\boldsymbol {p}_d\), unlike the solely width partitioning of FjORD [23] or equal width and depth partitioning of HeteroFL [10]. The use of two independent parameters helps to get the benefit of depth partitioning while enabling to use existing variable size architectures such as ResNet-RS [4] or MobileNet [24].
There still remains the question of how to partition each layer type. Depth-wise partitioning simply involves reducing the number of layers from the global model by the specified proportion. An interesting aspect of depth-wise partitioning occurs when the layers involve a function with strides, meaning models with less layers will have a larger amount of parameters in the final layer due to the larger output size. We find that the best way to address this issue is to simply apply the strides via a function without parameters such as a pooling layer. By doing this, we can ensure that shallow networks do not suffer from width explosion, thereby maintaining the hierarchy of ordered partitions with minimal addition to the overhead of parameter updates.
With width-wise partitioning, the strategy to be followed is dependent on the type of layer. For dense layers, it is straightforward, as one can simply reduce the number of neurons in each layer, and slice the parameters in a consistent ordered fashion from left to right. For convolutional layers, we reduce the number of channels, while the kernel remains the same size, since that size determines the type of features that it filters. In this work, as our datasets come from vision and tabular training tasks, we do not evaluate recurrent layers. However, we expect that width-wise partitioning could be effectively applied by reducing the hidden/cell size, though we note that this requires further study. For other layer types such as batch normalization, we do not apply any partitioning, since they do not have significant amount of parameters involved in gradient descent training.
It is worth noting that for optimal application of the proposed technique, the source code of the model itself must be modified to handle the varying levels of partitioning. For instance, this can be done via the inclusion of an additional hyperparameter to the model’s functions, that represents \(p_w\) and \(p_d\). Otherwise, a shortcut approach could be to simply set the removed model parameters to zero and freeze them, which will ensure that they have no impact on the client’s training/outputs while still saving computation when optimized. Additionally, the model architecture variants that are direct subsets (e.g., ResNet50, ResNet100, ResNet151) to each other can be used in place of the above approaches.

4.2 Allocate Function

The allocate function, \({\rm A}{\rm\small{LLOCATE}}(i)\), is called at the beginning of our PPDHFL algorithm and it is vital for ensuring fairness in the system. On fairness, we note that prior works in data fairness, such as Reference [16], use the demographic parity and equal opportunity metrics. These are not applicable to our environment, as those metrics are based on a comparison between categories of sensitive attributes of data. These are related to data heterogeneity whereas our work is concerned with device heterogeneity. This led us to formulate a different notion of fairness using Pareto efficiency and envy freeness defined earlier (see Definition 2) to govern the allocation of global model partitioning rates employed by clients. Our allocate function finds the allocation of partitioning rates for each client such that it strikes a balance between the amount of information it contributes to each aggregation and the amount of time it spends computing that contribution. In this section, we define how our allocate function ensures fairness in a heterogeneous device setting. We show empirically how the partitions impact the computation time of the updates. We assume that prior to this process, each client has undergone the benchmarking (described at the beginning of Section 4) to find the relationship between its update computation speed and the model proportion.

4.2.1 Function Definition.

According to the definitions provided in Section 3, the allocate function will use a game-theoretical approach and convex optimization to find a fair allocation of partitioning rates (proportions) across clients. First, we must define a utility function, which maps the partitioning rates to real values, where larger values indicate a better rate from the client’s perspective. Our utility function is given in Equation (2), and is based on Cobb-Douglas utility. It is a weighted product of the proportion of the global model contributed by the client i, \(p_i\), and a reward function for the distance between the global time limit determined by the server8, T, and the time that the client i takes to update the \(p_i\) proportion of the global model, \(t_i(p_i)\). The reward function, \(R(a, b)\), is a continuous, concave and smooth function that assigns the maximum value when \(a = b\), and proportionally smaller values as the distance between a and b increases. In our implementation of the Allocate algorithm, we use the probability density function of the Gaussian normal function centered at T with scale 0.5, which has these properties within the limited domain of [0, 1]. Optionally, a hard constraint can be added to \(R(a, b)\), to ensure that the time limit is never exceeded, however, it is of note that this will remove some of the convexity of the function and thus reduce the guarantee of convergence. So, as an alternative, we find that by restricting the search space to ensure that a proportion that causes a client to take longer than T to update is never chosen, we can attain both a locally convex utility function and a hard constraint on update time:
\begin{equation} u_i(p_i) = p_i^\lambda \times R(t_i(p_i), T)^{1 - \lambda }. \end{equation}
(2)
With the utility function defined, we can first trivially define the envy free condition as follows:
\begin{equation*} u_i(p_i) \ge u_i(p_j), \forall i,j \in \mathbb {C}. \end{equation*}
The Pareto efficiency condition can be defined as the maximum of the weighted sum of the client utilities:
\begin{equation*} \max _{\boldsymbol {p}} \sum _{i \in \mathbb {C}} a_i u_i(p_i). \end{equation*}
Varian [44] presents that this solution is Pareto efficient for any \(\boldsymbol {a}\), but only envy free for a specific set of weighting values. Our use case has its own unique conditions, proven below, which enables us to set \(\boldsymbol {a} = [1,\ldots, 1]\), and define Equation (3) as a fair allocation.
Lemma 4.1.
Each client can independently solve the convex optimization problem of \(p_i^* = \arg \max _{p_i} u_i(p_i)\) to find the partition allocation solution for achieving contribution fairness in a heterogeneous federated learning model. This problem has a linear time complexity to attain an upper bound on the error using the gradient ascent.9
\begin{equation} {\boldsymbol{p}}^* = \arg \max _{\boldsymbol {p}} \sum _{i \in \mathbb {C}} u_i(p_i). \end{equation}
(3)
Proof.
Pareto efficiency is satisfied with \(\boldsymbol {a} = [1, \ldots , 1]\) when \(\boldsymbol {p}^* = \lbrace \arg \max _{p_i} u_i(p_i)\rbrace _{\forall i \in \mathbb {C}}\), since
\begin{equation*} \sum _{i \in \mathbb {C}} u_i(p_i^*) = \max _{\boldsymbol {p}} \sum _{i \in \mathbb {C}} a_iu_i(p_i). \end{equation*}
This allocation is also envy free, first, because increasing any \(p_i\) does not decrease the maximum possible value of any other \(p_j\). Second, the definition of the maximized client allocation, \(p_i^* := \arg \max _{p_i} u_i(p_i)\), implies that \(u_i(p_i^*) \ge u_i(x)\) for all values of x within the domain of \(u_i\). Therefore, if a client i finds \(p_i^*\), then it is impossible for any other client j to hold an allocation that client i would gain more utility from
\begin{equation*} u_i(p_i^*) \ge u_i(p_j), \forall i, j \in \mathbb {C}. \end{equation*}
This process places the responsibility for fairness on the clients contributing to the system, as each client is responsible for making the system as fair as possible for itself. Any client that does not follow this process will only adversely impact itself, making the system unfair for itself, without impacting the others. In an unfair state, the clients still have the opportunity to make it fair, simply by adhering to the procedure of the system.
The computational cost of this process is relatively small in comparison to the neural network updates, due to this process being an optimization process of a relatively simple function with a one-dimensional input and output. Furthermore, dynamic changes to client partitions in this system can be considered at the discretion of the client, with a small amount of extra cost due to extra computational power.
Our full algorithm considers two separate partitioning directions, depth-wise and width wise. In the above optimization process, this is represented as a two element vector \(p_i\). As mentioned earlier in Section 4.1, our reason for considering depth and width as independent elements of a partition is based on our empirical study described in our supplementary material.10 Our empirical study showed that each of these directions has a different impact on the computational load and performance of the model. Summarizing our results, partitioning by depth has a lower impact on the model accuracy while saving more computation time. However, we acknowledge that our empirical study is limited in scope. As the current theory on the influence of the neural network hyperparameters is still incomplete, at this stage, our optimization task is open to modification in both the height and the width directions.

4.2.2 Impact of Partitioning Rates on Update Times Across Heterogeneous Devices.

Throughout this article, we have used the assumption that devices with different computational capabilities take different amounts of time to update neural network models of same size. Thus the amount of time required to perform an update can be equalized by giving devices partitions of the model that correspond to their capability. From a theoretical point of view, at a simple level, one can argue that the computational complexity of updating a neural network is proportional to the number of parameters. Hence, devices that take longer to perform each operation should be given less parameters to update so that their total time taken is comparable to the more powerful devices. In this section, we evaluate the above claims empirically.
We recorded the time taken to perform a single update of a “square” partition (that is, with an equal depth and width) of a dense neural network with 10p hidden layers and \(1,\!000p\) neurons in each hidden layer. The networks learned the MNIST task using mini batches of size 32, with an SGD optimizer with a learning rate 0.1. Considering stochasticity in single update times, we took the mean of 1,000 runs for each of the ten partitions that we experimented with, \(p \in \lbrace 0.1, 0.2, \ldots , 1.0\rbrace\). This experiment was performed on three different types of devices: (1) a laptop with an Intel i7-8565U CPU, (2) a desktop with an Intel i9-7900X CPU and NVIDIA GeForce GTX 1080 GPU, and (3) a desktop with an Intel i7-10700 CPU and NVIDIA GeForce GTX 1660 SUPER GPU.
The results of our experiments are given in Figure 2. In general, we note that device 1 is strictly slower than the other two devices, while device 3 is in most cases faster than device 2, although they are generally relatively close to each other. For example, in a non-optimized situation, during a time limit of 0.17 s (not including communication time), device 1 can update up to \(p = 0.7\), while device 2 can update up to \(p = 0.9\), and device 3 can update up to \(p = 1.0\). With JIT optimization, we see that the disparity between the devices without GPUs and those with GPUs is exacerbated. For example, for a time limit of 0.0014 s, the maximum allocation would be \(\boldsymbol {p} = \lbrace 0.3, 0.9, 1.0\rbrace\). This further supports our assumption that devices with different capabilities take different amounts of time to update neural networks of equal size, and that the update times can be synchronized with the different amounts of model partitioning.
Fig. 2.
Fig. 2. Neural Network Update Times for Heterogeneous Devices.

4.3 Scale Function

The scale function, \({\rm S}{\rm\small{CALE}}(\Delta)\), called during the ClientUpdate function, is used to address the issue of unfair aggregation under a private gradient system. We envision two independent mutually exclusive approaches, namely, the counter approach and norm scaling approach, to be computed by the clients. Each approach takes the securely aggregated gradients, which at this point, is simply the sum of the client gradients, and scales them into a viable update reminiscent of the element-wise average between the client gradients.

4.3.1 Counter Method.

An obvious way to ensure that the gradients are correctly averaged is by creating a counter for the model parameters that have been updated. Secure aggregation algorithms allow such a construction, as they only require the addition operation at the aggregation level. Hence, clients can simply submit to the server an encrypted pair of the parameter update and the counter of the parameters that have been updated. The latter part of the pair is a tensor11 of the same shape as the model parameters, with a value 1 indicating the corresponding parameter has been updated and a zero indicating that the parameter has not been updated. The server will sum the client’s counter tensors, creating a count of the number of updates per parameter, which gets sent back to all the clients. Each client will then decrypt the aggregated update and the aggregated counter. The counter allows them to get the average of the update by first setting all zero values in the counter to one and performing an element-wise division between the aggregated update and counter.
While this technique provides a precise average of the global gradients, it has the disadvantage of increasing the communication overhead. On the uplink, clients will be required to submit an additional bit for each parameter in their model indicating whether it was updated. For the downlink, the server will be required to communicate \(\log _2|\mathbb {C}|\) bits for every parameter in the global model to count all the updating clients.

4.3.2 Norm Scaling Method.

To overcome the communication overhead issue, we propose the use of an estimate of the average using the norm scaling approach. With this technique, shown in Equation (4), the aggregated update, \(\tilde{\Delta }= {\rm S}{\rm\small{ECURE}}{\rm A}{\rm\small{GGREGATION}}(\lbrace \Delta _i\rbrace _{i \in \mathbb {S}})\), will be divided by its norm and then scaled by the learning rate. This ensures that the relative scale of the values within the updates are retained without exploding the resulting parameter values. Since this does not explicitly perform the average, it is expected that this would introduce some noise to updates. However, we show that this is not sufficient to prevent convergence. We give the formal convergence proof in Appendix A. We expect this to correlate with other findings in the deep learning literature:
\begin{equation} \psi _i := \theta _r - \min \left(\frac{\eta _{i, r}}{\Vert \tilde{\Delta }\Vert }, 1\right) \tilde{\Delta }. \end{equation}
(4)
Since each of the updates retain their scale from the local client updates when aggregated into the sum, it is more likely that the stationary points of the objective function are overstepped. So, we define the learning rate under the decaying schedule shown in Equation (5), to shrink the impact of step sizes with the progression of learning. The necessity of this schedule is further evidenced by the bound found in our proof of convergence, where the distance from a stationary point is upper bounded by an additive term of half the learning rate:
\begin{equation} \eta _{i, r} = \max \left(\frac{\eta _{i, 0}}{1 + 0.0001r}, 0.0001\right). \end{equation}
(5)
This technique is more feasible compared to the counter approach, as it does not introduce any communication overheads, and only requires a minor computational overhead of norm computation and tensor scaling. However, intuitively, it appears that this technique would suffer performance issues due to this method being less precise than the counter method. Contrary to this intuition, our empirical results in our supplementary material illustrate that the norm scaling technique is in fact better applicable to a wider range of learning environments. Hence, we have used this approach to represent \({\rm S}{\rm\small{CALE}}(\Delta)\) in our experiments with PPDHFL in Section 5.

4.4 Costs of PPDHFL

4.4.1 Computation.

The computational savings made by clients is proportional to their partition size as well as to varying degrees dependent on the types of layers in their model. In this work, as our focus is on computer vision and tabular tasks, we only consider two layer types for partitioning, namely, fully connected/dense layers and convolutional layers. Other common layer types for this task either have a small amount of parameters that are learned by gradient descent (e.g., batch normalization), or have no parameters (e.g., max pooling).
The parameters for the ith fully connected/dense layer form a matrix with dimensions \(N_{i - 1} \times N_i\), where \(N_{i - 1}\) is the length of the output of the previous layer, and \(N_i\) is the length of the output of the current layer. Since width partitioning is applied equally to all the layers, the number of parameters to update in a fully connected layer is \(p_w^2 N_{i - 1} N_i\).
The parameters of a convolutional layer also form a matrix, although with three dimensions, with the size independent of the inputs and outputs. The parameter matrix of the ith convolutional layer has the shape \(C_i \prod _j K_{i, j}\), where \(C_i\) is the number of channels, and \(K_{i,j}\) are the dimensions of the kernel, and both channels and kernel shape are hyperparameters of the layer. In our partitioning scheme, we only apply width partitioning to the channels of convolutional layers, reducing the number of parameters to \(p_w C_i \prod _j K_{i, j}\).
Overall, the reduction in computation is dependent on where the partitioning scheme is applied. However assuming that it has been applied uniformly to all the layers except the input and output layers, then the resulting number of computations is as follows:
\begin{equation} p_w^2 \sum _{i = 1}^{p_d L_{FC}} N_{i - 1} N_i + p_w \sum _{i = 1}^{p_d L_{Conv}} C_i \prod _j K_{i, j}, \end{equation}
(6)
where \(L_{FC}\) is the number of fully connected layers in the global model, and \(L_{Conv}\) is the number of convolutional layers.
Table 1 compares the number of computations in updating a feedforward model between PPDHFL and other state-of-the-art model partitioning algorithms that we study in Section 5. To reiterate, our algorithm PPDHFL divides the partitioning rate into two variables \(p_w\) and \(p_d\), which enables more fine grained tuning of the number of computations compared to the other algorithms.
Table 1.
AlgorithmComputation
PPDHFL\(p_w^2 \sum _{i = 1}^{p_d L_{FC}} N_{i - 1} N_i + p_w \sum _{i = 1}^{p_d L_{Conv}} C_i \prod _j K_{i, j}\)
FjORD [23]\(p^2 \sum _{i = 1}^{L_{FC}} N_{i - 1} N_i + p \sum _{i = 1}^{L_{Conv}} C_i \prod _j K_{i, j}\)
HeteroFL [10]\(p^2 \sum _{i = 1}^{p L_{FC}} N_{i - 1} N_i + p \sum _{i = 1}^{p L_{Conv}} C_i \prod _j K_{i, j}\)
FedDrop [47]\(p^2 \sum _{i = 1}^{L_{FC}} N_{i - 1} N_i + \sum _{i = 1}^{L_{Conv}} C_i \prod _j K_{i, j}\)
Table 1. Comparison of State-of-the-art Algorithm’s Number of Computations in Model Update

4.4.2 Communication.

It is notable that despite having clients train on partitions of the global model, there are no communication savings made by the client. This is because they must pad their local model to size of the global model during the SecAggEncrypt operation.
This is a cost of gradient privacy, as if the server or the clients have knowledge of the model parameters that particular clients are updating (even as a superset), then there is an opportunity to recover some or all of the parameters from the global updates. For example, consider a 10 client system, with the server or a client being made aware of the assignment of proportions:
\begin{equation*} \boldsymbol {p} = \lbrace 0.5, 0.5, 0.8, 0.5, 0.8, 0.2, 0.2, \boldsymbol {1.0}, 0.2, 0.3\rbrace . \end{equation*}
Then that server or client would know that the parameters in the global update in the set \(\mathcal {P}(\bar{\Delta }, 1.0) \setminus \mathcal {P}(\bar{\Delta }, 0.8)\) exclusively belong to the eighth client, where \(\mathcal {P}(\bar{\Delta }, p)\) is a function that returns the p percentage partition of the global update \(\bar{\Delta }\). This would enable the server or the client to observe those parameters that can be used in gradient inversion attack or some other privacy exploiting attack. Moreover, there may be other exploits that target the clients that hold unique proportion values, especially when collusion is possible.
Hence, SecAggEncrypt and the resulting model update must operate on the full global model. This means that despite the update itself being computed on a partition of the global model, the encryption and communicated updates are performed on the full global model with padded updates. Hence, the greater computation and communication costs required for improving privacy. It is worth noting that, however, for the common forms of secure aggregation, e.g., Reference [5], the extra computation costs arise merely from the addition of random numbers, which is relatively lightweight compared to the gradient computation.

5 Experiments and Results

In this section, we show empirically that the performance of PPDHFL is close to or even better than the current state of the art. Since PPDHFL’s primary goals are to preserve privacy and fairness, our rule of thumb is to consider performances that are within 5% of the state of the art to be satisfactory. In this sense, PPDHFL achieves the set goal in performance.

5.1 Experiment Details

We have analyzed PPDHFL algorithm by experimenting with five datasets: MNIST [30], CIFAR-10 [29], CIFAR-100 [29], Tiny-ImageNet [8], and HAR [41]. MNIST is a simple learning task of digit image recognition, which is used to provide an initial baseline for comparison of the different techniques as well as giving us initial insights on their performances. CIFAR-10 provides a more complex learning task of object image recognition, and provides a better evaluation of the quality of the techniques being considered. CIFAR-100 is similar to CIFAR-10, but is larger in scale in terms of the number of classes of objects, having 100 rather than 10. Tiny-ImageNet dataset contains a slightly more complex vision task using larger images and a greater number of classes. HAR dataset provides a contrasting task of human activity recognition; it also provides a natural division of data samples based on subjects in the data. A more detailed discussion of the evaluated datasets and our pre-processing is provided in our supplementary material.
Experiments with the CIFAR-10 dataset trained a modified VGG16 architecture, while those with the CIFAR-100 and Tiny-ImageNet datasets trained a modified DenseNet121 architecture, and all others trained a dense network with ReLU activations. For the FedDrop algorithm, we instead train the modified VGG16 architecture for the CIFAR-10, CIFAR-100, and Tiny-ImageNet architectures, since FedDrop only partitions fully connected hidden layers, of which there are none in the DenseNet121 architecture. In all studied neural network architectures, width partitioning reduced the number of filters in convolutional layers and number of neurons in fully connected layers, while depth partitioning reduced the number of hidden layers. For experiments with PPDHFL, we have partitioned by both depth and width. In the case of FedDrop, FjORD, and HeteroFL, the neural networks were partitioned by a single value, with the FedDrop and FjORD partitioning only the width, whereas HeteroFL partitioning both the width and the depth equally. In addition to the partitioning modifications that we made to the models, we have also replaced the batch normalization layers with layer normalization to eliminate the external covariate shift issue discussed by Du et al. [12].
Our experiments were performed on a single machine with an Intel i7-1070 CPU and NVIDIA GeForce GTX 1660 Super GPU. To realize device heterogeneity, we performed the experiments with a “simulated device heterogenity” allocation scheme. In this scheme, we have used functions interpolated from the results shown in Figure 2 to find the optimal partitions for the three classes of devices for each algorithm. These partitions are assigned to the clients in the system in a cyclic fashion. As baselines, we also compare our results to canonical federated learning and independent local learning models; in both these cases, the clients hold the full model. These experiments provide a glimpse of the learning capability of the clients, on the one hand, as their computational power increases, while, on the other hand, they, respectively, are slowed by the weakest device in the network or completely ignore collaborative learning. More detailed specifications of the model architectures are given in our supplementary material, and can be found at https://github.com/codymlewis/ppdhfl/blob/main/Models.md.
For each of the evaluated datasets, we compare our results to five baselines: the upper bound federated averaging (FedAVG) [36], the lower bound local learning (Local), FedDrop [47], FjORD [23], and HeteroFL [10]. Federated averaging and local learning, as noted in Reference [11], compare the performance of the proposed techniques with respect to the upper and lower bound techniques. FedDrop, FjORD, and HeteroFL provide comparisons to the state-of-the-art techniques for device heterogeneity. Since our focus has been to introduce privacy in a heterogeneous device environment, our aim has been to achieve the best performance that is as close to the state-of-the-art as possible, as indicated earlier. To measure performance, we evaluated model prediction accuracy on the standard testing dataset of the respective dataset in the experiment. We recorded four values, namely, the client mean and standard deviation (STD) of the accuracy scores from each client’s local model, client range of these scores, and the global accuracy score attained by the global model held at the server. In these experiments, fairness is reflected using the simultaneous maximization of mean client accuracy and minimization of the standard deviation of client accuracy, since together these reflect the envy free and Pareto efficient properties. With that said, we find that this can create the possibility of saddle point problem. Hence, we also observe the range of accuracy among the clients, where a higher minimum accuracy shows a more envy free system whereas a higher maximum accuracy shows a more Pareto efficient system.
For each dataset, we have analyzed the effect of varying the number of clients and the participation rates. In our main results, we give the evaluation of the CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets trained with 10 clients. For the MNIST dataset, we have performed training with 100 clients, and for HAR, we have considered 21 clients. Each of these settings have 100% client training participation in every round. Our supplementary material gives the evaluation for other different settings that we have experimented with such as MNIST with 100 clients and 10% training participation, or with only 10 clients. We find that, for the most part, they make negligible difference to our observations, apart from the slower convergence as the number of clients increases and the proportion of participation decreases. For all experiments there was 100% client participation for testing evaluation, where clients each used a complete copy of the standard testing dataset assigned with the respective dataset.
The code for all the experiments is publicly available at https://github.com/codymlewis/ppdhfl.

5.2 Results

The results that we have given in this article have been generated by taking the average of five experiments, each learning over 50 rounds, where clients train by performing a full pass over their local dataset, using a mini-batch size of 128 samples. For CIFAR-100 and Tiny-ImageNet, we have performed 100 total rounds due to the slower convergence of the larger model being trained; we also use a smaller batch-size of 32 samples to save memory resources and to increase the stochasticity of updates.
From Figure 3, we can see that PPDHFL achieves a performance that is close to the best performing algorithm HeteroFL. Furthermore, we see in Table 2 that PPDHFL achieves a performance that is close to or better than the state-of-the-art across all the evaluated settings with different datasets, number of clients, and models. Given the privacy and fairness benefits of the PPDHFL algorithm, these results are clearly are significant as they show the minimal impact of the PPDHFL on the learning performance.
Table 2.
DatasetAlgorithmClient Mean (STD)Client RangeGlobal
CIFAR-10FedAVG56.646%
Local29.974% (7.913%)[15.556%, 42.540%]
FedDrop39.934%
FjORD40.523% (20.054%)[10.000%, 55.036%]55.036%
HeteroFL47.432% (1.874%)[44.964%, 49.356%]49.264%
PPDHFL52.335% (2.478%)[49.190%, 55.052%]54.822%
CIFAR-100FedAVG40.420%
Local16.736% (1.316%)[14.490%, 18.918%]
FedDrop25.850%
FjORD26.430% (5.661%)[17.940%, 31.098%]31.098%
HeteroFL29.199% (1.653%)[27.360%, 31.110%]31.110%
PPDHFL29.805% (1.617%)[28.020%, 31.672%]31.672%
HARFedAVG66.806%
Local45.488% (10.746%)[24.788%, 68.159%]
FedDrop17.590%
FjORD52.041% (24.712%)[17.236%, 71.316%]69.728%
HeteroFL58.111% (5.370%)[51.233%, 64.099%]58.355%
PPDHFL73.072% (2.075%)[70.721%, 75.667%]72.435%
MNISTFedAVG91.192%
Local39.476% (15.248%)[9.408%, 75.898%]
FedDrop27.442%
FjORD62.649% (37.074%)[9.830%, 89.312%]89.312%
HeteroFL75.105% (4.380%)[70.194%, 80.770%]80.770%
PPDHFL88.256% (0.615%)[87.480%, 88.876%]88.618%
Tiny-ImageNetFedAVG36.290%
Local11.241% (0.661%)[10.218%, 12.550%]
FedDrop18.344%
FjORD21.868% (5.078%)[14.288%, 26.166%]26.166%
HeteroFL23.477% (1.236%)[21.966%, 24.784%]24.784%
PPDHFL22.726% (1.444%)[21.184%, 24.408%]24.408%
Table 2. Training Performance, in Terms of Testing Dataset Accuracy, Attained in Our Experiments with a Simulated Model Partition Allocation
Performance of the upper bound (FedAVG) and lower bound (Local) algorithms is included in our considered settings. FedAVG performs standard federated averaging, where clients all update the full global model and so is slow. Local also has clients that update the full model, however, they never collaborate during training, and so do not slow down each other at the cost of model generalization and performance. Global accuracy for PPDHFL assumes usage of the SecAgg algorithm [5], it may not be possible to compute for other secure aggregation algorithms.
Fig. 3.
Fig. 3. Performance of Heterogeneous Device Algorithms with MNIST Dataset. For the local model evaluations, the line shows the mean value, and the shading shows the range.
We do find, however, that model partitioning algorithms generally perform significantly worse than standard federated averaging when training on more complex tasks. We hypothesize that this is due to the partitioning algorithms relying on an over-parameterization of the global model, so that even small partitions of this model can still learn the task well. In the case of the more complex tasks of CIFAR-100 and Tiny-ImageNet, the leeway for over-parameterization is significantly smaller, and thus the partitions of the model are simply unable to completely/adequately learn the task. To support this hypothesis, we performed the same experiments with a cyclic partitioning rate allocation, and these results are given in our supplemental materials. This cyclic allocation assigns approximately an even number of clients with the rates of \(\lbrace 0.1, 0.3, 0.7, 1.0\rbrace\), which is less realistic compared to our simulation-based allocation, but it shows a more extreme case of model heterogeneity. In this case, though we see the same general patterns as in our main results, the overall accuracy is lower; this is due to the greater number of clients holding smaller partitions of the model, which supports our hypothesis.
Overall, our proposed algorithm, PPDHFL, unanimously outperforms all other algorithms, with only one exception, the Tiny-ImageNet dataset.
In the case of Tiny-ImageNet dataset, HeteroFL achieves a slightly better mean accuracy as well as a smaller standard deviation and range of values, thereby achieving better fairness. However, note that the improvements are all less than \(1\%\). Hence, we still consider our PPDHFL to be a significant algorithm as it provides additionally improved privacy.
FjORD achieves a greater global accuracy, around \(1.758\%\) in absolute terms, though at the cost of substantially higher standard deviation and range. So, we consider FjORD to be less fair than our algorithm in addition to having lesser privacy.
It appears that the better performance FjORD and HeteroFL under Tiny-ImageNet is due to the under-parameterization issue mentioned earlier. That is, the more restricted partitioning schemes of the HeteroFL and FjORD algorithms are more likely to assign parameters in the more “useful locations.” We can see that this issue could be prevented most simply by increasing the global update time limit or by adding further constraints to the search process in the partition allocation algorithm. We explicitly demonstrate the effectiveness of increasing the global time limit in our supplementary materials, where the methods achieve performance closer to FedAVG, and the PPDHFL achieving the best performance.
On the other algorithms, generally, local training provides a lower bound in the learning performance. Conversely, FedAVG tends to provide an upper bound on learning performance due to training full models. However this has the disadvantage of each round slowed to the training speed of the slowest client. Finally, while FedDrop provides device heterogeneity, it suffers from the permutation invariance issue, since it randomly selects the neurons to include in its partition, and hence has low performance that is usually only slightly better than local training.
We believe an important reason behind PPDHFL’s generally better performance with improved privacy lies in its norm scaling strategy. As we have observed in our proof of convergence, in the Appendix A, the norm scaling eliminates most of the impact due to variance among the updates thereby reducing the amount of drift experienced. Additionally, improved fairness is due to our less restricted model partitioning scheme and its optimization algorithm.

6 Conclusion

In this article, we have proposed a novel federated learning algorithm (PPDHFL) with personalization in the context of heterogeneous devices while maintaining compatibility with the gradient privacy preservation techniques of secure aggregation. The proposed algorithm achieves fairness by having clients optimize the proportion of the global model that they contribute to according to the game theoretic notion of fairness. We have formulated a formal definition of fairness. Gradient privacy is enabled by moving the averaging operation to the clients, and hence the server does not need to observe the individual updates provided by each client. We have analyzed the proposed federated learning algorithm under different environments with several different datasets, and have shown that PPDHFL achieves a performance that is close to or better than the current state-of-the-art in a heterogeneous device personalized federated learning. We have also provided theoretical proofs backing the fairness and convergence properties of our proposed algorithm.

Acknowledgments

Acknowledgments

The authors thank the anonymous referees for their valuable comments and suggestions.

Footnotes

1
In our setting, we will see that the increasing the partition size of a client does not decrease another, and hence Pareto Efficiency is inherent to the ordered partitioning scheme that we use.
2
A form of differential privacy where clients cannot be distinguished from each other.
3
An example benchmarking procedure, and the one we used in our experiments, is to perform n updates of the model across a set of partitioning values. Then interpolate the mean results to form an equation giving the time to update with respect to the partitioning rate. For our experiments, the resulting approximated equations were \(t_1(\boldsymbol {p}) = 0.14p_d (1 + 0.2p_w)\) for the most powerful device, \(t_2(\boldsymbol {p}) = 2 t_1(\boldsymbol {p})\) for the slightly weaker device, and \(t_3(\boldsymbol {p}) = 16 t_1(\boldsymbol {p})^2\) for the weakest device.
4
Whether this resulting sum remains encrypted after aggregation is subject to the secure aggregation algorithm in use.
5
For example, stochastic gradient descent or Adam [28].
6
Knowledge distillation is a machine learning technique that summarizes a large model into a significantly smaller model by minimizing the Kullback-Leibler loss between their predictions.
8
This allows the server to decide the maximum amount of time that each round will take.
9
Gradient ascent is effective, since the utility function is concave and L-smooth as \(t_i(p_i)\) must be a monotonically increasing function by definition. Furthermore, it could be solved more efficiently using, for instance, Newton’s method if the utility function has a non-zero second order derivative.
11
We use tensor in the colloquial form common in neural network programming libraries, referring to an n-dimensional array of numbers with \(n \ge 0\).

Supplementary Material

supplementary_material (supplementary_material.zip)
Supplementary material

Appendices

A Proof of Convergence of Norm Scaling

Lemma A.1.
Our proposed norm scaling strategy converges to a stationary point of the objective function with the expected rate defined by Equation (7). Using Assumption 3, we find that this convergence applies for any size of ordered partition, \(p \in (0, 1]\) that is held by a group of clients. This allows us to define the model parameters, x, as \(\mathcal {P}(X, p)\) where \(\mathcal {P}(y, p)\) is a function that partitions the parameters, y, by the proportion, p, and X is the global parameters:
\begin{equation} \mathbb {E}[f(x^{(R)})] - f(x^*) \le \frac{1}{R\eta ^2}\Vert x^{(0)} - x^*\Vert ^2 + \frac{\eta }{2}. \end{equation}
(7)
In the proof below, we make the following three key assumptions:
Assumption 1 (L-smoothness)
The objective function, f, is continuously differentiable and satisfies the inequality,
\begin{equation} f(y) \le f(x) + \langle \nabla f(x), y - x\rangle + \frac{L}{2}\Vert y - x\Vert ^2. \end{equation}
(8)
Assumption 2 (Unbiased estimate).
The gradients of the objective function submitted by all clients in aggregation is an unbiased estimator of gradient of the global objective functions,
\begin{equation} \mathbb {E}_{i \sim \mathbb {C}}[\nabla _{b_i} f_i(x)] = \nabla f(x). \end{equation}
(9)
Assumption 3 (Partition group completeness).
For any group of clients that each hold the p ordered partition of the model, the union of the dataset is representative of the global dataset. That is, for the Wasserstein distance, \(W_p\), and the probability density function of client data P,
\begin{equation} W_p(P_{p_i = p, \forall i \in \mathbb {C}}, P_{\forall i \in \mathbb {C}}) \approx 0. \end{equation}
(10)
Proof.
For the norm scaling algorithm, the update rule of the parameters, x, is defined as follows:
\begin{equation} x^+ = x - \eta \frac{\Delta }{\Vert \Delta \Vert }, \end{equation}
(11)
where \(\Delta\) is the sum of the client’s submitted mini-batch gradients, \(b_i\), on their respective objective function, \(f : \mathbb {R}^n \mapsto \mathbb {R}\), more formally,
\begin{equation*} \Delta = \sum _{i \in \mathbb {C}} \nabla _{b_i} f_i(x). \end{equation*}
The expectation where mini-batches are uniformly taken from the batch of data, b, is then defined as follows:
\begin{align*} \mathbb {E}[\Delta ] &= \sum _{i \in \mathbb {C}} \sum _{i \in \mathbb {C}} \nabla f(x) P(b_i = b | x) \\ &= |\mathbb {C}|\nabla f(x). \end{align*}
Expanding Equation (8) at the update rule Equation (11), the following can be found:
\begin{align*} f(x^+) &\le f(x) + \langle \nabla f(x), x^+ - x\rangle + \frac{L}{2} \Vert x^+ - x\Vert ^2 \\ &= f(x) + \left\langle \nabla f(x), -\eta \frac{\Delta }{\Vert \Delta \Vert }\right\rangle + \frac{L\eta ^2}{2}\bigg\Vert \frac{\Delta }{\Vert \Delta \Vert }\bigg\Vert ^2. \end{align*}
Since \(\Vert \frac{v}{\Vert v\Vert }\Vert = 1\) and taking the expectation of both sides then applying Jensen’s inequality,
\begin{align*} \mathbb {E}[f(x^+)] &\le f(x) - \eta \frac{|\mathbb {C}|\nabla f(x)^2}{\Vert |\mathbb {C}|\nabla f(x)\Vert } + \frac{L\eta ^2}{2} \\ &\le f(x) - \eta \frac{\nabla f(x)^2}{\Vert \nabla f(x)\Vert } + \frac{L\eta ^2}{2}. \end{align*}
Now using the tangent property of differentiable convex functions with respect to the optimal parameters \(x^* = \arg \min _x f(x)\) and the current parameters, x,
\begin{equation*} -f(x^*) \le \langle \nabla f(x), x - x^*\rangle - f(x), \end{equation*}
and using the following fact in the distance between the optimal parameters and the parameter update,
\begin{align*} \Vert x^+ - x^*\Vert ^2 &= \Vert x - x^*\Vert ^2 - 2\eta \left\langle \frac{\nabla f(x)}{\Vert \nabla f(x)\Vert }, x - x^*\right\rangle + \eta ^2\frac{\nabla f(x)^2}{\Vert \nabla f(x)\Vert ^2} \\[2pt] -\Vert x^+ - x^*\Vert ^2 &\ge -\Vert x - x^*\Vert ^2 + \eta \left\langle \frac{\nabla f(x)}{\Vert \nabla f(x)\Vert }, x - x^*\right\rangle - \eta ^2\frac{\nabla f(x)^2}{\Vert \nabla f(x)\Vert ^2}, \end{align*}
we will take the expected distance between the updated and optimal parameters, and setting \(\eta \le \frac{1}{L}\),
\begin{align*} \mathbb {E}[f(x^+)] - f(x^*) &\le \langle \nabla f(x), x - x^*\rangle - \eta \frac{\nabla f(x)^2}{\Vert \nabla f(x)\Vert } + \frac{L\eta ^2}{2} \\ &= \frac{1}{\eta } \left(\eta \langle \nabla f(x), x - x^*\rangle - \eta ^2 \frac{\nabla f(x)^2}{\Vert \nabla f(x)\Vert }\right) + \frac{L\eta ^2}{2} \\ &\le \frac{\Vert \nabla f(x)\Vert }{\eta } \left(\eta \left\langle \frac{\nabla f(x)}{\Vert \nabla f(x)\Vert }, x - x^*\right\rangle - \eta ^2 \frac{\nabla f(x)^2}{\Vert \nabla f(x)\Vert ^2}\right) + \frac{\eta }{2} \\ &\le \frac{1}{\eta ^2}(\Vert x - x^*\Vert ^2 - \Vert x^+ - x^*\Vert ^2) + \frac{\eta }{2}. \end{align*}
Taking the sum of distances up to round R,
\begin{align*} \sum _{i = 1}^R (\mathbb {E}[f(x^{(i)})] - f(x^*)) &\le \sum _{i = 1}^R \frac{1}{\eta ^2}(\Vert x^{(i - 1)} - x^*\Vert ^2 - \Vert x^{(i)} - x^*\Vert ^2) + \frac{\eta }{2} \\ &= \frac{1}{\eta ^2}(\Vert x^{(0)} - x^*\Vert ^2 - \Vert x^{(R)} - x^*\Vert ^2) + R\frac{\eta }{2}. \end{align*}
This is used to find the upper bound for the expectation of the distance between the objective functions with respect to the model at round R and the optimal model,
\begin{align*} \mathbb {E}[f(x^{(R)})] - f(x^*) &\le \frac{1}{R} \sum _{i = 1}^R \mathbb {E}[f(x^{(i)})] - f(x^*) \\ &\le \frac{1}{R\eta ^2}\Vert x^{(0)} - x^*\Vert ^2 + \frac{\eta }{2}. \end{align*}
With this proof that PPDHFL converges for any partition of the global model that is trained on, we follow with a corollary that shows that the training of other partitions improves the one that is considered.
Corollary A.2.
Inclusion of updates from clients that hold smaller and larger partitions of the model lead to better gradient estimations of the respective partition of the model.
Proof.
This can be trivially shown by considering the empirical expression of the expectation of the sum of client’s submitted gradients, and expanding the scope to all levels of partition, where \(\mathcal {Z}(x)\) pads the ordered partition model x in to the same shape as X with zeros,
\begin{equation} \mathbb {E}[\Delta ] = \sum _{i \in \mathbb {C}} \sum _{i \in \mathbb {C}} \nabla f(\mathcal {Z}(\mathcal {P}(x, p_i))) P(b_i = b | x). \end{equation}
(12)
It is trivial to see that for \(p_{min} = \min \boldsymbol {p}\) that \(\mathbb {E}[\mathcal {P}(\Delta , p_{min})] = |\mathbb {C}| \nabla f(\mathcal {P}(x, p_{min}))\). From this, we find that given the vector \(\boldsymbol {sp}\), which contains the unique values in \(\boldsymbol {p}\) sorted in ascending order, \(\mathcal {P}(\mathbb {E}[\mathcal {P}(\Delta , sp_2)], p_{min}) = |\mathbb {C}| \nabla f(\mathcal {P}(x, p_{min}))\). This smallest portion of the model uses the data distribution held by all clients, while the \(sp_2\) partition of the model will follow \(\mathbb {E}[\mathcal {P}(\Delta , sp_2)] = |\mathbb {C}| \nabla f(\mathcal {P}(x, sp_2))\) where the portion of \(sp_2\) that is not in \(p_{min}\) will follow the data distribution of clients that hold this partition and larger, since they contribute updates to it. This proof follows trivially with induction.

B Impacts of Secure Aggregation Quantization Upon Training Performance

To understand the trade-offs between accuracy and privacy provided by secure aggregation, we replicated our earlier experiments described in Section 5, with the quantization needed in secure aggregation. Quantization is the only cause of loss of information in the secure aggregation process due to the conversion of the floating point representation of model parameters to a positive integer representation when performing encryption. All other steps in secure aggregation are lossless. Hence, only quantization can have an impact on the accuracy due to secure aggregation. In our experiments, we implemented quantized variants of the federated averaging (FedAVG-Q) and PPDHFL (PPDHFL-Q) algorithms, since these are the settings where secure aggregation is applied. The quantization process involved converting the model gradients from their default typing of 32-bit floating points to 32-bit unsigned integers. To achieve this, we clipped the values to fit in the range \([-\eta ^{-1}, \eta ^{-1}]\), and then we used the formula \((x + \eta ^{-1}) / 2\eta ^{-1}\) to place them in the range of \([0, 1]\), and finally multiplied them by \((2^{32} - 1) / 2\) and truncated to use the full range of 32-bit unsigned integers. In this process, we assumed that the gradients will not have element values exceeding \([-\eta ^{-1}, \eta ^{-1}] = [-10, 10]\). We found this to be the case in our evaluated settings, since individual gradients tended to be small; however, in settings with greater number epochs prior to gradient submission, it may be beneficial to use a larger value. Furthermore, employing a scheduling scheme similar to that which is applied to learning rates could lead to a greater performance. With that said, we applied quantization to client gradients prior to submission to the server. The server then aggregated the quantized gradients and sent the result back to the clients. This aggregated gradient is dequantized by the clients and added to their copy of the global model.
In Table 3, we present the results of our experiments. Here, we see that using quantization can lead to a reduction of accuracy of up to \(3\%\) due to the loss of information that it incurs. However, this is not a consistent pattern as there are also numerous cases where quantization can improve performance. This increase is the result of restricting the space of attainable parameters, which in some cases allows the search process of machine learning to more easily find the optimal parameters. A consistent property of quantization, however, is the reduction of the standard deviation and range of performance values attained by clients, that is, improvement in fairness. This too is due to the restriction of the space of possible parameters; clients are now more likely to have more similar parameters despite differences in local optimization, and hence their testing accuracies are more similar.
Table 3.
DatasetAlgorithmClient Mean (STD)Client RangeGlobal
CIFAR-10FedAVG56.646%
FedAVG-Q61.166%
PPDHFL52.335% (2.478%)[49.190%, 55.052%]54.822%
PPDHFL-Q52.008% (1.891%)[49.638%, 53.928%]53.762%
CIFAR-100FedAVG40.420%
FedAVG-Q40.616%
PPDHFL29.805% (1.617%)[28.020%, 31.672%]31.672%
PPDHFL-Q30.785% (1.282%)[29.208%, 32.202%]32.156%
HARFedAVG66.806%
FedAVG-Q71.373%
PPDHFL73.072% (2.075%)[70.721%, 75.667%]72.435%
PPDHFL-Q70.934% (1.888%)[69.058%, 73.441%]69.285%
MNISTFedAVG91.192%
FedAVG-Q91.008%
PPDHFL88.256% (0.615%)[87.480%, 88.876%]88.618%
PPDHFL-Q85.815% (0.558%)[85.150%, 86.452%]86.252%
Tiny-ImageNetFedAVG36.290%
FedAVG-Q36.544%
PPDHFL22.726% (1.444%)[21.184%, 24.408%]24.408%
PPDHFL-Q24.640% (1.062%)[23.454%, 25.834%]25.834%
Table 3. Training Performance Results Demonstrating the Impact of Quantization
Experiments are performed under the same setting as those presented in Section 5. “-Q” in the algorithm name indicates the experiments where quantization was applied.

References

[1]
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS’16). ACM, New York, NY, 308–318. DOI:
[2]
Stefan Arnold and Dilara Yesilbas. 2021. Demystifying the effects of non-independence in federated learning. Retrieved from https://arxiv.org/abs/2103.11226
[3]
Sana Awan, Bo Luo, and Fengjun Li. 2021. CONTRA: Defending against poisoning attacks in federated learning. In Proceedings of the 26th European Symposium on Research in Computer Security (ESORICS’21). Springer-Verlag, Berlin, 455–475. DOI:
[4]
Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. 2021. Revisiting ResNets: Improved training and scaling strategies. In Advances in Neural Information Processing Systems. M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 22614–22627. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2021/file/bef4d169d8bddd17d68303877a3ea945-Paper.pdf
[5]
Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS’17). ACM, New York, NY, 1175–1191. DOI:
[6]
Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. 2020. TiFL: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’20). ACM, New York, NY, 125–136. DOI:
[7]
Wei Chen, Kartikeya Bhardwaj, and Radu Marculescu. 2020. FedMAX: Mitigating activation divergence for accurate and communication-efficient federated learning. Retrieved from https://arxiv.org/abs/2004.03657
[8]
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. 2017. A downsampled variant of imagenet as an alternative to the CIFAR datasets. Retrieved from http://arxiv.org/abs/1707.08819
[9]
Li Deng and Xiao Li. 2013. Machine learning paradigms for speech recognition: An overview. IEEE Trans. Audio, Speech Lang. Process. 21, 5 (2013), 1060–1089.
[10]
Enmao Diao, Jie Ding, and Vahid Tarokh. 2020. HeteroFL: Computation and communication efficient federated learning for heterogeneous clients. Retrieved from https://arxiv.org/abs/2010.01264
[11]
Siddharth Divi, Yi-Shan Lin, Habiba Farrukh, and Z. Berkay Celik. 2021. New metrics to evaluate the performance and fairness of personalized federated learning. Retrieved from https://arxiv.org/abs/2107.13173
[12]
Zhixu Du, Jingwei Sun, Ang Li, Pin-Yu Chen, Jianyi Zhang, Hai “Helen” Li, and Yiran Chen. 2022. Rethinking normalization methods in federated learning. In Proceedings of the 3rd International Workshop on Distributed Machine Learning (DistributedML’22). ACM, New York, NY, 16–22. DOI:
[13]
Christophe Dupuy, Tanya Roosta, Leo Long, Clement Chung, Rahul Gupta, and Salman Avestimehr. 2022. Learnings from federated learning in the real world. In Proceedings of the International Conference on Acoustics, Speech, & Signal Processing (ICASSP’22). Retrieved from https://www.amazon.science/publications/learnings-from-federated-learning-in-the-real-world
[14]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 214–226.
[15]
Daniel C. Elton, Zois Boukouvalas, Mark D. Fuge, and Peter W. Chung. 2019. Deep learning for molecular design—a review of the state of the art. Molec. Syst. Design Eng. 4, 4 (2019), 828–849.
[16]
Yahya H. Ezzeldin, Shen Yan, Chaoyang He, Emilio Ferrara, and A. Salman Avestimehr. 2023. FairFed: Enabling group fairness in federated learning. Proceedings of the AAAI Conference on Artificial Intelligence. 7494–7502. DOI:
[17]
Liam H. Fowl, Jonas Geiping, Steven Reich, Yuxin Wen, Wojciech Czaja, Micah Goldblum, and Tom Goldstein. 2023. Decepticons: Corrupted transformers breach privacy in federated learning for language models. In Proceedings of the 11th International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=r0BrY4BiEXO
[18]
Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. 2020. Inverting gradients-how easy is it to break privacy in federated learning? Adv. Neural Info. Process. Syst. 33 (2020), 16937–16947.
[19]
Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. 2022. An efficient framework for clustered federated learning. IEEE Trans. Information Theory 68, 12 (2022), 8076–8091. DOI:
[20]
Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. Retrieved from http://arxiv.org/abs/1711.10677
[21]
Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. 2018. Fairness without demographics in repeated loss minimization. In Proceedings of the International Conference on Machine Learning. PMLR, 1929–1938.
[22]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. Retrieved from http://arxiv.org/abs/1503.02531
[23]
Samuel Horvath, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D. Lane. 2021. FjORD: Fair and accurate federated learning under heterogeneous targets with ordered dropout. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2021/file/6aed000af86a084f9cb0264161e29dd3-Paper.pdf
[24]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. Retrieved from http://arxiv.org/abs/1704.04861
[25]
Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. 2020. The non-IID data quagmire of decentralized machine learning. In Proceedings of the International Conference on Machine Learning. PMLR, 4387–4398.
[26]
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneny, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrede Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Ozgur, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramer, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. 2019. Advances and open problems in federated learning. Retrieved from http://arxiv.org/abs/1912.04977
[27]
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. SCAFFOLD: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning. PMLR, 5132–5143.
[28]
Diederik P. Kingma and Jimmy Ba. 2021. Adam: A method for stochastic optimization. Retrieved from https://arxiv.org/abs/1412.6980v9
[29]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report, University of Toronto, Toronto, Ontario.
[30]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
[31]
Daliang Li and Junpu Wang. 2019. FedMD: Heterogenous federated learning via model distillation. Retrieved from http://arxiv.org/abs/1910.03581
[32]
Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. 2021. Ditto: Fair and robust federated learning through personalization. In Proceedings of the International Conference on Machine Learning. PMLR, 6357–6368.
[33]
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems. I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2. 429–450. Retrieved from https://proceedings.mlsys.org/paper_files/paper/2020/file/1f5fe83998a09396ebe6477d9475ba0c-Paper.pdf
[34]
Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Think locally, act globally: Federated learning with local and global representations. Retrieved from http://arxiv.org/abs/2001.01523
[35]
Qianpiao Ma, Yang Xu, Hongli Xu, Zhida Jiang, Liusheng Huang, and He Huang. 2021. FedSA: A semi-asynchronous federated learning mechanism in heterogeneous edge computing. IEEE J. Select. Areas Commun. 39, 12 (2021), 3654–3672.
[36]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. PMLR, 1273–1282.
[37]
H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2018. Learning differentially private recurrent language models. In Proceedings of the International Conference on Learning Representations (ICLR’18). Retrieved from https://openreview.net/pdf?id=BJ0hF1Z0b
[38]
Arup Mondal, Yash More, Prashanthi Ramachandran, Priyam Panda, Harpreet Virk, and Debayan Gupta. 2022. SCOTCH: An efficient secure computation framework for secure aggregation. Retrieved from https://arxiv.org/abs/2201.07730
[39]
Dario Pasquini, Danilo Francati, and Giuseppe Ateniese. 2022. Eluding secure aggregation in federated learning via model inconsistency. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2429–2443.
[40]
Le Trieu Phong, Yoshinori Aono, Takuya Hayashi, Lihua Wang, and Shiho Moriai. 2017. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Info. Forensics Secur. 13, 5 (2017), 1333–1345.
[41]
Jorge-L. Reyes-Ortiz, Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita. 2016. Transition-aware human activity recognition using smartphones. Neurocomputing 171 (2016), 754–767.
[42]
Holger R. Roth, Le Lu, Amal Farag, Hoo-Chang Shin, Jiamin Liu, Evrim B. Turkbey, and Ronald M. Summers. 2015. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 556–564.
[43]
Hal R. Varian. 1974. Equity, envy, and efficiency. J. Econ. Theory 9, 1 (1974), 63–91. DOI:
[44]
Hal R. Varian. 1976. Two problems in the theory of fairness. J. Public Econ. 5, 3-4 (1976), 249–260.
[45]
Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. 2020. Federated learning with matched averaging. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=BkluqlSFDS
[46]
Ziqi Wang, Yongcun Song, and Enrique Zuazua. 2023. Approximate and weighted data reconstruction attack in federated learning. Retrieved from http://arxiv.org/abs/2308.06822
[47]
Dingzhu Wen, Ki-Jun Jeon, and Kaibin Huang. 2022. Federated dropout—A simple approach for enabling federated learning on resource constrained devices. IEEE Wireless Commun. Lett. 11, 5 (May2022), 923–927. DOI:
[48]
Yuxin Wen, Jonas A. Geiping, Liam Fowl, Micah Goldblum, and Tom Goldstein. 2022. Fishing for user data in large-batch federated learning via gradient magnification. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.), Vol. 162. PMLR, 23668–23684. Retrieved from https://proceedings.mlr.press/v162/wen22a.html
[49]
Haomiao Yang, Mengyu Ge, Kunlan Xiang, and Jingwei Li. 2022. Using highly compressed gradients in federated learning for data reconstruction attacks. IEEE Trans. Info. Forensics Secur. 18 (2022), 818–830.
[50]
Hailin Yang, Yanhong Huang, Jianqi Shi, and Yang Yang. 2023. A federated framework for edge computing devices with collaborative fairness and adversarial robustness. J. Grid Comput. 21 (2023). Retrieved from
[51]
Chengliang Zhang, Suyi Li, Junzhe Xia, Wei Wang, Feng Yan, and Yang Liu. 2020. BatchCrypt: Efficient homomorphic encryption for cross-silo federated learning. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’20). 493–506.
[52]
Jingyang Zhang, Yiran Chen, and Hai Li. 2022. Privacy leakage of adversarial training models in federated learning systems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 108–114.
[53]
Shuaishuai Zhang, Jie Huang, Zeping Zhang, and Chunyang Qi. 2023. Compromise privacy in large-batch federated learning via malicious model parameters. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing. Springer, 63–80. Retrieved from
[54]
Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2020. iDLG: Improved deep leakage from gradients. Retrieved from http://arxiv.org/abs/2001.02610
[55]
Zhicheng Zhou, Hailong Chen, Kunhua Li, Fei Hu, Bingjie Yan, Jieren Cheng, Xuyan Wei, Bernie Liu, Xiulai Li, Fuwen Chen, and Yongji Sui. 2021. A novel optimized asynchronous federated learning framework. In Proceedings of the IEEE 23rd International Conference on High Performance Computing and Communications; 7th International Conference on Data Science and Systems; 19th International Conference on Smart City; 7th International Conference on Dependability in Sensor, Cloud and Big Data Systems and Application (HPCC/DSS/SmartCity/DependSys’21). 2363–2370. DOI:
[56]
Ligeng Zhu and Song Han. 2020. Deep leakage from gradients. In Federated Learning. Springer, Berlin, 17–31.

Cited By

View all
  • (2025)Privacy-Preserving Federated Neural Architecture Search With Enhanced Robustness for Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2024.349083524:3(2234-2252)Online publication date: Mar-2025
  • (2024)Federated Learning: Attacks and Defenses, Rewards, Energy Efficiency: Past, Present and FutureWSEAS TRANSACTIONS ON COMPUTERS10.37394/23205.2024.23.1023(106-135)Online publication date: 12-Jun-2024
  • (2024)Federated Objective: Assessing Client Truthfulness in Federated Learning2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825652(7755-7763)Online publication date: 15-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 3
June 2024
646 pages
EISSN:2157-6912
DOI:10.1145/3613609
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2024
Online AM: 13 March 2024
Accepted: 19 February 2024
Revised: 31 January 2024
Received: 21 June 2023
Published in TIST Volume 15, Issue 3

Check for updates

Author Tags

  1. Federated learning
  2. device heterogeneity
  3. fairness
  4. privacy

Qualifiers

  • Research-article

Funding Sources

  • Australian Government Research Training Program (RTP) Scholarship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,008
  • Downloads (Last 6 weeks)129
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Privacy-Preserving Federated Neural Architecture Search With Enhanced Robustness for Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2024.349083524:3(2234-2252)Online publication date: Mar-2025
  • (2024)Federated Learning: Attacks and Defenses, Rewards, Energy Efficiency: Past, Present and FutureWSEAS TRANSACTIONS ON COMPUTERS10.37394/23205.2024.23.1023(106-135)Online publication date: 12-Jun-2024
  • (2024)Federated Objective: Assessing Client Truthfulness in Federated Learning2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825652(7755-7763)Online publication date: 15-Dec-2024
  • (2024)A Multifaceted Survey on Federated Learning: Fundamentals, Paradigm Shifts, Practical Issues, Recent Developments, Partnerships, Trade-Offs, Trustworthiness, and Ways ForwardIEEE Access10.1109/ACCESS.2024.341306912(84643-84679)Online publication date: 2024
  • (2024)Genomic privacy preservation in genome-wide association studies: taxonomy, limitations, challenges, and visionBriefings in Bioinformatics10.1093/bib/bbae35625:5Online publication date: 29-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media