skip to main content
research-article
Open access

Exploring the Distributed Knowledge Congruence in Proxy-data-free Federated Distillation

Published: 22 February 2024 Publication History

Abstract

Federated learning (FL) is a privacy-preserving machine learning paradigm in which the server periodically aggregates local model parameters from cli ents without assembling their private data. Constrained communication and personalization requirements pose severe challenges to FL. Federated distillation (FD) is proposed to simultaneously address the above two problems, which exchanges knowledge between the server and clients, supporting heterogeneous local models while significantly reducing communication overhead. However, most existing FD methods require a proxy dataset, which is often unavailable in reality. A few recent proxy-data-free FD approaches can eliminate the need for additional public data, but suffer from remarkable discrepancy among local knowledge due to client-side model heterogeneity, leading to ambiguous representation on the server and inevitable accuracy degradation. To tackle this issue, we propose a proxy-data-free FD algorithm based on distributed knowledge congruence (FedDKC). FedDKC leverages well-designed refinement strategies to narrow local knowledge differences into an acceptable upper bound, so as to mitigate the negative effects of knowledge incongruence. Specifically, from perspectives of peak probability and Shannon entropy of local knowledge, we design kernel-based knowledge refinement (KKR) and searching-based knowledge refinement (SKR) respectively, and theoretically guarantee that the refined-local knowledge can satisfy an approximately-similar distribution and be regarded as congruent. Extensive experiments conducted on three common datasets demonstrate that our proposed FedDKC significantly outperforms the state-of-the-art on various heterogeneous settings while evidently improving the convergence speed.

1 Introduction

Federated learning (FL) is a privacy-preserving machine learning paradigm that allows participants to collaboratively train machine learning (ML) models while keeping the data decentralized. Owing to the advantages of protecting data privacy and boosting model accuracy, FL has been widely applied to a variety of applications, such as medical treatment [27, 37], financial risk management [16], and recommendation systems [15, 29]. Conventional parameter-aggregation-based FL frameworks [22, 24] periodically aggregate local model parameters uploaded by distributed clients on the server-side and then broadcast the updated global model to clients until model convergence, aiming to improve the trained models’ generic performance. However, such methods face two challenges to tackle. On the one hand, frequently exchanging model parameters over the training process leads to an excessive communication burden; on the other hand, homogeneous models among clients conflict with client heterogeneity in terms of data distribution and system configuration. The above-mentioned defects easily result in drastic performance drops and hinder the actual deployment of FL.
Motivated by the challenges above, federated distillation (FD) is proposed via extending knowledge distillation technology into FL frameworks [1, 11], in which model outputs (called knowledge) in replacement of model parameters are exchanged between clients and the server. Since the size of knowledge is smaller than that of model weights by many orders of magnitude and knowledge is independent of model architectures, FD can maintain low communication overhead while allowing to design personalized models for individual clients, which is deemed as a communication-efficient and heterogeneous-allowable FL paradigm.
Most existing FD methods [4, 14, 23] require a globally-shared proxy dataset to extract knowledge, based on which the server and clients can conduct co-distillation to narrow their representation gap. Since the proxy dataset needs to be cautiously gathered and is not available in reality, a few efforts are made to explore FD frameworks in a proxy-data-free manner. Typical methods break the dependence on proxy data via iteratively exchanging additional information between clients and the server, such as a generator [39] or local-global models [19, 25] to realize distillation. However, such methods remarkably increase communication overhead because of exchanging model parameters. In order to maintain communication efficiency in proxy-data-free FD, [7] proposes a novel feature-driven FD framework, which leverages embedded features in the replacement of proxy data to extract knowledge and achieve workable client-server co-distillation with little influence on communication efficiency. Nevertheless, this approach suffers from a non-negligible problem of accuracy degradation, since heterogeneous local models tend to exhibit a significant knowledge discrepancy without the assistance of a proxy dataset. Such knowledge incongruence will lead to unstable and incorrect distillation, which is undoubtedly harmful for the FD process.
To alleviate the accuracy drop caused by knowledge discrepancy among clients, we investigate proxy-data-free FD from a novel perspective: refinement-based distributed knowledge congruence among heterogeneous clients. We propose a feature-driven FD algorithm based on distributed knowledge congruence (namely FedDKC), in which we refine distributed local knowledge from clients to satisfy a similar distribution based on our well-designed congruence-refinement strategies during server-side distillation. Specifically, we consider knowledge discrepancy from two perspectives: the peak probability and the Shannon entropy of knowledge, and propose kernel-based knowledge refinement (KKR) and searching-based knowledge refinement (SKR) strategies, respectively. On this foundation, the server can learn unbiased knowledge representations and obtain more precise global knowledge based on relatively-congruent local knowledge. In turn, clients can achieve better performance with transferred global knowledge. As far as we know, this paper is the first work to consider knowledge incongruence among heterogeneous clients in proxy-data-free federated distillation. Our proposed FedDKC can significantly boost training accuracy while maintaining communication efficiency based on distributed knowledge congruence.
The main contributions of this paper are summarized as follows:
We propose a communication-efficient and accuracy-guaranteed FD algorithm (namely FedDKC), where local knowledge discrepancy among clients with heterogeneous models is narrowed through skillfully refining to a similar probability distribution. In FedDKC, the server can learn unbiased knowledge representation and help clients promote local training accuracy.
We design KKR and SKR strategies severally for two kinds of knowledge incongruence. The KKR strategy refines the peak probability of clients’ local knowledge to the given limitation, and the SKR strategy makes the Shannon entropy of the refined-local knowledge not exceed the target range. We further prove that the knowledge discrepancy between arbitrating clients satisfies an acceptable theoretical upper bound when adopting both strategies.
We conduct empirical experiments on MNIST, CIFAR-10, and CINIC-10 datasets with heterogeneous client model architectures and multiple data Non-IID settings. Results demonstrate that our proposed FedDKC outperforms the related state-of-the-art with accuracy improvements and faster convergence on individual clients.

2 Preliminary and Motivation

This section provides the fundamental process of proxy-data-free FD, and then emphasizes our motivation on distributed knowledge congruence. Detailed notations and descriptions are given in Table 1.
Table 1.
NotationDescription
\(K\)The number of clients
\(C\)The number of classes
\(\mathcal {D}^k\)The private dataset of client \(k\)
\((X^k,y^k)\)The data and labels in \(\mathcal {D}^k\)
\(N^k\)The number of samples in \(\mathcal {D}^k\)
\(\mathcal {P}\)The universal set of probability space in \(C\) classes
\(W^S\)The global model weights of the server
\(W^k\)The local model weights of client \(k\)
\(W^k_e\)The feature extractor weights of client \(k\)
\(W^k_p\)The predictor weights of client \(k\)
\(z_{X^k}^S\)The global knowledge
\(z^k_{X^k}\)The local knowledge from client \(k\)
\(p_{X^k}^S\)The softmax-normalized global knowledge
\(p_{X^k}^k\)The softmax-normalized local knowledge from client \(k\)
\({H^k}\)The extracted features from client \(k\)
\(\theta\)The parameter of the auxiliary mapping in \(\psi (\theta ;\cdot)\)
\(m\)The index of the maximum element in \(p_{X^k}^k\)
\(t\)The input scaling parameter of kernel function
\(T\)The hyper-parameter of target peak probability
\(E\)The hyper-parameter of target Shannon entropy
\(\tau (\cdot)\)The softmax mapping
\(L_{CE}(\cdot)\)The cross-entropy loss function
\(L_{sim}(\cdot)\)The knowledge-similarity loss function
\(\max (\cdot)\)The maximum function
\(\varphi (\cdot)\)The refinement mapping over distributed knowledge
\(\sigma (\cdot)\)The kernel function in KKR
\(H(\cdot)\)The Shannon entropy function
\(\psi (\cdot)\)The auxiliary mapping in SKR
Table 1. Main Notations with Descriptions

2.1 Basic Process of Proxy-data-free Federated Distillation

Without loss of generality, we consider the classification task in FL setting with \(C\) categories, and let \(\mathcal {C}=\lbrace 1,2, \ldots ,C\rbrace\). The FD system consists of a large-scale server and \(K\) heterogeneous clients. Let \(\mathcal {K}= \lbrace 1,2, \ldots ,K\rbrace\) denote the set of clients. Each client \(k\) owns a private dataset \(\mathcal {D}^k=\lbrace {X^k},{y^k}\rbrace\) with \(N^k\) samples, where \(X^k\) and \(y^k\) denotes the set of input data and corresponding labels, respectively. Moreover, data distributions among clients are not identically and independently distributed (Non-IID) in our setting.
We assume that each client owns heterogeneous model architecture, determined by the computation capability and training requirements of individual clients in reality. Referring to FedGKT [7], we consider the feature-driven FD framework, which can achieve heterogeneous model training while guaranteeing communication efficiency. In this framework, the local model at each client includes a small feature extractor and a large predictor, while the global model at the server only contains a large predictor. Let \(W_e^k\) and \(W_p^k\) be the feature extractor’s weights and the predictor’s weights of client \(k\), respectively. Moreover, we denote \(W^k=\lbrace W_e^k \cup W_p^k\rbrace\) as the weights of the local model at client \(k\), and denote \(W^S\) as the weights of the global model on the server. Let \(f(W^*;\cdot)\) denote the nonlinear function determined by weights \(W^\ast\), where \(W^\ast \in \lbrace \bigcup \nolimits _{i=1}^K W^k\cup W^S\rbrace\). In addition, we define the extracted features of client \(k\) as \({H^k} = f(W_e^k;{X^k})\), the logits of client \(k\) as \(z_{{X^k}}^k = f(W_p^k;{H^k})\), and the logits of the server as \(z_{{X^k}}^S = f({W^S};{H^k})\). Specifically, the logits of clients are called local knowledge, and the logits of the server are called global knowledge.
The whole process of proxy-data-free FD can be divided into multiple rounds. Each round consists of two stages: local distillation, where each client updates its local model based on global knowledge transferred back from the server; and global distillation, where the global model on the server performs knowledge distillation based on uploaded local knowledge from clients. The detailed processes are illustrated as follows:
(1) Local Distillation Process : Each client \(k\) updates its feature extractor \(W_e^k\) and predictor \(W_p^k\) according to the received global knowledge \(z_{X^k}^S\), aiming to minimize the combination of cross-entropy loss \(L_{CE}(\cdot)\) and knowledge-similarity loss \(L_{sim}(\cdot)\), which can be given by:
\begin{equation} \mathop {\arg \min }\limits _{{W^k}} L_C^k: = {L_{CE}}\left(p_{{X^k}}^k,{y^k}\right) + \beta \cdot {L_{sim}}\left(p_{{X^k}}^k,p_{{X^k}}^S\right)\!, \end{equation}
(1)
where \(L_C^k(\cdot)\) represents the loss function of client \(k\), and \(\beta\) is the hyper-parameter for weighting the effect of knowledge similarity loss. \(p_{X^k}^S=\tau (z_{X^k}^S)\) denotes the softmax-normalized global knowledge that is broadcast to client \(k\), and \(p_{X^k}^k=\tau (z_{X^k}^k)\) is the softmax-normalized local knowledge from client \(k\), in which \(\tau (\cdot)\) is the softmax mapping. \(L_{sim}(\cdot)\) measures the similarity of normalized local and global knowledge and takes the Kullback-Leibler divergence by default. After local training, client \(k\) generates the extracted features \(H^k\) and the local knowledge \({z_{{X^k}}^k}\) based on its updated feature extractor and predictor, i.e., \({{H}^k} = f(W_e^k;{X^k})\), and \(z_{{X^k}}^k = f(W_p^k;{H^k})\). Then, client \(k\) uploads its obtained features \(H^k\), local knowledge \(z_{{X^k}}^k\) and corresponding labels \(y^k\) to the server for performing global distillation.
(2) Global Distillation Process : After receiving the local knowledge from all clients, the server conducts the global distillation process, which updates the global model \(W^S\) by optimizing the following objective:
\begin{equation} \mathop {\arg \min }\limits _{{W^S}} {L_S}: = {L_{CE}}\left(p_{{X^k}}^S,{y^k}\right) + \beta \cdot {L_{sim}}\left(p_{{X^k}}^S,p_{{X^k}}^k\right)\!, \end{equation}
(2)
where \(L_S(\cdot)\) denotes the server-side loss function. After distillation, the server generates global knowledge \(z_{X^k}^S\) for each client \(k\) using the updated global model \(W^S\) and the uploaded local features \(H^k\), i.e., \(z_{X^k}^S=f(W^S;H^k)\). Then, \(z_{X^k}^S\) is broadcast to client \(k\). At this point, this round is completed, and the next round begins.
During the above process, only extracted features \(H^k\) and local-global knowledge \(\lbrace z_{X^k}^k\), \(z_{X^k}^S\rbrace\) are exchanged between the server and client \(k\). Since the sizes of such information are significantly smaller compared with model weights, this feature-driven FD manner can achieve client-server co-distillation under model heterogeneity with slight communication overhead.

2.2 Motivation of Distributed Knowledge Congruence

(1) Existing Drawback: Affected by both data heterogeneity and model heterogeneity, existing proxy-data-free FD methods are difficult to get similarly-distributed local knowledge from multiple clients. On the one hand, data heterogeneity leads to diverse label distributions among clients, and the local model on each client tends to learn biased representations based on an independently sampled space, which favors the samples with higher frequency to promote local fitting degree. On the other hand, model heterogeneity can further exacerbate these biases since larger models tend to possess superior representation capability and generate knowledge with higher numerical differences and vice versa.
Furthermore, according to Equation (2), we can draw that knowledge incongruence has a non-negligible influence on server distillation since the global model needs to be optimized based on the knowledge similarity between clients and the server. Due to the aforementioned problem, if straightforwardly learning the incongruent knowledge from clients, the server will learn an ambiguous or a biased representation and easily fail to converge smoothly, which cannot acquire approximate-optimal global knowledge and affects the training accuracy of clients in turn. Whereas existing methods [4, 7, 14, 19, 20, 23, 38], summarized in Table 2, dismiss the ill effect of incongruent knowledge among clients, which leads to severe performance degradation. Figure 1 shows the effect of knowledge congruence on global model convergence, where the red arrows indicate the direction of the negative gradient obtained by distillation on softmax-normalized local knowledge, and black arrows indicate that obtained by distillation on the refined-local knowledge. As shown in Figure 1(a), the local knowledge from a single client will contribute to an optimized direction for the global model. However, knowledge incongruence among heterogeneous clients contributes to biased optimization and frequent fluctuation in the convergence direction. These negative effects cause the actual result to deviate from the optimal one.
Table 2.
MethodPFAMHECKRKDHC
FedDF [23]AverageNoisy
FedMD [20]AverageNoisy
FedGEM [4]NoneIncongruent
DS-FL [14]Entropy ReductionIncongruent
FedLSD [19]SoftenIncongruent
FedGKD [38]Historical InformationIncongruent
FedGEN [39]NoneIncongruent
FedGKT [7]NoneIncongruent
FedDKCKKR/SKRCongruent
Table 2. Comparison of FedDKC with Related State-of-the-art Methods
Proxy-data-free, allow model heterogeneity, efficient communication, knowledge refinement and knowledge distribution among heterogeneous clients are respectively denoted as PF, AMH, EC, KR, KDHC in this table.
Fig. 1.
Fig. 1. The effect of knowledge congruence on global model convergence.
(2) Insight Formulation : Through the above analysis, we assert that congruent local knowledge among clients is essential for optimizing the global model and realizing stabilized convergence. Therefore, we expect to narrow the distribution differences of the original local knowledge among clients through knowledge refinement, aiming to make refined-local knowledge satisfy an approximate distribution constraint. Based on congruent knowledge during server-side distillation, the global model can be steadily updated towards the correct convergence direction, as shown in Figure 1(b). Guided by the above insight, we propose the FedDKC algorithm, and the detailed comparison between FedDKC and related state-of-the-art methods is shown in Table 2. Compared with existing proxy-data-free FD methods, our proposed FedDKC allows both model heterogeneity among clients and high communication efficiency, and pioneers to leverage knowledge congruence to promote the distillation performance.

3 Federated Distillation Based On Distributed Knowledge Congruence

In this section, we first introduce our proposed FedDKC algorithm and its fundamental idea. Then, knowledge refinement strategies including kernel-based knowledge refinement (KKR) and searching-based knowledge refinement (SKR) are explained in detail. Finally, we provide the formal description of FedDKC.

3.1 Framework Formulation

Different from previous methods, we commit to achieving a tailored distribution congruence of local knowledge among clients during server-side distillation by narrowing the difference of distributed local knowledge to an acceptable constraint, as shown in Figure 2. Specifically, we define \(dist(\cdot)\) to measure the normalized knowledge distribution. Taking \(z^k_{X^k}\) and \(z^l_{X^l}\) as inputs, they are normalized via softmax mapping \(\tau (\cdot)\), and the knowledge discrepancy between client \(k\) and client \(l\) can be represented by \(|dist(\tau (z^k_{X^k}))-dist(\tau (z^l_{X^l}))|\). Affected by data and model heterogeneity among clients, significant discrepancy among the softmax-normalized local knowledge derived by each client is ubiquitous. Thus, we design knowledge refinement mapping \(\varphi (\cdot)\) to refine all local knowledge into a similar distribution and realize approximate congruence of local knowledge. Note that the local knowledge after refinement mapping is called refined-local knowledge.
Fig. 2.
Fig. 2. The overall framework of FedDKC.
Firstly, we indicate that \(\varphi (\cdot)\) should satisfy the following three properties:
Probabilistic Projectivity. For each client \(k\), the refined-local knowledge is in probability space, which means that all elements in refined-local knowledge have to be non-negative and add up to 1, i.e.,
\begin{equation} \varphi \left(z_{X^{k}}^{{k}}\right) \in \mathcal {P}, \end{equation}
(3)
where
\begin{equation} \mathcal {P} = \left\lbrace Z \in {R^C} \wedge \sum \nolimits _i {{Z_i}}=1 \wedge 0 \le {Z_i} \le 1,\forall i \in \mathcal {C}\right\rbrace \! . \end{equation}
(4)
This is because the refined-local knowledge in our algorithm is required to exhibit the form of normalized, which is a necessary condition to compute similarity loss with the global knowledge.
Invariant Relations. For each client’s logits \(z^k_{X^k} := (u_1^k,u_2^k, \ldots ,u_C^k)\), the refinement mapping \(\varphi (\cdot)\) should not change the order of numeric value among all elements in \(z^k_{X^k}\), i.e.,
\begin{equation} \varphi \left(z^k_{X^k}\right)_i \ge \varphi \left(z^k_{X^k}\right)_j, \forall u_i^k \ge u_j^k, \end{equation}
(5)
where \(\varphi (z^k_{X^k})_i\) is the \(i\)-th dimension in \(\varphi (z^k_{X^k})\). Since the structured information of local knowledge is mainly reflected in the dimensional order relations, knowledge refinement needs to maintain such relations to preserve the original information.
Bounded Dissimilarity. After refining, the knowledge discrepancy between arbitrating clients should satisfy an acceptable theoretical upper bound \(\varepsilon\), i.e.,
\begin{equation} \left|dist\left(\varphi \left(z_{{X_{{k}}}}^{{k}}\right)\right) - dist\left(\varphi \left(z_{{X_{{l}}}}^{{l}}\right)\right)\right| \lt \varepsilon , \forall k, l\in \mathcal {K}. \end{equation}
(6)
This property ensures that the refined-local knowledge is approximately congruent under the measurement of \(dist(\cdot)\), which is the foundation of our motivation.
Based on the proposed knowledge refinement mapping \(\varphi (\cdot)\), the new knowledge-similarity loss of the server partly depends on the refined-local knowledge among clients, which is described as follows:
\begin{equation} L_{sim}\left(p^S_{X^k},\varphi \left(z^k_{X^k}\right)\right)\!. \end{equation}
(7)
As a consequence, the reformulated optimization problem with a new loss function during the global distillation process can be formulated as:
\begin{equation} \mathop {\arg \min }\limits _{{W^S}} {L^{^{\prime }}_S}: = {L_{CE}}\left(p_{{X^k}}^S,{y^k}\right) + \beta \cdot {L_{sim}}\left(p_{{X^k}}^S,\varphi \left(z_{{X^k}}^k\right)\right)\!. \end{equation}
(8)
Considering peak probability congruence and Shannon entropy congruence which are two disparate metrics to capture overall knowledge distribution, we design respective strategies for implementing knowledge refinement mapping. Specifically, kernel-based knowledge refinement (KKR) is tailored for refining the peak probability of normalized local knowledge to a customized hyper-parameter through performing a kernel-based transformation for every dimension of knowledge. Additionally, searching-based knowledge refinement (SKR) is proposed to achieve the Shannon entropy of refined-local knowledge in a given range by searching out a knowledge refinement mapping with the controlled value of output Shannon entropy. Figure 3 illustrates the local knowledge of two distributions extracted from samples in the TMD [2] dataset, where the red and blue fills respectively represent the distribution of (normalized) local knowledge from the 1st and 10th communication rounds. As displayed in Figure 3, the gap between the two knowledge distributions can be significantly reduced by KKR and SKR, indicating the effectiveness of distributed knowledge congruence strategy KKR and SKR in handling knowledge discrepancy. The detailed process of our proposed strategies will be introduced in the following sections.
Fig. 3.
Fig. 3. Comparison of knowledge normalization with softmax, KKR and SKR over two distributions.

3.2 Kernel-based Knowledge Refinement

This section proposes a kernel-based strategy to implement knowledge refinement. We adopt the maximum value in the normalized knowledge (called peak probability) to represent the distribution of the overall normalized local knowledge, since it can reflect the model’s confidence on a specific sample. The measurement function of knowledge distribution in KKR is defined as \(dist_{KKR}(\cdot)=\max (\cdot)\), where \(\max (\cdot)\) gets the maximum value of the input normalized knowledge. To enable the peak probability congruence among clients, we require the refined peak probabilities of all clients to be a constant value \(T\).
To achieve this, we first define a non-direct-proportion and monotonically increasing kernel function \(\sigma (\cdot), \sigma \in \lbrace f|f(x) \ne k \cdot x\rbrace \cap \lbrace f|f({x_1}) - f({x_2}) \ge 0,\forall {x_1} \ge {x_2}\rbrace\) to map each dimension of the softmax-normalized local knowledge. We expect that the refined-local knowledge jointly transformed from the parameterized multi-kernel functions can maintain the customized peak probability, and the parameter of kernel functions can be derived from the constraint that output peak probability equivalent to \(T\). Let \(\varphi _{KKR} (z^k_{X^k})\) denote the refined-local knowledge of client \(k\) derived by KKR strategy, and let \({\varphi _{KKR}}{(z_{{X^k}}^k)_i}\) denote the \(i\)-th dimension in \(\varphi _{KKR} (z^k_{X^k})\). For each client \(k\), \(p_{X_k}^k=\tau (z^k_{X^k})\) represents its normalized knowledge, and \(p^k_{X^k}:=(v_1^k,v_2^k,\ldots ,v_C^k)\). Each dimension \(v_i^k\) is refined as follows:
\begin{equation} {\varphi _{KKR}}{\left(z_{{X^k}}^k\right)_i} = \frac{{\sigma (\tfrac{v_i^k}{{t \cdot v_m^k}})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{{v_j^k}}}{{t \cdot v_m^k}})} }}, \end{equation}
(9)
where \(m\) is the index of the empirically unique maximum value in \(p^k_{X^k}\), i.e., \(v_m^k=\max (p^k_{X^k})\). Besides, \(t\) represents the input scaling parameter of the kernel function \(\sigma (\cdot)\). When \(v_i^k=v_m^k\), \(t\) should hold:
\begin{equation} \frac{{\sigma (\tfrac{1}{t})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{{v_j^k}}}{{t \cdot v_m^k}})} }} = T. \end{equation}
(10)
Once \(t\) is solved in Equation (10), we can bring it into Equation (9) and gain \(\varphi _{KKR}(\cdot)\) as long as the properties mentioned in Section 3.1 are satisfied. It is worth noting that there is no knowledge discrepancy among clients after refining, which means \(\mid dist_{KKR}(\varphi (z_{X^k}^k))-dist_{KKR}(\varphi (z_{X^l}^l)) \mid =0\) for arbitrate clients \(k\) and \(l\) in this case.
To make Equation (10) solvable, we further instantiate the kernel function as follows:
\begin{equation} \sigma (x)=kx+b, \forall k \gt 0,b \gt 0. \end{equation}
(11)
Bringing Equation (11) into Equation (10), we have:
\begin{equation} \frac{{\frac{1}{t} + 1}}{{\sum \nolimits _{j = 1}^C {(\tfrac{{{v_j^k}}}{{t \cdot v_m^k}} + 1)} }} = T. \end{equation}
(12)
Solving Equation (12), \(t\) is easily obtained as:
\begin{equation} t = \frac{{v_m^k - T}}{{v_m^k \cdot (CT - 1)}}. \end{equation}
(13)
We bring \(t\) into Equation (9) to obtain the refined result of KKR strategy, which can be given by:
\begin{equation} {\varphi _{KKR}}{\left(z_{{X^k}}^k\right)_i} = \frac{{(CT - 1)\cdot {v_i^k} + v_m^k - T}}{C \cdot v_m^k-1} . \end{equation}
(14)
In Appendix, Theorem 1 proves that the KKR strategy may project the local knowledge into a non-probability space, which indicates that one dimension in the refined-local knowledge \(\varphi _{KKR} (z^k_{X^k})\) may be negative. Therefore, we further rectify the refined result Equation (14) as follows:
When all dimensions in the refined-local knowledge is non-negative, i.e., \(\lbrace {\varphi _{KKR}}{(z_{{X^k}}^k)_j} \ge 0, \forall j \in \mathcal {C}\rbrace\), \(\varphi _{KKR}(z_{{X^k}}^k)\) stays unchanged.
When existing dimensions in \(\varphi _{KKR}(z_{{X^k}}^k)\) are negative, we set the maximum dimension in \(\varphi _{KKR}(z_{{X^k}}^k)\) as \(T\), and let others satisfy the uniform distribution, setting as \(\frac{1-T}{C-1}\).
After the above-mentioned rectification, we gain the final refined-local knowledge \({\varphi _{KKR}}(z^k_{X^k})\) via KKR strategy. Theorems 3, 5, and 7 prove that the KKR strategy satisfies three necessary properties mentioned in Section 3.1, which is shown in Appendix.

3.3 Searching-based Knowledge Refinement

This section proposes a searching-based strategy to implement knowledge refinement. We adopt the Shannon entropy to indicate the distribution of normalized local knowledge, since it integrally reflects the amount of information hidden in knowledge. The knowledge distribution measurement function in SKR is defined as \(dist_{SKR}(\cdot) = H (\cdot)\), where \(H(\cdot)\) is the Shannon entropy function. In order to realize the Shannon entropy congruence among clients, we require that the difference between the Shannon entropy of any refined-local knowledge and the target Shannon entropy \(E\) is less than \(\frac{\varepsilon }{2}\).
To this end, we define an auxiliary mapping \(\psi (\theta ;\cdot)\) with parameter \(\theta\), to help search out an available refine mapping for SKR. We expect that the refined-local knowledge transformed from the parameterized auxiliary mapping can satisfy the boundedness constraint of Shannon entropy differences, and the the parameter of the auxiliary mapping can be derived based on a root searching method with our given tolerance error. Taking \(z^k_{X^k}\) as input, we require \(\psi (\theta ;\cdot)\) to maintain numerical relationships in local knowledge unchanged, and its outputs are always in probability space, that is:
\begin{equation} \psi \left(\theta ;z_{{X^k}}^k\right)_i \ge \psi \left(\theta ;z_{{X^k}}^k\right)_j,\forall u_i^k \ge u_j^k, \end{equation}
(15)
\begin{equation} \psi \left(\theta ;z_{{X^k}}^k\right) \in \mathcal {P},\forall z_{{X^k}}^k, \end{equation}
(16)
where \(z^k_{X^k} := (u_1^k,u_2^k, \ldots ,u_C^k)\), and \(\psi (\theta ;z_{{X^k}}^k)_i\) is the \(i\)-th dimension in \(\psi (\theta ;z_{{X^k}}^k)\). Our key idea is to search for an optimal parameter \(\theta ^*\) such that the difference between the refined knowledge’s Shannon entropy and the target Shannon entropy \(E\) is less than \(\frac{\varepsilon }{2}\), which can be expressed as:
\begin{equation} \begin{array}{l} \theta ^{\ast }: = \mathop {\arg \min }\limits _\theta |H({\psi }(\theta ;z^k_{X^k})) - E|\\ s.t. |H\left({\psi }\left(\theta ;z^k_{X^k}\right)\right) - E| \lt \frac{\varepsilon }{2}. \end{array} \end{equation}
(17)
For client \(k\), its \(i\)-th dimension in local knowledge \(z_{X^k}^k\) is transformed via \(\psi (\theta ;\cdot)\), which can be given by
\begin{equation} {\psi }\left(\theta ;z^k_{X^k}\right)_i = \frac{{\exp (\tfrac{{u_i^k}}{\theta })}}{{\sum \nolimits _{j = 1}^C {\exp (\tfrac{{u_j^k}}{\theta })} }}. \end{equation}
(18)
In this way, the searching problem of parameter \(\theta\) can be converted into finding an approximate root of the following equation:
\begin{equation} {H \left({\psi }\left(\theta ;z^k_{X^k}\right)\right) - E}=0, \end{equation}
(19)
which takes \(\frac{\varepsilon }{2}\) as the tolerable error. In Appendix, Theorem 2 proves that an approximate real root \(\theta ^*\) of Equation (19) can be always figured out using the Bisection method [5], which is also the optimal parameter that we expect to find. On this basis, let \(\varphi _{SKR}(z^k_{X^k})\) denote the refined-local knowledge of client \(k\) derived by SKR strategy, and it is defined as:
\begin{equation} {\varphi _{SKR}}\left(z^k_{X^k}\right) = \psi \left({\theta ^*};z^k_{X^k}\right)\!. \end{equation}
(20)
Moreover, Theorems 4, 6, and 8 prove that the SKR strategy satisfies three necessary properties mentioned in 3.1, which is shown in Appendix.

3.4 Formal Description of FedDKC

We introduce our proposed proxy-data-free FD algorithm based on Distributed Knowledge Congruence (FedDKC) in Algorithm, in which the knowledge refinement strategy is adopted, as shown in Algorithm. In our algorithm, both the server and clients can perform knowledge distillation as well as knowledge generation. At the beginning of round \(r\), each client parallelly performs local distillation jointly supervised by global knowledge and local labels (Step 1.1). It is followed by feature and knowledge extraction (Step 1.2). Then, each client uploads its extracted features, local knowledge, and corresponding labels to the server (Step 1.3). The server receives uploaded information from clients and refines the incongruent local knowledge (Step 1.4). At this point, we can customize knowledge refinement strategies, KKR or SKR. The former is to be mapped according to the rectified refined result of Equation (14) (Step 2.1), and the latter needs to first search for a parameter according to Equation (17) (Step 2.2), and then refines local knowledge according to Equations (18) and (20) (Step 2.3). After that, feature-driven server-side distillation is conducted supervised by the refined-local knowledge along with local labels (Step 1.5). After the server finishes distillation, the global knowledge is then generated based on the extracted features uploaded by clients (Step 1.6) and is transferred to corresponding clients (Step 1.7). At this point, the server and clients will start the next training round \(r+1\) until model convergence.

4 Experiments

In this section, we provide experimental results to evaluate the performance of our proposed FedDKC algorithm, especially for verifying the accuracy improvements derived via knowledge refinement. The detailed experiment settings are first described, and then simulation results are provided and analyzed.

4.1 Experimental Setup

(1) Implementation and Datasets: We conduct simulation experiments on a single physical server with multiple NVIDIA GeForce RTX 3090 GPU cards and enough memory space. Our simulation code is implemented based on an open-source FL library [8]. Besides, our training tasks are image classification on three datasets: MNIST [18], CIFAR-10 [17] and CINIC-10 [6]. We split the original dataset into five non-IID partitions and randomly distribute them to five clients. A hyper-parameter \(\alpha\) is taken to control the degree of data heterogeneity among clients. Figure 4 visualizes the data distributions of clients with different \(\alpha\) over CIFAR-10 dataset, in which the bubble radius indicates the samples number of a particular category in a clients’ private data. As \(\alpha\) decreases, the data distributions among clients exhibit a higher degree of heterogeneity. In our experiments, we set \(\alpha =\lbrace 0.1, 0.5, 1.0, 3.0\rbrace\). Before feeding data into models, we adopt commonly-used data preprocessing and augmentation strategies, including random cropping, random horizontal flipping, and normalization.
Fig. 4.
Fig. 4. Visualization of data heterogeneity with different \(\alpha\). Raw statistics are derived from CIFAR-10.
(2) Model Architecture: In order to carry out model heterogeneity, ResNet56 [9] is adopted as the global model on the server; ResNet2, ResNet4, ResNet8, and ResNet10 are adopted as heterogeneous local models on five clients. For each local model, the feature extractor consists of the foremost Conv+Batch+ReLU+MaxPool layers, and the subsequent layers form the predictor. In particular, the server-side predictor is the whole global model. Different models exhibit a remarkable difference in terms of parameter size, memory consumption, and computation cost, as shown in Table 3.
Table 3.
Device/ServerModelParams (K)Memory (MB)Flops (M)
Client 1ResNet20.630.310.5
Client 2ResNet45.181.285.12
Client 3ResNet810.346.9310.29
Client 4/5ResNet109.742.179.75
ServerResNet56577.0133.7987.28
Table 3. Configurations of Models
(Taking \(32\times 32\times 3\) as input).
(3) Benchmarks and Criteria: We compare our proposed FedDKC with state-of-the-art FD methods, FedGKT [7] and FCCL [13]. In addition, we measure the performance of the client-side models by the Top-1 and Top-5 accuracy achieved in 100 communication rounds.
(4) Hyperparameters: We adopt stochastic gradient descent optimizer with batch size 256, learning rate 0.03, and weight decay \(5 \times {10^{ - 4}}\) for all methods. Specifically, we set the hyper-parameter for controlling the effect of knowledge similarity in loss function as \(\beta =1.5\) in FedGKT and FedDKC. Besides, we leverage FashionMNIST [36] as the public dataset in FCCL, and follow other hyper-parameters settings in [12]. To ensure a high entropy of the refined-local knowledge in FedDKC, we set \(T\) to the value that is slightly greater than \(\frac{1}{C}\) and \(E\) to the value that is slightly smaller than \({\log _2}C\). Precisely, we uniformly take \(T\)=0.11 and \(E\)=3.3, respectively.

4.2 Results

(1) Performance Overview: Tables 46 display the experimental results on MNIST, CIFAR-10 and CINIC-10 datasets, respectively. Overall, our proposed FedDKC achieves superior performance than benchmark algorithms in terms of Top-1 and Top-5 accuracy on average over all datasets. For KKR-FedDKC, the average Top-1 accuracy is improved by 1.31% and 16.18% compared to FedGKT and FCCL, respectively, and the average Top-5 accuracy is improved by 2.55% and 1.16%, respectively. For SKR-FedDKC, the average Top-1 accuracy improvements over FedGKT and FCCL are 2.28% and 16.29%; and the average Top-5 accuracy improvements are 2.09% and 0.70%, respectively.
Table 4.
Data Hetero.MetricMethodClient 1Client 2Client 3Client 4Client 5Clients Avg.
\(\alpha =3.0\)Top-1 Acc.FedGKT30.7984.8889.5492.5883.6676.29
FCCL12.2912.9652.7331.9147.1831.41
KKR-FedDKC32.5482.2888.3494.2586.1376.71
SKR-FedDKC32.2979.2288.9894.5085.4676.09
Top-5 Acc.FedGKT65.4498.7998.7499.5989.3890.39
FCCL62.0887.1098.2997.7888.7786.80
KKR-FedDKC72.7098.7999.2899.6489.6992.02
SKR-FedDKC71.4897.0398.6699.6089.5591.26
\(\alpha =1.0\)Top-1 Acc.FedGKT29.9466.6273.1186.7882.0767.70
FCCL13.3920.6243.1025.7444.1029.39
KKR-FedDKC35.4562.8477.8487.0587.5570.15
SKR-FedDKC33.5870.0975.7186.5288.9770.97
Top-5 Acc.FedGKT70.1278.7687.2490.2797.9284.86
FCCL69.9076.5687.2787.8896.6183.64
KKR-FedDKC72.5678.8388.0090.2699.2085.77
SKR-FedDKC72.6079.1287.3490.2899.3985.75
\(\alpha =0.5\)Top-1 Acc.FedGKT29.9555.3958.2560.6271.5455.15
FCCL16.4216.8941.4825.9234.2526.99
KKR-FedDKC30.0855.9268.5869.1079.2260.58
SKR-FedDKC29.8253.6967.4169.4678.8059.84
Top-5 Acc.FedGKT60.2873.2289.0979.9689.3678.38
FCCL70.8983.1791.6084.5286.9383.42
KKR-FedDKC62.8276.0089.8281.8489.2479.94
SKR-FedDKC62.5272.9089.6279.5989.2878.78
\(\alpha =0.1\)Top-1 Acc.FedGKT20.7228.7722.3618.9121.0222.36
FCCL17.9214.6226.2519.9322.4420.23
KKR-FedDKC21.3428.8625.4418.9228.1324.54
SKR-FedDKC21.5729.1322.9418.9324.9423.50
Top-5 Acc.FedGKT54.7949.7479.4349.2050.8656.80
FCCL52.7151.0354.4951.5052.7052.49
KKR-FedDKC51.2749.9149.5647.0251.9349.94
SKR-FedDKC52.3648.1151.3549.9147.0749.76
Table 4. Top-1 and Top-5 Accuracy on MNIST Dataset
The bold numbers represent the best accuracy, and the underline numbers are the second best accuracy. The same as below.
Table 5.
Data Hetero.MetricMethodClient 1Client 2Client 3Client 4Client 5Clients Avg.
\(\alpha =3.0\)Top-1 Acc.FedGKT27.4342.6948.1147.4251.9843.53
FCCL19.3820.8731.3529.9332.5726.82
KKR-FedDKC30.2942.6451.0445.1049.2643.67
SKR-FedDKC30.7344.2551.9851.4350.8645.85
Top-5 Acc.FedGKT77.8676.2682.2189.5385.5682.28
FCCL70.1568.9383.9082.3086.0978.27
KKR-FedDKC79.3579.4389.1490.9990.7085.92
SKR-FedDKC79.0579.3992.1293.2590.6486.89
\(\alpha =1.0\)Top-1 Acc.FedGKT21.4036.5337.5339.8735.9034.25
FCCL20.2022.7427.6728.0422.0724.14
KKR-FedDKC26.7937.2740.8538.5836.7036.04
SKR-FedDKC27.2739.5348.0738.7737.0838.14
Top-5 Acc.FedGKT64.9778.6977.40371.1859.5470.36
FCCL67.5477.8980.3373.6563.9572.67
KKR-FedDKC75.0983.1888.5279.1666.7478.54
SKR-FedDKC68.5783.3388.5177.8863.1676.29
\(\alpha =0.5\)Top-1 Acc.FedGKT24.2328.6737.3346.0635.1634.29
FCCL16.6824.1323.8229.0428.2424.38
KKR-FedDKC24.1230.7937.9746.8437.3135.41
SKR-FedDKC24.0929.1036.4647.9738.5035.22
Top-5 Acc.FedGKT55.6063.4259.8275.8165.7164.07
FCCL55.3067.2967.8976.6971.9667.83
KKR-FedDKC56.8369.3465.1176.1072.5267.98
SKR-FedDKC56.6367.8262.7077.2371.8867.25
\(\alpha =0.1\)Top-1 Acc.FedGKT20.8525.3834.4525.1130.9427.35
FCCL17.1320.2831.6917.7020.6921.50
KKR-FedDKC21.2427.4335.4022.6831.1027.57
SKR-FedDKC21.3726.8036.3723.2635.5128.66
Top-5 Acc.FedGKT50.6750.0050.0765.4950.2253.29
FCCL49.8863.6063.0760.9954.1358.33
KKR-FedDKC49.0550.0152.1958.0852.0552.28
SKR-FedDKC49.9951.4061.4860.4358.9256.44
Table 5. Top-1 and Top-5 Accuracy on CIFAR-10 Dataset
Table 6.
Data Hetero.MetricMethodClient 1Client 2Client 3Client 4Client 5Clients Avg.
\(\alpha =3.0\)Top-1 Acc.FedGKT22.8433.3734.8332.2335.5531.76
FCCL21.1424.7931.5020.4332.5626.08
KKR-FedDKC25.7937.8634.9339.8437.7335.23
SKR-FedDKC25.8738.2835.5137.8538.2835.16
Top-5 Acc.FedGKT68.4973.3262.9280.5877.9472.65
FCCL74.4476.5669.9475.9481.7475.72
KKR-FedDKC78.4782.1764.7282.8086.2278.88
SKR-FedDKC77.6880.7665.0781.4786.0278.20
\(\alpha =1.0\)Top-1 Acc.FedGKT21.0827.5933.5022.4031.9727.31
FCCL19.3122.4430.8420.4024.3923.48
KKR-FedDKC23.7231.6634.9828.5637.6231.31
SKR-FedDKC22.7329.4234.3127.7636.0530.05
Top-5 Acc.FedGKT60.3665.4761.1371.6671.0465.93
FCCL66.3267.4268.7570.0769.8068.47
KKR-FedDKC67.1869.1164.9676.5474.4570.45
SKR-FedDKC64.8168.0863.7975.1072.9368.94
\(\alpha =0.5\)Top-1 Acc.FedGKT14.9529.7024.9528.8632.9126.27
FCCL17.6823.8527.5625.0525.8123.99
KKR-FedDKC16.2432.0526.8830.0537.7728.60
SKR-FedDKC16.0231.1826.5030.6336.1428.09
Top-5 Acc.FedGKT58.7064.4264.9454.3971.9562.88
FCCL61.4566.6374.0962.2071.7267.22
KKR-FedDKC60.4469.0870.0455.8674.5565.99
SKR-FedDKC59.9767.2269.3855.9872.2164.95
\(\alpha =0.1\)Top-1 Acc.FedGKT21.8216.7921.1920.8319.7420.07
FCCL21.1217.3220.6919.3420.6619.83
KKR-FedDKC23.2623.7322.8321.5721.4222.56
SKR-FedDKC23.0823.4722.9920.9620.5622.21
Top-5 Acc.FedGKT50.2649.2453.3649.8850.0850.56
FCCL51.9950.4864.9253.7450.4454.31
KKR-FedDKC55.3857.1563.9150.1349.9755.31
SKR-FedDKC53.8556.6353.6650.5550.0952.96
Table 6. Top-1 and Top-5 Accuracy on CINIC-10 Dataset
Furthermore, we conduct comparisons on three datasets with four degrees of data heterogeneity, including a total of 120 groups of comparisons with two metrics. Compared with the best performance among FedGKT and FCCL, KKR-FedDKC and SKR-FedDKC achieve accuracy improvements in 73 and 68 groups, respectively. Overall, our proposed FedDKC outperforms all considered benchmarks in most of the comparisons. Hence, we can conclude that our methods are generally applicable to improve the performance of individual clients.
(2) Performance on Heterogeneous Data: Figure 5 compares the average accuracies of FedGKT, KKR-FedDKC, and SKR-FedDKC on different datasets under diverse degrees of data heterogeneity. As displayed, the red and green bubbles are always on the upper right of the blue bubbles for the same radius of bubbles. Hence, we can draw that FedDKC can effectively improve the general performance of clients compared with FedGKT, regardless of data heterogeneity.
Fig. 5.
Fig. 5. Average Top-1 accuracy on three datasets with various degrees of data heterogeneity.
(3) Performance on Heterogeneous Models: Figure 6 shows the comparison of the average Top-1 accuracy of local models trained with FedDKC and FedGKT on three datasets, categorized by model architectures. We can determine that FedDKC is generally effective for local models with all kinds of architectures. The reason is that FedDKC can mitigate the local knowledge discrepancy during server-side distillation via KKR or SKR strategy, and thus can capture more globally-generalized representations, which will benefit client-side local distillation in turn.
Fig. 6.
Fig. 6. The average Top-1 accuracy of local models with different architectures evaluated on three datasets.
(4) Communication Robustness: Figures 7 and 8 show the learning curves on different degrees of data heterogeneity and different local models, respectively. From Figure 7, we observe that FedDKC can consistently exhibit better performance than FedGKT under various data heterogeneity settings with the same number of communication rounds. Figure 8 further confirms that FedDKC can achieve faster convergence for all heterogeneous models on clients, regardless of the knowledge refinement strategy. In general, compared with FedGKT, FedDKC does not increase any additional communication overhead in a single round, and can achieve faster convergence under various degrees of data heterogeneity and model architectures.
Fig. 7.
Fig. 7. Learning curves of ResNet4 on different degrees of data heterogeneity over CINIC-10 dataset.
Fig. 8.
Fig. 8. Learning curves on local models with different architectures over CINIC-10 dataset, taking \(\alpha\)=0.5. Results of ResNet10 are obtained from Client 4.
(5) Performance on Larger Number of Clients: We further conduct experiments on more clients to evaluate the effectiveness of FedDKC in scenarios with a larger number of clients. Specifically, we fix the hyper-parameter \(\alpha =1.0\) on the CIFAR-10 dataset, and vary the number of clients \(K \in \lbrace 5,10,20,50\rbrace\). Clients whose number mod 5 has a remainder of 0\(\sim\)4 adopt the model architectures of Client 1\(\sim\)5 described in Table 3, and keep other settings as described in Section 4.1. Thereout, we obtain the performance of FedDKC-KKR, FedDKC-SKR, and FedGKT with different numbers of clients in Table 7. As displayed, although all methods achieve superior performance as the number of clients increases, FedDKC consistently outperforms FedGKT, indicating that our proposed methods can be adapted to the larger-scale FL scenarios.
Table 7.
MethodAvg. Top-1 Acc.(%)
5 Clients10 Clients20 Clients50 ClientsAvg.
FedGKT34.2527.2926.0721.6029.20
KKR-FedDKC36.0429.1326.4722.8330.55
SKR-FedDKC38.1428.9925.8322.9430.99
MethodAvg. Top-5 Acc.(%)
5 Clients10 Clients20 Clients50 ClientsAvg.
FedGKT70.3663.2765.6961.3163.42
KKR-FedDKC78.5468.0465.0463.7465.66
SKR-FedDKC76.2966.7964.1663.3864.78
Table 7. Top-1 and Top-5 Accuracy with Different Number of Clients
Results are based on CIFAR-10 dataset, taking \(\alpha =1.0\).

5 Discussions

5.1 Customizing Kernel Functions for KKR

This section provides further guidance for customizing kernel functions in KKR, which can support more subtly and controllable knowledge refinement. We give out the relaxation conditions for available kernel functions \(\sigma (\cdot)\) in KKR, which are as follows:
Non-direct-proportion
Continuous and monotonically increasing
Function value is consistently positive
Parameter \(t\) is solvable in Equation (10)
In Appendix 7, Theorem 9 proves that the necessary properties of refinement mapping \(\varphi (\cdot)\) mentioned in Section 3.1 can be satisfied as long as the above relaxation conditions are met. Up to this point, the relaxation conditions provide sufficient support for the design of feasible kernel functions: all satisfactory kernel functions can realize knowledge congruence. On this basis, kernel functions can be flexibly customized to meet finer distribution requirements, e.g., adopting a convex kernel function to diminish the differences between classes that are not preferred by softmax-normalized knowledge, or adopting a concave kernel function to strengthen the correlation between the preferred class and the first alternative class.

5.2 Conversion of KKR to SKR

This section discusses the feasibility of the conversion from KKR to SKR. We observe that the \(t\) in Equation (10) can be derived by the searching-based method just like how \(\theta ^*\) in Equation (17) being searched out in Section 3.3. We define an auxiliary mapping \(\rho (t;\cdot)\) with unknown parameter \(t\). \(\rho (t;z_{{X^k}}^k)_i\) denotes the \(i\)-th dimension in \(\rho (t;z_{{X^k}}^k)\), which can be expressed as:
\begin{equation} \rho \left(t;z^k_{X^k}\right)_i = \frac{{\sigma (\tfrac{{v_i^k}}{{t \cdot v_m^k}})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})} }}. \end{equation}
(21)
Then the optimal \(t^*\) is to be searched such that the difference between the refined-local knowledge’s peak probability and the target peak probability is less than a tolerable upper bound \(\frac{\varepsilon }{2}\), which can be given by:
\begin{equation} \begin{array}{*{20}{l}} {{t^*}: = \mathop {\arg \min }\limits _t \left|dist_{KKR} \left(\rho \left(t ;z^k_{X^k}\right)\right) - T\right|}\\ {s.t.\left|dist_{KKR} \left(\rho \left(t ;z^k_{X^k}\right)\right) - T\right| \lt \frac{\varepsilon }{2}}. \end{array} \end{equation}
(22)
After gaining \(t^\ast\), we let:
\begin{equation} {\varphi _{KKR}}\left(z^k_{X^k}\right) = \rho \left({t^*};z^k_{X^k}\right)\!. \end{equation}
(23)
So far, the final \({\varphi _{KKR}}(\cdot)\) is obtained. Noting that when the Bisection method [5] is adopted, the sufficient condition for available \(t^*\) to be solved is that:
\begin{equation} h\left(z_{{X^k}}^k;{\epsilon _1}\right) \cdot h\left(z_{{X^k}}^k;{\epsilon _2}\right) \lt 0,\exists {\epsilon _1},{\epsilon _2}, \end{equation}
(24)
where
\begin{equation} h\left(z_{{X^k}}^k;x\right) = dis{t_{KKR}}\left(\rho \left(x;z_{{X^k}}^k\right)\right) - T, \end{equation}
(25)
which is practical to satisfy. Up to this point, any kernel function satisfying Equation (24) can apply to the KKR-convert-to-SKR strategy. With the KKR to SKR conversion, our KKR can still work even when we cannot solve out \(t\) from Equation (10), which further promotes the customizability of kernel functions.

5.3 Superiority and Limitations of KKR and SKR

This section provides an analysis of the superiority and limitations of KKR and SKR. Even though Section 4.2 empirically demonstrates that KKR outperforms SKR in general, the results are severely constrained by the experimental environment and the knowledge distribution metrics adopted by their respective methodologies. However, when knowledge refinement strategies apply to new data environments or improved knowledge distribution metrics are adopted, the opposite conclusion might be drawn.
According to our argument, KKR can only handle simple target knowledge distribution because it must meet to the crucial requirement that Equation (10) has a solution and can be worked out. The analytical solution to Equation (10) is frequently not available when complex kernel functions are used to satisfy the structured requirements of the target knowledge distribution (where some KKR problems can only be solved by converting to an SKR problem, as mentioned in Section 5.2); as a result, KKR is not practical under such ordinary circumstances. In contrast, SKR only requires that Equation (19) has a real root, which is significantly easier to satisfy than Equation (10) requested by KKR. As a result, SKR outperforms KKR in cases that require complex target knowledge distribution.
It is also worth noting that both SKR and KKR introduce computational overhead on the server side during the global distillation process, where the computation complexity of KKR is linear, and that of SKR is logarithmic (depends on the number of iterations during the parameter searching process in Bisection). Empirically, the computation costs of KKR and SKR are often affordable since they are much lower than that of the server distillation and are borne by the computation-powerful server side.
In summary, KKR is more accurate in our empirical experiments, while SKR enables more flexible setups for target knowledge distribution. In addition, the additional computational overhead introduced on the server side by KKR and SKR is acceptable.

6 Related Work

6.1 Knowledge Distillation

Knowledge distillation (KD) is a teacher-student learning paradigm that transfers the teacher model’s knowledge to the student model through distillation. KD has attracted much attention in ensemble model based aggregation [11] and cumbersome model compression [10, 21, 26, 28, 31, 32]. Existing KD methods [11, 26] demonstrate the feasibility that the student model learns the data-to-label representation from the teacher model. The subsequent work [1] extends the distillation technique to exploit the potential for collaboratively optimizing a collection of models [30]. On this foundation, KD is introduced to FL for realizing collaborative training between the server and clients. Such distillation-based FL framework is named federated distillation (FD).

6.2 Federated Distillation

Typical FD methods [3, 14, 20, 33] exchange model outputs instead of model parameters among clients and the server. The server performs an aggregated representation of knowledge from clients and guides clients to converge toward global generalization. These methods, however, require a proxy dataset without exception, which is often not available during the FD process. Recent works devote to dispensing proxy datasets through exchanging additional information, such as global models [19, 38], generators [39], hash values [35], or extracted features [7, 34]. Parameter decentralization-based approaches [19, 38] achieve local distillation by broadcasting model parameters of the server to clients, where clients treat the downloaded global model from the server as the teacher model, and conduct local knowledge distillation based on private data. The generator-passing-based approach [39] uses a lightweight generator to integrate information from clients, which is subsequently broadcast to clients for local training by utilizing the learned knowledge for constrained optimization. Feature-driven approaches [7, 34] additionally upload client-side extracted features before global distillation and global knowledge generation. However, none of these approaches considers that fitting local knowledge with biased distributions negatively affects the global representations under the premise of heterogeneous models among clients.

7 Conclusion

This paper proposes a proxy-data-free federated distillation algorithm based on distributed knowledge congruence (FedDKC). In our algorithm, incongruent local knowledge from distributed clients is refined to satisfy a similarly-congruent distribution without adding any communication burden. Furthermore, we design KKR and SKR strategies to achieve distributed knowledge congruence considering two kinds of knowledge discrepancies: the peak probability and the Shannon entropy of normalized local knowledge. As far as we know, this paper is the first work to boost training accuracy while maintaining communication efficiency based on distributed knowledge congruence in proxy-data-free federated distillation. Experiments demonstrate that FedDKC effectively improves the training accuracy of individual clients and significantly outperforms related state-of-the-art methods in various heterogeneous settings.

Acknowledgments

We thank Prof. Lichao Sun from Lehigh University, USA, Prof. Hong Qi from Jilin University, China, Di Hou from National University of Singapore, Singapore, Xujin Li, Hui Jiang, Zhiliu Fu, Runhan Li, Hao Tan and Prof. Zhongcheng Li from Institute of Computing Technology, Chinese Academy of Sciences, and Meicheng Liao from Shanghai Jiaotong University, China, for inspiring suggestions.

Appendix

A.1 Mapping Negativity of the KKR Strategy without Rectification

Theorem 1.
There exists \(\mathbf {{z}_{X^{k*}}^{k*}}\) such that \(\mathbf {{\varphi _{KKR}}({z}_{X^{k*}}^{k*})_i \lt 0}, \exists i \in \mathcal {C}\).
Proof.
Empirically, \(p_{{X^k}}^k\) is not a uniform distribution, so there would be:
\begin{equation} v_m^k \gt \frac{1}{C}, \end{equation}
(26)
and thereout,
\begin{equation} C \cdot v_m^k - 1\gt 0. \end{equation}
(27)
Also, since \(T\) is the hyper-parameter that controls the peak probability of normalized knowledge, we empirically set \(T\gt 0.1\) with classification category \(C\ge 10\). And hence, we have:
\begin{equation} CT - 1\gt 0. \end{equation}
(28)
We let:
\begin{equation} \begin{aligned}&\; \; \; \; \;{\varphi _{KKR}}\left(z^k_{X^k}\right)_i\\ &= \frac{{(CT - 1) \cdot v_i^k + v_m^k - T}}{{C \cdot v_m^k - 1}}\\ &= \frac{{(CT - 1) \cdot \left(v_i^k + \frac{{v_m^k - T}}{{CT - 1}}\right)}}{{C \cdot v_m^k - 1}}. \end{aligned} \end{equation}
(29)
Accordingly, based on Equations (28) and (27), we can infer that when:
\begin{equation} v_i^{k*} + \frac{{v_m^{k*} - T}}{{CT - 1}} \lt 0, \end{equation}
(30)
there would be \({\varphi _{KKR}}(z_X^{k*})_i \lt 0\), and Equation (30) holds when \(v_i^{k*} \rightarrow 0 \wedge v_m^{k*} \lt T\).
Theorem 1 is proved. □

A.2 Root Finding in the SKR Strategy

Theorem 2.
The equation \(\mathbf {H(\psi (\theta ;z^k_{X^k})) - E = 0}\) with unknown variable \(\theta\) has a real root, and the root can be figured out using the Bisection method [5].
Proof.
Since \(E\) is the hyper-parameter that indicates the target entropy of the refined-local knowledge, its empirical value should be taken between the Shannon entropy of the normalized local knowledge subject to a concentrated distribution and that subject to a uniform distribution, which means:
\begin{equation} (C - 1) \cdot \mathop {\lim }\limits _{p \rightarrow 0^+} (- p{\log _2}p) + \mathop {\lim }\limits _{q \rightarrow 1^-} (- q{\log _2}q) \lt E \lt C \cdot \left(- \frac{1}{C}{\log _2}\frac{1}{C}\right)\!, \end{equation}
(31)
and that is:
\begin{equation} 0 \lt E \lt {\log _2}C. \end{equation}
(32)
We define a continuous function \(g(z^k_{X^k};\cdot)\) as follows:
\begin{equation} g\left(z^k_{X^k};\theta \right) = H\left(\psi \left(\theta ;z^k_{X^k}\right)\right) - E. \end{equation}
(33)
On the one hand, we have:
\begin{equation} \begin{aligned}& \; \; \; \; {\mathop {\lim }\limits _{\theta \rightarrow 0^+} g\left(z^k_{X^k};\theta \right)}\\ &{ = \mathop {\lim }\limits _{\theta \rightarrow 0^+} \sum \limits _{i = 1}^C { - \psi \left(\theta ;z^k_{X^k}\right)_i \cdot {{\log }_2}\psi \left(\theta ;z^k_{X^k}\right)_i} - E}\\ &{ = \sum \limits _{i = 1}^C { - \mathop {\lim }\limits _{\theta \rightarrow 0^+} \psi \left(\theta ;z^k_{X^k}\right)_i \cdot {{\log }_2}\psi \left(\theta ;z^k_{X^k}\right)_i} - E}, \end{aligned} \end{equation}
(34)
in which
\begin{equation} \begin{aligned}& \; \; \; \; \mathop {\lim }\limits _{\theta \rightarrow 0^+} \psi \left(\theta ;z^k_{X^k}\right)_i\\ & = \mathop {\lim }\limits _{\theta \rightarrow 0^+} \frac{{\exp \left(\frac{{u_i^k}}{\theta }\right)}}{{\sum \nolimits _{j = 1}^C {\exp \left(\frac{{u_j^k}}{\theta }\right)} }}\\ & = \frac{{\mathop {\lim }\nolimits _{\theta \rightarrow 0^+} \exp \left(\frac{{u_i^k}}{\theta }\right)}}{{\sum \nolimits _{j = 1}^C {\mathop {\lim }\nolimits _{\theta \rightarrow 0^+} \exp \left(\frac{{u_j^k}}{\theta }\right)} }}\\ & = \frac{{\mathop {\lim }\nolimits _{\theta \rightarrow 0^+} \exp \left(\frac{{u_i^k}}{\theta }\right)}}{{\mathop {\lim }\nolimits _{\theta \rightarrow 0^+} \exp \left(\frac{{u_m^k}}{\theta }\right)}}\\ & = \delta (i), \end{aligned} \end{equation}
(35)
where
\begin{equation} \delta ({x}) = \left\lbrace \begin{array}{*{20}{l}} {0,x = m}\\ {1,x \ne m} \end{array} \right. . \end{equation}
(36)
Therefore, we have:
\begin{equation} \begin{aligned}& \; \; \; {\mathop {\lim }\limits _{\theta \rightarrow 0^+} g\left(z^k_{X^k};\theta \right)}\\ &{ = \sum \limits _{i = 1}^C { - \mathop {\lim }\limits _{\theta \rightarrow 0^+} \psi \left(\theta ;z^k_{X^k}\right)_i \cdot {{\log }_2}\psi \left(\theta ;z^k_{X^k}\right)_i} - E}\\ &{ = \sum \limits _{i = 1}^C {\delta (i) \cdot \left(- \mathop {\lim }\limits _{x \rightarrow 0^+} x \cdot {{\log }_2}x \right)} - \mathop {\lim }\limits _{x \rightarrow 1^-} x \cdot {{\log }_2}x - E}\\ &= - E\\ &\lt 0. \end{aligned} \end{equation}
(37)
Due to the sign preserving property of continuous functions, we can infer that there exists \(0\lt {\epsilon } \lt 1\) making that:
\begin{equation} g\left(z^k_{X^k};\theta \right) \lt 0,\forall \theta \in [0,{\epsilon }], \end{equation}
(38)
and hence,
\begin{equation} g \left(z^k_{X^k};\frac{{\epsilon }}{2}\right)\lt 0. \end{equation}
(39)
On the other hand:
\begin{equation} \begin{aligned}& \; \; \; \; {\mathop {\lim }\limits _{\theta \rightarrow +\infty } g\left(z^k_{X^k};\theta \right)}\\ & { = \sum \limits _{i = 1}^C { - \mathop {\lim }\limits _{\theta \rightarrow +\infty } \psi \left(\theta ;z^k_{X^k}\right)_i \cdot {{\log }_2}\psi \left(\theta ;z^k_{X^k}\right)_i} - E}, \end{aligned} \end{equation}
(40)
where
\begin{equation} \begin{aligned}& \; \; \; \; \; {\mathop {\lim }\limits _{\theta \rightarrow +\infty } \psi \left(\theta ;z^k_{X^k}\right)_i}\\ & = \mathop {\lim }\limits _{\theta \rightarrow +\infty } \frac{{\exp \left(\frac{{u_i^k}}{\theta }\right)}}{{\sum \nolimits _{j = 1}^C {\exp \left(\frac{{u_j^k}}{\theta }\right)} }}\\ & = \frac{{\mathop {\lim }\nolimits _{\theta \rightarrow +\infty } \exp \left(\frac{{u_i^k}}{\theta }\right)}}{{\sum \nolimits _{j = 1}^C {\mathop {\lim }\nolimits _{\theta \rightarrow +\infty } \exp \left(\frac{{u_j^k}}{\theta }\right)} }}\\ & = \frac{1}{C}. \end{aligned} \end{equation}
(41)
As a consequence,
\begin{equation} \begin{aligned}& \; \; \; \; {\mathop {\lim }\limits _{\theta \rightarrow +\infty } g\left(z^k_{X^k};\theta \right)}\\ & { = \sum \limits _{i = 1}^C { - \mathop {\lim }\limits _{\theta \rightarrow +\infty } \psi \left(\theta ;z^k_{X^k}\right)_i \cdot {{\log }_2}\psi \left(\theta ;z^k_{X^k}\right)_i} - E}\\ & { = \sum \limits _{i = 1}^C { - \mathop {\lim }\limits _{\theta \rightarrow +\infty } \psi \left(\theta ;z^k_{X^k}\right)_i \cdot {{\log }_2}\mathop {\lim }\limits _{\theta \rightarrow +\infty } \psi \left(\theta ;z^k_{X^k}\right)_i} - E}\\ & { = C \cdot \left(- \frac{1}{C} \cdot {{\log }_2}\frac{1}{C}\right) - E}\\ & {= {\log _2}C - E}\\ & \gt 0. \end{aligned} \end{equation}
(42)
According to the definition of limit, we can infer that for the positive real number \({\log _2}C - E \in {\mathbf {R^{+}} }\), there exists \(M \in \mathbf {R^{+}}\), such that:
\begin{equation} \left|g \left(z^k_{X^k};\theta \right) - ({\log _2}C - E)\right| \lt {\log _2}C - E,\forall \theta \gt M. \end{equation}
(43)
Since \({e^M} \gt M\), we have:
\begin{equation} \left|g\left(z^k_{X^k};\theta \right) - ({\log _2}C - E)\right| \lt {\log _2}C - E,\forall \theta \gt {e^M}, \end{equation}
(44)
and that is:
\begin{equation} 0 \lt g\left(z^k_{X^k};\theta \right) \lt 2{\log _2}C - 2E,\forall \theta \gt {e^M}, \end{equation}
(45)
and then, we have:
\begin{equation} g\left(z^k_{X^k};2{e^M}\right) \gt 0. \end{equation}
(46)
In summary, there exists \(\epsilon \in (0,1)\) and \(M \in \mathbf {R^{+}}\) such that:
\begin{equation} g\left(z^k_{X^k};\epsilon \right) \cdot g\left(z^k_{X^k};2{e^M}\right) \lt 0, \end{equation}
(47)
in which \(\frac{\epsilon }{2}\lt 1\lt 2{e^M}\). Hence, according to the existence theorem of zero points, \(g(z^k_{X^k};\cdot)\) must have a zero in the interval \((\frac{\epsilon }{2}, 2{e^M})\), and the zero is also the root of the equation \({H(\psi (\theta ;z^k_{X^k})) - E = 0}\).
When taking \((\frac{\epsilon }{2}, 2{e^M})\) as the input interval, \(\frac{\varepsilon }{2}\) as the tolerable error, an approximate real root can be found by adopting the Bisection method [5]. Empirically, when a searching lower bound close to zero and a reasonably big searching upper bound is taken, we can always obtain an available \(\theta ^*\) as the approximated real root.
Theorem 2 is proved. □

A.3 Proof of Knowledge Refinement Properties

(1) Probabilistic Projectivity
Theorem 3.
In KKR, the refined-local knowledge is in probability space.
Proof.
First, we prove that \(\sum \nolimits _{i = 1}^C {{\varphi _{KKR}}(z_{{X^k}}^k)_i = 1}\).
Case 3.1.1.
When \({\varphi _{KKR}}(z^k_{X^k})_i \ge 0,\forall i \in \mathcal {C}\), we calculate the sum of all dimensions in the refined-local knowledge, which can be given by:
\begin{equation} \begin{aligned}&\; \; \;\sum \limits _{i = 1}^C {\varphi _{KKR}\left(z^k_{X^k}\right)_i}\\ &{ = \sum \limits _{i = 1}^C {\frac{{(CT - 1) \cdot {v_i^k} + v_m^k - T}}{{C \cdot v_m^k - 1}}} }\\ &{ = \frac{{(CT - 1) \cdot \sum \nolimits _{i = 1}^C {{v_i^k}} }}{{C \cdot v_m^k - 1}} + \frac{{C \cdot \left(v_m^k - T\right)}}{{C \cdot v_m^k - 1}}}.\\ \end{aligned} \end{equation}
(48)
Since the softmax-normalized knowledge satisfies:
\begin{equation} \sum {p_{X^k}^k} = \sum \limits _{i = 1}^C {{v_i^k}} = 1, \end{equation}
(49)
hence, we have:
\begin{equation} \begin{aligned}&\; \; \; \; \;\sum \limits _{i = 1}^C {\varphi _{KKR}\left(z^k_{X^k}\right)_i}\\ &=\frac{{(CT - 1) \cdot \sum \nolimits _{i = 1}^C {{v_i^k}} }}{{C \cdot v_m^k - 1}} + \frac{{C \cdot \left(v_m^k - T\right)}}{{C \cdot v_m^k - 1}}\\ &= \frac{{CT - 1 + C \cdot \left(v_m^k - T\right)}}{{C \cdot v_m^k - 1}}\\ &=1. \end{aligned} \end{equation}
(50)
Case 3.1.2
When \({\varphi _{KKR}}(z^k_{X^k})_i \lt 0,\exists i \in \mathcal {C}\), the rectified \(\varphi _{KKR}(\cdot)\) is adopted, which means:
\begin{equation} \begin{aligned}& \; \; \; \; \;\sum \limits _{i = 1}^C {{{\varphi }_{KKR}}\left(z^k_{X^k}\right)_i} \\ &= \sum \limits _{i = 1}^C {\delta (i) \cdot \frac{{1 - T}}{{C - 1}} + T} \\ &= (C - 1) \cdot \frac{{1 - T}}{{C - 1}} + T\\ &= 1. \end{aligned} \end{equation}
(51)
In summary, \(\sum \nolimits _{i = 1}^C {{\varphi _{KKR}}(z_{{X^k}}^k)_i = 1}\) is proved. Then, we prove that \(0\le {\varphi _{KKR}(z^k_{X^k})_i} \le 1, \forall i \in \mathcal {C}\).
Case 3.2.1.
When \({\varphi _{KKR}}(z^k_{X^k})_i \ge 0,\forall i \in \mathcal {C}\), we have:
\begin{equation} \begin{aligned}& \; \; \; \; {\varphi _{KKR}}\left(z^k_{X^k}\right)_i - 1\\ & = {\varphi _{KKR}}\left(z^k_{X^k}\right)_i - \sum \limits _{i = 1}^C {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i\\ & { = - \sum \limits _{j = 1}^{i - 1} {{\varphi _{KKR}}\left(z^k_{X^k}\right)_j - } \sum \limits _{j = i + 1}^C {{\varphi _{KKR}}\left(z^k_{X^k}\right)_j} }\\ & { \le 0}, \end{aligned} \end{equation}
(52)
and hence, we have \(0 \le {\varphi _{KKR}}(z^k_{X^k})_i \le 1\) in this case.
Case 3.2.2.
When \({\varphi _{KKR}}(z^k_{X^k})_i \lt 0,\exists i \in \mathcal {C}\), we consider the rectified form of \({\varphi _{KKR}(\cdot)}\), that is:
\begin{equation} {\varphi _{KKR}}\left(z^k_{X^k}\right)_i \in \left\lbrace T,\frac{{1 - T}}{{C - 1}}\right\rbrace ,\forall i \in \mathcal {C}. \end{equation}
(53)
As hyper-parameter \(T\) indicates the target peak probability of the refined-local knowledge, and \(C\) denotes the number of classes, they empirically satisfy the following conditions:
\begin{equation} \frac{1}{C} \lt T \lt 1, \end{equation}
(54)
\begin{equation} C \ge 10 \wedge C \in \mathbf {Z^{+}}, \end{equation}
(55)
where \(\mathbf {Z^{+}}\) is the set of positive integers. From Equation (54), we have:
\begin{equation} 0 \le T \le 1. \end{equation}
(56)
From Equations (54) and (55), we can easily figure out that:
\begin{equation} 1-T \le 0 \wedge C-1\gt 0, \end{equation}
(57)
and hence,
\begin{equation} \frac{{1 - T}}{{C - 1}} \ge 0. \end{equation}
(58)
Besides, we have:
\begin{equation} \begin{aligned}& \; \; \;\frac{{1 - T}}{{C - 1}} - 1\\ & = \frac{{1 - T - C + 1}}{{C - 1}}\\ & \lt \frac{{ - C + 1}}{{C - 1}}\\ & \le 0. \end{aligned} \end{equation}
(59)
Therefore, we can get that:
\begin{equation} \frac{{1 - T}}{{C - 1}} \le 1. \end{equation}
(60)
Based on Equations (53), (56), (58), and (60), \(0 \le {\varphi _{KKR}}(z^k_{X^k})_i \le 1,\forall {i \in \mathcal {C}}\) is proved. Combines Case 3.1 and Case 3.2, we have \({\varphi _{KKR}(z^k_{X^k})} \in \mathcal {P}\).
Theorem 3 is proved. □
Theorem 4.
In SKR, the refined-local knowledge is in probability space.
Proof.
Define \({\varphi _{SKR}}(z_{{X^k}}^k)_i\) as the \(i\)-th dimension in \({{\varphi _{SKR}}(z_{{X^k}}^k)}\). We should first prove that \(\sum \nolimits _{i = 1}^C {{\varphi _{SKR}}(z_{{X^k}}^k)_i} = 1\).
\begin{equation} \begin{aligned}& \; \; \; \; \sum \limits _{i = 1}^C {{\varphi _{SKR}}\left(z_{{X^k}}^k\right)_i}\\ & = \sum \limits _{i = 1}^C {\psi \left({\theta ^*};z^k_{X^k}\right)_i} \\ & = \sum \limits _{i = 1}^C {\frac{{\exp (\tfrac{{u_i^k}}{\theta ^* })}}{{\sum \nolimits _{j = 1}^C {\exp (\tfrac{{u_j^k}}{\theta ^* })} }}} \\ & = 1. \end{aligned} \end{equation}
(61)
Then, we prove that \(0 \le {\varphi _{SKR}}(z_{{X^k}}^k)_i \le 1, \forall i \in \mathcal {C}\).
On the one hand, since the following inequations are always true:
\begin{equation} {\exp \left(\frac{{u_i^k}}{\theta }\right)} \gt 0, \end{equation}
(62)
\begin{equation} {\sum \limits _{j = 1}^C {\exp \left(\frac{{u_j^k}}{\theta }\right)} }\gt 0, \end{equation}
(63)
we can infer that:
\begin{equation} {\varphi _{SKR}}\left(z_{{X^k}}^k\right)_i = \frac{{\exp (\tfrac{{u_i^k}}{{{\theta ^*}}})}}{{\sum \nolimits _{j = 1}^C {\exp (\tfrac{{u_j^k}}{{{\theta ^*}}})} }} \gt 0 \ge 0. \end{equation}
(64)
On the other hand,
\begin{equation} \begin{aligned}& \; \; \; \; \;{\varphi _{SKR}}\left(z_{{X^k}}^k\right)_i\\ & = \frac{{\exp \left(\frac{{u_i^k}}{{{\theta ^*}}}\right)}}{{\sum \nolimits _{j = 1}^C {\exp \left(\frac{{u_j^k}}{{{\theta ^*}}}\right)} }} - 1\\ & = \frac{{\sum \nolimits _{j = 1}^{i - 1} {\exp \left(\frac{{u_j^k}}{{{\theta ^*}}}\right) + \sum \nolimits _{j = i + 1}^C {\exp \left(\frac{{u_j^k}}{{{\theta ^*}}}\right)} } }}{{\sum \nolimits _{j = 1}^C {\exp \left(\frac{{u_j^k}}{{{\theta ^*}}}\right)} }}\\ & = - \sum \limits _{j = 1}^{i - 1} {{\varphi _{SKR}} \left(z^k_{X^k}\right)_j - \sum \limits _{j = i + 1}^C {{\varphi _{SKR}} \left(z^k_{X^k}\right)_j} } \\ & \; { \le 0}. \end{aligned} \end{equation}
(65)
To this end, based on Equations (64) and (65), \({0 \le {\varphi _{SKR}}(z_{{X^k}}^k)_i \le 1},\forall i \in \mathcal {C}\) is proved. Based on the analysis presented above, we have \(\varphi _{SKR} (z^k_{X^k}) \in \mathcal {P}\).
Theorem 4 is proved. □
(2) Invariant Relations
Theorem 5.
KKR do not change the order of numeric value among all elements in local knowledge.
Proof.
We first prove that the softmax mapping does not change the order of numeric value among all elements in local knowledge.
For \(\forall u_i^k \ge u_j^k\), we have:
\begin{equation} \begin{aligned}&\; \; \; \; \;v_i^k - v_j^k\\ &= \frac{{\exp \left(u_i^k\right)}}{{\sum \nolimits _{l = 1}^C {\exp \left(u_l^k\right)} }} - \frac{{\exp \left(u_j^k\right)}}{{\sum \nolimits _{l = 1}^C {\exp \left(u_l^k\right)} }}\\ &= \frac{{\exp \left(u_i^k\right) - \exp \left(u_j^k\right)}}{{\sum \nolimits _{l = 1}^C {\exp \left(u_l^k\right)} }}. \end{aligned} \end{equation}
(66)
Since \(\exp (\cdot)\) is a monotonically increasing function, there will always be:
\begin{equation} \exp \left(u_i^k\right) - \exp \left(u_j^k\right) \ge 0,\forall u_i^k \ge u_j^k. \end{equation}
(67)
As a result, we have:
\begin{equation} v_i^k \ge v_j^k,\forall u_i^k \ge u_j^k. \end{equation}
(68)
Next, we need to prove that:
\begin{equation} {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i \ge {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_j, \forall v_i^k \ge v_j^k. \end{equation}
(69)
We consider the proof of Equation (69) in the following cases:
Case 5.1.
When \({\varphi _{KKR}}(z^k_{X^k})_i \ge 0,\forall i \in \mathcal {C}\). At this point, for \(\forall v_i^k \ge v_j^k\), we can infer that:
\begin{equation} \begin{aligned}& \; \; \; \; \; \varphi _{KKR} \left({z^k_{X^k}}\right)_i - \varphi _{KKR} \left(z^k_{X^k}\right)_j \\ & = \frac{{(CT - 1) \cdot {v_i^k} + v_m^k - T}}{{C \cdot v_m^k - 1}} - \frac{{(CT - 1) \cdot {v_j^k} + v_m^k - T}}{{C \cdot v_m^k - 1}}\\ & = \frac{{(CT - 1) \cdot ({v_i^k} - v_j^k)}}{{C \cdot v_m^k - 1}}. \end{aligned} \end{equation}
(70)
With Equations (27), (28) and the precondition \({v_i^k} \ge {v_j^k}\), we can infer that:
\begin{equation} \begin{aligned}&\; \; \; \; \;{\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i - {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_j\\ &= \frac{{(CT - 1) \cdot (v_i^k - v_j^k)}}{{C \cdot v_m^k - 1}}\\ &\ge 0, \end{aligned} \end{equation}
(71)
and hence, we can gain:
\begin{equation} {\varphi _{KKR}}\left(z^k_{X^k}\right)_i - {\varphi _{KKR}}\left(z^k_{X^k}\right)_j \ge 0,\forall v_i^k \ge v_j^k. \end{equation}
(72)
Case 5.2.
When \({\varphi _{KKR}}(z^k_{X^k})_j \lt 0,\exists i \in \mathcal {C}\) in which \({\varphi _{KKR}}(\cdot)\) is rectified, three cases should be taken into consideration.
Case 5.2.1.
When \(i = m \wedge j = m\), we have:
\begin{equation} {\varphi _{KKR}}{\left(z_{{X^k}}^k\right)_i} = \varphi _{KKR} \left(z^k_{X^k}\right)_j = T, \end{equation}
(73)
which means \({\varphi _{KKR}}{(z_{{X^k}}^k)_i} \ge \varphi _{KKR} (z^k_{X^k})_j\) is workable.
Case 5.2.2.
When \(i \ne m \wedge j \ne m\), we have:
\begin{equation} {\varphi _{KKR}}{\left(z_{{X^k}}^k\right)_i} = \varphi _{KKR} \left(z^k_{X^k}\right)_j = \frac{{1 - T}}{{C - 1}}, \end{equation}
(74)
which means \({\varphi _{KKR}}{(z_{{X^k}}^k)_i} \ge \varphi _{KKR} (z^k_{X^k})_j\) is workable.
Case 5.2.3.
When \(i = m \wedge j \ne m\), following Equation (28), we can infer that:
\begin{equation} {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i = T = \frac{{TC - T}}{{C - 1}} \gt \frac{{1 - T}}{{C - 1}} = {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_j, \end{equation}
(75)
and \({\varphi _{KKR}}{(z_{{X^k}}^k)_i} \ge \varphi _{KKR} (z^k_{X^k})_j\) is workable as well.
So far, we can prove:
\begin{equation} {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i \ge {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_j,\forall v_i^k \ge v_j^k. \end{equation}
(76)
Combined with Equation (68), we can prove that:
\begin{equation} {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i \ge {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_j,\forall u_i^k \ge u_j^k. \end{equation}
(77)
Theorem 5 is proved. □
Theorem 6.
SKR does not change the order of numeric value among all elements in local knowledge.
Proof.
For \(\forall u_i^k \ge u_j^k\), we have:
\begin{equation} \begin{aligned}& \; \; \; \; \; {\varphi _{SKR}}\left(z^k_{X^k}\right)_i - {\varphi _{SKR}}\left(z^k_{X^k}\right)_j\\ & = \frac{{\exp (\tfrac{{u_i^k}}{{{\theta ^*}}}) - \exp (\tfrac{{u_j^k}}{{{\theta ^*}}})}}{{\sum \nolimits _{l = 1}^C {\exp (\tfrac{{u_l^k}}{{{\theta ^*}}})} }}. \end{aligned} \end{equation}
(78)
Since\(\frac{{u_i^k}}{{{\theta ^*}}} \ge \frac{{u_j^k}}{{{\theta ^*}}}\),we have:
\begin{equation} \exp \left(\frac{{u_i^k}}{{{\theta ^*}}}\right) - \exp \left(\frac{{u_j^k}}{{{\theta ^*}}}\right) \ge 0. \end{equation}
(79)
Hence,
\begin{equation} {\frac{{\exp (\tfrac{{u_i^k}}{{{\theta ^*}}}) - \exp (\tfrac{{u_j^k}}{{{\theta ^*}}})}}{{\sum \nolimits _{l = 1}^C {\exp (\tfrac{{u_l^k}}{{{\theta ^*}}})} }}} \ge 0. \end{equation}
(80)
In summary, we can always get \({\varphi _{KKR}}{(z_{{X^k}}^k)_i} \ge \varphi _{KKR} (z^k_{X^k})_j\) when \({\forall {v_i^k} \ge {v_j^k}}\).
Theorem 6 is proved. □
(3) Bounded Dissimilarity
Theorem 7.
After refining by KKR, the knowledge discrepancy between arbitrating clients should satisfy an acceptable theoretical upper bound \(\mathbf {\varepsilon }\).
Proof.
We first prove that the peak probability of the knowledge refined by KKR is always \(T\). Two cases are taken into consideration.
Case 6.1.
When \({\varphi _{KKR}}(z^k_{X^k})_i \ge 0,\forall i \in \mathcal {C}\), according to Theorem 5, we have:
\begin{equation} \begin{aligned}& \; \; \; \; \max \left({\varphi _{KKR}}\left(z^k_{X^k}\right)\right)\\ & = \max \left({\varphi _{KKR}}\left(z^k_{X^k}\right)_1,{\varphi _{KKR}}\left(z^k_{X^k}\right)_2,\right.\\ & \; \; \; \; \;\left. \ldots ,{\varphi _{KKR}}\left(z^k_{X^k}\right)_C\right)\\ & = {\varphi _{KKR}}\left(z^k_{X^k}\right)_m\\ & = \frac{{(CT - 1) \cdot v_m^k + v_m^k - T}}{{C \cdot v_m^k - 1}}\\ & = T. \end{aligned} \end{equation}
(81)
Case 6.2.
When \({\varphi _{KKR}}(z^k_{X^k})_j \lt 0,\exists i \in \mathcal {C}\), we can conduct the following inference based on Equation (75):
\begin{equation} \begin{aligned}& \; \; \; \; \max \left({\varphi _{KKR}}\left(z^k_{X^k}\right)\right)\\ & =\max \left(T,\frac{{1 - T}}{{C - 1}}\right)\\ & = T. \end{aligned} \end{equation}
(82)
So far, for \(\forall {k_1},{k_2} \in \mathcal {K}\), we have:
\begin{equation} \begin{aligned}& \; \; \; \; \; \left|dis{t_{KKR}}\left(\varphi _{KKR} \left(z_X^{{k_1}}\right)\right) - dis{t_{KKR}}\left(\varphi _{KKR} \left(z_X^{{k_2}}\right)\right)\right|\\ & = \left|\max \left(\varphi _{KKR} \left(z_X^{{k_1}}\right)\right) - \max \left(\varphi _{KKR} \left(z_X^{{k_2}}\right)\right)\right|\\ & = |T - T|\\ & = 0\\ & \lt \varepsilon . \end{aligned} \end{equation}
(83)
Theorem 7 is proved. □
Theorem 8.
After refining by SKR, the knowledge discrepancy between arbitrating clients should satisfy an acceptable theoretical upper bound \(\mathbf {\varepsilon }\).
Proof.
Since we cannot provide \(\varphi _{SKR}(\cdot)\) directly, our demonstration is to follow two steps:
(a)
To prove that \(\varphi _{SKR}(\cdot)\) is able to be constructed according to Section 3.3.
(b)
To prove that the knowledge discrepancy between arbitrate clients should satisfy an acceptable theoretical upper bound \(\varepsilon\) after refining the local knowledge by the available SKR.
To prove step (a), we should first search for an optimal \(\theta ^*\) in \(\psi (\theta ; \cdot)\) just as mentioned in Equations (17) and (18). Furthermore, our problem is converted into finding the root of Equation (19), whose availability has been proved in Theorem A.2.
To prove step (b), we calculate the differences in knowledge distributions based on metric \(dist_{SKR}(\cdot)\), in that for \(\forall {{k}_1},{k_2} \in \mathcal {K}\),
\begin{equation} \begin{aligned}& \; \; \; \; \;\left|dis{t_{SKR}}\left({\varphi _{SKR}}\left(z_X^{{k_1}}\right)\right) - dis{t_{SKR}}\left({\varphi _{SKR}}\left(z_X^{{k_2}}\right)\right)\right|\\ & = \left|H\left({\varphi _{SKR}}\left(z_X^{{k_1}}\right)\right) - H\left({\varphi _{SKR}}\left(z_X^{{k_2}}\right)\right)\right|\\ & = \left|\left(H\left({\varphi _{SKR}}\left(z_X^{{k_1}}\right)\right) - E\right) - \left(H\left({\varphi _{SKR}}\left(z_X^{{k_2}}\right)\right) - E\right)\right|\\ & \le \left|\left(H\left({\varphi _{SKR}}\left(z_X^{{k_1}}\right)\right) - E\right)\right| + \left|\left(H\left({\varphi _{SKR}}\left(z_X^{{k_2}}\right)\right) - \right)\right|\\ &\lt \frac{\varepsilon }{2} + \frac{\varepsilon }{2}\\ &= \varepsilon . \\ \end{aligned} \end{equation}
(84)
Theorem 8 is proved. □

A.4 Sufficient Conditions for Available Kernel Functions in the KKR Strategy

Theorem 9.
The constructed KKR can satisfy all properties mentioned in Section 3.1 as long as the kernel function \(\mathbf {\sigma (\cdot)}\) satisfies the following relaxation conditions:
(a)
None-direct-proportion, i.e., \(\sigma \notin \lbrace f|f(x) = k \cdot x\rbrace\)
(b)
Continuous and monotonically increasing, i.e., \(\sigma \in \lbrace f|\mathop {\lim }\nolimits _{x \rightarrow c} f(x) = f(c)\rbrace \cap \lbrace f|f({x_1}) - f({x_2}) \ge 0,\forall {x_1} \ge {x_2}\rbrace\)
(c)
Function value is consistently positive, i.e., \(\sigma \in \lbrace f|f(x) \gt 0\rbrace\)
(d)
Parameter \(\mathbf {t}\) is solvable in Equation (10), i.e., \((\tfrac{{\sigma (\tfrac{1}{{{t_1}}})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{{t_1} \cdot v_m^k}})} }} - T) \cdot (\tfrac{{\sigma (\tfrac{1}{{{t_2}}})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{{t_2} \cdot v_m^k}})} }} - T) \lt 0,\exists {t_1},{t_2}\)
To prove the necessary properties in Section 3.1, we first introduce a lemma to confirm that the kernel function scaling parameter \(t\) is consistently positive.
When the kernel function satisfies the relaxation conditions mentioned in Theorem 5.1, \(\mathbf {t}\) is consistently positive.
Proof of Lemma
We first claim that \(t \ne 0\) as an denominator in Equation (10). Then we prove that \(t\lt 0\) can never hold. According to Equation (54) and condition (b), we can infer that:
\begin{equation} \begin{aligned}& \; \; \; \; \; \frac{{\sigma (\tfrac{1}{t})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})} }}\\ &\lt \frac{{\sigma (\tfrac{1}{t})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_j^k}})} }}\\ &= \frac{{\sigma (\tfrac{1}{t})}}{{C \cdot \sigma (\tfrac{1}{t})}}\\ &= \frac{1}{C}\\ &\lt T, \end{aligned} \end{equation}
(85)
which indicates:
\begin{equation} \frac{{\sigma (\tfrac{1}{t})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})} }} \ne T, \end{equation}
(86)
and Equation (86) is in conflict with Equation (10). Hence, we can never take \(t \le 0\) when relaxation conditions in Theorem 5.1 satisfy. While condition (d) indicates that we can always solve out a \(t\), there should always be \(t\gt 0\).
Lemma is proved.
So far, we begin to prove the necessary properties mentioned in Section 3.1.
(1) Probabilistic Projectivity: As stated in condition (b), i.e., \(\sigma (x) \gt 0,\forall x \in R\), hence, we have:
\begin{equation} {\varphi _{KKR}}\left(z^k_{X^k}\right)_i = \frac{{\sigma (\tfrac{{v_i^k}}{{t \cdot v_m^k}})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})} }} \gt 0. \end{equation}
(87)
What is more,
\begin{equation} \begin{aligned}& \; \; \; \; \sum \limits _{i = 1}^C {{\varphi _{KKR}}\left(z^k_{X^k}\right)_i} \\ & = \sum \nolimits _{i = 1}^C {\frac{{\sigma (\tfrac{{v_i^k}}{{t \cdot v_m^k}})}}{{\sum \limits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})} }}} \\ & = 1. \end{aligned} \end{equation}
(88)
Hence, we prove \({\varphi _{KKR}}(z^k_{X^k}) \in \mathcal {P}\).
(2) Invariant Relations: As \(v_i^k \ge v_j^k\), \(t\gt 0\) and \(v^k_m\gt 0\), we can infer that:
\begin{equation} \frac{{v_i^k}}{{t \cdot v_m^k}} \ge \frac{{v_j^k}}{{t \cdot v_m^k}}. \end{equation}
(89)
Consequently, we have:
\begin{equation} \begin{aligned}& \; \; \; \; \; \forall v_i^k \ge v_j^k,\\ & \; \; \; \; \; {\varphi _{KKR}}\left(z^k_{X^k}\right)_i - {\varphi _{KKR}}\left(z^k_{X^k}\right)_j\\ & = \frac{{\sigma (\tfrac{{v_i^k}}{{t \cdot v_m^k}}) - \sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})}}{{\sum \nolimits _{j = 1}^C {\sigma (\tfrac{{v_j^k}}{{t \cdot v_m^k}})} }}\\ & \ge 0. \end{aligned} \end{equation}
(90)
Referencing to the process in proving Equation (68), we can summarize that:
\begin{equation} {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_i \ge {\varphi _{KKR}}\left(z_{{X^k}}^k\right)_j,\forall u_i^k \ge u_j^k. \end{equation}
(91)
(3) Bounded Dissimilarity: Based on the property invariant relations (2) in Theorem 9, the proven detail is just the same as Equation (81).
Theorem 9 is proved. □

References

[1]
Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, and Geoffrey E. Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018).
[2]
Claudia Carpineti, Vincenzo Lomonaco, Luca Bedogni, Marco Di Felice, and Luciano Bononi. 2018. Custom dual transportation mode detection by smartphone devices exploiting sensor diversity. In 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops ’18). IEEE, 367–372.
[3]
Hongyan Chang, Virat Shejwalkar, Reza Shokri, and Amir Houmansadr. 2019. Cronus: Robust and heterogeneous collaborative learning with black-box knowledge transfer. arXiv preprint arXiv:1912.11279 (2019).
[4]
Sijie Cheng, Jingwen Wu, Yanghua Xiao, and Yang Liu. 2021. FedGEMS: Federated learning of larger server models via selective knowledge fusion. arXiv preprint arXiv:2110.11027 (2021).
[5]
George Corliss. 1977. Which root does the bisection algorithm find? SIAM Review 19, 2 (1977), 325–327.
[6]
Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, and Amos J. Storkey. 2018. CINIC-10 is not ImageNet or CIFAR-10. arXiv preprint arXiv:1810.03505 (2018).
[7]
Chaoyang He, Murali Annavaram, and Salman Avestimehr. 2020. Group knowledge transfer: Federated learning of large CNNs at the edge. Advances in Neural Information Processing Systems 33 (2020), 14068–14080.
[8]
Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. 2020. FedML: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518 (2020).
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[10]
Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. 2019. Knowledge adaptation for efficient semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 578–587.
[11]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[12]
Wenke Huang, Mang Ye, and Bo Du. 2022. https://github.com/wenkehuang/fccl
[13]
Wenke Huang, Mang Ye, and Bo Du. 2022. Learn from others and be yourself in heterogeneous federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10143–10153.
[14]
Sohei Itahara, Takayuki Nishio, Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. 2020. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-IID private data. arXiv preprint arXiv:2008.06180 (2020).
[15]
Amir Jalalirad, Marco Scavuzzo, Catalin Capota, and Michael Sprague. 2019. A simple and efficient federated recommender system. In Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. 53–58.
[16]
Deep Kawa, Sunaina Punyani, Priya Nayak, Arpita Karkera, and Varshapriya Jyotinagar. 2019. Credit risk assessment from combined bank records using federated learning. International Research Journal of Engineering and Technology (IRJET) 6, 4 (2019), 1355–1358.
[17]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009).
[18]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
[19]
Gihun Lee, Yongjin Shin, Minchan Jeong, and Se-Young Yun. 2021. Preservation of the global knowledge by not-true self knowledge distillation in federated learning. arXiv preprint arXiv:2106.03097 (2021).
[20]
Daliang Li and Junpu Wang. 2019. FedMD: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581 (2019).
[21]
Tianhong Li, Jianguo Li, Zhuang Liu, and Changshui Zhang. 2020. Few sample knowledge distillation for efficient network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14639–14647.
[22]
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems 2 (2020), 429–450.
[23]
Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. 2020. Ensemble distillation for robust model fusion in federated learning. arXiv preprint arXiv:2006.07242 (2020).
[24]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics. PMLR, 1273–1282.
[25]
Wanning Pan and Lichao Sun. 2021. Global knowledge distillation in federated learning. arXiv preprint arXiv:2107.00051 (2021).
[26]
Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. 2019. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5007–5016.
[27]
Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N. Galtier, Bennett A. Landman, Klaus Maier-Hein, Sébastien Ourselin, Micah Sheller, Ronald M. Summers, Andrew Trask, Daguang Xu, Maximilian Baust, and M. Jorge Cardoso. 2020. The future of digital health with federated learning. NPJ Digital Medicine 3, 1 (2020), 1–7.
[28]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
[29]
Ben Tan, Bo Liu, Vincent Zheng, and Qiang Yang. 2020. A federated recommender system for online services. In Fourteenth ACM Conference on Recommender Systems. 579–581.
[30]
Lin Wang and Kuk-Jin Yoon. 2021. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[31]
Zhiyuan Wu, Yu Jiang, Chupeng Cui, Zongmin Yang, Xinhui Xue, and Hong Qi. 2021. Spirit distillation: Precise real-time semantic segmentation of road scenes with insufficient data. arXiv preprint arXiv:2103.13733 (2021).
[32]
Zhiyuan Wu, Yu Jiang, Minghao Zhao, Chupeng Cui, Zongmin Yang, Xinhui Xue, and Hong Qi. 2021. Spirit distillation: A model compression method with multi-domain knowledge transfer. In Knowledge Science, Engineering and Management, Han Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, and Sun-Yuan Kung (Eds.). Springer International Publishing, Cham, 553–565.
[33]
Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Xuefeng Jiang, and Runhan Li. 2023. Survey of knowledge distillation in federated edge learning. arXiv preprint arXiv:2301.05849 (2023).
[34]
Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Quyang Pan, Xuefeng Jiang, and Bo Gao. 2023. FedICT: Federated multi-task distillation for multi-access edge computing. IEEE Transactions on Parallel and Distributed Systems (2023), 1–16. DOI:
[35]
Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Wen Wang, Xuefeng Jiang, Bo Gao, and Jinda Lu. 2023. FedCache: A knowledge cache-driven federated learning architecture for personalized edge intelligence. arXiv preprint arXiv:2308.07816 (2023).
[36]
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
[37]
Jie Xu, Benjamin S. Glicksberg, Chang Su, Peter Walker, Jiang Bian, and Fei Wang. 2021. Federated learning for healthcare informatics. Journal of Healthcare Informatics Research 5, 1 (2021), 1–19.
[38]
Dezhong Yao, Wanning Pan, Yutong Dai, Yao Wan, Xiaofeng Ding, Hai Jin, Zheng Xu, and Lichao Sun. 2021. Local-global knowledge distillation in heterogeneous federated learning with non-IID data. arXiv e-prints (2021), arXiv–2107.
[39]
Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. 2021. Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning. PMLR, 12878–12889.

Cited By

View all
  • (2025)Applications of knowledge distillation in remote sensing: A surveyInformation Fusion10.1016/j.inffus.2024.102742115(102742)Online publication date: Mar-2025
  • (2024)FedICT: Federated Multi-Task Distillation for Multi-Access Edge ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328944435:6(1107-1121)Online publication date: Jun-2024
  • (2024)FedCache: A Knowledge Cache-Driven Federated Learning Architecture for Personalized Edge IntelligenceIEEE Transactions on Mobile Computing10.1109/TMC.2024.336187623:10(9368-9382)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Exploring the Distributed Knowledge Congruence in Proxy-data-free Federated Distillation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 2
      April 2024
      481 pages
      EISSN:2157-6912
      DOI:10.1145/3613561
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 February 2024
      Online AM: 29 December 2023
      Accepted: 10 December 2023
      Revised: 09 November 2023
      Received: 03 October 2022
      Published in TIST Volume 15, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Federated learning
      2. knowledge distillation
      3. proxy-data-free
      4. model heterogeneity

      Qualifiers

      • Research-article

      Funding Sources

      • National Key Research and Development Program of China
      • National Natural Science Foundation of China
      • Innovation Capability Support Program of Shaanxi
      • Shaanxi Qinchuangyuan “scientists+engineers” team
      • Innovation Funding of ICT, CAS
      • Beijing Natural Science Foundation
      • Beijing Science and Technology Project

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,647
      • Downloads (Last 6 weeks)161
      Reflects downloads up to 12 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Applications of knowledge distillation in remote sensing: A surveyInformation Fusion10.1016/j.inffus.2024.102742115(102742)Online publication date: Mar-2025
      • (2024)FedICT: Federated Multi-Task Distillation for Multi-Access Edge ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328944435:6(1107-1121)Online publication date: Jun-2024
      • (2024)FedCache: A Knowledge Cache-Driven Federated Learning Architecture for Personalized Edge IntelligenceIEEE Transactions on Mobile Computing10.1109/TMC.2024.336187623:10(9368-9382)Online publication date: Oct-2024
      • (2024)Agglomerative Federated Learning: Empowering Larger Model Training via End-Edge-Cloud CollaborationIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621254(131-140)Online publication date: 20-May-2024
      • (2024)DE-DFKD: diversity enhancing data-free knowledge distillationMultimedia Tools and Applications10.1007/s11042-024-20193-zOnline publication date: 14-Sep-2024
      • (2024)Extending Knowledge Distillation for Personalized FederationAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5666-7_33(392-403)Online publication date: 5-Aug-2024
      • (2023)FedDyn: A dynamic and efficient federated distillation approach on Recommender System2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS56603.2022.00107(786-793)Online publication date: Jan-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media