2.1 Basic Process of Proxy-data-free Federated Distillation
Without loss of generality, we consider the classification task in FL setting with \(C\) categories, and let \(\mathcal {C}=\lbrace 1,2, \ldots ,C\rbrace\). The FD system consists of a large-scale server and \(K\) heterogeneous clients. Let \(\mathcal {K}= \lbrace 1,2, \ldots ,K\rbrace\) denote the set of clients. Each client \(k\) owns a private dataset \(\mathcal {D}^k=\lbrace {X^k},{y^k}\rbrace\) with \(N^k\) samples, where \(X^k\) and \(y^k\) denotes the set of input data and corresponding labels, respectively. Moreover, data distributions among clients are not identically and independently distributed (Non-IID) in our setting.
We assume that each client owns heterogeneous model architecture, determined by the computation capability and training requirements of individual clients in reality. Referring to FedGKT [
7], we consider the feature-driven FD framework, which can achieve heterogeneous model training while guaranteeing communication efficiency. In this framework, the local model at each client includes a small feature extractor and a large predictor, while the global model at the server only contains a large predictor. Let
\(W_e^k\) and
\(W_p^k\) be the feature extractor’s weights and the predictor’s weights of client
\(k\), respectively. Moreover, we denote
\(W^k=\lbrace W_e^k \cup W_p^k\rbrace\) as the weights of the local model at client
\(k\), and denote
\(W^S\) as the weights of the global model on the server. Let
\(f(W^*;\cdot)\) denote the nonlinear function determined by weights
\(W^\ast\), where
\(W^\ast \in \lbrace \bigcup \nolimits _{i=1}^K W^k\cup W^S\rbrace\). In addition, we define the extracted features of client
\(k\) as
\({H^k} = f(W_e^k;{X^k})\), the logits of client
\(k\) as
\(z_{{X^k}}^k = f(W_p^k;{H^k})\), and the logits of the server as
\(z_{{X^k}}^S = f({W^S};{H^k})\). Specifically, the logits of clients are called local knowledge, and the logits of the server are called global knowledge.
The whole process of proxy-data-free FD can be divided into multiple rounds. Each round consists of two stages: local distillation, where each client updates its local model based on global knowledge transferred back from the server; and global distillation, where the global model on the server performs knowledge distillation based on uploaded local knowledge from clients. The detailed processes are illustrated as follows:
(1) Local Distillation Process : Each client
\(k\) updates its feature extractor
\(W_e^k\) and predictor
\(W_p^k\) according to the received global knowledge
\(z_{X^k}^S\), aiming to minimize the combination of cross-entropy loss
\(L_{CE}(\cdot)\) and knowledge-similarity loss
\(L_{sim}(\cdot)\), which can be given by:
where
\(L_C^k(\cdot)\) represents the loss function of client
\(k\), and
\(\beta\) is the hyper-parameter for weighting the effect of knowledge similarity loss.
\(p_{X^k}^S=\tau (z_{X^k}^S)\) denotes the softmax-normalized global knowledge that is broadcast to client
\(k\), and
\(p_{X^k}^k=\tau (z_{X^k}^k)\) is the softmax-normalized local knowledge from client
\(k\), in which
\(\tau (\cdot)\) is the softmax mapping.
\(L_{sim}(\cdot)\) measures the similarity of normalized local and global knowledge and takes the Kullback-Leibler divergence by default. After local training, client
\(k\) generates the extracted features
\(H^k\) and the local knowledge
\({z_{{X^k}}^k}\) based on its updated feature extractor and predictor, i.e.,
\({{H}^k} = f(W_e^k;{X^k})\), and
\(z_{{X^k}}^k = f(W_p^k;{H^k})\). Then, client
\(k\) uploads its obtained features
\(H^k\), local knowledge
\(z_{{X^k}}^k\) and corresponding labels
\(y^k\) to the server for performing global distillation.
(2) Global Distillation Process : After receiving the local knowledge from all clients, the server conducts the global distillation process, which updates the global model
\(W^S\) by optimizing the following objective:
where
\(L_S(\cdot)\) denotes the server-side loss function. After distillation, the server generates global knowledge
\(z_{X^k}^S\) for each client
\(k\) using the updated global model
\(W^S\) and the uploaded local features
\(H^k\), i.e.,
\(z_{X^k}^S=f(W^S;H^k)\). Then,
\(z_{X^k}^S\) is broadcast to client
\(k\). At this point, this round is completed, and the next round begins.
During the above process, only extracted features \(H^k\) and local-global knowledge \(\lbrace z_{X^k}^k\), \(z_{X^k}^S\rbrace\) are exchanged between the server and client \(k\). Since the sizes of such information are significantly smaller compared with model weights, this feature-driven FD manner can achieve client-server co-distillation under model heterogeneity with slight communication overhead.
2.2 Motivation of Distributed Knowledge Congruence
(1) Existing Drawback: Affected by both data heterogeneity and model heterogeneity, existing proxy-data-free FD methods are difficult to get similarly-distributed local knowledge from multiple clients. On the one hand, data heterogeneity leads to diverse label distributions among clients, and the local model on each client tends to learn biased representations based on an independently sampled space, which favors the samples with higher frequency to promote local fitting degree. On the other hand, model heterogeneity can further exacerbate these biases since larger models tend to possess superior representation capability and generate knowledge with higher numerical differences and vice versa.
Furthermore, according to Equation (
2), we can draw that knowledge incongruence has a non-negligible influence on server distillation since the global model needs to be optimized based on the knowledge similarity between clients and the server. Due to the aforementioned problem, if straightforwardly learning the incongruent knowledge from clients, the server will learn an ambiguous or a biased representation and easily fail to converge smoothly, which cannot acquire approximate-optimal global knowledge and affects the training accuracy of clients in turn. Whereas existing methods [
4,
7,
14,
19,
20,
23,
38], summarized in Table
2, dismiss the ill effect of incongruent knowledge among clients, which leads to severe performance degradation.
Figure 1 shows the effect of knowledge congruence on global model convergence, where the red arrows indicate the direction of the negative gradient obtained by distillation on softmax-normalized local knowledge, and black arrows indicate that obtained by distillation on the refined-local knowledge. As shown in Figure
1(a), the local knowledge from a single client will contribute to an optimized direction for the global model. However, knowledge incongruence among heterogeneous clients contributes to biased optimization and frequent fluctuation in the convergence direction. These negative effects cause the actual result to deviate from the optimal one.
(2) Insight Formulation : Through the above analysis, we assert that congruent local knowledge among clients is essential for optimizing the global model and realizing stabilized convergence. Therefore, we expect to narrow the distribution differences of the original local knowledge among clients through knowledge refinement, aiming to make refined-local knowledge satisfy an approximate distribution constraint. Based on congruent knowledge during server-side distillation, the global model can be steadily updated towards the correct convergence direction, as shown in Figure
1(b). Guided by the above insight, we propose the FedDKC algorithm, and the detailed comparison between FedDKC and related state-of-the-art methods is shown in Table
2. Compared with existing proxy-data-free FD methods, our proposed FedDKC allows both model heterogeneity among clients and high communication efficiency, and pioneers to leverage knowledge congruence to promote the distillation performance.