A Appendix
In this section, we discuss the convergence analysis of the proposed “FedCMD” method and the error propagation from the teacher to the student models based on the authors’ comments in References [
28,
37]. Referring to the work presented in Reference [
37], the authors use the
Neural Tangent Kernel (NTK) [
23] to show that the prediction error of the student NN in estimating the ground truth vector
\(\mathbf {y}\) is expressed as follows:
In Equation (
2),
C is a constant that depends on the Thatcher hyperparameters, including the distillation regularization parameter
\(\lambda\).
\(\mathbf {f}_{\mathcal {S}}\) and
\(\mathbf {f}_{\mathcal {T}}\) are the logits for the student and teacher models.
\(\mathbf {a}_{\mathcal {S}}\) is the vector of weights used by the student to generate the prediction output by linearly combining its logits, and
\(N_H\) is the number of neurons used in the hidden layer. This result implies that the student’s prediction error decreases as the pre-trained teacher model
\(\mathbf {f}_{\mathcal {T}}\) approaches
\(\mathbf {y}\).
To build the convergence analysis of our proposed approach, we rely on the results presented in Reference [
28]. In this article, the authors perform a comprehensive convergence analysis for the FedAvg approach, considering scenarios with IID and non-IID data distributions. In particular, they show that FedAvg converges to the global optimum over a total number of
T FL rounds at a rate of
\(\mathcal {O}(1/T)\). This holds even in the presence of non-IID data, provided that the local losses correspond to strongly convex and smoothed functions. Moreover, the authors find that local update iterations
E with a small learning rate have a similar effect on the convergence rate as a single step of
Stochastic Gradient Descent (SGD) with a larger learning rate. This observation suggests that the global model update in FedAvg behaves similarly to an SGD update when combined with appropriate sampling and averaging strategies. Consequently, the FedAvg algorithm shows a convergence pattern comparable to that of SGD. Moreover, the study emphasizes the role of learning rates in reducing the variance of the averaged sequence of the global models. This reduction in variance is attributed to the effects of partial user participation, highlighting the impact of the choice of learning rate on the stability and efficiency of the FedAvg algorithm. The results in Reference [
28] provide valuable insights into the convergence behavior and optimization properties of FedAvg and support the analysis of our proposed approach.
In the following, we consider the upper bounds found by the authors for the difference between the global minimum loss
\(\mathcal {L}^*\) and its average
\(\mathbb {E}[\mathcal {L}(\mathbf {w}_{\mathcal {S}}(T))]\) evaluated for the federated model after a total number of
T iterations:
The difference given in Equation (
3) depends on the number of total
T and local update iterations
E performed by the users on the subset of clients
K selected from the total set
N. It also depends on the properties of the loss function at each user, i.e., L-smoothness,
\(\mu\)-strong convexity, an upper bound on the variance of the stochastic gradient, and an upper bound on the expected quadratic norm of the stochastic gradient with parameters
L,
\(\mu\),
\(\sigma _k\), and
G, respectively. The equation also shows that the difference depends on the mean square distance between the model update in the first step
\(\mathbf {w}_{\mathcal {S}}(1)\) and the model that provides the minimum global loss
\(\mathbf {w}_{\mathcal {S}}^*\). The constant
B in the equation takes into account the effect of the weights
\(p_k\:| \: k=1 \dots K\) used to select the subset of users during the federation and the degree of non-IID of the data
\(\mathcal {L}^*-\textstyle \sum _{k=1}^N p_k \mathcal {L}_k^*\). The first term of the difference is again the global minimum of the loss and the second is the minimum for each local loss. In Equation (
4), authors derived the number of FL communication rounds needed to achieve the error in Equation (
3).
In this article, we exploit the above insights and characterize the error propagation for the federated model caused by the errors of individual students. To address the generality of the analysis, we consider a simple 3-layer student model based on an NN, which consists of an input layer, a hidden layer, and an output layer. These layers are fully connected, and a nonlinear Lipschitz continuous function
\(\delta (\mathbf {w}_{\mathcal {S}})\) takes care of the activation of the hidden layers and generates the output logits
\(\mathbf {f}_{\mathcal {S}}=\delta (\mathbf {w}_{\mathcal {S}})\). The activation function is therefore the link between the logits of the student model and the locally generated and federated NN model weights. This means that the minimum for each local loss
\(\mathcal {L}_k^*(\mathbf {w}_{\mathcal {S}})\) in Equation (
4) is given by Equation (
2) in our case. Equation (
4) shows that the main contribution due to the distillation error corresponds to the degree of non-ID. In fact, in the case of IID data,
\(\mathcal {L}^*-\textstyle \sum _{k=1}^N p_k \mathcal {L}_k^*\) approaches zero, and the only parameters responsible for this error are
\(\sigma _k\) and
G. Comparing the results in Table
8 with those in Table
7, we find that the federation method requires a higher number of rounds, which is again due to the non-IID data. We also note that the distribution with weights
\(p_k\) can affect convergence, as it allows us to select the clients that contribute with the loss functions with the lowest variance and the lowest squared norm of the stochastic gradient. If we analyze the results in Table
8, then we find that the optimal number of local updates to achieve the minimum number of FL communication rounds is 3. Considering Equation (
3), this value is due to the parameter
E, which affects the convergence rate; in fact, too small values of
E make FedAvg equivalent to SGD, too large
E decrease the convergence rate. Again, the Equation (
3) shows that the contribution of selecting a subset of users to the growth of communication rounds is negligible.