FLCert Provably Secure Federated Learning against Poisoning Attacks 论文阅读笔记

FLcert 是发表在 TIFS'22 上的一篇关于缓解联邦学习毒化攻击的论文,论文相关引用与来源如下：

AAAI'21 version

Cao, Xiaoyu, Jinyuan Jia, and Neil Zhenqiang Gong. "Provably secure federated learning against malicious clients." Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 8. 2021. PDF: [2102.01854] Provably Secure Federated Learning against Malicious Clients
TIFS'22 version

Cao, Xiaoyu, et al. "Flcert: Provably secure federated learning against poisoning attacks." IEEE Transactions on Information Forensics and Security 17 (2022): 3691-3705. PDF: FLCert: Provably Secure Federated Learning against Poisoning Attacks

TIFS'22 论文是 AAAI'21 上对应论文的扩展，本文主要对 AAAI'21 上的该论文进行阅读理解，并指出其在 TIFS 上的扩展部分。

Background and Motivation

针对联邦学习的毒化攻击，目前防御方法有 Byzantine-robust aggregation rules, such as Krum [9], trimmed mean and median,FLtrust 等。但是这些方法都没有对毒化攻击提供可证明的安全保证，不能够保证预测结果是否收到恶意客户端的影响。

However, these methods cannot provably guarantee that the predicted label for a testing example is not affected by malicious clients.

Contributions

本文通过将集成学习 ensemble learning 应用在联邦学习中，也就是 ensemble federated learning 来解决上述问题。

回顾周志华西瓜书的解释，集成学习通过构建并结合多个学习器来完成学习任务，有时被称为多分类器系统 (multi-classifier > > > > > system)、基于委员会的学习 (committee-based learning) 等。

集成学习的一般结构与步骤:

先产生一组 individual learner

individual learner 通常由一个现有的学习算法从训练数据产生，例如决策树、神经网络 CNN 等,如果 individual learners 算法相同，则称为同质的 homogeneous，否则称为异质的 heterogeneous。

再用某种结合策略将它们结合起来

常用的结合策略: 对于回归模型，需要预测数值的，常用平均法：直接平均 averaging 和对 individual learner 加权平均 weighted averaging 对于分类模型，需要预测类别的，常用投票法：多数投票（majority）voting 和加权投票法 weighted voting

提出的框架如下

先产生一组 individual learner

在本文中，通过 randomly sampling a subset of clients 生成一组组的 individual learners，具体的，假设有 N 个 clients，每组有 k 个 clients,那么采样可能有排列数 \(C_N^k\) 种，也就是有 \(C_N^k\) 组，每组有 k 个。然后训练 \(C_N^k\) 个模型，得到 \(C_N^k\) 个 global models
使用 majority voting 来解决分类问题中的标签预测

使用得到的 \(C_N^k\) 个 global models 对同一测试样本进行预测，使用 majority voting，也就是预测的 label 类别最多的作为最终的预测结果。

从而在联邦学习中实现 provably guarantee that the predicted label for a testing example is not affected by malicious clients, 也就是可以证明，标签预测结果未受到恶意客户端的影响。

理论成果

our ensemble global model provably predicts the same label for a testing example x when the number of malicious clients is no larger than a threshold, which we call certified security level.
The derived bound is tight, i.e., when no assumptions are made on the base federated learning algorithm, it is impossible to derive a certified security level that is larger than ours.

缺点：Note that the certified security level may be different for different testing examples.

实现困难与解决方案

\(p_i\) label probability: define \(p_i\) as the fraction of the \(C_N^k\) global models that predict label i for x, where \(i = 1, 2,\cdots, L\).

计算 certified security level 需要\(C_N^k\)个预测结果中 largest and second largest label probabilities，也就是需要知道标签预测结果中预测数量第一多和第二多的标签的预测数量。

当 \(C_N^k\) 较小时，计算这种还可以，但是当 \(C_N^k\) 很大时，计算量太大，所以本文 develop a Monte Carlo algorithm to estimate them with probabilistic guarantees via training N instead of \(C_N^k\) global models.

蒙特克罗算法（Monte Carlo method）：以概率统计理论为指导的、通过随机抽样来来近似计算问题的解或结果一般情况下，蒙特卡罗算法的特点是，采样越多，越近似最优解，而永远不是最优解。

提出的 Ensemble Federated Learning 框架细节

define provable security guarantees against malicious clients:

\(h(C',x) = h(C,x), ∀C', M(C') ≤ m^∗\)

C 是 the set of n clients，C' 是 the set of n clients with malicious ones。M(C') 是 the number of malicious clients in C'，\(m^∗\) 是可以承受的最多 malicious client 个数，也就是 the certified security level，\(h(C,x)\) 是 the label predicted by the ensemble global model trained on C for x。
Deriving certified security level using exact label probabilities

设 \(p_y\)： the largest label y probability，\(p_y'\): the label probability of y when there are malicious clients，\(p_z\)：the second largest label z probability，\(p_z'\): the label probability of z when there are malicious clients。设有 m 个 malicious clients,则有 \(1 - \frac{C_{n-m}^k}{C_n^k}\) 比例的采样组至少含有一个 malicious clients，最坏的情况下，会有如下公式：

\[p_y' = p_y - 1 - \frac{C_{n-m}^k}{C_n^k}\]

\[p_z' = p_z + 1 - \frac{C_{n-m}^k}{C_n^k}\]

因此，为了保证预测结果正确，需要

\[ \begin{equation} \begin{aligned} p_y' &> p_z'\\ p_y - 1 - \frac{C_{n-m}^k}{C_n^k} &> p_z + 1 - \frac{C_{n-m}^k}{C_n^k}\\ p_y - p_z &> 2 - \frac{2C_{n-m}^k}{C_n^k} \end{aligned} \end{equation} \]

因此，certified security level \(m^*\) 可以通过上述公式计算得到。
Deriving certified security level using approximate label probabilities:

\(p_y - p_z >= \underline{p_y} - \overline{p_z}\), where \(\underline{p_y}\) is the lower bound of \(p_y\), \(\overline{p_z}\) is the upper bound of \(p_z\).

propose a Monte Carlo algorithm to estimate a lower bound \(p_y\) and an upper bound \(p_z\) via only training N of the \(C_{n-m}\) global models.

得出 \(\underline{p_y} - \overline{p_z} > 2 - \frac{2C_{n-m}^k}{C_n^k}\)

最后，发现根据上述公式得出的 the certified security level \(m^*\) 并不 tight，因为 \(m^*\) 应该是 \(\frac{1}{C_n^k}\) 的整数倍。因此需要

normalize \(\underline{p_y} - \overline{p_z}\) 为其倍数：

找到最接近 x 的 n 的倍数的整数，可以使用下面的公式： nearest_multiple = n * round(x / n)。可以使用向上取整或向下取整的方法来实现不同的取舍方式：
扩展为 TIFS 期刊的文章对分组的方法进行了划分，提出了两种分组方法，分别是 FLCert-P 和 FLCert-D

随机性的 FLCert-P: randomly samples clients in each group 确定性的 FLCert-D: divides clients to disjoint groups deterministically

总结

论文应用集成学习来解决 poisioning attacks 的想法很新颖
解决问题的思路很流畅，遇到问题就找原因去解决问题。
数论知识的应用很好

可以借鉴的点：

集成学习及其应用
蒙特卡洛方法

参考资料

周志华机器学习西瓜书
Algorithm 之 MC：Monte Carlo method 蒙特·卡罗方法的简介、实现、应用-云社区-华为云