16 March 2021
From the desk of Laura Norén
A Timely explainer for academic data scientists
Federated learning is a technique in which models can be trained and updated without pooling data in a central data store or otherwise creating onward exposures of the original data. Federated learning came out of Google in 2016 and was initially more widely used in industry than in academic data science, likely because it solved scaling and consumer preference problems that were not as common in academia. However, federated learning is showing up in academic medical research and masters level teaching.
Assistant Professor Corey Arnold (UCLA) and his postdoc Karthik Sarma (UCLA/Caltech) are using federated learning to train diagnostic models on MRI images from several health care providers without having to remove the MRI image data from home data repositories.
In 2020, the University of Cambridge added a module on federated learning to their MPhil course in Advanced Computer Science; many data science, computer science, and AI/ML programs do not yet include federated learning.
A brief overview of federated learning
Where traditional data analysis and modeling first gather all the data into a central location, then run computations against that data to produce a model or models, federated learning leaves the data in decentralized locations, runs the computations in all of those decentralized locations, and sends only the model parameters to the central hub. The central hub then computes one federated model based on the many model estimates and sends that federated model back out to all the members of the federation. As new data become available within the decentralized members of the federation, the process re-runs to update the model. Any updates to the primary model are always available to members of the federation, which is excellent in situations where federation members may have sparse or infrequent data generation. (Note: there are several more complex versions of federated learning, including a direct node-to-node architecture with no central server.)
Federated learning was designed in the context of smart phone use cases where consumers prefer to keep their personal data on their phone but also want to have the latest phone-based models updated in near real-time. For instance, auto-correct helped me spell ‘Billie Eilish’ and ‘Megan Thee Stallion’ with my thick thumbs while texting about the Grammys. 🎸
What are key benefits of federated learning?
Federated learning has two key benefits:
- Federated learning splits the computational load among many devices. If you are paying for computation, this is appealing. Programmers please note: battery etiquette asks that federated learning computations run only when devices are plugged in.
- Federated learning is more privacy protecting because the data remains within its silos. Example: I want to spell Megan Thee Stallion’s name correctly but I may not want my thoughts about Megan Thee Stallion’s lyrics in a database somewhere. 🎤
Should we be pleased about offloading compute costs to distant devices?
There is no definitive answer to this question, but there are a couple common considerations. In cases where the entire network is within your AWS instance or Snowflake account, you’ll pay for all the computation anyway, but with federated learning you can at least assess how much each tenant/customer’s computations are costing your company or research grant. If you’re an app developer and some of the computations are happening on your customer’s phones, laptops, or IoT device, the cost-control objectives are even more obvious, but the computational complexity can be limited by the device type. Phones and laptops are usually computationally sufficient and plugged in regularly, but it is challenging to run federated learning on solar-powered devices clipped to small bird species. 🦜
Does federated learning solve privacy protection?
Federated learning provides a technical strategy that allows a great deal of data to remain where the original owner can control access to it. This is good for privacy and security. However, there are a number of papers suggesting additional constraints to prevent model parameters from revealing sensitive information, even if the underlying data are kept on the local device. For instance, imagine a query language with 500 possible query terms that can be used by the 100 customers of an app. In this example, 80 percent of queries use only ten of the 500 available query terms. The other 490 terms are infrequently used. A matrix that represents each query term as an entity would be sparsely populated outside the top ten, so any customer who used a rare term frequently could generate a substantially different model coefficient for that term, from that organization. This could reveal sensitive corporate information (imagine a query term closely correlated with being acquired, scheduling a large layoff, or investigating a data breach).
All criticisms considered, federated learning is fundamentally a more privacy-preserving approach than strategies that allow less control over data for users, though there is still scope to build additional safeguards and strategies.
Could federated learning offer an end-run around privacy protections and corrode data guardianship?
From a tech ethics perspective, there are some legitimate concerns about which types of data may become available for training federated models that would not otherwise be available. For instance, medical data is generally protected by HIPAA and cannot be shared without explicit consent. If federation allows model training without data sharing, this raises important questions about whether federated learning could be used in applications that are either not net beneficial or that privilege and prioritize those who are already advantaged over those who are not. When the medical data is shared for research purposes that are likely to be net beneficial and that are shared equitably across the population, federated learning is a tool for good. But if federated learning were used by private insurance companies to, say, decline to offer plans in states where access to health care is already challenging, it’s not clear that celebrating privacy preservation is the proper conversation to have about federated learning. In other words, just because a tool or technique is privacy preserving, does not mean it is net beneficial or equitably beneficial.
If federated learning is used to avoid protections built under the heading “privacy” which were actually meant to serve broader ethical goals, there is reason to pay close attention to net benefit and equitable distribution of benefit. No technology should be presented as inherently ethical. Too often privacy protecting technologies and applications are seen as de facto ethical or net beneficial.
In the academic applications of federated learning that I have seen, the net benefits are present and prominent.
Getting started with Federated Learning
👩💻 To do more:
- paperswithcode has 133 papers tagged with Federated Learning. A substantial number of these are software papers outlining tools available in open source repositories.
- Specifically check out: Flower, the open source FL package that the University of Cambridge is using to teach federated learning.
📚 To read more:
- Short and basic:
Brendan McMahan and Daniel Ramage. (2017) “Federated Learning: Collaborative Machine Learning without Centralized Training Data” Google AI Blog.
- Overview of federated learning in medical imaging applications:
Kaissis, G.A., Makowski, M.R., Rückert, D. et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2, 305–311 (2020). https://doi.org/10.1038/s42256-020-0186-1
- Solid, highly cited explainer on federated learning 101:
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. (2019) Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 12 (February 2019), 19 pages. https://doi.org/0000001.0000001