Federated Learning 101: An Introduction to Privacy-Conscious Machine Learning and Why Companies Should Consider it

Introduction

71% of users expect a personalized experience when shopping online, as revealed by a McKinsey study from November 2021 [1]. Similarly, a study published in 2021 [2] finds that between 77% and 89% of users in Germany, Great Britain, and the United States accept personalized recommendation services in areas such as entertainment and shopping applications. In fact, leading providers like Amazon, Spotify, and Netflix are known to be at the forefront of applying machine learning to deliver such personalized experiences. However, the same study finds that a majority of users are opposed to the collection and use of their personal data for that purpose. Moreover, more than 80% express concerns about data privacy in these countries whereas the number is at 68% worldwide according to the IAPP Privacy and Consumer trust report from March 2023 [3].

Legislations like the General Data Protection Regulation GDPR in the EU, the California Consumer Privacy Act (CCPA) or the Personal Information Protection Law (PIPL) in China address these user concerns. They introduce strict rules for data localization and minimization, require obtaining explicit consent before data collection and usage, impose penalties for data breaches, and more.

Companies are faced with a dilemma: How can they provide personalized services while ensuring that sensitive data is kept private and not exploited? In this blog article, we will explore Federated Learning as a solution to this dilemma.

Federated learning can provide personalization while ensuring data privacy at the same time.

In Federated Learning, the key idea is to "bring the model to the data" rather than bringing the data to the model, as in more traditional approaches. We will delve a bit deeper, but in short: Devices (such as smartphones) train a model on their local datasets and only share model weights with a central server (such as a company). The central server collects weights from all participating clients and calculates a new global model from them. Effectively, the new global model is trained on all (potentially sensitive) data without the data ever leaving the clients' devices and control.

In this article, we are going to provide an overview of federated learning, why it exist and what it promises, how it works, what are common challenges, and where it has been applied already. In particular, we focus on the privacy-preserving aspect of federated learning, providing a high level technical understanding, and pointing out key benefits for businesses as well as implementation challenges.

For deep dives and more perspectives om federated learning, we refer to future articles or the reference list at the bottom.

Why Federated Learning is Needed: Aligning users' expectations and data privacy regulations

Users expect personalization in digital services. Companies like Netflix, Amazon, and Spotify have proven that data-hungry machine learning models can provide such personalization and be a competitive edge. However, at the same time, users have widespread concerns about data privacy, and regulatory aspects make it harder for companies to use personal data for personalization. In this section, we investigate some of these aspects that federated learning can help in staying compliant with.

Regulatory Aspects

Data privacy is a primary concern for users around the world when it comes to digital and AI-empowered services. 68% of consumers worldwide are concerned about data privacy, with numbers reaching up to 80% in important markets like the EU or US. Policymakers around the world are addressing these concerns by passing laws like GDPR in the EU, CCPA in the US, or PIPL in China.

One core aspect of these legislations is the principle of data minimization, which mandates data processors to “limit the collection of personal information to what is directly relevant and necessary to accomplish a specified purpose.” [4]. Furthermore, they require data to be stored and processed within geographical boundaries (data localization) and only after users have given explicit consent. For example, European customer data is not allowed to be stored and processed outside the EU, nor is Chinese customer data allowed outside China.

Another aspect is the “right to be forgotten” (GDPR) or “right to erasure” (CCPA), which empowers users to remain in control over their personal data. They can revoke their consent and request the erasure of their personal data from company databases (if it is not needed for, e.g., other legal purposes). Additionally, users must be enabled to access their data upon request.

Lastly, the regulations also mandate that data processors take measures to ensure data security. For example, companies must regularly assess risks like unlawful disclosure or access to personal data.

Companies must go to great lengths if they want to train machine learning models to provide personalized services. They must implement a multitude of processes to ensure compliance with data privacy regulations. However, many requirements could be fulfilled if there was no need to centralize the data for model training. Federated learning offers exactly that. In some cases, it might make use cases possible that wouldn’t be otherwise due to the data localization principle. In other cases, it can significantly simplify compliance with the right to be forgotten, the data minimization principle, or preventing data breaches since personal data does not need to be collected and stored in companies’ databases.

How Federated Learning Works

The federated learning workflow.
Figure from Wikimedia.

Decentralized machine learning can simplify compliance with data regulations and enable model training in environments with bandwidth or other data-sharing constraints. But how does it work? Most federated learning workflows involving a central coordinator (dubbed "centralized federated learning") can be described in terms of six steps [6]:

The central server initializes a global model.
The central server selects client devices to participate in the next round of model training and shares the current global model with them.
The clients train the global model using only their local data and derive local model updates. At this point, every client has its own updated model.
The clients share their model updates with the central server.
The central server aggregates the local updates to derive a global update and applies it to the global model.
The global model is shared with all clients. Steps 2-6 are repeated until some convergence criterion is met.

Following these steps ensures that the clients’ data never leaves their local databases and is kept private. At the same time, the data is still utilized for model training. The final model benefits from a larger and more diverse dataset. Clients can either be devices like smartphones, or other IoT devices, or companies that collaborate in training a model. The first case is called "cross-device" (typically many clients) and the latter "cross-silo" (typically few, less than 100 clients) federated learning.

As straightforward as it might seem, these steps come with unique challenges. We will look into each of the steps in a bit more detail but will also refer to other resources or future blog articles for deeper dives.

Global model initialization

Initializing a global model includes defining the model architecture (think, for example, a deep neural network) as well as the initial parameters. Since the central server is prohibited from accessing raw data, it might be difficult for it to define an appropriate network architecture that can capture complex patterns in the data, generalize well, and respect the resource constraints of client devices.

Client selection

A round of model training consists of the six steps outlined above. Clients participating in a training round must fulfill certain requirements. For example, they must have new data available for model training, they must have enough resources available for the training round (think, for instance, a sufficiently charged battery or idle CPU/GPU), or they need a stable network connection to share their updates. Moreover, the central server might need to ensure a certain diversity. If the model is constantly updated using the same few clients’ datasets, it will be biased towards those clients’ patterns and might be blind to other clients’ patterns. In fact, there are methods that select clients based on expected performance gains [7].

Local model training

To obtain local model updates, clients perform a few rounds of training on their local data. Local training must respect the clients' resource constraints; for example, edge devices might not be capable of performing complex computations like backpropagation in deep neural networks. Moreover, the clients' data distributions might be heterogeneous, which is known to affect model convergence, as we will explore in the next chapter. Another challenge is protecting the training process from adversarial attacks. For instance, a client's local database might be corrupted. In that case, we would want to exclude model updates from that client to protect the global model [8].

Clients share their model updates with the central server

After completing local model training, clients need to send their updates to the central server for further processing. Therefore, clients must have a stable network connection to the central server. Furthermore, it is important to ensure that the connection to the central server is secure and cannot be infiltrated by attackers. Although clients never share their raw data, the model updates might still be considered sensitive information.

Aggregating local model updates

Once the local updates are available on the central server, it must compute a global update and apply it to the current version of the global model. An intuitive approach is to calculate the weighted average of all local updates. While this method works well when all clients' data follow a similar distribution, it is known to fail when the data is distributed differently (see, e.g., [9]). Fortunately, several aggregation strategies have been proposed to address this issue, and selecting the best aggregation strategy depends on various factors, such as performance and scalability requirements [10].

Benefits for businesses

We explored several regulatory requirements for data privacy; however, this does not yet make a strong business case. In this section, we will discuss some benefits of federated learning for businesses.


Business potential of federated learning. Generated with you.com.

Storage cost efficiency

In federated learning, raw data does not need to be stored in a central location. By avoiding the need to store raw data, businesses can also reduce some of their data storage costs. While storage is often considered inexpensive, it can accumulate quickly. For instance, Azure's pricing calculator indicates that 1TB of storage in the Azure SQL Database's "Business Critical" service tier, with 16 vCores in the German region, costs around 9,000 USD per month (calculated as of 10/2024)—a significant expense even for larger companies.

In federated learning, only the model needs to be stored, and machine learning models are known to effectively compress data. For example, the LLM Chinchilla 70b compresses text data to just 8.3% of its original size [11]. While this may be an extreme case, it highlights the potential for cost savings in storage through large-scale machine learning via federated learning. Moreover, transmitting model updates rather raw data also has lower bandwidth requirements with a potential of saving costs.

Improved user trust to increase brand loyalty

Users are increasingly concerned about data privacy (see, e.g., [2,3]), and enhanced data privacy can lead to increased user trust and loyalty. Apple exemplifies a company that leverages data privacy as a competitive advantage and key differentiator from its competitors. This is evident in its preference for on-device data processing, its ease of opting out of app tracking, and other privacy-focused initiatives [14]. Apple has made privacy a core principle of its product design, and aligning the company's values with those of its users contributes to an improved customer experience, as noted by [15]. This enhanced customer experience translates into tangible outcomes, such as high brand loyalty in the crowded handset market [16].

Federated learning can serve as a cornerstone of a strategy that prioritizes user trust without compromising the quality of intelligent services. This approach can lead to a positive customer experience, high brand loyalty, and improved retention rates.

Enhanced security

Not having to store raw (potentially sensitive) data not only increases storage cost efficiency but also reduces the risk of data exposure. According to an IBM research report, the average cost of a data breach was 4.88 million USD in 2024 [12]. As a result, companies allocate a significant portion of their IT budgets to cybersecurity measures, averaging 11.6% [13]. Employing federated learning can help reduce spending without sacrificing data security or the positive business outcomes associated with machine learning initiatives.

Challenges and Considerations

The "no free lunch theorem" teaches us that "there are no easy shortcuts to success" [17], and federated learning is no exception. In this section, we are looking into some of the challenges in federated learning and its implementation.

Technical complexity

Federated learning represents a paradigm shift in machine learning, presenting new challenges for data scientists and ML engineers. Developing a machine learning model typically involves extensive experimentation with the training dataset to identify a suitable model for production. However, in federated learning, there is no direct access to raw data, which can make extensive experimentation slow or even infeasible. Additionally, the distribution of the training data may be unknown, necessitating robust drift monitoring and mitigation strategies in production environments in case issues arise.

Furthermore, federated learning introduces new infrastructure requirements. For instance, all clients—potentially devices with varying computational resources—must have the capability for local data processing and machine learning. Moreover, it is important to robustly handle intermittent connectivity and manage the frequent joining and dropout of clients, as many more communication rounds are required compared to centralized training.

Those challenges necessitate upfront investments and leaders who are open to take calculated risks.

Communication Overhead

The high frequency of model information exchanges during the training process in federated learning results in significant communication overhead compared to non-federated training. Inefficient communication can lead to slow training or even render federated learning impractical, especially in bandwidth-limited environments.

Communication overhead is typically measured by the total time required for all clients to submit model updates to the central server until convergence, as well as the total size of these updates. Thus, factors such as the number of participating clients, the total number of rounds until convergence, and the model size influence this overhead. Data scientists and developers can impact model size by selecting appropriate architectures and algorithms. Fortunately, numerous methods have been proposed to achieve quicker convergence, such as strategically choosing a smaller number of participating clients or reducing model size through techniques like quantization, sparsification, and distillation. For excellent overviews, we refer to [18, 19].

However, it is important to note that other factors, such as computational power, network speed, and stability, are often difficult to influence [18].

Model accuracy

Figure from [2].

Achieving model accuracy comparable to that of centralized models in the presence of heterogeneous distributions (also known as non-iid data) is another significant challenge in federated learning. By heterogeneous distribution, we refer to situations where some clients have different data distributions than others, which can pertain to features, labels, or the joint distribution of features and labels.

In such cases, it is crucial to apply appropriate aggregation algorithms. Without them, the training process is likely to converge to a sub-optimal solution that does not generalize well across all clients' data, or it may converge slowly, negatively impacting communication overhead. As one of the major challenges in federated learning, numerous research efforts focus on proposing algorithms that can quickly converge to optimal solutions, even in the presence of non-iid data. For example, some methods cluster compatible model updates and only consider those for aggregation (known as clustered federation), while others adjust the loss function or propose sharing data among all clients for training when feasible [2, 20, 21].

Each of these approaches has its pros and cons and should be selected based on the specific requirements of the use case.

Real-World Applications and Case Studies

Federated learning is still in its early adoption phase. In this section, we investigate pioneering use cases in two very different industries, showcasing the real-world applicability of federated learning.

Google's next word prediction


Next word prediction. Figure from [22].

One of the earliest real-world applications of federated learning was reported by Google in 2018. Google utilized federated learning to enhance the next word prediction feature in its virtual keyboard, Gboard (see Figure) [22]. In this initiative, Google trained a LSTM network to predict the next words a user types on the keyboard. Training such a network requires access to a text corpus with representative typing data, ideally consisting of actual texts that users have typed. By using federated learning, Google was able to train a model on sensitive text data while respecting users' privacy. Experiments conducted on live traffic data demonstrated that the federated learning approach outperformed the non-federated production model at the time, leading to improved personalization.

Subsequent work at Google has focused on enhancing privacy through the application of differential privacy, expanding federated learning to other Gboard features such as emoji prediction, and applying it to discover words that are "out of vocabulary" (think of emerging terms like "Covid-19" or user-specific spellings like "coooool") [23, 24].

In summary, Google employs federated learning to improve and personalize user experiences by leveraging the vast amount of data it has access to while also protecting users' privacy. This approach aligns user expectations for personalization with privacy concerns and ensures compliance with data regulations.

Roche enables data collaboration in the health care industry

Roche is one of the early adopters of federated learning in the health care industry.
Figure from NVFlare day 2024 [26].

Machine learning applications in the healthcare industry often rely on patient data for purposes such as disease prediction, treatment, and drug development [25]. However, because patient data is considered extremely sensitive, hospitals, pharmaceutical companies, and research institutions are often reluctant or prohibited from sharing this data. Despite this, larger and more diverse datasets can significantly enhance the performance of machine learning models. Federated learning offers a solution by enabling the use of a broader set of data points to improve drug development and diagnostics in healthcare.

For instance, industry leader Roche has positioned itself as an early adopter of federated learning, reporting successful implementations in real-world settings with client sites in Switzerland and the United States. They have trained a federated model for cell detection, demonstrating that it generalizes better than its non-federated counterpart. Additionally, Roche participates in multiple public-private research initiatives focused on disease discovery and cancer diagnosis and care, leveraging vast data sources through federated learning [26].

While Roche has successfully demonstrated the applicability of federated learning in real-world environments with sensitive data, they also acknowledge several implementation challenges. These include the lengthy duration of experiments and the necessity for multidisciplinary teams at each site to provide expertise in algorithm design, domain knowledge, and IT skills.

Overall, Roche applies federated learning across various scenarios to gain a competitive edge in research and enhance data utilization for machine learning applications. In addition to supporting research efforts, federated learning can also facilitate personalized disease treatment.

Conclusion

Studies show that users expect personalization from digital companies, yet they are often reluctant to provide personal data for this purpose. Federated learning offers a solution to this dilemma: it allows for the training of machine learning models on distributed (and sensitive) data without transferring the data to a central storage location.

In addition to aiding compliance with data privacy regulations, federated learning can reduce costs associated with data storage, lower the risk of costly data breaches, and help build user trust, which can lead to higher customer loyalty—an approach exemplified by Apple.

Finally, while there are still technical challenges to overcome, industry giants like Google and Roche have begun implementing federated learning use cases and continue to invest in this technology.

If you are a tech leader seeking privacy-conscious machine learning solutions, federated learning is definitely worth considering. Do reach out in case you're looking advice or just like to ping-pong some ideas.

References and further readings

"The value of getting personalization right—or wrong—is multiplying", McKinsey & Company (Nov. 2021), Link (accessed 06.10.2024)
Kozyreva, A., Lorenz-Spreen, P., Hertwig, R. et al. Public attitudes towards algorithmic personalization and use of personal data online: evidence from Germany, Great Britain, and the United States. Humanit Soc Sci Commun 8, 117 (2021). https://doi.org/10.1057/s41599-021-00787-w Link
IAPP Privacy and Consumer Trust Report (March 2023), https://iapp.org/resources/article/privacy-and-consumer-trust-summary/ (accessed 08.10.2024)
European Data Protection Supervisor, Data Minimization, Link (accessed 08.10.2024)
GDPR Art. 32: Security of Processing, https://gdpr-info.eu/art-32-gdpr/ (accessed 08.10.2024)
Kairouz et al. Advances and Open Problems in Federated Learning, Arxiv link (accessed 08.10.2024)
Gouissem et al. A Comprehensive Survey On Client Selections in Federated Learning, Arxiv link (accessed 08.10.2024)
Sagar et al. Poisoning Attacks and Defenses in Federated Learning: A Survey Arxiv Link (accessed 11.10.2024)
Zhao et al., Federated Learning with Non-IID Data, Arxiv Link (accessed 12.10.2024)
Moshawrab et al., Reviewing Federated Learning Aggregation Algorithms; Strategies, Contributions, Limitations and Future Perspective, MDPI link (accessed 12.10.2024)
Adrian Wilkins-Caruana and Tyler Neylon, An elegant equivalence between LLMs and data compression, Link (accessed 12.10.2024)
IBM Cost of a Data Breach Report 2024, Link (accessed 12.10.2024)
Quest blog article by John Hernandez, Top considerations when creating a cybersecurity budget, Link (accessed 12.10.2024)
Kif Leswing (CNBC), Apple is turning privacy into a business advantage, not just a marketing slogan, Link (accessed 12.10.2024)
Renascene, How Apple Elevates Customer Experience (CX) Through Ecosystem Integration (2024), Link (accessed 12.10.2024)
Yougov, The enviable brand loyalty of iPhone owners (2023), Link (accessed 12.10.2024)
No free lunch theorem, Wikipedia (accessed 12.10.2024)
Le et al., Exploring the Practicality of Federated Learning: A Survey Towards the Communication Perspective, Arxiv Link (accessed 13.10.2024)
Zhao et al., Towards Efficient Communications in Federated Learning: A Contemporary Survey, Arxiv link (accessed 13.10.2024)
Gosh et al., An Efficient Framework for Clustered Federated Learning (NeurIPS 2020), Link (accessed 13.10.2024)
Zeng et al., Tackling Data Heterogeneity in Federated Learning via Loss Decomposition, Arxiv Link (accessed 13.10.2024)
Hard et al., Federated Learning for Mobile Keyboard Prediction (2018) Link (accessed 20.10.2024)
Sun et al., Improving Gboard language models via private federated analytics (2024) Link (accessed 20.10.2024)
McMahan et al., Federated Learning with Formal Differential Privacy Guarantees (2022), Link (accessed 20.10.2024)
Ali et al., Federated Learning in Healthcare: Model Misconducts, Security, Challenges, Applications, and Future Research Directions-A Systematic Review (2024), Link (accessed 21.10.2024)
"Roche: Unlocking patient-level data at scale with federated computing to drive collaborative research and advance science", NVFLARE day 2024 live recording and slides (accessed 21.10.2024)

Writing, research and edit supported by you.com.

Dieses Blog durchsuchen

Technology Meets World Affairs