Learn

August 28, 2023

5 Minute Read

Federated Learning in AI: How It Works, Benefits and Challenges

By Muhammad Raza

Federated learning in artificial intelligence refers to the practice of training AI models in multiple independent and decentralized training regimes.

Traditionally, AI models required a single, centralized dataset and a unified system of training. While this method is a straightforward approach to training an AI tool, putting all of that data in one location can present undue risk and vulnerabilities, especially when that data needs to move around.

That’s where federated learning comes in — as a response to the privacy and security concerns brought on by traditional machine learning methods.

In this article, we’ll explore AI training, how it works and what benefits organizations can expect from adopting federated learning models. Let’s dig in!

How AI training works

Before we dive into the details, let’s start with a quick overview of AI training, or machine learning.

AI training is the process of teaching a system to acquire, interpret and learn from an incoming source of data. To accomplish those three goals, AI models go through a few steps:

Training

As a first step, an AI model is fed a large amount of prepared data and is asked to make decisions based on that information. This data is generally filled with all kinds of tags or targets that tell the system how it should categorize the information — like thousands of sets of training wheels all guiding the system toward to desired outcome. Engineers make adjustments here until they see acceptable results, and at least some degree of accuracy using tagged data.

Validation

Once the AI model has trained on its initial data set and adjustments have been made, the system is fed a new set of prepared data, and performance is evaluated. This is where we keep an eye out for any unexpected variables, and verify that the system is behaving as expected.

Testing

After passing validation, the system is ready to meet its first set of real-world data. This data is unprepared and contains no tags or targets that the initial training and validation sets included. If the system can parse this unstructured data and return accurate results, it's ready to go to work. If not, it heads back to the training stage for another iteration of the process.

Traditional AI training problems

In the traditional machine learning paradigm, all of that training data sits in one location — meaning data may need to be exchanged between a central cloud location and the model.

This presents some serious issues:

Privacy

AI models train on large datasets. These data sets are stored in silos and a single, unified and centralized data platform may not be available to store them in their entirety. Different data sets may be highly relevant and valuable to AI model training but are subject to strict data privacy limitations. The data may be proprietary or contain sensitive personally identifiable information on end-users. Therefore, only a limited number of users may be authorized to access relevant data sets.

The use of sensitive data may be subject to stringent compliance regulations and liability for damages in the event of a data breach or security incident. Data anonymization in data transfer, data storage and data pipeline processes may not be possible if the algorithms do not allow such provisions or if the necessary resources are not available.

Lack of variety

In a traditional machine learning approach, data distribution can be highly homogeneous, and that can be a problem.

AI models that are trained on a limited curated data set from a few sources may not adequately learn and represent the complete data distribution of these sources. If the training data does not represent a large and diverse volume of its underlying distribution, the learned model may not be able to infer (predict) accurately.

This can also create learning imbalances and biases that are highly discouraged in real-world applications. For instance, if the training data is curated from specific demographic groups, the models may underperform on data from other demographic groups that were not available during training. To develop a scalable learning model, it should be able to train on large volumes of data simultaneously. This may be referred to as global training, which may be impossible due to the limitations stated above.

(Learn more about bias & similar ethical concerns in AI.)

Data security

Traditional ML algorithms often require access to large amounts of data for training. This data might contain sensitive information, and the process of collecting, storing, and sharing this data can increase the risk of data exposure, breaches, or leaks.

Furthermore, as the model is still in development, it can be an attack vector itself through what professionals refer to as “model inversion”. By exploiting ML model vulnerabilities, attackers can infer sensitive information about individual data points used in the training process. This involves querying the model with carefully crafted inputs to learn about the training data or individual records, which could compromise the privacy and security of the dataset as a whole

How federated learning improves the AI training process

So how does federated learning work, and how does it solve these key problems?

Initialization and distribution

To start, in federated learning, instead of using the so-called “global” training regime described above, the training process is divided into independent and localized sessions.

A base model is prepared using a generic, large dataset — this model is then copied and sent out to local devices for training. Models can be trained on smartphones, IoT devices or local servers that house data relevant to the task the model is aiming to solve. The local data generated by these devices will be used to fine-tune the model.

Aggregation and model updates

As the models train, there will be small iterative updates happening within them as they get closer to achieving their desired level of performance — these small updates are called gradients. Rather than sending back the fully parsed dataset from the device, federated learning models only send the gradients of the AI model back to the central server.

As all of these gradients are sent to that central server, the system can average all of the update information and create a reflection of the combined learning of all participating devices.

Iteration and convergence

To reiterate, the basic structure for a federated learning model shows that:

Devices train locally
Updates are sent to the central server
The server aggregates updates into a global model

Much like a traditional AI learning model, this process is repeated multiple times until the model reaches a state where it can perform well across diverse and varied datasets. Once the desired level of performance is achieved and confirmed, the global model is ready for deployment.

Benefits and challenges of federated learning

What we’ve just described is a simple case of federated machine learning. As this process has become more commonplace, organizations have seen several operational benefits including:

Less data transfer, leading to improved data privacy and security.
Data owners retain control and do not have to share raw data.
Decentralization of training is more resilient to network failure or disruptions, allowing for continuous training.
Collaborative learning is an option: multiple parties can submit model improvements without sharing sensitive data.

Modern federated AI algorithms may use a variety of training regimes, data processing and parameter updating mechanisms depending on the performance goals and challenges facing federated AI.

Some of these challenges include that:

Datasets and parameters of local nodes must be interoperable with other nodes.
Characteristics of datasets may change over time, which means that the parameter update process may have to account for temporal characteristics and heterogeneity.
Local data sets may have their limitations around accuracy, availability of true training labels and privacy limitations.
Additional security measures are required if the global training model relies only on the parameter outcomes of local data nodes, which may be hijacked by adversaries to inject false parameter updates into the global model.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Muhammad Raza

Muhammad Raza is a technology writer who specializes in cybersecurity, software development and machine learning and AI.

Learn 7 Min Read

Chaos Engineering: Benefits, Best Practices, and Challenges

Learn the significance of chaos engineering for resilient systems. Explore principles, tools, benefits, and challenges of this must-have practice.

Learn 5 Min Read

Log Analytics: Analyzing Log Data 101

Breaking news: Log data can provide a ton of value, if you know how to do it right. Read on to get everything you need to know to maximize value from logs.

Learn 7 Min Read

What is Cloud-Native Security? An Introduction

Understand how cloud native security can protect your organization's data and applications in the cloud. Explore the 4Cs and 3Rs principles plus get common strategies.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram