There’s a catch to Artificial Intelligence: it is vulnerable to adversarial attacks.
Any AI has the potential to be reverse engineered and manipulated — due to the inherent limitations in its algorithms and training process. Improving the robustness and security of AI is key for the technology to live up to its hype, fueled by generative AI tools just like ChatGPT.
Enterprise organizations are readily adopting advanced generative AI agents for business applications ranging the gamut:
Business process optimization
Product and marketing, including product analytics and web analytics
And many more applications and experimentations
In this article, we will discuss how both the neural networks training process and modern machine learning algorithms are vulnerable to adversarial attacks.
Adversarial Machine Learning (ML) is the name for any technique that involves misguiding the neural networks model and its training process in order to produce a malicious outcome.
Associated with cybersecurity, adversarial AI can be considered a cyberattack vector. Adversarial techniques can be executed at several model stages:
During training
In the testing stage
When the model is deployed
Consider the general training process of a neural network model. It involves feeding input data to a set of interconnected layers representing mathematical equations. The parameters of these equations are updated iteratively during the training process such that an input correctly maps to its true output.
Once the model is trained on adequate data, it is evaluated on previously unseen test data where the training is no longer performed — now, the model performance is evaluated.
An adversarial ML attack during the training stage involves the modification of input data, features or the corresponding output labels.
A model trained on sufficient data can model its underlying data distribution with high accuracy. This training data can belong to a complex set of data distributions.
An adversarial machine learning attack can be executed by manipulating the training data such that it partially or incorrectly captures the behavior of this underlying distribution. For example, the training data may not be sufficiently diverse, it may be altered or deleted.
The training labels may be intentionally altered during the training stage. During the training process, the same model weights or parameters guide the model trajectory to a fixed decision boundary.
By altering the output class, features, categories or labels of the input data, the trained model weights cannot guide the output outside of this decision boundary and therefore produce incorrect results.
The training data may be injected with incorrect and malicious data. This process may subtly shift the decision boundary such that the evaluation metrics are generally within the acceptable performance thresholds, but the corresponding output classification may be significantly altered.
Another important type of adversarial attack involves a framework that exploits an inherent problem in AI systems: most AI models are black-box systems.
In black-box AI, the systems are highly nonlinear and therefore exhibit high sensitivity and instability. These models are developed based on a set of input data and its corresponding output. We do not (and cannot) have knowledge of the inner workings of the system, but the model correctly maps an input to its true output.
White-box systems on the other hand are fully interpretable. We can understand how the model behaves and we have access to the model parameters with a complete understanding of its impact on the system behavior.
Adversaries cannot obtain knowledge of the model underlying a black-box AI system. However, they can use any synthetic data that closely resembles the input and output from such a system to train a substitute model that emulates the behavior of a target model. This occurs due to the transferability characteristics of the AI model.
Transferability is the phenomenon where, given an adversary can construct adversarial data samples to exploit a model M1 by using knowledge of another model M2, as long as the model M2 can sufficiently perform the tasks that the model M1 is designed for.
In a white-box AI attack, adversaries have knowledge of the target model, including:
Its parameters
The algorithms used to train the model
A popular example involves the use of small perturbations to the input dataset such that it produces an incorrect output with high confidence of accuracy. These perturbations reflect worst-case scenarios that are used to exploit the sensitivity and nonlinear behavior of the neural networks model, which then converges to an incorrect decision class.
The same concepts of adversarial training and constructing adversarial examples can also be used to improve the robustness of an AI system. It can be used to regularize the model training, which imposes constraints on the models against extreme-case scenarios that force the model into misclassifying an output.
Adversarial training can be used to augment the training data to ensure that during the training process, the model is already exposed to a distribution of adversarial datasets. This also includes perturbed data that may be used to exploit the vulnerabilities of the AI models.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.