So you want to be an AI Hacker?

Can you brainwash a robot to do your bidding? The advancements in artificial intelligence (AI) are driven by machine learning (ML), a statistical approach where computers learn from data. It turns out you can mess with this data to bend AI to your will. Hacking ML models is the wild west of cybersecurity all over again. It’s a exciting field, because it’s so new and ripe for operations. As these models become more prevalent in industry, it will become increasingly important for red teams to understand how to exploit them. In this blog post, we’ll explore the skills and tools that are necessary for hacking and exploiting ML models on a red team.

First, we’ll discuss the skills your team needs, especially the importance of understanding machine learning concepts and techniques. Next, we’ll delve into the different types of attacks that can be used to compromise ML models, including model evasion and poisoning attacks. We’ll introduce some of the tools and frameworks that are commonly used for attacking ML models, like the Adversarial Robustness Toolkit (ART).

Finally, we’ll provide some best practices for red teams that are looking to incorporate machine learning exploitation into their operations. These include staying up-to-date on the latest research and following ethics guidelines.

To get started in the specialty of hacking and exploiting machine learning models on a red team, there are a few key skills and background knowledge that you’ll need. Key skills include:

  • Strong understanding of machine learning concepts: In order to exploit machine learning models, you’ll need to have a solid understanding of the underlying techniques that are used to build and train these models in the first place. This is includes foundational knowledge of different machine learning techniques like convolutional neural networks and transformers. Having an understanding of how to evaluate the performance of these models is especially essential.
  • Familiarity with popular programming languages and libraries: To work with machine learning models, you’ll need to be comfortable working with programming libraries that are commonly used. These libraries are largely Python libraries, which are widely used for machine learning tasks, such as Scikit-Learn, TensorFlow, and PyTorch. Other data science focused languages include Julia and R.
  • Experience with data manipulation (aka Extract Transform Load or ETL): Machine learning models often work with large amounts of data, and you’ll need to be comfortable working with and manipulating this data in order to create exploits. This includes skills such as data cleaning, preprocessing, and visualization. The libraries often used for this is Pandas for data processing and cleaning and MatPlotLib or Plotly for visualization.
  • Familiarity with cybersecurity concepts and tools: As a red team member, you’ll also need to have a strong understanding of all foundational cybersecurity concepts, such as the different types of network and system vulnerabilities and how to exploit them, the art of penetration testing, and perhaps even a familiarity with exploit development. This will help you to understand how hacking machine learning models fit into the broader context of an organization’s security posture.

In conclusion, to be successful as a red team member specializing in machine learning model exploitation, you’ll need to have a strong foundation in both cybersecurity and machine learning.

There are several ways that machine learning models can be exploited, depending on the desired effect, the specific model and the context in which it is deployed. Some common techniques for exploiting machine learning models include:

  • Model evasion: Modified inputs (often called adversarial examples) that have been specifically designed to trick a machine learning model into making an incorrect prediction. Adversarial examples can be created by adding small, imperceptible changes to an input, such as a single pixel in an image, that cause the model to misclassify the input. For example, a single pixel on a plane could confuse it into thinking it is actually a picture of a cat.
  • Model inversion: These attacks involve reversing the process by which a machine learning model was trained, in order to extract information about the data that was used to train it. For example, an attacker might be able to reverse engineer a model that was trained on customer data in order to learn sensitive information about individual customers, or a face recognition system could be reversed to extract the faces it was trained on.
  • Model poisoning: These attacks involve introducing malicious or incorrect data into a machine learning model’s training data in order to cause the model to perform poorly or make incorrect predictions. Like model evasion the changes are generally imperceptible. These attacks can be difficult to detect, as the model may still appear to be functioning correctly during training. However, when the poisoned model is deployed, it may produce incorrect or unexpected results, sometimes due to a specially crafted adversarial example by the attacker (this is called a backdoor attack).

The Adversarial Robustness Toolkit (ART) is an open-source Python library that provides a collection of tools and techniques for creating and testing adversarial attacks against machine learning models. It was developed by researchers at IBM and is designed to help security professionals assess the robustness of machine learning models against adversarial attacks.

Model evasion attacks cause a machine learning model to make incorrect predictions by manipulating the input data that is fed into the model. The modified input data is called an adversarial example. Generally, a human can not tell the difference between the adversarial example and the benign example, but the model is confused by them. A picture of a plane might be confidently predicted as a cat. These attacks can be particularly effective at fooling deep neural networks, but work in most if not virtually all machine learning models.

An excellent model evasion attack is the HopSkipJump attack, which is described in the paper HopSkipJumpAttack: A Query-Efficient Decision-Based Attack by Chen et al. This attack involves crafting adversarial examples by iteratively modifying the input data and measuring the resulting change in the model’s prediction, making small jumps when the accuracy seems to drop. By using this approach, the HopSkipJump attack can fool deep neural networks with a relatively low number of queries and without even having access to the model.

An older model evasion attack in a similar vein is DeepFool, which is described in the paper DeepFool: a simple and accurate method to fool deep neural networks by Moosavi-Dezfooli et al. This attack involves an approach that searches for the minimum amount of perturbation needed to cause a deep neural network to make an incorrect prediction, and it has been shown to be effective at fooling a wide range of different machine learning models.

Finally, the paper One pixel attack for fooling deep neural networks by Su et al discusses a model evasion attack for computer vision that modifies a single pixel in the input data. This single pixel modification can be enough to fool deep neural networks. Someone looking at such an image could easily assume it was a dead pixel on a camera and not an attack, if they even notice the pixel.

Model poisoning attacks are a type of attack that involve injecting malicious data into a machine learning model during training in order for its performance after training to suffer. Contrast this to model evasion, which is a strictly inference-time attack on an already trained model. Model poisoning attacks the model as it is being trained.

One notable example of model poisoning is BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain by Gu et al. In this paper, the authors show how to identify vulnerabilities in the machine learning model supply chain by injecting a small number of malicious examples into a training set. They also demonstrate how to use these poisoned models to launch targeted attacks against real-world applications like fooling a model to make incorrect street sign detections, with big implications for self-driving cars.

Another example is Witches’ Brew: Industrial Scale Data Poisoning via Gradient Matching by Geiping et al. This paper shows how to use gradient matching to poison a model’s training data at scale. To do this, the attacker carefully crafts a set of examples that have the same “gradient” or “slope” as the examples in the training set. The gradient is a measure of how sensitive the model’s predictions are to small changes in the input data. By matching the gradients of the malicious examples to the gradients of the legitimate examples, the attacker can “trick” the model into incorporating the malicious examples into its decision-making process.

Think of it like this, imagine you are trying to get a robot to follow a certain path, the robot will adjust its path based on the slope of the ground. By carefully creating a small path with the same slope as the original path, you can trick the robot into following the new path, instead of the original one. This makes it possible to cause the target model to make specific, targeted errors, without the need for large amounts of poisoned data.

In Hidden Trigger Backdoor Attack by Saha et al., a “backdoor” is created in the form of a hidden trigger in the input data, which causes the model to produce a specific output when the trigger is present. Imagine a model that is trained to recognize images of animals, the attacker wants the model to identify a specific image as a “giraffe” even if the image doesn’t resemble a giraffe at all. So the attacker injects a specific pattern (could be a certain color, shape, or texture) into the training data, this pattern is the trigger, and when the model sees that trigger in a new image it will identify it as a giraffe. The attacker can use that trigger to make the model misclassify any image that contains that pattern. The hidden backdoor attack is stealthier than other types of poisoning attacks because it doesn’t really leave any human perceivable trace in the poisoned data, and it doesn’t affect the accuracy of the model on clean inputs. This makes it nearly impossible to detect let alone defend against.

Model inversion allows an attacker to reconstruct the input data that was used to train a machine learning model. This doesn’t even need access to the model parameters/weights, it can be done by using just the model’s output to infer information about the input data.

A key paper on model inversion is Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures by Fredrikson et al. In the paper the authors proposed a method of attacking a face recognition system. They showed that by using the model’s confidence scores, an attacker can reconstruct a recognizable input image.

The attack is launched by starting with a target image of random noise. The attacker then feeds these manipulated images into the model and examines the model’s confidence scores. By observing how the confidence scores change as the image is manipulated and adjusting the noise accordingly, the attacker can infer information about what faces were used in training the system.

This process is repeated until the attacker has reconstructed an image that is similar to an original input. With this reconstructed image, the attacker can then impersonate the person or use it to bypass the face recognition system. The authors also proposed some countermeasures, such as adding random noise or reducing the precision of the confidence score in the model’s output.

In conclusion, AI/ML models are an increasingly important component of modern computer and software systems. However, these models are very prone to attack, and it is important for organizations to be aware of the different ways that they can be exploited.

In this blog post, we discussed two different types of attacks that can be used to compromise machine learning models: model evasion attacks and model poisoning attacks. Model poisoning attacks involve injecting malicious data into a machine learning model during training in order to cause it to make incorrect predictions, while model evasion attacks involve manipulating the input data in order to cause the model to make incorrect predictions during inference.

We also discussed a number of different papers that provide valuable insights into these types of attacks and the techniques that can be used to detect and defend against them. Using these techniques, red teams can help organizations identify vulnerabilities in their models and ensure that they are able to effectively defend against these attacks.