Researchers have created a new technique to stop adversarial attacks

Did you know Neural is taking the stage this fall? Together with an amazing line-up of experts, we will explore the future of AI during TNW Conference 2021. Secure your ticket now!

There’s growing concern about new security threats that arise from machine learning models becoming an important component of many critical applications. At the top of the list of threats are adversarial attacks, data samples that have been inconspicuously modified to manipulate the behavior of the targeted machine learning model.

Adversarial machine learning has become a hot area of research and the topic of talks and workshops at artificial intelligence conferences. Scientists are regularly finding new ways to attack and defend machine learning models.

A new technique developed by researchers at Carnegie Mellon University and the KAIST Cybersecurity Research Center employs unsupervised learning to address some of the challenges of current methods used to detect adversarial attacks. Presented at the Adversarial Machine Learning Workshop (AdvML) of the ACM Conference on Knowledge Discovery and Data Mining (KDD 2021), the new technique takes advantage of machine learning explainability methods to find out which input data might have gone through adversarial perturbation.

Creating adversarial examples

Say an attacker wants to stage an adversarial attack that causes an image classifier to change the label of an image from “dog” to “cat.” The attacker starts with the unmodified image of a dog. When the target model processes this image, it returns a list of confidence scores for each of the classes it has been trained on. The class with the highest confidence score corresponds to the class to which the image belongs.

The attacker then adds a small amount of random noise to the image and runs it through the model again. The modification results in a small change to the model’s output. By repeating the process, the attacker finds a direction that will cause the main confidence score to decrease and the target confidence score to increase. By repeating this process, the attacker can cause the machine learning model to change its output from one class to another.

Adversarial attack algorithms usually have an epsilon parameter that limits the amount of change allowed to the original image. The epsilon parameter makes sure the adversarial perturbations remain imperceptible to human eyes.

Adding adversarial noise to an image reduces the confidence score of the main class