An adversarial preturbation is a small, human-imperceptible change to a piece of data that flummoxes an otherwise well-behaved machine learning classifier: for example, there's a really accurate ML model that guesses which full-sized image corresponds to a small thumbnail, but if you change just one pixel in the thumbnail, the classifier stops working almost entirely.
In a new paper by a group of MIT computer scientists, the authors posit a general theory for why these adversarial examples exist, and how to create or prevent vulnerabilities to them in machine learning classifiers.
The authors propose that adversarial examples are the result of machine classifiers picking up on "non-robust" elements of their training data and incorporating it into their model (if you were training a model to tell men from women and all the women were photographed on a light background, the classifier might assign a high probability for "woman" on any image with a light background).
Since these spurious correlates are present in every training dataset, using different algorithms to create models from the same data will often produce a vulnerability to the same adversarial examples.
The authors propose a method (whose math I have to admit I could not follow) for determining which files in a dataset have robust characteristics and which ones have "non-robust" (spurious) characteristics. Then they show that they can split up a dataset into a set that can be used to train classifiers that are much harder to fool, and classifiers that are much more vulnerable to adversarial example attacks.
What's more, models trained on either set will work equally well at classifying data from either set, and the sets themselves will not appear substantially weaker or stronger to human reviewers.
In this work, we cast the phenomenon of adversarial examples as a natural consequence of the presence of highly predictive but non-robust features in standard ML datasets. We provide support for this hypothesis by explicitly disentangling robust and non-robust features in standard datasets, as well as showing that non-robust features alone are sufficient for good generalization. Finally, we study these phenomena in more detail in a theoretical setting where we can rigorously study adversarial vulnerability, robust training, and gradient alignment.
Our findings prompt us to view adversarial examples as a fundamentally human phenomenon. In particular, we should not be surprised that classifiers exploit highly predictive features that happen to benon-robust under a human-selected notion of similarity, given such features exist in real-world datasets.In the same manner, from the perspective of interpretability, as long as models rely on these non-robust features, we cannot expect to have model explanations that are both human-meaningful and faithful to the models themselves. Overall, attaining models that are robust and interpretable will require explicitly encoding human priors into the training process.
Adversarial Examples Are Not Bugs, They Are Features [Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry/Arxiv]