NeurIPS Standard adversarial attacks only fool the final layer

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Standard adversarial attacks only fool the final layer

Stanislav Fort

[ Abstract ] [ Project Page ]

[ OpenReview]

Sun 15 Dec 4:30 p.m. PST — 5:30 p.m. PST

Abstract:

This paper presents a surprising empirical phenomenon in the domain of adversarial machine learning: standard adversarial attacks, while successful at fooling a neural network's final classification layer, fail to significantly impact the representations at early and intermediate layers. Through experiments on ResNet152 models finetuned on CIFAR-10, we demonstrate that when an image is adversarially perturbed to be misclassified, its intermediate layer representations remain largely faithful to the original class. Furthermore, we uncover a decoupling effect where attacks trying to fool specific intermediate layers have limited impact on other layers' classifications, both before and after the targeted layer. These findings challenge the conventional understanding of how adversarial attacks operate and suggest that deep networks possess more robust internal representations by default than previously thought.

Chat is not available.

Poster Session in Workshop: Scientific Methods for Understanding Neural Networks

Standard adversarial attacks only fool the final layer

Stanislav Fort

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks