Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks
Standard adversarial attacks only fool the final layer
Stanislav Fort
This paper presents a surprising empirical phenomenon in the domain of adversarial machine learning: standard adversarial attacks, while successful at fooling a neural network's final classification layer, fail to significantly impact the representations at early and intermediate layers. Through experiments on ResNet152 models finetuned on CIFAR-10, we demonstrate that when an image is adversarially perturbed to be misclassified, its intermediate layer representations remain largely faithful to the original class. Furthermore, we uncover a decoupling effect where attacks trying to fool specific intermediate layers have limited impact on other layers' classifications, both before and after the targeted layer. These findings challenge the conventional understanding of how adversarial attacks operate and suggest that deep networks possess more robust internal representations by default than previously thought.