Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)
Adversarial Attacks as Near-Zero Eigenvalues in the Empirical Kernel of Neural Networks
Ouns El Harzli · Bernardo Grau
Keywords: [ adversarial attacks ] [ kernel theory ] [ neural networks ]
Adversarial examples ---imperceptibly modified data inputs designed to mislead machine learning models--- have raised concerns about the robustness of modern neural architectures in safety-critical applications. In this paper, we propose a unified mathematical framework for understanding adversarial examples in neural networks, corroborating Szegedy et al.'s original conjecture that such examples are exceedingly rare, despite their presence in the proximity of nearly every test case. By exploiting Mercer's decomposition theorem, we characterise adversarial examples as those producing near-zero Mercer's eigenvalues in the empirical kernel associated to a trained neural network. Consequently, the generation of adversarial attacks, using any known technique, can be conceptualised as a progression towards the eigenvalue space's zero point within the empirical kernel. We rigorously prove this characterisation for trained neural networks that achieve interpolation and under mild assumptions on the architecture, thus providing a mathematical explanation for the apparent contradiction of neural networks excelling at generalisation while remaining vulnerable to adversarial attacks. We have empirically verified that adversarial examples generated for both fully-connected and convolutional architectures through the widely-known DeepFool algorithm and through the more recent Fast Adaptive Boundary (FAB) method consistently lead to a shift in the distribution of Mercer's eigenvalues toward zero. These results are in strong agreement with predictions of our theory.