Poster
in
Workshop: New Frontiers in Graph Learning (GLFrontiers)
What Improves the Generalization of Graph Transformer? A Theoretical Dive into Self-attention and Positional Encoding
Hongkang Li · Meng Wang · Tengfei Ma · Sijia Liu · ZAIXI ZHANG · Pin-Yu Chen
Keywords: [ transformer ] [ Deep Learning Theory ] [ Optimization ] [ Graph neural network ] [ generalization analysis ] [ Graph Transformer ]
Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perception. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a zero generalization error by training with stochastic gradient descent (SGD). Our theoretical findings suggest that a larger fraction of discriminative nodes, a clearer-cutting vote among discriminative nodes, a smaller fraction of erroneous labels, and smaller errors in the initial model and node patterns improve generalization. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.