Abstract:
Graph Transformers have shown excellent results on a diverse set of datasets. However, memory limitations prohibit these models from scaling to larger graphs. With standard single-GPU setups, even training on medium-sized graphs is impossible for most Graph Transformers. While the $\mathcal{O}(n^2d^2)$ complexity of each layer can be reduced to $\mathcal{O}((n+m)d^2)$ using sparse attention models such as Exphormer for graphs with $n$ nodes and $m$ edges, these models are still infeasible to train on training on small-memory devices even for medium-sized datasets. Here, we propose to sparsify the Exphormer model even further,by using a small ``pilot'' network to estimate attention scores along the graph edges, then training a larger model only using $\mathcal O(n)$ edges deemed important by the small network. We show empirically that attention scores from smaller networks provide a good estimate of the attention scores in larger networks, and that this process can yield a large-width sparse model nearly as good as the large-width non-sparse model.