Poster
in
Workshop: Optimization for ML Workshop
Optimizing Attention
Hanno Ackermann · Hong Cai · Markus Nagel · Leyla Mirvakhabova · Farhad G. Zanjani · Fatih Porikli
The attention mechanism is an important part of transformer architectures. It enables the network to compare samples within a sequence. Before the comparison is performed, tokens are multiplied by trainable matrices. These matrices can constitute a significant part of the total number of parameters. Their size creates problems on systems with limited cache in the compute unit, especially if there is limited bandwidth between compute unit and memory. In particular, GPUs on mobile devices suffer from this double bottleneck.Prior works mitigate this problem for instance by storing low-rank approximations, quantization or minimizing the amount of data that needs to be transferred. In this paper, an alternative to the traditional attention mechanism is proposed which does not require any trainable matrices to perform the attention. The idea rests upon solving optimization problems, whereby memory is substituted for compute. It will be shown however, that the computational demand can be reduced such that auto-differentiation becomes possible. An experimental evaluation shows that the proposed algorithm performs favorable compared with several baselines.