Poster
Efficient Attention using Low-Dimensional Keys
Prajwal Singhania · Siddharth Singh · Shwai He · Soheil Feizi · Abhinav Bhatele
Inference on large language models can be expensive in terms of the compute andmemory costs involved, especially when long sequence lengths are used. Inparticular, the self-attention mechanism used in such models contributessignificantly to these costs, which has resulted in several recent works thatpropose approximate attention methods for inference. In this work, we proposeto approximate the self-attention computation by focusing on the dimensionalityof key vectors computed in the attention block. Our analysis reveals that thekey vectors lie in a significantly lower-dimensional space, consistently acrossdatasets and models. Exploiting this observation, we propose PCA-TopK, a novelapproximate attention method that ranks keys based on their attention scorescomputed in low-dimensional space, uses the rankings to select the top Ktokens, and uses the full dimensionality only for the selected tokens tocompute the approximate attention. Our evaluations show that PCA-TopK is ableto maintain the efficacy of the models better than other popular approximationmethods, while speeding up the attention computation due to reduced datamovement (load/store) and compute costs.
Live content is unavailable. Log in and register to view live content