Poster
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Songlin Yang · Bailin Wang · Yu Zhang · Yikang Shen · Yoon Kim
Transformers with linear attention (i.e., linear Transformers) and state-space models have recently been suggested as a viable linear-time alternative to Transformers with softmax attention. However, these models still underperform Transformers especially on recall-intensive tasks. While more expressive variants of linear Transformers which replace the additive outer-product update in linear Transformers with the delta rule have been found to be more effective at associative recall, existing algorithms for training such models are hardware-inefficient and thus difficult to scale. This work describes a hardware-efficient algorithm for training a generalized variant of linear Transformers (of which DeltaNet is a special case) which exploits the WY representation for computing products of Householder matrices. This algorithm allows us to scale DeltaNet to moderate-scale language modeling settings (1.3B models trained for 100B tokens), where we find that it outperforms strong linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks (including tasks that focus on recall).
Live content is unavailable. Log in and register to view live content