KeyNote Talk
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models
Hardware-aware Algorithms for Language Modeling
Tri Dao
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We describe recent progress on subquadratic-time architectures such structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks. The resulting architecture (Mamba and Mamba-2) matches or exceeds the performance of strong modern Transformers on language modeling, validated at 3B scales on both pretraining and downstream evaluation, while enjoying 5x higher inference throughput and linear scaling in sequence length. Hybridizing Mamba layers with a 2-4 attention layers leads to state-of-the-art models, excelling at long context and fast inference.