Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)
Attention-Only Transformers and Implementing MLPs with Attention Heads
Robert Huben · Valerie Morris
Abstract:
The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads.
Chat is not available.