Skip to yearly menu bar Skip to main content


Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic

Shaoxiong Duan · Yining Shi · Wei Xu

Keywords: [ Attention ] [ inductive bias ] [ transformer ] [ arithmetic ] [ length generalization ]


Abstract:

In this paper, we study the transformer model's capability in learning arithmetic from an inductive learning perspective and draw attention to the importance of inductive biases. We first introduce a definition of length generalization, requiring the model to maintain near perfect accuracy on samples with length at least 10 times the training length, as an indicator of successful learning. Through experiments and attention analysis, we show that the failure of the vanilla Transformer on learning arithmetic is due to inadequate inductive biasing. We then present Attention Bias Scaffolding (ABS) which uses attention masking to enforce the necessary inductive bias, making it the first Transformer-based architecture to achieve complete generalization on several arithmetic tasks such as addition and parity. Additionally, we introduce Attention Bias Calibration (ABC), a calibration stage that allows the model to learn the proper attention biases, and obtain complete length generalization automatically on tasks that could interpolate. Finally, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.

Chat is not available.