Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)
In-Context Learning by Linear Attention: Exact Asymptotics and Experiments
Yue Lu · Mary Letey · Jacob Zavatone-Veth · Anindita Maiti · Cengiz Pehlevan
Keywords: [ In-Context Learning ] [ Learning Phase Transition ] [ Double Descent ] [ Random Matrix Theory ] [ Exactly Solvable Model ]
Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. In this work, we provide precise answers to these questions using a solvable model of ICL for a linear regression task with linear attention. We derive asymptotics for the learning curve in a regime where token dimension, context length, and pretraining diversity scale proportionally, and pretraining examples scale quadratically. Our analysis reveals a double-descent learning curve and a transition between low and high task diversity, which is empirically validated with experiments on realistic Transformer architectures.