Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)
In-Context Learning on Unstructured Data: Softmax Attention as a Mixture of Experts
Kevin Christian Wibisono · Yixin Wang
Transformers have exhibited impressive in-context learning (ICL) capabilities: they can generate predictions for new query inputs based on sequences of inputs and outputs (i.e., prompts) without parameter updates. Despite numerous attempts to theoretically justify why these capabilities emerge, they often assume that input-output pairings in the data are known. This assumption can enable simplified transformers (e.g., ones using a single attention layer without the softmax activation) to achieve notable ICL performance. However, this assumption can often be unrealistic: the unstructured data that transformers are trained on rarely contains such input-output pairings. How does ICL emerge in unstructured data then? To answer this question, we propose a challenging task that does not assume prior knowledge of input and output pairings. Unlike previous studies, this new task elucidates the pivotal role of softmax attention in the robust ICL abilities of transformers, particularly ones with a single attention layer. We argue that the significance of the softmax activation stems from the provable equivalence of softmax-based attention models with mixtures of experts, thereby facilitating the inference of input-output pairings from the data. Additionally, a probing analysis sheds light on when these pairings are learned within the model. While subsequent layers predictably encode more information about these pairings, we find that even the first attention layer contains a significant amount of pairing information.