Poster
in
Workshop: All Things Attention: Bridging Different Perspectives on Attention
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks
Yuxuan Li · James McClelland
Keywords: [ transformers ] [ Multi-task Learning ] [ systematic generalization ] [ interpretability ]
Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We searched for the layer and head configuration sufficient to solve the task, and performed attention ablation and analyzed encoded representations. We show that two-layer transformers learn generalizable solutions to multi-level problems, develop signs of systematic task decomposition, and exploit shared computation across related tasks. These results provide key insights into the possible structures of within-task and cross-task computations that stacks of attention layers can afford.