NeurIPS Depth Extrapolation of Decoders Trained on Nested Structures

Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)

Depth Extrapolation of Decoders Trained on Nested Structures

Emile Richard

Keywords: [ generalization ] [ embeddings ] [ Transformers ] [ nested structures ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Reasoning problems with deeply nested formal statements are challenging to solve for humans and machines alike. We investigate how next-token predictors learn such structures, and whether they extrapolate to more deeply nested cases. We propose a theoretical grounding of memorization in a self-attention head with added trainable positional embeddings. We elicit relations between positional and token embeddings, which explain how large embedding dimensions scale with context size and number of topics in the training set. As an application of our theory, we study completion of a bounded stack of parentheses. We derive a closed-form expression for a simple decoder Transformer which solves this problem. With one dimensional embeddings, our closed-form model perfectly fits a single sequence training set, and provably completes any out-of-sample parentheses prefix, regardless of the prefix depth. In contrast, numerically, we observe that decoder Transformers trained on this task fail at learning small models capable of depth extrapolation. Gradient-trained decoders demand large samples and a high-dimensional embedding space to achieve near perfect accuracy on test sets nearly as deep as the training set. However, when the gap between training and test depths widens, gradient trained models are inefficient.

Chat is not available.

Poster in Workshop: Mathematics of Modern Machine Learning (M3L)

Depth Extrapolation of Decoders Trained on Nested Structures

Emile Richard

Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)