Poster
Transformers on Markov data: Constant depth suffices
Nived Rajaraman · Marco Bondaschi · Ashok Vardhan Makkuva · Kannan Ramchandran · Michael Gastpar
East Exhibit Hall A-C #1710
Abstract:
Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from $k^{\text{th}}$-order Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from $k^{\text{th}}$-order Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer’s ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, we prove that a transformer with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. Surprisingly, with the addition of layer normalization, we show that a transformer with a constant number of layers can represent the in-context conditional empirical distribution, concurring with our empirical observations. This result provides more insight into the benefit of soft-attention and non-linearities in the transformer architecture.
Chat is not available.