NeurIPS Poster Transformers on Markov data: Constant depth suffices

Poster

Transformers on Markov data: Constant depth suffices

Nived Rajaraman · Marco Bondaschi · Ashok Vardhan Makkuva · Kannan Ramchandran · Michael Gastpar

East Exhibit Hall A-C #1710

[ Abstract ]

[ Paper] [ OpenReview]

Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from $k^{\text{th}}$-order Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from $k^{\text{th}}$-order Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer’s ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, we prove that a transformer with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. Surprisingly, with the addition of layer normalization, we show that a transformer with a constant number of layers can represent the in-context conditional empirical distribution, concurring with our empirical observations. This result provides more insight into the benefit of soft-attention and non-linearities in the transformer architecture.

Chat is not available.