Poster
in
Workshop: Workshop on Machine Learning and Compression
Transformers Learn to Compress Variable-order Markov Chains in-Context
Ruida Zhou · Chao Tian · Suhas Diggavi
Abstract:
In recent years, large language models (LLMs) have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). In this work, we study the ICL of VOMCs by viewing language modeling as a form of data compression and focusing on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as the context-tree weighting (CTW) algorithm as a baseline, which is Bayesian optimal for a class of priors. We empirically observe that the performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in context quite well, tracking closely the performance of CTW. We provide a construction with $D+2$ layers that can mimic the CTW algorithm accurately for VOMCs of maximum order $D$. One distinction from the FOMC setting is that a counting mechanism plays an important role in this setting.
Chat is not available.