Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks
Token-token correlations predict the scaling of the test loss with the number of input tokens
Francesco Cagnetta · Matthieu Wyart
The success of Large Language Models (LLMs) establishes that machines trained for next-token prediction can acquire language proficiency. What are the mechanisms behind this acquisition and how much data do they require? We show that these questions can be partially answered by studying the correlations between the input tokens. Specifically, using scaling concepts of physics, we formulate a conjecture on the relationship between correlations, size of the training set and effective context window, i.e. the input tokens that are actually used by the model when predicting the next. Interestingly, when the correlations decay as a power of the distance between tokens, our conjecture connects to neural scaling laws and predicts how the scaling of test loss with dataset size should depend on the length of the context window. We confirm our conjecture and predictions on two datasets, consisting of Wikipedia articles and Shakespeare's lines.