Poster
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Chaofan Tao · Qian Liu · Longxu Dou · Niklas Muennighoff · Zhongwei Wan · Ping Luo · Min Lin · Ngai Wong
Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how the vocabulary size impacts language model scaling laws. Through the training of models ranging from 33M to 3B parameters on up to 510B characters with various vocabulary configurations, we discover that the optimal vocabulary size is bounded by the computational budget. We propose two methods to determine the optimal vocabulary size: an empirical IsoFLOPs approach and a fast derivative-based approach. Both methods suggest that vocabulary parameters should be scaled slower than non-vocabulary parameters. Nonetheless, vocabulary parameters are critical for performance and under-allocated in current LLMs. By increasing the vocabulary size beyond the conventional 32K, we train a better 3B parameter model despite using fewer training tokens. Our work reveals the underestimated role of vocabulary, and the necessity of jointly considering vocabulary size, model parameters, and training data for efficient scaling.
Live content is unavailable. Log in and register to view live content