Poster
in
Workshop: Optimization for ML Workshop
Understanding Critical Batch Sizes: Scheduling and Batch-Size Invariance in Data-constrained Pre-training
Hanlin Zhang · Depen Morwani · Nikhil Vyas · Jingfeng Wu · Difan Zou · Udaya Ghai · Dean Foster · Sham Kakade
Training large-scale models under given resource budgets requires careful designs of training and parallelism strategies.In particular, the efficiency notion of critical batch size concerns the compromise between time and compute, beyond which might lead to diminishing returns from greater data parallelism. To operationalize this notion, we study auto-regressive language model pre-training on C4 and navigate the optimal model performance under a series of batch sizes with massive hyper-parameter sweeps: by carefully controlling accounting for factors like learning rate and its scheduling, and momentum, we are able to understand the interaction of them transparently, and derive a quantitative relationship between model sizes and CBS. We train a series of LMs ranging from 85 million to 1.2 billion, showing that the halving effect (doubling the batch size would decrease the number of steps by half) widely exists and diminishes over a batch size of 1.05 million tokens. Then we propose the measure of CBS, fit a scaling law with respect to model and training token sizes, and theoretically analyze the behaviors of infinite dimension regression to derive the expected scaling laws. Overall, our results quantify the slight growth of CBS with respect to training duration instead of the increase in model sizes.In addition, we highlight the importance of several common hyper-parameter choices, as well as strategies on how to study large-scale pre-training beyond fixed training durations.