NeurIPS Language model scaling laws and zero-sum learning

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Language model scaling laws and zero-sum learning

Andrei Mircea · Ekaterina Lobacheva · Supriyo Chakraborty · Nima Chitsazan · Irina Rish

[ Abstract ] [ Project Page ]

[ Slides] [ OpenReview]

Sun 15 Dec 4:30 p.m. PST — 5:30 p.m. PST

Abstract:

This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements. We find that these improvements can be tied back to loss deceleration, an abrupt transition in the rate of loss improvement, characterized by piece-wise linear behavior in log-log space. Notably, improvements from increased model size appear to be a result of (1) improving the loss at which this transition occurs; and (2) improving the rate of loss improvement after this transition. As an explanation for the mechanism underlying this transition (and the effect of model size on loss it mediates), we propose the zero-sum learning (ZSL) hypothesis. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where the model can't improve loss on one token without harming it on another; bottlenecking the overall rate at which loss can improve. We find compelling evidence of ZSL, as well as unexpected results which shed light on other factors contributing to ZSL.

Chat is not available.

Poster Session in Workshop: Scientific Methods for Understanding Neural Networks

Language model scaling laws and zero-sum learning

Andrei Mircea · Ekaterina Lobacheva · Supriyo Chakraborty · Nima Chitsazan · Irina Rish

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks