Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)

Understanding Compute-Parameter Trade-offs in Sparse Mixture-of-Expert Language Models

Harshay Shah · Vimal Thilak · Dan Busbridge · Alaaeldin El-Nouby · Joshua Susskind · Samira Abnar


Abstract:

Scaling language model capacity is crucial for achieving better performance, as it allows these models to capture more complex patterns and representations. Empirically, increasing model size and compute improves outcomes; however, the relationship between model parameters and compute per example, and their combined contribution to capacity, is not yet fully understood. We explore this relationship through sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream objectives. We find that under different constraints (e.g. parameter and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a clearer understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

Chat is not available.