Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Mathematics of Modern Machine Learning (M3L)

Mixture of Parrots: Mixtures of experts improve memorization more than reasoning

Samy Jelassi · Clara Mohri · David Brandfonbrener · Alex Gu · Nikhil Vyas · Nikhil Anand · David Alvarez-Melis · Yuanzhi Li · Sham Kakade · Eran Malach

Keywords: [ reasoning for LMs ] [ Mixtures of experts ] [ memorization for LMs ]

[ ] [ Project Page ]
Sat 14 Dec 3:45 p.m. PST — 4 p.m. PST
 
presentation: Mathematics of Modern Machine Learning (M3L)
Sat 14 Dec 8:50 a.m. PST — 5 p.m. PST

Abstract:

The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. To empirically validate our findings, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

Chat is not available.