Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

Ashwinee Panda · Vatsal Baherwani · Zain Sarwar · Benjamin Therien · Sambit Sahu · Stephen Rawls · Supriyo Chakraborty · Tom Goldstein

Keywords: [ Efficient Architectures ]


Abstract:

Sparsely-gated Mixture-of-Experts (MoEs) such as Gemini have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by \emph{routing} tokens to selected ``experts'', allowing practitioners to scale up model parameter counts without significantly increasing total compute.However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.

Chat is not available.