Poster
in
Workshop: Machine Learning for Systems
TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation
Reza Yazdani Aminabadi · Connor Holmes · Samyam Rajbhandari · Zhewei Yao · Yuxiong He
Abstract:
The Mixture of Experts (MoE) model is a powerful architecture that dynamically selects a subset of experts for each input, enabling the model to scale efficiently. However, the gating mechanism, which determines the assignment of tokens to experts, introduces 4-dimensional ($S\times E\times C\times M$) computational complexity due to its reliance on sparse representation which results in wasteful dense-computation. In this work, we present TurboMoE, a novel approach to accelerate MoE model training by optimizing the gating logic through smart kernel-fusion and data-layout transformations.Our method addresses the computational bottlenecks of the gating process by introducing three specialized kernels.The first kernel efficiently computes expert scores and performs top-k expert selection, while the second kernel scatters input tokens into expert-specific buffers, minimizing the need for sparse operations. Furthermore, we introduce the third MoE-Gather kernel, which replaces the traditional sparse matrix multiplication, streamlining the process of combining expert outputs.By integrating these kernels, TurboMoE achieves substantial end-to-end speedups over the state-of-the-art solution, MegaBlocks, with a 55\% faster training time for top-1 selection and a 41\% improvement for top-2 selection configurations. These optimizations significantly reduce the computation overhead of the gating functionality from O(${SECM}) \rightarrow O({SM}$). TurboMoE demonstrates that by removing the reliance on sparse computation, MoE models can achieve unprecedented training efficiency, reaching 460 Tera-Flops on 32 NVIDIA-H100 for a 32-expert MoE architecture with Top-2 gating configuration, paving the way for more scalable and effective applications.
Chat is not available.