NeurIPS Poster MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Poster

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Xingkui Zhu · Yiran Guan · Dingkang Liang · Yuchao Chen · Yuliang Liu · Xiang Bai

East Exhibit Hall A-C #1310

[ Abstract ] [ Project Page ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

The sparsely activated mixture of experts (MoE) model presents an effective alternative to densely activated (dense) models, combining improved accuracy with computational efficiency. However, training MoE models from scratch requires extensive data and computational resources, a challenge that limits their widespread adoption. To address this, we introduce MoE Jetpack, a framework designed to fine-tune the abundant and easily accessible dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which initializes MoE models with dense checkpoints to accelerate convergence and enhance accuracy, minimizing the need for extensive pre-training; (2) the hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture to enhance fine-tuning performance and efficiency.Experimental results indicate that MoE Jetpack doubles the convergence speed and enhances accuracy by 2.8% on ImageNet-1K. On smaller datasets, it achieves up to 8-fold faster convergence and over 30% accuracy gains, highlighting its efficiency.The code is available at https://github.com/Adlith/MoE-Jetpack.

Chat is not available.