The Mixture-of-Experts (MoE) architecture has emerged as a powerful approach for enhancing the capacity of large language models (LLMs) while maintaining computational efficiency. This is achieved by activating only a subset of the model's parameters during inference, which allows for substantial scaling without a proportional increase in computation costs. Despite these advantages, deploying MoE models on memory-constrained devices remains challenging due to their large number of parameters, which often exceed the available DRAM capacity of a typical smartphone.
In this demonstration, we present a Qwen-MoE model with 14 billion parameters running on a mobile device that lacks sufficient DRAM to store the entire model. To address this limitation, we implement an expert caching strategy that selectively stores only a subset of experts in DRAM. During dynamic routing, if the required experts are already cached, computation is expedited. However, cache misses necessitate loading experts from flash memory, resulting in increased latency. Furthermore, by conditioning the router on the current cache state, our approach significantly improves cache hit rates and reduces deployment latency without compromising model accuracy on downstream tasks. For further acceleration, we shrink the model from FP32 to INT4 with post-training quantization. Our solution showcases the potential of running large-scale MoE models on mobile devices by dynamically managing memory constraints through intelligent caching mechanisms.
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party without the express prior written consent of Qualcomm Technologies, Inc."