Poster
in
Workshop: ML with New Compute Paradigms
MoQ: Mixture-of-format Activation Quantization for Communication-efficient AI Inference System
Haonan Wang · Zeli Liu · Chao Fang · John Walters · Stephen Crago
In the era of AI, a drastic expansion of model size is witnessed, which poses challenge to the resource efficiency of AI system. To benefit communication-sensitive applications on distributed edge AI inference systems, quantization has been widely applied as a promising technique to compress the activation. Existing mainstream works exploit either integer (INT) or float-point (FP) format of quantization for the entire model. However, they overlook the possibility of jointly leveraging mixture of formats and exploring their strength accordingly tailored to different models and layers. In this work, we first comprehensively analyze the feature of different formats of quantization, including both INT and FP format, with or without clipping method, and draw some insights about the advantages of different formats under different circumstances. Then, we propose a lightweight calibration-based Mixture of format Quantization (MoQ) strategy that enables a communication-efficient AI serving system to automatically adapt to different models with different activation distributions, and achieve minimal accuracy loss using optimal format of quantization for different layers. To quantitatively evaluate the capability of locating the optimal format for different layers of our MoQ format selection strategy, a criteria termed Optimum Hit Rate (OHR) is defined. Experimental results show that by leveraging our proposed MoQ method, an AI inference system can achieve significant improvement of OHR over any static format quantization and intermediate measurement-based strategies.