Poster
in
Workshop: ML with New Compute Paradigms
DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations
Wenhao Hu · Paul Henderson · José Cano
Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce the memory demand during inference, which can be particularly beneficial on edge devices with limited computational resources. To achieve high accuracy, existing methods rely on expensive mathematical computations or perform extensive online searches for the best hyper-parameters. However, these expensive operations are impractical on edge devices due to their restricted computation capabilities, memory capacities, and limited energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization, which is more relevant to resource-constrained edge devices. To fill these gaps, we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations. DQA leverages simple shifting-based operations and Huffman coding to achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit levels of quantization and three different DNN models on two different datasets for two different tasks, image classification and image segmentation. DQA shows significantly better accuracy (up to 29.28\%) compared to a direct method and NoisyQuant for sub-6-bit quantization.