Poster
in
Workshop: Workshop on Machine Learning and Compression
Adaptive Quantization and Pruning of Deep Neural Networks via Layer Importance Estimation
Tushar Shinde
Deep neural networks (DNNs) have demonstrated exceptional performance across various tasks but face challenges when deployed on edge devices due to high computational and storage demands. To address this, quantizing neural networks has been effective in reducing model size while preserving accuracy, and pruning further compresses the models. However, balancing compression and performance remains difficult, especially with mixed-precision approaches that assign different bit widths to different layers. Our method improves this process by ranking layers based on statistical importance and adaptively pruning and selecting bit-width precision for each layer, ensuring minimal accuracy loss. The layer-specific threshold is dynamically chosen, optimizing compression without complex tuning or costly optimization. Our interpretable and efficient approach is applied to image classification tasks, with experimental results showing its effectiveness across multiple DNN architectures.