NeurIPS AdaQuantLM: LLM Quantization with Adaptive Bit-Widths

Poster
in
Workshop: Workshop on Machine Learning and Compression

AdaQuantLM: LLM Quantization with Adaptive Bit-Widths

Shuangyi Chen · Ashish Khisti

[ Abstract ]

Abstract:

Current LLM quantization methods focus on single bitwidth quantization, requiring time-consuming finetuning and benchmarking for each bitwidth version, which limits their adaptability to different scenarios. To address these challenges, we propose AdaQuantLM, a method for LLM quantization with adaptive bit-width. Inspired by techniques such as AdaBits and Additive Quantization for Language Models (AQLM), AdaQuantLM leverages the additivity of codewords in quantized models. This allows for the efficient conversion between different bit-widths by adding or removing specific codewords, eliminating the need for storing full-precision weights. Our approach jointly quantizes and fine-tunes LLMs across multiple bit-widths, enabling the model to adapt to devices with varying computational resources while maintaining performance. We demonstrate the effectiveness of AdaQuantLM through experiments on the Gemma-2b model, highlighting its potential for broad applicability in the efficient deployment of LLMs.

Chat is not available.

Poster in Workshop: Workshop on Machine Learning and Compression

AdaQuantLM: LLM Quantization with Adaptive Bit-Widths

Shuangyi Chen · Ashish Khisti

Poster
in
Workshop: Workshop on Machine Learning and Compression