NeurIPS CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization

Poster
in
Workshop: Workshop on Machine Learning and Compression

CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization

Pranav Nair · Arun Suggala

[ Abstract ]

Abstract:

Quantization has emerged as a key technique for compressing large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses greedy coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. We perform extensive evaluation on Gemma, and PaLM2 model families, and demonstrate that CDQuant consistently outperforms GPTQ in 2-4 bit weight quantization. Moreover, CDQuant improves the performance of state-of-the-art PTQ techniques such as QuIP and FrameQuant when used as a replacement for their GPTQ component, resulting in further gains in quality.

Chat is not available.

Poster in Workshop: Workshop on Machine Learning and Compression

CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization

Pranav Nair · Arun Suggala

Poster
in
Workshop: Workshop on Machine Learning and Compression