NeurIPS Interactions Across Blocks in PTQ

Poster
in
Workshop: Workshop on Machine Learning and Compression

Interactions Across Blocks in PTQ

Khasmamad Shabanovi · Lukas Wiest · Thomas Pfeil · Vladimir Golkov · Daniel Cremers

[ Abstract ]

[ Poster]

Abstract:

Post-Training Quantization (PTQ) typically involves fine-tuning the quantization of substructures, such as individual layers or groups of layers, one at a time with the goal of minimizing the error introduced in their pre-activations due to quantization. Deriving this \textit{local} objective from the \textit{global} objective of task loss error minimization requires two key simplifications: (1) Substructures are assumed to be mutually independent, and (2) the knowledge of the subsequent substructures and the task loss is ignored. In this work, we evaluate the impact of these simplifications on weight-only \textit{blockwise} PTQ of Large Language Models (LLMs). We explore two methods: (1) optimizing the quantization of \textit{multiple} transformer blocks at a time, and (2) optimizing the quantization of each transformer block to minimize the error in the pre-activations of a \textit{downstream} transformer block. The first method introduces second-order interactions across blocks, while the second incorporates the knowledge of subsequent layers into the optimization. We compare these methods against the baseline of vanilla blockwise quantization. Our comparison shows that these methods are beneficial for only certain LLM models and come at the cost of increased required computational resources.

Chat is not available.

Poster in Workshop: Workshop on Machine Learning and Compression

Interactions Across Blocks in PTQ

Khasmamad Shabanovi · Lukas Wiest · Thomas Pfeil · Vladimir Golkov · Daniel Cremers

Poster
in
Workshop: Workshop on Machine Learning and Compression