Poster
in
Workshop: Workshop on Machine Learning and Compression
Interactions Across Blocks in PTQ
Khasmamad Shabanovi · Lukas Wiest · Thomas Pfeil · Vladimir Golkov · Daniel Cremers
Post-Training Quantization (PTQ) typically involves fine-tuning the quantization of substructures, such as individual layers or groups of layers, one at a time with the goal of minimizing the error introduced in their pre-activations due to quantization. Deriving this \textit{local} objective from the \textit{global} objective of task loss error minimization requires two key simplifications: (1) Substructures are assumed to be mutually independent, and (2) the knowledge of the subsequent substructures and the task loss is ignored. In this work, we evaluate the impact of these simplifications on weight-only \textit{blockwise} PTQ of Large Language Models (LLMs). We explore two methods: (1) optimizing the quantization of \textit{multiple} transformer blocks at a time, and (2) optimizing the quantization of each transformer block to minimize the error in the pre-activations of a \textit{downstream} transformer block. The first method introduces second-order interactions across blocks, while the second incorporates the knowledge of subsequent layers into the optimization. We compare these methods against the baseline of vanilla blockwise quantization. Our comparison shows that these methods are beneficial for only certain LLM models and come at the cost of increased required computational resources.