NeurIPS Poster QTIP: Quantization with Trellises and Incoherence Processing

Spotlight Poster

QTIP: Quantization with Trellises and Incoherence Processing

Albert Tseng · Qingyao Sun · David Hou · Christopher De Sa

East Exhibit Hall A-C #3407

[ Abstract ] [ Project Page ]

[ Paper] [ OpenReview]

Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract: Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes.Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput.Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping.However, VQ requires a codebook with size exponential in the dimension.This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality.Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

Chat is not available.