NeurIPS LSH-E Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression

Poster
in
Workshop: Workshop on Machine Learning and Compression

LSH-E Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression

Tahseen Rabbani · Minghui Liu · Tony O Halloran · Ananth Sankaralingam · Mary-Anne Hartley · Furong Huang

[ Abstract ]

Abstract:

Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce LSH-E: an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. LSH-E quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. At every decoding step, the key and value of the current token replace the embeddings of a token expected to produce the lowest attention score. We demonstrate that LSH-E can compress KV cache by 30\%-70\% while maintaining high performance across reasoning, multiple-choice, and long-context retrieval tasks.

Chat is not available.

Poster in Workshop: Workshop on Machine Learning and Compression

LSH-E Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression

Tahseen Rabbani · Minghui Liu · Tony O Halloran · Ananth Sankaralingam · Mary-Anne Hartley · Furong Huang

Poster
in
Workshop: Workshop on Machine Learning and Compression