Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Machine Learning and Compression

MAPLE: Memory-Aware Predict and Load for Efficient LLM Inference

Zhenyu Liu · Zhemin Zhang · Zirui Zhang · Yanyuan Qin · Jiayi Luo · Zhenyu Gu · Liu Liu


Abstract:

Large Language Models (LLMs) perform well across various natural language processing (NLP) tasks. However, the inference for extensive text generation faces challenges due to the significant memory demands of the key-value (KV) cache, which increases quadratically with sequence length. In this paper, we introduce a novel, bandwidth-efficient method for managing the KV cache. Utilizing learning-based techniques, our method predicts and retrieves only the essential KV entries, thereby eliminating the need for comprehensive KV pair transfers. Distinct from previous approaches, our method decouples the prediction phase from the computation phases by storing low-rank Keys in HBM, drastically reducing bandwidth consumption while minimally impacting memory usage and maintaining accuracy.

Chat is not available.