Poster
in
Workshop: Workshop on Machine Learning and Compression
MAPLE: Memory-Aware Predict and Load for Efficient LLM Inference
Zhenyu Liu · Zhemin Zhang · Zirui Zhang · Yanyuan Qin · Jiayi Luo · Zhenyu Gu · Liu Liu
Large Language Models (LLMs) perform well across various natural language processing (NLP) tasks. However, the inference for extensive text generation faces challenges due to the significant memory demands of the key-value (KV) cache, which increases quadratically with sequence length. In this paper, we introduce a novel, bandwidth-efficient method for managing the KV cache. Utilizing learning-based techniques, our method predicts and retrieves only the essential KV entries, thereby eliminating the need for comprehensive KV pair transfers. Distinct from previous approaches, our method decouples the prediction phase from the computation phases by storing low-rank Keys in HBM, drastically reducing bandwidth consumption while minimally impacting memory usage and maintaining accuracy.