NeurIPS Exploring CXL-based KV Cache Storage for LLM Serving

Poster
in
Workshop: Machine Learning for Systems

Exploring CXL-based KV Cache Storage for LLM Serving

Yupeng Tang · Runxiang Cheng · Ping Zhou · Tongping Liu · Fei Liu · Wei Tang · Kyoungryun Bae · Jianjun Chen · Wu Xiang · Rui Shi

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Large language model (LLM) serving systems often store key-value (KV) cache to reduce redundant inference computations. However, storing KV cache of long-context requests under high-throughput serving demand can overwhelm GPU and host memory.Recently, Compute Express Link (CXL) emerges as a promising interconnect technology that offers low-latency host-device communication and expanded memory capacity. In this paper, we explore leveraging CXL memory to store KV cache in LLM serving. Our results show that CXL-CPU interconnect performs comparably to CPU-GPU interconnect in both data-transfer latency and bandwidth, enabling a 30% increase in batch size compared to full KV re-compute under the same SLO. Our production Return on Investment (ROI) modeling further shows that storing KV cache on CXL memory can reduce GPU requirements by up to 87%, with 7.5x higher GPU utilization for prefill, compared to full KV re-compute. Our overall results demonstrate the performance improvement and resource benefits of using CXL memory to store KV cache in LLM serving.

Chat is not available.

Poster in Workshop: Machine Learning for Systems

Exploring CXL-based KV Cache Storage for LLM Serving

Yupeng Tang · Runxiang Cheng · Ping Zhou · Tongping Liu · Fei Liu · Wei Tang · Kyoungryun Bae · Jianjun Chen · Wu Xiang · Rui Shi

Poster
in
Workshop: Machine Learning for Systems