Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Scheduling in LLM Inference with Blowed-up Memory Constraints

Zijie Zhou · Jiashuo Jiang

Keywords: [ LLM Inference ] [ Scheduling ] [ KV Cache ] [ Blowed-up Memory ]

[ ] [ Project Page ]
Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract: In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) are crucial for various applications, necessitating efficient computational strategies for LLM inference. LLM inference involves generating text in response to input prompts, with two main phases: the prompt phase, where the initial prompt is processed and the first output token is generated, and the token phase, where subsequent tokens are sequentially generated until the final token is complete. During inference, each token's key and value matrices are calculated, a process that can be streamlined using a KV cache. This cache stores the matrices of processed tokens, reducing the need for recomputation and preventing the quadratic scaling of computation time. However, a significant challenge arises as the KV cache's memory load blows up with each new token, potentially leading to memory overflow. Exceeding the KV cache's limit incurs substantial costs, such as stopping current processing and resetting the cache. Our research focuses on developing batching and scheduling strategies to manage memory efficiently and minimize total completion time. This work is still in its early stages. Initially, we consider a scenario where the decision-maker starts with $n$ prompt requests. Each request's response length (total number of tokens generated) is assumed to be predictable, a feasible assumption as \cite{zheng2024response} demonstrates that prediction accuracy can reach 81\%. We further assume that all prompts are of equal size or their size increases monotonically with output length. We have developed a batching and scheduling algorithm (Algorithm 1), designed to ensure that the memory usage in the KV cache does not exceed its limit. Moreover, with respect to the total completion time, Theorem 2 shows that Algorithm 1 is optimal (it has a competitive ratio of $1$)! However, Algorithm 1 is computationally intensive, necessitating the complexity $O(n^2)$ at each time period (Theorem 1). Our current focus is on designing an approximation algorithm which can have a smaller complexity and a bounded competitive ratio.In addition to validating these two conjectures, our next step is to explore extending this problem to an online setting, where \(n\) jobs arrive sequentially. Furthermore, we aim to broaden the scope beyond the current assumptions of identical or monotonic job sizes to encompass more general scenarios. We also plan to investigate potential solutions for instances where predictions are not consistently accurate, determining effective strategies to mitigate inaccuracies.

Chat is not available.