Poster
D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models
Yikun Jiang · Huanyu Wang · Lei Xie · Hanbin Zhao · zhang chao · Hui Qian · John C.S. Lui
Large language models have shown an impressive societal impact owing to their excellent understanding and logical reasoning skills. However, such strong ability relies on utilizing huge amount of computing resources, which makes it difficult to deploy LLMs on computing resource-constrained platforms. Currently, LLMs process each token in all heterogeneous tasks equivalently, but we argue that not every word is equally important and should not be allocated too much computing resources particularly for some dispensable words in simple questions. In this paper, we propose a novel dynamic inference paradigm for LLMs, namely D-LLMs, which adaptively allocate computing resource in token precessing. We design a dynamic decision module for each transformer layer and make an execution decision to decide whether a network unit should be executed or skipped. Moreover, to enable D-LLMs to adapt to real-world applications, it is necessary to make it compatible with KV-cache methods, which is widely used to accelerate processing. To address this challenge, we propose a simple, yet effective eviction policy to exclude the skipped layers from the subsequent calculation. The eviction policy not only enables D-LLMs to be compatible with prevalent applications but also reduces considerable storage resource. Experimentally, D-LLMs show superior performance, in terms of computational cost and KV-cache storage utilization. It can reduce up to 45\% computational cost and KV-cache storage on Q\&A, summarization, and math solving tasks, 50\% on commonsense reasoning tasks.
Live content is unavailable. Log in and register to view live content