NeurIPS A Unified Framework for Speculative Decoding with Multiple Drafters as a Bandit

Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

A Unified Framework for Speculative Decoding with Multiple Drafters as a Bandit

Taehyeon Kim · Hojung Jung · Se-Young Yun

[ Abstract ]

Abstract:

Speculative decoding (SD) has emerged as a promising approach to accelerate inference in large language models (LLMs). This method drafts potential future tokens by leveraging a smaller model, while these tokens are concurrently verified by the target LLM, ensuring only outputs aligned with the target LLM’s predictions are accepted. However, the limited capacity of individual drafters often hinders their effectiveness across diverse tasks. In this paper, we introduce a unified framework that incorporates multiple drafters into the speculative decoding process to address this limitation. Our approach employs multi-armed bandit sampling to dynamically allocate computational resources and optimize inference across various drafters, thereby improving overall generation performance. Through extensive experiments, we demonstrate that our unified framework achieves superior results compared to traditional single-drafter approaches.

Chat is not available.

Poster in Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

A Unified Framework for Speculative Decoding with Multiple Drafters as a Bandit

Taehyeon Kim · Hojung Jung · Se-Young Yun

Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models