Poster
in
Workshop: Workshop on Machine Learning and Compression
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
Zhenglin Wang · Jialong Wu · Yilong Lai · Congzhi Zhang · Deyu Zhou
Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks.The tree-search-based reasoning methods address this by encouraging the exploration of intermediate steps, surpassing the capabilities of chain-of-thought prompting.However, significant inference latency is introduced due to the systematic exploration and evaluation of multiple thought paths.This paper introduces SEED, a novel and efficient inference framework to improve both runtime speed and GPU memory management concurrently.Based on a scheduled speculative execution, SEED efficiently handles multiple iterations for thought generation and state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching.Extensive experimental evaluations on three reasoning datasets demonstrate the superior speedup performance of SEED.