Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding
Xinyu Yang · Tianqi Chen · Beidi Chen
Abstract:
Many adaptive language model applications, such as RAG and ICL, require the efficient combination of multiple external contexts to generate a response. In this work, we explore the potential of parallel encoding to speedup generation and extend context by pre-caching the KV states of each context separately for direct loading and position reuse during inference. However, directly applying it reduces performance due to its misalignment with sequential encoding. To address this challenge, we propose APE, which brings a shared prefix, additional scaling factor, and lower attention temperature to align these two distributions of attention weights. Extensive experiments showcase APE improves performance by 7.8% over standard parallel encoding and 2.9% over sequential encoding for long contexts, while maintaining 93% accuracy in few-shot learning. For the efficiency evaluation, APE achieves a 976$\times$ speedup for 512K context-augmented generation with a 256-token response.
Chat is not available.