NeurIPS APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Xinyu Yang · Tianqi Chen · Beidi Chen

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Many adaptive language model applications, such as RAG and ICL, require the efficient combination of multiple external contexts to generate a response. In this work, we explore the potential of parallel encoding to speedup generation and extend context by pre-caching the KV states of each context separately for direct loading and position reuse during inference. However, directly applying it reduces performance due to its misalignment with sequential encoding. To address this challenge, we propose APE, which brings a shared prefix, additional scaling factor, and lower attention temperature to align these two distributions of attention weights. Extensive experiments showcase APE improves performance by 7.8% over standard parallel encoding and 2.9% over sequential encoding for long contexts, while maintaining 93% accuracy in few-shot learning. For the efficiency evaluation, APE achieves a 976$\times$ speedup for 512K context-augmented generation with a 256-token response.

Chat is not available.

Poster in Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Xinyu Yang · Tianqi Chen · Beidi Chen

Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning