Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
Better Prompt Compression Without Multi-Layer Perceptrons
Edouardo Honig · Andrew Lizarraga · Zijun Frank Zhang · Ying Nian Wu
Prompt compression is a promising approach to speeding up language modelinference without altering the generative model. Prior works compress promptsinto smaller sequences of learned tokens using an encoder that is trained as a Low-Rank Adaptation (LoRA) of the inference language model. However, we showthat the encoder does not need to keep the original language model’s architectureto achieve useful compression. We introduce the Attention-Only Compressor(AOC), which learns a prompt compression encoder after removing the multi-layerperceptron (MLP) layers in the Transformer blocks of a language model, resultingin an encoder with roughly 67% less parameters compared to the original model.Intriguingly we find that, across a range of compression ratios up to 480×, AOCcan better regenerate prompts and outperform a baseline compression encoder thatis a LoRA of the inference language model without removing MLP layers. Theseresults demonstrate that the architecture of prompt compression encoders does notneed to be identical to that of the original decoder language model, paving the wayfor further research into architectures and approaches for prompt compression.