Poster
in
Workshop: Workshop on Machine Learning and Compression
Simple LLM Compression Recovery Using Dynamic Prompting with Theoretical Analysis
Duc Hoang · Minsik Cho · Thomas Merth · Mohammad Rastegari · Zhangyang "Atlas" Wang
Large Language Models (LLMs) need compression to be serviceable on hardware-limited devices, with the tradeoff being a reduction in performance, especially in natural language comprehension. As a direct consequence, parameter-efficient fine-tuning (PEFT) methods, previously used in task adaptation, are increasingly being utilized for post-compression performance recovery; however, the overall cost-benefit of these methods in this area is still unclear. In this work, we perform a comprehensive experimental study on various PEFT methods on Llama and OPT models with different compression approaches on a dedicated test suite aimed at measuring a model's performance, particularly in English comprehension. To analyze our results, we propose two conjectures that differentiate the nature of the compression damage on LLMs: one is that certain knowledge is \textbf{forgotten (or erased)} after LLM compression; the other presumes that knowledge is \textbf{internally displaced}. We found that the often-overlooked prompting holds a competitive advantage against more advanced approaches such as LoRA. Furthermore, we show we can extend prompting at minimal cost to latency by allowing multiple prompts to be dynamically allocated to different inputs at inference time, leading to even better or comparable post-compression performance recovery.