Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Inducing Elasticity in Foundation Models: Post-Training Techniques for Adaptable Inference

Aashiq Muhamed · Jiarui Liu · Mona Diab · Virginia Smith

Keywords: [ Efficient Inference ]


Abstract:

Large foundation models (LFMs) power a diverse range of applications, but their deployment often requires adapting model size and performance to specific hardware constraints and latency requirements. Existing approaches rely on training independent models of various sizes, leading to storage redundancy, inconsistent behavior across sizes, and limited scalability. This work investigates post-training techniques for inducing elasticity into pre-trained LFMs, enabling dynamic adaptation of model size during inference based on specific needs. We frame this as decomposing LFM weight matrices into sparsely activating factors. While naive decompositions like weight SVD struggle to maintain performance across complex tasks while inducing the desired nested sub-structures, we propose two novel methods: SparseDecomp, which exploits sparse neuron activations in feed-forward networks to conditionally select decoder rows; and RankDecomp, which leverages the basis-agnostic nature of Transformers for low-rank weight decomposition. Integrating SparseDecomp and RankDecomp with GritLM-7B, a state-of-the-art LFM excelling in both generative and embedding tasks, we conduct a comparative analysis. Our results demonstrate that these approaches offer complementary benefits. SparseDecomp maintains robust performance across a wider range of sparsity levels, achieving average speedups of up to 4.6\% with 25\% sparsity. RankDecomp, conversely, yields a more significant latency reduction, reaching a speedup of 22.2\% at 25\% sparsity, but exhibits greater sensitivity to increasing sparsity. This study provides valuable insights into leveraging post-training weight decomposition for developing efficient and adaptable LFMs, paving the way for future research on creating elastic and resource-aware models.

Chat is not available.