Poster
in
Workshop: Safe Generative AI
The Impact of Inference Acceleration Strategies on Bias of Large Language Models
Elisabeth Kirsten · Ivan Habernal · Vedant Nanda · Muhammad Bilal Zafar
Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to deeply benefit a vast array of application domains ranging from healthcare to education, and content creation to online search. However, due to their immense size, performing inference with LLMs is both costly and slow. To this end, a plethora of recent work has proposed strategies to enhance inference efficiency, utilizing techniques such as parameter quantization, pruning, and caching. These acceleration strategies have shown great promise: they reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of LLM outputs before and after application of inference acceleration shows a significant impact on bias. To make matters worse, these bias effects are complex and unpredictable, and vary based on models and datasets. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.