Poster
in
Workshop: Safe Generative AI
Measuring Steerability in Large Language Models
Trenton Chang · Jenna Wiens · Tobias Schnabel · Adith Swaminathan
Large language models (LLMs) are powerful instruction followers. However, many open-ended generation tasks have a large “solution space” that depends on a user’s needs. LLMs that are steerable towards such needs are critical to safe LLM systems that behave consistently with user expectations and goals. Despite continued improvement in LLM instruction-following, such gains may not necessarily translate to steerability. This disconnect motivates a principled framework for measuring steerability. Thus, we propose a goal-oriented, quantitative definition of steerability. Our definition informs the design of an empirical steerability probe, where we leverage text rewriting tasks to measure steerability of LLMs. We demonstrate that recent LLMs are not steerable. We attribute this lack of steerability to “side-effects:” correlations between requested goals and non-requested LLM movement. Thus, despite advances in LLM instruction following, there remains significant room for improving LLM steerability.