Poster
Improving Context-Aware Preference Modeling for Language Models
Silviu Pitis · Ziang Xiao · Nicolas Le Roux · Alessandro Sordoni
While finetuning language models (\lms) from pairwise preferences has proven remarkably effective, the underspecified nature of natural language presents critical challenges. Direct preference feedback is uninterpretable, difficult to provide where multidimensional criteria may apply, and often inconsistent, either because it is based on incomplete instructions or provided by diverse principals. To address these challenges, we consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We decompose reward modeling error according to these two steps, which suggests that supervising context in addition to context-specific preference may be a viable approach to aligning models with diverse human preferences. For this to work, the ability of models to evaluate context-specific preference is critical. To this end, we contribute several \textit{context-conditioned} preference datasets and accompanying experiments that investigate the ability of language models to evaluate context-specific preference. Unlike past datasets, where context-specific preference is highly correlated with general preference, our ``preference reversal'' datasets disentangle context-specific and general preferences to isolate context-specific capabilities. We use our datasets to (1) show that existing preference models benefit from, but fail to fully consider, added context, (2) finetune a context-aware reward model with context-specific performance that approaches and sometimes exceeds that of GPT-4 and Llama 3 70B, and (3) investigate the potential value of context-aware preference modeling.
Live content is unavailable. Log in and register to view live content