Invited Talk
in
Workshop: Workshop on Behavioral Machine Learning
Hannah Rose Kirk: Putting the H Back in RLHF: Challenging assumptions of human behaviour for AI alignment
Early work in AI alignment relied on restrictive assumptions about human behaviour to make progress even in simple 1:1 settings with a single operator. This talk addresses two key considerations for developing more realistic models of human preferences for alignment today. In Part I, we challenge the assumption that values and preferences are universal or acontextual through examining interpersonal dilemmas - what happens when we disagree with one another? I'll introduce the PRISM Alignment Dataset as a key new resource that contextualizes preference ratings across diverse human groups with detailed sociodemographic data. In Part II, we challenge the assumption that values and preferences are stable or exogenous by exploring intrapersonal dilemmas - what happens when we disagree with ourselves? I'll introduce ongoing research on anthropomorphism in human-AI interaction, examining how revealed preferences often conflict with stated preferences, especially regarding AI systems' social capabilities and in longitudinal interactions.