Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Behavioral Machine Learning

Analyzing Reward Functions via Trajectory Alignment

Calarina Muslimani · Suyog Chandramouli · Serena Booth · Brad Knox · Matthew Taylor


Abstract:

Reward design in reinforcement learning (RL) is often overlooked, with the assumption that a well-defined reward is readily available. However, reward functions can be challenging to design and prone to reward hacking, potentially leading to unintended or dangerous consequences in real-world applications. To create safe RL agents, reward alignment is crucial. We define reward alignment as the process of designing reward functions that preserve the preferences of a human stakeholder.In practice, reward functions are designed with training performance as the primary measure of success; this measure, however, may not reflect alignment.This work studies the practical implications of reward design on alignment. Specifically, we (1) propose a reward alignment metric, the Trajectory Alignment coefficient, that measures the similarity between the preference orderings of a human stakeholder and the preference orderings induced by a reward function, (2) use this metric to quantify the prevalence and extent of misalignment in human-designed reward functions, and (3) examine how misalignment affects the efficacy of these human-designed reward functions in terms of training performance.

Chat is not available.