Poster
in
Workshop: LaReL: Language and Reinforcement Learning
Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems
Yihao Feng · Shentao Yang · Shujian Zhang · Jianguo Zhang · Caiming Xiong · Mingyuan Zhou · Huan Wang
Keywords: [ Reinforcement Learning ] [ reward learning ] [ task-oriented dialogue systems ]
When learning task-oriented dialogue (TOD) agents, one can naturally utilize reinforcement learning (RL) techniques to train conversational strategies to achieve user-specific goals. Existing works on training TOD agents mainly focus on developing advanced RL algorithms, while the mechanical designs of reward functions are not well studied. This paper discusses how we can better learn and utilize reward functions for training TOD agents. Specifically, we propose two generalized objectives for reward function learning inspired by the classical learning to rank losses. Further, to address the high variance issue of policy gradient estimation using REINFORCE, we leverage the gumbel-softmax trick to better estimate the gradient for TOD policies, which significantly improves the training stability for policy learning. With the above techniques, we can outperform the state-of-the-art results on the end-to-end dialogue task on the Multiwoz 2.0 dataset.