Oral
in
Workshop: Foundation Models for Decision Making
Target Rate Optimization: Avoiding Iterative Error Exploitation
Braham Snyder · Amy Zhang · Yuke Zhu
Large models solve hard problems in supervised, unsupervised, and bandit learning. Unfortunately, even with large models, many real-world reinforcement learning (RL) problems remain intractable. A key issue is that sample-efficient RL algorithms are unstable. Early stopping sometimes works around this. But, partly because of the instability itself, early stopping is difficult. Further, the standard approach to early stopping in RL is to stop all learning. Why don't we instead fix the early stopping that most target networks implicitly use already? That is, in algorithms like DQN, the target update rate already early-stops DQN's target-fitting subproblems. Currently, practitioners must either hope the default target rate performs well, or tune it with an expensive grid search over online returns. Moreover, within a run, algorithms like DQN continue to update the target even when the updates increase the training error. This degrades value estimates, which degrades returns. Newer off-policy and offline RL algorithms lessen this well-known deadly triad divergence, but still often fail to capture peak returns. To combat these issues, we propose adding optimization of the training error w.r.t. the target update rate. Our algorithm, Target Rate Optimization, empirically prevents divergence and increases return on both discrete- and continuous-action RL problems.