NeurIPS Policy Optimization via Optimal Policy Evaluation

Poster
in
Workshop: Deep Reinforcement Learning

Policy Optimization via Optimal Policy Evaluation

Alberto Maria Metelli · Samuele Meta · Marcello Restelli

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed as a what-if analysis tool, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises, showing the connections between the two objectives. We illustrate that variance minimization can be used as a performance improvement tool, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. We make use of these theoretical findings to build a PO algorithm, Policy Optimization via Optimal Policy Evaluation (PO2PE), that employs variance minimization as an inner loop. Finally, we present empirical evaluations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes.

Chat is not available.

Poster in Workshop: Deep Reinforcement Learning

Policy Optimization via Optimal Policy Evaluation

Alberto Maria Metelli · Samuele Meta · Marcello Restelli

Poster
in
Workshop: Deep Reinforcement Learning