NeurIPS 2020 High-Dimensional Contextual Policy Search with Unknown Context Rewards using Bayesian Optimization Spotlight

Spotlight

High-Dimensional Contextual Policy Search with Unknown Context Rewards using Bayesian Optimization

Qing Feng · Ben Letham · Hongzi Mao · Eytan Bakshy

Orals & Spotlights: Reinforcement Learning

[ Abstract ] [ Visit Orals & Spotlights Track 14: Reinforcement Learning ] [ Paper ]

[ Paper ]

Abstract:

Contextual policies are used in many settings to customize system parameters and actions to the specifics of a particular setting. In some real-world settings, such as randomized controlled trials or A/B tests, it may not be possible to measure policy outcomes at the level of context—we observe only aggregate rewards across a distribution of contexts. This makes policy optimization much more difficult because we must solve a high-dimensional optimization problem over the entire space of contextual policies, for which existing optimization methods are not suitable. We develop effective models that leverage the structure of the search space to enable contextual policy optimization directly from the aggregate rewards using Bayesian optimization. We use a collection of simulation studies to characterize the performance and robustness of the models, and show that our approach of inferring a low-dimensional context embedding performs best. Finally, we show successful contextual policy optimization in a real-world video bitrate policy problem.

Chat is not available.