NeurIPS Poster A Unified Principle of Pessimism for Offline Reinforcement Learning under Model Mismatch

Poster

A Unified Principle of Pessimism for Offline Reinforcement Learning under Model Mismatch

Yue Wang · Zhongchang Sun · Shaofeng Zou

West Ballroom A-D #6908

[ Abstract ]

[ Paper] [ OpenReview]

Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: In this paper, we address the challenges of offline reinforcement learning (RL) under model mismatch, where the agent aims to optimize its performance through an offline dataset that may not accurately represent the deployment environment. We identify two primary challenges under the setting: inaccurate model estimation due to limited data and performance degradation caused by the model mismatch between the dataset-collecting environment and the target deployment one. To tackle these issues, we propose a unified principle of pessimism using distributionally robust Markov decision processes. We carefully construct a robust MDP with a single uncertainty set to tackle both data sparsity and model mismatch, and demonstrate that the optimal robust policy enjoys a near-optimal sub-optimality gap under the target environment across three widely used uncertainty models: total variation, $\chi^2$ divergence, and KL divergence. Our results improve upon or match the state-of-the-art performance under the total variation and KL divergence models, and provide the first result for the $\chi^2$ divergence model.

Chat is not available.