Poster
in
Workshop: Adaptive Experimental Design and Active Learning in the Real World
Learning Models and Evaluating Policies with Offline Off-Policy Data under Partial Observability
Shreyas Chaudhari · Philip Thomas · Bruno C. da Silva
Models in reinforcement learning are often estimated from offline data, which in many real-world scenarios is subject to partial observability.In this work, we study the challenges that emerge from using models estimated from partially-observable offline data for policy evaluation.Notably, these models must be defined in conjunction with the data-collecting policy.To address this issue, we introduce a method for model estimation thatincorporates importance weighting in the model learning process.The off-policy samples are reweighted to be reflective of their probabilities under a different policy, such that the resultant model is a consistent estimator of the off-policy model and provides consistent off-policy estimates of the expected return.This is a crucial step towards the reliable and responsible use of models learned under partial observability, particularly in scenarios where inaccurate policy evaluation can have catastrophic consequences.We empirically demonstrate the efficacy of our method and its resilience to common approximations such as weight clipping on a range of domains with diverse types of partial observability.