Poster
in
Workshop: New Frontiers of AI for Drug Discovery and Development
Online Learning of Optimal Prescriptions under Bandit Feedback with Unknown Contexts
Hongju Park · Mohamad Kazem Shirani Faradonbeh
Keywords: [ Partial Observability ] [ regret bounds ] [ contextual bandits ] [ thompson sampling ]
Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn prescriptions of highest reward subject to the contextual information, while the unknown reward parameters of each prescription need to be learned by experimenting it. Accordingly, a fundamental problem is that of balancing exploration (i.e., prescribing different options to learn the parameters), versus exploitation (i.e., sticking with the best option to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partially observed contexts remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal prescriptions based on observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling with other quantities including dimensions and number of options. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we utilize concentration inequalities for dependent data and also develop novel probabilistic bounds for time-varying suboptimality gaps, among others. These techniques pave the road towards studying similar problems.