Poster
in
Workshop: Adaptive Experimental Design and Active Learning in the Real World
Sequentially Adaptive Experimentation for Learning Optimal Options subject to Unobserved Contexts
Hongju Park · Mohamad Kazem Shirani Faradonbeh
Contextual bandits constitute a classical framework for interactive learning of best decisions subject to context information. In this setting, the goal is to sequentially learn arms of highest reward subject to the contextual information, while the unknown reward parameters of each arm need to be learned by experimenting it. Accordingly, a fundamental problem is that of balancing such experimentation (i.e., pulling different arms to learn the parameters), versus sticking with the best arm learned so far, in order to maximize rewards. To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partially observed contexts remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that adaptive experiments based on samples from the posterior distribution efficiently learn optimal arms. Specifically, we establish regret bounds that grow logarithmically with time. Extensive simulations for real-world data are presented as well to illustrate this efficacy.