Oral
in
Workshop: Learning in Presence of Strategic Behavior
Pessimistic Offline Reinforcement Learning with Multiple Agents
Yihang Chen
We study offline multi-agent reinforcement learning (RL), which aims to learn an optimal policy based on historical data for multiple agents. Unlike online RL, accidental overestimation errors arising from function approximation can cumulatively arise and affect future iterations in the offline RL. In this paper, we extend the pessimistic value iteration algorithm to the multi-agent setting: after obtaining the lower bound of the value function of each agent, we compute the optimistic policy by solving a general-sum matrix game with these lower bounds as the payoff matrices. Instead of finding the Nash Equilibrium of such a general-sum game, our algorithm solves for a Coarse Correlated Equilibrium (CCE) for the sake of computational efficiency. In the end, we prove an information-theoretic lower bound of the regret for any algorithm on an offline dataset in the two-agent zero-sum game. To the best of our knowledge, such a CCE scheme for pessimism-based algorithm has not appeared in the literature and can be of interest in its own right. We hope that our work can shed light on the future analysis of the equilibrium point of a multiagent Markov decision process given an offline dataset.