Poster
Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback
Haolin Liu · Zak Mhammedi · Chen-Yu Wei · Julian Zimmert
[
Abstract
]
Thu 12 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). We initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose a computationally inefficient model-based algorithm with $poly(d, A, H)T^{\frac{2}{3}}$ regret, and oracle-efficient model-free algorithms with $poly(d, A, H)T^{\frac{5}{6}}$ regret, where $d$ is the rank of the transitions, $A$ is the number of actions, $H$ is the horizon length, and $T$ is the number of episodes.We show that the linear structure is necessary for the case of bandit feedback without structure on the reward function, the regret has to scale polynomially with the number of states. This is contrary to the full-information case (Zhao et al., 2024), where the regret can be independent of the number of states even for unstructured reward function.
Live content is unavailable. Log in and register to view live content