Abstract:
A major research direction in contextual bandits is to
develop algorithms that are computationally efficient, yet support
flexible, general-purpose function approximation. Algorithms based on
modeling rewards have shown strong empirical performance, yet
typically require a well-specified model, and can fail when this
assumption does not hold. Can we design algorithms that are efficient
and flexible, yet degrade gracefully in the face of model
misspecification? We introduce a new family of oracle-efficient algorithms for $\varepsilon$-misspecified contextual bandits
that adapt to unknown model misspecification---both for finite and
infinite action settings. Given access to an
\emph{online oracle} for square loss regression, our algorithm attains
optimal regret and---in particular---optimal dependence on the
misspecification level, with \emph{no prior knowledge}. Specializing
to linear contextual bandits with infinite actions in $d$ dimensions,
we obtain the first algorithm that achieves the optimal
$\bigoht(d\sqrt{T} + \varepsilon\sqrt{d}T)$ regret bound for unknown $\varepsilon$.
On a conceptual level, our results are enabled by a new
optimization-based perspective on the regression oracle reduction framework of
Foster and Rakhlin (2020), which we believe will be useful more broadly.
Chat is not available.