NeurIPS Poster How Does Variance Shape the Regret in Contextual Bandits?

Poster

How Does Variance Shape the Regret in Contextual Bandits?

Zeyu Jia · Jian Qian · Alexander Rakhlin · Chen-Yu Wei

West Ballroom A-D #6705

[ Abstract ]

[ Paper] [ OpenReview]

Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract: We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax regret bounds, we show that the eluder dimension $d_{\text{elu}}$$-$a measure of the complexity of the function class$-$plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of $\Omega( \sqrt{ \min (A, d_{\text{elu}}) \Lambda } + d_{\text{elu}} )$ is unavoidable when $d_{\text{elu}} \leq \sqrt{A T}$, where $A$ is the number of actions, $T$ is the total number of rounds, and $\Lambda$ is the total variance over $T$ rounds. For the $A\leq d_{\text{elu}}$ regime, we derive a nearly matching upper bound $\tilde{O}( \sqrt{ A\Lambda } + d_{\text{elu} } )$ for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of $\Omega( \sqrt{ d_{\text{elu}} \Lambda } + d_{\text{elu}} )$ is unavoidable when $\sqrt{ d_{\text{elu}} \Lambda } + d_{\text{elu}} \leq \sqrt{A T}$. In this setting, we provide an upper bound of order $\tilde{O}( d_{\text{elu}}\sqrt{ \Lambda } + d_{\text{elu}} )$.Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound $\tilde{O}(\sqrt{d_{\text{elu}} \Lambda} + d_{\text{elu}})$ established in their work is unimprovable when $\sqrt{d_{\text{elu}} \Lambda} + d_{\text{elu}}\leq \sqrt{AT}$. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of $\tilde{O}(\sqrt{A\Lambda} + d_{\text{elu}})$.

Chat is not available.