Abstract:
The classical theory of reinforcement learning (RL) has focused on tabular
and linear representations of value functions. Further progress hinges on
combining RL with modern function approximators such as kernel functions and
deep neural networks, and indeed there have been many empirical successes
that have exploited such combinations in large-scale applications. There
are profound challenges, however, in developing a theory to support this
enterprise, most notably the need to take into consideration the exploration-exploitation
tradeoff at the core of RL in conjunction with the computational and statistical
tradeoffs that arise in modern function-approximation-based learning systems.
We approach these challenges by studying an optimistic modification of the
least-squares value iteration algorithm, in the context of the action-value function
represented by a kernel function or an overparameterized neural network.
We establish both polynomial runtime complexity and polynomial sample complexity
for this algorithm, without additional assumptions on the data-generating model.
In particular, we prove that the algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\cF}
H^2 \sqrt{T})$ regret, where $\delta_{\cF}$ characterizes the intrinsic complexity of the function class $\cF$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states,
a result which exhibits clearly the benefit of function approximation in RL.
Chat is not available.