NeurIPS Poster Occupancy-based Policy Gradient: Estimation, Convergence, and Optimality

Poster

Occupancy-based Policy Gradient: Estimation, Convergence, and Optimality

Audrey Huang · Nan Jiang

West Ballroom A-D #6809

[ Abstract ]

[ Paper] [ OpenReview]

Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Occupancy functions play an instrumental role in reinforcement learning (RL) for guiding exploration, handling distribution shift, and optimizing general objectives beyond the expected return. Yet, computationally efficient policy optimization methods that use (only) occupancy functions are virtually non-existent. In this paper, we establish the theoretical foundations of model-free policy gradient (PG) methods that compute the gradient through the occupancy for both online and offline RL, without modeling value functions. Our algorithms reduce gradient estimation to squared-loss regression and are computationally oracle-efficient. We characterize the sample complexities of both local and global convergence, accounting for both finite-sample estimation error and the roles of exploration (online) and data coverage (offline). Occupancy-based PG naturally handles arbitrary offline data distributions, and, with one-line algorithmic changes, can be adapted to optimize any differentiable objective functional.

Chat is not available.