Track: Oral Session 2C: Reinforcement Learning

Oral

Oral Session 2C: Reinforcement Learning

East Meeting Room 1-3

Wed 11 Dec 3:30 p.m. PST — 4:30 p.m. PST

Abstract:

Chat is not available.

Wed 11 Dec. 15:30 - 15:50 PST

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Shaoteng Liu · Haoqi Yuan · Minda Hu · Yanwei Li · Yukang Chen · Shu Liu · Zongqing Lu · Jiaya Jia

Large Language Models (LLMs) have demonstrated proficiency in utilizing various tools by coding, yet they face limitations in handling intricate logic and precise control. In embodied tasks, high-level planning is amenable to direct coding, while low-level actions often necessitate task-specific refinement, such as Reinforcement Learning (RL). To seamlessly integrate both modalities, we introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent. The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks. This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline. Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it rapidly obtains diamonds within a single day on an RTX3090. Additionally, it achieves SOTA performance across all designated MineDojo tasks.

Wed 11 Dec. 15:50 - 16:10 PST

Statistical Efficiency of Distributional Temporal Difference Learning

Yang Peng · Liangyu Zhang · Zhihua Zhang

Distributional reinforcement learning (DRL) has achieved empirical success in various domains.One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$.The distributional temporal difference learning has been accordingly proposed, whichis an extension of the temporal difference learning (TD) in the classic RL area.In the tabular case, Rowland et al. [2018] and Rowland et al. [2023] proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively.In this paper, we go a step further and analyze the finite-sample performance of distributional TD.To facilitate theoretical analysis, we propose a non-parametric distributional TD learning (NTD).For a $\gamma$-discounted infinite-horizon tabular Markov decision process,we show that for NTD we need $\widetilde O\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance.This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance.To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.

Wed 11 Dec. 16:10 - 16:30 PST

**SeeA: Efficient Exploration-Enhanced A Search by Selective Sampling**

Dengwei Zhao · Shikui Tu · Lei Xu

Monte-Carlo tree search (MCTS) and reinforcement learning contributed crucially to the success of AlphaGo and AlphaZero, and A$^*$ is a tree search algorithm among the most well-known ones in the classical AI literature. MCTS and A$^*$ both perform heuristic search and are mutually beneficial. Efforts have been made to the renaissance of A$^*$ from three possible aspects, two of which have been confirmed by studies in recent years, while the third is about the OPEN list that consists of open nodes of A$^*$ search, but still lacks deep investigation. This paper aims at the third, i.e., developing the Sampling-exploration enhanced A$^*$ (SeeA$^*$) search by constructing a dynamic subset of OPEN through a selective sampling process, such that the node with the best heuristic value in this subset instead of in the OPEN is expanded. Nodes with the best heuristic values in OPEN are most probably picked into this subset, but sometimes may not be included, which enables SeeA$^*$ to explore other promising branches. Three sampling techniques are presented for comparative investigations. Moreover, under the assumption about the distribution of prediction errors, we have theoretically shown the superior efficiency of SeeA$^*$ over A$^*$ search, particularly when the accuracy of the guiding heuristic function is insufficient. Experimental results on retrosynthetic planning in organic chemistry, logic synthesis in integrated circuit design, and the classical Sokoban game empirically demonstrate the efficiency of SeeA$^*$, in comparison with the state-of-the-art heuristic search algorithms.

Oral

Oral Session 2C: Reinforcement Learning

East Meeting Room 1-3

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Statistical Efficiency of Distributional Temporal Difference Learning

SeeA*: Efficient Exploration-Enhanced A* Search by Selective Sampling

**SeeA: Efficient Exploration-Enhanced A Search by Selective Sampling**