Oral
in
Workshop: Foundation Models for Decision Making
In-Context Multi-Armed Bandits via Supervised Pretraining
Fred Zhang · Jiaxin Ye · Zhuoran Yang
Exploring the in-context learning capabilities of large transformer models, this research focuses on decision-making within reinforcement learning (RL) environments, specifically multi-armed bandit problems. We introduce the Reward Weighted Decision-Pretrained Transformer (DPT-RW), a model that uses straightforward supervised pretraining with a reward-weighted imitation learning loss. The DPT-RW predicts optimal actions by evaluating a query state and an in-context dataset across varied tasks. Surprisingly, this simple approach produces a model capable of solving a wide range of RL problems in-context, demonstrating online exploration and offline conservatism without specific training in these areas. A standout observation is the optimal performance of the model in the online setting, despite being trained on data generated from suboptimal policies and not having access to optimal data.