Poster
in
Workshop: Goal-Conditioned Reinforcement Learning
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Shalev Lifshitz · Keiran Paster · Harris Chan · Jimmy Ba · Sheila McIlraith
Keywords: [ Deep Learning ] [ transformers ] [ sequential decision making ] [ instruction following ] [ Reinforcement Learning ] [ foundation models ] [ text conditioned reinforcement learning ] [ goal conditioned reinforcement learning ] [ sequence models ] [ Minecraft ]
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.