Skip to yearly menu bar Skip to main content


Poster
in
Workshop: NeurIPS 2023 Workshop on Diffusion Models

Text-Aware Diffusion Policies

Calvin Luo · Chen Sun


Abstract:

Diffusion models scaled to massive datasets have demonstrated powerful underlying unification capabilities between the language modality and pixel space, as convincingly evidenced by high-quality text-to-image synthesis that delight and astound. In this work, we interpret agents interacting within a visual reinforcement learning setting as trainable video renderers, where the output video is simply frames stitched together across sequential timesteps. Then, we propose Text-Aware Diffusion Policies (TADPols), which uses large-scale pretrained models, particularly text-to-image diffusion models, to train policies that are aligned with natural language text inputs. As the behavior represented within a policy naturally learns to align with the reward function utilized during optimization, we propose generating the reward signal for a reinforcement learning agent as the similarity between a provided text description and the frames the agent produces from its interactions. Furthermore, rendering the video produced by an agent during inference can be treated as a form of text-to-video generation, where the video has the added bonus of always being smooth and consistent with respect to the environmental specifications. Additionally, when the diffusion model is kept frozen, this enables the investigation of how well a large-scale model pretrained only on static image and textual data is able to understand temporally extended behaviors and actions. We conduct experiments on a variety of locomotion experiments across multiple subjects, and demonstrate that agents can be trained using the unified understanding of vision and language captured within large-scale pretrained diffusion models to not only synthesize videos that correspond with provided text, but also learn to perform the motion itself as autonomous agents.

Chat is not available.