Large language models (LMs) have achieved remarkable success in many language tasks.
Recent works have also shown that knowledge of the world can emerge from large LMs, enabling large LMs to assist decision-making for embodied tasks. However, the world knowledge exhibited by the current large LMs is often not robust and cannot be grounded in physical environments without additional models. This hinders large LMs’ abilities to perform complex reasoning and planning tasks reliably. For example, in creating action plans to move blocks to a target state, GPT-3 achieves a success rate of only 1%, compared to 78% for humans.
On the other hand, humans perform deliberate reasoning and planning based on the mental model of the world (i.e., world model, WMs) that enables us to simulate actions and their effects on the world’s state. WMs encoding the knowledge of the physical world can drastically improve the data efficiency and robustness of intelligent agents. However, WMs were typically studied in reinforcement learning and robotics, which are conceptually distinct from problems studied in language modeling.
This gap indicates enormous new opportunities for connecting WMs and LMs, to enhance LM capabilities of reasoning/planning in both embodied and general settings, and address the aforementioned limitations. Emerging studies on the intersection of WMs and LMs have demonstrated promising results. This tutorial aims to summarize and present a unified view of connecting WMs and LMs and highlight the various opportunities for improved machine reasoning and planning based on (or even beyond) large LMs through world modeling. We will review recent works on learning WMs and on using them to further learn and perform embodied tasks. We will show how LMs can utilize external WMs to compensate for their lack of grounded world knowledge and how LMs themselves can learn world models from embodied experiences that are beyond text data and use the internal WMs to guide complex reasoning.