Expo Talk Panel
West Ballroom C

Diffusion-based video generation has attracted significant interest across both academia and industry as the next exciting step in demonstrating the capabilities of large-scale deep learning models. Recently, the goal of world simulator models that enable video generation in real time based on user input has begun to evolve as it may introduce a new paradigm in terms of human interaction with deep-learning models. In this talk, we tackle the training of the largest world simulator model on over millions of hours of data at scale across thousands of GPUs. The training of this large-scale diffusion model was possible due to two fundamental pillars developed by Decart and Crusoe which are crucial to its success and are highlighted in this talk. The first pillar involves the adaption and optimization of the model training infrastructure to enable the fast training of large-scale models. This is integrated with the Crusoe cluster to provide a high-throughput reliable training operation that is resilient to GPU failures. We also rely on the optimization of data pipelines that processed over millions of hours of video data at scale. The second pillar involves new model architectures we proposed that are at the forefront of diffusion-based video generation models and enable real-time conditioning and inferencing of models at scale. Together, these enable the training of massive world simulator models that are at the forefront of advancing the human-model interaction landscape.

Chat is not available.