Skip to yearly menu bar Skip to main content


Poster

A Practitioner's Guide to Real-World Continual Multimodal Pretraining

Karsten Roth · Vishaal Udandarao · Sebastian Dziadzio · Ameya Prabhu · Mehdi Cherti · Oriol Vinyals · Olivier Henaff · Samuel Albanie · Matthias Bethge · Zeynep Akata

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Foundation models require vast amounts of data and compute to train. Still, over time, they become outdated and need to be updated with new information. Current research explores two main ways to update these models: (i) large-scale, indiscriminate updates through continual fine-tuning on large quantities of new data and compute, and (ii) frequent, small updates on a fact or sample level with continual knowledge edits or retrieval with a fixed backbone.However, many real-world applications need updates to subdomains and concepts not well covered during pretraining, or new, specific tasks. How to best update foundation models, in cases beyond small edits but not warranting re-pretraining, remains unclear.This work aims to provide extensive guidance on effective continual model updates in such scenarios. We introduce FoMo-in-Flux, a benchmark for continual multimodal pretraining with real-world compute constraints and diverse coverage. FoMo-in-Flux operates over 63 datasets, provides long data and task horizons, and measures both adaptation and preservation of zero-shot transfer abilities. We conduct extensive experiments to explore multiple perspectives: (i) A data-centric study investigating pretraining and adaptation data mixtures alongside real-world stream orderings, (ii) a method landscape from simple fine-tuning, parameter-efficient updates, traditional continual learning strategies to model merging, and (iii) training recipes exploring learning rate schedules and low-level design choices. Together, we extend current perspectives, and provide a practitioner's guide to continual multimodal pretraining for real-world deployment.

Live content is unavailable. Log in and register to view live content