Poster
in
Workshop: AIM-FM: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond
Promoting cross-modal representations to improve multimodal foundation models for physiological signals
Ching Fang · Christopher Sandino · Behrooz Mahasseni · Juri Minxha · Hadi Pouransari · Erdrin Azemi · Ali Moin · Ellen Zippi
Many healthcare applications are inherently multimodal and involve multiple types of physiological signals. As sensors for measuring these signals become more ubiquitous, it is increasingly important to improve machine learning methods that consume multimodal healthcare data. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still early in exploration and it is unclear which pretraining strategies are most effective given the diverse set of physiological signals collected. This is in part due to challenges of multimodal learning with health data: data across many patients is difficult to obtain and expensive, and there is a lot of inter-subject variability. Furthermore, modalities are often heterogeneously informative across the downstream tasks of interest. Here, we explore these challenges in the PhysioNet 2018 Challenge dataset collected across 1,985 patients. We used a masked autoencoding objective to pretrain a multimodal model on the dataset. We show that the model learns representations that can be linearly probed for a diverse set of downstream tasks. We hypothesize that cross-modal reconstruction objectives are important for the success of multimodal training as they encourages the model to combine information across modalities. We demonstrate that adding modality drop in the input space improves model performance across downstream tasks. We also show that late-fusion models pretrained with contrastive learning objectives are not as effective as across multiple tasks. Finally, we analyze the representations developed in the model. We show how attention weights become more cross-modal and temporally aligned as a result of our chosen pretraining strategy. The learned embeddings also become more distributed in terms of the modalities that each unit in the model encodes. Taken together, our work demonstrates the utility of multimodal foundation models with health data, even across diverse physiological data sources. We further argue how more explicit means of inducing cross-modality may be valuable additions to any multimodal pretraining strategy.