Poster
in
Workshop: NeurIPS 2023 Workshop on Diffusion Models
Masked Multi-time Diffusion for Multi-modal Generative Modeling
Mustapha BOUNOUA · Giulio Franzese · Pietro Michiardi
Multi-modal data is ubiquitous, and models to learn a joint representation of all modalities have flourished. However, existing approaches suffer from a coherence-quality tradeoff, where generation quality comes at the expenses of generative coherence across modalities, and vice versa. To overcome these limitations, we propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders. Individual latent variables are concatenated and fed to a masked diffusion model to enable generative modeling. We also introduce a new multi-time training method to learn the conditional score network for multi-modal diffusion. Empirically, our methodology substantially outperforms competitors in both generation quality and coherence.