NeurIPS Efficient Multimodal Alignment: To Freeze or Not to Freeze?

Poster
in
Workshop: UniReps: Unifying Representations in Neural Models

Efficient Multimodal Alignment: To Freeze or Not to Freeze?

Till Aczel · Roger Wattenhofer

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Language-image pretraining creates a joint representation space between the two modalities where images and texts with similar semantic information lay close to each other. Language-image models are often trained from scratch without taking advantage of unimodal pretrained models. By aligning the representation spaces of two modality-specific encoders, our model achieves 74.7% accuracy on the ImagenNet1K validation set, at two orders of magnitude lower training cost. In this work, we highlight the importance of unfreezing the CLS tokens of uni-modal transformer encoders to create a joint embedding space. Freezing the image and text CLS tokens reduces the mean accuracy from 37.5% to 19.4% on the 38 evaluation benchmarks.

Chat is not available.

Poster in Workshop: UniReps: Unifying Representations in Neural Models

Efficient Multimodal Alignment: To Freeze or Not to Freeze?

Till Aczel · Roger Wattenhofer

Poster
in
Workshop: UniReps: Unifying Representations in Neural Models