Poster
Diffusion-Inspired Truncated Sampler for Text-Video Retrieval
JIAMIAN WANG · Pichao WANG · Dongfang Liu · Qiang Guan · Sohail Dianat · MAJID RABBANI · Raghuveer Rao · Zhiqiang Tao
Prevalent text-to-video retrieval methods represent the multi-modality text-video data in a joint embedding space, aiming at bridging the relevant text-video pairs and pulling away irrelevant ones. One main challenge in state-of-the-art retrieval methods lies in the modality gap, which stems from the substantial disparities between text and video and can persist in the joint space. The primary goal of this work is to leverage the power of Diffusion models in modeling the text-video modality gap and uncover two-fold design defects of the Diffusion model for retrieval: The L_2 loss does not fit the ranking problem inherent in text-video retrieval, and the generation quality heavily depends on the varied initial point drawn from the isotropic Gaussian, causing inaccurate retrieval. To this end, we introduce Diffusion-Inspired Truncated Sampler (DITS) that jointly performs progressive alignment and modality gap modeling in the joint embedding space. The key innovation of DITS is to leverage the inherent proximity of text and video embeddings, defining a truncated diffusion flow from the fixed text embedding to the video embedding and alleviating the volatility present in the isotropic Gaussian. Moreover, DITS adopts the contrastive loss to jointly consider the relevant and irrelevant pairs, not only facilitating alignment but also yielding a discriminatively structured embedding. Experiments on five benchmark datasets suggest the state-of-the-art performance of DITS. We empirically find that DITS flexibly adjusts the retrieval scope (Recall@K) over time and also improves the structure of the CLIP embedding space. Code and models will be released.
Live content is unavailable. Log in and register to view live content