Skip to yearly menu bar Skip to main content


Poster

Multi-view Masked Contrastive Representation Learning for Endoscopic Video Analysis

Kai Hu · Ye Xiao · Yuan Zhang · Xieping Gao

East Exhibit Hall A-C #2005
[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Endoscopic video analysis can effectively assist clinicians in disease diagnosis and treatment, and has played an indispensable role in clinical medicine. Unlike regular videos, endoscopic video analysis presents unique challenges, including complex camera movements, uneven distribution of lesions, and concealment, and it typically relies on contrastive learning in self-supervised pretraining as its mainstream technique. However, representations obtained from contrastive learning enhance the discriminability of the model but often lack fine-grained information, which is suboptimal in the pixel-level prediction tasks. In this paper, we develop a Multi-view Masked Contrastive Representation Learning (M$^2$CRL) framework for endoscopic video pre-training. Specifically, we propose a multi-view mask strategy for addressing the challenges of endoscopic videos. We utilize the frame-aggregated attention guided tube mask to capture global-level spatiotemporal sensitive representation from the global views, while the random tube mask is employed to focus on local variations from the local views. Subsequently, we combine multi-view mask modeling with contrastive learning to obtain endoscopic video representations that possess fine-grained perception and holistic discriminative capabilities simultaneously. The proposed M$^2$CRL is pre-trained on 7 publicly available endoscopic video datasets and fine-tuned on 3 endoscopic video datasets for 3 downstream tasks. Notably, our M$^2$CRL significantly outperforms the current state-of-the-art self-supervised endoscopic pre-training methods, e.g., Endo-FM (3.5% F1 for classification, 7.5% Dice for segmentation, and 2.2% F1 for detection) and other self-supervised methods, e.g., VideoMAE V2 (4.6% F1 for classification, 0.4% Dice for segmentation, and 2.1% F1 for detection).

Chat is not available.