NeurIPS Poster G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

Poster

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

Che Liu · Cheng Ouyang · Sibo Cheng · Anand Shah · Wenjia Bai · Rossella Arcucci

East Exhibit Hall A-C #1000

[ Abstract ] [ Project Page ]

[ Paper] [ Poster] [ OpenReview]

Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Medical imaging tasks require an understanding of subtle and localized visual features due to the inherently detailed and area-specific nature of pathological patterns, which are crucial for clinical diagnosis. Although recent advances in medical vision-language pre-training (VLP) enable models to learn clinically relevant visual features by leveraging both medical images and their associated radiology reports, current medical VLP methods primarily focus on aligning images with entire reports. This focus hinders the learning of dense (pixel-level) visual features and is suboptimal for dense prediction tasks (e.g., medical image segmentation).To address this challenge, we propose a novel medical VLP framework, named Global to Dense level representation learning (G2D), which aims to learn global and dense visual features simultaneously using only image-text pairs without extra annotations. In particular, G2D designs a Pseudo Segmentation (PS) task, which enables the model to learn dense visual features during VLP. Notably, generating PS masks can be performed on the fly during VLP, which does not incur extra trainable parameters. With this simple yet effective idea, G2D achieves superior performance across 5 medical imaging tasks and 25 diseases. Particularly, in the segmentation task which requires dense visual features, G2D surpasses existing models even with just 1% of the training data for finetuning, compared to 100% used by other models. The code can be found in https://github.com/cheliu-computation/G2D-NeurIPS24/tree/main.

Chat is not available.