Poster
in
Workshop: Table Representation Learning Workshop (TRL)
Enhancing Biomedical Schema Matching with LLM-based Training Data Generation
Yurong Liu · AĆ©cio Santos · Eduardo Pena · Roque Lopez · Eden Wu · Juliana Freire
Keywords: [ Large Language Models ] [ Contrastive Learning ] [ Schema Matching ] [ Data Harmonization ]
Data harmonization is the practice of combining datasets in a way that ensures that the information is compatible and can be accurately compared. In this context, schema matching is an essential task that allows for establishing correspondences between attributes coming from different data sources. In this paper, we show that existing schema-matching methods often struggle to adequately capture the semantics necessary for aligning complex schemas, particularly those in biomedicine domains, which can result in less effective data integration. To address this problem, we introduce an approach for schema matching that leverages LLMs for (1) generating semantically coherent training data pairs that can be used to train effective column embedding models using the contrastive learning framework and (2) refining final column match selections. This approach allows us to overcome the limitations posed by scarce in-domain and semantically diverse training data. Our approach demonstrates significant improvements over traditional methods, as validated by experiments with real-world biomedical datasets.