Poster
in
Workshop: AI4Mat-2024: NeurIPS 2024 Workshop on AI for Accelerated Materials Design
Large scale Extraction of Composition and Properties from Materials Tables
Kausik Hira · Mohd Zaki · Mausam · N M Anoop Krishnan
Keywords: [ database ] [ Information extraction ] [ materials table ] [ table data extraction ]
In this study, we aim to develop the largest automated knowledge base (KB) of inorganic materials’ compositions and properties by systematically extracting data from published research articles in the Materials Science (MatSci) domain. Since most material compositions and properties are reported in tables, their efficient extraction is essential for building large-scale knowledge repositories in this field. To this extent, we developed a framework combining two models, namely, DISCOMAT and PEGAMAT, for extracting materials’ compositions and properties respectively. Training data was generated through distant supervision using compositions and desired properties from existing databases and the corresponding journals, supplemented by rule-based. Validation and test datasets were manually annotated by materials science experts. DISCOMAT achieved an F1 score of 71.49 for composition extraction, while PEGAMAT attained 86.90 for property extraction. We processed research papers published in 12 journals of the ScienceDirect database for our study and extracted more than 550,000 entries comprising around 100,000 glass material compositions with their properties, along with 137,000 compositions and 316,000 properties without their counterparts. The proposed models and the resulting database offer significant potential to advance the modeling and development of tailored materials.