Poster
in
Workshop: Table Representation Learning Workshop
Modeling string entries for tabular data prediction: do we need big large language models?
Leo Grinsztajn · Myung Jun Kim · Edouard Oyallon · Gael Varoquaux
Keywords: [ language models ] [ tabular data ] [ embeddings ]
Tabular data are often characterized by numerical and categorical features. But these features co-exist with features made of text entries, such as names or descriptions. Here, we investigate whether language models can extract information from these text entries. Studying 19 datasets and varying training sizes, we find that using language model to encode text features improve predictions upon no encodings and character-level approaches based on substrings. Furthermore, we find that larger, more advanced language models translate to more significant improvements.