Spotlight
in
Workshop: Table Representation Learning Workshop
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
Scott Yak · Yihe Dong · Javier Gonzalvo · Sercan Arik
Keywords: [ LLM ] [ tabular ] [ Foundation Model ] [ structured data ]
There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency.