Poster
in
Workshop: Table Representation Learning Workshop (TRL)
Scaling Generative Tabular Learning for Large Language Models
Yiming Sun · Xumeng Wen · Shun Zheng · Xiaowei Jia · Jiang Bian
Keywords: [ large language models ] [ Tabular data learning ] [ scaling laws ] [ generative tabular learning ]
Developing predictive models for tabular data is essential across many industrial applications. The primary challenge in addressing these tasks lies in handling heterogeneous data schemas and diverse prediction targets. Recently, generative tabular learning (GTL) was developed to leverage the instruction-following paradigm of large language models (LLMs) to enable universal tabular learning across varied datasets. This method facilitates effective prompt-based transfers to downstream tasks without the need for supervised tuning. However, the full potential of GTL-enhanced LLMs remains largely unexplored due to limitations in dataset size, sequence length, and model architecture, leading to notable performance gaps compared to traditional tuning-based tabular models as the number of training examples increases. In this study, we aim to unlock the full potential of GTL from a scaling perspective. We expanded the pre-training datasets from 340 to 972, extended the sequence length from 4,096 to 16,384 tokens, and experimented with different base LLMs. Our findings reveal that scaling datasets and prediction tasks generally enhances generalization, although regression tasks tend to reach saturation quickly. Increasing the number of in-context samples consistently improves performance, especially during inference. Our optimized LLMs demonstrate significant improvements, effectively closing the gap with and even surpassing highly-optimized models when dealing with larger training samples.