Poster
in
Workshop: Table Representation Learning Workshop (TRL)
LLM Embeddings Improve Test-time Adaptation to Tabular $Y|X$-Shifts
Yibo Zeng · Jiashuo Liu · Henry Lam · Hongseok Namkoong
Keywords: [ large language model embeddings ] [ distribution shift ] [ tabular prediction ]
Abstract:
Distribution shifts between the source and target domain pose significant challenges for machine learning, and different types of distribution shifts require distinct interventions. Unlike in computer vision tasks, where covariates typically contain all the information needed for prediction, the common occurrence of missing variables makes distribution shifts in real-world tabular data far more complex. After analyzing 7,650 distribution shift pairs across three real-world tabular datasets, we find that $Y|X$-shifts are more prevalent in tabular data, in contrast to image data, where $X$-shifts are more dominant.In this work, we conduct a comprehensive and systematic study on leveraging recent large language models to generate improved feature embeddings for backend neural network models. Specifically, we develop a large-scale testbed consisting of **7,650** distribution shift pairs across the ACS Income, ACS Mobility, and ACS Public Coverage datasets, following a standard training-validation-testing protocol. Through an extensive analysis of **20** models and learning strategies across over **261,000** model configurations, we find that while LLM embeddings are inherently powerful, they do not consistently outperform state-of-the-art tree-ensemble methods. Interestingly, even a small number of target samples can have a significant impact for tabular $Y|X$ shifts. Additionally, we explore the influence of target sample size, fine-tuning strategies, and methods of integrating supplementary information.
Chat is not available.