Poster
in
Workshop: Table Representation Learning Workshop (TRL)
Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance
Niklas Wretblad · Oskar Holmström · Erik Larsson · Axel Wiksäter · Hjalmar Öhman · Oscar Söderlund · Ture Pontén · Martin Forsberg · Martin Sörme · Fredrik Heintz
Keywords: [ metadata ] [ large language model ] [ LLM ] [ database ] [ column descriptions ] [ text-to-sql ] [ SQL ]
Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and text-to-SQL models. In this paper, we explore the use of large language models (LLMs) to automatically generate detailed natural language descriptions for SQL database columns, aiming to improve text-to-SQL performance and automate metadata creation. We create a dataset of gold column descriptions based on the BIRD-Bench benchmark, manually refining its column descriptions and creating a taxonomy for categorizing column difficulty. Through evaluating several LLMs, we find that incorporating these column descriptions consistently enhances model performance, particularly for larger models like GPT-4o and Qwen2 72B. However, models struggle with columns that exhibit inherent ambiguity, highlighting the need for manual expert input. Notably, Qwen2-generated descriptions, containing by annotators deemed superfluous information, outperform manually curated gold descriptions, suggesting that models benefit from more detailed metadata than humans expect. Future work will investigate the specific features of these high-performing descriptions and explore other types of metadata, such as numerical reasoning and domain-specific knowledge, to further improve text-to-SQL systems.