Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Table Representation Learning Workshop (TRL)

Data-Centric Text-to-SQL with Large Language Models

Zachary Huang · Shuo Zhang · Kechen Liu · Eugene Wu

Keywords: [ large language models ] [ data centric ] [ Text-to-SQL ]


Abstract:

Text-to-SQL is crucial for enabling non-technical users to access data, and large language models have significantly improved performance. However, recent frameworks are largely Query-Centric, focusing on improving models' ability to translate natural language into SQL queries. Despite these advancements, real-world challenges—especially messy and large datasets—remain a major bottleneck. Our case studies reveal that 11-37\% of the ground truth answers in the BIRD benchmark are incorrect due to data quality issues (duplication, disguised missing values, data types and inconsistent values). To address this, we propose a Data-Centric Text-to-SQL framework that preprocesses and cleans data offline, builds a relationship graph between tables, and incorporates business logic. This allows LLM agents to efficiently retrieve relevant tables and details during query time, significantly improving accuracy. Our experiments show that this approach outperforms human-provided ground truth answers on the BIRD benchmarks by up to 33.89\%.

Chat is not available.