NeurIPS SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

Poster
in
Workshop: Table Representation Learning Workshop (TRL)

SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

Denver Baumgartner · Tomasz Kornuta

Keywords: [ Synthetic Data Generation ] [ Text-to-SQL ] [ In-Context Learning ] [ Low-Resource Scenarios ] [ In-Domain ] [ Fine-Tuning ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We address the challenge of generating high-quality data for text-to-SQL parsing in low-resource, in-domain scenarios. Although leveraging large language models (LLMs) and in-context learning often achieves the best results in research settings, it is frequently impractical for real-world applications. Therefore, fine-tuning smaller, domain-specific models provides a viable alternative. However, the scarcity of training data frequently constrains it. To overcome this, we introduce SynQL, a novel method for synthetic text-to-SQL data generation tailored for in-domain contexts. We demonstrate the effectiveness of SynQL on the KaggleDBQA benchmark, showing significant performance improvements over models fine-tuned on original data. Additionally, we validate our method on the out-of-domain Spider dataset. We open-source the method and both synthetic datasets.

Chat is not available.

Poster in Workshop: Table Representation Learning Workshop (TRL)

SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

Denver Baumgartner · Tomasz Kornuta

Poster
in
Workshop: Table Representation Learning Workshop (TRL)