Oral
in
Workshop: Table Representation Learning Workshop (TRL)
TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation
Juntong Shi · Minkai Xu · Harper Hua · Hengrui Zhang · Stefano Ermon · Jure Leskovec
Keywords: [ Generative Models ] [ Tabular Representative Learning ] [ Diffusion Models ]
Synthesizing high-quality tabular data is an important topic in many data science applications, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types and intricate column-wise distributions. In this paper, we introduce TabDiff, a unified diffusion framework that models all multi-modal distributions of mixed-type tabular data in one model. Our key insight is to design different continuous-time diffusion processes for numerical and categorical data, and learn one model to simultaneously predict the noise for different modalities. To counter the high disparity of different feature distributions, we further introduce feature-wise learnable diffusion processes to optimally balance the generative performance. The entire framework can be efficiently optimized in an end-to-end fashion. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across five out of six metrics.