Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
Adaptive and Robust Watermark for Generative Tabular Data
Dung Ngo · Daniel Scott · Saheed Obitayo · Vamsi Potluru · Manuela Veloso
Recent development in generative models has demonstrated its ability to create high-quality synthetic data. However, the pervasiveness of synthetic content online also bringsforth growing concerns that it can be used for malicious purpose. To ensure the authenticity ofthe data, watermarking techniques have recently emerged as a promising solution due to theirstrong statistical guarantees. In this paper, we propose a flexible and robust watermarkingmechanism for generative tabular data. Specifically, a data provider with knowledge of thedownstream tasks can partition the feature space into pairs of (key, value) columns. Withineach pair, the data provider first uses elements in the key column to generate a randomizedset of “green” intervals, then encourages elements of the value column to be in one of these“green” intervals. We show theoretically and empirically that the watermarked datasets (i)have negligible impact on the data quality and downstream utility, (ii) can be efficientlydetected, and (iii) are robust against multiple attacks commonly observed in data science.