Synthetic Data Generation with Generative AI

Workshop

Synthetic Data Generation with Generative AI

Sergul Aydore · Zhaozhi Qian · Mihaela van der Schaar

Hall E2 (level 1)

Sat 16 Dec, 7 a.m. PST

[ Abstract ] Workshop Website

Synthetic data (SD) is data that has been generated by a mathematical model to solve downstream data science tasks. SD can be used to address three key problems: 1/ private data release, 2/ data de-biasing and fairness, 3/ data augmentation for boosting the performance of ML models. While SD offers great opportunities for these problems, SD generation is still a developing area of research. Systematic frameworks for SD deployment and evaluation are also still missing. Additionally, despite the substantial advances in Generative AI, the scientific community still lacks a unified understanding of how generative AI can be utilized to generate SD for different modalities.The goal of this workshop is to provide a platform for vigorous discussion from all these different perspectives with research communities in the hope of progressing the ideal of using SD for better and trustworthy ML training. Through submissions and facilitated discussions, we aim to characterize and mitigate the common challenges of SD generation that span numerous application domains. The workshop is jointly organized by academic researchers (University of Cambridge) and industry partners from tech (Amazon AI).

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Sat 7:00 a.m. - 7:05 a.m.	Welcome and workshop overview ( Talk ) > SlidesLive Video	Sergul Aydore 🔗
Sat 7:05 a.m. - 7:15 a.m.	Synthetic Data: Charting New Research Frontiers, Maximizing Impact, and Cultivating Collaborative Communities ( Talk ) > SlidesLive Video	Mihaela van der Schaar 🔗
Sat 7:15 a.m. - 8:00 a.m.	Generating health records ( Invited Talk ) > SlidesLive Video	Edward Choi 🔗
Sat 8:00 a.m. - 8:30 a.m.	Coffee Break & Poster Session ( Poster ) >	🔗
Sat 8:30 a.m. - 9:15 a.m.	Privacy and Synthetic data ( Invited Talk ) > SlidesLive Video	Antti Honkela 🔗
Sat 9:15 a.m. - 9:45 a.m.	Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Contributed Talk ) > link Link	Zinan Lin 🔗
Sat 9:45 a.m. - 10:15 a.m.	Effective Data Augmentation With Diffusion Models ( Contributed Talk ) > link SlidesLive Video Link	Max Gurinas · Brandon Trabucco 🔗
Sat 10:15 a.m. - 11:30 a.m.	Lunch Break & Poster Session ( Poster ) >	🔗
Sat 11:30 a.m. - 12:15 p.m.	Diversity and Synthetic data ( Invited Talk ) > SlidesLive Video	Adji Bousso Dieng 🔗
Sat 12:15 p.m. - 12:45 p.m.	Fair Wasserstein Coresets ( Contributed Talk ) > SlidesLive Video	Vamsi Potluru 🔗
Sat 12:45 p.m. - 1:15 p.m.	Improving fairness for spoken language understanding in atypical speech with Text-to-Speech ( Contributed Talk ) > SlidesLive Video	Venkatesh Ravichandran · Helin Wang 🔗
Sat 1:15 p.m. - 1:30 p.m.	Coffee Break & Poster Session ( Poster ) >	🔗
Sat 1:30 p.m. - 2:15 p.m.	Generative Agents: Interactive Simulacra ( Invited Talk ) > SlidesLive Video	Michael Bernstein 🔗
Sat 2:15 p.m. - 3:00 p.m.	Panel Discussion ( Panel ) > SlidesLive Video	Danielle Belgrave · Cem Tekin · Robert Tillman · Megan Gibbs · Dino Oglic · Rudi Agius · Panagiota Konstantinou 🔗
-	Size Matters: Large Graph Generation with HiGGs ( Poster ) > link Link	Alex O. Davies · Nirav Ajmeri · Telmo Silva Filho 🔗
-	Generating Medical Instructions with Conditional Transformer ( Poster ) > link Link	Samuel Belkadi · Nicolo Micheletti · Lifeng Han · Warren Del-Pinto · Goran Nenadic 🔗
-	$\mathbb{S}$ci$\mathbb{F}$ix: Outperforming GPT3 on Scientific Factual Error Correction ( Poster ) > link Link	Dhananjay Ashok · Atharva Kulkarni · Hai Pham · Barnabas Poczos 🔗
-	Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI ( Poster ) > link Link	Elena Sizikova · Niloufar Saharkhiz · Diksha Sharma · Miguel Lago · Berkman Sahiner · Jana Delfino · Aldo Badano 🔗
-	Knowledge-Infused Prompting Improves Clinical Text Generation with Large Language Models ( Poster ) > link Link	Ran Xu · Hejie Cui · Yue Yu · Xuan Kan · Wenqi Shi · Yuchen Zhuang · Wei Jin · Joyce Ho · Carl Yang 🔗
-	Improving Code Style for Accurate Code Generation ( Poster ) > link Link	Naman Jain · Tianjun Zhang · Wei-Lin Chiang · Joseph Gonzalez · Koushik Sen · Ion Stoica 🔗
-	GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning ( Poster ) > link Link	Amani Namboori · Shivam Mangale · Andy Rosenbaum · Saleh Soltan 🔗
-	EDGE++: Improved Training and Sampling of EDGE ( Poster ) > link Link	Xiaohui Chen · Mingyang Wu · Liping Liu 🔗
-	Conditional Generative Modeling for High-dimensional Marked Temporal Point Processes ( Poster ) > link Link	Zheng Dong · Zekai Fan · Shixiang Zhu 🔗
-	Synthetic Data Generation for Scarce Road Scene Detection Scenarios ( Poster ) > link Link	Dipika Khullar · Yash Shah · Ninadkulamz · Negin Sokhandan 🔗
-	Stable Diffusion For Aerial Object Detection ( Poster ) > link Link	Yanan Jian · FUXUN YU · Simranjit Singh · Dimitrios Stamoulis 🔗
-	INTAGS: Interactive Agent-Guided Simulation ( Poster ) > link Link	Song Wei · Andrea Coletta · Svitlana Vyetrenko · Tucker Balch 🔗
-	CALICO: Conversational Agent Localization via Synthetic Data Generation ( Poster ) > link Link	11 presenters Andy Rosenbaum · Ershad Banijamali · Christopher DiPersio · Pegah Kharazmi · Pan Wei · Lu Zeng · Gokmen Oz · Wael Hamza · Clement Chung · Karolina Owczarzak · Fabian Triefenbach 🔗
-	Improving fairness for spoken language understanding in atypical speech with Text-to-Speech ( Oral ) > link Link	11 presenters Helin Wang · Venkatesh Ravichandran · Milind Rao · Becky Lammers · Myra J. Sydnor · Nicholas Maragakis · Ankur Butala · Jayne Zhang · Lora Clawson · Victoria Chovaz · Laureano Moro-Velazquez 🔗
-	Generating Privacy-Preserving Longitudinal Synthetic Data ( Poster ) > link Link	Robin van Hoorn 🔗
-	AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing ( Poster ) > link Link	Namjoon Suh · Xiaofeng Lin · Din-Yin Hsieh · Mehrdad Honarkhah · Guang Cheng 🔗
-	Towards Effective Synthetic Data Sampling for Domain Adaptive Pose Estimation ( Poster ) > link Link	Isha Dua · Arjun Sharma · Shuaib Ahmed · Rahul Tallamraju 🔗
-	Fair Wasserstein Coresets ( Oral ) > link Link	Zikai Xiong · Niccolo Dalmasso · Vamsi Potluru · Tucker Balch · Manuela Veloso 🔗
-	Effective Data Augmentation With Diffusion Models ( Oral ) > link Link	Brandon Trabucco · Kyle Doherty · Max Gurinas · Russ Salakhutdinov 🔗
-	Continuous Diffusion for Mixed-Type Tabular Data ( Poster ) > link Link	Markus Mueller · Kathrin Gruber · Dennis Fok 🔗
-	Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets ( Poster ) > link Link	Brandon Smith · Miguel Farinha · Siobhan Mackenzie Hall · Hannah Rose Kirk · Aleksandar Shtedritski · Max Bain 🔗
-	Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization ( Poster ) > link Link	Elior Benarous · Sotiris Anagnostidis · Luca Biggio · Thomas Hofmann 🔗
-	Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models ( Oral ) > link Link	Yujin Kim · Jaehong Yoon · Seonghyeon Ye · Sung Ju Hwang · Se-Young Yun 🔗
-	Learning to Place Objects into Scenes by Hallucinating Scenes around Objects ( Poster ) > link Link	Lu Yuan · James Hong · Vishnu Sarukkai · Kayvon Fatahalian 🔗
-	Evaluating VLMs for Property-Specific Annotation of 3D Objects ( Poster ) > link Link	Rishabh Kabra · Loic Matthey · Alexander Lerchner · Niloy Mitra 🔗
-	Strong statistical parity through fair synthetic data ( Poster ) > link Link	Ivona Krchova · Michael Platzer · Paul Tiwald 🔗
-	On the Limitation of Diffusion Models for Synthesizing Training Datasets ( Poster ) > link Link	Shin'ya Yamaguchi · Takuma Fukuda 🔗
-	STAR: Improving Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models ( Poster ) > link Link	Mingyu Derek Ma · Xiaoxuan Wang · Po-Nien Kung · P. Jeffrey Brantingham · Nanyun Peng · Wei Wang 🔗
-	Feedback-guided Data Synthesis for Imbalanced Classification ( Poster ) > link Link	Reyhane Askari Hemmat · Mohammad Pezeshki · Florian Bordes · Michal Drozdzal · Adriana Romero 🔗
-	Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization ( Poster ) > link Link	Prakamya Mishra · Zonghai Yao · shuwei chen · Beining Wang · Rohan Mittal · Hong Yu 🔗
-	Privacy Measurements in Tabular Synthetic Data: State of the Art and Future Research Directions ( Poster ) > link Link	Alexander Boudewijn · Andrea Filippo Ferraris · Daniele Panfilo · Vanessa Cocca · Sabrina Zinutti · Karel De Schepper · Carlo Chauvenet 🔗
-	On Consistent Bayesian Inference from Synthetic Data ( Poster ) > link Link	Ossi Räisä · Joonas Jälkö · Antti Honkela 🔗
-	Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Oral ) > link Link	Zinan Lin · Sivakanth Gopi · Janardhan Kulkarni · Harsha Nori · Sergey Yekhanin 🔗
-	Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models ( Poster ) > link Link	Nicholas Kuo · Louisa Jorm · Sebastiano Barbieri 🔗
-	Diffusion-based Semantic-Discrepant Outlier Generation for Out-of-Distribution Detection ( Poster ) > link Link	Suhee Yoon · Sanghyu Yoon · Hankook Lee · Sangjun Han · Ye Seul Sim · Kyungeun Lee · Hyeseung Cho · Woohyung Lim 🔗