Workshop
Synthetic Data Generation with Generative AI
Sergul Aydore · Zhaozhi Qian · Mihaela van der Schaar
Hall E2 (level 1)
Sat 16 Dec, 7 a.m. PST
Synthetic data (SD) is data that has been generated by a mathematical model to solve downstream data science tasks. SD can be used to address three key problems: 1/ private data release, 2/ data de-biasing and fairness, 3/ data augmentation for boosting the performance of ML models. While SD offers great opportunities for these problems, SD generation is still a developing area of research. Systematic frameworks for SD deployment and evaluation are also still missing. Additionally, despite the substantial advances in Generative AI, the scientific community still lacks a unified understanding of how generative AI can be utilized to generate SD for different modalities.The goal of this workshop is to provide a platform for vigorous discussion from all these different perspectives with research communities in the hope of progressing the ideal of using SD for better and trustworthy ML training. Through submissions and facilitated discussions, we aim to characterize and mitigate the common challenges of SD generation that span numerous application domains. The workshop is jointly organized by academic researchers (University of Cambridge) and industry partners from tech (Amazon AI).
Schedule
Sat 7:00 a.m. - 7:05 a.m.
|
Welcome and workshop overview
(
Talk
)
>
SlidesLive Video |
Sergul Aydore 🔗 |
Sat 7:05 a.m. - 7:15 a.m.
|
Synthetic Data: Charting New Research Frontiers, Maximizing Impact, and Cultivating Collaborative Communities
(
Talk
)
>
SlidesLive Video |
Mihaela van der Schaar 🔗 |
Sat 7:15 a.m. - 8:00 a.m.
|
Generating health records
(
Invited Talk
)
>
SlidesLive Video |
Edward Choi 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break & Poster Session
(
Poster
)
>
|
🔗 |
Sat 8:30 a.m. - 9:15 a.m.
|
Privacy and Synthetic data
(
Invited Talk
)
>
SlidesLive Video |
Antti Honkela 🔗 |
Sat 9:15 a.m. - 9:45 a.m.
|
Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Contributed Talk ) > link | Zinan Lin 🔗 |
Sat 9:45 a.m. - 10:15 a.m.
|
Effective Data Augmentation With Diffusion Models
(
Contributed Talk
)
>
link
SlidesLive Video |
Max Gurinas · Brandon Trabucco 🔗 |
Sat 10:15 a.m. - 11:30 a.m.
|
Lunch Break & Poster Session
(
Poster
)
>
|
🔗 |
Sat 11:30 a.m. - 12:15 p.m.
|
Diversity and Synthetic data
(
Invited Talk
)
>
SlidesLive Video |
Adji Bousso Dieng 🔗 |
Sat 12:15 p.m. - 12:45 p.m.
|
Fair Wasserstein Coresets
(
Contributed Talk
)
>
SlidesLive Video |
Vamsi Potluru 🔗 |
Sat 12:45 p.m. - 1:15 p.m.
|
Improving fairness for spoken language understanding in atypical speech with Text-to-Speech
(
Contributed Talk
)
>
SlidesLive Video |
Venkatesh Ravichandran · Helin Wang 🔗 |
Sat 1:15 p.m. - 1:30 p.m.
|
Coffee Break & Poster Session
(
Poster
)
>
|
🔗 |
Sat 1:30 p.m. - 2:15 p.m.
|
Generative Agents: Interactive Simulacra
(
Invited Talk
)
>
SlidesLive Video |
Michael Bernstein 🔗 |
Sat 2:15 p.m. - 3:00 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Danielle Belgrave · Cem Tekin · Robert Tillman · Megan Gibbs · Dino Oglic · Rudi Agius · Panagiota Konstantinou 🔗 |
-
|
Size Matters: Large Graph Generation with HiGGs ( Poster ) > link | Alex O. Davies · Nirav Ajmeri · Telmo Silva Filho 🔗 |
-
|
Generating Medical Instructions with Conditional Transformer ( Poster ) > link | Samuel Belkadi · Nicolo Micheletti · Lifeng Han · Warren Del-Pinto · Goran Nenadic 🔗 |
-
|
$\mathbb{S}$ci$\mathbb{F}$ix: Outperforming GPT3 on Scientific Factual Error Correction ( Poster ) > link | Dhananjay Ashok · Atharva Kulkarni · Hai Pham · Barnabas Poczos 🔗 |
-
|
Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI ( Poster ) > link | Elena Sizikova · Niloufar Saharkhiz · Diksha Sharma · Miguel Lago · Berkman Sahiner · Jana Delfino · Aldo Badano 🔗 |
-
|
Knowledge-Infused Prompting Improves Clinical Text Generation with Large Language Models ( Poster ) > link | Ran Xu · Hejie Cui · Yue Yu · Xuan Kan · Wenqi Shi · Yuchen Zhuang · Wei Jin · Joyce Ho · Carl Yang 🔗 |
-
|
Improving Code Style for Accurate Code Generation ( Poster ) > link | Naman Jain · Tianjun Zhang · Wei-Lin Chiang · Joseph Gonzalez · Koushik Sen · Ion Stoica 🔗 |
-
|
GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning ( Poster ) > link | Amani Namboori · Shivam Mangale · Andy Rosenbaum · Saleh Soltan 🔗 |
-
|
EDGE++: Improved Training and Sampling of EDGE ( Poster ) > link | Xiaohui Chen · Mingyang Wu · Liping Liu 🔗 |
-
|
Conditional Generative Modeling for High-dimensional Marked Temporal Point Processes ( Poster ) > link | Zheng Dong · Zekai Fan · Shixiang Zhu 🔗 |
-
|
Synthetic Data Generation for Scarce Road Scene Detection Scenarios ( Poster ) > link | Dipika Khullar · Yash Shah · Ninadkulamz · Negin Sokhandan 🔗 |
-
|
Stable Diffusion For Aerial Object Detection ( Poster ) > link | Yanan Jian · FUXUN YU · Simranjit Singh · Dimitrios Stamoulis 🔗 |
-
|
INTAGS: Interactive Agent-Guided Simulation ( Poster ) > link | Song Wei · Andrea Coletta · Svitlana Vyetrenko · Tucker Balch 🔗 |
-
|
CALICO: Conversational Agent Localization via Synthetic Data Generation ( Poster ) > link |
11 presentersAndy Rosenbaum · Ershad Banijamali · Christopher DiPersio · Pegah Kharazmi · Pan Wei · Lu Zeng · Gokmen Oz · Wael Hamza · Clement Chung · Karolina Owczarzak · Fabian Triefenbach |
-
|
Improving fairness for spoken language understanding in atypical speech with Text-to-Speech ( Oral ) > link |
11 presentersHelin Wang · Venkatesh Ravichandran · Milind Rao · Becky Lammers · Myra J. Sydnor · Nicholas Maragakis · Ankur Butala · Jayne Zhang · Lora Clawson · Victoria Chovaz · Laureano Moro-Velazquez |
-
|
Generating Privacy-Preserving Longitudinal Synthetic Data ( Poster ) > link | Robin van Hoorn 🔗 |
-
|
AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing ( Poster ) > link | Namjoon Suh · Xiaofeng Lin · Din-Yin Hsieh · Mehrdad Honarkhah · Guang Cheng 🔗 |
-
|
Towards Effective Synthetic Data Sampling for Domain Adaptive Pose Estimation ( Poster ) > link | Isha Dua · Arjun Sharma · Shuaib Ahmed · Rahul Tallamraju 🔗 |
-
|
Fair Wasserstein Coresets ( Oral ) > link | Zikai Xiong · Niccolo Dalmasso · Vamsi Potluru · Tucker Balch · Manuela Veloso 🔗 |
-
|
Effective Data Augmentation With Diffusion Models ( Oral ) > link | Brandon Trabucco · Kyle Doherty · Max Gurinas · Russ Salakhutdinov 🔗 |
-
|
Continuous Diffusion for Mixed-Type Tabular Data ( Poster ) > link | Markus Mueller · Kathrin Gruber · Dennis Fok 🔗 |
-
|
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets ( Poster ) > link | Brandon Smith · Miguel Farinha · Siobhan Mackenzie Hall · Hannah Rose Kirk · Aleksandar Shtedritski · Max Bain 🔗 |
-
|
Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization ( Poster ) > link | Elior Benarous · Sotiris Anagnostidis · Luca Biggio · Thomas Hofmann 🔗 |
-
|
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models ( Oral ) > link | Yujin Kim · Jaehong Yoon · Seonghyeon Ye · Sung Ju Hwang · Se-Young Yun 🔗 |
-
|
Learning to Place Objects into Scenes by Hallucinating Scenes around Objects ( Poster ) > link | Lu Yuan · James Hong · Vishnu Sarukkai · Kayvon Fatahalian 🔗 |
-
|
Evaluating VLMs for Property-Specific Annotation of 3D Objects ( Poster ) > link | Rishabh Kabra · Loic Matthey · Alexander Lerchner · Niloy Mitra 🔗 |
-
|
Strong statistical parity through fair synthetic data ( Poster ) > link | Ivona Krchova · Michael Platzer · Paul Tiwald 🔗 |
-
|
On the Limitation of Diffusion Models for Synthesizing Training Datasets ( Poster ) > link | Shin'ya Yamaguchi · Takuma Fukuda 🔗 |
-
|
STAR: Improving Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models ( Poster ) > link | Mingyu Derek Ma · Xiaoxuan Wang · Po-Nien Kung · P. Jeffrey Brantingham · Nanyun Peng · Wei Wang 🔗 |
-
|
Feedback-guided Data Synthesis for Imbalanced Classification ( Poster ) > link | Reyhane Askari Hemmat · Mohammad Pezeshki · Florian Bordes · Michal Drozdzal · Adriana Romero 🔗 |
-
|
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization ( Poster ) > link | Prakamya Mishra · Zonghai Yao · shuwei chen · Beining Wang · Rohan Mittal · Hong Yu 🔗 |
-
|
Privacy Measurements in Tabular Synthetic Data: State of the Art and Future Research Directions ( Poster ) > link | Alexander Boudewijn · Andrea Filippo Ferraris · Daniele Panfilo · Vanessa Cocca · Sabrina Zinutti · Karel De Schepper · Carlo Chauvenet 🔗 |
-
|
On Consistent Bayesian Inference from Synthetic Data ( Poster ) > link | Ossi Räisä · Joonas Jälkö · Antti Honkela 🔗 |
-
|
Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Oral ) > link | Zinan Lin · Sivakanth Gopi · Janardhan Kulkarni · Harsha Nori · Sergey Yekhanin 🔗 |
-
|
Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models ( Poster ) > link | Nicholas Kuo · Louisa Jorm · Sebastiano Barbieri 🔗 |
-
|
Diffusion-based Semantic-Discrepant Outlier Generation for Out-of-Distribution Detection ( Poster ) > link | Suhee Yoon · Sanghyu Yoon · Hankook Lee · Sangjun Han · Ye Seul Sim · Kyungeun Lee · Hyeseung Cho · Woohyung Lim 🔗 |