Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
Harnessing Large Language Models for Market Research: A Data-augumentation Approach
Mengxin Wang · Dennis Zhang · Heng Zhang
Keywords: [ Data Augmentation ] [ Conjoint Analysis ] [ Large Language Model ]
Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when combining the two.In this paper, we address this gap by proposing a novel statistical data augmentation framework that efficiently integrates LLM-generated data with real data in conjoint analysis. Our approach builds on knowledge distillation principles, training a simpler model to mimic the outputs of a large, complex LLM while correcting for biases using real data. This yields statistically sound estimators with consistent and asymptotically normal properties, unlike naive augmentation methods that exacerbate bias.We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce bias and save data by up to 22.2% to 81.9%. By leveraging advanced LLM techniques such as chain-of-thought prompting, we achieve substantial bias reduction, particularly with GPT-4, and show that our method can generate results comparable to datasets with up to 77% correct labels. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework. This research offers a cost-effective and scalable solution for market research, with broader implications for AI-enhanced data augmentation across various fields.