Skip to yearly menu bar Skip to main content



Abstract:

Popular methods for causal inference in social science, like regression analysis, conventionally can only incorporate numerical or categorical data. Developments in natural language processing (NLP) present new opportunities for empirical researchers to incorporate raw textual data in their analyses, either as additional controls or new variables of interest. We present three NLP techniques to incorporate free-form text in causal inference, evaluating their goodness-of-fit and prediction error rates. We find that fine-tuning a large language model (LLM) to directly predict outcome variables and adding the LLM's predicted probabilities to a conventional ordinary least squares (OLS) regression delivers the best balance of good performance and interpretable results. We also describe statistical best practices for incorporating LLM-predicted variables as additional controls in OLS regression, including a method to address multicollinearity when text data proxies for variables of interest, which we validate using a Monte Carlo simulation.

Chat is not available.