Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
LLMs for Causal Inference
Jonathan Choi
Keywords: [ large language models ] [ predictive modeling ] [ multicollinearity ] [ natural language processing ] [ text analysis ] [ ordinary least squares ] [ econometrics ] [ causal inference ] [ empirical methods ] [ monte carlo simulation ]
Popular methods for causal inference in social science, like regression analysis, conventionally can only incorporate numerical or categorical data. Developments in natural language processing (NLP) present new opportunities for empirical researchers to incorporate raw textual data in their analyses, either as additional controls or new variables of interest. We present three NLP techniques to incorporate free-form text in causal inference, evaluating their goodness-of-fit and prediction error rates. We find that fine-tuning a large language model (LLM) to directly predict outcome variables and adding the LLM's predicted probabilities to a conventional ordinary least squares (OLS) regression delivers the best balance of good performance and interpretable results. We also describe statistical best practices for incorporating LLM-predicted variables as additional controls in OLS regression, including a method to address multicollinearity when text data proxies for variables of interest, which we validate using a Monte Carlo simulation.