Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

CriticAL: Model Criticism Automation with Language Models

Michael Li · Noah Goodman · Emily Fox

Keywords: [ model criticism ] [ posterior predictive checks ] [ automated model discovery ]

[ ] [ Project Page ]
Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract:

All statistical models are wrong and each model is wrong in its own way. Systematically identifying deficiencies, or model criticism, is key to understanding the limitations of a model and revising accordingly. Previous model criticism techniques rely on substantial human insight to invent discrepancy measures that reveal limitations and, when feasible, guide model revision.We propose CriticAL, an automated statistical model criticism method that integrates large language models (LMs) within a Bayesian model criticism framework.CriticAL is a two-stage procedure that produces interpretable and actionable model criticism artifacts.First, given a statistical model and a description of the dataset, an LM proposes test statistics which summarize aspects of the data that a model may fail to capture. The LM implements these test statistics as Python critic programs operating on dataframes. We use these test statistics to perform posterior predictive checks that assess whether a model faithfully captures properties in the data and then ask an LM to summarize these checks in natural language. In experiments, we characterize the false and true positive rates of CriticAL in a setting where we synthetically generate discrepancies between models and datasets.We then criticize 36 real-world statistical models designed by human experts. In contrast to previous methods, CriticAL produces criticism artifacts (critic programs and natural language) that are directly actionable. To illustrate this, we integrate CriticAL within a closed-loop automated model discovery pipeline and show improvements upon expert models.

Chat is not available.