Poster
in
Workshop: Safe Generative AI
The Probe Paradigm: A Theoretical Foundation for Explaining Generative Models
Amit Rege
To understand internal representations in generative models, there has been a long line of research of using \emph{probes} i.e. shallow binary classifiers trained on the model's representations to indicate the presence/absensce of human-interpretable \emph{concepts}. While the focus of much of this work has been empirical, it is important to establish rigorous guarantees on the use of such methods to understand its limitations. To this end, we introduce a formal framework to theoretically study explainability in generative models using probes. We discuss the applicability of our framework to number of practical models and then, using our framework, we are able to establish theoretical results on sample complexity and the limitations of probing in high-dimensional spaces. Then, we prove results highlighting significant limitations in probing strategies in the worst case. Our findings underscore the importance of cautious interpretation of probing results and imply that comprehensive auditing of complex generative models might be hard even with white box access to internal representations.