Skip to yearly menu bar Skip to main content


Poster

Demystifying Encoding: Detecting Explanations that Hide Information in the Selection

Aahlad Manas Puli · Nhi Nguyen · Rajesh Ranganath

[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

A variety of feature attribution methods exist but it is not known how to rank them. Ranking these methods demands an approach to quantitatively define what makes a good explanation. One challenge in building quantitative evaluation is baking in the ability to detect when explanations encode predictions in the identity of the selected inputs. The notion of encoding has been studied in specific examples, but its definition has remained nebulous and informal. We develop a mathematical definition for encoding: an explanation is encoding if the identity of the explanation provides extra information about the target even after conditioning on the input values selected by the explanation. Based on the new definition, we classify evaluation methods as weak detectors, which are evaluations that are optimal only for a non-encoding explanation, and as strong detectors, which are evaluations that score non-encoding explanations as high or higher than encoding explanations. We prove that an existing score EVAL-X can weakly detect encoding but not strongly. We then develop an alternate score DET-X that provably strongly detects encoding. We empirically verify the theoretical insights on a simulated dataset and an image recognition task. With DET-X, we uncover evidence of encoding in LLM-generated explanations for predicting the sentiment in movie reviews.

Live content is unavailable. Log in and register to view live content