Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Regulatable ML: Towards Bridging the Gaps between Machine Learning Research and Regulations

LLM-Generated Black-box Explanations Can Be Adversarially Helpful

Rohan Deepak Ajwani · Shashidhar Reddy Javaji · Frank Rudzicz · Zining Zhu


Abstract: Large language models (LLMs) are becoming vital tools that help us solve and understand complex problems. LLMs can generate convincing explanations, even when given only the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call $\textit{adversarial helpfulness}.$ This happens when an LLM's explanations make a wrong answer look correct, potentially leading people to trust faulty solutions. In this paper, we show that this issue affects not just humans, but also LLM evaluators. Digging deeper, we identify and examine key persuasive strategies employed by LLMs. Our findings reveal that these models employ strategies such as reframing questions, expressing an elevated level of confidence, and `cherry-picking' evidence that supports incorrect answers. We further create a symbolic graph reasoning task to analyze the mechanisms of LLMs generating adversarial helpfulness explanations. Most LLMs are not able to find alternative paths along simple graphs, indicating that other mechanisms, rather than logical deductions, might facilitate adversarial helpfulness. These findings shed light on the limitations of black-box explanations and lead to recommendations for the safer use of LLMs.

Chat is not available.