NeurIPS Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

Poster
in
Workshop: Safe Generative AI

Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

Manoj Acharya · Xiao Lin · Susmit Jha

[ Abstract ] [ Project Page ]

[ Slides] [ OpenReview]

Abstract:

In recent years, researchers have delved into how Large Language Models (LLMs) memorize information. A significant concern within this area is the rise of backdoor attacks, a form of shortcut memorization, which pose a threat due to the often unmonitored curation of training data. This work introduces a novel technique that utilizes Mutual Information (MI) to measure memorization, effectively bridging the gap between understanding memorization and enhancing the transparency and security of LLMs. We validate our approach with two tasks: Trojan detection and training data extraction, demonstrating that our method outperforms existing baselines.

Chat is not available.

Poster in Workshop: Safe Generative AI

Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

Manoj Acharya · Xiao Lin · Susmit Jha

Poster
in
Workshop: Safe Generative AI