Poster
in
Workshop: Safe Generative AI
Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction
Manoj Acharya · Xiao Lin · Susmit Jha
Abstract:
In recent years, researchers have delved into how Large Language Models (LLMs) memorize information. A significant concern within this area is the rise of backdoor attacks, a form of shortcut memorization, which pose a threat due to the often unmonitored curation of training data. This work introduces a novel technique that utilizes Mutual Information (MI) to measure memorization, effectively bridging the gap between understanding memorization and enhancing the transparency and security of LLMs. We validate our approach with two tasks: Trojan detection and training data extraction, demonstrating that our method outperforms existing baselines.
Chat is not available.