Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Using Influence Functions to Unlearn Poisons
Wenjie Li · Jiawei Li · Christian Schroeder de Witt · Ameya Prabhu · Amartya Sanyal
Abstract:
Addressing data integrity challenges, such as removing the effects of poisoning after model training, is a major obstacle in machine learning. Traditional unlearning methods often struggle with this task especially, when only a small number of poisoned samples can be identified. We propose a novel approach that leverages influence functions to trace abnormal model behavior back to poisoned training data using just *one* poisoned test example. However, we find that even the most advanced influence functions, like EK-FAC, fail to reliably attribute the performance of a poisoned test sample to the poisoning of specific subsets of training data. To overcome this limitation, we introduce $\Delta$-Influence, a method that exploits the property that data transformations can consistently disrupt the relationship between poisoned training data and affected test points, without consistently affecting clean data. This phenomena, termed *influence collapse*, allows us to identify poisoning in a subset of the training data by detecting a significant negative shift in their influence scores after targeted augmentations to the poisoned test sample. Retraining without this identified subset enables effective unlearning the data poisoning. We demonstrates reliable unlearning performance across three vision-based data poisoning attacks, showing the promise of influence functions for corrective unlearning.
Chat is not available.