Oral
in
Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly
Effective Backdoor Mitigation Depends on the Pre-training Objective
Sahil Verma · Gantavya Bhatt · Soumye Singhal · Arnav Das · Chirag Shah · John Dickerson · Jeff A Bilmes
Despite the remarkable capabilities of current machine learning (ML) models, they are still susceptible to adversarial and backdoor attacks. Models compromised by such attacks can be particularly risky when deployed, as they can behave unpredictably in critical situations. Recent work has proposed an algorithm to mitigate the impact of poison in backdoored multimodal models like CLIP by finetuning such models on a clean subset of image-text pairs using a combination of contrastive and self-supervised loss. In this work, we show that such a model cleaning approach is not effective when the pre-training objective is changed to a better alternative. We demonstrate this by training multimodal models on two large datasets consisting of 3M (CC3M) and 6M data points (CC6M) on this better pre-training objective. We find that the proposed method is ineffective for both the datasets for this pre-training objective, even with extensive hyperparameter search. Our work brings light to the fact that mitigating the impact of the poison in backdoored models is an ongoing research problem and is highly dependent on how the model was pre-trained and the backdoor was introduced.