Poster
in
Workshop: 3rd Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers)
vTune: Verifiable Fine-Tuning Through Backdooring
Eva Zhang · Akilesh Potti · Micah Goldblum
Keywords: [ sft ] [ mlaas ] [ large language model ] [ Verification ] [ backdoor ] [ backdoor attacks ] [ Fine-tuning ] [ data poisoning ]
As fine-tuning large language models becomes increasingly prevalent, consumers often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: how do consumers verify thatfine-tuning services are performed correctly? We present vTune, a novel statistical framework that allows a user to assess that an external provider indeed fine-tuned a custom model specifically for that user. vTune induces a backdoor in models that were fine-tuned on the client's data and includes an efficient statistical detector. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets. We detect fine-tuned models with p-values on the order of 10E-45, adding as few as 1600 additional tokens to the training set, requiring no more than 10 inference calls to verify, and preserving resulting model performance across multiple benchmarks. vTune typically costs between $1-3 to implement on popular fine-tuning services.