Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop
Vision-LLMs Can Fool Themselves with Self-Generated Text
Maan Qraitem · Nazia Tasnim · Piotr Teterwak · Kate Saenko · Bryan Plummer
Prior work has shown that pasting misleading text on an image (Typographic Attacks) could lead Vision-Language models to make incorrect predictions. However, the susceptibility of recent Large Vision-Language Models (\eg, LLaVA, GPT4-V) to Typographic Attacks is understudied. This is especially relevant given their use as personal assistants, where, as a result, typographic attacks could amplify issues like misinformation. Furthermore, prior work's Typographic Attacks randomly sample a misleading class from a predefined set of categories. However, this simple strategy misses potentially more effective choices (\eg a more similar deceiving class to the ground truth). Moreover, prior work attacks that only include one word (the deceiving class) do not fully exploit LVLMs' stronger language reasoning skills. To address these issues, we first introduce an experimental setup for testing Typographic attacks that generalize to LVLM(s). Moreover, we propose two novel and more effective \textit{Self-Generated} attacks which prompt the vision-language model to generate an attack against itself: 1) Class based attack, which uses the vision-language model to identify a class similar to the true class, leveraging it as a deceptive class 2) Descriptive Attacks where an advanced LVLM (\eg GPT4-V) is prompted to recommend a Typographic attack that includes both a deceiving class and a description. Using our experimental setup, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 60\% of the baseline performance. We also uncover that attacks generated by one model (\eg GPT-4V or LLaVA) are effective against the model itself and other models like InstructBLIP and MiniGPT4.