Poster
in
Workshop: Towards Safe & Trustworthy Agents
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System
Julian Collado · Kevin Stangl
Recent approaches in machine learning often solve a task using a composition of multiple models or agentic architectures.When targeting a composed system with adversarial attacks, it might be computationally or informationally feasible to train a proxy model for every component of the system, while an end-to-end black box attack may require too many queries or introduce too much adversarial noise. We introduce a method to craft an adversarial attack pipeline against the overall multi-model system when we only have a proxy model for the final black box model, and when the transformation applied by the initial models can destroy the adversarial perturbations.Current methods handle this by applying many of copies of the first models/transformations to an input and then re-use a standard adversarial attacks by averaging gradients, or learning a proxy model for both stages. To our knowledge, we are the first attack that is specifically designed for this greybox threat model and our method has a substantially higher attack success rate (80\% vs 25\%) and contains 9.4\% smaller perturbations compared to prior SOTA methods. While our experiments focus on a supervised image pipeline, we believe our attack will generalize to other multi-model settings [e.g. a mix of open/closed source foundation models], or agentic system.