Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
Detecting Watermark Spoofing Attacks
Eliot Cowan · Max Daniels
Keywords: [ Large Language Models ] [ Spoof Attacks ] [ Authorship Attribution ] [ Watermarking ]
The rapid advancement of Large Language Models (LLMs) has intensified concerns about the authenticity and safety of generated text. To address these issues, watermarking schemes have been developed to detect LLM-generated content and accurately identify its source. However, these methods are vulnerable to spoofing attacks, where adversaries manipulate text to falsely appear as originating from a watermarked model. Past research has demonstrated that watermarking methods like the popular KGW scheme can be exploited in these spoofing attacks. Moreover, recent studies reveal that state-of-the-art watermarking schemes can be automatically spoofed at minimal cost. These attacks pose significant risks to watermark creators by falsely attributing harmful or unsafe text to reputable models, leading to potential reputational damage and other adverse consequences. In addition, if a model were to employ a poorly-constructed watermark that could be spoofed, we introduce a detection algorithm that can quickly, cheaply and easily identify whether harmful text was created with a spoofed model.