NeurIPS Detecting Watermark Spoofing Attacks

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Detecting Watermark Spoofing Attacks

Eliot Cowan · Max Daniels

Keywords: [ Large Language Models ] [ Spoof Attacks ] [ Authorship Attribution ] [ Watermarking ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract:

The rapid advancement of Large Language Models (LLMs) has intensified concerns about the authenticity and safety of generated text. To address these issues, watermarking schemes have been developed to detect LLM-generated content and accurately identify its source. However, these methods are vulnerable to spoofing attacks, where adversaries manipulate text to falsely appear as originating from a watermarked model. Past research has demonstrated that watermarking methods like the popular KGW scheme can be exploited in these spoofing attacks. Moreover, recent studies reveal that state-of-the-art watermarking schemes can be automatically spoofed at minimal cost. These attacks pose significant risks to watermark creators by falsely attributing harmful or unsafe text to reputable models, leading to potential reputational damage and other adverse consequences. In addition, if a model were to employ a poorly-constructed watermark that could be spoofed, we introduce a detection algorithm that can quickly, cheaply and easily identify whether harmful text was created with a spoofed model.

Chat is not available.

Poster in Workshop: Statistical Frontiers in LLMs and Foundation Models

Detecting Watermark Spoofing Attacks

Eliot Cowan · Max Daniels

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models