NeurIPS A Watermark for Black-Box Language Models

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

A Watermark for Black-Box Language Models

Dara Bahri · John Wieting · Dana Alon · Donald Metzler

Keywords: [ large language models ] [ detection ] [ watermarking ] [ black-box ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 14 Dec 3:45 p.m. PST — 4:30 p.m. PST

Abstract:

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

Chat is not available.

Poster in Workshop: Statistical Frontiers in LLMs and Foundation Models

A Watermark for Black-Box Language Models

Dara Bahri · John Wieting · Dana Alon · Donald Metzler

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models