Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

Stability Evaluation of Large Language Models via Distributional Perturbation Analysis

Jiashuo Liu · Jiajin Li · Peng Cui · Jose Blanchet

Keywords: [ optimal transport ] [ stability evaluation ]


Abstract:

The performance of Large Language Models (LLMs) can degrade when exposed to shifts such as changes in language style or domain-specific knowledge that is underrepresented in the training data. To ensure robust deployment, we propose a stability evaluation criterion based on distributional perturbations. Conceptually, this criterion measures the minimal perturbation required in the data to induce a specified deterioration in model performance. We employ optimal transport (OT) discrepancy with moment constraints on the (sample, density) space to quantify these perturbations. This allows our stability criterion to address both data corruptions and sub-population shifts, which are common in real-world LLM applications. To make this approach practical, we provide tractable convex formulations and computational methods tailored to different classes of loss functions used in LLMs. Empirically, we validate the utility of our stability criterion by testing LLMs on tasks such as jailbreak attempts and general question-answering tasks, demonstrating its effectiveness in assessing model robustness and providing insights into improving stability under diverse real-world scenarios.

Chat is not available.