Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Safe Generative AI

Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy

Tsung-Huan Yang · Ko-Wei Huang · Yung-Hui Li · Lun-Wei Ku


Abstract:

Large Language Models (LLMs) often undergo fine-tuning to adapt to new tasks. However, recent studies have shown that such fine-tuning inadvertently compromise their safety alignments. This paper investigates the challenges of preserving safety during fine-tuning and provides guidelines to mitigate safety degradation. We systematically evaluate 11 LLMs' safety degradation, revealing that certain LLMs consistently exhibit higher safety degradation across all datasets, suggesting that inherent model characteristics influence safety robustness. To address this issue, we explore two detoxification procedures with three different detoxification methods. Our analysis shows that the ensemble procedure significantly mitigates the safety degradation, indicating a crucial relationship between toxicity and safety robustness. To elucidate the underlying mechanisms, we conduct a subspace similarity analysis, revealing that the consecutive training procedure exhibits higher similarity between subspaces of detoxification weight and task-specific weight, explaining its ineffectiveness in mitigating safety degradation. This study provides critical insights into LLM safety preservation, highlighting the importance of separating safety-related parameters and task-specific parameters.

Chat is not available.