Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?
Large Language Model Detoxification: Data and Metric Solutions
SungJoo Byun · HYOPIL SHIN
Keywords: [ Large Language Model (LLM) ] [ Metric ] [ Direct Preference Optimization ] [ Instruction Tuning ]
\textit{\textbf{Caution:} This paper may include material that could be offensive or distressing.} There have been many studies about mitigating toxicity of language models. In fact, Large Language Models (LLMs), trained on extensive text corpora, often develop biases and toxicity during the pretraining phase. This paper demonstrates effective and successful detoxification of LLMs in the alignment tuning phase, through instruction tuning and Direct Preference Optimization (DPO). We introduce comprehensive instruction and preference datasets specifically designed for detoxifying LLMs. In our experiments, the models consistently exhibited reduced toxicity, with the DPO, fine-tuned, and base versions in descending order of toxicity reduction. Additionally, we identify the limitations of the existing prompting metric for assessing LLM toxicity and present a new metric that addresses this issue. We introduce Contextual Toxicity Score (CTS), which considers the contextual factors of prompts, as well as the continuation generated by LLMs.