NeurIPS Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding

Poster
in
Workshop: Safe Generative AI

Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding

Shaina Raza · Deval Pandya · Shardul ghuge · Nifemi

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Large Language Models (LLMs) have demonstrated remarkable capabilities in Natural Language Processing (NLP) tasks, but they often generate text that perpetuates societal biases and produces unsafe content. While existing approaches to mitigate these issues have shown some success, they frequently come at the cost of reduced knowledge retention and language understanding. This study investigates a method to produce safe, unbiased outputs from LLMs without compromising their core capabilities.To address this challenge, we trained already-safe LLMs on a specialized dataset containing examples of unsafe content paired with safer alternatives. Our results demonstrate that this approach enhances the model's ability to generate safe content while maintaining its language understanding capabilities.The findings of this study have significant implications for the development of more responsible and ethical AI systems. To promote transparency and facilitate further research in this area, we have made our code and dataset publicly \href{https://github.com/llm-work/safe-llm}{available on GitHub

Chat is not available.

Poster in Workshop: Safe Generative AI

Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding

Shaina Raza · Deval Pandya · Shardul ghuge · Nifemi

Poster
in
Workshop: Safe Generative AI