NeurIPS An Adversarial Behavior Model for Contextual Ethical Alignment in Large Language Models

Poster
in
Workshop: Safe Generative AI

An Adversarial Behavior Model for Contextual Ethical Alignment in Large Language Models

Edward Chang

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

This research develops methodologies for Large Language Models (LLMs) to manage linguistic behaviors related to emotions and ethics. We introduce DIKE, a framework that enhances LLMs' ability to internalize and reflect human values, adapting to cultural contexts to promote transparency and trust. The methodology involves modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our approaches include mapping emotions and behaviors using self-supervised learning, refining guardrails through adversarial reviews, and adjusting outputs for ethical alignment. This framework establishes a foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for responsible AI interactions.

Chat is not available.

Poster in Workshop: Safe Generative AI

An Adversarial Behavior Model for Contextual Ethical Alignment in Large Language Models

Edward Chang

Poster
in
Workshop: Safe Generative AI