Skip to yearly menu bar Skip to main content


Poster
in
Workshop: UniReps: Unifying Representations in Neural Models

Inducing Human-like Biases in Moral Reasoning Language Models

Austin Meek · Artem Karpov · Seong Cho · Raymond Koopmanschap · Lucy Farnik · Bogdan-Ionut Cirstea

Keywords: [ alignment ] [ BrainScore ] [ large language model ] [ theory of mind ] [ moral reasoning ] [ fMRI ] [ fine-tuning ]


Abstract:

In this work, we study the alignment (BrainScore) of large language models (LLMs)fine-tuned for moral reasoning on behavioral data and/or brain data of humansperforming the same task. We also explore if fine-tuning several LLMs on thefMRI data of humans performing moral reasoning can improve the BrainScore.We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoningbehavioral data from the ETHICS benchmark Hendrycks et al. [2020], on themoral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We studyboth the accuracy on the ETHICS benchmark and the BrainScores between modelactivations and fMRI data. While larger models generally performed better on bothmetrics, BrainScores did not significantly improve after fine-tuning.

Chat is not available.