NeurIPS Disclosing the Biases in Large Language Models via Reward Structured Questions

Poster
in
Workshop: Workshop on Machine Learning Safety

Disclosing the Biases in Large Language Models via Reward Structured Questions

Ezgi Korkmaz

[ Abstract ]

Abstract:

The success of the large language models have been utterly demonstrated in the recent time. Using these models and fine tuning for the specific task at hand results in highly performing models. However, these models also learn biased representations from the data they have been trained on. In particular, several studies recently showed that language models can learn to be biased towards certain genders. Quite recently, several studies tried to eliminate this bias via proposing human feedback included in fine-tuning. In our study we show that by changing the question asked to the language model the log probabilities of the bias measured in the responses changes dramatically. Furthermore, in several cases the language model ends up providing a completely opposite response. The recent language models finetuned on the prior gender bias datasets do not resolve the actual problem, but rather alleviates the problem for the dataset on which the model is fine-tuned. We believe our results might lay the foundation for further alignment and safety problems in large language models.

Chat is not available.

Poster in Workshop: Workshop on Machine Learning Safety

Disclosing the Biases in Large Language Models via Reward Structured Questions

Ezgi Korkmaz

Poster
in
Workshop: Workshop on Machine Learning Safety