Poster
in
Workshop: Towards Safe & Trustworthy Agents
Lost in Translation: Jail Breaking Gemini and Revealing Biases in Large Language Models via Translation
Ezgi Korkmaz
The capabilities of large language models have been demonstrated quite recently. Upon this success currently language agents are one of the main focuses of machine learning research targeting constructing a general artificial intelligence agent that can interact with humans daily to perform certain tasks. Even currently, many can access the publicly available versions of these models. Previous studies demonstrated that early versions of large language models can learn biased representations. To overcome this benchmarks have been proposed to measure the biased representations learned. Following this, on the training part, many studies focused on eliminating this bias via either including human feedback in fine tuning or further training other models to provide a reward function that penalizes the biased representations. Now large language models perform better when it comes to revealing the biased representations learned. Yet, our paper demonstrates that the reason for that is not due to the fact that models do no longer learn biased representations, but rather that they acquired knowledge on how to respond to benchmarks that measure biases. Our results demonstrate that one of most recent publicly available large language model learns biased representations that can be surfaced simply via leveraging translation, while they cannot be via previous methods. We believe our results can provide foundation on the concrete problems of large language models regarding their safety and robustness.