Poster
in
Workshop: Workshop on Machine Learning Safety
Reflection Mechanisms as an Alignment Target: A Survey
Marius Hobbhahn · Eric Landgrebe · Elizabeth Barnes
We used Positly to survey roughly 1000 US-based workers about their attitudes on moral questions, conditions under which they would change their moral beliefs, and approval towards different mechanisms for society to resolve moral disagreements. Unsurprisingly, our sample strongly disagreed on contentious object-level moral questions such as whether abortion is immoral. In addition, a substantial fraction of people reported that these beliefs wouldn’t change even if they came to different beliefs about factors we view as morally relevant, such as whether the fetus was conscious in the case of abortion. However, people were generally favorable to the idea of society deciding policies by some means of reflection - such as democracy, a debate between well-intentioned experts, or thinking for a long time. This agreement improves in a hypothetical well-intentioned future society. Surprisingly, favorability remained even when we stipulate that the reflection procedure came to the opposite of the respondents' view on polarizing topics like abortion. This provides evidence that people may support aligning AIs to a reflection procedure rather than individual beliefs. We tested our findings on a second adversarial survey that actively tries to disprove the finding from the first study. We find that our core results are robust in standard settings but are weakened when the questions are constructed adversarially (e.g. when decisions are made by people who have the opposite of the respondents' moral or political beliefs).