Spotlight Poster
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi (Alex) Qiu · Yang Zhang · Xuchuan Huang · Jasmine Li · Jiaming Ji · Yaodong Yang
Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing algorithms to learn mechanics of moral progress from history, in order to facilitate future moral progress in real-world moral decisions. Leveraging nine centuries of historical text and 18 historical LLMs, the ProgressGym framework enables codification of real-world progress alignment challenges into concrete benchmarks. We demonstrate the failures of existing alignment methods on three key challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). In response, we present lifelong and extrapolative algorithms as initial methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges.
Live content is unavailable. Log in and register to view live content