Off The Rails: Procedural Dilemma Generation for Moral Reasoning

Abstract

As AI systems like language models are increasingly integrated into making decisions that affect people, it’s critical to ensure that these systems have sound moral reasoning. To test whether they do, we need to develop systematic evaluations. Recent work has introduced a method for procedurally generating LLM evaluations from abstract causal templates, and tested this method in the context of social reasoning (i.e., theory-of-mind). In this paper, we extend this method to the domain of moral dilemmas. We develop a framework that translates causal graphs into a prompt template which can then be used to procedurally generate a large and diverse set of moral dilemmas using a language model. Using this framework, we created the OffTheRails dataset which consists of 50 scenarios and 500 unique test items. We evaluated the quality of our model-written test items using two independent human experts and found that 90% of the test-items met the desired structure. We collect moral permissibility and intention judgments from 100 human crowdworkers and compared these judgments with those from GPT-4 and Claude-2 across eight control conditions. Both humans and GPT-4 assigned higher intentionality to agents when a harmful outcome was evitable and a necessary means. However, our findings did not match previous findings on permissibility judgments. This difference may be a result of not controlling the severity of harmful outcomes during scenario generation. We conclude by discussing future extensions of our benchmark to address this limitation.

Publication
Fränken J., Khawaja A., Gandhi K., Moore J., Goodman N. D., Gerstenberg T. (2023). Off The Rails: Procedural Dilemma Generation for Moral Reasoning. In AI Meets Moral Philosophy and Moral Psychology Workshop (NeurIPS 2023).
Date

<< Back to list of publications