How do language models assign responsibility and reward, and is it similar to how humans do it? We instructed three state-of-the-art large language models to assign responsibility (Experiment 1) and reward (Experiment 2) to agents in a collaborative task. We then compared the language models’ responses to seven existing cognitive models of responsibility and reward allocation. We found that language models mostly evaluated agents based on force (how much they actually did), in line with classical production-style accounts of causation. By contrast, humans valued actual and counterfactual effort (how much agents tried or could have tried). These results indicate a potential barrier to effective human-machine collaboration.
<< Back to list of publications