Actual or counterfactual? Asymmetric responsibility attributions in language models

Abstract

We investigate how language models assign responsibility to collaborators. We instruct 10 large language models from three different companies to assign responsibility to agents in a collaborative task. We then compare the language models’ responses to seven existing cognitive models of responsibility attribution. We find that, while humans use actual and counterfactual effort to assign responsibility to collaborators, LLMs primarily use force, and this divergence shows up asymmetrically, when evaluating collaboration failures rather than successes. Our results highlight the similarities and differences between LLMs and humans in responsibility attributions and demonstrate the promise of interpreting LLM behavior using cognitive theories.

Publication
Bigelow, E., Xiang, Y., Gerstenberg, T., Ullman*, T., Gershman*, S. J. (2025). Actual or counterfactual? Asymmetric responsibility attributions in language models. NeurIPS Workshop ‘CogInterp: Interpreting Cognition in Deep Learning Models’, 2025.
Date
Links

<< Back to list of publications