Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people’s gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot the Ball, a challenging benchmark for evaluating visual social inference in vision–language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20–34%) than models (≤ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics—such as guessing near the image center or nearby players—while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human–model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.
<< Back to list of publications