Abstract
Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
Author notes
Action Editors: Marianna Apidianaki, Abdellah Fourtassi, and Sebastian Padó