Abstract
Large Language Models (LLMs) and humans acquire knowledge about language without direct supervision. LLMs do so by means of specific training objectives, while humans rely on sensory experience and social interaction. This parallelism has created a feeling in NLP and cognitive science that a systematic understanding of how LLMs acquire and use the encoded knowledge could provide useful insights for studying human cognition. Conversely, methods and findings from the field of cognitive science have occasionally inspired language model development. Yet, the differences in the way that language is processed by machines and humans—in terms of learning mechanisms, amounts of data used, grounding and access to different modalities—make a direct translation of insights challenging. The aim of this edited volume has been to create a forum of exchange and debate along this line of research, inviting contributions that further elucidate similarities and differences between humans and LLMs.
1 Introduction
Large Language Models (LLMs) have come to dominate the field of computational linguistics. One reason is their ability to acquire rich information regarding linguistic structure and world knowledge (Tenney et al. 2019; Hewitt and Manning 2019; Petroni et al. 2019; Mahowald et al. 2024; Chang and Bergen 2024). A rather surprising aspect of LLMs is that they demonstrate this ability despite using typically very simple training objectives, learning how to sensibly continue (or fill gaps in) text without the need for explicit supervision (Bengio, Ducharme, and Vincent 2000; Goldberg 2017; Devlin et al. 2019). In this broad sense, LLMs appear to work analogously to how humans develop most of their knowledge about language structure, meaning, and use—that is, spontaneously and without direct supervision.
Since the introduction of LLMs, there has been a feeling among some communities in both NLP and cognitive science that a systematic understanding of how these models work and how they use the knowledge they encode could help to shed light on the way humans acquire, represent, and process this same knowledge (Dupoux 2018; Cichy and Kaiser 2019; Caucheteux and King 2022; Goldstein et al. 2022). Conversely, findings from the areas of psycholinguistics and cognitive science have already inspired language model development, since these models are expected to exhibit human-like behavior in language use. For instance, datasets developed in—or inspired by—the field of cognitive science, including eye tracking and brain imaging datasets, have been used to evaluate models’ behavior and to provide additional training signal (Ettinger 2020; Dunbar et al. 2017; Binz and Schulz 2023; Bingel, Barrett, and Søgaard 2016; Hollenstein et al. 2021).
Yet, there are also unmistakable differences between machines and humans, which put the direct translation of insights into question. Chief among them is the difference in learning mechanisms, as a consequence of which the size of data required to train LLMs far exceeds—by orders of magnitude—what humans need to acquire sophisticated conceptual structures and meanings (Frank 2023; Warstadt et al. 2023). Furthermore, human language is inherently grounded and multi-modal. In particular, children acquire world knowledge not only via exposure to language, but also via sensory experience and social interaction (Clark 2003; Tomasello 2009; Vigliocco, Perniss, and Vinson 2014). While some LLMs do learn from modalities other than text or speech, the degree to which this learning mirrors that of humans is far from obvious.
The aim of this special issue is to consolidate this exciting line of research, inviting contributions that further elucidate similarities and differences in the study of humans and LLMs, broadening the research scope to a range of linguistic levels and methodologies. The main questions we encouraged researchers to engage with are whether and how methods used in psycholinguistics (and cognitive science more generally) for studying the mechanisms of language processing and acquisition can be applied to the study of LLMs; and, conversely, whether the study of linguistic phenomena using LLMs, the investigation of the conceptual and world knowledge they encode, and the learning and processing principles they employ, can provide useful insights for studying human cognition.
We received a large number of interesting and engaging submissions, from which ten papers were accepted after two rounds of reviewing. In this preface, we provide an overview of these articles and discuss their contributions in terms of the most important lessons and challenges going forward.
2 Overview of the Articles in this Special Issue
The papers in this special issue can be grouped in terms of the topic addressed.
2.1 Meaning and Pragmatics
Ohmer, Bruni, and Hupkes (2024) explore LLMs’ language understanding abilities. They prompt a model (GPT-3.5) with linguistic expressions that have the same underlying meaning and evaluate its consistency across different tasks. These might be expressions with the same reference in the real world (such as “morning star” and “evening star” for Venus), paraphrases or translations (“the sum of two and two”, “two plus two”, “zwei plus zwei”). The proposed experiments involve tasks of increasing complexity, from basic truth-conditional statements to more complex tasks, such as paraphrase detection and NLI. High consistency would suggest the LLM might be linking the expressions to their common underlying meaning. The results and follow-up analyses demonstrate that the model’s meaning representations are strongly tied to form, and its understanding is still quite far from being consistent and human-like. The authors provide an interesting discussion on the consequences of these findings for the role of LLMs as explanatory models of semantic understanding in humans.
Allaway et al. (2024) investigate how LLMs reason about generalizations using generics, a particular type of statement that is fundamental to human reasoning but challenging to analyze semantically. Generics express generalizations (e.g., birds can fly) without explicit quantification; notably, they generalize over their instantiations (sparrows can fly) yet hold true even in the presence of exceptions (penguins do not). The authors use a framework grounded in pragmatics to automatically generate a large-scale dataset of generics, including both instantiations and exceptions. With this dataset, they probe whether LLMs exhibit similar behavior to humans in terms of quantification and property inheritance. LLMs show evidence of overgeneralization, similar to humans, but sometimes struggle to reason about exceptions. They are also found to exhibit similar non-logical behavior to humans when considering quantification and property inheritance.
de Varda et al. (2024) address the question of pseudoword meaning interpretation. Pseudowords are letter strings that are consistent with the orthotactical rules of a language, but do not appear in its lexicon and are traditionally considered to be meaningless (e.g., “knackets” or “spechy”). Previous studies that demonstrated humans’ ability to make sense of pseudowords were limited by their focus on specific features (e.g., the emotional values of words). de Varda et al. instead analyze speakers’ free definitions for pseudowords. They also show that LLMs compute embeddings for pseudowords which resemble the definitions given by study participants. This study confirms previous findings that pseudowords have semantic content. It shows a flexible form-to-meaning mapping that is useful to speakers when they encounter novel lexical entries.
2.2 Syntax and Grammar Induction
Lampinen (2024) highlights challenges in the comparison of LLMs and humans, taking the processing of recursively nested grammatical structures as a case study. While previous work found that language models cannot compete with humans, Lampinen argues that these studies disadvantaged language models by providing them with less task-related information than human participants. The study shows that simple prompting yields performance comparable—or even superior—to human results. The paper demonstrates the importance of methodological care and the difficulty in establishing comparability between humans and language models.
Jon-And and Michaud (2024) focus on the mechanisms of grammar induction and, more specifically, on simple cognitive principles that would support this learning in humans, such as sequence memory. Their model uses Reinforcement Learning to identify sentences in a stream of words where cues to sentence borders (such as punctuation and capitalization) have been removed, and reuses these chunks in the learning process. They test the model on artificial languages—instantiating grammars at various complexities—showing that it succeeds in inducing parsimonious tree structures. Their study showcases how simple cognitive mechanisms like sequence memory and chunking can be effective in grammar induction.
2.3 Situational Grounding
Jones, Bergen, and Trott (2024) address the symbol grounding problem in language models and explore whether Multimodal LLMs (MLLMs) provide a plausible solution to this challenge. They investigate the degree to which MLLMs integrate modalities and if the way they do so mirrors the mechanisms believed to underpin grounding in humans, especially embodied simulation. Across a series of experiments, they ask whether MLLMs are, like humans, sensitive to sensorimotor features that are implied, but not explicit, in descriptions of an event. They find similarities and differences with human behavior, revealing strengths and limitations in the ability of current MLLMs to integrate language with other modalities.
Beuls and Van Eecke (2024) model the way language could emerge in a situated communicative context, similar to how humans acquire their native language. The authors conduct experiments where an artificial agent learns linguistic constructions, relying on situation-based intention reading and syntactico-semantic pattern identification. They argue that a situated and communicative learning context is essential to modeling human-like language acquisition. This goal contrasts with typical learning in LLMs where the input is predominantly text-based, and where the distribution of words serves as a basis for modeling meaning.
2.4 Phonology
Georges et al. (2024) investigate developing neural models for a challenging task in language acquisition, namely, learning how to form sounds with one’s vocal tract (i.e., perform discrete articulatory gestures) from raw acoustic data, a very much underspecified problem. The article presents a series of modeling ideas, including the use of physiologically motivated inductive biases to regularize the learning problem. A series of careful evaluations demonstrates partial success—the model is able to learn interpretable gestures that lead to comprehensible speech—but still struggles with broader ecological validity (e.g., generalization between speakers). In this sense, the article presents a case study of the perspectives and pitfalls in modeling complex mechanisms of language acquisition.
Pouw et al. (2024) elucidate the extent to which neural models of Automatic Speech Recognition (ASR) detect phonological changes. More specifically, they consider the case of assimilation, where sounds change according to their context. Psycholinguistic studies have shown that human speakers can relate the phonology of a word and its phonetic realization, which raises the questions of where the representations learned by ASR models stand in this regard and what context cues they pay attention to. The authors carry out innovative intervention experiments on the Wav2Vec2 model, establishing strong evidence that the model focuses on local phonological context and that phonological “normalization” takes place in the final layers of the model. The resulting ASR model shows a substantial degree of match to human behavior, but is still limited in terms of time course and phenomena it can account for.
2.5 Imaging for Discourse Processing
Ling, Murphy, and Fyshe (2024) address the question of comparing language processing in LMs and the human brain. They conduct a decoding analysis, using Multi-timescale LSTM (or MT-LSTM), a model with temporally tuned parameters to induce sensitivity to different timescales of language processing (i.e., related to near/distant words). Such a model is particularly suited for high-temporal resolution brain data like electroencephalography (EEG). They study the extent to which EEG signals predict MT-LSTM embeddings on various timescales. This innovative study, combining MT-LSTM with EEG data, complements previous research that has primarily focused on fMRI to study the representation of linguistic timescales in the brain.
Together, the papers cover various levels of linguistic knowledge, including phonology, syntax, semantics, pragmatics, and discourse.
A few papers are concerned with the mechanisms of language learning, particularly in syntax (Jon-And and Michaud 2024; Beuls and Van Eecke 2024) and phonology (Georges et al. 2024). These papers generally highlight mechanisms that are not typically part of the LLM’s training pipeline but crucial to children’s linguistic development, such as learning from a situated, communicative context (Beuls and Van Eecke 2024), using simple cognitive principles like chunking and sequence memory (Jon-And and Michaud 2024), and leveraging the motor system (Georges et al. 2024). This highlights once more the role of extratextual signals in human language learning, that many current LLMs do not have access to.
Most papers, however, focus on mechanisms of processing (Pouw et al. 2024; Lampinen 2024; de Varda et al. 2024; Ohmer, Bruni, and Hupkes 2024; Jones, Bergen, and Trott 2024; Ling, Murphy, and Fyshe 2024; Allaway et al. 2024). These papers generally compare LLMs to human performance on the same or similar tasks and obtain mixed results. Some papers find that LLMs perform similarly to humans. In particular, Chinchilla is able to process recursively nested grammatical structures (Lampinen 2024). Wav2Vec2 can recognize the underlying phonological form in challenging contexts (e.g., place assimilation [Pouw et al. 2024]). Finally, LLMs like GPT-2 and RoBERTa can generate human-like meaning definitions for novel words (de Varda et al. 2024). However, other papers find LLMs lacking when compared to human behavior or brain data. For instance, multimodal LLMs (like CLIP) are not adequately sensitive to sensorimotor features implicit in language (Jones, Bergen, and Trott 2024). GPT-3.5 struggles to represent meaning consistently across different linguistic forms (Ohmer, Bruni, and Hupkes 2024). Some LLMs (like GPT-4 and LLAMA-2) conflate universally quantified statements with generics, though humans also exhibit similar non-logical behavior (Allaway et al. 2024). Finally, the embeddings from Multi-timescale LSTMs can be decoded from EEG data for short, but not as well for medium or long timescales (Ling, Murphy, and Fyshe 2024).
3 Challenges and Directions for Future Research
While the studies reviewed above significantly advance our understanding of the research questions posed in this special issue, they represent a small sample of the studies needed to address the complex relationship between language learning and processing in humans and LLMs. The different fields involved in this research (in our case, cognitive science and computational linguistics) often operate under different principled and pragmatic assumptions about language, allowing them to focus on what they deem most relevant. However, this diversity also creates difficulties in transferring insights across disciplines and in building a cumulative body of research. In the following, we highlight some recurring challenges and perspectives that can inform future work in this direction.
One important challenge arises from the fact that LLMs are computational artifacts. Psycholinguistics fundamentally assumes that its participants can be drawn randomly from a population of speakers that is relatively stable (at least over short time spans). In contrast, for LLMs, new models are being proposed almost every few months, and there is no guarantee that analyses of LLMs’ cognitive capabilities carry over between model families. Furthermore, the choice of models is not random: A researcher has to grapple with the dilemma of testing open-source models, which provide a more reliable environment for reproducible research, and commercial models, which typically provide higher performance at the expense of transparency. The studies in this special issue use a diversity of models and evaluation methods, with a pattern emerging that researchers report results with smaller open-source models if these are sufficient to exhibit the target knowledge or mechanisms, and larger commercial models when the target knowledge is more challenging. While this diversity is vital in an early, exploratory phase of this research program, a more systematic approach appears essential for a more cumulative phase.
An important distinction made in psycholinguistics is between the intrinsic difficulty of a task and auxiliary task demands, for example, memory consumption or complex instructions. Participants can show bad performance at a task if they do not meet the task demands, irrespective of their ability to carry out the core task. The same is true for LLMs: If they fare poorly in a language processing task, it is not always easy to differentiate when poor performance is due to inadequate linguistic knowledge and when it is due to other types of demands made by the task—for example, in terms of working memory (Hu and Frank 2024). Thus, it is crucial to examine the model’s understanding of a task and its ability to perform it with available resources before one can draw conclusions on the model’s knowledge. As a case in point, Lampinen (2024) observes that LLMs can be disadvantaged when given less task-related information than human participants, even when they have access to more (training) data.
Another important parameter that needs to be accounted for in studies on language learning and processing in humans and LLMs is the ecological validity of the considered cognitive mechanisms. Many studies on language development focus on cognitively plausible mechanisms, but their implementation typically does not scale to natural language or relies on drastic simplifications to do so. This makes it hard to evaluate the ecological validity of these mechanisms and their practical advantages, especially compared to LLMs. In the case of some cognitively inspired mechanisms, a promising strategy is to integrate them as inductive biases within LLMs, thereby merging the insights of cognitive science with the scalability of LLMs. This is still a largely unexplored area of research, although promising results were reported using transfer learning and meta-learning techniques (Papadimitriou and Jurafsky 2023; McCoy and Griffiths 2023; Lake and Baroni 2023).
Notably, the scientific approach adopted in cognitive science and LLM research has, until now, been different. In psycholinguistics, as in the cognitive sciences more generally, scientific progress involves the interplay of theory building and empirical investigation (Haig 2014), as reflected in the two main types of articles published in this field. In contrast, the study of LLMs has (at least currently) inherited the focus on empirical work from NLP, with a corresponding lack of theory building. This is not surprising, as a “cognitive theory of LLMs” has yet to be clearly articulated.
In particular, not all research on LLMs that uses cognitive measures or data is necessarily interested in developing models that are cognitively plausible. Many researchers simply believe in the usefulness of cognitive assessments as parts of a general-purpose evaluation benchmark. In addition, it is unclear what cognitive status one should assign to the growing body of work on prompt optimization, if any. Fundamentally, there is a tension between building models that are as human-like as possible, including by incorporating human constraints (e.g., cognitive architectures [Newell 1990]); and models that are optimized to perform as well as possible in practical tasks, which often requires removing human constraints such as limitations on working memory. We believe that studies in the LLM area should position themselves with respect to this distinction, which is currently often not the case.
A crucial challenge is the need for experiments that manage to compare human and machine language learning under more similar conditions, for example, in terms of data size and access to different modalities. Prompts must also be carefully designed in order to not disadvantage one side of the comparison over the other. In terms of evaluation, there is a need for benchmarks that address various aspects of language learning. Existing resources used for evaluation (e.g., McRae et al. 2005; Devereux et al. 2014) contain features of much different nature than the implicit perceptual features they are supposed to stand for (Bruni, Tran, and Baroni 2014). There is also a need for benchmarks addressing the models’ grounding capabilities, and how well they capture empirical phenomena associated with situational word learning (Ebert and Pavlick 2020; Vong and Lake 2022; Jiang et al. 2023).
Regarding the communication between the areas of cognitive science and LLM research, findings and insights from one are already being exploited in the other. For example, standard tasks in psycholinguistics allow us to examine how LLMs fare in comparison to humans. However, if the model’s behavior shows a discrepancy with humans, we often lack insights on how the models can be improved or better aligned with human knowledge. Commonly used probing techniques, which typically point to correlations between model representations and linguistic properties, do not always help in this respect (Feder et al. 2021; Lyu, Apidianaki, and Callison-Burch 2024). The challenge can potentially be mitigated using methods that can identify causal relationships between model structures and outputs, such as ablation studies and counterfactual methods.
Conversely, causal intervention in LLMs could provide insights into human language learning. High-performing LLMs are typically hard to (re-)train with academic computing resources. This limits the type of investigation that can be pursued. For example, the papers in this special issue typically test knowledge in pre-trained models, the same way we test knowledge in humans. While this approach can provide a wealth of insights, as it has done with humans in the field of experimental psychology, the ability to re-train LLMs in different diagnostic conditions (at the architecture or input level) would be a game-changer by providing the opportunity for causal interventions. Such interventions could provide insights into the mechanisms of knowledge emergence in LLMs in a way that could not be obtained in humans. In fact, this approach—if made logistically feasible (see, for example, the effort in Warstadt et al. 2023)—can also provide insights into the emergence of knowledge in humans.