Abstract
Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. Many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. We synthesize recent results to highlight what is currently known about large language model capabilities, thus providing a resource for applied work and for research in adjacent fields that use language models.
1 Introduction
Transformer language models have revolutionized the field of natural language processing (NLP) since their introduction in 2018 (Radford et al. 2018; Devlin et al. 2019). Recent research and public attention has demonstrated that large language models (e.g., GPT-3/4, PaLM, and OPT; Brown et al. 2020; Chowdhery et al. 2022; Zhang et al. 2022b; OpenAI 2023a) can achieve remarkable performance both on standard NLP benchmarks and on open-ended natural language generation tasks from the general public (Wang et al. 2019; Johnson 2022). Already, language models are used in industry for applications ranging from Web search and chatbots to medical and financial document analysis (Nayak 2019; Broyde and Palmer 2021; Thewsey 2021; Lee 2023). Due to their widespread applicability, language models have been called “foundation models” for NLP (Bommasani et al. 2021).
Language models are trained to predict masked (i.e., hidden) or upcoming words from context, usually text. The models can then be fine-tuned for specific downstream tasks (e.g., text classification; Devlin et al. 2019), or they can be used directly for any text prediction task. As language model capabilities have expanded in recent years, they have increasingly been used in the text generation scenario with minimal or no fine-tuning (Brown et al. 2020). This approach requires no task-specific data or further training infrastructure, thus expanding the range of possibilities and audience for language model applications. In particular, the release of public APIs and interfaces such as GPT-3 and ChatGPT (Brown et al. 2020; OpenAI 2022) have enabled widespread public experimentation on the text generation capabilities of language models.
Yet, text generated by language models is often surprising even to NLP researchers. Previous studies have investigated both the outputs and internal mechanisms of language models, originally focusing on masked (i.e., fill-in-the-blank) “BERT” models and establishing the field of “BERTology” (see Rogers, Kovaleva, and Rumshisky 2020 for a survey). In the years since the last BERTology survey in 2020, and in tandem with the rise of large autoregressive models such as GPT-3 (i.e., predicting upcoming words instead of masked words), language model analysis has shifted focus to these large autoregressive models. Because these models are often used without fine-tuning for open-ended text generation, there have been an increasing number of behavioral studies evaluating the output text probabilities of language models.
Despite this flurry of research, language model text generation behavior remains unpredictable. Although model performance on broad benchmark datasets is relatively consistent for a given model size and architecture, responses to specific inputs and examples are not. This feature makes large language models tempting but unreliable to use in many practical applications (Ganguli et al. 2022a). Furthermore, the rapid pace of NLP research and the quantity of individual studies make any progress in understanding model behavior difficult to track. As language models become more widespread and researchers from other fields invest interest in language models, it is increasingly important that our existing understanding of model behavior be made clear and accessible.
In this survey, we discuss over 250 recent studies of English language model behavior, covering syntax, semantics, pragmatics, world knowledge, reasoning, memorization, and bias.1 Language models generate fluent and coherent text, but their predictions are highly dependent on input context. Slight changes in input word choice and phrasing can lead to unfactual, offensive, or plagiarized text. Understanding these behaviors has broad implications for informed applications in industry (Weidinger et al. 2021) and general questions about meaning and “understanding” in artificial agents (Bender and Koller 2020; Mitchell and Krakauer 2022; Shardlow and Przybyla 2022).
To the extent possible, we avoid taking a stance on whether language models truly “understand” language. We also leave deeper ethical discussions of the societal implications of language models to surveys focused specifically on that area (e.g., Weidinger et al. 2021, 2022). Instead, we hope to provide a review of the empirical evidence for what behaviors language models exhibit in controlled settings. We discuss a wide range of model capabilities and weaknesses (Sections ec6ec31), and we synthesize results framed from the perspectives of model scale (Section 10.1) and text pattern generalization (Section 10.2). In this way, we hope to combat anecdote-driven language model “hype” with informed hype grounded in what language models actually can and cannot do (Bowman 2022), while also highlighting potential future directions of research in language model behavioral analysis.
1.1 Scope
We consider studies of masked and autoregressive English Transformer language models not fine-tuned for any specific downstream tasks. We exclude a wealth of research on fine-tuned model behavior (e.g., models tuned for natural language inference, a text classification task). During the fine-tuning process, language models are prone to overfitting to spurious correlations between text features and labels in the fine-tuning dataset (McCoy, Pavlick, and Linzen 2019; Kavumba et al. 2020; Wang et al. 2022b; Du et al. 2022a; Kavumba, Takahashi, and Oda 2022), and they can even “forget” syntactic and semantic information learned during the original pre-training process (Miaschi et al. 2020; Mosbach et al. 2020). Thus, fine-tuned language models are not necessarily reflective of the linguistic abilities of language models in general. Moreover, as noted in the Introduction, language models are increasingly used without fine-tuning on any individual task.
We also leave studies of non-English and multilingual language models to future surveys that can better focus on the many nuances of cross-lingual comparisons. We acknowledge that over-focusing on high-resource languages (e.g., English) is a recurring problem in NLP research (Joshi et al. 2020), and we hope that this survey provides a foundation to expand to less well-studied languages for which language models often perform poorly (Wu and Dredze 2020; Choudhury and Deshpande 2021). Future surveys might also study the behavior of language model variants such as vision-language models (Du et al. 2022b), code models (Chen et al. 2021), speech models (Lakhotia et al. 2021; Radford et al. 2022), knowledge-augmented models (Zhang et al. 2019), sparsely activated models (Fedus, Zoph, and Shazeer 2022), or compressed models (Sanh et al. 2019; Zafrir et al. 2019). In the current survey, we consider non-augmented “out-of-the-box” Transformer language models, as used in the majority of NLP research.
Finally, we limit our survey to behavioral studies of language models. These studies treat the models as black box functions that take input text and return probability distributions over output text. Often inspired by work in psycholinguistics, these studies evaluate language model responses to controlled inputs (e.g., Ettinger 2020), to make inferences about how the models process and generate text. As we note in Discussion Section 10.3, other studies analyze language models at the mechanistic level, studying internal representations, individual neurons, and attention heads (Geva et al. 2021; Meng et al. 2022; Olsson et al. 2022). We focus on behavioral studies in this survey, but establishing ties between mechanistic and behavioral analyses of language models is an exciting direction of emerging research.
2 Transformer Language Models
In this section, we provide a brief introduction to Transformer language models, which we generally refer to as language models. Transformer language models use a deep neural network architecture called a Transformer (Vaswani et al. 2017; Section 2.1), and they are trained to predict either masked words (i.e., fill-in-the-blank) or upcoming words in text (Section 2.2). Throughout this survey, we refer to these two types of models as masked and autoregressive models, respectively.2 Some studies refer to them as bidirectional and unidirectional models. Language models are most often applied to downstream tasks using either fine-tuning (or prompt-tuning), zero-shot prompting, or few-shot prompting (Section 2.3).
2.1 Architectures
The basic Transformer language model architecture has remained largely unchanged since 2018 (Radford et al. 2018; Devlin et al. 2019). First, an input text string is converted into a sequence of tokens. Tokens correspond roughly to words, although some words are composed of multiple subword tokens due to limited vocabulary size. For example, the string “This is preposterous!” might be tokenized into [_this, _is, _prepo, ster, ous, !]. Common tokenization techniques include byte pair encoding (BPE; Sennrich, Haddow, and Birch 2016) and unigram language modeling (Kudo 2018), but we refer to these other papers for detailed descriptions of tokenization techniques. Model vocabularies generally range from 30K to 250K possible tokens (Radford et al. 2019; Cohen et al. 2022; Chowdhery et al. 2022).
After tokenization, each token is mapped to a fixed vector “embedding”; the embedding for each token is learned during the pre-training process. The embeddings are passed through a stack of Transformer layers (Vaswani et al. 2017; usually 10–100 layers), each consisting of a self-attention network, layer normalizations, and feedforward networks. The primary innovation of Transformer layers is the self-attention network, which “mixes” the sequence of token embeddings using projections into a “query”, “key”, and “value” vector for each token. This mixing of token embeddings results in a “contextualized” representation for each token, essentially a vector representation that incorporates the context of the input sequence. Finally, after the stack of Transformer layers, each output token representation is projected into a distribution over the same token vocabulary used in the input. In other words, the overall architecture maps each input token to a probability distribution over output tokens (e.g., upcoming tokens). Language models usually have between 100M and 500B total parameters, with autoregressive models usually much larger than masked models (Devlin et al. 2019; Brown et al. 2020; Lieber et al. 2021; Smith et al. 2022b; Chowdhery et al. 2022).
The Transformer architecture does not naturally encode any information about each token’s position in an input sequence; intuitively, it is useful to encode this information for features such as word order. Thus, Transformer language models use a variety of position encoding techniques (Wang et al. 2021a; Dufter, Schmitt, and Schütze 2022), such as adding absolute position embeddings to the input token embeddings (i.e., an embedding for each position i; Vaswani et al. 2017; Radford et al. 2018; Devlin et al. 2019; Radford et al. 2019; Brown et al. 2020; Zhang et al. 2022b), relative position embeddings or biases (i.e., encoding relative position distances between tokens; Shaw, Uszkoreit, and Vaswani 2018; Dai et al. 2019; Raffel et al. 2020; Chang et al. 2021; Rae et al. 2021; Cohen et al. 2022), or rotary position embeddings (an efficient approach to relative position biases; Su et al. 2021; Chowdhery et al. 2022). With relative rather than absolute position methods, language models can better extrapolate to longer sequences than observed during pre-training (Press, Smith, and Lewis 2022). Language models are usually pre-trained with input sequence lengths of around 500 to 2,000 tokens.
2.2 Training
Language models are pre-trained using gradient descent, observing many examples as in Examples 1 and 2. Text corpora for pre-training usually range from approximately 5B to 1.5T tokens (roughly 15GB to 5TB of raw text; Devlin et al. 2019; Liu et al. 2019; Brown et al. 2020; Rae et al. 2021; Hoffmann et al. 2022). For compute-optimal pre-training in autoregressive language models, as the number of model parameters increases, the number of pre-training tokens should increase roughly proportionally (Kaplan et al. 2020; Hoffmann et al. 2022). During pre-training, examples are fed into the models with anywhere from 100K to 4M tokens per optimization step (i.e., batch size), usually with larger batch sizes in larger models (Devlin et al. 2019; Brown et al. 2020; Hoffmann et al. 2022; Chowdhery et al. 2022; Zhang et al. 2022b). Models are usually pre-trained for 100K to 1M steps (Radford et al. 2018; Devlin et al. 2019; Zhang et al. 2022b); when possible, examples are not repeated during pre-training (Hoffmann et al. 2022; Chowdhery et al. 2022). Due to high computational costs, relatively few language models are pre-trained from scratch as described here, and they are usually trained in industry labs. In practice, most NLP researchers build applications upon existing pre-trained language models, using the approaches described in Section 2.3.
This survey considers pre-trained language models as described above. Recent language models often contain further non-task-specific fine-tuning stages (particularly autoregressive models; Cohen et al. 2022; Ouyang et al. 2022). For example, autoregressive models are sometimes fine-tuned using the language modeling objective on curated human-written examples that demonstrate desirable text outputs (Ouyang et al. 2022) or examples of outputs that correctly follow input instructions (Wei et al. 2022a; Iyer et al. 2022). These approaches are referred to as supervised fine-tuning or instruction tuning. Some more recent models are also tuned using reinforcement learning, with predicted human preferences for different responses used as a reward (reinforcement learning from human feedback; Ouyang et al. 2022; OpenAI 2023a). Throughout this survey, we consider non-fine-tuned language models unless otherwise specified.3 Non-fine-tuned language models still serve as the foundation for more recent language models.
2.3 Downstream Tasks and Text Generation
In cases such as Example 3, autoregressive language models can compute the probability for any desired output text by iteratively multiplying the probability for each next token. When the models are used for open-ended text generation (i.e., the models must select each next token), common approaches are to (1) iteratively select the most probable next token (greedy sampling), (2) iteratively sample the next token from the output probability distribution with some temperature parameter τ (temperature sampling), (3) sample from the top k token predictions (top-k sampling), or (4) sample from the top tokens that sum to some probability p (nucleus sampling; Holtzman et al. 2020). In all of these cases, multiple candidate sequences of tokens can be generated and then ranked according to their overall sequence probability (i.e., beam search; Freitag and Al-Onaizan 2017), but beam search is often not used in practice due to its high computational cost. Of the studies discussed in this survey, the majority use greedy, temperature, top-k, or nucleus sampling for open-ended text generation. In the next sections, we discuss recent studies evaluating language model generated text and output text probabilities from a wide range of perspectives.
3 Syntax
We begin with studies that evaluate language model predictions from a syntactic perspective. In the vast majority of cases, language models are more likely to predict grammatical tokens than ungrammatical tokens, adhering to a wide variety of syntactic rules (Section 3.1). In subject-verb agreement, the models’ performance degrades in more complex or infrequent examples (Section 3.2), and language model predictions are possibly over-sensitive to token position information (i.e., word order; Section 3.4), but syntactic abilities overall are learned fairly robustly early in pre-training (Section 3.3).
3.1 Language Models Generally Produce Grammatical Text
In general, the grammaticality of language model predictions improves with model size and pre-training corpus size, in both autoregressive and masked models (Warstadt et al. 2020; Pérez-Mayos, Ballesteros, and Wanner 2021). Across model sizes, better overall language modeling performance (e.g., inverse perplexity) is positively correlated with syntactic ability, although this relationship is not clear within any given model size (Hu et al. 2020; Pérez-Mayos, Ballesteros, and Wanner 2021). That said, many syntactic rules may be learned primarily based on memorized examples, dependent on the specific words and structures seen during pre-training (Section 3.2). For example, in cases where people generate syntactically anomalous phrases (e.g., article-noun disagreement between “a” and “days” in “a cold five days”), GPT-3 acceptability predictions roughly mirror human judgments (Mahowald 2023).6 When prompted with examples, GPT-3 can answer questions directly about a sentence’s syntactic structure (Zhang et al. 2022a). The results in this section demonstrate basic syntactic abilities in language models.
3.2 Language Models Learn Subject-Verb Agreement, but They Are Sensitive to Intervening Clauses and Specific Words
Language models’ syntactic abilities are most often evaluated using agreement, when one token’s form depends on a property of another token. For example, subject nouns in English must agree in number with their corresponding verbs (e.g., “the dog eats” vs. “the dogs eat”; see also Example 4). Masked and autoregressive language models are generally good at predicting verb forms for subject-verb agreement (van Schijndel, Mueller, and Linzen 2019), even in nested clauses and with long-distance dependencies as in Example 4 (Goldberg 2019). However, agreement performance degrades as the distance between the subject and verb increases (Bacon and Regier 2019; Ryu and Lewis 2021; Lakretz et al. 2022). In large autoregressive models, this degradation can be reduced significantly if models are provided with even just two initial examples (using few-shot prompting), as human raters usually are (Lampinen 2022).
Subject-verb agreement performance in language models is also dependent on the specific nouns and verbs involved (Yu et al. 2020; Chaves and Richter 2021). Masked and autoregressive models produce over 40% more accurate agreement predictions for verbs that are already probable from context (Newman et al. 2021), and agreement accuracy is worse overall for infrequent verbs (Wei et al. 2021). For infrequent verbs, masked language models are biased towards the more frequent verb form seen during pre-training (e.g., singular vs. plural) (Wei et al. 2021). Error rates exceed 30% for infrequent verbs in nonce (grammatically correct but semantically meaningless) sentences (Wei et al. 2021), with further degradations if there is an intervening clause between the subject and verb as in Example 4 (Lasri, Lenci, and Poibeau 2022a). This subject-verb agreement degradation in nonce sentences with long-distance dependencies has also been observed in people, although to a lesser degree than in language models (Lasri et al. 2022). Finally, subject-verb agreement performance in masked and autoregressive language models is dependent on the specific subject noun, although these differences in performance do not appear to be driven by noun frequency (Yu et al. 2020). In many ways, language models’ variable performance on subject-verb agreement reflects a larger sensitivity to specific words and input structures (Discussion Section 10.2).
3.3 Language Models Learn Syntactic Rules Early in Pre-training
The acquisition of syntactic rules is fairly consistent during language model pre-training. Syntactic rules are learned within roughly the first 20% of masked language model pre-training, as measured by the syntactic generalization suites in Section 3.1 (Liu et al. 2021; Zhang et al. 2021b). Small masked language models (8M parameters) pre-trained on only 30M words of transcribed child-directed speech can achieve similar syntactic performance to standard masked models with over 10x more parameters and 1,000x more pre-training data (Huebner et al. 2021). Autoregressive and masked models tend to learn similar syntactic generalizations during the pre-training process regardless of random initializations and training data shuffling (Choshen et al. 2022; Misra 2022). Early in pre-training, models are syntactically more similar to bag-of-words, unigram, and n-gram models (Choshen et al. 2022), passing through stages where their predictions mirror unigram then bigram distributions (Chang and Bergen 2022).7 Notably, syntactic abilities emerge in Transformer language models despite the fact that Transformers cannot model arbitrarily deep hierarchical structures unless their number of layers or attention heads increases with input length (Hahn 2020), and Transformers have a tendency to generalize linearly rather than hierarchically when trained from scratch on purely syntactic tasks (Petty and Frank 2021).
3.4 Language Models Can Learn Word Order Without Explicit Position Information, but Word Order Is not Necessary in Many Examples
At first glance, language modeling performance would seem highly dependent on a model’s understanding of word order (i.e., token positions). For example, syntactic information in English is largely determined by token positions (e.g., “the dog saw the cat” vs. “the cat saw the dog”). However, masked language models pre-trained on data with shuffled words can still be fine-tuned for reasonable performance on a variety of downstream tasks (Sinha et al. 2021). This result may be because token position embeddings (Section 2.1) are still learned through common subword token sequences that remain unshuffled. Even when pre-training data is shuffled after tokenization, masked models learn informative position embeddings using correlations between sentence length and token frequencies (Abdou et al. 2022). Similarly, autoregressive language models without any position embeddings are able to encode token position information implicitly by “counting” the previous tokens in the causal (autoregressive) attention mask (Haviv et al. 2022).8 Thus, to some degree, the models in these studies are still able to rely on learned token position information.
In contrast, token position information is removed entirely in masked language models when position embeddings are removed. Small masked language models (e.g., 13M parameters) achieve similar language modeling performance when pre-trained with and without position embeddings, particularly if few tokens are masked per sequence (Chang et al. 2021; Lasri, Lenci, and Poibeau 2022b). However, more masking during pre-training improves fine-tuning performance for larger masked models (Wettig et al. 2023); in these larger models, removing token position information entirely might lead to more detrimental effects than in smaller models. While position information (word order) is not necessary for disambiguating semantic meaning in many sentences, there exists a minority of cases where position cues are necessary (Mahowald et al. 2022). Language models can reconstruct text from shuffled inputs, but not with perfect accuracy (Malkin et al. 2021). Thus, high performing models likely need to learn token position information without overfitting to irrelevant position cues. Both masked and autoregressive models with absolute position embeddings (Section 2.1) exhibit such overfitting, making worse language modeling predictions when sequences are shifted by a constant (i.e., shifting all positions by k, maintaining relative positions), a transformation that would ideally have little effect (Sinha et al. 2022b). This overfitting to position cues may also be related to language models’ tendency to generate highly frequent local structures (shorter n-grams based on local positions) rather than long-term coherent text, as described in Section 7.2.
4 Semantics and Pragmatics
On top of syntax, language models display basic semantic abilities, considering how text can be parsed to produce “meaning”. Language models learn word meanings and relationships as reflected in lexical semantics (Section 4.1), they track entities in described situations (Section 4.3), and they recognize basic figurative language (Section 4.4). However, they struggle with negation (Section 4.2) and pragmatics (Section 4.5).
4.1 Language Models Learn Semantic and Compositional Properties of Individual Words, Including Argument Structure, Synonyms, and Hypernyms
Researchers have primarily evaluated compositional semantics in language models through the lens of lexical semantics, which studies word meanings and relationships, considering how individual words influence the meaning and semantic structure of a phrase (Geeraerts 2017). At the word meaning level, both masked and autoregressive language models can predict frequent words from their definitions and vice versa, but they struggle with infrequent words (Senel and Schütze 2021). Masked models can predict noun hypernyms (e.g., “robins” are “birds”) using template sentences (e.g., “A robin is a _”; Hanna and Mareček 2021) or by predicting noun replacements (Ravichander et al. 2020), but predictions degrade when the noun is plural or the hypernym pair is infrequent. The hypernym prediction confidence in autoregressive and masked models is correlated with the human-rated typicality of the hyponym within the hypernym category, with larger models showing stronger typicality effects (Misra, Ettinger, and Rayz 2021). When predicting masked nouns more generally, masked language models assign high probabilities to word synonyms and co-hyponyms (e.g., “robin” and “sparrow” are co-hyponyms of “bird”), rather than pairs of hyponyms and hypernyms (Arefyev et al. 2020). These results suggest that language models understand basic word meanings and allowable word substitutions; more grounded knowledge of the objects and entities that words refer to, such as physical properties and facts, are discussed in Section 5.
4.2 Language Models Struggle with Negation, Often Performing Worse as Models Scale
One notable example of compositionality is negation, where a word such as “not” inverts the meaning of a phrase. Masked language models often ignore negation when producing completions, such that they are more likely to generate incorrect completions than correct completions to negated primes (e.g., “A robin is not a [bird]”; (Ettinger 2020; Kassner and Schütze 2020). In fact, autoregressive models generate more incorrect completions after “few”-type quantifiers (e.g., “Few robins are [birds]”) as models increase in size (Michaelov and Bergen 2022b). These results may reflect a similarity to human online processing (e.g., neural responses and reading times) rather than offline processing and reasoning (Michaelov and Bergen 2022b). Sensitivity to negation can be improved if language models are fine-tuned on more negation sentences, still using the language modeling objective (predicting tokens); masked models are then much less likely to predict any token that was negated in a given context (Gubelmann and Handschuh 2022).
Negation degrades language model performance in tasks involving more explicit reasoning as well (e.g., reasoning abilities in Section 6). When autoregressive models are presented with negated task prompts (e.g., “Please produce a possible incorrect answer to the question”), they perform worse as they increase in size (Jang, Ye, and Seo 2022). Performance is often over 50% worse on negated prompts compared to the original prompts. These weaknesses may not be reflected in many NLP benchmarks due to underrepresentation of negation relative to naturally occurring corpora, and the fact that negation is not relevant for many examples (Hossain, Chinnappa, and Blanco 2022); fine-tuned language models perform much worse on datasets that explicitly focus on negation (Hossain et al. 2020; Geiger, Richardson, and Potts 2020; Tejada, Scholtes, and Spanakis 2021; Truong et al. 2022).
4.3 Language Models Construct Coherent but Brittle Situation Models
Similar to situation models proposed in human language comprehension (Zwaan 2016), language models are able to track entities such as objects and characters throughout a passage. Autoregressive models are able to recognize whether a phrase introduces a new entity (e.g., the “cake” in “I saw Michael bake a cake” vs. “I doubt Michael baked a cake”), with better accuracy in larger models (Schuster and Linzen 2022). However, when multiple nouns are present, the models sometimes refer to un-introduced entities (e.g., “I doubt Michael baked a cake. It’s in the oven.”; Schuster and Linzen 2022). Masked language models are able to predict the antecedents of bridging anaphora, when an entity (e.g., “the window”) has an implied relation to a previously mentioned entity (e.g., “the house”) (Pandit and Hou 2021).
When prompted with a passage, GPT-3 can answer questions about entity states and event likelihoods, but only marginally better than chance (Zhang et al. 2023b). GPT-3 performs better when answers are stated explicitly in the passage, but its answers are sensitive to the phrasing of the question (Summers-Stay, Bonial, and Voss 2021). GPT-3 also has poor accuracy for questions that involve mathematical reasoning, temporal ordering of events, or logical negation (Summers-Stay, Bonial, and Voss 2021; see also Section 4.2 for negation and Section 6.2 for numerical reasoning). Of course, the studies above consider entities and entity states that are described relatively unambiguously in the text, and language models already exhibit somewhat unreliable performance; in later sections, we discuss commonsense inferences about the implied mental states of characters (Section 4.5) and implied relationships between events (Section 5.3).
4.4 Language Models Recognize Basic Analogies, Metaphors, and Figurative Language
Contradicting the rules of compositional semantics (Section 4), some phrases have meanings that cannot be constructed directly from their constituent words. Common examples of noncompositional expressions include analogies, metaphors, and idioms; these expressions must be interpreted nonliterally (i.e., figuratively or metaphorically). Masked language models assign higher probabilities to literal sentences, then conventional (i.e., common) metaphors, then novel metaphors, then nonsense (Pedinotti et al. 2021a; Griciūtė, Tanti, and Donatelli 2022). When prompting autoregressive models directly to identify metaphorical language, the models exhibit a sharp increase in performance around 100B parameters (Comșa, Eisenschlos, and Narayanan 2022). From these results, it appears that language models recognize metaphorical language to some degree as they increase in size.
Furthermore, masked and autoregressive models can predict the correct interpretations of similes (figurative comparisons using “like” or “as”), with improvements based on model size, but consistently worse than people (Liu et al. 2022a; He et al. 2022a). The models can complete analogies (e.g., “X is to Y as Z is to _”) reasonably well (Ushio et al. 2021), but they perform significantly worse for more abstract and unconventional analogies (Czinczoll et al. 2022). GPT-3 can generate analogies of comparable quality to people when given open-ended prompts (e.g., “What is analogous to X?”), although quality varies by prompt template (Bhavya, Xiong, and Zhai 2022).
Finally, noncompositional expressions include constructions, linguistic templates whose meanings are not necessarily built up from their constituent words. For example, the comparative correlative construction (e.g., “the better your syntax, the better your semantics”) has a well-understood meaning in English despite its apparent ungrammaticality (e.g., no inflected verb). Masked language models struggle to recognize the comparative correlative, making inferences about the implied descriptions at chance level after accounting for adjective frequencies (Weissweiler et al. 2022). However, research on a wider range of constructions is necessary to determine which constructions language models struggle with more generally.
4.5 Language Models Can Infer the Mental States of Characters in Text, But They Struggle with Implied Meaning and Pragmatics
The previous sections focused on linguistic structure and meaning somewhat independent of context. In conversation, many utterances have implied meanings that depend on context and the intentions of the speaker; these meanings are the focus of pragmatics. According to Grice’s maxims of conversation (quantity, quality, relation, and manner), utterances should be appropriately informative, true, relevant, and clear (Grice 1975). Comprehending and producing pragmatically sound utterances likely requires some sensitivity to others’ mental states (Frank and Goodman 2012; Monroe and Potts 2015; Sikos et al. 2021). Indeed, when asked directly, GPT-3 can infer the knowledge and desires of characters in text (Summers-Stay, Bonial, and Voss 2021; Sap et al. 2022), and it can explain why characters perform actions in everyday situations based on commonsense reasoning (Lal et al. 2022). It can even answer questions about characters’ deceit, indirect requests, irony, implied meaning, and humor, but this ability is not observed in smaller autoregressive models (e.g., 100M parameters) (Hu et al. 2022). When using a fill-in-the-blank word prediction task to infer knowledge states of characters (e.g., whether they know the location of an object), GPT-3 performs well above chance but worse than people (Trott et al. 2023). Masked language models can predict “go” vs. “come” in narratives with accuracy similar to people, recognizing the implied spatial perspective of the narrative (Masis and Anderson 2021).
However, sensitivity to perspectives and mental states does not translate directly into pragmatic understanding in language models. Autoregressive models are more likely to repeat an entity (e.g., “the cup”) than use a pronoun (e.g., “it”) in many cases where a pronoun would be more natural, thus producing potentially over-informative text (Beyer, Loáiciga, and Schlangen 2021). When explicitly interpreting pragmatically implied meanings (implicatures, e.g., “A asked X, and B responded Y, which means [yes/no]”), both masked and autoregressive models perform only slightly above chance and much worse than people, with no substantial improvements using larger models (Ruis et al. 2022). GPT-3 is unable to predict plausible presuppositions (e.g., “Grant stopped eating meat” implies “Grant once ate meat”) or scalar implicatures (e.g., “some brothers” implies “not all brothers”) any better than chance (Cong 2022). This is in line with studies showing that fine-tuned language models rely on surface cues such as specific function words when they appear to recognize presuppositions (Kabbara and Cheung 2022). That said, both masked and autoregressive models prefer conversationally relevant content over less relevant content, preferring to output text related to main clause content over embedded clause content (Kim, Yu, and Ettinger 2022). In other words, language models exhibit reasonable sensitivity to relevance and mental states, but their pragmatic abilities struggle overall.
5 Commonsense and World Knowledge
Beyond their ability to interpret and produce fluent text, language models exhibit basic world knowledge, including commonsense reasoning and facts. They learn encyclopedic facts and commonsense properties of objects (Section 5.1), albeit unreliably (Section 5.2), and they have a limited ability to infer typical relationships between actions and events (Section 5.3). Commonsense and factual knowledge in language models generally improves with model size, and the models’ factual knowledge can be further enhanced with explicit memory retrieval mechanisms (Khandelwal et al. 2020; Borgeaud et al. 2022), or connections to search engines (Schick et al. 2023), or knowledge bases (Zhang et al. 2019; Guu et al. 2020).
5.1 Language Models Learn Facts and Commonsense Properties of Objects, Particularly as Models Scale, but They Are Less Sensitive Than People to Physical Properties
However, autoregressive models perform worse when considering larger sets of facts in open-ended factual question-answering (Kalo and Fichtel 2022). Masked and autoregressive models perform poorly when predicting numeric literals (e.g., years; Kalo and Fichtel 2022) and numerical commonsense (e.g., “A bird has _ legs”; Lin et al. 2020) (see Section 6.2 for more general numerical reasoning). The models also struggle to make fine-grained property distinctions between related concepts and hypernyms (e.g., properties of “robins” vs. “birds” in general), although accuracy improves with model size (Peng et al. 2022; Misra, Rayz, and Ettinger 2023). As model size increases, autoregressive models are also more likely to correctly use their background factual knowledge to answer questions; accuracy on relevant facts is more predictive of a correct response to a target question in larger models (Sahu et al. 2022). On top of generally higher accuracy (Kalo and Fichtel 2022), larger models (e.g., 50B parameters) are able to assess whether their own answers to factual questions are correct or incorrect, with this self-reflection ability increasing with model size (Kadavath et al. 2022).
To some degree, language models are also able to predict physical properties of objects, such as colors and sizes, using templates similar to Example 8. Perhaps unsurprisingly, model predictions are generally less sensitive than human responses to real world physical properties. For example, masked models can predict typical vs. atypical properties when prompted using quantifiers (e.g., “All X are _” vs. “Some X are _”; Apidianaki and Garí Soler 2021). However, their property predictions are only loosely correlated with human responses, and when predicting a target object from its properties, the models rely on encyclopedic facts over visual and perceptual properties (Weir, Poliak, and Durme 2020). Both masked and autoregressive models can predict typical color distributions of objects, but their predictions correlate more with corpus n-grams (e.g., “red ball”) than with human judgments (Paik et al. 2021), particularly for smaller models (Liu et al. 2022b). Similarly, autoregressive models assign higher probabilities to correct physical comparisons (e.g., “A bear is bigger than a cat”) than to incorrect comparisons, with better performance in larger models (Shi and Wolff 2021; De Bruyn et al. 2022). Finally, masked models can predict the typical use for an object better than chance (Jiang and Riloff 2021), and GPT-3 predicts atypical but physically plausible (i.e., “afforded”) uses as more likely than implausible uses, but this effect is much smaller than in people (Jones et al. 2022). When prompted for creative uses for objects, GPT-3 provides slightly less creative and original uses than people (Stevenson et al. 2022).
5.2 Learned Facts Are Sensitive to Context and a Fact’s Frequency in the Pre-training Corpus
Language models’ ability to predict facts and object properties is highly sensitive to the specific prompt template (e.g., the template in Example 8) and the entities involved. Accuracies in both masked and autoregressive models vary substantially when the templates are paraphrased (Elazar et al. 2021; Cao et al. 2022) or altered in terms of punctuation (Podkorytov, Bis, and Liu 2021). Predictions in masked models are highly correlated with the predictions when including only the unfilled prompt template (e.g., excluding “Dante” in Example 8) (Cao et al. 2021). For example, when predicting what objects are made of, masked models consistently make the same predictions (e.g., “wood” or “metal”) regardless of the given object (Kwon et al. 2019). Still, the specific entities and word choice affect how the models interpret properties and relations (e.g., “density” in cities vs. physical objects) (Beloucif and Biemann 2021). Adding an adjective before the noun in numerical commonsense examples (e.g., “A [adjective] bird has _ legs”) can significantly degrade performance in masked and autoregressive models (Lin et al. 2020).
Often, masked models rely largely on simple heuristics to make predictions, such as predicting nationalities based on common names in different countries (Poerner, Waltinger, and Schütze 2019), or simply predicting semantically similar words to the input prompt. Performance degrades substantially if the template includes a semantically similar distractor sentence (Pandia and Ettinger 2021), and masked models can be primed to incorrectly produce a plausible word appearing immediately before the prime for a fact (e.g., “Talk? Birds can __” → “talk”) (Kassner and Schütze 2020). Using causal graph analysis, masked model predictions are correlated with co-occurrence frequencies between the target word and words in the prompt (Elazar et al. 2022). Masked models make similar predictions even for opposite relations (e.g., “has property” vs. “does not have property”) (Kwon et al. 2019), although this may be due to models’ difficulty processing negation (Section 4.2).
Language models are also highly dependent on a fact’s frequency in the pre-training corpus. In very small masked models (e.g., 1M parameters), accuracy for an individual fact correlates with its frequency, and schema-conforming facts (e.g., “robins can fly” in a corpus of birds) are learned faster than exceptions (e.g., “penguins can dive”) (Kassner, Krojer, and Schütze 2020). In factual question-answering tasks, autoregressive model performance for each example is correlated with the number of related documents in the pre-training corpus; removing the relevant documents during pre-training decreases performance for the fact (Kandpal et al. 2022). Factual question-answering performance improvements based on model size are primarily due to accuracy increases for popular entities, as measured by Wikipedia views (Mallen et al. 2022). These frequency effects on fact learning may explain why masked model predictions of typical noun properties improve when models are fine-tuned on children’s books (still using the language modeling objective; Romero and Razniewski 2022); children’s books are more likely to explicitly state commonsense properties of objects.
Factual knowledge continues to evolve even late in pre-training in masked language models, as evaluated by raw fact accuracies (Chiang, Huang, and Lee 2020) and similarity between extracted knowledge graphs (Swamy, Romanou, and Jaggi 2021). Factual and commonsense knowledge in general is learned more slowly than syntactic generalizations during masked language model pre-training (Liu et al. 2021; Zhang et al. 2021b). Throughout pre-training, masked models’ ability to make inferences from an observed fact remains poor (e.g., observing “A robin is a bird” during pre-training does not increase the probability for “Robins can fly”; Porada, Sordoni, and Cheung 2022), suggesting that the models are memorizing rather than generalizing facts observed during pre-training. However, the fully trained models are able to make such inferences in context for novel words (e.g., “A wug is a bird. Therefore, a wug can _” → “fly”), even though this effect is sensitive to distractor sentences (Misra, Rayz, and Ettinger 2023). In other words, language models can identify in context after pre-training that “A robin is a bird ⇒ Robins can fly”, but if they observe the fact “A robin is a bird” during pre-training, it will not increase the probability for “Robins can fly”. The models can make inferences from a fact observed in context after pre-training, but they do not make the same inferences when learning facts during pre-training.
5.3 Language Models Have a Limited but Nontrivial Ability to Make Commonsense Inferences About Actions and Events
Beyond learning facts and commonsense properties of objects, language models can make basic commonsense inferences about events. Extending beyond simple situation modeling (Section 4.3), language models can infer plausible situations that are not described explicitly, although this ability is unreliable. Masked models are more likely to predict typical locations than atypical locations for verbs (Cho et al. 2021), but they are biased overall towards unusual or noteworthy events that are more likely to appear in many text corpora (e.g., “The person is _” → “killed” or “dying”; Shwartz and Choi 2020). The models assign higher probabilities to possible over impossible scenarios, but their ability to distinguish plausible and implausible scenarios varies per example (Beyer, Loáiciga, and Schlangen 2021; Kauf et al. 2022). Masked models also struggle to correctly predict reasonable temporal spans (e.g., “My holiday is only _”) (Qin et al. 2021), although they are able to predict the telicity (completed vs. in-progress state) of verbs using cues similar to people, such as verb-specific biases and stated time lengths (Zhao et al. 2021). Question-answering performance about commonsense situations in autoregressive models can often be attributed to answer-only probabilities, where the correct answer is a priori more likely than incorrect answers (Li et al. 2022b). Still, when asked directly, GPT-3 can identify character roles (e.g., the hero, villain, and victim) in newspaper articles, movie plot summaries, and political speeches (Stammbach, Antoniak, and Ash 2022).
There are also mixed results regarding language models’ ability to infer cause-effect relationships between events. Autoregressive models assign lower probabilities to flipped cause-effect sentences and self-contradictions, albeit with high variation across examples (Beyer, Loáiciga, and Schlangen 2021). Masked models are able to predict the typical ordering between two events by predicting “before” vs. “after” between phrases (Jin et al. 2022b), and the models assign higher overall probabilities to plausible causes before a described effect (Tamborrino et al. 2020). However, both masked and autoregressive models perform poorly when predicting the most likely reason sentence to place between start and end state descriptions (Misra 2022). Masked models are surprisingly bad at predicting concessive vs. causal conjunctions (e.g., “but” vs. “so”) between sentences (around 10% accuracy) in minimal pair cases with few lexical cues (Pandia, Cong, and Ettinger 2021). This occurs despite the fact that autoregressive model responses after connectives such as “but” and “so” are generally rated as coherent by people (Ko and Li 2020).
Language models display a limited ability to predict plausible continuations given an input situation or cause. Both masked and autoregressive models assign higher probabilities to supported statements than unsupported statements after a piece of evidence, with improved performance in larger models (Lee et al. 2021). The models predict story completions with probabilities that correlate with human typicality ratings, although this effect is largely driven by frequent words (Pedinotti et al. 2021b). Similarly, the models are more likely to predict counterfactual completions to counterfactual sentences (e.g., “If cats had liked vegetables, families would feed their cats with [carrots/fish]”), but these effects are largely due to lexical cues (e.g., just predicting related words) (Li, Yu, and Ettinger 2022). Masked and autoregressive models are at approximately random chance when predicting commonsense effects of actions such as “A did X and B did Y, so A is [more/less] Z” (Zhou et al. 2021). Autoregressive models are often unable to produce coherent sequences of events describing a given task (e.g., “baking a cake”; Sancheti and Rudinger 2022). Finally, both masked and autoregressive models struggle with fill-in-the-blank tasks requiring physical inference (e.g., inferring object locations, objects breaking, or objects moving); predictions are sensitive to which objects appear first in the text (Aroca-Ouellette et al. 2021), and language model predictions do not fully account for the physical inferences made by people (Jones and Bergen 2021).
6 Logical and Numerical Reasoning
We next consider logical reasoning tasks, tasks that include symbols and rules, along with algorithms for solving examples when the rules are known (Fujisawa and Kanai 2022). When provided with explicit instructions or examples, language models can perform basic step-by-step logical reasoning (Section 6.1) and numerical reasoning (Section 6.2), but they struggle with complex reasoning, and they are dependent on specific numerical inputs. Language models’ numerical and logical reasoning abilities can be improved by connecting the models to external APIs and logical reasoning modules such as calculators and code execution environments (Karpas et al. 2022; Schick et al. 2023; Krawczyk and Subramanya 2023).
6.1 Large Language Models Can Perform Basic Logical Reasoning When Prompted, but They Still Struggle with Complex Reasoning
If prompted with examples of reasoning for question-answer pairs (using few-shot prompting; Section 2.3), autoregressive models with at least 8B parameters can perform well on mathematical word problems, formal logic puzzles, and other logical reasoning tasks (Wei et al. 2022c; Suzgun et al. 2022). Their reasoning abilities do not appear to rely solely on surface cues such as word overlap; randomly shuffled example explanations do not provide significant benefits (Lampinen et al. 2022). Given examples, GPT-3 is able to solve fill-in-the-blank puzzles for arbitrary letter patterns and numerical matrix patterns (Webb, Holyoak, and Lu 2022). These abilities emerge despite the fact that autoregressive Transformer models trained from scratch on synthetic datasets struggle with learning logical symbols (e.g., the distinction between “and” and “or”; Traylor, Feiman, and Pavlick 2021). In some studies, only autoregressive models with at least 20B parameters can solve logic puzzles above chance, even when provided with examples (Han et al. 2022).
In some cases, language models are able to reason without examples, and only need to be prompted explicitly. Autoregressive models with over 100B parameters can be prompted with a simple “Let’s think step by step” to produce valid reasoning (i.e., “chain-of-thought prompting”; Kojima et al. 2022). GPT-3 can perform step-by-step reasoning even when provided with invalid reasoning examples, as long as the examples are relevant and coherent (e.g., steps in the correct order, even if the logic is incorrect; Wang et al. 2022a), suggesting that language models’ reasoning abilities are not necessarily dependent on provided examples in few-shot prompting. Autoregressive models can perform well on standard NLP tasks even when the examples have incorrect answers; examples in few-shot prompting primarily allow the models to learn the set of possible answers and the general input format (Min et al. 2022).
Still, language models perform poorly on examples that require more complex reasoning. Even though autoregressive models generally produce valid reasoning steps, they struggle when multiple valid next steps are possible (Saparov and He 2023). Given text descriptions of toy blocks and goals, the models are unable to generate successful plans or modify existing plans (<5% accuracy; Valmeekam et al. 2022). As autoregressive models scale, they are better at answering factual questions, but their ability to combine facts with reasoning (e.g., “Who lived longer, George Washington or Julius Caesar?”) does not improve substantially (Press et al. 2022). When asked questions that implicitly require multi-step reasoning (e.g., “Did Julius Caesar ever visit George Washington?”), the models struggle to leverage known facts to answer questions correctly (Katz, Geva, and Berant 2022). When asked to make inferences from a set of rules and a fact, autoregressive models often just predict the answer choice with the highest word overlap with the input question (Betz, Richardson, and Voigt 2021). The models are also biased to predict intuitively plausible answers to logical questions regardless of the true logical answer, although this effect is also present in people (Dasgupta et al. 2022).
6.2 Language Models Exhibit Basic Numerical and Probabilistic Reasoning Abilities, but They Are Dependent on Specific Inputs.
GPT-3 can perform addition and subtraction for small numbers (e.g., two- to three-digit numbers) and numbers that may appear often in text (e.g., 12345678+87654321), but its performance is poor for large numbers (Brown et al. 2020; Wang et al. 2021b). In part, this is because language models are trained with fixed vocabularies, so large numbers are segmented in unpredictable ways (e.g., 937523 → 93 752 3) (Wallace et al. 2019b; Jiang et al. 2020a).9 As numbers increase in arithmetic problems, autoregressive models start producing non-numeric responses entirely (Fujisawa and Kanai 2022). Larger language models are significantly better at arithmetic than smaller models (Brown et al. 2020), but the models’ performance on arithmetic and time unit conversion is highly correlated with the frequency of the inputs in text corpora (Razeghi et al. 2022).
When solving mathematical word problems, autoregressive models are sensitive to slight modifications in wording, regardless of whether the modifications change the solution (Stolfo et al. 2022). GPT-3 performance drops when word problems include irrelevant context (Shi et al. 2023), and, similar to people, reinforcement-learning-tuned GPT-3 is sensitive to syntactic and lexical heuristics (e.g., responding with a salient number such as $1 from the prompt, even if incorrect; Hagendorff, Fabi, and Kosinski 2022). Autoregressive models perform poorly (<10% accuracy) on competition math problems, even with fine-tuning (Hendrycks et al. 2021b). Still, when probabilistic scenarios are described (e.g., gambling tasks), GPT-3 can make decisions better than chance, even outperforming people in some tasks; however, its “exploration” behavior of uncertain possibilities is essentially random instead of targeted or information optimal (Binz and Schulz 2023).
7 Memorized vs. Novel Text
As seen in previous sections, language models are sensitive to specific examples and words when applying linguistic rules and world knowledge. These sensitivities can be viewed as instances of memorization or under-generalization of the examples observed during pre-training (Discussion Section 10.2). Models are reasonably likely to generate text memorized during pre-training (Section 7.1), but they can also generate novel text based on an input context (Section 7.2). Memorization has direct implications for language model usage in practice; models may produce plagiarized or even private information (Section 8.2), and they may overperform on benchmarks that are inadvertently included in pre-training data.10 As discussed in the next sections, memorization in language models can be reduced by pre-training the models on deduplicated pre-training data or by increasing sampling temperatures during text generation.
7.1 As Language Models Scale, They Are More Likely to Generate Memorized Text from the Pre-training Corpus
Autoregressive language models assign higher probabilities to exact sequences from the pre-training corpus; memorized sequences can be extracted by generating many sequences and filtering to the most probable (Carlini et al. 2021). Without any prompting, autoregressive models with around 1.5B parameters output about 1%–5% memorized tokens, defined as 50+ length exact sequences from the pre-training corpus (Lee et al. 2022). Providing the start of a memorized sequence makes the models more likely to generate the memorized continuation (Lee et al. 2022; Carlini et al. 2023), and examples that appear more frequently in the pre-training corpus are more likely to be memorized (Kandpal, Wallace, and Raffel 2022; Carlini et al. 2023). Deduplicating the pre-training data can reduce memorization by up to 10× while also improving language modeling performance overall (Lee et al. 2022; Hernandez et al. 2022).
Autoregressive models generate more memorized sequences as they scale up (Carlini et al. 2023), along with more paraphrased memorized text (Lee et al. 2023). Paraphrased or slightly modified memorized text is more likely when a model is manually restricted from producing verbatim copied text (Ippolito et al. 2022). Truncating probability distributions during generation (e.g., top-k or nucleus sampling; Section 2.3) increases the probability of memorized text relative to temperature sampling (Lee et al. 2023). During pre-training, larger masked and autoregressive models memorize examples after fewer observations, but they can memorize more of the training data before overfitting; they also “forget” less, regressing to a higher forgetting baseline after observing an example only once (Tirumala et al. 2022). In small models (e.g., 18M parameters), more examples are memorized as the models’ vocabulary sizes increase, even after accounting for total parameter count (Kharitonov, Baroni, and Hupkes 2021).
7.2 Language Models Generate Novel Text That Is Consistent with the Input Context
Still, language models can generate novel text consistent with novel input contexts, without just generating memorized examples. On average, text generated by autoregressive language models includes more concrete and frequent words, along with shallower syntactic structures, than people (Tuckute et al. 2022). It contains more frequent local structures (e.g., 3-grams, sequences of three tokens) than human-generated text (Tuckute et al. 2022), but its longer sequences are more novel than human-generated text (despite occasional memorized passages; McCoy et al. 2021). Model-generated text has different proportions of unique tokens per sequence from human-generated text, but it has similar token frequencies and similar sequence lengths overall (Meister and Cotterell 2021). Autoregressive models still occasionally degenerate into repetitive strings; once the model makes a “mistake”, it may not have been exposed to any similar example in the pre-training data (also known as exposure bias), leading it to default to degenerate behavior such as looping and repetition (Chiang and Chen 2021). Sampling-based generation strategies (e.g., temperature or nucleus sampling; Section 2.3) produce less repetitive but also less factual text than sequence-based strategies (e.g., beam search) (Massarelli et al. 2020).
Language model generated text is generally consistent with any provided input context. Unsurprisingly, autoregressive models are better at predicting upcoming tokens given more context (Cífka and Liutkus 2022). Larger autoregressive models generate more coherent and on-topic text than smaller models, often with fewer factual and commonsense errors (Dou et al. 2022). Masked and autoregressive models tend to repeat syntactic structures from the input context (Sinclair et al. 2022), with grammatical vs. ungrammatical contexts inducing greater grammaticality or ungrammaticality respectively in autoregressive models (Sinha et al. 2022a). When presented with a syntactically ambiguous input, autoregressive models generate text with probabilities split between the possible upcoming structures (Aina and Linzen 2021). However, the models can be prompted to modify the input text style, with performance improving significantly with model size (Reif et al. 2022). Without being asked, language models naturally generate text that is consistent in both personality and politics with the input context (Section 9.3).
Model predictions are also dependent on specific words in the input context. Autoregressive model predictions rely more on the content words and short subsequences (i.e., local n-grams) in the distant past context than on the named entities and general topics (O’Connor and Andreas 2021). Masked and autoregressive models are primed by previous words to produce semantically related words (Misra, Ettinger, and Rayz 2020), even for semantically related words that would otherwise be unlikely (Michaelov and Bergen 2022a). Language models rely on this semantic similarity heuristic for a wide variety of predictions, and it can confound models’ recall of facts and their reasoning abilities (Discussion Section 10.2). Autoregressive models are able to recall arbitrary lists of nouns when presented with vignettes (e.g., “Mary wrote down a list of words...”), regardless of the size of the list and the length of any intervening text (Armeni, Honey, and Linzen 2022).
8 Bias, Privacy, and Toxicity
Content warning: This section discusses offensive content and stereotypes.
Despite their wide range of capabilities, language models sometimes generate harmfully biased (Sections 8.3 and 8.4), offensive (Section 8.1), and private (Section 8.2) text. These outputs can often be identified by human raters or automated systems (Jigsaw 2017; Welbl et al. 2021; Lees et al. 2022). The specific potential harms from these responses depend on broader societal context (Bender et al. 2021; Weidinger et al. 2021, 2022); for example, social biases can be analyzed along multiple dimensions, and their effects depend on the communities and power relations involved (Blodgett et al. 2020). Previous surveys discuss potential societal impacts and harms of language model biases (Dev et al. 2022), along with how previous language model bias studies relate to these harms (Blodgett et al. 2020). Models used in industry are often fine-tuned with language modeling on curated “safe” text (Cohen et al. 2022), and there are a wide variety of other bias mitigation strategies (Meade, Poole-Dayan, and Reddy 2022). Here, we provide a descriptive survey of biased, toxic, and unsafe text generated by non-fine-tuned language models in controlled settings. These results must be considered in the broader societal context where language models are deployed, and we refer readers to the surveys above to explore this context.
8.1 Language Models Sometimes Generate Offensive Text and Hate Speech, Particularly in Response to Targeted Prompts
When interacting with autoregressive language models presented as chatbots, people can successfully “red-team” the models into producing harmful and offensive text such as swearing, harassment, insults, and hate speech, along with text describing violence, crime, abuse, and illegal substances (Ganguli et al. 2022b). Even without any prompting, or prompting with “safe” text, autoregressive models often degenerate into this “toxic” text when sampling just 25 output texts (Gehman et al. 2020). Toxic outputs occur at similar rates regardless of model size, likely due to the prevalence of toxic content in the web text observed during pre-training (Gehman et al. 2020; Ganguli et al. 2022b). Automated prompt construction methods can identify input text prompts that induce racist outputs and hate speech (Wallace et al. 2019a), controversial opinions (Heidenreich and Williams 2021), or more general toxic outputs (Mehrabi et al. 2022), although these methods often rely on access to internal model states. Without such access, a smaller autoregressive language model can be fine-tuned or reinforcement-learning-tuned to generate text prompts that induce toxic content in a larger model (Perez et al. 2022a).
8.2 Language Models Can Expose Private Information, but Often not Tied to Specific Individuals
Similarly, autoregressive language models can be prompted to generate personally identifiable information (PII) such as phone numbers or email addresses, using prompts generated by people (Ganguli et al. 2022b) or other language models (Perez et al. 2022a). Given known contexts where emails appear in the pre-training data (e.g., “mailto: ...”), larger autoregressive models generate more valid emails than smaller models (Huang, Shao, and Chang 2022). This aligns with results showing that larger models are more likely to generate memorized text (Section 7.1). Still, current approaches mostly produce random or fake PII not tied to individuals (Perez et al. 2022a); for example, templates such as “The email of X is _” have extremely low success rates (Huang, Shao, and Chang 2022). When masked models are pre-trained on clinical data, it is difficult to prompt the models to disclose health information given a patient’s name (Lehman et al. 2021). When prompted with a first name, larger autoregressive models are more likely to produce the last name of a famous or historical figure (Shwartz, Rudinger, and Tafjord 2020). Regardless of whether PII can be tied to individuals, common expectations of privacy may be impossible to achieve when training on Web text data; privacy expectations fluctuate, and information on the Web is often intended for specific in-groups that the pre-training data does not distinguish (Brown et al. 2022).
8.3 Language Model Behavior Varies Across Demographic Groups, Both in Terms of Raw Performance and Probabilities of Toxic Text
Language models exhibit systematic differences in performance across text produced by or mentioning different demographic groups. Both masked and autoregressive models assign different probabilities on average to text including different demographic terms, covering ability, age, body type, ethnicity, gender, nationality, politics, race, religion, sexual orientation, and socioeconomic status; for example, sentences including “ace”, “AAPI”, “AFAB”, or “pagan” generally have low probabilities (Smith et al. 2022a), as do gender-neutral pronouns themselves (e.g., singular “they” or “xe”; Brandl, Cui, and Søgaard 2022). Masked and autoregressive models are worse at predicting tokens written by certain demographics, with the best performance for young white men and the worst performance for young non-white men (Zhang et al. 2021a), and poor performance for African-American Vernacular English (AAVE) text (Groenwold et al. 2020). When predicting country names in factual sentences, masked models have worse performance for countries with lower GDP, likely because those countries are less frequent in text corpora (Zhou, Ethayarajh, and Jurafsky 2022). Of course, when considering different demographic groups and cultures, researchers must consider cross-cultural differences in values and concepts, along with raw language modeling performance (Hershcovich et al. 2022; Arora, Kaffee, and Augenstein 2022).
On top of performance differences, language models are more likely to generate negative sentiment and toxic text when specific demographic groups are mentioned (Example 9). When refugees or disabled people are mentioned, masked and autoregressive models are substantially more likely to generate toxic content (Hassan, Huenerfauth, and Alm 2021; Ousidhoum et al. 2021). Prompts mentioning women are slightly more likely to result in toxic content (Ousidhoum et al. 2021), and prompts including LGBTQIA+ identity words produce harmful or offensive content 13% of the time in masked models (350M parameters), up to 87% for some identity groups (Nozza et al. 2022). Autoregressive models are more likely to generate negative sentiment text when completing AAVE sentences (Groenwold et al. 2020), sentences about black or gay people (Sheng et al. 2019), or sentences about nonbinary, disabled, or Muslim people, with unpredictable effects of intersectionality (Magee et al. 2021). This sentiment bias occurs even when the demographic identity groups are not mentioned explicitly, such as when using names from Wikipedia matching different identity groups (Dhamala et al. 2021). Effects of gender depend on context; prompts about women result in more negative sentiment in workplace contexts, while prompts about men result in more negative sentiment in more general descriptive contexts (Sheng et al. 2019). Effects of demographic identities on sentiment and toxicity are reduced when using beam search as opposed to top-k or nucleus sampling during text generation (Section 2.3) (Sheng et al. 2021b; Akyürek et al. 2022). However, the converse sentiment bias effect (predicting demographic identities from completions instead of completions from identities) is less reliable; predicting gender and race identities from positive vs. negative sentiment completions only sometimes exhibits bias effects in masked and autoregressive models (Kurita et al. 2019; Silva, Tambwekar, and Gombolay 2021).
8.4 Language Models Reflect Harmful Stereotypes Based on Gender, Sexuality, Race, Religion, and Other Demographic Identities
Many studies have considered bias in predicting people’s occupations and professions. Occupation predictions from autoregressive language models are biased by given continental name origins and explicitly stated identities, with correlations with official labor statistics in the United States; occupational biases based on gender in language models are slightly less skewed than true labor statistics (Kirk et al. 2021). Similarly, when predicting gendered pronouns given a known occupation, masked language model predictions are correlated with labor statistics on gender (Bartl, Nissim, and Gatt 2020; de Vassimon Manela et al. 2021), although predictions are sensitive to the specific prompt sentence (Touileb 2022). In autoregressive models, gendered pronoun predictions based on occupations are more biased in simple templates than in natural sentences from Wikipedia (Alnegheimish, Guo, and Sun 2022). Some studies find larger gender occupation biases in larger models (Tal, Magar, and Schwartz 2022; Srivastava et al. 2022), but these effects are inconsistent (de Vassimon Manela et al. 2021; Alnegheimish, Guo, and Sun 2022).
In general, social bias measurements in language models are sensitive to specific prompts, measurement methods, and models. Across different pre-training runs, masked models exhibit different levels of preference for stereotypical descriptions of people, particularly for individual demographic groups, despite similar downstream task performance (Aribandi, Tay, and Metzler 2021). Gender occupation biases fluctuate significantly during model pre-training, even after the loss has plateaued (Tang and Jiang 2022). Results when predicting gendered pronouns in potentially biased scenarios are sensitive to paraphrasing and punctuation changes in the prompt (Seshadri, Pezeshkpour, and Singh 2022); prompt and metric choices lead to noisy results for gender occupation bias in autoregressive models as well (Mattern et al. 2022; Akyürek et al. 2022). Despite improving logical reasoning, prompting GPT-3 to “think step-by-step” (Section 6.1) increases the probability that the model will generate stereotypical answers to questions, based on people’s race, gender, religion, and other demographic identities (Shaikh et al. 2022). Effects of social biases in general appear to increase with model size across bias measurement tasks (Srivastava et al. 2022). Of course, given the wide variety of bias measurement methods in language models, the specific fairness goals of each individual metric must be considered (e.g., pairwise group fairness, group against baseline fairness, and/or overall between-group fairness; Czarnowska, Vyas, and Shah 2021).
9 Misinformation, Personality, and Politics
Even outside of toxic and harmfully biased text, language models sometimes generate unfactual and misleading text. They generate convincing unfactual text (Section 9.1) that is difficult to distinguish from human-generated text (Section 9.2), and their generated text depends on the political leaning and perceived personality of the input context (Section 9.3). These behaviors can be more difficult to detect than explicitly biased and toxic text, because the outputs are often more subjective or controversial, and they primarily emerge in large models (Section 10.1). As noted in Section 5, factual knowledge in language models can be improved by using search and retrieval-enhanced models (e.g., Guu et al. 2020; Borgeaud et al. 2022; Schick et al. 2023); more fine-grained control over model outputs can be accomplished by conditioning the models on specific input data using controlled text generation (Li et al. 2021; Zhang et al. 2023a).
9.1 Language Models Can Generate Convincing Unfactual Text and Unsafe Advice
As they scale, autoregressive language models are more likely to generate text that affirms a conspiracy theory as fact when prompted with a conspiracy-related topic (Levy, Saxon, and Wang 2021). They are also more likely to affirm common misconceptions (e.g., “If you crack your knuckles a lot, you may develop arthritis”; Lin, Hilton, and Evans 2022), although this result is inconsistent across studies (Rae et al. 2021). Larger models tend to be more consistent in their responses, producing semantically similar responses to semantically similar prompts, regardless of whether their responses are factually correct (Raj, Rosati, and Majumdar 2022). Given access to internal model states, automated methods can identify text prompts that induce specific stances to common controversial topics (Heidenreich and Williams 2021). Perhaps worryingly, people are more likely to rate GPT-3 generated tweets as true than human-generated tweets about vaccines, COVID-19, climate change, and other topics, regardless of whether they are factual or not (Spitale, Biller-Andorno, and Germani 2023). Conversations with GPT-3 can lead people to change their opinions on topics such as BLM (Black Lives Matter) and climate change (Chen et al. 2022).
Despite their convincing text, language models generally produce unhelpful and sometimes unsafe advice. GPT-3 produces worse advice than people 95% of the time in situations described on Reddit (Zellers et al. 2021). Given a fill-in-the-blank task for stock market decisions, masked models have a preference to buy stocks rather than sell them, and they prefer specific stock categories such as utilities and materials (Chuang and Yang 2022). Although autoregressive models only rarely generate physically unsafe advice on their own (about 1% of prompt responses), they predict slightly higher probabilities for unsafe than safe completions when given two possible options (Levy et al. 2022). When provided with a social rule and a described scenario with potentially permissible rule-breaking behavior, both masked and autoregressive models only agree with human permissibility ratings marginally above chance (Jin et al. 2022a).
9.2 Model-generated Text Is Difficult to Distinguish from Human-generated Text
Despite subtle differences between human and language model generated text (Section 7.2), people have difficulty distinguishing the two, particularly as language models scale (Brown et al. 2020). People can only distinguish news articles generated by 175B parameter autoregressive models from human-generated articles with 52% accuracy (compared to 50% random chance; Brown et al. 2020). Similar accuracies are reported when people are asked to identify GPT-3 paraphrased Wikipedia paragraphs (Wahle et al. 2022) and GPT-3 generated tweets (Spitale, Biller-Andorno, and Germani 2023). People are better at identifying language model generated text in longer sequences (Ippolito et al. 2020), but even when provided with specialized instructions and examples, people only reach about 55% accuracy (Clark et al. 2021). In passages partially generated by smaller autoregressive models (e.g., 1.5B parameters), artificial intelligence graduate students are able to identify where the model-generated text begins with 23% accuracy relative to 10% random chance (Dugan et al. 2023).
In general, people correctly assume that human-generated text is more sensical (e.g., fewer commonsense errors) and less repetitive than model-generated text (Clark et al. 2021; Jakesch, Hancock, and Naaman 2023). However, people also tend to predict that text is human-generated when it is more grammatical, uses shorter words, and contains more frequent bigrams; in reality, human-generated text is less grammatical, uses slightly longer words, and contains fewer frequent bigrams than model-generated text (Jakesch, Hancock, and Naaman 2023). With fine-tuning or given examples, language models themselves achieve better performance than people at identifying model-generated text, but they still have relatively low accuracy overall (Jawahar, Abdul-Mageed, and Lakshmanan 2020; Wahle et al. 2022). To combat these difficulties in distinguishing human vs. model generated text, researchers have proposed “watermarking” model-generated text by slightly increasing the probabilities of “whitelist” tokens during text generation (Kirchenbauer et al. 2023), or by explicitly replacing some tokens with whitelist tokens (He et al. 2022b).
9.3 Language Model “Personality” and Politics Depend on the Input Context
Recent studies have found that language models generally mimic the political leanings and personality traits implied by a given input. For example, larger autoregressive models are more likely to repeat political views expressed in a provided prompt (Perez et al. 2022b). When prompted with a liberal vs. conservative identity (e.g., “As a liberal, ...”) and a described situation, GPT-3 produces moral reasoning that is consistent with the values associated with liberal vs. conservative ideologies in moral foundations theory (Simmons 2022). When prompted with a person’s demographic information or personal background as context, GPT-3 produces similar words to describe political parties as that person, and it even predicts similar voting patterns and multiple choice responses to political surveys (Argyle et al. 2023). Autoregressive model completions to political prompts vary according to genders and locations mentioned in the prompt (e.g., United States states with different political leanings), although they tend to generate liberal-leaning text overall (Liu et al. 2022c). When asked to summarize text, GPT-3 shifts values in the input text towards the United States’ moral and political values as opposed to values from other countries (Johnson et al. 2022). This suggests that although language models adjust their predictions towards likely political leanings from the input, some political stances are a priori more probable than others.
Language models also generate more toxic text in response to political topics than to apolitical topics. Autoregressive models tuned for dialogue generate hyperpartisan responses to neutral political prompts over 50% of the time and offensive responses 30% of the time; the probability of hyperpartisan responses increases with politically biased prompts (Bang et al. 2021). These models are also more likely to generate insults in response to controversial topics such as BLM or MeToo than to less emotionally charged topics such as veganism or WFH (work from home) (Sheng et al. 2021a). Linguistic bias cues (e.g., “claimed” vs. “stated”) increase the non-neutral sentiment of generated text in autoregressive models (Patel and Pavlick 2021). When people converse with GPT-3 about controversial topics, people with minority opinions or less formal educational background report lower satisfaction with the interaction, often due to more negative responses from the model (Chen et al. 2022).
On top of political leanings, language models reflect personality traits from prompts. When prompted with a person’s self description of their personality, both masked and autoregressive language models complete Big Five personality surveys similarly to that person; however, the models score low on agreeableness and openness to experience regardless of prompt (Caron and Srivastava 2022). GPT-3 exhibits similar effects, answering personality questions similarly to personalities described in given prompts (Jiang et al. 2022). Without prompting, autoregressive models have high psychopathy scores and low self-satisfaction scores on psychometric surveys (Li et al. 2022a). However, GPT-3 responses to psychometric and demographic surveys vary significantly depending on sampling temperature (Section 2.3), resulting in different self-reported age, gender, personality, and values (Miotto, Rossberg, and Kleinberg 2022). When given prompts describing classic psychology experiments (e.g., the Milgram Shock Experiment), GPT-3 replicates average human results to a reasonable degree (Aher, Arriaga, and Kalai 2022). Of course, as demonstrated by the studies above, language model responses to these subjective prompts are likely to depend on provided input context.
10 Discussion
The previous sections discuss a wide range of language model capabilities and weaknesses, covering syntax, semantics, pragmatics, world knowledge, reasoning, memorization, and bias. In this section, we synthesize these results framed from the perspectives of model scale (Section 10.1) and text pattern generalization (Section 10.2), and we highlight recent research tying behavioral results to mechanistic analyses of language model internals (Section 10.3).
10.1 Effects of Scale
Recent work has increasingly focused on the impact of language model “scale” on model capabilities (Kaplan et al. 2020; Hendrycks et al. 2021a; Rae et al. 2021; Tay et al. 2022a Tayet al.,b), and public language model releases often include multiple model sizes for evaluation (Brown et al. 2020; Zhang et al. 2022b). Language model scale is traditionally measured by number of parameters, usually between 100M and 500B parameters, although recent studies have also measured model scale using required computation during pre-training (FLOPs; Wei et al. 2022b, 2023). Scaling research focuses on autoregressive language models, which exhibit substantial performance improvements on many text generation tasks as they scale; fewer studies evaluate how model scale affects masked language model behavior (Artetxe et al. 2022). Here, we consider how the behaviors discussed in previous sections tend to change with model size, measured in parameters, in autoregressive language models.
Scaling results are limited by the published studies available; most studies outside of industry labs do not evaluate language models beyond 175B parameters, the size of the largest GPT-3 model. Some tasks, such as domain-specific question-answering, arithmetic, logical event ordering, and proverb prediction exhibit unexpectedly large performance gains beyond 175B parameters (Wei et al. 2022b; Chowdhery et al. 2022). Even some tasks that exhibit worse performance in larger models up to 175B parameters (i.e., “inverse scaling”) exhibit sudden performance improvements beyond 175B parameters (i.e., “U-shaped scaling”); many of these tasks contain a “distractor” feature or subtask that medium-sized models learn, but that large models can successfully ignore (Wei et al. 2023). In language modeling overall, the examples learned successfully by larger models are roughly a superset of the examples learned by smaller models (Xia et al. 2022). For some examples that are not successfully learned in 1B parameter models, models over 5B parameters exhibit an initial phase where their loss increases during pre-training before the examples are eventually learned (Xia et al. 2022). Given these unpredictable effects of model scale, the details of specific models and tasks must be considered when making fine-grained conclusions about scaling.
Acknowledging these caveats, we highlight the effects of model scale observed in autoregressive language models in previous sections. Larger models learn syntactic rules more robustly than smaller models, but models across scales still generate grammatical text in most cases (Section 3.1). Larger models are worse at recognizing negation (Section 4.2) but better at recognizing figurative language (Section 4.4). They are more sensitive to the implied mental states of characters in text, but models across scales still struggle with pragmatics (Section 4.5). Larger models learn more commonsense properties of objects and facts (Section 5.1), more fine-grained word properties (Section 4.1), and more correct arithmetic (Section 6.2), but this may be because they memorize more examples during pre-training (Section 7.1; see also under-generalization in Section 10.2). Large models (e.g., over 100B parameters) can be prompted to generate explicit multi-step reasoning by asking them to “think step by step” (Kojima et al. 2022; Section 6.1), but logical reasoning overall improves only slightly beyond around 10B parameters (Rae et al. 2021). Model size appears to have little impact on offensive text generation (Section 8.1), but text generated by larger models is harder to distinguish from human-generated text (Section 9.2), and larger models are more likely to mimic political opinions in a given input (Section 9.3). The prevalence of harmful social biases in language models is inconsistent both within and across model sizes (Section 8.4). Overall, larger language models tend to exhibit equal or better performance to smaller models on most tasks, but their performance is still far from perfect, and they come at a higher environmental and computational cost (Strubell, Ganesh, and McCallum 2019).
10.2 Language Modeling as Generalization
Text Pattern Generalization
Many of the strengths and weaknesses of language models can be viewed through the lens of text pattern generalization. Over-generalizations and under-generalizations of learned patterns in text simultaneously provide insights into the impressive capabilities and brittle responses of large language models (Ganguli et al. 2022a). Specifically, due to the productivity of language (i.e., infinitely many combinations of patterns; Piantadosi and Fedorenko 2017), language models must learn to generalize to novel examples, even when those examples would traditionally be considered “in-distribution” in generalization research (i.e., within the expected range of examples seen during pre-training; Hupkes et al. 2022). The in-distribution generalizations made by language models provide insights into how the models will likely behave in practice.
Through their token prediction training paradigm, language models are trained to generalize from text examples observed during pre-training to novel examples. Given the beginning of a sentence never observed during pre-training, a language model can generate plausible completions to that sentence, similar to people generalizing from past experience to novel sentences (Piantadosi and Fedorenko 2017). Again similar to in people (Prefors, Regier, and Tenenbaum 2006; Berwick et al. 2011; Dabrowska 2015), there are infinitely many generalization approaches that a language model can apply to extrapolate from pre-training examples (e.g., linear vs. hierarchical syntactic generalizations; McCoy, Frank, and Linzen 2018; White and Cotterell 2021). Any text pattern that predicts upcoming tokens can under-influence or over-influence language model predictions (i.e., under-generalization vs. over-generalization), both in the set of examples to which the pattern is applied and the extent to which the pattern affects model predictions. The specific generalizations that a language model learns are dependent on the language data observed and inherent biases from the model architecture and random initialization, also known as inductive biases (White and Cotterell 2021).
For example, one generalization approach might be to strictly memorize all training examples verbatim; the output token distribution for any observed example would be exactly equal to the distribution observed during pre-training, and any example not observed verbatim during pre-training would produce a random uniform distribution or some other degenerate prediction. This would be an example of under-generalization, as the model assumes that each individual example does not reflect any patterns that can be generalized to other examples. In practice, while language models do exhibit memorization of examples (Section 7.1), they appear to still extrapolate learned patterns from the memorized examples without overfitting (Tirumala et al. 2022), suggesting that they are not entirely under-generalizing.
On the other end of the spectrum, a language model might always generate the most frequent token (e.g., “the”) or condition only on the previous token (i.e., a bigram model). Language models pass through both of these stages during pre-training(Chang and Bergen 2022). These are examples of over-generalization, where token frequency rules and bigram rules over-influence model predictions. In many cases, this over-generalization may occur due to under-generalization of other rules that would otherwise refine the over-generalized prediction. Viewing these errors as generalization errors ties language model analysis research to broader generalization research in machine learning and NLP (Hupkes et al. 2022).
Generalizations in Language Models
Indeed, many of the weaknesses exhibited by large language models can be interpreted as examples over-generalization or under-generalization. For example, language models’ sensitivity to intervening clauses and specific words in subject-verb agreement reflects under-generalization of the subject-verb agreement rule (Section 3.2). Similarly, the models’ sensitivity to paraphrasing and punctuation changes when recalling facts (Section 5.2) reflects under-generalization of learned facts. Finally, the models’ sensitivity to specific inputs when constructing situation models (Section 4.3) and performing logical and numerical reasoning (Section 6) reflects a systematic under-generalization of many patterns and rules to novel contexts.
Specifically, the models’ reliance on pre-training corpus frequency for subject-verb agreement (Section 3.2), facts (Section 5.2), word meanings (Section 4.1), and arithmetic (Section 6.2) might suggest that language models require many examples to correctly generalize some patterns, or it might suggest that the models are simply memorizing many under-generalized instances of each pattern. Given the models’ sensitivity to specific inputs for these capabilities, the memorization case appears more likely—for example, that the models memorize many examples of arithmetic with minimal generalization. Of course, these examples of under-generalization are not as severe as the models’ inability to learn (and therefore under-generalization of) negation (Section 4.2), pragmatics (Section 4.5), and many commonsense inferences (Sections 5.1 and 5.3). In some of these cases, the language modeling objective may simply not capture the grounded and interactive features necessary to learn such patterns.
Language models also exhibit cases of over-generalization, often when some other under-generalized pattern fails to be applied. When models fail to recall facts (Section 5.2), make commonsense inferences (Section 5.3), or solve mathematical word problems (Section 6.2), they often fall back to over-generalized heuristics such as predicting semantically similar tokens to the input context (Section 7.2). Overreliance on token position-based patterns (e.g., local n-grams) may reflect an over-generalization of position-based patterns as well (Sections 3.4 and 7.2). Furthermore, harmful social biases in language models (Sections 8.3 and 8.4) can be interpreted as over-generalizations of patterns observed in the pre-training corpus. Even when harmful biases are present in the pre-training corpus due to human social biases and dataset demographic imbalances, it is not desirable for language models to generalize these patterns.
Understanding when language models generalize correctly vs. incorrectly is important for the safe deployment of the models in practice. Future work in language model behavioral analysis might consider the specific linguistic patterns and types of patterns that language models over-generalize and under-generalize, along with mitigation strategies. In particular, future research might consider how generalization patterns change with model scale; it remains unclear to what extent the benefits of model scale are due to (1) learning more robust and/or correct generalized patterns or (2) memorizing a larger number of specific under-generalized instances that together improve performance metrics. Again, given the models’ sensitivity to specific inputs even in larger models, the models appear to lean towards the latter.
10.3 Levels of Analysis in Understanding Language Models
As stated in the Introduction (Section 1.1), this survey focuses on behavioral analyses of language models. Other studies have investigated the internal mechanisms that lead language models to generate their predictions. These two approaches roughly mirror Marr’s computational and algorithmic levels of analysis in cognitive science, describing respectively (1) what the system does functionally and (2) the algorithms and representations the system uses to accomplish these functions (Marr 2010; Bechtel and Shagrir 2015; Trott 2023). Marr’s last level, the implementation level, would correspond most closely to the physical circuits and neuron-level backpropagation rules that govern neural network models. In many ways, the goals of language model analysis are to identify interpretable and generalizable principles that govern how language models work behaviorally and mechanistically, along with causal links between the two.
At the mechanistic (i.e., algorithmic) level, previous studies have probed the linguistic (and non-linguistic) information that can be extracted from language models’ internal vector representations of tokens (Tenney, Das, and Pavlick 2019; Rogers, Kovaleva, and Rumshisky 2020; Belinkov 2022), along with how the representation spaces are structured geometrically (Reif et al. 2019; Cai et al. 2021; Chang, Tu, and Bergen 2022). They have also studied whether the attention weights assigned by language models’ internal attention mechanism correlate with interpretable inter-token relationships (Clark et al. 2019; Kovaleva et al. 2019; Vig and Belinkov 2019), although the attention weights do not necessarily influence language modeling predictions in expected ways (Jain and Wallace 2019; Serrano and Smith 2019).
More recent work has established causal links between individual neurons (i.e., entries in the models’ vector representations) and language modeling predictions (Vig et al. 2020; Geva et al. 2021; Finlayson et al. 2021; Geva et al. 2022). For example, model representations of tokens at any layer can be interpreted as probability distributions over the language model vocabulary using the language model’s output vocabulary projection matrix (Geva et al. 2022); model parameters themselves can be interpreted using the same projections (Dar et al. 2022). Parameter-level interventions can modify factual associations in language models in targeted ways (Meng et al. 2022), establishing direct connections between language model behavior and internal mechanisms.
Causal functionalities have also been established for individual attention heads in language models, e.g., for copying previous sequences from the input (Olsson et al. 2022). The attention mechanism has even been viewed as an in-context implementation of gradient descent, facilitating in-context learning (Section 2.3) without explicit parameter updates (Dai et al. 2022). Future work might apply similar analysis techniques to investigate the mechanisms underlying a wider range of language model behaviors, including under-generalized and over-generalized behaviors (Section 10.2), bridging the gap between behavioral and mechanistic levels of language model analysis.
11 Conclusion
In this survey, we have discussed a wide range of language model capabilities and weaknesses, covering over 250 studies of language model behavior from the past three years. We find that language models remain sensitive to specific inputs and surface features even as they scale to hundreds of billions of parameters. Many model strengths and weaknesses can be framed as correct or incorrect generalizations of text patterns. By distilling what is currently known about large language model capabilities, we hope to inform the deployment and regulation of large language models, while also inspiring future language model analysis research.
Appendix A: Literature Review Process
We identified papers to include in this survey using Semantic Scholar (Fricke 2018). From a seed of 271 relevant language model analysis papers (including the majority of the citation list from Rogers, Kovaleva, and Rumshisky 2020), we extracted all papers that cited any paper in the seed. This resulted in over 15K papers, last scraped on February 4, 2023. Anecdotally, the majority of recent language model analysis papers we encountered were included in this list. We manually filtered by title down to approximately 1,500 potentially relevant papers, gradually refining the scope as described in Section 1.1. We then further filtered by abstract down to approximately 400 highly relevant papers.
Acknowledgments
We would like to thank the other members of the UCSD Language and Cognition Lab for helpful discussions. Tyler Chang is partially supported by the UCSD HDSI graduate fellowship.
Notes
The process for identifying papers and studies for this survey is described in Appendix A. Code, key points, and links to cited papers are available at: https://github.com/tylerachang/llm-behavior-survey.
Along with differentiating results for masked vs. autoregressive models, we mention when studies use a GPT-3 model (autoregressive) that may or may not have been instruction-tuned (Section 2.2). For example, text-davinci-001 and text-davinci-002 are instruction-tuned, but davinci is not (OpenAI 2023b). Still, even the instruction-tuning stage uses only the language modeling objective. We specifically note if any study uses a model tuned with reinforcement learning (Section 2.2), e.g., text-davinci-003. When we refer to masked and autoregressive language models generally, we refer to models that are not fine-tuned.
Mentions of GPT-3 specifically may be instruction-tuned, but not tuned with reinforcement learning. See footnote in Section 2.
An asterisk before a phrase indicates ungrammaticality, as in Carnie (2002).
Specifically, Lee and Schuster (2022) study subject- and object-control verbs, as in the sentences: “The artist promised the lawyers to make fun of [himself/*themselves].” “The artist persuaded the lawyers to make fun of [*himself/themselves].”
Acceptability predictions in Mahowald (2023) are elicited from GPT-3 using few-shot prompting (Section 2.3).
Bag-of-words models only have access to surrounding tokens without any word order information. Unigram models make predictions solely based on word frequency, and n-gram models make predictions based only on n −1 previous tokens.
The causal attention mask in autoregressive language models only allows tokens to “attend” to previous tokens in the input. Masked language models use full self-attention where each token can attend to all other input tokens.
Some language models manually enforce that numbers must always be segmented into individual digits (Chowdhery et al. 2022).
Some large language model evaluation datasets now include “canary” strings to help prevent the datasets from being included in pre-training corpora (Srivastava et al. 2022).