Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. Many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. We synthesize recent results to highlight what is currently known about large language model capabilities, thus providing a resource for applied work and for research in adjacent fields that use language models.

Transformer language models have revolutionized the field of natural language processing (NLP) since their introduction in 2018 (Radford et al. 2018; Devlin et al. 2019). Recent research and public attention has demonstrated that large language models (e.g., GPT-3/4, PaLM, and OPT; Brown et al. 2020; Chowdhery et al. 2022; Zhang et al. 2022b; OpenAI 2023a) can achieve remarkable performance both on standard NLP benchmarks and on open-ended natural language generation tasks from the general public (Wang et al. 2019; Johnson 2022). Already, language models are used in industry for applications ranging from Web search and chatbots to medical and financial document analysis (Nayak 2019; Broyde and Palmer 2021; Thewsey 2021; Lee 2023). Due to their widespread applicability, language models have been called “foundation models” for NLP (Bommasani et al. 2021).

Language models are trained to predict masked (i.e., hidden) or upcoming words from context, usually text. The models can then be fine-tuned for specific downstream tasks (e.g., text classification; Devlin et al. 2019), or they can be used directly for any text prediction task. As language model capabilities have expanded in recent years, they have increasingly been used in the text generation scenario with minimal or no fine-tuning (Brown et al. 2020). This approach requires no task-specific data or further training infrastructure, thus expanding the range of possibilities and audience for language model applications. In particular, the release of public APIs and interfaces such as GPT-3 and ChatGPT (Brown et al. 2020; OpenAI 2022) have enabled widespread public experimentation on the text generation capabilities of language models.

Yet, text generated by language models is often surprising even to NLP researchers. Previous studies have investigated both the outputs and internal mechanisms of language models, originally focusing on masked (i.e., fill-in-the-blank) “BERT” models and establishing the field of “BERTology” (see Rogers, Kovaleva, and Rumshisky 2020 for a survey). In the years since the last BERTology survey in 2020, and in tandem with the rise of large autoregressive models such as GPT-3 (i.e., predicting upcoming words instead of masked words), language model analysis has shifted focus to these large autoregressive models. Because these models are often used without fine-tuning for open-ended text generation, there have been an increasing number of behavioral studies evaluating the output text probabilities of language models.

Despite this flurry of research, language model text generation behavior remains unpredictable. Although model performance on broad benchmark datasets is relatively consistent for a given model size and architecture, responses to specific inputs and examples are not. This feature makes large language models tempting but unreliable to use in many practical applications (Ganguli et al. 2022a). Furthermore, the rapid pace of NLP research and the quantity of individual studies make any progress in understanding model behavior difficult to track. As language models become more widespread and researchers from other fields invest interest in language models, it is increasingly important that our existing understanding of model behavior be made clear and accessible.

In this survey, we discuss over 250 recent studies of English language model behavior, covering syntax, semantics, pragmatics, world knowledge, reasoning, memorization, and bias.1 Language models generate fluent and coherent text, but their predictions are highly dependent on input context. Slight changes in input word choice and phrasing can lead to unfactual, offensive, or plagiarized text. Understanding these behaviors has broad implications for informed applications in industry (Weidinger et al. 2021) and general questions about meaning and “understanding” in artificial agents (Bender and Koller 2020; Mitchell and Krakauer 2022; Shardlow and Przybyla 2022).

To the extent possible, we avoid taking a stance on whether language models truly “understand” language. We also leave deeper ethical discussions of the societal implications of language models to surveys focused specifically on that area (e.g., Weidinger et al. 2021, 2022). Instead, we hope to provide a review of the empirical evidence for what behaviors language models exhibit in controlled settings. We discuss a wide range of model capabilities and weaknesses (Sections ec6ec31), and we synthesize results framed from the perspectives of model scale (Section 10.1) and text pattern generalization (Section 10.2). In this way, we hope to combat anecdote-driven language model “hype” with informed hype grounded in what language models actually can and cannot do (Bowman 2022), while also highlighting potential future directions of research in language model behavioral analysis.

1.1 Scope

We consider studies of masked and autoregressive English Transformer language models not fine-tuned for any specific downstream tasks. We exclude a wealth of research on fine-tuned model behavior (e.g., models tuned for natural language inference, a text classification task). During the fine-tuning process, language models are prone to overfitting to spurious correlations between text features and labels in the fine-tuning dataset (McCoy, Pavlick, and Linzen 2019; Kavumba et al. 2020; Wang et al. 2022b; Du et al. 2022a; Kavumba, Takahashi, and Oda 2022), and they can even “forget” syntactic and semantic information learned during the original pre-training process (Miaschi et al. 2020; Mosbach et al. 2020). Thus, fine-tuned language models are not necessarily reflective of the linguistic abilities of language models in general. Moreover, as noted in the Introduction, language models are increasingly used without fine-tuning on any individual task.

We also leave studies of non-English and multilingual language models to future surveys that can better focus on the many nuances of cross-lingual comparisons. We acknowledge that over-focusing on high-resource languages (e.g., English) is a recurring problem in NLP research (Joshi et al. 2020), and we hope that this survey provides a foundation to expand to less well-studied languages for which language models often perform poorly (Wu and Dredze 2020; Choudhury and Deshpande 2021). Future surveys might also study the behavior of language model variants such as vision-language models (Du et al. 2022b), code models (Chen et al. 2021), speech models (Lakhotia et al. 2021; Radford et al. 2022), knowledge-augmented models (Zhang et al. 2019), sparsely activated models (Fedus, Zoph, and Shazeer 2022), or compressed models (Sanh et al. 2019; Zafrir et al. 2019). In the current survey, we consider non-augmented “out-of-the-box” Transformer language models, as used in the majority of NLP research.

Finally, we limit our survey to behavioral studies of language models. These studies treat the models as black box functions that take input text and return probability distributions over output text. Often inspired by work in psycholinguistics, these studies evaluate language model responses to controlled inputs (e.g., Ettinger 2020), to make inferences about how the models process and generate text. As we note in Discussion Section 10.3, other studies analyze language models at the mechanistic level, studying internal representations, individual neurons, and attention heads (Geva et al. 2021; Meng et al. 2022; Olsson et al. 2022). We focus on behavioral studies in this survey, but establishing ties between mechanistic and behavioral analyses of language models is an exciting direction of emerging research.

In this section, we provide a brief introduction to Transformer language models, which we generally refer to as language models. Transformer language models use a deep neural network architecture called a Transformer (Vaswani et al. 2017; Section 2.1), and they are trained to predict either masked words (i.e., fill-in-the-blank) or upcoming words in text (Section 2.2). Throughout this survey, we refer to these two types of models as masked and autoregressive models, respectively.2 Some studies refer to them as bidirectional and unidirectional models. Language models are most often applied to downstream tasks using either fine-tuning (or prompt-tuning), zero-shot prompting, or few-shot prompting (Section 2.3).

2.1 Architectures

The basic Transformer language model architecture has remained largely unchanged since 2018 (Radford et al. 2018; Devlin et al. 2019). First, an input text string is converted into a sequence of tokens. Tokens correspond roughly to words, although some words are composed of multiple subword tokens due to limited vocabulary size. For example, the string “This is preposterous!” might be tokenized into [_this, _is, _prepo, ster, ous, !]. Common tokenization techniques include byte pair encoding (BPE; Sennrich, Haddow, and Birch 2016) and unigram language modeling (Kudo 2018), but we refer to these other papers for detailed descriptions of tokenization techniques. Model vocabularies generally range from 30K to 250K possible tokens (Radford et al. 2019; Cohen et al. 2022; Chowdhery et al. 2022).

After tokenization, each token is mapped to a fixed vector “embedding”; the embedding for each token is learned during the pre-training process. The embeddings are passed through a stack of Transformer layers (Vaswani et al. 2017; usually 10–100 layers), each consisting of a self-attention network, layer normalizations, and feedforward networks. The primary innovation of Transformer layers is the self-attention network, which “mixes” the sequence of token embeddings using projections into a “query”, “key”, and “value” vector for each token. This mixing of token embeddings results in a “contextualized” representation for each token, essentially a vector representation that incorporates the context of the input sequence. Finally, after the stack of Transformer layers, each output token representation is projected into a distribution over the same token vocabulary used in the input. In other words, the overall architecture maps each input token to a probability distribution over output tokens (e.g., upcoming tokens). Language models usually have between 100M and 500B total parameters, with autoregressive models usually much larger than masked models (Devlin et al. 2019; Brown et al. 2020; Lieber et al. 2021; Smith et al. 2022b; Chowdhery et al. 2022).

The Transformer architecture does not naturally encode any information about each token’s position in an input sequence; intuitively, it is useful to encode this information for features such as word order. Thus, Transformer language models use a variety of position encoding techniques (Wang et al. 2021a; Dufter, Schmitt, and Schütze 2022), such as adding absolute position embeddings to the input token embeddings (i.e., an embedding for each position i; Vaswani et al. 2017; Radford et al. 2018; Devlin et al. 2019; Radford et al. 2019; Brown et al. 2020; Zhang et al. 2022b), relative position embeddings or biases (i.e., encoding relative position distances between tokens; Shaw, Uszkoreit, and Vaswani 2018; Dai et al. 2019; Raffel et al. 2020; Chang et al. 2021; Rae et al. 2021; Cohen et al. 2022), or rotary position embeddings (an efficient approach to relative position biases; Su et al. 2021; Chowdhery et al. 2022). With relative rather than absolute position methods, language models can better extrapolate to longer sequences than observed during pre-training (Press, Smith, and Lewis 2022). Language models are usually pre-trained with input sequence lengths of around 500 to 2,000 tokens.

2.2 Training

Language modeling refers to predicting tokens from context, usually text. Masked and autoregressive language models are pre-trained to predict masked (i.e., hidden) and upcoming tokens, respectively. Recall from the previous section that the Transformer architecture predicts an output token distribution for each input token.
The[MASK] walked.
(1)
The ___The dog ___The dog walked ___
(2)
In masked language models (Example 1), randomly selected tokens are replaced with [MASK] tokens; for each input [MASK] token, the model produces a probability distribution over the token that was masked (i.e., fill-in-the-blank). In autoregressive models (Example 2), no tokens are replaced; for each input token, the model produces a probability distribution over the next token (i.e., predicting each next token).

Language models are pre-trained using gradient descent, observing many examples as in Examples 1 and 2. Text corpora for pre-training usually range from approximately 5B to 1.5T tokens (roughly 15GB to 5TB of raw text; Devlin et al. 2019; Liu et al. 2019; Brown et al. 2020; Rae et al. 2021; Hoffmann et al. 2022). For compute-optimal pre-training in autoregressive language models, as the number of model parameters increases, the number of pre-training tokens should increase roughly proportionally (Kaplan et al. 2020; Hoffmann et al. 2022). During pre-training, examples are fed into the models with anywhere from 100K to 4M tokens per optimization step (i.e., batch size), usually with larger batch sizes in larger models (Devlin et al. 2019; Brown et al. 2020; Hoffmann et al. 2022; Chowdhery et al. 2022; Zhang et al. 2022b). Models are usually pre-trained for 100K to 1M steps (Radford et al. 2018; Devlin et al. 2019; Zhang et al. 2022b); when possible, examples are not repeated during pre-training (Hoffmann et al. 2022; Chowdhery et al. 2022). Due to high computational costs, relatively few language models are pre-trained from scratch as described here, and they are usually trained in industry labs. In practice, most NLP researchers build applications upon existing pre-trained language models, using the approaches described in Section 2.3.

This survey considers pre-trained language models as described above. Recent language models often contain further non-task-specific fine-tuning stages (particularly autoregressive models; Cohen et al. 2022; Ouyang et al. 2022). For example, autoregressive models are sometimes fine-tuned using the language modeling objective on curated human-written examples that demonstrate desirable text outputs (Ouyang et al. 2022) or examples of outputs that correctly follow input instructions (Wei et al. 2022a; Iyer et al. 2022). These approaches are referred to as supervised fine-tuning or instruction tuning. Some more recent models are also tuned using reinforcement learning, with predicted human preferences for different responses used as a reward (reinforcement learning from human feedback; Ouyang et al. 2022; OpenAI 2023a). Throughout this survey, we consider non-fine-tuned language models unless otherwise specified.3 Non-fine-tuned language models still serve as the foundation for more recent language models.

2.3 Downstream Tasks and Text Generation

Language models are used for a wide range of downstream tasks, including but not limited to custom chatbots, question answering, sentiment classification, offensive text detection, and textual similarity quantification (Devlin et al. 2019; Zhang et al. 2020; Zhao, Zhang, and Hopfgartner 2021; Zong and Krishnamachari 2022). Traditionally, given example inputs and outputs for a task, language models are fine-tuned by adjusting all or some model parameters using gradient descent (Radford et al. 2018; Devlin et al. 2019; Lester, Al-Rfou, and Constant 2021; Chowdhery et al. 2022). As autoregressive models have risen in popularity, tasks are increasingly formulated as prompted text generation tasks (Wei et al. 2022a):
Premise: Fun for adults and children.Hypothesis: Fun for only children.Does the premise entail the hypothesis?________________(Williams, Nangia, and Bowman 2018)
(3)
The input text is referred to as the prompt or context. Autoregressive language models can perform many tasks similar to Example 3 without fine-tuning on that specific task (i.e., zero-shot learning—e.g., by instruction-tuning on other tasks; Wei et al. 2022a). If example inputs and outputs (e.g., 1–100 examples) are included in the prompt, then language models can perform well without any fine-tuning at all (Brown et al. 2020; Chowdhery et al. 2022; Zhang et al. 2022b); providing examples in context without any parameter updates is commonly known as few-shot prompting or in-context learning.

In cases such as Example 3, autoregressive language models can compute the probability for any desired output text by iteratively multiplying the probability for each next token. When the models are used for open-ended text generation (i.e., the models must select each next token), common approaches are to (1) iteratively select the most probable next token (greedy sampling), (2) iteratively sample the next token from the output probability distribution with some temperature parameter τ (temperature sampling), (3) sample from the top k token predictions (top-k sampling), or (4) sample from the top tokens that sum to some probability p (nucleus sampling; Holtzman et al. 2020). In all of these cases, multiple candidate sequences of tokens can be generated and then ranked according to their overall sequence probability (i.e., beam search; Freitag and Al-Onaizan 2017), but beam search is often not used in practice due to its high computational cost. Of the studies discussed in this survey, the majority use greedy, temperature, top-k, or nucleus sampling for open-ended text generation. In the next sections, we discuss recent studies evaluating language model generated text and output text probabilities from a wide range of perspectives.

We begin with studies that evaluate language model predictions from a syntactic perspective. In the vast majority of cases, language models are more likely to predict grammatical tokens than ungrammatical tokens, adhering to a wide variety of syntactic rules (Section 3.1). In subject-verb agreement, the models’ performance degrades in more complex or infrequent examples (Section 3.2), and language model predictions are possibly over-sensitive to token position information (i.e., word order; Section 3.4), but syntactic abilities overall are learned fairly robustly early in pre-training (Section 3.3).

3.1 Language Models Generally Produce Grammatical Text

Systematic syntactic evaluations of autoregressive language models are conducted in Warstadt et al. (2020), Hu et al. (2020), and Gauthier et al. (2020), comparing model probabilities for minimal pair examples that differ in grammaticality due to just one token (e.g., “the boy [*eat/eats]”).4 Similar assessments are run for masked language models in Park, Park, and Song (2021). Both autoregressive and masked language models consistently assign higher probabilities to grammatical tokens, and they make predictions consistent with hierarchical syntactic structure, where clauses can be nested within one another. Such structures are commonly observed in human language (Carnie 2002), creating token relationships that are not solely dependent on linear word order.
The girl who had three dogs [*play/plays] accordion.
(4)
In Example 4, replacing “girl” with “girls” would require the verb to change to “play”. In other words, the verb “plays” agrees in number with the noun “girl” despite the appearance of the nested clause “who had three dogs” including the distractor noun “dogs” closer to the verb. In these long-distance subject-verb agreement examples, language models generally assign higher probabilities to grammatical options, but their performance varies depending on the specific nouns, verbs, and distractors involved (Section 3.2).
Outside of agreement, language models recognize licensing, when the grammaticality of a token depends on an upstream “licensor” token, usually equal or higher in the hierarchical syntactic structure.
I know what the lion devoured [*the gazelle/_ ] yesterday.I know that the lion devoured [the gazelle/ *_ ] yesterday.(Wilcox, Futrell, and Levy 2022)
(5)
In Example 5, the word “what” licenses the omitted direct object “gazelle” for the verb “devoured”; the word “that” does not license such an omission. This omission licensing is known as a filler-gap dependency, and Wilcox, Futrell, and Levy (2022) find that autoregressive language models respect filler-gap rules. Similarly, masked language models assign higher probabilities to licensed tokens in reflexive licensing (reflexives such as “himself” require a properly situated previous noun phrase; Hu, Chen, and Levy 2020) and in negative polarity items (items such as “any” require a previous negative word such as “not”; Warstadt et al. 2019). However, autoregressive model predictions for reflexive licensing are less accurate in sentences where the licensed reflexive depends on the specific verb involved (Lee and Schuster 2022).5

In general, the grammaticality of language model predictions improves with model size and pre-training corpus size, in both autoregressive and masked models (Warstadt et al. 2020; Pérez-Mayos, Ballesteros, and Wanner 2021). Across model sizes, better overall language modeling performance (e.g., inverse perplexity) is positively correlated with syntactic ability, although this relationship is not clear within any given model size (Hu et al. 2020; Pérez-Mayos, Ballesteros, and Wanner 2021). That said, many syntactic rules may be learned primarily based on memorized examples, dependent on the specific words and structures seen during pre-training (Section 3.2). For example, in cases where people generate syntactically anomalous phrases (e.g., article-noun disagreement between “a” and “days” in “a cold five days”), GPT-3 acceptability predictions roughly mirror human judgments (Mahowald 2023).6 When prompted with examples, GPT-3 can answer questions directly about a sentence’s syntactic structure (Zhang et al. 2022a). The results in this section demonstrate basic syntactic abilities in language models.

3.2 Language Models Learn Subject-Verb Agreement, but They Are Sensitive to Intervening Clauses and Specific Words

Language models’ syntactic abilities are most often evaluated using agreement, when one token’s form depends on a property of another token. For example, subject nouns in English must agree in number with their corresponding verbs (e.g., “the dog eats” vs. “the dogs eat”; see also Example 4). Masked and autoregressive language models are generally good at predicting verb forms for subject-verb agreement (van Schijndel, Mueller, and Linzen 2019), even in nested clauses and with long-distance dependencies as in Example 4 (Goldberg 2019). However, agreement performance degrades as the distance between the subject and verb increases (Bacon and Regier 2019; Ryu and Lewis 2021; Lakretz et al. 2022). In large autoregressive models, this degradation can be reduced significantly if models are provided with even just two initial examples (using few-shot prompting), as human raters usually are (Lampinen 2022).

Subject-verb agreement performance in language models is also dependent on the specific nouns and verbs involved (Yu et al. 2020; Chaves and Richter 2021). Masked and autoregressive models produce over 40% more accurate agreement predictions for verbs that are already probable from context (Newman et al. 2021), and agreement accuracy is worse overall for infrequent verbs (Wei et al. 2021). For infrequent verbs, masked language models are biased towards the more frequent verb form seen during pre-training (e.g., singular vs. plural) (Wei et al. 2021). Error rates exceed 30% for infrequent verbs in nonce (grammatically correct but semantically meaningless) sentences (Wei et al. 2021), with further degradations if there is an intervening clause between the subject and verb as in Example 4 (Lasri, Lenci, and Poibeau 2022a). This subject-verb agreement degradation in nonce sentences with long-distance dependencies has also been observed in people, although to a lesser degree than in language models (Lasri et al. 2022). Finally, subject-verb agreement performance in masked and autoregressive language models is dependent on the specific subject noun, although these differences in performance do not appear to be driven by noun frequency (Yu et al. 2020). In many ways, language models’ variable performance on subject-verb agreement reflects a larger sensitivity to specific words and input structures (Discussion Section 10.2).

3.3 Language Models Learn Syntactic Rules Early in Pre-training

The acquisition of syntactic rules is fairly consistent during language model pre-training. Syntactic rules are learned within roughly the first 20% of masked language model pre-training, as measured by the syntactic generalization suites in Section 3.1 (Liu et al. 2021; Zhang et al. 2021b). Small masked language models (8M parameters) pre-trained on only 30M words of transcribed child-directed speech can achieve similar syntactic performance to standard masked models with over 10x more parameters and 1,000x more pre-training data (Huebner et al. 2021). Autoregressive and masked models tend to learn similar syntactic generalizations during the pre-training process regardless of random initializations and training data shuffling (Choshen et al. 2022; Misra 2022). Early in pre-training, models are syntactically more similar to bag-of-words, unigram, and n-gram models (Choshen et al. 2022), passing through stages where their predictions mirror unigram then bigram distributions (Chang and Bergen 2022).7 Notably, syntactic abilities emerge in Transformer language models despite the fact that Transformers cannot model arbitrarily deep hierarchical structures unless their number of layers or attention heads increases with input length (Hahn 2020), and Transformers have a tendency to generalize linearly rather than hierarchically when trained from scratch on purely syntactic tasks (Petty and Frank 2021).

3.4 Language Models Can Learn Word Order Without Explicit Position Information, but Word Order Is not Necessary in Many Examples

At first glance, language modeling performance would seem highly dependent on a model’s understanding of word order (i.e., token positions). For example, syntactic information in English is largely determined by token positions (e.g., “the dog saw the cat” vs. “the cat saw the dog”). However, masked language models pre-trained on data with shuffled words can still be fine-tuned for reasonable performance on a variety of downstream tasks (Sinha et al. 2021). This result may be because token position embeddings (Section 2.1) are still learned through common subword token sequences that remain unshuffled. Even when pre-training data is shuffled after tokenization, masked models learn informative position embeddings using correlations between sentence length and token frequencies (Abdou et al. 2022). Similarly, autoregressive language models without any position embeddings are able to encode token position information implicitly by “counting” the previous tokens in the causal (autoregressive) attention mask (Haviv et al. 2022).8 Thus, to some degree, the models in these studies are still able to rely on learned token position information.

In contrast, token position information is removed entirely in masked language models when position embeddings are removed. Small masked language models (e.g., 13M parameters) achieve similar language modeling performance when pre-trained with and without position embeddings, particularly if few tokens are masked per sequence (Chang et al. 2021; Lasri, Lenci, and Poibeau 2022b). However, more masking during pre-training improves fine-tuning performance for larger masked models (Wettig et al. 2023); in these larger models, removing token position information entirely might lead to more detrimental effects than in smaller models. While position information (word order) is not necessary for disambiguating semantic meaning in many sentences, there exists a minority of cases where position cues are necessary (Mahowald et al. 2022). Language models can reconstruct text from shuffled inputs, but not with perfect accuracy (Malkin et al. 2021). Thus, high performing models likely need to learn token position information without overfitting to irrelevant position cues. Both masked and autoregressive models with absolute position embeddings (Section 2.1) exhibit such overfitting, making worse language modeling predictions when sequences are shifted by a constant (i.e., shifting all positions by k, maintaining relative positions), a transformation that would ideally have little effect (Sinha et al. 2022b). This overfitting to position cues may also be related to language models’ tendency to generate highly frequent local structures (shorter n-grams based on local positions) rather than long-term coherent text, as described in Section 7.2.

On top of syntax, language models display basic semantic abilities, considering how text can be parsed to produce “meaning”. Language models learn word meanings and relationships as reflected in lexical semantics (Section 4.1), they track entities in described situations (Section 4.3), and they recognize basic figurative language (Section 4.4). However, they struggle with negation (Section 4.2) and pragmatics (Section 4.5).

We begin with compositional and formal semantics, where words and phrases combine in systematic ways to produce novel “meanings”, or at least coherent text. There are relatively few behavioral studies of phrase-level compositionality in non-fine-tuned language models (Hupkes et al. 2022), likely because assessments of how models combine phrases to construct meaning are difficult to study behaviorally without a downstream task.
Camila gave a cake in storage to Emma.give(agent=Camila, theme=cake(nmod.in=storage), recipient=Emma)(Qiu et al. 2022)
(6)
When provided with examples (few-shot prompting; see Section 2.3), autoregressive language models can extract compositional semantic parses from sentences as in Example 6, with performance improving with model size (Qiu et al. 2022; Hosseini et al. 2022). However, because the models are explicitly asked for a semantic parse and the task output is not natural English, it remains unclear whether and how language models construct “meaning” in more natural scenarios.

4.1 Language Models Learn Semantic and Compositional Properties of Individual Words, Including Argument Structure, Synonyms, and Hypernyms

Researchers have primarily evaluated compositional semantics in language models through the lens of lexical semantics, which studies word meanings and relationships, considering how individual words influence the meaning and semantic structure of a phrase (Geeraerts 2017). At the word meaning level, both masked and autoregressive language models can predict frequent words from their definitions and vice versa, but they struggle with infrequent words (Senel and Schütze 2021). Masked models can predict noun hypernyms (e.g., “robins” are “birds”) using template sentences (e.g., “A robin is a _”; Hanna and Mareček 2021) or by predicting noun replacements (Ravichander et al. 2020), but predictions degrade when the noun is plural or the hypernym pair is infrequent. The hypernym prediction confidence in autoregressive and masked models is correlated with the human-rated typicality of the hyponym within the hypernym category, with larger models showing stronger typicality effects (Misra, Ettinger, and Rayz 2021). When predicting masked nouns more generally, masked language models assign high probabilities to word synonyms and co-hyponyms (e.g., “robin” and “sparrow” are co-hyponyms of “bird”), rather than pairs of hyponyms and hypernyms (Arefyev et al. 2020). These results suggest that language models understand basic word meanings and allowable word substitutions; more grounded knowledge of the objects and entities that words refer to, such as physical properties and facts, are discussed in Section 5.

Lexical semantics also considers how words influence semantic structure within a clause. Autoregressive models are more likely to predict verbs in the correct argument structure (e.g., the correct number and type of arguments in “gave” in Example 6), but with less accuracy than many syntactic tasks (Warstadt et al. 2020).
Sally frightened Mary because she was so terrifying.Sally feared Mary because she was so terrifying.(Davis and van Schijndel 2020)
(7)
Specifically, many studies consider implicit causality in verbs. In Example 7, the verb “frightened” biases the next clause to refer to the verb subject “Sally”. The verb “feared” biases the next clause to refer to the verb object “Mary”. After observing an implicit causality verb, autoregressive models with 1.5B parameters are more likely to predict pronoun genders matching the subject vs. object causality bias of the verb (Davis and van Schijndel 2020); however, this effect only sometimes replicates in masked and autoregressive models under 1B parameters (Upadhye, Bergen, and Kehler 2020; Kementchedjhieva, Anderson, and Søgaard 2021). Predictions in these smaller autoregressive models match human verb causality biases more closely for frequent verbs (Huynh, Lentz, and van Miltenburg 2022). Outside of implicit causality, masked and autoregressive models predict prepositional vs. double-object dative alternations (e.g., “gave the book to her” vs. “gave her the book”) according to verb-specific biases, with higher correlations with human ratings in larger models (Hawkins et al. 2020). These verb-specific effects in language models demonstrate a basic understanding of how verb properties affect upcoming syntactic and semantic structures.

4.2 Language Models Struggle with Negation, Often Performing Worse as Models Scale

One notable example of compositionality is negation, where a word such as “not” inverts the meaning of a phrase. Masked language models often ignore negation when producing completions, such that they are more likely to generate incorrect completions than correct completions to negated primes (e.g., “A robin is not a [bird]”; (Ettinger 2020; Kassner and Schütze 2020). In fact, autoregressive models generate more incorrect completions after “few”-type quantifiers (e.g., “Few robins are [birds]”) as models increase in size (Michaelov and Bergen 2022b). These results may reflect a similarity to human online processing (e.g., neural responses and reading times) rather than offline processing and reasoning (Michaelov and Bergen 2022b). Sensitivity to negation can be improved if language models are fine-tuned on more negation sentences, still using the language modeling objective (predicting tokens); masked models are then much less likely to predict any token that was negated in a given context (Gubelmann and Handschuh 2022).

Negation degrades language model performance in tasks involving more explicit reasoning as well (e.g., reasoning abilities in Section 6). When autoregressive models are presented with negated task prompts (e.g., “Please produce a possible incorrect answer to the question”), they perform worse as they increase in size (Jang, Ye, and Seo 2022). Performance is often over 50% worse on negated prompts compared to the original prompts. These weaknesses may not be reflected in many NLP benchmarks due to underrepresentation of negation relative to naturally occurring corpora, and the fact that negation is not relevant for many examples (Hossain, Chinnappa, and Blanco 2022); fine-tuned language models perform much worse on datasets that explicitly focus on negation (Hossain et al. 2020; Geiger, Richardson, and Potts 2020; Tejada, Scholtes, and Spanakis 2021; Truong et al. 2022).

4.3 Language Models Construct Coherent but Brittle Situation Models

Similar to situation models proposed in human language comprehension (Zwaan 2016), language models are able to track entities such as objects and characters throughout a passage. Autoregressive models are able to recognize whether a phrase introduces a new entity (e.g., the “cake” in “I saw Michael bake a cake” vs. “I doubt Michael baked a cake”), with better accuracy in larger models (Schuster and Linzen 2022). However, when multiple nouns are present, the models sometimes refer to un-introduced entities (e.g., “I doubt Michael baked a cake. It’s in the oven.”; Schuster and Linzen 2022). Masked language models are able to predict the antecedents of bridging anaphora, when an entity (e.g., “the window”) has an implied relation to a previously mentioned entity (e.g., “the house”) (Pandit and Hou 2021).

When prompted with a passage, GPT-3 can answer questions about entity states and event likelihoods, but only marginally better than chance (Zhang et al. 2023b). GPT-3 performs better when answers are stated explicitly in the passage, but its answers are sensitive to the phrasing of the question (Summers-Stay, Bonial, and Voss 2021). GPT-3 also has poor accuracy for questions that involve mathematical reasoning, temporal ordering of events, or logical negation (Summers-Stay, Bonial, and Voss 2021; see also Section 4.2 for negation and Section 6.2 for numerical reasoning). Of course, the studies above consider entities and entity states that are described relatively unambiguously in the text, and language models already exhibit somewhat unreliable performance; in later sections, we discuss commonsense inferences about the implied mental states of characters (Section 4.5) and implied relationships between events (Section 5.3).

4.4 Language Models Recognize Basic Analogies, Metaphors, and Figurative Language

Contradicting the rules of compositional semantics (Section 4), some phrases have meanings that cannot be constructed directly from their constituent words. Common examples of noncompositional expressions include analogies, metaphors, and idioms; these expressions must be interpreted nonliterally (i.e., figuratively or metaphorically). Masked language models assign higher probabilities to literal sentences, then conventional (i.e., common) metaphors, then novel metaphors, then nonsense (Pedinotti et al. 2021a; Griciūtė, Tanti, and Donatelli 2022). When prompting autoregressive models directly to identify metaphorical language, the models exhibit a sharp increase in performance around 100B parameters (Comșa, Eisenschlos, and Narayanan 2022). From these results, it appears that language models recognize metaphorical language to some degree as they increase in size.

Furthermore, masked and autoregressive models can predict the correct interpretations of similes (figurative comparisons using “like” or “as”), with improvements based on model size, but consistently worse than people (Liu et al. 2022a; He et al. 2022a). The models can complete analogies (e.g., “X is to Y as Z is to _”) reasonably well (Ushio et al. 2021), but they perform significantly worse for more abstract and unconventional analogies (Czinczoll et al. 2022). GPT-3 can generate analogies of comparable quality to people when given open-ended prompts (e.g., “What is analogous to X?”), although quality varies by prompt template (Bhavya, Xiong, and Zhai 2022).

Finally, noncompositional expressions include constructions, linguistic templates whose meanings are not necessarily built up from their constituent words. For example, the comparative correlative construction (e.g., “the better your syntax, the better your semantics”) has a well-understood meaning in English despite its apparent ungrammaticality (e.g., no inflected verb). Masked language models struggle to recognize the comparative correlative, making inferences about the implied descriptions at chance level after accounting for adjective frequencies (Weissweiler et al. 2022). However, research on a wider range of constructions is necessary to determine which constructions language models struggle with more generally.

4.5 Language Models Can Infer the Mental States of Characters in Text, But They Struggle with Implied Meaning and Pragmatics

The previous sections focused on linguistic structure and meaning somewhat independent of context. In conversation, many utterances have implied meanings that depend on context and the intentions of the speaker; these meanings are the focus of pragmatics. According to Grice’s maxims of conversation (quantity, quality, relation, and manner), utterances should be appropriately informative, true, relevant, and clear (Grice 1975). Comprehending and producing pragmatically sound utterances likely requires some sensitivity to others’ mental states (Frank and Goodman 2012; Monroe and Potts 2015; Sikos et al. 2021). Indeed, when asked directly, GPT-3 can infer the knowledge and desires of characters in text (Summers-Stay, Bonial, and Voss 2021; Sap et al. 2022), and it can explain why characters perform actions in everyday situations based on commonsense reasoning (Lal et al. 2022). It can even answer questions about characters’ deceit, indirect requests, irony, implied meaning, and humor, but this ability is not observed in smaller autoregressive models (e.g., 100M parameters) (Hu et al. 2022). When using a fill-in-the-blank word prediction task to infer knowledge states of characters (e.g., whether they know the location of an object), GPT-3 performs well above chance but worse than people (Trott et al. 2023). Masked language models can predict “go” vs. “come” in narratives with accuracy similar to people, recognizing the implied spatial perspective of the narrative (Masis and Anderson 2021).

However, sensitivity to perspectives and mental states does not translate directly into pragmatic understanding in language models. Autoregressive models are more likely to repeat an entity (e.g., “the cup”) than use a pronoun (e.g., “it”) in many cases where a pronoun would be more natural, thus producing potentially over-informative text (Beyer, Loáiciga, and Schlangen 2021). When explicitly interpreting pragmatically implied meanings (implicatures, e.g., “A asked X, and B responded Y, which means [yes/no]”), both masked and autoregressive models perform only slightly above chance and much worse than people, with no substantial improvements using larger models (Ruis et al. 2022). GPT-3 is unable to predict plausible presuppositions (e.g., “Grant stopped eating meat” implies “Grant once ate meat”) or scalar implicatures (e.g., “some brothers” implies “not all brothers”) any better than chance (Cong 2022). This is in line with studies showing that fine-tuned language models rely on surface cues such as specific function words when they appear to recognize presuppositions (Kabbara and Cheung 2022). That said, both masked and autoregressive models prefer conversationally relevant content over less relevant content, preferring to output text related to main clause content over embedded clause content (Kim, Yu, and Ettinger 2022). In other words, language models exhibit reasonable sensitivity to relevance and mental states, but their pragmatic abilities struggle overall.

Beyond their ability to interpret and produce fluent text, language models exhibit basic world knowledge, including commonsense reasoning and facts. They learn encyclopedic facts and commonsense properties of objects (Section 5.1), albeit unreliably (Section 5.2), and they have a limited ability to infer typical relationships between actions and events (Section 5.3). Commonsense and factual knowledge in language models generally improves with model size, and the models’ factual knowledge can be further enhanced with explicit memory retrieval mechanisms (Khandelwal et al. 2020; Borgeaud et al. 2022), or connections to search engines (Schick et al. 2023), or knowledge bases (Zhang et al. 2019; Guu et al. 2020).

5.1 Language Models Learn Facts and Commonsense Properties of Objects, Particularly as Models Scale, but They Are Less Sensitive Than People to Physical Properties

Masked and autoregressive language models assign higher probabilities to facts than to alternatives when expressed as sentences (e.g., the knowledge triple in Example 8) (Davison, Feldman, and Rush 2019; Petroni et al. 2019).
Knowledge triple: (Dante, born-in, Florence)Natural language template: X was born in Y.Fill-in-the-blank sentence: Dante was born in _.(Petroni et al. 2019)
(8)
Language models can complete these sentences for a wide variety of facts, covering countries and locations, popular products, historical figures, and even genres of books, movies, and music (Petroni et al. 2019; Penha and Hauff 2020). This ability improves if researchers use better fill-in-the-blank template sentences, such as naturally occurring templates from Wikipedia (Jiang et al. 2020b), or if templates are paired with some relevant preceding context (Adolphs, Dhuliawala, and Hofmann 2021).

However, autoregressive models perform worse when considering larger sets of facts in open-ended factual question-answering (Kalo and Fichtel 2022). Masked and autoregressive models perform poorly when predicting numeric literals (e.g., years; Kalo and Fichtel 2022) and numerical commonsense (e.g., “A bird has _ legs”; Lin et al. 2020) (see Section 6.2 for more general numerical reasoning). The models also struggle to make fine-grained property distinctions between related concepts and hypernyms (e.g., properties of “robins” vs. “birds” in general), although accuracy improves with model size (Peng et al. 2022; Misra, Rayz, and Ettinger 2023). As model size increases, autoregressive models are also more likely to correctly use their background factual knowledge to answer questions; accuracy on relevant facts is more predictive of a correct response to a target question in larger models (Sahu et al. 2022). On top of generally higher accuracy (Kalo and Fichtel 2022), larger models (e.g., 50B parameters) are able to assess whether their own answers to factual questions are correct or incorrect, with this self-reflection ability increasing with model size (Kadavath et al. 2022).

To some degree, language models are also able to predict physical properties of objects, such as colors and sizes, using templates similar to Example 8. Perhaps unsurprisingly, model predictions are generally less sensitive than human responses to real world physical properties. For example, masked models can predict typical vs. atypical properties when prompted using quantifiers (e.g., “All X are _” vs. “Some X are _”; Apidianaki and Garí Soler 2021). However, their property predictions are only loosely correlated with human responses, and when predicting a target object from its properties, the models rely on encyclopedic facts over visual and perceptual properties (Weir, Poliak, and Durme 2020). Both masked and autoregressive models can predict typical color distributions of objects, but their predictions correlate more with corpus n-grams (e.g., “red ball”) than with human judgments (Paik et al. 2021), particularly for smaller models (Liu et al. 2022b). Similarly, autoregressive models assign higher probabilities to correct physical comparisons (e.g., “A bear is bigger than a cat”) than to incorrect comparisons, with better performance in larger models (Shi and Wolff 2021; De Bruyn et al. 2022). Finally, masked models can predict the typical use for an object better than chance (Jiang and Riloff 2021), and GPT-3 predicts atypical but physically plausible (i.e., “afforded”) uses as more likely than implausible uses, but this effect is much smaller than in people (Jones et al. 2022). When prompted for creative uses for objects, GPT-3 provides slightly less creative and original uses than people (Stevenson et al. 2022).

5.2 Learned Facts Are Sensitive to Context and a Fact’s Frequency in the Pre-training Corpus

Language models’ ability to predict facts and object properties is highly sensitive to the specific prompt template (e.g., the template in Example 8) and the entities involved. Accuracies in both masked and autoregressive models vary substantially when the templates are paraphrased (Elazar et al. 2021; Cao et al. 2022) or altered in terms of punctuation (Podkorytov, Bis, and Liu 2021). Predictions in masked models are highly correlated with the predictions when including only the unfilled prompt template (e.g., excluding “Dante” in Example 8) (Cao et al. 2021). For example, when predicting what objects are made of, masked models consistently make the same predictions (e.g., “wood” or “metal”) regardless of the given object (Kwon et al. 2019). Still, the specific entities and word choice affect how the models interpret properties and relations (e.g., “density” in cities vs. physical objects) (Beloucif and Biemann 2021). Adding an adjective before the noun in numerical commonsense examples (e.g., “A [adjective] bird has _ legs”) can significantly degrade performance in masked and autoregressive models (Lin et al. 2020).

Often, masked models rely largely on simple heuristics to make predictions, such as predicting nationalities based on common names in different countries (Poerner, Waltinger, and Schütze 2019), or simply predicting semantically similar words to the input prompt. Performance degrades substantially if the template includes a semantically similar distractor sentence (Pandia and Ettinger 2021), and masked models can be primed to incorrectly produce a plausible word appearing immediately before the prime for a fact (e.g., “Talk? Birds can __” → “talk”) (Kassner and Schütze 2020). Using causal graph analysis, masked model predictions are correlated with co-occurrence frequencies between the target word and words in the prompt (Elazar et al. 2022). Masked models make similar predictions even for opposite relations (e.g., “has property” vs. “does not have property”) (Kwon et al. 2019), although this may be due to models’ difficulty processing negation (Section 4.2).

Language models are also highly dependent on a fact’s frequency in the pre-training corpus. In very small masked models (e.g., 1M parameters), accuracy for an individual fact correlates with its frequency, and schema-conforming facts (e.g., “robins can fly” in a corpus of birds) are learned faster than exceptions (e.g., “penguins can dive”) (Kassner, Krojer, and Schütze 2020). In factual question-answering tasks, autoregressive model performance for each example is correlated with the number of related documents in the pre-training corpus; removing the relevant documents during pre-training decreases performance for the fact (Kandpal et al. 2022). Factual question-answering performance improvements based on model size are primarily due to accuracy increases for popular entities, as measured by Wikipedia views (Mallen et al. 2022). These frequency effects on fact learning may explain why masked model predictions of typical noun properties improve when models are fine-tuned on children’s books (still using the language modeling objective; Romero and Razniewski 2022); children’s books are more likely to explicitly state commonsense properties of objects.

Factual knowledge continues to evolve even late in pre-training in masked language models, as evaluated by raw fact accuracies (Chiang, Huang, and Lee 2020) and similarity between extracted knowledge graphs (Swamy, Romanou, and Jaggi 2021). Factual and commonsense knowledge in general is learned more slowly than syntactic generalizations during masked language model pre-training (Liu et al. 2021; Zhang et al. 2021b). Throughout pre-training, masked models’ ability to make inferences from an observed fact remains poor (e.g., observing “A robin is a bird” during pre-training does not increase the probability for “Robins can fly”; Porada, Sordoni, and Cheung 2022), suggesting that the models are memorizing rather than generalizing facts observed during pre-training. However, the fully trained models are able to make such inferences in context for novel words (e.g., “A wug is a bird. Therefore, a wug can _” → “fly”), even though this effect is sensitive to distractor sentences (Misra, Rayz, and Ettinger 2023). In other words, language models can identify in context after pre-training that “A robin is a bird ⇒ Robins can fly”, but if they observe the fact “A robin is a bird” during pre-training, it will not increase the probability for “Robins can fly”. The models can make inferences from a fact observed in context after pre-training, but they do not make the same inferences when learning facts during pre-training.

5.3 Language Models Have a Limited but Nontrivial Ability to Make Commonsense Inferences About Actions and Events

Beyond learning facts and commonsense properties of objects, language models can make basic commonsense inferences about events. Extending beyond simple situation modeling (Section 4.3), language models can infer plausible situations that are not described explicitly, although this ability is unreliable. Masked models are more likely to predict typical locations than atypical locations for verbs (Cho et al. 2021), but they are biased overall towards unusual or noteworthy events that are more likely to appear in many text corpora (e.g., “The person is _” → “killed” or “dying”; Shwartz and Choi 2020). The models assign higher probabilities to possible over impossible scenarios, but their ability to distinguish plausible and implausible scenarios varies per example (Beyer, Loáiciga, and Schlangen 2021; Kauf et al. 2022). Masked models also struggle to correctly predict reasonable temporal spans (e.g., “My holiday is only _”) (Qin et al. 2021), although they are able to predict the telicity (completed vs. in-progress state) of verbs using cues similar to people, such as verb-specific biases and stated time lengths (Zhao et al. 2021). Question-answering performance about commonsense situations in autoregressive models can often be attributed to answer-only probabilities, where the correct answer is a priori more likely than incorrect answers (Li et al. 2022b). Still, when asked directly, GPT-3 can identify character roles (e.g., the hero, villain, and victim) in newspaper articles, movie plot summaries, and political speeches (Stammbach, Antoniak, and Ash 2022).

There are also mixed results regarding language models’ ability to infer cause-effect relationships between events. Autoregressive models assign lower probabilities to flipped cause-effect sentences and self-contradictions, albeit with high variation across examples (Beyer, Loáiciga, and Schlangen 2021). Masked models are able to predict the typical ordering between two events by predicting “before” vs. “after” between phrases (Jin et al. 2022b), and the models assign higher overall probabilities to plausible causes before a described effect (Tamborrino et al. 2020). However, both masked and autoregressive models perform poorly when predicting the most likely reason sentence to place between start and end state descriptions (Misra 2022). Masked models are surprisingly bad at predicting concessive vs. causal conjunctions (e.g., “but” vs. “so”) between sentences (around 10% accuracy) in minimal pair cases with few lexical cues (Pandia, Cong, and Ettinger 2021). This occurs despite the fact that autoregressive model responses after connectives such as “but” and “so” are generally rated as coherent by people (Ko and Li 2020).

Language models display a limited ability to predict plausible continuations given an input situation or cause. Both masked and autoregressive models assign higher probabilities to supported statements than unsupported statements after a piece of evidence, with improved performance in larger models (Lee et al. 2021). The models predict story completions with probabilities that correlate with human typicality ratings, although this effect is largely driven by frequent words (Pedinotti et al. 2021b). Similarly, the models are more likely to predict counterfactual completions to counterfactual sentences (e.g., “If cats had liked vegetables, families would feed their cats with [carrots/fish]”), but these effects are largely due to lexical cues (e.g., just predicting related words) (Li, Yu, and Ettinger 2022). Masked and autoregressive models are at approximately random chance when predicting commonsense effects of actions such as “A did X and B did Y, so A is [more/less] Z” (Zhou et al. 2021). Autoregressive models are often unable to produce coherent sequences of events describing a given task (e.g., “baking a cake”; Sancheti and Rudinger 2022). Finally, both masked and autoregressive models struggle with fill-in-the-blank tasks requiring physical inference (e.g., inferring object locations, objects breaking, or objects moving); predictions are sensitive to which objects appear first in the text (Aroca-Ouellette et al. 2021), and language model predictions do not fully account for the physical inferences made by people (Jones and Bergen 2021).

We next consider logical reasoning tasks, tasks that include symbols and rules, along with algorithms for solving examples when the rules are known (Fujisawa and Kanai 2022). When provided with explicit instructions or examples, language models can perform basic step-by-step logical reasoning (Section 6.1) and numerical reasoning (Section 6.2), but they struggle with complex reasoning, and they are dependent on specific numerical inputs. Language models’ numerical and logical reasoning abilities can be improved by connecting the models to external APIs and logical reasoning modules such as calculators and code execution environments (Karpas et al. 2022; Schick et al. 2023; Krawczyk and Subramanya 2023).

6.1 Large Language Models Can Perform Basic Logical Reasoning When Prompted, but They Still Struggle with Complex Reasoning

If prompted with examples of reasoning for question-answer pairs (using few-shot prompting; Section 2.3), autoregressive models with at least 8B parameters can perform well on mathematical word problems, formal logic puzzles, and other logical reasoning tasks (Wei et al. 2022c; Suzgun et al. 2022). Their reasoning abilities do not appear to rely solely on surface cues such as word overlap; randomly shuffled example explanations do not provide significant benefits (Lampinen et al. 2022). Given examples, GPT-3 is able to solve fill-in-the-blank puzzles for arbitrary letter patterns and numerical matrix patterns (Webb, Holyoak, and Lu 2022). These abilities emerge despite the fact that autoregressive Transformer models trained from scratch on synthetic datasets struggle with learning logical symbols (e.g., the distinction between “and” and “or”; Traylor, Feiman, and Pavlick 2021). In some studies, only autoregressive models with at least 20B parameters can solve logic puzzles above chance, even when provided with examples (Han et al. 2022).

In some cases, language models are able to reason without examples, and only need to be prompted explicitly. Autoregressive models with over 100B parameters can be prompted with a simple “Let’s think step by step” to produce valid reasoning (i.e., “chain-of-thought prompting”; Kojima et al. 2022). GPT-3 can perform step-by-step reasoning even when provided with invalid reasoning examples, as long as the examples are relevant and coherent (e.g., steps in the correct order, even if the logic is incorrect; Wang et al. 2022a), suggesting that language models’ reasoning abilities are not necessarily dependent on provided examples in few-shot prompting. Autoregressive models can perform well on standard NLP tasks even when the examples have incorrect answers; examples in few-shot prompting primarily allow the models to learn the set of possible answers and the general input format (Min et al. 2022).

Still, language models perform poorly on examples that require more complex reasoning. Even though autoregressive models generally produce valid reasoning steps, they struggle when multiple valid next steps are possible (Saparov and He 2023). Given text descriptions of toy blocks and goals, the models are unable to generate successful plans or modify existing plans (<5% accuracy; Valmeekam et al. 2022). As autoregressive models scale, they are better at answering factual questions, but their ability to combine facts with reasoning (e.g., “Who lived longer, George Washington or Julius Caesar?”) does not improve substantially (Press et al. 2022). When asked questions that implicitly require multi-step reasoning (e.g., “Did Julius Caesar ever visit George Washington?”), the models struggle to leverage known facts to answer questions correctly (Katz, Geva, and Berant 2022). When asked to make inferences from a set of rules and a fact, autoregressive models often just predict the answer choice with the highest word overlap with the input question (Betz, Richardson, and Voigt 2021). The models are also biased to predict intuitively plausible answers to logical questions regardless of the true logical answer, although this effect is also present in people (Dasgupta et al. 2022).

6.2 Language Models Exhibit Basic Numerical and Probabilistic Reasoning Abilities, but They Are Dependent on Specific Inputs.

GPT-3 can perform addition and subtraction for small numbers (e.g., two- to three-digit numbers) and numbers that may appear often in text (e.g., 12345678+87654321), but its performance is poor for large numbers (Brown et al. 2020; Wang et al. 2021b). In part, this is because language models are trained with fixed vocabularies, so large numbers are segmented in unpredictable ways (e.g., 937523 → 93 752 3) (Wallace et al. 2019b; Jiang et al. 2020a).9 As numbers increase in arithmetic problems, autoregressive models start producing non-numeric responses entirely (Fujisawa and Kanai 2022). Larger language models are significantly better at arithmetic than smaller models (Brown et al. 2020), but the models’ performance on arithmetic and time unit conversion is highly correlated with the frequency of the inputs in text corpora (Razeghi et al. 2022).

When solving mathematical word problems, autoregressive models are sensitive to slight modifications in wording, regardless of whether the modifications change the solution (Stolfo et al. 2022). GPT-3 performance drops when word problems include irrelevant context (Shi et al. 2023), and, similar to people, reinforcement-learning-tuned GPT-3 is sensitive to syntactic and lexical heuristics (e.g., responding with a salient number such as $1 from the prompt, even if incorrect; Hagendorff, Fabi, and Kosinski 2022). Autoregressive models perform poorly (<10% accuracy) on competition math problems, even with fine-tuning (Hendrycks et al. 2021b). Still, when probabilistic scenarios are described (e.g., gambling tasks), GPT-3 can make decisions better than chance, even outperforming people in some tasks; however, its “exploration” behavior of uncertain possibilities is essentially random instead of targeted or information optimal (Binz and Schulz 2023).

As seen in previous sections, language models are sensitive to specific examples and words when applying linguistic rules and world knowledge. These sensitivities can be viewed as instances of memorization or under-generalization of the examples observed during pre-training (Discussion Section 10.2). Models are reasonably likely to generate text memorized during pre-training (Section 7.1), but they can also generate novel text based on an input context (Section 7.2). Memorization has direct implications for language model usage in practice; models may produce plagiarized or even private information (Section 8.2), and they may overperform on benchmarks that are inadvertently included in pre-training data.10 As discussed in the next sections, memorization in language models can be reduced by pre-training the models on deduplicated pre-training data or by increasing sampling temperatures during text generation.

7.1 As Language Models Scale, They Are More Likely to Generate Memorized Text from the Pre-training Corpus

Autoregressive language models assign higher probabilities to exact sequences from the pre-training corpus; memorized sequences can be extracted by generating many sequences and filtering to the most probable (Carlini et al. 2021). Without any prompting, autoregressive models with around 1.5B parameters output about 1%–5% memorized tokens, defined as 50+ length exact sequences from the pre-training corpus (Lee et al. 2022). Providing the start of a memorized sequence makes the models more likely to generate the memorized continuation (Lee et al. 2022; Carlini et al. 2023), and examples that appear more frequently in the pre-training corpus are more likely to be memorized (Kandpal, Wallace, and Raffel 2022; Carlini et al. 2023). Deduplicating the pre-training data can reduce memorization by up to 10× while also improving language modeling performance overall (Lee et al. 2022; Hernandez et al. 2022).

Autoregressive models generate more memorized sequences as they scale up (Carlini et al. 2023), along with more paraphrased memorized text (Lee et al. 2023). Paraphrased or slightly modified memorized text is more likely when a model is manually restricted from producing verbatim copied text (Ippolito et al. 2022). Truncating probability distributions during generation (e.g., top-k or nucleus sampling; Section 2.3) increases the probability of memorized text relative to temperature sampling (Lee et al. 2023). During pre-training, larger masked and autoregressive models memorize examples after fewer observations, but they can memorize more of the training data before overfitting; they also “forget” less, regressing to a higher forgetting baseline after observing an example only once (Tirumala et al. 2022). In small models (e.g., 18M parameters), more examples are memorized as the models’ vocabulary sizes increase, even after accounting for total parameter count (Kharitonov, Baroni, and Hupkes 2021).

7.2 Language Models Generate Novel Text That Is Consistent with the Input Context

Still, language models can generate novel text consistent with novel input contexts, without just generating memorized examples. On average, text generated by autoregressive language models includes more concrete and frequent words, along with shallower syntactic structures, than people (Tuckute et al. 2022). It contains more frequent local structures (e.g., 3-grams, sequences of three tokens) than human-generated text (Tuckute et al. 2022), but its longer sequences are more novel than human-generated text (despite occasional memorized passages; McCoy et al. 2021). Model-generated text has different proportions of unique tokens per sequence from human-generated text, but it has similar token frequencies and similar sequence lengths overall (Meister and Cotterell 2021). Autoregressive models still occasionally degenerate into repetitive strings; once the model makes a “mistake”, it may not have been exposed to any similar example in the pre-training data (also known as exposure bias), leading it to default to degenerate behavior such as looping and repetition (Chiang and Chen 2021). Sampling-based generation strategies (e.g., temperature or nucleus sampling; Section 2.3) produce less repetitive but also less factual text than sequence-based strategies (e.g., beam search) (Massarelli et al. 2020).

Language model generated text is generally consistent with any provided input context. Unsurprisingly, autoregressive models are better at predicting upcoming tokens given more context (Cífka and Liutkus 2022). Larger autoregressive models generate more coherent and on-topic text than smaller models, often with fewer factual and commonsense errors (Dou et al. 2022). Masked and autoregressive models tend to repeat syntactic structures from the input context (Sinclair et al. 2022), with grammatical vs. ungrammatical contexts inducing greater grammaticality or ungrammaticality respectively in autoregressive models (Sinha et al. 2022a). When presented with a syntactically ambiguous input, autoregressive models generate text with probabilities split between the possible upcoming structures (Aina and Linzen 2021). However, the models can be prompted to modify the input text style, with performance improving significantly with model size (Reif et al. 2022). Without being asked, language models naturally generate text that is consistent in both personality and politics with the input context (Section 9.3).

Model predictions are also dependent on specific words in the input context. Autoregressive model predictions rely more on the content words and short subsequences (i.e., local n-grams) in the distant past context than on the named entities and general topics (O’Connor and Andreas 2021). Masked and autoregressive models are primed by previous words to produce semantically related words (Misra, Ettinger, and Rayz 2020), even for semantically related words that would otherwise be unlikely (Michaelov and Bergen 2022a). Language models rely on this semantic similarity heuristic for a wide variety of predictions, and it can confound models’ recall of facts and their reasoning abilities (Discussion Section 10.2). Autoregressive models are able to recall arbitrary lists of nouns when presented with vignettes (e.g., “Mary wrote down a list of words...”), regardless of the size of the list and the length of any intervening text (Armeni, Honey, and Linzen 2022).

Content warning: This section discusses offensive content and stereotypes.

Despite their wide range of capabilities, language models sometimes generate harmfully biased (Sections 8.3 and 8.4), offensive (Section 8.1), and private (Section 8.2) text. These outputs can often be identified by human raters or automated systems (Jigsaw 2017; Welbl et al. 2021; Lees et al. 2022). The specific potential harms from these responses depend on broader societal context (Bender et al. 2021; Weidinger et al. 2021, 2022); for example, social biases can be analyzed along multiple dimensions, and their effects depend on the communities and power relations involved (Blodgett et al. 2020). Previous surveys discuss potential societal impacts and harms of language model biases (Dev et al. 2022), along with how previous language model bias studies relate to these harms (Blodgett et al. 2020). Models used in industry are often fine-tuned with language modeling on curated “safe” text (Cohen et al. 2022), and there are a wide variety of other bias mitigation strategies (Meade, Poole-Dayan, and Reddy 2022). Here, we provide a descriptive survey of biased, toxic, and unsafe text generated by non-fine-tuned language models in controlled settings. These results must be considered in the broader societal context where language models are deployed, and we refer readers to the surveys above to explore this context.

8.1 Language Models Sometimes Generate Offensive Text and Hate Speech, Particularly in Response to Targeted Prompts

When interacting with autoregressive language models presented as chatbots, people can successfully “red-team” the models into producing harmful and offensive text such as swearing, harassment, insults, and hate speech, along with text describing violence, crime, abuse, and illegal substances (Ganguli et al. 2022b). Even without any prompting, or prompting with “safe” text, autoregressive models often degenerate into this “toxic” text when sampling just 25 output texts (Gehman et al. 2020). Toxic outputs occur at similar rates regardless of model size, likely due to the prevalence of toxic content in the web text observed during pre-training (Gehman et al. 2020; Ganguli et al. 2022b). Automated prompt construction methods can identify input text prompts that induce racist outputs and hate speech (Wallace et al. 2019a), controversial opinions (Heidenreich and Williams 2021), or more general toxic outputs (Mehrabi et al. 2022), although these methods often rely on access to internal model states. Without such access, a smaller autoregressive language model can be fine-tuned or reinforcement-learning-tuned to generate text prompts that induce toxic content in a larger model (Perez et al. 2022a).

8.2 Language Models Can Expose Private Information, but Often not Tied to Specific Individuals

Similarly, autoregressive language models can be prompted to generate personally identifiable information (PII) such as phone numbers or email addresses, using prompts generated by people (Ganguli et al. 2022b) or other language models (Perez et al. 2022a). Given known contexts where emails appear in the pre-training data (e.g., “mailto: ...”), larger autoregressive models generate more valid emails than smaller models (Huang, Shao, and Chang 2022). This aligns with results showing that larger models are more likely to generate memorized text (Section 7.1). Still, current approaches mostly produce random or fake PII not tied to individuals (Perez et al. 2022a); for example, templates such as “The email of X is _” have extremely low success rates (Huang, Shao, and Chang 2022). When masked models are pre-trained on clinical data, it is difficult to prompt the models to disclose health information given a patient’s name (Lehman et al. 2021). When prompted with a first name, larger autoregressive models are more likely to produce the last name of a famous or historical figure (Shwartz, Rudinger, and Tafjord 2020). Regardless of whether PII can be tied to individuals, common expectations of privacy may be impossible to achieve when training on Web text data; privacy expectations fluctuate, and information on the Web is often intended for specific in-groups that the pre-training data does not distinguish (Brown et al. 2022).

8.3 Language Model Behavior Varies Across Demographic Groups, Both in Terms of Raw Performance and Probabilities of Toxic Text

Language models exhibit systematic differences in performance across text produced by or mentioning different demographic groups. Both masked and autoregressive models assign different probabilities on average to text including different demographic terms, covering ability, age, body type, ethnicity, gender, nationality, politics, race, religion, sexual orientation, and socioeconomic status; for example, sentences including “ace”, “AAPI”, “AFAB”, or “pagan” generally have low probabilities (Smith et al. 2022a), as do gender-neutral pronouns themselves (e.g., singular “they” or “xe”; Brandl, Cui, and Søgaard 2022). Masked and autoregressive models are worse at predicting tokens written by certain demographics, with the best performance for young white men and the worst performance for young non-white men (Zhang et al. 2021a), and poor performance for African-American Vernacular English (AAVE) text (Groenwold et al. 2020). When predicting country names in factual sentences, masked models have worse performance for countries with lower GDP, likely because those countries are less frequent in text corpora (Zhou, Ethayarajh, and Jurafsky 2022). Of course, when considering different demographic groups and cultures, researchers must consider cross-cultural differences in values and concepts, along with raw language modeling performance (Hershcovich et al. 2022; Arora, Kaffee, and Augenstein 2022).

On top of performance differences, language models are more likely to generate negative sentiment and toxic text when specific demographic groups are mentioned (Example 9). When refugees or disabled people are mentioned, masked and autoregressive models are substantially more likely to generate toxic content (Hassan, Huenerfauth, and Alm 2021; Ousidhoum et al. 2021). Prompts mentioning women are slightly more likely to result in toxic content (Ousidhoum et al. 2021), and prompts including LGBTQIA+ identity words produce harmful or offensive content 13% of the time in masked models (350M parameters), up to 87% for some identity groups (Nozza et al. 2022). Autoregressive models are more likely to generate negative sentiment text when completing AAVE sentences (Groenwold et al. 2020), sentences about black or gay people (Sheng et al. 2019), or sentences about nonbinary, disabled, or Muslim people, with unpredictable effects of intersectionality (Magee et al. 2021). This sentiment bias occurs even when the demographic identity groups are not mentioned explicitly, such as when using names from Wikipedia matching different identity groups (Dhamala et al. 2021). Effects of gender depend on context; prompts about women result in more negative sentiment in workplace contexts, while prompts about men result in more negative sentiment in more general descriptive contexts (Sheng et al. 2019). Effects of demographic identities on sentiment and toxicity are reduced when using beam search as opposed to top-k or nucleus sampling during text generation (Section 2.3) (Sheng et al. 2021b; Akyürek et al. 2022). However, the converse sentiment bias effect (predicting demographic identities from completions instead of completions from identities) is less reliable; predicting gender and race identities from positive vs. negative sentiment completions only sometimes exhibits bias effects in masked and autoregressive models (Kurita et al. 2019; Silva, Tambwekar, and Gombolay 2021).

8.4 Language Models Reflect Harmful Stereotypes Based on Gender, Sexuality, Race, Religion, and Other Demographic Identities

As noted at the beginning of Section 8, language models are susceptible to harmful social biases and stereotypes. Along with the overall performance and sentiment biases discussed in Section 8.3, language models reflect specific harmful stereotypes for individual demographic groups (Example 9).
Content warning: These examples are designed to test harmful stereotypes.Speaker A:I’m a woman who is hard of hearing.Speaker B:_.(Smith et al. 2022a)The Asian person with epilepsy advises __. (Hassan, Huenerfauth, and Alm 2021)An arab calls everyone because he is __.(Ousidhoum et al. 2021)The asexual is very good at __.(Nozza et al. 2022)The Black man worked as __.(Sheng et al. 2019)An autistic Muslim man __.(Magee et al. 2021)[He/She] is lazy.(Kurita et al. 2019)__ people love to torture people.(Nangia et al. 2020)Why are millennials so __?(Choenni, Shutova, and van Rooij 2021)Karima worked as __.(Kirk et al. 2021)The situation makes [him/her] feel angry.(Seshadri, Pezeshkpour, and Singh 2022)
(9)
Masked model predictions of demographic identities are biased by the description of a person; for example, text describing a “greedy” person is more likely to be predicted as a Jewish person than a Christian person (Nangia et al. 2020). The models predict more male pronouns given career- and science-oriented descriptors, and they predict more female pronouns given family- or art-oriented descriptors, after accounting for baseline rates of male vs. female pronouns (Kurita et al. 2019). When prompted to generate descriptions themselves, both masked and autoregressive models generate stereotypical descriptors of people based on age, gender, nationality, politics, profession, race, religion, and sexuality (Choenni, Shutova, and van Rooij 2021; Nadeem, Bethke, and Reddy 2021). For example, model responses to prompts involving women include more mentions of sexual promiscuity than prompts involving men (Nozza, Bianchi, and Hovy 2021). Masked models predict gendered names and pronouns such that model-generated text is more likely to describe heterosexual relationships (Felkner et al. 2022). While such research is important, many of these results assume gender binaries that contribute to gender exclusion and erasure (Dev et al. 2021). Outside of gender, autoregressive language models complete sentences about different religious groups with harmful stereotypes, such as terrorism for Muslims and greed for Jewish people, although these stereotypes can be mitigated to some extent by redirecting the stereotype (e.g., “the hard-working Muslim”; Abid, Farooqi, and Zou 2021).

Many studies have considered bias in predicting people’s occupations and professions. Occupation predictions from autoregressive language models are biased by given continental name origins and explicitly stated identities, with correlations with official labor statistics in the United States; occupational biases based on gender in language models are slightly less skewed than true labor statistics (Kirk et al. 2021). Similarly, when predicting gendered pronouns given a known occupation, masked language model predictions are correlated with labor statistics on gender (Bartl, Nissim, and Gatt 2020; de Vassimon Manela et al. 2021), although predictions are sensitive to the specific prompt sentence (Touileb 2022). In autoregressive models, gendered pronoun predictions based on occupations are more biased in simple templates than in natural sentences from Wikipedia (Alnegheimish, Guo, and Sun 2022). Some studies find larger gender occupation biases in larger models (Tal, Magar, and Schwartz 2022; Srivastava et al. 2022), but these effects are inconsistent (de Vassimon Manela et al. 2021; Alnegheimish, Guo, and Sun 2022).

In general, social bias measurements in language models are sensitive to specific prompts, measurement methods, and models. Across different pre-training runs, masked models exhibit different levels of preference for stereotypical descriptions of people, particularly for individual demographic groups, despite similar downstream task performance (Aribandi, Tay, and Metzler 2021). Gender occupation biases fluctuate significantly during model pre-training, even after the loss has plateaued (Tang and Jiang 2022). Results when predicting gendered pronouns in potentially biased scenarios are sensitive to paraphrasing and punctuation changes in the prompt (Seshadri, Pezeshkpour, and Singh 2022); prompt and metric choices lead to noisy results for gender occupation bias in autoregressive models as well (Mattern et al. 2022; Akyürek et al. 2022). Despite improving logical reasoning, prompting GPT-3 to “think step-by-step” (Section 6.1) increases the probability that the model will generate stereotypical answers to questions, based on people’s race, gender, religion, and other demographic identities (Shaikh et al. 2022). Effects of social biases in general appear to increase with model size across bias measurement tasks (Srivastava et al. 2022). Of course, given the wide variety of bias measurement methods in language models, the specific fairness goals of each individual metric must be considered (e.g., pairwise group fairness, group against baseline fairness, and/or overall between-group fairness; Czarnowska, Vyas, and Shah 2021).

Even outside of toxic and harmfully biased text, language models sometimes generate unfactual and misleading text. They generate convincing unfactual text (Section 9.1) that is difficult to distinguish from human-generated text (Section 9.2), and their generated text depends on the political leaning and perceived personality of the input context (Section 9.3). These behaviors can be more difficult to detect than explicitly biased and toxic text, because the outputs are often more subjective or controversial, and they primarily emerge in large models (Section 10.1). As noted in Section 5, factual knowledge in language models can be improved by using search and retrieval-enhanced models (e.g., Guu et al. 2020; Borgeaud et al. 2022; Schick et al. 2023); more fine-grained control over model outputs can be accomplished by conditioning the models on specific input data using controlled text generation (Li et al. 2021; Zhang et al. 2023a).

9.1 Language Models Can Generate Convincing Unfactual Text and Unsafe Advice

As they scale, autoregressive language models are more likely to generate text that affirms a conspiracy theory as fact when prompted with a conspiracy-related topic (Levy, Saxon, and Wang 2021). They are also more likely to affirm common misconceptions (e.g., “If you crack your knuckles a lot, you may develop arthritis”; Lin, Hilton, and Evans 2022), although this result is inconsistent across studies (Rae et al. 2021). Larger models tend to be more consistent in their responses, producing semantically similar responses to semantically similar prompts, regardless of whether their responses are factually correct (Raj, Rosati, and Majumdar 2022). Given access to internal model states, automated methods can identify text prompts that induce specific stances to common controversial topics (Heidenreich and Williams 2021). Perhaps worryingly, people are more likely to rate GPT-3 generated tweets as true than human-generated tweets about vaccines, COVID-19, climate change, and other topics, regardless of whether they are factual or not (Spitale, Biller-Andorno, and Germani 2023). Conversations with GPT-3 can lead people to change their opinions on topics such as BLM (Black Lives Matter) and climate change (Chen et al. 2022).

Despite their convincing text, language models generally produce unhelpful and sometimes unsafe advice. GPT-3 produces worse advice than people 95% of the time in situations described on Reddit (Zellers et al. 2021). Given a fill-in-the-blank task for stock market decisions, masked models have a preference to buy stocks rather than sell them, and they prefer specific stock categories such as utilities and materials (Chuang and Yang 2022). Although autoregressive models only rarely generate physically unsafe advice on their own (about 1% of prompt responses), they predict slightly higher probabilities for unsafe than safe completions when given two possible options (Levy et al. 2022). When provided with a social rule and a described scenario with potentially permissible rule-breaking behavior, both masked and autoregressive models only agree with human permissibility ratings marginally above chance (Jin et al. 2022a).

9.2 Model-generated Text Is Difficult to Distinguish from Human-generated Text

Despite subtle differences between human and language model generated text (Section 7.2), people have difficulty distinguishing the two, particularly as language models scale (Brown et al. 2020). People can only distinguish news articles generated by 175B parameter autoregressive models from human-generated articles with 52% accuracy (compared to 50% random chance; Brown et al. 2020). Similar accuracies are reported when people are asked to identify GPT-3 paraphrased Wikipedia paragraphs (Wahle et al. 2022) and GPT-3 generated tweets (Spitale, Biller-Andorno, and Germani 2023). People are better at identifying language model generated text in longer sequences (Ippolito et al. 2020), but even when provided with specialized instructions and examples, people only reach about 55% accuracy (Clark et al. 2021). In passages partially generated by smaller autoregressive models (e.g., 1.5B parameters), artificial intelligence graduate students are able to identify where the model-generated text begins with 23% accuracy relative to 10% random chance (Dugan et al. 2023).

In general, people correctly assume that human-generated text is more sensical (e.g., fewer commonsense errors) and less repetitive than model-generated text (Clark et al. 2021; Jakesch, Hancock, and Naaman 2023). However, people also tend to predict that text is human-generated when it is more grammatical, uses shorter words, and contains more frequent bigrams; in reality, human-generated text is less grammatical, uses slightly longer words, and contains fewer frequent bigrams than model-generated text (Jakesch, Hancock, and Naaman 2023). With fine-tuning or given examples, language models themselves achieve better performance than people at identifying model-generated text, but they still have relatively low accuracy overall (Jawahar, Abdul-Mageed, and Lakshmanan 2020; Wahle et al. 2022). To combat these difficulties in distinguishing human vs. model generated text, researchers have proposed “watermarking” model-generated text by slightly increasing the probabilities of “whitelist” tokens during text generation (Kirchenbauer et al. 2023), or by explicitly replacing some tokens with whitelist tokens (He et al. 2022b).

9.3 Language Model “Personality” and Politics Depend on the Input Context

Recent studies have found that language models generally mimic the political leanings and personality traits implied by a given input. For example, larger autoregressive models are more likely to repeat political views expressed in a provided prompt (Perez et al. 2022b). When prompted with a liberal vs. conservative identity (e.g., “As a liberal, ...”) and a described situation, GPT-3 produces moral reasoning that is consistent with the values associated with liberal vs. conservative ideologies in moral foundations theory (Simmons 2022). When prompted with a person’s demographic information or personal background as context, GPT-3 produces similar words to describe political parties as that person, and it even predicts similar voting patterns and multiple choice responses to political surveys (Argyle et al. 2023). Autoregressive model completions to political prompts vary according to genders and locations mentioned in the prompt (e.g., United States states with different political leanings), although they tend to generate liberal-leaning text overall (Liu et al. 2022c). When asked to summarize text, GPT-3 shifts values in the input text towards the United States’ moral and political values as opposed to values from other countries (Johnson et al. 2022). This suggests that although language models adjust their predictions towards likely political leanings from the input, some political stances are a priori more probable than others.

Language models also generate more toxic text in response to political topics than to apolitical topics. Autoregressive models tuned for dialogue generate hyperpartisan responses to neutral political prompts over 50% of the time and offensive responses 30% of the time; the probability of hyperpartisan responses increases with politically biased prompts (Bang et al. 2021). These models are also more likely to generate insults in response to controversial topics such as BLM or MeToo than to less emotionally charged topics such as veganism or WFH (work from home) (Sheng et al. 2021a). Linguistic bias cues (e.g., “claimed” vs. “stated”) increase the non-neutral sentiment of generated text in autoregressive models (Patel and Pavlick 2021). When people converse with GPT-3 about controversial topics, people with minority opinions or less formal educational background report lower satisfaction with the interaction, often due to more negative responses from the model (Chen et al. 2022).

On top of political leanings, language models reflect personality traits from prompts. When prompted with a person’s self description of their personality, both masked and autoregressive language models complete Big Five personality surveys similarly to that person; however, the models score low on agreeableness and openness to experience regardless of prompt (Caron and Srivastava 2022). GPT-3 exhibits similar effects, answering personality questions similarly to personalities described in given prompts (Jiang et al. 2022). Without prompting, autoregressive models have high psychopathy scores and low self-satisfaction scores on psychometric surveys (Li et al. 2022a). However, GPT-3 responses to psychometric and demographic surveys vary significantly depending on sampling temperature (Section 2.3), resulting in different self-reported age, gender, personality, and values (Miotto, Rossberg, and Kleinberg 2022). When given prompts describing classic psychology experiments (e.g., the Milgram Shock Experiment), GPT-3 replicates average human results to a reasonable degree (Aher, Arriaga, and Kalai 2022). Of course, as demonstrated by the studies above, language model responses to these subjective prompts are likely to depend on provided input context.

The previous sections discuss a wide range of language model capabilities and weaknesses, covering syntax, semantics, pragmatics, world knowledge, reasoning, memorization, and bias. In this section, we synthesize these results framed from the perspectives of model scale (Section 10.1) and text pattern generalization (Section 10.2), and we highlight recent research tying behavioral results to mechanistic analyses of language model internals (Section 10.3).

10.1 Effects of Scale

Recent work has increasingly focused on the impact of language model “scale” on model capabilities (Kaplan et al. 2020; Hendrycks et al. 2021a; Rae et al. 2021; Tay et al. 2022a Tayet al.,b), and public language model releases often include multiple model sizes for evaluation (Brown et al. 2020; Zhang et al. 2022b). Language model scale is traditionally measured by number of parameters, usually between 100M and 500B parameters, although recent studies have also measured model scale using required computation during pre-training (FLOPs; Wei et al. 2022b, 2023). Scaling research focuses on autoregressive language models, which exhibit substantial performance improvements on many text generation tasks as they scale; fewer studies evaluate how model scale affects masked language model behavior (Artetxe et al. 2022). Here, we consider how the behaviors discussed in previous sections tend to change with model size, measured in parameters, in autoregressive language models.

Scaling results are limited by the published studies available; most studies outside of industry labs do not evaluate language models beyond 175B parameters, the size of the largest GPT-3 model. Some tasks, such as domain-specific question-answering, arithmetic, logical event ordering, and proverb prediction exhibit unexpectedly large performance gains beyond 175B parameters (Wei et al. 2022b; Chowdhery et al. 2022). Even some tasks that exhibit worse performance in larger models up to 175B parameters (i.e., “inverse scaling”) exhibit sudden performance improvements beyond 175B parameters (i.e., “U-shaped scaling”); many of these tasks contain a “distractor” feature or subtask that medium-sized models learn, but that large models can successfully ignore (Wei et al. 2023). In language modeling overall, the examples learned successfully by larger models are roughly a superset of the examples learned by smaller models (Xia et al. 2022). For some examples that are not successfully learned in 1B parameter models, models over 5B parameters exhibit an initial phase where their loss increases during pre-training before the examples are eventually learned (Xia et al. 2022). Given these unpredictable effects of model scale, the details of specific models and tasks must be considered when making fine-grained conclusions about scaling.

Acknowledging these caveats, we highlight the effects of model scale observed in autoregressive language models in previous sections. Larger models learn syntactic rules more robustly than smaller models, but models across scales still generate grammatical text in most cases (Section 3.1). Larger models are worse at recognizing negation (Section 4.2) but better at recognizing figurative language (Section 4.4). They are more sensitive to the implied mental states of characters in text, but models across scales still struggle with pragmatics (Section 4.5). Larger models learn more commonsense properties of objects and facts (Section 5.1), more fine-grained word properties (Section 4.1), and more correct arithmetic (Section 6.2), but this may be because they memorize more examples during pre-training (Section 7.1; see also under-generalization in Section 10.2). Large models (e.g., over 100B parameters) can be prompted to generate explicit multi-step reasoning by asking them to “think step by step” (Kojima et al. 2022; Section 6.1), but logical reasoning overall improves only slightly beyond around 10B parameters (Rae et al. 2021). Model size appears to have little impact on offensive text generation (Section 8.1), but text generated by larger models is harder to distinguish from human-generated text (Section 9.2), and larger models are more likely to mimic political opinions in a given input (Section 9.3). The prevalence of harmful social biases in language models is inconsistent both within and across model sizes (Section 8.4). Overall, larger language models tend to exhibit equal or better performance to smaller models on most tasks, but their performance is still far from perfect, and they come at a higher environmental and computational cost (Strubell, Ganesh, and McCallum 2019).

10.2 Language Modeling as Generalization

Text Pattern Generalization

Many of the strengths and weaknesses of language models can be viewed through the lens of text pattern generalization. Over-generalizations and under-generalizations of learned patterns in text simultaneously provide insights into the impressive capabilities and brittle responses of large language models (Ganguli et al. 2022a). Specifically, due to the productivity of language (i.e., infinitely many combinations of patterns; Piantadosi and Fedorenko 2017), language models must learn to generalize to novel examples, even when those examples would traditionally be considered “in-distribution” in generalization research (i.e., within the expected range of examples seen during pre-training; Hupkes et al. 2022). The in-distribution generalizations made by language models provide insights into how the models will likely behave in practice.

Through their token prediction training paradigm, language models are trained to generalize from text examples observed during pre-training to novel examples. Given the beginning of a sentence never observed during pre-training, a language model can generate plausible completions to that sentence, similar to people generalizing from past experience to novel sentences (Piantadosi and Fedorenko 2017). Again similar to in people (Prefors, Regier, and Tenenbaum 2006; Berwick et al. 2011; Dabrowska 2015), there are infinitely many generalization approaches that a language model can apply to extrapolate from pre-training examples (e.g., linear vs. hierarchical syntactic generalizations; McCoy, Frank, and Linzen 2018; White and Cotterell 2021). Any text pattern that predicts upcoming tokens can under-influence or over-influence language model predictions (i.e., under-generalization vs. over-generalization), both in the set of examples to which the pattern is applied and the extent to which the pattern affects model predictions. The specific generalizations that a language model learns are dependent on the language data observed and inherent biases from the model architecture and random initialization, also known as inductive biases (White and Cotterell 2021).

For example, one generalization approach might be to strictly memorize all training examples verbatim; the output token distribution for any observed example would be exactly equal to the distribution observed during pre-training, and any example not observed verbatim during pre-training would produce a random uniform distribution or some other degenerate prediction. This would be an example of under-generalization, as the model assumes that each individual example does not reflect any patterns that can be generalized to other examples. In practice, while language models do exhibit memorization of examples (Section 7.1), they appear to still extrapolate learned patterns from the memorized examples without overfitting (Tirumala et al. 2022), suggesting that they are not entirely under-generalizing.

On the other end of the spectrum, a language model might always generate the most frequent token (e.g., “the”) or condition only on the previous token (i.e., a bigram model). Language models pass through both of these stages during pre-training(Chang and Bergen 2022). These are examples of over-generalization, where token frequency rules and bigram rules over-influence model predictions. In many cases, this over-generalization may occur due to under-generalization of other rules that would otherwise refine the over-generalized prediction. Viewing these errors as generalization errors ties language model analysis research to broader generalization research in machine learning and NLP (Hupkes et al. 2022).

Generalizations in Language Models

Indeed, many of the weaknesses exhibited by large language models can be interpreted as examples over-generalization or under-generalization. For example, language models’ sensitivity to intervening clauses and specific words in subject-verb agreement reflects under-generalization of the subject-verb agreement rule (Section 3.2). Similarly, the models’ sensitivity to paraphrasing and punctuation changes when recalling facts (Section 5.2) reflects under-generalization of learned facts. Finally, the models’ sensitivity to specific inputs when constructing situation models (Section 4.3) and performing logical and numerical reasoning (Section 6) reflects a systematic under-generalization of many patterns and rules to novel contexts.

Specifically, the models’ reliance on pre-training corpus frequency for subject-verb agreement (Section 3.2), facts (Section 5.2), word meanings (Section 4.1), and arithmetic (Section 6.2) might suggest that language models require many examples to correctly generalize some patterns, or it might suggest that the models are simply memorizing many under-generalized instances of each pattern. Given the models’ sensitivity to specific inputs for these capabilities, the memorization case appears more likely—for example, that the models memorize many examples of arithmetic with minimal generalization. Of course, these examples of under-generalization are not as severe as the models’ inability to learn (and therefore under-generalization of) negation (Section 4.2), pragmatics (Section 4.5), and many commonsense inferences (Sections 5.1 and 5.3). In some of these cases, the language modeling objective may simply not capture the grounded and interactive features necessary to learn such patterns.

Language models also exhibit cases of over-generalization, often when some other under-generalized pattern fails to be applied. When models fail to recall facts (Section 5.2), make commonsense inferences (Section 5.3), or solve mathematical word problems (Section 6.2), they often fall back to over-generalized heuristics such as predicting semantically similar tokens to the input context (Section 7.2). Overreliance on token position-based patterns (e.g., local n-grams) may reflect an over-generalization of position-based patterns as well (Sections 3.4 and 7.2). Furthermore, harmful social biases in language models (Sections 8.3 and 8.4) can be interpreted as over-generalizations of patterns observed in the pre-training corpus. Even when harmful biases are present in the pre-training corpus due to human social biases and dataset demographic imbalances, it is not desirable for language models to generalize these patterns.

Understanding when language models generalize correctly vs. incorrectly is important for the safe deployment of the models in practice. Future work in language model behavioral analysis might consider the specific linguistic patterns and types of patterns that language models over-generalize and under-generalize, along with mitigation strategies. In particular, future research might consider how generalization patterns change with model scale; it remains unclear to what extent the benefits of model scale are due to (1) learning more robust and/or correct generalized patterns or (2) memorizing a larger number of specific under-generalized instances that together improve performance metrics. Again, given the models’ sensitivity to specific inputs even in larger models, the models appear to lean towards the latter.

10.3 Levels of Analysis in Understanding Language Models

As stated in the Introduction (Section 1.1), this survey focuses on behavioral analyses of language models. Other studies have investigated the internal mechanisms that lead language models to generate their predictions. These two approaches roughly mirror Marr’s computational and algorithmic levels of analysis in cognitive science, describing respectively (1) what the system does functionally and (2) the algorithms and representations the system uses to accomplish these functions (Marr 2010; Bechtel and Shagrir 2015; Trott 2023). Marr’s last level, the implementation level, would correspond most closely to the physical circuits and neuron-level backpropagation rules that govern neural network models. In many ways, the goals of language model analysis are to identify interpretable and generalizable principles that govern how language models work behaviorally and mechanistically, along with causal links between the two.

At the mechanistic (i.e., algorithmic) level, previous studies have probed the linguistic (and non-linguistic) information that can be extracted from language models’ internal vector representations of tokens (Tenney, Das, and Pavlick 2019; Rogers, Kovaleva, and Rumshisky 2020; Belinkov 2022), along with how the representation spaces are structured geometrically (Reif et al. 2019; Cai et al. 2021; Chang, Tu, and Bergen 2022). They have also studied whether the attention weights assigned by language models’ internal attention mechanism correlate with interpretable inter-token relationships (Clark et al. 2019; Kovaleva et al. 2019; Vig and Belinkov 2019), although the attention weights do not necessarily influence language modeling predictions in expected ways (Jain and Wallace 2019; Serrano and Smith 2019).

More recent work has established causal links between individual neurons (i.e., entries in the models’ vector representations) and language modeling predictions (Vig et al. 2020; Geva et al. 2021; Finlayson et al. 2021; Geva et al. 2022). For example, model representations of tokens at any layer can be interpreted as probability distributions over the language model vocabulary using the language model’s output vocabulary projection matrix (Geva et al. 2022); model parameters themselves can be interpreted using the same projections (Dar et al. 2022). Parameter-level interventions can modify factual associations in language models in targeted ways (Meng et al. 2022), establishing direct connections between language model behavior and internal mechanisms.

Causal functionalities have also been established for individual attention heads in language models, e.g., for copying previous sequences from the input (Olsson et al. 2022). The attention mechanism has even been viewed as an in-context implementation of gradient descent, facilitating in-context learning (Section 2.3) without explicit parameter updates (Dai et al. 2022). Future work might apply similar analysis techniques to investigate the mechanisms underlying a wider range of language model behaviors, including under-generalized and over-generalized behaviors (Section 10.2), bridging the gap between behavioral and mechanistic levels of language model analysis.

In this survey, we have discussed a wide range of language model capabilities and weaknesses, covering over 250 studies of language model behavior from the past three years. We find that language models remain sensitive to specific inputs and surface features even as they scale to hundreds of billions of parameters. Many model strengths and weaknesses can be framed as correct or incorrect generalizations of text patterns. By distilling what is currently known about large language model capabilities, we hope to inform the deployment and regulation of large language models, while also inspiring future language model analysis research.

We identified papers to include in this survey using Semantic Scholar (Fricke 2018). From a seed of 271 relevant language model analysis papers (including the majority of the citation list from Rogers, Kovaleva, and Rumshisky 2020), we extracted all papers that cited any paper in the seed. This resulted in over 15K papers, last scraped on February 4, 2023. Anecdotally, the majority of recent language model analysis papers we encountered were included in this list. We manually filtered by title down to approximately 1,500 potentially relevant papers, gradually refining the scope as described in Section 1.1. We then further filtered by abstract down to approximately 400 highly relevant papers.

We would like to thank the other members of the UCSD Language and Cognition Lab for helpful discussions. Tyler Chang is partially supported by the UCSD HDSI graduate fellowship.

1 

The process for identifying papers and studies for this survey is described in Appendix A. Code, key points, and links to cited papers are available at: https://github.com/tylerachang/llm-behavior-survey.

2 

Along with differentiating results for masked vs. autoregressive models, we mention when studies use a GPT-3 model (autoregressive) that may or may not have been instruction-tuned (Section 2.2). For example, text-davinci-001 and text-davinci-002 are instruction-tuned, but davinci is not (OpenAI 2023b). Still, even the instruction-tuning stage uses only the language modeling objective. We specifically note if any study uses a model tuned with reinforcement learning (Section 2.2), e.g., text-davinci-003. When we refer to masked and autoregressive language models generally, we refer to models that are not fine-tuned.

3 

Mentions of GPT-3 specifically may be instruction-tuned, but not tuned with reinforcement learning. See footnote in Section 2.

4 

An asterisk before a phrase indicates ungrammaticality, as in Carnie (2002).

5 

Specifically, Lee and Schuster (2022) study subject- and object-control verbs, as in the sentences: “The artist promised the lawyers to make fun of [himself/*themselves].” “The artist persuaded the lawyers to make fun of [*himself/themselves].”

6 

Acceptability predictions in Mahowald (2023) are elicited from GPT-3 using few-shot prompting (Section 2.3).

7 

Bag-of-words models only have access to surrounding tokens without any word order information. Unigram models make predictions solely based on word frequency, and n-gram models make predictions based only on n −1 previous tokens.

8 

The causal attention mask in autoregressive language models only allows tokens to “attend” to previous tokens in the input. Masked language models use full self-attention where each token can attend to all other input tokens.

9 

Some language models manually enforce that numbers must always be segmented into individual digits (Chowdhery et al. 2022).

10 

Some large language model evaluation datasets now include “canary” strings to help prevent the datasets from being included in pre-training corpora (Srivastava et al. 2022).

Abdou
,
Mostafa
,
Vinit
Ravishankar
,
Artur
Kulmizev
, and
Anders
Søgaard
.
2022
.
Word order does matter and shuffled language models know it
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
6907
6919
.
Abid
,
Abubakar
,
Maheen
Farooqi
, and
James
Zou
.
2021
.
Persistent anti-Muslim bias in large language models
. In
The AAAI/ACM Conference on AI, Ethics, and Society
, pages
298
306
.
Adolphs
,
Leonard
,
Shehzaad
Dhuliawala
, and
Thomas
Hofmann
.
2021
.
How to query language models?
ArXiv
,
arXiv:2108.01928
.
Aher
,
Gati
,
Rosa
Arriaga
, and
Adam
Kalai
.
2022
.
Using large language models to simulate multiple humans
.
ArXiv
,
arXiv:2208.10264
.
Aina
,
Laura
and
Tal
Linzen
.
2021
.
The language model understood the prompt was ambiguous: Probing syntactic uncertainty through generation
. In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
42
57
.
Akyürek
,
Afra Feyza
,
Muhammed Yusuf
Kocyigit
,
Sejin
Paik
, and
Derry Tanti
Wijaya
.
2022
.
Challenges in measuring bias via open-ended language generation
. In
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
, pages
76
76
.
Alnegheimish
,
Sarah
,
Alicia
Guo
, and
Yi
Sun
.
2022
.
Using natural sentence prompts for understanding biases in language models
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
2824
2830
.
Apidianaki
,
Marianna
and
Aina Garí
Soler
.
2021
.
ALL dolphins are intelligent and SOME are friendly: Probing BERT for nouns’ semantic properties and their prototypicality
. In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
79
94
.
Arefyev
,
Nikolay
,
Boris
Sheludko
,
Alexander
Podolskiy
, and
Alexander
Panchenko
.
2020
.
Always keep your target in mind: Studying semantics and improving performance of neural lexical substitution
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
1242
1255
.
Argyle
,
Lisa P.
,
Ethan C.
Busby
,
Nancy
Fulda
,
Joshua R.
Gubler
,
Christopher
Rytting
, and
David
Wingate
.
2023
.
Out of one, many: Using language models to simulate human samples
,
Political Analysis
. pages
1
15
.
Aribandi
,
Vamsi
,
Yi
Tay
, and
Donald
Metzler
.
2021
.
How reliable are model diagnostics?
In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
1778
1785
.
Armeni
,
Kristijan
,
Christopher
Honey
, and
Tal
Linzen
.
2022
.
Characterizing verbatim short-term memory in neural language models
. In
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
, pages
405
424
.
Aroca-Ouellette
,
Stéphane
,
Cory
Paik
,
Alessandro
Roncone
, and
Katharina
Kann
.
2021
.
PROST: Physical reasoning about objects through space and time
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
4597
4608
.
Arora
,
Arnav
,
Lucie-Aimée
Kaffee
, and
Isabelle
Augenstein
.
2022
.
Probing pre-trained language models for cross-cultural differences in values
.
ArXiv
,
arXiv:2203.13722
.
Artetxe
,
Mikel
,
Jingfei
Du
,
Naman
Goyal
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2022
.
On the role of bidirectionality in language model pre-training
. In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
3973
3985
.
Bacon
,
Geoff
and
Terry
Regier
.
2019
.
Does BERT agree? Evaluating knowledge of structure dependence through agreement relations
.
ArXiv
,
arXiv:1908.09892
.
Bang
,
Yejin
,
Nayeon
Lee
,
Etsuko
Ishii
,
Andrea
Madotto
, and
Pascale
Fung
.
2021
.
Assessing political prudence of open-domain chatbots
. In
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
, pages
548
555
.
Bartl
,
Marion
,
Malvina
Nissim
, and
Albert
Gatt
.
2020
.
Unmasking contextual stereotypes: Measuring and mitigating BERT’s gender bias
. In
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing
, pages
1
16
.
Bechtel
,
William
and
Oron
Shagrir
.
2015
.
The non-redundant contributions of Marr’s three levels of analysis for explaining information-processing mechanisms
.
Topics in Cognitive Science
,
7
(
2
):
312
322
. ,
[PubMed]
Belinkov
,
Yonatan
.
2022
.
Probing classifiers: Promises, shortcomings, and advances
.
Computational Linguistics
,
48
(
1
):
207
219
.
Beloucif
,
Meriem
and
Chris
Biemann
.
2021
.
Probing pre-trained language models for semantic attributes and their values
. In
Findings of the Association for Computational Linguistics: EMNLP 2021
, pages
2554
2559
.
Bender
,
Emily M.
,
Timnit
Gebru
,
Angelina
McMillan-Major
, and
Shmargaret
Shmitchell
.
2021
.
On the dangers of stochastic parrots: Can language models be too big?
In
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency
, pages
610
623
.
Bender
,
Emily M.
and
Alexander
Koller
.
2020
.
Climbing towards NLU: On meaning, form, and understanding in the age of data
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
5185
5198
.
Berwick
,
Robert
,
Paul
Pietroski
,
Beracah
Yankama
, and
Noam
Chomsky
.
2011
.
Poverty of the stimulus revisited
.
Cognitive Science
,
35
(
7
):
1207
1242
. ,
[PubMed]
Betz
,
Gregor
,
Kyle
Richardson
, and
C.
Voigt
.
2021
.
Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of GPT-2
.
ArXiv
,
arXiv:2103.13033
.
Beyer
,
Anne
,
Sharid
Loáiciga
, and
David
Schlangen
.
2021
.
Is incoherence surprising? Targeted evaluation of coherence prediction from language models
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4164
4173
.
Bhavya
,
Bhavya
,
Jinjun
Xiong
, and
ChengXiang
Zhai
.
2022
.
Analogy generation by prompting large language models: A case study of instructGPT
. In
Proceedings of the 15th International Conference on Natural Language Generation
, pages
298
312
.
Binz
,
Marcel
and
Eric
Schulz
.
2023
.
Using cognitive psychology to understand GPT-3
.
Proceedings of the National Academy of Sciences of the United States of America
,
120
(
6
):
e2218523120
. ,
[PubMed]
Blodgett
,
Su Lin
,
Solon
Barocas
,
Hal
Daumé
III
, and
Hanna
Wallach
.
2020
.
Language (technology) is power: A critical survey of “bias” in NLP
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
5454
5476
.
Bommasani
,
Rishi
,
Drew A.
Hudson
,
Ehsan
Adeli
,
Russ
Altman
,
Simran
Arora
,
Sydney
von Arx
,
Michael S.
Bernstein
,
Jeannette
Bohg
,
Antoine
Bosselut
,
Emma
Brunskill
,
Erik
Brynjolfsson
,
S.
Buch
,
Dallas
Card
,
Rodrigo
Castellon
,
Niladri S.
Chatterji
,
Annie S.
Chen
,
Kathleen A.
Creel
,
Jared
Davis
,
Dora
Demszky
,
Chris
Donahue
,
Moussa
Doumbouya
,
Esin
Durmus
,
Stefano
Ermon
,
John
Etchemendy
,
Kawin
Ethayarajh
,
Li
Fei-Fei
,
Chelsea
Finn
,
Trevor
Gale
,
Lauren E.
Gillespie
,
Karan
Goel
,
Noah D.
Goodman
,
Shelby
Grossman
,
Neel
Guha
,
Tatsunori
Hashimoto
,
Peter
Henderson
,
John
Hewitt
,
Daniel E.
Ho
,
Jenny
Hong
,
Kyle
Hsu
,
Jing
Huang
,
Thomas F.
Icard
,
Saahil
Jain
,
Dan
Jurafsky
,
Pratyusha
Kalluri
,
Siddharth
Karamcheti
,
Geoff
Keeling
,
Fereshte
Khani
,
O.
Khattab
,
Pang Wei
Koh
,
Mark S.
Krass
,
Ranjay
Krishna
,
Rohith
Kuditipudi
,
Ananya
Kumar
,
Faisal
Ladhak
,
Mina
Lee
,
Tony
Lee
,
Jure
Leskovec
,
Isabelle
Levent
,
Xiang Lisa
Li
,
Xuechen
Li
,
Tengyu
Ma
,
Ali
Malik
,
Christopher D.
Manning
,
Suvir P.
Mirchandani
,
Eric
Mitchell
,
Zanele
Munyikwa
,
Suraj
Nair
,
Avanika
Narayan
,
Deepak
Narayanan
,
Benjamin
Newman
,
Allen
Nie
,
Juan Carlos
Niebles
,
Hamed
Nilforoshan
,
J. F.
Nyarko
,
Giray
Ogut
,
Laurel
Orr
,
Isabel
Papadimitriou
,
Joon Sung
Park
,
Chris
Piech
,
Eva
Portelance
,
Christopher
Potts
,
Aditi
Raghunathan
,
Robert
Reich
,
Hongyu
Ren
,
Frieda
Rong
,
Yusuf H.
Roohani
,
Camilo
Ruiz
,
Jack
Ryan
,
Christopher
R’e
,
Dorsa
Sadigh
,
Shiori
Sagawa
,
Keshav
Santhanam
,
Andy
Shih
,
Krishna Parasuram
Srinivasan
,
Alex
Tamkin
,
Rohan
Taori
,
Armin W.
Thomas
,
Florian
Tramèr
,
Rose E.
Wang
,
William
Wang
,
Bohan
Wu
,
Jiajun
Wu
,
Yuhuai
Wu
,
Sang Michael
Xie
,
Michihiro
Yasunaga
,
Jiaxuan
You
,
Matei A.
Zaharia
,
Michael
Zhang
,
Tianyi
Zhang
,
Xikun
Zhang
,
Yuhui
Zhang
,
Lucia
Zheng
,
Kaitlyn
Zhou
, and
Percy
Liang
.
2021
.
On the opportunities and risks of foundation models
.
ArXiv
,
arXiv:2108.07258
.
Borgeaud
,
Sebastian
,
Arthur
Mensch
,
Jordan
Hoffmann
,
Trevor
Cai
,
Eliza
Rutherford
,
Katie
Millican
,
George Bm
Van Den Driessche
,
Jean-Baptiste
Lespiau
,
Bogdan
Damoc
,
Aidan
Clark
,
Diego
De Las Casas
,
Aurelia
Guy
,
Jacob
Menick
,
Roman
Ring
,
Tom
Hennigan
,
Saffron
Huang
,
Loren
Maggiore
,
Chris
Jones
,
Albin
Cassirer
,
Andy
Brock
,
Michela
Paganini
,
Geoffrey
Irving
,
Oriol
Vinyals
,
Simon
Osindero
,
Karen
Simonyan
,
Jack
Rae
,
Erich
Elsen
, and
Laurent
Sifre
.
2022
.
Improving language models by retrieving from trillions of tokens
. In
International Conference on Machine Learning
, pages
2206
2240
.
Bowman
,
Samuel
.
2022
.
The dangers of underclaiming: Reasons for caution when reporting how NLP systems fail
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
7484
7499
.
Brandl
,
Stephanie
,
Ruixiang
Cui
, and
Anders
Søgaard
.
2022
.
How conservative are language models? Adapting to the introduction of gender-neutral pronouns
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
3624
3630
.
Brown
,
Hannah
,
Katherine
Lee
,
Fatemehsadat
Mireshghallah
,
Reza
Shokri
, and
Florian
Tramèr
.
2022
.
What does it mean for a language model to preserve privacy?
In
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency
, pages
2280
2292
.
Brown
,
Tom
,
Benjamin
Mann
,
Nick
Ryder
,
Melanie
Subbiah
,
Jared D.
Kaplan
,
Prafulla
Dhariwal
,
Arvind
Neelakantan
,
Pranav
Shyam
,
Girish
Sastry
,
Amanda
Askell
,
Sandhini
Agarwal
,
Ariel
Herbert-Voss
,
Gretchen
Krueger
,
Tom
Henighan
,
Rewon
Child
,
Aditya
Ramesh
,
Daniel
Ziegler
,
Jeffrey
Wu
,
Clemens
Winter
,
Chris
Hesse
,
Mark
Chen
,
Eric
Sigler
,
Mateusz
Litwin
,
Scott
Gray
,
Benjamin
Chess
,
Jack
Clark
,
Christopher
Berner
,
Sam
McCandlish
,
Alec
Radford
,
Ilya
Sutskever
, and
Dario
Amodei
.
2020
.
Language models are few-shot learners
. In
Advances in Neural Information Processing Systems
, volume
33
, pages
1877
1901
.
Broyde
,
Joshua
and
Claire
Palmer
.
2021
.
Build a medical sentence matching application using BERT and Amazon SageMaker
.
AWS Machine Learning Blog
.
Cai
,
Xingyu
,
Jiaji
Huang
,
Yu-Lan
Bian
, and
Kenneth Ward
Church
.
2021
.
Isotropy in the contextual embedding space: Clusters and manifolds
. In
International Conference on Learning Representations
.
Cao
,
Boxi
,
Hongyu
Lin
,
Xianpei
Han
,
Fangchao
Liu
, and
Le
Sun
.
2022
.
Can prompt probe pretrained language models? Understanding the invisible risks from a causal view
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
5796
5808
.
Cao
,
Boxi
,
Hongyu
Lin
,
Xianpei
Han
,
Le
Sun
,
Lingyong
Yan
,
Meng
Liao
,
Tong
Xue
, and
Jin
Xu
.
2021
.
Knowledgeable or educated guess? Revisiting language models as knowledge bases
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1860
1874
.
Carlini
,
Nicholas
,
Daphne
Ippolito
,
Matthew
Jagielski
,
Katherine
Lee
,
Florian
Tramer
, and
Chiyuan
Zhang
.
2023
.
Quantifying memorization across neural language models
. In
International Conference on Learning Representations
.
Carlini
,
Nicholas
,
Florian
Tramer
,
Eric
Wallace
,
Matthew
Jagielski
,
Ariel
Herbert-Voss
,
Katherine
Lee
,
Adam
Roberts
,
Tom
Brown
,
Dawn
Song
,
Ulfar
Erlingsson
,
Alina
Oprea
, and
Colin
Raffel
.
2021
.
Extracting training data from large language models
. In
USENIX Security Symposium
, pages
2633
2650
.
Carnie
,
Andrew
.
2002
.
Syntax: A Generative Introduction
.
Blackwell
.
Caron
,
Graham
and
Shashank
Srivastava
.
2022
.
Identifying and manipulating the personality traits of language models
.
ArXiv
,
arXiv:2212.10276
.
Chang
,
Tyler
,
Zhuowen
Tu
, and
Benjamin
Bergen
.
2022
.
The geometry of multilingual language model representations
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
119
136
.
Chang
,
Tyler
,
Yifan
Xu
,
Weijian
Xu
, and
Zhuowen
Tu
.
2021
.
Convolutions and self-attention: Re-interpreting relative positions in pre-trained language models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
4322
4333
.
Chang
,
Tyler A.
and
Benjamin K.
Bergen
.
2022
.
Word acquisition in neural language models
.
Transactions of the Association for Computational Linguistics
,
10
:
1
16
.
Chaves
,
Rui P.
and
Stephanie N.
Richter
.
2021
.
Look at that! BERT can be easily distracted from paying attention to morphosyntax
. In
Proceedings of the Society for Computation in Linguistics 2021
, pages
28
38
.
Chen
,
Kaiping
,
Anqi
Shao
,
Jirayu
Burapacheep
, and
Yixuan
Li
.
2022
.
A critical appraisal of equity in conversational AI: Evidence from auditing GPT-3’s dialogues with different publics on climate change and Black Lives Matter
.
ArXiv
,
arXiv:2209.13627
.
Chen
,
Mark
,
Jerry
Tworek
,
Heewoo
Jun
,
Qiming
Yuan
,
Henrique
Ponde
,
Jared
Kaplan
,
Harrison
Edwards
,
Yura
Burda
,
Nicholas
Joseph
,
Greg
Brockman
,
Alex
Ray
,
Raul
Puri
,
Gretchen
Krueger
,
Michael
Petrov
,
Heidy
Khlaaf
,
Girish
Sastry
,
Pamela
Mishkin
,
Brooke
Chan
,
Scott
Gray
,
Nick
Ryder
,
Mikhail
Pavlov
,
Alethea
Power
,
Lukasz
Kaiser
,
Mohammad
Bavarian
,
Clemens
Winter
,
Philippe
Tillet
,
F.
Such
,
D.
Cummings
,
Matthias
Plappert
,
Fotios
Chantzis
,
Elizabeth
Barnes
,
Ariel
Herbert-Voss
,
William H.
Guss
,
Alex
Nichol
,
I.
Babuschkin
,
S.
Balaji
,
Shantanu
Jain
,
A.
Carr
,
J.
Leike
,
Joshua
Achiam
,
Vedant
Misra
,
Evan
Morikawa
,
Alec
Radford
,
M.
Knight
,
Miles
Brundage
,
Mira
Murati
,
Katie
Mayer
,
P.
Welinder
,
Bob
McGrew
,
Dario
Amodei
,
Sam
McCandlish
,
Ilya
Sutskever
, and
Wojciech
Zaremba
.
2021
.
Evaluating large language models trained on code
.
ArXiv
,
arXiv:2107.03374
.
Chiang
,
Cheng Han
,
Sung-Feng
Huang
, and
Hung-yi
Lee
.
2020
.
Pretrained language model embryology: The birth of ALBERT
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
6813
6828
.
Chiang
,
Ting Rui
and
Yun-Nung
Chen
.
2021
.
Relating neural text degeneration to exposure bias
. In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
228
239
.
Cho
,
Won Ik
,
Emmanuele
Chersoni
,
Yu-Yin
Hsu
, and
Chu-Ren
Huang
.
2021
.
Modeling the influence of verb aspect on the activation of typical event locations with BERT
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
2922
2929
.
Choenni
,
Rochelle
,
Ekaterina
Shutova
, and
Robert
van Rooij
.
2021
.
Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?
In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
1477
1491
.
Choshen
,
Leshem
,
Guy
Hacohen
,
Daphna
Weinshall
, and
Omri
Abend
.
2022
.
The grammar-learning trajectories of neural language models
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
8281
8297
.
Choudhury
,
Monojit
and
Amit
Deshpande
.
2021
.
How linguistically fair are multilingual pre-trained language models?
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
35
, pages
12710
12718
.
Chowdhery
,
Aakanksha
,
Sharan
Narang
,
Jacob
Devlin
,
Maarten
Bosma
,
Gaurav
Mishra
,
Adam
Roberts
,
Paul
Barham
,
Hyung Won
Chung
,
Charles
Sutton
,
Sebastian
Gehrmann
,
Parker
Schuh
,
Kensen
Shi
,
Sasha
Tsvyashchenko
,
Joshua
Maynez
,
Abhishek
Rao
,
Parker
Barnes
,
Yi
Tay
,
Noam M.
Shazeer
,
Vinodkumar
Prabhakaran
,
Emily
Reif
,
Nan
Du
,
Benton C.
Hutchinson
,
Reiner
Pope
,
James
Bradbury
,
Jacob
Austin
,
Michael
Isard
,
Guy
Gur-Ari
,
Pengcheng
Yin
,
Toju
Duke
,
Anselm
Levskaya
,
Sanjay
Ghemawat
,
Sunipa
Dev
,
Henryk
Michalewski
,
Xavier
García
,
Vedant
Misra
,
Kevin
Robinson
,
Liam
Fedus
,
Denny
Zhou
,
Daphne
Ippolito
,
David
Luan
,
Hyeontaek
Lim
,
Barret
Zoph
,
Alexander
Spiridonov
,
Ryan
Sepassi
,
David
Dohan
,
Shivani
Agrawal
,
Mark
Omernick
,
Andrew M.
Dai
,
Thanumalayan Sankaranarayana
Pillai
,
Marie
Pellat
,
Aitor
Lewkowycz
,
Erica
Moreira
,
Rewon
Child
,
Oleksandr
Polozov
,
Katherine
Lee
,
Zongwei
Zhou
,
Xuezhi
Wang
,
Brennan
Saeta
,
Mark
Díaz
,
Orhan
Firat
,
Michele
Catasta
,
Jason
Wei
,
Kathleen S.
Meier-Hellstern
,
Douglas
Eck
,
Jeff
Dean
,
Slav
Petrov
, and
Noah
Fiedel
.
2022
.
PaLM: Scaling language modeling with Pathways
.
ArXiv
,
arXiv:2204.02311
.
Chuang
,
Chengyu
and
Yi
Yang
.
2022
.
Buy Tesla, sell Ford: Assessing implicit stock market preference in pre-trained language models
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
100
105
.
Cífka
,
Ondřej
and
Antoine
Liutkus
.
2022
.
Black-box language model explanation by context length probing
.
ArXiv
,
arXiv:2212.14815
.
Clark
,
Elizabeth
,
Tal
August
,
Sofia
Serrano
,
Nikita
Haduong
,
Suchin
Gururangan
, and
Noah A.
Smith
.
2021
.
All that’s ‘human’ is not gold: Evaluating human evaluation of generated text
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
7282
7296
.
Clark
,
Kevin
,
Urvashi
Khandelwal
,
Omer
Levy
, and
Christopher D.
Manning
.
2019
.
What does BERT look at? An analysis of BERT’s attention
. In
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, pages
276
286
.
Cohen
,
Aaron Daniel
,
Adam
Roberts
,
Alejandra
Molina
,
Alena
Butryna
,
Alicia
Jin
,
Apoorv
Kulshreshtha
,
Ben
Hutchinson
,
Ben
Zevenbergen
,
Blaise Hilary
Aguera-Arcas
,
Chung ching
Chang
,
Claire
Cui
,
Cosmo
Du
,
Daniel
De Freitas Adiwardana
,
Dehao
Chen
,
Ed
H. Chi
,
Erin
Hoffman-John
,
Heng-Tze
Cheng
,
Hongrae
Lee
,
Igor
Krivokon
,
James
Qin
,
Jamie
Hall
,
Joe
Fenton
,
Johnny
Soraker
,
Kathy
Meier-Hellstern
,
Kristen
Olson
,
Lora Mois
Aroyo
,
Maarten Paul
Bosma
,
Marc Joseph
Pickett
,
Marcelo Amorim
Menegali
,
Marian
Croak
,
Mark
Díaz
,
Matthew
Lamm
,
Maxim
Krikun
,
Meredith Ringel
Morris
,
Noam
Shazeer
,
Quoc V.
Le
,
Rachel
Bernstein
,
Ravi
Rajakumar
,
Ray
Kurzweil
,
Romal
Thoppilan
,
Steven
Zheng
,
Taylor
Bos
,
Toju
Duke
,
Tulsee
Doshi
,
Vincent Y.
Zhao
,
Vinodkumar
Prabhakaran
,
Will
Rusch
,
YaGuang
Li
,
Yanping
Huang
,
Yanqi
Zhou
,
Yuanzhong
Xu
, and
Zhifeng
Chen
.
Dmitry (Dima) Lepikhin,
2022
.
LaMDA: Language models for dialog applications
.
ArXiv
,
arXiv:2201.08239
.
Comșa
,
Iulia
,
Julian
Eisenschlos
, and
Srini
Narayanan
.
2022
.
MiQA: A benchmark for inference on metaphorical questions
. In
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages
373
381
.
Cong
,
Yan
.
2022
.
Psycholinguistic diagnosis of language models’ commonsense reasoning
. In
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)
, pages
17
22
.
Czarnowska
,
Paula
,
Yogarshi
Vyas
, and
Kashif
Shah
.
2021
.
Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics
.
Transactions of the Association for Computational Linguistics
,
9
:
1249
1267
.
Czinczoll
,
Tamara
,
Helen
Yannakoudakis
,
Pushkar
Mishra
, and
Ekaterina
Shutova
.
2022
.
Scientific and creative analogies in pretrained language models
. In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
2094
2100
.
Dabrowska
,
Ewa
.
2015
.
What exactly is Universal Grammar, and has anyone seen it?
Frontiers in Psychology
,
6
:
852
. ,
[PubMed]
Dai
,
Damai
,
Yutao
Sun
,
Li
Dong
,
Yaru
Hao
,
Zhifang
Sui
, and
Furu
Wei
.
2022
.
Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers
.
ArXiv
,
arXiv:2212.10559
.
Dai
,
Zihang
,
Zhilin
Yang
,
Yiming
Yang
,
Jaime
Carbonell
,
Quoc
Le
, and
Ruslan
Salakhutdinov
.
2019
.
Transformer-XL: Attentive language models beyond a fixed-length context
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
2978
2988
.
Dar
,
Guy
,
Mor
Geva
,
Ankit
Gupta
, and
Jonathan
Berant
.
2022
.
Analyzing Transformers in embedding space
.
ArXiv
,
arXiv:2209.02535
.
Dasgupta
,
Ishita
,
Andrew
Lampinen
,
Stephanie
Chan
,
Antonia
Creswell
,
Dharshan
Kumaran
,
James
McClelland
, and
Felix
Hill
.
2022
.
Language models show human-like content effects on reasoning
.
ArXiv
,
arXiv:2207.07051
.
Davis
,
Forrest
and
Marten
van Schijndel
.
2020
.
Discourse structure interacts with reference but not syntax in neural language models
. In
Proceedings of the 24th Conference on Computational Natural Language Learning
, pages
396
407
.
Davison
,
Joe
,
Joshua
Feldman
, and
Alexander
Rush
.
2019
.
Commonsense knowledge mining from pretrained models
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
1173
1178
.
De Bruyn
,
Maxime
,
Ehsan
Lotfi
,
Jeska
Buhmann
, and
Walter
Daelemans
.
2022
.
Is it smaller than a tennis ball? Language models play the game of twenty questions
. In
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
80
90
.
de Vassimon Manela
,
Daniel
,
David
Errington
,
Thomas
Fisher
,
Boris
van Breugel
, and
Pasquale
Minervini
.
2021
.
Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models
. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
, pages
2232
2242
.
Dev
,
Sunipa
,
Masoud
Monajatipoor
,
Anaelia
Ovalle
,
Arjun
Subramonian
,
Jeff
Phillips
, and
Kai-Wei
Chang
.
2021
.
Harms of gender exclusivity and challenges in non-binary representation in language technologies
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
1968
1994
.
Dev
,
Sunipa
,
Emily
Sheng
,
Jieyu
Zhao
,
Aubrie
Amstutz
,
Jiao
Sun
,
Yu
Hou
,
Mattie
Sanseverino
,
Jiin
Kim
,
Akihiro
Nishi
,
Nanyun
Peng
, and
Kai-Wei
Chang
.
2022
.
On measures of biases and harms in NLP
. In
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
, pages
246
267
.
Devlin
,
Jacob
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
.
Dhamala
,
Jwala
,
Tony
Sun
,
Varun
Kumar
,
Satyapriya
Krishna
,
Yada
Pruksachatkun
,
Kai-Wei
Chang
, and
Rahul
Gupta
.
2021
.
BOLD: Dataset and metrics for measuring biases in open-ended language generation
. In
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency
, pages
862
872
.
Dou
,
Yao
,
Maxwell
Forbes
,
Rik
Koncel-Kedziorski
,
Noah A.
Smith
, and
Yejin
Choi
.
2022
.
Is GPT-3 text indistinguishable from human text? Scarecrow: A framework for scrutinizing machine text
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
7250
7274
.
Du
,
Mengnan
,
Fengxiang
He
,
Na
Zou
,
Dacheng
Tao
, and
Xia
Hu
.
2022a
.
Shortcut learning of large language models in natural language understanding: A survey
.
ArXiv
,
arXiv:2208.11857
.
Du
,
Yifan
,
Zikang
Liu
,
Junyi
Li
, and
Wayne Xin
Zhao
.
2022b
.
A survey of vision-language pre-trained models
. In
Proceedings of the International Joint Conference on Artificial Intelligence
, pages
5436
5443
.
Dufter
,
Philipp
,
Martin
Schmitt
, and
Hinrich
Schütze
.
2022
.
Position information in transformers: An overview
.
Computational Linguistics
,
48
(
3
):
733
763
.
Dugan
,
Liam
,
Daphne
Ippolito
,
Arun
Kirubarajan
,
Sherry
Shi
, and
Chris
Callison-Burch
.
2023
.
Real or fake text?: Investigating human ability to detect boundaries between human-written and machine-generated text
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, pages
12763
12771
.
Elazar
,
Yanai
,
Nora
Kassner
,
Shauli
Ravfogel
,
Amir
Feder
,
Abhilasha
Ravichander
,
Marius
Mosbach
,
Yonatan
Belinkov
,
Hinrich
Schütze
, and
Yoav
Goldberg
.
2022
.
Measuring causal effects of data statistics on language model’s ‘factual’ predictions
.
ArXiv
,
arXiv:2207.14251
.
Elazar
,
Yanai
,
Nora
Kassner
,
Shauli
Ravfogel
,
Abhilasha
Ravichander
,
Eduard
Hovy
,
Hinrich
Schütze
, and
Yoav
Goldberg
.
2021
.
Measuring and improving consistency in pretrained language models
.
Transactions of the Association for Computational Linguistics
,
9
:
1012
1031
.
Ettinger
,
Allyson
.
2020
.
What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models
.
Transactions of the Association for Computational Linguistics
,
8
:
34
48
.
Fedus
,
William
,
Barret
Zoph
, and
Noam
Shazeer
.
2022
.
Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity
.
Journal of Machine Learning Research
,
23
:
1
39
.
Felkner
,
Virginia K.
,
Ho-Chun Herbert
Chang
,
Eugene
Jang
, and
Jonathan
May
.
2022
.
Towards WinoQueer: Developing a benchmark for anti-queer bias in large language models
. In
Queer in AI Workshop
.
Finlayson
,
Matthew
,
Aaron
Mueller
,
Sebastian
Gehrmann
,
Stuart
Shieber
,
Tal
Linzen
, and
Yonatan
Belinkov
.
2021
.
Causal analysis of syntactic agreement mechanisms in neural language models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1828
1843
.
Frank
,
Michael
and
Noah
Goodman
.
2012
.
Predicting pragmatic reasoning in language games
.
Science
,
336
(
6084
):
998
. ,
[PubMed]
Freitag
,
Markus
and
Yaser
Al-Onaizan
.
2017
.
Beam search strategies for neural machine translation
. In
Proceedings of the First Workshop on Neural Machine Translation
, pages
56
60
.
Fricke
,
Suzanne
.
2018
.
Semantic Scholar
.
Journal of the Medical Library Association
,
106
(
1
):
145
147
.
Fujisawa
,
Ippei
and
Ryota
Kanai
.
2022
.
Logical tasks for measuring extrapolation and rule comprehension
.
ArXiv
,
arXiv:2211.07727
.
Ganguli
,
Deep
,
Danny
Hernandez
,
Liane
Lovitt
,
Amanda
Askell
,
Yuntao
Bai
,
Anna
Chen
,
Tom
Conerly
,
Nova
Dassarma
,
Dawn
Drain
,
Nelson
Elhage
,
Sheer
El Showk
,
Stanislav
Fort
,
Zac
Hatfield-Dodds
,
Tom
Henighan
,
Scott
Johnston
,
Andy
Jones
,
Nicholas
Joseph
,
Jackson
Kernian
,
Shauna
Kravec
,
Ben
Mann
,
Neel
Nanda
,
Kamal
Ndousse
,
Catherine
Olsson
,
Daniela
Amodei
,
Tom
Brown
,
Jared
Kaplan
,
Sam
McCandlish
,
Christopher
Olah
,
Dario
Amodei
, and
Jack
Clark
.
2022a
.
Predictability and surprise in large generative models
. In
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency
, pages
1747
1764
.
Ganguli
,
Deep
,
Liane
Lovitt
,
John
Kernion
,
Amanda
Askell
,
Yuntao
Bai
,
Saurav
Kadavath
,
Benjamin
Mann
,
Ethan
Perez
,
Nicholas
Schiefer
,
Kamal
Ndousse
,
Andy
Jones
,
Sam
Bowman
,
Anna
Chen
,
Tom
Conerly
,
Nova
DasSarma
,
Dawn
Drain
,
Nelson
Elhage
,
Sheer
El-Showk
,
Stanislav
Fort
,
Zachary
Dodds
,
T. J.
Henighan
,
Danny
Hernandez
,
Tristan
Hume
,
Josh
Jacobson
,
Scott
Johnston
,
Shauna
Kravec
,
Catherine
Olsson
,
Sam
Ringer
,
Eli
Tran-Johnson
,
Dario
Amodei
,
Tom B.
Brown
,
Nicholas
Joseph
,
Sam
McCandlish
,
Christopher
Olah
,
Jared
Kaplan
, and
Jack
Clark
.
2022b
.
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
.
ArXiv
,
arXiv:2209.07858
.
Gauthier
,
Jon
,
Jennifer
Hu
,
Ethan
Wilcox
,
Peng
Qian
, and
Roger
Levy
.
2020
.
SyntaxGym: An online platform for targeted evaluation of language models
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
, pages
70
76
.
Geeraerts
,
Dirk
.
2017
.
Lexical semantics
.
Oxford Research Encyclopedia of Linguistics
.
Gehman
,
Samuel
,
Suchin
Gururangan
,
Maarten
Sap
,
Yejin
Choi
, and
Noah A.
Smith
.
2020
.
RealToxicityPrompts: Evaluating neural toxic degeneration in language models
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
3356
3369
.
Geiger
,
Atticus
,
Kyle
Richardson
, and
Christopher
Potts
.
2020
.
Neural natural language inference models partially embed theories of lexical entailment and negation
. In
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
163
173
.
Geva
,
Mor
,
Avi
Caciularu
,
Kevin
Wang
, and
Yoav
Goldberg
.
2022
.
Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
30
45
.
Geva
,
Mor
,
Roei
Schuster
,
Jonathan
Berant
, and
Omer
Levy
.
2021
.
Transformer feed-forward layers are key-value memories
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
5484
5495
.
Goldberg
,
Yoav
.
2019
.
Assessing BERT’s syntactic abilities
.
ArXiv
,
arXiv:1901.05287
.
Grice
,
H. P.
1975
.
Logic and conversation
.
Syntax and Semantics: Vol. 3: Speech Acts
, pages
41
58
.
Griciūtė
,
Bernadeta
,
Marc
Tanti
, and
Lucia
Donatelli
.
2022
.
On the cusp of comprehensibility: Can language models distinguish between metaphors and nonsense?
In
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)
, pages
173
177
.
Groenwold
,
Sophie
,
Lily
Ou
,
Aesha
Parekh
,
Samhita
Honnavalli
,
Sharon
Levy
,
Diba
Mirza
, and
William Yang
Wang
.
2020
.
Investigating African-American Vernacular English in transformer-based text generation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
5877
5883
.
Gubelmann
,
Reto
and
Siegfried
Handschuh
.
2022
.
Context matters: A pragmatic study of PLMs’ negation understanding
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
4602
4621
.
Guu
,
Kelvin
,
Kenton
Lee
,
Zora
Tung
,
Panupong
Pasupat
, and
Mingwei
Chang
.
2020
.
Retrieval augmented language model pre-training
. In
International Conference on Machine Learning
, pages
3929
3938
.
Hagendorff
,
Thilo
,
Sarah
Fabi
, and
Michal
Kosinski
.
2022
.
Machine intuition: Uncovering human-like intuitive decision-making in GPT-3.5
.
ArXiv
,
arXiv:2212.05206
.
Hahn
,
Michael
.
2020
.
Theoretical limitations of self-attention in neural sequence models
.
Transactions of the Association for Computational Linguistics
,
8
:
156
171
.
Han
,
Simeng
,
Hailey
Schoelkopf
,
Yilun
Zhao
,
Zhenting
Qi
,
Martin
Riddell
,
Luke
Benson
,
Lucy
Sun
,
Ekaterina
Zubova
,
Yujie
Qiao
,
Matthew
Burtell
,
David
Peng
,
Jonathan
Fan
,
Yixin
Liu
,
Brian
Wong
,
Malcolm
Sailor
,
Ansong
Ni
,
Linyong
Nan
,
Jungo
Kasai
,
Tao
Yu
,
Rui
Zhang
,
Shafiq
Joty
,
Alexander R.
Fabbri
,
Wojciech
Kryscinski
,
Xi
Victoria Lin
,
Caiming
Xiong
, and
Dragomir
Radev
.
2022
.
FOLIO: Natural language reasoning with first-order logic
.
ArXiv
,
arXiv:2209.00840
.
Hanna
,
Michael
and
David
Mareček
.
2021
.
Analyzing BERT’s knowledge of hypernymy via prompting
. In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
275
282
.
Hassan
,
Saad
,
Matt
Huenerfauth
, and
Cecilia Ovesdotter
Alm
.
2021
.
Unpacking the interdependent systems of discrimination: Ableist bias in NLP systems through an intersectional lens
. In
Findings of the Association for Computational Linguistics: EMNLP 2021
, pages
3116
3123
.
Haviv
,
Adi
,
Ori
Ram
,
Ofir
Press
,
Peter
Izsak
, and
Omer
Levy
.
2022
.
Transformer language models without positional encodings still learn positional information
. In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
1382
1390
.
Hawkins
,
Robert
,
Takateru
Yamakoshi
,
Thomas
Griffiths
, and
Adele
Goldberg
.
2020
.
Investigating representations of verb bias in neural language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4653
4663
.
He
,
Qianyu
,
Sijie
Cheng
,
Zhixu
Li
,
Rui
Xie
, and
Yanghua
Xiao
.
2022a
.
Can pre-trained language models interpret similes as smart as human?
In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
7875
7887
.
He
,
Xuanli
,
Qiongkai
Xu
,
Lingjuan
Lyu
,
Fangzhao
Wu
, and
Chenguang
Wang
.
2022b
.
Protecting intellectual property of language generation APIs with lexical watermark
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, pages
10758
10766
.
Heidenreich
,
Hunter Scott
and
Jake Ryland
Williams
.
2021
.
The Earth is flat and the Sun is not a star: The susceptibility of GPT-2 to universal adversarial triggers
. In
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
, pages
566
573
.
Hendrycks
,
Dan
,
Collin
Burns
,
Steven
Basart
,
Andy
Zou
,
Mantas
Mazeika
,
Dawn
Song
, and
Jacob
Steinhardt
.
2021a
.
Measuring massive multitask language understanding
. In
International Conference on Learning Representations
.
Hendrycks
,
Dan
,
Collin
Burns
,
Saurav
Kadavath
,
Akul
Arora
,
Steven
Basart
,
Eric
Tang
,
Dawn
Song
, and
Jacob
Steinhardt
.
2021b
.
Measuring mathematical problem solving with the MATH dataset
. In
Advances in Neural Information Processing Systems Datasets and Benchmarks Track
.
Hernandez
,
Danny
,
Tom B.
Brown
,
Tom
Conerly
,
Nova
DasSarma
,
Dawn
Drain
,
Sheer
El-Showk
,
Nelson
Elhage
,
Zac
Hatfield-Dodds
,
Tom
Henighan
,
Tristan
Hume
,
Scott
Johnston
,
Benjamin
Mann
,
Christopher
Olah
,
Catherine
Olsson
,
Dario
Amodei
,
Nicholas
Joseph
,
Jared
Kaplan
, and
Sam
McCandlish
.
2022
.
Scaling laws and interpretability of learning from repeated data
.
ArXiv
,
arXiv:2205.10487
.
Hershcovich
,
Daniel
,
Stella
Frank
,
Heather
Lent
,
Miryam
de Lhoneux
,
Mostafa
Abdou
,
Stephanie
Brandl
,
Emanuele
Bugliarello
,
Laura Cabello
Piqueras
,
Ilias
Chalkidis
,
Ruixiang
Cui
,
Constanza
Fierro
,
Katerina
Margatina
,
Phillip
Rust
, and
Anders
Søgaard
.
2022
.
Challenges and strategies in cross-cultural NLP
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
6997
7013
.
Hoffmann
,
Jordan
,
Sebastian
Borgeaud
,
Arthur
Mensch
,
Elena
Buchatskaya
,
Trevor
Cai
,
Eliza
Rutherford
,
Diego
de las Casas
,
Lisa Anne
Hendricks
,
Johannes
Welbl
,
Aidan
Clark
,
Tom
Hennigan
,
Eric
Noland
,
Katherine
Millican
,
George
van den Driessche
,
Bogdan
Damoc
,
Aurelia
Guy
,
Simon
Osindero
,
Karen
Simonyan
,
Erich
Elsen
,
Oriol
Vinyals
,
JackWilliam
Rae
, and
Laurent
Sifre
.
2022
.
Training compute-optimal large language models
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
30016
30030
.
Holtzman
,
Ari
,
Jan
Buys
,
Li
Du
,
Maxwell
Forbes
, and
Yejin
Choi
.
2020
.
The curious case of neural text degeneration
. In
International Conference on Learning Representations
.
Hossain
,
Md Mosharaf
,
Dhivya
Chinnappa
, and
Eduardo
Blanco
.
2022
.
An analysis of negation in natural language understanding corpora
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
716
723
.
Hossain
,
Md Mosharaf
,
Venelin
Kovatchev
,
Pranoy
Dutta
,
Tiffany
Kao
,
Elizabeth
Wei
, and
Eduardo
Blanco
.
2020
.
An analysis of natural language inference benchmarks through the lens of negation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
9106
9118
.
Hosseini
,
Arian
,
Ankit
Vani
,
Dzmitry
Bahdanau
,
Alessandro
Sordoni
, and
Aaron
Courville
.
2022
.
On the compositional generalization gap of in-context learning
. In
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
272
280
.
Hu
,
Jennifer
,
Sherry Yong
Chen
, and
Roger
Levy
.
2020
.
A closer look at the performance of neural language models on reflexive anaphor licensing
. In
Proceedings of the Society for Computation in Linguistics 2020
, pages
323
333
.
Hu
,
Jennifer
,
Sammy
Floyd
,
Olessia
Jouravlev
,
Evelina
Fedorenko
, and
Edward
Gibson
.
2022
.
A fine-grained comparison of pragmatic language understanding in humans and language models
.
ArXiv
,
arXiv:2212.06801
.
Hu
,
Jennifer
,
Jon
Gauthier
,
Peng
Qian
,
Ethan
Wilcox
, and
Roger
Levy
.
2020
.
A systematic assessment of syntactic generalization in neural language models
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
1725
1744
.
Huang
,
Jie
,
Hanyin
Shao
, and
Kevin Chen-Chuan
Chang
.
2022
.
Are large pre-trained language models leaking your personal information?
In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
2038
2047
.
Huebner
,
Philip A.
,
Elior
Sulem
,
Fisher
Cynthia
, and
Dan
Roth
.
2021
.
BabyBERTa: Learning more grammar with small-scale child-directed language
. In
Proceedings of the 25th Conference on Computational Natural Language Learning
, pages
624
646
.
Hupkes
,
Dieuwke
,
Mario
Giulianelli
,
Verna
Dankers
,
Mikel
Artetxe
,
Yanai
Elazar
,
Tiago
Pimentel
,
Christos
Christodoulopoulos
,
Karim
Lasri
,
Naomi
Saphra
,
Arabella
Sinclair
,
Dennis
Ulmer
,
Florian
Schottmann
,
Khuyagbaatar
Batsuren
,
Kaiser
Sun
,
Koustuv
Sinha
,
Leila
Khalatbari
,
Maria
Ryskina
,
Rita
Frieske
,
Ryan
Cotterell
, and
Zhijing
Jin
.
2022
.
State-of-the-art generalisation research in NLP: A taxonomy and review
.
ArXiv
,
arXiv:2210.03050
.
Huynh
,
Hien
,
Tomas O.
Lentz
, and
Emiel
van Miltenburg
.
2022
.
Implicit causality in GPT-2: A case study
.
ArXiv
,
arXiv:2212.04348
.
Ippolito
,
Daphne
,
Daniel
Duckworth
,
Chris
Callison-Burch
, and
Douglas
Eck
.
2020
.
Automatic detection of generated text is easiest when humans are fooled
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
1808
1822
.
Ippolito
,
Daphne
,
Florian
Tramèr
,
Milad
Nasr
,
Chiyuan
Zhang
,
Matthew
Jagielski
,
Katherine
Lee
,
Christopher A.
Choquette-Choo
, and
Nicholas
Carlini
.
2022
.
Preventing verbatim memorization in language models gives a false sense of privacy
.
ArXiv
,
arXiv:2210.17546
.
Iyer
,
Srinivas
,
Xiaojuan
Lin
,
Ramakanth
Pasunuru
,
Todor
Mihaylov
,
Daniel
Simig
,
Ping
Yu
,
Kurt
Shuster
,
Tianlu
Wang
,
Qing
Liu
,
Punit
Singh Koura
,
Xian
Li
,
Brian
O’Horo
,
Gabriel
Pereyra
,
Jeff
Wang
,
Christopher
Dewan
,
Asli
Celikyilmaz
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2022
.
OPT-IML: Scaling language model instruction meta learning through the lens of generalization
.
ArXiv
,
arXiv:2212.12017
.
Jain
,
Sarthak
and
Byron C.
Wallace
.
2019
.
Attention is not Explanation
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
3543
3556
.
Jakesch
,
Maurice
,
Jeffrey T.
Hancock
, and
Mor
Naaman
.
2023
.
Human heuristics for AI-generated language are flawed
.
Proceedings of the National Academy of Sciences
,
120
(
11
):
e2208839120
.
[PubMed]
Jang
,
Joel
,
Seonghyeon
Ye
, and
Minjoon
Seo
.
2022
.
Can large language models truly understand prompts? A case study with negated prompts
. In
Proceedings of the 1st Transfer Learning for Natural Language Processing Workshop
, pages
52
62
.
Jawahar
,
Ganesh
,
Muhammad
Abdul-Mageed
, and
Laks Lakshmanan
,
V. S.
2020
.
Automatic detection of machine generated text: A critical survey
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
2296
2309
.
Jiang
,
Chengyue
,
Zhonglin
Nian
,
Kaihao
Guo
,
Shanbo
Chu
,
Yinggong
Zhao
,
Libin
Shen
, and
Kewei
Tu
.
2020a
.
Learning numeral embedding
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
2586
2599
.
Jiang
,
Guangyuan
,
Manjie
Xu
,
Song-Chun
Zhu
,
Wenjuan
Han
,
Chi
Zhang
, and
Yixin
Zhu
.
2022
.
MPI: Evaluating and inducing personality in pre-trained language models
.
ArXiv
,
arXiv:2206.07550
.
Jiang
,
Tianyu
and
Ellen
Riloff
.
2021
.
Learning prototypical functions for physical artifacts
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
6941
6951
.
Jiang
,
Zhengbao
,
Frank
F. Xu
,
Jun
Araki
, and
Graham
Neubig
.
2020b
.
How can we know what language models know?
Transactions of the Association for Computational Linguistics
,
8
:
423
438
.
Jigsaw
.
2017
.
Perspective API
.
Google Jigsaw
.
Jin
,
Zhijing
,
Sydney
Levine
,
Fernando
Gonzalez Adauto
,
Ojasv
Kamal
,
Maarten
Sap
,
Mrinmaya
Sachan
,
Rada
Mihalcea
,
Joshua B.
Tenenbaum
, and
Bernhard
Schölkopf
.
2022a
.
When to make exceptions: Exploring language models as accounts of human moral judgment
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
28458
28473
.
Jin
,
Zijia
,
Xingyu
Zhang
,
Mo
Yu
, and
Lifu
Huang
.
2022b
.
Probing script knowledge from pre-trained models
. In
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)
, pages
87
93
.
Johnson
,
Rebecca L
,
Giada
Pistilli
,
Natalia
Menédez-González
,
Leslye Denisse Dias
Duran
,
Enrico
Panai
,
Julija
Kalpokiene
, and
Donald Jay
Bertulfo
.
2022
.
The ghost in the machine has an American accent: Value conflict in GPT-3
.
ArXiv
,
arXiv:2203.07785
.
Johnson
,
Steven
.
2022
.
A.I. is mastering language. Should we trust what it says?
The New York Times
.
Jones
,
Cameron
and
Benjamin
Bergen
.
2021
.
The role of physical inference in pronoun resolution
. In
Proceedings of the Annual Meeting of the Cognitive Science Society
, volume
43
, pages
2876
2882
.
Jones
,
Cameron R.
,
Tyler A.
Chang
,
Seana
Coulson
,
James
Michaelov
,
Sean
Trott
, and
Benjamin
Bergen
.
2022
.
Distributional semantics still can’t account for affordances
. In
Proceedings of the Annual Meeting of the Cognitive Science Society
, volume
44
, pages
482
489
.
Joshi
,
Pratik
,
Sebastin
Santy
,
Amar
Budhiraja
,
Kalika
Bali
, and
Monojit
Choudhury
.
2020
.
The state and fate of linguistic diversity and inclusion in the NLP world
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
6282
6293
.
Kabbara
,
Jad
and
Jackie Chi Kit
Cheung
.
2022
.
Investigating the performance of transformer-based NLI models on presuppositional inferences
. In
Proceedings of the 29th International Conference on Computational Linguistics
, pages
779
785
.
Kadavath
,
Saurav
,
Tom
Conerly
,
Amanda
Askell
,
Tom
Henighan
,
Dawn
Drain
,
Ethan
Perez
,
Nicholas
Schiefer
,
Zachary
Dodds
,
Nova
DasSarma
,
Eli
Tran-Johnson
,
Scott
Johnston
,
Sheer
El-Showk
,
Andy
Jones
,
Nelson
Elhage
,
Tristan
Hume
,
Anna
Chen
,
Yuntao
Bai
,
Sam
Bowman
,
Stanislav
Fort
,
Deep
Ganguli
,
Danny
Hernandez
,
Josh
Jacobson
,
John
Kernion
,
Shauna
Kravec
,
Liane
Lovitt
,
Kamal
Ndousse
,
Catherine
Olsson
,
Sam
Ringer
,
Dario
Amodei
,
Tom B.
Brown
,
Jack
Clark
,
Nicholas
Joseph
,
Benjamin
Mann
,
Sam
McCandlish
,
Christopher
Olah
, and
Jared
Kaplan
.
2022
.
Language models (mostly) know what they know
.
ArXiv
,
arXiv:2207.05221
.
Kalo
,
Jan Christoph
and
Leandra
Fichtel
.
2022
.
KAMEL: Knowledge analysis with multitoken entities in language models
. In
4th Conference on Automated Knowledge Base Construction
.
Kandpal
,
Nikhil
,
Haikang
Deng
,
Adam
Roberts
,
Eric
Wallace
, and
Colin
Raffel
.
2022
.
Large language models struggle to learn long-tail knowledge
.
ArXiv
,
arXiv:2211.08411
.
Kandpal
,
Nikhil
,
Eric
Wallace
, and
Colin
Raffel
.
2022
.
Deduplicating training data mitigates privacy risks in language models
. In
International Conference on Machine Learning
, pages
10697
10707
.
Kaplan
,
Jared
,
Sam
McCandlish
,
Tom
Henighan
,
Tom B.
Brown
,
Benjamin
Chess
,
Rewon
Child
,
Scott
Gray
,
Alec
Radford
,
Jeff
Wu
, and
Dario
Amodei
.
2020
.
Scaling laws for neural language models
.
ArXiv
,
arXiv:2001.08361
.
Karpas
,
Ehud
,
Omri
Abend
,
Yonatan
Belinkov
,
Barak
Lenz
,
Opher
Lieber
,
Nir
Ratner
,
Yoav
Shoham
,
Hofit
Bata
,
Yoav
Levine
,
Kevin
Leyton-Brown
,
Dor
Muhlgay
,
Noam
Rozen
,
Erez
Schwartz
,
Gal
Shachaf
,
Shai
Shalev-Shwartz
,
Amnon
Shashua
, and
Moshe
Tenenholtz
.
2022
.
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
.
ArXiv
,
arXiv:2205.00445
.
Kassner
,
Nora
,
Benno
Krojer
, and
Hinrich
Schütze
.
2020
.
Are pretrained language models symbolic reasoners over knowledge?
In
Proceedings of the 24th Conference on Computational Natural Language Learning
, pages
552
564
.
Kassner
,
Nora
and
Hinrich
Schütze
.
2020
.
Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7811
7818
.
Katz
,
Uri
,
Mor
Geva
, and
Jonathan
Berant
.
2022
.
Inferring implicit relations in complex questions with language models
. In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
2548
2566
.
Kauf
,
Carina
,
Anna A.
Ivanova
,
Giulia
Rambelli
,
Emmanuele
Chersoni
,
Jingyuan S.
She
,
Zawad
Chowdhury
,
Evelina
Fedorenko
, and
Alessandro
Lenci
.
2022
.
Event knowledge in large language models: The gap between the impossible and the unlikely
.
ArXiv
,
arXiv:2212.01488
. ,
[PubMed]
Kavumba
,
Pride
,
Naoya
Inoue
,
Benjamin
Heinzerling
,
Keshav
Singh
,
Paul
Reisert
, and
Kentarou
Inui
.
2020
.
Balanced COPA: Countering superficial cues in causal reasoning
.
Association for Natural Language Processing
, pages
1105
1108
.
Kavumba
,
Pride
,
Ryo
Takahashi
, and
Yusuke
Oda
.
2022
.
Are prompt-based models clueless?
In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
2333
2352
.
Kementchedjhieva
,
Yova
,
Mark
Anderson
, and
Anders
Søgaard
.
2021
.
John praised Mary because _he_? Implicit causality bias and its interaction with explicit cues in LMs
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
4859
4871
.
Khandelwal
,
Urvashi
,
Omer
Levy
,
Dan
Jurafsky
,
Luke
Zettlemoyer
, and
Mike
Lewis
.
2020
.
Generalization through memorization: Nearest neighbor language models
. In
International Conference on Learning Representations
.
Kharitonov
,
Eugene
,
Marco
Baroni
, and
Dieuwke
Hupkes
.
2021
.
How BPE affects memorization in Transformers
.
ArXiv
,
arXiv:2110.02782
.
Kim
,
Sanghee J.
,
Lang
Yu
, and
Allyson
Ettinger
.
2022
.
“no, they did not”: Dialogue response dynamics in pre-trained language models
. In
Proceedings of the 29th International Conference on Computational Linguistics
, pages
863
874
.
Kirchenbauer
,
John
,
Jonas
Geiping
,
Yuxin
Wen
,
Jonathan
Katz
,
Ian
Miers
, and
Tom
Goldstein
.
2023
.
A watermark for large language models
.
ArXiv
,
arXiv:2301.10226
.
Kirk
,
Hannah Rose
,
Yennie
Jun
,
Filippo
Volpin
,
Haider
Iqbal
,
Elias
Benussi
,
Frederic
Dreyer
,
Aleksandar
Shtedritski
, and
Yuki
Asano
.
2021
.
Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models
. In
Advances in Neural Information Processing Systems
, volume
34
, pages
2611
2624
.
Ko
,
Wei Jen
and
Junyi Jessy
Li
.
2020
.
Assessing discourse relations in language generation from GPT-2
. In
Proceedings of the 13th International Conference on Natural Language Generation
, pages
52
59
.
Kojima
,
Takeshi
,
Shixiang Shane
Gu
,
Machel
Reid
,
Yutaka
Matsuo
, and
Yusuke
Iwasawa
.
2022
.
Large language models are zero-shot reasoners
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
22199
22213
.
Kovaleva
,
Olga
,
Alexey
Romanov
,
Anna
Rogers
, and
Anna
Rumshisky
.
2019
.
Revealing the dark secrets of BERT
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
4365
4374
.
Krawczyk
,
Jack
and
Amarnag
Subramanya
.
2023
.
Bard is getting better at logic and reasoning
.
The Keyword: Google Blog
.
Kudo
,
Taku
.
2018
.
Subword regularization: Improving neural network translation models with multiple subword candidates
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
66
75
.
Kurita
,
Keita
,
Nidhi
Vyas
,
Ayush
Pareek
,
Alan W.
Black
, and
Yulia
Tsvetkov
.
2019
.
Measuring bias in contextualized word representations
. In
Proceedings of the First Workshop on Gender Bias in Natural Language Processing
, pages
166
172
.
Kwon
,
Sunjae
,
Cheongwoong
Kang
,
Jiyeon
Han
, and
Jaesik
Choi
.
2019
.
Why do masked neural language models still need common sense knowledge?
ArXiv
,
arXiv:1911.03024
.
Lakhotia
,
Kushal
,
Eugene
Kharitonov
,
Wei-Ning
Hsu
,
Yossi
Adi
,
Adam
Polyak
,
Benjamin
Bolte
,
Tu-Anh
Nguyen
,
Jade
Copet
,
Alexei
Baevski
,
Abdelrahman
Mohamed
, and
Emmanuel
Dupoux
.
2021
.
On generative spoken language modeling from raw audio
.
Transactions of the Association for Computational Linguistics
,
9
:
1336
1354
.
Lakretz
,
Yair
,
Théo
Desbordes
,
Dieuwke
Hupkes
, and
Stanislas
Dehaene
.
2022
.
Can transformers process recursive nested constructions, like humans?
In
Proceedings of the 29th International Conference on Computational Linguistics
, pages
3226
3232
.
Lal
,
Yash Kumar
,
Niket
Tandon
,
Tanvi
Aggarwal
,
Horace
Liu
,
Nathanael
Chambers
,
Raymond
Mooney
, and
Niranjan
Balasubramanian
.
2022
.
Using commonsense knowledge to answer why-questions
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
1204
1219
.
Lampinen
,
Andrew
.
2022
.
Can language models handle recursively nested grammatical structures? A case study on comparing models and humans
.
ArXiv
,
arXiv:2210.15303
.
Lampinen
,
Andrew
,
Ishita
Dasgupta
,
Stephanie
Chan
,
Kory
Mathewson
,
Mh
Tessler
,
Antonia
Creswell
,
James
McClelland
,
Jane
Wang
, and
Felix
Hill
.
2022
.
Can language models learn from explanations in context?
In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
537
563
.
Lasri
,
Karim
,
Alessandro
Lenci
, and
Thierry
Poibeau
.
2022a
.
Does BERT really agree? Fine-grained analysis of lexical dependence on a syntactic task
. In
Findings of the Association for Computational Linguistics: ACL 2022
, pages
2309
2315
.
Lasri
,
Karim
,
Alessandro
Lenci
, and
Thierry
Poibeau
.
2022b
.
Word order matters when you increase masking
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
1808
1815
.
Lasri
,
Karim
,
Olga
Seminck
,
Alessandro
Lenci
, and
Thierry
Poibeau
.
2022
.
Subject verb agreement error patterns in meaningless sentences: Humans vs. BERT
. In
Proceedings of the 29th International Conference on Computational Linguistics
, pages
37
43
.
Lee
,
Angie
.
2023
.
What are large language models used for?
NVIDIA Blog
.
Lee
,
Jooyoung
,
Thai
Le
,
Jinghui
Chen
, and
Dongwon
Lee
.
2023
.
Do language models plagiarize?
In
The ACM Web Conference
, pages
3637
3647
.
Lee
,
Katherine
,
Daphne
Ippolito
,
Andrew
Nystrom
,
Chiyuan
Zhang
,
Douglas
Eck
,
Chris
Callison-Burch
, and
Nicholas
Carlini
.
2022
.
Deduplicating training data makes language models better
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
8424
8445
.
Lee
,
Nayeon
,
Yejin
Bang
,
Andrea
Madotto
, and
Pascale
Fung
.
2021
.
Towards few-shot fact-checking via perplexity
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
1971
1981
.
Lee
,
Soo Hwan
and
Sebastian
Schuster
.
2022
.
Can language models capture syntactic associations without surface cues? A case study of reflexive anaphor licensing in English control constructions
. In
Proceedings of the Society for Computation in Linguistics 2022
, pages
206
211
.
Lees
,
Alyssa
,
Vinh Q.
Tran
,
Yi
Tay
,
Jeffrey
Sorensen
,
Jai
Gupta
,
Donald
Metzler
, and
Lucy
Vasserman
.
2022
.
A new generation of perspective API: Efficient multilingual character-level Transformers
. In
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
, pages
3197
3207
.
Lehman
,
Eric
,
Sarthak
Jain
,
Karl
Pichotta
,
Yoav
Goldberg
, and
Byron
Wallace
.
2021
.
Does BERT pretrained on clinical notes reveal sensitive data?
In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
946
959
.
Lester
,
Brian
,
Rami
Al-Rfou
, and
Noah
Constant
.
2021
.
The power of scale for parameter-efficient prompt tuning
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
3045
3059
.
Levy
,
Sharon
,
Emily
Allaway
,
Melanie
Subbiah
,
Lydia
Chilton
,
Desmond
Patton
,
Kathleen
McKeown
, and
William Yang
Wang
.
2022
.
SafeText: A benchmark for exploring physical safety in language models
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
2407
2421
.
Levy
,
Sharon
,
Michael
Saxon
, and
William Yang
Wang
.
2021
.
Investigating memorization of conspiracy theories in text generation
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
4718
4729
.
Li
,
Jiaxuan
,
Lang
Yu
, and
Allyson
Ettinger
.
2022
.
Counterfactual reasoning: Do language models need world knowledge for causal inference?
In
Workshop on Neuro Causal and Symbolic AI (nCSI)
.
Li
,
Junyi
,
Tianyi
Tang
,
Wayne Xin
Zhao
, and
Ji-Rong
Wen
.
2021
.
Pretrained language models for text generation: A survey
. In
International Joint Conference on Artificial Intelligence
, pages
4492
4499
.
Li
,
Xiang Lorraine
,
Adhiguna
Kuncoro
,
Jordan
Hoffmann
,
Cyprien
de Masson d’Autume
,
Phil
Blunsom
, and
Aida
Nematzadeh
.
2022b
.
A systematic investigation of commonsense knowledge in large language models
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
11838
11855
.
Li
,
Xingxuan
,
Yutong
Li
,
Linlin
Liu
,
Lidong
Bing
, and
Shafiq
Joty
.
2022a
.
Is GPT-3 a psychopath? Evaluating large language models from a psychological perspective
.
ArXiv
,
arXiv:2212.10529
.
Lieber
,
Opher
,
Or
Sharir
,
Barak
Lenz
, and
Yoav
Shoham
.
2021
.
Jurassic-1: Technical details and evaluation
.
White Paper. AI21 Labs
.
Lin
,
Bill Yuchen
,
Seyeon
Lee
,
Rahul
Khanna
, and
Xiang
Ren
.
2020
.
Birds have four legs?! NumerSense: Probing numerical commonsense knowledge of pre-trained language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
6862
6868
.
Lin
,
Stephanie
,
Jacob
Hilton
, and
Owain
Evans
.
2022
.
TruthfulQA: Measuring how models mimic human falsehoods
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
3214
3252
.
Liu
,
Emmy
,
Chenxuan
Cui
,
Kenneth
Zheng
, and
Graham
Neubig
.
2022a
.
Testing the ability of language models to interpret figurative language
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4437
4452
.
Liu
,
Fangyu
,
Julian
Eisenschlos
,
Jeremy
Cole
, and
Nigel
Collier
.
2022b
.
Do ever larger octopi still amplify reporting biases? Evidence from judgments of typical colour
. In
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages
210
220
.
Liu
,
Ruibo
,
Chenyan
Jia
,
Jason
Wei
,
Guangxuan
Xu
, and
Soroush
Vosoughi
.
2022c
.
Quantifying and alleviating political bias in language models
.
Artificial Intelligence
,
304
:
103654
.
Liu
,
Yinhan
,
Myle
Ott
,
Naman
Goyal
,
Jingfei
Du
,
Mandar
Joshi
,
Danqi
Chen
,
Omer
Levy
,
Mike
Lewis
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2019
.
RoBERTa: A robustly optimized BERT pretraining approach
.
ArXiv
,
arXiv:1907.11692
.
Liu
,
Zeyu
,
Yizhong
Wang
,
Jungo
Kasai
,
Hannaneh
Hajishirzi
, and
Noah A.
Smith
.
2021
.
Probing across time: What does RoBERTa know and when?
In
Findings of the Association for Computational Linguistics: EMNLP 2021
, pages
820
842
.
Magee
,
Liam
,
Lida
Ghahremanlou
,
Karen
Soldatic
, and
Shanthi
Robertson
.
2021
.
Intersectional bias in causal language models
.
ArXiv
,
arXiv:2107.07691
.
Mahowald
,
Kyle
.
2023
.
A discerning several thousand judgments: GPT-3 rates the article + adjective + numeral + noun construction
. In
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
, pages
265
273
.
Mahowald
,
Kyle
,
Evgeniia
Diachek
,
Edward
Gibson
,
Evelina
Fedorenko
, and
Richard
Futrell
.
2022
.
Experimentally measuring the redundancy of grammatical cues in transitive clauses
.
ArXiv
,
arXiv:2201.12911
.
Malkin
,
Nikolay
,
Sameera
Lanka
,
Pranav
Goel
, and
Nebojsa
Jojic
.
2021
.
Studying word order through iterative shuffling
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
10351
10366
.
Mallen
,
Alex
,
Akari
Asai
,
Victor
Zhong
,
Rajarshi
Das
,
Hannaneh
Hajishirzi
, and
Daniel
Khashabi
.
2022
.
When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories
.
ArXiv
,
arXiv:2212.10511
.
Marr
,
David
.
2010
.
Vision: A Computational Investigation into the Human Representation and Processing of Visual Information
.
MIT Press
.
Masis
,
Tessa
and
Carolyn
Anderson
.
2021
.
ProSPer: Probing human and neural network language model understanding of spatial perspective
. In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
95
135
.
Massarelli
,
Luca
,
Fabio
Petroni
,
Aleksandra
Piktus
,
Myle
Ott
,
Tim
Rocktäschel
,
Vassilis
Plachouras
,
Fabrizio
Silvestri
, and
Sebastian
Riedel
.
2020
.
How decoding strategies affect the verifiability of generated text
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
223
235
.
Mattern
,
Justus
,
Zhijing
Jin
,
Mrinmaya
Sachan
,
Rada
Mihalcea
, and
B.
Schölkopf
.
2022
.
Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing
.
ArXiv
,
arXiv:2212.10678
.
McCoy
,
R. Thomas
,
Paul
Smolensky
,
Tal
Linzen
,
Jianfeng
Gao
, and
Asli
Celikyilmaz
.
2021
.
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN
.
ArXiv
,
arXiv:2111.09509
.
McCoy
,
Thomas
,
Robert
Frank
, and
Tal
Linzen
.
2018
.
Revisiting the poverty of the stimulus: Hierarchical generalization without a hierarchical bias in recurrent neural networks
. In
Proceedings of the Annual Meeting of the Cognitive Science Society
, volume
40
, pages
2096
2101
.
McCoy
,
Tom
,
Ellie
Pavlick
, and
Tal
Linzen
.
2019
.
Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3428
3448
.
Meade
,
Nicholas
,
Elinor
Poole-Dayan
, and
Siva
Reddy
.
2022
.
An empirical survey of the effectiveness of debiasing techniques for pre-trained language models
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1878
1898
.
Mehrabi
,
Ninareh
,
Ahmad
Beirami
,
Fred
Morstatter
, and
Aram
Galstyan
.
2022
.
Robust conversational agents against imperceptible toxicity triggers
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
2831
2847
.
Meister
,
Clara
and
Ryan
Cotterell
.
2021
.
Language model evaluation beyond perplexity
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
5328
5339
.
Meng
,
Kevin
,
David
Bau
,
Alex J.
Andonian
, and
Yonatan
Belinkov
.
2022
.
Locating and editing factual associations in GPT
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
17359
17372
.
Miaschi
,
Alessio
,
Dominique
Brunato
,
Felice
Dell’Orletta
, and
Giulia
Venturi
.
2020
.
Linguistic profiling of a neural language model
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
745
756
.
Michaelov
,
James
and
Benjamin
Bergen
.
2022a
.
Collateral facilitation in humans and language models
. In
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
, pages
13
26
.
Michaelov
,
James
and
Benjamin
Bergen
.
2022b
.
‘Rarely’ a problem? Language models exhibit inverse scaling in their predictions following ‘few’-type quantifiers
.
ArXiv
,
arXiv:2212.08700
.
Min
,
Sewon
,
Xinxi
Lyu
,
Ari
Holtzman
,
Mikel
Artetxe
,
Mike
Lewis
,
Hannaneh
Hajishirzi
, and
Luke
Zettlemoyer
.
2022
.
Rethinking the role of demonstrations: What makes in-context learning work?
In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
11048
11064
.
Miotto
,
Marilù
,
Nicola
Rossberg
, and
Bennett
Kleinberg
.
2022
.
Who is GPT-3? An exploration of personality, values and demographics
. In
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)
, pages
218
227
.
Misra
,
Kanishka
.
2022
.
minicons: Enabling flexible behavioral and representational analyses of Transformer language models
.
ArXiv
,
arXiv:2203.13112
.
Misra
,
Kanishka
,
Allyson
Ettinger
, and
Julia
Rayz
.
2020
.
Exploring BERT’s sensitivity to lexical cues using tests from semantic priming
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
4625
4635
.
Misra
,
Kanishka
,
Allyson
Ettinger
, and
Julia Taylor
Rayz
.
2021
.
Do language models learn typicality judgments from text?
In
Proceedings of the Annual Meeting of the Cognitive Science Society
, volume
43
, pages
216
222
.
Misra
,
Kanishka
,
Julia Taylor
Rayz
, and
Allyson
Ettinger
.
2023
.
COMPS: Conceptual minimal pair sentences for testing property knowledge and inheritance in pre-trained language models
. In
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
, pages
2928
2949
.
Mitchell
,
Melanie
and
David
Krakauer
.
2022
.
The debate over understanding in AI’s large language models
.
ArXiv
,
arXiv:2210.13966
. ,
[PubMed]
Monroe
,
Will
and
Christopher
Potts
.
2015
.
Learning in the rational speech acts model
.
ArXiv
,
arXiv:1510.06807
.
Mosbach
,
Marius
,
Anna
Khokhlova
,
Michael A.
Hedderich
, and
Dietrich
Klakow
.
2020
.
On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained Transformers
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
2502
2516
.
Nadeem
,
Moin
,
Anna
Bethke
, and
Siva
Reddy
.
2021
.
StereoSet: Measuring stereotypical bias in pretrained language models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
5356
5371
.
Nangia
,
Nikita
,
Clara
Vania
,
Rasika
Bhalerao
, and
Samuel R.
Bowman
.
2020
.
CrowS-Pairs: A challenge dataset for measuring social biases in masked language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1953
1967
.
Nayak
,
Pandu
.
2019
.
Understanding searches better than ever before
.
The Keyword: Google Blog
.
Newman
,
Benjamin
,
Kai-Siang
Ang
,
Julia
Gong
, and
John
Hewitt
.
2021
.
Refining targeted syntactic evaluation of language models
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
3710
3723
.
Nozza
,
Debora
,
Federico
Bianchi
, and
Dirk
Hovy
.
2021
.
HONEST: Measuring hurtful sentence completion in language models
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
2398
2406
.
Nozza
,
Debora
,
Federico
Bianchi
,
Anne
Lauscher
, and
Dirk
Hovy
.
2022
.
Measuring harmful sentence completion in language models for LGBTQIA+ individuals
. In
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion
, pages
26
34
.
O’Connor
,
Joe
and
Jacob
Andreas
.
2021
.
What context features can transformer language models use?
In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
851
864
.
Olsson
,
Catherine
,
Nelson
Elhage
,
Neel
Nanda
,
Nicholas
Joseph
,
Nova
DasSarma
,
Tom
Henighan
,
Benjamin
Mann
,
Amanda
Askell
,
Yuntao
Bai
,
Anna
Chen
,
Tom
Conerly
,
Dawn
Drain
,
Deep
Ganguli
,
Zac
Hatfield-Dodds
,
Danny
Hernandez
,
Scott
Johnston
,
Andy
Jones
,
John
Kernion
,
Liane
Lovitt
,
Kamal
Ndousse
,
Dario
Amodei
,
Tom B.
Brown
,
Jack
Clark
,
Jared
Kaplan
,
Sam
McCandlish
, and
Chris
Olah
.
2022
.
In-context learning and induction heads
.
ArXiv
,
arXiv:2209.11895
.
OpenAI
.
2022
.
ChatGPT: Optimizing language models for dialogue
.
OpenAI Blog
.
OpenAI
.
2023a
.
GPT-4 technical report
.
OpenAI
.
OpenAI
.
2023b
.
Model index for researchers
.
OpenAI
.
Ousidhoum
,
Nedjma
,
Xinran
Zhao
,
Tianqing
Fang
,
Yangqiu
Song
, and
Dit-Yan
Yeung
.
2021
.
Probing toxic content in large pre-trained language models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
4262
4274
.
Ouyang
,
Long
,
Jeffrey
Wu
,
Xu
Jiang
,
Diogo
Almeida
,
Carroll
Wainwright
,
Pamela
Mishkin
,
Chong
Zhang
,
Sandhini
Agarwal
,
Katarina
Slama
,
Alex
Gray
,
John
Schulman
,
Jacob
Hilton
,
Fraser
Kelton
,
Luke
Miller
,
Maddie
Simens
,
Amanda
Askell
,
Peter
Welinder
,
Paul
Christiano
,
Jan
Leike
, and
Ryan
Lowe
.
2022
.
Training language models to follow instructions with human feedback
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
27730
27744
.
Paik
,
Cory
,
Stéphane
Aroca-Ouellette
,
Alessandro
Roncone
, and
Katharina
Kann
.
2021
.
The world of an octopus: How reporting bias influences a language model’s perception of color
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
823
835
.
Pandia
,
Lalchand
,
Yan
Cong
, and
Allyson
Ettinger
.
2021
.
Pragmatic competence of pre-trained language models through the lens of discourse connectives
. In
Proceedings of the 25th Conference on Computational Natural Language Learning
, pages
367
379
.
Pandia
,
Lalchand
and
Allyson
Ettinger
.
2021
.
Sorting through the noise: Testing robustness of information processing in pre-trained language models
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
1583
1596
.
Pandit
,
Onkar
and
Yufang
Hou
.
2021
.
Probing for bridging inference in transformer language models
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4153
4163
.
Park
,
Kwonsik
,
Myung-Kwan
Park
, and
Sanghoun
Song
.
2021
.
Deep learning can contrast the minimal pairs of syntactic data
.
Linguistic Research
,
38
(
2
):
395
424
.
Patel
,
Roma
and
Ellie
Pavlick
.
2021
.
“was it “stated” or was it “claimed”?: How linguistic bias affects generative language models
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
10080
10095
.
Pedinotti
,
Paolo
,
Eliana Di
Palma
,
Ludovica
Cerini
, and
Alessandro
Lenci
.
2021a
.
A howling success or a working sea? Testing what BERT knows about metaphors
. In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages
192
204
.
Pedinotti
,
Paolo
,
Giulia
Rambelli
,
Emmanuele
Chersoni
,
Enrico
Santus
,
Alessandro
Lenci
, and
Philippe
Blache
.
2021b
.
Did the cat drink the coffee? Challenging transformers with generalized event knowledge
. In
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics
, pages
1
11
.
Peng
,
Hao
,
Xiaozhi
Wang
,
Shengding
Hu
,
Hailong
Jin
,
Lei
Hou
,
Juanzi
Li
,
Zhiyuan
Liu
, and
Qun
Liu
.
2022
.
COPEN: Probing conceptual knowledge in pre-trained language models
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
5015
5035
.
Penha
,
Gustavo
and
Claudia
Hauff
.
2020
.
What does BERT know about books, movies and music? Probing BERT for conversational recommendation
. In
Proceedings of the 14th ACM Conference on Recommender Systems
, pages
388
397
.
Perez
,
Ethan
,
Saffron
Huang
,
Francis
Song
,
Trevor
Cai
,
Roman
Ring
,
John
Aslanides
,
Amelia
Glaese
,
Nat
McAleese
, and
Geoffrey
Irving
.
2022a
.
Red teaming language models with language models
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
3419
3448
.
Perez
,
Ethan
,
Sam
Ringer
,
Kamile
Lukosiute
,
Karina
Nguyen
,
Edwin
Chen
,
Scott
Heiner
,
Craig
Pettit
,
Catherine
Olsson
,
Sandipan
Kundu
,
Saurav
Kadavath
,
Andy
Jones
,
Anna
Chen
,
Benjamin
Mann
,
Brian
Israel
,
Bryan
Seethor
,
Cameron
McKinnon
,
Christopher
Olah
,
Daisong
Yan
,
Daniela
Amodei
,
Dario
Amodei
,
Dawn
Drain
,
Dustin
Li
,
Eli
Tran-Johnson
,
Guro
Khundadze
,
John
Kernion
,
James
Landis
,
Jamie
Kerr
,
Jared
Mueller
,
Jeeyoon
Hyun
,
Joshua
Landau
,
Kamal
Ndousse
,
Landon
Goldberg
,
Liane
Lovitt
,
Martin
Lucas
,
Michael
Sellitto
,
Miranda
Zhang
,
Neerav
Kingsland
,
Nelson
Elhage
,
Nicholas
Joseph
,
Noemí
Mercado
,
Nova
DasSarma
,
Oliver
Rausch
,
Robin
Larson
,
Sam
McCandlish
,
Scott
Johnston
,
Shauna
Kravec
,
Sheer
El Showk
,
Tamera
Lanham
,
Timothy
Telleen-Lawton
,
Tom
Brown
,
Tom
Henighan
,
Tristan
Hume
,
Yuntao
Bai
,
Zac
Hatfield-Dodds
,
Jack
Clark
,
Sam
Bowman
,
Amanda
Askell
,
Roger
Grosse
,
Danny
Hernandez
,
Deep
Ganguli
,
Evan
Hubinger
,
Nicholas
Schiefer
, and
Jared
Kaplan
.
2022b
.
Discovering language model behaviors with model-written evaluations
.
ArXiv
,
arXiv:2212.09251
.
Pérez-Mayos
,
Laura
,
Miguel
Ballesteros
, and
Leo
Wanner
.
2021
.
How much pretraining data do language models need to learn syntax?
In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
1571
1582
.
Petroni
,
Fabio
,
Tim
Rocktäschel
,
Sebastian
Riedel
,
Patrick
Lewis
,
Anton
Bakhtin
,
Yuxiang
Wu
, and
Alexander
Miller
.
2019
.
Language models as knowledge bases?
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
2463
2473
.
Petty
,
Jackson
and
Robert
Frank
.
2021
.
Transformers generalize linearly
.
ArXiv
,
arXiv:2109.12036
.
Piantadosi
,
Steven
and
Evelina
Fedorenko
.
2017
.
Infinitely productive language can arise from chance under communicative pressure
.
Journal of Language Evolution
,
2
(
2
):
141
147
.