How Much Do Language Models Copy From Their Training Data? Evaluating Linguistic Novelty in Text Generation Using RAVEN

Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models trained on English (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure—e.g., individual dependencies—text generated with a standard sampling scheme is substantially less novel than our baseline of human-generated text from each model’s test set. For larger-scale structure—e.g., overall sentence structure—model-generated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from the training set. We also perform extensive manual analysis, finding evidence that GPT-2 uses both compositional and analogical generalization mechanisms and showing that GPT-2’s novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues (e.g., being self-contradictory).


Introduction
How deep is deep learning? Are neural networks "discovering intricate structures" that support sophisticated generalization (LeCun et al., 2015), or are they "stochastic parrots" that simply memorize seen examples and recombine them in shallow ways (Bender et al., 2021)?
We focus on this question in the area of openended text generation. Neural network language models (LMs) can generate grammatical, coherent text (See et al., 2019;Brown et al., 2020, section 3.9.4), but the text alone cannot tell us if it was * Work partially done while at Microsoft Research. constructed by the model or copied from the training set. We argue that it is important to disentangle these possibilities. That is, in addition to evaluating the quality of generated text, as is already standard (Gatt and Krahmer, 2018;Celikyilmaz et al., 2020), we should also evaluate its novelty.
Novelty is important for several reasons. From a linguistic perspective, one core component of knowing a language is the ability to combine familiar parts in novel ways (Chomsky, 1957;Hockett, 1963). From a machine learning perspective, models are meant to learn the training distribution, not just memorize the training set (Dietterich, 1995). Finally, on the more practical side, models that copy training data might leak sensitive information (Carlini et al., 2021) or repeat hate speech (Bender et al., 2021).
In this work, to assess the novelty of generated text, we introduce a suite of analyses called RAVEN (RAting VErbal Novelty). 1,2 These analyses cover both sequential structure (n-grams) and syntactic structure. We apply these analyses to text generated by an LSTM, a Transformer, Transformer-XL, and all 4 sizes of GPT-2 (the largest LM for which we had access to the training data). Because there are many ways to generate text from LMs, we test 12 generation methods and 4 prompt lengths. As a baseline, we also analyze human-generated text from each model's test set.
We find that models display novelty for all aspects of structure that we analyze: they gener-ate novel n-grams, novel morphological combinations, and novel syntactic structures. For instance, GPT-2 coins several types of novel words, including inflections (e.g., Swissified) and derivations (e.g., IKEA-ness), and 74% of sentences generated by Transformer-XL have a syntactic structure that no training sentence has. Thus, neural language models do not simply memorize; instead they use productive processes that allow them to combine familiar parts in novel ways. Nonetheless, when considering small n-grams, these models are less novel than the baseline. For example, for each model, the baseline human-generated text has 1.4 to 3.3 times as many novel bigrams as the model-generated text does. For n-grams larger than 5-grams, models are more novel than the baseline, but they still occasionally copy extensively: GPT-2 sometimes duplicates training passages that are over 1,000 words long. Overall, by evaluating novelty, we gain a new window into how models have or have not succeeded at generalizing beyond their experience.

Background
Memorization and copying: Neural networks are capable of extensive memorization: they can memorize randomly-labeled examples (Zhang et al., 2021) and can reveal training data when subjected to adversarial attacks (Shokri et al., 2017;Carlini et al., 2019Carlini et al., , 2021. We study copying in text generated under standard, non-adversarial conditions, a topic which four other works have touched on. Brown et al. (2020, Section 8.2) study copying of 8-grams by GPT-2, and Lee et al. (2021) study copying of 50-grams by Transformer LMs, while Chen et al. (2021) and Ziegler (2021) look at copying of large n-grams in the codegenerating model Codex. We perform a more comprehensive analysis of duplication: We look across the full range of n-gram sizes and analyze a range of architectures and generation methods. Beyond n-grams, we also evaluate copying of other linguistic structures (e.g., dependency arcs). Thus, we focus on linguistic generalization, while past work focused more on data privacy.
NLG evaluation: Careful evaluation of natural language generation (NLG) is crucial because of the ELIZA effect (Weizenbaum, 1966), "the susceptibility of people to read far more understanding than is warranted into strings of symbolsespecially words-strung together by computers" (Hofstadter, 1995). Because generated text has such power to guide people's views of AI, it is important to understand what capacities actually underlie the generation of that text in order to give a balanced view of the model.
Unfortunately, NLG evaluation is challenging because many NLG tasks are open-ended. For example, a dialogue system can generate multiple plausible responses for the same user input. The prevalent evaluation methods quantify the quality of generated text, via a single holistic score (Zhang et al., 2020a) or via scores that focus on specific properties (Dou et al., 2021) such as fluency (Mutton et al., 2007), coherence (Lapata and Barzilay, 2005), or factual accuracy (Kryściński et al., 2020). We argue that evaluation of open-ended NLG should emphasize not only quality but also novelty: is the generated text novel, or does it simply duplicate part of the training set? Novelty is important because, without novelty, quality does not reveal much about the model's abilities. For example, suppose that a model is being evaluated for coherence. If the model simply copies a paragraph from its training set, it will produce highly coherent text, but only because it has learned how to copy-not because it has learned how to be coherent.
The previously-studied attribute that is most similar to novelty is diversity (Zhu et al., 2018;Hashimoto et al., 2019): can a model generate a diverse range of output sentences? Like novelty, diversity is rooted in differences between pieces of text. Despite this superficial similarity, novelty and diversity are distinct. Novelty covers how the generated text differs from the training set, while diversity covers how the generated text is different from other generated text. A model could be diverse but not novel (by copying a diverse set of training sentences), or novel but not diverse (by repeatedly generating the same novel sentence).
Much discussion about evaluating LMs focuses on whether they understand language (Bender and Koller, 2020;Marcus, 2020), whereas we assess the novelty of surface text. Thus, our main analyses only test whether models have abstractions governing form (e.g., syntax), not meaning.

Motivation and approach
Motivation: The analyses in RAVEN are inspired by a scientific question: To what extent do NLG models have generalizable linguistic abili-ties? This question motivates our focus on novelty because only novel text can illustrate linguistic generalization. There may be some practical use cases for which novelty is not important-but for answering our scientific question, and for working toward general-purpose LMs that can handle unfamiliar situations, novelty is crucial.
Approach: We generate many samples of text from LMs, and then evaluate how novel the text is. We assess novelty for two types of structure: n-grams and syntactic structure. We count a generated structure as duplicated if it appears in the training set or the context (the concatenation of the prompt and the text that the LM has already generated based on the prompt); otherwise, it is novel.
Copying is not necessarily undesirable (Khandelwal et al., 2020). For instance, some long ngrams might reasonably be duplicated from the training set, such as the title of a book. To contextualize a model's degree of duplication, we compare the model-generated text to human-generated text from the model's (in-distribution) test set, which gives a baseline for how much duplication can be expected within the model's training domain. If the model is at least as novel as the baseline, we conclude that it is not copying excessively. Two prior papers (Pannitto and Herbelot, 2020;Meister and Cotterell, 2021) have also analyzed models' linguistic abilities by comparing model-generated text to human-generated text, but neither of these focused on novelty.

Experimental details
Models: To perform a controlled comparison across architectures, we used three models trained on the same dataset, namely Wikitext-103 (Merity et al., 2017). Wikitext-103 is a collection of high-quality Wikipedia articles tokenized at the word level. Its training set contains 103 million words. Holding this training set constant, we compare the LSTM (Hochreiter and Schmidhuber, 1997), Transformer (Vaswani et al., 2017), and Transformer-XL (TXL; Dai et al., 2019) architectures, chosen because they give examples of the two main types of processing prevalent in language modeling: recurrence (used in the LSTM) and self-attention (used in the Transformer), with TXL using both mechanisms.
In addition to these systematic analyses, we also analyzed GPT-2 (Radford et al., 2019) as an example of a larger-scale Transformer LM (GPT-2 was the model with the largest training set that we could gain access to). Unlike our other models, GPT-2 is trained on the WebText corpus, which is constructed from webpages linked to on Reddit. GPT-2 also differs from our other models in its tokenization scheme: All our other models use word-level tokenization (in which each token is a full word), but GPT-2 uses a subword tokenization scheme (Sennrich et al., 2016). The WebText training corpus contains 7.7 billion words, making it much larger than Wikitext-103. For more details about each model, see Appendix A.
Prompts: To generate text from a model, we input a prompt drawn from that model's test set, which comes from the same distribution as its training set. For Wikitext-103, we use 1000 prompts of length 512 words and have models generate 1000 words following the prompt. For WebText, we use 1000 prompts of length 564 subword tokens, and have models generate 1100 subword tokens; these numbers are 1.1 times the corresponding Wikitext-103 numbers because there are approximately 1.1 subword tokens per word in WebText. As our baseline human-generated text, we use the text that follows the prompt in the corpus. For tokenization details, see Appendix B.
Decoding method: top-40 sampling: As its prediction about which word will appear next, a language model outputs a probability distribution over the vocabulary. There are many ways to select a word to generate from this distribution, which are called decoding methods.
When evaluating a model's novelty, an important consideration is that novelty is not always positive: a model that generates random nonsense would be highly novel. Thus, we want to choose a decoding method that gives high-quality text, because high novelty is only positive when accompanied by high quality. To this end, the decoding scheme that we use is top-k sampling with k = 40, where the model's distribution is truncated to the 40 highest-ranked words then renormalized and sampled from. We chose top-40 sampling because it is what Radford et al. (2019) used for GPT-2 and what Dai et al. (2019) used for TXL; because this method was selected by the creators of these models, we can be reasonably confident that it produces high-quality text from these models. For consistency, we use this same decoding scheme for our LSTM and Transformer, for which there is no established decoding scheme. For experiments with other decoding methods, see Section 5.3.

N-gram novelty
We first investigate novelty at the level of n-grams, where an n-gram is a sequence of n words.
5.1 How often are generated n-grams novel for various values of n?
Findings: For small values of n, n-grams generated by LMs are rarely novel. For larger values (n > 6), generated n-grams are almost always novel.
Details: Figure 1 shows the proportion of generated n-grams that are novel, for values of n from 1 to 10. We first note that the models are not merely copying: for all models, for n-grams of size 5 or larger, the majority of n-grams are novel.
We can obtain a more nuanced view by comparing the models to the baseline of text drawn from each model's test set. For small n-grams (n < 6), models are less novel than the baseline. For instance, with Wikitext-103, the baseline has 6% of its bigrams being novel, while the models have 2% to 3% novelty; for trigrams, the baseline has 31% novelty while models have 17% to 22% ( Figure 1a). Thus, models are conservative at the small scale, rarely deviating from bigrams and trigrams they have seen before (though they do occasionally generate novel bigrams: see Appendix C). However, for larger n-grams (n > 6), the models are more novel than the baseline. Thus, at a larger scale, models cannot be described as mainly copying n-grams they have seen before.
Comparing the models to each other in the inset of Figure 1a, we see that the LSTM is the least novel for small n-grams, while the Transformer is the most novel, and TXL falls in between. We conjecture the following explanation: Recurrence creates a recency bias (Ravfogel et al., 2019) which makes models that use recurrence likely to condition their predictions heavily on immediately preceding tokens, biasing them to memorize bigrams and trigrams. This explains why the LSTM duplicates so much: it operates entirely via recurrence. The Transformer duplicates the least because it is driven entirely by self-attention (no recurrence), allowing it to condition on recent and faraway tokens more evenly. TXL uses both recurrence and self-attention, placing it between the other two.  5.2 Do models ever duplicate large n-grams?
Finding: All models occasionally duplicate training set passages that are 100 words long or longer.
Details: Models rarely duplicate n-grams larger than 10 tokens. However, there are occasional exceptions where models duplicate extremely long sequences. For instance, in our GPT-2 generated text, there are several cases where an entire generated passage (over 1,000 words long) appears in the training set. To refer to these extreme cases, we use the term supercopying, which we define as the duplication of an n-gram of size 100 or larger. See Appendix D for examples of supercopied text.
What causes supercopying? We hypothesize that models supercopy passages that appear multiple times in the training set. For instance, the Wikitext-103 training set contains 159 articles about different instances of The Boat Race, a rowing competition: "The Boat Race 1861," "The Boat Race 2002," etc. These articles are formu-laic, with many sentences repeated across articles; e.g., the 100-gram in Appendix D that was generated by all 3 Wikitext-103 models occurs 56 times in the training set. As evidence supporting this hypothesis, Figure 9 (Appendix D) shows that supercopied 100-grams appear far more times in the training set on average than randomly-selected 100-grams. This is consistent with the findings of Lee et al. (2021) and Ziegler (2021) that duplicated text tends to be common. Carlini et al. (2021) found that text can be extracted even if it only occurred once, but they used an adversarial method that deliberately tries to extract training data, instead of freely generating text.

5.3
How is novelty related to the decoding scheme and the generated text's quality?
Findings: Changing decoding parameters can substantially alter a model's novelty: the novelty can be increased by increasing p in top-p sampling, k in top-k sampling, or the temperature. However, all modifications that increase the novelty of generated text also decrease the quality.

Details:
To get a single number that summarizes novelty, we use a new metric called the pointwise duplication score: Each generated token gets a score quantifying the extent to which it duplicates previously-seen text. This score is equal to the size of the smallest novel n-gram that ends with this word. For example, if the word is the end of a novel 4-gram (e.g., these rules will not be), but all of the smaller n-grams ending with the word were duplicated (will not be, not be, and be), then the pointwise duplication score is 4. To get the overall score, we average across the tokens. A downside of this basic score is that it can be heavily influenced by the extremely large duplication scores that arise from supercopying. To address this factor, we truncate each token's score at 5 before averaging (see Appendix E for untruncated results). Using this score, we investigated a range of decoding methods. Figure 11 (in Appendix F) shows the effects of varying commonly-used decoding parameters. With top-k sampling (truncating the distribution to the k most probable tokens before sampling), increasing k also increases novelty. With top-p sampling (truncating the distribution to the the top p probability mass before sampling; Holtzman et al., 2020), increasing p increases novelty. When using a temperature (which scales words' scores before taking the soft-max), increasing the temperature increases novelty. All of these trends make intuitive sense: a small k, p, or temperature upweights the head of the model's distribution, and it makes sense that statistical learners would assign higher probability to things they have seen than things they have not, which would lead to the head of a model's distribution being less novel than the tail.
Could we make models perfectly novel just by changing the decoding scheme? Unfortunately, the decoding methods that increase novelty also decrease quality. Measuring quality is challenging; ideally we would use human evaluations, but that is beyond the scope of this project because we have 336 conditions to evaluate (7 models with 4 prompt lengths and 12 decoding schemes). Instead, we use perplexity as a proxy for quality, under the assumption that high-quality text should have a low perplexity. This assumption is certainly imperfect: text can have a low perplexity for degenerate reasons such as being repetitive (Holtzman et al., 2020). Nonetheless, it can still give us a rough initial sense of general trends. We use GPT-2 to measure the perplexity of text generated by the LSTM, Transformer, and TXL; we use TXL to measure the perplexity of GPT-2 text. See Appendix G for discussion of these decisions. Figure 2 shows a clear tradeoff between novelty and quality. None of models trained on Wikitext do as well as the baseline at managing this tradeoff. However, a model's perplexity does not entirely determine its level of novelty: Both Transformer architectures do better at this tradeoff than LSTMs, showing that it is possible to improve on this tradeoff using architectural innovations.
In contrast to the Wikitext-103 models, GPT-2 performs similarly to the baseline at the qualitynovelty tradeoff. The GPT-2 decoding scheme that comes closest to the baseline is top-p decoding with p = 0.95; this achieves a perplexity of 93.7 (baseline: 89.4) and a truncated pointwise duplication score of 4.41 (baseline: 4.47). Why does GPT-2 (with the right decoding scheme) outperform the Wikitext-103 models at matching the quality and novelty of its baseline? It is unlikely that the model architecture is the reason because GPT-2 is similar in architecture to the Wikitext-103 Transformer. In addition, although GPT-2 is our largest model, we doubt that model size is the explanation: GPT-2 Small shows similar results even though it is smaller than TXL. It may then be WebText−based models Perplexity Pointwise duplication score (truncated at 5) Figure 2: Manipulations to the decoding scheme that result in higher-quality text (i.e., lower perplexity; x-axis) also result in decreased novelty (i.e., a greater degree of duplication; y-axis). Each point shows a different decoding scheme. that training set size is the key factor, as WebText is much larger than Wikitext-103. Alternately, the WebText baseline might be easier to meet than the Wikitext baseline, because the generic Internet text in WebText is generally lower-quality than the curated Wikipedia text in Wikitext-103.

Other n-gram analyses
Additional analyses are detailed in the appendices. We find that model size (Appendix H) and prompt length (Appendix I) do not have a clear effect on novelty; novelty is influenced by position within the generated text for some models, but the effect is small (Appendix J); and our novelty results do not change much if we only consider duplication from the training set rather than duplication from the context and/or training set (Appendix K).

Syntactic novelty
Findings: At the level of global sentence structure, models show a high degree of syntactic novelty, with the majority of generated sentences having an overall syntactic structure that no training sentence has. Models also display some novelty for more local structure (e.g., individual dependency arcs), but they have much less novelty for local structure than the baselines do.

Details:
We have seen that models display some novelty. How deeply does their novelty extend? Are they just inserting words into memorized templates, or performing deeper syntactic composition? To investigate this question, we parsed our generated text and our models' training data using state-of-the-art constituency (Ki-taev and Klein, 2018) and dependency (Zhang et al., 2020b) parsers. We then evaluated novelty for 7 aspects of syntax.
Though current parsers perform well, they are not perfect, so we cannot completely trust the parsers' output. This is particularly a problem because the cases that are important to us (novel ones) are especially likely to confuse parsers. To address this issue, we manually analyzed the examples identified as novel to estimate the parsers' error rates (details in Appendix L). We concluded that 4 of the 7 attributes that we analyzed were handled accurately enough by the parsers for us to report numerical results, which are in Figure 3.
Here is a description of these attributes: • POS sequence: the sequence of part-ofspeech tags for the words in the sentence. • Parse structure: the sentence's constituency tree minus the leaves (the words). • Dependency arc: a 3-tuple of a dependency relation (e.g., nsubj) and the two words that hold that relation. • Dependency role: a 3-tuple of a word, a dependency relation that the word is part of, and the word's position in that relation; e.g., "watch as the head of an nsubj relation" For POS sequences and parse structures, there is a high degree of novelty: across all models and baselines, the majority of sentences have an overall structure that no training sentence has. In addition, there is little difference between the models and the baselines. For the more local structure of dependency arcs and dependency relations, the baselines are far more novel than the models. This is similar at a high level to our n-gram results: Models tend to be less novel than the baseline for local structure (small n-grams), but they are more novel than the baseline for more global structure (large n-grams). See Appendix M for an example of how such a local/global mismatch is possible, and see Appendix N for specific examples of syntactic generalization (e.g., nouns that were used as direct objects in generated text when they have never appeared as direct objects in training).

Analysis
We finish with some manual analysis of novel generated text. Such analysis is labor-intensive; to use this labor most effectively, we focus exclusively on GPT-2 because it is the strongest-performing model. For this initial analysis, we study only the POS seq.
Parse struct. novel unigrams that GPT-2 generates; GPT-2 uses subword tokenization, so it can generate novel words by combining seen subwords in novel ways. See Appendices O and P for a detailed taxonomy of the novel words that GPT-2 generates. Here in the main paper, we focus only on 4 targeted questions about these novel words. Throughout this section, any word in boldface is novel.
7.1 When GPT-2 generates novel words, are they morphologically well-formed?
Specific categories: Forming English plurals requires a choice between two orthographic forms, -s and -es. In 72 of the 74 novel plurals, GPT-2 made the correct choice (e.g., Brazilianisms, Fowleses). The two incorrect examples were 1099es and SQLes. Similarly, forming English possessives requires a choice between -'s and -'.
Here, GPT-2 makes the correct choice in 135 out of 136 novel possessives (e.g., Flexagons', Runerealm's), with the only error being watchmakers's. Acronyms provide another case for which we can easily quantify well-formedness. Our GPT-2-generated text contains 75 examples of novel acronyms that appear along with the full version of what the acronym stands for. In 72% of cases, the acronym is not a suitable abbreviation (wellformed example in 1, ill-formed example in 2). There are valid reasons why an acronym might not match its expansion; e.g., sometimes Englishlanguage publications will translate a non-English phrase but not the acronym derived from it, giving results such as Doctors Without Borders (MSF). However, in our baseline text, 17 of the 21 acronyms that appeared with expansions were suitable, so GPT-2 is still not suitable nearly as often as the baseline (28% vs. 81%).
(1) West of England Cricket and Athletics Club (WECAC) (2) Extremely Large Interactive Neutrino Experiment (ELIGO) Some additional examples of success involve suffixes that require the stem to change spelling, with GPT-2 successfully making the change (3). Some additional mistakes are the use of a plural noun as the first component of a compound (4) and overregularization, namely using the regular suffix -th instead of the exceptional suffix -nd (5).
(3) a. by " cookying " certain searches on the internet b. Summission base camp c. the ridiculousities of war (4) The...rivers had their headswaters in a larger basin (5) the 752th year 7.2 When GPT-2 generates novel words, do they fit within their syntactic context?
Finding: The vast majority of GPT-2's novel words (94%) are used in grammatically-correct contexts ( Figure 4), but it does make more errors than we see in the baseline (e.g., 6).
(6) a. the manicure that I did for Sally-themed a year ago b. Slicex load-samples provides a single button Agreement: Despite these errors the vast majority of cases have proper syntax. Some particularly impressive cases involve novel plural words. First, (despite the one mistake in 6b), GPT-2 generally does well at providing plural verbs (underlined) to agree with novel plural nouns, whether the verb appears after the noun (7) or before the noun in the context of a question (8). In (9), it correctly uses a plural verb for both verbs that agree with the novel plural subject-a verb within the relative clause, and a verb after it. The correct agreement with the verb after the relative clause is especially impressive because, in both sentences, there are 3 singular "distractors" (italicized) between the subject and the verb. See Haley (2020) for similar observations but with BERT instead of GPT-2.
(7) a. We know that M-Sinks need a target b. Torpexes are small hardpoints (8) Why do SQLes have to change (9) a. The Huamangas , who are descendants of indigenous people who lived on the Isthmus of Tehuantepec before it was covered by farmland , have been demanding that the federal government address the issue of climate change . b. FOIA-requesters who think an agency has a good reason for withholding information are not always given a second opportunity to press their case .
Other plural-relevant syntax: Beyond agreement, syntactic consequences of plurality are observed in a few other places as well: in using the plural possessive form that is just an apostrophe instead of the singular form of -'s (10); in having the pronouns that are coreferential with the noun be plural as well (11); and in following determiners that require a plural noun (12).
(10) The Fowleses ' lawyer (11) a. I love Klymits , but it has been nearly impossible for us to find them in stores . b. The Sarrats were lucky to have her as part of their lives (12) a. these small townites b. so many Brazilianisms Incrementing/ordering: Another type of interword relation that GPT-2 appears to have learned is incrementing/ordering, with examples in the Appendix. In (112a), GPT-2 increments numbers from Firstly to Fourteenthly, with Thirteenthly and Fourteenthly being novel. In (112b), it increments the letters at the ends of variable names in computer code, going from multiplyx to multiplyy to multiplyz. Finally, in (112c), the prompt ends with an alphabetical list of companies, and GPT-2 continues this list, staying mostly in alphabetical order and including many novel words along the way.  Quotation marks: A final aspect of sentence structure that we analyze is putting words within quotation marks. In human-generated text, there is an association between novel words and quotation marks: words are much more likely to appear inside quotation marks if they are novel, and they are much more likely to be novel if they appear inside quotation marks. This association is also present in GPT-2's generated text ( Figure 5), e.g.: (13) a. The " proto-poetry " of modern times b. the " un-competition " that is happening These results suggest that GPT-2 might encode some version of the concept "novel word" which it can accesses when determining whether to include quotation marks.
7.3 When GPT-2 generates novel words, do they result in reasonable meanings?
Finding: GPT-2 does less well in this area than in morphology and syntax, consistent with claims (Bender and Koller, 2020) that language models only learn form, not meaning ( Figure 6).
Examples: There are some generated examples for which there is clear evidence that the meaning is incorrect (14). One frequent source of mistakes is numbers, revealing a general lack of understanding of the quantities that these numbers represent. Numerical errors include incorrect conversions (15a), physical impossibilities (15b), and inconsistent exchange rates (15c): Nonetheless, there are also some positive examples where GPT-2 essentially provides a clear and accurate definition of the novel word or otherwise makes use of all aspects of the word: (16) a. ... the process of re-nitrification that gives them a new supply of nitrogen b. the concept of ' co-causation ' , in which effects are thought to be caused by causes that act in parallel c. the " bondbreaking enchantment " , which...permanently breaks any binding .

What does GPT-2 generalize from?
We have seen that GPT-2 generates some novel words. What types of generalization does GPT-2 use to create these words? There are two basic types of generalization that might be employed (Prasada and Pinker, 1993;Albright and Hayes, 2003;Dasgupta et al., 2021). First, a novel word could be created by a compositional rule that builds up word parts (17a). Alternatively, a novel word could be created via a similarity-based analogy, with similar word parts replacing each other, such as swapping giraffe and elephant (17b).
(17) a. elephant + -s = elephants b. giraffesgiraffe + elephant = elephants As these examples show, a given word (e.g., elephants) could have been formed in either of these ways, so we can never be certain about which approach GPT-2 is using. However, based on some examples which are reasonably clear, we suspect that GPT-2 employs both types of generalization.
Generalization by composition: In a few cases, GPT-2 generates a novel word whose stem never appears in training but does appear in the context (the prompt plus the previously-generated words): see (18). We believe that these examples are best explained by composition: analogy requires some notion of similarity between the two word parts being swapped, and it is unlikely that the model would have such similarity notions for a word stem it has never seen before. Thus, we think these examples are better understood as the model adding a prefix or suffix to a word from its context, without direct reference to another word that has that prefix or suffix-a form of composition.
(18) a. using the LHAW to take out other LHAWs b. Pelagic epineopterygoid ... Subepineopterygoid , N. scapulatus Generalization by analogy: Appendix P.16 contains one piece of generated text which we believe provides clear evidence for analogy. The prompt for this generation contains the real English word torero (borrowed from Spanish), which means "bullfighter." The generation then contains several alternate forms of this word (some with plural inflection): tearro, tornro, tearingros, and tearsros (e.g., in the sentence tearingros are taught to avoid the horns). It appears, then, that GPT-2 has taken the word torero and replaced the first 4 letters (tore) with other forms of the verb tear: tear, torn, tearing, and tears. There is no morphological process in English that adds -ro to verbs, so it is unlikely that these words were generated via composition; instead, it seems more likely that they were generated via analogy.

Discussion
Using our analysis suite RAVEN, we have found that models generated many types of noveltynovel n-grams of all sizes, novel syntactic structures, and novel morphological combinations. However, they also show many signs of copying: for local structure, they are substantially less novel than the baseline; and we see occasional largescale copying, such as duplicating passages from the training set that are over 1,000 words long.
Compositionality: Compositional generalization (combining familiar parts in novel ways) is often discussed in the context of out-ofdistribution generalization (Hadley, 1994;Hupkes et al., 2020;Keysers et al., 2020;Li et al., 2021), typically relying on synthetic datasets to test models' compositional abilities (Lake and Baroni, 2018;Kim and Linzen, 2020;McCoy et al., 2020). Our baseline results in Figure 3 show that compositional generalization is important even for in-distribution test sets drawn from large-scale natural corpora. Most notably, the majority of test sentences had a sentence-level syntactic structure that had never appeared in the training set. Turning to the model results in Figure 3, all models displayed nonzero rates of compositional generalization, giving an existence proof that they can perform these types of generalization. Nonetheless, the models' scores are lower than the baseline, so their generalization might be limited to particular subcases, instead of being as general as human generalization. In the opposite direction, however, we also found examples where GPT-2 generalized too freely, such as generating the word 752th (Section 7.1). We conclude that it may not be enough to simply encourage models to be systematic, because language is not completely systematic. Instead, we need models that can both figure out linguistic rules and recognize exceptions to those rules (O'Donnell, 2015;Yang, 2016).
Evaluating novelty: Our core message is that novelty has not received the attention it deserves in NLG evaluation. For generated text to truly illustrate a model's generative capabilities, that text must be novel-otherwise, it may only illustrate the model's ability to copy but not other abilities (e.g., the ability to be coherent). We recommend using the level of novelty found in an indistribution test set as a baseline: if the model is at least as novel as this baseline, we can rule out the possibility that it is copying excessively.
Recent increases in training set sizes make it especially critical to check for novelty because the magnitude of these training sets can break our intuitions about what can be expected to occur naturally. For instance, some notable work in language acquisition (e.g. Kuczaj II, 1977;Marcus et al., 1992) relies on the assumption that regular past tense forms of irregular verbs (e.g., becomed, teached) do not appear in a learner's experience, so if a learner produces such words, they must be novel to the learner. However, it turns out that, for all 92 basic irregular verbs in English, the incorrect regular form appears in GPT-2's training set; details are in Appendix Q, along with results for another category often assumed to be novel in human experiments, namely nonsense words such as wug (Berko, 1958). Thus, when we are using models trained on such large-scale datasets, it is not safe to assume that something is absent from the training set; we must explicitly check.
Improving novelty: One straightforward approach for increasing novelty would be to modify the sampling procedure to suppress highly-copied outputs, similar to penalties used to prevent repetition (Keskar et al., 2019). Another approach would be to implement more nuanced forms of deduplication during training: We found that supercopying mainly arises when there is repetition in the training set, so eliminating such repetition might improve models' novelty. Indeed, concurrent work (Lee et al., 2021) has shown that deduplication can substantially decrease copying of 50grams from the training set.
Ideally, however, we would find ways to decrease copying that are deeper, without requiring post-hoc modifications to the training data and sampling procedure. In humans, novelty has long been attributed to the usage of symbolic, compositional rules. Thus, greater novelty might be achieved through models that build in compositional mechanisms, such as RNNGs (Dyer et al., 2016) and TP-Transformers (Schlag et al., 2019).
Alternatively, one major difference between text generation in humans and neural LMs is that humans usually have a meaning that they want to express that guides their text generation, whereas most neural text generation involves no explicit plan. This difference may partly explain the ways in which models are less novel than humans: since models mainly manipulate text alone, they fall back to repeating text they have seen before. Thus, novelty may be improved by incorporating more explicit semantic planning (Rashkin et al., 2020).

Conclusion
In machine learning, it is critical to evaluate models on a withheld test set. Due to the open-ended nature of text generation, a model's generated text might be copied from the training set, in which case it is not withheld-so using that data to evaluate the model (e.g., for coherence or grammaticality) is not valid. Thus, it is important to consider novelty when evaluating text generation. We have introduced RAVEN, an analysis suite covering both sequential structure and syntactic structure, and have applied it to several models, showing that models are rarely novel for local structure but are often novel for larger-scale structure; however, they occasionally copy even very long passages. Beyond text generation, we hope that our work will motivate more careful consideration of maintaining withheld splits between training sets and evaluation sets across NLP.

B Tokenization and other text preprocessing
Prompts: For the Wikitext prompts, we used prompts of length 0, 16, 128, and 512; most of our experiments used only the length-512 prompts.
For the WebText prompts, the prompt lengths were 0, 18, 141, and 564 subword tokens; however, we extended the prompt past that length as was necessary to ensure that the prompt ended with a complete word. If this required adding more than 10 additional tokens, we discarded the prompt and sampled a new one. The WebText prompt lengths were chosen to be approximately 1.1 times the Wikitext prompt lengths because there are approximately 1.1 WebText subword tokens for every word.
As our baseline text, we used the words that followed the prompt in the test set. Due to the small size of the Wikitext-103 test set (it contains approximately 245,000 tokens, while the contin-uations following the prompts total 1,000,000 tokens), some of the Wikitext-103 continuations that were used to make the baseline text necessarily have parts that overlap with parts of other continuations, but no two continuations are identical. For the Webtext baseline, there was no such overlap because the dataset was large enough to avoid it.
N -gram novelty: For computing n-gram novelty, we did not perform any processing of Wikitext text or the text generated by Wikitext models; thus, this text uses the tokenization from the Wikitext-103 dataset, which is a slightly modified version of the Moses tokenizer (Koehn et al., 2007). For WebText text and text generated by GPT-2, we converted GPT-2's subword IDs into text using the GPT2Tokenizer from the Hugging-Face Transformers library (version 2.11.0). We then replaced each newline with the token &NEW-LINE; (which never occurs in WebText), to be consistent with Wikitext, in which each newline is a token. Wherever there were multiple spaces in a row, we replaced them with a single space. We then tokenized this text using the Moses tokenizer (Koehn et al., 2007) and used the resulting tokens to compute n-gram novelty.
Syntactic novelty: Many of our syntactic analyses operate at the level of sentences. Thus, we first sentence-tokenized our text using the NLTK sentence tokenizer 5 and then parsed them using a constituency parser (Kitaev and Klein, 2018) and dependency parser (Zhang et al., 2020b). These parsers perform their own tokenization, so we wanted to provide them with untokenized text. For WebText baselines and GPT-2 text, this was straightforwardly accomplished by not performing word-level tokenization before passing text to the sentence tokenizer and the parser. For Wikitext baselines and text generated by Wikitext-based models, we first detokenized the text using the Moses detokenizer (Koehn et al., 2007) and then passed the detokenized text to the parsers.
Analyses: For identifying novel unigrams for our analyses in Section 7, we used the GPT-2 generated text as it was tokenized for the n-gram novelty evaluations, and we then used the n-gram novelty annotations to determine which unigrams were novel. 5 nltk.org C Novel bigrams Figure 7 shows examples of novel bigrams generated by each of our models. Figure 8 gives examples of generated text, as well as text from our models' test sets, that we classify as supercopying: duplicating a passage that is 100 words long or longer from the training set. Figure 9 gives statistics about how many times supercopied 100-grams appeared in the relevant model's training set. F Plots showing the effects of the decoding scheme Figure 11 shows how the pointwise duplication score is affected by varying three decoding parameters.

E Untruncated pointwise duplication scores
G Evaluating perplexity

G.1 Evaluating overlap between Wikitext-103 and WebText
To measure the perplexity of generated text, we used GPT-2 (which was trained on Web-Text) to measure perplexity for models trained on Wikitext-103, and we used TXL (which was trained on Wikitext-103) to measure perplexity for models trained on WebText. We justified this choice based on the fact that Wikitext-103 was constructed entirely from Wikipedia articles, while Wikipedia articles were excluded from Web-Text, meaning that there should be no overlap between the training sets of Wikitext-trained models and WebText-trained models. However, there is a caveat for making this assumption: the Web-Text creation process excluded Wikipedia articles, but the text from Wikipedia articles could still potentially occur in WebText, because there are many non-Wikipedia websites that copy data from Wikipedia; and because Wikipedia writers could potentially generate Wikipedia content by copying it from other public-domain websites.  Here we test our no-overlap assumption more rigorously. To do so, we randomly selected one thousand 20-grams from the WebText training set and checked whether they appeared in the Wikitext-103 training set. Similarly, we also selected one thousand 20-grams from the Wikitext-103 training set and checked whether they appeared in the WebText training set tokenized with the Moses tokenizer (Koehn et al., 2007), which was the basis of the tokenization in Wikitext-103. To control for tokenization differences that might persist despite the use of the Moses tokenizer, we lowercased all text and deleted any words containing any characters besides the 26 Roman letters.
We found that 0 of the 1000 WebText 20-grams appeared in the Wikitext-103 training set, so it seems very safe to use TXL to evaluate the perplexity of text generated by models trained on WebText. In the other direction, 12 of the 1000 Wikitext 20-grams appeared in the WebText training set. This shows that a small amount of text that appears in Wikipedia text did end up in Web-Text. Nonetheless, the proportion of overlap is very small (0.012), so we conclude that it is still fair to use GPT-2 to evaluate the perplexity of text generated by models trained on Wikitext-103.

G.2 Details of perplexity evaluation
To evaluate the perplexity of a piece of text using Transformer-XL or GPT-2, we adapted code from Hugging Face at https://huggingface. co/transformers/perplexity.html.
For each model, we used a stride of 512 tokens and a maximum length of 1024 tokens. That is, the perplexity was evaluated in segments of 1024 tokens each, with each segment preceded by a further context of 512 words whose perplexity was not evaluated as part of the segment being evaluated. This approach ensures that every token has at least 512 tokens of prior context available; tokens at the end of a 1024-token segment have an even longer context (specifically, a context of 1535 tokens for the last token in each segment).

H How does model size affect novelty?
It seems possible for model size to affect novelty in either direction: It could be that larger models have a greater capacity to memorize, and would therefore be less novel. On the other hand, larger models are generally stronger (Kaplan et al., 2020), which might include greater strength in their ability to be novel. Figure 12 shows the level of duplication observed for the 4 different sizes of GPT-2 (all using top-40 samping). There is not a clear, consistent effect of size. Across the various n-gram sizes, the most novel model is GPT-2 XL; however, GPT-2 Medium is more novel than GPT-2 Large, so it is not the case that larger models are always more novel than smaller models (or vice versa).

Generation method
Example of supercopying Wikitext test set <eos> <eos> = = Themes = = <eos> <eos> The Hustler is fundamentally a story of what it means to be a human being , couched within the context of winning and losing . Describing the film , Robert Rossen said : " My protagonist , Fast Eddie , wants to become a great pool player , but the film is really about the obstacles he encounters in attempting to fulfill himself as a human being . He attains self @-@ awareness only after a terrible personal tragedy which he has caused -and then he wins his pool game . " Roger Ebert concurs with this assessment LSTM, Transformer, Transformer-XL . <eos> <eos> = = Background = = <eos> <eos> The Boat Race is a side @-@ by @-@ side rowing competition between the University of Oxford ( sometimes referred to as the " Dark Blues " ) and the University of Cambridge ( sometimes referred to as the " Light Blues " ) . The race was first held in 1829 , and since 1845 has taken place on the 4 @.@ 2 @-@ mile ( 6 @.@ 8 km ) Championship Course on the River Thames in southwest London . The rivalry is a major point of honour  q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Transformer−XL Transformer LSTM Random Wikitext 0 20 40 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q GPT−2 Random Webtext 0 500 1000 1500 Counts in the training set Figure 9: Counts of how often 100-grams supercopied by each model appear in that model's training set, compared to counts of random 100-grams from the training sets. For legibility, some GPT-2 outliers have been removed. The biggest outlier was a supercopied passage that occurred 176,424 times in GPT-2's training set.

I How does prompt length affect novelty?
Prompt length could reasonably be expected to increase or decrease novelty. On one hand, shorter prompts might not give the model much context to build from, which could lead the model to fall back on what it has seen during training, making it less novel. On the other hand, we observe from the baselines in Figure 1 that there is a reasonably high overlap between models' training sets and their test sets. Since our prompts are drawn from the test set, a long prompt might contain long portions that also appear from the training set, which could encourage the model to further copy from that part of the training set, in which case longer prompts would lead to lower novelty than shorter prompts. To assess how novelty is affected by prompt length, we consider only duplication from the training set, not from the context, because a longer prompt trivially provides more opportunities for copying from the context. In general, the length of the prompt does not appear to affect novelty much ( Figure 13). For the LSTM and the baseline of text drawn from the Wikitext-103 test set, we do not discern any effect of prompt length. For the Transformer and GPT-2, longer prompts lead to slightly more novelty than shorter prompts, while Transformer-XL shows the opposite effect, with shorter prompts leading to slightly more novelty than longer prompts. The WebText baseline shows some differences between prompt lengths, but we do not see a clear generalization there as novelty is not affected consistently by length: for exam- ple, length 0 is more novel than length 18 but less novel than length 141. Figure 14 illustrates how novelty is related to position in the output text. For these analyses, we only consider duplication from the training set, not from the context, because positions later in the generation have more context to copy from. We use pointwise duplication scores truncated at 10 to control for the fact that later positions can have higher untruncated pointwise scores than earlier positions can have; by using truncated scores, all positions that we consider have the same possible range of values (1 to 10 inclusive). We group generated text into bins of 100 words (positions 0 to 99, positions 100 to 199, etc.) and then compute the mean truncated pointwise duplication score for each bin, discarding the first bin because its first 10 positions have a different range of possible scores than the rest of the positions in the generation.

J Position in generated text
There is little effect of position in the baselines, the LSTM, and the Transformer, but in GPT-2 and Transformer-XL, there is greater duplication at later positions in the generated text. Though the effect is consistent for these two models, the effect size is small, with the pointwise duplication score increasing by only about 0.2 from the start of the generation to the end of the generation.  Figure 11: How novelty is affected by 3 decoding parameters: k in top-k sampling, p in top-p sampling, and the temperature. A higher value on the y axis means that the generated text is less novel. As each parameter is increased, novelty also increases (that is, duplication-on the y-axis-decreases).

K Duplication from the training set vs. the context
In the main text, we only report results that collapse together duplication from the training set and duplication from the context (the prompt and the previously-generated text). In Figure 15, we separate these two sources of duplication. The plots showing duplication from the training set alone are almost identical to the plots showing both the training set and the context, showing that models almost never copied content from the context that was not also in the training set. Models showed much less duplication from the context than from the training set, which is unsurprising because the training sets are much larger than the contexts, meaning that there are far more pieces of text that would count as duplicated from the training set than duplicated from the context.

L Vetting syntax
We considered evaluating the novelty of 7 aspects of syntax. For each of these 7 aspects, we conducted manual analyses to determine whether the numerical results gained from our parses were roughly accurate. Below we describe these manual analyses.

POS tags:
We considered a novel part-of-speech (POS) tag to be an instance where a word was generated with a POS tag that it had never appeared with in training (but where the word had appeared in training with a different POS tag). From an initial inspection of the generated words that the parser identified as having a novel part of speech, we found that most of them were correctly labeled, but that the training set actually did contain an instance of the word with that part of speech (just with that training instance mislabeled). Thus, we concluded that we could not trust the quantitative results for this factor, because most generated words identified as having a novel POS tag had actually appeared as that POS in training.
CFG rules: The CFG rules are the contextfree rules present in the parses from the constituency parser (ignoring rules that include a terminal symbol-i.e., only considering ones composed entirely of nonterminals). From an initial inspection of the CFG rules identified as novel, every example that we looked at was the result of a parser error (where the parser assigned an overly flat rule, e.g. NP → -LRB-CD -RRB-, NP CD CD CD , NP , NP). Thus, we concluded that we would be unable to trust numerical results covering CFG rules.
POS sequence: The POS sequence of a sentence is the sequence of POS tags assigned to its words by the constituency parser. For these, we looked at 100 generated sentences identified as having a novel POS sequence, and for each manually checked whether the POS sequence assigned by the parser was accurate. These generally were accurate (Figure 16), and the POS sequences generally had high scores for novelty, so we conclude  that the quantitative results are approximately the correct order of magnitude for the POS sequences.
Parse structure: We define a sentence's parse structure as its constituency parse minus the leaves (i.e., the terminal nodes in the tree). If a sentence has a novel POS sequence, it is guaranteed to also have a novel parse structure. Therefore, we did not conduct an additional inspection of the parse structures on top of the POS sequences, because the parse numbers are close to the POS number.
Dependency arcs: For dependency arcs, we first checked 100 dependency arcs identified as novel for each model to confirm that they were not parser errors. We then looked at the subset of each of these 100-arc sets which were correctly labeled and for which there were no more than 100 training sentences that could possibly contain that arc (i.e., training sentences containing both relevant words). We then checked those training sentences to confirm that the dependency arc in question did not appear in that sentence. These results were generally strong (Figure 17), so we conclude that the numerical results for dependency arcs are approximately correct.
Dependency roles: Similarly to the dependency arcs, we checked 100 dependency roles per model that were identified as novel to check if the roles were correctly identified. We then looked at the subset of those 100-role sets which were correctly labeled and for which there were no more than 100 training sentences that could possibly contain that role (i.e., training sentences containing the relevant word). We then checked those training sentences to see if the dependency role truly was novel. These results were also successful enough for us to conclude that the numerical results were approximately correct and could be reported in the paper (Figure 18).
Dependency argument structure: Dependency argument structure is the list of argument types that a verb has (e.g., subject, direct object, indirect object). We did not manually analyze this due to how labor-intensive it would be (requiring analysis of a large number of entire training sentences). Thus, we do not provide numerical results for this, since we do not have an estimate of how reliable such numbers would be.

M Example of a mismatch between local and global novelty
In both our n-gram analyses and syntactic analyses, we found that LMs were less novel than the baseline for local structure (e.g., small n-grams or individual dependency arcs) but were more novel than the baseline for larger-scale structure (e.g., large n-grams or overall sentence structure). As an example of how such a local/global mismatch is possible, suppose that the training set contained only (19)

N Examples of syntactic novelty
Although we cannot report reliable numerical results for several of the aspects of syntax that we considered in Appendix L (due to the infeasibility of manually checking all examples and the high error rates of automatic methods), it may still be worthwhile to see if there are any individual examples that we can identify to provide an existence proof that models do, at least sometimes, perform the types of syntactic generalization in question.
To that end, we identified several types of syntactic generalization that have received attention in prior literature in language acquisition and/or analysis of LMs, and manually looked for examples that we could verify were novel. For example, given a dependency role (e.g., "watch as the head of an nsubj relation"), we identified all train- Novel part-of-speech tags: Here we looked for generated words which are used as a noun in the generation and which appear in the training set, but never as a noun (for this purpose, we consider all noun POS tags to be identical), and similarly for verbs. We find a few examples; note that some of these might be instances of overgeneralization (that is, English speakers might judge some of them to be ungrammatical): The exploratory parts of the game feature a series of arcane codes and glyphs , treasure maps and chests , and secret rooms . Novel determiner: In the language acquisition literature, one question that has received focus is about determiner-noun pairs (Pine and Lieven, 1997; Yang, 2013): do children have a productive rule that allows them to combine any determiner with any noun, or have they only memorized particular determiner-noun pairs (e.g., the dog, a cat)? We investigate this general topic by turning to the det dependency arc to see if there are any nouns appearing with a determiner they have not appeared with during training. We see examples from all models, showing that models have not simply memorized particular determiner-noun pairs.
(23) the → a a. (i) LSTM: The upper part of the tooth, from the upper part of the tooth, is covered with a coarse keratin, usually to the base of the teeth. (ii) Training example: The scales of birds are composed of the same keratin as beaks , claws , and spurs b. (i) Transformer: called The unk featured a "very clever" female tilefish who played a role that she had previously played in the pilot episode. (ii) Training example: the tilefish will continue to expand its burrow in the sediment throughout its life . c. (i) Transformer-XL: A medium-sized zebrafish, this small tadpole has two openings in its snout: a small opening on the top of its head and a large opening at the bottom of its head. (ii) Training example: The main species used is the zebrafish d.

GPT-2:
A traditional Tread is a "V"shaped surface that has a solid groundplate in both the outboard sidewalls. Novel argument structures: Finally we look at cases where verbs have an argument structure they never have in the training set, where the argument structure defines the set of arguments that the verb has (e.g., laughed in Alex laughed has the argument structure active subject; explained in The rules were explained has the argument structure passive subject; and saw in The doctor saw the lawyer has the argument structure subject,direct object). We first checked if a verb that had only ever appeared as active was used as passive or vice versa, and found a small number of these. We also found one example of a shift from transitive to intransitive, where TXL used the verb suffuses as an intransitive verb when it had always appeared as transitive in the training set.
(27) Active to passive a. (i) LSTM: By the afternoon of September 9, the large cyclone system had been re-strengthened, with an increase in the surface circulation of the storm. (ii) Training example: The storm quickly re-strengthened early on September 20 , but transitioned into an extratropical cyclone on September 21 . b. (i) LSTM: However, the Enterprise crew in sickbay was allowed the craft to be re-docked.  Figure 19: Categorization of the novel words generated by GPT-2 but praised the "rich and evocative language" that suffuses, as well as the plot as "vivid and convincing." (ii) Training example: Genuine poetry suffuses them , and they are scored with brilliance and resource .
We observed no confirmed instances of a verb being generated with a transitive usage when it had only been intransitive in training. We also observed no instances of novel dative alternations: using an indirect object (e.g., I gave them a book) with a verb that, in training, always used a prepositional object (e.g., I gave a book to them), or vice versa. Note that the baselines also contained no confirmed novel transitivity or dative alternations, meaning that the lack of such examples in the models should not be interpreted as evidence that they are incapable of making such generalizations, since these types of novelty are rare even in human-generated text.

O Morphology categorization
In this section, we give an overview of all the novel unigrams generated by GPT-2. Specifically, we analyze our samples of text generated by the largest size of GPT-2 using top-40 sampling, pooling together the samples from all 4 of our prompt lengths. We then tokenized the generated text using the Moses tokenizer (Koehn et al., 2007) and analyzed all of the resulting tokens that never appeared in GPT-2's training set. We also performed the same procedure for our baseline WebText text. All of the analyses described here were performed manually by one of the authors who is a native English speaker with training in linguistics; we used manual annotation because automatic methods are unlikely to be trustworthy for the rare, long-tail phenomena that give rise to novel unigrams in the generated text.
We define a word as a space-delimited token (after the text has been tokenized with the Moses tokenizer). This means that a word counts as novel as long as this exact form of the word has never appeared during training, but in some cases it may not deviate that much from words in training; for example, it is possible that the word appears during training but with a different capitalization, or with different punctuation (e.g., as two words in a row instead of being hyphenated). We consider all of these to be instances of novelty because, from the model's perspective, all of them are encoded differently.
In case some readers would prefer a stricter definition of novelty, we searched GPT-2's training data for all of the morphologically-novel words discussed in the main text to see whether any form of that word appeared in the training set. Specifically, we lowercased the morphologically-novel word and removed all characters other than the 26 Roman letters (i.e., we removed all spaces, punctuation, and non-Roman characters), and then searched for the resulting string in the training set after formatting it in the same way. This method was chosen to be extremely thorough; in addition to returning all reasonable ways of formatting words that we could think of (e.g., capitalized/uncapitalized, with punctuation/without punctuation), it also returns many false positives (e.g., a word being split across multiple unrelated words, such as Welsh appearing inside vowel shift), so we manually inspected all portions of the training set that were identified as candidate matches for the morphologically novel words. Figure 20 shows the results, using the following categorization: • None: We verified that the word never occurs in the training set in any form.
• Unknown: Checking whether the word occurs was infeasible because there were too many candidates to check  Returning to the full set of novel unigrams (not those just discussed in the main paper), we divided all of the novel unigrams into 8 broad categories. Figure 19 gives an overview of how common each category is in the GPT-2-generated text and in the baseline; in the rest of this section, we describe these categories and give more details about the types of words that make them up.

O.1 Unanalyzed
This category is composed of examples for which we could not discern any internal structure. Of the 838 such examples, most of them (658) were strings of random characters: (30) a. src = " jE4B9BpL9KWv " b. fvw vnvh qvwq wvqw , kqw .
This category also contains 180 examples, like the following, which are pronounceable words but for which we could not find any apparent meaning or internal structure.
(31) a. Narcow . Mike Tyson says he 's " on track... b. Aumulule may have been constructed of wool or coarse flax .

O.2 Typo
The generated text contains 754 typographical errors. Most of these errors (706 of them) are caused by improper spacing and/or punctuation, sometimes as an error of the language model (32a) and other times as an error of the tokenizer that we used to post-process the text (32b).
(32) a. I 'm going to have to use a different material besides plastic.Here 's the front and back b. This material may not be published , broadcast , rewritten , or redistributed.
The remaining 48 typos are ones that involve misspelled words; e.g., (33a) has efforteless instead of effortless, and (33b) has oceansic instead of oceanic.
(33) a. I use these words to give these thoughts a form in such an efforteless and natural way that I may never really know where the person is coming from b. Because of the potential influence of atmospheric and oceansic carbon dioxide on the global carbon cycle The generated text contains far fewer typos than the baseline, human-generated text does. We conjecture that this difference arises because humans generate text character-by-character, which might create more opportunities for typos than the subword-based generation process used by GPT-2.

O.3 Tech
The most common category of novel words generated by GPT-2 is technology-related terms including both URLs or parts of URLs 6 and words used in computer code (e.g., variable names) (34). The baseline text also contains a fair number of novel words in this category, but they are only about half as frequent as in the generated text.
(34) a. may now return the string " < template name = ' views.template.name ' > " . b. $ success = $ auth -> successMessage-Text ( 'Logged In ') ; c. row-reverse ( columns ) { -widths 2columns 2 ; We do not make any systematic analysis of how such words are structured, but constructing them likely does require several types of string manipulation, including: • "Affixation" such as adding www or com in URLs, or adding the double dash used to introduce an argument in computer code • Concatenation of multiple words, sometimes with a delimiting token such as a period, dash, or underscore • Manipulation of case, particularly converting words entirely to lowercase for URLs, or the use of camelcase within code.

O.4 Number
GPT-2 generates several types of novel words that we classify as numbers. First are real numbers (672 of them; 35); these are generally wellformed, but note the one example that improperly starts with a 0 (35c). c. For the census year 2011 , there were 3,878,000 male-female households ( 0,936,000 female households in same-sex couples ) .
There are also 49 novel dates (36), 7 all of which form possible dates (i.e., there are no impossible dates such as November 31 or July 33). Some of these dates use the form month-date-year (36d-36e), while others use the form date-month-year (36f-36g).

O.5 Acronym
In our generated text, there are 195 examples of acronyms. Of these, 75 appear along with the full version of what the acronym stands for; some examples are below.
(38) a. The Money Funders International Group ( MFIG ) 7 Note that we are only analyzing novel unigrams; there are likely more novel dates that include spaces, along the lines of June 28, 2021. 8 We do not provide examples of generated phone numbers: even though the generated phone numbers are not in the training set, they may still belong to a person who would not want their phone number published in this paper.

O.6 Non-English
Even though GPT-2 is primarily an English model, sometimes it generates words that are clearly meant to be interpreted as words in another language. In many cases the intended language is specified: these languages include Arabic, Aramaic, Bulgarian, Czech, Dutch, Esperanto, French, German, Greek, Hebrew, Hungarian, Icelandic, Inuktitut, Japanese, Latin, Mandarin, Middle Dutch, Old English, Old Frisian, Old Norse, Proto-Germanic, Quenya, Russian, Spanish, Swedish, and Swiss German. We do not attempt any investigation of whether these words are actually valid words in the languages they are purported to be in, or whether they have the meanings that they have been attributed. Within this category we also include the 6 instances of GPT-2 providing pronunciation hints. These are generally somewhat similar to the true pronunciations of the words they are meant to correspond to, but are far from perfect: (39) a. Voltaire ( pronounced VORE-ey ) b. Tom Paine ( pronounced TOHR-in )

O.7 Name
We see 1,499 novel names in the generated text. 7 are names of languages or dialects (40). 655 are online usernames. 9 39 are first names (the first word in a multi-word name) (41), 220 are last names (the last word in a multi-word name) (42), while 150 are sole names (the only word in a single-word name) (43). 10 11 are names of groups of people (44). 120 are names of places (45). 21 are names for which we were unsure of the type of name (e.g., name of a person or of a place?) (46). 206 were corporate names: names of products, companies, or organizations (47 Beyond classifying the types of names that are present, we do not analyze the structure of these names. However, this would be potentially interesting to look into; it is non-trivial to construct a word that is pronounceable and "looks like" a name, such as generating sequences of syllables that look like plausible names of humans (41 and 42), or like the Greek-and Latin-based words used for scientific names (48). Further, some of these names require some more structured string manip-ulations, such as lowercasing and concatenating words to form usernames, or using affixes within names such as al-(42l) and Mc-(42x) in names of people, or -town in place names (45b).

O.8 Morphology
Finally, we analyze the words that use morphology-linguistic derivation of novel words. Our categorization largely follows that of The Cambridge Grammar of the English Language (Huddleston and Pullum, 2001), chapters 18 (Palmer et al., 2001) and 19 (Bauer and Huddleston, 2001).

O.8.1 Inflectional morphology
Inflectional morphology is the inflection of a word-not creating a new word, rather simply changing some grammatical feature of the word. Figure 21 gives an overview of the inflectional morphology found among the novel words in our text samples.
Nouns There are two types of inflected forms for English nouns: plurals and possessives. Both occur in the generated text.
The generated text includes the plural forms of common nouns (49a-49b), proper names (49c), and abbreviations (49d). Most of the plurals are formed using the -s form of the plural morpheme, but there are some formed with -es, such as (49e). Though the English language features a few pluralization processes other than the -(e)s suffix (e.g., changing -um to -a, as in bacterium/bacteria), none of them are employed in the sample of generated text that we analyzed. There are also possessive forms of both proper nouns (50a-50b) and common nouns (50c-50d). 11 The observed possessives involve both forms of the possessive morpheme: 's as in (50a-50c), and ' as in (50d).
(50) Possessives a. According to a report by UK accounting firm Deloitte , Fregoli 's pizza delivery service achieved $ 250m of revenues in 2014 b. or from their own personal maps as an alternative to MapMyRide 's online maps . c. The shillelagh 's name comes from the word " shillelagh , " a military term meaning d. We hope that what we uncover in our census-takers ' reports will lead to a better understanding of how to best serve New York City Adjectives English adjectives can be inflected for grade (also known as degree). This can take the form of a suffix (-er or -est), or a preceding word (more or most). The generated text contains no instances of adjectives inflected with the suffixes, but there were a few examples with the separate words more (51) and most (52). 12 (51) Comparatives a. We 've become much more androcentric b. and other news reports indicate that this particular incident represents a trend among the more socially-diverse , tolerant areas of Japan (52) Superlatives a. rabbits we adopted , which ate wheat products , are now two of the most gluten-intolerant cats we know . b. the " F-14 class of fighters , the most stealth-capable combat aircraft ever manufactured  (54) Participial uses of -ed a. In addition to the self-profiled and selflabeled population , b. the word in its more colloquialised form is most often associated with the 2000s There are also examples using -ing: (55) -ing a. and GCHQ was spying on this too ( for example , by " cookying " certain searches on the internet ) . b. I hear how you worry that you 're being judgmental or judgmentalizing c. will revert back to the normal way of updating your OS , which includes completely re-restoring all applications and settings to their default settings There are no novel verbs with the last remaining major type of inflectional verb morphology, namely the suffix -s used to form 3rd-person present-tense verbs.

O.8.2 Lexical word-formation
Under this heading, we include all processes that create novel words (as opposed to novel inflections of existing words).
Compounds The text generated by GPT-2 contains many types of compound words, summarized in Figure 22. First are dephrasal compounds, created by converting an entire phrase into a single word by conjoining its words with hyphens. Many of these are of the form noun-preposition-noun (56a-56b), but there are many more with a wide range of other structures (56c-56f).
(56) Dephrasal compounds a. It looks as if we are going to continue our spending-to-student ratios going up b. it became , in comics-as-genrewholesale parlance , a " reset " of sorts . c. the governor signed a similar law -with a different version of the controversial " we-told-you " law -into law less than a month ago .  e. So I feel no obligation to go along with Jimmy in finding the good-guy-but-evilman , unless he 's a good guy . f. that happened Friday was that President Obama and President Xi stopped playing the " it 's-a-civilized-country-and-weshould-just-accept-each-other " game and started working together .
Another common type of compound is coordinative compounds, in which all elements of the compound have an equal status, and the meaning of the whole compound is essentially the meaning arrived by joining all of its elements with the word and. A few of these compounds are ones that we categorize as lists because the order of the elements matter (57a), but most are of the more general type that involves joining the elements with no obvious reason for the ordering (57b-57k).
(57) Coordinative compounds a. So we go left-right-right and left-rightleft . b. According to the CIA , the Soviet-Azerbaijani relationship had become increasingly " intense " in recent years . c. What really happened on Tatooine ? The Hutt-Kling-Mandalorian trade dispute ? d. Fox 's Titans-Cowboys game at 12 : 00 p.m. on September 9 averaged 8.5 million viewers e. the moons of Jupiter provided the foundation for what came to be called the " Newton-Laplace-Einstein model of the solar system . f. It has its roots in the " anarchismcapitalism " debate g. the Northern Norway-Iceland border h. proton beam producing roughly the same number of gluon plasma as is observed for the proton-photon collisions ; i. Pa had a little iron grill-barbecue outside j. And it was very easy for him to be a painter-scientist , because he used nature and art . k. the most successful form of handstand execution known to man : the Krzyzewski-Frohwirth-Kacmar-Cunningham Combination Most of the rest of the compounds we categorize based on the part of speech of the compound that is created. First are compound nouns (which can, in Count in generated text

Count in baseline
Location in time and space 16 39

Count in baseline
Verb or adjective to noun 20 21  some circumstances be used as adjectives), which can be created in a variety of ways: (58) Compound nouns: Noun + noun a. He tries his dad-voice again b. Healthy Sesame Almond-Fluff Pasta c. in order to save trillions of precious lifemonths and lives . d. Not all carriers offer their own bagshare program e. The best times to use cooking oil on your grill / grilltop include f. This plant is very similar to Flowerpotweed . g. using calipers and then used while the lens was still fresh to obtain samples for x-ray-analysis , h. Some bears have more of a densize capacity than others . i. The Pongo pygmaeus vocal repertoire is highly complex and reflects the complexity of the callsong behaviour of this species , j. Upon hitting a ball-cage , the player becomes unable to move and must jump k. He refers to them as " hill-elves " , or Eistlaes , in Quenya ; (59) Compound nouns: Adjective + noun a. So what happens to a raw-fruit diet with no dairy ?
b. The paper concludes by proposing possible further extensions toward an actual " computational-self theory " c. and the front and the back shall be of bluework with a gold frame for the robe .
(60) Compound nouns: Adverb + noun: a press release on the topic from then-CMA president Tom McGinnis .
(61) Compound nouns: Verb + noun: only the latter can actually be seen with a sputterball (62) Compound nouns: Noun + postpositive modifier a. at least 1 in 3 employed adults reported having a job-to-be-held . b. Tags : Apprentice-in-training , Magic , Apprentice-in-training , Magic Academy , mentor-in-training (63) Compound nouns: Noun + number a. At the beginning of the Siege of Eayn , Eylan-3 was attacked by the Republic . b. A test flight at the end of 2017 will conduct Thaad-AMM-3 (64) Compound nouns: Noun + deverbal noun a. SoundFont is available on apt for installation , by replacing the font-encoder . b. Poster-maker David Hill has now stepped forward c. but in recent years the practice has taken on a more popular reputation as the fuelextraction method of choice for many small businesses hoping to make a profit Next are compound adjectives, which also can be constructed in a variety of ways: (65) Compound adjectives: Noun + adjective a. Today , I have another record to share with my wing-weary students and colleagues : b. Last year , Rangers hitters batted.243 against a league-leading 1,200 or more innings of relief-friendly MLB pitching . c. Mexico seemed to be having a bit more trouble with its backpass-heavy tactics than in previous World Cups . d. and Dan Dennett as they talk about their work together and the world of" scienceneutral " philosophy .
(66) Compound adjectives: Adverb + adjective a. Both wheelsets feature a slightly-slender offset . b. a non-distorted copy may be displayed in at least a partially-circular area (67) Compound adjectives: Number + Noun a. On-highway travel is permitted between campus and off-highway travel only within a two-quarter-mile radius of any of the following : b. and the presence of militants in many areas along its 2,500-kilometre (  (72) Compound adjectives: Noun + gerund/participle a. Extremely powerful ultrasonic cleaner that cleans debris and debris-containing debris out of every inch of surface it hits b. teachers are given incentives to target a " school-chasing " strategy . c. characters themselves , but of who those characters mean to a different segment of the superhero-watching public . d. One of the principal problems with carbohydrate-eating is how it stimulates blood sugar and fat stores , e. island of Oahu , Hawaii , where they began a life of mushroom picking and mushroom-making , later known as the " Mycological Arts " . f. Kenow hopes to set up a loon-viewing deck in a few years to better identify the small , mostly nocturnal birds .
(73) Compound adjectives: Noun + past participle a. browser features also come as Microsoft has added Internet Explorer 11 to the list of IE10-based browsers . b. and we 'll be navigating to the folder where your project and composermanaged libraries are in . c. NASA had to move closer to developing the systems necessary : a propellant-fed rocket , an ascent engine , an emergency escape system and a crew evacuation system d. Flag-shaped flag sold in Alameda County e. Soy is often used in place of dairycontaining dairy products in dairyreduced diets and is being studied as a possible supplement for osteoporosis prevention .
(74) Compound adjectives: Verb + preposition a. If you aren 't using nail-on nails and tape , you also can glue them to the wood .
Finally, there is exactly one example of a compound verb: (76) Compound verb: Preposition + verb One side of the farm has been taken off-grazing over the winter The last major class of compounds is neoclassical compounds, which are formed by adding one or more Greek-or Latin-based affixes to a stem. Occasionally these involve stacking several affixes; e.g., (77e) uses the prefixes epi-, neo-, and ptery-.
(77) Neoclassical compounds a. These reservoirs supply a special type of rock called granophyllite that was originally made up of the granites from which the Svalbard region was originally b. Hydrochlorothiazide contains hydroxychlorothiazoxide which , together with hydroxychlorothiazide and hydroxypyridazine hydrochloride , form a yellow crystalline powder . c. Thermolithotrophic bacteria are Gramnegative bacteria that produce a wide range of enzymes and are more d. the author of the books " Ethical and Religio-Economic Consequences of American Wars Since 1898 : A Primer for Political Policy Makers " e. Pelagic epineopterygoid Derivational affixes Derivational affixes are prefixes or suffixes that are added to a word to create a new word. The derivational affixes that we observe are summarized in Figures 23 through 26.
The first two categories of such affixes are diminutives such as -ling, -lite, and mini-(78), which make the meaning smaller along some dimension, and augmentatives such as superand ultra-(79) which make the meaning larger. (Note that the diminishing/augmenting effects of these affixes are not very apparent in these examples).
(78) Diminutives a. THE REAL FISHLING POND b. This social democracy will be a " statelite " , in the sense that the king will not be in a situation where c. the three Mini-Compounders eventually defeated the pair (79) Augmentatives a. The Professor went on to design new super-fabricated paper products b. It was revealed in Ultra-Sized G ! that there was originally a card named " Berserker " in the set Several derivational affixes in the generated text indicate location in space or time, specifically cross-, post-, pre-, proto-, sub-, and trans-: (80) Location in time and space a. A large drawbridge ( with a crossreinforcement mechanism to support it ) b. Bach 's music can be classified , along with Beethoven , as " post-Johann Sebastian styles " c. However , if the GC CPUs are slower than they need to be due to emulation-based issues , there is often the option of going back to a pre-Emulation model d. The " proto-poetry " of modern times is the " hyperbole " spoken by Shakespeare e. Other companions , enemies and sub-Bosses f. the Trans-Dniestria railway Observed morphemes that negate or reverse the meaning of the stem include anti-, de-, dis-, in-, no-, non-, and un-: (81) Negatives and reversatives a. the most prominent member of the anti-St-Pierre camp b. He calls this phenomenon the " Great Deconcentration c. Chaos is seen today not as an allencompassing disorder and disorderlessness but as a complex interplay of extremes d. He eventually found one of the Inhuman Inveterans in an abandoned building . e. adding an optional " no-knockout " version that also removes the knockout effect f. they immediately pointed out my non-Arabic-sounding pronunciation of Arabic words . g. 15 Proposed 8A NEWLINE NEWLINE 16 Unproposed 9A In contrast to the negatives and reversatives, there is also one prefix that expresses positive sentiment (pro-) and one that expresses repetition (re-): (82) Positive or repetitive a. In this study , we investigated the anxiogenic-like and pro-immobility and anxiolytic-like effects b. They go through the process of renitrification that gives them a new supply of nitrogen Two observed morphemes, -an and -ite, attach to a place name or description to create a word meaning a resident of that place: (83) Residency morphemes a. The Riot ' at the Cleveland Institute of Art -a time when Clevelandians riot for a variety of reasons b. In the meantime , the Aquallans were forced to contend with the Supermen . c. made them dwell in the land of Nod ( Nodites ) d. Inspired by the successes of social movements such as the Paris Commune and the American labor movement in the 1890s , these small townites were inspired to take action to create a new economic alternative We categorize most of the rest of the affixes by the part of speech of the word they create (and potentially also the part of speech of the stem, though this categorization is imprecise because some morphemes, such as -ism, can take multiple different parts of speech as a stem). We observe affixes that create a noun from a verb or an adjective (84); affixes that create a noun from another noun (85); affixes that create an adjective from a different part of speech (86); affixes that create a verb from a different part of speech such as a noun or adjective (87); and an affix that generates an adverb from an adjective (88).
(84) Verb or adjective to noun a. -ation: Now , there 's also a difference between an IKEA-ification and a Vogueification . b. -er/-or: This is what it means to be a dataaggregator c. -ity: Behaviourality was evaluated during the first and first and second consecutive day in groups of 10 d. -ment: The most anticipated part of the redevelopment will be the " Prapliftment Zone " -the area where the residences are put up . e. -ness: Chaos is seen today not as an allencompassing disorder and disorderlessness but as a complex interplay of extremes Character manipulation Some novel words involve manipulations at the level of individual characters (summarized in Figure 27). First is the use of letter repetition, either to elongate a word for emphasis (90a) or to indicate nervousness (90b). Second is the creation of onomatopoeias, in which each letter is meant to represent a sound in the real world (91).
(90) Letter repetition a. Youuuuuuuuu ! ! b. " W-what are y-you m-meant-" (91) Onomatopoeia make a humming noise that sounds like " tchtch " or as low a " ka-a-la " and a " hwa-hwa " while searching for food Portmanteau Finally, there are three generated words that could potentially be viewed as portmanteau words: (92a) is a blend of Disqus and etiquette, (92b) is a blend of pizza and apocalypse, and (92c) is a blend of gel and popsicle. However, it is possible that the model does not view these as blends but rather as compounds: it may have learned a morpheme -iquette that means "etiquette," a morpheme -pocalypse that means "apocalypse," and a morpheme -sicle that means "popsicle." Indeed, Pinter et al. (2020) found that LMs perform poorly at handling portmanteau words, which would support the hypothesis that, in the cases we have observed, GPT-2 is not handling these words as portmanteau words.
(92) Portmanteau words a. Please make sure to read the Disqusiquette before leaving comments . b. What 's in the future for the ' Pizza-Pocalypse ' ? c. and took two gels ( Molly and a gelsicle ) .

P Additional examples for the analyses
Here we provide additional examples from the manual analyses discussed in Section 7.
Predicted form -s is correct -es is correct  Of the 195 novel acronyms in our generated text, 75 appear with the full version of what the acronym stands for. In 21 of the 75 cases, the acronym is a suitable abbreviation for the shortened form (97); in the remaining 54 cases, the acronym is not a suitable abbreviation. Often, the errors involve having extra letters in the acronym (98), often repeats of letters that appear elsewhere in the acronym (98a through 98d), but not always (98e through 98g). Other types of errors include the omission of a letter (99a and 99b), having letters out of order (99b and 99c), and replacing a letter with a different, incorrect letter (99d and 99e).
In a few cases, the generated acronym differs from its expansion to a more substantial degree (99f).

P.4 Examples of incorrect morphology
Here we review some common sources of morphological errors among novel words generated by GPT-2.
Incorrect stem changes: Some of the errors arise from GPT-2 making unwarranted changes to the stem. In (100a), a name that was consistently spelled as Shuutou earlier in the passage has been given an extra u in its possessive form. In (100b) and (100c), GPT-2 has generated two different words that are most likely intended to be the adjectival form of the word Pentagon (the headquarters of the US Department of Defense). Neither of these terms (Pentagorean and Pentagran) are plausible ways to turn Pentagon into an adjective; the most plausible correct form would be Pentagonian, which has in fact appeared in the training set. Thus, by using Pentagorean and Pentagran, GPT-2 is being inconsistent both with itself and with its training data.
(100) Uncalled-for changes to the stem: a. Shuuutou 's flight speed b. Pentagon and Cybersecurity ... Pentagorean nuclear arms c. Pentagran war on the way Inconsistent morphology: Beyond the Pentagon-based examples above, the suffix -an appears to be a common source of inconsistency for GPT-2. In (101a), the generated word for people from the town Hamilton is Hamiltonan, but later in the same generation it is formed as Hamiltonian. (101b) shows inconsistency in how to refer to residents of genosha, and (101c) and (101d) show 2 different ways from 2 different generations to refer to someone from Cleveland. None of these (except for genoshaans) count as ill-formed in Figure 4 because there is variability in how the -(i)an suffix can be applied, but the inconsistency is an issue. These examples display a different type of inconsistency from the inconsistency observed in prior work, namely inconsistency in how to apply morphology, as opposed to factual inconsistency (Welleck et al., 2019;Li et al., 2020).
(101) Inconsistent demonyms a. as the percentage of Hamiltonans in the GTA increases , so does the number of people leaving the city ... almost a quarter of all Hamiltonians b. the genoshans were the first of a number of species...the genoshaans managed to regain their homeworld c. that Clevelandans are forced to endure d. a time when Clevelandians riot Missing sound changes: As mentioned above, the word genoshaans in (101b) is considered illformed. The reason is that it lacks the proper phonological change to the -an suffix that occurs when the stem ends with -a, namely of deleting one of the instances of -a. Another example of failing to make a sound change is in (102) in which a should instead be an.
(102) the $ 6 , $ 8 or $ 10-a-ounce china cup cake Plurals in compounds: A few ill-formed examples arise from using the plural form of a noun as the first part of a noun-noun compound; generally, the first noun in a noun-noun compound is the singular form of the noun, though there are exceptions in standard usage, so it is unclear if these should actually be viewed as errors.
(103) Plural in compound a. " common-sense " guns-control policies b. The...rivers had their headswaters in a larger basin c. mushrooms-related products Overregularization: Some of the errors can be classified as overregularization: applying a linguistic process in an overly broad way. In (104a), the word syllogist has been created, presumably by changing the -ism ending from syllogism into -ist. In many cases it is valid to swap -ism and -ist (e.g., tourism/tourist, optimism/optimist), but not in this case. In (104b), the suffix -th has been applied to the number 752, even though numbers ending with 2 should instead get a different suffix, -nd.

P.5 Syntactic errors
Most of the novel words that GPT-2 generates fit properly into their syntactic context, but it does make some mistakes, a few of which are below.
In (105a), bat-washer is used as a verb when its structure suggests it should be a noun (though this could potentially be valid given English's flexibility about parts of speech). In (105b), there is a noun-noun compound (cyber-missiles shortfall) with a plural noun as the first noun; typically, such compounds start with singular nouns, though there are some exceptions. In (105c), it should most likely either say look anti-Tunisian or look like anti-Tunisians. Finally, in (105d), load-samples is plural but is given a singular verb, provides.
(105) a. if I bat-washer it b. its massive cyber-missiles shortfall c. just to make our community look like anti-Tunisian d. Slicex load-samples provides a single button

P.6 Agreement
Here we look at the novel plural nouns that GPT-2 generates to see whether the rest of the generated sentence observes the correct consequences of the word's plurality. First, (despite the one mistake in 105d), GPT-2 generally does well at providing plural verbs (underlined) to agree with novel plural nouns, whether the verb appears after the noun (106) or before the noun in the context of a question (107). In (108), it correctly uses a plural verb for both verbs that agree with the novel plural subject-a verb within the relative clause, and a verb after it. The correct agreement with the verb after the relative clause is especially impressive because, in both sentences, there are 3 singular "distractors" (italicized) between the subject and the verb.
(106) a. We know that M-Sinks need a target b. when Clevelandians riot c. the Aquallans were forced to contend with the Supermen d. Torpexes are small hardpoints e. Another indicator of the poverty that Clevelandans are forced to endure : f. The YR-2s were designated simply as YR . 2 and were designated simply as YR . 2 g. Hustlings work when those people can work the job market for the required number of hours h. It was revealed that the genoshans were the first of a number of species that the Andromeda Initiative had already studied i. For some reason my old 1 / 1-01s do not turn and drive with the shift lever down like my new ones do (107) Why do SQLes have to change (108) a. The Huamangas , who are descendants of indigenous people who lived on the Isthmus of Tehuantepec before it was covered by farmland , have been demanding that the federal government address the issue of climate change . b. FOIA-requesters who think an agency has a good reason for withholding information are not always given a second opportunity to press their case .

P.7 Other plural-relevant syntax
Beyond agreement, syntactic consequences of plurality are observed in a few other places as well: in using the plural possessive form that is just an apostrophe instead of the singular form of -'s (109); in having the pronouns that are coreferential with the noun be plural as well (110); and in following determiners that require a plural noun (111).
(109) when Mr. Fowles asked him about it . The Fowleses ' lawyer , James F. Kelly , (110) a. I love Klymits , but it has been nearly impossible for us to find them in stores . b. The Sarrats were lucky to have her as part of their lives c. The color-coats are far more black & white than their predecessors (111) a. as the Paris Commune and the American labor movement in the 1890s , these small townites were inspired to take action to create a new economic alternative b. to help you understand why there are so many Brazilianisms in the English language as opposed to the Portuguese one

P.8 Incrementing/ordering
Here we provide the examples mentioned in the main text where GPT-2 successfully increments. In (112a), it increments numbers from Firstly to Fourteenthly, with the last two (Thirteenthly and Fourteenthly) being novel. In (112b), it increments the letters at the ends of variable names in computer code, going from multiplyx to multiplyy to multiplyz. Finally, in (112c), the prompt ends with an alphabetical list of companies, and GPT-2 con-tinues this list, largely (though not entirely) staying in alphabetical order, including many novel words along the way (all in bold).

P.9 Quotation marks
In GPT-2's text, as in human-written text, novel words are more likely to be enclosed in quotation marks than non-novel words. Some examples of novel words being in quotations are below: (113) a. the wave function in the " meganiverse " b. a " co-workplace ", where people are working together c. " Active-Passive-Inactive " investing d. the " anarchism-capitalism " debate e. The " proto-poetry " of modern times f. the " un-competition " that is happening as a result of rapid technological advances P.10 Novel words with meanings that are suitable for their context Below are some examples of novel words that are used in ways that are particularly well-suited for their semantic contexts.
(114) a. And it was very easy for him to be a painter-scientist , because he used nature and art . b. The process is pretty simple : a cyclist is followed ( suspect or not-so-suspect ) from a safe distance c. They go through the process of renitrification that gives them a new supply of nitrogen d. These include the concept of ' cocausation ' , in which effects are thought to be caused by causes that act in parallel e. The other thing of course is that the companies will be allowed to sell and to sell this information to the government in real time . This is what it means to be a dataaggregator and it is an interesting way to think about all of this . f. Thirdly , a new unique feature , the " bondbreaking enchantment " , which renders any item cursed by the " Cursed item " bug inadmissible to the user and permanently breaks any binding . g. we are witnessing a major movement away from a capitalist workplace toward a " co-workplace " , where people are working together to solve problems and create goods and services on a much smaller scale .
P.11 Novel words with meanings that are not suitable for their context Below are some examples of novel words where there is clear evidence that the word is not used in a semantically-sensible way. In (115a), judgmentalizing is used in a way that suggests it should mean "being judgmental," but the word's structure should yield the meaning of "making someone judgmental." In (115b), Brazilianism is used to refer to an English term, not a Brazilian term. In (115c), Bittrex is referred to as Bittrex-like, but it is not standard to refer to something as being "like" itself. (115d) is contradictory because disorderlessness should be the opposite of disorder.
(115e) is also contradictory because anti-catatonia effects are said to decrease motor activity, even though a decrease would be consistent with catatonia, not anti-catatonia. (115f) refers to the front floor, even though the floors of buildings are arranged vertically, so a building cannot have a front floor. In (115g), nitrification is referred to as a nitrate-deficient state, even though it most likely should be a nitrate-rich state. (115h) refers to a markdown-to-HTML converter having markdown output even though its output would actually be HTML. (115i) says that Internet Explorer 11 is built on Internet Explorer 10, even though most likely it would be viewed as a new browser, not a version of Internet Explorer 10. Finally, (115j) says that a no-knockout effect would enable people to be knocked out, which is contradictory.
(115) a. I hear how you worry that you 're being judgmental or judgmentalizing if you talk about how you 're leaving b. An old school English term is a Brazilianism . c. Blockstream 's shares were traded on Bittrex , a Bittrex-like cryptocurrency exchange . d. Chaos is seen today not as an allencompassing disorder and disorder-lessness but as a complex interplay of extremes , e. Moreover , the main aim of this study was to investigate whether an anxiolytic effect of Vitex by increasing OAT % and OAE % is accompanied by antiimmobility and anti-catatonia effects by decreasing motor activity f. wandering around the front-floor lobby of the Hotel Del Coronado g. This nitrate-deficient state is called nitrification . h. so you can use its markdown-to-HTML convertor with the markdown output format you prefer i. Microsoft has added Internet Explorer 11 to the list of IE10-based browsers . j. The only thing I 've done with my mod since then ( well , maybe a little bit before ) is adding an optional " no-knockout " version that also removes the knockout effect , so you can actually be knocked out again if you take enough damage .

P.12 Numbers
The analyzed sample of text includes 75 instances of a number plus a unit, such as the following: (116) a. Minimum Water Pressure : 2.5atm b. Tags : ch.5 , ch.5.1 , ch.5.2 , ch.6 , ch.6.1 , ch.6.2 c. Available OS Memory : 8147MB RAM Several of these involve math, which gives us an opportunity to see whether GPT-2 understands the numbers it is using. (For convenience, we also include some that are classified as number-noun compounds, rather than numbers (118)). Generally it appears that GPT-2 does not understand the numbers it generates; all the examples involving math are included below. (117a) includes the computation of a difference between two numbers, where it is said that 1065mhz − 1030mhz = 90.5M Hz. Regardless of whether the MHz on the right hand side is interpreted as the same unit as mhz on the left hand side, or if the capitalization is viewed as meaningful, this computation is incorrect. (117b) involves a physical impossibility: a 4-milliliter container cannot hold 10.4 milliliters of juice. Meanwhile, (118a) and (118b) give quantities that are not strictly impossible but are highly unlikely: according to ESPN, 14 the fastest 40-yard dash in history was 4.22 seconds, making the 2.64second time in (118a) implausibly fast; and the fastest three-cone drill in history was 6.28 seconds, making the 4.15-second time in (118b) also implausible.
The rest of the examples involve conversions between units, and generally the conversions are not equivalent. (118c) says that 1240 pounds equals 735 kilograms, when in fact it equals 562 kilograms. (118d) says that 2500 kilometers equals 1460 miles when in fact it equals 1553 miles (though this example is close enough to perhaps be reasonable). (117c) says that 975 milliliters equals 2.2 gallons when in fact it equals 0.26 gallons. (117d) says that 1 billion US dollars equals 610.9 million British pounds; this one is reasonable, because at current exchange rates it equals 718 million British pounds, so this is possible given fluctuations in exchange rates. Finally, examples (117e) through (117h) involve conversions between Kenyan shillings (KES) and British pounds (£). Across these examples (which all come from the same piece of generated text), we observe four different exchange rates: £1 = KES14.3 (117e); £1 = KES25 (117f); £1 = KES66.7 (117g); and £1 = KES200 (117h). Given this inconsistency, it appears that the model does not have any consistent meaning stored for these numbers.
(117) a. The highest speed in the XFX version is 1065mhz , which is around 90.5MHz higher than the 1030mhz in our testing . b. the original 4ml tank holds 10.4ml of e juice . c. Water Tank Capacity : 975mL ( 2.2 Gallons ) d. found Mr Mitchell guilty of a combined $ 1bil ( £ 610.9m ) in damages and costs . e. In a town like Kajiado that can cost up to KES50 ( £ 3.50 ) to find a taxi . f. Prices range up to KES100 ( £ 4.00 ) a night g. a bed in one of these rooms can cost KES300 ( £ 4.50 ) . h. be prepared for the ride to cost you KES200 ( £ 2.50 ) .