In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test n-gram language models, a probabilistic context-free grammar, language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks for text generation. Our analysis reveals that language models based on recurrent neural networks with a gating mechanism (i.e., long short-term memory; a gated recurrent unit; and quasi-recurrent neural networks) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor’s law is a good indicator of model quality.

The question of evaluation methods for computational models of natural language is fundamental in language engineering. Aside from human rating, current evaluation methods rely on the probability distribution produced by the model, or on the n-gram similarity between the generated text and a corresponding reference written by human experts. The representative metric of the former type is perplexity. Perplexity quantifies the prediction accuracy of a language model and thus requires its probability distribution. The latter category includes the metrics BLEU (Papineni et al. 2002) and ROUGE (Lin 2004). These evaluation methods compute the n-gram co-occurrence between the generated text and a reference. Hence, these methods are reasonable for cases in which either the probability distribution of the computational model is explicit and comparable or a corresponding reference is given.

The emergence of intractable models such as generative adversarial networks (GANs) for text generation has revealed the limitation of these conventional evaluation methods. Tentative studies (Lin et al. 2017; Rajeswar et al. 2017; Yu et al. 2017; Guo et al. 2018; Lu et al. 2018) have sought to generate natural language text in the adversarial learning framework. Because these models do not explicitly output the probability distribution for prediction, they are evaluated by feeding the generated text to other models, such as a neural language model (Fedus, Goodfellow, and Dai 2018) or a probabilistic context-free grammar (PCFG) (Rajeswar et al. 2017). Although those proposals are promising and worth considering, the effectiveness of the methods for evaluation has not been thoroughly investigated. As an alternative to those approaches, in this article we test evaluation with the scaling properties of natural language text.

The scaling properties of natural language are the universal statistical behaviors observed in natural language text. For example, Zipf’s law characterizes the vocabulary population with a power-law function for the rank-frequency distribution. Recent statistical mechanical studies (Ebeling and Neiman 1995; Altmann, Pierrehumbert, and Motter 2009; Tanaka-Ishii and Bunde 2016; Kobayashi and Tanaka-Ishii 2018; Tanaka-Ishii and Kobayashi 2018) revealed another statistical aspect of natural language: long memory. This refers to the way that sequences of characters or words in natural language universally exhibit clustering, bursty behavior. In particular, results using Taylor’s law (Kobayashi and Tanaka-Ishii 2018; Tanaka-Ishii and Kobayashi 2018) show that a natural language text has a consistent range for the Taylor exponent, which quantifies the degree of burstiness in the text.

As the results obtained with scaling properties have clear interpretations, they suggest qualitative implications for language models. For example, evaluation with Zipf’s law examines whether a model can properly produce infrequent words. Similarly, evaluation with Taylor’s law quantifies whether a model can learn the long memory in a natural language text. In this article, we show that, among the computational models, only neural language models based on recurrent neural networks (RNNs) with a gating mechanism can learn and reproduce the long memory of natural language text. None of the other models can reproduce this behavior. In addition, our study demonstrates the capabilities of the scaling properties for evaluating language models.

The rest of the article is organized as follows. In §2, we review the evaluation metrics that have been widely used for tasks in natural language processing. In §3, we introduce the scaling properties of natural language: those given by Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and long-range correlation analysis. We also explain the methods of applying these scaling properties to evaluate computational models. In §4, we provide a summary of the models of natural language considered in this article. Specifically, our work covers n-gram language models, mathematical language models based on the Simon and Pitman-Yor processes, grammatical models, and neural language models. The experimental procedure and settings are explained in §5. In §6, we assess the scaling properties as evaluation metrics and compare them with other metrics using a PCFG and neural language models. In §7, we use the scaling properties to evaluate the models of natural language and discuss the implications of the results. §8 discusses evaluation of GAN models for text generation. Finally, we conclude our work with a summary in §9.

Note that we describe all computational models of natural language considered in this article, as introduced in §4, by the term language model. For some readers this might sound inadequate, because some of these models do not actually form a model to predict subsequent words (e.g., a PCFG and the models based on the Simon and Pitman-Yor processes). Because the term computational models of natural language is long, however, for the sake of brevity we simply use the term language models.

There are two major approaches to evaluate a language model:

  • • 

    directly inspecting some subpart of the model, or

  • • 

    verifying the output generated by the model.

This section summarizes previous methods of evaluating models from these two viewpoints, with §2.1 and §2.2 corresponding to the first and second approaches, respectively, and §2.3 considering both. As clarified in §3, our proposal belongs to the second category.

2.1 Evaluation Using Probability Distribution: Perplexity

A standard evaluation metric for language models such as n-gram and neural language models is the perplexity (Manning and Schutze 1999), which is a measure of the prediction accuracy. Given a test sample x1, …, xN of length N and a language model that predicts the probability of words, denoted as q(xi), the perplexity is defined as the number e to the power of the average log probability of the correct prediction for every word:
perplexity=e1Ni=1Nlogq(xi)
(1)
Perplexity is usually applied to predict the successive token xi given a context of length k, namely, xik, …, xi−1. The probability distribution q(xi) for prediction must be explicit for evaluation with the perplexity. Moreover, to compare models by using the perplexity, the probability distribution must be defined in a comparable manner. For example, n-gram language models and neural language models are comparable, as they predict the next word from the context.

Because perplexity is the current standard metric for automatic evaluation of model quality, the other metrics appearing in this article are compared with the perplexity.

2.2 Evaluation Using Reference: BLEU/ROUGE

Another popular evaluation metric is the n-gram co-occurrence–based approach, including BLEU (Papineni et al. 2002) and ROUGE (Lin 2004). These metrics are widely used in paired-corpus-oriented tasks such as machine translation and automatic summarization. They evaluate by using statistics of the counts of the same n-grams appearing in the machine-generated text and a corresponding reference, which is a correct answer written by an expert.

These approaches only use the output of a model and thus do not require access to any of its internal elements. Because they require the corresponding reference for computing the n-gram co-occurrence, however, their utility is limited to paired-corpus tasks.

Because intractable models such as GANs for text generation cannot have an explicit reference, the application of BLEU or ROUGE to those models is not trivial. A series of GAN studies (Yu et al. 2017; Lin et al. 2017; Guo et al. 2018; Lu et al. 2018) quantitatively measured the quality of the generated text with BLEU by regarding the whole training data set as a reference. The validity of this evaluation method remains questionable, as BLEU was designed for comparison between a pair of a machine-generated text and its correct reference. Zhu et al. (2018) reported that the application of BLEU with this approach does not provide consistent results with different n-grams chosen.

2.3 Evaluation Using Other Language Models

One approach for evaluation without using either a model distribution or a reference is the use of language models, that is, evaluation of language models by using other language models. Fedus, Goodfellow, and Dai (2018) proposed evaluating GAN-generated text with a neural language model trained with the same natural language data set. This direction is promising, if the language model is a reliable model of natural language. Even with state-of-the-art neural language models, however, the model quality is limited.

The use of a clear, transparent model for evaluation, such as an n-gram language model, would also be a possible method. That approach, however, could only measure models of the n-gram structures of natural language and would thus be similar to BLEU evaluation. The use of a PCFG is another possible method of evaluation without a reference. A PCFG is constructed using a parsed corpus such as the Penn Treebank (PTB), and the generated text is parsed with the Viterbi algorithm (Forney 1973). The algorithm computes the log-likelihood of the text. The PCFG is expected to output a small negative log-likelihood for a grammatically correct sentence. As we demonstrate later, however, it is doubtful that a PCFG could meaningfully evaluate the grammaticality of a sentence.

In this section, we explain scaling properties, the statistical properties of natural language text that have a power-law form. One study on the statistics of natural language reported nine scaling laws (Altmann and Gerlach 2017). Four of them concern word formation and a network structure, which do not directly relate to language modeling. This leaves five scaling properties, which can be categorized into those for the vocabulary population and those for long memory. These properties are characterized by power-law functions, which involve a power exponent. The exponents of the scaling properties have the capability to characterize the degree of each property. They therefore serve to evaluate whether a language model has the same behavior as natural language text. Specifically, given a text generated by a language model, we set two levels of assessment for evaluation:

  • Q1 Does the scaling property hold qualitatively?

  • Q2 How does the exponent differ from that of the training data?

As revealed in the following sections, many models fail to satisfy even the first criterion, especially for long memory. For those models that do satisfy Q1, their exponents can be compared with those of the original text.

Hence, we propose the exponents of scaling properties as metrics to evaluate machine-generated text. Consider a power-law relation yzκ for points (y1, z1), …, (yN, zN). These points (yi, zi) are calculated for any given text. Let c be the coefficient of the power law, and then the exponent κ is estimated by the least-squares method:
κ^,ĉ=argminκ,cε(κ,c)
(2)
ε(κ,c)i=1N(logyilogcziκ)2/N
(3)

The data points are regressed on a log-log scale. The regression method could be a problem if the errors between the data points and the fitting function are not Gaussian-distributed. There are other proposed regression methods such as maximum likelihood estimation for Zipf’s law (Clauset, Shalizi, and Newman 2009; Gerlach and Altmann 2013). In this article, however, because exponents obtained with the least-squares method are effective in distinguishing machine-generated text from natural language text, and because this method has been a conventional standard, we adopt it for estimation.

The following subsections introduce the five scaling properties: Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and a long-range correlation method. As an example, Figure 1 shows a visual presentation of these methods for the wikitext-2 (WT2) data set (Merity et al. 2016). WT2 is a collected corpus of well-written Wikipedia articles, preprocessed by replacing rare words having frequencies under a certain threshold with a meta symbol, <unk>. The details of the data set appear in the first row of Table 1, later in §3.3.

Figure 1 

Scaling properties of the WT2 data set. (a) Zipf’s law: The rank-frequency distributions of words (red) and word pairs (blue). (b) Heaps’ law: The growth of vocabulary size with text length. The solid line is a power-law fitting, and the dashed line represents a power law with exponent α = 1.0, meaning that all words in a sequence are unique. (c) Ebeling’s method: Fluctuation analysis of character occurrence. (d) Taylor’s law: Mean-variance relation of word occurrence. (e) Long-range correlation: Temporal correlation of the sequence of the return intervals of rare words. All data points of these five scaling properties are plotted in a log-log scale.

Figure 1 

Scaling properties of the WT2 data set. (a) Zipf’s law: The rank-frequency distributions of words (red) and word pairs (blue). (b) Heaps’ law: The growth of vocabulary size with text length. The solid line is a power-law fitting, and the dashed line represents a power law with exponent α = 1.0, meaning that all words in a sequence are unique. (c) Ebeling’s method: Fluctuation analysis of character occurrence. (d) Taylor’s law: Mean-variance relation of word occurrence. (e) Long-range correlation: Temporal correlation of the sequence of the return intervals of rare words. All data points of these five scaling properties are plotted in a log-log scale.

3.1 Vocabulary Population

3.1.1 Zipf’s Law.

Let r be the rank of a particular word type and f(r) be its frequency. The well-known Zipf’s law formulates a power-law relation between the frequency and the rank:
f(r)rα
(4)
with α ≈ 1.0. This scaling behavior generally holds not only for unigrams but also for larger n-grams, with smaller exponent values. Figure 1(a) shows Zipf distributions for WT2, with unigrams in red and bigrams in blue. Because WT2 replaces rare words, as mentioned before, the tail of the unigram distribution disappears. The Zipf distributions for unigrams and bigrams typically intersect in the middle of the plots. In practice, the plot is not often aligned linearly in a log-log scale, which makes estimation of the exponent α difficult. Although previous works have dealt with this problem, it is a sensitive topic and is beyond the scope of this article. We therefore do not estimate α but instead observe the distribution qualitatively.

3.1.2 Heaps’ Law.

Heaps’ law describes how the vocabulary size grows with the text size following a power-law function. Let n be the length of a text and v(n) be its vocabulary size. Then Heaps’ law is formulated as the following relation:
v(n)nβ,0<β<1
(5)
Figure 1(b) shows the text sizes and corresponding vocabulary sizes for WT2. The exponent β is 0.75 with error ε = 0.13, which is smaller than β = 1.0 (represented by the dashed black line). There have been multiple debates on how Heaps’ law is mathematically related to Zipf’s law (Baeza-Yates and Navarro 2000; van Leijenhorst and van der Weide 2005; Lü, Zhang, and Zhou 2010).

3.2 Long Memory

The statistical mechanics domain has introduced two approaches for quantifying long memory in a time series: fluctuation analysis and the long-range correlation method. We introduce two fluctuation analysis methods, one for characters and one for words, and one long-range correlation method, applied to words. Although these methods are related analytically for a well-formed time series (Eisler, Bartos, and Kertész 2007), the relation is nontrivial for real phenomena.

3.2.1 Ebeling’s Method.

Ebeling’s method (Ebeling and Neiman 1995) analyzes the power-law relation between the lengths of subsequences of a text and the variance of the number of characters in the subsequences. Given a set of elements (characters in this method), W, let y(c, l) be the counts of character c within subsequences of the text of length l. Then, the fluctuation function m(l) is defined as
m(l)=cWm2(c,l)lη
(6)
where m2(c, l) is the variance of the counts y(c, l):
m2(c,l)=<y2(c,l)>(<y(c,l)>)2
(7)
Theoretically, if a time series is independent and identically distributed (i.i.d.), then η = 1.0, in general, and η > 1.0 if a time series has long memory. Ebeling and Neiman (1995) reported that the character sequence of the Bible has exponent η = 1.67, indicating the presence of clustering behavior at the character level. Following the original work, we apply this method at the character level. Figure 1(c) shows the fluctuation analysis m(l) for WT2. The exponent is η = 1.32 with error ε = 0.10.

3.2.2 Taylor’s Law.

Taylor’s law was originally reported in two pioneering works (Smith 1938; Taylor 1961) and has been applied in various domains (Eisler, Bartos, and Kertész 2007). It describes the power-law relation between the mean and the variance in spatiotemporal observations. In this article, we apply Taylor’s law for natural language text as proposed by Kobayashi and Tanaka-Ishii (2018) and Tanaka-Ishii and Kobayashi (2018).

Given a text with a set of words, W, for a given segment size l the number of occurrences of a particular word wW is counted, and the mean μw and standard deviation σw are calculated. We thus obtain a scatter plot of μ and σ for all elements of W. Taylor’s law states the following power-law relation between σ and μ with the Taylor exponent ζ:
σμζ
(8)

Figure 1(d) shows the Taylor’s law plot for WT2 with l = 5,620 (l can be any value larger than 1). The scatter plot generally follows a power-law function with exponent ζ = 0.62 and has some deviation from the regression line, with error ε = 0.15.

The Taylor exponent takes the range of values 0.50 ≤ ζ ≤ 1.00, and the two limit values ζ = 0.50, 1.0 have clear interpretations. For an i.i.d. process, it is proved that ζ = 0.50. On the other hand, one case with ζ = 1.0 occurs when all segments of length l contain the elements of W with the same proportions. For example, given W = {a, b}, suppose that b always occurs twice as often as a in all segments (e.g., one segment with three a and six b, another segment with one a and two b). Then, both the mean and standard deviation for b are twice those for a, and thus ζ = 1.0. Therefore, the Taylor exponent quantifies how consistently words co-occur in a text. The Taylor exponent of a natural language text typically has a range of 0.55 ≤ ζ ≤ 0.65 and never takes ζ = 0.50 (which would indicate no long memory). It takes different ranges of values for different types of sequences (e.g., child-directed speech and programming source code). It is therefore expected to have the capability to evaluate machine-generated text.

Ebeling’s method and Taylor’s law analysis have the following two differences. First, Ebeling’s method analyzes the growth of the variance m(l) with respect to the length of the subsequences, l, and Taylor’s law analyzes the variance with respect to the mean frequency within a fixed subsequence length. Second, to acquire an exponent for a text, Ebeling’s method takes the sum of the variances over all symbols, whereas Taylor’s law obtains the exponent from the individual points for all words.

For the latter reason, Ebeling’s method is influenced by a small number of frequently appearing symbols. Because it involves the sum of the variances of all words that follow the power law, the behavior of the exponent η often tends to be less sensible than that of the Taylor exponent.

3.2.3 Long-Range Correlation.

Long-range correlation analysis quantifies the burstiness of word occurrence in a natural language text. The analysis measures the degree of self-similarity within a sequence. Among such analyses, early works proposed mutual-information-based methods (Li 1989; Ebeling and Pöschel 1994; Lin and Tegmark 2017). Such methods compute the mutual information between characters separated by s characters. These works reported that the mutual information decays according to a power law with the distance s. Takahashi and Tanaka-Ishii (2017) showed, however, that the mutual information method cannot quantify the long-range dependence in word sequences. Moreover, the mutual information between characters decays quickly and reaches a plateau at a distance s ≈ 101 for natural language texts such as the collected works of Shakespeare and the PTB data set.

Another approach to long-range correlation analysis is the use of the autocorrelation function (ACF). The ACF c(s) is defined as the Pearson correlation for two elements of a sequence separated by a distance s:
c(s)=E[(xtμ)(xt+sμ)]σ2
(9)
where μ and σ are the respective mean and standard deviation of the time series xt. The value of c(s) ranges between −1 and 1. A time series is said to be long-range correlated if the ACF c(s) for two elements separated by distance s follows a power law:
c(s)sξ,s>0,0<ξ<1
(10)
In the case of application to real-world data, a sequence is said to be long-range correlated if c(s) takes positive values for s until about 1/100 of the length (Lennartz and Bunde 2009). For sequences without correlation, c(s) fluctuates around zero.

Because the ACF is applicable only for numerical time series, the application of this method for natural language text requires transformation of the sequence of symbols into a numerical time series. Recent methods do so by considering the intervals of word occurrences (Tanaka-Ishii and Bunde 2016). In this article, we apply a method that measures the ACF of a sequence of the return intervals of rare words, which amounts to 1Q of the text length. With this method, Tanaka-Ishii and Bunde (2016) reported that power-law decay of the ACF is observed for natural language texts.

Figure 1(e) shows the long-range correlation analysis of word sequences in WT2. The hyperparameter was set to Q = 16 for all results in this article. As seen in the figure, the ACF c(s) always takes positive values up to 1/100 of the sequence length and follows a power-law function (i.e., a straight line in a log-log plot) with exponent ξ = 0.33 and error ε = 0.04. Throughout this article, the error ε of this metric is only measured for s ≤ 100.

3.3 Examples of Scaling Properties for Other Natural Language Texts

Except for Zipf’s and Heaps’ laws, the scaling properties have hardly appeared in the context of computational linguistics or language engineering. This may be because these properties do not directly incorporate semantics or syntax, which are of central concern in those domains. Instead, the properties quantify the universal structures behind natural language in a statistical sense. Those introduced so far are robust and apply to texts across different genres and languages as long as the text is sufficiently long. Figure 2 shows the scaling properties of another language modeling data set, the PTB. This text also satisfies all five scaling properties. They are indeed universal with respect to the genre or even language. More results are shown in Appendix A. Figure A1 shows the scaling properties of the collected works of Shakespeare, and their exponents are listed in the third block of Table 1. Likewise, the scaling properties and exponents for Hong Lou Meng, a Chinese literary work, are shown in Figure A2 and listed in the last block of Table 1, respectively. Among the exponents, that of the long-range correlation, ξ, differs largely among the four data sets considered thus far. In contrast, the other exponents generally take similar values for the data sets.

Figure 2 

Scaling properties of the Penn Treebank (preprocessed).

Figure 2 

Scaling properties of the Penn Treebank (preprocessed).

Table 1 
Summary of the data sets used in this article and their statistics.
 TokensVocab.Vocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Wikitext-2 (English, Wikipedia article) 
preprocessed data set 2,088,628 33,278 Yes 0.75 (0.13) 1.33 (0.10) 0.62 (0.15) 0.33 (0.04) 
original data set 2,088,628 76,617 Yes 0.78 (0.09) 1.33 (0.10) 0.65 (0.11) 0.32 (0.03) 
  
Penn Treebank (English, The Wall Street Journal news article) 
preprocessed data set 887,521 10,000 Yes 0.70 (0.16) 1.23 (0.06) 0.56 (0.14) 0.81 (0.24) 
original data set 892,008 89,317 Yes 0.83 (0.07) 1.20 (0.05) 0.57 (0.06) 0.60 (0.16) 
  
Shakespeare (old English collection of literature works) 
original text 740,706 83,105 Yes 0.79 (0.07) 1.24 (0.09) 0.59 (0.05) 0.13 (0.02) 
  
Hong Lou Meng (Chinese, literature work) 
original text 703,034 18,312 Yes 0.74 (0.14) 1.31 (0.07) 0.58 (0.07) 0.39 (0.04) 
 TokensVocab.Vocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Wikitext-2 (English, Wikipedia article) 
preprocessed data set 2,088,628 33,278 Yes 0.75 (0.13) 1.33 (0.10) 0.62 (0.15) 0.33 (0.04) 
original data set 2,088,628 76,617 Yes 0.78 (0.09) 1.33 (0.10) 0.65 (0.11) 0.32 (0.03) 
  
Penn Treebank (English, The Wall Street Journal news article) 
preprocessed data set 887,521 10,000 Yes 0.70 (0.16) 1.23 (0.06) 0.56 (0.14) 0.81 (0.24) 
original data set 892,008 89,317 Yes 0.83 (0.07) 1.20 (0.05) 0.57 (0.06) 0.60 (0.16) 
  
Shakespeare (old English collection of literature works) 
original text 740,706 83,105 Yes 0.79 (0.07) 1.24 (0.09) 0.59 (0.05) 0.13 (0.02) 
  
Hong Lou Meng (Chinese, literature work) 
original text 703,034 18,312 Yes 0.74 (0.14) 1.31 (0.07) 0.58 (0.07) 0.39 (0.04) 

This section introduces the computational models of natural language tested in this article. We test four categories of language models: n-gram models, grammatical models, language models based on the Simon or Pitman-Yor process, and neural language models. These categories cover the different genres of language models that have appeared in the history of computational linguistics. For every category, some sophisticated, advanced models have been proposed. The experiments reported in §6 and §7, however, were conducted only with the most recent models whose code was available, except for the n-gram models. This served to avoid errors in reimplementation.

4.1 n-Gram Models

An n-gram language model is the most basic model, as it is an n − 1-ordered Markov model. Let c(X1t) be the count of X1t = x1, x2, …, xt, and then the probability of element xt is calculated as
P(xt+1|X1t)P(xt+1|Xtn+1t)=c(Xtn+1t+1)c(Xtn+1t)
(11)

This article examines 3-gram and 5-gram models. Other than the original n-gram model, we also test models with a variety of smoothing techniques to improve the perplexity. In particular, linear interpolation (Stolcke 2002), Katz backoff (Katz 1987), and Kneser-Ney smoothing (Kneser and Ney 1995) have been known to enhance the performance of n-gram models. We also set n = 3 and n = 5 for these models to compare with the original n-gram models. It has been empirically verified that longer context does not necessarily contribute to improving the perplexity and can even degrade performance (Chen and Goodman 1999). Simple n-gram models, in fact, have been mathematically shown to be incapable of reproducing long memory (Kingman 1963; Lin and Tegmark 2017).

4.2 Grammatical Models

The PCFG is a basic grammatical model. We constructed this grammar model with the annotated PTB data set and used the Natural Language Toolkit (NLTK) (Loper and Bird 2002) to generate sentences according to the probabilities assigned to productions. Unlike an n-gram model, the PCFG generates a text by using a tree.

4.3 Language Models Based on Simon/Pitman-Yor Processes

The Simon and Pitman-Yor processes are abstract models of natural language that reproduce Zipf’s law and Heaps’ law. These are generative models, and a sequence is formulated over time, either through (1) introduction of new words or (2) sampling from the past sequence. Let K(X1t) be the number of word types existing in X1t, and let nk(X1t) be the frequency of the kth word type in X1t. The sequence starts with K(X0) = 1 and X0 = x0 at t = 0. For t ≥ 1, given a constant a with 0 < a < 1, the Simon process (Simon 1955) introduces a new word with probability a, or a word is sampled from X1t with probability 1 − a:
P(xt+1=wk)=(1a)nk(X1t)t1kK(X1t)ak=K(X1t)+1
The Simon process strictly follows Zipf’s law with exponent α = 1.0 and consequently Heaps’ law, as well. In contrast, the Pitman-Yor process copes with this problem by decreasing the introduction rate of new words in proportion to K(X1t) via another parameter b, with 0 ≤ a < 1 and 0 ≤ b:
P(xt+1=wk)=nk(X1t)at+b1kK(X1t)aK(X1t)+bt+bk=K(X1t)+1

These two parameters serve to produce Zipf’s law with slightly convex behavior (Goldwater, Griffiths, and Johnson 2011). The basic models introduced to this point define nothing about how to introduce words: We could simply generate random sequences and examine their scaling properties, because the basic formulations thus far govern the nature of the language models elaborated from these basic models.

By mapping words to the elements produced, we would generate a language model, like the two-stage model proposed in Goldwater, Griffiths, and Johnson (2011). Here, we consider a more advanced model proposed as the hierarchical Pitman-Yor language model (HPYLM) (Teh 2006), which integrates the Pitman-Yor process into an n-gram model.1

4.4 Neural Language Models

State-of-the-art neural language models are known to outperform n-gram language models by the measure of perplexity. The majority of promising neural language models (Mikolov and Zweig 2012; Melis, Dyer, and Blunsom 2018; Merity, Keskar, and Socher 2018b; Yang et al. 2018) adopt RNNs. An RNN computes a hidden state ht recursively from the input xt and the previous hidden state ht−1 to create an effective flow of past information:
ht=Φ(xt,ht1)
(12)
The function Φ depends on the recurrent architecture of the network. A simple RNN model computes the hidden vector ht as a matrix transformation of xt and ht−1 by the parameters Wxh and Whh and a nonlinear transformation by the sigmoid function:
ht=sigmoid(Wxhxt+Whhht1+bh)
(13)

In modern applications, RNNs with a gating mechanism, such as long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997), a gated recurrent unit (GRU) (Cho et al. 2014), and quasi-recurrent neural networks (QRNNs) (Bradbury et al. 2017), are often adopted. The recurrent architectures of these models are defined as follows.

  • LSTM
    it=sigmoid(Uixt+Wiht1+bi)
    (14)
    ft=sigmoid(Ufxt+Wfht1+bf)
    (15)
    ot=sigmoid(Uoxt+Woht1+bo)
    (16)
    c~t=tanh(Uc~xt+Wc~ht1+bc~)
    (17)
    ct=sigmoid(ftct1+itc~t)
    (18)
    ht=tanh(ct)ot
    (19)
  • GRU
    rt=sigmoid(Urxt+Wrht1+br)
    (20)
    ut=sigmoid(Uuxt+Wuht1+bf)
    (21)
    h~t=tanh(Wxrtxt+Whht+b)
    (22)
    ht=(1ut)ht1+uth~t
    (23)
  • QRNNs
    zt=sigmoid(Wz1xt1+Wz2xt)
    (24)
    ft=sigmoid(Wf1xt1+Wf2xt)
    (25)
    ht=ftht1+(1ft)zt
    (26)
Here, the operator ∘ denotes element-wise multiplication, the capital symbols W and U with subscripts are matrices, and the lowercase symbols b with subscripts are bias vectors. All these architectures have a gating mechanism (Equations (18), (23), and (26)), which balances the use of the states at the previous and current time steps.

In this article, we consider a total of nine neural language models. Three of them are based on a simple RNN, a GRU (Cho et al. 2014), and QRNNs (Bradbury et al. 2017; Merity, Keskar, and Socher 2018a). The rest are LSTM-based language models. The first LSTM model is trained without regularizations such as dropout. The second model is AWD-LSTM (Merity, Keskar, and Socher 2018b), which applies regularization effectively to achieve competitive prediction performance. The other four models integrate extended architectures of RNN language models, namely, continuous cache (Grave, Joulin, and Usunier 2017) and mixture of softmaxes (MoS) (Yang et al. 2018). Continuous cache is a memory augmentation architecture that computes a cache probability pcache from the l most recent context. It computes the similarity between ht and hi to estimate the reappearance of the word at time step i. The output probability of the model with continuous cache, denoted as the AWD-LSTM-Cache model, is a linear interpolation of the AWD-LSTM output and the cache probability. We also test a model incorporating the Simon process, denoted as the AWD-LSTM-Simon model. It behaves as a uniform sampling from the past generated sequence and is a special case of AWD-LSTM-Cache. In addition, the MoS architecture reformulates the language modeling task as matrix factorization and is a state-of-the-art language model integrated with AWD-LSTM as the AWD-LSTM-MoS model. Finally, we also consider a combination of all these architectures, denoted as the AWD-LSTM-MoS-Cache model.

The hyperparameters used in our experiments followed the instructions in Merity, Keskar, and Socher (2018b) and Yang et al. (2018). The context length (or the length of back-propagation through time) was 70, as given in the references, for both character- and word-based models. The cache window size of the AWD-LSTM-Simon model was set to 10,000, to balance a large window size with computational efficiency. All the language models were trained to minimize the negative log-likelihood of the training data by stochastic gradient algorithms. Note that the perplexity scores for character- and word-based models are not directly comparable, as they indicate bits per character and per word, respectively.

For every language model, a sample text of 1 million words was generated and evaluated using the metrics explained thus far. We expected models that learned a natural language text to be able to generate a sample text with scaling properties resembling those of the original text. In particular, we expected that the exponent values would be close to those of the original data set.

The subsequent two sections, §6 and §7, proceed by examining the scaling properties as applied to models that learned WT2 or the PTB. As introduced in §3.3, these are two standard data sets used as language model benchmarks. For both WT2 and the PTB, the data set was preprocessed to reduce the vocabulary size. Infrequent words were replaced with <unk>, and numbers were replaced with N in the PTB (Mikolov et al. 2010). Language models were then constructed by training with either WT2 or the PTB, except for the Simon and Pitman-Yor processes (but not HPYLM, which does learn) and the PCFG. The PCFG could be constructed only with the PTB data set, because it requires a parsed corpus, which does not exist for WT2.

Tables 2 and 3 list the perplexity and the scaling exponents of the models for the WT2 and PTB data sets, respectively. Each row presents the results for a single text, either real or machine-generated. The perplexity is not reported for the Simon model, the Pitman-Yor process, or the PCFG. For the two mathematical models, it was not measured because they do not have references for computing the prediction accuracy. The perplexity of the PCFG is not reported because its computation does not trivially match that of the n-gram and neural language models.

Table 2 
Summary of the scaling properties of the language models with WT2. † The perplexity measure for HPYLM is not equivalent to that for the n-gram and neural language models because of the preprocessing difference. ‡ The values for these models are in bits per character.
 PerplexityVocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Original Data set 
Wikitext-2 (Preprocessed) Yes 0.75 (0.13) 1.32 (0.10) 0.62 (0.15) 0.33 (0.04) 
Wikitext-2 (Original) Yes 0.78 (0.09) 1.33 (0.10) 0.65 (0.11) 0.32 (0.03) 
  
Shuffled Data set 
Wikitext-2(1-gram) Yes 0.75 (0.16) 1.00 (0.01) 0.50 (0.02) No 
Wikitext-2(2-gram) Yes 0.76 (0.16) 1.00 (0.00) 0.50 (0.01) No 
Wikitext-2(5-gram) Yes 0.76 (0.16) 1.00 (0.00) 0.50 (0.02) No 
Wikitext-2(10-gram) Yes 0.76 (0.16) 1.00 (0.00) 0.50 (0.02) No 
  
N-gram Language Model 
3-gram 837.58 Yes 0.79 (0.13) 1.00 (0.00) 0.50 (0.02) No 
5-gram 534.98 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
linear interpolation 294.72 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 3-gram 285.14 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 5-gram 357.94 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 3-gram 204.15 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 5-gram 215.44 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
  
Simon/Pitman-Yor Process and Related Language Model 
Simon Yes 0.95 (0.15) 0.50 (0.01) 0.09 (0.03) 
Pitman-Yor Yes 0.78 (0.09) 0.50 (0.01) No 
HPYLM (184.34Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
  
Neural Language Model (character based) 
LSTM (no regularization) (1.44Yes 0.74 (0.17) 1.06 (0.05) 0.50 (0.01) No 
AWD-LSTM (1.22Yes 0.73 (0.15) 1.27 (0.10) 0.54 (0.04) 0.30 (0.05) 
  
Neural Language Model (word based) 
Simple RNN 164.51 Yes 0.79 (0.12) 1.01 (0.00) 0.50 (0.02) No 
GRU 96.22 Yes 0.79 (0.11) 1.12 (0.06) 0.52 (0.03) 0.52 (Weak) 
QRNN 74.74 Yes 0.79 (0.11) 1.08 (0.03) 0.52 (0.03) 0.57 (0.08) 
LSTM (no regularization) 113.18 Yes 0.78 (0.12) 1.10 (0.03) 0.52 (0.03) 0.43 (0.15) 
AWD-LSTM 64.27 Yes 0.76 (0.13) 1.30 (0.15) 0.58 (0.06) 0.05 (0.01) 
AWD-LSTM-Simon 61.59 Yes 0.77 (0.10) 1.25 (0.15) 0.55 (0.05) 0.03 (0.01) 
AWD-LSTM-MoS 62.44 Yes 0.78 (0.12) 1.16 (0.07) 0.54 (0.04) 0.33 (0.07) 
AWD-LSTM-MoS-Cache 59.21 Yes 0.78 (0.11) 1.20 (0.07) 0.57 (0.07) 0.29 (0.05) 
AWD-LSTM-Cache 50.39 Yes 0.78 (0.11) 1.25 (0.10) 0.59 (0.07) 0.14 (0.04) 
 PerplexityVocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Original Data set 
Wikitext-2 (Preprocessed) Yes 0.75 (0.13) 1.32 (0.10) 0.62 (0.15) 0.33 (0.04) 
Wikitext-2 (Original) Yes 0.78 (0.09) 1.33 (0.10) 0.65 (0.11) 0.32 (0.03) 
  
Shuffled Data set 
Wikitext-2(1-gram) Yes 0.75 (0.16) 1.00 (0.01) 0.50 (0.02) No 
Wikitext-2(2-gram) Yes 0.76 (0.16) 1.00 (0.00) 0.50 (0.01) No 
Wikitext-2(5-gram) Yes 0.76 (0.16) 1.00 (0.00) 0.50 (0.02) No 
Wikitext-2(10-gram) Yes 0.76 (0.16) 1.00 (0.00) 0.50 (0.02) No 
  
N-gram Language Model 
3-gram 837.58 Yes 0.79 (0.13) 1.00 (0.00) 0.50 (0.02) No 
5-gram 534.98 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
linear interpolation 294.72 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 3-gram 285.14 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 5-gram 357.94 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 3-gram 204.15 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 5-gram 215.44 Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
  
Simon/Pitman-Yor Process and Related Language Model 
Simon Yes 0.95 (0.15) 0.50 (0.01) 0.09 (0.03) 
Pitman-Yor Yes 0.78 (0.09) 0.50 (0.01) No 
HPYLM (184.34Yes 0.78 (0.13) 1.00 (0.00) 0.50 (0.02) No 
  
Neural Language Model (character based) 
LSTM (no regularization) (1.44Yes 0.74 (0.17) 1.06 (0.05) 0.50 (0.01) No 
AWD-LSTM (1.22Yes 0.73 (0.15) 1.27 (0.10) 0.54 (0.04) 0.30 (0.05) 
  
Neural Language Model (word based) 
Simple RNN 164.51 Yes 0.79 (0.12) 1.01 (0.00) 0.50 (0.02) No 
GRU 96.22 Yes 0.79 (0.11) 1.12 (0.06) 0.52 (0.03) 0.52 (Weak) 
QRNN 74.74 Yes 0.79 (0.11) 1.08 (0.03) 0.52 (0.03) 0.57 (0.08) 
LSTM (no regularization) 113.18 Yes 0.78 (0.12) 1.10 (0.03) 0.52 (0.03) 0.43 (0.15) 
AWD-LSTM 64.27 Yes 0.76 (0.13) 1.30 (0.15) 0.58 (0.06) 0.05 (0.01) 
AWD-LSTM-Simon 61.59 Yes 0.77 (0.10) 1.25 (0.15) 0.55 (0.05) 0.03 (0.01) 
AWD-LSTM-MoS 62.44 Yes 0.78 (0.12) 1.16 (0.07) 0.54 (0.04) 0.33 (0.07) 
AWD-LSTM-MoS-Cache 59.21 Yes 0.78 (0.11) 1.20 (0.07) 0.57 (0.07) 0.29 (0.05) 
AWD-LSTM-Cache 50.39 Yes 0.78 (0.11) 1.25 (0.10) 0.59 (0.07) 0.14 (0.04) 
Table 3 
Summary of the scaling properties of the language models with the PTB. † The perplexity measure for HPYLM is not equivalent to that for the n-gram and neural language models because of the preprocessing difference. ‡ The values for these models are in bits per character.
 PerplexityVocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Original Data set 
Penn Treebank (Preprocessed) Yes 0.70 (0.16) 1.23 (0.06) 0.56 (0.14) 0.81 (0.24) 
Penn Treebank (Original) Yes 0.83 (0.07) 1.20 (0.05) 0.57 (0.06) 0.60 (0.16) 
  
Shuffled Data set 
Penn Treebank (1-gram) Yes 0.72 (0.18) 1.00 (0.00) 0.50 (0.02) No 
Penn Treebank (2-gram) Yes 0.72 (0.18) 1.00 (0.00) 0.50 (0.02) No 
Penn Treebank (5-gram) Yes 0.72 (0.18) 1.00 (0.00) 0.50 (0.02) No 
Penn Treebank (10-gram) Yes 0.72 (0.18) 1.00 (0.01) 0.50 (0.02) No 
  
N-gram Language Model 
3-gram 367.79 Yes 0.71 (0.19) 0.99 (0.01) 0.50 (0.02) No 
5-gram 561.65 Yes 0.72 (0.21) 1.00 (0.00) 0.50 (0.02) No 
linear interpolation 238.59 Yes 0.71 (0.20) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 3-gram 195.65 Yes 0.71 (0.19) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 5-gram 250.18 Yes 0.71 (0.19) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 3-gram 150.64 Yes 0.72 (0.21) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 5-gram 156.70 Yes 0.71 (0.20) 1.00 (0.00) 0.50 (0.02) No 
  
Simon/Pitman-Yor Process and Related Language Model 
HPYLM (140.49Yes 0.73 (0.21) 1.00 (0.00) 0.50 (0.02) No 
  
Grammatical Model 
PCFG Yes 0.73 (0.19) 1.00 (0.00) 0.50 (0.02) No 
  
Neural Language Model (character based) 
LSTM (no regularization) (1.38Yes 0.79 (0.08) 1.03 (0.01) 0.50 (0.01) No 
AWD-LSTM (1.18Yes 0.76 (0.12) 1.10 (0.03) 0.51 (0.02) 0.40 (0.10) 
  
Neural Language Model (word based) 
Simple RNN 123.96 Yes 0.71 (0.19) 1.00 (0.01) 0.50 (0.02) 0.74 (Weak) 
GRU 85.05 Yes 0.71 (0.18) 1.05 (0.02) 0.50 (0.02) 0.40 (Weak) 
QRNN 62.65 Yes 0.71 (0.18) 1.10 (0.03) 0.51 (0.02) 0.54 (Weak) 
LSTM (no regularization) 111.79 Yes 0.71 (0.19) 1.04 (0.01) 0.51 (0.02) 0.84 (Weak) 
AWD-LSTM 56.40 Yes 0.71 (0.18) 1.06 (0.02) 0.51 (0.03) 0.69 (Weak) 
AWD-LSTM-Simon 57.85 Yes 0.72 (0.16) 1.04 (0.01) 0.51 (0.03) No 
AWD-LSTM-MoS 54.77 Yes 0.71 (0.18) 1.10 (0.03) 0.52 (0.04) 0.77 (Weak) 
AWD-LSTM-MoS-Cache 54.03 Yes 0.71 (0.18) 1.13 (0.04) 0.55 (0.06) 0.61 (Weak) 
AWD-LSTM-Cache 52.51 Yes 0.72 (0.17) 1.07 (0.02) 0.53 (0.05) 0.57 (Weak) 
 PerplexityVocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Original Data set 
Penn Treebank (Preprocessed) Yes 0.70 (0.16) 1.23 (0.06) 0.56 (0.14) 0.81 (0.24) 
Penn Treebank (Original) Yes 0.83 (0.07) 1.20 (0.05) 0.57 (0.06) 0.60 (0.16) 
  
Shuffled Data set 
Penn Treebank (1-gram) Yes 0.72 (0.18) 1.00 (0.00) 0.50 (0.02) No 
Penn Treebank (2-gram) Yes 0.72 (0.18) 1.00 (0.00) 0.50 (0.02) No 
Penn Treebank (5-gram) Yes 0.72 (0.18) 1.00 (0.00) 0.50 (0.02) No 
Penn Treebank (10-gram) Yes 0.72 (0.18) 1.00 (0.01) 0.50 (0.02) No 
  
N-gram Language Model 
3-gram 367.79 Yes 0.71 (0.19) 0.99 (0.01) 0.50 (0.02) No 
5-gram 561.65 Yes 0.72 (0.21) 1.00 (0.00) 0.50 (0.02) No 
linear interpolation 238.59 Yes 0.71 (0.20) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 3-gram 195.65 Yes 0.71 (0.19) 1.00 (0.00) 0.50 (0.02) No 
Katz backoff 5-gram 250.18 Yes 0.71 (0.19) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 3-gram 150.64 Yes 0.72 (0.21) 1.00 (0.00) 0.50 (0.02) No 
Kneser-Ney 5-gram 156.70 Yes 0.71 (0.20) 1.00 (0.00) 0.50 (0.02) No 
  
Simon/Pitman-Yor Process and Related Language Model 
HPYLM (140.49Yes 0.73 (0.21) 1.00 (0.00) 0.50 (0.02) No 
  
Grammatical Model 
PCFG Yes 0.73 (0.19) 1.00 (0.00) 0.50 (0.02) No 
  
Neural Language Model (character based) 
LSTM (no regularization) (1.38Yes 0.79 (0.08) 1.03 (0.01) 0.50 (0.01) No 
AWD-LSTM (1.18Yes 0.76 (0.12) 1.10 (0.03) 0.51 (0.02) 0.40 (0.10) 
  
Neural Language Model (word based) 
Simple RNN 123.96 Yes 0.71 (0.19) 1.00 (0.01) 0.50 (0.02) 0.74 (Weak) 
GRU 85.05 Yes 0.71 (0.18) 1.05 (0.02) 0.50 (0.02) 0.40 (Weak) 
QRNN 62.65 Yes 0.71 (0.18) 1.10 (0.03) 0.51 (0.02) 0.54 (Weak) 
LSTM (no regularization) 111.79 Yes 0.71 (0.19) 1.04 (0.01) 0.51 (0.02) 0.84 (Weak) 
AWD-LSTM 56.40 Yes 0.71 (0.18) 1.06 (0.02) 0.51 (0.03) 0.69 (Weak) 
AWD-LSTM-Simon 57.85 Yes 0.72 (0.16) 1.04 (0.01) 0.51 (0.03) No 
AWD-LSTM-MoS 54.77 Yes 0.71 (0.18) 1.10 (0.03) 0.52 (0.04) 0.77 (Weak) 
AWD-LSTM-MoS-Cache 54.03 Yes 0.71 (0.18) 1.13 (0.04) 0.55 (0.06) 0.61 (Weak) 
AWD-LSTM-Cache 52.51 Yes 0.72 (0.17) 1.07 (0.02) 0.53 (0.05) 0.57 (Weak) 

The first blocks in each table indicate the properties of the original data sets with and without preprocessing. The second blocks list the results for shuffled data sets, which preserve parts of the n-gram structure. They were tested to check the behavior of the evaluation metrics on randomized texts. The shuffled data sets were expected to lose long memory and were largely different from the original natural language texts. The shuffling was conducted as follows. As an example, the text ABCDEFGHI was first split into 3-gram chunks, giving ABC/DEF/GHI. Then, the chunks were shuffled randomly to obtain a 3-gram shuffled data set (i.e., DEF/GHI/ABC). Note that this shuffling does not preserve some n-gram structures, such as BCD and FGH, in the original text. The remaining blocks correspond to the results for the language models introduced. The grammatical model category is absent in Table 2 because of the lack of a parsed corpus for WT2. Appendix B includes all figures showing the scaling properties.

The first columns of Table 2 and Table 3 list the perplexities of the language models. The blank symbol “-” appears in rows for which the perplexity is not available: the original and shuffled data sets are not language models, while the Simon/Pitman-Yor processes and the grammatical model have different definitions of probability and cannot be measured comparably with the n-gram and neural language models. The perplexity scores in parentheses were measured comparably but are not comparable with the other values because of their different implementations of preprocessing, as explained at the ends of §4.3 and §4.4.

In terms of perplexity, the neural language models consistently outperformed the n-gram models. Among the n-gram models, Kneser-Ney smoothing consistently outperformed the other smoothing techniques. The 3-gram models sometimes had better perplexity than the 5-gram models did, as the training data sets in this experiment were not especially large (see Table 1). Among the neural language models, the simple RNN model had the worst perplexity. The RNNs with a gating mechanism improved the perplexity over that of the simple RNN model. In particular, the AWD-LSTM model performed the best among the RNN language models. The additional architectures of the cache mechanism and MoS contributed to improving the perplexity.

6.1 Metrics of Scaling Properties

The proposed evaluation metrics should be compared with another evaluation metric that is assumed plausible. In this article, the perplexity is adopted as such a metric. As perplexity has been the standard evaluation metric in language modeling and the prediction accuracy is of primary importance for that application, we compare the metrics derived from the scaling properties by comparing them with the perplexity and consider how they correlate with it.

Columns 3–7 of Table 2 and Table 3 list the respective results for the scaling properties: Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and the long-range correlation. Even when the perplexity was not computable, the properties could all still be examined regardless of the kind of language model, except for Ebeling’s method, because it applies to characters. Overall, except for the long-range correlation, the results were consistent across the data sets: When a scaling law was followed by one data set, then it was also followed by the other data set.

All the language models qualitatively satisfied Zipf’s law. We indicate this by Yes in the tables for the reason stated in §3.1. Relatedly, all the language models also satisfied Heap’s law. These two properties, however, are present even with a unigram language model. Despite their fame, Zipf’s law and Heaps’ law have no capacity to distinguish randomized and real text. It is therefore not a challenge for language models to satisfy Zipf’s and Heaps’ laws.

In contrast, the metrics of long memory were capable of quantifying the quality of machine-generated texts. For Ebeling’s method (first column of the Long Memory vertical block), the exponent of the original data set was η = 1.32 for WT2 and η = 1.23 for the PTB, whereas that of both shuffled data sets was η = 1.00, thus indicating no long memory in the latter. The neural language models had exponents between η = 1.10 and η = 1.30 for WT2, and between η = 1.04 and η = 1.13 for the PTB, whereas the other language models were the same as i.i.d. behavior. Ebeling’s method therefore could verify the text quality to a certain extent.

The last column in each table lists the results for the long-range correlation. If the text was not long-range correlated, this is denoted by No or Weak: No if more than one value was negative for s ≤ 10, or Weak if there was one negative value for s ≤ 100. Such arbitrariness of judgment is one disadvantage of this metric. In addition, even though it has good correspondence with the other two metrics of long memory, it has two further disadvantages. First, the exponent has poor correlation with the perplexity. The second disadvantage was exhibited in the degree of long-range correlation listed for the Simon model. The degree was high at the beginning and did not decay (see Figure A16 in Appendix B). As the Simon model had more new words later in a sequence, the correlation stayed large even for two sequences with a large distance between them. Therefore, this non-decaying phenomenon was due not to burstiness but to a different characteristic specific to the Simon process. The Taylor exponent for the Simon process was ζ = 0.50, indicating that the long-range correlation observed was not due to long memory behavior.

Finally, the Taylor exponent ζ seemed the most reliable metric among those derived from the scaling properties. The left panel of Figure 3 shows the correlation between the perplexity of the models and the Taylor exponent ζ. As the perplexity decreased, the Taylor exponent ζ showed a steep increase. Because the exponent quantifies the degree of burstiness of word occurrence, this result indicates that the better models in terms of perplexity can also reproduce that statistical property.

Figure 3 

Scatter plots of the perplexity of various models with respect to the Taylor exponent ζ (left) and the perplexity of the eval-AWD-LSTM model (right) for the WT2 data set (left). The Taylor exponents of the n-gram language models were consistently ζ = 0.50, which indicates the absence of long memory. In contrast, the neural language models had Taylor exponents of ζ > 0.50, which indicates the presence of long memory in the generated texts (right). The perplexity of eval-AWD-LSTM had clear, positive correlation with the perplexities of the language models.

Figure 3 

Scatter plots of the perplexity of various models with respect to the Taylor exponent ζ (left) and the perplexity of the eval-AWD-LSTM model (right) for the WT2 data set (left). The Taylor exponents of the n-gram language models were consistently ζ = 0.50, which indicates the absence of long memory. In contrast, the neural language models had Taylor exponents of ζ > 0.50, which indicates the presence of long memory in the generated texts (right). The perplexity of eval-AWD-LSTM had clear, positive correlation with the perplexities of the language models.

Overall, the scaling properties of long memory serve for evaluation of generated texts. The Taylor exponent ζ especially has the capability for evaluation.

6.2 Comparison with PCFG- and Language-Model–Based Evaluation

Next, we test the effectiveness of using the negative log-likelihood from a PCFG (Rajeswar et al. 2017) and the perplexity obtained from a neural language model (Fedus, Goodfellow, and Dai 2018). The results show how PCFG-based evaluation is not effective, in contrast to evaluation based on the scaling properties.

In principle, the negative log-likelihood of a PCFG evaluates the grammaticality of text. Rajeswar et al. (2017) used the negative log-likelihood of a PCFG to evaluate GAN-generated texts. The scatter plots in Figure 4 show the average negative log-likelihood from a PCFG for the PTB data set (magenta), the PTB data set shuffled with 5-grams (green), and the AWD-LSTM-Cache model (blue). Because the PTB data set is annotated, the negative log-likelihood was calculated for every sentence, and the values were plotted for different sentence lengths. As for the other two cases, because the outputs had no sentence boundaries indicated in the training data, consecutive parts of a given length n were randomly extracted from the text and fed to the PCFG parser, and the negative log-likelihood was then calculated. The NLTK (Loper and Bird 2002) parser implementation was used in this work. The shaded area in red represents the upper and lower bounds of the original PTB data set.

Figure 4 

Average negative log-likelihood of a PCFG for different sentence lengths from the PTB data set (magenta), n-word chunks from the AWD-LSTM-Cache model (blue), and 5-grams from the shuffled PTB data set (green). The area shaded red represents the upper and lower bounds of the negative log-likelihood of the PCFG for the PTB data set.

Figure 4 

Average negative log-likelihood of a PCFG for different sentence lengths from the PTB data set (magenta), n-word chunks from the AWD-LSTM-Cache model (blue), and 5-grams from the shuffled PTB data set (green). The area shaded red represents the upper and lower bounds of the negative log-likelihood of the PCFG for the PTB data set.

The average negative log-likelihood of a sentence has a strong linear correlation with its length, and the values for the PTB data set were consistently lower than those for the generated text of the AWD-LSTM-Cache model and the 5-gram shuffled text. The differences from the original PTB data set, however, were not significant, even though the 5-gram and AWD-LSTM-Cache results were calculated merely for n-word random chunks. Moreover, the average values for the 5-gram shuffled text and the machine-generated text were within the range of the PTB’s upper and lower bounds. This indicates that the negative log-likelihood from a PCFG is probably not usable for evaluating machine-generated texts.

Apart from the PCFG, Fedus, Goodfellow, and Dai (2018) proposed evaluating the quality of GAN-generated texts with the perplexity computed from a neural language model. We next test whether that method provides a good measure of the language models considered here. Accordingly, we used the AWD-LSTM model to evaluate the texts generated by the n-gram and neural language models. To avoid confusion, we call this the eval-AWD-LSTM model. It was trained with the WT2 and PTB data sets to evaluate the texts generated by the various other models (including AWD-LSTM itself).

The perplexity of eval-AWD-LSTM was calculated for each machine-generated text by (1). The rightmost columns of Tables 4 and 5 list the results, and the right panel of Figure 3 shows a scatter plot of the perplexity of the models with respect to the perplexity of eval-AWD-LSTM. This method seemed to work well, especially in globally distinguishing the n-gram and neural language model categories: The former category had perplexities above 600, whereas the latter category had almost all values below 200 for WT2. The eval-AWD-LSTM perplexity could not, however, detect the differences among the n-gram language models nor among the neural language models (e.g., between Katz backoff and Kneser-Ney, or AWD-LSTM and AWD-LSTM-Cache). The bias caused by the evaluation model is also a problem with this method. In the experiment, AWD-LSTM was the best model by eval-AWD-LSTM evaluation for both the WT2 and PTB data sets. It is likely that worse-performing models whose behavior is similar to that of the evaluation model are evaluated more highly than are other models that have higher fluency but behave differently from the evaluation model.

Table 4 
Evaluation of language models by using the AWD-LSTM model (trained with WT2), in comparison with using the perplexity and the Taylor exponent.
 PerplexityTaylor exponentPerplexity from eval-AWD-LSTM
Original Data set 
Wikitext-2 (Preprocessed) 0.62 (0.15) 33.81 
  
Shuffled Data Set 
Wikitext-2 (1-gram) 0.50 (0.02) 7,389.15 
Wikitext-2 (2-gram) 0.50 (0.02) 2,405.15 
Wikitext-2 (5-gram) 0.50 (0.02) 559.92 
Wikitext-2 (10-gram) 0.50 (0.02) 236.49 
  
N-gram Language Model 
3-gram 837.58 0.50 (0.02) 3,730.74 
5-gram 534.98 0.50 (0.02) 7,532.91 
linear interpolation 294.72 0.50 (0.02) 1,371.75 
Katz backoff 3-gram 285.14 0.50 (0.02) 663.74 
Katz backoff 5-gram 357.94 0.50 (0.02) 664.25 
Kneser-Ney 3-gram 204.15 0.50 (0.02) 2,562.24 
Kneser-Ney 5-gram 215.44 0.50 (0.02) 2,743.65 
HPYLM 184.34 0.50 (0.02) 884.76 
  
Neural Language Model 
Simple RNN 164.51 0.50 (0.02) 645.64 
GRU 96.22 0.52 (0.03) 266.33 
QRNN 74.74 0.52 (0.03) 135.68 
LSTM (no regularization) 113.18 0.52 (0.03) 177.12 
AWD-LSTM 64.27 0.58 (0.06) 88.73 
AWD-LSTM-Simon 61.59 0.55 (0.05) 130.52 
AWD-LSTM-MoS 62.44 0.54 (0.04) 97.89 
AWD-LSTM-MoS-Cache 59.21 0.57 (0.07) 164.39 
AWD-LSTM-Cache 50.39 0.59 (0.07) 109.02 
 PerplexityTaylor exponentPerplexity from eval-AWD-LSTM
Original Data set 
Wikitext-2 (Preprocessed) 0.62 (0.15) 33.81 
  
Shuffled Data Set 
Wikitext-2 (1-gram) 0.50 (0.02) 7,389.15 
Wikitext-2 (2-gram) 0.50 (0.02) 2,405.15 
Wikitext-2 (5-gram) 0.50 (0.02) 559.92 
Wikitext-2 (10-gram) 0.50 (0.02) 236.49 
  
N-gram Language Model 
3-gram 837.58 0.50 (0.02) 3,730.74 
5-gram 534.98 0.50 (0.02) 7,532.91 
linear interpolation 294.72 0.50 (0.02) 1,371.75 
Katz backoff 3-gram 285.14 0.50 (0.02) 663.74 
Katz backoff 5-gram 357.94 0.50 (0.02) 664.25 
Kneser-Ney 3-gram 204.15 0.50 (0.02) 2,562.24 
Kneser-Ney 5-gram 215.44 0.50 (0.02) 2,743.65 
HPYLM 184.34 0.50 (0.02) 884.76 
  
Neural Language Model 
Simple RNN 164.51 0.50 (0.02) 645.64 
GRU 96.22 0.52 (0.03) 266.33 
QRNN 74.74 0.52 (0.03) 135.68 
LSTM (no regularization) 113.18 0.52 (0.03) 177.12 
AWD-LSTM 64.27 0.58 (0.06) 88.73 
AWD-LSTM-Simon 61.59 0.55 (0.05) 130.52 
AWD-LSTM-MoS 62.44 0.54 (0.04) 97.89 
AWD-LSTM-MoS-Cache 59.21 0.57 (0.07) 164.39 
AWD-LSTM-Cache 50.39 0.59 (0.07) 109.02 
Table 5 
Evaluation of language models by using the AWD-LSTM model (trained with the PTB), in comparison with using the perplexity and the Taylor exponent.
 PerplexityTaylor exponentPerplexity from eval-AWD-LSTM
Original Data Set 
Penn Tree Bank (Preprocessed) 0.56 (0.14) 40.70 
  
Shuffled Data Set 
Penn Tree Bank (1-gram) 0.50 (0.02) 3,698.52 
Penn Tree Bank (2-gram) 0.50 (0.02) 1,328.39 
Penn Tree Bank (5-gram) 0.50 (0.02) 351.22 
Penn Tree Bank (10-gram) 0.50 (0.02) 166.93 
  
N-gram Language Model 
3-gram 367.79 0.50 (0.02) 1,697.99 
5-gram 561.65 0.50 (0.02) 3,463.88 
linear interpolation 238.59 0.50 (0.02) 965.58 
Katz backoff 3-gram 195.65 0.50 (0.02) 420.48 
Katz backoff 5-gram 250.18 0.50 (0.02) 471.03 
Kneser-Ney 3-gram 150.64 0.50 (0.02) 1,324.67 
Kneser-Ney 5-gram 156.70 0.50 (0.02) 1,411.14 
HPYLM 140.49 0.50 (0.02) 412.13 
  
Neural Language Model 
Simple RNN 123.96 0.50 (0.02) 321.31 
GRU 85.05 0.50 (0.02) 258.12 
QRNN 62.65 0.51 (0.02) 113.22 
LSTM (no regularization) 113.18 0.51 (0.02) 234.05 
AWD-LSTM 64.27 0.51 (0.03) 90.01 
AWD-LSTM-Simon 61.59 0.51 (0.03) 144.45 
AWD-LSTM-MoS 62.44 0.52 (0.04) 97.73 
AWD-LSTM-MoS-Cache 59.21 0.55 (0.06) 100.56 
AWD-LSTM-Cache 50.39 0.53 (0.05) 123.32 
 PerplexityTaylor exponentPerplexity from eval-AWD-LSTM
Original Data Set 
Penn Tree Bank (Preprocessed) 0.56 (0.14) 40.70 
  
Shuffled Data Set 
Penn Tree Bank (1-gram) 0.50 (0.02) 3,698.52 
Penn Tree Bank (2-gram) 0.50 (0.02) 1,328.39 
Penn Tree Bank (5-gram) 0.50 (0.02) 351.22 
Penn Tree Bank (10-gram) 0.50 (0.02) 166.93 
  
N-gram Language Model 
3-gram 367.79 0.50 (0.02) 1,697.99 
5-gram 561.65 0.50 (0.02) 3,463.88 
linear interpolation 238.59 0.50 (0.02) 965.58 
Katz backoff 3-gram 195.65 0.50 (0.02) 420.48 
Katz backoff 5-gram 250.18 0.50 (0.02) 471.03 
Kneser-Ney 3-gram 150.64 0.50 (0.02) 1,324.67 
Kneser-Ney 5-gram 156.70 0.50 (0.02) 1,411.14 
HPYLM 140.49 0.50 (0.02) 412.13 
  
Neural Language Model 
Simple RNN 123.96 0.50 (0.02) 321.31 
GRU 85.05 0.50 (0.02) 258.12 
QRNN 62.65 0.51 (0.02) 113.22 
LSTM (no regularization) 113.18 0.51 (0.02) 234.05 
AWD-LSTM 64.27 0.51 (0.03) 90.01 
AWD-LSTM-Simon 61.59 0.51 (0.03) 144.45 
AWD-LSTM-MoS 62.44 0.52 (0.04) 97.73 
AWD-LSTM-MoS-Cache 59.21 0.55 (0.06) 100.56 
AWD-LSTM-Cache 50.39 0.53 (0.05) 123.32 

Overall, the evaluation methods using other language models were not consistent. The PCFG-based evaluation could not even clearly distinguish between the shuffled and original data sets. Evaluation based on a neural language model could detect the difference between the n-gram and neural language models, but it could not distinguish quality within those categories of language models. Compared with those methods, the Taylor exponent ζ had a clearer correlation with the perplexity of the models. Specifically, the exponent satisfied ζ = 0.50 for all n-gram language models. It was larger than 0.50 only for the neural language models whose perplexity was better than that of the n-gram language models. Among the neural language models, the Taylor exponent took high values for the AWD-LSTM family, which had better perplexity than the GRU and QRNN models and the LSTM model without regularization.

In this section, we apply the evaluation of metrics in §6.1 to discuss the scaling properties of the language models. All language models tested in the experiments satisfied the scaling properties of vocabulary population, Zipf’s law, and Heaps’ law. These properties are relatively easy for models to reproduce, because they concern the static probability distribution of words.

In contrast, many of the language models failed to reproduce long memory behavior. The sole exception was the Simon process, which presented strong long-range correlation, but this was not caused by burstiness, as explained in §6.1. The lack of long memory in n-gram language models is supported by an analytical argument about Markov models, as mentioned in §4.1. The failure of the PCFG model in our experiment setting can be explained by its lack of inter-sentence structure.

Even among the neural language models, the simple RNN model failed to reproduce long memory. The Taylor exponent was ζ = 0.50, and the other metrics also indicated that the generated text did not have long-range dependence. In contrast, the RNNs with a gating mechanism (LSTM, GRU, and QRNNs) could reproduce long memory behavior. The Taylor exponents of the GRU and QRNN language models were both ζ = 0.52 for WT2, which indicates the presence of long memory to a certain extent. The LSTM language models were consistently the best at reproducing long memory behavior of natural language text for WT2 and the PTB at both the character level and the word level.

Figure 5 shows (a) Zipf’s law and (b) Taylor’s law results for the AWD-LSTM-Cache model trained with WT2, which was the best performing model in terms of perplexity. Figure 5(a) demonstrates that the Zipf’s law behavior of the data set shown in Figure 1(a) was well recovered. Likewise, Figure 5(b) demonstrates how well the AWD-LSTM-Cache model captured and reproduced the Taylor’s law behavior shown in Figure 1(d). Whereas the Taylor exponent for the original data set was ζ = 0.62, the AWD-LSTM-Cache model had a Taylor exponent of ζ = 0.59 for WT2. The data points in Figure 1(d) were more widely scattered around the regression line than those in Figure 5(b). Even with the well-performing neural language models, however, the scaling properties of long memory were not fully recovered. These differences represent gaps between the natural language text and the language model, which may indicate room for improvement.

Figure 5 

Scaling properties of the AWD-LSTM-Cache model trained with WT2.

Figure 5 

Scaling properties of the AWD-LSTM-Cache model trained with WT2.

Finally, we discuss the possibility of evaluating GAN-generated text with the scaling properties. Table 6 lists the scaling properties for the COCO image data set (Lin et al. 2014). Because current GAN models for text generation cannot produce long texts, image captions constitute the standard data set for these GAN models. Because of the data set used, the GAN models are limited to generating a certain text type (i.e., image captions). In particular, as the length of the text is short, the results are readily expected not to reproduce long memory behavior. Yet it is worthwhile to test the vocabulary population of the GAN models to understand their capacity.

Table 6 
Summary of statistics for the COCO image data set.
 TokensVocab.Vocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Image COCO (English, collection of image caption) 
original data set 105,933 6,095 Yes 0.76 (0.09) 0.99 (0.03) 0.50 (0.04) No 
 TokensVocab.Vocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Image COCO (English, collection of image caption) 
original data set 105,933 6,095 Yes 0.76 (0.09) 0.99 (0.03) 0.50 (0.04) No 

Figures 6 and 7 show Zipf’s and Taylor’s law graphs for the original data set and the text generated by SeqGAN (Yu et al. 2017), respectively. Unlike the other language models, GAN models for text generation had problems reproducing Zipf’s law. The tail decay for the generated text was faster than that for the data set. The vocabulary size of the generated text was only v(n) = 1,822 words for n = 118,264 generated words, whereas the original text had a vocabulary size v(n) = 6,095 for n = 105,933 words. This result indicates that the GAN model could not produce the infrequent words in the training data set.

Figure 6 

Scaling properties of captions in the COCO image data set.

Figure 6 

Scaling properties of captions in the COCO image data set.

Figure 7 

Scaling properties of captions generated from the COCO image data set by SeqGAN.

Figure 7 

Scaling properties of captions generated from the COCO image data set by SeqGAN.

On the other hand, long memory was already absent at the level of the training data set. The Taylor exponent was ζ = 0.50 (Figure 6(b)), indicating no memory, which was obviously expected, as the captions were shuffled and two consecutive captions had no relation. Through learning of such training data and production caption by caption, the generated text also had no long memory (Figure 7(b)). Indeed, long memory analysis literally requires a model to generate a sufficiently long text to allow further quality evaluation of natural language.

Nevertheless, other metrics would not provide a better evaluation in this case. Table 7 lists the evaluation metrics of BLEU and perplexity by eval-AWD-LSTM for texts generated using different GAN techniques. The BLEU scores for the GAN models in Table 7 were extracted from Zhu et al. (2018). The perplexity scores were computed by using the eval-AWD-LSTM model trained with the COCO image data set and the hyperparameters for the PTB data set. The perplexity of AWD-LSTM when trained with that data set was 65.41.

Table 7 
BLEU scores and perplexity for eval-AWD-LSTM-based evaluation on texts generated from the COCO image data set by different GAN models: SeqGAN (Yu et al. 2017), MaliGAN (Che et al. 2017), RankGAN (Lin et al. 2017), LeakGAN (Guo et al. 2018), and TextGAN (Zhang et al. 2017).
 SeqGANMaliGANRankGANLeakGANTextGANMLEImageCoco
BLEU-2 0.92 0.89 0.94 0.93 0.65 0.92 1.00 
BLEU-3 0.75 0.70 0.80 0.82 0.65 0.68 1.00 
BLEU-4 0.53 0.48 0.60 0.66 0.60 0.57 1.00 
BLEU-5 0.35 0.31 0.41 0.47 0.52 0.39 1.00 
eval-AWD-LSTM 179.29 272.53 132.90 146.26 129.93 176.34 44.17 
 SeqGANMaliGANRankGANLeakGANTextGANMLEImageCoco
BLEU-2 0.92 0.89 0.94 0.93 0.65 0.92 1.00 
BLEU-3 0.75 0.70 0.80 0.82 0.65 0.68 1.00 
BLEU-4 0.53 0.48 0.60 0.66 0.60 0.57 1.00 
BLEU-5 0.35 0.31 0.41 0.47 0.52 0.39 1.00 
eval-AWD-LSTM 179.29 272.53 132.90 146.26 129.93 176.34 44.17 

For both BLEU and perplexity, the results were inconsistent. In terms of BLEU, the best-performing GAN model varied among RankGAN with BLEU-2, LeakGAN with BLEU-3 and BLEU4, and TextGAN with BLEU-5. In contrast, TextGAN was the best model in terms of eval-AWD-LSTM. In addition to these metrics, the negative log-likelihood of the PCFG was also not effective in evaluating the GAN models in Zhu et al. (2018).

Although rigid quantitative evaluation is necessary for comparing GAN models, the existing evaluation metrics are not sufficiently reliable. Therefore, further study of evaluation metrics is necessary. The Taylor exponent may play a role in such studies when GAN-based models become able to produce longer texts.

In this article, we have investigated the scaling properties of computational models of natural language and analyzed whether these metrics could serve for assessing the models. The scaling properties quantify the vocabulary population and long memory behavior, which are universal qualities of natural language text. These metrics are applicable to any model, even those for which the perplexity is not measurable or a reference is not available. We tested n-gram language models, a grammatical model, mathematical models, neural language models, and GAN models for text generation. Among the five scaling properties introduced, the exponent of Taylor’s law showed the most reasonable behavior. It had the clearest correlation with the perplexity of the n-gram and neural language models.

Our analysis demonstrated that RNNs with a gating mechanism (LSTM, GRU, and QRNNs) are the first computational models of natural language that have the capacity to reproduce the long memory in natural language text. No other models tested in our experiment reproduced the scaling properties of long memory. The LSTM models were the best among the neural language models, as their long memory behavior was closer to that of the original text as compared to the GRU and QRNN models. Yet even the LSTM language models could not entirely recover long memory, including the exponents of the scaling properties. This observation confirms the gap between natural language text and language models and suggests corresponding room for improvement. Our future work will include investigating other scaling properties that could serve for evaluating language models.

This section presents the figures for the scaling properties of data sets that appear in this paper. The presence of the scaling properties is robust to the genre and the language of the text.

Figure A1 

Scaling properties of the collected works of Shakespeare.

Figure A1 

Scaling properties of the collected works of Shakespeare.

Figure A2 

Scaling properties of Hong Lou Meng.

Figure A2 

Scaling properties of Hong Lou Meng.

Figure A3 

Scaling properties of the Penn-Treebank (original).

Figure A3 

Scaling properties of the Penn-Treebank (original).

Figure A4 

Scaling properties of Wikitext-2 (original).

Figure A4 

Scaling properties of Wikitext-2 (original).

Figure A5 

Scaling properties of captions in COCO image data set.

Figure A5 

Scaling properties of captions in COCO image data set.

This section presents the figures for the scaling properties of language models of WT2 in this article.

Figure A6 

Scaling properties of 3-gram language model.

Figure A6 

Scaling properties of 3-gram language model.

Figure A7 

Scaling properties of 5-gram language model.

Figure A7 

Scaling properties of 5-gram language model.

Figure A8 

Scaling properties of linear interpolation n-gram language model.

Figure A8 

Scaling properties of linear interpolation n-gram language model.

Figure A9 

Scaling properties of the Katz backoff 3-gram language model.

Figure A9 

Scaling properties of the Katz backoff 3-gram language model.

Figure A10 

Scaling properties of Katz backoff 5-gram language model.

Figure A10 

Scaling properties of Katz backoff 5-gram language model.

Figure A11 

Scaling properties of Kneser-Ney 3-gram language model.

Figure A11 

Scaling properties of Kneser-Ney 3-gram language model.

Figure A12 

Scaling properties of Kneser-Ney 5-gram language model.

Figure A12 

Scaling properties of Kneser-Ney 5-gram language model.

Figure A13 

Scaling properties of hierarchical Pitman-Yor language model.

Figure A13 

Scaling properties of hierarchical Pitman-Yor language model.

Figure A14 

Scaling properties of the PCFG constructed from PTB data set.

Figure A14 

Scaling properties of the PCFG constructed from PTB data set.

Figure A15 

Scaling properties of Simon process. The figure of Ebeling method does not appear because of the inappropriateness of the application.

Figure A15 

Scaling properties of Simon process. The figure of Ebeling method does not appear because of the inappropriateness of the application.

Figure A16 

Scaling properties of Pitman-Yor process. The figure of Ebeling method does not appear because of the inappropriateness of the application.

Figure A16 

Scaling properties of Pitman-Yor process. The figure of Ebeling method does not appear because of the inappropriateness of the application.

Figure A17 

Scaling properties of Simple RNN language model.

Figure A17 

Scaling properties of Simple RNN language model.

Figure A18 

Scaling properties of GRU language model.

Figure A18 

Scaling properties of GRU language model.

Figure A19 

Scaling properties of QRNN language model.

Figure A19 

Scaling properties of QRNN language model.

Figure A20 

Scaling properties of LSTM without regularization language model.

Figure A20 

Scaling properties of LSTM without regularization language model.

Figure A21 

Scaling properties of AWD-LSTM.

Figure A21 

Scaling properties of AWD-LSTM.

Figure A22 

Scaling properties of AWD-LSTM-Simon.

Figure A22 

Scaling properties of AWD-LSTM-Simon.

Figure A23 

Scaling properties of AWD-LSTM-MoS.

Figure A23 

Scaling properties of AWD-LSTM-MoS.

Figure A24 

Scaling properties of AWD-LSTM-MoS-Cache.

Figure A24 

Scaling properties of AWD-LSTM-MoS-Cache.

Figure A25 

Scaling properties of AWD-LSTM-Cache.

Figure A25 

Scaling properties of AWD-LSTM-Cache.

Figure A26 

Scaling properties of LSTM without regularization for character-level modeling.

Figure A26 

Scaling properties of LSTM without regularization for character-level modeling.

Figure A27 

Scaling properties of AWD-LSTM for character-level modeling.

Figure A27 

Scaling properties of AWD-LSTM for character-level modeling.

Figure A28 

Scaling properties of the Seq-GAN (the model learns COCO image data set).

Figure A28 

Scaling properties of the Seq-GAN (the model learns COCO image data set).

1 

The implementation used in the experiment is available at https://github.com/musyoku/hpylm. Although HPYLM is an n-gram language model and it is possible to calculate the perplexity, the resulting value is not comparable with those of other n-gram language models and neural language models. Specifically, the training data requires using <BOS> and <EOS> to signify the beginning and end of a sentence, respectively. This decreases the perplexity because of the regularities introduced by these insertions, such as <EOS> being almost always followed by <BOS>.

Altmann
,
Eduardo G.
and
Martin
Gerlach
.
2017
.
Statistical laws in linguistics
. In
M. D.
Espositi
,
E. G.
Altmann
, and
F.
Pachet
, editors,
Creativity and Universality in Language
, pages
7
26
.
Altmann
,
Eduardo G.
,
Janet B.
Pierrehumbert
, and
Adilson E.
Motter
.
2009
.
Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words
.
PLoS One
,
4
(
11
):
e7678
.
Baeza-Yates
,
Ricardo
and
Gonzalo
Navarro
.
2000
.
Block addressing indices for approximate text retrieval
.
Journal of the American Society for Information Science
,
51
(
1
):
69
82
.
Bradbury
,
James
,
Stephen
Merity
,
Caiming
Xiong
, and
Richard
Socher
.
2017
.
Quasi-recurrent neural networks
. In
Proceedings of International Conference on Learning Representations
,
Toulon
.
Che
,
Tong
,
Yanran
Li
,
Ruixiang
Zhang
,
Devon R.
Hjelm
,
Wenjie
Li
,
Yangqiu
Song
, and
Yoshua
Bengio
.
2017
.
Maximum-likelihood augmented discrete generative adversarial networks
.
arXiv preprint arXiv:1702.07983
.
Chen
,
Stanley F.
and
Joshua
Goodman
.
1999
.
An empirical study of smoothing techniques for language modeling
.
Computer Speech & Language
,
13
(
4
):
359
394
.
Cho
,
Kyunghyun
,
Bart
van Merriënboer
,
Çaglar
Gülçehre
,
Dzmitry
Bahdanau
,
Fethi
Bougares
,
Holger
Schwenk
, and
Yoshua
Bengio
.
2014
.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
, pages
1724
1734
,
Doha
.
Clauset
,
Aaron
,
Cosma Rohilla
Shalizi
, and
M. E. J.
Newman
.
2009
.
Power-law distributions in empirical data
.
SIAM Review
,
51
(
4
):
661
703
.
Ebeling
,
Werner
and
Alexander
Neiman
.
1995
.
Long-range correlations between letters and sentences in texts
.
Physica A
,
215
(
3
):
233
241
.
Ebeling
,
Werner
and
Thorsten
Pöschel
.
1994
.
Entropy and long-range correlations in literary English
.
Europhysics Letters
,
26
(
4
):
241
246
.
Eisler
,
Zoltán
,
Imre
Bartos
, and
Janos
Kertész
.
2007
.
Fluctuation scaling in complex systems: Taylor’s law and beyond
.
Advances in Physics
,
57
(
1
):
89
142
.
Fedus
,
William
,
Ian
Goodfellow
, and
Andrew M.
Dai
.
2018
.
MaskGAN: Better text generation via filling in the _______
. In
Proceedings of International Conference on Learning Representations
,
Vancouver
.
Forney
,
G. David
.
1973
.
The viterbi algorithm
.
Proceedings of the IEEE
,
61
(
3
):
268
278
.
Gerlach
,
Martin
and
Eduardo G.
Altmann
.
2013
.
Stochastic model for the vocabulary growth in natural languages
.
Physical Review X
,
3
(
2
):
021006
.
Goldwater
,
Sharon
,
Thomas L.
Griffiths
, and
Mark
Johnson
.
2011
.
Producing power-law distributions and damping word frequencies with two-stage language models
.
Journal of Machine Learning Research
,
12
:
2335
2382
.
Grave
,
Edouard
,
Armand
Joulin
, and
Nicolas
Usunier
.
2017
.
Improving neural language models with a continuous cache
. In
Proceedings of International Conference on Learning Representations
,
Toulon
.
Guo
,
Jiaxian
,
Sidi
Lu
,
Han
Cai
,
Weinan
Zhang
,
Yong
Yu
, and
Jun
Wang
.
2018
.
Long text generation via adversarial training with leaked information
. In
Proceedings of the Thirty-Second AAAI Conference
, pages
5141
5148
,
New Orleans, LA
.
Hochreiter
,
Sepp
and
Jürgen
Schmidhuber
.
1997
.
Long short-term memory
.
Neural Computation
,
9
(
8
):
1735
1780
.
Katz
,
Slava M.
1987
.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
.
IEEE Transactions on Acoustics, Speech, and Signal Processing
,
35
(
3
):
400
401
.
Kingman
,
J. F. C.
1963
.
The exponential decay of Markov transition probabilities
.
Proceedings of the London Mathematical Society
,
s3-13
(
1
):
337
358
.
Kneser
,
Reinhard
and
Hermann
Ney
.
1995
.
Improved backing-off for M-gram language modeling
. In
Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing
,
volume 1
, pages
181
184
,
Ann Arbor, MI
.
Kobayashi
,
Tatsuru
and
Kumiko
Tanaka-Ishii
.
2018
.
Taylor’s law for human linguistic sequences
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
, pages
1138
1148
,
Melbourne
.
van Leijenhorst
,
Dick
and
Theo
van der Weide
.
2005
.
A formal derivation of Heaps’ law
.
Information Sciences
,
170
(
2–4
):
263
272
.
Lennartz
,
Sabine
and
Armin
Bunde
.
2009
.
Eliminating finite-size effects and detecting the amount of whitenoise in short records with long-term memory
.
Physical Review E
,
79
(
6
):
066101
.
Li
,
Wentian
.
1989
.
Mutual information functions of natural language texts
.
Santa Fe Institute Working Paper 89-10-008
.
Lin
,
Chin Yew
.
2004
.
Rouge: A package for automatic evaluation of summaries
. In
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics Workshop
, pages
74
81
,
Barcelona
.
Lin
,
Henry W.
and
Max
Tegmark
.
2017
.
Critical behavior in physics and probabilistic formal languages
.
Entropy
,
19
(
7
):
299
.
Lin
,
Kevin
,
Dianqi
Li
,
Xiaodong
He
,
Zhengyou
Zhang
, and
Ming-Ting
Sun
.
2017
.
Adversarial ranking for language generation
. In
Advances in Neural Information Processing Systems
, pages
3155
3165
,
Long Beach, CA
.
Lin
,
Tsung Yi
,
Michael
Maire
,
Serge
Belongie
,
Pietro
Perona
,
Deva
Ramanan
,
Piotr
Dollár
, and
C. Lawrence
Zitnick
.
2014
.
Microsoft coco: Common objects in context
. In
European Conference on Computer Vision
, pages
740
755
,
Zurich
.
Loper
,
Edward
and
Steven
Bird
.
2002
.
Nltk: The natural language toolkit
. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop
, pages
63
70
,
Philadelphia, PA
.
,
Linyuan
,
Zi-Ke
Zhang
, and
Tao
Zhou
.
2010
.
Zipf’s law leads to Heaps’ law: Analyzing their relation in finite-size systems
.
PLoS One
,
5
(
12
):
e14139
.
Lu
,
Sidi
,
Lantao
Yu
,
Weinan
Zhang
, and
Yong
Yu
.
2018
.
Cot: Cooperative training for generative modeling
.
arXiv preprint arXiv:1804.03782
.
Manning
,
Chris
and
Hinrich
Schutze
.
1999
.
Foundations of Statistical Natural Language Processing
.
MIT Press
.
Melis
,
Gábor
,
Chris
Dyer
, and
Phil
Blunsom
.
2018
.
On the state of the art of evaluation in neural language models
. In
Proceedings of International Conference on Learning Representations
,
Vancouver
.
Merity
,
Stephen
,
Nitish
Keskar
, and
Richard
Socher
.
2018a
.
An analysis of neural language modeling at multiple scales
.
arXiv preprint arXiv:1803.08240
.
Merity
,
Stephen
,
Nitish Shirish
Keskar
, and
Richard
Socher
.
2018b
.
Regularizing and optimizing LSTM language models
. In
Proceedings of International Conference on Learning Representations
,
Vancouver
.
Merity
,
Stephen
,
Caiming
Xiong
,
James
Bradbury
, and
Richard
Socher
.
2016
.
Pointer sentinel mixture models
. In
Proceedings of International Conference on Learning Representations
,
San Juan, PR
.
Mikolov
,
Tomáš
,
Martin
Karafiát
,
Lukáš
Burget
,
Jan Honza
Černocký
, and
Sanjeev
Khudanpur
.
2010
.
Recurrent neural network based language model
. In
Proceedings of the 11th Annual Conference of the International Speech Communication Association
, pages
1045
1048
,
Chiba
.
Mikolov
,
Tomáš
and
Geoffrey
Zweig
.
2012
.
Context dependent recurrent neural network language model
. In the
IEEE Workshop on Spoken Language Technology
, pages
234
239
,
Miami, FL
.
Papineni
,
Kishore
,
Salim
Roukos
,
Todd
Ward
, and
Wei-Jing
Zhu
.
2002
.
BLEU: A method for automatic evaluation of machine translation
. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
, pages
311
318
,
Philadelphia, PA
.
Rajeswar
,
Sai
,
Sandeep
Subramanian
,
Francis
Dutil
,
Christopher
Pal
, and
Aaron
Courville
.
2017
.
Adversarial generation of natural language
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics Workshop
, pages
241
251
,
Vancouver
.
Simon
,
Herbert A.
1955
.
On a class of skew distribution functions
.
Biometrika
,
42
(
3/4
):
425
440
.
Smith
,
H. Fairfield
.
1938
.
An empirical law describing heterogeneity in the yields of agricultural crops
.
Journal of Agriculture Science
,
28
(
1
):
1
23
.
Stolcke
,
Andreas
.
2002
.
Srilm - an extensible language modeling toolkit
. In
International Conference on Spoken Language Processing
, pages
901
904
,
Denver, CO
.
Takahashi
,
Shuntaro
and
Kumiko
Tanaka-Ishii
.
2017
.
Do neural nets learn statistical laws behind natural language?
PLoS One
,
12
(
12
):
e0189326
.
Tanaka-Ishii
,
Kumiko
and
Armin
Bunde
.
2016
.
Long-range memory in literary texts: On the universal clustering of the rare words
.
PLoS One
,
11
(
11
):
e0164658
.
Tanaka-Ishii
,
Kumiko
and
Tatsuru
Kobayashi
.
2018
.
Taylor’s law for linguistic sequences and random walk models
.
Journal of Physics Communications
,
2
(
11
):
115024
.
Taylor
,
Lionel Roy
.
1961
.
Aggregation, variance and the mean
.
Nature
,
189
(
4766
):
732
735
.
Teh
,
Yee Whye
.
2006
.
A hierarchical Bayesian language model based on Pitman-Yor processes
. In
Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics
, pages
985
992
,
Sydney
.
Yang
,
Zhilin
,
Zihang
Dai
,
Ruslan
Salakhutdinov
, and
William W.
Cohen
.
2018
.
Breaking the softmax bottleneck: A high-rank RNN language model
. In
Proceedings of International Conference on Learning Representations
,
Vancouver
.
Yu
,
Lantao
,
Weinan
Zhang
,
Jun
Wang
, and
Yong
Yu
.
2017
.
Seqgan: Sequence generative adversarial nets with policy gradient
. In
Proceedings of The Thirty-First AAAI Conference
, pages
2852
2858
,
San Francisco, CA
.
Zhang
,
Yizhe
,
Zhe
Gan
,
Kai
Fan
,
Zhi
Chen
,
Ricardo
Henao
,
Dinghan
Shen
, and
Lawrence
Carin
.
2017
.
Adversarial feature matching for text generation
.
arXiv preprint arXiv:1706.03850
.
Zhu
,
Yaoming
,
Sidi
Lu
,
Lei
Zheng
,
Jiaxian
Guo
,
Weinan
Zhang
,
Jun
Wang
, and
Yong
Yu
.
2018
.
Texygen: A benchmarking platform for text generation models
.
arXiv preprint arXiv:1802.01886
.
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits you to copy and redistribute in any medium or format, for non-commercial use only, provided that the original work is not remixed, transformed, or built upon, and that appropriate credit to the original source is given. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.