Abstract
In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test n-gram language models, a probabilistic context-free grammar, language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks for text generation. Our analysis reveals that language models based on recurrent neural networks with a gating mechanism (i.e., long short-term memory; a gated recurrent unit; and quasi-recurrent neural networks) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor’s law is a good indicator of model quality.
1. Introduction
The question of evaluation methods for computational models of natural language is fundamental in language engineering. Aside from human rating, current evaluation methods rely on the probability distribution produced by the model, or on the n-gram similarity between the generated text and a corresponding reference written by human experts. The representative metric of the former type is perplexity. Perplexity quantifies the prediction accuracy of a language model and thus requires its probability distribution. The latter category includes the metrics BLEU (Papineni et al. 2002) and ROUGE (Lin 2004). These evaluation methods compute the n-gram co-occurrence between the generated text and a reference. Hence, these methods are reasonable for cases in which either the probability distribution of the computational model is explicit and comparable or a corresponding reference is given.
The emergence of intractable models such as generative adversarial networks (GANs) for text generation has revealed the limitation of these conventional evaluation methods. Tentative studies (Lin et al. 2017; Rajeswar et al. 2017; Yu et al. 2017; Guo et al. 2018; Lu et al. 2018) have sought to generate natural language text in the adversarial learning framework. Because these models do not explicitly output the probability distribution for prediction, they are evaluated by feeding the generated text to other models, such as a neural language model (Fedus, Goodfellow, and Dai 2018) or a probabilistic context-free grammar (PCFG) (Rajeswar et al. 2017). Although those proposals are promising and worth considering, the effectiveness of the methods for evaluation has not been thoroughly investigated. As an alternative to those approaches, in this article we test evaluation with the scaling properties of natural language text.
The scaling properties of natural language are the universal statistical behaviors observed in natural language text. For example, Zipf’s law characterizes the vocabulary population with a power-law function for the rank-frequency distribution. Recent statistical mechanical studies (Ebeling and Neiman 1995; Altmann, Pierrehumbert, and Motter 2009; Tanaka-Ishii and Bunde 2016; Kobayashi and Tanaka-Ishii 2018; Tanaka-Ishii and Kobayashi 2018) revealed another statistical aspect of natural language: long memory. This refers to the way that sequences of characters or words in natural language universally exhibit clustering, bursty behavior. In particular, results using Taylor’s law (Kobayashi and Tanaka-Ishii 2018; Tanaka-Ishii and Kobayashi 2018) show that a natural language text has a consistent range for the Taylor exponent, which quantifies the degree of burstiness in the text.
As the results obtained with scaling properties have clear interpretations, they suggest qualitative implications for language models. For example, evaluation with Zipf’s law examines whether a model can properly produce infrequent words. Similarly, evaluation with Taylor’s law quantifies whether a model can learn the long memory in a natural language text. In this article, we show that, among the computational models, only neural language models based on recurrent neural networks (RNNs) with a gating mechanism can learn and reproduce the long memory of natural language text. None of the other models can reproduce this behavior. In addition, our study demonstrates the capabilities of the scaling properties for evaluating language models.
The rest of the article is organized as follows. In §2, we review the evaluation metrics that have been widely used for tasks in natural language processing. In §3, we introduce the scaling properties of natural language: those given by Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and long-range correlation analysis. We also explain the methods of applying these scaling properties to evaluate computational models. In §4, we provide a summary of the models of natural language considered in this article. Specifically, our work covers n-gram language models, mathematical language models based on the Simon and Pitman-Yor processes, grammatical models, and neural language models. The experimental procedure and settings are explained in §5. In §6, we assess the scaling properties as evaluation metrics and compare them with other metrics using a PCFG and neural language models. In §7, we use the scaling properties to evaluate the models of natural language and discuss the implications of the results. §8 discusses evaluation of GAN models for text generation. Finally, we conclude our work with a summary in §9.
Note that we describe all computational models of natural language considered in this article, as introduced in §4, by the term language model. For some readers this might sound inadequate, because some of these models do not actually form a model to predict subsequent words (e.g., a PCFG and the models based on the Simon and Pitman-Yor processes). Because the term computational models of natural language is long, however, for the sake of brevity we simply use the term language models.
2. Previous Evaluation Metrics
There are two major approaches to evaluate a language model:
- •
directly inspecting some subpart of the model, or
- •
verifying the output generated by the model.
2.1 Evaluation Using Probability Distribution: Perplexity
Because perplexity is the current standard metric for automatic evaluation of model quality, the other metrics appearing in this article are compared with the perplexity.
2.2 Evaluation Using Reference: BLEU/ROUGE
Another popular evaluation metric is the n-gram co-occurrence–based approach, including BLEU (Papineni et al. 2002) and ROUGE (Lin 2004). These metrics are widely used in paired-corpus-oriented tasks such as machine translation and automatic summarization. They evaluate by using statistics of the counts of the same n-grams appearing in the machine-generated text and a corresponding reference, which is a correct answer written by an expert.
These approaches only use the output of a model and thus do not require access to any of its internal elements. Because they require the corresponding reference for computing the n-gram co-occurrence, however, their utility is limited to paired-corpus tasks.
Because intractable models such as GANs for text generation cannot have an explicit reference, the application of BLEU or ROUGE to those models is not trivial. A series of GAN studies (Yu et al. 2017; Lin et al. 2017; Guo et al. 2018; Lu et al. 2018) quantitatively measured the quality of the generated text with BLEU by regarding the whole training data set as a reference. The validity of this evaluation method remains questionable, as BLEU was designed for comparison between a pair of a machine-generated text and its correct reference. Zhu et al. (2018) reported that the application of BLEU with this approach does not provide consistent results with different n-grams chosen.
2.3 Evaluation Using Other Language Models
One approach for evaluation without using either a model distribution or a reference is the use of language models, that is, evaluation of language models by using other language models. Fedus, Goodfellow, and Dai (2018) proposed evaluating GAN-generated text with a neural language model trained with the same natural language data set. This direction is promising, if the language model is a reliable model of natural language. Even with state-of-the-art neural language models, however, the model quality is limited.
The use of a clear, transparent model for evaluation, such as an n-gram language model, would also be a possible method. That approach, however, could only measure models of the n-gram structures of natural language and would thus be similar to BLEU evaluation. The use of a PCFG is another possible method of evaluation without a reference. A PCFG is constructed using a parsed corpus such as the Penn Treebank (PTB), and the generated text is parsed with the Viterbi algorithm (Forney 1973). The algorithm computes the log-likelihood of the text. The PCFG is expected to output a small negative log-likelihood for a grammatically correct sentence. As we demonstrate later, however, it is doubtful that a PCFG could meaningfully evaluate the grammaticality of a sentence.
3. Scaling Properties of Natural Language for Evaluation
In this section, we explain scaling properties, the statistical properties of natural language text that have a power-law form. One study on the statistics of natural language reported nine scaling laws (Altmann and Gerlach 2017). Four of them concern word formation and a network structure, which do not directly relate to language modeling. This leaves five scaling properties, which can be categorized into those for the vocabulary population and those for long memory. These properties are characterized by power-law functions, which involve a power exponent. The exponents of the scaling properties have the capability to characterize the degree of each property. They therefore serve to evaluate whether a language model has the same behavior as natural language text. Specifically, given a text generated by a language model, we set two levels of assessment for evaluation:
Q1 Does the scaling property hold qualitatively?
Q2 How does the exponent differ from that of the training data?
The data points are regressed on a log-log scale. The regression method could be a problem if the errors between the data points and the fitting function are not Gaussian-distributed. There are other proposed regression methods such as maximum likelihood estimation for Zipf’s law (Clauset, Shalizi, and Newman 2009; Gerlach and Altmann 2013). In this article, however, because exponents obtained with the least-squares method are effective in distinguishing machine-generated text from natural language text, and because this method has been a conventional standard, we adopt it for estimation.
The following subsections introduce the five scaling properties: Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and a long-range correlation method. As an example, Figure 1 shows a visual presentation of these methods for the wikitext-2 (WT2) data set (Merity et al. 2016). WT2 is a collected corpus of well-written Wikipedia articles, preprocessed by replacing rare words having frequencies under a certain threshold with a meta symbol, <unk>. The details of the data set appear in the first row of Table 1, later in §3.3.
Scaling properties of the WT2 data set. (a) Zipf’s law: The rank-frequency distributions of words (red) and word pairs (blue). (b) Heaps’ law: The growth of vocabulary size with text length. The solid line is a power-law fitting, and the dashed line represents a power law with exponent α = 1.0, meaning that all words in a sequence are unique. (c) Ebeling’s method: Fluctuation analysis of character occurrence. (d) Taylor’s law: Mean-variance relation of word occurrence. (e) Long-range correlation: Temporal correlation of the sequence of the return intervals of rare words. All data points of these five scaling properties are plotted in a log-log scale.
Scaling properties of the WT2 data set. (a) Zipf’s law: The rank-frequency distributions of words (red) and word pairs (blue). (b) Heaps’ law: The growth of vocabulary size with text length. The solid line is a power-law fitting, and the dashed line represents a power law with exponent α = 1.0, meaning that all words in a sequence are unique. (c) Ebeling’s method: Fluctuation analysis of character occurrence. (d) Taylor’s law: Mean-variance relation of word occurrence. (e) Long-range correlation: Temporal correlation of the sequence of the return intervals of rare words. All data points of these five scaling properties are plotted in a log-log scale.
3.1 Vocabulary Population
3.1.1 Zipf’s Law.
3.1.2 Heaps’ Law.
3.2 Long Memory
The statistical mechanics domain has introduced two approaches for quantifying long memory in a time series: fluctuation analysis and the long-range correlation method. We introduce two fluctuation analysis methods, one for characters and one for words, and one long-range correlation method, applied to words. Although these methods are related analytically for a well-formed time series (Eisler, Bartos, and Kertész 2007), the relation is nontrivial for real phenomena.
3.2.1 Ebeling’s Method.
3.2.2 Taylor’s Law.
Taylor’s law was originally reported in two pioneering works (Smith 1938; Taylor 1961) and has been applied in various domains (Eisler, Bartos, and Kertész 2007). It describes the power-law relation between the mean and the variance in spatiotemporal observations. In this article, we apply Taylor’s law for natural language text as proposed by Kobayashi and Tanaka-Ishii (2018) and Tanaka-Ishii and Kobayashi (2018).
Figure 1(d) shows the Taylor’s law plot for WT2 with l = 5,620 (l can be any value larger than 1). The scatter plot generally follows a power-law function with exponent ζ = 0.62 and has some deviation from the regression line, with error ε = 0.15.
The Taylor exponent takes the range of values 0.50 ≤ ζ ≤ 1.00, and the two limit values ζ = 0.50, 1.0 have clear interpretations. For an i.i.d. process, it is proved that ζ = 0.50. On the other hand, one case with ζ = 1.0 occurs when all segments of length l contain the elements of W with the same proportions. For example, given W = {a, b}, suppose that b always occurs twice as often as a in all segments (e.g., one segment with three a and six b, another segment with one a and two b). Then, both the mean and standard deviation for b are twice those for a, and thus ζ = 1.0. Therefore, the Taylor exponent quantifies how consistently words co-occur in a text. The Taylor exponent of a natural language text typically has a range of 0.55 ≤ ζ ≤ 0.65 and never takes ζ = 0.50 (which would indicate no long memory). It takes different ranges of values for different types of sequences (e.g., child-directed speech and programming source code). It is therefore expected to have the capability to evaluate machine-generated text.
Ebeling’s method and Taylor’s law analysis have the following two differences. First, Ebeling’s method analyzes the growth of the variance m(l) with respect to the length of the subsequences, l, and Taylor’s law analyzes the variance with respect to the mean frequency within a fixed subsequence length. Second, to acquire an exponent for a text, Ebeling’s method takes the sum of the variances over all symbols, whereas Taylor’s law obtains the exponent from the individual points for all words.
For the latter reason, Ebeling’s method is influenced by a small number of frequently appearing symbols. Because it involves the sum of the variances of all words that follow the power law, the behavior of the exponent η often tends to be less sensible than that of the Taylor exponent.
3.2.3 Long-Range Correlation.
Long-range correlation analysis quantifies the burstiness of word occurrence in a natural language text. The analysis measures the degree of self-similarity within a sequence. Among such analyses, early works proposed mutual-information-based methods (Li 1989; Ebeling and Pöschel 1994; Lin and Tegmark 2017). Such methods compute the mutual information between characters separated by s characters. These works reported that the mutual information decays according to a power law with the distance s. Takahashi and Tanaka-Ishii (2017) showed, however, that the mutual information method cannot quantify the long-range dependence in word sequences. Moreover, the mutual information between characters decays quickly and reaches a plateau at a distance s ≈ 101 for natural language texts such as the collected works of Shakespeare and the PTB data set.
Because the ACF is applicable only for numerical time series, the application of this method for natural language text requires transformation of the sequence of symbols into a numerical time series. Recent methods do so by considering the intervals of word occurrences (Tanaka-Ishii and Bunde 2016). In this article, we apply a method that measures the ACF of a sequence of the return intervals of rare words, which amounts to of the text length. With this method, Tanaka-Ishii and Bunde (2016) reported that power-law decay of the ACF is observed for natural language texts.
Figure 1(e) shows the long-range correlation analysis of word sequences in WT2. The hyperparameter was set to Q = 16 for all results in this article. As seen in the figure, the ACF c(s) always takes positive values up to 1/100 of the sequence length and follows a power-law function (i.e., a straight line in a log-log plot) with exponent ξ = 0.33 and error ε = 0.04. Throughout this article, the error ε of this metric is only measured for s ≤ 100.
3.3 Examples of Scaling Properties for Other Natural Language Texts
Except for Zipf’s and Heaps’ laws, the scaling properties have hardly appeared in the context of computational linguistics or language engineering. This may be because these properties do not directly incorporate semantics or syntax, which are of central concern in those domains. Instead, the properties quantify the universal structures behind natural language in a statistical sense. Those introduced so far are robust and apply to texts across different genres and languages as long as the text is sufficiently long. Figure 2 shows the scaling properties of another language modeling data set, the PTB. This text also satisfies all five scaling properties. They are indeed universal with respect to the genre or even language. More results are shown in Appendix A. Figure A1 shows the scaling properties of the collected works of Shakespeare, and their exponents are listed in the third block of Table 1. Likewise, the scaling properties and exponents for Hong Lou Meng, a Chinese literary work, are shown in Figure A2 and listed in the last block of Table 1, respectively. Among the exponents, that of the long-range correlation, ξ, differs largely among the four data sets considered thus far. In contrast, the other exponents generally take similar values for the data sets.
. | Tokens . | Vocab. . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | |||
Wikitext-2 (English, Wikipedia article) | |||||||
preprocessed data set | 2,088,628 | 33,278 | Yes | 0.75 (0.13) | 1.33 (0.10) | 0.62 (0.15) | 0.33 (0.04) |
original data set | 2,088,628 | 76,617 | Yes | 0.78 (0.09) | 1.33 (0.10) | 0.65 (0.11) | 0.32 (0.03) |
Penn Treebank (English, The Wall Street Journal news article) | |||||||
preprocessed data set | 887,521 | 10,000 | Yes | 0.70 (0.16) | 1.23 (0.06) | 0.56 (0.14) | 0.81 (0.24) |
original data set | 892,008 | 89,317 | Yes | 0.83 (0.07) | 1.20 (0.05) | 0.57 (0.06) | 0.60 (0.16) |
Shakespeare (old English collection of literature works) | |||||||
original text | 740,706 | 83,105 | Yes | 0.79 (0.07) | 1.24 (0.09) | 0.59 (0.05) | 0.13 (0.02) |
Hong Lou Meng (Chinese, literature work) | |||||||
original text | 703,034 | 18,312 | Yes | 0.74 (0.14) | 1.31 (0.07) | 0.58 (0.07) | 0.39 (0.04) |
. | Tokens . | Vocab. . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | |||
Wikitext-2 (English, Wikipedia article) | |||||||
preprocessed data set | 2,088,628 | 33,278 | Yes | 0.75 (0.13) | 1.33 (0.10) | 0.62 (0.15) | 0.33 (0.04) |
original data set | 2,088,628 | 76,617 | Yes | 0.78 (0.09) | 1.33 (0.10) | 0.65 (0.11) | 0.32 (0.03) |
Penn Treebank (English, The Wall Street Journal news article) | |||||||
preprocessed data set | 887,521 | 10,000 | Yes | 0.70 (0.16) | 1.23 (0.06) | 0.56 (0.14) | 0.81 (0.24) |
original data set | 892,008 | 89,317 | Yes | 0.83 (0.07) | 1.20 (0.05) | 0.57 (0.06) | 0.60 (0.16) |
Shakespeare (old English collection of literature works) | |||||||
original text | 740,706 | 83,105 | Yes | 0.79 (0.07) | 1.24 (0.09) | 0.59 (0.05) | 0.13 (0.02) |
Hong Lou Meng (Chinese, literature work) | |||||||
original text | 703,034 | 18,312 | Yes | 0.74 (0.14) | 1.31 (0.07) | 0.58 (0.07) | 0.39 (0.04) |
4. Computational Models of Natural Language
This section introduces the computational models of natural language tested in this article. We test four categories of language models: n-gram models, grammatical models, language models based on the Simon or Pitman-Yor process, and neural language models. These categories cover the different genres of language models that have appeared in the history of computational linguistics. For every category, some sophisticated, advanced models have been proposed. The experiments reported in §6 and §7, however, were conducted only with the most recent models whose code was available, except for the n-gram models. This served to avoid errors in reimplementation.
4.1 n-Gram Models
This article examines 3-gram and 5-gram models. Other than the original n-gram model, we also test models with a variety of smoothing techniques to improve the perplexity. In particular, linear interpolation (Stolcke 2002), Katz backoff (Katz 1987), and Kneser-Ney smoothing (Kneser and Ney 1995) have been known to enhance the performance of n-gram models. We also set n = 3 and n = 5 for these models to compare with the original n-gram models. It has been empirically verified that longer context does not necessarily contribute to improving the perplexity and can even degrade performance (Chen and Goodman 1999). Simple n-gram models, in fact, have been mathematically shown to be incapable of reproducing long memory (Kingman 1963; Lin and Tegmark 2017).
4.2 Grammatical Models
The PCFG is a basic grammatical model. We constructed this grammar model with the annotated PTB data set and used the Natural Language Toolkit (NLTK) (Loper and Bird 2002) to generate sentences according to the probabilities assigned to productions. Unlike an n-gram model, the PCFG generates a text by using a tree.
4.3 Language Models Based on Simon/Pitman-Yor Processes
These two parameters serve to produce Zipf’s law with slightly convex behavior (Goldwater, Griffiths, and Johnson 2011). The basic models introduced to this point define nothing about how to introduce words: We could simply generate random sequences and examine their scaling properties, because the basic formulations thus far govern the nature of the language models elaborated from these basic models.
By mapping words to the elements produced, we would generate a language model, like the two-stage model proposed in Goldwater, Griffiths, and Johnson (2011). Here, we consider a more advanced model proposed as the hierarchical Pitman-Yor language model (HPYLM) (Teh 2006), which integrates the Pitman-Yor process into an n-gram model.1
4.4 Neural Language Models
In modern applications, RNNs with a gating mechanism, such as long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997), a gated recurrent unit (GRU) (Cho et al. 2014), and quasi-recurrent neural networks (QRNNs) (Bradbury et al. 2017), are often adopted. The recurrent architectures of these models are defined as follows.
- LSTM(14)(15)(16)(17)(18)(19)
- GRU(20)(21)(22)(23)
- QRNNs(24)(25)(26)
In this article, we consider a total of nine neural language models. Three of them are based on a simple RNN, a GRU (Cho et al. 2014), and QRNNs (Bradbury et al. 2017; Merity, Keskar, and Socher 2018a). The rest are LSTM-based language models. The first LSTM model is trained without regularizations such as dropout. The second model is AWD-LSTM (Merity, Keskar, and Socher 2018b), which applies regularization effectively to achieve competitive prediction performance. The other four models integrate extended architectures of RNN language models, namely, continuous cache (Grave, Joulin, and Usunier 2017) and mixture of softmaxes (MoS) (Yang et al. 2018). Continuous cache is a memory augmentation architecture that computes a cache probability pcache from the l most recent context. It computes the similarity between ht and hi to estimate the reappearance of the word at time step i. The output probability of the model with continuous cache, denoted as the AWD-LSTM-Cache model, is a linear interpolation of the AWD-LSTM output and the cache probability. We also test a model incorporating the Simon process, denoted as the AWD-LSTM-Simon model. It behaves as a uniform sampling from the past generated sequence and is a special case of AWD-LSTM-Cache. In addition, the MoS architecture reformulates the language modeling task as matrix factorization and is a state-of-the-art language model integrated with AWD-LSTM as the AWD-LSTM-MoS model. Finally, we also consider a combination of all these architectures, denoted as the AWD-LSTM-MoS-Cache model.
The hyperparameters used in our experiments followed the instructions in Merity, Keskar, and Socher (2018b) and Yang et al. (2018). The context length (or the length of back-propagation through time) was 70, as given in the references, for both character- and word-based models. The cache window size of the AWD-LSTM-Simon model was set to 10,000, to balance a large window size with computational efficiency. All the language models were trained to minimize the negative log-likelihood of the training data by stochastic gradient algorithms. Note that the perplexity scores for character- and word-based models are not directly comparable, as they indicate bits per character and per word, respectively.
5. Experiments
For every language model, a sample text of 1 million words was generated and evaluated using the metrics explained thus far. We expected models that learned a natural language text to be able to generate a sample text with scaling properties resembling those of the original text. In particular, we expected that the exponent values would be close to those of the original data set.
The subsequent two sections, §6 and §7, proceed by examining the scaling properties as applied to models that learned WT2 or the PTB. As introduced in §3.3, these are two standard data sets used as language model benchmarks. For both WT2 and the PTB, the data set was preprocessed to reduce the vocabulary size. Infrequent words were replaced with <unk>, and numbers were replaced with N in the PTB (Mikolov et al. 2010). Language models were then constructed by training with either WT2 or the PTB, except for the Simon and Pitman-Yor processes (but not HPYLM, which does learn) and the PCFG. The PCFG could be constructed only with the PTB data set, because it requires a parsed corpus, which does not exist for WT2.
Tables 2 and 3 list the perplexity and the scaling exponents of the models for the WT2 and PTB data sets, respectively. Each row presents the results for a single text, either real or machine-generated. The perplexity is not reported for the Simon model, the Pitman-Yor process, or the PCFG. For the two mathematical models, it was not measured because they do not have references for computing the prediction accuracy. The perplexity of the PCFG is not reported because its computation does not trivially match that of the n-gram and neural language models.
. | Perplexity . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | ||
Original Data set | ||||||
Wikitext-2 (Preprocessed) | - | Yes | 0.75 (0.13) | 1.32 (0.10) | 0.62 (0.15) | 0.33 (0.04) |
Wikitext-2 (Original) | - | Yes | 0.78 (0.09) | 1.33 (0.10) | 0.65 (0.11) | 0.32 (0.03) |
Shuffled Data set | ||||||
Wikitext-2(1-gram) | - | Yes | 0.75 (0.16) | 1.00 (0.01) | 0.50 (0.02) | No |
Wikitext-2(2-gram) | - | Yes | 0.76 (0.16) | 1.00 (0.00) | 0.50 (0.01) | No |
Wikitext-2(5-gram) | - | Yes | 0.76 (0.16) | 1.00 (0.00) | 0.50 (0.02) | No |
Wikitext-2(10-gram) | - | Yes | 0.76 (0.16) | 1.00 (0.00) | 0.50 (0.02) | No |
N-gram Language Model | ||||||
3-gram | 837.58 | Yes | 0.79 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
5-gram | 534.98 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
linear interpolation | 294.72 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 3-gram | 285.14 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 5-gram | 357.94 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 3-gram | 204.15 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 5-gram | 215.44 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Simon/Pitman-Yor Process and Related Language Model | ||||||
Simon | - | Yes | 0.95 (0.15) | - | 0.50 (0.01) | 0.09 (0.03) |
Pitman-Yor | - | Yes | 0.78 (0.09) | - | 0.50 (0.01) | No |
HPYLM | (184.34†) | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Neural Language Model (character based) | ||||||
LSTM (no regularization) | (1.44‡) | Yes | 0.74 (0.17) | 1.06 (0.05) | 0.50 (0.01) | No |
AWD-LSTM | (1.22‡) | Yes | 0.73 (0.15) | 1.27 (0.10) | 0.54 (0.04) | 0.30 (0.05) |
Neural Language Model (word based) | ||||||
Simple RNN | 164.51 | Yes | 0.79 (0.12) | 1.01 (0.00) | 0.50 (0.02) | No |
GRU | 96.22 | Yes | 0.79 (0.11) | 1.12 (0.06) | 0.52 (0.03) | 0.52 (Weak) |
QRNN | 74.74 | Yes | 0.79 (0.11) | 1.08 (0.03) | 0.52 (0.03) | 0.57 (0.08) |
LSTM (no regularization) | 113.18 | Yes | 0.78 (0.12) | 1.10 (0.03) | 0.52 (0.03) | 0.43 (0.15) |
AWD-LSTM | 64.27 | Yes | 0.76 (0.13) | 1.30 (0.15) | 0.58 (0.06) | 0.05 (0.01) |
AWD-LSTM-Simon | 61.59 | Yes | 0.77 (0.10) | 1.25 (0.15) | 0.55 (0.05) | 0.03 (0.01) |
AWD-LSTM-MoS | 62.44 | Yes | 0.78 (0.12) | 1.16 (0.07) | 0.54 (0.04) | 0.33 (0.07) |
AWD-LSTM-MoS-Cache | 59.21 | Yes | 0.78 (0.11) | 1.20 (0.07) | 0.57 (0.07) | 0.29 (0.05) |
AWD-LSTM-Cache | 50.39 | Yes | 0.78 (0.11) | 1.25 (0.10) | 0.59 (0.07) | 0.14 (0.04) |
. | Perplexity . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | ||
Original Data set | ||||||
Wikitext-2 (Preprocessed) | - | Yes | 0.75 (0.13) | 1.32 (0.10) | 0.62 (0.15) | 0.33 (0.04) |
Wikitext-2 (Original) | - | Yes | 0.78 (0.09) | 1.33 (0.10) | 0.65 (0.11) | 0.32 (0.03) |
Shuffled Data set | ||||||
Wikitext-2(1-gram) | - | Yes | 0.75 (0.16) | 1.00 (0.01) | 0.50 (0.02) | No |
Wikitext-2(2-gram) | - | Yes | 0.76 (0.16) | 1.00 (0.00) | 0.50 (0.01) | No |
Wikitext-2(5-gram) | - | Yes | 0.76 (0.16) | 1.00 (0.00) | 0.50 (0.02) | No |
Wikitext-2(10-gram) | - | Yes | 0.76 (0.16) | 1.00 (0.00) | 0.50 (0.02) | No |
N-gram Language Model | ||||||
3-gram | 837.58 | Yes | 0.79 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
5-gram | 534.98 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
linear interpolation | 294.72 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 3-gram | 285.14 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 5-gram | 357.94 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 3-gram | 204.15 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 5-gram | 215.44 | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Simon/Pitman-Yor Process and Related Language Model | ||||||
Simon | - | Yes | 0.95 (0.15) | - | 0.50 (0.01) | 0.09 (0.03) |
Pitman-Yor | - | Yes | 0.78 (0.09) | - | 0.50 (0.01) | No |
HPYLM | (184.34†) | Yes | 0.78 (0.13) | 1.00 (0.00) | 0.50 (0.02) | No |
Neural Language Model (character based) | ||||||
LSTM (no regularization) | (1.44‡) | Yes | 0.74 (0.17) | 1.06 (0.05) | 0.50 (0.01) | No |
AWD-LSTM | (1.22‡) | Yes | 0.73 (0.15) | 1.27 (0.10) | 0.54 (0.04) | 0.30 (0.05) |
Neural Language Model (word based) | ||||||
Simple RNN | 164.51 | Yes | 0.79 (0.12) | 1.01 (0.00) | 0.50 (0.02) | No |
GRU | 96.22 | Yes | 0.79 (0.11) | 1.12 (0.06) | 0.52 (0.03) | 0.52 (Weak) |
QRNN | 74.74 | Yes | 0.79 (0.11) | 1.08 (0.03) | 0.52 (0.03) | 0.57 (0.08) |
LSTM (no regularization) | 113.18 | Yes | 0.78 (0.12) | 1.10 (0.03) | 0.52 (0.03) | 0.43 (0.15) |
AWD-LSTM | 64.27 | Yes | 0.76 (0.13) | 1.30 (0.15) | 0.58 (0.06) | 0.05 (0.01) |
AWD-LSTM-Simon | 61.59 | Yes | 0.77 (0.10) | 1.25 (0.15) | 0.55 (0.05) | 0.03 (0.01) |
AWD-LSTM-MoS | 62.44 | Yes | 0.78 (0.12) | 1.16 (0.07) | 0.54 (0.04) | 0.33 (0.07) |
AWD-LSTM-MoS-Cache | 59.21 | Yes | 0.78 (0.11) | 1.20 (0.07) | 0.57 (0.07) | 0.29 (0.05) |
AWD-LSTM-Cache | 50.39 | Yes | 0.78 (0.11) | 1.25 (0.10) | 0.59 (0.07) | 0.14 (0.04) |
. | Perplexity . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | ||
Original Data set | ||||||
Penn Treebank (Preprocessed) | - | Yes | 0.70 (0.16) | 1.23 (0.06) | 0.56 (0.14) | 0.81 (0.24) |
Penn Treebank (Original) | - | Yes | 0.83 (0.07) | 1.20 (0.05) | 0.57 (0.06) | 0.60 (0.16) |
Shuffled Data set | ||||||
Penn Treebank (1-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.00) | 0.50 (0.02) | No |
Penn Treebank (2-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.00) | 0.50 (0.02) | No |
Penn Treebank (5-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.00) | 0.50 (0.02) | No |
Penn Treebank (10-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.01) | 0.50 (0.02) | No |
N-gram Language Model | ||||||
3-gram | 367.79 | Yes | 0.71 (0.19) | 0.99 (0.01) | 0.50 (0.02) | No |
5-gram | 561.65 | Yes | 0.72 (0.21) | 1.00 (0.00) | 0.50 (0.02) | No |
linear interpolation | 238.59 | Yes | 0.71 (0.20) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 3-gram | 195.65 | Yes | 0.71 (0.19) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 5-gram | 250.18 | Yes | 0.71 (0.19) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 3-gram | 150.64 | Yes | 0.72 (0.21) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 5-gram | 156.70 | Yes | 0.71 (0.20) | 1.00 (0.00) | 0.50 (0.02) | No |
Simon/Pitman-Yor Process and Related Language Model | ||||||
HPYLM | (140.49†) | Yes | 0.73 (0.21) | 1.00 (0.00) | 0.50 (0.02) | No |
Grammatical Model | ||||||
PCFG | - | Yes | 0.73 (0.19) | 1.00 (0.00) | 0.50 (0.02) | No |
Neural Language Model (character based) | ||||||
LSTM (no regularization) | (1.38‡) | Yes | 0.79 (0.08) | 1.03 (0.01) | 0.50 (0.01) | No |
AWD-LSTM | (1.18‡) | Yes | 0.76 (0.12) | 1.10 (0.03) | 0.51 (0.02) | 0.40 (0.10) |
Neural Language Model (word based) | ||||||
Simple RNN | 123.96 | Yes | 0.71 (0.19) | 1.00 (0.01) | 0.50 (0.02) | 0.74 (Weak) |
GRU | 85.05 | Yes | 0.71 (0.18) | 1.05 (0.02) | 0.50 (0.02) | 0.40 (Weak) |
QRNN | 62.65 | Yes | 0.71 (0.18) | 1.10 (0.03) | 0.51 (0.02) | 0.54 (Weak) |
LSTM (no regularization) | 111.79 | Yes | 0.71 (0.19) | 1.04 (0.01) | 0.51 (0.02) | 0.84 (Weak) |
AWD-LSTM | 56.40 | Yes | 0.71 (0.18) | 1.06 (0.02) | 0.51 (0.03) | 0.69 (Weak) |
AWD-LSTM-Simon | 57.85 | Yes | 0.72 (0.16) | 1.04 (0.01) | 0.51 (0.03) | No |
AWD-LSTM-MoS | 54.77 | Yes | 0.71 (0.18) | 1.10 (0.03) | 0.52 (0.04) | 0.77 (Weak) |
AWD-LSTM-MoS-Cache | 54.03 | Yes | 0.71 (0.18) | 1.13 (0.04) | 0.55 (0.06) | 0.61 (Weak) |
AWD-LSTM-Cache | 52.51 | Yes | 0.72 (0.17) | 1.07 (0.02) | 0.53 (0.05) | 0.57 (Weak) |
. | Perplexity . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | ||
Original Data set | ||||||
Penn Treebank (Preprocessed) | - | Yes | 0.70 (0.16) | 1.23 (0.06) | 0.56 (0.14) | 0.81 (0.24) |
Penn Treebank (Original) | - | Yes | 0.83 (0.07) | 1.20 (0.05) | 0.57 (0.06) | 0.60 (0.16) |
Shuffled Data set | ||||||
Penn Treebank (1-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.00) | 0.50 (0.02) | No |
Penn Treebank (2-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.00) | 0.50 (0.02) | No |
Penn Treebank (5-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.00) | 0.50 (0.02) | No |
Penn Treebank (10-gram) | - | Yes | 0.72 (0.18) | 1.00 (0.01) | 0.50 (0.02) | No |
N-gram Language Model | ||||||
3-gram | 367.79 | Yes | 0.71 (0.19) | 0.99 (0.01) | 0.50 (0.02) | No |
5-gram | 561.65 | Yes | 0.72 (0.21) | 1.00 (0.00) | 0.50 (0.02) | No |
linear interpolation | 238.59 | Yes | 0.71 (0.20) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 3-gram | 195.65 | Yes | 0.71 (0.19) | 1.00 (0.00) | 0.50 (0.02) | No |
Katz backoff 5-gram | 250.18 | Yes | 0.71 (0.19) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 3-gram | 150.64 | Yes | 0.72 (0.21) | 1.00 (0.00) | 0.50 (0.02) | No |
Kneser-Ney 5-gram | 156.70 | Yes | 0.71 (0.20) | 1.00 (0.00) | 0.50 (0.02) | No |
Simon/Pitman-Yor Process and Related Language Model | ||||||
HPYLM | (140.49†) | Yes | 0.73 (0.21) | 1.00 (0.00) | 0.50 (0.02) | No |
Grammatical Model | ||||||
PCFG | - | Yes | 0.73 (0.19) | 1.00 (0.00) | 0.50 (0.02) | No |
Neural Language Model (character based) | ||||||
LSTM (no regularization) | (1.38‡) | Yes | 0.79 (0.08) | 1.03 (0.01) | 0.50 (0.01) | No |
AWD-LSTM | (1.18‡) | Yes | 0.76 (0.12) | 1.10 (0.03) | 0.51 (0.02) | 0.40 (0.10) |
Neural Language Model (word based) | ||||||
Simple RNN | 123.96 | Yes | 0.71 (0.19) | 1.00 (0.01) | 0.50 (0.02) | 0.74 (Weak) |
GRU | 85.05 | Yes | 0.71 (0.18) | 1.05 (0.02) | 0.50 (0.02) | 0.40 (Weak) |
QRNN | 62.65 | Yes | 0.71 (0.18) | 1.10 (0.03) | 0.51 (0.02) | 0.54 (Weak) |
LSTM (no regularization) | 111.79 | Yes | 0.71 (0.19) | 1.04 (0.01) | 0.51 (0.02) | 0.84 (Weak) |
AWD-LSTM | 56.40 | Yes | 0.71 (0.18) | 1.06 (0.02) | 0.51 (0.03) | 0.69 (Weak) |
AWD-LSTM-Simon | 57.85 | Yes | 0.72 (0.16) | 1.04 (0.01) | 0.51 (0.03) | No |
AWD-LSTM-MoS | 54.77 | Yes | 0.71 (0.18) | 1.10 (0.03) | 0.52 (0.04) | 0.77 (Weak) |
AWD-LSTM-MoS-Cache | 54.03 | Yes | 0.71 (0.18) | 1.13 (0.04) | 0.55 (0.06) | 0.61 (Weak) |
AWD-LSTM-Cache | 52.51 | Yes | 0.72 (0.17) | 1.07 (0.02) | 0.53 (0.05) | 0.57 (Weak) |
The first blocks in each table indicate the properties of the original data sets with and without preprocessing. The second blocks list the results for shuffled data sets, which preserve parts of the n-gram structure. They were tested to check the behavior of the evaluation metrics on randomized texts. The shuffled data sets were expected to lose long memory and were largely different from the original natural language texts. The shuffling was conducted as follows. As an example, the text ABCDEFGHI was first split into 3-gram chunks, giving ABC/DEF/GHI. Then, the chunks were shuffled randomly to obtain a 3-gram shuffled data set (i.e., DEF/GHI/ABC). Note that this shuffling does not preserve some n-gram structures, such as BCD and FGH, in the original text. The remaining blocks correspond to the results for the language models introduced. The grammatical model category is absent in Table 2 because of the lack of a parsed corpus for WT2. Appendix B includes all figures showing the scaling properties.
6. Evaluation of Metrics
The first columns of Table 2 and Table 3 list the perplexities of the language models. The blank symbol “-” appears in rows for which the perplexity is not available: the original and shuffled data sets are not language models, while the Simon/Pitman-Yor processes and the grammatical model have different definitions of probability and cannot be measured comparably with the n-gram and neural language models. The perplexity scores in parentheses were measured comparably but are not comparable with the other values because of their different implementations of preprocessing, as explained at the ends of §4.3 and §4.4.
In terms of perplexity, the neural language models consistently outperformed the n-gram models. Among the n-gram models, Kneser-Ney smoothing consistently outperformed the other smoothing techniques. The 3-gram models sometimes had better perplexity than the 5-gram models did, as the training data sets in this experiment were not especially large (see Table 1). Among the neural language models, the simple RNN model had the worst perplexity. The RNNs with a gating mechanism improved the perplexity over that of the simple RNN model. In particular, the AWD-LSTM model performed the best among the RNN language models. The additional architectures of the cache mechanism and MoS contributed to improving the perplexity.
6.1 Metrics of Scaling Properties
The proposed evaluation metrics should be compared with another evaluation metric that is assumed plausible. In this article, the perplexity is adopted as such a metric. As perplexity has been the standard evaluation metric in language modeling and the prediction accuracy is of primary importance for that application, we compare the metrics derived from the scaling properties by comparing them with the perplexity and consider how they correlate with it.
Columns 3–7 of Table 2 and Table 3 list the respective results for the scaling properties: Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and the long-range correlation. Even when the perplexity was not computable, the properties could all still be examined regardless of the kind of language model, except for Ebeling’s method, because it applies to characters. Overall, except for the long-range correlation, the results were consistent across the data sets: When a scaling law was followed by one data set, then it was also followed by the other data set.
All the language models qualitatively satisfied Zipf’s law. We indicate this by Yes in the tables for the reason stated in §3.1. Relatedly, all the language models also satisfied Heap’s law. These two properties, however, are present even with a unigram language model. Despite their fame, Zipf’s law and Heaps’ law have no capacity to distinguish randomized and real text. It is therefore not a challenge for language models to satisfy Zipf’s and Heaps’ laws.
In contrast, the metrics of long memory were capable of quantifying the quality of machine-generated texts. For Ebeling’s method (first column of the Long Memory vertical block), the exponent of the original data set was η = 1.32 for WT2 and η = 1.23 for the PTB, whereas that of both shuffled data sets was η = 1.00, thus indicating no long memory in the latter. The neural language models had exponents between η = 1.10 and η = 1.30 for WT2, and between η = 1.04 and η = 1.13 for the PTB, whereas the other language models were the same as i.i.d. behavior. Ebeling’s method therefore could verify the text quality to a certain extent.
The last column in each table lists the results for the long-range correlation. If the text was not long-range correlated, this is denoted by No or Weak: No if more than one value was negative for s ≤ 10, or Weak if there was one negative value for s ≤ 100. Such arbitrariness of judgment is one disadvantage of this metric. In addition, even though it has good correspondence with the other two metrics of long memory, it has two further disadvantages. First, the exponent has poor correlation with the perplexity. The second disadvantage was exhibited in the degree of long-range correlation listed for the Simon model. The degree was high at the beginning and did not decay (see Figure A16 in Appendix B). As the Simon model had more new words later in a sequence, the correlation stayed large even for two sequences with a large distance between them. Therefore, this non-decaying phenomenon was due not to burstiness but to a different characteristic specific to the Simon process. The Taylor exponent for the Simon process was ζ = 0.50, indicating that the long-range correlation observed was not due to long memory behavior.
Finally, the Taylor exponent ζ seemed the most reliable metric among those derived from the scaling properties. The left panel of Figure 3 shows the correlation between the perplexity of the models and the Taylor exponent ζ. As the perplexity decreased, the Taylor exponent ζ showed a steep increase. Because the exponent quantifies the degree of burstiness of word occurrence, this result indicates that the better models in terms of perplexity can also reproduce that statistical property.
Scatter plots of the perplexity of various models with respect to the Taylor exponent ζ (left) and the perplexity of the eval-AWD-LSTM model (right) for the WT2 data set (left). The Taylor exponents of the n-gram language models were consistently ζ = 0.50, which indicates the absence of long memory. In contrast, the neural language models had Taylor exponents of ζ > 0.50, which indicates the presence of long memory in the generated texts (right). The perplexity of eval-AWD-LSTM had clear, positive correlation with the perplexities of the language models.
Scatter plots of the perplexity of various models with respect to the Taylor exponent ζ (left) and the perplexity of the eval-AWD-LSTM model (right) for the WT2 data set (left). The Taylor exponents of the n-gram language models were consistently ζ = 0.50, which indicates the absence of long memory. In contrast, the neural language models had Taylor exponents of ζ > 0.50, which indicates the presence of long memory in the generated texts (right). The perplexity of eval-AWD-LSTM had clear, positive correlation with the perplexities of the language models.
Overall, the scaling properties of long memory serve for evaluation of generated texts. The Taylor exponent ζ especially has the capability for evaluation.
6.2 Comparison with PCFG- and Language-Model–Based Evaluation
Next, we test the effectiveness of using the negative log-likelihood from a PCFG (Rajeswar et al. 2017) and the perplexity obtained from a neural language model (Fedus, Goodfellow, and Dai 2018). The results show how PCFG-based evaluation is not effective, in contrast to evaluation based on the scaling properties.
In principle, the negative log-likelihood of a PCFG evaluates the grammaticality of text. Rajeswar et al. (2017) used the negative log-likelihood of a PCFG to evaluate GAN-generated texts. The scatter plots in Figure 4 show the average negative log-likelihood from a PCFG for the PTB data set (magenta), the PTB data set shuffled with 5-grams (green), and the AWD-LSTM-Cache model (blue). Because the PTB data set is annotated, the negative log-likelihood was calculated for every sentence, and the values were plotted for different sentence lengths. As for the other two cases, because the outputs had no sentence boundaries indicated in the training data, consecutive parts of a given length n were randomly extracted from the text and fed to the PCFG parser, and the negative log-likelihood was then calculated. The NLTK (Loper and Bird 2002) parser implementation was used in this work. The shaded area in red represents the upper and lower bounds of the original PTB data set.
Average negative log-likelihood of a PCFG for different sentence lengths from the PTB data set (magenta), n-word chunks from the AWD-LSTM-Cache model (blue), and 5-grams from the shuffled PTB data set (green). The area shaded red represents the upper and lower bounds of the negative log-likelihood of the PCFG for the PTB data set.
Average negative log-likelihood of a PCFG for different sentence lengths from the PTB data set (magenta), n-word chunks from the AWD-LSTM-Cache model (blue), and 5-grams from the shuffled PTB data set (green). The area shaded red represents the upper and lower bounds of the negative log-likelihood of the PCFG for the PTB data set.
The average negative log-likelihood of a sentence has a strong linear correlation with its length, and the values for the PTB data set were consistently lower than those for the generated text of the AWD-LSTM-Cache model and the 5-gram shuffled text. The differences from the original PTB data set, however, were not significant, even though the 5-gram and AWD-LSTM-Cache results were calculated merely for n-word random chunks. Moreover, the average values for the 5-gram shuffled text and the machine-generated text were within the range of the PTB’s upper and lower bounds. This indicates that the negative log-likelihood from a PCFG is probably not usable for evaluating machine-generated texts.
Apart from the PCFG, Fedus, Goodfellow, and Dai (2018) proposed evaluating the quality of GAN-generated texts with the perplexity computed from a neural language model. We next test whether that method provides a good measure of the language models considered here. Accordingly, we used the AWD-LSTM model to evaluate the texts generated by the n-gram and neural language models. To avoid confusion, we call this the eval-AWD-LSTM model. It was trained with the WT2 and PTB data sets to evaluate the texts generated by the various other models (including AWD-LSTM itself).
The perplexity of eval-AWD-LSTM was calculated for each machine-generated text by (1). The rightmost columns of Tables 4 and 5 list the results, and the right panel of Figure 3 shows a scatter plot of the perplexity of the models with respect to the perplexity of eval-AWD-LSTM. This method seemed to work well, especially in globally distinguishing the n-gram and neural language model categories: The former category had perplexities above 600, whereas the latter category had almost all values below 200 for WT2. The eval-AWD-LSTM perplexity could not, however, detect the differences among the n-gram language models nor among the neural language models (e.g., between Katz backoff and Kneser-Ney, or AWD-LSTM and AWD-LSTM-Cache). The bias caused by the evaluation model is also a problem with this method. In the experiment, AWD-LSTM was the best model by eval-AWD-LSTM evaluation for both the WT2 and PTB data sets. It is likely that worse-performing models whose behavior is similar to that of the evaluation model are evaluated more highly than are other models that have higher fluency but behave differently from the evaluation model.
. | Perplexity . | Taylor exponent . | Perplexity from eval-AWD-LSTM . |
---|---|---|---|
Original Data set | |||
Wikitext-2 (Preprocessed) | - | 0.62 (0.15) | 33.81 |
Shuffled Data Set | |||
Wikitext-2 (1-gram) | - | 0.50 (0.02) | 7,389.15 |
Wikitext-2 (2-gram) | - | 0.50 (0.02) | 2,405.15 |
Wikitext-2 (5-gram) | - | 0.50 (0.02) | 559.92 |
Wikitext-2 (10-gram) | - | 0.50 (0.02) | 236.49 |
N-gram Language Model | |||
3-gram | 837.58 | 0.50 (0.02) | 3,730.74 |
5-gram | 534.98 | 0.50 (0.02) | 7,532.91 |
linear interpolation | 294.72 | 0.50 (0.02) | 1,371.75 |
Katz backoff 3-gram | 285.14 | 0.50 (0.02) | 663.74 |
Katz backoff 5-gram | 357.94 | 0.50 (0.02) | 664.25 |
Kneser-Ney 3-gram | 204.15 | 0.50 (0.02) | 2,562.24 |
Kneser-Ney 5-gram | 215.44 | 0.50 (0.02) | 2,743.65 |
HPYLM | 184.34 | 0.50 (0.02) | 884.76 |
Neural Language Model | |||
Simple RNN | 164.51 | 0.50 (0.02) | 645.64 |
GRU | 96.22 | 0.52 (0.03) | 266.33 |
QRNN | 74.74 | 0.52 (0.03) | 135.68 |
LSTM (no regularization) | 113.18 | 0.52 (0.03) | 177.12 |
AWD-LSTM | 64.27 | 0.58 (0.06) | 88.73 |
AWD-LSTM-Simon | 61.59 | 0.55 (0.05) | 130.52 |
AWD-LSTM-MoS | 62.44 | 0.54 (0.04) | 97.89 |
AWD-LSTM-MoS-Cache | 59.21 | 0.57 (0.07) | 164.39 |
AWD-LSTM-Cache | 50.39 | 0.59 (0.07) | 109.02 |
. | Perplexity . | Taylor exponent . | Perplexity from eval-AWD-LSTM . |
---|---|---|---|
Original Data set | |||
Wikitext-2 (Preprocessed) | - | 0.62 (0.15) | 33.81 |
Shuffled Data Set | |||
Wikitext-2 (1-gram) | - | 0.50 (0.02) | 7,389.15 |
Wikitext-2 (2-gram) | - | 0.50 (0.02) | 2,405.15 |
Wikitext-2 (5-gram) | - | 0.50 (0.02) | 559.92 |
Wikitext-2 (10-gram) | - | 0.50 (0.02) | 236.49 |
N-gram Language Model | |||
3-gram | 837.58 | 0.50 (0.02) | 3,730.74 |
5-gram | 534.98 | 0.50 (0.02) | 7,532.91 |
linear interpolation | 294.72 | 0.50 (0.02) | 1,371.75 |
Katz backoff 3-gram | 285.14 | 0.50 (0.02) | 663.74 |
Katz backoff 5-gram | 357.94 | 0.50 (0.02) | 664.25 |
Kneser-Ney 3-gram | 204.15 | 0.50 (0.02) | 2,562.24 |
Kneser-Ney 5-gram | 215.44 | 0.50 (0.02) | 2,743.65 |
HPYLM | 184.34 | 0.50 (0.02) | 884.76 |
Neural Language Model | |||
Simple RNN | 164.51 | 0.50 (0.02) | 645.64 |
GRU | 96.22 | 0.52 (0.03) | 266.33 |
QRNN | 74.74 | 0.52 (0.03) | 135.68 |
LSTM (no regularization) | 113.18 | 0.52 (0.03) | 177.12 |
AWD-LSTM | 64.27 | 0.58 (0.06) | 88.73 |
AWD-LSTM-Simon | 61.59 | 0.55 (0.05) | 130.52 |
AWD-LSTM-MoS | 62.44 | 0.54 (0.04) | 97.89 |
AWD-LSTM-MoS-Cache | 59.21 | 0.57 (0.07) | 164.39 |
AWD-LSTM-Cache | 50.39 | 0.59 (0.07) | 109.02 |
. | Perplexity . | Taylor exponent . | Perplexity from eval-AWD-LSTM . |
---|---|---|---|
Original Data Set | |||
Penn Tree Bank (Preprocessed) | - | 0.56 (0.14) | 40.70 |
Shuffled Data Set | |||
Penn Tree Bank (1-gram) | - | 0.50 (0.02) | 3,698.52 |
Penn Tree Bank (2-gram) | - | 0.50 (0.02) | 1,328.39 |
Penn Tree Bank (5-gram) | - | 0.50 (0.02) | 351.22 |
Penn Tree Bank (10-gram) | - | 0.50 (0.02) | 166.93 |
N-gram Language Model | |||
3-gram | 367.79 | 0.50 (0.02) | 1,697.99 |
5-gram | 561.65 | 0.50 (0.02) | 3,463.88 |
linear interpolation | 238.59 | 0.50 (0.02) | 965.58 |
Katz backoff 3-gram | 195.65 | 0.50 (0.02) | 420.48 |
Katz backoff 5-gram | 250.18 | 0.50 (0.02) | 471.03 |
Kneser-Ney 3-gram | 150.64 | 0.50 (0.02) | 1,324.67 |
Kneser-Ney 5-gram | 156.70 | 0.50 (0.02) | 1,411.14 |
HPYLM | 140.49 | 0.50 (0.02) | 412.13 |
Neural Language Model | |||
Simple RNN | 123.96 | 0.50 (0.02) | 321.31 |
GRU | 85.05 | 0.50 (0.02) | 258.12 |
QRNN | 62.65 | 0.51 (0.02) | 113.22 |
LSTM (no regularization) | 113.18 | 0.51 (0.02) | 234.05 |
AWD-LSTM | 64.27 | 0.51 (0.03) | 90.01 |
AWD-LSTM-Simon | 61.59 | 0.51 (0.03) | 144.45 |
AWD-LSTM-MoS | 62.44 | 0.52 (0.04) | 97.73 |
AWD-LSTM-MoS-Cache | 59.21 | 0.55 (0.06) | 100.56 |
AWD-LSTM-Cache | 50.39 | 0.53 (0.05) | 123.32 |
. | Perplexity . | Taylor exponent . | Perplexity from eval-AWD-LSTM . |
---|---|---|---|
Original Data Set | |||
Penn Tree Bank (Preprocessed) | - | 0.56 (0.14) | 40.70 |
Shuffled Data Set | |||
Penn Tree Bank (1-gram) | - | 0.50 (0.02) | 3,698.52 |
Penn Tree Bank (2-gram) | - | 0.50 (0.02) | 1,328.39 |
Penn Tree Bank (5-gram) | - | 0.50 (0.02) | 351.22 |
Penn Tree Bank (10-gram) | - | 0.50 (0.02) | 166.93 |
N-gram Language Model | |||
3-gram | 367.79 | 0.50 (0.02) | 1,697.99 |
5-gram | 561.65 | 0.50 (0.02) | 3,463.88 |
linear interpolation | 238.59 | 0.50 (0.02) | 965.58 |
Katz backoff 3-gram | 195.65 | 0.50 (0.02) | 420.48 |
Katz backoff 5-gram | 250.18 | 0.50 (0.02) | 471.03 |
Kneser-Ney 3-gram | 150.64 | 0.50 (0.02) | 1,324.67 |
Kneser-Ney 5-gram | 156.70 | 0.50 (0.02) | 1,411.14 |
HPYLM | 140.49 | 0.50 (0.02) | 412.13 |
Neural Language Model | |||
Simple RNN | 123.96 | 0.50 (0.02) | 321.31 |
GRU | 85.05 | 0.50 (0.02) | 258.12 |
QRNN | 62.65 | 0.51 (0.02) | 113.22 |
LSTM (no regularization) | 113.18 | 0.51 (0.02) | 234.05 |
AWD-LSTM | 64.27 | 0.51 (0.03) | 90.01 |
AWD-LSTM-Simon | 61.59 | 0.51 (0.03) | 144.45 |
AWD-LSTM-MoS | 62.44 | 0.52 (0.04) | 97.73 |
AWD-LSTM-MoS-Cache | 59.21 | 0.55 (0.06) | 100.56 |
AWD-LSTM-Cache | 50.39 | 0.53 (0.05) | 123.32 |
Overall, the evaluation methods using other language models were not consistent. The PCFG-based evaluation could not even clearly distinguish between the shuffled and original data sets. Evaluation based on a neural language model could detect the difference between the n-gram and neural language models, but it could not distinguish quality within those categories of language models. Compared with those methods, the Taylor exponent ζ had a clearer correlation with the perplexity of the models. Specifically, the exponent satisfied ζ = 0.50 for all n-gram language models. It was larger than 0.50 only for the neural language models whose perplexity was better than that of the n-gram language models. Among the neural language models, the Taylor exponent took high values for the AWD-LSTM family, which had better perplexity than the GRU and QRNN models and the LSTM model without regularization.
7. Evaluation of Models
In this section, we apply the evaluation of metrics in §6.1 to discuss the scaling properties of the language models. All language models tested in the experiments satisfied the scaling properties of vocabulary population, Zipf’s law, and Heaps’ law. These properties are relatively easy for models to reproduce, because they concern the static probability distribution of words.
In contrast, many of the language models failed to reproduce long memory behavior. The sole exception was the Simon process, which presented strong long-range correlation, but this was not caused by burstiness, as explained in §6.1. The lack of long memory in n-gram language models is supported by an analytical argument about Markov models, as mentioned in §4.1. The failure of the PCFG model in our experiment setting can be explained by its lack of inter-sentence structure.
Even among the neural language models, the simple RNN model failed to reproduce long memory. The Taylor exponent was ζ = 0.50, and the other metrics also indicated that the generated text did not have long-range dependence. In contrast, the RNNs with a gating mechanism (LSTM, GRU, and QRNNs) could reproduce long memory behavior. The Taylor exponents of the GRU and QRNN language models were both ζ = 0.52 for WT2, which indicates the presence of long memory to a certain extent. The LSTM language models were consistently the best at reproducing long memory behavior of natural language text for WT2 and the PTB at both the character level and the word level.
Figure 5 shows (a) Zipf’s law and (b) Taylor’s law results for the AWD-LSTM-Cache model trained with WT2, which was the best performing model in terms of perplexity. Figure 5(a) demonstrates that the Zipf’s law behavior of the data set shown in Figure 1(a) was well recovered. Likewise, Figure 5(b) demonstrates how well the AWD-LSTM-Cache model captured and reproduced the Taylor’s law behavior shown in Figure 1(d). Whereas the Taylor exponent for the original data set was ζ = 0.62, the AWD-LSTM-Cache model had a Taylor exponent of ζ = 0.59 for WT2. The data points in Figure 1(d) were more widely scattered around the regression line than those in Figure 5(b). Even with the well-performing neural language models, however, the scaling properties of long memory were not fully recovered. These differences represent gaps between the natural language text and the language model, which may indicate room for improvement.
8. Evaluation of GAN Models
Finally, we discuss the possibility of evaluating GAN-generated text with the scaling properties. Table 6 lists the scaling properties for the COCO image data set (Lin et al. 2014). Because current GAN models for text generation cannot produce long texts, image captions constitute the standard data set for these GAN models. Because of the data set used, the GAN models are limited to generating a certain text type (i.e., image captions). In particular, as the length of the text is short, the results are readily expected not to reproduce long memory behavior. Yet it is worthwhile to test the vocabulary population of the GAN models to understand their capacity.
. | Tokens . | Vocab. . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | |||
Image COCO (English, collection of image caption) | |||||||
original data set | 105,933 | 6,095 | Yes | 0.76 (0.09) | 0.99 (0.03) | 0.50 (0.04) | No |
. | Tokens . | Vocab. . | Vocabulary Population . | Long Memory . | |||
---|---|---|---|---|---|---|---|
Zipf’s Law f(r) ∝ r−α . | Heaps’ Law v(n) ∝ nβ . | Ebeling’s Method m(l) ∝ lη . | Taylor’s Law σ ∝ μζ . | Long Range Correlation c(s) ∝ s−ξ . | |||
Image COCO (English, collection of image caption) | |||||||
original data set | 105,933 | 6,095 | Yes | 0.76 (0.09) | 0.99 (0.03) | 0.50 (0.04) | No |
Figures 6 and 7 show Zipf’s and Taylor’s law graphs for the original data set and the text generated by SeqGAN (Yu et al. 2017), respectively. Unlike the other language models, GAN models for text generation had problems reproducing Zipf’s law. The tail decay for the generated text was faster than that for the data set. The vocabulary size of the generated text was only v(n) = 1,822 words for n = 118,264 generated words, whereas the original text had a vocabulary size v(n) = 6,095 for n = 105,933 words. This result indicates that the GAN model could not produce the infrequent words in the training data set.
Scaling properties of captions generated from the COCO image data set by SeqGAN.
Scaling properties of captions generated from the COCO image data set by SeqGAN.
On the other hand, long memory was already absent at the level of the training data set. The Taylor exponent was ζ = 0.50 (Figure 6(b)), indicating no memory, which was obviously expected, as the captions were shuffled and two consecutive captions had no relation. Through learning of such training data and production caption by caption, the generated text also had no long memory (Figure 7(b)). Indeed, long memory analysis literally requires a model to generate a sufficiently long text to allow further quality evaluation of natural language.
Nevertheless, other metrics would not provide a better evaluation in this case. Table 7 lists the evaluation metrics of BLEU and perplexity by eval-AWD-LSTM for texts generated using different GAN techniques. The BLEU scores for the GAN models in Table 7 were extracted from Zhu et al. (2018). The perplexity scores were computed by using the eval-AWD-LSTM model trained with the COCO image data set and the hyperparameters for the PTB data set. The perplexity of AWD-LSTM when trained with that data set was 65.41.
. | SeqGAN . | MaliGAN . | RankGAN . | LeakGAN . | TextGAN . | MLE . | ImageCoco . |
---|---|---|---|---|---|---|---|
BLEU-2 | 0.92 | 0.89 | 0.94 | 0.93 | 0.65 | 0.92 | 1.00 |
BLEU-3 | 0.75 | 0.70 | 0.80 | 0.82 | 0.65 | 0.68 | 1.00 |
BLEU-4 | 0.53 | 0.48 | 0.60 | 0.66 | 0.60 | 0.57 | 1.00 |
BLEU-5 | 0.35 | 0.31 | 0.41 | 0.47 | 0.52 | 0.39 | 1.00 |
eval-AWD-LSTM | 179.29 | 272.53 | 132.90 | 146.26 | 129.93 | 176.34 | 44.17 |
. | SeqGAN . | MaliGAN . | RankGAN . | LeakGAN . | TextGAN . | MLE . | ImageCoco . |
---|---|---|---|---|---|---|---|
BLEU-2 | 0.92 | 0.89 | 0.94 | 0.93 | 0.65 | 0.92 | 1.00 |
BLEU-3 | 0.75 | 0.70 | 0.80 | 0.82 | 0.65 | 0.68 | 1.00 |
BLEU-4 | 0.53 | 0.48 | 0.60 | 0.66 | 0.60 | 0.57 | 1.00 |
BLEU-5 | 0.35 | 0.31 | 0.41 | 0.47 | 0.52 | 0.39 | 1.00 |
eval-AWD-LSTM | 179.29 | 272.53 | 132.90 | 146.26 | 129.93 | 176.34 | 44.17 |
For both BLEU and perplexity, the results were inconsistent. In terms of BLEU, the best-performing GAN model varied among RankGAN with BLEU-2, LeakGAN with BLEU-3 and BLEU4, and TextGAN with BLEU-5. In contrast, TextGAN was the best model in terms of eval-AWD-LSTM. In addition to these metrics, the negative log-likelihood of the PCFG was also not effective in evaluating the GAN models in Zhu et al. (2018).
Although rigid quantitative evaluation is necessary for comparing GAN models, the existing evaluation metrics are not sufficiently reliable. Therefore, further study of evaluation metrics is necessary. The Taylor exponent may play a role in such studies when GAN-based models become able to produce longer texts.
9. Conclusion
In this article, we have investigated the scaling properties of computational models of natural language and analyzed whether these metrics could serve for assessing the models. The scaling properties quantify the vocabulary population and long memory behavior, which are universal qualities of natural language text. These metrics are applicable to any model, even those for which the perplexity is not measurable or a reference is not available. We tested n-gram language models, a grammatical model, mathematical models, neural language models, and GAN models for text generation. Among the five scaling properties introduced, the exponent of Taylor’s law showed the most reasonable behavior. It had the clearest correlation with the perplexity of the n-gram and neural language models.
Our analysis demonstrated that RNNs with a gating mechanism (LSTM, GRU, and QRNNs) are the first computational models of natural language that have the capacity to reproduce the long memory in natural language text. No other models tested in our experiment reproduced the scaling properties of long memory. The LSTM models were the best among the neural language models, as their long memory behavior was closer to that of the original text as compared to the GRU and QRNN models. Yet even the LSTM language models could not entirely recover long memory, including the exponents of the scaling properties. This observation confirms the gap between natural language text and language models and suggests corresponding room for improvement. Our future work will include investigating other scaling properties that could serve for evaluating language models.
Appendix A. Scaling Properties of Natural Language
This section presents the figures for the scaling properties of data sets that appear in this paper. The presence of the scaling properties is robust to the genre and the language of the text.
Appendix B. Scaling Properties of Language Model
This section presents the figures for the scaling properties of language models of WT2 in this article.
Scaling properties of Simon process. The figure of Ebeling method does not appear because of the inappropriateness of the application.
Scaling properties of Simon process. The figure of Ebeling method does not appear because of the inappropriateness of the application.
Scaling properties of Pitman-Yor process. The figure of Ebeling method does not appear because of the inappropriateness of the application.
Scaling properties of Pitman-Yor process. The figure of Ebeling method does not appear because of the inappropriateness of the application.
Scaling properties of LSTM without regularization language model.
Scaling properties of LSTM without regularization for character-level modeling.
Scaling properties of LSTM without regularization for character-level modeling.
Scaling properties of the Seq-GAN (the model learns COCO image data set).
Note
The implementation used in the experiment is available at https://github.com/musyoku/hpylm. Although HPYLM is an n-gram language model and it is possible to calculate the perplexity, the resulting value is not comparable with those of other n-gram language models and neural language models. Specifically, the training data requires using <BOS> and <EOS> to signify the beginning and end of a sentence, respectively. This decreases the perplexity because of the regularities introduced by these insertions, such as <EOS> being almost always followed by <BOS>.