Abstract
This article presents a mathematical and empirical verification of computational constancy measures for natural language text. A constancy measure characterizes a given text by having an invariant value for any size larger than a certain amount. The study of such measures has a 70-year history dating back to Yule's K, with the original intended application of author identification. We examine various measures proposed since Yule and reconsider reports made so far, thus overviewing the study of constancy measures. We then explain how K is essentially equivalent to an approximation of the second-order Rényi entropy, thus indicating its signification within language science. We then empirically examine constancy measure candidates within this new, broader context. The approximated higher-order entropy exhibits stable convergence across different languages and kinds of text. We also show, however, that it cannot identify authors, contrary to Yule's intention. Lastly, we apply K to two unknown scripts, the Voynich manuscript and Rongorongo, and show how the results support previous hypotheses about these scripts.
1. Introduction
A constancy measure for a natural language text is defined, in this article, as a computational measure that converges to a value for a certain amount of text and remains invariant for any larger size. Because such a measure exhibits the same value for any size of text larger than a certain amount, its value could be considered as a text characteristic.
The concept of such a text constancy measure was introduced by Yule (1944) in the form of his measure K. Since Yule, there has been a continuous quest for such measures, and various formulae have been proposed. They can be broadly categorized into three types, namely, those measuring (1) repetitiveness, (2) power law character, and (3) complexity.
Yule's original intention for K's utility lay in author identification, assuming that it would differ for texts written by different authors. State-of-the-art multivariate machine learning techniques are powerful, however, for solving such language engineering tasks, in which Yule's K is used only as one variable among many, as reported in Stamatatos, Fakotakis, and Kokkinakis (2001) and Stein, Lipka, and Prettenhofer (2010).
We believe that constancy measures today, however, have greater importance in understanding the mathematical nature of language. Although mathematical models of language have been studied in the computational linguistics milieu, via Markov models (Manning and Schuetze 1999), Zipf's law and its modifications (Mandelbrot 1953; Zipf 1965; Bell, Cleary, and Witten 1990), and Pitman-Yor models (Teh 2006) more recently, the true mathematical model of linguistic processes is ultimately unknown. Therefore, the convergence of a constancy measure must be examined through empirical verification. Because some constancy measures have a mathematical theory of convergence for a known process, discrepancies in the behavior of real linguistic data from such a theory would shed light on the nature of linguistic processes and give hints towards improving the mathematical models. Furthermore, as one application, a convergent measure would allow for comparison of different texts through a common, stable norm, provided that the measure converges for a sufficiently small amount of text. One of our goals is to discover a non-trivial measure with a certain convergence speed that distinguishes the different natures of texts.
The objective of this article is thus to provide a potential explanation of what the study of constancy measures over 70 years has been about, by answering the three following questions mathematically and empirically:
Question 1 Does a measure exhibit constancy?
Question 2 If so, how fast is the convergence speed?
Question 3 How discriminatory is the measure?
The most important and closest previous work was reported in Tweedie and Baayen (1998), the first paper to have examined the empirical behavior of constancy measures on real texts. The authors used English literary texts to test constancy measure candidates proposed prior to their work. Today, the coverage and abundance of language corpora allow us to conduct a larger-scale investigation across multiple languages. Recently, Golcher (2007) tested his measure V (discussed later in this paper) with Indo-European languages and also programming language sources. Our papers (Kimura and Tanaka-Ishii 2011, 2014) also precede this one, presenting results preliminary to this article but with only part of our data, and neither of those provides mathematical analysis with respect to the Rényi entropy. Compared with these previous reports, our contribution here can be summarized as follows:
- •
Our work elucidates the mathematical relation of Yule's K to Rényi's higher-order entropy and explains why K converges.
- •
Our work vastly extends the corpora used for empirical examination in terms of both size and language.
- •
Our work compares the convergent values for these corpora.
- •
Our work also presents results for unknown language data, specifically from the Voynich manuscript and Rongorongo.
2. Constancy Measures
The measures proposed so far can broadly be categorized into three types, calculating the repetitiveness, power-law distribution, or complexity of text. This section mathematically analyzes these measures and summarizes them.
2.1 Measures Based on Repetitiveness
2.1.1 Yule's K

2.1.2 Other Measures Based on Simple Text Statistics
Apart from Yule's K, various measures have been proposed from simple statistical observation of text, as detailed in Tweedie and Baayen (1998). One genre is based on the so-called token-type relation (i.e., the ratio of the vocabulary size V(N) and the text size N, in log) as formulated by Guiraud (1954) and Herdan (1964) as a law. Because this simple ratio is not stable, the measure was modified numerous times to formulate Herdan's C (Herdan 1964), Dugast's k and U (Dugast 1979), Maas' a2 (Maas 1972), Tuldava's LN (Tuldava 1977), and Brunet's W (Brunet 1978).
Another genre of measures concerns the proportion of hapax legomena, that is V(1, N). Honoré noted that V(1, N) increases linearly with respect to the log of a text's vocabulary size V(N) (Honoré 1979). Another ratio, of V(2, N) to V(N), was proposed as a text characteristic by Sichel (1975) and Maas (1972).
Each of these values, however, was found not to be convergent according to the extensive study conducted by Tweedie and Baayen (1998). In common with Yule's intention to apply such measures for author identification, they examined all of the measures discussed here, in addition to two measures explained later: Orlov's Z, and the Shannon entropy upper bound obtained from the relative frequencies of unigrams. They examined these measures with English novels (such as Alice's Adventures in Wonderland) and empirically found that only Yule's K and Orlov's Z were convergent. Given their report, we consider K the only true candidate among the constancy measures examined so far.
2.1.3 Golcher's V
Golcher explains in his paper that the possibility of constancy of V does not yet have mathematical grounding and has only been shown empirically. He does not report values for texts larger than about 30 megabytes nor for those of non-Indo-European languages. A simple conjecture on this measure is that because a suffix tree for a string of length N has at most N − 1 inner nodes, V must end up at some value 0 ≤ V < 1, for any given text.
Our group tested V with larger-scale data and concluded that V could be a constancy measure, although we admitted to observing a gradual increase (Kimura and Tanaka-Ishii 2014). Because V requires further verification on larger-scale data before ruling it out, we include it as a constancy measure candidate.
2.2 Measures Based on Power Law Distributions
The major problem with measures based on power laws lies in the skewed head and tail of the vocabulary population distribution. Because these exceptions constitute important parts of the population, parameter estimation by fitting to Equation (3) is sensitive to the estimation method. For example, the estimated value of the exponent for Zipf's law depends on the method used for dealing with these exceptions. We tested several simple methods of estimating the Zipf law's exponent γ with different ways of handling the head and tail of a distribution. There were settings that led to convergence, but the convergence depended on the settings. Such difficulty could be one reason why there has been no direct proposal for γ as a text constancy measure. Hence, due care must be taken in relating text constancy to a power law. We chose another path by considering text constancy through a random Zipf distribution, as described later in the experimental section.
2.3 Measures Based on Complexity
2.3.1 Shannon Entropy Upper Bound



Tweedie and Baayen directly calculated an approximation of this formula in terms of the relative frequencies (for P) of unigrams (for X), and they concluded that the measure would continue increasing with respect to text size and would not converge for short, literary texts (Tweedie and Baayen 1998). Because we are interested in the measure's behavior on a larger scale, we replicated their experiment, as discussed later in the section on empirical constancy. We denote this measure as H1 in this article.



Empirically, many have attempted to calculate the upper bound of the entropy rate. Brown's report (Brown et al. 1992) is representative in showing a good estimation of the entropy rate for English from texts, as compared with values obtained from humans (Cover and King 1978). Subsequently, there have been important studies on calculating the entropy rate, as reported thoroughly in Schümann and Grassberger (1996). The questions related to h*, however, remain unsolved. Recently, Dbowski used a Lempel-Ziv compressor and examined Hilberg's conjecture for texts by single authors (D
bowski 2013). He showed an exponential decrease in the entropy rate with respect to text size, supporting the validity of Equation (9). Following these previous works, we examine the entropy rate by using an algorithm proposed by Grassberger (1989) and later on by Farach et al. (1995). This method is based on universal coding. The algorithm has a theoretical background of convergence to the true h*, provided the sequence is stationary, but has been proved by Shields (1992) to be inconsistent—that is, it does not converge to the entropy rate for certain non-Markovian processes. We still chose to apply this method, because it requires no arbitrary parameters for calculation and is applicable to large-scale data within a reasonable time.

2.3.2 Approximation of Rényi Entropy Hα
- •
When α = 0, H0 (X) indicates the number of distinct occurrences of X.
- •
When the limit α → 1 is taken, Equation (11) reduces to the Shannon entropy.
For α → 1, there is controversy regarding convergence, as noted in the previous section. Such difficulty in convergence for these α values lies in the nature of linguistic processes, in which the vocabulary set evolves.

This relation of Yule's K to the Rényi entropy H2 is reported for the first time here, to the best of our knowledge. This mathematical relation clarifies both why Yule's K should converge and what the convergent value means; specifically, the value represents the gross complexity underlying the language system. As noted earlier, the higher-order entropy considers hapax legomena to a lesser degree and calculates the gross entropy only from the representative vocabulary population. This simple argument shows that Yule's K captures not only the simple repetitiveness of vocabulary but also the more profound signification of its equivalence with the approximated second-order entropy. Because K has been previously reported as a stable text constancy measure, we consider it here once again, but this time within the broader context of Hα.
2.4 Summary of Constancy Measure Candidates
Based on the previous reports (Tweedie and Baayen 1998; Kimura and Tanaka-Ishii 2014) and the discussion so far, we consider the following four measures as candidates for text constancy measures.
- •
Repetitiveness-based measures:
Yule's K (Equation (1)); and Golcher's V (Equation (2)).
- •
Complexity-based measures:
The Shannon entropy upper bound (h1 as the entropy rate (Equations (10) and (8)) and H1 (Equation (7), with X in terms of unigrams and the probability function in terms of relative frequencies); and the approximated Rényi entropy, denoted as Hα (α > 1) (Equation (11), again with X and the probability function in terms of unigrams and relative frequencies, respectively).
Furthermore, because we have seen the mathematical equivalence of Yule's K and H2, in the following we only consider H2. As for Hα, we consider α = 3, 4 only in comparison with H2. Because H1 is based on relative frequencies and can be considered together with H2, we first focus on the convergence of the three measures V, h1, and H2, and then we consider H1 in comparison with H2, H3, and H4.
3. Data
3.1 Real Texts
Table 1 lists the data used in our experimental examination. The table indicates the data identifier (by which we refer to the data in the rest of the article), language, source, number of distinct tokens, data length by total number of tokens, and size in bytes. The first block contains relatively large-scale natural language corpora consisting of texts written by multiple authors, and the second block contains smaller corpora consisting of texts by single authors. The third block contains programming language corpora, and the fourth block contains corpora of unknown scripts, which we examine at the end of this article in Section 4.3.
Our data.
Identifier . | Language kind . | Source . | Number of distinct tokens . | Data length by tokens . | Data size in bytes . |
---|---|---|---|---|---|
Large scale corpora | |||||
Enews-c | English | WSJ Corpus(1987) | 87 | 112,868,099 | 108MB |
Enews-w | 137,466 | 22,679,512 | |||
Jnews-c | Japanese | 2000–2009 Mainichi | 5,758 | 475,101,506 | 1.3GB |
Newspaper | |||||
Jnews-cr | 94 | 1,087,919,430 | |||
Jnews-w | 468,818 | 289,032.862 | |||
Cnews-c | Chinese | 1995 People's | 5,963 | 24,696,511 | 67MB |
Daily Newspaper | |||||
Cnews-cr | 88 | 68,325,519 | |||
Cnews-w | 144,336 | 14,965,501 | |||
Atext-c | Arabic | Watan-2004 corpus | 59 | 42,174,262 | 73MB |
Atext-w | 298,370 | 7,450,442 | |||
Ttext-c | Thai | NECTEC corpus | 159 | 1,444,536 | 3.9MB |
Ttext-w | 16,291 | 280,602 | |||
Small scale corpora | |||||
Ebook1-w | English | Ulysses | 34,359 | 325,692 | 1.5MB |
Ebook2-w | English | Les Miserables | 25,994 | 677,163 | 3MB |
Fbook-w | French | Les Miserables | 31,956 | 691,407 | 3MB |
Gbook-w | German | Kritik der reinen | 10,604 | 215,299 | 1.3MB |
Vernunft | |||||
Jbook-w | Japanese | Dohyo | 19,179 | 502,137 | 2MB |
Cbook-w | Chinese | Hong Lou Meng | 18,450 | 701,255 | 2.5MB |
Abook-w | Arabic | Quaran | 16,121 | 75,185 | 728KB |
Sbook-w | Sanskrit | Ramayana | 62,318 | 213,736 | 1.9MB |
Corpora of programming languages | |||||
Python-w | Python | python library sources | 1,517,424 | 48,704,374 | 214MB |
Cplus-w | C++ | C++ library sources | 127,332 | 15,617,801 | 64MB |
Lisp-w | Common Lisp | sbcl and Clozure CL | 164,248 | 2,326,270 | 32MB |
Corpora of Unknown scripts | |||||
VoynichA-c | Unknown | Voynich Manuscript | 22 | 44,360 | 44KB |
VoynichB-c | Unknown | Voynich Manuscript | 25 | 117,105 | 115KB |
VoynichA-w | Unknown | Voynich Manuscript | 2,628 | 7,460 | 44KB |
VoynichB-w | Unknown | Voynich Manuscript | 4,609 | 18,495 | 115KB |
RongoA-c | Unknown | Rongorongo script | 3,546 | 10,376 | 60KB |
RongoB-c | Unknown | Rongorongo script | 656 | 14,003 | 60KB |
Identifier . | Language kind . | Source . | Number of distinct tokens . | Data length by tokens . | Data size in bytes . |
---|---|---|---|---|---|
Large scale corpora | |||||
Enews-c | English | WSJ Corpus(1987) | 87 | 112,868,099 | 108MB |
Enews-w | 137,466 | 22,679,512 | |||
Jnews-c | Japanese | 2000–2009 Mainichi | 5,758 | 475,101,506 | 1.3GB |
Newspaper | |||||
Jnews-cr | 94 | 1,087,919,430 | |||
Jnews-w | 468,818 | 289,032.862 | |||
Cnews-c | Chinese | 1995 People's | 5,963 | 24,696,511 | 67MB |
Daily Newspaper | |||||
Cnews-cr | 88 | 68,325,519 | |||
Cnews-w | 144,336 | 14,965,501 | |||
Atext-c | Arabic | Watan-2004 corpus | 59 | 42,174,262 | 73MB |
Atext-w | 298,370 | 7,450,442 | |||
Ttext-c | Thai | NECTEC corpus | 159 | 1,444,536 | 3.9MB |
Ttext-w | 16,291 | 280,602 | |||
Small scale corpora | |||||
Ebook1-w | English | Ulysses | 34,359 | 325,692 | 1.5MB |
Ebook2-w | English | Les Miserables | 25,994 | 677,163 | 3MB |
Fbook-w | French | Les Miserables | 31,956 | 691,407 | 3MB |
Gbook-w | German | Kritik der reinen | 10,604 | 215,299 | 1.3MB |
Vernunft | |||||
Jbook-w | Japanese | Dohyo | 19,179 | 502,137 | 2MB |
Cbook-w | Chinese | Hong Lou Meng | 18,450 | 701,255 | 2.5MB |
Abook-w | Arabic | Quaran | 16,121 | 75,185 | 728KB |
Sbook-w | Sanskrit | Ramayana | 62,318 | 213,736 | 1.9MB |
Corpora of programming languages | |||||
Python-w | Python | python library sources | 1,517,424 | 48,704,374 | 214MB |
Cplus-w | C++ | C++ library sources | 127,332 | 15,617,801 | 64MB |
Lisp-w | Common Lisp | sbcl and Clozure CL | 164,248 | 2,326,270 | 32MB |
Corpora of Unknown scripts | |||||
VoynichA-c | Unknown | Voynich Manuscript | 22 | 44,360 | 44KB |
VoynichB-c | Unknown | Voynich Manuscript | 25 | 117,105 | 115KB |
VoynichA-w | Unknown | Voynich Manuscript | 2,628 | 7,460 | 44KB |
VoynichB-w | Unknown | Voynich Manuscript | 4,609 | 18,495 | 115KB |
RongoA-c | Unknown | Rongorongo script | 3,546 | 10,376 | 60KB |
RongoB-c | Unknown | Rongorongo script | 656 | 14,003 | 60KB |
For the large-scale natural language data, we considered five languages: English, Japanese, Chinese, Arabic, and Thai. These languages were chosen to represent different language families and writing systems. The large-scale corpora in English, Japanese, and Chinese consist of newspapers in chronological order, and the Thai and Arabic corpora include other kinds of texts. The markers ‘w’, ‘c’, and ‘cr’ appearing at the end of every identifier in Table 1 (e.g., Enews-c, Enews-w, and Jnews-cr) indicate text processed through words, characters, and transliterated Roman characters, respectively.
As for the small-scale corpora in the second block, the texts were only considered in terms of words, since verification via characters produced findings consistent with those obtained with the large-scale corpora. The texts were chosen because each was written by a single author but is relatively large.
Here, we summarize our preprocessing procedures. For the annotated Thai NECTEC corpus, texts were tokenized according to the annotation. The preprocessing methods for the other corpora were as follows:
- •
English: NLTK2 was used to tokenize text into words.
- •
Japanese: Mecab3 was used for tokenization, and KAKASI4 was used for romanization.
- •
Chinese: ICTCLAS20135 was used for tokenization, and the pinyin Python library was used for pinyin romanization.
- •
Other European Languages: PunktWordTokenizer6 was used for tokenization.
Following Golcher (2007), who first suggested testing constancy on programming languages, we also collected program sources from different languages (third block in Table 1). The programs were also considered solely in terms of words, not characters. C++ and Python were chosen to represent different abstraction levels, and Lisp was chosen because of its different ordering for function arguments. Source code was collected from language libraries. The programming language texts were preprocessed as follows. Comments in natural language were eliminated (although strings remained in the programs, where each was a literal token). Identical files and copies of sources in large chunks were carefully eliminated, although this process did not completely eliminate redundancy since most programs reuse some previous code. Finally, the programs were tokenized according to the language specifications.7
The last block of the table lists two corpora of unknown scripts. We consider these scripts at the end of this article in Section 4.3, through Figure 5, to show one possible application of the text constancy measures. The first unknown script is that of the Voynich manuscript, a famous text that is undeciphered but hypothesized to have been written in natural language. This corpus is considered in terms of both characters and words, where words were defined via the white space separation in the original text. Given the common understanding that the manuscript seems to have two different parts (Reddy and Knight 2011), we separated it into two parts according to the Currier annotation (identified as A and B, respectively). The second corpus of unknown text consists of the Rongorongo script of Easter Island (Daniels and Bright 1996, Section 13; Orliac 2005; Barthel 2013). This script's status as natural language is debatable, but if so, it is considered to possess characteristics of both phonographs and ideograms (Pozdniakov and Pozdniakov 2007). Because there are several ways to consider what constitutes a character in this script (Barthel 2013), we calculate values for the two most extreme cases as follows. For corpus RongoA-c, we consider a character inclusive of all adjoining parts (i.e., including accents and ornamental parts). On the other hand, for corpus RongoB-c, we separate parts as reasonably as possible, among multiple possible separation methods. Because the unit of word in this script is unknown, the Rongorongo script is only considered in terms of characters.
3.2 Random Data
The empirical verification of convergence for real data is controversial. We must first note that it does not conform with the standard approach to statistical testing. In the domain of statistics, it is a common understanding that “convergence” cannot be tested. A statistical test raises two contrasting hypotheses—called the null and alternative hypotheses—and calculates a p-value indicating the probability of the null hypothesis to occur. When this p-value is smaller than a certain value, the null hypothesis is considered unable to occur and is thus rejected. For convergence, the null hypothesis corresponds to “not converging,” and the alternative hypothesis, to “converging.” The problem here is that the null hypothesis is always related to the alternative hypothesis to a certain extent, because the difference between convergence and non-convergence is merely a matter of degree. In other words, the notion of convergence for a constancy measure does not conform with the philosophy of statistical testing. Convergence is therefore considered in terms of the distance from convergent values, or in terms of the error with respect to some parameter (such as data size). Such a distance cannot be calculated for real data, however, since the underlying mathematical model is unknown.
To sum up, verification of the convergence of real data must be considered by some other means. Our proposal is to consider convergence in comparison to a set of random data whose process is known. For this random data, we considered two kinds.
The first kind is used to examine data convergence in Section 4.1. This random data was generated from real data by shuffling the original text with respect to certain linguistic units. Tweedie and Baayen (1998) presented results by shuffling words, where the original texts were literary texts by single authors. Here, we generated random data by shuffling (1) words/characters, (2) sentences, or (3) documents. Because these options greatly increased the number of combinations of results, we mainly present the results with option (1) for large-scale data in this article. There are three reasons for this: Convergence must be verified especially at large scale; the most important convergence findings for randomized small-scale data were already reported in Tweedie and Baayen (1998); and the results for options (2) and (3) were situated within the range of option (1) and the original texts.
Randomization of the words and characters of original texts will destroy various linguistic characteristics, such as n-grams and long-range correlation. The convergence properties of the three measures V, h1, and H2 are as follows. The convergence of V is unknown, because it lacks a mathematical background. Even if the value of V did converge, the convergent value for randomized data would differ from that of the original text, since the measure is based on repeated n-grams in the text. h1 converges to the entropy rate of the randomized text, if the data size suffices. This is supported by the mathematical background of the algorithm, which converges to the true entropy rate for stationary data. Even when h1 converges for random data, the convergent value will be larger than that of the original text, because h1 considers the probabilities of n-grams. Lastly, H2 converges to the same point for a randomized text and the original text, because it is the approximated higher-order entropy, such that words and characters are considered to occur independently.
The second kind of random data is used to compare the convergent values of different texts for a constancy measure, as considered in Section 4.2. Random corpora were generated according to four different distributions: one uniform, and the other three following Zipf distributions with exponents of γ = 0.8, 1.0, and 1.3, respectively, for Equation (3). Because each set of real data consists of different numbers of distinct tokens, ranging from tens to billions, random data sets consisting of 2n distinct tokens for every n = 4 … 19, were randomly generated for each of the four distributions. We only consider the measures H2 and H0 for these data sets. Both of these measures have convergent values, given a sufficient data size.
4. Experimental Results
From the previous discussion, we applied the three measures V, h1, and H2 with five large-scale and eight small-scale natural language corpora, three programming language corpora, and two unknown script corpora, in terms of words and characters. Because there were many results for different combinations of measure, data, and token (word or character), this section is structured so that it best highlights our findings.
4.1 Empirical Constancy
Figures 1, 2, and 3 in this section can be examined in the following manner. The horizontal axis indicates the text size of each corpus, in terms of the number of tokens, on a log scale. Chunks of different text sizes were always taken from the head of the corpus.8 The vertical axis indicates the values of the different measures: V, h1, or H2. Each figure contains multiple lines, each corresponding to a corpus, as indicated in the legends.
V, h1, and H2 values in terms of words and characters for the large-scale corpora.
V, h1, and H2 values in terms of words and characters for the large-scale corpora.
V, h1, and H2 values in terms of words and characters for the randomized large-scale corpora.
V, h1, and H2 values in terms of words and characters for the randomized large-scale corpora.
V, h1 , and H2 values for the small-scale corpora and programming language texts in terms of words.
V, h1 , and H2 values for the small-scale corpora and programming language texts in terms of words.
First, we consider the results for the large-scale data. Figure 1 shows the different measures for words (left three graphs) and characters (right three graphs). We can see that V increased for both words and characters (top two graphs). Golcher tested his measure on up to 30 megabytes of text in terms of characters (Golcher 2007). We also observed a stable tendency up to around 107 characters. The increase in V became apparent, however, for larger text sizes. Thus, it is difficult to consider V as a constancy measure.
As for the results for h1 (middle graphs), both graphs show a gradual decrease. The tendency was clearer for words than for characters. For some corpora, especially for characters, it was possible to observe some values converging towards h*. The overall tendency, however, could not be concluded as converging. This result suggests the difficulty in attaining convergence of the entropy rate, even with gigabyte-scale data. From the theoretical background of the Grassberger algorithm, the values would possibly converge with larger-scale data. The continued decrease could be due to multiple reasons, including the possibility of requiring far larger data than that used here, or a discrepancy between linguistic processes and the mathematical model assumed for the Grassberger algorithm.
We tried to estimate h* by fitting the Equation (9). For the corpora with good fitting, all of the estimated values were larger than zero, but many of the results could not be fitted easily, and the estimated values were unstable due to fluctuation of the lines. Whether a value for h* is reached asymptotically and also whether h* > 0 remain important questions requiring separate, more extensive mathematical and empirical studies.
In contrast, H2 (or Yule's K, bottom graphs) showed convergence, already at the level of 105 tokens, for both words and characters. From the previous verification of Yule's K, we can conclude that H2 is convergent. The final convergent values, however, differed for the various writing systems. We return to this issue in the next section.
To better understand the convergence, Figure 2 shows the results for the corresponding randomized data. As mentioned in Section 3.2, the original texts were randomized by shuffling words and characters for the data examined by words and characters, respectively. Therefore, all n-gram characteristics existing in the text were destroyed, and what remained were the different words and characters appearing in a random order. Here, we see how the random data's behavior has some of the theoretical properties of convergence, as summarized in Section 3.2.
As mentioned previously, because V has no mathematical background, its behavior even for uniform random data is unknown, and even if it converged, the convergent value would be smaller than that of the original text. The top two graphs in Figure 2 exhibit some oscillation, especially for randomized Chinese (Cnews-c,w). Such peculiar oscillation was already reported by Golcher himself (Golcher 2007) for uniformly random data. This was easy to replicate, as reported in Kimura and Tanaka-Ishii (2014), for uniformly random data with the number of distinct tokens up to a hundred. Because the word distribution almost follows Zipf's law, the vocabulary is not uniformly distributed, yet oscillating results occur for some randomized data in the top left figure. Moreover, the values seem to increase for Japanese and English for words at a larger scale. Although the plots for some scripts seem convergent (top right graph), these convergent values are theoretically different from those of the original texts, if they exist, and this stability is not universal across the different data sets. Given this result, it is doubtful that V is convergent across languages.
In contrast, h1 is mathematically proven to be convergent given infinite-length randomized data, but to larger values than those of the original texts, as mentioned in Section 3.2. The middle two graphs of Figure 2 show the results for h1. The majority of the plots do not reach convergence even at the largest data sizes, but for certain results with characters, especially in the Roman alphabet, the plots seem to go to a convergent value (middle right). All the plots can be extrapolated to converge to a certain entropy rate above zero, although these values are larger than the convergent values—if they ever exist—of the real data. These results confirm the difficulty of judging whether the entropy rates of the original texts are convergent and whether they remain above zero.
Lastly, it is easy to see that H2 is convergent for a randomized text (bottom two graphs), and the convergent values are the same for the cases with and without randomization. In fact, the plots converge to exactly the same points faster and more stably, which shows the effect of randomization.
As for the other randomization options, by sentences and documents, the findings—both the tendencies of the lines and the changes in the values—can be situated in the middle of what we have seen so far. The plots should increasingly fluctuate more like the real data because of the incomplete randomization, in the order of sentences and then documents.
Returning to inspection of the remaining real data, Figure 3 shows V, h1, and H2 in terms of words for the small-scale corpora (left column) and for the programming language texts (right column). For the small-scale corpora, in general, the plots are bunched together, and the results shared the tendencies noted previously for the large-scale corpora. V again showed an increase, while h1 showed a tendency to decrease. H2 converged rapidly and was already almost stable at 104 tokens. This again shows how H2 exhibits stable constancy, especially with texts written by single authors.
As for the programming language results, the plots fluctuate more than for the natural language texts because of the redundancy within the program sources. Still, the global tendencies noted so far were just discernible. V had relatively larger values but h1 and H2 had smaller values for programs, as compared to the natural language texts. The differences in value indicate the larger degree of repetitiveness in programs.
Lastly, Figure 4 shows the Hα results for the Wall Street Journal in terms of words in unigrams (Enews-w). The horizontal axis indicates the corpus size, and the vertical axis indicates the approximated entropy value. The different lines represent the results for Hα with α = 1, 2, 3, 4. The two H1 plots represent calculations with and without Laplace smoothing (Manning and Schuetze 1999). We can see that without smoothing, H1 increased, as Tweedie and Baayen (1998) reported, but in contrast to their conclusion, we observe a tendency of convergence for larger-scale data. The increase was due to the influence of low-frequency vocabulary pushing up the entropy. The opposite tendency to decrease was observed for the smoothed probabilities, with the plot eventually converging to the same point as that for the unsmoothed H1 values. The convergence was by far slower for H1 as compared with that for H2, H3, and H4, which all had attained convergence already at 102 tokens. The convergence values naturally decreased for larger α, although the amount of decrease itself rapidly decreased with larger α.
In answer to Questions 1 and 2 raised in the Introduction—which measures show constancy, with sufficient convergence speed—the empirical conclusion from our data is that Hα with α > 1 showed stable constancy when the values were approximated using relative frequencies. For H1, the convergence was much slower because of the strong influence of low-frequency words. Consequently, the constancy of Hα with α > 1 is attained by representing the gross complexity underlying a text.
4.2 Discriminatory Power of H2
Now we turn to Question 3 raised in the Introduction and examine the discriminatory power of H2. As Yule intended, does H2 identify authors? Given the influence of different writing systems, as seen previously in Figure 1, we examine the relation between H2 and the number of distinct tokens (the alphabet/vocabulary size). Note that because this number corresponds to H0 in Equation (11), this analysis effectively considers texts on the H0-H2 plane. Since H0 grows according to the text size, unlike H2, the same text size must be used for all corpora in order to meaningfully compare H0 values.9 Given that H2 converges fast, we chose a size of 104 tokens to handle all of the small- and large-scale corpora.
For each of the corpora listed in Table 1 and the second kind of random corpora explained at the end of Section 3.2, Figure 5 plots the values of H2 (vertical axis) and the number of distinct tokens H0 (horizontal) measured for each corpus at a size of 104 tokens. The three large circles are groupings of points. The leftmost group represents news sources in alphabetic characters. All of the romanized Chinese, Japanese, and Arabic texts are located almost at the same vertical location as the English text. This indicates the difficulty for H2 to distinguish natural languages if measured in terms of alphabetic characters. The middle group represents the programming language texts in terms of words. This group is located separately (vertically lower than the natural language corpora in terms of words), so H2 is likely to distinguish between natural languages and programming languages. The rightmost group represents the small-scale corpora. Considering the proximity of these points despite the variety of the content, it is unlikely that H2 can distinguish authors, in contrast to Yule's hope. Still, these points are located lower than those for news text. Therefore, H2 has the potential to distinguish genre or maybe writing style.
Convergent H2 values with respect to the number of distinct tokens for each corpus.
Convergent H2 values with respect to the number of distinct tokens for each corpus.
The natural language texts located near the line for a Zipf exponent of 0.8 are those of the non-alphabetic writing systems.10 Note that Chinese characters have morphological features, and the Arabic and Thai languages also have flexibility in terms of which units are considered words and morphemes. In other words, the plots closer to the random data with a smaller Zipf exponent are for language corpora of morphemic sequences. The group of plots measured for phonographic scripts is located near the line for a Zipf exponent of 1.0 (the grouping of points in the leftmost circle), which could suggest that morphemes are more randomized units than words.
4.3 Application to Unknown Scripts: Voynich Manuscript and Rongorongo Script
The nature of unknown scripts can also be considered through our understanding thus far. Figure 5 includes plots for the Voynich manuscript in terms of words and characters, and for the Rongorongo script in terms of characters. Like all the data seen in this figure, the points are placed at the H2 values (vertically) for the number of distinct tokens (horizontally) at the specified size of 104 tokens, with the exception of Voynich-A in terms of words. Because this corpus consists of fewer than 104 words (refer to the data length by tokens listed for VoynichA-w in Table 1), its point is located horizontally at the vocabulary size corresponding to the corpus' maximum size.
For the two Voynich manuscript parts, the plots in terms of words appear near the Arabic corpus for words (Abook-w). For characters, on the other hand, the plots are at the leftmost end of the figure. This was due to overestimation of the total number of characters for the alphabetic texts (e.g., both English and other, romanized language texts), since all ASCII characters, such as colons, periods, and question marks, are counted. Still, the H2 values are located almost at the same position as for the other romanized texts, indicating that the Voinich manuscript has approximately similar complexity. These results suggest the possibility that the Voynich manuscript could have been generated from a source in natural language, possibly written in some script of the abjad type. This supports previous findings (Reddy and Knight 2011; Montemurro and Zanette 2013), which reported the possibility of the Voynich manuscript being in a natural language and the coincidence of its word length distribution with that of Arabic.
On the other hand, the plots for the Rongorongo script appear near the line for a Zipf exponent of 0.8, with RongoA near Arabic in terms of words but RongoB somewhat further down from Japanese in terms of characters. The status of Rongorongo as natural language has been controversial (Pozdniakov and Pozdniakov 2007). Both points in the graph, however, are near many other natural language texts (and not widely separated), making it reasonable to hypothesize that Rongorongo is indeed natural language. The characters can be deemed morphologically rich, because both plots are close to the line for a Zipf exponent of 0.8. In the case of RongoA, for which a character was considered inclusive of all parts (i.e., including accents and ornamental parts), the morphological richness is comparable to that of the words of an abjad script. On the other hand, when considering the different character parts as distinct (RongoB), the location drifts towards the plot for Thai, a phonographic script, in terms of characters. Therefore, the Rongorongo script could be considered basically morphemic, with some parts functioning phonographically. This conclusion again supports a previous hypothesis proposed by a domain specialist (Pozdniakov and Pozdniakov 2007).
This analysis of two unknown scripts supports previous conjectures. Our results, however, only add a small bit of evidence to those conjectures; clearly, reaching a reasonable conclusion would require further study. Moreover, the analysis of unknown scripts introduced here could provide another possible application of text constancy measures, from a broader context.
5. Conclusion
We have discussed text constancy measures, whose values are invariant across different sizes of text, for a given text. Such measures have a 70-year history, since Yule originally proposed K as a text characteristic, potentially with language engineering utility for problems such as author identification. We consider text constancy measures today to have scientific importance in understanding language universals from a computational view.
After overviewing measures proposed so far and previous studies on text constancy, we explained how K essentially has a mathematical equivalence to the Rényi higher-order entropy. We then empirically examined various measures across different languages and kinds of corpora. Our results showed that only the approximated higher-order Rényi entropy exhibits stable, rapid constancy. Examining the nature of the convergent values revealed that K does not possess the discriminatory power of author identification as Yule had hoped. We also applied our understanding to two unknown scripts, the Voynich manuscript and Rongorongo, and showed how our constancy results support previous hypotheses about each of these scripts.
Our future work will include application of K to other kinds of data besides natural language. There, too, we will consider the questions raised in the Introduction, of whether K converges and of how discriminatory it is. We are especially interested in considering the relation between the value of K and the meaningfulness of data.
Acknowledgments
This research was supported by JST's PRESTO program.
Notes
According to Dbowski (2009), h* = 0 suggests that the next element of a linguistic process is deterministic, that is, a function of the corpus observed before, under the two conditions that (1) the number of possible choices for the element is finite, and (2) the corpus observed before is infinite. In reality, the finiteness of linguistic sequences has the opposite tendency (i.e., the size of the observed corpus is finite, and the possible vocabulary size is infinite).
With respect to the Lisp programming language, its culture favors long, hyphenated variable names that can be almost as long as a sentence. For this work, therefore, Lisp variable names were tokenized by splitting at the hyphens.
For real data, this was done without any randomization of the order of texts for all corpora besides Atext-w and Atext-c. The Watan corpus is distributed not in the chronological order of the publishing dates, but as a set of articles grouped into categories (i.e., all articles of one category, then all articles of another category, and so on). Because of this, there is a large skew in the vocabulary distribution, depending on the section of the corpus. We thus randomly reshuffled the articles by categories for the whole corpus before taking chunks of different sizes (always from the beginning) to generate our results. Apart from this, we avoided any arbitrary randomization with respect to the original data summarized in Table 1.
Because H0 is not convergent, the horizontal locations remain unstable, unless the tokens are of a phonographic alphabet. In other words, for all word-based results and character-based results not based on a phonographic alphabet, the resulting horizontal locations are changed by increasing the corpus size. As for the random data, the H0 values are convergent, because these data sets have a finite number of distinct tokens. Since H0 is measured only for the first 104 tokens, however, the horizontal locations are underestimated, especially for random data following a Zipf distribution.
Note that here we use the values of the Zipf exponent for the random data, and not the estimated exponents for the real data. The rank-frequency distributions of characters, especially for phonetic alphabets, often do not follow a power law.
References
Author notes
Kyushu University, 744 Motooka Nishiku, Fukuoka City, Fukuoka, Japan. E-mail: [email protected].
JST-PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012, Japan.
Gunosy Inc., 6-10-1 Roppongi, Minato-ku, Tokyo, Japan.