Skip to Main Content
Table 1 
Summary of the data sets used in this article and their statistics.
 TokensVocab.Vocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Wikitext-2 (English, Wikipedia article) 
preprocessed data set 2,088,628 33,278 Yes 0.75 (0.13) 1.33 (0.10) 0.62 (0.15) 0.33 (0.04) 
original data set 2,088,628 76,617 Yes 0.78 (0.09) 1.33 (0.10) 0.65 (0.11) 0.32 (0.03) 
  
Penn Treebank (English, The Wall Street Journal news article) 
preprocessed data set 887,521 10,000 Yes 0.70 (0.16) 1.23 (0.06) 0.56 (0.14) 0.81 (0.24) 
original data set 892,008 89,317 Yes 0.83 (0.07) 1.20 (0.05) 0.57 (0.06) 0.60 (0.16) 
  
Shakespeare (old English collection of literature works) 
original text 740,706 83,105 Yes 0.79 (0.07) 1.24 (0.09) 0.59 (0.05) 0.13 (0.02) 
  
Hong Lou Meng (Chinese, literature work) 
original text 703,034 18,312 Yes 0.74 (0.14) 1.31 (0.07) 0.58 (0.07) 0.39 (0.04) 
 TokensVocab.Vocabulary PopulationLong Memory
Zipf’s Law f(r) ∝ r−αHeaps’ Law v(n) ∝ nβEbeling’s Method m(l) ∝ lηTaylor’s Law σ ∝ μζLong Range Correlation c(s) ∝ s−ξ
Wikitext-2 (English, Wikipedia article) 
preprocessed data set 2,088,628 33,278 Yes 0.75 (0.13) 1.33 (0.10) 0.62 (0.15) 0.33 (0.04) 
original data set 2,088,628 76,617 Yes 0.78 (0.09) 1.33 (0.10) 0.65 (0.11) 0.32 (0.03) 
  
Penn Treebank (English, The Wall Street Journal news article) 
preprocessed data set 887,521 10,000 Yes 0.70 (0.16) 1.23 (0.06) 0.56 (0.14) 0.81 (0.24) 
original data set 892,008 89,317 Yes 0.83 (0.07) 1.20 (0.05) 0.57 (0.06) 0.60 (0.16) 
  
Shakespeare (old English collection of literature works) 
original text 740,706 83,105 Yes 0.79 (0.07) 1.24 (0.09) 0.59 (0.05) 0.13 (0.02) 
  
Hong Lou Meng (Chinese, literature work) 
original text 703,034 18,312 Yes 0.74 (0.14) 1.31 (0.07) 0.58 (0.07) 0.39 (0.04) 
Close Modal

or Create an Account

Close Modal
Close Modal