Our data.
Identifier . | Language kind . | Source . | Number of distinct tokens . | Data length by tokens . | Data size in bytes . |
---|---|---|---|---|---|
Large scale corpora | |||||
Enews-c | English | WSJ Corpus(1987) | 87 | 112,868,099 | 108MB |
Enews-w | 137,466 | 22,679,512 | |||
Jnews-c | Japanese | 2000–2009 Mainichi | 5,758 | 475,101,506 | 1.3GB |
Newspaper | |||||
Jnews-cr | 94 | 1,087,919,430 | |||
Jnews-w | 468,818 | 289,032.862 | |||
Cnews-c | Chinese | 1995 People's | 5,963 | 24,696,511 | 67MB |
Daily Newspaper | |||||
Cnews-cr | 88 | 68,325,519 | |||
Cnews-w | 144,336 | 14,965,501 | |||
Atext-c | Arabic | Watan-2004 corpus | 59 | 42,174,262 | 73MB |
Atext-w | 298,370 | 7,450,442 | |||
Ttext-c | Thai | NECTEC corpus | 159 | 1,444,536 | 3.9MB |
Ttext-w | 16,291 | 280,602 | |||
Small scale corpora | |||||
Ebook1-w | English | Ulysses | 34,359 | 325,692 | 1.5MB |
Ebook2-w | English | Les Miserables | 25,994 | 677,163 | 3MB |
Fbook-w | French | Les Miserables | 31,956 | 691,407 | 3MB |
Gbook-w | German | Kritik der reinen | 10,604 | 215,299 | 1.3MB |
Vernunft | |||||
Jbook-w | Japanese | Dohyo | 19,179 | 502,137 | 2MB |
Cbook-w | Chinese | Hong Lou Meng | 18,450 | 701,255 | 2.5MB |
Abook-w | Arabic | Quaran | 16,121 | 75,185 | 728KB |
Sbook-w | Sanskrit | Ramayana | 62,318 | 213,736 | 1.9MB |
Corpora of programming languages | |||||
Python-w | Python | python library sources | 1,517,424 | 48,704,374 | 214MB |
Cplus-w | C++ | C++ library sources | 127,332 | 15,617,801 | 64MB |
Lisp-w | Common Lisp | sbcl and Clozure CL | 164,248 | 2,326,270 | 32MB |
Corpora of Unknown scripts | |||||
VoynichA-c | Unknown | Voynich Manuscript | 22 | 44,360 | 44KB |
VoynichB-c | Unknown | Voynich Manuscript | 25 | 117,105 | 115KB |
VoynichA-w | Unknown | Voynich Manuscript | 2,628 | 7,460 | 44KB |
VoynichB-w | Unknown | Voynich Manuscript | 4,609 | 18,495 | 115KB |
RongoA-c | Unknown | Rongorongo script | 3,546 | 10,376 | 60KB |
RongoB-c | Unknown | Rongorongo script | 656 | 14,003 | 60KB |
Identifier . | Language kind . | Source . | Number of distinct tokens . | Data length by tokens . | Data size in bytes . |
---|---|---|---|---|---|
Large scale corpora | |||||
Enews-c | English | WSJ Corpus(1987) | 87 | 112,868,099 | 108MB |
Enews-w | 137,466 | 22,679,512 | |||
Jnews-c | Japanese | 2000–2009 Mainichi | 5,758 | 475,101,506 | 1.3GB |
Newspaper | |||||
Jnews-cr | 94 | 1,087,919,430 | |||
Jnews-w | 468,818 | 289,032.862 | |||
Cnews-c | Chinese | 1995 People's | 5,963 | 24,696,511 | 67MB |
Daily Newspaper | |||||
Cnews-cr | 88 | 68,325,519 | |||
Cnews-w | 144,336 | 14,965,501 | |||
Atext-c | Arabic | Watan-2004 corpus | 59 | 42,174,262 | 73MB |
Atext-w | 298,370 | 7,450,442 | |||
Ttext-c | Thai | NECTEC corpus | 159 | 1,444,536 | 3.9MB |
Ttext-w | 16,291 | 280,602 | |||
Small scale corpora | |||||
Ebook1-w | English | Ulysses | 34,359 | 325,692 | 1.5MB |
Ebook2-w | English | Les Miserables | 25,994 | 677,163 | 3MB |
Fbook-w | French | Les Miserables | 31,956 | 691,407 | 3MB |
Gbook-w | German | Kritik der reinen | 10,604 | 215,299 | 1.3MB |
Vernunft | |||||
Jbook-w | Japanese | Dohyo | 19,179 | 502,137 | 2MB |
Cbook-w | Chinese | Hong Lou Meng | 18,450 | 701,255 | 2.5MB |
Abook-w | Arabic | Quaran | 16,121 | 75,185 | 728KB |
Sbook-w | Sanskrit | Ramayana | 62,318 | 213,736 | 1.9MB |
Corpora of programming languages | |||||
Python-w | Python | python library sources | 1,517,424 | 48,704,374 | 214MB |
Cplus-w | C++ | C++ library sources | 127,332 | 15,617,801 | 64MB |
Lisp-w | Common Lisp | sbcl and Clozure CL | 164,248 | 2,326,270 | 32MB |
Corpora of Unknown scripts | |||||
VoynichA-c | Unknown | Voynich Manuscript | 22 | 44,360 | 44KB |
VoynichB-c | Unknown | Voynich Manuscript | 25 | 117,105 | 115KB |
VoynichA-w | Unknown | Voynich Manuscript | 2,628 | 7,460 | 44KB |
VoynichB-w | Unknown | Voynich Manuscript | 4,609 | 18,495 | 115KB |
RongoA-c | Unknown | Rongorongo script | 3,546 | 10,376 | 60KB |
RongoB-c | Unknown | Rongorongo script | 656 | 14,003 | 60KB |