Impact of lemmatization on the different text types in the training set (80% of the corpus).
. | # tokens . | # types (terms) . | token/type (lem.) . | ||
---|---|---|---|---|---|
. | . | raw . | lemmatized . | gain . | . |
unigram | 48,898,738 | 160,424 | 142,396 | 1.12 | 343.39 |
bigram | 48,473,756 | 3,836,212 | 3,119,422 | 1.23 | 15.54 |
Stanford | 35,772,003 | 8,750,839 | 7,430,397 | 1.18 | 4.81 |
AEGIR | 31,004,525 | – | 5,096,918 | – | 6.08 |
. | # tokens . | # types (terms) . | token/type (lem.) . | ||
---|---|---|---|---|---|
. | . | raw . | lemmatized . | gain . | . |
unigram | 48,898,738 | 160,424 | 142,396 | 1.12 | 343.39 |
bigram | 48,473,756 | 3,836,212 | 3,119,422 | 1.23 | 15.54 |
Stanford | 35,772,003 | 8,750,839 | 7,430,397 | 1.18 | 4.81 |
AEGIR | 31,004,525 | – | 5,096,918 | – | 6.08 |