Language . | Family . | Source . | #Tokens . | Segmentation Situation . | Century . |
---|---|---|---|---|---|
Gothic | Germanic | Wulfila† | 40,518 | unsegmented | 3–10 AD |
Ugaritic | Semitic | Snyder et al. (2010) | 7,353†† | segmented | 14–12 BC |
Iberian | unclassified | Hesperia‡ | 3,466‡‡ | undersegmented | 6–1 BC |
Language . | Family . | Source . | #Tokens . | Segmentation Situation . | Century . |
---|---|---|---|---|---|
Gothic | Germanic | Wulfila† | 40,518 | unsegmented | 3–10 AD |
Ugaritic | Semitic | Snyder et al. (2010) | 7,353†† | segmented | 14–12 BC |
Iberian | unclassified | Hesperia‡ | 3,466‡‡ | undersegmented | 6–1 BC |
http://hesperia.ucm.es/. Iberian language is semi-syllabic, but this database has already transliterated the inscriptions into Latin scripts.
This dataset directly provides the Ugaritic vocabulary, i.e., each word occurs exactly once.
Since the texts are undersegmented and we do not know the ground truth segmentations, this represents the number of unsegmented chunks, each of which might contain multiple tokens.