Skip to Main Content
Table 1: 
Basic information about the lost languages.
LanguageFamilySource#TokensSegmentation SituationCentury
Gothic Germanic Wulfila 40,518 unsegmented 3–10 AD 
Ugaritic Semitic Snyder et al. (2010) 7,353†† segmented 14–12 BC 
Iberian unclassified Hesperia 3,466‡‡ undersegmented 6–1 BC 
LanguageFamilySource#TokensSegmentation SituationCentury
Gothic Germanic Wulfila 40,518 unsegmented 3–10 AD 
Ugaritic Semitic Snyder et al. (2010) 7,353†† segmented 14–12 BC 
Iberian unclassified Hesperia 3,466‡‡ undersegmented 6–1 BC 
††

http://hesperia.ucm.es/. Iberian language is semi-syllabic, but this database has already transliterated the inscriptions into Latin scripts.

This dataset directly provides the Ugaritic vocabulary, i.e., each word occurs exactly once.

‡‡

Since the texts are undersegmented and we do not know the ground truth segmentations, this represents the number of unsegmented chunks, each of which might contain multiple tokens.

Close Modal

or Create an Account

Close Modal
Close Modal