Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundantpatterns for compressing the data, and hence alleviates the sparsity problem in downstream applications.Subwords discovered during the first merge operations tend to have the most substantial impact on thecompression of texts. However, the structural underpinnings of this effect have not been analyzedcross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages and threeparallel corpora, and thereby show that the types of recurrent patterns that have the strongest impact oncompression are an indicator of morphological typology. For languages with richer inflectional morphologythere is a preference for highly productive subwords on the early merges, while for languages with lessinflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute toefficient compression. Counter to the common perception that BPE subwords are not linguistically relevant,we find patterns across languages that resemble those described in traditional typology. We thus propose anovel way to characterize languages according to their BPE subword properties, inspired by the notion ofmorphological productivity in linguistics. This allows us to have language vectors that encode typologicalknowledge induced from raw text. Our approach is easily applicable to a wider range of languages and texts,as it does not require annotated data or any external linguistic knowledge. We discuss its potentialcontributions to quantitative typology and multilingual NLP.
Action Editor: Carlos Gómez-Rodríguez