Sub-Character Tokenization for Chinese Pretrained Language Models

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.


Introduction
Large-scale Transformer-based pretrained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020;Clark et al., 2020;He et al., 2021, inter alia) have achieved great success in recent years and attracted wide research interest, in which tokenization plays a fundamental role.
However, we believe that both of these existing tokenizers are sub-optimal for Chinese.This is based on the observation that Chinese has unique linguistic characteristics: 1) Chinese has an opaque orthography with irregular grapheme-phoneme correspondence (Hao and Yang, 2021).This is in contrast to transparent orthographies like Spanish and Finnish where each letter approximately represents one sound.As a result, utilizing pronunciation information in Chinese requires explicit pronunciation encoding.
2) Chinese does not have morphological inflection, unlike morphologically-rich languages like Russian (Coulmas, 1991).This renders sub-word tokenization less useful since the main advantage of sub-word tokenization comes from the fact that it can split common affixes and root words as separate tokens.In fact, Chinese characters are logograms, and their glyphs (the composition of radicals) also contain rich semantic information, which can only be captured at the sub-character level.
Motivated by these observations, we propose the novel sub-character (SubChar) tokenization.It first encodes every Chinese character into a sequence of phonetic or stroke symbols, and then it uses a sub-arXiv:2106.00400v3[cs.CL] 14 Feb 2023 我们家乡安徽歙县风景秀丽。 Our hometown, Shexian County, Ahhui Province, has beautiful scenery.word segmenter (such as BPE) to construct the vocabulary on all the encoded sequences.In this way, the resultant tokenizers can capture sub-character tokens that correspond to meaningful phonetic or morphemic units, which are absent from all existing Chinese tokenizers.As far as we know, this is the first attempt on leveraging the sub-character information for language models, especially in the context of Chinese NLP.

SubChar-Pinyin
To assess the effectiveness of our proposed method, we train a series of BERT-style PLMs using the existing and proposed tokenizers.We evaluate these models on over ten datasets of various downstream natural language understanding (NLU) tasks.Through extensive evaluation, we find that models trained with SubChar tokenizers match models trained with character and sub-word tokenizers on downstream task performance.More importantly, SubChar tokenizers have two major advantages compared to existing tokenizers: 1) SubChar tokenizers are more efficient.We find that a small fraction of sub-character tokens in the vocabulary can compose a large variety of rare and complex characters, thus saving much space in the vocabulary for more character combination tokens such as words and phrases.The increased use of combination tokens leads to significantly decreased length of the tokenized sequences.For example, on the iFLYTEK long text classification dataset, with the same vocabulary size as the CharTokenizers, SubChar tokenizers can achieve as much as 40% length reduction on the tokenized output.Such length reduction can significantly speed up both pretraining and finetuning.
2) SubChar tokenizers are more robust.A common and unique type of typos in Chinese is caused by homophones where characters with different semantic meanings have exactly the same pronunciation.SubChar tokenizers based on pronunciation can map homophones into the same transliteration sequences, thus improving robustness against any homophone typos.This could be immensely useful when handling noisy inputs.
We believe that our work is an important step towards more tailored techniques for languages beyond just English by effectively integrating the unique linguistic characteristics of the language (Bender, 2019, #BenderRule).

Method
In this section, we describe our proposed SubChar tokenization in detail.We break it down into two steps: 1) Chinese character encoding; 2) vocabulary construction based on the encoded sequences.

Step 1: Character Encoding
The core idea of this step is to encode every Chinese character into a sequence that characterizes its glyph or pronunciation, in order to provide additional inductive biases to the model.We explore several ways of encoding the characters.They can be categorised as pronunciation-based and glyphbased encoding.
Pronunciation-based encoding In order to capture pronunciation information of characters, we encode Chinese characters using transliteration, which uses IPA-inspired1 phonetic scripts to characterize the pronunciation.
We explore two different transliteration methods: pinyin and zhuyin (i.e., bopomofo).Pinyin uses romanized transcription and four different tones (¯, ´, ˇ, `) to transliterate characters, e.g., 魑魅魍魉 → Chi¯Mei`WangˇLiangˇ.On the other hand, zhuyin uses a set of graphemes nonexistent in English and the same four tones to transliterate the characters, e.g., 魑魅魍魉 → ㄔㄇㄟ`ㄨㄤˇㄌㄧㄤˇ.In zhuyin, the first tone mark (¯) is usually omitted.
Different Chinese characters may have the same pronunciation even if they have different semantic meanings (i.e., homophones).For disambiguation, we append different indices after the encoded sequences for the homophonic characters, so as to allow a biunique mapping between each Chinese character and its transliteration sequence, e.g., Chi¯33#Mei`24#Wangˇ25#Liangˇ13#, ㄔ10#ㄇㄟ`3#ㄨㄤˇ6#ㄌㄧㄤˇ1#.
It is unclear whether having such disambiguation of homophones is beneficial or not.To analyse the impact, we also experiment with a variant where we do not add the indices to disambiguate the homophones.We implement the tokenizer SubChar-Pinyin-NoIndex to perform pinyin encoding without disambiguation indices.We will show that this variant also has the advantage of being robust to homophone typos (section 4.2).

Glyph-based encoding
The glyphs (i.e., shapes) of Chinese characters contain rich semantic information and can help NLP models (Cao et al., 2018).Most Chinese characters can be broken down into semantically meaningful radicals.Characters that share common radicals often have related semantic information, e.g., the four characters '魑魅魍魉' share the same radical '鬼' (meaning "ghost"), and their meanings are indeed all related to ghosts and monsters. 2In order to capture glyph information, we explore four glyph-based encoding methods, namely Stroke, Wubi, Zhengma, and Cangjie.
For stroke encoding, we use the Latin alphabet to represent the set of Chinese strokes and convert the characters based on the standard stroke orders,3 e.g., 魑 → pszhshpzznnhpnzsszshn; 魅 → pszhshpzznhhspn (underlined parts indicate shared stroke sequences across these characters).
The other three glyph-based encoding methods encode characters into radical sequences instead, by using glyph-based Chinese input methods: Wubi, Zhengma and Cangjie.These input methods group strokes together in different ways to form radicals, and then decompose characters into radical sequences.We use the Latin alphabet to represent these radicals, e.g., 魑魅魍魉 → Wubi: rqcc rqci rqcn rqcw; Zhengma: njlz njbk njld njoo; Cangjie: hiyub hijd hibtv himob (underlined parts indicate common radicals among them).
We append the same separation symbol ('#') after each character, and also add the disambiguation indices for characters whose stroke sequences are identical (e.g., 人 (people) and 八 (eight)).However, we note that there are very few cases where different characters have the same glyph encoding.

Step 2: Vocabulary construction
Once we have the encoded sequences, we can treat the encoding of each character as the equivalent of 'word' in English and then apply sub-word segmentation to construct the vocabulary for our subcharacter tokenizers.
Sub-word segmentation typically forms subword tokens by merging frequent token bigrams, which often results in meaningful morphemes of the words when used in languages like English.On our encoded sequences, sub-word segmentation can capture shared sub-character sequences that correspond to shared radicals or phonetic sequences among similar characters.After running the sub-word segmentation step on the encoded sequences, the vocabulary of the resultant subcharacter tokenizers consists of a mixture of sub- character tokens, character tokens, and character combination tokens.
In this work, we use the unigram language model segmentation method (Kudo, 2018) implemented in SentencePiece (Kudo and Richardson, 2018) as the default sub-word segmentation method.In section 5.6, we also perform an ablation study by setting the sub-word segmentation method to BPE, which results in similar performance and efficiency, illustrating that the gains of SubChar tokenization are insensitive to the specific choice of sub-word segmentation methods.

Optional Step: Chinese Word Segmentation
Before the first step of character encoding, there is an optional step of Chinese word segmentation.Chinese word segmentation (CWS) is a common technique to split Chinese text chunks into a sequence of Chinese words.The resultant segmented words sometimes provide better granularity for downstream tasks (Chang et al., 2008).However, the impact of CWS is unclear in the context of pretraining, especially its interplay with the tokenization.Hence, we propose a way to incorporate CWS into our SubChar tokenization and examine whether it is helpful.Our proposed tokenization pipeline is summarized in Figure 2.
Given that the vocabulary of SubChar tokenizers consists of character combinations, characters, and sub-characters, we use CWS to construct the character combination part of the vocabulary.Compared to the character combination tokens generated by the statistical approach of sub-word tokenization, the combination tokens generated by a trained Chinese word segmenter have more linguistic prior knowledge.Specifically, to construct the vocabulary, we first segment the pretraining corpus into words.Then, we select the most frequent words as the character combination part of the SubChar tokenizer vocabulary.We then encode the text with one of the pronunciation-or glyph-based encoding methods and use sub-word tokenization on the encoded sequences to get the sub-character and character tokens of the vocabulary.Finally, we merge these parts together as the vocabulary for the SubChar tokenizer.When tokenizing new inputs, we first segment them into words, if the words are in the vocabulary, they will be tokenized as word tokens; if not, they will be further processed by the SubChar tokenizer.We control the ratio of word tokens in the vocabulary to be 80% based on preliminary tuning and we use a state-of-the-art segmenter THU-LAC (Li and Sun, 2009;Sun et al., 2016) for word segmentation.

Experiment Setup
In this section, we introduce our baselines, datasets and experiment settings.

Baselines
We compare two existing tokenization methods as baselines, namely single-character tokenization and sub-word tokenization.For a fair comparison, we set the same vocabulary size of 22, 675 for all tokenizers, including baselines and our proposed tokenizers.This is consistent with the vocabulary size of Chinese BERT (Devlin et al., 2019).

Pretraining Data
We use the same training corpus to train all the tokenizers in this work.The corpus consists of 2.3 GB Chinese text from Baidu Baike. 4o evaluate the effectiveness of the tokenizers, we pretrain a BERT5 model using each tokenizer and compare their performance on downstream tasks.When pretraining the BERT models, we use the same pretraining corpus (i.e., Baidu Baike) and the same set of hyper-parameters.Notably, we also pretrain a new BERT model using the character tokenizer on our pretraining corpus instead of loading from existing checkpoints (Devlin et al., 2019) so that it provides an apple-to-apple comparison with our proposed methods.Since our proposed tokenizers are direct drop-in replacements for the baseline tokenizers, they do not incur any extra parameters.In summary, all the compared models have the same training corpus, hyper-parameters, and number of parameters, allowing for a truly fair comparison.

Evaluation Data
We finetune and evaluate the pretrained models with different tokenization methods on various downstream NLU datasets, including singlesentence classification, sentence-pair classification, and reading comprehension tasks.We briefly introduce each dataset below and present the dataset statistics in Table 1.TNEWS (Xu et al., 2020b) is a news title classification dataset containing 15 classes.IFLYTEK (Xu et al., 2020b) is a long text classification dataset containing 119 classes.The task is to classify mobile applications into corresponding categories given their description.BQ (Chen et al., 2018) is a sentence-pair question matching dataset extracted from an online bank customer service log.The goal is to evaluate whether two questions are semantically equivalent.THUCNEWS (Li and Sun, 2007) is a document classification dataset with 14 classes.The task is to classify news into the corresponding categories given their title and content.CLUEWSC (Xu et al., 2020b) is a coreference resolution dataset in the format of Winograd Schema Challenge (Levesque et al., 2012).The task is to determine whether the given noun and pronoun in the sentence refer to the same entity.AFQMC (Xu et al., 2020b) is the Ant Financial Question Matching Corpus for the question matching task that aims to predict whether two sentences are semantically equivalent.CSL6 is the Chinese Scientific Literature dataset extracted from academic papers.Given an abstract and some keywords, the goal is to determine whether they belong to the same paper.It is formatted as a sentence-pair classification task.OCNLI (Hu et al., 2020) is a natural language inference dataset.The task is to determine whether the relationship between the hypothesis and premise is entailment, neutral, or contradiction.CHID (Zheng et al., 2019) is a cloze-style multiplechoice reading comprehension dataset.Given a context where some idioms are masked, the task is to select the appropriate idiom from a list of candidates.C3 (Sun et al., 2020) is a multiple-choice reading comprehension dataset.The goal is to choose the correct answer for the questions given context.CMRC (Cui et al., 2019b) is a span-extraction reading comprehension dataset consisting of questions annotated from Wikipedia paragraphs.CLUENER2020 (Xu et al., 2020a) is a named entity recognition dataset with 10 entity types.

Hyper-parameters
We elaborate on all hyper-parameters involved for reproducibility (we also release all code, trained tokenizers and models).
Tokenizer Training.When training tokenizers with SentencePiece, we use a character coverage of 1.0 and model type 'unigram' for all tokenizers being compared.Other hyper-parameters follow the default of SentencePiece.
BERT pretraining.We follow the training procedure of BERT (Devlin et al., 2019) except that the next sentence prediction objective is removed.The pretraining process consists of two stages.The first stage uses a maximum sequence length of 128 with a batch size of 8K for 8K steps.The second stage uses a maximum sequence length of 512 with a batch size of 4K for 2K steps.We experiment primarily with 6-layer Transformer (Vaswani et al., 2017)  The subscript is the standard deviation.Models trained with sub-character tokenizers can match the performance of baseline models across all datasets.Ablation shows that increasing the model size or pretraining corpus size can slightly improve downstream task performance.These ablation results support our overall conclusion that models trained with SubChar tokenizers can closely match or slightly outperform the baselines.
the baseline CharTokenizer and proposed SubChar-Pinyin tokenizer.Other model configurations are the same for all models: 12 attention heads, an intermediate size of 3072, and a hidden size of 768.
BERT finetuning.For the finetuning on downstream datasets, we use a batch size of 32, maximum training epochs of 24 and tune max sequence length in {96, 256, 512}.Since the original test sets are not released, we use the original dev sets as the test sets and randomly hold-out 10% of the training set as the dev sets.We select the best checkpoint on the dev sets and report performance on test sets.These hyper-parameters are consistent with previous work.For all experiments in this paper, we report the results of the average run of three different random seeds.All experiments are done on NVIDIA A100 GPUs.

Experiment Results
In this section, we present the experiment results and the main findings.We not only evaluate on a wide range of common Chinese NLU datasets, but also perform robustness evaluation on both synthetic and real-world noisy data.

Standard Evaluation
We compare models trained with our SubChar tokenizers and the baseline tokenizers.There are multiple possible encoding methods for SubChar tokenizers as described in section 2. In this section, we choose two representative ones: Wubi (glyphbased) and Pinyin (pronunciation-based).We later show a full ablation of all different encoding methods in section 5.5.
Table 2 shows the performance of BERT models with different tokenizers on downstream datasets.Examining the results of the 6-layer BERT models pretrained on the 2.3G Baidu Baike corpus, we observe that despite some variation across different datasets, our proposed sub-character tokenizers can match the baselines on downstream datasets.When scaling the 6-layer models to 12-layer, we observe moderate improvement on the average performance (70.75 → 72.23 for CharTokenizer and 71.42 → 72.87 for SubChar-Pinyin).Besides, we discuss the impact of pretraining data size in section 5.4.These results demonstrate that on standard NLU benchmarks, our proposed tokenizers can serve as a very strong alternative.

Robustness Evaluation
Apart from evaluating on the standard benchmarks, we also verify whether our proposed tokenization methods are better at handling noisy inputs.We cover two major Chinese input methods: keyboard input and speech input.For keyboard input, we construct synthetic noise tests via character substitutions.For speech input, we use a noisy test set including inputs with diverse accents, which poses greater typo diversity.Our SubChar-Pinyin method shows advantage in both cases.

Synthetic Typos
We simulate the homophone typos that are common in real-world Chinese writing systems, especially user-generated inputs.As shown in Figure 3, pinyin input is the most widely used keyboard input method for Chinese users. 7 When users type in the romanization of the intended characters, the input interface will present all Chinese characters with the same romanization for the users to choose from.As a result, it is common for users to choose the wrong characters either by mistake or because they are unclear about the differences among these homophones.
In such cases, our SubChar-Pinyin-NoIndex tokenizer (described in section 2.1) has the advantage of being robust towards any such homophone typos.As illustrated in Figure 4, the character encoding will map all homophones of a character into the same romanization sequence before undergoing the 7 https://en.wikipedia.org/wiki/Chinese_input_methods_for_computers sub-word tokenization.As a result, the tokenized output will be identical no matter what the typo character is as long as it is a homophone of the intended character.
We inject synthetic noises into the test data and examine whether models trained on clean training data can perform well on these noisy data.To construct the noisy data, we replace the original correct characters with their homophones, e.g., change '意' (sense) to '异' (different)' and '义' (meaning) to '议' (debate).8Specifically, we randomly sample a certain ratio r% of the original characters.For each of them, we replace it with a randomly sampled homophone from all its homophones obtained via a Pinyin dictionary (no replacement if it has no homophones).
The results are shown in Table 3.We observe that there can be a significant drop in performance where there exist homophone typos in the test data.For example, the BERT model trained with CharTokenizer drops from 64.10% accuracy on clean data to 25.20% accuracy when 37.5% of the characters in test inputs are replaced with homophone typos.Overall, we find that the character tokenizer, sub-word tokenizer, as well as the vanilla SubChar-Pinyin tokenizer, cannot handle such noisy data.However, our SubChar-Pinyin-NoIndex tokenizer exhibits no performance drop under noises.Moreover, despite learning a shared representation for homophones, the model with SubChar-Pinyin-NoIndex still performs competitively on the clean test sets, either match (on C3) or only a little worse than the baselines (on TNEWS and OCNLI).
Real-World Typos While the above synthetic typos aim to simulate typos in keyboard inputs, another major input method is through speech input where users speak to their devices (like mobile phones) and their speech input is then converted to text for downstream tasks.In order to evaluate model robustness in such scenarios, we use a realistically collected test set that captures such speech input typos.Specifically, we use the speech-noise version of the AFQMC test set from the READIN (Si et al., 2023) 4: Results on the real-world AFQMC noisy test set.Each clean test instance is annotated by three different annotators, we report both the macro-average on these noisy annotations (N-Average) as well as the average of the worst-case performance across all test examples (N-Worst).SubChar-Pinyin outperforms baselines on the challenging noisy test set (best results on the noisy test set are in bold).
text using commercial automatic speech recognition (ASR) software.We refer readers to the dataset description paper for more data construction details.When computing performance for each test example, we compute both the average across different annotations (Noisy-Average), as well as the worst performance across different annotations (Noisy-Worst), and then take the macro-average across all examples.The character-level error rate of the noisy test set is 30% on average.This AFQMC noisy test set contains not only homophone typos, but also a wide range of other types of real-world input noises due to both the accent variations and ASR errors.The greater diversity of typo types in the real-world test set makes it much more challenging to maintain robustness than the synthetic setting which only considers homophone typos.While the original AFQMC is a binary classification task that classifies whether the question pair is a paraphrase or not, we find that models trained on the AFQMC training set exploit spurious correlations like lexical overlap, even though we explicitly balanced the training set.In particular, when introducing typos in the test data, performance on positive examples drops drastically due to lower lexical overlap, while the performance on negative examples stays or even improves a little because of the lower lexical overlap caused by the typos.This is similar to previous findings on HANS (McCoy et al., 2019) and PAWS (Zhang et al., 2019a).Hence, we follow the evaluation practice when dealing with spurious correlation, which is to focus on improving the worst-group performance, and in this case, we focus on improving performance on the positive examples against the impact of typos.
The results are shown in Table 4 where we report performance on the AFQMC positive examples.All models are trained on the original clean data from AFQMC (we balanced the positive and negative classes during training).We evaluate on the original clean test set, the Noisy-Average performance (N-Average), and the Noisy-Worst performance (N-Worst).We can see that despite this more challenging speech typo setting, our SubChar-Pinyin model still outperforms the baselines.
These results highlight the robustness advantage of our Sub-Character tokenization method in both dealing with synthetic homophone typos as well as on more diverse real-world typos.

Effect of CWS
We examine the impact of incorporating CWS in the tokenization as described in Section 2.3.We train tokenizers with and without CWS and compare the performance of the corresponding pretrained models.As shown in Table 5, we can see that adding CWS as an additional step does not help downstream task performance.These results serve as empirical evidence that CWS is ineffective in the use of PLMs, complementary to the results of Li et al. (2019) on models without pretraining.

Character-Level Tasks
The evaluation in Section 4.1 is restricted to sequence-level classification tasks such as singlesentence classification, sentence-pair classification and machine reading comprehension.
One might wonder how do SubChar tokenizers handle character-level tasks where classification is done on every single character, such as sequence labeling and span extraction.Since SubChar tokenizers may combine multiple characters into one token or split one character into sub-character tokens, directly adding a classification head on each token may cause discrepancy with the human annotation, which is done on the character level.For example, it is infeasible to evaluate the POS tag of a sub-character token.
To handle such situations, we perform classification on the character level for these tasks.To obtain the representation of each character, we average the representations of all its sub-character tokens.We apply this on the final layer of BERT and feed the character representation to a linear classifier for downstream tasks.
We measure the performance of this approach on CMRC (span-extraction reading comprehension) and CLUENER (named entity recognition) and show the results in Table 6.The results show that our model can indeed handle character-level tasks with this simple adaptation.There might be better ways of adopting our model on character-level tasks, and we leave it to future work.

Analysis
In this section, we conduct various analyses to better understand the working mechanisms of Sub-Char tokenization, including illustrations of the efficiency improvement and ablations on different components of our tokenization pipeline.

Vocabulary Composition
We break down the vocabulary of each tokenizer into three different categories: sub-character tokens, character tokens, and character combination tokens (words and phrases).As shown in Figure 5, character tokenizers only have character tokens, while sub-word tokenizers have a small percentage of combination tokens.The main reason for the relatively small number of combination tokens in sub-word tokenizers is that unlike how English words are composed with 26 alphabet letters, there are thousands of unique Chinese characters, which take up a large proportion of the vocabulary in order to maintain coverage.
In contrast, SubChar tokenizers use a very small fraction of sub-character tokens to compose many complex Chinese characters, therefore saving up a large percentage of the vocabulary to store combination tokens.This brings the advantage of having more words and phrases in the tokenized outputs, thus shortening the sequence lengths, as elaborated in the next section.

Efficiency Improvement
The direct consequence of having more character combinations in the vocabulary is that the tokenized sequences are shorter.sequence length by using different tokenizers on two downstream datasets.We observe that Sub-Char tokenizers can tokenize the inputs into much shorter sequences.Moreover, our SubChar tokenizers can speed up training for both pretraining and finetuning.During finetuning, we can pack multiple sequences into one input sequence to reduce the computation waste introduced by sequence padding (Krell et al., 2021), and shorter sequence lengths allow the sequences to be packed more densely, thus increasing the overall throughput.
Table 8 shows the model finetuning time relative to the CharTokenizer baseline.We observe significant speedup by SubChar tokenizers, finishing in as little as 68.9% time on iFLYTEK with the SubChar-Pinyin-NoIndex tokenizer.In Figure 6, we plot the training curves for the CharTokenizer baseline and   The speedup on pretraining is also significant.While the running speed differs on different machines, the compression brought by the shorter tokenized outputs is hardware-invariant.In Table 9, we show the relative size (disk memory) of the tokenized pretraining corpus.We observe that Sub-Char tokenizers can tokenize the raw pretraining texts into shorter sequences than the baselines, thus resulting in a much smaller pretraining data (e.g., as much as 25.3% smaller than that of the Char-Tokenizer baseline with SubChar-Pinyin-NoIndex).
In turn, this can translate to much faster pretraining on any training infrastructure.

Impact of Vocabulary Size
Intuitively, when we increase the vocabulary size, there will also be more room to store combination tokens (e.g., words and phrases), leading to a decrease in tokenization length and thus better efficiency.Although we used the standard vocabulary size of 22675 in our previous experiments, to understand whether the efficiency benefits of SubChar tokenization wear off at larger vocabulary size, we perform an additional ablation on the impact of vocabulary size.
As shown in Table 10, as we increase the vocabulary size, the efficiency advantage of SubChar tokenizers slightly diminishes.However, even at a very large vocab size of 60000, our SubChar-Pinyin tokenizer still tokenizes the inputs into significantly shorter sequences than the Sub-word baseline.We thus conclude that the efficiency advantage of our SubChar tokenizers would hold in most practical cases where the vocabulary size is typically under 60000 (such as BERT and RoBERTa).

Impact of Pretraining Data Size
To understand the impact of pretraining data size, we take the checkpoints of the 12-layer Transformer models pretrained on the 2.3G Baike corpus, and further pretrain them on a much larger corpus of 22.1GB text.This 22.1GB corpus is sampled from Chinese web text9 , mainly consisting of books and web pages.We further pretrain for 8K steps with a maximum sequence length of 512.
As shown in the bottom block of Table 2, further training on this larger corpus leads to small improvement on average performance (72.23 → 72.81 for CharTokenizer and 72.87 → 73.42 for SubChar-Pinyin), possibly because the original models trained on 2.3GB corpus are already close to being fully trained.More importantly, this result shows that even with pretraining on larger corpora, our proposed methods can still match or slightly outperform baselines on the downstream datasets.

Impact of Encoding Methods
As described in Section 2, we experiment with different types of encoding methods and compare their downstream performance to analyze the impact.
Our previous encoding methods are based on the hypothesis that linguistic information such as glyph or pronunciation provides useful inductive biases to the model.However, in the case where this hypothesis is not true, it is possible that non-linguistic encoding methods may work as well.To verify this, we add two encoding methods that do not consider any linguistic information: Byte Encoding and Random Index Encoding, for the purpose of ablation analysis.
In Byte Encoding, we convert every character into its byte sequence, same as in ByT5 (Xue et al., 2022).In cases where the byte sequence consists of multiple indices (each Chinese character has three byte indices), we concatenate them and append the character separation symbol as the encoding (e.g., 魑 → 233_173_145#).
In Random Index Encoding, we map each character into a unique and randomly generated five-digit index and append the character separation symbol as the encoding ( e.g., 魑 → 29146# ).
We train SubChar tokenizers with all the different encoding methods and compare the corresponding BERT models using these tokenizers on downstream tasks.The results are presented in Table 11.We observe that the differences between these different tokenizers are rather small in terms of the model performance on downstream datasets.Moreover, perhaps somewhat surprisingly, tokenizers with the non-linguistic encoding methods -SubChar-Byte and SubChar-RandomIndex -can also perform competitively despite the fact that they do not capture glyph or pronunciation information like the other tokenizers.These results suggest that linguistic encoding may not be necessary for SubChar tokenizers to achieve high performance on downstream tasks.However, the linguistic encoding methods can build more robust and efficient tokenizers as illustrated in previous sections.

Impact of Vocabulary Construction Algorithm
In previous experiments, we used the Unigram LM implementation in SentencePiece for vocabulary construction.We perform an additional ablation where we replace Unigram LM with Byte Pair Encoding (BPE) for vocabulary construction to train a pinyin-based tokenizer, while holding all other hyper-parameters constant.
We compare the SubChar-Pinyin-BPE variant with the unigram LM (SubChar-Pinyin) tokenizer.We find that these two perform similarly.In terms of efficiency: SubChar-Pinyin-BPE tokenizes iFLYTEK to an average length of 184.4 and tokenizes TNEWS to an average length of 15.9.In comparison, SubChar-Pinyin tokenizes iFLY-TEK to an average length of 185.2 and tokenizes TNEWS to an average length of 16.1.The vo-cabulary compositions of the two are also similar, where character combination takes up the majority of the space in the vocabulary for both BPE and unigram LM implementations.In terms of performance, we observe in Table 11 that the BPE implementation and the unigram LM implementation have little difference in downstream task performance.Based on these results, we conclude that the choice of which vocabulary construction to use has a marginal impact on the tokenization efficiency and model performance.

Related Work
Chinese PLMs.Chinese BERT (Devlin et al., 2019) is the first Chinese PLM, which adopts the character tokenization.Then, researchers have explored techniques to explicitly incorporate the word-level information into Chinese PLMs for better performance.Zhu (2020) 2021) consider coarse-grained information through masking whole words and n-grams during the masked language modeling pretraining.Diao et al. (2020) incorporate word-level information via superimposing the character and word embeddings.Lai et al. (2021) incorporate Chinese word lattice structures in pretraining.Different from these studies, we investigate the information in the sub-character level for Chinese PLMs.
Linguistically-Informed Techniques for Chinese NLP.Before the era of PLM, many efforts have been made to incorporate linguistic knowledge, including both glyph (Sun et al., 2014;Yu et al., 2017;Cao et al., 2018) and pronunciation (Zhang et al., 2019b;Chaudhary et al., 2018), into word embedding (Mikolov et al., 2013).Beyond word-level representation, researchers explore the use of linguistic information to enhance sequential models (Dong et al., 2016;Bharadwaj et al., 2016;Liu et al., 2017), especially BERT (Meng et al., 2019;Sun et al., 2021).Compared to these works, we do not incorporate additional information from sources like images, instead, our proposed tokenization methods are dropin replacements to existing tokenizers, without adding any extra layers or parameters.Besides, CWS is a common preprocessing step for Chinese NLP (Li and Sun, 2009), Li et al. (2019) empirically analyze whether CWS is helpful for Chinese NLP tasks before the era of PLMs and find that the answer is no in many cases.In our work, we also spend a section examining the impact of CWS specifically for PLMs.Moreover, as shown by Huang et al. (2021), incorporating linguistic information also benefits spelling check.Instead of explicitly using spelling check, our linguisticallyinformed tokenizations are robust to spelling errors.

Granularity of Tokenization.
Although subwords are taken to be the default granularity of tokenization since the release of BERT, researchers also explore different granularities for PLMs.For instance, ELMo (Peters et al., 2018), the early pioneer of PLMs, starts by using character representation.Ma et al. (2020) combine character representations with sub-word representations for better performance and robustness.Nzeyimana and Rubungo (2022) incorporate a morphological analyzer for tokenization and achieve gains for the Kinyarwanda language model.More recently, there is a trend in tokenization-free methods, including Byte-BPE (Wei et al., 2021), CA-NINE (Clark et al., 2021), ByT5 (Xue et al., 2022), and Charformer (Tay et al., 2022), which discard explicit tokenization and directly represent inputs as small units such as bytes.The downside of these tokenization-free approaches is obvious: the longer tokenized sequence lengths slow down both training and inference.Contrary to them, our subcharacter tokenization encourages the use of more character combinations, which largely shortens the tokenized sequences.

Conclusion
In this work, we propose sub-character tokenization and conduct comprehensive experiments to illustrate its advantage over existing tokenization methods.Compared to treating each individual character as a token (CharTokenizer) or directly running sub-word tokenization on the raw Chinese text (sub-word tokenizer), our SubChar tokenizers not only perform competitively on downstream NLU tasks, more importantly, they can be much more efficient and robust.We conduct a series of ablation and analysis to understand the reasons why SubChar tokenizers are more efficient, as well as the impact of linguistic and non-linguistic encoding.Given the advantages of our SubChar tokenizers, we believe that they are better alternatives to all existing Chinese tokenizers, especially in applications where efficiency and robustness are critical.It is possible that our approach can be useful for other morphologically poor languages and more complicated methods could be developed based on SubChar tokenization for even better performance.We leave these interesting directions for future exploration.On a broader level, our work makes an important attempt in developing more tailored methods for a language drastically different from English with promising results.We believe that this is a crucial future direction for the community given the language diversity in the world.We hope that our work can inspire more such work in order to benefit language technology users from different countries and cultures.

Limitations
Our experiments are focused on natural language understanding tasks.We recognize that adapting our SubChar tokenization to language generation tasks might require additional efforts, for example, we may want to avoid cases of predicting subcharacter tokens that do not form complete characters.Also, evaluating the robustness of language generation models on real-world input noises may require additional benchmarks beyond those used in this paper.We leave such exploration as an interesting direction for future work.
Another limitation is that our method is designed specifically for the Chinese language.While we hypothesize that our method can also bring benefits to other languages with ideographic symbols, such as Kanji in Japanese, we leave such investigation to future work.

Broader Impact
We expect our work to have a positive impact on the society.Firstly, we addressed the practical problem of handling input with real-world noises.Such noisy settings are very common in real-life applications.Our method, along with the evaluation framework, can help make language technologies more robust and reliable in real-world applications, especially for Chinese users.Secondly, we addressed the efficiency concern of large language models by significantly reducing both and training and inference time.This not only reduces the latency of these models in real-world applications, more importantly, helps reduce the environmental costs of using these large language models, moving further towards Green AI.All of our code and models are released with proper documentation in order to better facilitate the adoption of our work in a wide range of research and industrial applications.

qFigure 1 :
Figure 1: Comparison of existing tokenizers (character tokenizer and sub-word tokenizer) and our sub-character tokenizers (SubChar-Wubi using glyph and SubChar-Pinyin using pronunciation encoding).Different tokens produced by the tokenizers are separated by '|'.The numbers in (brackets) indicate the number of tokens in the tokenized sequence.Tokens in orange indicate character combinations, while tokens in green indicate sub-character tokens.'#' indicates the special separation symbol after each character, circled numbers ( 1 2 3 4 ) indicate the intonation of characters.(Figure best viewed in color.)

Figure 2 :
Figure 2: Illustration of the tokenization pipeline when incorporating CWS.After the first step of CWS, highfrequency words (words in the dashed box) directly become part of the final output sequence, the other words then go through SubChar tokenization.

Figure 3 :
Figure 3: An actual interface of the popular pinyin input method.The first line yi yi is the user input of the romanization sequence, all words with this same pronunciation are listed below for users to choose from.

Figure 4 :
Figure 4: Illustration of how our SubChar-Pinyin-NoIndex tokenizer is robust to any homophone typos.The possible homophone typos (characters in purple dashed boxes) are mapped into the same romanization sequence as the intended correct characters, and hence the resultant tokenization based on the romanized sequences would be the same.

Figure 5 :
Figure5: Breakdown of different types of tokens in the vocabularies of various tokenizers.We observe the clear trend that in our SubChar tokenizers, a small fraction of sub-character tokens saves the space to store much more character combination tokens (e.g., words and phrases).

Figure 6 :
Figure 6: Training curves on the iFLYTEK dataset with two different models.The y-axis indicates classification loss (cross-entropy), the x-axis indicates time (seconds).Our SubChar-Pinyin-NoIndex model gets a lower loss than the CharTokenizer baseline throughout training.
and Zhang et al. (2021a) expand BERT vocabulary with Chinese words apart from Chinese characters and incorporate them in the pretraining objectives.Cui et al. (2019a),Wei et al. (2019), and Xiao et al. (

Table 1 :
Statistics of downstream datasets.

Table 2 :
models.To ablate the impact of model size, we also pretrain 12-layer Transformer models for Results on downstream datasets of different tokenizers.The last column indicates average performance.

Table 3 :
Results for noisy evaluation with homophone typos.Different columns correspond to different percentages of typos in the test data.The BERT model with our SubChar-Pinyin-NoIndex tokenizer (results in bold) suffers no performance drop on noisy test data since it is robust to all homophone typos.

Table 5 :
Results of models trained with different tokenizers.Numbers in brackets indicate the difference between adding and not adding the CWS step in tokenization.Adding CWS brings no significant improvement in performance.

Table 6 :
Results on two character-level classification datasets: CMRC (span-extraction) and CLUENER (named entity recognition).Models are 6-layer BERT.Models with SubChar tokenizers perform close to the baseline models.

Table 7 :
Comparison of average length of tokenized sequences with different tokenizers.SubChar tokenizers produce much shorter tokenized sequences than the baselines.SubChar-Pinyin-NoIndex tokenizer achieves the most length reduction.BPE and Unigram LM counterparts achieve similar speedup improvement.

Table 8 :
Finetuning time of models with different tokenizers.Numbers indicate time relative to the CharTokenizer baseline model.Models with SubChar tokenizers take much shorter time to finish finetuning.SubChar-Pinyin-NoIndex brings the most speedup.

Table 10 :
Comparison of average length of tokenized sequences with different tokenizers and different vocabulary sizes.

Table 11 :
Results of SubChar tokenizers when using different encoding methods.The last row is a model with SubChar-Pinyin tokenizer using BPE as the subword tokenization algorithm, all previous rows are using unigram LM as the subword tokenization implementation.All models have 6-layers with the same hyper-parameters.The impact of different encoding methods on downstream performance is small, and the ULM and BPE versions of SubChar-Pinyin also achieve similar results.