Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.


Introduction
End-to-end neural models have generally replaced the traditional NLP pipeline, and with it, the error cascades and feature engineering common to such systems, preferring instead to let the model automatically induce its own sophisticated representations.Tokenization, however, is one of few holdovers from that era, with nearly all commonly-used models today requiring an explicit preprocessing stage to segment a raw text CANINE: Character Architecture with No tokenization In Neural Encoders.
Code and checkpoints are available on GitHub at http://caninemodel.page.link/code.
Published in Transactions of the Association for Computational Linguistics (TACL), 2022.
string into a sequence of discrete model inputs.Broadly speaking, tokenizers are generally either carefully constructed systems of language-specific rules, which are costly, requiring both manual feature engineering and linguistic expertise, or data-driven algorithms such as Byte Pair Encoding (Sennrich et al., 2016), WordPiece (Wu et al., 2016), or SentencePiece (Kudo and Richardson, 2018) that split strings based on frequencies in a corpus, which are less brittle and easier to scale, but are ultimately too simplistic to properly handle the wide range of linguistic phenomena that can't be captured by mere string-splitting ( §2.1).
The degree of sophistication required to accurately capture the full breadth of linguistic phenomena, along with the infeasibility of writing such rules by hand across all languages and domains, suggests that explicit tokenization itself is problematic.In contrast, an end-to-end model that operates directly on raw text strings would avoid these issues, instead learning to compose individual characters into its own arbitrarily complex features, with potential benefits for both accuracy and ease of use.While this change is conceptually very simple-one could replace the subword vocabulary in a model like BERT (Devlin et al., 2019) with a vocabulary made solely of individual characters-doing so leads to two immediate problems.First, the computational complexity of a transformer (Vaswani et al., 2017), the main components in BERT as well as other models such as GPT (Radford et al., 2019;Brown et al., 2020) and T5 (Raffel et al., 2020), grows quadratically with the length of the input.Since standard subword models have roughly four characters per subword on average, the 4x increase in input sequence length would result is a significantly slower model.Second, simply switching to a character vocabulary yields empirically poor results ( §4.2).
In order to enable tokenization-free modeling that overcomes these obstacles, we present arXiv:2103.06874v4[cs.CL] 18 May 2022 CANINE.CANINE is a large language encoder with a deep transformer stack at its core.Inputs to the model are sequences of Unicode characters. 1o represent the full space of Unicode characters2 without a vocabulary, we employ a hashing strategy.To avoid the slowdown from increasing the sequence length, CANINE uses strided convolutions to downsample input sequences to a shorter length before the deep transformer stack.
Like BERT, we pre-train CANINE on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks.For the MLM task, CANINE offers two options: 1.A fully character-level loss that autoregressively predicts characters in masked spans.2. A vocabulary-based loss that predicts the identities of masked subword tokens.Critically, this tokenization is used only for the pre-training loss; tokens are never input to the encoder, and the tokenizer and subword vocabulary can be safely discarded after pretraining.This effectively converts the hard constraint of token boundaries found in other models into a soft inductive bias in CANINE.
In this article, we contribute: • the first pre-trained tokenization-free deep encoder; • an efficient model architecture that directly encodes long sequences of characters with speed comparable to vanilla BERT; and • a model that performs no tokenization on the input, avoiding the lossy information bottleneck associated with most pre-processing.

Linguistic pitfalls of tokenization
Subword tokenizers are the de-facto standard in modern NLP (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020).These algorithms are limited to only simple word-splitting operations.While this is perhaps a reasonable approach for a language with impoverished morphology such as English, it is much less appropriate in the face of phenomena like agglutinative morphology, non-k-t-b "write" (root form) kataba "he wrote" kattaba "he made (someone) write" iktataba "he signed up" Table 1: Non-concatenative morphology in Arabic. 4hen conjugating, letters are interleaved within the root.The root is therefore not separable from its inflection via any contiguous split.
Even in high-resource languages, subword models still tend to struggle on challenging domains, such as informal text, which often includes typos, spelling variation,5 transliteration, or emoji (O'Connor et al., 2010).BERT, which uses Word-Piece tokenization, is sensitive to corruptions of the input, both natural typos (Sun et al., 2020) and adversarial manipulations (Pruthi et al., 2019), with some of the loss attributable to corrupted strings no longer being covered by the vocabulary.
Seemingly safe heuristics used by these algorithms, such as splitting on whitespace and punctuation, are problematic when applied to languages that do not use spaces between words (Thai, Chinese) or use punctuation as letters (Hawaiian,6 Twi7 ).While SentencePiece does offer the option to skip whitespace splitting, it is not typically used due to poor empirical performance.
Fixed vocabulary methods can also force modelers to choose between difficult preprocessing tradeoffs: should one keep accents, casing, etc. and avoid destructive preprocessing?-Or keep such orthographic information and risk important words dropping out of the frequency-based vocabulary altogether due to the presence of multiple variants of otherwise-similar words?For instance, mBERT initially removed all diacritics, thus dropping tense information in Spanish8 and conflating many unrelated words in Vietnamese.9 Finally, using a fixed vocabulary during pretraining also creates complications for downstream tasks, which are subsequently tied to the same tokenizer and vocabulary used for pretraining, even if it is not well-suited for the target domain and/or end-task.Boukkouri et al. (2020) showed that BERT's Wikipedia+BooksCorpus WordPiece vocabulary results in excessive segmentation when fine-tuning on medical data, diminishing the benefit of pre-training as a strategy.

Enabling better generalization
Much as Tenney et al. (2019) showed that large encoders learn elements of the classic NLP pipeline, it seems natural to let the model discover tokenization as well.With this in mind, we seek an approach that can better generalize beyond the orthographic forms encountered during pre-training.
In terms of scientific inquiry, we would like to know whether we can build models that learn how to compose words where appropriate, and memorize them where memorization is needed.Large frequency-derived vocabularies partially mitigate this problem by simply memorizing more, but language inherently requires aspects of both memorization and composition.By building a model that directly engages with these issues within the small scale of word composition, we hope to enable future work studying these problems at larger scales such as phrasal constructions.
Practically, generalization is hindered for vocabulary elements that are slight orthographic variations, where one is very infrequent.Hypothetically, a model may estimate a very good embedding for a common vocabulary element kitten, but a poor embedding for the less frequent element kittens since the model has no a priori knowledge that they are related.Embeddings that are rarely touched during pre-training will not be updated much beyond their random initializations.

Reducing engineering effort
Mature tokenizers often include years of handengineered rules around special cases such as email addresses, URLs, and handling unknown words;10 even fairly minimal modern tokenizers include initial word-splitting heuristics followed by a specific algorithm and vocabulary for further breaking these tokens into subwords.
Modern pre-trained models also have many requirements throughout their lifecycle: Between the time a model is pre-trained, fine-tuned, and served-potentially months or years apart-its weights and model implementation may be converted to be compatible with another toolkit, its fine-tuning data may be tokenized in a different way, and the natural distribution of words may be quite different.All of these things introduce ample opportunities for mismatches to arise between tokenization and the vocabulary from pre-training.Yet this same pre-training paradigm presents an advantage for character models: access to a far more (unsupervised) data to learn word composition from characters; without transfer learning, this has historically been impractical for many tasks having little supervised data.

CANINE
CANINE consists of three primary components: (1) a vocabulary-free technique for embedding text; (2) a character-level model that is efficient by means of downsampling and upsampling; and (3) an effective means of performing masked language modeling on a character-level model.

Model
CANINE is designed to be a minimally modified variant of the deep transformer stack found in modern encoders such as GPT, (m)BERT, XLM, and XLM-R such that its architecture is easily adoptable by other models in this family.The simplest implementation of such a character model would be to feed characters at each position in place of subwords.However, this approach would result in far more sequence positions given the same input text, leading to linearly more compute in feed forward layers and quadratically more compute in self-attention layers.
The overall form of the CANINE model is the composition of a downsampling function DOWN, a primary encoder ENCODE, and an upsampling function UP; 11 given an input sequence of character embeddings e ∈ R n×d with length n and dimensionality d: 11 Enveloping the attention stack between downsampling and upsampling layers is similar to the Funnel-Transformer (Dai et al., 2020), which operates on WordPiece.However, many of its design choices (e.g., average pooling, their residual structure) did not work well in CANINE.where Y seq ∈ R n×d is the final representation for sequence prediction tasks.Similarly, for classification tasks, the model simply uses the zeroth element of the primary encoder:

︸
Preprocessing Like existing models, the input to CANINE must ultimately be represented as a sequence of integers, but because the nature of characters is well-defined and standardized by Unicode, preprocessing code that would typically be hundreds or thousands of lines can be replaced by a very simple procedure: just iterate over the characters in the input string, and return their codepoint integer values (e.g., a single line of code 12 in Python).Furthermore, because codepoint values are part of the Unicode Standard, they are documented publicly, already supported by programming languages, and will not change over time, unlike arbitrary vocabulary-based IDs.
Character hash embeddings CANINE uses hashing (Svenstrup et al., 2017) to support embedding the full space of Unicode codepoints with a relatively small number of parameters, but to reduce the chance that different codepoints will share exactly the same representation, we define a generalization of the standard hashing approach in which we apply multiple hash functions to each codepoint and concatenate the representations associated with the various hash values.More formally, given a single codepoint 13 x i ∈ N, we apply K hash functions H k : N → N, and look up each hashing result in its own embedding matrix 14 E k ∈ R B×d , yielding K embeddings of size d = d /K, which are then concatenated into a 12 Python preprocessing: [ord(c) for c in text] 13 Conceptually, a codepoint is a character; however, a Unicode codepoint is defined precisely and unambiguously.
14 CANINE uses learned embeddings, not random embedding as in other hash embeddings (Kaliamoorthi et al., 2019).single representation of size d: where ⊕ denotes vector concatenation.We refer to these as the character embeddings e ∈ R n×d .
In our experiments, we use d = 768, K = 8, and B = 16k. 15hile each individual hash function is subject to hash collisions,16 the overall effect is minimal since each function only accounts for a small portion of the codepoint's overall embedding, and it is highly improbable that the other hash functions will produce the same collisions.
Because the model always supports all codepoints, it is possible to learn representations during fine-tuning for characters (and, by extension, words, scripts, etc.) that were never seen during pre-training, while still making use of what pretraining learned about word composition and sentence structure.
Optional vocabulary-free n-grams We can also redefine the embeddings e i above to include character n-grams, again without a fixed vocabulary, such that each n-gram order contributes equally to a summed embedding: 17 This formulation still admits tokenization-free modeling, but provides the model with an inductive bias that favors slightly more memorization via a compute-cheap means of adding parameters.Notably, it also allows the model's input signature to remain a simple sequence of codepoints.Downsampling To make CANINE efficient, we use a multi-part downsampling strategy.First, we encode characters using a single-layer block-wise local attention transformer.This model performs self-attention only within each block of a predefined size,18 saving the quadratic cost of attention while leveraging the linguistic intuition that word composition-i.e., the kind of composition relevant in the lowest layers of the model (Tenney et al., 2019)-tends to happen at a very local level.Next, we use a strided convolution to reduce the number of sequence positions to be similar to that of a word piece model. 19Given character embeddings e ∈ R n×d with a sequence length of n characters and dimensionality d, we use a convolution with a stride of r to downsample the sequence: We refer to this output as the downsampled positions: h down ∈ R m×d where m = n /r is the number of downsampled positions.In our experiments, we use r = 4 and n = 2048 such that m = 512, giving CANINE's primary encoder-the transformer stack-the same length as in mBERT.
Deep transformer stack After downsampling, CANINE applies a deep transformer stack with L layers to the resulting downsampled positions.This is the same as the core of BERT and derivative models, and remains the core of CANINE in that it accounts for the vast majority of its compute and parameters, though we note that this middle portion of the model could easily be replaced with any other sequence-to-sequence model including those with better compute performance such as Performer (Choromanski et al., 2021), Big Bird (Zaheer et al., 2020), RFA (Peng et al., 2021), ETC (Ainslie et al., 2020), etc.This portion of the model yields a new downsampled representation We used L = 12 to match mBERT.
Upsampling While the above architecture is sufficient for classification tasks, sequence prediction tasks require that the model expose an output layer with the same sequence length as the input (i.e., characters are the model's input and output "API" for tasks like tagging and span prediction).
We reconstruct a character-wise output representation by first concatenating the output of the original character transformer (above) with the downsampled representation produced by the deep transformer stack.(Note that since each downsampled position is associated with exactly r characters for a downsampling rate of r, each position of downsampled representation is replicated r times before concatenation.)More formally, where ⊕ indicates vector concatenation of the representations (i.e.not sequences) such that CONV projects from R n×2d back to R n×d across a window of w characters. 20Applying a final transformer layer (standard, not local) yields a final sequence representation y seq ∈ R n×d .
Residual connections While the initial character encoder (before downsampling) and final character encoder (after upsampling) both represent character positions, they conceptually have very different purposes in the network.Intuitively, we think of the initial character encoder as composing characters to create a more word-like representation, while the final character encoder is extracting the in-context representation that's relevant for predicting the "meaning" of the content at each position; CANINE must be able to deal with additional ambiguity during upsampling since a single downsampled position may span more than one conceptual word.Because of the different roles of these induced features, we do not use residual connections from h init to h up . 20We use w = 4 in our experiments.This representation ŷ is then used to predict each character.To avoid wasting time on a large output weight matrix and softmax, the gold target classes t are bucketed codepoint IDs such that . This is similar to the strategy used in the character hash embedder ( §3.1).The occassional collisions among characters is less problematic due (a) the fact that this is an encoder-only model and (b) that the embeddings must still retain contextual information in order to correctly predict characters.Because we're only predicting a relatively small subsequence of the input (15% in our experiments), the cost of this layer is small.

Subword Loss
We also experiment with CANINE-S, a subwordbased loss function, to demonstrate how a tokenaware pre-training loss can still be paired with a tokenization-free model such that the tokenizer and vocabulary are discarded after pre-training.
Span-wise masking Like mBERT's MLM setup, each span in CANINE-S corresponds to a single subword.As with the autoregressive loss, all characters within the masked span are replaced with a special "mask" codepoint.Random replacements of subwords are chosen from the vocabulary of same-length subwords such that the length of the character sequence remains unchanged; more formally, given a subword selected for random replacement x and a vocabulary of subwords V , x's replacement will be drawn from the subset of v ∈ V where LEN(v) = LEN(x).
Span prediction Within each masked character span, CANINE-S randomly selects a character position where the model will make a prediction; the model predicts the identity of the masked subword via softmax.The associated subword embeddings are discarded after pre-training.

Targeted Upsampling
By design, each final character representation (after upsampling) is a function of the output of the initial character encoder (before downsampling) and the output of the deep transformer stackthere are no inter-position dependencies across the upsampled sequence.This depends on the upsampler using position-wise feed-forward projections and a single transformer layer.During pretraining, we leverage this design to improve speed by only performing upsampling on the sequence positions that will be used by the MLM task p.More formally, we use the following equivalent 24 form of the UP function during pre-training:

Modularity
Unlike previous models, CANINE removes both the vocabulary and tokenization algorithm as fossilized parts of the final model that must be replicated during fine-tuning and prediction.Regardless of which pre-training loss is chosen (characters or subwords), the use of these components in CANINE is limited to a detail of the pretraining procedure-an inductive bias of the loss function-that is then discarded.The fine-tuning and prediction phases of the model lifecycle never have any knowledge of what vocabulary or tokenization algorithm (if any) were used in pretraining.This allows the model to natively pro-24 This highly-effective targeted upsampling optimization is the primary reason that CANINE uses a full Transformer layer for the final full-length character sequence rather than a local transformer.Because a block-wise local transformer assumes uniform position-wise locality over attention blocks, it is not trivial to combine these two optimizations; the local self-attention mask would no longer be a simple block diagonal.However, this final upsampling layer is discarded for classification tasks and so does not contribute any cost.Hence, while it is possible to combine local attention and targeted upsampling, this is left as future work.
cess untokenized data, or even process data that has been pre-processed by different tokenizers, a situation that would otherwise introduce a significant skew between training phases.

Experiments
4.1.1Information-Seeking QA Data TYDI QA: Primary Tasks TYDI QA is a dataset of information-seeking questions in 11 typologically diverse languages (Clark et al., 2020).Questions are written before answers, leading to less lexical and morphological overlap between questions and answers, which are drawn from Wikipedia.We evaluate on the primary tasks. 25assage Selection Task (SELECTP) Given a list of the passages in a Wikipedia article, return either the index of the passage that answers the question, or return NULL if the article contains no acceptable answer.

Minimal Answer Span Task (MINSPAN)
Given a full Wikipedia article, return the start and end byte indices of the minimal span that completely answers the question.Alternatively, a system may indicate that the article does not contain an answer, or return YES or NO for yes/no type questions.

Named Entity Recognition Data
We also consider the task of named entity recognition (NER), which requires the model to identify which spans of a sentence correspond to entities and label the entity type.In all of our experiments, we framed the task as sequence labeling, predicting BIO-encoded span labels.
CoNLL NER We use Spanish and Dutch data from the CoNLL 2002 NER task (Tjong Kim Sang, 2002) and English and German from the CoNLL 2003 NER task (Tjong Kim Sang and De Meulder, 2003), all from the newswire domain.
MasakhaNER To widen the scope of our experiments beyond Europoean languages, we also include MasakhaNER (Adelani et al., 2021)

Model Configuration
Direct comparison with mBERT In order to determine which pre-training architecture produces better quality downstream predictions, we compare CANINE to mBERT, which we reimplemented and re-trained in order to hold as many variables as possible constant.Note that we intentionally do not compare against public pretrained checkpoints that use different pre-training corpora since (a) this would be a major confounding variable and (b) most publicly available pretrained models are simply instantiations of BERT, including XLM-R 26 and X-STILTS. 27  Setup We pre-train on the multilingual Wikipedia data of mBERT, which includes 104 languages.Similarly, we reuse mBERT's exponential smoothing technique to weight the languages within the pre-training samples.We train for 124k steps with batch size 4096 (2.5 passes over the data) using the LAMB optimizer (You et al., 2020) with a linearly decayed learning rate of 0.018 where 2.5% of the steps are used for warm-up.We use a sequence length of 512 for mBERT, and 2048 for CANINE, which results in 512 downsampled positions in its core deep transformer stack.We pre-train on 64 Cloud TPUs v3 28 for approximately one day (see results for precise timings).For both mBERT and CANINE-S (CANINE with the subword loss), we 26 XLM-R instantiates BERT with a larger pre-training corpus, larger model size, and larger vocabulary size.
27 X-STILTS performs English fine-tuning on an existing XLM-R checkpoint.(Phang et al., 2020) 28 v3 TPUs have 16 GiB memory / core (128 GiB total).select 15% of subwords for the MLM loss and predict up to 80 output positions; 80% of these are masked in the input, 10% are randomly replaced, and 10% are unmodified.For CANINE-C (CANINE with the autoregressive character loss), we select 15% of contiguous spans for the MLM loss and predict up to 320 output characters, and no random replacement is performed.For TYDI QA, we use a maximum answer length of 100 characters, which is approximately the 99 th percentile answer length.Sequences longer than the maximum sequence length are zero-padded, following BERT. 29

TYDI QA Results
Our main result is shown in Table 2. CANINE-S (CANINE with the subword loss) improves over mBERT in the TYDI QA SELECTP task by 2.8 F1, while using about 30% fewer parameters.Similarly, CANINE-C (CANINE with the autoregressive character loss), improves over mBERT by 2.5 F1.Adding vocab-free character n-grams leads to even further gains over mBERT (+3.8 F1) and even more on the MINSPAN task (+6.9 F1).A language-wise breakdown is provided in Table 7 in the appendix.
29 Each pre-training uses approximately 24 hours on 64 TPUs (1.5k TPU-hours), so the 18 pre-trainings in Tables 2/3/4 required about 28k TPU-hours.The 18 TyDi QA experiments in these tables, each take about 1 hour on 16 TPUs, each with 3 replicas (48 TPU-hours), about 1k TPU-hours total.The 3 NER experiments in Table 5 each took 3 hours on 4 TPUs with 3 replicas each (36 TPU-hours), 108 TPU-hours total.Thus replicating the experiments in this paper would take approximately 29k TPU-hours.Table 3: Kiswahili examples in which CANINE improved over mBERT in the TYDI QA SELECTP task.On examining the mBERT's subword tokenization, we observe that the segmentations do not align well, putting more pressure on the model to combine them and more opportunities for some embeddings to be poorly estimated.Top: The model must match a key word in the question milikiwa (own) to a morphological variant in the answer iki-milikiwa (to be owned).mBERT's WordPiece segmentation produces milik -iwa and iki -mi -iki -wa for these, respectively.Bottom: The model must match i-sambaza-yo (distributes) in the question with u-sambaza-ji (distribution).mBERT's WordPiece segmentation produces isam -ba -za -yo and usa -mba -zaj -i.
We also present results from some ablation models as additional baselines in rows 3-4 of Table 2. First, for row 3, we simply replace BERT's subword vocabulary with a pure character vocabulary, which makes characters both the input granularity and the unit of masking and prediction for the MLM task, and observe that not only is the model 10X slower than subword-based BERT, but the quality also suffers greatly.Then, for row 4, we modify that model to use subwords for masking and MLM predictions, while keeping characters as the input granularity, and we see a substantial quality improvement, though pre-training remains extremely slow.Finally, by comparing to the full CANINE model in row 5, we can see that adding the downsampling strategy improves speed by 700%, and also leads to an additional small bump in quality.We speculate that this additional quality gain comes from giving the model a better inductive bias toward more word-like units within the deep transformer stack.
Analysis CANINE fares particularly well on morphologically rich languages such as Kiswahili.Table 3 shows examples where CANINE outperforms mBERT on the TYDI QA SELECTP task.In particular, we observe examples where Kiswahili's rich morphology does not hinder the matching process for CANINE. 30These ablations were carried out during initial model development, hence comparisons to a non-final model.

Ablations
In Table 6, we consider minor modifications to the final CANINE architecture, and evaluate the effect of each on the downstream quality of the model. 30  Attending directly to h down Instead of attending to the character-wise sequence h up , we attend to the downsampled sequence: While this change reduces the overall FLOPS of the model due to the reduced attention computation, it does not have a major effect on pre-training throughput.However, it does substantially degrade quality.

Number of hash buckets
We reduce the number of hash buckets (B) from 16k to 8k, meaning more (partial) collisions in embedding lookups.This significantly hinders the MINSPAN task.
Character vocab We switch from our hashbased no-vocabulary strategy to using a normal character vocabulary (which we derive from the pre-training corpus).We observe that this underperforms the hashing approach.We speculate that this might be due to skew between the pre-training corpus and the final downstream task since not all codepoints can be included in the vocabulary.

Input character dimension
We reduced the embedding size of the initial character encoder (i.e. the embedding size of h init and e-not h up nor y seq ) and observe that quality falls off rapidly.No initial transformer We remove the local transformer from h init and similarly observed a marked reduction in quality.
Increased downsampling While more aggressive downsampling (a factor of 5X or 6X, rather than 4X) brings substantial speed gains, the passage-level quality degrades substantially and the minimal span predictions suffer even more.
No position-limited MLM When we do not use the trick of applying the final character transformer (y seq ) only to the positions that will be computed by the MLM task, we observe a large reduction in speed.Since this model is theoretically equivalent in terms of operations, we show only the speed for exposition.We also performed ablations aimed at exploring the effect of feature concatenation and residuals; results are in Table 4.Not concatenating the downsampled representation with the initial character representation when computing h up causes the model to become unstable (row 2); adding a residual from h up back to h init does not help (row 3).However, additionally inserting a residual from h up back to h down does stabilize the model (row 4) though it does not recover the original quality.

NER Results
Named entity recognition is a task in which memorization is often a very effective strategy.For example, if a model has London in its vocabulary and sees it with the label LOCATION during training, then it simply has to retrieve this memorized association when it sees the token London at test time.Therefore, evaluating on NER is helpful for understanding the ways in which different models emphasize memorization vs. generalization.
As shown in Table 5, CANINE-C performs significantly worse than mBERT on NER, likely due to mBERT's memorization-friendly vocabulary.However, when (tokenization-free) n-gram features are added to CANINE-C, performance rebounds, showing that it is possible to cheaply  boost a model's memorization ability while remaining fully tokenization-free.
A full language-wise breakdown is provided in the appendix (Table 8).It's worth noting that part of the performance difference on MasakhaNER is due to mBERT producing no usable outputs for Amharic.The mBERT pre-training data does not contain Amharic (or any Amharic-script text), so it has no vocabulary entries to Amharic's script (meaning that mBERT sees only sequences of [UNK] on Amharic inputs).However, since CANINE always supports the full Unicode space, it is able to achieve 50 F1 even though it, too, had never seen Amharic text during pre-training.We take this as validation of CANINE's vocabularyfree approach.It may also be evidence that CANINE exhibits cross-script transfer abilities analogous to those in mBERT (Pires et al., 2019).
Error analysis CANINE-C tends not to label rarer lexical items that mBERT appears to have memorized.
For example, with CANINE-C, JCPenney (a relatively rare lexical item) is not recognized as an entity.CANINE-C also tends to separate long entities; for example, "State Street Bank and Trust Company" is labeled as two separate spans: "State Street Bank" and "Trust Company"; and the location TAMPA BAY is recognized only as TAMPA.However, adding n-grams features appears to mostly resolve this issue.

Improvements to subword tokenization
Further improvements to standard subword tokenization like Byte Pair Encoding (BPE) (Sennrich et al., 2016), WordPiece (Wu et al., 2016), and SentencePiece (Kudo and Richardson, 2018) have been proposed.Subword regularization (Kudo, 2018) and BPE-dropout (Provilkov et al., 2020) recognize that deterministic segmentation during training limits the ability to leverage morphology and word composition; instead, they sample at random one of the multiple tokenizations of the training input, made possible by the inherent ambiguity of subword vocabularies.Wang  (Kudo, 2018), which builds its vocabulary top-down, was shown to align with morphology better than BPE on pretrained encoders (Bostrom and Durrett, 2020).
Others have built hybrid models that use multiple granularities, combining characters with tokens (Luong and Manning, 2016) or different subword vocabularies (Zhang and Li, 2021).

Character-level models
Following the larger NLP trend, character-level n-gram models (Huang et al., 2013;Wieting et al., 2016;Bojanowski et al., 2017) have mostly been replaced by neural networks.While generally lagging behind their word-level counterparts, character-level features are important for morphologically rich languages, particularly in lowresource settings (Garrette and Baldridge, 2013).
For language modeling Character language models (CLMs) have used vanilla RNN architectures to produce distributions over sequences of characters in a purely tokenization-free manner (Sutskever et al., 2011;Graves, 2013;Hwang and Sung, 2017;Radford et al.).Hierarchical RNNs modeled the assumption that language operates on increasing layers of abstraction: Chung et al. (2017) jointly trained a sub-module to segment the character-level input into larger spans at each layer of a stacked LSTM.
Due to the consistent lag in performance behind their word-level counterparts, attention shifted from pure CLMs towards merely character-aware models, still reliant on traditional tokenization.Some hybrid models processed the input at character level, but predicted words from a closed vocabulary (Kim et al., 2016;Gerz et al., 2018).Others reintroduced explicit tokenization on the input side, and either generated bursts of character sequences that formed an open vocabulary (Kawakami et al., 2017) or used a character-only generator as a fallback when the main closedvocabulary word generator produced a rare or unknown token (Matthews et al., 2019;Mielke and Eisner, 2019).Especially after the popularization of the inherently ambiguous subword vocabularies like BPE, several studies moved beyond a single input segmentation and marginalized over all possible segmentations (van Merriënboer et al., 2017;Buckman and Neubig, 2018;Grave et al., 2019).
Coming full circle, Kawakami et al. (2019) induced a lexicon without any explicit supervision, reverting back to pure CLMs.In a revitalized effort to bring them on-par with coarser granularities, researchers leveraged external resources such as grounding in vision (Kawakami et al., 2019) or multi-task learning together with supervised morphology tasks (Blevins and Zettlemoyer, 2019).
After the transformer (Vaswani et al., 2017) replaced RNNs as the dominant architecture in NLP, character-level models followed.Al-Rfou et al. (2019) showed that byte-level vanilla Transformers significantly underperform their word-level counterparts.A similar finding was reported by Radford et al. (2019).Although the gap has been reduced (Choe et al., 2019), subword transformers remain the status quo for pure language modeling.
For specific tasks In parallel with LM efforts, the neural machine translation (NMT) commu-nity sought to solve its open-vocabulary problem via character-level modeling.Luong and Manning (2016) proposed a hybrid model that operated mainly at the word level, but consulted a characterlevel LSTM for unknown words; this was a practical compromise, as their character-only model took 3 months to train. Lee et al. (2017) enabled pure character NMT by shortening the input length via convolutional, pooling, and highway layers.Notably, their many-to-English model outperformed its subword counterpart and most bilingual baselines, with a 35% increase in training time (on a single GPU) compared to a baseline BPE-to-char model.CANINE has a similar motivation, but operates in the context of pre-trained transformers; training is 7x faster compared to a char-to-char baseline (on TPU v3), and has a 28% increase in training time over mBERT (Table 2).
For transfer learning Token-based models have also been augmented with character-level information in the context of transfer learning, where encoders trained with unsupervised objectives are repurposed to solve downstream tasks.Pinter et al. (2017) addressed the out-ofvocabulary problem of static pre-trained word embeddings by training a model to map the surface of a word to its pre-trained representation, and used it on unknown words.ELMo (Peters et al., 2018), a bidirectional LSTM model, applied character convolutions to its whitespace-separated input tokens.CharacterBERT (Boukkouri et al., 2020) ported this technique to BERT, augmenting its existing WordPiece-tokenized input.Consistent with previous observations that feeding characters into a transformer stack comes with a huge computational cost while not improving over tokenization-based approaches (Al-Rfou et al., 2019), a BERT model fine-tuned for semantic parsing achieved gains only when characters complemented subwords (van Noord et al., 2020).

Multilingual models
Multilingual NLP has been dominated by deep pre-trained multilingual models whose subword vocabularies are shared across languages.Such models borrow their architectures from monolingual predecessors and apply joint training in 100+ languages, either with unsupervised LM losses: mBERT, mT5 (Xue et al., 2021), or with additional translation losses: XLM (Lample and Conneau, 2019), XLM-R (Conneau et al., 2020).Chung et al. (2020) extended this by forming language clusters with per-cluster vocabularies.To accommodate languages unseen during pre-training, Wang et al. (2020) extended the vocabulary and continued pre-training.

Conclusion
In this article, we described CANINE, which is, to our knowledge, the first pre-trained deep encoder for language understanding that uses a tokenization-free, vocabulary-free model, while surpassing the quality of models built on top of heuristic tokenizers.CANINE eliminates many engineering pitfalls for practitioners and opens up new research directions for the community.

Table 2 :
Direct comparison between mBERT (rows 1-2) and CANINE (rows 5-7) on TYDI QA.Public mBERT results are taken from the TYDI QA paper.Rows 3 and 4 show simple baselines that yield inefficient / low-quality performance.Despite operating on 4x more sequence positions, CANINE remains comparable to mBERT in terms of speed.Pre-training example/sec are shown for our reported hardware (see Setup, §4.1).r represents the ratio for downsampling.Parameters are calculated at fine-tuning time.All results are averaged over 3 fine-tuning replicas.TYDI QA scores are F1 scores, macro-averaged across languages.Deltas from our mBERT (the most comparable baseline) are shown in parentheses.
Chelsea ina milikiwa na nani?Kwa kawaida Chelsea huvaa jezi ya blu, kaptula blu na soksi nyeupe.Nembo ya klabu imebadilishwa mara nyingi kulingana na wakati na kuboresha muonekano wa klabu.Nembo ya sasa inaonesha picha ya simba akiwa amebeba mkuki.Tangu Julai 2003, Chelsea imekuwa ikimilikiwa na Bilionea wa Kirusi, Roman Abramovich.Who owns Chelsea?Chelsea usually wear blue jerseys, blue shorts and white socks.The club logo has been changed many times over time and improved the club's appearance.The current emblem shows a picture of a lion carrying a spear.Since July 2003, Chelsea has been owned by Russian billionaire Roman Abramovich.

Table 4 :
Ablations for residuals and feature concatenation on TYDI QA.Rows are cumulative (each row contains all changes from the previous).

Table 6 :
Ablation experiments on the CANINE model with TYDI QA F1 scores.Deltas are shown in parentheses with regard to the top-most experiment, which serves as the baseline configuration for all experiments in this table.Each result is averaged over 3 fine-tuning and evaluation replicas.

Table 7 :
Language-wise breakdown for TYDI QA primary tasks.English is parenthesized because it is not included in the overall score calculation for TYDI QA.