Abstract
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.
1 Introduction
End-to-end neural models have generally replaced the traditional NLP pipeline, and with it, the error cascades and feature engineering common to such systems, preferring instead to let the model automatically induce its own sophisticated representations. Tokenization, however, is one of the few holdovers from that era, with nearly all commonly used models today requiring an explicit preprocessing stage to segment a raw text string into a sequence of discrete model inputs. Broadly speaking, tokenizers are generally either carefully constructed systems of language-specific rules, which are costly, requiring both manual feature engineering and linguistic expertise, or data-driven algorithms such as Byte Pair Encoding (Sennrich et al., 2016), WordPiece (Wu et al., 2016), or SentencePiece (Kudo and Richardson, 2018) that split strings based on frequencies in a corpus, which are less brittle and easier to scale, but are ultimately too simplistic to properly handle the wide range of linguistic phenomena that can’t be captured by mere string-splitting (§2.1).
The degree of sophistication required to accurately capture the full breadth of linguistic phenomena, along with the infeasibility of writing such rules by hand across all languages and domains, suggests that explicit tokenization itself is problematic. In contrast, an end-to-end model that operates directly on raw text strings would avoid these issues, instead learning to compose individual characters into its own arbitrarily complex features, with potential benefits for both accuracy and ease of use. While this change is conceptually very simple—one could replace the subword vocabulary in a model like Bert (Devlin et al., 2019) with a vocabulary made solely of individual characters—doing so leads to two immediate problems. First, the computational complexity of a transformer (Vaswani et al., 2017), the main component in Bert as well as other models such as GPT (Radford et al., 2019; Brown et al., 2020) and T5 (Raffel et al., 2020), grows quadratically with the length of the input. Since standard subword models have roughly four characters per subword on average, the 4x increase in input sequence length would result is a significantly slower model. Second, simply switching to a character vocabulary yields empirically poor results (§4.2).
In order to enable tokenization-free modeling that overcomes these obstacles, we present Canine. Canine is a large language encoder with a deep transformer stack at its core. Inputs to the model are sequences of Unicode characters.1 To represent the full space of Unicode characters2 without a vocabulary, we use a hashing strategy. To avoid the slowdown from increasing the sequence length, Canine uses strided convolutions to downsample input sequences to a shorter length before the deep transformer stack.
Like Bert, we pre-train Canine on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. For the MLM task, Canine offers two options:
A fully character-level loss that autoregressively predicts characters in masked spans.
A vocabulary-based loss that predicts the identities of masked subword tokens. Critically, this tokenization is used only for the pre-training loss; tokens are never input to the encoder, and the tokenizer and subword vocabulary can be safely discarded after pre-training. This effectively converts the hard constraint of token boundaries found in other models into a soft inductive bias in Canine.
In this article, we contribute:
the first pre-trained tokenization-free deep encoder;
an efficient model architecture that directly encodes long sequences of characters with speed comparable to vanilla Bert; and
a model that performs no tokenization on the input, avoiding the lossy information bottleneck associated with most pre-processing.
2 Motivation
2.1 Linguistic Pitfalls of Tokenization
Subword tokenizers are the de facto standard in modern NLP (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020). These algorithms are limited to only simple word-splitting operations. While this is perhaps a reasonable approach for a language with impoverished morphology such as English, it is much less appropriate in the face of phenomena like agglutinative morphology such as English, it is much less appropriate in the face of phenomena like agglutinative morphology, nonconcatenative morphology (Table 1), consonant mutation, vowel harmony, and so on.
k-t-b | “write” (root form) | |
kataba | “he wrote” | |
kattaba | “he made (someone) write” | |
iktataba | “he signed up” |
k-t-b | “write” (root form) | |
kataba | “he wrote” | |
kattaba | “he made (someone) write” | |
iktataba | “he signed up” |
Even in high-resource languages, subword models still tend to struggle on challenging domains, such as informal text, which often includes typos, spelling variation,4 transliteration, or emoji (O’Connor et al., 2010). Bert, which uses WordPiece tokenization, is sensitive to corruptions of the input, both natural typos (Sun et al., 2020) and adversarial manipulations (Pruthi et al., 2019), with some of the loss attributable to corrupted strings no longer being covered by the vocabulary.
Seemingly safe heuristics used by these algorithms, such as splitting on whitespace and punctuation, are problematic when applied to languages that do not use spaces between words (Thai, Chinese) or use punctuation as letters (Hawaiian,5 Twi6 ). While SentencePiece does offer the option to skip whitespace splitting, it is not typically used due to poor empirical performance.
Fixed vocabulary methods can also force modelers to choose between difficult preprocessing tradeoffs: Should one keep accents, casing, and so forth, and avoid destructive preprocessing?—Or keep such orthographic information and risk important words dropping out of the frequency-based vocabulary altogether due to the presence of multiple variants of otherwise-similar words? For instance, mBert initially removed all diacritics, thus dropping tense information in Spanish7 and conflating many unrelated words in Vietnamese.8
Finally, using a fixed vocabulary during pre-training also creates complications for downstream tasks, which are subsequently tied to the same tokenizer and vocabulary used for pre-training, even if it is not well-suited for the target domain and/or end-task. Boukkouri et al. (2020) showed that Bert’s Wikipedia+BooksCorpus WordPiece vocabulary results in excessive segmentation when fine-tuning on medical data, diminishing the benefit of pre-training as a strategy.
2.2 Enabling Better Generalization
Much as Tenney et al. (2019) showed that large encoders learn elements of the classic NLP pipeline, it seems natural to let the model discover tokenization as well. With this in mind, we seek an approach that can better generalize beyond the orthographic forms encountered during pre-training.
In terms of scientific inquiry, we would like to know whether we can build models that learn how to compose words where appropriate, and memorize them where memorization is needed. Large frequency-derived vocabularies partially mitigate this problem by simply memorizing more, but language inherently requires aspects of both memorization and composition. By building a model that directly engages with these issues within the small scale of word composition, we hope to enable future work studying these problems at larger scales such as phrasal constructions.
Practically, generalization is hindered for vocabulary elements that are slight orthographic variations, where one is very infrequent. Hypothetically, a model may estimate a very good embedding for a common vocabulary element kitten, but a poor embedding for the less frequent element kittens since the model has no a priori knowledge that they are related. Embeddings that are rarely touched during pre-training will not be updated much beyond their random initializations.
2.3 Reducing Engineering Effort
Mature tokenizers often include years of hand-engineered rules around special cases such as email addresses, URLs, and handling unknown words;9 even fairly minimal modern tokenizers include initial word-splitting heuristics followed by a specific algorithm and vocabulary for further breaking these tokens into subwords.
Modern pre-trained models also have many requirements throughout their lifecycle: Between the time a model is pre-trained, fine-tuned, and served—potentially months or years apart—its weights and model implementation may be converted to be compatible with another toolkit, its fine-tuning data may be tokenized in a different way, and the natural distribution of words may be quite different. All of these things introduce ample opportunities for mismatches to arise between tokenization and the vocabulary from pre-training. Yet this same pre-training paradigm presents an advantage for character models: access to far more (unsupervised) data to learn word composition from characters; without transfer learning, this has historically been impractical for many tasks having little supervised data.
3 Canine
Canine consists of three primary components: (1) a vocabulary-free technique for embedding text; (2) a character-level model that is efficient by means of downsampling and upsampling; and (3) an effective means of performing masked language modeling on a character-level model.
3.1 Model
Canine is designed to be a minimally modified variant of the deep transformer stack found in modern encoders such as GPT, (m)Bert, XLM, and XLM-R such that its architecture is easily adoptable by other models in this family. The simplest implementation of such a character model would be to feed characters at each position in place of subwords. However, this approach would result in far more sequence positions given the same input text, leading to linearly more compute in feed forward layers and quadratically more compute in self-attention layers.
Preprocessing
Like existing models, the input to Canine must ultimately be represented as a sequence of integers, but because the nature of characters is well-defined and standardized by Unicode, preprocessing code that would typically be hundreds or thousands of lines can be replaced by a very simple procedure: just iterate over the characters in the input string, and return their codepoint integer values (e.g., a single line of code11 in Python). Furthermore, because codepoint values are part of the Unicode Standard, they are documented publicly, already supported by programming languages, and will not change over time, unlike arbitrary vocabulary-based IDs.
Character Hash Embeddings
Canine uses hashing (Svenstrup et al., 2017) to support embedding the full space of Unicode codepoints with a relatively small number of parameters—but, to reduce the chance that different codepoints will share exactly the same representation, we define a generalization of the standard hashing approach in which we apply multiple hash functions to each codepoint and concatenate the representations associated with the various hash values.
While each individual hash function is subject to hash collisions,15 the overall effect is minimal since each function only accounts for a small portion of the codepoint’s overall embedding, and it is highly improbable that the other hash functions will produce the same collisions.
Because the model always supports all codepoints, it is possible to learn representations during fine-tuning for characters (and, by extension, words, scripts, etc.) that were never seen during pre-training, while still making use of what pre-training learned about word composition and sentence structure.
Optional Vocabulary-Free n-Grams
Downsampling
Deep Transformer Stack
Upsampling
While the above architecture is sufficient for classification tasks, sequence prediction tasks require that the model expose an output layer with the same sequence length as the input (i.e., characters are the model’s input and output “API” for tasks like tagging and span prediction).
Residual Connections
While the initial character encoder (before downsampling) and final character encoder (after upsampling) both represent character positions, they conceptually have very different purposes in the network. Intuitively, we think of the initial character encoder as composing characters to create a more word-like representation, while the final character encoder is extracting the in-context representation that’s relevant for predicting the “meaning” of the content at each position; Canine must be able to deal with additional ambiguity during upsampling since a single downsampled position may span more than one conceptual word. Because of the different roles of these induced features, we do not use residual connections from hinit to hup.
3.2 Pre-training
Recent pre-trained models ranging from Bert to T5 have largely used variations on a masked language model (MLM) task (also known as span corruption) as an unsupervised pre-training loss function—a means of generating synthetic examples that are not from any realistic task, yet prepare a model to learn realistic tasks in future phases of training (i.e., fine-tuning). The Canine pre-training procedure retains the MLM task, and offers two distinct strategies for computing the MLM loss—autoregressive character prediction vs. subword prediction—both of which yield a fully tokenization-free model following pre-training. In our experiments, we use only one of these losses at a time.
3.2.1 Autoregressive Character Loss
Span-wise Masking
Canine-C is an autoregressive character loss that masks character spans within each sequence. These spans are chosen based on whitespace boundaries. No punctuation splitting nor other heuristics are used. All characters within the masked span are replaced by a special mask codepoint in the input.20 No random subword replacement is performed as there is no subword vocabulary.21
Span Prediction
3.2.2 Subword Loss
We also experiment with Canine-S, a subword-based loss function, to demonstrate how a token-aware pre-training loss can still be paired with a tokenization-free model such that the tokenizer and vocabulary are discarded after pre-training.
Span-wise Masking
Like mBert’s MLM setup, each span in Canine-S corresponds to a single subword. As with the autoregressive loss, all characters within the masked span are replaced with a special “mask” codepoint. Random replacements of subwords are chosen from the vocabulary of same-length subwords such that the length of the character sequence remains unchanged; more formally, given a subword selected for random replacement x and a vocabulary of subwords V, x’s replacement will be drawn from the subset of v ∈ V where Len(v) = Len(x).
Span Prediction
Within each masked character span, Canine-S randomly selects a character position where the model will make a prediction; the model predicts the identity of the masked subword via softmax. The associated subword embeddings are discarded after pre-training.
3.2.3 Targeted Upsampling
3.2.4 Modularity
Unlike previous models, Canine removes both the vocabulary and tokenization algorithm as fossilized parts of the final model that must be replicated during fine-tuning and prediction. Regardless of which pre-training loss is chosen (characters or subwords), the use of these components in Canine is limited to a detail of the pre-training procedure—an inductive bias of the loss function—that is then discarded. The fine-tuning and prediction phases of the model lifecycle never have any knowledge of what vocabulary or tokenization algorithm (if any) were used in pre-training. This allows the model to natively process untokenized data, or even process data that has been pre-processed by different tokenizers, a situation that would otherwise introduce a significant skew between training phases.
4 Experiments
4.1 Experimental Setup
4.1.1 Information-Seeking QA Data
TyDi QA: Primary Tasks
TyDi QA is a dataset of information-seeking questions in 11 typologically diverse languages (Clark et al., 2020). Questions are written before answers, leading to less lexical and morphological overlap between questions and answers, which are drawn from Wikipedia. We evaluate on the primary tasks.24
Passage Selection Task (SelectP)
Given a list of the passages in a Wikipedia article, return either the index of the passage that answers the question, or return NULL if the article contains no acceptable answer.
Minimal Answer Span Task (MinSpan)
Given a full Wikipedia article, return the start and end byte indices of the minimal span that completely answers the question. Alternatively, a system may indicate that the article does not contain an answer, or return YES or NO for yes/no type questions.
4.1.2 Named Entity Recognition Data
We also consider the task of named entity recognition (NER), which requires the model to identify which spans of a sentence correspond to entities and label the entity type. In all of our experiments, we framed the task as sequence labeling, predicting BIO-encoded span labels.
CoNLL NER
MasakhaNER
To widen the scope of our experiments beyond European languages, we also include MasakhaNER (Adelani et al., 2021), which includes ten African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian Pidgin, Swahili, Wolof, and Yorùbá) with human annotations on local news text.
4.1.3 Model Configuration
Direct Comparison with mBert
In order to determine which pre-training architecture produces better quality downstream predictions, we compare Canine to mBert, which we re-implemented and re-trained in order to hold as many variables as possible constant. Note that we intentionally do not compare against public pre-trained checkpoints that use different pre-training corpora since (a) this would be a major confounding variable and (b) most publicly available pre-trained models are simply instantiations of Bert, including XLM-R25 and X-STILTS.26
Setup
We pre-train on the multilingual Wikipedia data of mBert, which includes 104 languages. Similarly, we reuse mBert’s exponential smoothing technique to weight the languages within the pre-training samples. We train for 124k steps with batch size 4096 (2.5 passes over the data) using the LAMB optimizer (You et al., 2020) with a linearly decayed learning rate of 0.018 where 2.5% of the steps are used for warm-up. We use a sequence length of 512 for mBert, and 2048 for Canine, which results in 512 downsampled positions in its core deep transformer stack. We pre-train on 64 Cloud TPUs v327 for approximately one day (see results for precise timings). For both mBert and Canine-S (Canine with the subword loss), we select 15% of subwords for the MLM loss and predict up to 80 output positions; 80% of these are masked in the input, 10% are randomly replaced, and 10% are unmodified. For Canine-C (Canine with the autoregressive character loss), we select 15% of contiguous spans for the MLM loss and predict up to 320 output characters, and no random replacement is performed. For TyDi QA, we use a maximum answer length of 100 characters, which is approximately the 99th percentile answer length. Sequences longer than the maximum sequence length are zero-padded, following BERT.28
4.2 TyDi QA Results
Our main result is shown in Table 2. Canine-S (Canine with the subword loss) improves over mBert in the TyDi QA SelectP task by 2.8 F1, while using about 30% fewer parameters. Similarly, Canine-C (Canine with the autoregressive character loss), improves over mBert by 2.5 F1. Adding vocab-free character n-grams leads to even further gains over mBert (+3.8 F1) and even more on the MinSpan task (+6.9 F1). A language-wise breakdown is provided in Table 7 in the Appendix.
Model . | Input . | MLM . | Examples . | TyDiQA . | TyDiQA . | |||
---|---|---|---|---|---|---|---|---|
r . | Length . | / sec . | Params . | SelectP . | MinSpan . | |||
mBert (public) | Subwords | Subwords | – | –512 | – | 179M | 63.1 | 50.5 |
mBert (ours) | Subwords | Subwords | – | –512 | 9000 | 179M | 63.2 | 51.3 |
Chars | Single Chars | 1 | 2048 | –925 | 127M | 59.5 (–3.7) | 43.7 (–7.5) | |
Chars | Subwords | 1 | 2048 | –900 | 127M | 63.8 (+0.6) | 50.2 (–1.0) | |
Canine-S | Chars | Subwords | 4 | 2048 | 6400 | 127M | 66.0 (+2.8) | 52.5 (+1.2) |
Canine-C | Chars | Autoreg. Chars | 4 | 2048 | 6050 | 127M | 65.7 (+2.5) | 53.0 (+1.7) |
Canine-C + n-grams | Chars | Autoreg. Chars | 4 | 2048 | 5600 | 167M | 68.1 (+4.9) | 57.0 (+5.7) |
Model . | Input . | MLM . | Examples . | TyDiQA . | TyDiQA . | |||
---|---|---|---|---|---|---|---|---|
r . | Length . | / sec . | Params . | SelectP . | MinSpan . | |||
mBert (public) | Subwords | Subwords | – | –512 | – | 179M | 63.1 | 50.5 |
mBert (ours) | Subwords | Subwords | – | –512 | 9000 | 179M | 63.2 | 51.3 |
Chars | Single Chars | 1 | 2048 | –925 | 127M | 59.5 (–3.7) | 43.7 (–7.5) | |
Chars | Subwords | 1 | 2048 | –900 | 127M | 63.8 (+0.6) | 50.2 (–1.0) | |
Canine-S | Chars | Subwords | 4 | 2048 | 6400 | 127M | 66.0 (+2.8) | 52.5 (+1.2) |
Canine-C | Chars | Autoreg. Chars | 4 | 2048 | 6050 | 127M | 65.7 (+2.5) | 53.0 (+1.7) |
Canine-C + n-grams | Chars | Autoreg. Chars | 4 | 2048 | 5600 | 167M | 68.1 (+4.9) | 57.0 (+5.7) |
We also present results from some ablation models as additional baselines in rows 3–4 of Table 2. First, for row 3, we simply replace Bert’s subword vocabulary with a pure character vocabulary, which makes characters both the input granularity and the unit of masking and prediction for the MLM task, and observe that not only is the model 10X slower than subword-based Bert, but the quality also suffers greatly. Then, for row 4, we modify that model to use subwords for masking and MLM predictions, while keeping characters as the input granularity, and we see a substantial quality improvement, though pre-training remains extremely slow. Finally, by comparing to the full Canine model in row 5, we can see that adding the downsampling strategy improves speed by 700%, and also leads to an additional small bump in quality. We speculate that this additional quality gain comes from giving the model a better inductive bias toward more word-like units within the deep transformer stack.
Analysis
Canine fares particularly well on morphologically rich languages such as Kiswahili. Table 3 shows examples where Canine outperforms mBert on the TyDi QA SelectP task. In particular, we observe examples where Kiswahili’s rich morphology does not hinder the matching process for Canine.
Question . | Passage Answer . |
---|---|
Chelsea ina milikiwa na nani? | Kwa kawaida Chelsea huvaa jezi ya blu, kaptula blu na soksi nyeupe. Nembo ya klabu imebadilishwa mara nyingi kulingana na wakati na kuboresha muonekano wa klabu. Nembo ya sasa inaonesha picha ya simba akiwa amebeba mkuki. Tangu Julai 2003, Chelsea imekuwa ikimilikiwa na Bilionea wa Kirusi, Roman Abramovich. |
Who owns Chelsea? | Chelsea usually wear blue jerseys, blue shorts and white socks. The club logo has been changed many times over time and improved the club’s appearance. The current emblem shows a picture of a lion carrying a spear. Since July 2003, Chelsea has been owned by Russian billionaire Roman Abramovich. |
Kampuni isambazayo umeme nchini Kenya inaitwaje? | Kenya Power and Lighting (KPLC) ni kampuni inayohusika na maambukizi ya umeme na usambazaji wa umeme nchini Kenya. |
What is the name of the company that distributes electricity in Kenya? | Kenya Power and Lighting (KPLC) is a company responsible for electricity transmission and distribution in Kenya. |
Question . | Passage Answer . |
---|---|
Chelsea ina milikiwa na nani? | Kwa kawaida Chelsea huvaa jezi ya blu, kaptula blu na soksi nyeupe. Nembo ya klabu imebadilishwa mara nyingi kulingana na wakati na kuboresha muonekano wa klabu. Nembo ya sasa inaonesha picha ya simba akiwa amebeba mkuki. Tangu Julai 2003, Chelsea imekuwa ikimilikiwa na Bilionea wa Kirusi, Roman Abramovich. |
Who owns Chelsea? | Chelsea usually wear blue jerseys, blue shorts and white socks. The club logo has been changed many times over time and improved the club’s appearance. The current emblem shows a picture of a lion carrying a spear. Since July 2003, Chelsea has been owned by Russian billionaire Roman Abramovich. |
Kampuni isambazayo umeme nchini Kenya inaitwaje? | Kenya Power and Lighting (KPLC) ni kampuni inayohusika na maambukizi ya umeme na usambazaji wa umeme nchini Kenya. |
What is the name of the company that distributes electricity in Kenya? | Kenya Power and Lighting (KPLC) is a company responsible for electricity transmission and distribution in Kenya. |
4.3 Ablations
Attending Directly to
Number of Hash Buckets
We reduce the number of hash buckets (B) from 16k to 8k, meaning more (partial) collisions in embedding lookups. This significantly hinders the MinSpan task.
Character Vocab
We switch from our hash-based no-vocabulary strategy to using a normal character vocabulary (which we derive from the pre-training corpus). We observe that this underperforms the hashing approach. We speculate that this might be due to skew between the pre-training corpus and the final downstream task since not all codepoints can be included in the vocabulary.
Input Character Dimension
We reduced the embedding size of the initial character encoder (i.e., the embedding size of hinit and e—not hup nor yseq) and observe that quality falls off rapidly.
No Initial Transformer
We remove the local transformer from hinit and similarly observed a marked reduction in quality.
Increased Downsampling
While more aggressive downsampling (a factor of 5X or 6X, rather than 4X) brings substantial speed gains, the passage-level quality degrades substantially and the minimal span predictions suffer even more.
No Position-Limited MLM
When we do not use the trick of applying the final character transformer (yseq) only to the positions that will be computed by the MLM task, we observe a large reduction in speed. Since this model is theoretically equivalent in terms of operations, we show only the speed for exposition.
We also performed ablations aimed at exploring the effect of feature concatenation and residuals; results are in Table 4. Not concatenating the downsampled representation with the initial character representation when computing hup causes the model to become unstable (row 2); adding a residual from hup back to hinit does not help (row 3). However, additionally inserting a residual from hup back to does stabilize the model (row 4) though it does not recover the original quality.
Model . | SelectP . | MinSpan . |
---|---|---|
Canine-C | 65.7 | 53.0 |
No concatenation | 17.2 | 35.6 |
+Final-to-initial resid. | 17.3 | 35.9 |
+Final-to-downsampled resid. | 62.0 | 50.2 |
Model . | SelectP . | MinSpan . |
---|---|---|
Canine-C | 65.7 | 53.0 |
No concatenation | 17.2 | 35.6 |
+Final-to-initial resid. | 17.3 | 35.9 |
+Final-to-downsampled resid. | 62.0 | 50.2 |
4.4 NER Results
Named entity recognition is a task in which memorization is often a very effective strategy. For example, if a model has London in its vocabulary and sees it with the label location during training, then it simply has to retrieve this memorized association when it sees the token London at test time. Therefore, evaluating on NER is helpful for understanding the ways in which different models emphasize memorization vs. generalization.
As shown in Table 5, Canine-C performs significantly worse than mBert on NER, likely due to mBert’s memorization-friendly vocabulary. However, when (tokenization-free) n-gram features are added to Canine-C, performance rebounds, showing that it is possible to cheaply boost a model’s memorization ability while remaining fully tokenization-free.
Model . | CoNLL . | MasakhaNER . |
---|---|---|
mBert (ours) | 87.8 | 72.4 |
Canine-C | 74.0 (–13.8) | 65.5 (–6.9) |
Canine-C + n-grams | 86.7 (–1.1) | 76.8 (+4.3) |
Model . | CoNLL . | MasakhaNER . |
---|---|---|
mBert (ours) | 87.8 | 72.4 |
Canine-C | 74.0 (–13.8) | 65.5 (–6.9) |
Canine-C + n-grams | 86.7 (–1.1) | 76.8 (+4.3) |
Condition . | Examples / sec . | TyDi QA SelectP . | TyDi QA MinSpan . |
---|---|---|---|
Attend to (instead of hup) | 6400 | 64.5 | 52.2 |
8k codepoint hash buckets (instead of 16k) | 6400 | 64.1 (–0.4) | 50.5 (–1.7) |
Character vocab (no hashing) | 6400 | 64.6 (+/–) | 51.2 (–1.0) |
Input character dim 384 (instead of 768) | 6600 | 62.9 (–1.2) | 49.3 (–1.2) |
Input character dim 192 (instead of 768) | 6400 | 61.7 (–2.4) | 47.3 (–3.2) |
No initial character transformer | 6700 | 63.2 (–1.4) | 48.3 (–2.9) |
Downsample by a factor of 5 (instead of 4) | 7000 | 62.9 (–1.7) | 49.2 (–2.0) |
Downsample by a factor of 6 (instead of 4) | 9200 | 62.7 (–1.9) | 47.6 (–3.6) |
Don’t limit final character transformer to MLM positions | 5200 | — | — |
Canine-S | 6400 | 66.0 | 52.5 |
Condition . | Examples / sec . | TyDi QA SelectP . | TyDi QA MinSpan . |
---|---|---|---|
Attend to (instead of hup) | 6400 | 64.5 | 52.2 |
8k codepoint hash buckets (instead of 16k) | 6400 | 64.1 (–0.4) | 50.5 (–1.7) |
Character vocab (no hashing) | 6400 | 64.6 (+/–) | 51.2 (–1.0) |
Input character dim 384 (instead of 768) | 6600 | 62.9 (–1.2) | 49.3 (–1.2) |
Input character dim 192 (instead of 768) | 6400 | 61.7 (–2.4) | 47.3 (–3.2) |
No initial character transformer | 6700 | 63.2 (–1.4) | 48.3 (–2.9) |
Downsample by a factor of 5 (instead of 4) | 7000 | 62.9 (–1.7) | 49.2 (–2.0) |
Downsample by a factor of 6 (instead of 4) | 9200 | 62.7 (–1.9) | 47.6 (–3.6) |
Don’t limit final character transformer to MLM positions | 5200 | — | — |
Canine-S | 6400 | 66.0 | 52.5 |
A full language-wise breakdown is provided in the appendix (Table 8). It’s worth noting that part of the performance difference on MasakhaNER is due to mBert producing no usable outputs for Amharic. The mBert pre-training data does not contain Amharic (or any Amharic-script text), so it has no vocabulary entries to Amharic’s script (meaning that mBert sees only sequences of [UNK] on Amharic inputs). However, since Canine always supports the full Unicode space, it is able to achieve 50 F1 even though it, too, had never seen Amharic text during pre-training. We take this as validation of Canine’s vocabulary-free approach. It may also be evidence that Canine exhibits cross-script transfer abilities analogous to those in mBert (Pires et al., 2019).
Error Analysis
Canine-C tends not to label rarer lexical items that mBert appears to have memorized. For example, with Canine-C, JCPenney (a relatively rare lexical item) is not recognized as an entity. Canine-C also tends to separate long entities; for example, “State Street Bank and Trust Company” is labeled as two separate spans: “State Street Bank” and “Trust Company”; and the location TAMPA BAY is recognized only as TAMPA. However, adding n-grams features appears to mostly resolve this issue.
5 Related Work
5.1 Improvements to Subword Tokenization
Further improvements to standard subword tokenization like Byte Pair Encoding (BPE) (Sennrich et al., 2016), WordPiece (Wu et al., 2016), and SentencePiece (Kudo and Richardson, 2018) have been proposed. Subword regularization (Kudo, 2018) and BPE-dropout (Provilkov et al., 2020) recognize that deterministic segmentation during training limits the ability to leverage morphology and word composition; instead, they sample at random one of the multiple tokenizations of the training input, made possible by the inherent ambiguity of subword vocabularies. Wang et al. (2021) recently expanded on this paradigm to enforce consistency of predictions over different segmentations. Unigram LM (Kudo, 2018), which builds its vocabulary top–down, was shown to align with morphology better than BPE on pre-trained encoders (Bostrom and Durrett, 2020).
5.2 Character-Level Models
Following the larger NLP trend, character-level n-gram models (Huang et al., 2013; Wieting et al., 2016; Bojanowski et al., 2017) have mostly been replaced by neural networks. While generally lagging behind their word-level counterparts, character-level features are important for morphologically rich languages, particularly in low-resource settings (Garrette and Baldridge, 2013).
For Language Modeling
Character language models (CLMs) have used vanilla RNN architectures to produce distributions over sequences of characters in a purely tokenization-free manner (Sutskever et al., 2011; Graves, 2013; Hwang and Sung, 2017; Radford et al., 2017). Hierarchical RNNs modeled the assumption that language operates on increasing layers of abstraction: Chung et al. (2017) jointly trained a sub-module to segment the character-level input into larger spans at each layer of a stacked LSTM.
Due to the consistent lag in performance behind their word-level counterparts, attention shifted from pure CLMs towards merely character-aware models, still reliant on traditional tokenization. Some hybrid models processed the input at character level, but predicted words from a closed vocabulary (Kim et al., 2016; Gerz et al., 2018). Others reintroduced explicit tokenization on the input side, and either generated bursts of character sequences that formed an open vocabulary (Kawakami et al., 2017) or used a character-only generator as a fallback when the main closed-vocabulary word generator produced a rare or unknown token (Matthews et al., 2019; Mielke and Eisner, 2019). Especially after the popularization of the inherently ambiguous subword vocabularies like BPE, several studies moved beyond a single input segmentation and marginalized over all possible segmentations (van Merriënboer et al., 2017; Buckman and Neubig, 2018; Grave et al., 2019).
Coming full circle, Kawakami et al. (2019) induced a lexicon without any explicit supervision, reverting back to pure CLMs. In a revitalized effort to bring them on par with coarser granularities, researchers leveraged external resources such as grounding in vision (Kawakami et al., 2019) or multi-task learning together with supervised morphology tasks (Blevins and Zettlemoyer, 2019).
After the transformer (Vaswani et al., 2017) replaced RNNs as the dominant architecture in NLP, character-level models followed. Al-Rfou et al. (2019) showed that byte-level vanilla Transformers significantly underperform their word-level counterparts. A similar finding was reported by Radford et al. (2019). Although the gap has been reduced (Choe et al., 2019), subword transformers remain the status quo for pure language modeling.
For Specific Tasks
In parallel with LM efforts, the neural machine translation (NMT) community sought to solve its open-vocabulary problem via character-level modeling. Luong and Manning (2016) proposed a hybrid model that operated mainly at the word level, but consulted a character-level LSTM for unknown words; this was a practical compromise, as their character-only model took 3 months to train. Lee et al. (2017) enabled pure character NMT by shortening the input length via convolutional, pooling, and highway layers. Notably, their many-to-English model outperformed its subword counterpart and most bilingual baselines, with a 35% increase in training time (on a single GPU) compared to a baseline BPE-to-char model. Canine has a similar motivation, but operates in the context of pre-trained transformers; training is 7x faster compared to a char-to-char baseline (on TPU v3), and has a 28% increase in training time over mBert (Table 2).
Character information has been leveraged for many other end tasks as well, including: text classification (Zhang et al., 2015; Zhang and LeCun, 2017), part-of-speech tagging and NER (Gillick et al., 2016; Akbik et al., 2018; Pinter et al., 2019), named entity detection (Yu et al., 2018), dependency parsing (Vania et al., 2018), and machine reading comprehension (Hewlett et al., 2018). Character information proved particularly useful for low-resource languages (Xie et al., 2018), phenomena such as code-switching and transliteration (Ball and Garrette, 2018), and rich morphology (Vania and Lopez, 2017), previously receiving special modeling including adaptor grammars (Botha and Blunsom, 2013).
For Transfer Learning
Token-based models have also been augmented with character-level information in the context of transfer learning, where encoders trained with unsupervised objectives are repurposed to solve downstream tasks. Pinter et al. (2017) addressed the out-of-vocabulary problem of static pre-trained word embeddings by training a model to map the surface of a word to its pre-trained representation, and used it on unknown words. ELMo (Peters et al., 2018), a bidirectional LSTM model, applied character convolutions to its whitespace-separated input tokens. CharacterBert (Boukkouri et al., 2020) ported this technique to Bert, augmenting its existing WordPiece-tokenized input. Consistent with previous observations that feeding characters into a transformer stack comes with a huge computational cost while not improving over tokenization-based approaches (Al-Rfou et al., 2019), a Bert model fine-tuned for semantic parsing achieved gains only when characters complemented subwords (van Noord et al., 2020).
5.3 Multilingual Models
Multilingual NLP has been dominated by deep pre-trained multilingual models whose subword vocabularies are shared across languages. Such models borrow their architectures from monolingual predecessors and apply joint training in 100+ languages, either with unsupervised LM losses: mBert, mT5 (Xue et al., 2021), or with additional translation losses: XLM (Lample and Conneau, 2019), XLM-R (Conneau et al., 2020). Chung et al. (2020) extended this by forming language clusters with per-cluster vocabularies. To accommodate languages unseen during pre-training, Wang et al. (2020) extended the vocabulary and continued pre-training.
6 Conclusion
In this article, we described Canine, which is, to our knowledge, the first pre-trained deep encoder for language understanding that uses a tokenization-free, vocabulary-free model, while surpassing the quality of models built on top of heuristic tokenizers. Canine eliminates many engineering pitfalls for practitioners and opens up new research directions for the community.
Acknowledgments
The authors wish to thank Noah Constant, Rami Al-Rfou, Kristina Toutanova, Kenton Lee, Ming-Wei Chang, and Tim Dozat for their feedback on this work. We would also like to thank Martin Njoroge and Nanjala Misiko for their consultations on the Kiswahili examples, Diana Akrong for consulting on Twi orthography, and Waleed Ammar for consulting on Arabic morphology.
A Appendix
Language . | mBert . | Canine-S . | Canine-C . | Canine-C + n-grams . |
---|---|---|---|---|
SelectP | ||||
(English) | 62.2 | 58.6 (–3.6) | 61.6 (–0.6) | 64.6 (+2.4) |
Arabic | 82.3 | 82.8 (+0.5) | 82.5 (+0.2) | 84.3 (+2.0) |
Bengali | 58.5 | 61.8 (+3.3) | 62.5 (+4.0) | 66.0 (+7.5) |
Finnish | 60.4 | 62.2 (+1.8) | 63.6 (+3.2) | 66.7 (+6.3) |
Indonesian | 61.3 | 63.5 (+2.2) | 64.2 (+2.9) | 65.9 (+4.6) |
Japanese | 46.2 | 51.7 (+5.5) | 49.7 (+3.5) | 51.2 (+5.0) |
Korean | 60.2 | 60.3 (+0.1) | 59.7 (–0.5) | 60.6 (+0.4) |
Russian | 62.2 | 64.6 (+2.4) | 65.6 (+3.4) | 68.5 (+6.3) |
Swahili | 58.8 | 67.8 (+9.0) | 67.0 (+8.2) | 67.2 (+8.4) |
Telugu | 81.0 | 82.5 (+1.5) | 81.1 (+0.1) | 84.6 (+3.6) |
Thai | 61.1 | 62.8 (+1.7) | 61.2 (+0.1) | 65.8 (+4.7) |
Macro Avg | 63.2 | 66.0 (+2.8) | 65.7 (+2.5) | 68.1 (+4.9) |
MinSpan | ||||
(English) | 46.0 | 46.3 (+0.3) | 49.0 (+3.0) | 51.8 (+5.8) |
Arabic | 70.7 | 66.9 (–3.8) | 65.6 (–5.1) | 73.0 (+2.3) |
Bengali | 47.3 | 46.7 (–0.6) | 52.5 (+5.2) | 57.1 (+9.8) |
Finnish | 51.1 | 53.0 (+1.9) | 53.8 (+2.7) | 57.1 (+6.0) |
Indonesian | 52.2 | 53.6 (+1.4) | 54.4 (+2.2) | 56.8 (+4.6) |
Japanese | 36.1 | 40.3 (+4.2) | 40.7 (+4.6) | 42.0 (+5.9) |
Korean | 36.8 | 35.7 (–1.1) | 36.5 (–0.3) | 39.9 (+3.1) |
Russian | 45.6 | 46.7 (+1.1) | 47.2 (+1.6) | 51.5 (+5.9) |
Swahili | 49.4 | 59.0 (+9.6) | 57.6 (+8.2) | 59.2 (+9.8) |
Telugu | 75.6 | 75.2 (–0.4) | 74.2 (–1.4) | 79.7 (+4.1) |
Thai | 48.4 | 47.9 (–0.5) | 47.1 (–1.3) | 54.2 (+5.8) |
Macro Avg | 51.3 | 52.5 (+1.2) | 53.0 (+1.7) | 57.0 (+5.7) |
Language . | mBert . | Canine-S . | Canine-C . | Canine-C + n-grams . |
---|---|---|---|---|
SelectP | ||||
(English) | 62.2 | 58.6 (–3.6) | 61.6 (–0.6) | 64.6 (+2.4) |
Arabic | 82.3 | 82.8 (+0.5) | 82.5 (+0.2) | 84.3 (+2.0) |
Bengali | 58.5 | 61.8 (+3.3) | 62.5 (+4.0) | 66.0 (+7.5) |
Finnish | 60.4 | 62.2 (+1.8) | 63.6 (+3.2) | 66.7 (+6.3) |
Indonesian | 61.3 | 63.5 (+2.2) | 64.2 (+2.9) | 65.9 (+4.6) |
Japanese | 46.2 | 51.7 (+5.5) | 49.7 (+3.5) | 51.2 (+5.0) |
Korean | 60.2 | 60.3 (+0.1) | 59.7 (–0.5) | 60.6 (+0.4) |
Russian | 62.2 | 64.6 (+2.4) | 65.6 (+3.4) | 68.5 (+6.3) |
Swahili | 58.8 | 67.8 (+9.0) | 67.0 (+8.2) | 67.2 (+8.4) |
Telugu | 81.0 | 82.5 (+1.5) | 81.1 (+0.1) | 84.6 (+3.6) |
Thai | 61.1 | 62.8 (+1.7) | 61.2 (+0.1) | 65.8 (+4.7) |
Macro Avg | 63.2 | 66.0 (+2.8) | 65.7 (+2.5) | 68.1 (+4.9) |
MinSpan | ||||
(English) | 46.0 | 46.3 (+0.3) | 49.0 (+3.0) | 51.8 (+5.8) |
Arabic | 70.7 | 66.9 (–3.8) | 65.6 (–5.1) | 73.0 (+2.3) |
Bengali | 47.3 | 46.7 (–0.6) | 52.5 (+5.2) | 57.1 (+9.8) |
Finnish | 51.1 | 53.0 (+1.9) | 53.8 (+2.7) | 57.1 (+6.0) |
Indonesian | 52.2 | 53.6 (+1.4) | 54.4 (+2.2) | 56.8 (+4.6) |
Japanese | 36.1 | 40.3 (+4.2) | 40.7 (+4.6) | 42.0 (+5.9) |
Korean | 36.8 | 35.7 (–1.1) | 36.5 (–0.3) | 39.9 (+3.1) |
Russian | 45.6 | 46.7 (+1.1) | 47.2 (+1.6) | 51.5 (+5.9) |
Swahili | 49.4 | 59.0 (+9.6) | 57.6 (+8.2) | 59.2 (+9.8) |
Telugu | 75.6 | 75.2 (–0.4) | 74.2 (–1.4) | 79.7 (+4.1) |
Thai | 48.4 | 47.9 (–0.5) | 47.1 (–1.3) | 54.2 (+5.8) |
Macro Avg | 51.3 | 52.5 (+1.2) | 53.0 (+1.7) | 57.0 (+5.7) |
Language . | mBert . | Canine-C . | Canine-C + n-grams . |
---|---|---|---|
CoNLL | |||
Dutch | 90.2 | 74.7 (–15.5) | 88.5 (–1.7) |
English | 91.1 | 79.8 (–11.3) | 89.8 (–1.3) |
German | 82.5 | 64.1 (–18.4) | 82.1 (–0.4) |
Spanish | 87.6 | 77.4 (–10.2) | 86.5 (–1.1) |
Macro Avg | 87.8 | 74.0 (–13.8) | 86.7 (–1.1) |
MasakhaNER | |||
Amharic | 0.0 | 44.6 (+44.6) | 50.0 (+50.0) |
Hausa | 89.3 | 76.1 (–13.2) | 88.0 (–1.3) |
Igbo | 84.6 | 75.6 (–9.0) | 85.0 (+0.4) |
Kinyarwanda | 73.9 | 58.3 (–15.6) | 72.8 (–1.1) |
Luganda | 80.2 | 69.4 (–10.8) | 79.6 (–0.6) |
Luo | 75.8 | 63.4 (–12.4) | 74.2 (–1.6) |
Nigerian Pidgin | 89.8 | 66.6 (–23.2) | 88.7 (–1.1) |
Swahili | 87.1 | 72.7 (–14.4) | 83.7 (–3.4) |
Wolof | 64.9 | 60.7 (–4.2) | 66.5 (+1.6) |
Yorùbá | 78.7 | 67.9 (–10.8) | 79.1 (+0.4) |
Macro Avg | 72.4 | 65.5 (–6.9) | 76.8 (+4.3) |
Language . | mBert . | Canine-C . | Canine-C + n-grams . |
---|---|---|---|
CoNLL | |||
Dutch | 90.2 | 74.7 (–15.5) | 88.5 (–1.7) |
English | 91.1 | 79.8 (–11.3) | 89.8 (–1.3) |
German | 82.5 | 64.1 (–18.4) | 82.1 (–0.4) |
Spanish | 87.6 | 77.4 (–10.2) | 86.5 (–1.1) |
Macro Avg | 87.8 | 74.0 (–13.8) | 86.7 (–1.1) |
MasakhaNER | |||
Amharic | 0.0 | 44.6 (+44.6) | 50.0 (+50.0) |
Hausa | 89.3 | 76.1 (–13.2) | 88.0 (–1.3) |
Igbo | 84.6 | 75.6 (–9.0) | 85.0 (+0.4) |
Kinyarwanda | 73.9 | 58.3 (–15.6) | 72.8 (–1.1) |
Luganda | 80.2 | 69.4 (–10.8) | 79.6 (–0.6) |
Luo | 75.8 | 63.4 (–12.4) | 74.2 (–1.6) |
Nigerian Pidgin | 89.8 | 66.6 (–23.2) | 88.7 (–1.1) |
Swahili | 87.1 | 72.7 (–14.4) | 83.7 (–3.4) |
Wolof | 64.9 | 60.7 (–4.2) | 66.5 (+1.6) |
Yorùbá | 78.7 | 67.9 (–10.8) | 79.1 (+0.4) |
Macro Avg | 72.4 | 65.5 (–6.9) | 76.8 (+4.3) |
Notes
We consider splitting on Unicode characters to be tokenization-free because it depends only on the (deterministic) process defined by the Unicode standard, and not on any models, hand-crafted rules, or other linguistic knowledge.
Unicode defines 1,114,112 total codepoints, of which only 143,698 are assigned to characters as of Unicode 13.0. This covers 154 scripts and over 900 languages.
For example, Spanish speakers may drop accents when typing.
Hawaiian uses an apostrophe to indicate a glottal stop.
Informal Twi uses a right paren ) to represent the letter ↄ.
Spanish past tense uses an accented final vowel.
Vietnamese uses diacritics to indicate tones—often the only difference among several unrelated content words.
For example, should a subword containing an unknown character be a separate token, or should the unknown character be separated as its own token?
Enveloping the attention stack between downsampling and upsampling layers is similar to the Funnel-Transformer (Dai et al., 2020), which operates on WordPiece. However, many of its design choices (e.g., average pooling, their residual structure) did not work well in Canine.
Python preprocessing: [ord(c) for c in text].
Conceptually, a codepoint is a character; however, a Unicode codepoint is defined precisely and unambiguously.
Canine uses learned embeddings, not random embedding as in other hash embeddings (Kaliamoorthi et al., 2019).
The memory footprint of these hash embeddings is equivalent to a vocabulary embedding with 16k items.
This is not a probing/chaining hash table, but rather as an approximate map, where we expect and tolerate collisions, similar to a Bloom Map (Talbot and Talbot, 2008).
We use B = 15k and N = 4 for our n-grams.
We use blocks of 128 characters in our experiments.
In our experiments, we found a downsampling rate of 4X to result in high quality with a speed comparable to Bert.
We use w = 4 in our experiments.
We use codepoints in Unicode’s Private Use Area block such that the input remains a valid Unicode string.
Though we expect that future work on vocabulary-free random replacement may improve quality.
The left-to-right self-attention masking is with regard to the shuffled sequence.
This highly effective targeted upsampling optimization is the primary reason that Canine uses a full Transformer layer for the final full-length character sequence rather than a local transformer. Because a block-wise local transformer assumes uniform position-wise locality over attention blocks, it is not trivial to combine these two optimizations; the local self-attention mask would no longer be a simple block diagonal. However, this final upsampling layer is discarded for classification tasks and so does not contribute any cost. Hence, while it is possible to combine local attention and targeted upsampling, this is left as future work.
As opposed to the simplified TyDiQA-GoldP task, which is part of the Xtreme meta-benchmark.
XLM-R instantiates Bert with a larger pre-training corpus, larger model size, and larger vocabulary size.
X-STILTS performs English fine-tuning on an existing XLM-R checkpoint (Phang et al., 2020).
v3 TPUs have 16 GiB memory / core (128 GiB total).
Each pre-training uses approximately 24 hours on 64 TPUs (1.5k TPU-hours), so the 18 pre-trainings in Tables 2/3/4 required about 28k TPU-hours. The 18 TyDi QA experiments in these tables, each take about 1 hour on 16 TPUs, each with 3 replicas (48 TPU-hours), about 1k TPU-hours total. The 3 NER experiments in Table 5 each took 3 hours on 4 TPUs with 3 replicas each (36 TPU-hours), 108 TPU-hours total. Thus replicating the experiments in this paper would take approximately 29k TPU-hours.
These ablations were carried out during initial model development, hence comparisons to a non-final model.
References
Author notes
Canine: Character Architecture with No tokenization In Neural Encoders. Code and checkpoints are available on GitHub at http://caninemodel.page.link/code.
Action Editor: Shay Cohen