Abstract
Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).
1. Introduction
Recent advancements in the natural language processing (NLP) field have led to models that surpass all previous results on a range of high-level downstream applications, such as machine translation, text classification, dependency parsing, and many more. However, these models require a huge number of training data points to achieve state-of-the-art scores and are known to suffer from the out-of-domain problem. In other words, they are not able to correctly label or generate novel, unseen data points. In order to boost the performance of such systems in the presence of low data, the researchers have introduced various data augmentation techniques that aim to increase the sample size and also the variation of the lexical (Wei and Zou 2019; Fadaee, Bisazza, and Monz 2017; Kobayashi 2018; Karpukhin et al. 2019) or syntactic patterns (Vickrey and Koller 2008; Şahin and Steedman 2018; Gulordava et al. 2018). A similar line of research introduced adversarial attack and defense mechanisms (Belinkov and Bisk 2018; Karpukhin et al. 2019) based on injecting noises with the goal of more robust NLP systems.
Even though low-resource languages are the perfect test bed for such augmentation techniques, a large number of studies only simulate a low-resource environment by sampling from a high-resource language like English (Wei and Zou 2019; Guo, Mao, and Zhang 2019a). Unfortunately, methods that perform well on English may function poorly for many low-resource languages. This is due to English being an analytic language while most low-resource languages are synthetic. Furthermore, the majority of the studies focus either on sentence classification or machine translation. Although these tasks are important, the traditional sequence tagging tasks where tokens (e.g., white-Adj cat-Noun) or the relation between tokens (e.g., white Modifier cat) are labeled are still considered as primary steps to natural language understanding. These tasks generally have finer-grained labels and are more sensitive to noise. In addition, previous work mostly report single best scores that can be achieved by the augmentation techniques.
To extend the knowledge of the NLP community on text augmentation methods, there is a need for a comprehensive study that investigates (i) different types of augmentation methods on (ii) a diverse set of low-resource languages, for (iii) a diverse set of sequence tagging tasks that require different linguistic skills (e.g., morphologic, syntactic, semantic). In order to address these issues, we explore a wide range of text augmentation methodologies that augment on the character level (Karpukhin et al. 2019), token level (Kolomiyets, Bethard, and Moens 2011; Zhang, Zhao, and LeCun 2015; Wang and Yang 2015; Wei and Zou 2019) and syntactic level (Şahin and Steedman 2018; Gulordava et al. 2018). To gain insights on the capability of these methods, we experiment on sequence tagging tasks of varying difficulty: POS tagging and dependency parsing and semantic role labeling, also known as shallow semantic parsing. POS tagging and dependency parsing experiments are performed on truly low-resource languages that are members of various language families: Kazakh (Turkic), Tamil (Dravidian), Buryat (Mongolic), Telugu (Dravidian), Vietnamese (Austro-Asiatic), Kurmanji (Iranian), and Belarusian (Slavic). Due to the lack of annotated data, we simulate the low-resource environment for semantic role labeling (SRL) on languages from a diverse set of families, namely, Turkish (Turkic), Finnish (Uralic), Catalan (Romance), Spanish (Romance), and Czech (Slavic). To investigate whether the augmentations are model-agnostic and can further improve on state-of-the-art models, we experiment with distinct models that use various subword units (e.g., character, byte-pair-encoding, and word piece), pretrained embeddings (e.g., BPE-GloVe, multilingual BERT), and architectural designs (e.g., biaffine, transition-based parser). We train and test each model multiple times and report the mean and standard deviation scores along with the p-values calculated via paired t-tests. Furthermore, we investigate the sensitivity of augmentation techniques to parameters in a separate study, and analyze the contribution of each technique to the improvement of frequent and rare tokens and token classes.
Our results show that augmentation methods benefit the dependency parsing task more significantly than part-of-speech tagging and semantic role labeling—in that given order—independent from the parser type. The improvements are more reliably observed in morphologically richer languages; that is, the results for Vietnamese (an analytic language) are varied—in some cases significantly worse than the baselines. We find that augmentation can still provide performance gains over the strong baselines built on top of pretrained contextualized language models—especially in dependency parsing. In general, character-level augmentation gives more consistent improvements for the experimented languages and tasks, whereas synonym replacement and rotating sentences mostly result in a weak increase or decrease. We find that the subword unit choice and task requirements also have a large impact on the results. For instance, we observe significant gains over the character-based POS tagger, but no improvement for the BPE-based one for Tamil language. In addition, we find that token-level augmentation is more effective for BPE-based models, whereas character-level augmentation techniques provide significant gains for the rest. Furthermore, we observe that syntactic augmentation techniques are more likely to improve the performance on morphologically richer languages as pronounced in the SRL task.
2. Related Work
Augmentation techniques for NLP tasks fall into two categories: feature space augmentation (FSA) and text augmentation (TA). FSA techniques mostly focus on augmenting the continuous representation space directly inside the model, while TA processes the discrete variables such as raw or annotated text.
FSA.
Guo, Mao, and Zhang (2019b) and Guo (2020) propose to generate a synthetic sample in the feature space via linearly and nonlinearly interpolating the original training samples. Although the idea originates from the computer vision field (Zhang et al. 2018), it has been adapted to English sentence classification tasks successfully (Guo, Mao, and Zhang 2019a). More recently Guo, Kim, and Rush (2020) proposed a similar technique that mixes up the input and the output sequences and showed improvements on several sequence-to-sequence tasks such as machine translation between high-resource language pairs. Despite their success in text classification and sequence-to-sequence tasks, they are seldom used for sequence tagging tasks. Zhang, Yu, and Zhang (2020) use the mixup technique (Zhang et al. 2018) in the scope of active learning, where they augment the queries at each iteration and later classify whether the augmented query is plausible—since the resulting queries might be noisy—and report improvements for the Named Entity Recognition (NER) and event detection task. However, building a robust discriminator for more challenging tasks as dependency parsing (DP) and SRL is a challenge on its own. Chen et al. (2020a) report that direct application of the mixup techniques (Guo, Mao, and Zhang 2019b; Guo 2020) for NER introduces vast amounts of noise, therefore making the learning process even more challenging. To address this, Chen et al. (2020a) introduce local interpolation strategies that mixes sequences close to each other, where closeness for tokens is defined either as (i) occurring in the same sentence (ii) occurring in the sentence within the kth neighborhood, and show improvements over state-of-the-art NER models. Even though NER is a sequence tagging task, it is fundamentally different from the tasks in this study: DP and SRL. The local context for the NER task is usually the full sentence; however, for SRL the local context is much smaller. Imagine these sentences: “I want to visit the USA” and “USA is trying to solve the problems”. NER would label USA as Country in both sentences, whereas USA would have the labels Arg1: visited and Arg0: agent/entity trying. As can be seen, for our sequence tagging tasks, the local context is much smaller and the labels are more fine-grained than in NER. Therefore, Chen et al. 2020a would have an extremely small sam ple space, whereas Chen et al. (2020b) is likely to produce too much noise. To sum up, using these techniques for more complicated sequence tagging tasks such as DP and SRL is not straightforward. This is due to (i) the relations between tokens being labeled instead of the tokens themselves; and (ii) the labels being extremely fine-grained compared to classification tasks. Furthermore, FSA techniques require direct access to the neural architecture since they modify the embedding space. That means they cannot be used in combination with black-box systems, which might be a drawback for NLP practitioners. Finally, unlike the aforementioned sequence labeling tasks, DP and SRL have a relatively complex input layer where additional linguistic information (e.g., postags, predicate flag) is used. That also makes the direct application of FSA techniques more challenging. For these reasons, we focus on the other category of augmentation methods.
TA.
Vickrey and Koller (2008) introduce a number of expert-designed sentence simplification rules to augment the data set with simplified sentences, and show improvements on English semantic role labeling. Similarly, Şahin and Steedman (2018) propose an automated approach to generated simplified and reordered sentences using dependency trees. We refer to such methods as label-preserving because they do not alter the semantics of the original sentence. They mostly perform on syntactically annotated data and sometimes require manually designed rules (Vickrey and Koller 2008).
Another set of techniques (Gulordava et al. 2018; Fadaee, Bisazza, and Monz 2017) performs lexical changes instead of syntactic restructuring. Gulordava et al. (2018) generate synthetic sentences by replacing a randomly chosen token with its syntactic equivalent. Fadaee, Bisazza, and Monz (2017) replace more frequent words with the rare ones to create stronger lexical-label associations. Wei and Zou (2019) and Zhang, Zhao, and LeCun (2015) make use of semantic lexicons like WordNet to replace words with their synonyms. Kobayashi (2018); Wu et al. (2019), and Fadaee, Bisazza, and Monz (2017) use pretrained language models to generate a set of candidates for a token position, while Anaby-Tavor et al. (2020), Kumar, Choudhary, and Cho (2020), and Ding et al. (2020) take a generative approach. Most work focuses on classification and translation, except for Ding et al. (2020), who experiment on low-resource tagging tasks. However they assume one label per token, therefore this is not directly applicable to relational tagging tasks such as dependency parsing and semantic role labeling. Additionally, Wei and Zou (2019) propose simple techniques such as adding, removing, or replacing random words and show that such perturbations improve the performance of English sentence classification tasks. Unlike other lexical augmentation methods, these do not require syntactic annotations, semantic lexicons, and large language models that make them more suitable to low-resource settings.
There also exist sentence-level techniques such as back-translation (Sennrich, Haddow, and Birch 2016a) that aim to generate a paraphrase by translating back and forth between a pair of languages. However, such techniques are mostly not applicable to sequential tagging tasks like semantic parsing while they cannot guarantee the token-label association. Furthermore, training an intermediate translation model for paraphrasing purposes may not be possible for genuinely low-resource languages. In a similar line, Yoo et al. (2021) leverage a pretrained large language model to generate text samples and report improvements on English text classification tasks.
Another category of TA uses noising techniques that are closely associated with adversarial attacks to text systems. Belinkov and Bisk (2018) attack machine translation systems by modifying the input with synthetic (e.g., swapping characters) and natural noise (e.g., injecting common spelling mistakes). Similarly Karpukhin et al. (2019) defend the machine translation system by training on a set of character-level synthetic noise models. Han et al. (2020) introduce an adversarial attack strategy for structured prediction tasks including dependency parsing. They first train a seq2seq generator via reinforcement learning where they design a special reward function that evaluates whether the generated sequence could trigger a wrong output. The evaluation is carried out by two other reference parsers—that is, if both parsers are tricked then it is likely that the victim parser would also be misled. Han et al. (2020) then use adversarial training as a defense mechanism and show improvements. In the same line of research, Zheng et al. (2020) propose replacing a token with a “similar” token, where similarity is defined as having a similar log likelihood of being generated by pretrained BERT (Devlin et al. 2019) given the context and having the same POS tag (in the black-box setting). The tokens are chosen in a way that the parser’s error rate is maximized. The attack is called a “sentence-level attack” when the chosen tokens’ positions are irrelevant. On a “phrase-level attack,” the authors first choose two subtrees and then maximize the error rate on the target subtree by modifying the tokens in the source subtree. Even though the adversarial example generation techniques (Zheng et al. 2020; Han et al. 2020) could be used to augment data in theory, the requirements—such as a separate seq2seq generator, a BERT-based scorer (Zhang et al. 2020), reference parsers that are of certain quality, external POS taggers, and high quality pretrained BERT (Devlin et al. 2019) models—make them challenging to apply on low-resource languages. Besides, most of the aforementioned adversarial attacks are optimized to trigger an undesired change in the output with minimal modifications, while data augmentation is only concerned about increasing the generalization capacity of the model.
Surveys.
Due to the growing number of data augmentation techniques and interest in them, a couple of survey studies have been recently published (Hedderich et al. 2021; Feng et al. 2021; Chen et al. 2021). Hedderich et al. (2021) give a comprehensive overview of recent methods that are proposed to tackle low-resource scenarios. The authors overview a range of techniques both for low-resource languages and domains, discussing their limitations, requirements, and outcomes. Apart from data augmentation methods, the authors discuss more general approaches such as cross-lingual projection, transfer learning, pretraining (of large multilingual models) and meta-learning, which are not in scope of this work. Feng et al. (2021) provide a more focused survey, zooming in on data augmentation techniques rather than other common approaches such as transfer learning. They categorize data augmentations into three categories as (i) rule-based, (ii) example interpolation, and (iii) model-based techniques. Rule-based techniques are referred to as easy-to-compute methods such as EDA (Wei and Zou 2019) and dependency tree morphing (Şahin and Steedman 2018), which are also covered in this study. Example interpolation techniques are used to define the FSA type of methods that we have discussed earlier in this section. Model-based techniques refer to augmentation techniques that rely on large models trained on large texts (e.g., backtranslation and synonym replacement using BERT-like models) that either don’t preserve the labels or require a strong pretrained model—which is mostly not available for low-resource languages. Both Feng et al. (2021) and Hedderich et al. (2021) provide a taxonomy and a structured bird’s eye view on the topic without any empirical investigation as in this study. The closest work to ours is by Vania et al. (2019), who explore two of the augmentation techniques along with other approaches (e.g., transfer learning) for low-resource dependency parsing, which is limited in numbers of tasks, languages, and augmentation techniques. Finally, Chen et al. (2021) provide an empirical survey covering a wide range of NLP tasks and augmentation approaches where most of them exist for English only. Hence, the experiments are performed only on English, providing no insights for low-resource languages or sequence tagging tasks. Unlike previous literature, our work (i) focuses on sequential tagging tasks that require various linguistic skills: part-of-speech tagging, dependency parsing, and semantic role labeling; (ii) experiments on multiple low-resource languages from a diverse set of language families; and (iii) compares and analyzes a rich compilation of augmentation techniques that are suitable for low-resource languages and the focused tasks.
3. Augmentation Techniques
As discussed in Section 2, we categorize the augmentation techniques as textual (TA) and feature-space augmentation (FSA). In this article, we focus on textual augmentation techniques rather than FSA for several reasons. First of all, existing FSA techniques have only been implemented and tested for simple sequence tagging tasks where the task is to choose the best tag for the token from a quite limited number of labels. However, in our sequence tagging tasks, the labels are quite fine-grained, and the relation between tokens are labeled rather than the tokens themselves. Second, FSA techniques require direct access to the neural architecture because they modify the embedding space, which means that they cannot be used in combination with black-box systems. Finally, for some of the sequence tagging tasks, namely, as dependency parsing and semantic role labeling, the models might have relatively complex input layers (e.g., incorporate additional features such as postags or predicate flags), which makes the direct application of FSA techniques more challenging.
Textual augmentation techniques, that is, augmentation techniques that alter the input text, can be applied on many different levels such as character, token, sentence, or document. Because we focus on sentence-level tagging tasks, document-level augmentation techniques are simply ignored in this study. Furthermore, the focus of this article is to investigate techniques that are suitable for low-resource languages and are able to preserve the task labels up to some degree. For instance, sentence-level augmentation techniques like backtranslation are not suitable since they would not preserve the token labels. Similarly, genuinely low-resource languages do not have associated strong pretrained language models due to lack of raw data. Therefore sophisticated techniques that make use of such models are also left out in this article.
To provide more details, the overview of the DA techniques investigated in Feng et al. (2021) is given in Table 1 to justify our selection of techniques. Here, the first category refers to the techniques that are included in this study. They modify the data on the input level, are task agnostic, can preserve the token and relation labels—hence suitable for our sequence tagging tasks—and do not require large amounts of text or models. The methods in the second category (Sennrich, Haddow, and Birch 2016a; Vaibhav et al. 2019; Nguyen et al. 2020; Wieting and Gimpel 2017; Feng, Li, and Hoey 2019; Singh et al. 2019; Anaby-Tavor et al. 2020) are not able to preserve the labels. For instance, Semantic Text Exchange (STE) (Feng, Li, and Hoey 2019) aims to replace an entity in a given sentence while modifying the rest of the sentence accordingly. Given the input “great food , large portions ! my family and i really enjoyed our saturday morning breakfast” and the entity to be replaced as pizza, STE generates a new sentence “80% great pizza, chewy crust ! nice ambiance and i really enjoyed it .”. Because the generated sentence is both syntactically and semantically different from the original sentence, such techniques cannot be used for any of our tasks. Furthermore most of these techniques are tested on English language only and require large amounts of data to generate meaningful paraphrases. The next category of techniques (Kobayashi 2018; Gao et al. 2019; Louvan and Magnini 2020) benefit from large pretrained language models to replace a token/phrase. Even though such models might exist for some of the low-resource languages, the quality of the models are low due to insufficient training data. Therefore using these models introduces more noise than expected. The next category contains techniques (Feng et al. 2020; Grundkiewicz, Junczys-Dowmunt, and Heafield 2019) that require external tools/lexicons such as WordNet and spell checkers that are only available for high resource languages. The final category consists of the techniques that are tuned for a specific NLP task. Therefore they can only be used for the specific task, and not for the tasks we focus on in this study. One exception is GECA (Andreas 2020), which can be used for parsing low-resource languages. One issue with GECA is that the boundary of extracted fragments can exceed the constituency span, hence there is no guarantee that the fragment would be a subtree.
DA Method . | Level . | Task . | Reason . |
---|---|---|---|
Synonym Replacement (Wang and Yang 2015) | Input | Agnostic | included in study |
Random Deletion (Wei and Zou 2019) | Input | Agnostic | included in study |
Random Swap (Wei and Zou 2019) | Input | Agnostic | included in study |
DTreeMorph (Şahin and Steedman 2018) | Input | Agnostic | included in study |
Synthetic Noise (Karpukhin et al. 2019) | Input | Agnostic | included in study |
Nonce (Gulordava et al. 2018) | Input | Agnostic | included in study |
Backtranslation (Sennrich, Haddow, and Birch 2016a) | Input | Agnostic | labels not preserved |
UBT & TBT (Vaibhav et al. 2019) | Input | Agnostic | labels not preserved |
Data Diversification (Nguyen et al. 2020) | Input | Agnostic | labels not preserved |
SCPN (Wieting and Gimpel 2017) | Input | Agnostic | labels not preserved |
Semantic Text Exchange (Feng, Li, and Hoey 2019) | Input | Agnostic | labels not preserved |
XLDA (Singh et al. 2019) | Input | Agnostic | labels not preserved |
LAMBADA (Anaby-Tavor et al. 2020) | Input | classification | labels not preserved |
ContextualAug (Kobayashi 2018) | Input | Agnostic | requires strong pretrained model |
Soft Contextual DA (Gao et al. 2019) | Emb/Hidden | Agnostic | requires strong pretrained model |
Slot-Sub-LM (Louvan and Magnini 2020) | Input | slot filling | requires strong pretrained model |
WN-Hypers (Feng et al. 2020) | Input | Agnostic | requires WordNet |
UEdin-MS (DA part) (Grundkiewicz, Junczys-Dowmunt, and Heafield 2019) | Input | Agnostic | requires spell checker |
SeqMixUp (Guo, Kim, and Rush 2020) | Input | seq2seq | not suitable |
Emix (Jindal et al. 2020a) | Emb/Hidden | classification | not suitable |
SpeechMix (Jindal et al. 2020b) | Emb/Hidden | Speech/Audio | not suitable |
MixText (Chen, Yang, and Yang 2020b) | Emb/Hidden | classification | not suitable |
SwitchOut (Wang et al. 2018) | Input | machine translation | not suitable |
SignedGraph (Chen, Ji, and Evans 2020) | Input | paraphrase | not suitable |
DAGA (Ding et al. 2020) | Input+Label | sequence tagging | not suitable |
SeqMix (Zhang, Yu, and Zhang 2020) | Input+Label | active sequence labeling | not suitable |
GECA (Andreas 2020) | Input | Agnostic | not suitable |
DA Method . | Level . | Task . | Reason . |
---|---|---|---|
Synonym Replacement (Wang and Yang 2015) | Input | Agnostic | included in study |
Random Deletion (Wei and Zou 2019) | Input | Agnostic | included in study |
Random Swap (Wei and Zou 2019) | Input | Agnostic | included in study |
DTreeMorph (Şahin and Steedman 2018) | Input | Agnostic | included in study |
Synthetic Noise (Karpukhin et al. 2019) | Input | Agnostic | included in study |
Nonce (Gulordava et al. 2018) | Input | Agnostic | included in study |
Backtranslation (Sennrich, Haddow, and Birch 2016a) | Input | Agnostic | labels not preserved |
UBT & TBT (Vaibhav et al. 2019) | Input | Agnostic | labels not preserved |
Data Diversification (Nguyen et al. 2020) | Input | Agnostic | labels not preserved |
SCPN (Wieting and Gimpel 2017) | Input | Agnostic | labels not preserved |
Semantic Text Exchange (Feng, Li, and Hoey 2019) | Input | Agnostic | labels not preserved |
XLDA (Singh et al. 2019) | Input | Agnostic | labels not preserved |
LAMBADA (Anaby-Tavor et al. 2020) | Input | classification | labels not preserved |
ContextualAug (Kobayashi 2018) | Input | Agnostic | requires strong pretrained model |
Soft Contextual DA (Gao et al. 2019) | Emb/Hidden | Agnostic | requires strong pretrained model |
Slot-Sub-LM (Louvan and Magnini 2020) | Input | slot filling | requires strong pretrained model |
WN-Hypers (Feng et al. 2020) | Input | Agnostic | requires WordNet |
UEdin-MS (DA part) (Grundkiewicz, Junczys-Dowmunt, and Heafield 2019) | Input | Agnostic | requires spell checker |
SeqMixUp (Guo, Kim, and Rush 2020) | Input | seq2seq | not suitable |
Emix (Jindal et al. 2020a) | Emb/Hidden | classification | not suitable |
SpeechMix (Jindal et al. 2020b) | Emb/Hidden | Speech/Audio | not suitable |
MixText (Chen, Yang, and Yang 2020b) | Emb/Hidden | classification | not suitable |
SwitchOut (Wang et al. 2018) | Input | machine translation | not suitable |
SignedGraph (Chen, Ji, and Evans 2020) | Input | paraphrase | not suitable |
DAGA (Ding et al. 2020) | Input+Label | sequence tagging | not suitable |
SeqMix (Zhang, Yu, and Zhang 2020) | Input+Label | active sequence labeling | not suitable |
GECA (Andreas 2020) | Input | Agnostic | not suitable |
This section includes a detailed discussion on three main categories of textual augmentation techniques that are used in our experiments, namely, syntactic, token level, and character level. A summary of the techniques under each category and their suitability to our downstream tasks can be found in Table 2. Finally, it discusses the parameters associated with each technique in Section 3.4.
. | . | POS . | DEP . | SRL . | Generated . |
---|---|---|---|---|---|
Char | Char Insert (CI) | x | x | x | I wrrotle him a legtter |
Char Substitute (CSU) | x | x | x | I wyote him a lettep | |
Char Swap (CSW) | x | x | x | I wtore him a lteter | |
Char Delete (CD) | x | x | x | I wote him a leter | |
Token | Synonym Replacement (SR) | x | x | x | I wrote him a message |
RW Delete (RWD) | x | o | o | I him a letter | |
RW Swap (RWS) | x | o | o | I him wrote a letter | |
RW Insert (RWI) | o | o | o | I wrote him a her letter | |
Syntactic | Crop | x | x | x | I wrote a letter |
Rotate | x | x | x | Him a letter I wrote | |
Nonce | x | x | o | I wrote him a flower |
. | . | POS . | DEP . | SRL . | Generated . |
---|---|---|---|---|---|
Char | Char Insert (CI) | x | x | x | I wrrotle him a legtter |
Char Substitute (CSU) | x | x | x | I wyote him a lettep | |
Char Swap (CSW) | x | x | x | I wtore him a lteter | |
Char Delete (CD) | x | x | x | I wote him a leter | |
Token | Synonym Replacement (SR) | x | x | x | I wrote him a message |
RW Delete (RWD) | x | o | o | I him a letter | |
RW Swap (RWS) | x | o | o | I him wrote a letter | |
RW Insert (RWI) | o | o | o | I wrote him a her letter | |
Syntactic | Crop | x | x | x | I wrote a letter |
Rotate | x | x | x | Him a letter I wrote | |
Nonce | x | x | o | I wrote him a flower |
3.1 Character-Level Augmentation
The idea of adding synthetic noise to text applications is not new; however, it has mostly been used for adversarial attacks or to develop more robust models (Belinkov and Bisk 2018; Karpukhin et al. 2019). Previous work by Karpukhin et al. (2019) introduces four types of synthetic noise on orthographic level: character deletion (CD), insertion (CI), substitution (CSU), and swapping (CSW). Additionally, they introduce a mixture of all noise types by sampling from a distribution of 60% clean (no noise) and 10% from each type of noise, which we refer to as Character All (CA). They show that adding synthetic noise to training data improves the performance on test data with natural noise, that is, text with real-world spelling mistakes, while not hurting the performance on clean data. The authors experiment on neural machine translation where the source languages are German, French, Czech, and the target language is English. We hypothesize that adding the right amount of synthetic noise might as well improve the performance on low-resource languages for our set of downstream tasks. For CI, we first build a character vocabulary out of the most commonly used characters in the training set. We do not add noise to one-letter words and do not apply CSW to the first and last characters of the token.
The advantages of character-level synthetic noise are two-fold: First, the output of the augmentation mostly preserves the original syntactic and semantic labels. This is because the resulting tokens are mostly out of vocabulary words that are quite close to the original word—like a spelling mistake. Second, they are trivial to generate, not requiring any external resources like large language models or syntactic annotations. Finally, they are only constrained by the number of characters, which results in the ability of generating huge numbers of augmented sentences—this, can be an advantage for most downstream tasks.
3.2 Token-Level Augmentation
This category includes methods that perform token-level changes such as adding, replacing, or removing certain tokens. While some preserve the syntax or semantics of the original sentence, the majority does not.
Synonym Replacement.
As one of the earliest techniques (Kolomiyets, Bethard, and Moens 2011; Zhang, Zhao, and LeCun 2015; Wang and Yang 2015; Wei and Zou 2019), synonym replacement aims to replace words with their synonyms. A lexicon containing synonymity, like WordNet, or a thesaurus is generally required to retrieve the synonyms. Because most languages do not have such a resource, some researchers (Wang and Yang 2015) exploit special pretrained word embeddings and use k-nearest neighbors (by means of cosine similarity) of the queried word as the replacement. As discussed in Section 2, more recent studies (Kobayashi 2018; Wu et al. 2019; Fadaee, Bisazza, and Monz 2017; Anaby-Tavor et al. 2020; Kumar, Choudhary, and Cho 2020) use contextualized language models such as bidirectional LSTMs or BERT (Devlin et al. 2019) to find related words such that the class of the sentence is still preserved. However, these methods require strong pretrained language models that are generally not available for truly low-resource languages. Furthermore, these methods are mostly applied to sentence classification tasks, where the labels are considered coarse-grained compared with our downstream tasks. Considering these, we use a simplified approach similar to Wang and Yang (2015) and query the randomly chosen token on non-contextualized pretrained embeddings. Most languages we experiment on are morphologically productive, having the out-of-vocabulary word problem. To circumvent this issue we use the subword-level fastText embeddings (Grave et al. 2018).
Random Word.
Similar to character-level noise, one can inject a higher level of noise by randomly inserting (RI), deleting (RD) or swapping (RS) tokens. The EDA framework by Wei and Zou (2019) shows the efficiency of these techniques—despite their simplicity—on multiple text classification benchmarks. Following Wei and Zou (2019), we experiment with all techniques except RI. Because our downstream tasks require contextual annotation of tokens, we cannot insert a random word without annotation. Similar to character-level methods, random word augmentation techniques are easy to apply and can produce an extensive amount of synthetic sentences as they are only constrained with the number of tokens. One disadvantage is the inability to preserve the syntactic and semantic labels. For instance, deleting a word may yield an ungrammatical sentence without a valid dependency tree. Therefore random word techniques are not eligible for two of our tasks: dependency parsing and semantic role labeling.
3.3 Syntactic Augmentation
This category consists of more sophisticated methods that benefit from syntactic properties to generate new sentences. The main disadvantages of this category are (i) the need for syntactic annotation, and (ii) being more constrained, namely, not being able to generate as many data points as previous categories.
Nonce.
This technique, introduced by Gulordava et al. (2018), aims to produce dummy sentences by replacing some of the words in the original sentence, such that the produced sentence is the syntactic equivalent of the original, while not preserving the original semantics. In more detail, randomly chosen content words (i.e., noun, adjective, and verb) are replaced with words that have the same part of speech tag, morphological tags, and dependency label. Given the sentence “Her sibling bought a cake”, the following sentences can be generated: “Mysiblingsawa cake” or “Hismotorbikebought aisland”. As can be seen, the generated sentences mostly do not make any sense, however syntactically equivalent. For this reason, it can only be utilized for syntactic tasks and not for SRL.
Crop.
This augmentation algorithm by Şahin and Steedman (2018) morphs the dependency trees to generate new sentences. The main idea is to remove some of the dependency links to create simpler, shorter, but still (mostly) meaningful sentences. In order to do so, Şahin and Steedman (2018) define a list of dependency labels, which is referred to as Label Of Interest (LOI), that attaches subjects and direct and indirect objects to the predicates. Then a simpler sentence is created by retaining only one LOI at a time. Following Şahin and Steedman (2018), we only consider the dependency relations that attach subjects and objects to predicates as LOIs, and ignore other adverbial dependency links involving predicates. We keep the augmentation non-recursive, that is, we only reorder the first level of flexible subtrees; hence the flexible chunks inside these subtrees are kept fixed. The idea is demonstrated in Figure 1.
Rotate.
This uses a similar idea to image rotation, where the sentence is rotated around the root node of the dependency tree. The method uses the same LOIs as cropping and creates flexible chunks that can be reordered in the sentence. A demonstration of rotation is shown in Figure 1. Both the cropping and rotation operations may cause semantic shifts and ill-formed sentences, mostly depending on the morphological properties of the languages. For instance, languages that use case marking to mark subjects and objects can still generate valid sentence forms via the rotation operation simply because word order is more flexible (Futrell, Mahowald, and Gibson 2015). However, most valid sentences may still sound strange to a native speaker, since there is still a preferred order and the generated sentence may have an infrequent word order. Furthermore, for languages with a strict word order, like English, rotation may introduce more noise than desired. Finally, the augmented sentences may still be beneficial for learning, since they would provide the model with more examples of variations in semantic argument structure and in the order of phrases.
3.4 Parameters
All of the aforementioned augmentatiom methods are parametric. As we show later in Section 5, the choice of parameters may have a substantial impact on the success of the methods. The parameters and their value range are defined as below:
Ratio.
This parameter is used to calculate the number of new sentences to be generated from one original sentence. To ensure that more sentences are generated from longer sentences, it is simply calculated as ratio * |sentence|; thus, for a sentence of length 10, only 1 extra sentence will be produced with ratio = 0.1. It is only used for orthographic methods because syntactic methods are constrained in other ways (e.g., number of tokens that share morphological features and dependency labels). The values we experiment with are as follows: [0.1, 0.2, 0.3, 0.4].
Probability.
This determines the probability of a token-level augmentation, namely, the ratio of the augmented tokens to sentence length. For word insertion, deletion, and all character-based methods (CI, CD, CSU, CSW, CA), it can be interpreted as the ratio of tokens that undergo a change to the number of total tokens; whereas for the word swapping operation, it refers to the ratio of swapped token pairs to total number of tokens. Moreover, it determines the number of characters to which the augmentation is applied in cases of character-level augmentation. Similar to ratio, we experiment with the value range: [0.1, 0.2, 0.3, 0.4]. For the syntactic methods crop and rotate, probability is associated with the operation dynamics, that is, when the operation probability is set to 0.2, the expected number of additional sentences produced via cropping would be #LOI/5, where #LOI indirectly refers to the number of valid subtrees for the operation. Because the number of additional sentences are already constrained with #LOI, we also experiment with higher probabilities for Crop and Rotate. The range is then [0.1, 0.3, 0.5, 0.7, 1.0].
Max Sentences.
In cases of considerably long sentences or a substantial number of available candidates, the number of augmented sentences can grow rapidly. This sometimes causes an undesired level of noise in the training data. To avoid this, the number of maximum sentences, namely, sentence threshold parameter, is commonly used. Following Gulordava et al. (2018) and Vania et al. (2019), we experiment with 5 and 10 as the threshold values.
4. Experimental Setup
First we provide detailed information on the downstream tasks and the languages we have experimented on. Next we discuss the data sets that are used in our experiments.
4.1 Tasks
We focus on three sequence tagging tasks that are central to NLP: POS-tagging (POS), dependency parsing (DEP), and semantic role labeling (SRL). POS and DEP require processing at the morphological and syntactic level and are considered crucial steps toward understanding natural language. Furthermore, the performance on POS and DEP tasks are likely to repeat for other tasks that require syntactic and morphological information such as data-to-text generation. On the other hand, SRL is more useful to gain insights on the augmentation performance where preserving the semantics of the original sentence is essential. Therefore it can serve as a proxy for higher-level tasks that necessitate semantic knowledge such as question answering or natural language inference.
The aim of this study is to analyze the performance of various augmentation techniques for low-resource languages in a systematic way, rather than implementing state-of-the art sequence taggers, which would generally require additional resources and engineering effort. Furthermore, this study involves a large number of experiments comparing 11 augmentation techniques across three tasks and several languages in a multi-run setup. For these reasons, we have chosen models that are not heavily engineered and are modular, but also with proven competence on a diverse set of languages. For all tasks, we use neural models that operate on a subword level rather than a word level. This is necessary to tackle the out-of-vocabulary problem, which is inevitable for languages with productive morphology, especially in low-resource scenarios. Furthermore, we do not use any additional features (e.g., POS tags, pretrained word embeddings), external resources, or ensemble of multiple models. Considering these facts, for our POS tagging experiment, we use the subword-level sequence tagging model (Heinzerling and Strube 2019) that is both modular and provides results on par with state-of-the-art on many languages. The model uses an autoregressive architecture (e.g., RNN or bi-LSTM) for random or non-contextual subwords, and uses a fine-tuning paradigm for large pretrained language models. For our dependency parsing experiments, we use the transition-based Uppsala parser v2.3 (de Lhoneux, Stymne, and Nivre 2017; Kiperwasser and Goldberg 2016), which achieved the second best average LAS performance1 on low-resource languages on the CoNLL 2018 shared task (Zeman et al. 2018). Additionally, we experiment with a biaffine dependency parser built on top of large pretrained contextualized word representations (Glavas and Vulic 2021), since fine-tuning such models on downstream tasks has recently achieved state-of-the-art results on many tasks and languages. Finally, we use a character-level bidirectional LSTM model (Şahin and Steedman 2018) to conduct SRL experiments.2 More details on the models and their configuration are given in the following subsections.
4.1.1 POS Tagging.
The goal is to associate each token with its corresponding lexical class, that is, syntactic label (e.g., noun, adjective, verb). Although it is a token-level task, disambiguation using contextual knowledge is mostly necessary as one token may belong to multiple classes (e.g., to fly [Verb] or a fly [Noun]). For languages with rich morphology, it is generally referred to as morphological disambiguation, while the correct morphological analysis—including the POS tag—is chosen among multiple analyses. For analytic languages like English, it is mostly performed as the first step in the traditional NLP pipeline. For this task, we inherit the universal POS tag set that is shared among languages and defined within the scope of the Universal Dependencies project (Zeman, Nivre, and Abrams 2020).
Subword Units.
We experiment with three subword units: characters, BPEs, and Word Pieces. Characters and character n-grams have been one of the most popular subword units (Ling et al. 2015), because they (i) don’t require any preprocessing, (ii) are language-agnostic, and (iii) are computationally cheap due to the small vocabulary size. Byte Pair Encoding (BPE) (Sennrich, Haddow, and Birch 2016b) is a simple segmentation algorithm that learns a subword vocabulary by clustering the frequent character pairs together for a predefined number of times. The algorithm only requires raw text—hence is language-agnostic, and computationally simple. Furthermore, it has been shown to improve the performance of various NLP tasks, especially machine translation. Heinzerling and Strube (2018) trained BPE embeddings using the GloVe (Pennington, Socher, and Manning 2014) word embedding objective and made it available for 275 languages with multiple vocabulary sizes. However, neither randomly initialized character embeddings nor GloVe-BPE embeddings are aware of context. Therefore, we also experiment with contextualized embeddings (i.e., different embeddings for the same subword depending on the context) that operate on Word Pieces. For this study we choose BERT (Devlin et al. 2019), a transformer (Vaswani et al. 2017) based contextualized language model that recently led to state-of-the-art results for many languages and tasks. We use the publicly available multilingual BERT (mBERT)3 that has been trained on the top 100 languages with the largest Wikipedia using a shared vocabulary across languages. Some of the low-resource languages we use in our experiments are not part of mBERT’s languages.
The Model.
4.1.2 Dependency Parsing.
Dependency parsing aims to provide a viable structural or grammatical analysis of a sentence by finding the links between the tokens. It assumes that the dependent word is linked to its parent, that is, head word, with one of the dependency relations such as modifier, subject, or object. The resulting grammatical analysis is called a dependency graph, shown in Figure 1. We use the universal dependency label sets defined by the Universal Dependencies project (Zeman, Nivre, and Abrams 2020) and report the Labeled Attachment Score (LAS) as the performance measure. For the experiments, we use two different models: uuparser (Kiperwasser and Goldberg 2016; de Lhoneux, Stymne, and Nivre 2017) and the biaffine parser fine-tuned on large contextualized language models (Glavas and Vulic 2021).
uuparser.
biaffine.
4.1.3 Semantic Role Labeling.
SRL, that is, shallow semantic parsing, is defined as analyzing a sentence by means of predicates and the arguments attached to them. A wide range of semantic formalisms and annotation schemes exist; however the main idea is labeling the arguments according to their relation to the predicate.
[I]A0: buyer [bought]buy.01: purchase [a new headphone]A1: thing bought from [Amazon]A2: seller
The example given above shows a labeled sentence with English Proposition Bank (Palmer, Dan, and Paul 2005) semantic roles, where buy.01 denotes the first sense of the verb “buy”, and A0, A1, and A2 are the numbered arguments defined by the predicate’s semantic frame. For this study, we perform dependency-based SRL, which means that only the head word of the phrase (e.g., headphone instead of a new headphone) will be detected as an argument and will be labeled as A1. To evaluate SRL results, we used the official CoNLL-09 evaluation script on the official test split. The script calculates the macro-average F1 scores for the semantic roles available in the data.
The Model.
4.2 Languages
We set the definition of low-resource language based on the number of training sentences available in UD v2.6 (Zeman, Nivre, and Abrams 2020). First we calculate the quartiles via ordering the treebanks with respect to their training data size. We then define the languages under the first quartile as low-resource languages, as shown in Figure 4. According to this definition, the list of low-resource languages are as follows: Kazakh, Tamil, Welsh, Wolof, Upper Serbian, Buryat, Swedish sign language, Coptic, Gaelish, Marathi, Telugu, Vietnamese, Kurmanji, Livvi, Belarusian, Maltese, Hungarian, and Afrikaans.
Finally we choose a subset of languages that are from diverse language families: Kazakh (Turkic), Tamil (Dravidian), Buryat (Mongolic), Telugu (Dravidian), Vietnamese (Austro-Asiatic), Kurmanji (Indo-European [IE], Iranian), and Belarusian (IE, Slavic). Although this list of languages can be experimented on for the tasks of POS tagging and dependency parsing, there are no data available for any of these languages for semantic role labeling. Semantically annotated data sets are only available for a handful of languages such as Turkish (Turkic), Finnish (Uralic), Catalan (IE, Romance), Spanish (IE, Romance), and Czech (IE, Slavic). Therefore, we simulate a low-resource environment by sampling for SRL. We provide a summary of languages in Table 3 and discuss the relevant properties of each language family below.
Language . | Family . | Typology . |
---|---|---|
Belarusian | Indo-European (IE), Slavic | Fusional |
Buryat | Mongolic | Agglutinative |
Catalan | Indo-European (IE), Romance | Fusional |
Czech | Indo-European (IE), Slavic | Fusional |
Finnish | Uralic | Agglutinative |
Kazakh | Turkic | Agglutinative |
Kurmanji | Indo-European (IE), Iranian | Fusional |
Spanish | Indo-European (IE), Romance | Fusional |
Tamil | Dravidian | Agglutinative |
Telugu | Dravidian | Agglutinative |
Turkish | Turkic | Agglutinative |
Vietnamese | Austro-Asiatic | Analytic |
Language . | Family . | Typology . |
---|---|---|
Belarusian | Indo-European (IE), Slavic | Fusional |
Buryat | Mongolic | Agglutinative |
Catalan | Indo-European (IE), Romance | Fusional |
Czech | Indo-European (IE), Slavic | Fusional |
Finnish | Uralic | Agglutinative |
Kazakh | Turkic | Agglutinative |
Kurmanji | Indo-European (IE), Iranian | Fusional |
Spanish | Indo-European (IE), Romance | Fusional |
Tamil | Dravidian | Agglutinative |
Telugu | Dravidian | Agglutinative |
Turkish | Turkic | Agglutinative |
Vietnamese | Austro-Asiatic | Analytic |
Indo-European (IE).
We have five representatives for this language family: Kurmanji, Belarusian, Catalan, Spanish, and Czech. The representative languages are from various branches: Iranian, Slavic, and Romance. From the typological perspective, all languages have fusional characteristics. In other words, morphemes (prefixes, suffixes) are used to convey some linguistic information such as gender; however, one morpheme can be used to mark multiple properties. Slavic languages are known to have around 7 distinct case markers, while others are not as rich. The number and type of case markers available in Slavic helps to relax the word order.
Uralic and Turkic.
The languages from both families are known to be agglutinative, meaning that there is one-to-one mapping between morpheme and meaning. All representative languages, namely, Kazakh (Turkic), Turkish (Turkic), Hungarian (Uralic), and Finnish (Uralic), attach morphemes to words extensively. These languages have a high morpheme-to-word ratio, and a comprehensive case marking system.
Dravidian.
Two languages, namely, Tamil and Telugu, are from the Dravidian language family but in different branches. Similar to Uralic and Turkic, Dravidian languages are also agglutinative, and have extensive grammatical case marking (e.g., Tamil defines eight distinct markers). Unlike other language families, it is not based on the Latin alphabet.
Austro-Asiatic.
Vietnamese is the only representative of this family. Unlike the previous languages, Austro-Asiatic languages are analytic. For instance, Vietnamese does not use any morphological marking for tense, gender, number, or case (i.e., it has low morphological complexity).
Mongolic.
Being a language from the Mongolic family, Buryat is an agglutinative language with eight grammatical cases. Modern Buryat uses an extended Cyrillic alphabet.
4.3 Data Sets
We use the Universal Dependencies v2.6 treebanks (Zeman, Nivre, and Abrams 2020) for POS and DEP. Some of the languages, such as Kazakh, Kurmanji, and Buryat, do not have any development data. For those languages, we randomly sample 25% of the training data to create a development set. For the SRL task, we use the data sets distributed by Linguistic Data Consortium (LDC) for Catalan (CAT) and Spanish (SPA).4 In addition, we use dependency-based annotated SRL resources released for Finnish (FIN) (Haverinen et al. 2015) and Turkish (TUR) (Şahin and Adali 2018; Şahin 2016; Sulubacak and Eryiğit 2018; Sulubacak, Eryiğit, and Pamay 2016). All proposition banks are derived from a language-specific dependency treebank, and contain semantic role annotations for verbal predicates. We provide the basic data set statistics for each language in Table 4. Because SRL resources are not available for truly low-resource languages, we sample a small training data set from the original ones, shown with #sampled in Table 4. Each SRL data set uses a language-specific dependency and semantic role annotation scheme. Therefore, we needed to perform language-specific preprocessing for cropping and rotation augmentation techniques given in the Appendix.
. | . | #training . | #dev . | #test . |
---|---|---|---|---|
SRL | Czech | #sampled | 5,228 | 4,213 |
Catalan | #sampled | 1,724 | 1,862 | |
Spanish | #sampled | 1,655 | 1,725 | |
Turkish | #sampled | 844 | 842 | |
Finnish | #sampled | 716 | 648 | |
POS & DEP | Vietnamese | 1,400 | 800 | 800 |
Telugu | 1,051 | 131 | 146 | |
Tamil | 400 | 80 | 120 | |
Belarusian | 319 | 65 | 253 | |
Kazakh* | 23 | 8 | 1,047 | |
Kurmanji* | 15 | 5 | 734 | |
Buryat* | 14 | 5 | 908 |
. | . | #training . | #dev . | #test . |
---|---|---|---|---|
SRL | Czech | #sampled | 5,228 | 4,213 |
Catalan | #sampled | 1,724 | 1,862 | |
Spanish | #sampled | 1,655 | 1,725 | |
Turkish | #sampled | 844 | 842 | |
Finnish | #sampled | 716 | 648 | |
POS & DEP | Vietnamese | 1,400 | 800 | 800 |
Telugu | 1,051 | 131 | 146 | |
Tamil | 400 | 80 | 120 | |
Belarusian | 319 | 65 | 253 | |
Kazakh* | 23 | 8 | 1,047 | |
Kurmanji* | 15 | 5 | 734 | |
Buryat* | 14 | 5 | 908 |
5. Experiments and Results
We perform experiments on three downstream tasks, POS, DEP, and SRL, using the models described in Section 4.1. We run each experiment 10 times using the same set of random seeds for the model that operates on the original (unaugmented) and augmented data sets. We compare augmentation techniques to their corresponding unaugmented baselines using a paired t-test and report the p-values. We use for p ≤ 0.05 and for 0.05 < p ≤ 0.1 to highlight the cases where the improvements over the baselines are statistically significant. Similarly we use and to denote significantly lower scores. Additionally, we use the sign (**) for p ≤ 0.05 and (*) for 0.05 < p ≤ 0.1 to highlight the cases where the improvements over the baselines are statistically significant, and similarly the symbols (††) and (†) to denote significantly lower scores. Color-blind friendly results are additionally given in the Appendices.
5.1 POS Tagging
POS tagging is considered one of the most fundamental NLP tasks, such that it has been the initial step in the traditional NLP pipeline for quite a long time. Even in the era of end-to-end models, it has been shown to improve the performance of the downstream tasks of higher complexity when employed as a feature. Taking the importance of POS tagging on higher-level downstream tasks into account, we conduct experiments with different subword units and training regimes (see Section 4.1 for details) to examine (i) whether the behavior of the augmentation methods are model-agnostic, and (ii) whether the improvements are relevant for state-of-the-art models that are fine-tuned on large pretrained multilingual contextualized language models. The results for the character (char), BPEs (BPE), and multilingual BERT (mBERT) are given in Table 5. The languages that lack the corresponding pretrained model (e.g., Kurmanji and BPE) are not shown in the table.
char.
Except for Telugu, at least one of the techniques has significantly improved over the baselines for all other languages. The token-level methods, RWD and RWS, have increased the scores of Kazakh and Tamil significantly, while providing a slight increase for Belarusian and Kurmanji. On the other hand, SR results in a mixture of increase and decrease in the scores. For instance, Buryat and Kazakh POS taggers benefit significantly from SR, while the opposite pattern is observed for Belarusian, Kurmanji, and Telugu. Unlike token-level methods, the character-level ones, CI, CSU, CSW, CD, and CA, seem to bring more consistent improvements. In particular, Belarusian, Kazakh, and Tamil POS taggers are significantly improved by most of the character-level techniques, whereas the Vietnamese tagger only benefited from CSU and CA and is slightly hurt by CD. In the majority of the cases, syntactic methods either slightly decrease the performance or have not improved significantly. One exception is the Nonce technique, which led to a significant improvement for Kazakh, and slight improvements for Buryat and Tamil.
BPE.
In general, BPE scores are slightly lower than char, which can be considered as more room for improvement. Similar to char, we observe improvements over the baselines by at least one technique, with the exception of Tamil. Contrary to char, token-level techniques RWD and RWS significantly increase the scores for Belarusian, Buryat, Kazakh, Telugu, and Vietnamese. Similar to char, SR yields significantly higher scores for Buryat and Kazakh, while significantly reducing the performance for Belarusian and Telugu. Although it brings consistent and significant improvements for Buryat and Kazakh, not many character-level methods lead to higher scores for the other languages. Finally, except for Buryat and Kazakh, the improvements from syntactic techniques are not significant, while the performance drop might be severe in cases like Vietnamese.
mBERT.
As expected, the mean baseline scores are the highest for most languages; however, the augmentation methods still provide significant gains in some cases. The most significant improvements are achieved in Belarusian by CSU and in Kazakh by RWS, CSW, and Nonce. Unlike char and BPE, we do not observe a distinct increase or decrease pattern in certain groups of techniques, except from CA, which achieved significantly higher scores for all languages. However, by manual comparison of the improved/worsened results, the pattern is found to be closer to char than to BPE.
Summary and Discussion.
To summarize, we observe significant improvements over the corresponding baselines by a group of augmentation techniques on certain languages for all experimented models. The token-level methods provide more significant improvements for BPE models, whereas character-level methods increase the char models’ performances the most. Except from Nonce and a few exceptional cases, the syntactic-level augmentation techniques either deteriorate the performance or do not lead to significant gains. Even though mBERT baselines are quite strong, it is still possible to achieve significantly better scores—especially for Kazakh, and CA leads to consistent improvements across experimented languages. On the other hand, we haven’t observed any significant gains by any augmentation for Telugu-char and Tamil-BPE. We also note the following decreasing patterns across models: Belarusian-SR, Telugu-SR, Telugu-Crop, Vietnamese-Crop, and Telugu-Rotate. We believe the inconsistency of SR is due to: (i) pretrained embeddings do not guarantee a syntactic equivalent of a token, and (ii) quality of embeddings also suffer from low-resources, namely, small Wikipedia.
Furthermore, the linguistically motivated syntactic techniques such as Crop and Rotate are found to be less effective than the straightforward techniques that rely on introducing noise to the system, namely, RWD, RWS, CI, CSU, CSW, CD, and CA. One important difference between these two categories is the amount of augmented data that can be generated via both techniques. This amount is limited to linguistic factors for syntactic techniques, such as the number of subtrees or the number of tokens that share the same morphological features and dependency labels. On the other hand, the number of sentences with noise addition is only constrained by the parameters. Hence, substantially larger amounts of augmented data can be generated via simple techniques than the more sophisticated ones.
Additionally, we believe that this additional number of noisy sentences provides informative signals to the POS tagging network. As shown previously (Tenney, Das, and Pavlick 2019), low-level features like POS tags are mostly encoded at the initial layers of the network. As layers are added to the network, more sophisticated features that require understanding of the interaction between tokens are more likely to be captured. This is connected to the nature of the unsophisticated augmentation techniques that treat tokens as isolated items, rather than considering the relation among other tokens like the syntactic methods. As a result, easier techniques yield stronger associations at the initial layers, while more sophisticated techniques are more likely to strengthen the intermediate layers. In addition to easier techniques, the syntactic method Nonce also advanced the scores of most POS taggers. Although it is considered one of the sophisticated methods, it targets tokens instead of modifying the sentence structure.
5.2 Dependency Parsing
We use the transition-based parser uuparser and the biaffine parser based on mBERT from Rust et al. (2021) to conduct the dependency parsing experiments. The mean and the standard deviation for each language-data set pair are given in Table 6.
uuparser.
Compared with POS tagging, the relative improvements over the baseline are substantially higher. However, unlike POS tagging there is no linear relation between low baseline scores and larger improvements. For instance, the relative improvement for Kazakh is only 3%, despite having the second lowest baseline. This suggests that more factors such as data set statistics and linguistic properties come into play for dependency parsing. Except for Vietnamese, we observe significant gains for all languages by at least one of the augmentation methods—sometimes by any technique as in Kurmanji. Similar to POS tagging, SR provides a mixture of results, significantly reducing the scores for Belarus and Vietnamese, while improving the Kurmanji and Buryat parsers. Character-level augmentation techniques mostly improve the performance of Belarusian, Buryat, Kazakh, Kurmanji, and Telugu parsers significantly. Unlike POS tagging, the improvements from syntactic methods are more emphasized for most languages with the exception of Vietnamese, Belarusian, and Buryat (only in Nonce augmentation setting).
biaffine.
As discussed earlier, mBERT is trained on the top 100 languages with the largest Wikipedia; however, training is performed on a shared vocabulary. Hence even if a language is not seen during training, the shared vocabulary might enable zero-shot learning—especially if the model is trained with related languages. In Table 6, the languages without an * sign are not included in mBERT training; therefore the scores are the results of zero-shot learning. Compared with uuparser, all baselines (Org) are substantially higher as expected—except for Kurmanji. Despite such high scores, we observe a considerable amount of statistically significant improvements over the baseline with some exceptions. As with the uuparser, Vietnamese dependency parsing performance is significantly reduced by augmentation, suggesting that augmentation methods introduce too much noise for the Vietnamese dependency parser. SR improves the scores more consistently for Belarus, Kazakh, and Tamil, while not being able to bring significant gains for the other languages. Character noise injection techniques—especially CD and CA—boost the performance of most parsers, again with the exception of Vietnamese and Telugu. Similar to uuparser, syntactic techniques result in higher scores compared to POS tagging. We observe significant gains from Nonce, while Crop also improves substantially in most cases—Buryat and Vietnamese are exceptions. Unlike Nonce and Crop, Rotate gives mixed results.
Summary and Discussion.
The similarities among different parsers and POS taggers are numerous, such as (i) the small number of augmentation techniques that were able to improve scores for Buryat; (ii) the high performance of character noise methods for most languages; (iii) the generally confusing scores produced by the SR and rotation, and (iv) the mostly unimproved or worse results for Vietnamese and Telugu. Considering the synthetic nature of most languages, where characters (e.g., case markers) provide valuable information on the relationship between two tokens, high performance of character-level noise is rather expected. In other words, varying character sequences may help the network to strengthen the dependency relations. Unlike POS tagging, we also observe a performance boost from the syntactic augmenters Crop and Nonce, sometimes leading to the best results, for example, for Vietnamese and Kurmanji. This suggests that introducing more syntactic variation/noise, even in smaller amounts than character-level noise, helps in certain cases. Nevertheless, the performance of both techniques is comparable, and it is not possible to single out one technique that is guaranteed to improve the results. Additionally, we observe that not all character-level methods increase the scores and there is no clear pattern to which character noise improves which language. Our results also suggest that a higher number of augmentation techniques are able to improve significantly over competitive baseline scores provided by biaffine compared to the mBERT-based POS tagger. One reason may be that the mBERT model already contains a substantial amount of low-level syntactic knowledge and augmentation techniques only add noise to the fine-tuning process.
5.3 Semantic Role Labeling
As discussed in Section 4.1, we simulate a low-resource scenario by sampling 250, 500, and 1,000 training sentences from the original training sets. We perform a grid search on augmentation parameters for the #sampled = 250 setting, and choose the parameters with the best average performance for other data set settings. The results are given in Table 7.
As expected, the relative improvement over the baselines decreases, as the number of samples increases. For some languages like Finnish, the drop is dramatic, while for the majority, the decrease is exponential. The only exception is Czech. The reason is the fine-grained, language-specific semantic annotation scheme that requires larger data sets. Another noticeable pattern is the decreasing number of augmentation methods that improve the F1 scores with the increasing sample size. For instance, while six of the methods increase the SRL performance in the #sample = 250 setting for Catalan, only two of them provides improvement for the #sample = 1,000 setting—which are also less significant.
We see one distinctive pattern for the languages Turkish, Czech, and Finnish. Unlike Spanish and Catalan, the syntactic operation crop improves the performances for almost all settings. We believe this is due to the rich case marking systems of these languages that enable generating almost natural sentences from subtrees. Furthermore, we observe that the rotation operation introduces a high amount of noise that cannot be regularized easily with the models. One reason for more consistent improvements with the cropping operation is related the semantic role statistics. Most treebanks in this study are dominated by the core arguments, namely, Arg0, Arg1. These core arguments are usually observed as subjects or objects. In addition, many predicates are encountered with missing arguments, that is, it is more likely to see a predicate with only one of the arguments than containing all. We believe cropping introduces more variation in terms of core arguments compared to rotation, which provides more signals for the cases such as missing arguments. For Spanish and Catalan, the gains are almost always provided by character-level noise. On the contrary, both languages never benefit from the syntactic operations, as expected.
5.4 Summary of the Findings
We summarize the key findings from experiments from different perspectives: languages, downstream tasks, augmentation techniques, and models.
5.4.1 Languages.
Most languages see significant improvements in at least one augmentation configuration independent of tasks, models, and the data set sizes.
Vietnamese has mostly witnessed drops in scores, which were especially highlighted in dependency parsing. This suggests that the augmentation methods may be less effective for analytic languages.
We have not observed any significant difference between fusional and agglutinative languages by means of POS and DEP scores, although the differences between syntactic and non-syntactic augmentation techniques were pronounced for languages with different morphological properties.
The suitability of augmentation techniques have been found to be dependent on the language and subword unit pair. For instance, Tamil POS tagging with BPE baseline could not be improved, where we have observed significant improvements over the Tamil POS tagging char baseline.
We have detected inconsistent results for the Telugu language in the majority of cases. Furthermore, we have not seen many configurations that led to substantial improvements. Telugu had the second largest number of significant declines in performance after Vietnamese. We believe this may be due to Telugu having one of the largest treebanks, and augmentation techniques adding unmeaningful noise to the already strong baseline.
5.4.2 Tasks.
We have found many similarities between the results for POS and DEP. The most important ones are: (i) character-level augmentations providing significant improvements in most cases and (ii) inconsistent results from SR and rotation methods.
The task that has been improved the most (i.e., statistically significant improvements with a large gap over the baseline) was DEP, followed by POS and SRL. In other words, the experimented augmentation benefited the task with the intermediate complexity the most.
Strong baseline scores provided by exploiting large pretrained contextualized embeddings were more likely to be further improved for DEP (e.g., biaffine) than POS (e.g., mBERT).
5.4.3 Augmentation Techniques.
The most consistent augmenters across tasks, models, and languages were found to be the character-level ones.
A satisfactory choice of augmentation techniques depends heavily on the input unit. For instance, token-level augmentation provides significant improvements for BPE, while character-level augmentation gives higher scores for char and WordPiece of mBERT.
The performance of SR has been detected as irregular. The reason may be that it relies on external pretrained embeddings that may be of lower-quality for some of the languages.
Even if we have not found one single winner across character-level augmentation methods, the mixed character noise, namely, CA, has improved the POS, DEP, and SRL tasks more consistently.
Among the syntactic augmenters, crop and Nonce have been found to be more reliable compared with rotate—with some exceptions like the case for BPE.
5.4.4 Models.
We observed almost a regular improvement pattern among different models for DEP, such as a significant drop in scores for Vietnamese. Even though there were some similarities among models for POS, such as syntactic augmenters achieving the lowest scores, a regular pattern was not visible. The reason might be the difference among the subword units in POS.5
Strong baseline scores provided by exploiting large pretrained contextualized embeddings were more likely to be further improved for DEP (e.g., biaffine) than POS (e.g., mBERT). This may be due to mBERT already containing a substantial amount of low-level syntactic knowledge (e.g., POS), hence augmentation techniques only add noise to the fine-tuning process.
6. Analysis
In this section, we first discuss the parameter choice for augmentation techniques and conduct a case study on POS tagging to analyze their sensitivity to parameter choice and their performance range. Next, we study the performance of individual augmentation techniques on frequent and infrequent tokens and lexical types to reveal whether the performance improvement also corresponds to better generalization in out-of-domain settings.
6.1 Augmentation Parameters
Augmentation methods are parameterized and their performance may be sensitive to changes in parameter values. However in a real-world low-case scenario (mostly) without any development set, parameter tuning may not always be possible. Therefore we create augmented data sets using all combinations of augmentation parameters described in Section 3 (e.g., 4 × 4 × 2 = 32 augmented data sets with RWD method for each treebank). To analyze the behavior of each augmentation technique, we perform a grid search on the parameters and draw the box plots for each augmentation method on Belarusian, Tamil, Vietnamese, Buryat, Kazakh, and Kurmanji POS tagging shown in Figure 5. The box plots ensure that the augmentation techniques can be compared with respect to their performance range and their sensitivity to parameters rather than their best performance.
Belarusian.
The median lines of RWD, RWS, CI, CSU, CA, and the Nonce lie outside of the Org (baseline) box. This suggests that these techniques are statistically likely to perform higher than the baseline. Out of these methods, Nonce and RWD performance are found to be less dispersed than the others, that is, are more prone to parameter changes. The min-max range of Nonce is quite smaller than others, signaling the reliability of the method. There are only a few outliers to most techniques, suggesting that the results mostly follow a distribution.
Tamil.
For Tamil, only RWS and Nonce median lines lie above the baseline box. Similar to Belarusian, the box length and the min-max range of Nonce is smaller than RWS, hinting at the reliability of the model. Unlike Belarusian, SR, crop, and rotate lie under the box, meaning that they are likely to yield worse results than the baseline. Similarly, the number of outliers are limited and the maximum improvement over the best baseline model is around 0.7%.
Vietnamese, Hungarian.
For Vietnamese, there are no methods that are significantly better than the baseline, but only worse: CD, crop, and rotate. Although the maximum value of SR surpasses the best baseline model, the box plots reveal that this is statistically unlikely. A similar pattern is observed for Hungarian, despite being from a different language family.
Buryat, Telugu.
While the training data set of Buryat is the smallest of all, even modest changes in parameters may lead to outliers. Interestingly, we found none of the techniques to be significantly better or worse than the baseline according to the median lines. However, the outliers provide a performance boost around 8% over the best baseline. Even the plot is not shown for convenience; we noticed that all augmentation techniques, except for SR, provide neither a significant drop nor an increase in the scores for Telugu, similar to Buryat.
Kazakh, Kurmanji.
For both languages, the median lines above the baseline box are the character-level methods and the syntactic method Nonce. In Kazakh, similar to Belarusian and Tamil, Nonce has the smallest box and the min-max range. For Kurmanji, however, CA has the smallest box and min-max range instead of Nonce. Kazakh and Kurmanji, as the languages with the lowest resources, benefit significantly more from character-level augmentation compared with other languages. This may be due to a higher ratio of out-of-vocabulary words for these treebanks. In other words, association of a character set to POS labels is more important than association of tokens with POS labels. Apparently, character-level noise helps to strengthen such links.
6.2 Frequency Analysis
Besides improving the downstream task scores, one of the important goals of an augmentation method is to increase the generalization capability of the model. In order to evaluate the extent of this skill, we measure the individual performances on frequent and infrequent tokens along with word types. We perform a case study on dependency parsing because it stands between POS and SRL by means of complexity. The results are given in Figure 6.
Frequent vs. Infrequent Tokens.
First, we define the token as frequent if it is among the top 10% in the token frequency list extracted from the combination of training and development set. Next, we create a vocabulary of frequent tokens and then measure the LAS scores individually for frequent and infrequent tokens for each language and augmentation technique. Finally, we average the scores over all languages to ease the presentation of the results. The results show that only SR, CSW, Rotate, and Nonce improve the scores of infrequent tokens than frequent ones. SR and Nonce are likely to replace frequent tokens with infrequent ones, since their objective is to replace words with another. However, CSW and Rotate are not designed to replace tokens. We believe character switching sometimes coincidentally resulted in rare tokens, and the Rotate operation has randomly chosen the subtrees with rare tokens to augment. Interestingly, there is no one-to-one correspondence between the techniques that improved the dependency scores the most and the techniques that improved labeling of infrequent tokens the most. This may be due to building the vocabulary over tiny training sets and identifying the frequent/rare tokens accordingly. Therefore we analyze a more general property: POS tags.
Frequent vs. Infrequent POS Tags.
We perform a similar analysis for the token class—using the gold POS tags as the class. Because the number of unique POS tags is much lower than unique tokens, we identify the top 50% of the POS tags as frequent. We use the same calculation technique as above. The results show that all techniques improve the performance on rare token classes more than the frequent ones; however, there still does not exist a direct correlation between the best performing technique and the improvement over infrequent POS tags. This suggests that the improvement cannot simply be explained by frequency analysis, that is, the method that focuses on improving rare tokens or token classes is not guaranteed to improve the overall score. This is because the parser’s performance is a more complicated multivariate variable that relies on many other factors apparent in data set statistics.
7. Conclusion
Neural models have emerged as the standard model for a wide range of NLP tasks, including part-of-speech tagging, dependency parsing, and semantic role labeling—the task of assigning semantic role labels to predicate-argument pairs. These models were shown to provide state-of-the-art scores in the presence of large training data sets, although they still fall behind traditional statistical techniques in genuinely low-resource scenarios. One method that is commonly used to overcome the low data set size problem is enhancing the original data set synthetically, which is referred to as data augmentation. Recently, a variety of text augmentation techniques have been proposed such as replacing words with their synonyms or related words, injecting spelling errors, generating grammatically correct but meaningless sentences, simplifying sentences, and many more.
Despite the richness of augmentation literature, previous work mostly explores text classification and machine translation tasks on high-resource languages, and reports single scores. In this study, on the contrary, we provide a more detailed performance analysis of the existing augmentation techniques on a diverse set of languages and traditional sequence tagging tasks that are more sensitive to noise. First, we compile a rich set of augmentation methods of different categories that can be applied to our tasks and languages, namely, character level, token level, and syntactic level. Then we systematically compare them on the following downstream tasks: part-of-speech tagging, dependency parsing, and semantic role labeling. We conduct experiments on truly low-resource languages (when possible) such as Buryat, Kazakh, and Kurmanji; and simulate a low-resource setting on a diverse set of languages such as Catalan, Turkish, Finnish, and Czech when data are not available.
We find that the easy-to-apply techniques such as injecting character-level noise or generating grammatical, meaningless sentences provide gains across languages more consistently for the majority of cases. Compared with POS tagging, we have observed more significant improvements over the baseline for dependency parsing. Although the augmentation patterns for dependency parsing are found similar to POS tagging, a few differences, such as larger gains from syntactic methods, are noted. Again, the largest improvements in dependency parsers are mostly obtained by injecting character-level noises. For SRL, we show that the improvement from augmentation decreases with more training samples, along with the number of augmentation techniques that increases the scores. We observe that languages with fusional morphology almost always benefit from character-level noise the most; but always suffer from the syntactic operations crop and rotate. On the contrary, agglutinative languages such as Turkish and Finnish benefit from cropping, while character-level augmentation may provide inconsistent improvements. We show that augmentation techniques can provide consistent and statistically significant improvements for all languages except Vietnamese and Telugu. We find that the improvements do not solely depend on the task architecture, that is, augmentation methods can further improve on the strong biaffine parser as well as the weaker uuparser.
Acknowledgments
We first thank the anonymous reviewers who helped us improve the article. We would like to thank Clara Vania, Benjamin Heinzerling, Jonas Pfeiffer, and Phillip Rust for the valuable discussions and providing early access to their implementations. We finally thank Celal Şahin, Necla İşgüder, and Osman İşgüder for their invaluable support during the writing of this article.
Notes
The best performing model (Rosa and Marecek 2018) is a crosslingual model that trains a single model together using multiple treebanks, which was not applicable to our scenario.
To the best of our knowledge, there does not exist an open-source dependency-based SRL model built on top of large contextualized language models with proven competence on a diverse set of languages, and implementing such a model is beyond the scope of this article.
Catalog numbers are as follows: LDC2012T03 and LDC2012T04.
All POS models use distinct input types: character, BPE, and WordPiece; while uuparser (using a combination of char and word) and biaffine (WordPiece) parsers are likely to share more vocabulary.
References
8. Appendix
Appendix J: SRL Preprocessing
Turkish.
We use Modifier, Subject and Object dependency labels as LOI and merge the predicate tokens linked with MWE (Multi Word Expression), and Deriv (Derivation) while performing augmentation. In order to comply with the requirements of the SRL model, we first merge the inflectional groups of the words that are split via derivational boundaries; and then use character sequences of the full word in the SRL model.
Finnish.
This data set uses Universal Dependency (UD) formalism, hence we use the same LOIs as in the original study (Şahin and Steedman 2018) for augmentation. These are namely nsubj, dobj, iobj, obj, obl, and nmod for LOIs and case, fixed, flat, cop, and compound for multi-word predicates. In addition, to standardize the input format to our SRL model, the custom semantic layer annotation used in Finnish PropBank, has been converted to the same CoNLL-09 format.
Spanish, Catalan.
Both data sets use the same dependency annotation scheme: suj for subject, cd for direct, and ci for indirect object relations. Furthermore, we shorten the organization names to abbreviations (e.g., Confederación_Francesa to CF) since such long sequences cause memory problems for the SRL model.
Czech.
Sb, Obj, and Atr dependency labels are used as LOI and the predicate tokens with Pred dependency are merged.
Appendix K: UAS Scores
Appendix L: Result Tables without Colors
We visualize the results presented in Section 5 using no colors. POS Tagging, Dependency Parsing, and Semantic Role Labeling results are given in Tables A3 and A4 accordingly.