Abstract
NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.
1 Introduction
Deep learning methods have achieved strong performance on a wide range of supervised learning tasks (Sutskever et al., 2014; Deng et al., 2013; Minaee et al., 2021). Traditionally, these results were attained through the use of large, well-labeled datasets. This make them challenging to apply in settings where collecting a large amount of high-quality labeled data for training is expensive. Moreover, given the fast-changing nature of real-world applications, it is infeasible to relabel every example whenever new data comes in. This highlights a need for learning algorithms that can be trained with a limited amount of labeled data.
There has been a substantial amount of research towards learning with limited labeled data for various tasks in the NLP community. One common approach for mitigating the need for labeled data is data augmentation. Data augmentation (Feng et al., 2021) generates new data by modifying existing data points through transformations that are designed based on prior knowledge about the problem’s structure (Yang, 2015; Wei and Zou, 2019). This augmented data can be generated from labeled data, and then directly used in supervised learning (Wei and Zou, 2019), or in semi-supervised learning for unlabeled data through consistency regularization (Xie et al., 2020) (“consistency training”). While various approaches have been proposed to tackle learning with limited labeled data—including unsupervised pre-training (Peters et al., 2018; Devlin et al., 2019; Raffel et al., 2020), multi-task learning (Glorot et al., 2011; Liu et al., 2017; Augenstein et al., 2018), semi-supervised learning (Zhu, 2005; Chapelle et al., 2009; Miyato et al., 2017; Xie et al., 2020), and few-shot learning (Deng et al., 2019)—in this work, we focus on and compare different data augmentation methods and their application to supervised and semi-supervised learning.
In this empirical survey, we comprehensively review and perform experiments on recent data augmentation techniques developed for various NLP tasks. Our contributions are three-fold: (1) summarize and categorize recent methods in textual data augmentation; (2) compare different data augmentation methods through experiments with limited labeled data in supervised and semi-supervised settings on 11 NLP tasks, and (3) discuss current challenges and future directions of data augmentation, as well as learning with limited data in NLP more broadly.
Our experimental results allow us to conclude that:
Token-level augmentations and specifically word replacement and random swapping consistently improve performance the most for supervised learning while sentence-level augmentation and specifically roundtrip translation improve performance the most for semi-supervised learning.
Token-level augmentations improve performance more for simpler classification tasks and surprisingly hurt performance for harder tasks where the baseline performance is pretty low.
Token-level augmentations and sentence- level augmentations are generally more reliable compared to hidden-level augmentations in both supervised and semi-supervised settings.
The best augmentation method depends on the dataset and setting (either supervised or semi-supervised).
Related Surveys.
Recently, several surveys also explore the data augmentation techniques for NLP (Hedderich et al., 2020; Feng et al., 2021; Sahin, 2022). Hedderich et al. (2020) provide a broad overview of techniques for NLP in low resource scenarios and briefly cover data augmentation as one of several techniques. In contrast, we focus on data augmentation and provide a more comprehensive review on recent data augmentation methods in this work. Sahin (2022) focuses on syntax, token, and character level augmentations for part-of-speech tagging, dependency parsing, and semantic role labeling. While Feng et al. (2021) also survey task-specific data augmentation approaches for NLP, our work summarizes recent data augmentation methods in a more fine-grained categorization. We also focus on their application to learning from limited data by providing an empirical study of over different augmentation methods on various benchmark datasets in both supervised and semi-supervised settings, doing an empirical comparison of different types of augmentation methods on over 11 datasets covering various tasks. In doing so, we hope to shed light on data augmentation selections for future research.
2 Data Augmentation for NLP
Data augmentation increases both the amount (the number of data points) and the diversity (the variety of data) of a given dataset (Cubuk et al., 2019). Limited labeled data often leads to overfitting on the training set and data augmentation works to alleviate this issue by manipulating data either automatically or manually to create additional augmented data. Such techniques have been widely explored in the computer vision field, with methods like geometric/color space transformations (Simard et al., 2003; Krizhevsky et al., 2012; Taylor and Nitschke, 2018), mixup (Zhang et al., 2018), and random erasing (Zhong et al., 2020; DeVries and Taylor, 2017). Although the discrete nature of textual data and its complex syntactic and semantic structures make finding label-preserving transformation more difficult, there nevertheless exists a wide range of methods for augmenting text data that in practice preserve labels. In the following subsections, we describe four broad classes of data augmentation methods.
2.1 Token-Level Augmentation
Token-level augmentations manipulate words and phrases in a sentence to generate augmented text while ideally retaining the semantic meaning and labels of the original text.
Designed Replacement.
Intuitively, the semantic meaning of a sentence remains unchanged if some of its tokens are replaced with other tokens that have the same meaning. A simple approach is to fetch synonyms as words for substitutions (Kolomiyets et al., 2011; Yang, 2015; Zhang et al., 2015a; Wei and Zou, 2019; Miao et al., 2020). The synonyms are discovered based on pre-defined dictionaries such as WordNet (Kolomiyets et al., 2011), or similarities in word embedding space (Yang, 2015). However, improvements from this technique are usually minimal (Kolomiyets et al., 2011), and in some cases, performance may even degrade (Zhang et al., 2015a). A major drawback stems from the lack of contextual information when fetching synonyms—especially for words with multiple meanings and few synonyms. To resolve this, language models (LMs) have been used to replace the sampled words given their context (Kolomiyets et al., 2011; Fadaee et al., 2017; Kobayashi, 2018; Kumar et al., 2020). Other work preserves the labels of the text by conditioning on the label when generating the LMs’ predictions (Kobayashi, 2018; Wu et al., 2019a). In addition, different sampling strategies for word replacement have been explored. For example, instead of sampling one specific word from candidates by LMs, Gao et al. (2019) propose to compute a weighted average over embeddings of possible words predicted by LMs as the replaced input since the averaged representations could augment text with richer information.
Random Insertion, Replacement, Deletion, and Swapping.
While well-designed local modifications can preserve the syntax and semantic meaning of a sentence (Niu and Bansal, 2018), random local modifications such as deleting certain tokens (Iyyer et al., 2015; Wei and Zou, 2019; Miao et al., 2020), inserting random tokens (Wei and Zou, 2019; Miao et al., 2020), replacing non-important tokens with random tokens (Xie et al., 2017, 2020; Niu and Bansal, 2018), or randomly swapping tokens in one sentence (Artetxe et al., 2018; Lample et al., 2018; Wei and Zou, 2019; Miao et al., 2020) can preserve the meaning in practice. Different kinds of operations can be further combined (Wei and Zou, 2019), where each example is randomly augmented with one of insertion, deletion, and swapping. These noise-injection methods can efficiently be applied to training, and show improvements when they augment simple models trained on small training sets. However, the improvements might be unstable due to the possibility that random perturbations change the meanings of sentences (Niu and Bansal, 2018). Also, finetuning large pre-trained models on specific tasks might attenuate improvements due to preexisting generalization abilities of the model (Shleifer, 2019).
Compositional Augmentation.
To increase the compositional generalization abilities of models, recent efforts have also focused on compositional augmentations (Jia and Liang, 2016; Andreas, 2020) where different fragments from different sentences are re-combined to create augmented examples. Compared with random swapping, compositional augmentation often requires more carefully designed rules such as lexical overlap (Andreas, 2020), neural-symbolic stack machines (Chen et al., 2020e), and neural program synthesis (Nye et al., 2020). With the potential to greatly improve the generalization abilities to out-of-distribution data, compositional augmentation has been utilized in sequence labeling (Guo et al., 2020), semantic parsing (Andreas, 2020; Nye et al., 2020; Furrer et al., 2020), language modeling (Andreas, 2020; Shaw et al., 2021), and text generation (Feng et al., 2020).
2.2 Sentence-Level Augmentation
Instead of modifying tokens, sentence-level augmentation modifies the entire sentence at once.
Paraphrasing.
Paraphrasing has been widely adopted as a data augmentation technique in various NLP tasks (Yu et al., 2018; Xie et al., 2020; Kumar et al., 2019; He et al., 2020; Chen et al., 2020b,2020c; Cai et al., 2020), as it generally provides more diverse augmented text with different word choices and sentence structures while preserving the meaning of the original text. The most popular is round-trip translation (Sennrich et al., 2015; Edunov et al., 2018), a pipeline which first translates sentences into certain intermediate languages and then translates them back to generate paraphrases. Translating through intermediate languages with different vocabulary and linguistic structures can generate useful paraphrases. To ensure the diversity of augmented data, sampling and noisy beam search can also be adopted during the decoding stage (Edunov et al., 2018). Other work focuses on directly training end-to-end models to generate paraphrases (Prakash et al., 2016), and further augments the decoding phase with syntactic information (Iyyer et al., 2018; Chen et al., 2019), latent variables (Gupta et al., 2017), and sub-modular objectives (Kumar et al., 2019).
Conditional Generation.
Conditional generation methods generate additional text from a language model, conditioned on the label. After training the model to generate the original text given the label, the model can generate new text (Anaby-Tavor et al., 2020; Zhang and Bansal, 2019; Kumar et al., 2020; Yang et al., 2020). An extra filtering process is often used to ensure high-quality augmented data. For example, in text classification, Anaby-Tavor et al. (2020) first fine-tune GPT-2 (Radford et al., 2019) with the original examples prepended with their labels, and then generate augmented examples by feeding the fine-tuned model certain labels. Only confident examples as judged by a baseline classifier trained on the original data are kept. Similarly, new answers are generated on the basis of given questions in question answering and are filtered by customized metrics like question answering probability (Zhang and Bansal, 2019) and n-gram diversity (Yang et al., 2020). Given the recent success of pre-trained language models, Ye et al. (2022) and Wang et al. (2021) generate data from zero-shot models. Generative models used in this setting have been based on conditional VAE (Bowman et al., 2016; Hu et al., 2017; Guu et al., 2017; Malandrakis et al., 2019), GAN Iyyer et al., 2018; (Xu et al., 2018), or pre-trained language models like GPT-2 (Anaby-Tavor et al., 2020; Kumar et al., 2020). Overall, these conditional generation methods can create novel and diverse data that might be unseen in the original dataset, but require significant training effort.
2.3 Adversarial Data Augmentation
Adversarial methods create augmented examples by adding adversarial perturbations to the original data, which dramatically influences the model’s predictions and confidence without changing human judgments. These adversarial examples (Morris et al., 2020; Zeng et al., 2020) could be leveraged in adversarial training (Goodfellow et al., 2015) to increase neural models’ robustness, and can also be utilized as data augmentation to increase the models’ generalization ability (Miyato et al., 2017; Cheng et al., 2019).1
White-Box Methods.
These rely on model architecture and parameters being accessible and create adversarial examples directly using a model’s gradients. Unlike image pixel values that are continuous, textual tokens are discrete and cannot be directly modified based on gradients. To this end, adversarial perturbations are added directly to token embeddings or sentence hidden representations (Miyato et al., 2017; Zhu et al., 2020; Jiang et al., 2019; Chen et al., 2020d) which creates “virtual adversarial examples”. Other approaches vectorize modification operations as the difference of one-hot vectors (Ebrahimi et al., 2018a, b), or find real word neighbors in a model’s hidden representations via its gradients (Cheng et al., 2019).
Black-Box Methods.
These are usually model-agnostic since they do not require information from a model or its parameters and usually focus on task-specific heuristics for creating adversarial examples. For example, by enumerating feasible substitutions on the basis of word similarity and language models, Ren et al. (2019) and Garg and Ramakrishnan (2020) select adversarial word replacements which severely influence the predictions from the text classification model. To attack reading comprehension systems, Jia and Liang (2017) and Wang and Bansal (2018) insert distracting but meaningless sentences at different locations in paragraphs and Ribeiro et al. (2018) leverage rule-based paraphrasing to produce semantically equivalent adversarial examples. Likewise, for multi-hop question answering, Jiang and Bansal (2019) insert shortcut reasoning sentences and Trivedi et al. (2020) constructed disconnected reasoning example by removing certain supporting facts. For NLI, Mitra et al. (2020) use VerbNet and other Semantic Role Labeling resources to generate pair of sentences containing same set of words but have different meaning. For machine translation, Belinkov and Bisk (2017) attack character-based models by natural or synthesized typos and Tan et al. (2020) further adopt subword morphology level attacks. Similar attacks also help dialogue generation (Niu and Bansal, 2019) and text summarization (Cheng et al., 2020a; Fan et al., 2018). Other methods do not rely in editing input text directly; Iyyer et al. (2018) leverage round-trip translation to generate paraphrases in given syntactic templates and Zhao et al. (2017) search for adversarial examples in underlying semantic space with GANs (Goodfellow et al., 2014). Some of these heuristics could be further refined to obtain simple adversarial data augmentation approaches. For example, McCoy et al. (2019) craft adversarial examples for natural language inference using sophisticated templates which create lexical overlap between the premise and the hypothesis to fool the model. Min et al. (2020) propose two simple yet effective adversarial transformations that reverse the position of subject and object or the position of premise and hypothesis.
2.4 Hidden-Space Augmentation
This line of work generates augmented data by manipulating the hidden representations through perturbations such as adding noise or performing interpolations with other data points. Hidden-space perturbations augment existing data by adding perturbations to the hidden representations of tokens (Miyato et al., 2017; Zhu et al., 2020; Jiang et al., 2019; Chen et al., 2020d; Shen et al., 2020; Chen et al., 2021) or sentences (Hsu et al., 2017, 2018; Wu et al., 2019b; Malandrakis et al., 2019).
Interpolation-Based Methods.
Interpolation-based methods create new examples and labels by linear combinations of existing data-label pairs. Given two data-label pairs, virtual data-label pairs are created through linear interpolations of the pair of data points. Such interpolation-based methods can generate infinite augmented data in the “virtual vicinity” of the original data space, thus improving the generalization performance of models. Interpolation-based methods were first explored in computer vision (Zhang et al., 2018), and have more recently been generalized to the text domain (Miao et al., 2020; Chen et al., 2020c; Cheng et al., 2020b; Chen et al., 2020a) by performing interpolation between original data and token-level augmented data in the output space (Miao et al., 2020), between original data and adversarial data in embedding space (Cheng et al., 2020b), or between different training examples in general hidden space (Chen et al., 2020c). Different strategies to select samples to mix have also been explored (Chen et al., 2020a; Guo et al., 2020; Zhang et al., 2020a) such as k-nearest-neighbors (Chen et al., 2020a) or sentence composition (Guo et al., 2020).
We summarize the preceding overview of recent widely used data augmentation methods in Table 1, characterizing them with respect to augmentation levels, the diversity of generated data, and their applicable tasks.
Methods . | Level . | Diversity . | Tasks . | Related Work . |
---|---|---|---|---|
Synonym replacement | Token | Low | Text classification | Kolomiyets et al. (2011), Zhang et al. (2015a), Yang (2015), Miao et al. (2020), Wei and Zou (2019) |
Sequence labeling | ||||
Word replacement via LM | Token | Medium | Text classification | Kolomiyets et al. (2011), Gao et al. (2019) |
Sequence labeling | Kobayashi (2018), Wu et al. (2019a) | |||
Machine translation | Fadaee et al. (2017) | |||
Random insertion, deletion, swapping | Token | Low | Text classification | Iyyer et al. (2015), Xie et al. (2017) Artetxe et al. (2018), Lample et al. (2018) Xie et al. (2020), Wei and Zou (2019) |
Sequence labeling | ||||
Machine translation | ||||
Dialogue generation | ||||
Compositional augmentation | Token | High | Semantic Parsing | Jia and Liang (2016), Andreas (2020) Nye et al. (2020), Feng et al. (2020) Furrer et al. (2020), Guo et al. (2020) |
Sequence labeling | ||||
Language modeling | ||||
Text generation | ||||
Paraphrasing | Sentence | High | Text classification | Yu et al. (2018), Xie et al. (2020) Chen et al. (2019), He et al. (2020) Chen et al. (2020c), Cai et al. (2020) |
Machine translation | ||||
Question answering | ||||
Dialogue generation | ||||
Text summarization | ||||
Conditional generation | Sentence | High | Text classification | Anaby-Tavor et al. (2020), Kumar et al. (2020) |
Question answering | Zhang and Bansal (2019), Yang et al. (2020) | |||
White-box attack | Token or Sentence | Medium | Text classification | Miyato et al. (2017), Ebrahimi et al. (2018b) |
Sequence labeling | Ebrahimi et al. (2018a), Cheng et al. (2019), | |||
Machine translation | Chen et al. (2020d) | |||
Black-box attack | Token or Sentence | Medium | Text classification | Jia and Liang (2017) Belinkov and Bisk (2017), Zhao et al. (2017) Ribeiro et al. (2018), McCoy et al. (2019) Min et al. (2020), Tan et al. (2020) |
Sequence labeling | ||||
Machine translation | ||||
Textual entailment | ||||
Dialogue generation | ||||
Text Summarization | ||||
Hidden-space perturbation | Token or Sentence | High | Text classification | Hsu et al. (2017), Hsu et al. (2018) |
Sequence labeling | Wu et al. (2019b), Chen et al. (2021) | |||
Speech recognition | Malandrakis et al. (2019), Shen et al. (2020) | |||
Interpolation | Token | High | Text classification | Miao et al. (2020), Chen et al. (2020c) |
Sequence labeling | Cheng et al. (2020b), Chen et al. (2020a) | |||
Machine translation | Guo et al. (2020) |
Methods . | Level . | Diversity . | Tasks . | Related Work . |
---|---|---|---|---|
Synonym replacement | Token | Low | Text classification | Kolomiyets et al. (2011), Zhang et al. (2015a), Yang (2015), Miao et al. (2020), Wei and Zou (2019) |
Sequence labeling | ||||
Word replacement via LM | Token | Medium | Text classification | Kolomiyets et al. (2011), Gao et al. (2019) |
Sequence labeling | Kobayashi (2018), Wu et al. (2019a) | |||
Machine translation | Fadaee et al. (2017) | |||
Random insertion, deletion, swapping | Token | Low | Text classification | Iyyer et al. (2015), Xie et al. (2017) Artetxe et al. (2018), Lample et al. (2018) Xie et al. (2020), Wei and Zou (2019) |
Sequence labeling | ||||
Machine translation | ||||
Dialogue generation | ||||
Compositional augmentation | Token | High | Semantic Parsing | Jia and Liang (2016), Andreas (2020) Nye et al. (2020), Feng et al. (2020) Furrer et al. (2020), Guo et al. (2020) |
Sequence labeling | ||||
Language modeling | ||||
Text generation | ||||
Paraphrasing | Sentence | High | Text classification | Yu et al. (2018), Xie et al. (2020) Chen et al. (2019), He et al. (2020) Chen et al. (2020c), Cai et al. (2020) |
Machine translation | ||||
Question answering | ||||
Dialogue generation | ||||
Text summarization | ||||
Conditional generation | Sentence | High | Text classification | Anaby-Tavor et al. (2020), Kumar et al. (2020) |
Question answering | Zhang and Bansal (2019), Yang et al. (2020) | |||
White-box attack | Token or Sentence | Medium | Text classification | Miyato et al. (2017), Ebrahimi et al. (2018b) |
Sequence labeling | Ebrahimi et al. (2018a), Cheng et al. (2019), | |||
Machine translation | Chen et al. (2020d) | |||
Black-box attack | Token or Sentence | Medium | Text classification | Jia and Liang (2017) Belinkov and Bisk (2017), Zhao et al. (2017) Ribeiro et al. (2018), McCoy et al. (2019) Min et al. (2020), Tan et al. (2020) |
Sequence labeling | ||||
Machine translation | ||||
Textual entailment | ||||
Dialogue generation | ||||
Text Summarization | ||||
Hidden-space perturbation | Token or Sentence | High | Text classification | Hsu et al. (2017), Hsu et al. (2018) |
Sequence labeling | Wu et al. (2019b), Chen et al. (2021) | |||
Speech recognition | Malandrakis et al. (2019), Shen et al. (2020) | |||
Interpolation | Token | High | Text classification | Miao et al. (2020), Chen et al. (2020c) |
Sequence labeling | Cheng et al. (2020b), Chen et al. (2020a) | |||
Machine translation | Guo et al. (2020) |
3 Consistency Training with DA
While data augmentation (DA) can be applied in the supervised setting to produce better results when only a small labeled training dataset is available, data augmentation is also commonly used in semi-supervised learning (SSL). SSL is an alternative approach for learning from limited data that provides a framework for taking advantage of unlabeled data. Specifically, SSL assumes that our training set comprises labeled examples in addition to unlabeled examples drawn from the same distribution. Currently, one of the most common methods for performing SSL with deep neural networks is “consistency regularization” (Bachman et al., 2014; Tarvainen and Valpola, 2017). Consistency regularization-based SSL (or “consistency training” for short) regularizes a model by enforcing that its output doesn’t change significantly when the input is perturbed. In practice, the input is perturbed by applying data augmentation, and consistency is enforced through a loss term that measures the difference between the model’s predictions on a clean input and a corresponding perturbed version of the same input.
Xie et al. (2020) showed that consistency training can be effectively applied to semi-supervised learning for NLP. To achieve stronger results, they introduce several other tricks including confidence thresholding, training signal annealing, and entropy minimization. Confidence thresholding applies the unsupervised loss only when the model assigns a class probability above a pre-defined threshold. Training signal annealing prevents the model from overfitting on easy examples by applying the supervised loss only when the model is less confident about predictions. Entropy minimization trains the model to output low-entropy (highly-confident) predictions when fed unlabeled data. We refer the reader to Xie et al. (2020) for more details on these tricks.
4 Empirical Experiments
4.1 Datasets and Experiment Setup
To provide a quantitative comparison of the DA methods we have surveyed, we experiment with 10 of the most commonly used and model-agnostic augmentation techniques from different levels in Table 1, including: (i) Token-level augmentation: Synonym Replacement (SR) (Kolomiyets et al., 2011; Yang, 2015), Word Replacement based on Language Model (LM) (Kumar et al., 2020), Random Insertion (RI) (Wei and Zou, 2019; Miao et al., 2020), Random Deletion (RD) (Wei and Zou, 2019), Random Swapping (RS) (Wei and Zou, 2019), and Word Replacement (WR) based on TF-IDF in Vocabulary Set (Xie et al., 2020); (ii) Sentence-level augmentation: Roundtrip Translation (RT) (Xie et al., 2020; Chen et al., 2020c), Generation from Few-shot Models (GF) (Ye et al., 2022; Wang et al., 2021); and (iii) Hidden-space augmentation: Adversarial training (ADV) (Goodfellow et al., 2015), Cutoff (Shen et al., 2020), and Mixup in the embedding space (Zhang et al., 2018). Most aforementioned techniques are not label-dependent (except mixup and Generation from Few-shot), thus can be applied directly to unlabeled data. For generation from few-shot models, we use in-context learning with a few examples and a given label to generate the input, following Wang et al. (2021). Since we use BERT-base as our main model, we use a similar size model for in-context learning and choose GPT2.
We test them on different types of benchmark datasets including: (i) news classification tasks including AG News (Zhang et al., 2015b) and 20 Newsgroup (Joachims, 1997); (ii) topic classification tasks including Yahoo Answers (Chang et al., 2008) and PubMed news classification (Zhang et al., 2015b); (iii) inference tasks including MNLI, QNLI, and RTE (Wang et al., 2018); (iv) similarity and paraphrase tasks including QQP and MRPC (Wang et al., 2018); and (v) single-sentence tasks including SST-2 and CoLA (Wang et al., 2018).
For all datasets, we experiment with 10 labeled data points per class2 in a supervised setup, and an additional 5000 unlabeled data points per class in the semi-supervised setup. We use B E R Tbase (Devlin et al., 2019) as the base language model and use the same hyperparameters across all datasets/methods. We utilize accuracy as the evaluation metric for all datasets except for CoLA (which uses Matthews correlation) and PubMed (which uses accuracy and Macro-F1 score). Because the performance can be heavily dependent on the specific datapoints chosen (Sohn et al., 2020), for each dataset, we sample labeled data from the original dataset with 3 different seeds to form different training sets, and report the average result. For every setup, we fine-tune the model with the same seed as the dataset seed (in contrast to many works which report the max across different seeds).
We train our models on NVIDIA 2080ti and NVIDIA V-100 GPUs. Supervised experiments take 20 minutes, and semi-supervised experiments take two hours. The BERT-base model has 100M parameters. We use the same hyperparameter across all datasets, and so only use the validation set to find the best model checkpoint. We use a learning rate of 2e−5, batch size of 16, ratio of unlabeled to labeled data of 3, and dropout ratio of 0.1 for different augmentation methods.
4.2 Results with 10 Labeled Examples Per Class
News/Topic Classification Tasks.
The results are shown in Table 2. We observe that in supervised settings, token-level augmentations work the best. Specifically, word replacement works well, getting the highest or second highest score every time. On the other hand, Generating from Few-shot models performs very poorly. This is not surprising since the model used for in-context learning is GPT2, which is not that good at in-context learning and the inputs necessary for news/topic classification is much more complicated. In the semi-supervised settings, sentence level augmentations (round-trip translation) work the best, getting the highest or second highest score every time. This makes sense since for many classification tasks, multiple words indicate the label, and so dropping several words will not affect the label.
. | Methods . | Types . | News Classification . | Topic Classification . | ||
---|---|---|---|---|---|---|
AG News . | 20 Newsgroup . | Yahoo Answers . | PubMed . | |||
Supervised | None | – | 78.8(8.9) | 65.2(4.8) | 56.6(9.4) | 63.7(6.1)/49.3(3.9) |
SR | Token | 79.4(5.9) | 66.1(2.5) | 56.0(10.1) | 62.4(5.7)/48.3(3.9) | |
LM | 76.8(5.1) | 60.0(14.4) | 56.2(8.4) | 60.9(3.0)/47.4(2.5) | ||
RI | 79.5(4.9) | 66.6(0.6) | 57.3(12.0) | 63.7(4.2)/49.4(2.1) | ||
RD | 79.6(5.0) | 66.8(3.0) | 58.0(8.3) | 63.4(5.0)/49.3(1.5) | ||
RS | 79.5(5.3) | 64.8(10.8) | 57.1(10.3) | 63.8(7.4)/49.5(3.3) | ||
WR | 79.7(2.0) | 67.5(4.2) | 59.3(8.9) | 64.9(4.9)/49.4(2.5) | ||
RT | Sentence | 80.1(4.3) | 65.1(7.9) | 57.1(9.6) | 60.2(5.1)/46.3(6.4) | |
GF | Sentence | 25.3(0.7) | 5.2(0.1) | 27.4(3.5) | 33.0(0.0)/9.9(0.0) | |
ADV | Hidden | 78.2(5.3) | 65.5(1.6) | 53.8(4.89) | 37.4(2.6)/19.9(10.6) | |
Cutoff | 79.3(5.0) | 66.6(1.4) | 57.3(9.3) | 60.5(8.3)/46.6(9.4) | ||
Mixup | 80.0(6.52) | 65.9(3.1) | 57.8(4.19) | 51.4(19.3)/39.8(3.2) | ||
Semi Supervised | SR | Token | 69.6(29.3) | 65.7(1.8) | 51.4(9.4) | 59.3(5.9)/43.1(11.9) |
LM | 68.5(13.7) | 68.3(2.1) | 53.2(6.3) | 61.5(6.6)/46.4(4.4) | ||
RI | 65.8(5.5) | 66.7(1.1) | 50.5(3.2) | 61.4(11.3)/44.4(17.4) | ||
RD | 73.2(14.0) | 66.1(3.3) | 51.5(7.5) | 59.3(7.1)/46.0(3.8) | ||
RS | 71.6(16.6) | 65.0(2.0) | 51.1(7.1) | 64.2(12.1)/46.7(11.5) | ||
WR | 74.1(12.3) | 69.3(2.5) | 55.6(5.9) | 60.4(7.5)/43.7(14.2) | ||
RT | Sentence | 82.1(8.2) | 68.8(2.4) | 59.8(3.9) | 64.3(1.2)/49.8(1.9) | |
ADV | Hidden | 82.3(2.33) | 66.8(5.9) | 55.9(3.89) | 62.2(10.8)/46.2(9.8) | |
Cutoff | 79.9(5.5) | 67.9(0.8) | 60.1(1.0) | 62.7(9.0)/48.1(3.2) |
. | Methods . | Types . | News Classification . | Topic Classification . | ||
---|---|---|---|---|---|---|
AG News . | 20 Newsgroup . | Yahoo Answers . | PubMed . | |||
Supervised | None | – | 78.8(8.9) | 65.2(4.8) | 56.6(9.4) | 63.7(6.1)/49.3(3.9) |
SR | Token | 79.4(5.9) | 66.1(2.5) | 56.0(10.1) | 62.4(5.7)/48.3(3.9) | |
LM | 76.8(5.1) | 60.0(14.4) | 56.2(8.4) | 60.9(3.0)/47.4(2.5) | ||
RI | 79.5(4.9) | 66.6(0.6) | 57.3(12.0) | 63.7(4.2)/49.4(2.1) | ||
RD | 79.6(5.0) | 66.8(3.0) | 58.0(8.3) | 63.4(5.0)/49.3(1.5) | ||
RS | 79.5(5.3) | 64.8(10.8) | 57.1(10.3) | 63.8(7.4)/49.5(3.3) | ||
WR | 79.7(2.0) | 67.5(4.2) | 59.3(8.9) | 64.9(4.9)/49.4(2.5) | ||
RT | Sentence | 80.1(4.3) | 65.1(7.9) | 57.1(9.6) | 60.2(5.1)/46.3(6.4) | |
GF | Sentence | 25.3(0.7) | 5.2(0.1) | 27.4(3.5) | 33.0(0.0)/9.9(0.0) | |
ADV | Hidden | 78.2(5.3) | 65.5(1.6) | 53.8(4.89) | 37.4(2.6)/19.9(10.6) | |
Cutoff | 79.3(5.0) | 66.6(1.4) | 57.3(9.3) | 60.5(8.3)/46.6(9.4) | ||
Mixup | 80.0(6.52) | 65.9(3.1) | 57.8(4.19) | 51.4(19.3)/39.8(3.2) | ||
Semi Supervised | SR | Token | 69.6(29.3) | 65.7(1.8) | 51.4(9.4) | 59.3(5.9)/43.1(11.9) |
LM | 68.5(13.7) | 68.3(2.1) | 53.2(6.3) | 61.5(6.6)/46.4(4.4) | ||
RI | 65.8(5.5) | 66.7(1.1) | 50.5(3.2) | 61.4(11.3)/44.4(17.4) | ||
RD | 73.2(14.0) | 66.1(3.3) | 51.5(7.5) | 59.3(7.1)/46.0(3.8) | ||
RS | 71.6(16.6) | 65.0(2.0) | 51.1(7.1) | 64.2(12.1)/46.7(11.5) | ||
WR | 74.1(12.3) | 69.3(2.5) | 55.6(5.9) | 60.4(7.5)/43.7(14.2) | ||
RT | Sentence | 82.1(8.2) | 68.8(2.4) | 59.8(3.9) | 64.3(1.2)/49.8(1.9) | |
ADV | Hidden | 82.3(2.33) | 66.8(5.9) | 55.9(3.89) | 62.2(10.8)/46.2(9.8) | |
Cutoff | 79.9(5.5) | 67.9(0.8) | 60.1(1.0) | 62.7(9.0)/48.1(3.2) |
Inference Tasks.
As shown in Table 3, we observe that token-level augmentations work the best overall (e.g., random insertion, random deletion, and word replacement) for both supervised and semi-supervised settings. This is a bit surprising since the inference tasks usually heavily depend on several words, and changing these words can easily change the label for inferene tasks.
. | Methods . | Types . | Inference . | Paraphrase . | Single Sentence . | ||||
---|---|---|---|---|---|---|---|---|---|
MNLI . | QNLI . | RTE . | QQP . | MRPC . | SST-2 . | CoLA . | |||
Supervised | None | – | 35.2(0.7) | 51.8(7.0) | 49.8(3.1) | 63.9(9.1) | 61.8(21.2) | 60.5(13.1) | 12.9(6.32) |
SR | Token | 35.1(2.3) | 51.4(7.2) | 51.5(3.4) | 61.3(9.7) | 59.7(26.3) | 62.1(17.4) | 7.2(11.6) | |
LM | 35.3(0.8) | 51.0(8.0) | 49.0(1.4) | 62.4(11) | 61.0(24.3) | 62.8(9.8) | 6.8(15.8) | ||
RI | 34.9(2.6) | 51.5(8.4) | 51.5(1.4) | 60.6(10.9) | 60.6(25.0) | 63.3(12.2) | 7.8(7.42) | ||
RD | 35.5(2.1) | 51.1(8.4) | 50.9(2.4) | 62.4(11.3) | 61.2(22.0) | 59.7(18.4) | 7.1(16.6) | ||
RS | 35.1(1.1) | 51.5(7.0) | 50.9(5.0) | 62.6(6.7) | 63.2(22.5) | 61.2(10.8) | 5.2(17.0) | ||
WR | 34.5(2.6) | 52.0(3.8) | 50.0(0.9) | 60.6(10.2) | 61.0(25.3) | 61.8(12.5) | 7.0(10.6) | ||
RT | Sentence | 35.3(0.5) | 51.1(9.6) | 50.8(4.4) | 60.5(17.8) | 61.8(23.7) | 62.0(1.99) | 8.37(8.35) | |
GF | Sentence | 33.8(1.7) | 50.5(2.3) | 47.6(3.2) | 59.4(6.6) | 56.6(1.7) | 53.5(4.1) | 3.89 (7.2) | |
ADV | Hidden | 33.3(4.7) | 49.7(1.8) | 48.3(12.1) | 57.5(24.7) | 61.5(21.5) | 53.3(13.07) | 1.37(4.66) | |
Cutoff | 35.1(2.3) | 51.4(8.3) | 52.2(3.6) | 62.6(8.8) | 61.0(21.2) | 63.5(8.45) | 12.4(9.58) | ||
Mixup | 32.6(3.5) | 49.9(1.4) | 49.8(9.2) | 63.0(0.3) | 62.1(19.8) | 62.3(12.3) | 4.03(8.68) | ||
Semi-Supervised | SR | Token | 35.6(1.0) | 52.1(4.5) | 52.9(5.4) | 53.5(10.7) | 68.1(4.0) | 61.8(37.9) | 6.65(5.69) |
LM | 35.0(3.3) | 52.5(4.2) | 50.2(6.5) | 47.9(34.1) | 68.4(3.8) | 57.3(14.2) | 6.38(6.3) | ||
RI | 35.8(1.7) | 52.1(4.1) | 50.7(1.4) | 59.6(5.1) | 64.9(8.9) | 58.3(14.8) | 6.55(0.91) | ||
RD | 35.2(0.5) | 52.1(5.2) | 52.6(4.9) | 56.1(16.0) | 62.4(30.6) | 55.7(16.4) | 4.33(10.9) | ||
RS | 34.6(2.5) | 52.1(6.2) | 51.5(3.7) | 49.8(7.9) | 63.2(22.5) | 55.2(15.3) | 7.77(11.77) | ||
WR | 34.8(2.5) | 52.1(4.1) | 50.9(1.8) | 51.8(16.0) | 63.1(23.5) | 54.8(13.8) | 5.43(17.8) | ||
RT | Sentence | 35.3(2.7) | 52.7(4.8) | 51.6(4.1) | 63.9(7.5) | 62.2(12.5) | 61.9(20.8) | 11.6(14.5) | |
ADV | Hidden | 36.2(8.9) | 50.6(1.9) | 50.9(6.8) | 59.1(14.7) | 63.9(9.1) | 53.1(5.0) | 7.64(25.1) | |
Cutoff | 35.3(2.8) | 52.5(4.3) | 51.7(6.5) | 62.9(9.9) | 68.6(4.4) | 54.3(9.8) | 4.11(11.8) |
. | Methods . | Types . | Inference . | Paraphrase . | Single Sentence . | ||||
---|---|---|---|---|---|---|---|---|---|
MNLI . | QNLI . | RTE . | QQP . | MRPC . | SST-2 . | CoLA . | |||
Supervised | None | – | 35.2(0.7) | 51.8(7.0) | 49.8(3.1) | 63.9(9.1) | 61.8(21.2) | 60.5(13.1) | 12.9(6.32) |
SR | Token | 35.1(2.3) | 51.4(7.2) | 51.5(3.4) | 61.3(9.7) | 59.7(26.3) | 62.1(17.4) | 7.2(11.6) | |
LM | 35.3(0.8) | 51.0(8.0) | 49.0(1.4) | 62.4(11) | 61.0(24.3) | 62.8(9.8) | 6.8(15.8) | ||
RI | 34.9(2.6) | 51.5(8.4) | 51.5(1.4) | 60.6(10.9) | 60.6(25.0) | 63.3(12.2) | 7.8(7.42) | ||
RD | 35.5(2.1) | 51.1(8.4) | 50.9(2.4) | 62.4(11.3) | 61.2(22.0) | 59.7(18.4) | 7.1(16.6) | ||
RS | 35.1(1.1) | 51.5(7.0) | 50.9(5.0) | 62.6(6.7) | 63.2(22.5) | 61.2(10.8) | 5.2(17.0) | ||
WR | 34.5(2.6) | 52.0(3.8) | 50.0(0.9) | 60.6(10.2) | 61.0(25.3) | 61.8(12.5) | 7.0(10.6) | ||
RT | Sentence | 35.3(0.5) | 51.1(9.6) | 50.8(4.4) | 60.5(17.8) | 61.8(23.7) | 62.0(1.99) | 8.37(8.35) | |
GF | Sentence | 33.8(1.7) | 50.5(2.3) | 47.6(3.2) | 59.4(6.6) | 56.6(1.7) | 53.5(4.1) | 3.89 (7.2) | |
ADV | Hidden | 33.3(4.7) | 49.7(1.8) | 48.3(12.1) | 57.5(24.7) | 61.5(21.5) | 53.3(13.07) | 1.37(4.66) | |
Cutoff | 35.1(2.3) | 51.4(8.3) | 52.2(3.6) | 62.6(8.8) | 61.0(21.2) | 63.5(8.45) | 12.4(9.58) | ||
Mixup | 32.6(3.5) | 49.9(1.4) | 49.8(9.2) | 63.0(0.3) | 62.1(19.8) | 62.3(12.3) | 4.03(8.68) | ||
Semi-Supervised | SR | Token | 35.6(1.0) | 52.1(4.5) | 52.9(5.4) | 53.5(10.7) | 68.1(4.0) | 61.8(37.9) | 6.65(5.69) |
LM | 35.0(3.3) | 52.5(4.2) | 50.2(6.5) | 47.9(34.1) | 68.4(3.8) | 57.3(14.2) | 6.38(6.3) | ||
RI | 35.8(1.7) | 52.1(4.1) | 50.7(1.4) | 59.6(5.1) | 64.9(8.9) | 58.3(14.8) | 6.55(0.91) | ||
RD | 35.2(0.5) | 52.1(5.2) | 52.6(4.9) | 56.1(16.0) | 62.4(30.6) | 55.7(16.4) | 4.33(10.9) | ||
RS | 34.6(2.5) | 52.1(6.2) | 51.5(3.7) | 49.8(7.9) | 63.2(22.5) | 55.2(15.3) | 7.77(11.77) | ||
WR | 34.8(2.5) | 52.1(4.1) | 50.9(1.8) | 51.8(16.0) | 63.1(23.5) | 54.8(13.8) | 5.43(17.8) | ||
RT | Sentence | 35.3(2.7) | 52.7(4.8) | 51.6(4.1) | 63.9(7.5) | 62.2(12.5) | 61.9(20.8) | 11.6(14.5) | |
ADV | Hidden | 36.2(8.9) | 50.6(1.9) | 50.9(6.8) | 59.1(14.7) | 63.9(9.1) | 53.1(5.0) | 7.64(25.1) | |
Cutoff | 35.3(2.8) | 52.5(4.3) | 51.7(6.5) | 62.9(9.9) | 68.6(4.4) | 54.3(9.8) | 4.11(11.8) |
Similarity and Paraphrase Tasks.
From Table 3, in the supervised settings, we observe that token-level augmentations (random swapping) achieve the best performances, while hidden space augmentations work well in semi-supervised settings, with cutoff performing the best on average. This makes sense since for paraphrasing tasks, augmenting the text usually consists of paraphrases, and so can easily change whether two texts are paraphrases of each other.
Single Sentence Tasks.
Based on the single-sentence tasks results in Table 3, hidden space augmentations (cutoff) provides the biggest boost in performance in supervised settings, while in semi-supervised settings, sentence level augmentations (roundtrip translation) work best. We note most augmentation methods hurt performance on CoLA, a task for judging grammatical acceptability. This could be caused by the fact that most of the augmentation methods try to preserve meaning and not grammatical correctness.
Token Level Methods
Overall, we see that word replacement works the best for news and topic classification tasks and random swapping perform the best among token level methods for inference, paraphrase, and single sentence tasks.
4.3 Summary of Findings
Based on the results on 10 labeled examples we conclude that:
Overall, the best augmentation method depends on the dataset and whether we have unlabeled data or not.
In general, word replacement and random swapping work the best for supervised learning while roundtrip translation works the best for semi-supervised learning.
Token-level augmentation methods work better for easier classification tasks and surprisingly hurt performance for harder tasks when the baseline performance is low.
Token-level and sentence-level augmentations are most robust than hidden-level augmentation in the supervised and semi-supervised settings.
5 Other Limited Data Learning Methods
This work mainly focuses on data augmentation and semi-supervised learning (consistency regularization) in NLP; However, there are other orthogonal directions for tackling the problem of learning with limited data. For completeness, we summarize this related work below.
Low-Resourced Languages.
Most languages lack large monolingual or parallel corpora, or sufficient manually crafted linguistic resources for building statistical NLP applications (Garrette and Baldridge, 2013). Researchers have therefore developed a variety of methods for improving performance on low-resource languages, including cross-lingual transfer learning, which transfers models from resource-rich to resource-poor languages (Do and Gaspers, 2019; Lee and Lee, 2019; Schuster et al., 2019), few/zero-shot learning (Johnson et al., 2017; Blissett and Ji, 2019; Pham et al., 2019; Abad et al., 2020), which uses only a few examples from the low-resource domain to adapt models trained in another domain, and polyglot learning (Cotterell and Heigold, 2017; Tsvetkov et al., 2016; Mulcaire et al., 2019; Lample and Conneau, 2019), which combines resource-rich and resource-poor learning using an universal language representation.
Other Methods for Semi-Supervised Learning.
Semi-supervised learning methods further reduce the dependency on labeled data and enhance the models when there is only limited labeled data available. These methods use large amounts of unlabeled data in the training process, as unlabeled data is usually cheap and easy to obtain compared to labeled data. In this paper, we focus on consistency regularization, while there are also other widely used methods for NLP including self-training (Yarowsky, 1995; Zhang and Zong, 2016; He et al., 2020; Lin et al., 2020), generative methods (Xu et al., 2017; Yang et al., 2017; Kingma et al., 2014; Cheng et al., 2016), and co-training (Blum and Mitchell, 1998; Clark et al., 2018; Cai and Lapata, 2019).
Few-shot Learning.
Few-shot learning is a broad technique for dealing with tasks with less labeled data based on prior knowledge. Compared to semi-supervised learning, which utilizes unlabeled data as additional information, few-shot learning leverages various kinds of prior knowledge such as pre-trained models or supervised data from other domains and modalities (Wang et al., 2020). While most work on few-shot focuses on computer vision, few-shot learning has recently seen increasing adoption in NLP (Han et al., 2018; Rios and Kavuluru, 2018; Hu et al., 2018; Herbelot and Baroni, 2017). To better leverage pre-trained models, PET (Schick and Schütze, 2021a, b) converts the text and label in an example into a fluent sentence, and then uses the probability of generating the label text as the class logit, outperforming GPT3 for few shot learning (Brown et al., 2020). How to better model and incorporate prior knowledge to handle few-shot learning for NLP remains an open challenge and has the potential to significantly improve model performance with less labeled data.
6 Discussion and Future Directions
In this work, we empirically surveyed data augmentation methods for limited-data learning in NLP and compared them on 11 different NLP tasks. Despite the success, there are still certain challenges that need to be tackled to improve their performance. This section highlights some of these challenges and discusses future research directions.
Theoretical Guarantees and Data Distribution Shift.
Current data augmentation methods for text typically assume that they are label-preserving and will not change the data distribution. However, these assumptions are often not true in practice, which can result in noisy labels or a shift in the data distribution and consequently a decrease in performance or generalization (e.g., QQP in Table 3). Thus, providing theoretical guarantees that augmentations are label- and distribution-preserving under certain conditions would ensure the quality of augmented data and further accelerate the progress of this field.
Automatic Data Augmentation.
Despite being effective, current data augmentation methods are generally manually designed. Methods for automatically selecting the appropriate types of data augmentation still remain under-investigated. Although certain augmentation techniques have been shown effective for a particular task or dataset, they often do not transfer well to other datasets or tasks (Cubuk et al., 2019), as shown in Table 3. For example, paraphrasing works well for general text classification tasks, but may fail for some subtle scenarios like classifying bias because paraphrasing might change the label in this setting. Automatically learning data augmentation strategies or searching for an optimal augmentation policy for given datasets/tasks/models could enhance the generalizability of data augmentation techniques (Maharana and Bansal, 2020; Hu et al., 2019).
Acknowledgments
We would like to thank the members of Georgia Tech SALT and UNC-NLP groups for their feedback. This work is supported by grants from Amazon and Salesforce, ONR grant N00014-18-1-2871, DARPA YFA17-D17AP00022.
Notes
We also did experiments with 100 labeled examples per class and found the results consistent.
References
Author notes
Action Editor: Emily Pitler
Equal contribution.