An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.


Introduction
Deep learning methods have achieved strong performance on a wide range of supervised learning tasks (Sutskever et al., 2014;Deng et al., 2013;Minaee et al., 2021).Traditionally, these results were attained through the use of large, welllabeled datasets.This make them challenging to apply in settings where collecting a large amount of high-quality labeled data for training is expensive.Moreover, given the fast-changing nature of real-world applications, it is infeasible to relabel * Equal contribution.every example whenever new data comes in.This highlights a need for learning algorithms that can be trained with a limited amount of labeled data.
There has been a substantial amount of research towards learning with limited labeled data for various tasks in the NLP community.One common approach for mitigating the need for labeled data is data augmentation.Data augmentation (Feng et al., 2021) generates new data by modifying existing data points through transformations that are designed based on prior knowledge about the problem's structure (Yang, 2015;Wei and Zou, 2019).This augmented data can be generated from labeled data, and then directly used in supervised learning (Wei and Zou, 2019), or in semi-supervised learning for unlabeled data through consistency regularization (Xie et al., 2020) ("consistency training").While various approaches have been proposed to tackle learning with limited labeled data -including unsupervised pre-training (Peters et al., 2018;Devlin et al., 2019;Raffel et al., 2020), multi-task learning (Glorot et al., 2011;Liu et al., 2017;Augenstein et al., 2018), semi-supervised learning (Zhu, 2005;Chapelle et al., 2009;Miyato et al., 2017;Xie et al., 2020), and few-shot learning (Deng et al., 2019) -in this work, we focus on and compare different data augmentation methods and their application to supervised and semisupervised learning.
In this survey, we comprehensively review and perform experiments on recent data augmentation techniques developed for various NLP tasks.Our contributions are three-fold: (1) summarize and categorize recent methods in textual data augmentation; (2) compare different data augmentation methods through experiments with limited labeled data in supervised and semi-supervised settings on 11 NLP tasks, and (3) discuss current challenges and future directions of data augmentation, as well as learning with limited data in NLP more broadly.Our experimental results allow us to conclude that no single augmentation works best for every task, but (i) token-level augmentations work well for supervised learning, (ii) sentence-level augmentation usually works the best for semisupervised learning, and (iii) augmentation methods can sometimes hurt performance, even in the semi-supervised setting.
Related Surveys.Recently, several surveys also explore the data augmentation techniques for NLP (Hedderich et al., 2020;Feng et al., 2021).Hedderich et al. (2020) provide a broad overview of techniques for NLP in low resource scenarios and briefly cover data augmentation as one of several techniques.In contrast, we focus on data augmentation and provide a more comprehensive review on recent data augmentation methods in this work.While Feng et al. (2021) also survey task-specific data augmentation approaches for NLP, our work summarizes recent data augmentation methods in a more fine-grained categorization.We also focus on their application to learning from limited data by providing an empirical study over different augmentation methods on various benchmark datasets in both supervised and semi-supervised settings, so as to hint data augmentation selections in future research.

Data Augmentation for NLP
Data augmentation increases both the amount (the number of data points) and the diversity (the variety of data) of a given dataset (Cubuk et al., 2019).Limited labeled data often leads to overfitting on the training set and data augmentation works to alleviate this issue by manipulating data either automatically or manually to create additional augmented data.Such techniques have been widely explored in the computer vision field, with methods like geometric/color space transformations (Simard et al., 2003;Krizhevsky et al., 2012;Taylor and Nitschke, 2018), mixup (Zhang et al., 2018), and random erasing (Zhong et al., 2020;DeVries and Taylor, 2017).Although the discrete nature of textual data and its complex syntactic and semantic structures make finding labelpreserving transformation more difficult, there nevertheless exists a wide range of methods for augmenting text data that in practice preserve labels.In the following subsections, we describe four broad classes of data augmentation methods:

Token-Level Augmentation
Token-level augmentations manipulate words and phrases in a sentence to generate augmented text while ideally retaining the semantic meaning and labels of the original text.
Designed Replacement.Intuitively, the semantic meaning of a sentence remains unchanged if some of its tokens are replaced with other tokens that have the same meaning.A simple approach is to fetch synonyms as words for substitutions (Kolomiyets et al., 2011;Yang, 2015;Zhang et al., 2015a;Wei and Zou, 2019;Miao et al., 2020).The synonyms are discovered based on pre-defined dictionaries such as WordNet (Kolomiyets et al., 2011), or similarities in word embedding space (Yang, 2015).However, improvements from this technique are usually minimal (Kolomiyets et al., 2011) and in some cases, performance may even degrade (Zhang et al., 2015a).A major drawback stems from the lack of contextual information when fetching synonyms-especially for words with multiple meanings and few synonyms.To resolve this, language models (LMs) have been used to replace the sampled words given their context (Kolomiyets et al., 2011;Fadaee et al., 2017;Kobayashi, 2018;Kumar et al., 2020).Other work preserves the labels of the text by conditioning on the label when generating the LMs' predictions (Kobayashi, 2018;Wu et al., 2019a).In addition, different sampling strategies for word replacement have been explored.For example, instead of sampling one specific word from candidates by LMs, Gao et al. (2019) propose to compute a weighted average over embeddings of possible words predicted by LMs as the replaced input since the averaged representations could augment text with richer information.
Random Insertion, Replacement, Deletion and Swapping.While well-designed local modifications can preserve the syntax and semantic meaning of a sentence (Niu and Bansal, 2018), random local modifications such as deleting certain tokens (Iyyer et al., 2015;Wei and Zou, 2019;Miao et al., 2020), inserting random tokens (Wei and Zou, 2019;Miao et al., 2020), replacing non-important tokens with random tokens (Xie et al., 2017(Xie et al., , 2020;;Niu and Bansal, 2018) or randomly swapping tokens in one sentence (Artetxe et al., 2018;Lample et al., 2018;Wei and Zou, 2019;Miao et al., 2020) can preserve the meaning in practice.Different kinds of operations can be further combined (Wei and Zou, 2019), where each example is randomly augmented with one of insertion, deletion, and swapping.These noise-injection methods can efficiently be applied to training, and show improvements when they augment simple models trained on small training sets.However, the improvements might be unstable due to the possibility that random perturbations change the meanings of sentences (Niu and Bansal, 2018).Also, finetuning large pre-trained models on specific tasks might attenuate improvements due to preexisting generalization abilities of the model (Shleifer, 2019).
Compositional Augmentation.To increase the compositional generalization abilities of models, recent efforts have also focused on compositional augmentations (Jia and Liang, 2016;Andreas, 2020) where different fragments from different sentences are re-combined to create augmented examples.Compared to random swapping, compositional augmentation often requires more carefully-designed rules such as lexical overlap (Andreas, 2020), neural-symbolic stack machines (Chen et al., 2020e), and neural program synthesis (Nye et al., 2020).With the potential to greatly improve the generalization abilities to out-of-distribution data, compositional augmentation has been utilized in sequence labeling (Guo et al., 2020), semantic parsing (Andreas, 2020;Nye et al., 2020;Furrer et al., 2020), language modeling (Andreas, 2020;Shaw et al., 2020), and text generation (Feng et al., 2020).

Sentence-Level Augmentation
Instead of modifying tokens, sentence-level augmentation modifies the entire sentence at once.
Paraphrasing.Paraphrasing has been widely adopted as a data augmentation technique in various NLP tasks (Yu et al., 2018;Xie et al., 2020;Kumar et al., 2019;He et al., 2020;Chen et al., 2020b,c;Cai et al., 2020), as it generally provides more diverse augmented text with different word choices and sentence structures while preserving the meaning of the original text.The most popular is round-trip translation (Sennrich et al., 2015;Edunov et al., 2018), a pipeline which first translates sentences into certain intermediate languages and then translates them back to generate paraphrases.Translating through intermediate languages with different vocabulary and linguistic structures can generate useful paraphrases.To ensure the diversity of augmented data, sampling and noisy beam search can also be adopted during the decoding stage (Edunov et al., 2018).Other work focuses on directly training end-to-end models to generate paraphrases (Prakash et al., 2016), and further augments the decoding phase with syntactic information (Iyyer et al., 2018;Chen et al., 2019), latent variables (Gupta et al., 2017), and sub-modular objectives (Kumar et al., 2019).
Conditional Generation.Conditional generation methods generate additional text from a language model, conditioned on the label.After training the model to generate the original text given the label, the model can generate new text (Anaby-Tavor et al., 2020;Zhang and Bansal, 2019;Kumar et al., 2020;Yang et al., 2020).An extra filtering process is often used to ensure high-quality augmented data.For example, in text classification, Anaby-Tavor et al. ( 2020) first fine-tune GPT-2 (Radford et al., 2019) with the original examples prepended with their labels, and then generate augmented examples by feeding the finetuned model certain labels.Only confident examples as judged by a baseline classifier trained on the original data are kept.Similarly, new answers are generated on the basis of given questions in question answering and are filtered by customized metrics like question answering probability (Zhang and Bansal, 2019) and n-gram diversity (Yang et al., 2020).Generative models used in this setting have been based on conditional VAE (Bowman et al., 2016;Hu et al., 2017;Guu et al., 2017;Malandrakis et al., 2019), GAN (Iyyer et al., 2018;Xu et al., 2018) or pre-trained language models like GPT-2 (Anaby-Tavor et al., 2020;Kumar et al., 2020).Overall, these conditional generation methods can create novel and diverse data that might be unseen in the original dataset, but require significant training effort.

Adversarial Data Augmentation
Adversarial methods create augmented examples by adding adversarial perturbations to the original data, which dramatically influences the model's predictions and confidence without changing human judgements.These adversarial examples (Morris et al., 2020;Zeng et al., 2020) could be leveraged in adversarial training (Goodfellow et al., 2015) to increase neural models' robustness, and can also be utilized as data augmentation to increase the models' generalization ability (Miyato et al., 2017;Cheng et al., 2019).1 White-Box methods rely on model architecture and parameters being accessible and create adversarial examples directly using a model's gradients.Unlike image pixel values that are continuous, textual tokens are discrete and cannot be directly modified based on gradients.To this end, adversarial perturbations are added directly to token embeddings or sentence hidden representations (Miyato et al., 2017;Zhu et al., 2020;Jiang et al., 2019;Chen et al., 2020d) which creates "virtual adversarial examples".Other approaches vectorize modification operations as the difference of one-hot vectors (Ebrahimi et al., 2018b,a), or find real word neighbors in a model's hidden representations via its gradients (Cheng et al., 2019).
Black-Box methods are usually model-agnostic since they do not require information from a model or its parameters and usually focus on task-specific heuristics for creating adversarial examples.For example, by enumerating feasible substitutions on the basis of word similarity and language models, Ren et al. ( 2019) and Garg and Ramakrishnan (2020) select adversarial word replacements which severely influence the predictions from the text classification model.To attack reading comprehension systems, Jia and Liang (2017) and Wang and Bansal (2018) (2020) proposes two simple yet effective adversarial transformations that reverse the position of subject and object or the position of premise and hypothesis.
Interpolation-Based Methods.Interpolationbased methods create new examples and labels by linear combinations of existing data-label pairs.Given two data-label pairs, virtual data-label pairs are created through linear interpolations of the pair of data points.Such interpolation-based methods can generate infinite augmented data in the "virtual vicinity" of the original data space, thus improving the generalization performance of models.Interpolation-based methods were first explored in computer vision (Zhang et al., 2018), and have more recently been generalized to the text domain (Miao et al., 2020;Chen et al., 2020c;Cheng et al., 2020b;Chen et al., 2020a) by performing interpolation between original data and token-level augmented data in the output space (Miao et al., 2020), between original data and adversarial data in embedding space (Cheng et al., 2020b), or between different training examples in general hidden space (Chen et al., 2020c).Different strategies to select samples to mix have also been explored (Chen et al., 2020a;Guo et al., 2020;Zhang et al., 2020a) such as k-nearest-neighbours (Chen et al., 2020a) or sentence composition (Guo et al., 2020).We summarize the preceding overview of recent widely-used data augmentation methods in Table 1, characterizing them with respect to augmentation levels, the diversity of generated data, and their applicable tasks.

Consistency Training with DA
While data augmentation (DA) can be applied in the supervised setting to produce better results when only a small labeled training dataset is available, data augmentation is also commonly used in semi-supervised learning (SSL).SSL is an alternative approach for learning from limited data that provides a framework for taking advantage of unlabeled data.Specifically, SSL assumes that our training set comprises labeled examples in addition to unlabeled examples drawn from the same distribution.Currently, one of the most common methods for performing SSL with deep neural networks is "consistency regularization" (Bachman et al., 2014;Tarvainen and Valpola, 2017).Consistency regularization-based SSL (or "consistency training" for short) regularizes a model by enforcing that its output doesn't change significantly when the input is perturbed.In practice, the input is perturbed by applying data augmentation, and consistency is enforced through a loss term that measures the difference between the model's predictions on a clean input and a corresponding perturbed version of the same input.
Formally, let f θ be a model with parameters θ, f θ be a fixed copy of the model where no gradients

Types
News  are allowed to flow, x l be a labeled datapoint with label y, x u be an unlabeled datapoint, and α(x) be a data augmentation method.Then, a typical loss function for consistency training is where CE is the cross entropy loss and λ u is a tunable hyperparameter that determines the weight of the consistency regularization term.In practice, various other measures have been used to minimize the difference between f θ(x u ) and f θ (α(x u )), such as the KL divergence (Miyato et al., 2018;Xie et al., 2020) and the mean-squared error (Tarvainen and Valpola, 2017;Laine and Aila, 2017;Berthelot et al., 2019).Because gradients are not allowed to flow through the model when it was fed the clean unlabeled input x u , this objective can be viewed as using the clean unlabeled datapoint to generate a synthetic target distribution for the augmented unlabeled datapoint.Xie et al. (2020) showed that consistency training can be effectively applied to semi-supervised learning for NLP.To achieve stronger results, they introduce several other tricks including confidence thresholding, training signal annealing, and entropy minimization.Confidence thresholding applies the unsupervised loss only when the model assigns a class probability above a pre-defined threshold.Training signal annealing prevents the model from overfitting on easy examples by applying the supervised loss only when the model is less confident about predictions.Entropy minimization trains the model to output low-entropy (highly-confident) predictions when fed unlabeled data.We refer the reader to (Xie et al., 2020) for more details on these tricks.

Datasets and Experiment Setup
To provide a quantitative comparison of the DA methods we have surveyed, we experiment with 10 of the most commonly used and model-agnostic augmentation techniques from different levels in   (Goodfellow et al., 2015), Cutoff (Shen et al., 2020), and Mixup in the embedding space (Zhang et al., 2018).Most aforementioned techniques are not label-dependent (except mixup), thus can be applied directly to unlabeled data.We test them on different types of benchmark datasets including: (i) news classification tasks including AG News (Zhang et al., 2015b) and 20 Newsgroup (Joachims, 1997); (ii) topic classification tasks including Yahoo Answers (Chang et al., 2008) and PubMed news classification ( (Zhang et al., 2015b) (iii) inference tasks including MNLI, QNLI and RTE (Wang et al., 2018); (iv) similarity and paraphrase tasks including QQP and MRPC (Wang et al., 2018); and (v) single-sentence tasks including SST-2 and CoLA (Wang et al., 2018).
For all datasets, we experiment with 10 labeled data points per class2 in a supervised setup, and an additional 5000 unlabeled data points per class in the semi-supervised setup.We use BERT base (Devlin et al., 2019) as the base language model and use the same hyper-parameters across all datasets/methods.We utilize accuracy as the evaluation metric for all datasets except for CoLA (which uses Matthews correlation) and PubMed (which uses accuracy and Macro-F1 score).Because the performance can be heavily dependent on the specific datapoints chosen (Sohn et al., 2020), for each dataset, we sample labeled data from the original dataset with 3 different seeds to form different training sets, and report the average result.For every setup, we fine-tune the model with the same seed as the dataset seed (in contrast to many works which report the max across different seeds).The detailed experimental setup is described in the Appendix.

Results
News/Topic Classification Tasks.The results are shown in Table 2.We observe that in supervised settings, token-level augmentations work the best.Specifically, word replacement works well, getting the highest or second highest score every time; in the semi-supervised settings, sentence level augmentations (round-trip translation) works the best, getting the highest or second highest score every time.This makes sense since for many classification tasks, multiple words indicate the label, and so dropping several words will not affect the label.
Inference Tasks.As shown in Table 3, we observe that token-level augmentations work the best overall (e.g., random insertion, random deletion, and word replacement) for both supervised and semi-supervised settings.This is a bit surprising since the inference tasks usually heavily depend on several words, and changing these words can easily change the label for inferene tasks.
Similarity and Paraphrase Tasks.From Table 3, in the supervised settings, we observe that token-level augmentations (random swapping) achieve the best performances, while hidden space augmentations work well in semisupervised settings, with cutoff performing the best on average.This makes sense since for paraphrasing tasks, augmenting the text usually consists of paraphrases, and so can easily change whether two texts are paraphrases of each other.
Single Sentence Tasks.Based on the singlesentence tasks results in Table 3, hidden space augmentations (cutoff) provides the biggest boost in performance in supervised settings, while in semi-supervised settings, sentence level augmentations (roundtrip translation) works best.We note most augmentation methods hurt performance on CoLA, a task for judging grammatical acceptability.This could be caused by the fact that most of augmentation methods try to preserve meaning and not grammatical correctness.
Overall, no single augmentation works the best for every task in the supervised or semisupervised setting.However, several overall conclusions can be made: first, augmentation does not always improve performance, and can sometimes hurt performances, even in the semi-supervised setting.This suggests that we may need to design different augmentations for different tasks.Second, token-level augmentations (especially word replacement and random swapping) work well in general for supervised learning, especially when there is extremely limited labeled data.Third, round-trip translation usually works the best for semi-supervised learning, showing the most con-sistent gains.However, if the computation is limited, cutoff may be a better choice.

Other Limited Data Learning Methods
This work mainly focuses on data augmentation and semi-supervised learning (consistency regularization) in NLP; however, there are other orthogonal directions for tackling the problem of learning with limited data.For completeness, we summarize this related work below.
Low-Resourced Languages.Most languages lack large monolingual or parallel corpora, or sufficient manually-crafted linguistic resources for building statistical NLP applications (Garrette and Baldridge, 2013).Researchers have therefore developed a variety of methods for improving performance on low-resource languages, including cross-lingual transfer learning which transfers models from resource-rich to resource-poor languages (Do and Gaspers, 2019; Lee and Lee, 2019;Schuster et al., 2019), few/zero-shot learning (Johnson et al., 2017;Blissett and Ji, 2019;Pham et al., 2019;Abad et al., 2020) which uses only a few examples from the low-resource domain to adapt models trained in another domain, and polyglot learning (Cotterell and Heigold, 2017;Tsvetkov et al., 2016;Mulcaire et al., 2019;Lample and Conneau, 2019) which combines resource-rich and resource-poor learning using an universal language representation.
Other Methods for Semi-Supervised Learning.Semi-supervised learning methods further reduce the dependency on labeled data and enhance the models when there is only limited labeled data available.These methods use large amounts of unlabeled data in the training process, as unlabeled data is usually cheap and easy to obtain compared to labeled data.In this paper, we focus on consistency regularization, while there are also other widely-used methods for NLP including self-training (Yarowsky, 1995;Zhang and Zong, 2016;He et al., 2020;Lin et al., 2020), generative methods (Xu et al., 2017;Yang et al., 2017;Kingma et al., 2014;Cheng et al., 2016), and cotraining (Blum and Mitchell, 1998;Clark et al., 2018;Cai and Lapata, 2019).
Few-shot Learning.Few-shot learning is a broad technique for dealing with tasks with less labeled data based on prior knowledge.Compared to semi-supervised learning which utilizes unlabeled data as additional information, few-shot learning leverages various kinds of prior knowledge such as pre-trained models or supervised data from other domains and modalities (Wang et al., 2020).While most work on few-shot focuses on computer vision, few-shot learning has recently seen increasing adoption in NLP (Han et al., 2018;Rios and Kavuluru, 2018;Hu et al., 2018;Herbelot and Baroni, 2017).To better leverage pre-trained models, PET (Schick and Schütze, 2021a,b) converts the text and label in an example into a fluent sentence, and then uses the probability of generating the label text as the class logit, outperforming GPT3 for few shot learning (Brown et al., 2020).How to better model and incorporate prior knowledge to handle few-shot learning for NLP remains an open challenge and has the potential to significantly improve model performance with less labeled data.

Discussion and Future Directions
In this work, we empirically surveyed data augmentation methods for limited-data learning in NLP and compared them on 11 different NLP tasks.Despite the success, there are still certain challenges that need to be tackled for improve their performance.This section highlights some of these challenges and discusses future research directions.
Theoretical Guarantees and Data Distribution Shift.Current data augmentation methods for text typically assume that they are label-preserving and will not change the data distribution.However, these assumptions are often not true in practice, which can result in noisy labels or a shift in the data distribution and consequently a decrease in performance or generalization (e.g., QQP in Table 3).Thus, providing theoretical guarantees that augmentations are label-and distributionpreserving under certain conditions would ensure the quality of augmented data and further accelerate the progress of this field.
Automatic Data Augmentation.Despite being effective, current data augmentation methods are generally manually-designed.Methods for automatically selecting the appropriate types of data augmentation still remain under-investigated.Although certain augmentation techniques have been shown effective for a particular task or dataset, they often do not transfer well to other datasets or tasks (Cubuk et al., 2019), as shown in Table 3.For example, paraphrasing works well for general text classification tasks, but may fail for some subtle scenarios like classifying bias because paraphrasing might change the label in this setting.Automatically learning data augmentation strategies or searching for an optimal augmentation policy for given datasets/tasks/models could enhance the generalizability of data augmentation techniques (Maharana and Bansal, 2020).

A Experimental Setup
We train our models on NVIDIA 2080ti and NVIDIA V-100 gpus.Supervised experiments take 20 minutes, and semi-supervised experiments take two hours.The BERT-base model has 100M parameters.We use the same hyperaparameter across all datasets, and so only use the validation set to find the best model checkpoint.We use a learning rate of 2e −5 , batch size of 16, ratio of unlabeled to labeled data of 3, and dropout ratio of 0.1 for different augmentation methods.

B Results for 100 Labeled Data per Class
News/Topic Classification Tasks The results are shown in Table 4.We observe that overall, in both the supervised settings and semi-supervised setting, all the methods perofrmly similarly, with 2 points of each other.This indicates that data augmentation methods work well with limited labeled data, and with more labeled data, its effectiveness is removed.
Inference Tasks As shown in Table 5, we observe that most augmentation methods hurt the performance in both the supervised and semisupervised setting, with a greater drop in performance in the semi-supervised setting.
Similarity and Paraphrase Tasks Similar to inference tasks, we observe in Table 5 that most augmentation methods hurt the performance in both the supervised and semi-supervised setting, with a greater drop in performance in the semisupervised setting.
Single Sentence Tasks Unlike inference tasks and paraphrase tasks, augmentations methods help performance, as seen in Table 5, except for CoLA.We hypothesize the reason is because most augmentatiom methods seek to preserves meaning, not grammatical correctness, which is what CoLA measures.In the supervised and semisupervised setting, hidden level augmentations work well, with cutoff performing the best.

C Case Study
We analyze several data augmentation methods and check whether the label is preserved for these and if this affects its performance.We look at 25 examples for the best performing data augmentation method and the worst performing data augmentation method for 20 News Group and RTE.
For 20 News Group, Random Deletion was the best performing, and Language Model was the worst performing.In both cases, there were no examples where the label flipped, which makes sense since the input is usually several paragraphs with multiple references to the topic.Several examples are shown in Appendix.For RTE, Language Model was the worst performing and Cutoff was the best performing augmentation.Language Model flipped 24% of the labels with 4% uncertain, while Cutoff flipped 4% of the labels with 12% uncertain.We show several examples of when the label flipped for RTE in the Table 6.

Types
News

Table 1 :
Overview of different data augmentation techniques in NLP.Diversity refers to the difference of augmented data from existing data and the amount of different augmented data could be generated.

Table 2 :
Topic Classification and News Classification results with 10 examples.We report the average results across 3 different random seeds with the 95% confidence interval and bold the best results..For PubMed, we report the accuracy and F1 score.

Table 3 :
GLUE results with 10 labeled examples per class.We report the average results across 3 different random seeds with the 95% confidence interval and bold the best results.

Table 4 :
Topic Classification and News Classification results with 100 examples.We report the average results across 3 different random seeds with the 95% confidence interval and bold the best results..For PubMed, we report the accuracy and F1 score.

Table 5 :
GLUE results with 100 labeled examples per class.We report the average results across 3 different random seeds with the 95% confidence interval and bold the best results.

Table 6 :
Examples of different data augmentation methods on RTE and whether they preserve the original label or not