Generate, Annotate, and Learn: NLP with Synthetic Text

Abstract This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called “generate, annotate, and learn (GAL)” to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.


Introduction
There is an abundance of unlabeled data in the real world, but task-specific unlabeled data within the scope of a given machine learning problem can be challenging to find.For instance, one cannot easily find in-domain unlabeled text conforming to the input distribution of a specific Natural Language Processing (NLP) task from the GLUE benchmark (Wang et al., 2019c).Some NLP tasks require an input comprising a pair of sentences with a particular relationship between them.Moreover, classification datasets typically represent a tailored distribution of data and only include a limited number of class labels.If taskspecific unlabeled data were available, one could adopt self-training (Yarowsky, 1995) to automatically annotate unlabeled data with pseudo labels to improve accuracy and robustness of classifiers (Xie et al., 2020;Carmon et al., 2019).In addition, one can use knowledge distillation (Hinton et al., 2015) on fresh task-specific unlabeled data to more effectively compress deep neural networks and ensembles (Buciluǎ et al., 2006;Chen et al., 2020a).
In the absence of task-specific unlabeled data, one could retrieve unlabeled examples from a large and diverse open-domain dataset (Du et al., 2020).However, such a retrieval-based approach may not scale to problems with complex input schemes, e.g., sentence pairs with certain relations.Recent work (Yang et al., 2020;Kumar et al., 2020b) has considered the use of Language Models (LMs) like GPT-2 (Radford et al., 2019) as a means of data augmentation, showing the effectiveness of this approach for commonsense reasoning and classification tasks.Existing approaches often consider class-conditional generation, where the synthetic data is produced by conditioning on a specified class label.However, it is unclear whether class-conditional generation is best suited for NLP tasks.Furthermore, existing pipelines often make synthetic data generation complicated as one needs to detect and discard low-quality synthetic labeled data or optionally re-label data (Yang et al., 2020;Vu et al., 2021b).For instance, Kumar et al. (2020b) observe that it is difficult for sentences generated by label-conditioned GPT-2 to retain the semantics/pragmatics of the conditioning label, leading to poor performance on downstream tasks.
We unify and simplify existing work on LMs as a data source for NLP and develop a general framework called "generate, annotate, and learn (GAL)".The generality of GAL allows us to use LM-generated synthetic data within novel applications such as Knowledge Distillation (KD) and few-shot learning.GAL builds on recent advances in text generation (Radford et al., 2019;Gao et al., 2021) and uses powerful LMs to synthesize taskspecific unlabeled text by fine-tuning or conditioning a large LM on in-distribution examples.We use state-of-the-art classifiers to annotate generated text with soft pseudo labels when possible.We then combine labeled data and pseudo-labeled data to train more effective supervised models, resulting in significant gains on a range of NLP tasks like KD and few-shot learning.
We present a justification for GAL based on the empirical and vicinal risk minimization frameworks (Vapnik, 1992;Chapelle et al., 2001).We also investigate key components of GAL.We find that even if class-conditional LMs are available for text generation, it is more effective to discard the conditioning labels and let the teacher models produce pseudo labels.This observation is supported by our theoretical and empirical results.Accordingly, in contrast to prior work (Yang et al., 2020;Vu et al., 2021b), we advocate for the use of simple unconditional LMs for text synthesis.Further, we avoid any form of data filtering.Not surprisingly, we find that the diversity of synthetic text matters.That said, simple unconditional generation given random seeds provides sufficient diversity, and crafting diverse LM prompts is not needed.In summary: • We develop GAL, a simple and effective approach to the use of LMs for task-specific unlabeled text generation.We show that GAL can used effectively for KD, self-training, and fewshot learning in NLP.• We present theoretical and empirical investigations for GAL, explaining why it works and why using class-conditional LMs to generate synthetic labeled data is not as effective.

Related Work
Data synthesis with large pre-trained language models is closely related to our work (Kumar et al., 2020b;Yang et al., 2020;Vu et al., 2021b;Norouzi et al., 2020).Yang et al. (2020) propose a 1 Code is available at: https://github.com/xlhex/gal_syntex complex scheme, including label-conditioned data generation, data relabeling, data filtering, and twostage training, to utilize synthetic data.By contrast, we show that a simple mixture of the original data and synthetic unconditionally-generated data can provide sizable gains.Furthermore, we show a broader use of generative models on KD and fewshot learning.Vu et al. (2021b) takes a task augmentation approach and employ conditional generation to produce in-domain synthetic data for an auxiliary language inference (NLI) task, which is then used to initialize the target-task classifier.
However, not all tasks (e.g., grammatical acceptability judgments) can benefit from the NLI-style auxiliary task (Wang et al., 2019a).We aim to directly generate the unlabeled in-domain data for the target task.Unlike Norouzi et al. (2020), we do not use instance-based generative models.
More broadly, there has been a recent surge in data synthesis and augmentation in NLP, including rule-based and model-based approaches; see (Feng et al., 2021) for a recent survey.Data synthesis with grammars has been explored in semantic parsing and natural language understanding, eg see (Wang et al., 2015(Wang et al., , 2021;;Marzoev et al., 2020).Existing approaches to data augmentation for NLP include lexicon replacement, sentence retrieval, and round-trip machine translation (Wang and Yang, 2015;Yu et al., 2018;Kobayashi, 2018;Wu et al., 2019;Lichtarge et al., 2019;Wei and Zou, 2019;Alberti et al., 2019;Du et al., 2020;Shen et al., 2020).We, instead, propose the use of unconditional autoregressive LMs for data augmentation.This is simple, flexible, and powerful.
3 Generate, Annotate, and Learn (GAL) Given a labeled dataset L = {(x i , y i )} N i=1 , we first train an unconditional domain-specific generative model g(x) on L x = {x i } N i=1 , and then use it to synthesize unlabeled data.Such synthetic unlabeled data is used within self-training and KD even in the absence of in-domain unlabeled data.We restrict our attention to basic KD and self-training methods, even though GAL can be combined with more sophisticated semisupervised techniques too.
The effectiveness of GAL depends on the fidelity and diversity of synthetic examples.If we had access to the oracle generative process, we were able to obtain the best KD and SSL results, Initial parameters of a generative model g 0 Initial parameters of a classifier f 0 A teacher model h Output: A well-trained student classifier f s after KD unlabeled data generation 1: train a generative model g by fine-tuning g 0 on L x where L x = {x | (x, y) ∈ L} 2: generate U = { x j } kN j=1 by drawing kN random samples i.i.d.from g(x) knowledge distillation 3: apply h to unlabeled instances of U to get U 4: train f s by fine-tuning f 0 on L ∪ U 5: return f s as if we had access to real task-specific unlabeled data.Our preliminary experiments suggest that large language models are particularly effective within the GAL framework.Hence, as shown in Figure 1, to build the best domain-specific language model, we adopt a large language model pretrained on lots of open-domain text, and finetune it on a given dataset's inputs, i.e., L x , ignoring class labels.Both our theory and ablations confirm that ignoring class labels is a good idea (c.f., section 4 and 5).Transferring the knowledge of large language models is particularly beneficial a small input dataset L x of text is available (Hernandez et al., 2021).
To improve computational efficiency of GAL, we do not generate unlabeled data on the fly, but generate as many unconditional samples as possible and store them in a synthetic unlabeled dataset U .We use soft pseudo labels within self-training and KD, as we empirically found it is more effective than using hard labels on synthetic data.

Knowledge Distillation with GAL
KD distills knowledge of an expressive teacher model into a smaller student model (Hinton et al., 2015).We pose the following objective function for KD with labeled and synthetic unlabeled data, where h is the teacher model, f s is the student model, g is the large pre-trained language model (eg GPT2) fine-tuned on the text in the training data L x .H(q, p) = q log p is the softmax cross entropy loss.Note the use of g(x), approximat- Initial parameters of a generative model g 0 Initial parameters of a classifier f 0 Output: A better self-training classifier f T +1 after T steps unlabeled data generation 1: train a generative model g by fine-tuning g 0 on L x where L x = {x | (x, y) ∈ L} 2: generate U = { x j } kN j=1 by drawing kN random samples i.i.d.from g(x) self-training 3: train a base model f 1 by fine-tuning f 0 on L 4: for t = 1 to T do: ing the unknown real data distribution P (x) in (1).Algorithm 1 summarizes the GAL-KD process.

Self-Training with GAL
Self-training encourages knowledge transfer between a teacher and a student model in such a way that the student can outperform the teacher.Algorithm 2 summarizes the GAL-self-training process.Given the labeled dataset L and the synthetic unlabeled dataset U , an initial model denoted f 1 is trained using supervised learning on the labeled dataset L.Then, at iteration t, one adopts f t as the teacher model to annotate the unlabeled dataset U using pseudo labels.In self-training GAL, the student model f t+1 is trained to optimize a classification loss on the combination of L and U : where λ = 0.5 unless stated otherwise.Although many different variants of the basic self-training algorithm discussed above exist in the literature, we adopt the simplest variant of self-training and limit hyper-parameter tuning to a bare minimum.

Domain-Specific Text Generation
We take a pretrained GPT-2 language model (Radford et al., 2019) and fine-tune it separately on each dataset of interest after removing class labels.We find that training from scratch on these datasets is hopeless, but the larger the pretrained GPT-2 variant, the better the validation perplexity scores are.For tasks modeling a relationship between multiple sentences, we concatenate a separator token "[SEP]" between consecutive sentences.To alleviate an over-fitting on the training set, we use the best checkpoint evaluated on dev set as our generation engine.Once a fine-tuned GPT-2 model is obtained, we generate new domainspecific data by using top-k random sampling similar to Radford et al. (2019).We do not feed any prompt to the LM, but a special [BOS] token to initiate the generation chain.A generation episode is terminated when a special [EOS] token is produced.We generate diverse sentences by varying the random seed.After collecting enough synthetic data, we only retain unique sentences.For tasks with α input sentences, we discard generated samples that violate this constraint (approximately 10% of samples were rejected).Finally, we obtain task-specific synthetic data up to 40× larger than the original training sets.For some samples of generated text for GLUE see Table 11 and 12.
We believe using bigger LMs and larger synthetic datasets will improve our results, but we are constrained by compute resources.

An Empirical Risk Minimization Perspective
In supervised learning, one seeks to learn a mapping f that given an input x, predicts a reasonable output y.To define the supervised learning problem formally, one assumes that input-output pairs are drawn from a joint distribution P , i.e., (x, y) ∼ P (x, y), and a loss function H(y, f (x)) is used to assess the quality of a mapping f .This loss is used to define a notion of expected risk: In almost all practical applications P (x, y) is unknown.Hence, a labeled dataset of examples This objective function is known as empirical risk, and learning f through minimizing R(f ) is known as the empirical risk minimization principle (Vapnik, 1992).To compensate for the finite sample size in (4), one typically combines R(f ) with a regularizer to improve generalization.
Beyond empirical risk minimization.Empirical risk minimization ( 4) is motivated as a way to approximate P (x, y) through a set of Dirac delta functions on labeled examples: P δ (x, y) = i δ(x = x i , y = y i )/N .However, this approximation is far from perfect, hence one uses a heldout validation set for early stopping and hyper parameter tuning.
Vicinal risk minimization (Chapelle et al., 2001) approximates expected risk as E Pν (x,y) H(y, f (x)), using a vicinity distribution, e.g., ν( x, ỹ | x, y) = N ( x − x, σ 2 )δ(ỹ = y) to approximate P (x, y) as (5) The goal is to increase the support of each labeled data point and improve the quality and robustness of the risk function.
Recent work on mixup regularization (Zhang et al., 2018) proposes an effective way to construct another vicinity distribution by interpolating between two data points and their labels.Albeit its simplicity, these smoothing techniques tend to improve matters.Generative models for risk minimization.One can factorize the joint distribution of input-output pairs as P (x, y) = P (x)P (y | x).Accordingly, if one is able to learn a reasonable unconditional generative model of x denoted g(x), then one can draw a pair (x, y) by first drawing x ∼ g(x) and then using the current instance of f t to draw y ∼ f t (x).Then, one can use f t and g to approximate expected risk as (6) The quality of this approximation highly depends on the quality of f t and g.If f t is far from an optimal classifier f * or g(x) is far from P (x), (6) yields a poor approximation.
The expected risk in (6) smoothens the risk landscape in complex ways beyond simple Gaussian smoothing and interpolation.This smoothing is applicable to any continuous, discrete, or structured domain as long as expressive generative models of P (x) are available.That said, for almost all reasonable loss functions H (e.g., softmax cross entropy and squared error), ( 6) is minimized when f t+1 = f t , which is not ideal, especially when f t is far from f * .On the other hand, empirical risk (4) anchors the problem in real labeled examples that are provided as ground truth.
GAL-self-training aims to combine the benefits of ( 4) and ( 6) via: In this formulation, if f t represents the minimizer of empirical risk (4), then f t+1 = f t is the minimizer of (7) too.However, one does not seek the global minimizer of empirical risk, but rather the best performance on heldout data.If f t is obtained by stochastic gradient descent on any risk function, but early stopped according to empirical risk on a heldout set, then using such f t in (7) to define R t (f t+1 ) promotes the selection of a mapping f t+1 that minimizes empirical risk while staying close to the best performing mapping so far (i.e., f t ).This formulation motivates self-training and GAL as regularizers in the functional space and explains why they can conceivably work.Although the arguments are provided here for GAL-self-training, extending them to GAL-KD is straightforward (omitted due to the space constraint).
How about class-conditional generative models?One can also factorize the joint distribution P (x, y) as P (y)P (x | y) and accordingly utilize a class-conditional generative model g(x | y) to derive the following expected risk formulation: In this setting pseudo labeling is not needed as synthetic data is already labeled.One can show that the optimal classifier f * g that minimizes (8) for the cross-entropy loss is given by, that is turning the class-conditional generative model into a classifier by using the Bayes rule yields the optimal solution.
Provided that the accuracy of generative classifiers on text classification is behind their discriminate counterparts (e.g., Ravuri and Vinyals, 2019), we think substituting (8) into ( 7) is not a good idea.Essentially, by substituting (8) into the classification objective, one is regularizing f to remain close to f * g , which is not an effective strategy if f * g is not competitive.This argument corroborates the evidence from our ablation studies and recent work showing that using class-conditional generative models to augment supervised learning does not provide big gains (Ravuri and Vinyals, 2019).
That said, one can still use class-conditional generative models to synthesize high-fidelity samples.As long as these samples are treated as unlabeled examples and annotated using a classifier, e.g., f t , we believe this is a reasonable approach falling under GAL.Note that our argument above only applies to the scenario that class-conditional generative models are used to synthesize labeled examples.In other words, GAL emphasizes prediction of the labels in the course of the algorithm, rather than having the labels predefined.If one uses the unlabeled synthetic examples from classconditional generative models, it still aligns to (7), which will be verified in section 5.4.

Experiments
In this section, we assess the effectiveness of GAL on KD, self-training and few-shot learning.

State-of-the-art Results of Knowledge
Distillation with GAL on GLUE We use the GLUE benchmark (Wang et al., 2019c) for our KD experiments; see Appendix A.1 for benchmark details.Our synthetic unlabeled dataset U includes 40× as many examples as the original dataset for each task in GLUE.
It is known that KD on fresh data, unseen during training, performs better (Buciluǎ et al., 2006;Chen et al., 2020a) than KD on original training data.Hence, we investigate the effectiveness of KD using generated unlabeled data through GAL.
We use the HuggingFace implementation (Wolf et al., 2020) for KD experiments and adopt a standard experimental setup consistent with previous work (Sun et al., 2019a;Xu et al., 2020).Following Rashid et al. (2021), fine-tuned RoBERTalarge (24-layer transformer) represents the teacher and a DistilRoBERTa (6-layer transformer) (Sanh et al., 2019) is used as the student.We train the student model on U and L, where U is annotated by the best RoBERTa-large model, achieving an average score of 86.5.We then mix L and U with a ratio of 1:4, which is equivalent to λ = 0.2.This ratio works best on the dev set.
Table 1 shows the results of individual 6layer transformers on the GLUE test set.All of the baselines use an identical student architecture.GAL achieves the best entry on the GLUE leaderboard, marking a new state-of-theart for KD on NLP.It outperforms strong KD baselines such as DistilRoBERTa (Sanh et al., 2019), BERT-PKD (Sun et al., 2019a), BERT-Theseus (Xu et al., 2020), tinyBERT (Jiao et al., 2019) and MATE-KD (Rashid et al., 2021).It also outperforms our own DistilRoBERTa+KD baseline, which learns from soft labels produced by an identical RoBERTa-large ensemble on the original labeled dataset.While the use of soft labels outperform the vanilla fine-tuned Distil-RoBERTa model, it significantly underperforms our KD+GAL baseline.We also compare with two strong data-augmentation baselines, roundtrip translation (RT) (Yu et al., 2018;Shleifer, 2019) and word substitutions (WS) (Jiao et al., 2019;Wei and Zou, 2019) For RT, We generate 40× unlabeled data using German as the bridge language (English →German→English).The translations are generated via the best model in WMT19 (Ng et al., 2019).We use the codebase from Jiao et al. (2019) to conduct WS data augmentation.We mirror the KD experimental setup of GAL for both RT and WS.Although Distil-RoBERTa+RT and DistilRoBERTa+WS are better than vanilla DistilRoBERTa and KD variants, they still drastically underperforms our approach.

Self-Training with GAL on GLUE
We fine-tune pretrained RoBERTa model provided by fairseq (Ott et al., 2019) on each GLUE task.Fine-tuned RoBERTa serves as the first teacher model for self-training.Each student model is initialized with the original pretrained RoBERTa and fine-tuned with exactly the same hyper-parameters as suggested by fairseq (Ott et al., 2019).We combine the labeled dataset L and the synthetic dataset U with a ratio of 1:1, by oversampling labeled data.This corresponds to λ = 0.5 in Eq. ( 7).
Table 2 shows that GAL provides an average improvement of +1.3% over RoBERTa-base.We see consistent improvements with more GAL iterations, but performance saturates after three iterations.We further compare our approach with a self-distillation (Furlanello et al., 2018) baseline, in which the teacher and student models use the same architecture and transfer knowledge via the original labeled training set.Although self-distillation provides a slight improvement, the gains from GAL are more significant.
We delve deeper and combine GAL selftraining with RoBERTa-large and report test results for both single model and ensemble model in Table 3.We observe consistent gains coming from GAL on RoBERTa-large.Our results underperform the latest and biggest LMs from the GLUE leaderboard, but we are optimistic that GAL can  (Xu et al., 2020), BERT-PKD (Sun et al., 2019a), tinyBERT (Jiao et al., 2019) MATE-KD (Rashid et al., 2021), DistilRoBERTa (Sanh et al., 2019) In addition to the GLUE benchmark, Appendix B shows the applicability of GAL to two image classification tasks as a proof of concept, but more advanced techniques such as Mixup (Zhang et al., 2018) are needed to bridge the gap with the state-of-the-art.

Prompt-based Few-shot Experiments
GPT3 (Brown et al., 2020) has introduced an optimization-free paradigm for few-shot learning for NLP.Without updating the parameters, large LMs can correctly predict the labels of the inputs by conditioning on a prompt, which consists of an instruction, a few labeled instances and a new unlabeled input.We apply GAL to prompt-based few-shot learning.Specifically, we present k labeled examples as a prompt to GPT-J (Wang and Komatsuzaki, 2021), an opensourced re-implementation of GPT-3-6B, and generate m synthetic examples, followed by the corresponding labels.Note that to mitigate noisy outputs, the generation of each synthetic example only conditions on the original k labeled examples.Finally, we concatenate the original k examples and m synthetic examples, and conduct a (k + m)-shot learning experiment with GPT-J.2020) studied a total of 51 fewshot learning tasks.Studying all of these tasks is prohibitively expensive.Thus, we filter tasks by following these two steps.First, since generating m synthetic examples for each test instance is computationally expensive, we exclude tasks that have more than 5k test examples.Second, we filter tasks on which GPT-3-6B achieves a score lower than 65% (please refer to Table H.1 in Brown et al. (2020) for more details).After applying the filtering steps, we use four datasets: SST-2 (Wang et al., 2019c), PIQA (Bisk et al., 2020), COPA and BoolQ (Wang et al., 2019b) as the testbed.We notice that in order to generate valid synthetic data, GPT-J requires to see at least 4 labeled examples.

Brown et al. (
In addition, at most 16 examples of BoolQ can be fed into GPT-J without truncation.Thus, we set k and m to 4 and 12 respectively.As seen in Table 4, GAL leads to an average improvement of 1.2% over 4-shot learning, and reduces the gap between Table 3: RoBERTa-large with GAL self-training and SoTA methods evaluated on GLUE test sets.The benefit of GAL on single models is larger than ensembles.It appears that self-training reduce the variance of models.Baselines including much larger models: RoBERTa-large (Liu et al., 2019), ELECTRA (Clark et al., 2020), T5 (Raffel et al., 2020), ERNIE (Sun et al., 2019b), and DeBERTa (He et al., 2020) 4-shot + synthetic 12-shot (GAL) 91.5 0.7 76.7 1.0 80.0 1.2 65.9 0.8 78.5 4-shot and 16-shot learning.We noticed that the quality of some generated examples is low.We believe the performance of few-shot learning can be further improved with high-quality instances.
One solution is to generate many synthetic examples, and select a high-quality subset.Since each test instance conditions on distinct labeled instances, one has to generate different synthetic instances for each test example from GPT-J, which causes expensive computation.Due to such computational constraints, we leave the investigation of data selection strategies to the future work.We conduct an in-depth study of different components of GAL on GLUE datasets.Unless stated otherwise, we use a RoBERTa-base model with a combination of the original training data and 40× synthetic data for each self-training experiment.GPT-2 model size.Radford et al. (2019) present a few variants of the GPT-2 model including GPT-2, GPT-2-medium, GPT-2-large, and GPT-2-XL.Larger GPT-2 models yield better perplexity scores and higher generation quality.We utilize these models except GPT-2-XL within the GAL framework to study the impact of the generative model's quality on downstream task's performance.Table 5 shows that regardless of the GPT-2 model sizes, GAL consistently surpasses the vanilla RoBERTa base.Moreover, SST-2 and RTE datasets are not sensitive to the capacity of GPT-2, but higher quality synthetic text improves the results on MRPC and CoLA datasets.We leave investigation of GPT-2-XL and even larger LMs such as GPT-3 (Brown et al., 2020) to future work.Soft v.s.hard pseudo label.We investigate the use of soft and hard pseudo labels within the GAL framework.The results in Table 6 suggest that GAL using soft pseudo labels is more effective than hard labels on the GLUE benchmark.This finding is compatible with the intuition that soft labels enable measuring the functional similarity of neural networks better (Hinton et al., 2015).
Class-conditional synthetic data generation.
Previous work (Kumar et al., 2020b;Ravuri and Vinyals, 2019) suggests that it is challenging to utilize labeled synthetic data from classconditional generative models to boost the accuracy of text and image classifiers.Our theory in Section 4 points to the potential drawback of class-conditional synthetic data.We empirically study this phenomenon, by fine-tuning GPT-2 in a class-conditional manner.Then we utilize its synthetic examples in two different cases: 1) labeled synthetic examples and 2) unlabeled synthetic examples.Table 7 shows that not only class-conditional LMs underperform unconditional LMs in our GAL framework, but also they are much worse than the baseline, when using the pre-defined labels.Nevertheless, if we apply GAL to these examples, the class-conditional LM is on-par with the unconditional one, which corroborates the importance of the annotation step in GAL.We provide more analysis in Appendix A.3.

Limitations
This work demonstrates that one can leverage synthetic in-domain data generated by powerful pre-trained generative models.For simplicity, we do not employ any filtering avenue to retain diverse but high-quality data points.However, previous works have shown that advanced filter-ing approaches can further improve the performance (Sohn et al., 2020;Du et al., 2020;Yang et al., 2020).Given that the improvements in the self-training are not sizeable, we believe it is worth imposing filtering methods on the synthetic data to mitigate the side effects caused by the noisy data points.
Although we examine the effectiveness of GAL on various classification tasks, we still focus on the sentence-level tasks.Because of the superior performance on sentence-level tasks, there has been a surge of interest shift to document-level tasks, such as document-level machine translation (Miculicich et al., 2018;Voita et al., 2018;Maruf and Haffari, 2018), and document summarization (Rush et al., 2015;Nallapati et al., 2016), etc.As these tasks suffer from data scarcity, one can leverage GAL to synthesize more data points.However, previous works have shown that GPT-2 has difficulty generating coherent text requiring long-range dependency (Orbach and Goldberg, 2020;Guan et al., 2020).Thus, such a limitation may hinder the application of GAL to document-level tasks.
In addition, the label space of the studied tasks is not as complex as the structured prediction tasks, such as machine translation, dialog system, question answering, etc.However, we believe one can smoothly adapt GAL to these tasks as well.Let us consider machine translation (MT), as a canonical structured prediction task.Prior works have shown that one can use (real) monolingual data, in either source or the target language, through data augmentation (Sennrich et al., 2016) or knowledge distillation (Kim and Rush, 2016) to improve the structured prediction tasks.This suggests a promising avenue for future research on using synthetically generate monolingual data to improve MT for specialized domains where even monolingual data is scarce.Furthermore, Vu et al. (2021a) suggests that one can leverage a retrieval-based approach to obtain monolingual sentences from the generic data stores.This retrieved monolingual data is then employed to improve the translation quality in a domain adaptation setting.This suggests that a GAL-based approach to synthetically generate monolingual text is a promising method to improve MT for specialized domains -an interesting direction for future research.

Conclusion
We present Generate, Annotate, and Learn (GAL): a framework for self-training and knowledge distillation with generated unlabeled data.We motivate GAL from an expected risk minimization perspective and demonstrate both theoretically and empirically that the use of unconditional generative models for synthetic data generation is more effective than class-conditional generative models, previously used in the literature.GAL leverages advances in large pretrained language models to help supervised learning and can have implications for learning from limited labeled data.GAL significantly helps improve knowledge distillation and prompt-based few-shot learning.In addition, a concurrent work (Gowal et al., 2021) has shown that using generated images can enhance the robustness of images classifiers.We will explore this direction on NLP tasks in the future.Finally, We hope that GAL will stimulate new research on the evaluation and development of large language models.

A.1 Datasets
The statistics of GLUE are reported in Table 8.

A.2 GPT-2 for classification
We have conducted additional experiments, where we fine-tune GPT-2 as a classifier.We have considered two variants of the GPT-2 model.The first varant is the original GPT-2 model (GPT2-original) pre-trained on open-domain text.The second variant is the GPT-2 model that was fine-tuned on the inputs of each task separately (GPT-2-finetuned).This model was used to generate task-specific (synthetic) unlabeled data.Finally, we also consider self-training with GAL on top of GPT2-original.Specifically, we use the GPT-2-finetuned model to synthesize 40x in-domain unlabeled data.Then we apply self-training to GPT-2-original, where the data is a combination of the original labeled data and pseudo-labeled synthetic data.Table 9 suggests that the gains of GAL come from the pseudo-labeled synthetic data, i.e., both synthetic unlabeled data and teacher's knowledge.Without the generation of synthetic unlabeled data, the domain-specific knowledge embedded in GPT-2-finetuned model cannot be utilized.As such, GPT-2-finetuned model is inferior to the GPT2-original model.Since RoBERTa-large is superior to GPT-2 models, RoBERTa-large+GAL also significantly outperform the GPT-2 counterpart.

A.3 Importance of Pseudo-labels
We have argued and demonstrated that using class-conditional generative models to generate labeled synthetic examples is less effective than GAL in section 3 and section 5. To further verify this argument, we sample 100 instances from the synthetic RTE dataset generated by the label-prompted GPT2, as the class-conditional LM.Then we annotate these examples using a human annotator, GPT2 classifier and RoBERTa classifier.Finally, we compute the Accuracy, F1, Precision and Recall scores between human labels and GPT2 labels, between human labels and RoBERTa labels, and between human labels and conditioned labels used by GPT2 when data was generated.Table 10 shows that class-conditional LM has difficulty generating sentences retaining the semantics or pragmatics of a specified category, which also corroborates our theoretical analysis in section 3. On the other hand, discriminative models, such as GPT2 classifier and RoBERTa classifier, are able to produce higher quality labels that correlate better with human annotations.

A.4 Generated Unlabeled Examples Annotated with Pseudo Labels
We provide some synthetic sentences generated by GAL in Table 11 and 12.As a proof of concept, in addition to NLP tasks, we assess the effectiveness of GAL on CIFAR-10 ( Krizhevsky and Hinton, 2009) and Fashion MNIST (Xiao et al., 2017) as well.We adopt the NCSN model of (Song and Ermon, 2019) as the task-specific generative model.We use the CIFAR-10 model provided by the authors and train a model on Fashion MNIST using the same configuration as CIFAR-10.We select the model checkpoint resulting in the best FID score vs. training set (Heusel et al., 2017) based on 1000 samples.We then use the NCSN models to generate up to 10× synthetic unlabeled data, i.e., 500K for CIFAR-10 and 600K for Fashion MNIST.See Appendix B.2 for representative samples.We adopt FixMatch (Sohn et al., 2020) to conduct semi-supervised learning on vision tasks, since FixMatch has shown promising results on CIFAR-10.Specifically, we train a classifier on mini-batches of intertwined labeled and unlabeled data (synthetic).In each iteration, we obtain pseudo-labels for the unlabeled data, but filter unlabeled examples based on classifier's confidence, i.e., examples are kept on which the largest class probability exceeds τ .Weak augmentation is used to define pseudo labels, but strong augmentations are used to obtain student model's predictions.We randomly sample from the strong augmentations list defined in RandAugment (Cubuk et al., 2020).We only apply strong augmentations to the synthetic samples and not the original labeled data to ensure a fair comparison with the baseline.
We conduct experiments on three different convolutional neural network architectures: VGG19 (Simonyan and Zisserman, 2014), WideResnet28-10 (Zagoruyko and Komodakis, 2016), and ResNet110 (He et al., 2016).For the full list of hyperparameters and other implementation details, please refer to Appendix C. Each classifier is trained for 200 epochs and 3 synthetic datasets of size (1×, 5×, 10×) of the training dataset are used.
Table 13 shows that GAL achieves an average error reduction of 0.78% over the baseline on CIFAR-10 across the 3 architectures tested.Further, it appears that the larger the synthetic dataset size, the better the performance of GAL is.We note that the reported results are the average of 3 independent runs.Table 14 presents GAL results on Fashion MNIST dataset.Similar to CIFAR-10, we observe a performance improvement across the three architectures.Our image classification experiments confirm that even when the generative model of GAL is not pretrained on open domain data and solely trained on the dataset at hand, GAL can offer significant improvements.

C Training Details
We use the fairseq codebase (Ott et al., 2019) for the self-training over GLUE benchmark.Training details are summarized in Table 15.We use the HuggingFace codebase (Wolf et al., 2020) for KD experiments.All KD experiments are trained for 5 epochs with a learning rate of 2e-5 and a batch size of 32.All experiments are run on a single Nvidia V100 GPU.
For the CV tasks, we first use the official implementation of NCSN (Song and Ermon, 2019) to generate the synthetic images for CIFAR-10 and Fashion MNIST.We use the pretrained checkpoints provided by the authors for the generation of synthetic CIFAR-10 images and we train a new generative model for Fashion MNIST from scratch with the same hyperparameters of the CIFAR-10 network.
After generating the synthetic images, we apply GAL using a FixMatch-like setup (Sohn et al., 2020), using the hyperparameters listed in Table 16.We follow Cubuk et al. ( 2020) for strong augmentations.Finally, the backbone of the classifiers is from this codebase: https://github.com/bearpaw/pytorch-classification.
to unlabeled instances of U to get U 6:train f t+1 by fine-tuning f 0 on L ∪ U 7: return f T +1

Figure 2 :
Figure 2: CIFAR-10 synthetic samples generated by NCSN (Song and Ermon, 2019) and corresponding pseudolabels.Images are filtered based on a confidence threshold of τ = 0.95 and categorized based on pseudo-labels.For each category, 16 random samples are shown.

Figure 3 :
Figure 3: Fashion MNIST synthetic samples generated by NCSN (Song and Ermon, 2019) and pseudo-labels.Images are filtered based on a confidence threshold of τ = 0.95 and categorized based on pseudo-labels.For each category, 16 random samples are shown.

Table 1 :
GLUE test results for a 6-layer transformer.GAL establishes a new state of the art on KD for NLP.Baselines: BERT-Theseus

Table 2 :
, and DistilRoBERTa + KD (standard KD), Distil-RoBERTa + WS (word substitution) and DistilRoBERTa + RT (round-trip translation).MNLI-m and MNLI-mm indicate matched and mismatched respectively.RoBERTa base and GAL self-training results on GLUE dev sets, averaged across 5 independent runs (numbers in the subscript indicate the error bar, i.e., standard deviation divided by √ 5.).

Table 4 :
(Wang and Komatsuzaki, 2021)e matched and mismatched respectively.Few-shot learning results for GPT-J (6B)(Wang and Komatsuzaki, 2021)on four NLP datasets.Accuracy is reported for these datasets.

Table 5 :
GAL with various GPT-2 model sizes on GLUE dev sets.NA indicates a RoBERTa base model.We bold the best numbers.

Table 6 :
GAL with soft v.s.hard pseudo labels on GLUE dev sets.We bold the best numbers.

Table 7 :
Synthetic data from class-conditional LMs underperforms GAL and RoBERTa on GLUE dev sets.

Table 8 :
Summary of GLUE tasks used for evaluation of GAL.STS-B is a regression task, so #classes is not applicable.

Table 9 :
GLUE test results of using GPT-2 and RoBERTa-large as classification models.

Table 10 :
Performance of GPT2 annotation, RoBERTa annotation and conditioning labels on 100 random examples from the synthetic RTE dataset generated by a class-conditional LM.

Table 11 :
Two labeled examples, along with 3 nearest neighbors (based on RoBERTa representations) from our synthetic dataset.We include labels for original examples and pseudo-labels for synthetic examples in parenthesis.

Table 12 :
QQP: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa representations) from our synthetic dataset.We include labels for original examples and pseudo-labels for synthetic examples in parenthesis.What are the best courses for a mechanical engineering student?[SEP] What is the best course to do after completing a B.Tech in mechanical engineering?(not duplicated) 2: How much marks are needed to get through the GATE with electronics?[SEP] What is the average score of the Gate EE exam?What are the cut-offs?(not duplicated) 3: What is the best time table for students to prepare for IAS? [SEP] How can one study for IAS in a best time?(not duplicated)

Table 13 :
Classification error rates on CIFAR-10 test set with varying amounts of synthetic data for three different model architectures.Reported results are the average of 3 independent runs.

Table 14 :
Classification error rates on Fashion MNIST test set with varying amounts of synthetic data for three different model architectures.Results reported are the average over 3 independent runs.

Table 15 :
Training details for self-training over GLUE benchmark.

Table 16 :
Training details for CV experiments