This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called “generate, annotate, and learn (GAL)” to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.

There is an abundance of unlabeled data in the real world, but task-specific unlabeled data within the scope of a given machine learning problem can be challenging to find. For instance, one cannot easily find in-domain unlabeled text conforming to the input distribution of a specific Natural Language Processing (NLP) task from the GLUE benchmark (Wang et al., 2019c). Some NLP tasks require an input comprising a pair of sentences with a particular relationship between them. Moreover, classification datasets typically represent a tailored distribution of data and only include a limited number of class labels. If task-specific unlabeled data were available, one could adopt self-training (Yarowsky, 1995) to automatically annotate unlabeled data with pseudo labels to improve accuracy and robustness of classifiers (Xie et al., 2020; Carmon et al., 2019). In addition, one can use knowledge distillation (Hinton et al., 2015) on fresh task-specific unlabeled data to more effectively compress deep neural networks and ensembles (Buciluă et al., 2006; Chen et al., 2020a).

In the absence of task-specific unlabeled data, one could retrieve unlabeled examples from a large and diverse open-domain dataset (Du et al., 2020). However, such a retrieval-based approach may not scale to problems with complex input schemes, for example, sentence pairs with certain relations. Recent work (Yang et al., 2020; Kumar et al., 2020b) has considered the use of Language Models (LMs) like GPT-2 (Radford et al., 2019) as a means of data augmentation, showing the effectiveness of this approach for commonsense reasoning and classification tasks. Existing approaches often consider class-conditional generation, where the synthetic data is produced by con ditioning on a specified class label. However, it is unclear whether class-conditional generation is best suited for NLP tasks. Furthermore, existing pipelines often make synthetic data generation complicated as one needs to detect and discard low-quality synthetic labeled data or optionally re-label data (Yang et al., 2020; Vu et al., 2021b). For instance, Kumar et al. (2020b) observe that it is difficult for sentences generated by label- conditioned GPT-2 to retain the semantics/pragmatics of the conditioning label, leading to poor performance on downstream tasks.

We unify and simplify existing work on LMs as a data source for NLP and develop a general framework called “generate, annotate, and learn (GAL )”. The generality of GAL allows us to use LM-generated synthetic data within novel applications such as Knowledge Distillation (KD) and few-shot learning. GAL builds on recent advances in text generation (Radford et al., 2019; Gao et al., 2021) and uses powerful LMs to synthesize task-specific unlabeled text by fine-tuning or conditioning a large LM on in-distribution examples. We use state-of-the-art classifiers to annotate generated text with soft pseudo labels when possible. We then combine labeled data and pseudo-labeled data to train more effective supervised models, resulting in significant gains on a range of NLP tasks like KD and few-shot learning.

We present a justification for GAL based on the empirical and vicinal risk minimization frameworks (Vapnik, 1992; Chapelle et al., 2001). We also investigate key components of GAL. We find that even if class-conditional LMs are available for text generation, it is more effective to discard the conditioning labels and let the teacher models produce pseudo labels. This observation is supported by our theoretical and empirical results. Accordingly, in contrast to prior work (Yang et al., 2020; Vu et al., 2021b), we advocate for the use of simple unconditional LMs for text synthesis. Further, we avoid any form of data filtering. Not surprisingly, we find that the diversity of synthetic text matters. That said, simple unconditional generation given random seeds provides sufficient diversity, and crafting diverse LM prompts is not needed.

In summary:

• We develop GAL, a simple and effective approach to the use of LMs for task-specific unlabeled text generation. We show that GAL can be used effectively for KD, self-training, and few-shot learning in NLP.

• We present theoretical and empirical investigations for GAL, explaining why it works and why using class-conditional LMs to generate synthetic labeled data is not as effective.

• GAL advances KD for NLP and establishes a new state-of-the-art (SoTA) resu lt for a single 6-layer transformer on the GLUE test set. It further improves prompt-based few-shot learning, providing an average improvement of 1.3% on four 4-shot learning NLP tasks, outperforming GPT-3-6B.

##### Data synthesis

with large pre-trained language models is closely related to our work (Kumar et al., 2020b; Yang et al., 2020; Vu et al., 2021b; Norouzi et al., 2020). Yang et al. (2020) propose a complex scheme, including label-conditioned data generation, data relabeling, data filtering, and two-stage training, to utilize synthetic data. By contrast, we show that a simple mixture of the original data and synthetic unconditionally generated data can provide sizable gains. Furthermore, we show a broader use of generative models on KD and few-shot learning. Vu et al. (2021b) take a task augmentation approach and employ conditional generation to produce in-domain synthetic data for an auxiliary language inference (NLI) task, which is then used to initialize the target-task classifier. However, not all tasks (e.g., grammatical acceptability judgments) can benefit from the NLI- style auxiliary task (Wang et al., 2019a). We aim to directly generate the unlabeled in-domain data for the target task. Unlike Norouzi et al. (2020), we do not use instance-based generative models.

More broadly, there has been a recent surge in data synthesis and augmentation in NLP, including rule-based and model-based approaches; see Feng et al. (2021) for a recent survey. Data synthesis with grammars has been explored in semantic parsing and natural language understanding (e.g., see Wang et al., 2015, 2021; Marzoev et al., 2020). Existing approaches to data augmentation for NLP include lexicon replacement, sentence retrieval, and round-trip machine translation (Wang and Yang, 2015; Yu et al., 2018; Kobayashi, 2018; Wu et al., 2019; Lichtarge et al., 2019; Wei and Zou, 2019; Alberti et al., 2019; Du et al., 2020; Shen et al., 2020). We, instead, propose the use of unconditional autoregressive LMs for data augmentation. This is simple, flexible, and powerful.

##### Self-training

is one of the oldest approaches for semi-supervised learning (Scudder, 1965; Fralick, 1967; Agrawala, 1970; Yarowsky, 1995; Eisner and Karakos, 2005; Ueffing et al., 2007; Du et al., 2020). Abney (2004) and Haffari and Sarkar (2007) have theoretically analyzed self- training for simple decision lists. Recent theoretical work analyzes self-training for linear models, often under the assumption that the data distribution is (nearly) Gaussian (Carmon et al., 2019; Raghunathan et al., 2020; Chen et al., 2020b; Kumar et al., 2020a; Oymak and Gulcu, 2020). Wei et al. (2021) prove that, under “expansion” and “class separation” assumptions, self-training can lead to more accurate neural network classifiers. We present a theoretical framing of GAL in terms of empirical and vicinal risk minimization (Vapnik, 1992; Chapelle et al., 2001).

##### Knowledge Distillation

(KD) (Buciluă et al., 2006; Hinton et al., 2015) uses a procedure similar to self-training to distill knowledge of an expressive teacher model into a smaller student model. In contrast, self-distillation (Furlanello et al., 2018; Zhang et al., 2019; Mobahi et al., 2020) uses teacher and student models of equal size, hoping to iteratively refine class labels. Previous work uses unlabeled data (Buciluă et al., 2006) and adversarial training (Wang et al., 2018) to improve KD. We demonstrate that synthetic data generated by unconditional generative models can improve KD on NLP, outperforming strong KD baselines, which often add more complexity and additional hyperparameters (e.g., Sun et al., 2019a; Jiao et al., 2019; Xu et al., 2020; Rashid et al., 2021).

Given a labeled dataset $L={(xi,yi)}i=1N$, we first train an unconditional domain-specific generative model g(x) on $Lx={xi}i=1N$, and then use it to synthesize unlabeled data. Such synthetic unlabeled data is used within self-training and KD even in the absence of in-domain unlabeled data. We restrict our attention to basic KD and self- training methods, even though GAL can be combined with more sophisticated semi-supervised techniques, too.

The effectiveness of GAL depends on the fidelity and diversity of synthetic examples. If we had access to the oracle generative process, we would be able to obtain the best KD and SSL results, as if we had access to real task-specific unlabeled data. Our preliminary experiments suggest that large language models are particularly effective within the GAL framework. Hence, as shown in Figure 1, to build the best domain- specific language model, we adopt a large language model pretrained on lots of open-domain text, and fine-tune it on a given dataset’s inputs, that is, Lx, ignoring class labels. Both our theory and ablations confirm that ignoring class labels is a good idea (c.f., Section 4 and 5). Transferring the knowledge of large language models is particularly beneficial when a small input dataset Lx of text is available (Hernandez et al., 2021).

Figure 1:

An illustration of GAL for NLP. We use open-domain data once for self-supervised pretraining (e.g., BERT) and once for training a large LM (e.g., GPT-2). BERT is fine-tuned on labeled data to yield a classifier for the task of interest. GPT-2 is fine-tuned on the same data without labels to obtain an unconditional task-specific LM, which is used to generate lots of synthetic in-domain unlabeled data for self-training and KD.

Figure 1:

An illustration of GAL for NLP. We use open-domain data once for self-supervised pretraining (e.g., BERT) and once for training a large LM (e.g., GPT-2). BERT is fine-tuned on labeled data to yield a classifier for the task of interest. GPT-2 is fine-tuned on the same data without labels to obtain an unconditional task-specific LM, which is used to generate lots of synthetic in-domain unlabeled data for self-training and KD.

Close modal

To improve computational efficiency of GAL, we do not generate unlabeled data on the fly, but generate as many unconditional samples as possible and store them in a synthetic unlabeled dataset U. We use soft pseudo labels within self- training and KD, as we empirically found it is more effective than using hard labels on synthetic data.

### 3.1 Knowledge Distillation with GAL

KD distills knowledge of an expressive teacher model into a smaller student model (Hinton et al., 2015). We pose the following objective function for KD with labeled and synthetic unlabeled data:
$ℓkd=λE(x,y)∼LH(y,fs(x))+(1−λ)Ex~∼g(x)H(h(x~),fs(x~)),$
(1)
where h is the teacher model, fs is the student model, and g is the large pre-trained language model (e.g., GPT2) fine-tuned on the text in the training data Lx. $H(q,p)=q⊤logp$ is the softmax cross entropy loss. Note the use of g(x), approximating the unknown real data distribution P(x) in (1). Algorithm 1 summarizes the GAL-KD process.

### 3.2 Self-Training with GAL

Self-training encourages knowledge transfer between a teacher and a student model in such a way that the student can outperform the teacher. Algorithm 2 summarizes the GAL -self-training process. Given the labeled dataset L and the synthetic unlabeled dataset U, an initial model denoted f1 is trained using supervised learning on the labeled dataset L. Then, at iteration t, one adopts ft as the teacher model to annotate the unlabeled dataset U using pseudo labels. In self-training GAL, the student model ft +1 is trained to optimize a classification loss on the combination of L and U:
$ℓt+1=λE(x,y)∼LH(y,ft+1(x))+(1−λ)Ex~∼g(x)H(ft(x~),ft+1(x~)),$
(2)
where λ = 0.5 unless stated otherwise. Although many different variants of the basic self-training algorithm discussed above exist in the literature, we adopt the simplest variant of self-training and limit hyperparameter tuning to a bare minimum.

### 3.3 Domain-Specific Text Generation

We take a pretrained GPT-2 language model (Radford et al., 2019) and fine-tune it separately on each dataset of interest after removing class labels. We find that training from scratch on these datasets is hopeless, but the larger the pretrained GPT-2 variant, the better the validation perplexity scores are. For tasks modeling a relationship between multiple sentences, we concatenate a separator token [SEP] between consecutive sentences. To alleviate an over-fitting on the training set, we use the best checkpoint evaluated on the dev set as our generation engine. Once a fine- tuned GPT-2 model is obtained, we generate new domain-specific data by using top-k random sampling similar to Radford et al. (2019). We do not feed any prompt to the LM, but a special [BOS] token to initiate the generation chain. A generation episode is terminated when a special [EOS] token is produced. We generate diverse sentences by varying the random seed. After collecting enough synthetic data, we only retain unique sentences. For tasks with α input sentences, we discard generated samples that violate this constraint (approximately 10% of samples were rejected). Finally, we obtain task-specific synthetic data up to 40 × larger than the original training sets. For some samples of generated text for GLUE see Tables 11 and 12. We believe using bigger LMs and larger synthetic datasets will improve our results, but we are constrained by computer resources.

In supervised learning, one seeks to learn a mapping f that, given an input x, predicts a reasonable output y. To define the supervised learning problem formally, one assumes that input-output pairs are drawn from a joint distribution P, namely, (x,y) ∼ P(x,y), and a loss function H(y,f(x)) is used to assess the quality of a mapping f. This loss is used to define a notion of expected risk:
$R(f)=EP(x,y)H(y,f(x)).$
(3)
In almost all practical applications P(x,y) is unknown. Hence, a labeled dataset of examples $L={(xi,yi)}i=1N$ is used to approximate R(f) as
$R^(f)=1N∑i=1NH(yi,f(xi)).$
(4)
This objective function is known as empirical risk, and learning f through minimizing $R^(f)$ is known as the empirical risk minimization principle (Vapnik, 1992). To compensate for the finite sample size in (4), one typically combines $R^(f)$ with a regularizer to improve generalization.
##### Beyond Empirical Risk Minimization.

Empirical risk minimization (4) is motivated as a way to approximate P(x,y) through a set of Dirac delta functions on labeled examples: $Pδ(x,y)=∑iδ(x=xi,y=yi)/N$. However, this approximation is far from perfect, hence one uses a heldout validation set for early stopping and hyperparameter tuning.

Vicinal risk minimization (Chapelle et al., 2001) approximates expected risk as $EPν(x,y)H(y,f(x))$, using a vicinity distribution, for example, $ν(x~,y~∣x,y)=N(x~−x,σ2)δ(y~=y)$ to approximate P(x,y) as
$Pν(x,y)=1N∑i=1Nν(x~=x,y~=y∣xi,yi).$
(5)
The goal is to increase the support of each labeled data point and improve the quality and robustness of the risk function.

Recent work on mixup regularization (Zhang et al., 2018) proposes an effective way to construct another vicinity distribution by interpolating between two data points and their labels. Despite their simplicity, these smoothing techniques tend to improve matters.

##### Generative Models for Risk Minimization.
One can factorize the joint distribution of input- output pairs as P(x,y) = P(x)P(yx). Accordingly, if one is able to learn a reasonable unconditional generative model of x denoted g(x), then one can draw a pair (x,y) by first drawing xg(x) and then using the current instance of ft to draw yft(x). Then, one can use ft and g to approximate expected risk as
$Rt(ft+1)=Ex∼g(x)Ey∼ft(x)H(y,ft+1(x)).$
(6)
The quality of this approximation highly depends on the quality of ft and g. If ft is far from an optimal classifier f* or g(x) is far from P(x), (6) yields a poor approximation.

The expected risk in (6) smoothens the risk landscape in complex ways beyond simple Gaussian smoothing and interpolation. This smoothing is applicable to any continuous, discrete, or structured domain as long as expressive generative models of P(x) are available. That said, for almost all reasonable loss functions H (e.g., softmax cross entropy and squared error), (6) is minimized when ft +1 = ft, which is not ideal, especially when ft is far from f*. On the other hand, empirical risk (4) anchors the problem in real labeled examples that are provided as ground truth.

GAL -self-training aims to combine the benefits of (4) and (6) via:
$Rt(ft+1)=λN∑i=1NH(yi,ft+1(xi))+(1−λ)Ex∼g(x)Ey∼ft(x)H(y,ft+1(x)).$
(7)
In this formulation, if ft represents the minimizer of empirical risk (4), then ft +1 = ft is the minimizer of (7), too. However, one does not seek the global minimizer of empirical risk, but rather the best performance on heldout data. If ft is obtained by stochastic gradient descent on any risk function, but early-stopped according to empirical risk on a heldout set, then using such ft in (7) to define Rt(ft +1) promotes the selection of a mapping ft +1 that minimizes empirical risk while staying close to the best performing mapping so far (i.e., ft). This formulation motivates self-training and GAL as regularizers in the functional space and explains why they can conceivably work. Although the arguments are provided here for GAL-self-training, extending them to GAL-KD is straightforward (omitted due to the space constraints).
##### How About Class-conditional Generative Models?
One can also factorize the joint distribution P(x,y) as P(y)P(xy) and accordingly utilize a class-conditional generative model g(xy) to derive the following expected risk formulation:
$R(f)=Ey∼P(y)Ex∼g(x|y)H(y,ft+1(x)).$
(8)
In this setting pseudo labeling is not needed as synthetic data is already labeled. One can show that the optimal classifier $fg*$ that minimizes (8) for the cross-entropy loss is given by
$fg*(y∣x)=g(x|y)P(y)/∑y′g(x|y′)P(y′),$
(9)
that is, turning the class-conditional generative model into a classifier by using the Bayes rule yields the optimal solution.

Provided that the accuracy of generative classifiers on text classification is behind their discriminate counterparts (e.g., Ravuri and Vinyals, 2019), we think substituting (8) into (7) is not a good idea. Essentially, by substituting (8) into the classification objective, one is regularizing f to remain close to $fg*$, which is not an effective strategy if $fg*$ is not competitive. This argument corroborates the evidence from our ablation studies and recent work showing that using class-conditional generative models to augment supervised learning does not provide big gains (Ravuri and Vinyals, 2019).

That said, one can still use class-conditional generative models to synthesize high-fidelity samples. As long as these samples are treated as unlabeled examples and annotated using a classifier, for example, ft, we believe this is a reasonable approach falling under GAL. Note that our argument above only applies to the scenario that class-conditional generative models are used to synthesize labeled examples. In other words, GAL emphasizes prediction of the labels in the course of the algorithm, rather than having the labels predefined. If one uses the unlabeled synthetic examples from class-conditional generative models, it still aligns to (7), which will be verified in Section 5.4.

In this section, we assess the effectiveness of GAL on KD, self-training, and few-shot learning.

### 5.1 State-of-the-art Results of Knowledge Distillation with GAL on GLUE

We use the GLUE benchmark (Wang et al., 2019c) for our KD experiments; see Appendix A.1 for benchmark details. Our synthetic unlabeled dataset U includes 40× as many examples as the original dataset for each task in GLUE.

It is known that KD on fresh data, unseen during training, performs better (Buciluă et al., 2006; Chen et al., 2020a) than KD on original training data. Hence, we investigate the effectiveness of KD using generated unlabeled data through GAL.

We use the HuggingFace implementation (Wolf et al., 2020) for KD experiments and adopt a standard experimental setup consistent with previous work (Sun et al., 2019a; Xu et al., 2020). Following Rashid et al. (2021), fine-tuned RoBERTa- large (24-layer transformer) represents the teacher and a DistilRoBERTa (6-layer transformer) (Sanh et al., 2019) is used as the student. We train the student model on U and L, where U is annotated by the best RoBERTa-large model, achieving an average score of 86.5. We then mix L and U at a ratio of 1:4, which is equivalent to λ = 0.2. This ratio works best on the dev set.

Table 1 shows the results of individual 6-layer transformers on the GLUE test set. All of the baselines use an identical student architecture. GAL achieves the best entry on the GLUE leaderboard, marking a new state-of-the-art for KD on NLP. It outperforms strong KD baselines such as DistilRoBERTa (Sanh et al., 2019), BERT-PKD (Sun et al., 2019a), BERT-Theseus (Xu et al., 2020), tinyBERT (Jiao et al., 2019), and MATE-KD (Rashid et al., 2021). It also outperforms our own DistilRoBERTa+KD baseline, which learns from soft labels produced by an identical RoBERTa- large ensemble on the original labeled dataset. While the use of soft labels outperform the vanilla fine-tuned DistilRoBERTa model, it significantly underperforms our KD+GAL baseline. We also compare with two strong data-augmentation baselines, round-trip translation (RT) (Yu et al., 2018; Shleifer, 2019) and word substitutions (WS) (Jiao et al., 2019; Wei and Zou, 2019). For RT, We generate 40× unlabeled data using German as the bridge language (English$→$German$→$English). The translations are generated via the best model in WMT19 (Ng et al., 2019). We use the codebase from Jiao et al. (2019) to conduct WS data augmentation. We mirror the KD experimental setup of GAL for both RT and WS. Although DistilRoBERTa+RT and DistilRoBERTa+WS are better than vanilla DistilRoBERTa and KD variants, they still drastically underperform our approach.

Table 1:

GLUE test results for a 6-layer transformer. GAL establishes a new state of the art on KD for NLP. Baselines: BERT-Theseus (Xu et al., 2020), BERT-PKD (Sun et al., 2019a), tinyBERT (Jiao et al., 2019), MATE-KD (Rashid et al., 2021), DistilRoBERTa (Sanh et al., 2019), and DistilRoBERTa + KD (standard KD), DistilRoBERTa + WS (word substitution), and DistilRoBERTa + RT (round-trip translation). MNLI-m and MNLI-mm indicate matched and mismatched, respectively.

ModelMNLI(m/mm)CoLASST-2MRPCSTS-BQQPQNLIRTEAvg
Previous work:
BERT-Theseus 82.4/82.1 47.8 92.2 87.6/83.2 85.6/84.1 71.6/89.3 89.6 66.2 78.6
BERT-PKD 81.5/81.0 − 92.0 85.0/79.9 − 70.7/88.9 89.0 65.5 −
tinyBERT 84.6/83.2 51.1 93.1 87.3/82.6 85.0/83.7 71.6/89.1 90.4 70.0 79.8
MATE-KD 86.2/85.6 58.6 95.1 91.2/88.1 88.5/88.4 73.0/89.7 92.4 76.6 83.5

Our results:
DistilRoBERTa 83.8/83.4 55.9 93.2 87.4/83.1 87.5/87.5 71.7/89.1 90.6 73.3 81.2
DistilRoBERTa + KD 84.5/84.1 53.0 93.5 88.9/85.1 88.0/87.4 71.9/89.2 91.0 75.0 81.5
DistilRoBERTa + WS 86.2/85.9 52.2 94.0 89.9/86.4 88.7/88.3 71.7/89.2 91.5 76.2 82.1
DistilRoBERTa + RT 86.2/85.6 55.0 94.9 90.1/86.5 89.2/88.9 72.5/89.7 92.1 77.2 82.9
DistilRoBERTa + GAL 86.9/86.4 58.6 95.3 91.6/88.7 89.9/89.5 73.0/89.9 92.7 79.7 84.3
ModelMNLI(m/mm)CoLASST-2MRPCSTS-BQQPQNLIRTEAvg
Previous work:
BERT-Theseus 82.4/82.1 47.8 92.2 87.6/83.2 85.6/84.1 71.6/89.3 89.6 66.2 78.6
BERT-PKD 81.5/81.0 − 92.0 85.0/79.9 − 70.7/88.9 89.0 65.5 −
tinyBERT 84.6/83.2 51.1 93.1 87.3/82.6 85.0/83.7 71.6/89.1 90.4 70.0 79.8
MATE-KD 86.2/85.6 58.6 95.1 91.2/88.1 88.5/88.4 73.0/89.7 92.4 76.6 83.5

Our results:
DistilRoBERTa 83.8/83.4 55.9 93.2 87.4/83.1 87.5/87.5 71.7/89.1 90.6 73.3 81.2
DistilRoBERTa + KD 84.5/84.1 53.0 93.5 88.9/85.1 88.0/87.4 71.9/89.2 91.0 75.0 81.5
DistilRoBERTa + WS 86.2/85.9 52.2 94.0 89.9/86.4 88.7/88.3 71.7/89.2 91.5 76.2 82.1
DistilRoBERTa + RT 86.2/85.6 55.0 94.9 90.1/86.5 89.2/88.9 72.5/89.7 92.1 77.2 82.9
DistilRoBERTa + GAL 86.9/86.4 58.6 95.3 91.6/88.7 89.9/89.5 73.0/89.9 92.7 79.7 84.3

### 5.2 Self-Training with GAL on GLUE

We fine-tune a pretrained RoBERTa model provided by fairseq (Ott et al., 2019) on each GLUE task. Fine-tuned RoBERTa serves as the first teacher model for self-training. Each student model is initialized with the original pretrained RoBERTa and fine-tuned with exactly the same hyperparameters as suggested by fairseq (Ott et al., 2019). We combine the labeled dataset L and the synthetic dataset U with a ratio of 1:1, by oversampling labeled data. This corresponds to λ = 0.5 in Eq. (7).

Table 2 shows that GAL provides an average improvement of +1.3% over RoBERTa-base. We see consistent improvements with more GAL iterations, but performance saturates after three iterations. We further compare our approach with a self-distillation (Furlanello et al., 2018) baseline, in which the teacher and student models use the same architecture and transfer knowledge via the original labeled training set. Although self-distillation provides a slight improvement, the gains from GAL are more significant.

Table 2:

RoBERTa base and GAL self-training results on GLUE dev sets, averaged across 5 independent runs (numbers in the subscript indicate the error bar, i.e., standard deviation divided by $5$.).

ModelMNLICoLASST-2MRPCSTS-BQQPQNLIRTEAvg
RoBERTa base 87.7 0.1 63.6 0.4 94.8 0.1 90.1 0.4 90.8 0.1 91.5 0.1 92.6 0.1 78.8 0.4 86.2
+ GAL (iter 1) 87.9 0.1 65.1 0.5 95.3 0.1 91.7 0.5 91.4 0.1 91.8 0.1 93.1 0.1 81.4 0.4 87.2
+ GAL (iter 2) 88.0 0.1 65.2 0.5 95.3 0.1 92.2 0.4 91.5 0.1 91.7 0.1 93.2 0.1 82.4 0.5 87.4
+ GAL (iter 3) 87.9 0.1 65.5 0.5 95.3 0.1 92.2 0.5 91.7 0.2 91.7 0.1 93.2 0.1 82.0 0.5 87.4

RoBERTa base + self-distillation 88.1 0.1 63.7 0.5 95.2 0.1 90.3 0.4 90.4 0.1 91.5 0.1 93.1 0.1 79.7 0.5 86.5
ModelMNLICoLASST-2MRPCSTS-BQQPQNLIRTEAvg
RoBERTa base 87.7 0.1 63.6 0.4 94.8 0.1 90.1 0.4 90.8 0.1 91.5 0.1 92.6 0.1 78.8 0.4 86.2
+ GAL (iter 1) 87.9 0.1 65.1 0.5 95.3 0.1 91.7 0.5 91.4 0.1 91.8 0.1 93.1 0.1 81.4 0.4 87.2
+ GAL (iter 2) 88.0 0.1 65.2 0.5 95.3 0.1 92.2 0.4 91.5 0.1 91.7 0.1 93.2 0.1 82.4 0.5 87.4
+ GAL (iter 3) 87.9 0.1 65.5 0.5 95.3 0.1 92.2 0.5 91.7 0.2 91.7 0.1 93.2 0.1 82.0 0.5 87.4

RoBERTa base + self-distillation 88.1 0.1 63.7 0.5 95.2 0.1 90.3 0.4 90.4 0.1 91.5 0.1 93.1 0.1 79.7 0.5 86.5

We delve deeper and combine GAL self- training with RoBERTa-large and report test results for both single model and ensemble model in Table 3. We observe consistent gains coming from GAL on RoBERTa-large. Our results underperform the latest and largest LMs from the GLUE leaderboard, but we are optimistic that GAL can be effectively combined with enormous LMs to provide additional gains.

Table 3:

RoBERTa-large with GAL self-training and SoTA methods evaluated on GLUE test sets. The benefit of GAL on single models is larger than ensembles. It appears that self-training reduce the variance of models. Baselines including much larger models: RoBERTa-large (Liu et al., 2019), ELECTRA (Clark et al., 2020), T5 (Raffel et al., 2020), ERNIE (Sun et al., 2019b), and DeBERTa (He et al., 2020). MNLI-m and MNLI-mm indicate matched and mismatched, respectively.

ModelMNLI(m/mm)CoLASST-2MRPCSTS-BQQPQNLIRTEAvg
Individual Models (our implementation):
RoBERTa-large 90.1/89.7 63.8 96.1 91.2/88.3 90.9/90.7 72.5/89.6 94.5 85.9 86.5
RoBERTa-large + GAL 90.2/89.8 66.2 96.4 92.0/89.2 90.7/90.5 73.6/89.9 95.0 86.3 87.1

Ensemble Models (our implementation):
RoBERTa-large 91.2/90.5 66.8 96.9 92.8/90.3 91.9/91.6 74.5/90.4 95.5 87.7 87.9
RoBERTa-large + GAL 91.0/90.7 67.9 97.1 93.1/90.8 91.6/91.4 74.5/90.4 95.8 88.2 88.2

State-of-the-art:
RoBERTa-large 90.8/90.2 67.8 96.7 92.3/89.8 92.2/91.9 74.3/90.3 95.4 88.2 88.0
ELECTRA 91.3/90.8 71.7 97.1 93.1/90.7 92.9/92.5 75.6/90.8 95.8 89.8 89.2
T5 92.2/91.9 71.6 97.5 92.8/90.4 93.1/92.8 75.1/90.6 96.9 92.8 89.8
ERNIE 91.9/91.4 74.4 97.8 93.9/91.8 93.0/92.6 75.2/90.9 97.3 92.0 90.2
DeBERTa 91.9/91.6 71.5 97.5 94.0/92.0 92.9/92.6 76.2/90.8 99.2 93.2 90.3
ModelMNLI(m/mm)CoLASST-2MRPCSTS-BQQPQNLIRTEAvg
Individual Models (our implementation):
RoBERTa-large 90.1/89.7 63.8 96.1 91.2/88.3 90.9/90.7 72.5/89.6 94.5 85.9 86.5
RoBERTa-large + GAL 90.2/89.8 66.2 96.4 92.0/89.2 90.7/90.5 73.6/89.9 95.0 86.3 87.1

Ensemble Models (our implementation):
RoBERTa-large 91.2/90.5 66.8 96.9 92.8/90.3 91.9/91.6 74.5/90.4 95.5 87.7 87.9
RoBERTa-large + GAL 91.0/90.7 67.9 97.1 93.1/90.8 91.6/91.4 74.5/90.4 95.8 88.2 88.2

State-of-the-art:
RoBERTa-large 90.8/90.2 67.8 96.7 92.3/89.8 92.2/91.9 74.3/90.3 95.4 88.2 88.0
ELECTRA 91.3/90.8 71.7 97.1 93.1/90.7 92.9/92.5 75.6/90.8 95.8 89.8 89.2
T5 92.2/91.9 71.6 97.5 92.8/90.4 93.1/92.8 75.1/90.6 96.9 92.8 89.8
ERNIE 91.9/91.4 74.4 97.8 93.9/91.8 93.0/92.6 75.2/90.9 97.3 92.0 90.2
DeBERTa 91.9/91.6 71.5 97.5 94.0/92.0 92.9/92.6 76.2/90.8 99.2 93.2 90.3

### 5.3 Prompt-based Few-shot Experiments

GPT3 (Brown et al., 2020) has introduced an optimization-free paradigm for few-shot learning for NLP. Without updating the parameters, large LMs can correctly predict the labels of the inputs by conditioning on a prompt, which consists of an instruction, a few labeled instances and a new unlabeled input. We apply GAL to prompt- based few-shot learning. Specifically, we present k labeled examples as a prompt to GPT-J (Wang and Komatsuzaki, 2021), an open-sourced re- implementation of GPT-3-6B, and generate m synthetic examples, followed by the corresponding labels. Note that to mitigate noisy outputs, the generation of each synthetic example only conditions on the original k labeled examples. Finally, we concatenate the original k examples and m synthetic examples, and conduct a (k + m)-shot learning experiment with GPT-J.

Brown et al. (2020) studied a total of 51 few- shot learning tasks. Studying all of these tasks is prohibitively expensive. Thus, we filter tasks by following these two steps. First, since generating m synthetic examples for each test instance is computationally expensive, we exclude tasks that have more than 5k test examples. Second, we filter tasks on which GPT-3-6B achieves a score lower than 65% (please refer to Table H.1 in Brown et al. [2020] for more details). After applying the filtering steps, we use four datasets: SST-2 (Wang et al., 2019c), PIQA (Bisk et al., 2020), COPA, and BoolQ (Wang et al., 2019b) as the testbed. We notice that in order to generate valid synthetic data, GPT-J requires to see at least 4 labeled examples. In addition, at most 16 examples of BoolQ can be fed into GPT-J without truncation. Thus, we set k and m to 4 and 12, respectively. As seen in Table 4, GAL leads to an average improvement of 1.2% over 4-shot learning, and reduces the gap between 4-shot and 16-shot learning. We noticed that the quality of some generated examples is low. We believe the performance of few-shot learning can be further improved with high-quality instances. One solution is to generate many synthetic examples, and select a high-quality subset. Since each test instance conditions on distinct labeled instances, one has to generate different synthetic instances for each test example from GPT-J, which causes expensive computation. Due to such computational constraints, we leave the investigation of data selection strategies to the future work.

Table 4:

Few-shot learning results for GPT-J (6B) (Wang and Komatsuzaki, 2021) on four NLP datasets. Accuracy is reported for these datasets.

ModelSST-2PIQACOPABoolQAvg
4-shot 89.8 0.8 76.0 1.4 79.0 1.5 64.3 0.8 77.3
8-shot 91.3 0.8 76.2 1.2 79.0 1.5 66.2 0.8 78.2
16-shot 92.7 0.6 77.0 0.9 81.0 1.1 66.8 0.8 79.4

4-shot + synthetic 12-shot (GAL ) 91.5 0.7 76.7 1.0 80.0 1.2 65.9 0.8 78.5
ModelSST-2PIQACOPABoolQAvg
4-shot 89.8 0.8 76.0 1.4 79.0 1.5 64.3 0.8 77.3
8-shot 91.3 0.8 76.2 1.2 79.0 1.5 66.2 0.8 78.2
16-shot 92.7 0.6 77.0 0.9 81.0 1.1 66.8 0.8 79.4

4-shot + synthetic 12-shot (GAL ) 91.5 0.7 76.7 1.0 80.0 1.2 65.9 0.8 78.5

### 5.4 Ablating Components of GAL on GLUE

We conduct an in-depth study of different components of GAL on GLUE datasets. Unless stated otherwise, we use a RoBERTa-base model with a combination of the original training data and 40 × synthetic data for each self-training experiment.

##### GPT-2 Model Size.

Radford et al. (2019) present a few variants of the GPT-2 model including GPT-2, GPT-2-medium, GPT-2-large, and GPT-2-XL. Larger GPT-2 models yield better perplexity scores and higher generation quality. We utilize these models except GPT-2-XL within the GAL framework to study the impact of the generative model’s quality on downstream task’s performance. Table 5 shows that regardless of the GPT-2 model sizes, GAL consistently surpasses the vanilla RoBERTa base. Moreover, SST-2 and RTE datasets are not sensitive to the capacity of GPT-2, but higher quality synthetic text improves the results on MRPC and CoLA datasets. We leave investigation of GPT-2-XL and even larger LMs such as GPT-3 (Brown et al., 2020) to future work.

Table 5:

GAL with various GPT-2 model sizes on GLUE dev sets. NA indicates a RoBERTa base model. We bold the best numbers.

GPT-2SST-2RTEMRPCCoLA
NA 94.8 78.8 90.1 63.6
small 95.5 81.3 90.9 63.9
medium 95.3 81.3 91.3 63.7
large 95.3 81.4 91.7 65.1
GPT-2SST-2RTEMRPCCoLA
NA 94.8 78.8 90.1 63.6
small 95.5 81.3 90.9 63.9
medium 95.3 81.3 91.3 63.7
large 95.3 81.4 91.7 65.1
##### Soft vs. Hard Pseudo Label.

We investigate the use of soft and hard pseudo labels within the GAL framework. The results in Table 6 suggest that GAL using soft pseudo labels is more effective than hard labels on the GLUE benchmark. This finding is compatible with the intuition that soft labels enable measuring the functional similarity of neural networks better (Hinton et al., 2015).

Table 6:

GAL with soft vs. hard pseudo labels on GLUE dev sets. We bold the best numbers.

Pseudo labelSST-2RTEMRPCCoLA
hard 95.0 80.7 90.8 63.0
soft 95.3 81.4 91.7 65.1
Pseudo labelSST-2RTEMRPCCoLA
hard 95.0 80.7 90.8 63.0
soft 95.3 81.4 91.7 65.1

Class-conditional Synthetic Data Generation. Previous work (Kumar et al., 2020b; Ravuri and Vinyals, 2019) suggests that it is challenging to utilize labeled synthetic data from class-conditional generative models to boost the accuracy of text and image classifiers. Our theory in Section 4 points to the potential drawback of class-conditional synthetic data. We empirically study this phenomenon, by fine-tuning GPT-2 in a class-conditional manner. Then we utilize its synthetic examples in two different cases: 1) labeled synthetic examples and 2) unlabeled synthetic examples. Table 7 shows that not only do class- conditional LMs underperform unconditional LMs in our GAL framework, but also they are much worse than the baseline, when using the pre- defined labels. Nevertheless, if we apply GAL to these examples, the class-conditional LM is on par with the unconditional one, which corroborates the importance of the annotation step in GAL. We provide more analysis in Appendix A.3.

Table 7:

Synthetic data from class-conditional LMs underperforms GAL and RoBERTa on GLUE dev sets.

Generative modelLabeled synthetic dataSST-2RTEMRPCCoLA
None (baseline) – 94.8 78.8 90.1 63.6

Class-conditional LM ✓ 92.9 74.4 86.0 58.4
Unconditional LM (GAL ) ✗ 95.3 81.4 91.7 65.1
Class-conditional LM (GAL) ✗ 95.4 81.0 91.4 65.2
Generative modelLabeled synthetic dataSST-2RTEMRPCCoLA
None (baseline) – 94.8 78.8 90.1 63.6

Class-conditional LM ✓ 92.9 74.4 86.0 58.4
Unconditional LM (GAL ) ✗ 95.3 81.4 91.7 65.1
Class-conditional LM (GAL) ✗ 95.4 81.0 91.4 65.2

This work demonstrates that one can leverage synthetic in-domain data generated by powerful pre-trained generative models. For simplicity, we do not employ any filtering avenue to retain diverse but high-quality data points. However, previous work has shown that advanced filtering approaches can further improve the performance (Sohn et al., 2020; Du et al., 2020; Yang et al., 2020). Given that the improvements in the self- training are not sizeable, we believe it is worth imposing filtering methods on the synthetic data to mitigate the side effects caused by the noisy data points.

Although we examine the effectiveness of GAL on various classification tasks, we still focus on the sentence-level tasks. Because of the superior performance on sentence-level tasks, there has been a surge of interest shift to document-level tasks, such as document-level machine translation (Miculicich et al., 2018; Voita et al., 2018; Maruf and Haffari, 2018), document summarization (Rush et al., 2015; Nallapati et al., 2016), and so forth. As these tasks suffer from data scarcity, one can leverage GAL to synthesize more data points. However, previous work has shown that GPT-2 has difficulty generating coherent text requiring long-range dependency (Orbach and Goldberg, 2020; Guan et al., 2020). Thus, such a limitation may hinder the application of GAL to document-level tasks.

In addition, the label space of the studied tasks is not as complex as the structured prediction tasks, such as machine translation, dialog system, question answering, and so on. However, we believe one can smoothly adapt GAL to these tasks as well. Let us consider machine translation (MT) as a canonical structured prediction task. Prior work has shown that one can use (real) monolingual data, in either source or the target language, through data augmentation (Sennrich et al., 2016) or knowledge distillation (Kim and Rush, 2016) to improve the structured prediction tasks. This suggests a promising avenue for future research on using synthetically generate monolingual data to improve MT for specialized domains where even monolingual data is scarce.

Furthermore, Vu et al. (2021a) suggest that one can leverage a retrieval-based approach to obtain monolingual sentences from the generic data stores. This retrieved monolingual data is then employed to improve the translation quality in a domain adaptation setting. This suggests that a GAL-based approach to synthetically generate monolingual text is a promising method to improve MT for specialized domains—an interesting direction for future research.

We present Generate, Annotate, and Learn (GAL): a framework for self-training and knowledge distillation with generated unlabeled data. We motivate GAL from an expected risk minimization perspective and demonstrate both theoretically and empirically that the use of unconditional generative models for synthetic data generation is more effective than class-conditional generative models previously used in the literature. GAL leverages advances in large pretrained language models to help supervised learning and can have implications for learning from limited labeled data. GAL significantly helps improve knowledge distillation and prompt-based few-shot learning. In addition, a concurrent work (Gowal et al., 2021) has shown that using generated images can enhance the robustness of images classifiers. We will explore this direction on NLP tasks in the future. Finally, we hope that GAL will stimulate new research on the evaluation and development of large language models.

We would like to thank the anonymous reviewers and action editor André F.T. Martins for their comments and suggestions on this work. The computational resources of this work are partly supported by the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au). This material is partly based on research sponsored by Air Force Research Laboratory and DARPA under agreement number FA8750-19-2-0501. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.

### A.1 Datasets

The statistics of GLUE are reported in Table 8.

Table 8:

Summary of the three sets of tasks used for evaluation of GAL. STS-B is a regression task, so #classes is not applicable.

SST-2 sentiment analysis movie reviews 67k 872 1.8k
QQP paraphrase social QA questions 364k 40k 391k
QNLI QA/natural language inference Wikipedia 105k 5k 5.4k
RTE natural language inference news, Wikipedia 2.5k 277 3k
MNLI natural language inference misc. 393k 20k 20k
MRPC paraphrase news 3.7k 408 1.7k
CoLA acceptability misc. 8.5k 1043 1k
STS-B sentence similarity misc. 5.8k 15k 1.4k −
SST-2 sentiment analysis movie reviews 67k 872 1.8k
QQP paraphrase social QA questions 364k 40k 391k
QNLI QA/natural language inference Wikipedia 105k 5k 5.4k
RTE natural language inference news, Wikipedia 2.5k 277 3k
MNLI natural language inference misc. 393k 20k 20k
MRPC paraphrase news 3.7k 408 1.7k
CoLA acceptability misc. 8.5k 1043 1k
STS-B sentence similarity misc. 5.8k 15k 1.4k −

### A.2 GPT-2 for Classification

We have conducted additional experiments, where we fine-tune GPT-2 as a classifier. We have considered two variants of the GPT-2 model. The first varant is the original GPT-2 model (GPT2-original) pre-trained on open-domain text. The second variant is the GPT-2 model that was fine-tuned on the inputs of each task separately (GPT-2-finetuned). This model was used to generate task-specific (synthetic) unlabeled data. Finally, we also consider self-training with GAL on top of GPT2-original. Specifically, we use the GPT-2-finetuned model to synthesize 40x in-domain unlabeled data. Then we apply self- training to GPT-2-original, where the data is a combination of the original labeled data and pseudo-labeled synthetic data. Table 9 suggests that the gains of GAL come from the pseudo- labeled synthetic data, i.e., both synthetic unlabeled data and teacher’s knowledge. Without the generation of synthetic unlabeled data, the domain-specific knowledge embedded in GPT-2- finetuned model cannot be utilized. As such, GPT-2-finetuned model is inferior to the GPT2- original model. Since RoBERTa-large is superior to GPT-2 models, RoBERTa-large+GAL also significantly outperform the GPT-2 counterpart.

Table 9:

GLUE test results of using GPT-2 and RoBERTa-large as classification models.

ModelMNLICoLASST-2MRPCSTS-BQQPQNLIRTEAvg
GPT-2-original 85.9/85.6 54.8 94.5 86.9/82.2 86.3/85.2 72.5/89.3 91.2 69.8 80.9
GPT-2-finetuned 85.8/85.5 40.9 94.5 87.0/81.0 85.6/84.3 71.4/88.5 91.5 69.0 78.8
GPT-2-original+GAL 86.2/85.8 55.7 94.7 87.9/83.4 86.9/85.9 72.6/89.4 91.9 70.6 81.5

RoBERTa-large 90.1/89.7 63.8 96.1 91.2/88.3 90.9/90.7 72.5/89.6 94.5 85.9 86.5
RoBERTa-large + GAL 90.2/89.8 66.2 96.4 92.0/89.2 90.7/90.5 73.6/89.9 95.0 86.3 87.1
ModelMNLICoLASST-2MRPCSTS-BQQPQNLIRTEAvg
GPT-2-original 85.9/85.6 54.8 94.5 86.9/82.2 86.3/85.2 72.5/89.3 91.2 69.8 80.9
GPT-2-finetuned 85.8/85.5 40.9 94.5 87.0/81.0 85.6/84.3 71.4/88.5 91.5 69.0 78.8
GPT-2-original+GAL 86.2/85.8 55.7 94.7 87.9/83.4 86.9/85.9 72.6/89.4 91.9 70.6 81.5

RoBERTa-large 90.1/89.7 63.8 96.1 91.2/88.3 90.9/90.7 72.5/89.6 94.5 85.9 86.5
RoBERTa-large + GAL 90.2/89.8 66.2 96.4 92.0/89.2 90.7/90.5 73.6/89.9 95.0 86.3 87.1

### A.3 Importance of Pseudo-labels

We have argued and demonstrated that using class-conditional generative models to generate labeled synthetic examples is less effective than GAL in Section 3 and Section 5. To further verify this argument, we sample 100 instances from the synthetic RTE dataset generated by the label-prompted GPT2, as the class-conditional LM. Then we annotate these examples using a human annotator, GPT2 classifier, and RoBERTa classifier. Finally, we compute the Accuracy, F1, Precision, and Recall scores between human labels and GPT2 labels, between human labels and RoBERTa labels, and between human labels and conditioned labels used by GPT2 when the data was generated. Table 10 shows that class- conditional LM has difficulty generating sentences retaining the semantics or pragmatics of a specified category, which also corroborates our theoretical analysis in Section 3. On the other hand, discriminative models, such as GPT2 classifier and RoBERTa classifier, are able to produce higher quality labels that correlate better with human annotations.

Table 10:

Performance of GPT2 annotation, RoBERTa annotation and conditioning labels on 100 random examples from the synthetic RTE dataset generated by a class-conditional LM.

Label typeAccuracyF1PrecisionRecall
GPT2 86.0 87.0 88.7 85.5
RoBERTa 90.0 91.4 100.0 84.1
conditioning label 72.0 71.4 66.0 77.8
Label typeAccuracyF1PrecisionRecall
GPT2 86.0 87.0 88.7 85.5
RoBERTa 90.0 91.4 100.0 84.1
conditioning label 72.0 71.4 66.0 77.8

### A.4 Generated Unlabeled Examples Annotated with Pseudo Labels

We provide some synthetic sentences generated by GAL in Tables 11 and 12.

Table 11:

SST-2: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa representations) from our synthetic dataset. We include labels for original examples and pseudo-labels for synthetic examples in parenthesis.

 are more deeply thought through than in most ‘ right-thinking ’ films (positive) KNN: 1: is far more sophisticated, insightful and thought-provoking than his previous films . (positive) 2: is more sophisticated than its more obvious and less-than-dazzling counterparts (positive) 3: is about as well-thought as the idea of a bad hair day, (negative) contains no wit, only labored gags (negative) KNN: 1: lacks insight, and lacks empathy (negative) 2: has little humor or intelligence (negative) 3: lacks all wit and humanity (negative)
 are more deeply thought through than in most ‘ right-thinking ’ films (positive) KNN: 1: is far more sophisticated, insightful and thought-provoking than his previous films . (positive) 2: is more sophisticated than its more obvious and less-than-dazzling counterparts (positive) 3: is about as well-thought as the idea of a bad hair day, (negative) contains no wit, only labored gags (negative) KNN: 1: lacks insight, and lacks empathy (negative) 2: has little humor or intelligence (negative) 3: lacks all wit and humanity (negative)
Table 12:

QQP: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa representations) from our synthetic dataset. We include labels for original examples and pseudo-labels for synthetic examples in parenthesis.

 How is the life of a math student? Could you describe your own experiences? [SEP] Which level of prepration is enough for the exam jlpt5? (not duplicated) KNN: 1: What are the best courses for a mechanical engineering student? [SEP] What is the best course to do after completing a B.Tech in mechanical engineering? (not duplicated) 2: How much marks are needed to get through the GATE with electronics? [SEP] What is the average score of the Gate EE exam? What are the cut-offs? (not duplicated) 3: What is the best time table for students to prepare for IAS? [SEP] How can one study for IAS in a best time? (not duplicated) How does an IQ test work and what is determined from an IQ test? [SEP] How does IQ test works? (duplicated) KNN: 1: What is the average IQ of the U.S. population? [SEP] How does an IQ test work? (not duplicated) 2: Is the Iq test an effective way to measure intelligence? [SEP] How do IQ tests work? (duplicated) 3: How is an IQ test on a scale from 1 to 100 scored? [SEP] How do you get your IQ tested? (not duplicated)
 How is the life of a math student? Could you describe your own experiences? [SEP] Which level of prepration is enough for the exam jlpt5? (not duplicated) KNN: 1: What are the best courses for a mechanical engineering student? [SEP] What is the best course to do after completing a B.Tech in mechanical engineering? (not duplicated) 2: How much marks are needed to get through the GATE with electronics? [SEP] What is the average score of the Gate EE exam? What are the cut-offs? (not duplicated) 3: What is the best time table for students to prepare for IAS? [SEP] How can one study for IAS in a best time? (not duplicated) How does an IQ test work and what is determined from an IQ test? [SEP] How does IQ test works? (duplicated) KNN: 1: What is the average IQ of the U.S. population? [SEP] How does an IQ test work? (not duplicated) 2: Is the Iq test an effective way to measure intelligence? [SEP] How do IQ tests work? (duplicated) 3: How is an IQ test on a scale from 1 to 100 scored? [SEP] How do you get your IQ tested? (not duplicated)
Steven
Abney
.
2004
.
Understanding the Yarowsky algorithm
.
Computational Linguistics
,
30
(
3
):
365
395
.
A.
Agrawala
.
1970
.
Learning with a probabilistic teacher
.
IEEE Transactions on Information Theory
,
16
(
4
):
373
379
.
Chris
Alberti
,
Daniel
Andor
,
Emily
Pitler
,
Jacob
Devlin
, and
Michael
Collins
.
2019
.
Synthetic QA corpora generation with roundtrip consistency
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
6168
6173
.
Yonatan
Bisk
,
Rowan
Zellers
,
Jianfeng
Gao
, and
Yejin
Choi
.
2020
.
PIQA: Reasoning about physical commonsense in natural language
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
34
, pages
7432
7439
.
Tom B.
Brown
,
Benjamin
Mann
,
Nick
Ryder
,
Melanie
Subbiah
,
Jared
Kaplan
,
Prafulla
Dhariwal
,
Arvind
Neelakantan
,
Pranav
Shyam
,
Girish
Sastry
,
Amanda
,
Sandhini
Agarwal
,
Ariel
Herbert-Voss
,
Gretchen
Krueger
,
Tom
Henighan
,
Rewon
Child
,
Ramesh
,
Daniel M.
Ziegler
,
Jeffrey
Wu
,
Clemens
Winter
,
Christopher
Hesse
,
Mark
Chen
,
Eric
Sigler
,
Mateusz
Litwin
,
Scott
Gray
,
Benjamin
Chess
,
Jack
Clark
,
Christopher
Berner
,
Sam
McCandlish
,
Alec
,
Ilya
Sutskever
, and
Dario
Amodei
.
2020
.
Language models are few-shot learners
.
arXiv:2005.14165
.
Cristian
Buciluă
,
Rich
Caruana
, and
Alexandru
Niculescu-Mizil
.
2006
.
Model compression
.
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, pages
535
541
.
Yair
Carmon
,
Raghunathan
,
Ludwig
Schmidt
,
John C.
Duchi
, and
Percy S.
Liang
.
2019
.
.
Advances in Neural Information Processing Systems
,
32
.
Olivier
Chapelle
,
Jason
Weston
,
Léon
Bottou
, and
Vapnik
.
2001
.
Vicinal risk minimization
.
Advances in Neural Information Processing Systems
.
Ting
Chen
,
Simon
Kornblith
,
Kevin
Swersky
,
Norouzi
, and
Geoffrey
Hinton
.
2020a
.
Big self-supervised models are strong semi-supervised learners
.
NeurIPS
.
Yining
Chen
,
Colin
Wei
,
Ananya
Kumar
, and
Tengyu
Ma
.
2020b
.
Self-training avoids using spurious features under domain shift
. In
Advances in Neural Information Processing Systems 33: Annual Conferenc–12, 2020, virtual
.
Kevin
Clark
,
Minh-Thang
Luong
,
Quoc V.
Le
, and
Christopher D.
Manning
.
2020
.
Electra: Pre-training text encoders as discriminators rather than generators
.
International Conference on Learning Representations
.
Jingfei
Du
,
Edouard
Grave
,
Beliz
Gunel
,
Vishrav
Chaudhary
,
Onur
Celebi
,
Michael
Auli
,
Ves
Stoyanov
, and
Alexis
Conneau
.
2020
.
Self- training improves pre-training for natural language understanding
.
arXiv:2010.02194
.
Jason
Eisner
and
Damianos
Karakos
.
2005
.
Bootstrapping without the boot
. In
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
, pages
395
402
.
Steven Y.
Feng
,
Varun
Gangal
,
Jason
Wei
,
Sarath
Chandar
,
Soroush
Vosoughi
,
Teruko
Mitamura
, and
Eduard
Hovy
.
2021
.
A survey of data augmentation approaches for NLP
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
968
988
.
S.
Fralick
.
1967
.
Learning to recognize patterns without a teacher
.
IEEE Transactions on Information Theory
.
Tommaso
Furlanello
,
Zachary
Lipton
,
Michael
Tschannen
,
Laurent
Itti
, and
Anima
Anandkumar
.
2018
.
Born again neural networks
.
International Conference on Machine Learning
, pages
1607
1616
.
Leo
Gao
,
Jonathan
Tow
,
Stella
Biderman
,
Sid
Black
,
Anthony
DiPofi
,
Charles
Foster
,
Laurence
Golding
,
Jeffrey
Hsu
,
Kyle
McDonell
,
Niklas
Muennighoff
,
Jason
Phang
,
Laria
Reynolds
,
Eric
Tang
,
Anish
Thite
,
Ben
Wang
,
Kevin
Wang
, and
Andy
Zou
.
2021
.
A framework for few-shot language model evaluation
.
Sven
Gowal
,
Sylvestre-Alvise
Rebuffi
,
Olivia
Wiles
,
Florian
Stimberg
,
Dan Andrei
Calian
, and
Timothy A.
Mann
.
2021
.
Improving robustness using generated data
.
Advances in Neural Information Processing Systems
,
34
.
Jian
Guan
,
Fei
Huang
,
Zhihao
Zhao
,
Xiaoyan
Zhu
, and
Minlie
Huang
.
2020
.
A knowledge- enhanced pretraining model for commonsense story generation
.
Transactions of the Association for Computational Linguistics
,
8
:
93
108
.
Gholamreza
Haffari
and
Anoop
Sarkar
.
2007
.
Analysis of semi-supervised learning with the yarowsky algorithm
. In
UAI 2007, Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, July 19-22, 2007
, pages
159
166
.
AUAI Press
.
Pengcheng
He
,
Xiaodong
Liu
,
Jianfeng
Gao
, and
Weizhu
Chen
.
2020
.
Deberta: Decoding- enhanced BERT with disentangled attention
.
arXiv:2006.03654
.
Danny
Hernandez
,
Jared
Kaplan
,
Tom
Henighan
, and
Sam
McCandlish
.
2021
.
Scaling laws for transfer
.
Geoffrey
Hinton
,
Oriol
Vinyals
, and
Jeff
Dean
.
2015
.
Distilling the knowledge in a neural network
.
arXiv:1503.02531
.
Xiaoqi
Jiao
,
Yichun
Yin
,
Lifeng
Shang
,
Xin
Jiang
,
Xiao
Chen
,
Linlin
Li
,
Fang
Wang
, and
Qun
Liu
.
2019
.
TinyBERT: Distilling BERT for natural language understanding
.
arXiv:1909 .10351
.
Yoon
Kim
and
Alexander M.
Rush
.
2016
.
Sequence-level knowledge distillation
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
1317
1327
.
Sosuke
Kobayashi
.
2018
.
Contextual augmentation: Data augmentation by words with paradigmatic relations
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
, pages
452
457
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Ananya
Kumar
,
Tengyu
Ma
, and
Percy
Liang
.
2020a
.
. In
Proceedings of the 37th International Conference on Machine Learning
,
volume 119 of Proceedings of Machine Learning Research
, pages
5468
5479
.
PMLR
.
Varun
Kumar
,
Ashutosh
Choudhary
, and
Eunah
Cho
.
2020b
.
Data augmentation using pre- trained transformer models
.
arXiv:2003.02245
.
Jared
Lichtarge
,
Chris
Alberti
,
Shankar
Kumar
,
Noam
Shazeer
,
Niki
Parmar
, and
Simon
Tong
.
2019
.
Corpora generation for grammatical error correction
.
arXiv:1904.05780
.
Yinhan
Liu
,
Myle
Ott
,
Naman
Goyal
,
Jingfei
Du
,
Mandar
Joshi
,
Danqi
Chen
,
Omer
Levy
,
Mike
Lewis
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2019
.
Roberta: A robustly optimized bert pretraining approach
.
arXiv:1907.11692
.
Sameen
Maruf
and
Gholamreza
Haffari
.
2018
.
Document context neural machine translation with memory networks
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1275
1284
.
Alana
Marzoev
,
Samuel
,
M.
Frans Kaashoek
,
Michael J.
Cafarella
, and
Jacob
Andreas
.
2020
.
Unnatural language processing: Bridging the gap between synthetic and natural language data
.
ArXiv
,
abs/2004.13645
.
Lesly
Miculicich
,
Dhananjay
Ram
,
Nikolaos
Pappas
, and
James
Henderson
.
2018
.
Document- level neural machine translation with hierarchical attention networks
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
2947
2954
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Hossein
Mobahi
,
Farajtabar
, and
Peter L.
Bartlett
.
2020
.
Self-distillation amplifies regularization in hilbert space
. In
Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual
.
Ramesh
Nallapati
,
Bowen
Zhou
,
Caglar
Gulcehre
,
Bing
Xiang
.
2016
.
Abstractive text summarization using sequence-to-sequence RNNs and beyond
. In
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning
, pages
280
290
.
Nathan
Ng
,
Kyra
Yee
,
Alexei
Baevski
,
Myle
Ott
,
Michael
Auli
, and
Sergey
Edunov
.
2019
.
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
314
319
.
Norouzi
,
David J.
Fleet
, and
Norouzi
.
2020
.
Exemplar vaes for exemplar based generation and data augmentation
.
arXiv:2004.04795
.
Eyal
Orbach
and
Yoav
Goldberg
.
2020
.
Facts2Story: Controlling text generation by key facts
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
2329
2345
,
Barcelona, Spain (Online)
.
International Committee on Computational Linguistics
.
Myle
Ott
,
Sergey
Edunov
,
Alexei
Baevski
,
Angela
Fan
,
Sam
Gross
,
Nathan
Ng
,
David
Grangier
, and
Michael
Auli
.
2019
.
fairseq: A fast, extensible toolkit for sequence modeling
.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)
, pages
48
53
.
Samet
Oymak
and
Gulcu
.
2020
.
Statistical and algorithmic insights for semi- supervised learning with self-training
.
CoRR
,
abs/2006.11006
.
Alec
,
Jeffrey
Wu
,
Rewon
Child
,
David
Luan
,
Dario
Amodei
, and
Ilya
Sutskever
.
2019
.
Language models are unsupervised multitask learners
Colin
Raffel
,
Noam
Shazeer
,
Roberts
,
Katherine
Lee
,
Sharan
Narang
,
Michael
Matena
,
Yanqi
Zhou
,
Wei
Li
, and
Peter J.
Liu
.
2020
.
Exploring the limits of transfer learning with a unified text-to-text transformer
.
Journal of Machine Learning Research
,
21
:
1
67
.
Raghunathan
,
Sang Michael
Xie
,
Fanny
Yang
,
John
Duchi
, and
Percy
Liang
.
2020
.
Understanding and mitigating the tradeoff between robustness and accuracy
. In
Proceedings of the 37th International Conference on Machine Learning
,
volume 119 of Proceedings of Machine Learning Research
, pages
7909
7919
.
PMLR
.
Rashid
,
Vasileios
Lioutas
, and
Mehdi
.
2021
.
.
arXiv preprint arXiv:2105.05912
.
Suman
Ravuri
and
Oriol
Vinyals
.
2019
.
Classification accuracy score for conditional generative models
.
Advances in Neural Information Processing Systems
, pages
12268
12279
.
Alexander M.
Rush
,
Sumit
Chopra
, and
Jason
Weston
.
2015
.
A neural attention model for abstractive sentence summarization
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
379
389
.
Victor
Sanh
,
Lysandre
Debut
,
Julien
Chaumond
, and
Thomas
Wolf
.
2019
.
DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter
.
ArXiv
,
abs/1910.01108
.
H.
Scudder
.
1965
.
Probability of error of some adaptive pattern-recognition machines
.
IEEE Transactions on Information Theory
.
Rico
Sennrich
,
Barry
, and
Alexandra
Birch
.
2016
.
Improving neural machine translation models with monolingual data
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
86
96
,
Berlin, Germany
.
Association for Computational Linguistics
.
Dinghan
Shen
,
Mingzhi
Zheng
,
Yelong
Shen
,
Yanru
Qu
, and
Weizhu
Chen
.
2020
.
A simple but tough-to-beat data augmentation approach for natural language understanding and generation
.
arXiv preprint arXiv:2009.13818
.
Sam
Shleifer
.
2019
.
Low resource text classification with ulmfit and backtranslation
.
arXiv preprint arXiv:1903.09244
.
Kihyuk
Sohn
,
David
Berthelot
,
Chun-Liang
Li
,
Zizhao
Zhang
,
Nicholas
Carlini
,
Ekin D.
Cubuk
,
Alex
Kurakin
,
Han
Zhang
, and
Colin
Raffel
.
2020
.
Fixmatch: Simplifying semi-supervised learning with consistency and confidence
.
arXiv:2001.07685
.
Siqi
Sun
,
Yu
Cheng
,
Zhe
Gan
, and
Jingjing
Liu
.
2019a
.
Patient knowledge distillation for BERT model compression
.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
4314
4323
.
Yu
Sun
,
Shuohuan
Wang
,
Yukun
Li
,
Shikun
Feng
,
Xuyi
Chen
,
Han
Zhang
,
Xin
Tian
,
Danxiang
Zhu
,
Hao
Tian
, and
Hua
Wu
.
2019b
.
Ernie: Enhanced representation through knowledge integration
.
arXiv preprint arXiv:1904.09223
.
Nicola
Ueffing
,
Gholamreza
Haffari
, and
Anoop
Sarkar
.
2007
.
Transductive learning for statistical machine translation
. In
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
, pages
25
32
,
Prague, Czech Republic
.
Association for Computational Linguistics
.
Vapnik
.
1992
.
Principles of risk minimization for learning theory
.
Advances in Neural Information Processing Systems
.
Elena
Voita
,
Pavel
Serdyukov
,
Rico
Sennrich
, and
Ivan
Titov
.
2018
.
Context-aware neural machine translation learns anaphora resolution
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1264
1274
.
Thuy
Vu
,
Xuanli
He
,
Dinh
Phung
, and
Gholamreza
Haffari
.
2021a
.
Generalised unsupervised domain adaptation of neural machine translation with cross-lingual data selection
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
3335
3346
.
Tu
Vu
,
Minh-Thang
Luong
,
Quoc
Le
,
Simon
, and
Mohit
Iyyer
.
2021b
.
Strata: Self- training with task augmentation for better few- shot learning
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
5715
5731
.
Alex
Wang
,
Jan
Hula
,
Patrick
Xia
,
Raghavendra
Pappagari
,
R.
Thomas McCoy
,
Roma
Patel
,
Najoung
Kim
,
Ian
Tenney
,
Yinghui
Huang
,
Katherin
Yu
,
Shuning
Jin
,
Berlin
Chen
,
Benjamin Van
Durme
,
Edouard
Grave
,
Ellie
Pavlick
, and
Samuel R.
Bowman
.
2019a
.
Can you tell me how to get past sesame street? Sentence-level pretraining beyond language modeling
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
4465
4476
.
Alex
Wang
,
Pruksachatkun
,
Nikita
Nangia
,
Amanpreet
Singh
,
Julian
Michael
,
Felix
Hill
,
Omer
Levy
, and
Samuel R.
Bowman
.
2019b
.
SuperGLUE: A stickier benchmark for general- purpose language understanding systems
.
arXiv: 1905.00537
.
Alex
Wang
,
Amanpreet
Singh
,
Julian
Michael
,
Felix
Hill
,
Omer
Levy
, and
Samuel R.
Bowman
.
2019c
.
GLUE: A multi-task benchmark and analysis platform for natural language understanding
.
International Conference on Learning Representations
.
Bailin
Wang
,
Wenpeng
Yin
,
Xi
Victoria Lin
, and
Caiming
Xiong
.
2021
.
Learning to synthesize data for semantic parsing
. In
Proceedings of the Meeting of the North-American Chapter of Association for Computational Linguistics (NAACL)
.
Ben
Wang
and
Aran
Komatsuzaki
.
2021
.
GPT-J-6B: A 6 billion parameter autoregressive language model
. https://github.com/kingoflolz/mesh-transformer-jax.
William Yang
Wang
and
Diyi
Yang
.
2015
.
That’s so annoying!!!: A lexical and frame- semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets
.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
2557
2563
.
Xiaojie
Wang
,
Rui
Zhang
,
Yu
Sun
, and
Jianzhong
Qi
.
2018
.
Kdgan: Knowledge distillation with generative adversarial networks.
NeurIPS
.
Yushi
Wang
,
Jonathan
Berant
, and
Percy
Liang
.
2015
.
Building a semantic parser overnight
. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1332
1342
,
Beijing, China
.
Association for Computational Linguistics
.
Colin
Wei
,
Kendrick
Shen
,
Yining
Chen
, and
Tengyu
Ma
.
2021
.
Theoretical analysis of self-training with deep networks on unlabeled data
. In
International Conference on Learning Representations
.
Jason
Wei
and
Kai
Zou
.
2019
.
Eda: Easy data augmentation techniques for boosting performance on text classification tasks
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
6382
6388
.
Thomas
Wolf
,
Lysandre
Debut
,
Victor
Sanh
,
Julien
Chaumond
,
Clement
Delangue
,
Anthony
Moi
,
Pierric
Cistac
,
Tim
Rault
,
Remi
Louf
,
Morgan
Funtowicz
,
Joe
Davison
,
Sam
Shleifer
,
Patrick
von Platen
,
Clara
Ma
,
Yacine
Jernite
,
Julien
Plu
,
Canwen
Xu
,
Teven Le
Scao
,
Sylvain
Gugger
,
Mariama
Drame
,
Quentin
Lhoest
, and
Alexander
Rush
.
2020
.
Transformers: State-of-the-art natural language processing
.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages
38
45
.
Xing
Wu
,
Shangwen
Lv
,
Liangjun
Zang
,
Jizhong
Han
, and
Songlin
Hu
.
2019
.
Conditional BERT contextual augmentation
.
International Conference on Computational Science
, pages
84
95
,
Springer
.
Qizhe
Xie
,
Minh-Thang
Luong
,
Eduard
Hovy
, and
Quoc V.
Le
.
2020
.
Self-training with noisy student improves imagenet classification
.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pages
10684
10695
.
Canwen
Xu
,
Wangchunshu
Zhou
,
Tao
Ge
,
Furu
Wei
, and
Ming
Zhou
.
2020
.
Bert-of-theseus: Compressing BERT by progressive module replacing
.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7859
7869
.
Yiben
Yang
,
Chaitanya
Malaviya
,
Jared
Fernandez
,
Swabha
Swayamdipta
,
Ronan Le
Bras
,
Ji-Ping
Wang
,
Chandra
Bhagavatula
,
Yejin
Choi
, and
Doug
Downey
.
2020
.
G-daug: Generative data augmentation for commonsense reasoning
.
arXiv:2004.11546
.
David
Yarowsky
.
1995
.
Unsupervised word sense disambiguation rivaling supervised methods
.
33rd Annual Meeting of the Association for Computational Linguistics
, pages
189
196
.
Yu
,
David
Dohan
,
Minh-Thang
Luong
,
Rui
Zhao
,
Kai
Chen
,
Norouzi
, and
Quoc V
Le
.
2018
.
QANet: Combining local convolution with global self- attention for reading comprehension
.
ICLR
.
Hongyi
Zhang
,
Moustapha
Cisse
,
Yann N.
Dauphin
, and
David
Lopez-Paz
.
2018
.
mixup: Beyond empirical risk minimization
.
ICLR
.
Linfeng
Zhang
,
Jiebo
Song
,
Anni
Gao
,
Jingwei
Chen
,
Chenglong
Bao
, and
Kaisheng
Ma
.
2019
.
Be your own teacher: Improve the performance of convolutional neural networks via self distillation
.
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pages
3713
3722
.

## Author notes

Action Editor: André F.T. Martins

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.