An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset. Intrigued by these results, we find that the key to their success is generalization from a small amount of counterexamples where the spurious correlations do not hold. When such minority examples are scarce, pre-trained models perform as poorly as models trained from scratch. In the case of extreme minority, we propose to use multi-task learning (MTL) to improve generalization. Our experiments on natural language inference and paraphrase identification show that MTL with the right auxiliary tasks significantly improves performance on challenging examples without hurting the in-distribution performance. Further, we show that the gain from MTL mainly comes from improved generalization from the minority examples. Our results highlight the importance of data diversity for overcoming spurious correlations.


Introduction
A key challenge in building robust NLP models is the gap between limited linguistic variations in the training data and the diversity in real-world languages. Thus models trained on a specific dataset are likely to rely on spurious correlations: prediction rules that work for the majority examples but do not hold in general. For example, in natural language inference (NLI) tasks, previous work has found that models learned on notable benchmarks achieve high accuracy by associating high word overlap between the premise and the hypothesis with entailment (Dasgupta et al., 2018;McCoy et al., 2019). Consequently, these models perform 1 Code is available at https://github.com/ lifu-tu/Study-NLP-Robustness poorly on the so-called challenging or adversarial datasets where such correlations no longer hold (Glockner et al., 2018;McCoy et al., 2019;Nie et al., 2019;. This issue has also been referred to as annotation artifacts (Gururangan et al., 2018), dataset bias Clark et al., 2019), and group shift (Oren et al., 2019;Sagawa et al., 2020) in the literature.
Most current methods rely on prior knowledge of spurious correlations in the dataset and tend to suffer from a trade-off between in-distribution accuracy on the independent and identically distributed (i.i.d.) test set and robust accuracy 2 on the challenging dataset. Nevertheless, recent empirical results have suggested that self-supervised pre-training improves robust accuracy, while not using any task-specific knowledge nor incurring indistribution accuracy drop (Hendrycks et al., 2019(Hendrycks et al., , 2020. In this paper, we aim to investigate how and when pre-trained language models such as BERT improve performance on challenging datasets. Our key finding is that pre-trained models are more robust to spurious correlations because they can generalize from a minority of training examples that counter the spurious pattern, e.g., non-entailment examples with high premise-hypothesis word overlap. Specifically, removing these counterexamples from the training set significantly hurts their performance on the challenging datasets. In addition, larger model size, more pre-training data, and longer fine-tuning further improve robust accuracy. Nevertheless, pre-trained models still suffer from spurious correlations when there are too few counterexamples. In the case of extreme minority, we empirically show that multi-task learning (MTL) improves robust accuracy by improving generalization from the minority examples, even though preivous work has suggested that MTL has limited  (Søgaard and Goldberg, 2016;Hashimoto et al., 2017). This work sheds light on the effectiveness of pre-training on robustness to spurious correlations. Our results highlight the importance of data diversity (even if the variations are imbalanced). The improvement from MTL also suggests that traditional techniques that improve generalization in the i.i.d. setting can also improve out-of-distribution generalization through the minority examples.

Datasets
We focus on two natural language understanding tasks, NLI and paraphrase identification (PI). Both have large-scale benchmarking datasets with around 400k examples. While recent models have achieved near-human performance on these benchmarks, 3 the challenging datasets exploiting spurious correlations bring down the performance of state-of-the-art models below random guessing. We summarize the datasets used for our analysis in Table 1. NLI. Given a premise sentence and a hypothesis sentence, the task is to predict whether the hypothesis is entailed by, neutral with, or contradicts the premise. MultiNLI (MNLI) (Williams et al., 2017) is the most widely used benchmark for NLI, and it is also the most thoroughly studied in terms of spurious correlations. It was collected using the same crowdsourcing protocol as its predecessor SNLI (Bowman et al., 2015)  PI. Given two sentences, the task is to predict whether they are paraphrases or not. On Quora Question Pairs (QQP) (Iyer et al., 2017), one of the largest PI dataset,    overlap but different meanings through word swapping and back-translation. In addition to PAWS QQP which is created from sentences in QQP, they also released PAWS Wiki created from Wikipedia sentences.

Pre-training Improve Robust Accuracy
Recent results have shown that pre-trained models appear to improve performance on challenging examples over models trained from scratch (Yaghoobzadeh et al., 2019;Kaushik et al., 2020). In this section, we confirm this observation by thorough experiments on different pre-trained models and motivate our inquiries.
Models. We compare pre-trained models of different sizes and using different amounts of pretraining data. Specifically, we use the BERT BASE (110M parameters) and BERT LARGE (340M parameters) models implemented in GluonNLP (Guo et al., 2020) pre-trained on 16GB of text (Devlin et al., 2019). 4 To investigate the effect of size of the pre-training data, we also experiment with the RoBERTa BASE and RoBERTa LARGE models (Liu et al., 2019d) BERT but were trained on ten times as much text (about 160GB). To ablate the effect of pre-training, we also include a BERT BASE model with random initialization, BERT scratch .
Fine-tuning. We fine-tuned all models for 20 epochs and selected the best model based on the in-distribution dev set. We used the Adam optimizer with a learning rate of 2e-5, L2 weight decay of 0.01, batch sizes of 32 and 16 for base and large models respectively. Weights of BERT scratch and the last layer (classifier) of pretrained models are initialized from a normal distribution with zero mean and 0.02 variance. All experiments are run with 5 random seeds and the average values are reported.
Observations and inquiries. In Table 2, we show results for NLI and PI respectively. As expected, they improve performance on indistribution test sets significantly. 6 On the challenging datasets, we make two key observations. First, while pre-trained models improve the performance on challenging datasets, the improvement is not consistent across datasets. Specifically, the improvement on PAWS QQP are less promising than model_zoo/bert/index.html. HANS. While larger models (large vs. base) and more training data (RoBERTa vs. BERT) yield a further improvement of 5 to 10 accuracy points on HANS, the improvement on PAWS QQP is marginal. Second, even though three to four epochs of fine-tuning is typically sufficient for in-distribution data, we observe that longer fine-tuning improves results on challenging examples significantly (see BERT BASE ours vs. prior in Table 2). As shown in Figure 1, while the accuracy on MNLI and QQP dev sets saturate after three epochs, the performance on the corresponding challenging datasets keeps increasing until around the tenth epoch, with more than 30% improvement.
The above observations motivate us to ask the following questions: 1. How do pre-trained models generalize to outof-distribution data?
2. When do they generalize well given the inconsistent improvements?
3. What role does longer fine-tuning play?
We provide empirical answers to these questions in the next section and show that the answers are all related to a small amount of counterexamples in the training data.

Pre-training Improves Robustness to Data Imbalance
One common impression is that the diversity in large amounts of pre-training data allows pretrained models to generalize better to out-of-distribution data. Here we show that while pre-training improves generalization, they do not enable extrapolation to unseen patterns. Instead, they generalize better from minority patterns in the training set. Importantly, we notice that examples in HANS and PAWS are not completely uncovered by the training data, but belong to the minority groups. 7 For example, in MNLI, there are 727 HANS-like non-entailment examples where all words in the hypothesis also occur in the premise; in QQP, there are 247 PAWS-like non-paraphrase examples where the two sentences have the same bag of words. We refer to these examples that counter the spurious correlations as minority examples. We hypothesize that pre-trained models are more robust to group imbalance, thus generalizing well from the minority groups.
To verify our hypothesis, we remove minority examples during training and observe its effect on robust accuracy. Specifically, for NLI we sort nonentailment (contradiction and neutral) examples in MNLI by their premise-hypothesis overlap, which is defined as the percentage of hypothesis words that also appear in the premise. We then remove increasing amounts of these examples in the sorted order.
As shown in Figure 2, all models have significantly worse accuracy on HANS as more counterexamples are removed, while maintaining the original accuracy when the same amounts of random training examples are removed. With 6.4%  counterexamples removed, the performance of most pretrained models is near-random, as poor as non-pretrained models. Interestingly, larger models with more pre-training data (RoBERTa LARGE ) appear to be slightly more robust with increased level of imbalance.
Takeaway. These results reveal that pre-training improve robust accuracy by improving the i.i.d. accuracy on minority groups, highlighting the importance of increasing data diversity when creating benchmarks. Further, pre-trained models still suffer from suprious correlations when the minority examples are scarce. To enable extrapolation, we might need additional inductive bias (Nye et al., 2019) or new learning algorithms (Arjovsky et al., 2019).

Minority Patterns Require Varying Amounts of Training Data
Given that pre-trained models generalize better from minority examples, why do we not see similar improvement on PAWS QQP even though QQP also contains counterexamples? Unlike HANS examples that are generated from a handful of templates, PAWS examples are generated by swapping words in a sentence followed by human inspection. They often require recognizing nuance syntactic differences between two sentences with a small edit distance. For example, compare "What's classy if you're poor , but trashy if you're rich?" and "What's classy if you're rich , but trashy if you're poor?". Therefore, we posit that more samples are needed to reach good performance on PAWS-like examples.
To test the hypothesis, we plot learning curves by fine-tuning pre-trained models on the challenging datasets directly (Liu et al., 2019b  and randomly sample the same number of training examples from HANS; 8 the rest is used as dev/test set for evaluation. In Figure 3, we see that all models reach 100% accuracy rapidly on HANS. However, on PAWS, accuracy increases slowly and the models struggle to reach around 90% accuracy even with the full training set. This suggests that the amount of minority examples in QQP might not be sufficient for reliably estimating the model parameters.
To have a qualitative understanding on why PAWS examples are difficult to learn, we compare sentence length and constituency parse tree height of examples in HANS and PAWS. 9 We find that PAWS contains longer and syntactically more complex sentences, with an average length of 20.7 words and parse tree height of 11.4, compared to 9.2 and 7.5 on HANS. Figure 4 shows that the accuracy of BERT BASE and RoBERTa BASE on PAWS QQP decreases as the example length and the parse tree height increase.
Takeaway. We have shown that the inconsistent improvement on different challenging datasets are resulted from the same mechanism: pre-trained models improve robust accuracy by generalizing from minority examples, however, perhaps unsurprisingly, different minority patterns may require varying amounts of training data. This also poses a potential challenge in using data augmentation to tackle spurious correlations.

Minority Examples Require Longer fine-tuning
In the previos section, we have shown in First, we see that the training loss of minority examples decreases more slowly than the average loss, taking more than 15 epochs to reach near-zero loss. Second, the dev accuracy curves show that the accuracy of minority examples plateaus later, around epoch 10, whereas the average accuracy stops to increaste around epoch 5. In addition, it appears that BERT does not overfit with additional fine-tuning based on the accuracy curves. 10 Similary, a concurrent work  has found that longer fine-tuning improve few-sample performance.
Takeaway. While longer fine-tuning does not help in-distribution accuracy, we find that it improves performance on the minority groups. This suggests that selecting models or early stopping based on the i.i.d. dev set performance is insufficient, and we need new model selection criteria for robustness.

Improve Generalization through Multi-task Learning
Our results on minority examples show that increasing the amount of counterexamples to spurious correlations helps to improve model robustness. Then, an obvious solution is data augmentation; in fact, both McCoy et al. (2019) and  show that adding a small amount of challenging examples to the training set significantly improves performance on HANS and PAWS. However, these methods often require task-specific knowledge on spurious correlations and heavy use of rules to generate the counterexamples. Instead of adding examples with specific patterns, we investigate the effect of aggregating generic data from various sources through multi-task learning (MTL). It has been shown that MTL reduces the sample complexity of individual tasks compared to single-task learning (Caruana, 1997;Baxter, 2000;Maurer et al., 2016), thus may further improve the generalization capability of pre-trained models, especially on the minority groups.

Multi-task Learning
We learn from datasets from different sources jointly, where one is the target dataset to be evaluated on, and the rest are auxiliary datasets. The target dataset and the auxiliary dataset can belong to either the same task, e.g., MNLI and SNLI, or different but related tasks, e.g., MNLI and QQP. All datasets share the representation given by the pre-trained model, and we use separate linear classification layers for each dataset. The learning objective is a weighted sum of average losses on each dataset. We set the weight to be 1 for all datasets, equivalent to sampling examples from each dataset proportional to its size. 11 During training, we sample mini-batches from each dataset sequentially and use the same optimization hyperparameters as in single-task fine-tuning (Section 3) except for smaller batch sizes due to memory constraints. 12 .
Auxiliary datasets. We consider NLI and PI as related tasks since they both require understanding and comparing the meaning of two sentences. Therefore, we use both benchmark datasets and challenging datasets for NLI and PI as our auxiliary datasets. The hope is that benchmark data from related tasks helps transfer useful knowledge across tasks, thus improving generalization on minority examples, and the challenging datasets countering specific spurious correlations further improve generalization on the corresponding minority examples. We analyze the contribution of the two types of auxiliary data in Section 5.2. The MTL training set up is shown in Table 4. 13 Details on the auxiliary datasets are described in Section 2.1.

Results
MTL improves robust accuracy. Our main MTL results are shown in Table 3. MTL increases accuracies on the challenging datasets across tasks without hurting the in-distribution performance, especially when the minority examples in the target dataset is scarce (e.g., PAWS). While prior work has shown limited success of MTL when tested on in-distribution data (Søgaard and Goldberg, 2016;Hashimoto et al., 2017;Raffel et al., 2019), our results demonstrate its value for out-of-distribution generalization.
On HANS, MTL improves the accuracy significantly for BERT BASE but not for RoBERTa BASE . To confirm the result, we additionally experimented with RoBERTa LARGE and obtained consistent results: MTL achieves an accuracy of 75.7 (2.1) on HANS, similar to the STL result, 77.1 (1.6). One potential explanation is that RoBERTa is already sufficient for providing good generalization from minority examples in MNLI.
In addition, both MTL and RoBERTa BASE yiedls biggest improvement on lexical overlap, as shown in the results on HANS by category (Table 5), We believe the reason is that lexical  Table 3: Comparison between models fine-tuned with multi-task (MTL) and single-task (STL) learning. MTL improves robust accuracy on challenging datasets. We ran t-tests for the mean accuracies of STL and MTL on five runs and the larger number is bolded when they differ significantly with a p < 0.001.    Table 5: MTL Results on different categories on HANS: lexical overlap (O), constituent (C), and subsequence (S). Both auxiliary data (MTL) and larger pre-training data (RoBERTa) improve accuracies mainly on lexical overlap. overlap is the most representative pattern among high-overlap and non-entailment training examples. In fact, 85% of the 727 HANS-like examples belongs to lexical overlap. This suggests that further improvement on HANS may require better data coverage on other categories.

Auxiliary Datasets Size
On PAWS, MTL consistently yields large improvement across pre-trained models. Given that QQP has fewer minority examples resembling the patterns in PAWS, which is also harder to learn (Section 4.2), the results show that MTL is an effective way to improve generalization when the  Table 6: Results of the ablation study on auxiliary datasets using BERT BASE on MNLI (the target task). While the in-distribution performance is hardly affected when a specific auxiliary dataset is excluded, performance on the challenging data varies (difference shown in ∆).
minority examples are scarce. Next, we investigate why MTL is helpful.
Improved generalization from minority examples. We are interested in finding how MTL helps generalization from minority examples. One possible explanation is that the challenging data in the auxiliary datasets prevent the model from learning suprious patterns. However, the ablation studies on auxiliary datasets in Table 6 and Table 7 show that the challenging datasets are not much more helpful than benchmark datasets. The other possible explanation is that MTL reduces sample complexity for learning from the minority examples in the target dataset. To verify this, we remove minority examples from both the auxiliary and the target datasets, and compare their effect on the robust accuracy. We focus on PI because MTL shows largest improvement there. In Table 8  Takeaway. These results suggest that both pretraining and MTL do not enable extrapolation, instead, they improve generalization from minority examples in the (target) training set. Thus it is important to increase coverage of diverse patterns in the data to improve robustness to spurious correlations.

Related Work
Pre-training and robustness. Recently, there is an increasing amount of interest in studying the effect of pre-training on robustness. Hendrycks et al. (2019Hendrycks et al. ( , 2020 show that pre-training improves model robustness to label noise, class imbalance, and out-of-distribution detection. In cross-domain question-answering,  show that the ensemble of different pre-trained models significantly improves performance on out-of-domain data. In this work, we answers why pre-trained models appear to improve out-of-distribution robustness and point out the importance of minority examples in the training data. Data augmentation. The most straightforward way to improve model robustness to out-ofdistribution data is to augment the training set with  examples from the target distribution. Recent work has shown that augmenting syntactically-rich examples improves robust accuracy on NLI (Min et al., 2020). Similarly, counterfactual augmentation aims to identify parts of the input that impact the label when intervened upon, thus avoiding learning spurious features (Goyal et al., 2019;Kaushik et al., 2020). Finally, data recombination has been used to achieve compositional generalization (Jia and Liang, 2016;Andreas, 2020). However, data augmentation techniques largely rely on prior knowledge of the spurious correlations or human efforts. In addition, as shown in Section 4.2 and a concurrent work (Jha et al., 2020), it is often unclear how much augmented data is needed for learning a pattern. Our work shows promise in adding generic pre-training data or related auxiliary data (through MTL) without assumptions on the target distribution.
Robust learning algorithms. Serveral recent work proposes new learning algorithms that are robust to spurious correlations in NLI datasets Clark et al., 2019;Yaghoobzadeh et al., 2019;Zhou and Bansal, 2020;Sagawa et al., 2020;Mahabadi et al., 2020;Utama et al., 2020). They rely on prior knowledge to focus on "harder" examples that do not enable shortcuts during training. One weakness of these methods is their arguably strong assumption on knowing the spurious correlations a priori. Our work provides evidence that large amounts of generic data can be used to improve out-of-distribution generalization. Similarly, recent work has shown that semi-supervised learning with generic auxiliary data improves model robustness to adversarial examples (Schmidt et al., 2018;Carmon et al., 2019).
Transfer learning. Robust learning is also related to domain adaptation or transfer learning since both aim to learn from one distribution and achieve good performance on a different but related target distribution. Data selection and reweighting are common techniques used in domain adaptation. Similar to our findings on minority examples, source examples similar to the target data have been found to be helpful to transfer (Ruder and Plank, 2017;Liu et al., 2019a). In addition, many works have shown that MTL improves model performance on out-of-domain datasets (Ruder, 2017;Liu et al., 2019c). A concurrent work (Akula et al., 2020) shows that MTL improves robustness on advesarial examples in visual grounding. In this work, we further connect the effectiveness of MTL to generalization from minority examples.

Conclusion and Discussion
Our study is motivated by recent observations on the robustness of large-scale pre-trained transformers. Specifically, we focus on robust accuracy on the challenging datasets which are designed to expose spurious correlations learned by the model. Our analysis reveals that pre-training improves robustness by better generalizing from a minority of examples that counter dominant spurious patterns in the training set. In addition, we show that more pre-training data, larger model size, and additional auxiliary data through MTL further improve robustness, especially when the amount of minority examples is scarce. Our work suggests that it is possible to go beyond the robustness-accuracy trade-off with more data. However, the amount of improvement is still limited by the coverage of the training data because current models do not extrapolate to unseen patterns. Thus an important future direction is to increase data diversity through new crowdsourcing protocols or efficient human-in-the-loop augmentation.
While our work provides new perspectives on pre-training and robustness, it only scratches the surface of the effectiveness of pre-trained models and leaves many questions open. For example, why pre-trained models do not overfit to the minority examples; how different initialization (from different pre-trained models) influences optimization and generalization. Understanding these questions are key to designing better pre-training methods for robust models.
Finally, the difference between results on HANS and PAWS calls for more careful thinking on the formulation and evaluation of out-of-distribution generalization. Semi-manually constructed challenging data often covers only a specific type of distribution shift, thus the results may not generalize to other types. A more comprehensive evaluation will drive the development of principled methods for out-of-distribution generalization.