Adapting to the Long Tail: A Meta-Analysis of Transfer Learning Research for Language Understanding Tasks

Abstract Natural language understanding (NLU) has made massive progress driven by large benchmarks, but benchmarks often leave a long tail of infrequent phenomena underrepresented. We reflect on the question: Have transfer learning methods sufficiently addressed the poor performance of benchmark-trained models on the long tail? We conceptualize the long tail using macro-level dimensions (underrepresented genres, topics, etc.), and perform a qualitative meta-analysis of 100 representative papers on transfer learning research for NLU. Our analysis asks three questions: (i) Which long tail dimensions do transfer learning studies target? (ii) Which properties of adaptation methods help improve performance on the long tail? (iii) Which methodological gaps have greatest negative impact on long tail performance? Our answers highlight major avenues for future research in transfer learning for the long tail. Lastly, using our meta-analysis framework, we perform a case study comparing the performance of various adaptation methods on clinical narratives, which provides interesting insights that may enable us to make progress along these future avenues.


Introduction
"There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora."(Marcus et al., 1993) Since the creation of the Penn Treebank, using shared benchmarks to measure and drive progress in model development has been instrumental for accumulation of knowledge in the field of natural language processing, and has become a dominant practice.Ideally, we would like shared benchmark corpora to be diverse and comprehensive, which can be addressed at two levels: (i) macro-level dimensions such as language, genre, topic, etc., and (ii) micro-level dimensions such as specific language phenomena like negation, deixis, causal reasoning, etc.However, diversity and comprehensiveness is not straightforward to achieve.
According to Zipf's law, many micro-level language phenomena naturally occur infrequently and will be relegated to the long tail, except in cases of intentional over-sampling.Moreover, the advantages of restricting community focus to a specific set of benchmark corpora and limitations in resources lead to portions of the macrolevel space being under-explored, which can further cause certain micro-level phenomena to be under-represented.For example, since most popular coreference benchmarks focus on English narratives, they do not contain many instances of zero anaphora, a phenomenon quite common in other languages (e.g., Japanese, Chinese).In such situations, model performance on benchmark corpora may not be truly reflective of expected performance on micro-level long tail phenomena, raising questions about the ability of state-of-the-art models to generalize to the long tail.
Most benchmarks do not explicitly catalogue the list of micro-level language phenomena that are included or excluded in the sample, which makes it non-trivial to construct a list of long tail micro-level language phenomena.Hence, we formalize an alternate conceptualization of the long tail: undersampled portions of the macrolevel space that can be treated as proxies for long tail micro-level phenomena.These undersampled long tail macro-level dimensions highlight gaps and present potential new challenging directions for the field.Therefore, periodically taking stock of research to identify long tail macro-level di-mensions can help in highlighting opportunities for progress that have not yet been tackled.This idea has been gaining prominence recently; for example, Joshi et al. (2020) survey languages studied by NLP papers, providing statistical support for the existence of a macro-level long tail of lowresource languages.
In this work, our goal is to attempt to characterize the macro-level long tail in NLU and efforts that have tried to address it from research on transfer learning.Large benchmarks have driven much of the recent methodological progress on NLU (Bowman et al., 2015;Rajpurkar et al., 2016;McCann et al., 2018;Talmor et al., 2019;Wang et al., 2019b,a), but the generalization abilities of benchmark-trained models to the long tail have been unclear.In tandem, the NLP community has been successfully developing transfer learning methods to improve generalization of models trained on NLU benchmarks (Ruder et al., 2019).The goal of transfer learning research is to tackle the macro-level long tail in NLU, leading to the question: how far has transfer learning addressed performance of benchmark models on the NLU long tail, and where do we still fall behind?
Probing further, we perform a qualitative metaanalysis of a representative sample of 100 papers on domain adaptation and transfer learning in NLU.We sample these papers based on citation counts and publication venues ( §2.1), and document 7 facets for each paper such as tasks and domains studied, adaptation settings evaluated, etc. ( §2.2).Adaptation methods proposed (or applied) are documented using a hierarchical categorization described in §2.3, which we develop by extending the hierarchy from Ramponi and Plank (2020).With this information, our analysis focuses on three questions: • Q1: What long tail macro-level dimensions do transfer learning studies target?Dimensions include tasks, domains, languages and adaptation settings covered in transfer learning research.• Q2: Which properties of adaptation methods help improve performance on long tail dimensions?• Q3: Which methodological gaps have greatest negative impact on long tail performance?The rest of the paper presents thorough answers to these questions, laying out avenues for future research on transfer learning that more effectively address the macro-level long tail in NLU.We

Abstract inclusion criteria:
Include highly-cited abstracts (>=100 citations) Sample rest randomly until n=100 (Balance citation count and topic rarity) Figure 1: PRISMA diagram explaining our sample curation process also present a case study1 to demonstrate how our meta-analysis framework can be use to systematically design experiments that provide insights that enable us to make progress along these avenues.

Sample Curation
We gather a representative sample of work on domain adaptation or transfer learning in NLU from the December 2020 dump of the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al., 2020).First, we extract all abstracts published at 9 prestigious *CL venues: ACL, EMNLP, NAACL, EACL, COLING, CoNLL, SemEval, TACL, and CL.This results in 25,141 abstracts, which are filtered to retain those containing the terms "domain adaptation" or "transfer learning" in the title or abstract2 , producing a set of 382 abstracts after duplicate removal.Figure 2 shows the distribution of these retrieved abstracts across search terms and years.From this graph we can see that interest in this field has increased tremendously in recent years, and that there has been a slight terminology shift with recent work preferring the term "transfer learning" over "domain adaptation".We manually screen this subset and remove abstracts that are not eligible for our NLU-focused analysis (e.g., papers on generation-focused tasks  like machine translation), leaving us with a set of 266 abstracts.From this, we construct a final meta-analysis sample of 100 abstracts via application of two inclusion criteria.Per the first criterion, all abstracts with 100 or more citations are included since they are likely to describe landmark advances. 3Then, remaining abstracts (to bring our meta-analysis sample to 100) are randomly chosen, after discarding ones with no citations. 4The random sampling criterion ensures that we do not neglect studies that study less mainstream topics by focusing solely on highly-cited work.This produces a final representative sample of transfer learning work for our meta-analysis.Characterizing limitations of our curation process: Since our sample curation process primarily relies on a keyword-based search, it might miss relevant work that does not use any of these keywords.To characterize the limitations of our curation process, we employ two additional strategies for relevant literature identification: • Citation graph retrieval: Following Blodgett et al. ( 2020), we include all abstracts that cite or are cited by abstracts included in our keywordretrieved set of 382 abstracts.This retrieves 3727 additional abstracts, but many of these works are cited for their description or introduction of new tasks, datasets, evaluation metrics, etc.Therefore, we discard all works that do not have the words "adaptation" or "transfer", leaving 282 new abstracts.• Nearest neighbor retrieval: We use SPECTER (Cohan et al., 2020) to compute embeddings for all abstracts included in our keyword-retrieved set, as well as all abstracts in the ACL anthology.
Then we retrieve the nearest neighbor for every abstract in our keyword-retrieved set, which results in the retrieval of 262 new abstracts.
Combining abstracts returned by both strategies, we are able to identify 510 additional works.However, while going over them manually, we notice that despite our noise reduction efforts, not all abstracts describe transfer learning work.We perform an additional manual screening step to discard such work, which leaves us with a fi-Figure 4: Categorization of adaptation methods proposed, extended or used in all studies.This categorization is an extension of the one proposed by Ramponi and Plank (2020), with blue blocks indicating newly added categories, and yellow blocks indicating categories that have been moved to a different coarse category.nal set of 232 additional papers. 5To identify whether the exclusion of these papers from the initial sample may have led to visible gaps or blind spots in our meta-analysis, we perform a TSNE visualization of SPECTER embeddings for both keyword-retrieved papers and this additional set of papers.Figure 3 presents the results of this visualization and indicates that there aren't visible distributional differences between the two subsets.Hence, though our sample curation strategy is imperfect, this seems to indicate that our final observations from the meta-analysis might not have been very different.We note that this conclusion comes with two caveats: (i) t-SNE embeddings are not always reliable, and (ii) embedding overlap does not necessarily confirm that annotations for overlapping papers are similar/correlated.Keeping these caveats in mind, we perform a spot check for additional validation.For this spot check, we consider the following highly-cited large language models that have been considered to be major recent advances in transfer learning: ELMo, BERT, RoBERTa, BART, T5, ERNIE, DeBERTa, and ELECTRA.Note that we do not consider any few-shot models (e.g., GPT3, PET, etc.) since our sample only consists of work that was accepted to a *CL venue by December 2020.Of these major language models, RoBERTa, DeBERTa, T5, and ELECTRA were published at non-*CL venues (JMLR and ICLR), which excludes them from our sample.The remaining works (ELMo, BERT, BART, and ERNIE) are all present in the set of additional works we identified in this section, lending further support to our conclusion that our sampling strategy and subsequent analyses have not overlooked influential work.

Meta-Analysis Facets
For every paper from our meta-analysis sample, we document the following key facets: Task(s): NLP task(s) studied in the work.Tasks are grouped into 12 categories based on task formalization and linguistic level (e.g., lexical, syntactic, etc.), as shown in table 1. Domain(s): Source and target domains and/or languages studied, along with datasets used for each.Task Model: Base model used for the task, to which domain adaptation algorithms are applied.Adaptation Method(s): Domain adaptation method(s) proposed or used in the work.Adaptation methods are grouped according to the categorization showed in figure 4 (details in §2.3).Adaptation Baseline(s): Baseline domain adaptation method(s) to compare new methods against.Adaptation Settings: Source-target transfer settings explored in the work (e.g., unsupervised adaptation, multi-source adaptation, etc.).Result Summary: Performance improvements (if any), performance differences across multiple source-target pairs or methods, etc.

Adaptation Method Categorization
For adaptation methods proposed or used in each study, we assign type labels according to the cat-egorization presented in figure 4.This categorization is an extension of the one proposed by Ramponi and Plank (2020). 6Broadly, methods are divided into three coarse categories: (i) model-centric, (ii) data-centric, and (iii) hybrid approaches.Model-centric approaches perform adaptation by modifying the structure of the model, which may include editing the feature representation, loss function or parameters.Datacentric approaches perform adaptation by modifying or leveraging labeled/unlabeled data from the source and target domains to bridge the domain gap.Finally, hybrid approaches are ones that cannot be clearly classified as model-centric or datacentric.Each coarse category is divided into fine subcategories.
Model-centric approaches are divided into four categories, based on which portion of the model they modify: (i) feature-centric, (ii) losscentric, (iii) parameter-centric, and (iv) ensemble.Feature-centric approaches are further divided into two fine subcategories: (i) feature augmentation, and (ii) feature generalization.Feature augmentation includes techniques that learn an alignment between source and target feature spaces using shared features called pivots (Blitzer et al., 2006).Feature generalization includes methods that learn a joint representation space using autoencoders, motivated by Glorot et al. (2011); Chen et al. (2012).Loss-centric approaches contain one fine subcategory: loss augmentation.This includes techniques which augment task loss with adversarial loss (Ganin and Lempitsky, 2015;Ganin et al., 2016), multi-task loss (Liu et al., 2019) or regularization terms.Parameter-centric approaches include three fine subcategories: (i) parameter initialization, (ii) new parameter addition, and (iii) parameter freezing.Finally ensemble, used in settings with multiple source domains, includes techniques that learn to combine predictions of multiple models trained on source and target domains.
Data-centric approaches are divided into five fine subcategories.Pseudo-labeling approaches train classifiers that then produce "gold" labels for unlabeled target data.This includes semisupervised learning methods such as bootstrapping, co-training, self-training, etc. (e.g., Mc-Closky et al. (2006)).
Active learning approaches use a human-in-the-loop setting to an-  notate a subset of target data that the model can learn most from (Settles, 2009).Instance learning approaches leverage neighborhood structure in joint source-target feature spaces to make target predictions (e.g., nearest neighbor learning).Noising/denoising approaches include data corruption/pre-processing which increase surface similarity between source and target examples.Finally, pretraining includes approaches that train large-scale language models on unlabeled data to learn better source and target representations, a strategy that has gained popularity in recent years (Gururangan et al., 2020).
Hybrid approaches contain two fine subcategories that cannot be classified as model-centric or data-centric because they involve manipulation of the data distribution, but can also be viewed as loss-centric approaches that edit the training loss.Instance weighting approaches assign weights to target examples based on similarity to source data.Conversely, data selection approaches filter target data based on similarity to source data.Table 2 lists example adaptation methods for each fine category and example studies from our meta-analysis subset that use these methods.
3 Which Long Tail Macro-Level Dimensions Do Transfer Learning Studies Target?
The first goal of our meta-analysis is to document long tail macro-level dimensions that transfer learning studies have tested their methods on.We look at distributions of tasks, domains, languages and adaptation settings studied in all papers in our sample.Ten studies are surveys, posi- Feat Aug (FA) Structural correspondence learning, Frustratingly easy domain adaptation (Blitzer et al., 2006;Daumé III, 2007) Feat Gen (FG) Marginalized stacked denoising autoencoders, Deep belief networks (Jochim and Schütze, 2014;Ji et al., 2015;Yang et al., 2015) Loss Aug (LA) Multi-task learning, Adversarial learning, Regularization-based methods (Zhang et al., 2017;Liu et al., 2019;Chen et al., 2020) Init Noising/Denoising (NO) Token dropout (Pilán et al., 2016) Active Learning (AL) Sample selection via active learning (Rai et al., 2010;Wu et al., 2017) Pretraining (PT) Language model pretraining, Supervised pretraining (Conneau et al., 2017;Howard and Ruder, 2018) Instance Learning (IL) Nearest neighbor learning (Gong et al., 2016) Table 2: Examples of methods from each category, and papers studying these methods.These lists are non-exhaustive.In the interest of replicability, we have made our coding for all papers publicly available at: http://www.shorturl.at/stuAT.tion papers or meta-experiments, and so excluded from these statistics.Studies can cover multiple tasks, domains, languages or settings so counts may be higher than 90.
Task distribution: Figure 5 gives a brief overview of the distribution of tasks studied across papers.Text classification tasks clearly dominate, followed by semantic and syntactic tagging.Text classification covers a variety of tasks, but sentiment analysis is the most well-studied, with research driven by the multi-domain sentiment detection (MDSD) dataset (Blitzer et al., 2007).Conversely, structured prediction is under-studied (<10% studies from our sample evaluate on structured prediction tasks), despite covering a variety of tasks such as coreference resolution, syntactic parsing, dependency parsing, semantic parsing, etc.This indicates that tasks with complex formulations/objectives are under-explored.We speculate that there may be two reasons for this: (i) difficulty of collecting annotated data in multiple  Languages studied: Despite a focus on generalization, most studies in our sample rarely evaluate on other languages aside from English.As stated by Bender (2011), this is problematic because the ability to apply a technique to other languages does not necessarily guarantee comparable performance.Some studies do cover multi-lingual evaluation or focus on cross-linguality.Figure 6 shows the distribution of languages included in these studies, which is a limited subset.For a more comprehensive discussion of linguistic diversity in NLP research not limited to transfer learning, we refer interested readers to Joshi et al. (2020).Domains studied: Many popular transfer benchmarks (Blitzer et al., 2007;Wang et al., 2019b,a) are homogeneous.They focus on narrative English, drawn from plentiful sources such as news articles, reviews, blogs, essays and Wikipedia.This sidelines some categories of domains8 that fall into the long tail: (i) non-narrative text (e.g., social media, conversations etc.), and (ii) texts from high-expertise domains that use specialized vocabulary and knowledge (e.g., clinical text).Table 3 shows the number of papers focusing on high-expertise and non-narrative domains, highlighting the lack of focus on these areas.Adaptation settings studied: Most studies evaluate methods in a supervised adaptation setting, i.e. labeled data is available from both source and target domains.This assumption may not always hold.Often adaptation must be performed in harder settings such as unsupervised adaptation (no labeled data from target domain), adaptation from multiple source domains, online adaptation, etc, and we refer to all such settings aside from supervised adaptation as unconventional adaptation settings.Figure 7 shows the distribution of unconventional settings across papers, indicating that these settings are understudied in literature.
Open Issues: We can see that there is much ground to cover in testing adaptation methods on the macro long tail.Two research directions may be key to achieving this: (i) development of and evaluation on diverse benchmarks, and (ii) incentivizing publication of research on long tail domains at NLP venues.Diverse benchmark development has gained momentum, with the creation of benchmarks such as BLUE (Peng et al., 2019) and BLURB (Gu et al., 2020) for biomedical and clinical NLP, XTREME (Hu et al., 2020) for cross-lingual NLP and GLUECoS (Khanuja et al.,  The second goal of our meta-analysis is to identify which categories of adaptation methods have been tested extensively and have exhibited good performance on various long tail macro-level dimensions.Figures 8 and 9 provide an overview of categories of methods tested across all papers in our subset.We can see that studies overwhelmingly develop or use model-centric methods.Within this coarse category, feature augmentation (FA) and loss augmentation (LA) are the top two categories, followed by pretraining (PT), which is data-centric.Parameter initialization (PI) and pseudo labeling (PL) round out the top five.Feature augmentation being the most explored category is no surprise, given that a lot of pioneering early domain adaptation work in NLP (Blitzer et al., 2006(Blitzer et al., , 2007;;Daumé III, 2007) developed methods to learn shared feature spaces between source and target domains.Loss augmentation methods have gained prominence recently, with multi-task learning providing large improvements (Liu et al., 2015(Liu et al., , 2019)).Pretraining methods, both unsupervised (Howard and Ruder, 2018) and supervised (Conneau et al., 2017), have also gained popularity with large transformer-based language models (e.g., Peters et al. (2018), Devlin et al. (2019), etc.) achieving huge gains across tasks.
To specifically identify techniques that work on long tail domains, we look at categories of methods evaluated on high-expertise domains or nonnarrative domains (or both).Figures 10a, 10b and 10c present the distributions of fine method categories tested on high-expertise domains, nonnarrative domains and both domain types respectively.While feature augmentation techniques remain the most explored category for highexpertise domains, we see a change in trend for non-narrative domains.Loss augmentation and pretraining are more commonly explored categories.The difference in dominant model categories can be partly attributed to easy availability of large-scale unlabeled data and weak signals (e.g., likes, shares etc.), particularly for social media.Such user-generated content (called "fortuitous data" by Plank ( 2016)) is leveraged well by pretraining or multi-task learning techniques, making them popular choices for non-narrative domains.In contrast, high-expertise domains (e.g, security and defense reports, finance, etc.) often lack fortuitous data, with methods developed for them focusing on learning shared feature spaces.
Ten studies in our meta-analysis sample evaluate on both domain types.Five of these studies (described in table 4) operationalize two key ideas that seem to improve adaptation performance but have remained relatively under-explored in the   Positive transfer to biomedical, literature and conversation domains (Yang and Eisenstein, 2015) Dense embeddings induced from template features and manually defined domain attribute embeddings (FA) Positive transfer to 4/5 web domains and 10/11 literary periods (Xing et al., 2018) Multi-task learning method with source-target distance minimization as additional loss term (LA) Positive transfer on 4/6 intramedical settings (EHRs, forums) and 5/9 narrative to medical settings (Wang et al., 2018) Source-target distance minimized using two loss penalties (LA) Positive transfer to medical and Twitter data Table 4: Model and performance details for studies testing on high-expertise and non-narrative domains.Fine method categories used in these studies include feature augmentation (FA), loss augmentation (LA), ensembling (EN), pretraining (PT), parameter initialization (PI), and pseudo-labeling (PL).
context of recent methods like pretraining: • Incorporating source-target distance: Several methods explicitly incorporate distance between source and target domain (e.g., (Xing et al., 2018;Wang et al., 2018)).Aside from allowing flexible adaptation based on the specific domain pairs being considered, adding source-target distance provides two benefits.It offers an additional avenue to analyze generalizability by monitoring source-target distance during adaptation.It also allows performance to be estimated in advance using source-target distance, which can be helpful when choosing an adaptation technique for a new target domain.Kashyap et al. ( 2020) provide a comprehensive overview of source-target distance metrics and discuss their utility in analysis and performance prediction.Despite these benefits, very little work has tried to incorporate source-target distance into newer adaptation methods such as pretraining.
• Incorporating nuanced domain variation: Despite NLP treating domain variation as a dichotomy (source vs target), domains vary along a multitude of dimensions (e.g., topic, genre, medium of communication etc.) (Plank, 2016).Some methods acknowledge this nuance and treat domain variation as multi-dimensional, either in a discrete feature space (Arnold et al., 2008) or in a continuous embedding space (Yang and Eisenstein, 2015).This allows knowledge sharing across dimensions common to both source and target, improving transfer.This idea has also remained under-explored, though recent work such as the development of domain expert mixture (DEMix) layers (Gururangan et al., 2021) has attempted to incorporate nuanced domain variation into pretraining.Open Issues: Interestingly many studies from our sample do not analyze failures, i.e., source-target pairs on which adaptation methods do not improve performance.For some studies in table 4, adaptation methods do not improve performance on all source-target pairs.But failures are not investigated, presenting the question: do we know blind spots for current adaptation methods?Answering this is essential to develop a complete picture of the generalization capabilities of adaptation methods.Studies that present negative transfer results (e.g., (Plank et al., 2014)) are rare, but should be encouraged to develop a sound understanding of adaptation techniques.Analyses should also study ties between datasets used and methods applied, highlighting dimensions of variation between source-target domains and how adaptation methods bridge them (Kashyap et al., 2020;Naik et al., 2021).Such analyses can uncover important lessons about generalizability of adaptation methods and the kinds of source-target settings they can be expected to improve performance on.Identifying under-explored and promising methods: Annotating long tail macro-level dimensions and adaptation method categories studied by all works included in our representative sample has the additional benefit of providing a framework to identify both the most underexplored, as well as most promising methods, under various settings.Tables 5 and 6 provide evidence gap maps presenting the number of works from our sample that study the utility of various method categories on different tasks and domains respectively. 9The first thing we note is that both maps are highly sparse, indicating that there is little to no evidence for several combinations, many of which are worth exploring.In particular, given recent state-of-the-art advances, the following settings seem ripe for exploration: • Parameter addition and freezing: Though there are only four studies in our sample (providing positive evidence) that study parameter addition and freezing methods, we believe that given the advent of large-scale language models, these categories merit further exploration for popular task categories (TC, POS, NER, NLI, SP).Both methods attempt to improve generalization by reducing overfitting which is likely to be more prevalent with large language models, and are additionally efficient methods that do not require a large number of extra parameters.• Active Learning: Studies included in our sample provide positive evidence for the use of active learning in an adaptation setting, but they have mainly evaluated on text classification (primarily sentiment analysis).We hypothesize that active learning during adaptation might also prove to be beneficial for task categories POS, NER, and SP, which require more complex, linguistically-informed annotation.• Data Selection: Despite being similar in nature to instance weighting methods for which several studies provide positive evidence, data selection methods seem to have been under-explored.We believe that these methods might be useful for POS, NER, and SP tasks for which large-scale fortuitous data is not as easily available, and adaptation must also take into account shifts in output structure.Despite the scarcity of both maps, there are certain method-task and method-domain combinations for which our meta-analysis sample includes a reasonable number of studies (>=10%).For these combinations, we provide a quick performance summary below:  • Pretraining: For text classification, 4/13 studies use pretraining as a baseline.Of the remaining 9 studies, 8 provide strong positive evidence and only one provides mixed results.Despite their relatively recent emergence, pretraining methods also seem to be extremely promising based on performance.

Which Methodological Gaps Have
Greatest Negative Impact On Long Tail Performance?
The final goal of our meta-analysis is to identify methodological gaps in developing adaptation methods for long tail domains, which provide avenues for future research.Our observations highlight three areas: (i) combining adaptation methods, (ii) incorporating external knowledge, and (iii) application to data-scarce settings.

Combining adaptation methods
The potential of combining multiple adaptation methods has not been systematically and extensively studied.Combining methods may be useful in two scenarios.The first one is when source and target domains differ along multiple dimensions (e.g., topic, language etc.) and different methods are known to work well for each.The second one is when methods focus on resolving issues in specific portions of the model such as feature space misalignment, task level differences etc. Combining model-centric adaptation methods10 that tackle each issue separately may improve performance over individual approaches.Despite its utility, method combination has only been systematically explored by one meta-study from 2010.On the other hand, 23 studies apply a particular combination of methods to their tasks/domains, but do not analyze when these combinations do/do not work.We summarize both sources of evidence and highlight open questions.
Method combination meta-study: Chang et al. (2010) observe that most adaptation methods either tackle shift in feature space (P (X)) or shift in how features are linked to labels (P (Y |X)).They call the former category "unlabeled adaptation methods" since feature space alignment can be done using unlabeled data alone.Methods from the latter category require some labeled target data and are called "labeled adaptation methods". 11Through theoretical analysis, simulated experiments and experiments with realworld data on two tasks (named entity recognition and preposition sense disambiguation), they observe: (i) combination generally improves performance, (ii) combining best-performing individual
Applying particular combinations: Table 7 lists all studies that apply method combinations and fine-grained category labels from our hierarchy for the methods used.Combining methods from different coarse categories is the most popular strategy, employed by 15 out of 23 studies.5 studies combine methods from the same coarse category, but different fine categories.They combine model-centric methods that edit different parts of the model (e.g. a feature-centric and a loss-centric method).The last 3 studies combine methods from the same fine category.Only 7 studies evaluate on at least one long tail domain.
Several studies observe performance improvements (Yu and Kübler, 2011;Mohit et al., 2012;Scheible and Schütze, 2013;Kim et al., 2017;Yang et al., 2017;Alam et al., 2018), mirroring the observation by Chang et al. (2010) that method combination helps.However, this is not consistent across all studies.For example, Jochim and Schütze (2014) find that combining marginalized stacked denoising autoencoders (mSDA) (Chen et al., 2012) and frustratingly easy domain adaptation (FEDA) (Daumé III, 2007) performs worse than individual methods in preliminary experiments on citation polarity classification.Both methods are feature-centric, though mSDA is a generalization method (FG) while FEDA is an augmentation method (FA).Additionally, mSDA is an unlabeled adaptation method while FEDA is a labeled adaptation method.Owing to negative results, Jochim and Schütze (2014) do not experiment further to find a combination that might have worked.Wright and Augenstein (2020) show that combining adversarial domain adaptation (ADA) (Ganin and Lempitsky, 2015) with pretraining does not improve performance, but combining mixture of experts (MoE) with pretraining does.This indicates that methods from the same coarse category (model-centric) may react differently in combination settings.Similarly, studies achieving positive results do not analyze which properties of chosen methods allow them to combine well, and whether this success extends to other methods with similar properties.Open questions: To understand method combination, we must examine the following questions: • Is it possible to draw general conclusions about the potential of combining methods from various coarse or fine categories?• Which properties of adaptation methods are indicative of their ability to interface well with other methods?• Do task and/or domain of interest influence the abilities of methods to combine successfully?

Incorporating external knowledge
Most methods leverage labeled/unlabeled text to learn generalizable representations.However, knowledge from sources beyond text such as ontologies, human understanding of domain/task variation, etc., can also improve adaptation per-formance.
This is especially true for domains with expert-curated ontologies (e.g., UMLS for biomedical/clinical text (Bodenreider, 2004)).From our study sample, we observe some exploration of the following knowledge sources: Ontological knowledge: Romanov and Shivade (2018) employ UMLS for clinical natural language inference via two techniques: (i) retrofitting word vectors as per UMLS (Faruqui et al., 2015), and (ii) using UMLS concept distance-based attention.Retrofitting hurts performance, while concept distance provides modest improvements.Domain Variation: (Arnold et al., 2008) and (Yang and Eisenstein, 2015) incorporate human understanding of domain variation in discrete and continuous feature spaces respectively, with some success (table 4).Structural correspondence learning (Blitzer et al., 2006) relies on manually defined pivot features common to source and target domains, and shows performance improvements.Task Variation: (Zarrella and Marsh, 2016) incorporate human understanding of knowledge required for stance detection to define an auxiliary hashtag prediction task, which improves target task performance.Manual Adaptation: (Chiticariu et al., 2010) manually customize rule-based NER models, matching scores achieved by supervised models.
Another source that is not explored by studies in our sample, but has gained popularity is providing task descriptions for sample-efficient transfer learning (Schick and Schütze, 2021).Despite initial explorations, the potential of external knowledge sources is largely under-explored.Open questions: Given varying availability of knowledge sources across tasks/domains, comparing their performance across domains may be impractical.But studies experimenting with a specific source can still probe the following questions: • Can reliance on labeled/unlabeled data be reduced while maintaining the same performance?• Does incorporating the knowledge source improve interpretability of the adaptation method?• Can we preemptively identify a subset of samples which may benefit from the knowledge?Some of these settings have attracted attention in recent years.Ramponi and Plank (2020) comprehensively survey neural methods for unsupervised adaptation.In their survey on low-resource NLP, Hedderich et al. ( 2020) cover transfer learning techniques that reduce need for supervised target data.Wang et al. (2021) list human-in-the-loop data augmentation and model updation techniques that can be used for data-scarce adaptation.However, there is room to further study application of adaptation methods in data-scarce settings.Open questions: Broadly, two main questions in this area still remain unanswered: • At different levels of data scarcity (e.g., no labeled target data, no unlabeled target data, etc.), which adaptation methods perform best?Table 9: Performance of all adaptation methods on NER in the fine setting.Recall that the fine adaptation method categories we evaluate are loss augmentation (LA), pseudo-labeling (PL), pretraining (PT), and instance weighting (IW).
vide answers to the prevailing open questions laid out previously.As an example, we conduct a case study to evaluate the effectiveness of popularly used adaptation methods on high-expertise domains in an unsupervised adaptation setting, a burgeoning area of interest (Ramponi and Plank, 2020).Specifically, our study focuses on the question: which method categories perform best for semantic sequence labeling tasks when transferring from news to clinical narratives, given an unsupervised setting (i.e., no labeled clinical data available)?We focus on two semantic sequence labeling tasks: entity extraction and event extraction.

Datasets
We use the following entity extraction datasets:  (Sun et al., 2013): Discharge summaries annotated with events.• MTSamples (Naik et al., 2021): Medical records annotated with events (test-only).CoNLL 2003 and TimeBank are the source datasets for all entity and event extraction experiments respectively, while the remaining are target datasets.We focus on English narratives only.Among the NER datasets, the label sets for i2b22006 and i2b22014 can be mapped to the label set for CoNLL2003, however the label set for i2b22010 is quite distinct and cannot be mapped.Therefore, we evaluate NER in two settings: coarse and fine.In the coarse setting, the model only detects entities, but does not predict entity type, whereas in the fine setting, the model detects entities and predicts types.

Adaptation Methods
The baseline model for both tasks is a BERT-based sequence labeling model that computes tokenlevel representations using BERT, followed by a linear layer that predicts entity/event labels.We compare the performance of adaptation methods from the top 5 fine categories most frequently applied to high-expertise domains as per our analysis (figure 10a), on top of this BERT baseline.Since feature augmentation (FA) methods require some target labeled data to train target-specific weights and our focus is on an unsupervised setting, our study tests the remaining four categories: • PL: From pseudo-labeling, we test the selftraining method.Self-training first trains a sequence labeling model on the source dataset (news), then uses this model to generate labels for unlabeled target data (clinical narratives).High-confidence predictions from the "pseudolabeled" clinical data are combined with source data to train a new sequence labeling model.This process can be repeated iteratively.• LA: From loss augmentation, we test adversarial domain adaptation (Ganin and Lempitsky, 2015).This method learns domain-invariant representations by adding an adversary that predicts an example's domain and subtracting the loss from this adversary from the overall model loss.This setup is trained in a two-stage alternating optimization process (complete details in Ganin and Lempitsky (2015)).• PT: From pretraining, we test domain-adaptive pretraining as described Gururangan et al. (2020).This method tries to improve target domain performance of BERT-based models by continual masked language modeling pretraining on unlabeled text from the target domain.• IW: From instance weighting, we test classifierbased instance weighting.This method trains a classifier on the task of predicting an example's domain, then runs the classifier on all source domain examples and uses target domain probabilities as weights.Source examples that "look" more like the target domain get higher weights, improving performance on the target domain.
We perform interleaved training, recomputing source weights after each model training pass.

Results
Tables 8 and 9 show the results of all adaptation methods on coarse and fine NER, while table 10 shows results on event extraction.ZS indicates baseline model scores in a zero-shot setting, i.e., training on source and testing on target with no adaptation.From these tables, we can see that the best-performing method categories are loss augmentation and pseudo-labeling across different settings.Loss augmentation methods work best for event extraction.For coarse NER, pseudolabeling methods work better on target datasets whose labels can be mapped to the source (i.e., closer transfer tasks).For i2b22010, which is more distant transfer, loss augmentation works best.The effectiveness of pseudo-labeling is interesting because they often suffer from the pitfall of propagating errors made by the source-trained model, which may in part explain their poor performance on i2b22010.Early work on applying these methods to parsing showed negative results or minor improvements (Charniak, 1997;Steedman et al., 2003), but these methods have shown more promise in recent years with advances in embedding representations.Finally, for fine NER, loss augmentation and pseudo-labeling do better on i2b22006 and i2b22014 respectively.Pretraining is not the best-performing method in any setting, which may be a side effect of continual pretraining leading to some forgetting, negatively impacting an unsupervised setting.This highlights the need to systematically compare adaptation methods under data-scarce settings because the ranking of methods can change based on the availability and quality of domain-specific data.

Conclusion
This work presents a qualitative meta-analysis of 100 representative papers on domain adaptation and transfer learning in NLU, with the aim of understanding performance of adaptation methods on the long tail.Through this analysis, we assess current trends and highlight methodological gaps that we consider to be major avenues for future research in transfer learning for the long tail.We observe that long tail coverage in current research is far from comprehensive, and identify two properties of adaptation methods that may improve long tail performance, but have been under-explored.Additionally, we identify three major gaps that must be addressed to improve long tail performance: (i) combining adaptation methods, (ii) incorporating external knowledge and (iii) application to data-scarce adaptation settings.Finally, we demonstrate the utility of our meta-analysis framework and observations in guiding the design of systematic meta-experiments to address prevailing open questions by conducting a systematic evaluation of popular adaptation methods for highexpertise domains in a data-scarce setting.This case study reveals interesting insights about the adaptation methods evaluated and shows that significant progress can be made towards developing a better understanding of adaptation for the long tail by conducting such experiments.

Figure 2 :
Figure 2: Distribution of papers retrieved by our search strategy across search terms and years.

Figure 3 :
Figure 3: TSNE visualization of our meta-analysis sample alongside additional transfer learning papers missed by our keyword search.
Figure 1 describes our sample curation process via a PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) diagram (Page et al., 2021).

Figure 5 :
Figure 5: Distribution of papers according to tasks studied.Top three task categories are text classification (TC), semantic sequence labeling (NER) and syntactic sequence labeling (POS).Table 1 describes the remaining task categories.

Figure 6 :
Figure 6: Distribution of multi-lingual studies according to languages included.
Counts of papers (#P) studying highexpertise (HE) and non-narrative (NN) domains (DefSec refers to security and defense reports).

Figure 7 :
Figure 7: Distribution of papers according to unconventional (non-supervised) adaptation settings

Figure 8 :
Figure 8: Distribution of transfer learning studies according to coarse method categories.DC, MC and HY refer to data-centric, model-centric, and hybrid coarse categories respectively.

Figure 9 :
Figure 9: Distribution of transfer learning studies according to fine method categories.The top five fine categories are feature augmentation (FA), loss augmentation (LA), pretraining (PT), parameter initialization (PI), and pseudo-labeling (PL).Table 2 describes the remaining categories in more detail.
(a) Fine method categories evaluated on high-expertise domains.The top five fine categories are FA, PL, LA, PT, and IW.(b) Fine method categories evaluated on non-narrative domains.The top four fine categories are LA, PT, PI, and FA.(c) Fine method categories evaluated on both domain types.The top four fine categories are LA, FA, PL, and PI.

Figure 10 :
Figure 10: Distribution of fine method categories from studies evaluating on long tail domains.Note that FA stands for feature augmentation, LA for loss augmentation, PL for pseudo-labeling, PT for pretraining, IW for instance weighting, and PI for parameter initialization.Table 2 describes the remaining categories in more detail.

Table 1 :
Categorization of tasks studied.Note that the matrix factorization category includes textbased recommender systems.
Table 1 describes the remaining task categories.

Table 2
describes the remaining categories in more detail.2020) for code-switched NLP.However, newly proposed adaptation methods are often not evaluated on them, which is imperative to test their limitations and generalization abilities.On the other hand, application-specific or domain-specific evaluations of adaptation methods are sidelined at NLP venues and may be viewed as limited in Table 2 describes the remaining categories in more detail.

Table 5 :
Evidence gap map showing which method categories have not been explored sufficiently for various task categories.Please refer to Tables1 and 2for task and model abbreviations.

Table 6 :
Evidence gap map showing indicating which method categories have not been explored sufficiently for various long tail domain categories.Note that HE and NN refer to highexpertise and non-narrative domains.Please refer to Table 2 for model abbreviations.