Abstract
Natural Language Processing algorithms have made incredible progress, but they still struggle when applied to out-of-distribution examples. We address a challenging and underexplored version of this domain adaptation problem, where an algorithm is trained on several source domains, and then applied to examples from unseen domains that are unknown at training time. Particularly, no examples, labeled or unlabeled, or any other knowledge about the target domain are available to the algorithm at training time. We present PADA: An example-based autoregressive Prompt learning algorithm for on-the-fly Any-Domain Adaptation, based on the T5 language model. Given a test example, PADA first generates a unique prompt for it and then, conditioned on this prompt, labels the example with respect to the NLP prediction task. PADA is trained to generate a prompt that is a token sequence of unrestricted length, consisting of Domain Related Features (DRFs) that characterize each of the source domains. Intuitively, the generated prompt is a unique signature that maps the test example to a semantic space spanned by the source domains. In experiments with 3 tasks (text classification and sequence tagging), for a total of 14 multi-source adaptation scenarios, PADA substantially outperforms strong baselines.1
1 Introduction
Natural Language Processing (NLP) algorithms are gradually achieving remarkable milestones (Devlin et al., 2019; Lewis et al., 2020; Brown et al., 2020). However, such algorithms often rely on the seminal assumption that the training set and the test set come from the same underlying distribution. Unfortunately, this assumption often does not hold since text may emanate from many different sources, each with unique distributional properties. As generalization beyond the training distribution is still a fundamental challenge, NLP algorithms suffer a significant degradation when applied to out-of-distribution examples.
Domain Adaptation (DA) explicitly addresses this challenge, striving to improve out-of-distribution generalization of NLP algorithms. DA algorithms are trained on annotated data from source domains, to be effectively applied in a variety of target domains. Over the years, considerable efforts have been devoted to the DA challenge, focusing on various scenarios where the target domain is known at training time (e.g., through labeled or unlabeled data) but is yet under-represented (Roark and Bacchiani, 2003; Daumé III and Marcu, 2006; Reichart and Rappoport, 2007; McClosky et al., 2010; Rush et al., 2012; Schnabel and Schütze, 2014). Still, the challenge of adaptation to any possible target domain, which is unknown at training time, is underexplored in DA literature.2
In this work, we focus on adaptation to any target domain, which we consider a “Holy Grail” of DA (§3). Apart from the pronounced intellectual challenge, it also presents unique modeling advantages as target-aware algorithms typically require training a separate model for each target domain, leading to an inefficient overall solution.
Intuitively, better generalization to unseen domains can be achieved by integrating knowledge from several source domains. We present PADA: An example-based autoregressive Prompt learning algorithm for on-the-fly Any-Domain Adaptation (§4), which utilizes an autoregressive language model (T5; Raffel et al., 2020), and presents a novel mechanism that learns to generate human-readable prompts that represent multiple source domains. Given a new example, from any unknown domain, the model first generates properties (a sequence of tokens) that belong to familiar (source) domains and relate to the given example. Then, the generated sequence is used as a prompt for the example, while the model performs the downstream task.3PADA implements a specialized two-stage multi-task protocol that facilitates model parameter sharing between the prompt generation and the downstream tasks. Ultimately, PADA performs its adaptation per example, by leveraging (1) an example-specific prompting mechanism and (2) a two-stage multi-task objective.
In order to generate effective prompts, we draw inspiration from previous work on pivot features (Blitzer et al., 2006; Ziser and Reichart, 2018; Ben-David et al., 2020) to define sets of Domain Related Features (DRFs, §4.2). DRFs are tokens that are strongly associated with one of the source domains, encoding domain-specific semantics. We leverage the DRFs of the various source domains in order to span their shared semantic space. Together, these DRFs reflect the similarities and differences between the source domains, in addition to domain-specific knowledge.
Consider the task of review sentiment classification (Figure 1). The model is familiar with four source domains: restaurants, home-furniture, electronic-devices, and movies. When the model encounters a review, this time from the airlines domain, it uses DRFs to project the example into the shared semantic space, via the prompting mechanism. In the given example the DRFs marked in blue and green relate to the restaurants and the home-furniture domains, respectively. The DRF-based prompt is then used in classification.
We evaluate PADA in the multi-source DA setting, where the target domain is unknown during training (§5, 6). We consider two text classification tasks (Rumour Detection and Multi-Genre Natural Language Inference [MNLI]), and a sequence tagging task (Aspect Prediction), for a total of 14 DA setups. PADA outperforms strong baselines, yielding substantial error reductions.
2 Related Work
We first describe research in the setting of unsupervised DA with a focus on pivot-based methods. We then continue with the study of DA methods with multiple sources, focusing on mixture of experts models. Finally, we describe autoregressive language models and prompting mechanisms, and the unique manner in which we employ T5 for DA.
Unsupervised Domain Adaptation (UDA)
With the breakthrough of deep neural network (DNN) modeling, attention from the DA community has been directed to representation learning approaches. One line of work employs DNN-based autoencoders to learn latent representations. These models are trained on unlabeled source and target data with an input reconstruction loss (Glorot et al., 2011; Chen et al., 2012; Yang and Eisenstein, 2014; Ganin et al., 2016). Another branch employs pivot features to bridge the gap between a source domain and a target domain (Blitzer et al., 2006, 2007; Pan et al., 2010). Pivot features are prominent to the task of interest and are abundant in the source and target domains. Recently, Ziser and Reichart (2017, 2018, 2019) married the two approaches. Later on, Han and Eisenstein (2019) presented a pre-training method, followed by Ben-David et al. (2020) and Lekhtman et al. (2021), who introduced a pivot-based variant for pre-training contextual word embeddings.
Crucially, UDA models assume access to unlabeled data from the target domain in-hand during training. We see this as a slight relaxation to the goal of generalization beyond the training distribution. Moreover, this definition has engineering disadvantages, as a new model is required for each target domain. To this end, we pursue the any-domain adaptation setting, where unlabeled target data is unavailable at training time.
We draw inspiration from pivot-based modeling. The pivot definition relies on labeled source domain data and unlabeled source and target domain data (which is unavailable in our setup). Particularly, good pivots are ones that are correlated with the task label. Hence, pivot features are typically applied to tasks that offer meaningful correlations between words and the task label, such as sentiment classification. For other types of tasks, pivots may be difficult to apply. Consider the MNLI dataset, where the task is to understand the directional relation between a pair of sentences (entailment, contradiction, or neutral). In such a task it is unlikely to find meaningful correlations between single words and the label. Instead, we define task-invariant DRFs, features that are highly correlated with the identity of the domain. Since domains are highly correlated with words, our DRFs are lexical in nature.
Our proposed approach is an important step forward from pivots, as our model generates DRF sequences of unrestricted lengths, instead of focusing on individual words. Moreover, pivots are typically applied in single source setups, and while our method can operate with a single source domain, we utilize multiple source domains to facilitate generalization to unknown target domains.
Multi-Source Domain Adaptation
Most existing multi-source DA methods follow the setup definitions of unsupervised DA, while considering more than one source domain. A prominent approach is to fuse models from several sources. Early work trained a classifier for each domain and assumed all source domains are equally important for a test example (Li and Zong, 2008; Luo et al., 2008). More recently, adversarial-based methods used unlabeled data to align the source domains to the target domains (Zhao et al., 2018; Chen and Cardie, 2018). Meanwhile, Kim et al. (2017) and Guo et al. (2018) explicitly weighted a Mixture of Experts (MoE) model based on the relationship between a target example and each source domain. However, Wright and Augenstein (2020) followed this work and tested a variety of weighting approaches on a Transfomers-based MoE and found a naive weighting approach to be very effective.
We recognize two limitations in the proposed MoE solution. First, MoE requires training a standalone expert model for each source domain. Hence, the total number of parameters increases (typically linearly) with the number of source domains, which harms the solution’s scalability. One possible solution could be to train smaller-scale experts (Pfeiffer et al., 2020; Rücklé et al., 2020), but this approach is likely to lead to degradation in performance. Second, domain experts are tuned towards domain-specific knowledge, at times at the expense of cross-domain knowledge that highlights the relationship between different domains. In practice, test examples may arrive from unknown domains, and may reflect a complicated combination of the sources. To cope with this, MoE ensembles the predictions of the experts using heuristic methods, such as a simple average or a weighted average based on the predictions of a domain-classifier. Our results indicate that this approach is sub-optimal.
Moreover, we view domain partitioning as often somewhat arbitrary (consider for example the differences between the dvd and movie domains). We do not want to strictly confine our model to a specific partitioning and rather encourage a more lenient approach towards domain boundaries. Hence, in this work, we train only a single model that shares its parameters across all domains. Furthermore, we are interested in adapting to any target domain, such that no information about potential target domains is known at training time. Some of the above works (Wright and Augenstein, 2020) in fact avoid utilizing target data, thus they fit the any-domain setting and form two of our baselines. Yet, in contrast to these works, the any-domain objective is a core principle of this study.
Autoregressive LMs and Prompting
Recently, a novel approach to language modeling has been proposed, which casts it as a sequence-to-sequence task, by training a full Transformer (encoder-decoder) model (Vaswani et al., 2017) to autoregressively generate masked, missing or perturbed token spans from the input sequence (Raffel et al., 2020; Lewis et al., 2020). Raffel et al. (2020) present a particularly interesting approach with the T5 model. It treats all tasks as generative (text-to-text), eliminating the need for a task-specific network architecture. This is made possible by prefixing each example with a prompt phrase denoting the specific task being performed.
Recent works have further explored such prompting mechanisms in several avenues: Adapting a language model for different purposes (Brown et al., 2020); eliciting sentiment or topic-related information (Jiang et al., 2020; Sun and Lai, 2020; Shin et al., 2020; Haviv et al., 2021); efficient fine-tuning (Li and Liang, 2021; Scao and Rush, 2021); or as a method for few-shot learning (Gao et al., 2021; Schick and Schütze, 2021).4 In this work, we make use of T5’s prompting mechanism as a way of priming the model to encode domain-specific characteristics relating to each example from an unknown target domain. Borrowing terminology from Liu et al. (2021a), our approach falls under the “Prompt+LM Tuning” training strategy (Liu et al., 2021b; Han et al., 2021). In this strategy, prompt-relevant parameters are fine-tuned together with some or all of the parameters of the pre-trained model (T5 in our case). However, in contrast to prompt tuning approaches which focus on representation level tuning (Liu et al., 2021b; Li and Liang, 2021; Lester et al., 2021), we train T5 to generate human readable prompts consisting of natural language tokens that encode domain-specific information relating to the the given example. To the best of our knowledge, this work is the first to learn to generate textual prompts alongside a downstream prediction task. It is also the first to generate a unique prompt per example. Finally, it is the first to design a prompting mechanism for the purpose of DA.
3 Any-Domain Adaptation
DA and Transfer Learning
A prediction task (e.g., Rumour Detection) is defined as , where is the task’s label space. We denote to be a feature space, P(X) to be the marginal distribution over , and P(Y ) the prior distribution over . The domain is then defined by . DA is a particular case of transfer learning, namely, transductive transfer learning (Ramponi and Plank, 2020), in which and , the source and target tasks, are the same. However, and , the source and target domains, differ in at least one of their underlying probability distributions, P(X), P(Y ), or P(Y |X).5 The goal in DA is to learn a function f from a set of source domains that generalizes well to a set of target domains .
The Any-Domain Setting
We focus on building an algorithm for a given task that is able to adapt to any-domain. To this end, we assume zero knowledge about the target domain, , at training time. Hence, we slightly modify the classic setting of unsupervised multi-source domain adaptation, by assuming we have no knowledge or access to labeled or unlabeled data from the target domains. We only assume access to labeled training data from K source domains , where . The goal is to learn a model using only the source domains data, which generalizes well to unknown target domains.
The NLP and ML literature addresses several settings that are similar to any-domain adaptation. However, our on-the-fly example-based approach is novel. Below, we discuss these settings and the differences between their proposed solution approaches and ours.
The goal of any-domain adaptation was previously explored through the notion of domain robustness. Algorithms from this line of work seek generalization to unknown distributions through optimization methods which favor robustness over specification (Hu et al., 2018; Oren et al., 2019; Sagawa et al., 2020; Koh et al., 2020; Wald et al., 2021). This is typically achieved by training the model to focus on domain-invariant features, which are considered fundamental to the task and general across domains (Muandet et al., 2013; Ganin et al., 2016; Arjovsky et al., 2019; Müller et al., 2020). In contrast, this work proposes to achieve this goal through on-the-fly example-based adaptation, utilizing both domain-invariant and domain-specific features, as the latter often proves relevant to the new domain (Blitzer et al., 2006; Ziser and Reichart, 2017). For instance, consider the example presented in Figure 1. The expression “food was cold” would be considered as domain-specific, considering the restaurants domain. Despite it not being a domain-invariant feature, it may serve as a valuable feature for the target domain (airlines).
Any-domain adaptation also draws some similarities with the continual learning (Ring, 1995) and zero-shot learning (Palatucci et al., 2009) paradigms. Continual learning systems seek to transfer knowledge from a number of known tasks to a new one, while in our proposed setting new domains arrive during inference, and as opposed to continual learning, we do not update the parameters of the model when a new domain is presented (we actually do not even know the domains of the test examples).6 The zero-shot setting also does not update the parameters of the model given a new task, yet its definition is less consistent across different models: GPT-3 (Brown et al., 2020) attempts to transfer knowledge to an unknown target task and unknown domain ; Blitzer et al. (2009) assume access to unlabeled data from various domains including the target domain; and Peng et al. (2018) use data of a different task from the target domain. In contrast, our problem setting specifically focuses on domain adaptation, while assuming no prior knowledge of the target domain.
The any-domain adaptation setting naturally calls for an example-level adaptation approach. Because the model does not have any knowledge about the target domain during training, each example it encounters during inference should be aligned with the source domains.
4 Example-based Adaptation through Prompt Learning
In this work we propose a single model that encodes information from multiple domains. Our model is designed such that test examples from new unknown domains can trigger the most relevant parameters in the model. This way we allow our model to share information between domains and use the most relevant information at test time. Our model is inspired by recent research on prompting mechanisms for autoregressive language models. We start (§4.1) by describing the general architecture of our model, and continue (§4.2) with the domain related features that form our prompts.
4.1 The Model
We present our example-based autoregressive Prompt learning algorithm for on-the-fly Any-Domain Adaptation (PADA, Figure 2). PADA employs a pre-trained T5 language model and learns to generate example-specific DRFs in order to facilitate accurate task predictions. This is implemented through a two-step multi-task mechanism, where first a DRF set is generated to form a prompt, and then the task label is predicted.
Formally, assume an input example (xi,yi) ∼ Si, such that xi is the input text, yi is the task label, and Si is the domain of this example. For the input xi, PADA is trained to first generate Ni, the domain name, followed by Ri, the DRF signature of xi, and given this prompt to predict the label yi. At test time, when the model encounters an example from an unknown domain, it generates a prompt that may consist of one or more domain names as well as features from the DRF sets of one or more source domains, and based on this prompt it predicts the task label.
Test-time Inference
Consider the example in Figure 1, which describes a sentiment classification model, trained on the restaurants, home-furniture, electronic-devices, and movies source domains. The model observes a test example from the airlines domain, a previously unseen domain whose name is not known to the model. The model first generates the name of the domain that is most appropriate for this example, restaurants in this case. Then, it continues to generate the words “food” and “chair”, features related to the restaurants and home-furniture domains, respectively. Finally, given this prompt, the model predicts the example’s (negative) sentiment.
Training
In order to separate the prompt generation task from the discriminative classification task, we train our model within a multi-task framework. PADA is trained to perform two tasks, one for generating a prompt, consisting of features from the DRF set of the example’s domain, and another for predicting the example’s label. For the first, generative task, the model receives examples with the special prompt ‘Domain:’, which primes the model to generate Ni and Ri (see examples for prompts generated by PADA in Table 1). Note that Ri is a set of features derived from the DRF set of Si, and training examples are automatically annotated with their Ri, as described in §4.2. For the second, discriminative task, the model receives a prompt, consisting of Ni and Ri, and its task is to predict yi.
Input.A sad day for France, for journalism, for free speech, and those whose beliefs the attackers pretend to represent. |
Prompt.Ottawa Shooting - french, attack, shootings |
Input.Picture of explosion in Dammartin-en-Goele. all happened suddenly. |
Prompt.Germanwings Crash - explosion, goele, german |
Input.At least 5 hostages in kosher supermarket in eastern Paris, according to reports. |
Prompt.Paris Siege - hostages, reports, taker |
Input.A sad day for France, for journalism, for free speech, and those whose beliefs the attackers pretend to represent. |
Prompt.Ottawa Shooting - french, attack, shootings |
Input.Picture of explosion in Dammartin-en-Goele. all happened suddenly. |
Prompt.Germanwings Crash - explosion, goele, german |
Input.At least 5 hostages in kosher supermarket in eastern Paris, according to reports. |
Prompt.Paris Siege - hostages, reports, taker |
Following the multi-task training protocol of T5, we mix examples from each task. To this end, we define a task proportion mixture parameter α. Each example from the training set forms an example for the generative task with probability α, and an example for the discriminative task with probability 1 − α. The greater the value of α, the more the model will train for the generative task.
At the heart of our method is the clever selection of the DRF set of each domain, and the prompt annotation process for the training examples. We next discuss these features and their selection process.
4.2 Domain Related Features
For each domain we define the DRF set such that these features provide a semantic signature for the domain. Importantly, if two domains have shared semantics, for example, the restaurants and the cooking domains, we expect their DRFs to semantically overlap. Since the prompt of each training example consists of a subset of features from the DRF set of its domain, we should also decide on a prompt generation rule that can annotate these training examples with their relevant features.
In order to reflect the semantics of the domain, DRFs should occur frequently in this domain. Moreover, they should be substantially more common in that specific domain relative to all other domains. Despite their prominence in a specific domain, DRFs can also relate to other domains. For instance, consider the top example presented in Table 1. The word “attack” is highly associated with the “Ottawa Shooting” domain and is indeed one of its DRFs. However, this word is also associated with “Sydney Siege”, which is another domain in the Rumour Detection dataset (Zubiaga et al., 2016). Moreover, because both domains are related to similar events, it is not surprising that the DRF set of the former contains the feature suspect and the DRF set of the latter contains the feature taker (see Table 3). The similarity of these features facilitates parameter sharing in our model.
Automatically Extracting DRFs
There can be several ways of implementing a DRF extraction method that are in line with the above DRF definition. We experimented with several different extraction criteria (Correlation, class-based TF-IDF,7 and Mutual Information), and observed high similarity (82% overlap) between their resulting DRF sets. However, we observed a qualitative advantage for Mutual Information (MI), which successfully extracted DRFs that hold domain-specific semantic meaning.
Intuitively, the smaller ρ is, the more certain we are that the n-gram is especially associated with , compared to other domains. Since the number of examples in is much smaller than the number of examples in , we choose ρ ≥ 1 but do not allow it to be too large. As a result, this criterion allows for features which are associated with but also related to other source domains, to be part of the DRF set of . This is demonstrated in Table 2, where we present examples of DRFs extracted for the Ferguson domain of the rumour detection task, by using different values of ρ. Using ρ = 0, domain-specific DRFs such as “mikebrown” are extracted for the domain’s DRF set. By increasing the value of ρ to 1, we add DRFs which are highly associated with the domain, but are also prevalent in other domains (e.g., “killing” is also related to the Ottawa-shooting domain). However, when increasing the value of ρ to 10, we extract DRFs which are less associated with the domain (“know”). This is further exacerbated when increasing ρ to higher values.
Annotating DRF-based Prompts for Training
We denote the DRF set of the jth domain with Rj. Given a training example i from domain j, we select the m features from Rj that are most associated with this example to form its prompt. To do that, we compute the Euclidean distance between the T5 embeddings of the DRF features and the T5 embeddings of each of the example’s tokens. We then rank this list of pairs by their scores and select the top m features.8 In Table 3 we provide a sample of DRFs from the DRF sets associated with each domain in the rumor detection task (§ 5), alongside their frequency statistics for being annotated in a training example’s prompt.
To conclude, our methods for domain-specific DRF set extraction and for prompt annotation of training examples, demonstrate three attractive properties. First, every example has its own unique prompt. Second, our prompts map each training example to the semantic space of its domain. Lastly, the domain-specific DRF sets may overlap in their semantics, either by including the same tokens or by including tokens with similar meanings. This way they provide a more nuanced domain signature compared to the domain name alone. This is later used during the inference phase when the model can generate an example-specific prompt that consists of features from the DRF sets of the various source domains.
C . | GW . | OS . | S . |
---|---|---|---|
hebdo (88%) | lufthansa (86%) | ottawa (83%) | australians (75%) |
ahmed (48%) | germanwings (33%) | cdnpoli (36%) | monis (69%) |
terrorists (22%) | crash (25%) | shooting (30%) | isis (21%) |
attack (19%) | plane (24%) | soldier (12%) | cafe (18%) |
victims (4%) | barcelona (23%) | suspect (5%) | taker (16%) |
C . | GW . | OS . | S . |
---|---|---|---|
hebdo (88%) | lufthansa (86%) | ottawa (83%) | australians (75%) |
ahmed (48%) | germanwings (33%) | cdnpoli (36%) | monis (69%) |
terrorists (22%) | crash (25%) | shooting (30%) | isis (21%) |
attack (19%) | plane (24%) | soldier (12%) | cafe (18%) |
victims (4%) | barcelona (23%) | suspect (5%) | taker (16%) |
5 Experimental Setup
5.1 Task and Datasets
We experiment with three multi-source DA tasks, where a model is trained on several domains and applied to a new one. We consider two text classification tasks, Rumour Detection and MNLI, and one sequence tagging task—Aspect Prediction. The details of the training, development, and test sets of each domain are provided in Table 4. Our experiments are performed in a leave-one-out fashion: We train the model on all domains but one, and keep the held-out domain for testing. Particularly, training is done on the training data of the source domains and development on their development data, while the test data is taken from the target domain, which is unknown at training time. We repeat the experiments in each task such that each domain is used as a target domain.
Rumour Detection . | |||
---|---|---|---|
Domain . | Training (src) . | Dev (src) . | Test (trg) . |
Charlie-Hebdo (C) | 1,663 | 416 | 2,079 |
Ferguson (FR) | 914 | 229 | 1,143 |
Germanwings-crash (GW) | 375 | 94 | 469 |
Ottawa-shooting (OS) | 712 | 178 | 890 |
Sydney-siege (S) | 976 | 245 | 1,221 |
MNLI | |||
Domain | Training (src) | Dev (src) | Test (trg) |
Fiction (F) | 2,547 | 1,972 | 1,972 |
Government (G) | 2,541 | 1,944 | 1,944 |
Slate (SL) | 2,605 | 1,954 | 1,954 |
Telephone(TL) | 2,754 | 1,965 | 1,965 |
Travel (TR) | 2,541 | 1,975 | 1,975 |
Aspect | |||
Domain | Training (src) | Dev (src) | Test (trg) |
Device (D) | 2,302 | 255 | 1,279 |
Laptops (L) | 2,726 | 303 | 800 |
Restaurants (R) | 3,487 | 388 | 800 |
Service(SE) | 1,343 | 149 | 747 |
Rumour Detection . | |||
---|---|---|---|
Domain . | Training (src) . | Dev (src) . | Test (trg) . |
Charlie-Hebdo (C) | 1,663 | 416 | 2,079 |
Ferguson (FR) | 914 | 229 | 1,143 |
Germanwings-crash (GW) | 375 | 94 | 469 |
Ottawa-shooting (OS) | 712 | 178 | 890 |
Sydney-siege (S) | 976 | 245 | 1,221 |
MNLI | |||
Domain | Training (src) | Dev (src) | Test (trg) |
Fiction (F) | 2,547 | 1,972 | 1,972 |
Government (G) | 2,541 | 1,944 | 1,944 |
Slate (SL) | 2,605 | 1,954 | 1,954 |
Telephone(TL) | 2,754 | 1,965 | 1,965 |
Travel (TR) | 2,541 | 1,975 | 1,975 |
Aspect | |||
Domain | Training (src) | Dev (src) | Test (trg) |
Device (D) | 2,302 | 255 | 1,279 |
Laptops (L) | 2,726 | 303 | 800 |
Restaurants (R) | 3,487 | 388 | 800 |
Service(SE) | 1,343 | 149 | 747 |
Rumour Detection
The PHEME dataset of rumourous tweets (Zubiaga et al., 2016, 2017) contains 5,802 tweets, which followed 5 different real-world events, and are labelled as rumourous or non-rumourous.9 We treat each event as a separate domain: Charlie-Hebdo (C), Ferguson (FR), Germanwings-crash (GW), Ottawa-shooting (OS), and Sydney-siege (S).
We follow the data processing procedure of Wright and Augenstein (2020) and split each domain (event) corpus by a 4:1 ratio, establishing training and development sets. Because the corpora are relatively small, we want to avoid further shrinking the size of the test set. Hence, we include all examples available from the target domain to form the test set.10
MNLI
This corpus (Williams et al., 2018) is an extension of the SNLI dataset (Bowman et al., 2015).11 Each example consists of a pair of sentences, a premise and a hypothesis. The relationship between the two may be entailment, contradiction, or neutral. The corpus includes data from 10 domains: 5 are matched, with training, development and test sets, and 5 are mismatched, without a training set. We experiment only with the five matched domains: Fiction (F), Government (G), Slate (SL), Telephone (TL), and Travel (TR).
Since the test sets of the MNLI dataset are not publicly available, we use the original development sets as our test sets for each target domain, while source domains use these sets for development. We explore a lightly supervised scenario, which emphasizes the need for a DA algorithm. Thus, we randomly downsample each of the training sets by a factor of 30, resulting in 2,000–3,000 examples per set.
Aspect Prediction
The Aspect Prediction dataset is based on aspect-based sentiment analysis (ABSA) corpora from four domains: Device (D), Laptops (L), Restaurant (R), and Service (SE). The D data consist of reviews from Toprak et al. (2010), the SE data include web service reviews (Hu and Liu, 2004), and the L and R domains consist of reviews from the SemEval-2014 ABSA challenge (Pontiki et al., 2014).
We follow the training and test splits defined by Gong et al. (2020) for the D and SE domains, while the splits for the L and R domains are taken from the SemEval-2014 ABSA challenge. To establish our development set, we randomly sample 10% out of the training data.
5.2 Evaluated Models
Our main model is PADA: The multi-task model that first generates the domain name and domain related features to form a prompt, and then uses this prompt to predict the task label (§4.1, Figure 2). We compare it to two types of models: (a) T5-based baselines corresponding to ideas presented in multi-source DA work, as well as other recent state-of-the-art models (§2); and (b) Ablation models that use specific parts of PADA, to highlight the importance of its components.
5.2.1 Baseline Models
Transformer-based Mixture of Experts (Tr-MoE)
For each source domain, a separate transformer-based DistilBERT expert model (Sanh et al., 2019) is trained on the domain’s training set, and an additional model is trained on the union of training sets from all source domains. At test time, the average of the class probabilities of these models is calculated and the highest probability class is selected. This model is named MoE-avg by Wright and Augenstein (2020) and has been demonstrated to achieve state-of-the-art performance for Rumour Detection.
T5-MoE
A T5-based MoE ensemble model. For each source domain, a separate pre-trained T5 model is fine-tuned on the domain’s training set (i.e., a domain expert model). During inference, the final predictions of the model are decided using the same averaging procedure as in Tr-MoE.
T5-No-Domain-Adaptation (T5-NoDA)
A pre- trained T5 model, which feeds the same task classifier used in PADA (see below) to predict the task label. In each DA setting, the model is trained on the training data from all source domains.
We also experiment with an in-domain version of this model, T5-UpperBound (T5-UB), which is tested on the development data of each domain. We treat T5-UB performance as an upper bound for the average target performance across all DA settings, for any T5-based model in our setup.
T5-Domain-Adversarial-Network (T5-DAN)
A model that integrates T5-NoDA with an adversarial domain classifier to learn domain invariant representations.12
T5-Invariant-Risk-Minimization (T5-IRM)
A T5-based model that penalizes feature distributions that have different optimal linear classifiers for each domain. The model is trained on the training data from all source domains.
5.2.2 Ablation Models
Prompt-DN
A simplified version of our PADA model, which assigns only a domain name as a prompt to the input text. Since the domain name is unknown at test time, we create multiple variants of each test example, each with one of the training domain names as a prompt. For the final predictions of the model we follow the same averaging procedure as in Tr-MoE and T5-MoE.
Prompt-RDW and Prompt-REW
Two simplified versions of PADA that form prompts from Random-Domain-Words and Random-Example- Words, respectively. For Prompt-RDW, we sample m = 5 domain words (according to their distribution in the joint vocabulary of all source domains) for each example. For Prompt-REW, we randomly select m = 5 words from the example’s text. At both training and test times, we follow the same prompt formation procedures.
PADA-NP (No Prompt)
A multi-task model similar to PADA, except that it simultaneously generates the example-specific domain name and DRF-based prompt, and predicts the task label (Figure 3a). Because this model does not condition the task prediction on the generated prompt, it sheds light on the effect of the autoregressive nature of PADA.
PADA-NM (No Multi-task)
A pipeline of two independent models which emulates PADA. Given an input example, the first model generates a unique prompt for it. Then, the second model predicts the task label given the input and its generated prompt (Figure 3b). Since the prediction and prompt generation tasks are not performed jointly, nor are the model parameters shared between the tasks, this pipeline sheds light on the effect of the multi-task nature of PADA.
5.3 Implementation Details
The T5-based text classification models do not follow the same procedure originally described in Raffel et al. (2020). Instead, we add a simple 1D-CNN classifier on top of the T5 encoder to predict the task label (Figure 2). The number of filters in this classifier is 32 with a filter size of 9.14 The generative component of the T5- based models is identical to that of the original T5. Our T5-based models for Aspect Prediction cast sequence tagging as a sequence-to-sequence task, employing the text-to-text approach of Raffel et al. (2020) to generate a ‘B’ (begin), ‘I’ (in), or ‘O’ (out) token for each input token. Other than this change, these models are identical to the T5-based models for text classification.
We train all text classification models for 5 epochs and all sequence tagging models for 60 epochs, with an early stopping criterion according to performance on the development data. We use the cross-entropy loss function for all models, optimizing their parameters with the ADAM optimizer (Kingma and Ba, 2015). We employ a batch size of 32 for text classification and 24 for sequence tagging, warmup ratio of 0.1, and a learning rate of 5 ⋅ 10−5. The maximum input and output lengths of all T5-based models is set to 128 tokens. We pad shorter sequences and truncate longer ones to the maximum input length.
For PADA, we tune the α (example proportion- mixture, see §4.1) parameter considering the value range of {0.1, 0.25, 0.5, 0.75, 0.9}. The chosen values are: αrumour = 0.75, αmnli = 0.1 and αabsa = 0.1. For each training example, we select the top m = 5 DRFs most associated with it for its prompt. For the generative component of the T5-based models, we perform inference with the Diverse Beam Search algorithm (Vijayakumar et al., 2016), considering the following hyper-parameters: We generate 5 candidates, using a beam size of 10, with 5 beam groups, and a diversity penalty value of 1.5. The l and ρ parameters of the DRF extraction procedure (§4.2) were tuned to 1000 and 1.5, respectively, for all domains.
6 Results
Text Classification
Table 5 presents our results. We report the binary-F1 score for Rumour Detection, and the macro-F1 score for MNLI.15
. | Rumour Detection . | MNLI . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
All C . | All FR . | All GW . | All OS . | All S . | AVG . | All F . | All G . | All SL . | All TE . | All TR . | AVG . | |
Tr-MoE | 68.0 | 46.1 | 74.8 | 58.2 | 64.9 | 62.4 | 64.3 | 73.9 | 65.3 | 62.4 | 69.8 | 67.1 |
T5-MoE | 68.1 | 46.0 | 73.6 | 65.3 | 66.3 | 63.9 | 74.0 | 82.0 | 73.4 | 74.6 | 78.3 | 76.5 |
T5-DAN | 64.9 | 52.4 | 69.1 | 72.7 | 64.4 | 64.7 | 74.4 | 76.3 | 61.0 | 72.4 | 77.7 | 72.4 |
T5-IRM | 63.5 | 39.4 | 70.1 | 44.2 | 65.7 | 56.6 | 72.0 | 81.5 | 73.2 | 69.3 | 78.9 | 75.0 |
T5-NoDA | 64.1 | 46.9 | 75.1 | 72.0 | 71.0 | 65.8 | 76.4 | 83.5 | 75.5 | 74.9 | 81.3 | 78.3 |
Prompt-DN | 66.4 | 53.7 | 72.4 | 71.4 | 70.1 | 66.8 | 77.0 | 84.4 | 75.6 | 76.3 | 80.5 | 78.8 |
Prompt-RDW | 64.1 | 53.1 | 71.8 | 66.0 | 70.0 | 65.0 | 76.0 | 84.2 | 76.6 | 77.0 | 79.9 | 78.7 |
Prompt-REW | 64.2 | 54.3 | 71.6 | 70.0 | 69.1 | 65.8 | 75.7 | 81.4 | 76.7 | 78.8 | 81.2 | 78.7 |
PADA-NP | 65.8 | 54.8 | 71.6 | 72.2 | 74.0 | 67.7 | 76.2 | 83.6 | 75.4 | 77.2 | 81.4 | 78.8 |
PADA-NM | 63.6 | 54.1 | 74.3 | 70.1 | 70.3 | 66.5 | 76.0 | 83.7 | 76.5 | 78.0 | 81.0 | 79.0 |
PADA | 68.6 | 54.4 | 73.0 | 75.2 | 75.1 | 69.3 | 76.4 | 83.4 | 76.9 | 78.9 | 82.5 | 79.6 |
. | Rumour Detection . | MNLI . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
All C . | All FR . | All GW . | All OS . | All S . | AVG . | All F . | All G . | All SL . | All TE . | All TR . | AVG . | |
Tr-MoE | 68.0 | 46.1 | 74.8 | 58.2 | 64.9 | 62.4 | 64.3 | 73.9 | 65.3 | 62.4 | 69.8 | 67.1 |
T5-MoE | 68.1 | 46.0 | 73.6 | 65.3 | 66.3 | 63.9 | 74.0 | 82.0 | 73.4 | 74.6 | 78.3 | 76.5 |
T5-DAN | 64.9 | 52.4 | 69.1 | 72.7 | 64.4 | 64.7 | 74.4 | 76.3 | 61.0 | 72.4 | 77.7 | 72.4 |
T5-IRM | 63.5 | 39.4 | 70.1 | 44.2 | 65.7 | 56.6 | 72.0 | 81.5 | 73.2 | 69.3 | 78.9 | 75.0 |
T5-NoDA | 64.1 | 46.9 | 75.1 | 72.0 | 71.0 | 65.8 | 76.4 | 83.5 | 75.5 | 74.9 | 81.3 | 78.3 |
Prompt-DN | 66.4 | 53.7 | 72.4 | 71.4 | 70.1 | 66.8 | 77.0 | 84.4 | 75.6 | 76.3 | 80.5 | 78.8 |
Prompt-RDW | 64.1 | 53.1 | 71.8 | 66.0 | 70.0 | 65.0 | 76.0 | 84.2 | 76.6 | 77.0 | 79.9 | 78.7 |
Prompt-REW | 64.2 | 54.3 | 71.6 | 70.0 | 69.1 | 65.8 | 75.7 | 81.4 | 76.7 | 78.8 | 81.2 | 78.7 |
PADA-NP | 65.8 | 54.8 | 71.6 | 72.2 | 74.0 | 67.7 | 76.2 | 83.6 | 75.4 | 77.2 | 81.4 | 78.8 |
PADA-NM | 63.6 | 54.1 | 74.3 | 70.1 | 70.3 | 66.5 | 76.0 | 83.7 | 76.5 | 78.0 | 81.0 | 79.0 |
PADA | 68.6 | 54.4 | 73.0 | 75.2 | 75.1 | 69.3 | 76.4 | 83.4 | 76.9 | 78.9 | 82.5 | 79.6 |
PADA outperforms all baseline models (§ 5.2.1) in 7 of 10 settings and reaches the highest result in another setting (with T5-NoDA), exhibiting average performance gains of 3.5% and 1.3% in Rumour Detection and MNLI, respectively, over the best performing baseline model. Interestingly, it is T5-NoDA, which does not perform any DA, that outperforms (on average and in most model- to-model comparisons) all other baseline models, including the MoE models.
While the performance gains differ between the tasks, they partly stem from the different performance gaps between source and target domains in each of these tasks. Recall that we consider the T5-UB performance on its development sets for Rumour Detection (82.8%) and MNLI (80.8%) to be the upper bound for the average target performance across all DA settings, for any T5-based model. When considering the gaps between this upper bound and T5-NoDA (65.8% for Rumour Detection and 78.3% for MNLI), PADA reduces the error rate by 21% for Rumour Detection and 52% for MNLI. The improvements gained by PADA are in fact substantial in both tasks.
The advantage of PADA over MoE goes beyond improved predictions. Particularly, for PADA we train a single model while for MoE we train a unique model for each source domain, hence the number of parameters in the MoE framework linearly increases with the number of source domains. For example, in our setups, Tr-MoE trains five DistilBERT models (one for each source domain and one for all source domains together), resulting in 5 ⋅ 66M = 330M parameters. In contrast, the PADA models keep the 220M parameters of T5, regardless of the number of source domains.
Sequence Tagging
In order to demonstrate the wide applicability of our approach, we go beyond text classification (with 2 [Rumour Detection] or 3 [MNLI] classes) and also consider Aspect Prediction: A sequence tagging task. We are particularly curious to see if the aforementioned patterns replicate in this qualitatively different task. Our results are presented in Table 6, where we report the binary-F1 score (the F1 score of the aspect class). Crucially, the patterns we observe for text classification can also be detected for sequence tagging. Particularly, PADA is the best performing model in 4 of 4 settings compared to its baselines. On average, PADA outperforms the second-best model, T5-IRM, by 3.5% on average. Given the average results of T5-UB (69.4%) and T5-NoDA (38.7%), the error reduction is 24%.
Aspect Prediction . | |||||
---|---|---|---|---|---|
. | All D . | All L . | All R . | All SE . | AVG . |
T5-MoE | 39.5 | 31.4 | 31.4 | 30.9 | 33.3 |
T5-DAN | 28.4 | 38.0 | 49.1 | 33.4 | 33.2 |
T5-IRM | 37.1 | 44.6 | 47.4 | 41.5 | 42.7 |
T5-NoDA | 31.1 | 45.6 | 40.2 | 37.9 | 38.7 |
Prompt-DN | 41.1 | 42.6 | 29.0 | 30.8 | 35.9 |
Prompt-RDW | 34.6 | 46.9 | 52.9 | 41.2 | 43.9 |
Prompt-REW | 38.2 | 49.5 | 45.1 | 39.6 | 43.1 |
PADA-NP | 41.7 | 48.2 | 50.1 | 40.1 | 45.0 |
PADA-NM | 40.3 | 48.8 | 50.8 | 40.2 | 45.0 |
PADA | 43.1 | 50.9 | 50.8 | 45.3 | 47.5 |
Aspect Prediction . | |||||
---|---|---|---|---|---|
. | All D . | All L . | All R . | All SE . | AVG . |
T5-MoE | 39.5 | 31.4 | 31.4 | 30.9 | 33.3 |
T5-DAN | 28.4 | 38.0 | 49.1 | 33.4 | 33.2 |
T5-IRM | 37.1 | 44.6 | 47.4 | 41.5 | 42.7 |
T5-NoDA | 31.1 | 45.6 | 40.2 | 37.9 | 38.7 |
Prompt-DN | 41.1 | 42.6 | 29.0 | 30.8 | 35.9 |
Prompt-RDW | 34.6 | 46.9 | 52.9 | 41.2 | 43.9 |
Prompt-REW | 38.2 | 49.5 | 45.1 | 39.6 | 43.1 |
PADA-NP | 41.7 | 48.2 | 50.1 | 40.1 | 45.0 |
PADA-NM | 40.3 | 48.8 | 50.8 | 40.2 | 45.0 |
PADA | 43.1 | 50.9 | 50.8 | 45.3 | 47.5 |
PADA Ablation Models
As shown in Table 5, PADA outperforms all of its variants (§ 5.2.2) in 6 out of 10 text classification settings overall. Furthermore, in the sequence tagging task (Table 6), PADA outperforms its simpler variants (Prompt-{DN, REW}, PADA-NP) in all 4 setups, and Prompt-RDW, PADA-NM in 3 out of 4 setups. These results highlight the importance of our design choices: (a) including DRFs in the example-specific prompts, tailoring them to express the relation between the source domains and the test example (PADA vs Prompt-{DN, RDW, REW}); (b) utilizing an autoregressive component, where the generated DRF prompts are used by the task classification component (PADA vs PADA-NP); and (c) leveraging a multi-task training objective (PADA vs PADA-NM). A noticeable difference in the aspect prediction results from text classification results is the weakness of Prompt-DN, which is outperformed by all baseline models (§ 5.2.1) in 2 setups, and by 2 of these models in a third setup, as well as on average across all setups. This is yet another indication of the importance of the DRFs in the prompt generated by PADA.
7 Ablation Analysis
In this section, we analyze several unique aspects of PADA. We first evaluate the prompts generated by PADA, to gain further insight into its generative capabilities. We then analyze the impact of the number of source domains on PADA’s performance. Finally, we examine performance drops due to domain shifts, in order to evaluate PADA’s adaptation stability across domains. For the sake of clarity and concision, analyses will henceforth focus on the rumour detection task.
Generated Prompts Analysis
We first present an intrinsic evaluation of PADA’s prompt generation task (see §4.1) by examining model-generated prompts for examples from the development set, compared to their annotated prompts.16 We choose automatic metrics widely used for evaluating NLG tasks, focusing on n-gram overlap by calculating ROUGE (Lin, 2004) scores as well as measuring semantic similarity with BERTScore (Zhang et al., 2020). In Table 7 we present average F1 scores for these metrics, calculated over all DA settings in the rumour detection task. The high average BERTScore (0.94) indicates that the generated prompts share high semantic similarity with their annotated prompts. Yet, the average ROUGE-1 (0.64) and ROUGE-2 (0.3) scores indicate that the generated prompts vary on their unigram and bigram levels (respectively), compared with their annotated prompts. This evidence suggests that PADA learns to leverage the semantic overlaps between DRFs, over memorizing specific n-grams (e.g., an annotated DRF may be terrorist while the generated word may be gunman).
. | BERTScore . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . |
---|---|---|---|---|
Dev F1 | 0.94 | 0.64 | 0.30 | 0.61 |
. | BERTScore . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . |
---|---|---|---|---|
Dev F1 | 0.94 | 0.64 | 0.30 | 0.61 |
We continue our evaluation by analyzing the origins of words in the PADA-generated prompts, specifically, whether they appear in the source domains’ DRF sets, the input text, or in neither (Novel). Figure 4 presents the average ratios of different origins for generated prompt tokens, calculated over all DA settings in the rumour detection task. As expected, the overwhelming majority of generated tokens come from the source domains DRF sets, for both development (92.7%) and test (75.3%) sets. However, when introduced to examples from unknown domains (test sets), we observe a significant increase (compared to the development sets) in novel tokens (18.9% vs 5.4%) and a slight increase in tokens from the example’s input text (14.1% vs 11.7%).
Furthermore, Figure 5 demonstrates that PADA is able to exploit information from its multiple source domains. For test examples PADA generates prompts containing DRFs from several domains (95% of prompts contain DRFs from more than 2 source domains), while for development examples it mostly generates prompts with DRFs only from the correct source domain. Together with the examples presented in Table 1, these observations suggest an encouraging finding— PADA is successful in generating prompts which leverage and integrate both the source domains and the semantics of the input example.
Number of Source Domains
We next turn to study the impact of the number of source domains on PADA’s overall performance. Figure 6 presents F1 scores by the number of source domains for PADA and two of its baselines, namely, T5-NoDA and T5-MoE. We provide results on two target domains, as well as an average score across all five target domains from the rumour detection dataset.
As indicated in the figure, PADA’s performance improves as the number of source domains increases. These results support our claim that PADA is able to integrate knowledge from multiple source domains by learning a meaningful domain-mixture, and it then leverages this knowledge when introduced to an example from a new, unknown, domain. Interestingly, for the baseline models T5-NoDA and T5-MoE, it seems that including more source domains can sometimes harm their ability to generalize to unknown target domains. One of our main hypotheses states that a DA model stands to benefit from incorporating combined knowledge from multiple source domains (§4). PADA successfully implements this idea, while T5-MoE and T5-NoDA fall short.
Performance Drops between Source and Target
When a DA method improves model performance on the target domain, this can result in either increasing or decreasing the performance gap between the source and target domains. If a model performs similarly on its source training domains and on unseen target domains, its source domain performance can also provide an important indication for its future performance in such unseen domains. We hence consider such stability in performance as a desired property in our setup where future target domains are unknown (see discussion in Ziser and Reichart [2019]).
Figure 7 presents a heatmap depicting the performance drop for each model between the source domains and the target domains in rumour detection. We measure each model’s in-domain performance by calculating an F1 score across all development examples from its source domains, as well as out-of-domain performance on the target domain test set, as described in §6. We then calculate the difference between the source and the target performance measures, and report results for the best performing models in our experiments (§6). The general trend is clear: PADA not only performs better on the target domain, but it also substantially reduces the source-target performance gap. While T5-NoDA, which is not a DA model, triggers the largest average absolute performance drop, 17%, the average of PADA’s absolute performance drop is 8.7%.
8 Discussion
We addressed the problem of multi-source domain adaptation when the target domain is not known at training time. Effective models for this setup can be applied to any target domain with no data requirements about the target domains and without an increase in the number of model parameters as a function of the number of source or target domains. PADA, our algorithm, extends the prompting mechanism of the T5 autoregressive language model to generate a unique textual prompt per example. Each generated prompt maps its test example into a semantic space spanned by the source domains.
Our experimental results with three tasks and fourteen multi-source adaptation settings demonstrate the effectiveness of our approach compared to strong alternatives, as well as the importance of the model components and of our design choices. Moreover, as opposed to the MoE paradigm, where a model is trained separately for each source domain, PADA provides a single unified model. Intuitively, this approach also seems more cognitively plausible—a single model attempts to adapt itself to examples from new incoming domains, rather than employing an independent model per domain.
The prompt generation mechanism of PADA is naturally limited by the set of source domains it is trained on. This might yield sub-optimal DRFs in prompts generated for examples stemming from target domains which are semantically unrelated to any of the source domains. To alleviate this issue, we allow PADA to generate non- DRF words. Still, our prompt generation training process does not directly optimize for the downstream prediction task’s objective, which might also contribute to sub-optimally generated prompts. In future work, we hope to improve these aspects of our approach and explore natural extensions that accommodate multiple tasks and domains in a single model.
Acknowledgments
We would like to thank the action editor and the reviewers, as well as the members of the IE@Technion NLP group for their valuable feedback and advice. This research was partially funded by an ISF personal grant no. 1625/18.
Notes
Our code and data are available at https://github.com/eyalbd2/PADA.
The any-domain adaptation setting is addressed in the model robustness literature. In §3, we discuss the differences between these static methods and our dynamic approach.
We use a language model, pre-trained on massive unlabeled data, and it is possible that this model was exposed to text from the source or target domains. Yet, the downstream task training is based only on examples from the source domains without any knowledge of future target domains.
For a comprehensive discussion of the research on prompting mechanisms, we refer to Liu et al. (2021a).
In inductive transfer learning differs from .
von Oswald et al. (2020) explore the notion of inferring the new example’s task out of the training tasks.
In this computation we consider the non-contextual embeddings learned by T5 during its pre-training. In our experiments we consider only unigrams (words) as DRFs.
This does not harm the integrity of our experiments, since the training and development sets are sampled from the source domains while the test set is sampled only from the target domain.
We also experimented with BERT-NoDA and BERT-DAN models. We do not report their results because they were consistently outperformed by T5-NoDA and T5-DAN.
We experimented with the original T5 classification method as well, but PADA consistently outperformed it.
Binary-F1 measures the F1 score of the positive class. It is useful in cases of unbalanced datasets where the positive class is of interest (34% of the Rumour Detection dataset).
PADA is not restricted to specific structures or vocabulary when generating prompts, hence our annotated prompts only serve as pseudo gold labels for training purposes.
References
Author notes
Action Editor: Jimmy Lin
Both authors equally contributed to this work.