Abstract
In many NLP applications, to mitigate data deficiency in a target task, source data is collected to help with target model training. Existing transfer learning methods either select a subset of source examples that are close to the target domain or try to adapt all source examples into the target domain, then use selected or adapted source examples to train the target model. These methods either incur significant information loss or bear the risk that after adaptation, source examples which are originally already in the target domain may be outside the target domain. To address the limitations of these methods, we propose a four-level optimization based framework which simultaneously selects and adapts source data. Our method can automatically identify in-domain and out-of-domain source examples and apply example-specific processing methods: selection for in-domain examples and adaptation for out-of-domain examples. Experiments on various datasets demonstrate the effectiveness of our proposed method.
1 Introduction
Transfer learning (TL) (Zhuang et al., 2020), which aims at improving a model in a target domain by utilizing data from a source domain, has been broadly studied in natural language processing. One paradigm of TL methods (Sun et al., 2011; Song et al., 2012; Wang et al., 2017b; Patel et al., 2018; Qu et al., 2018; Liu et al., 2019b) focuses on selecting a subset of source examples that are close to the target domain and using selected examples as additional training data for the target model. Another paradigm of TL methods (Pan et al., 2010; Ganin et al., 2016; Bousmalis et al., 2017) focuses on adapting the entire set of source examples into the target domain and using adapted source examples as additional training data for the target model. The problem of selection based methods is that unselected examples are discarded. Though having a domain discrepancy with target data, unselected examples still contain useful information that can be leveraged to improve the target model. Discarding them would lead to information loss. The problem of adaptation based methods is that some source examples may already be in the target domain; performing domain adaptation on these source examples is a waste of effort; and, even worse, after adaptation, these source examples may be outside the target domain.
To address the limitations of both paradigms of methods, we propose a new approach which simultaneously performs selection and adaptation of source examples: Our method automatically identifies which source examples are in the same domain as target data (referred to as in-domain source data) and which are not (referred to as out-of-domain source data); for in-domain source data, they are directly used to train the target model; for out-of-domain source data, they are first adapted, then utilized to train the target model. Compared with previous methods, our approach has the following advantage: Instead of using a single way to deal with all source examples (either performing selection or adaptation), our method applies example-specific ways to deal with different examples based on whether they are in-domain or out-of-domain.
Our method is based on four-level optimization, which performs the following four stages end-to-end. At the first stage, a domain distance network is trained based on self-supervised learning. At the second stage, the domain distance network is used to identify out-of-domain source examples and adapt them into the target domain. At the third stage, in-domain source examples selected by the domain distance network and adapted source examples are used to train a target model. At the fourth stage, the trained target model is evaluated on a validation set and data weights of the domain distance network are updated by minimizing the validation losses. Experiments on a variety of datasets demonstrate the effectiveness of our proposed method. To summarize, the major contributions of this work is that we propose a four-level optimization based framework for simultaneous selection and adaptation of source examples in transfer learning and demonstrate the effectiveness of our proposed method on two NLP applications.
2 Related Works
Domain Adaptation (DA).
DA (Pan et al., 2010; Ganin et al., 2016; Bousmalis et al., 2017; Sun and Saenko, 2016; Long et al., 2017b; Ben-David et al., 2010; Hoffman et al., 2018; Long et al., 2017a; Kang et al., 2019; Long et al., 2015; Long et al., 2017b; Hoffman et al., 2018; Mitsuzumi et al., 2021) considers the problem of transferring knowledge from a label-rich source domain to a label-deficient target domain where the two domains have distributional discrepancies. There are mainly two paradigms of approaches. One paradigm (Sun and Saenko, 2016; Long et al., 2017b; Ben-David et al., 2010; Kang et al., 2019) is based on metric learning, where a distance metric is defined to measure the distribution discrepancy between domains and domain-invariant representations are learned by minimizing the distance. The other paradigm is based on adversarial learning (Hoffman et al., 2018; Long et al., 2017a; Tzeng et al., 2017; Sankaranarayanan et al., 2018; Motiian et al., 2017), which learns a domain discriminator and a feature learning network adversarially. The domain discriminator is trained to tell whether an instance is from source domain or target domain, while the feature learning network learns domain-invariant features by fooling the domain discriminator. CDAN (Long et al., 2017a) uses multilinear conditioning to capture the cross-covariance between feature representations and classifier predictions, and leverages entropy conditioning to control the uncertainty of classifier predictions. MME (Saito et al., 2019) performs adaptation by maximizing the conditional entropy of unlabeled target data w.r.t the classifier and minimizing it w.r.t the feature encoder. SRDC (Tang et al., 2020) performs deep discriminative clustering with source regularization for unsupervised domain adaptation. These methods perform adaptation on all source data, which leads to waste of efforts and incurs a risk of moving in-domain source data outside the target domain.
Data Selection in Transfer Learning.
Many methods (Jiang and Zhai, 2007; Foster et al., 2010; Moore and Lewis, 2010; Axelrod et al., 2011; Ge and Yu, 2017; Ruder and Plank, 2017; Sivasankaran et al., 2017; Zhang et al., 2017; Guo et al., 2019; Liu et al., 2019a; Tang and Jia, 2019; Wang et al., 2019a, b; Bateson et al., 2020) have been developed for selecting source data that is suitable for training target models, based on reinforcement learning (Patel et al., 2018; Qu et al., 2018; Liu et al., 2019b), adversarial learning (Wang et al., 2019a), curriculum learning (Zhang et al., 2017; Wang et al., 2019b), entropy (Song et al., 2012; Wang et al., 2017b), Bayesian optimization (Ruder and Plank, 2017), multi-task learning (Ge and Yu, 2017), and bi-level optimization (BLO) (Ren et al., 2018, 2020; Hu et al., 2019; Shu et al., 2019; Wang et al., 2020a, b). BLO based approaches select data by minimizing a validation loss, where the lower-level optimization problem trains network weights on a training dataset and the upper-level optimization problem learns data selection variables on a validation set. These methods select part of source data and discard the rest, which incurs information loss.
Transfer Learning (TL).
TL (Pratt, 1993; Mihalkova et al., 2007; Niculescu-Mizil and Caruana, 2007; Pan and Yang, 2009; Luo et al., 2017; Zhuang et al., 2020) aims at training a better target model by utilizing source data. Many TL methods have been developed, including those based on 1) distribution alignment (Huang et al., 2006; Foster et al., 2010; Wang et al., 2017b; Ngiam et al., 2018), 2) regularization (Luo et al., 2008; Tommasi et al., 2010; Duan et al., 2012), 3) adversarial domain-invariant representation learning (Ganin et al., 2016; Long et al., 2017a; Hoffman et al., 2018; Zhang et al., 2019), and 4) latent space projection (Borgwardt et al., 2006; Pan et al., 2010; Long et al., 2013; Wang et al., 2017a). These methods either select part of source data or adapt all source data, which leads to information loss, waste of efforts, and the risk of adapting in-domain source data outside the target domain.
Bi-level Optimization (BLO).
BLO finds broad applications in hyperparameter tuning (Feurer et al., 2015), data selection (Shu et al., 2019; Ren et al., 2020; Wang et al., 2020b), training data generation (Such et al., 2019), neural architecture search (Liu et al., 2018), learning rate adaptation (Baydin et al., 2017), meta learning (Finn et al., 2017), etc. In BLO-based methods, model weights are learned by solving an inner optimization problem and meta parameters are learned by solving an outer optimization problem. The two optimization problems are nested. Different from existing BLO-based methods, our method is based on four-level optimization.
3 Methods
In this section, we present the method for simultaneous selection and adaptation of source examples based on four-level optimization. We aim to train a target model for a specific target domain using dataset Dt. In practical scenarios, the target domain often suffers from a lack of labeled training data. This lack can lead to overfitting on the training data and poor generalization on test data. To address this issue, we utilize a dataset from a source domain, Ds, which has an abundance of labeled examples. However, there is a notable discrepancy between the source and target data. We categorize Ds examples into two types: those that belong to the same domain as Dt (in-domain source data) and those that do not (out-of-domain source data). We aim to select in-domain source data and adapt out-of-domain source data into the target domain, and use selected and adapted source data to train the target model. The overall framework is shown in Figure 1. The notations are shown in Table 1.
Notations.
Notation . | Meaning . |
---|---|
Dt | Target dataset. |
Nt | The number of examples in Dt. |
Training dataset of Dt. | |
Validation dataset of Dt. | |
The i-th example in . | |
Ds | Source dataset. |
Ns | The number of examples in Ds. |
ds, i | The i-th example in Ds. |
q | Query example. |
R | A set of examples. |
o | Binary label regarding whether q and R are from the same domain. |
M | The number of SSL training examples. |
W | The DDMN’s weight parameters. |
a | The weight of an SSL training example. |
A | The weights of all SSL training examples. |
W*(A) | The optimal solution of W, which depends on A. |
f(q, Dt;W*(A)) | A binary label regarding whether q is out of the target domain. |
The encoder in W*(A). | |
The rest of layers in W*(A) besides . | |
The latent representation of q extracted by . | |
z(q) | The adapted representation of q. |
z*(q, W*(A)) | The optimal solution of z(q), which depends on W*(A). |
γ | A tradeoff parameter in Eq.(2). |
λ | A tradeoff parameter in Eq.(6). |
U | Target model. |
A sub-network of U, which takes z*(q, W*(A)) as input. | |
ℓtgt | The target model’s training loss. |
Lt | The target model’s training loss on target training data. |
Ls | The target model’s training loss on selected source data. |
La | The target model’s training loss on adapted source data. |
Notation . | Meaning . |
---|---|
Dt | Target dataset. |
Nt | The number of examples in Dt. |
Training dataset of Dt. | |
Validation dataset of Dt. | |
The i-th example in . | |
Ds | Source dataset. |
Ns | The number of examples in Ds. |
ds, i | The i-th example in Ds. |
q | Query example. |
R | A set of examples. |
o | Binary label regarding whether q and R are from the same domain. |
M | The number of SSL training examples. |
W | The DDMN’s weight parameters. |
a | The weight of an SSL training example. |
A | The weights of all SSL training examples. |
W*(A) | The optimal solution of W, which depends on A. |
f(q, Dt;W*(A)) | A binary label regarding whether q is out of the target domain. |
The encoder in W*(A). | |
The rest of layers in W*(A) besides . | |
The latent representation of q extracted by . | |
z(q) | The adapted representation of q. |
z*(q, W*(A)) | The optimal solution of z(q), which depends on W*(A). |
γ | A tradeoff parameter in Eq.(2). |
λ | A tradeoff parameter in Eq.(6). |
U | Target model. |
A sub-network of U, which takes z*(q, W*(A)) as input. | |
ℓtgt | The target model’s training loss. |
Lt | The target model’s training loss on target training data. |
Ls | The target model’s training loss on selected source data. |
La | The target model’s training loss on adapted source data. |
3.1 A Four-Level Optimization Framework
We propose a four-level optimization framework to perform simultaneous selection and adaptation of source data. The framework consists of four learning stages which are performed end-to-end.
Stage I.
At the first stage, we learn a domain distance metric network via self-supervised learning (SSL) (He et al., 2019; Chen et al., 2020). This network takes a query example q and a set of examples R as inputs and predicts a binary label representing whether q is in the domain of R. The architecture of this network is as follows. For q and each example in R, an encoder is used to generate a latent representation for the example. Then self-attention (Vaswani et al., 2017) is performed on these latent representations to generate attentive representations. Finally, the attentive representation of q and the averaged attentive representation of examples in R are concatenated and fed into a feedforward layer to predict the binary label.
To learn this domain distance metric network (DDMN), we construct self-supervised training examples. We randomly sample a subset of examples R from Dt, and randomly sample a query example qt from Dt and a query example qs from Ds. We label (qt, R) as a positive pair since qt and R are both from target domain, and label (qs, R) as a negative pair since qs is from source domain and R is from the target domain. This procedure repeats M times, yielding 2M training examples denoted by where oi is a binary label representing whether qi and Ri are from the same domain. Let W denote weight parameters of the DDMN. We learn W by minimizing the binary classification loss: where ℓ is a cross-entropy loss.
Stage II.
Stage III.
Stage IV.
A Four-level Optimization Based Framework.
3.2 Optimization Algorithm
In the gradient of A calculated using the chain rule, the number of chains is the same as the number of levels in our proposed four-level optimization formulation. This shows that this optimization algorithm preserves the four-level nested optimization nature of the proposed formulation.
3.3 Reduce Computation and Memory Costs
To reduce computation and memory costs, we adopt the following methods.
We reduced the frequencies of updating (including calculating hypergradients of) the weights A of self-supervised training examples. They were updated every 8 mini-batches (i.e., iterations) instead of on every mini-batch. We empirically found that this greatly reduced computational costs without significantly sacrificing accuracy. The rest of the parameters were updated on every mini-batch.
We added a decorrelation regularizer (Cogswell et al., 2015) on W and U, which significantly speeds up convergence and allows reducing the number of epochs by half without sacrificing convergence quality.
Parameter tying was performed to reduce the number of weight parameters and computation costs. We let W and U share the same feature learning layers. These layers account for >95% of parameters in each of these models. Sharing them across models significantly reduces the number of total parameters, which reduces the computational costs of updating these parameters.
We optimized the implementation of our method to speed up computation by leveraging techniques including 1) automatic mixed precision (Micikevicius et al., 2017), 2) using multiple (4, specifically) workers and pinned memory in PyTorch DataLoader, 3) using cudNN autotuner, 4) kernel fusion, and so forth.
3.4 Applications
In this section, we apply the proposed four-level optimization framework for two NLP applications.
Text Classification.
In many text classification problems, training data in a target domain is limited. To address the lack of target training data, one can leverage data from a source domain.
Visual Question Answering on Pathology Images.
Pathology imaging (Mohan, 2015) is broadly used for identifying the causes and effects of diseases or injuries. Given a pathology image, being able to answer questions about the clinical findings contained in the image is very important for medical decision-making (He et al., 2020). However, collecting a large-scale visual question answering (VQA) dataset is challenging, due to the lack of doctors for making questions and answers from pathology images. A dataset collected in He et al. (2020) has about 33K question-answer pairs generated from around 5K pathology images. Although largest of its kind, it is still relatively small compared with common VQA datasets. To mitigate the deficiency of training data, we collect an auxiliary source dataset. From the pathology literature, we collect 1792 pathology figures and create 36,471 VQA questions using the method proposed in He et al. (2020).
4 Experiments
In this section, we present experimental results on text classification and visual question answering on pathology images. Following the common data assumption in transfer learning, the amount of labeled source data in our experiments is significantly larger than that of target data. Every experiment runs 4 times with different random initializations. For all experiments, we performed significance tests using double-sided t-tests. The p-values of our methods against baselines are all less than 0.001, which shows that our methods are significantly better than baselines.
4.1 Text Classification
Dataset.
Following Gururangan et al. (2020), we experiment with four domains: biomedical, computer science, news, and reviews. For the biomedical domain, we use two target datasets: ChemProt (Kringelum et al., 2016) and RCT (Dernoncourt and Lee, 2017), and one source dataset which contains 2.68 million full-text papers from S2ORC (Lo et al., 2019) with 7.55 billion tokens. For the computer science domain, we use two target datasets: ACL-ARC (Jurgens et al., 2018) and SciERC (Luan et al., 2018), and one source dataset which contains 2.22 millon full-text papers from S2ORC (Lo et al., 2019) with 8.1 billion tokens. For the news domain, we use two target datasets: HyperPartisan (Kiesel et al., 2019) and AGNews (Zhang et al., 2015), and one source dataset which contains 11.9 million articles from RealNews (Zellers et al., 2019) with 6.66 billion tokens. For the reviews domain, we use two target datasets: Helpfulness (McAuley et al., 2015) and IMDB (Maas et al., 2011), and one source dataset which contains 24.75 million articles from Amazon Reviews (He and McAuley, 2016) with 2.11 billion tokens. Statistics of the target datasets are summarized in Table 2. In our method, we split the original target training set into a new training set and a validation set, with a ratio of 1:1. The new training set is used as and the validation set is used as . Note that baseline methods are trained on the combination of and .
Statistics of datasets used in Gururangan et al. (2020).
Domain . | Dataset . | Label Type . | Train . | Dev . | Test . | Classes . |
---|---|---|---|---|---|---|
Biomedical | ChemProt | relation classification | 4169 | 2427 | 3469 | 13 |
RCT | abstract sent. roles | 180040 | 30212 | 30135 | 5 | |
Computer | ACL-ARC | citation intent | 1688 | 114 | 139 | 6 |
Science | SciERC | relation classification | 3219 | 455 | 974 | 7 |
News | HyperPartisan | partisanship | 515 | 65 | 65 | 2 |
AGNews | topic | 115000 | 5000 | 7600 | 4 | |
Reviews | Helpfulness | review helpfulness | 115251 | 5000 | 25000 | 2 |
IMDB | review sentiment | 20000 | 5000 | 25000 | 2 |
Domain . | Dataset . | Label Type . | Train . | Dev . | Test . | Classes . |
---|---|---|---|---|---|---|
Biomedical | ChemProt | relation classification | 4169 | 2427 | 3469 | 13 |
RCT | abstract sent. roles | 180040 | 30212 | 30135 | 5 | |
Computer | ACL-ARC | citation intent | 1688 | 114 | 139 | 6 |
Science | SciERC | relation classification | 3219 | 455 | 974 | 7 |
News | HyperPartisan | partisanship | 515 | 65 | 65 | 2 |
AGNews | topic | 115000 | 5000 | 7600 | 4 | |
Reviews | Helpfulness | review helpfulness | 115251 | 5000 | 25000 | 2 |
IMDB | review sentiment | 20000 | 5000 | 25000 | 2 |
Baselines.
We compare our method with the following baselines. In baseline methods, the target data used for training the target model includes both training and validation sets.
Domain adaptive pretraining (DAPT) (Gururangan et al., 2020): Given an RoBERTa-base model which has been pretrained on large amounts of corpora (used in Liu et al., 2019c), we continue to pretrain it on source data, then finetune it on target data.
Task adaptive pretraining (TAPT) (Gururangan et al., 2020): Given an RoBERTa-base model which has been pretrained on large amounts of corpora (used in Liu et al., 2019c), we continue to pretrain it on input texts of each target dataset.
Data selection methods for transfer learning, based on Bayesian optimization (BO) (Ruder and Plank, 2017), minimax game (MMG) (Wang et al., 2019a), learning to select instance (LSI) (Huan et al., 2021).
Domain adaptation methods, including DANN (Ganin et al., 2016), CDAN (Long et al., 2017a), MME (Saito et al., 2019), SRDC (Tang et al., 2020), SSDA (Kim and Kim, 2020), GDA (Mitsuzumi et al., 2021), and ATDOC (Liang et al., 2021).
SimCSE (Gao et al., 2021): A contrastive learning method. The same input sentence is fed into a pretrained RoBERTa encoder twice by applying different dropout masks, to get two different embeddings. These two embeddings are labeled as being “similar”. Embeddings of different sentences are labeled as being “dissimilar”. Contrastive learning is performed on these “similar” and “dissimilar” pairs.
Implementation Details.
In the domain distance network, the hyperparameters of self-attention and feed-forward layer are the same as those in Transformer (Vaswani et al., 2017). The subset R has varying cardinality (sampled uniformly). M is set to 10k. The tradeoff parameter γ and λ is set to 0.1 and 0.5 respectively. Baselines and our method receive a similar amount of tuning time and efforts. F1 is used as the evaluation metric. We use RoBERTa-base as a data encoder. For a fair comparison, most of our hyperparameters are the same as those in Gururangan et al. (2020). The maximum text length was set to 512. For all datasets, we used a batch size of 16 with gradient accumulation. We used the AdamW optimizer (Loshchilov and Hutter, 2017) with a warm-up proportion of 0.06, a weight decay of 0.1, and an epsilon of 1e-6. In AdamW, β1 and β2 are set to 0.9 and 0.98, respectively. The maximum learning rate was 2e-5. For the reader’s convenience, we summarize the hyperparameters as follows.
The number M of self-supervised training examples: 10k
Tradeoff parameters γ and λ: 0.1, 0.5
Maximum text length: 512
Batch size: 16
Optimizer: AdamW
Warm-up proportion, weight decay, and epsilon in AdamW: 0.06, 0.1, and 1e-6
β1 and β2 in AdamW: 0.9, 0.98
Maximum learning rate: 2e-5
To tune the hyperparameters, we randomly split the validation set into two equal-sized subsets, denoted by A and B. For each configuration of hyperparameters, we use the validation set A to learn the importance weights of self-supervised training examples. Then we measure the performance of the trained model on validation set B. Hyperparameter values yielding the best performance on validation set B are selected. Each baseline method received an equal amount of tuning effort as that for our method.
Results and Analysis.
Table 3 shows the results. From this table, we make the following observations. First, our method outperforms source data selection methods including BO, MMG, and LSI. In these baseline methods, out-of-domain source examples are discarded, which incurs information loss. In contrast, our method adapts out-of-domain source examples into the target domain and uses the adapted examples to train the target model. Second, our method works better than domain adaptation methods including DANN, CDAN, MME, SRDC, SSDA, GDA, and ATDOC. The reason is: These methods perform adaptation on all source examples without identifying which ones are already in the target domain; as a result, some in-domain source examples may be adapted out of the target domain. In contrast, our method first identifies which source examples are already in-domain and only performs adaptation on out-of-domain examples. Third, our method outperforms vanilla RoBERTa-base which does not leverage source data to learn representations. This demonstrates that leveraging source data is helpful for improving the target model. Fourth, our method performs better than DAPT. In DAPT, all source examples are leveraged to pretrain the target encoder without considering the fact that some source examples have large domain discrepancy with the target domain and are not suitable for pretraining the target encoder. Fifth, our method outperforms TAPT and SimCSE. These methods do not leverage auxiliary source data or select/adapt source data.
Text classification results. Following Gururangan et al. (2020), the results are micro-F1 for ChemProt and RCT, and macro-F1 for other datasets. For each xy entry, x and y represent the mean and standard deviation of four random runs, respectively.
Method . | ChemProt . | RCT . | ACL-ARC . | SciERC . | HyperPartisan . | AGNews . | Helpfulness . | IMDB . | Average . |
---|---|---|---|---|---|---|---|---|---|
RoBERTa-base | 81.91.0 | 87.20.1 | 63.05.8 | 77.31.9 | 86.60.9 | 93.90.2 | 65.13.4 | 95.00.2 | 81.3 |
TAPT | 82.60.4 | 87.70.1 | 67.41.8 | 79.31.5 | 90.45.2 | 94.50.1 | 68.51.9 | 95.50.1 | 83.2 |
DAPT | 84.20.2 | 87.60.1 | 75.42.5 | 80.81.5 | 88.25.9 | 93.90.2 | 66.51.4 | 95.40.1 | 84.0 |
SimCSE | 83.20.2 | 87.60.1 | 69.52.6 | 80.50.7 | 90.92.7 | 94.70.1 | 68.71.7 | 95.70.1 | 83.9 |
BO | 83.30.2 | 87.51.0 | 75.43.3 | 79.11.3 | 88.14.2 | 94.00.2 | 68.42.1 | 95.30.2 | 83.9 |
MMG | 83.00.4 | 87.41.0 | 75.14.1 | 79.60.8 | 87.92.1 | 94.20.1 | 66.71.5 | 95.70.1 | 83.7 |
LSI | 83.20.3 | 87.51.0 | 75.34.5 | 80.20.6 | 88.71.9 | 94.50.1 | 68.61.0 | 95.40.1 | 84.2 |
DANN | 83.50.3 | 87.70.1 | 75.53.7 | 80.51.1 | 90.43.0 | 94.30.1 | 66.82.8 | 95.20.1 | 84.2 |
CDAN | 83.80.2 | 87.40.1 | 75.92.4 | 80.91.4 | 88.75.5 | 94.10.2 | 67.33.4 | 95.80.2 | 84.2 |
MME | 84.00.1 | 87.40.1 | 75.75.1 | 80.50.8 | 89.22.8 | 94.60.2 | 68.92.7 | 95.40.1 | 84.5 |
SRDC | 84.30.3 | 87.30.1 | 75.73.4 | 80.71.0 | 88.93.5 | 94.10.1 | 67.63.0 | 95.10.1 | 84.2 |
SSDA | 83.90.5 | 87.90.1 | 75.92.7 | 81.01.4 | 88.52.6 | 94.40.1 | 67.01.6 | 95.70.2 | 84.3 |
GDA | 84.10.3 | 87.30.1 | 75.42.4 | 80.80.9 | 90.54.1 | 94.20.2 | 66.82.4 | 95.50.1 | 84.3 |
ATDOC | 84.50.2 | 87.50.1 | 75.63.6 | 81.11.2 | 89.05.7 | 93.90.2 | 67.42.8 | 95.10.1 | 84.3 |
No-Adapt | 84.10.2 | 87.30.1 | 75.92.2 | 81.41.0 | 90.34.2 | 94.10.1 | 66.93.3 | 95.30.2 | 84.4 |
No-In-Domain | 84.50.1 | 87.50.1 | 75.54.5 | 81.61.7 | 88.83.9 | 94.60.2 | 67.13.1 | 95.60.1 | 84.4 |
Separate | 85.30.5 | 88.10.1 | 75.82.9 | 82.70.9 | 90.13.6 | 94.00.2 | 68.31.5 | 95.20.1 | 84.9 |
MMD | 85.90.4 | 88.30.1 | 75.54.9 | 82.31.1 | 89.51.8 | 94.10.1 | 68.91.2 | 95.40.2 | 85.0 |
AD | 85.60.2 | 88.70.1 | 75.93.1 | 82.60.8 | 90.33.0 | 94.70.2 | 67.41.7 | 95.80.1 | 85.1 |
WMVL | 85.50.4 | 88.50.1 | 75.72.8 | 81.91.3 | 90.64.6 | 94.60.1 | 68.12.3 | 95.10.1 | 85.0 |
H-Divergence | 85.40.3 | 88.70.1 | 75.94.6 | 82.10.9 | 89.21.9 | 94.20.1 | 68.52.6 | 95.30.1 | 84.9 |
Our full method | 87.10.2 | 90.40.1 | 77.62.3 | 84.40.7 | 92.41.2 | 95.70.1 | 70.90.8 | 96.60.1 | 86.9 |
Method . | ChemProt . | RCT . | ACL-ARC . | SciERC . | HyperPartisan . | AGNews . | Helpfulness . | IMDB . | Average . |
---|---|---|---|---|---|---|---|---|---|
RoBERTa-base | 81.91.0 | 87.20.1 | 63.05.8 | 77.31.9 | 86.60.9 | 93.90.2 | 65.13.4 | 95.00.2 | 81.3 |
TAPT | 82.60.4 | 87.70.1 | 67.41.8 | 79.31.5 | 90.45.2 | 94.50.1 | 68.51.9 | 95.50.1 | 83.2 |
DAPT | 84.20.2 | 87.60.1 | 75.42.5 | 80.81.5 | 88.25.9 | 93.90.2 | 66.51.4 | 95.40.1 | 84.0 |
SimCSE | 83.20.2 | 87.60.1 | 69.52.6 | 80.50.7 | 90.92.7 | 94.70.1 | 68.71.7 | 95.70.1 | 83.9 |
BO | 83.30.2 | 87.51.0 | 75.43.3 | 79.11.3 | 88.14.2 | 94.00.2 | 68.42.1 | 95.30.2 | 83.9 |
MMG | 83.00.4 | 87.41.0 | 75.14.1 | 79.60.8 | 87.92.1 | 94.20.1 | 66.71.5 | 95.70.1 | 83.7 |
LSI | 83.20.3 | 87.51.0 | 75.34.5 | 80.20.6 | 88.71.9 | 94.50.1 | 68.61.0 | 95.40.1 | 84.2 |
DANN | 83.50.3 | 87.70.1 | 75.53.7 | 80.51.1 | 90.43.0 | 94.30.1 | 66.82.8 | 95.20.1 | 84.2 |
CDAN | 83.80.2 | 87.40.1 | 75.92.4 | 80.91.4 | 88.75.5 | 94.10.2 | 67.33.4 | 95.80.2 | 84.2 |
MME | 84.00.1 | 87.40.1 | 75.75.1 | 80.50.8 | 89.22.8 | 94.60.2 | 68.92.7 | 95.40.1 | 84.5 |
SRDC | 84.30.3 | 87.30.1 | 75.73.4 | 80.71.0 | 88.93.5 | 94.10.1 | 67.63.0 | 95.10.1 | 84.2 |
SSDA | 83.90.5 | 87.90.1 | 75.92.7 | 81.01.4 | 88.52.6 | 94.40.1 | 67.01.6 | 95.70.2 | 84.3 |
GDA | 84.10.3 | 87.30.1 | 75.42.4 | 80.80.9 | 90.54.1 | 94.20.2 | 66.82.4 | 95.50.1 | 84.3 |
ATDOC | 84.50.2 | 87.50.1 | 75.63.6 | 81.11.2 | 89.05.7 | 93.90.2 | 67.42.8 | 95.10.1 | 84.3 |
No-Adapt | 84.10.2 | 87.30.1 | 75.92.2 | 81.41.0 | 90.34.2 | 94.10.1 | 66.93.3 | 95.30.2 | 84.4 |
No-In-Domain | 84.50.1 | 87.50.1 | 75.54.5 | 81.61.7 | 88.83.9 | 94.60.2 | 67.13.1 | 95.60.1 | 84.4 |
Separate | 85.30.5 | 88.10.1 | 75.82.9 | 82.70.9 | 90.13.6 | 94.00.2 | 68.31.5 | 95.20.1 | 84.9 |
MMD | 85.90.4 | 88.30.1 | 75.54.9 | 82.31.1 | 89.51.8 | 94.10.1 | 68.91.2 | 95.40.2 | 85.0 |
AD | 85.60.2 | 88.70.1 | 75.93.1 | 82.60.8 | 90.33.0 | 94.70.2 | 67.41.7 | 95.80.1 | 85.1 |
WMVL | 85.50.4 | 88.50.1 | 75.72.8 | 81.91.3 | 90.64.6 | 94.60.1 | 68.12.3 | 95.10.1 | 85.0 |
H-Divergence | 85.40.3 | 88.70.1 | 75.94.6 | 82.10.9 | 89.21.9 | 94.20.1 | 68.52.6 | 95.30.1 | 84.9 |
Our full method | 87.10.2 | 90.40.1 | 77.62.3 | 84.40.7 | 92.41.2 | 95.70.1 | 70.90.8 | 96.60.1 | 86.9 |
4.2 Visual Question Answering on Pathology Images
Datasets.
For the target dataset, we use PathVQA (He et al., 2020), which contains 1,670 pathology images and 32,795 question-answer pairs. Of these, 16,466 questions are open-ended, with the following types: what, where, when, whose, how, why, how much/how many. The rest are close-ended “yes/no” questions. Based on images, the dataset is split into a training, validation, and test set with a ratio of 3:1:1 approximately. Note that baseline methods are trained on the combination of the training and validation sets. We collected a source dataset containing 1792 pathology figures extracted from papers in medRxiv where each figure has a caption. We create 36,471 VQA questions using the method proposed in He et al. (2020). This source dataset will be made available publicly.
Implementation Details.
For the target model, we experimented with two state-of-the-art VQA models—LXMERT (Tan and Bansal, 2019) and bilinear attention networks (BAN) (Kim et al., 2018)—each containing an image encoder, a question encoder, and an answer generation head. Hyperparameters mostly follow those in previous work (Tan and Bansal, 2019; Kim et al., 2018; Yang et al., 2016). For LXMERT, the hidden size of the text encoder was set to 768.
The initial learning rate was set to 5e-5 with the Adam (Kingma and Ba, 2014) optimizer used. The batch size was set to 256. The model was trained for 200 epochs. For BAN, words in questions and answers were represented using GloVe (Pennington et al., 2014) vectors. The initial learning rate was set to 0.005 with the Adamax optimizer (Kingma and Ba, 2014) used. The batch size was set to 512. The model was trained for 200 epochs.
From questions and answers in the PathVQA dataset, we create a vocabulary of 4,631 words that have the highest frequencies. Data augmentation is applied to images, including shifting, scaling, and shearing. We compare with baselines similar to those in Section 4.1. The Pretrain baseline works as follows: We pretrain the target encoder using source data, then finetune the target encoder using target data. Three metrics were used for evaluation, including BLEU (Papineni et al., 2002), macro-averaged F1 (Goutte and Gaussier, 2005), and accuracy (Malinowski and Fritz, 2014). We implement the methods using PyTorch and perform training on four GTX 1080Ti GPUs.
Results and Analysis.
The results are shown in Table 4. From this table, we make similar observations as those in Table 3. The analysis of reasons is similar to that for results in Table 3. The training time of our method is similar to that of baselines. Figure 2 shows some randomly sampled source pathology figures identified by our method as being out of the target domain. These images contain subfigures and texts, which are different from the target dataset.
Results on the PathVQA dataset. Runtime (hours) for training is measured on a 1080TI GPU.
. | Accuracy(%) . | BLEU-1(%) . | BLEU-2(%) . | BLEU-3(%) . | F1(%) . | Average . | Runtime(h) . |
---|---|---|---|---|---|---|---|
LXMERT (Tan and Bansal, 2019) based experiments | |||||||
Vanilla LXMERT (Tan and Bansal, 2019) | 57.6 | 57.4 | 3.1 | 1.3 | 9.9 | 25.9 | 29 |
Pretrain (He et al., 2016) | 59.3 | 59.0 | 4.6 | 2.6 | 11.3 | 27.4 | 35 |
BO (Ruder and Plank, 2017) | 59.5 | 58.9 | 3.8 | 2.7 | 11.4 | 27.3 | 41 |
MMG (Wang et al., 2019a) | 59.3 | 58.4 | 3.7 | 2.6 | 10.9 | 27.0 | 38 |
LSI (Huan et al., 2021) | 59.2 | 58.8 | 3.5 | 2.8 | 10.6 | 27.0 | 32 |
DANN (Ganin et al., 2016) | 59.9 | 59.4 | 4.3 | 3.2 | 11.2 | 27.6 | 30 |
CDAN (Long et al., 2017a) | 60.2 | 58.9 | 4.7 | 3.3 | 11.9 | 27.8 | 35 |
MME (Saito et al., 2019) | 59.7 | 59.0 | 3.6 | 2.9 | 11.7 | 27.4 | 30 |
SRDC (Tang et al., 2020) | 58.8 | 58.7 | 4.8 | 2.9 | 12.2 | 27.5 | 39 |
SSDA (Kim and Kim, 2020) | 59.3 | 58.3 | 4.4 | 3.0 | 12.0 | 27.4 | 34 |
GDA (Mitsuzumi et al., 2021) | 59.8 | 59.1 | 4.6 | 2.8 | 12.3 | 27.7 | 36 |
ATDOC (Liang et al., 2021) | 59.6 | 59.6 | 4.3 | 3.1 | 12.2 | 27.8 | 30 |
Ours | 62.5 | 61.1 | 5.2 | 3.7 | 12.9 | 29.1 | 29 |
BAN (Kim et al., 2018) based experiments | |||||||
Vanilla BAN (Kim et al., 2018) | 55.1 | 56.2 | 3.2 | 1.2 | 8.4 | 24.8 | 25 |
Pretrain (He et al., 2016) | 58.4 | 58.6 | 4.3 | 1.6 | 10.3 | 26.6 | 28 |
BO (Ruder and Plank, 2017) | 58.3 | 58.5 | 4.2 | 2.0 | 10.9 | 26.8 | 32 |
MMG (Wang et al., 2019a) | 58.7 | 58.1 | 4.6 | 2.3 | 11.2 | 27.0 | 29 |
LSI (Huan et al., 2021) | 58.3 | 58.7 | 4.3 | 2.1 | 11.5 | 27.0 | 30 |
DANN (Ganin et al., 2016) | 58.8 | 58.4 | 4.3 | 2.3 | 11.0 | 27.0 | 31 |
CDAN (Long et al., 2017a) | 58.6 | 58.6 | 4.5 | 2.5 | 11.4 | 27.1 | 33 |
MME (Saito et al., 2019) | 59.1 | 58.9 | 4.9 | 2.3 | 11.1 | 27.3 | 26 |
SRDC (Tang et al., 2020) | 58.9 | 58.7 | 4.7 | 2.5 | 11.6 | 27.3 | 35 |
SSDA (Kim and Kim, 2020) | 59.3 | 59.0 | 4.4 | 2.6 | 11.4 | 27.3 | 32 |
GDA (Mitsuzumi et al., 2021) | 59.5 | 58.8 | 4.9 | 2.4 | 11.7 | 27.5 | 30 |
ATDOC (Liang et al., 2021) | 59.1 | 59.1 | 5.0 | 2.7 | 11.5 | 27.5 | 29 |
Ours | 62.4 | 61.7 | 5.6 | 3.2 | 12.5 | 29.1 | 26 |
. | Accuracy(%) . | BLEU-1(%) . | BLEU-2(%) . | BLEU-3(%) . | F1(%) . | Average . | Runtime(h) . |
---|---|---|---|---|---|---|---|
LXMERT (Tan and Bansal, 2019) based experiments | |||||||
Vanilla LXMERT (Tan and Bansal, 2019) | 57.6 | 57.4 | 3.1 | 1.3 | 9.9 | 25.9 | 29 |
Pretrain (He et al., 2016) | 59.3 | 59.0 | 4.6 | 2.6 | 11.3 | 27.4 | 35 |
BO (Ruder and Plank, 2017) | 59.5 | 58.9 | 3.8 | 2.7 | 11.4 | 27.3 | 41 |
MMG (Wang et al., 2019a) | 59.3 | 58.4 | 3.7 | 2.6 | 10.9 | 27.0 | 38 |
LSI (Huan et al., 2021) | 59.2 | 58.8 | 3.5 | 2.8 | 10.6 | 27.0 | 32 |
DANN (Ganin et al., 2016) | 59.9 | 59.4 | 4.3 | 3.2 | 11.2 | 27.6 | 30 |
CDAN (Long et al., 2017a) | 60.2 | 58.9 | 4.7 | 3.3 | 11.9 | 27.8 | 35 |
MME (Saito et al., 2019) | 59.7 | 59.0 | 3.6 | 2.9 | 11.7 | 27.4 | 30 |
SRDC (Tang et al., 2020) | 58.8 | 58.7 | 4.8 | 2.9 | 12.2 | 27.5 | 39 |
SSDA (Kim and Kim, 2020) | 59.3 | 58.3 | 4.4 | 3.0 | 12.0 | 27.4 | 34 |
GDA (Mitsuzumi et al., 2021) | 59.8 | 59.1 | 4.6 | 2.8 | 12.3 | 27.7 | 36 |
ATDOC (Liang et al., 2021) | 59.6 | 59.6 | 4.3 | 3.1 | 12.2 | 27.8 | 30 |
Ours | 62.5 | 61.1 | 5.2 | 3.7 | 12.9 | 29.1 | 29 |
BAN (Kim et al., 2018) based experiments | |||||||
Vanilla BAN (Kim et al., 2018) | 55.1 | 56.2 | 3.2 | 1.2 | 8.4 | 24.8 | 25 |
Pretrain (He et al., 2016) | 58.4 | 58.6 | 4.3 | 1.6 | 10.3 | 26.6 | 28 |
BO (Ruder and Plank, 2017) | 58.3 | 58.5 | 4.2 | 2.0 | 10.9 | 26.8 | 32 |
MMG (Wang et al., 2019a) | 58.7 | 58.1 | 4.6 | 2.3 | 11.2 | 27.0 | 29 |
LSI (Huan et al., 2021) | 58.3 | 58.7 | 4.3 | 2.1 | 11.5 | 27.0 | 30 |
DANN (Ganin et al., 2016) | 58.8 | 58.4 | 4.3 | 2.3 | 11.0 | 27.0 | 31 |
CDAN (Long et al., 2017a) | 58.6 | 58.6 | 4.5 | 2.5 | 11.4 | 27.1 | 33 |
MME (Saito et al., 2019) | 59.1 | 58.9 | 4.9 | 2.3 | 11.1 | 27.3 | 26 |
SRDC (Tang et al., 2020) | 58.9 | 58.7 | 4.7 | 2.5 | 11.6 | 27.3 | 35 |
SSDA (Kim and Kim, 2020) | 59.3 | 59.0 | 4.4 | 2.6 | 11.4 | 27.3 | 32 |
GDA (Mitsuzumi et al., 2021) | 59.5 | 58.8 | 4.9 | 2.4 | 11.7 | 27.5 | 30 |
ATDOC (Liang et al., 2021) | 59.1 | 59.1 | 5.0 | 2.7 | 11.5 | 27.5 | 29 |
Ours | 62.4 | 61.7 | 5.6 | 3.2 | 12.5 | 29.1 | 26 |
Randomly sampled source pathology figures identified by our method as being out of the target domain.
Randomly sampled source pathology figures identified by our method as being out of the target domain.
4.3 Ablation Studies
4.3.1 Ablation by Removing Certain Components
To better understand the contributions of individual components in our framework, we perform the following ablation studies.
No-Adapt. Out-of-domain source examples are discarded instead of being adapted. This is equivalent to removing stage II and the loss term in Eq.(6) of stage III.
No-In-Domain. In-domain source examples are discarded instead of being used for training the target model. This is equivalent to removing the loss term in Eq.(6) of stage III.
Separate. Different stages are performed separately instead of jointly.
Table 3 shows ablation study results. From this table, we make the following observations. First, our full method works better than No-Adapt. This shows that it is beneficial to adapt out-of-domain source examples into the target domain and use adapted examples to train the target model, and our method is effective in achieving this goal. Second, our full method outperforms No-In-Domain, which demonstrates that the source examples selected by our method are useful for training the target model. Third, our full method achieves better performance than Separate. Our full method performs source data selection, adaptation, and target model training jointly, which enables these different tasks to mutually influence each other to achieve the best overall performance. Such a mechanism is lacking in Separate.
4.3.2 Ablation on the Adaptation Component
We perform an ablation study of the adaptation component in stage II of our framework by replacing it with the following baselines.
Maximum mean discrepancy (MMD) (Kang et al., 2019), which is a broadly used metric for measuring discrepancies of two distributions. MMD-based adaptation method learns latent representations so that the selected out-of-domain source examples and the target examples have small MMD in the latent space.
Adversarial adaptation (AD) (Ganin et al., 2016), which learns latent representations so that a domain discriminator cannot tell whether an example is from the source or from the target.
Table 3 shows the results. As can be seen, our adaptation method works better than the baselines. The reason is: The domain distance metric in our method is learned using many self-supervised training examples; it can better measure domain discrepancy and facilitate domain adaptation.
4.3.3 Ablation on the Selection Mechanism
We perform an ablation study of the selection mechanism in our framework by replacing it with the following baselines.
Each source example is associated with a weight in [0,1]. A larger weight indicates the example is more likely to be in the target domain. We learn these data weights by minimizing the validation loss (WMVL) of the target model.
We use an ℋ-divergence (Elsahar and Gallé, 2019) based metric to measure domain similarity between a source example and the target dataset.
Table 3 shows the results. As can be seen, our selection mechanism works better than the baselines. The reason is: In our framework, the domain distance network (DSN) is learned in a discriminative way by performing classification on self-supervised training examples. Discriminative training can enable the DSN to better distinguish in-domain source examples from out-of-domain ones. In contrast, the selection mechanisms in the two baselines lack discriminability.
4.3.4 Ablation on the Tradeoff Parameter λ
We investigate how the performance of our method is affected by the tradeoff parameter λ. Figure 3 shows the test accuracy on ImageCLEF. As can be seen, when λ increases from 0.01 to 0.1, the accuracy improves. This is because the selected and adapted source data is used as additional training resources for the target model. However, as λ further increases, the accuracy drops. This is because the source data is not as reliable as the target data. An excessively large λ renders too much emphasis to be put on less-reliable source data.
4.3.5 Ablation on the Train-Validate Ratio
In the next ablation study, we investigate how the performance of our method varies under different split ratios between target training and validation datasets. The study was performed on the text classification task. Table 5 shows the results. As can be seen, a more balanced ratio (e.g., 1:1) yields better performance. When the ratio is largely imbalanced (e.g., 1:9, 1:4, 1:0.1, 1:0.25), the performance is worse. If the target training dataset is much smaller than the target validation dataset , the target model’s weight parameters U, which are trained on , will not be sufficiently trained due to the lack of training data and thereby yield poorer performance. On the other hand, if is much smaller than , the weights A of SSL training examples, which are optimized by minimizing the loss on , will not be sufficiently optimized due to the lack of data and thereby yield worse performance as well. Note that since and are obtained by splitting the original target training dataset, it is always possible to obtain a balanced split.
Text classification results of our method under different split ratios between target training and validation sets.
Train-val ratio . | ChemProt . | RCT . | ACL-ARC . | SciERC . | HyperPartisan . | AGNews . | Helpfulness . | IMDB . | Average . |
---|---|---|---|---|---|---|---|---|---|
1:9 | 79.40.4 | 86.90.1 | 73.72.7 | 78.51.3 | 87.22.3 | 93.00.2 | 65.53.0 | 93.30.1 | 82.2 |
1:4 | 82.00.4 | 87.80.1 | 75.13.9 | 81.01.5 | 89.63.0 | 94.10.1 | 68.22.7 | 94.10.2 | 84.0 |
1:2 | 84.20.1 | 89.10.1 | 76.45.0 | 83.20.9 | 91.74.2 | 94.50.1 | 70.00.3 | 95.90.2 | 85.6 |
1:1 | 87.10.2 | 90.40.1 | 77.62.3 | 84.40.7 | 92.41.2 | 95.70.1 | 70.90.8 | 96.60.1 | 86.9 |
1:0.5 | 86.80.4 | 90.60.1 | 77.12.9 | 84.01.2 | 92.81.7 | 95.60.1 | 70.41.1 | 96.10.1 | 86.7 |
1:0.25 | 86.40.2 | 89.90.1 | 76.94.2 | 83.81.6 | 91.91.9 | 95.20.2 | 69.91.6 | 95.30.2 | 86.2 |
1:0.1 | 85.30.5 | 88.50.1 | 76.32.8 | 82.60.9 | 91.72.2 | 94.90.1 | 69.40.8 | 95.00.1 | 85.5 |
Train-val ratio . | ChemProt . | RCT . | ACL-ARC . | SciERC . | HyperPartisan . | AGNews . | Helpfulness . | IMDB . | Average . |
---|---|---|---|---|---|---|---|---|---|
1:9 | 79.40.4 | 86.90.1 | 73.72.7 | 78.51.3 | 87.22.3 | 93.00.2 | 65.53.0 | 93.30.1 | 82.2 |
1:4 | 82.00.4 | 87.80.1 | 75.13.9 | 81.01.5 | 89.63.0 | 94.10.1 | 68.22.7 | 94.10.2 | 84.0 |
1:2 | 84.20.1 | 89.10.1 | 76.45.0 | 83.20.9 | 91.74.2 | 94.50.1 | 70.00.3 | 95.90.2 | 85.6 |
1:1 | 87.10.2 | 90.40.1 | 77.62.3 | 84.40.7 | 92.41.2 | 95.70.1 | 70.90.8 | 96.60.1 | 86.9 |
1:0.5 | 86.80.4 | 90.60.1 | 77.12.9 | 84.01.2 | 92.81.7 | 95.60.1 | 70.41.1 | 96.10.1 | 86.7 |
1:0.25 | 86.40.2 | 89.90.1 | 76.94.2 | 83.81.6 | 91.91.9 | 95.20.2 | 69.91.6 | 95.30.2 | 86.2 |
1:0.1 | 85.30.5 | 88.50.1 | 76.32.8 | 82.60.9 | 91.72.2 | 94.90.1 | 69.40.8 | 95.00.1 | 85.5 |
4.3.6 A Bi-level Optimization Based Ablation Setting
Ablation study results of the BLO method on text classification.
Method . | ChemProt . | RCT . | ACL-ARC . | SciERC . | HyperPartisan . | AGNews . | Helpfulness . | IMDB . | Average . |
---|---|---|---|---|---|---|---|---|---|
BLO | 85.30.4 | 88.70.1 | 75.42.9 | 83.11.1 | 90.74.5 | 94.90.2 | 68.32.2 | 95.00.1 | 85.2 |
Ours | 87.10.2 | 90.40.1 | 77.62.3 | 84.40.7 | 92.41.2 | 95.70.1 | 70.90.8 | 96.60.1 | 86.9 |
Method . | ChemProt . | RCT . | ACL-ARC . | SciERC . | HyperPartisan . | AGNews . | Helpfulness . | IMDB . | Average . |
---|---|---|---|---|---|---|---|---|---|
BLO | 85.30.4 | 88.70.1 | 75.42.9 | 83.11.1 | 90.74.5 | 94.90.2 | 68.32.2 | 95.00.1 | 85.2 |
Ours | 87.10.2 | 90.40.1 | 77.62.3 | 84.40.7 | 92.41.2 | 95.70.1 | 70.90.8 | 96.60.1 | 86.9 |
4.4 Human Evaluation
We perform a human evaluation on whether the identified in-domain and out-of-domain source examples are indeed in or out of the target domain. The study is performed on ChemProt, RCT, and ACL-ARC. For each dataset, we randomly sample 200 source texts. Three undergraduates were asked to label whether these source texts are in-domain or out-of-domain. Majority vote is leveraged to decide the final label. The Kappa score among the annotations is 0.75, which indicates a strong level of agreement among the annotators. Different methods are applied to predict whether each source text is in-domain or not. Table 7 shows the results. Our method achieves the best accuracy in identifying in-domain and out-of-domain source examples, due to its mechanism of learning the domain distance network in a discriminative way (as analyzed in Section 4.3).
Human evaluation results.
Method . | Accuracy . |
---|---|
BO (Ruder and Plank, 2017) | 79.9 |
MMG (Wang et al., 2019a) | 82.1 |
LSI (Huan et al., 2021) | 83.9 |
Separate | 83.5 |
WMVL | 81.7 |
H-Divergence | 84.0 |
Our full method | 88.2 |
5 Conclusions and Discussion
We propose a framework for simultaneous selection and adaptation of source examples in transfer learning. Our method automatically identifies which source examples are in or out of the target domain, and performs example-specific operations (either selection or adaptation). This is different from previous methods, which 1) discard out-of-domain source examples, leading to information loss; or 2) try to adapt all source examples into the target domain, incurring a risk of moving in-domain source examples outside the target domain. Our framework is based on four-level optimization. Experiments on text classification and visual question answering demonstrate the effectiveness of our proposed method.
References
Author notes
Equal contribution.
Action Editor: Tim Baldwin