Simultaneous Selection and Adaptation of Source Data via Four-Level Optimization

In many NLP applications, to mitigate data deficiency in a target task, source data is collected to help with target model training. Existing transfer learning methods either select a subset of source examples that are close to the target domain or try to adapt all source examples into the target domain, then use selected or adapted source examples to train the target model. These methods either incur significant information loss or bear the risk that after adaptation, source examples which are originally already in the target domain may be outside the target domain. To address the limitations of these methods, we propose a four-level optimization based framework which simultaneously selects and adapts source data. Our method can automatically identify in-domain and out-of-domain source examples and apply example-specific processing methods: selection for in-domain examples and adaptation for out-of-domain examples. Experiments on various datasets demonstrate the effectiveness of our proposed method.


Introduction
Transfer learning (TL) (Zhuang et al., 2020), which aims at improving a model in a target domain by utilizing data from a source domain, has been broadly studied in natural language processing.One paradigm of TL methods (Sun et al., 2011;Song et al., 2012;Wang et al., 2017b;Patel et al., 2018;Qu et al., 2018;Liu et al., 2019b) focuses on selecting a subset of source examples that are close to the target domain and using selected examples as additional training data for the target model.Another paradigm of TL methods (Pan et al., 2010;Ganin et al., 2016;Bousmalis et al., 2017) focuses on adapting the entire set of source examples into the target domain and using adapted source examples as additional training data for the target model.The problem of selec-tion based methods is that unselected examples are discarded.Though having a domain discrepancy with target data, unselected examples still contain useful information that can be leveraged to improve the target model.Discarding them would lead to information loss.The problem of adaptation based methods is that some source examples may already be in the target domain; performing domain adaptation on these source examples is a waste of effort; and, even worse, after adaptation, these source examples may be outside the target domain.
To address the limitations of both paradigms of methods, we propose a new approach which simultaneously performs selection and adaptation of source examples: Our method automatically identifies which source examples are in the same domain as target data (referred to as in-domain source data) and which are not (referred to as out-of-domain source data); for in-domain source data, they are directly used to train the target model; for out-of-domain source data, they are first adapted, then utilized to train the target model.Compared with previous methods, our approach has the following advantage: Instead of using a single way to deal with all source examples (either performing selection or adaptation), our method applies example-specific ways to deal with different examples based on whether they are in-domain or out-of-domain.
Our method is based on four-level optimization, which performs the following four stages end-to-end.At the first stage, a domain distance network is trained based on self-supervised learning.At the second stage, the domain distance network is used to identify out-of-domain source examples and adapt them into the target domain.At the third stage, in-domain source examples selected by the domain distance network and adapted source examples are used to train a target model.At the fourth stage, the trained target model is evaluated on a validation set and data weights of the domain distance network are updated by minimizing the validation losses.Experiments on a variety of datasets demonstrate the effectiveness of our proposed method.To summarize, the major contributions of this work is that we propose a four-level optimization based framework for simultaneous selection and adaptation of source examples in transfer learning and demonstrate the effectiveness of our proposed method on two NLP applications.
2 Related Works Domain Adaptation (DA).DA (Pan et al., 2010;Ganin et al., 2016;Bousmalis et al., 2017;Sun and Saenko, 2016;Long et al., 2017b;Ben-David et al., 2010;Hoffman et al., 2018;Long et al., 2017a;Kang et al., 2019;Long et al., 2015;Long et al., 2017b;Hoffman et al., 2018;Mitsuzumi et al., 2021) considers the problem of transferring knowledge from a label-rich source domain to a label-deficient target domain where the two domains have distributional discrepancies.There are mainly two paradigms of approaches.One paradigm (Sun and Saenko, 2016;Long et al., 2017b;Ben-David et al., 2010;Kang et al., 2019) is based on metric learning, where a distance metric is defined to measure the distribution discrepancy between domains and domain-invariant representations are learned by minimizing the distance.The other paradigm is based on adversarial learning (Hoffman et al., 2018;Long et al., 2017a;Tzeng et al., 2017;Sankaranarayanan et al., 2018;Motiian et al., 2017), which learns a domain discriminator and a feature learning network adversarially.The domain discriminator is trained to tell whether an instance is from source domain or target domain, while the feature learning network learns domain-invariant features by fooling the domain discriminator.CDAN (Long et al., 2017a) uses multilinear conditioning to capture the cross-covariance between feature representations and classifier predictions, and leverages entropy conditioning to control the uncertainty of classifier predictions.MME (Saito et al., 2019) performs adaptation by maximizing the conditional entropy of unlabeled target data w.r.t the classifier and minimizing it w.r.t the feature encoder.SRDC (Tang et al., 2020) performs deep discriminative clustering with source regularization for unsupervised domain adaptation.These methods perform adaptation on all source data, which leads to waste of efforts and incurs a risk of moving in-domain source data outside the target domain.

Methods
In this section, we present the method for simultaneous selection and adaptation of source examples based on four-level optimization.We aim to train a target model for a specific target domain using dataset D t .In practical scenarios, the target domain often suffers from a lack of labeled training data.This lack can lead to overfitting on the training data and poor generalization on test data.To address this issue, we utilize a dataset from a source domain, D s , which has an abundance of labeled examples.However, there is a notable discrepancy between the source and target data.We categorize D s examples into two types: those that belong to the same domain as D t (in-domain source data) and those that do not (out-of-domain source data).We aim to select in-domain source data and adapt out-of-domain source data into the target domain, and use selected and adapted source data to train the target model.The overall framework is shown in Figure 1.The notations are shown in Table 1.

A Four-Level Optimization Framework
We propose a four-level optimization framework to perform simultaneous selection and adaptation of source data.The framework consists of four learning stages which are performed end-to-end.The optimal solution of W , which depends on A. f (q, D t ; W * (A)) A binary label regarding whether q is out of the target domain.
The encoder in W * (A).
The rest of layers in W * (A) besides W * 1 (A).ẑ(q; W * 1 (A)) The latent representation of q extracted by W * 1 (A).

U
A sub-network of U , which takes z * (q, W * (A)) as input.Stage I.At the first stage, we learn a domain distance metric network via self-supervised learning (SSL) (He et al., 2019;Chen et al., 2020).This network takes a query example q and a set of examples R as inputs and predicts a binary label representing whether q is in the domain of R. The architecture of this network is as follows.For q and each example in R, an encoder is used to generate a latent representation for the example.Then self-attention (Vaswani et al., 2017) is performed on these latent representations to generate attentive representations.Finally, the attentive representation of q and the averaged attentive representation of examples in R are concatenated and fed into a feedforward layer to predict the binary label.
To learn this domain distance metric network (DDMN), we construct self-supervised training examples.We randomly sample a subset of examples R from D t , and randomly sample a query example q t from D t and a query example q s from D s .We label (q t , R) as a positive pair since q t and R are both from target domain, and label (q s , R) as a negative pair since q s is from source domain and R is from the target domain.This procedure repeats M times, yielding 2M training examples denoted by where o i is a binary label representing whether q i and R i are from the same domain.Let W denote weight parameters of the DDMN.We learn W by minimizing the binary classification loss: where is a cross-entropy loss.
Note that these self-supervised examples may be noisy since the binary labels are given based on heuristics without human scrutiny.It could be the case that q s happens to be in the same domain as R while q t is not.It is necessary to automatically identify and remove these noisy examples.To achieve this goal, we associate each example the training set.This stage amounts to solving the following optimization problem: A is tentatively fixed at this stage and will be updated at a later stage.A cannot be updated at this stage.Otherwise, all the values in A would be zero.
Stage II.At the second stage, we use the learned domain distance network to identify out-of-domain source examples and adapt them into the target domain.For each data example q from the source dataset D s , we feed q and the target dataset D t into W * (A), which predicts a binary label f (q, D t ; W * (A)).If f (q, D t ; W * (A)) = 0, it means q is out of the target domain, and we use the domain distance network to adapt it into the target domain.The adaptation is performed in the following way.In W * (A), let W * 1 (A) denote the encoder (containing all layers before self-attention) and W * 2 (A) denote the rest of layers (including self-attention and feedforward layers) used for predicting the binary label.Let ẑ(q; W * 1 (A)) denote the latent representation of q extracted by W * 1 (A)-the layer before self-attention.We learn another representation z(q) of q which falls into the target domain (in the latent space) and is close to ẑ(q; W * 1 (A)).Closeness is measured using L 2 distance.To encourage z(q) to fall into the target domain, we encourage W * 2 (A) to predict that z(q) is in the target domain, i.e., minimizing the binary classification loss l(z(q), D t , t q = 1; W * 2 (A)).At this stage, we solve the following optimization problem for each source example q that is predicted to be out-of-domain: Encourage z(q) to fall into the target domain Encourage z(q) to be close to ẑ(q; W * 1 (A)) (2) where γ is a tradeoff parameter.
Stage III.At the third stage, we use selected and adapted source examples, together with the target training data D (tr) t , to train the target model U .Let tgt denote the loss function of the target task, N s and N t denote the number of examples in D s and D t , and f (q, D t ; W * (A)) denote the binary label predicted by the domain distance network W * (A) regarding whether a source example q is in the target domain.d (tr) t,i denotes the i-th example in D (tr) t , d s,i denotes the i-th example in D s .For a source example which is predicted to be in the target domain (where f (d s,i , D t ; W * (A)) = 1), it is directly used to train the target model U by minimizing the loss tgt (U, d s,i ).For a source example which is predicted to be out of the target domain (where f (d s,i , D t ; W * (A)) = 0), its adapted representation z * (d s,i , W * (A)) obtained in the second stage is used to train the target model.Note that since z * (d s,i , W * (A)) is already a latent representation, only part of weight parameters (denoted by U ) in the target model U is needed to make prediction on z * (d s,i , W * (A)).We define the loss on target training data as: the loss on selected source data as: and the loss on adapted data as: (5) At this stage, we solve the following optimization problem: where λ is a tradeoff parameter. ). (7) A Four-level Optimization Based Framework.Putting all these pieces together, we have the following four-level optimization based framework.
(8) To make the objective at the third stage differentiable, we approximate f (d s,i , D t ; W * (A)) using the probability calculated by W * (A) regarding whether d s,i and D t are in the same domain.

Optimization Algorithm
We leverage a gradient-based method (Liu et al., 2018) to solve the problem in Eq.( 8).Convergence of this algorithm has been analyzed (Ghadimi and Wang, 2018;Grazzi et al., 2020;Ji et al., 2021;Liu et al., 2021;Yang et al., 2021).At each level of the optimization problem, the exact value of the optimal solution (on the left-hand side of the equal sign, marked with * ) is computationally expensive to compute.To address this problem, following Liu et al. (2018), we approximate the optimal solution using a one-step gradient descent update and plug the approximation into the next level of optimization problem.In the sequel, We plug W * (A) ≈ W into the loss function at the second stage and get an approximated objective.Let W 2 and W 1 denote the approximations of W * 2 (A) and W * 1 (A), respectively.The approximated objective is O(z(q), W 2 , W 1 ) = l(z(q), D t , t q = 1; W 2 ) + γ z(q) − ẑ(q; W 1 ) 2 2 .We approximate z * (q, W * (A)) using one-step gradient descent update of z(q) w.r.t the approximated objective: (10) We plug z * (q, W * (A)) ≈ z (q) and W * (A) ≈ W into the objective at the third stage and get an approximated objective.Let g(d s,i , D t ; W * (A)) denote the probability that d s,i and D t are in the same domain.We approximate U * (W * (A)) using one-step gradient descent update of U w.r.t the approximated objective: (11) Finally, we plug the approximation U * (W * (A)) ≈ U into the validation loss at the fourth stage and update A by minimizing the approximated loss using gradient descent: while not converged do 1.Update the approximation W of W * (A) using Eq.( 9) 2. Update the approximation z (q) of z * (q, W * (A)) using Eq.( 10) 3. Update the approximation U of U * (W * (A)) using Eq.( 11) 4. Update A using Eq.( 13) end Algorithm 1: Optimization algorithm. For ), it can be computed as: ), (13) where ) The gradient descent update of A in Eq.( 13) can run one or more steps.After A is updated, the one-step gradient-descent approximations in Eq.( 9), (10), and (11), which are functions of A, change with A and need to be re-updated.
Then, the gradient of A, which is a function of one-step gradient-descent approximations, needs to be re-calculated and is used to refresh A. In sum, the update of A and the updates of one-step gradient-descent approximations mutually depend on each other.These updates are performed iteratively until convergence.Algorithm 1 shows the algorithm.
In the gradient of A calculated using the chain rule, the number of chains is the same as the number of levels in our proposed four-level optimization formulation.This shows that this optimization algorithm preserves the four-level nested optimization nature of the proposed formulation.

Reduce Computation and Memory Costs
To reduce computation and memory costs, we adopt the following methods.
• We reduced the frequencies of updating (including calculating hypergradients of) the weights A of self-supervised training examples.They were updated every 8 mini-batches (i.e., iterations) instead of on every mini-batch.We empirically found that this greatly reduced computational costs without significantly sacrificing accuracy.
The rest of the parameters were updated on every mini-batch.
• We added a decorrelation regularizer (Cogswell et al., 2015) on W and U , which significantly speeds up convergence and allows reducing the number of epochs by half without sacrificing convergence quality.
• Parameter tying was performed to reduce the number of weight parameters and computation costs.We let W and U share the same feature learning layers.These layers account for >95% of parameters in each of these models.Sharing them across models significantly reduces the number of total parameters, which reduces the computational costs of updating these parameters.
• We optimized the implementation of our method to speed up computation by leveraging techniques including 1) automatic mixed precision (Micikevicius et al., 2017), 2) using multiple (4, specifically) workers and pinned memory in PyTorch DataLoader, 3) using cudNN autotuner, 4) kernel fusion, and so forth.

Applications
In this section, we apply the proposed four-level optimization framework for two NLP applications.Baselines.We compare our method with the following baselines.In baseline methods, the target data used for training the target model includes both training and validation sets.
• Domain adaptive pretraining (DAPT) (Gururangan et al., 2020): Given an RoBERTa-base model which has been pretrained on large amounts of corpora (used in Liu et al., 2019c), we continue to pretrain it on source data, then finetune it on target data.
• Task adaptive pretraining (TAPT) (Gururangan et al., 2020): Given an RoBERTa-base model which has been pretrained on large amounts of corpora (used in Liu et al., 2019c), we continue to pretrain it on input texts of each target dataset.
• SimCSE (Gao et al., 2021): A contrastive learning method.The same input sentence is fed into a pretrained RoBERTa encoder twice by applying different dropout masks, to get two different embeddings.These two embeddings are labeled as being ''similar''.
Implementation Details.In the domain distance network, the hyperparameters of self-attention and feed-forward layer are the same as those in Transformer (Vaswani et al., 2017).The subset R has varying cardinality (sampled uniformly).M is set to 10k.The tradeoff parameter γ and λ is set to 0.1 and 0.5 respectively.Baselines and our method receive a similar amount of tuning time and efforts.F1 is used as the evaluation metric.
We use RoBERTa-base as a data encoder.For a fair comparison, most of our hyperparameters are the same as those in Gururangan et al. (2020).
The maximum text length was set to 512.For all datasets, we used a batch size of 16 with gradient accumulation.We used the AdamW optimizer (Loshchilov and Hutter, 2017) with a warm-up proportion of 0.06, a weight decay of 0.1, and an epsilon of 1e-6.In AdamW, β 1 and β 2 are set to 0.9 and 0.98, respectively.The maximum learning rate was 2e-5.For the reader's convenience, we summarize the hyperparameters as follows.
• The number M of self-supervised training examples: 10k • Tradeoff parameters γ and λ: 0.1, 0.5 • Maximum text length: 512 • Batch size: 16 • Optimizer: AdamW • Warm-up proportion, weight decay, and epsilon in AdamW: 0.06, 0.1, and 1e-6 • β 1 and β 2 in AdamW: 0.9, 0.98 • Maximum learning rate: 2e-5 To tune the hyperparameters, we randomly split the validation set into two equal-sized subsets, denoted by A and B. For each configuration of hyperparameters, we use the validation set A to learn the importance weights of self-supervised training examples.Then we measure the performance of the trained model on validation set B.  Hyperparameter values yielding the best performance on validation set B are selected.Each baseline method received an equal amount of tuning effort as that for our method.
Results and Analysis.Table 3 shows the results.From this table, we make the following observations.First, our method outperforms source data selection methods including BO, MMG, and LSI.In these baseline methods, out-of-domain source examples are discarded, which incurs information loss.In contrast, our method adapts out-of-domain source examples into the target domain and uses the adapted examples to train the target model.Second, our method works better than domain adaptation methods including DANN, CDAN, MME, SRDC, SSDA, GDA, and ATDOC.The reason is: These methods perform adaptation on all source examples without identifying which ones are already in the target domain; as a result, some in-domain source examples may be adapted out of the target domain.In contrast, our method first identifies which source examples are already in-domain and only performs adaptation on out-of-domain examples.Third, our method outperforms vanilla RoBERTa-base which does not leverage source data to learn representations.This demonstrates that leveraging source data is helpful for improving the target model.Fourth, our method performs better than DAPT.In DAPT, all source examples are leveraged to pretrain the target encoder without considering the fact that some source examples have large domain discrepancy with the target domain and are not suitable for pretraining the target encoder.Fifth, our method outperforms TAPT and SimCSE.These methods do not leverage auxiliary source data or select/adapt source data.Table 3: Text classification results.Following Gururangan et al. (2020), the results are micro-F1 for CHEMPROT and RCT, and macro-F1 for other datasets.For each x y entry, x and y represent the mean and standard deviation of four random runs, respectively.

Visual Question Answering on Pathology Images
Datasets.For the target dataset, we use PathVQA (He et al., 2020), which contains 1,670 pathology images and 32,795 question-answer pairs.Of these, 16,466 questions are open-ended, with the following types: what, where, when, whose, how, why, how much/how many.The rest are close-ended ''yes/no'' questions.Based on images, the dataset is split into a training, validation, and test set with a ratio of 3:1:1 approximately.Note that baseline methods are trained on the combination of the training and validation sets.We collected a source dataset containing 1792 pathology figures extracted from papers in medRxiv where each figure has a caption.We create 36,471 VQA questions using the method proposed in He et al. (2020).This source dataset will be made available publicly.
Implementation Details.For the target model, we experimented with two state-of-the-art VQA models-LXMERT (Tan and Bansal, 2019) and bilinear attention networks (BAN) (Kim et al., 2018)-each containing an image encoder, a question encoder, and an answer generation head.Hyperparameters mostly follow those in previous work (Tan and Bansal, 2019;Kim et al., 2018;Yang et al., 2016).For LXMERT, the hidden size of the text encoder was set to 768.
The initial learning rate was set to 5e-5 with the Adam (Kingma and Ba, 2014) optimizer used.The batch size was set to 256.The model was trained for 200 epochs.For BAN, words in questions and answers were represented using GloVe (Pennington et al., 2014) vectors.The initial learning rate was set to 0.005 with the Adamax optimizer (Kingma and Ba, 2014) used.The batch size was set to 512.The model was trained for 200 epochs.
From questions and answers in the PathVQA dataset, we create a vocabulary of 4,631 words that have the highest frequencies.Data augmentation is applied to images, including shifting, scaling, and shearing.We compare with baselines similar to those in Section 4.1.The Pretrain baseline works as follows: We pretrain the target encoder using source data, then finetune the target encoder using target data.Three metrics were used for evaluation, including BLEU (Papineni et al., 2002), macro-averaged F1 (Goutte and Gaussier, 2005), and accuracy (Malinowski and Fritz, 2014).We implement the methods using PyTorch and perform training on four GTX 1080Ti GPUs.
Results and Analysis.The results are shown in Table 4. From this table, we make similar observations as those in Table 3.The analysis of reasons is similar to that for results in Table 3.The  training time of our method is similar to that of baselines.Figure 2 shows some randomly sampled source pathology figures identified by our method as being out of the target domain.These images contain subfigures and texts, which are different from the target dataset.

Ablation by Removing Certain Components
To better understand the contributions of individual components in our framework, we perform the following ablation studies.
• No-Adapt.Out-of-domain source examples are discarded instead of being adapted.This is equivalent to removing stage II and the loss term )) in Eq.( 6) of stage III.
• No-In-Domain.In-domain source examples are discarded instead of being used for training the target model.This is equivalent to removing the loss term I(f (d s,i , D t ; W * (A)) = 1) tgt (U, d s,i ) in Eq.( 6) of stage III.
• Separate.Different stages are performed separately instead of jointly.
Table 3 shows ablation study results.From this table, we make the following observations.First, our full method works better than No-Adapt.This shows that it is beneficial to adapt out-of-domain source examples into the target domain and use adapted examples to train the target model, and our method is effective in achieving this goal.Second, our full method outperforms No-In-Domain, which demonstrates that the source examples selected by our method are useful for training the target model.Third, our full method achieves better performance than Separate.Our full method performs source data selection, adaptation, and target model training jointly, which enables these different tasks to mutually influence each other to achieve the best overall performance.Such a mechanism is lacking in Separate.

Ablation on the Adaptation Component
We perform an ablation study of the adaptation component in stage II of our framework by replacing it with the following baselines.
• Maximum mean discrepancy (MMD) (Kang et al., 2019), which is a broadly used metric for measuring discrepancies of two distributions.MMD-based adaptation method learns latent representations so that the selected out-of-domain source examples and the target examples have small MMD in the latent space.
• Adversarial adaptation (AD) (Ganin et al., 2016), which learns latent representations so that a domain discriminator cannot tell whether an example is from the source or from the target.
Table 3 shows the results.As can be seen, our adaptation method works better than the baselines.The reason is: The domain distance metric in our method is learned using many self-supervised training examples; it can better measure domain discrepancy and facilitate domain adaptation.

Ablation on the Selection Mechanism
We perform an ablation study of the selection mechanism in our framework by replacing it with the following baselines.
• Each source example is associated with a weight in [0, 1].A larger weight indicates the example is more likely to be in the target domain.We learn these data weights by minimizing the validation loss (WMVL) of the target model.
• We use an H-divergence (Elsahar and Gallé, 2019) based metric to measure domain similarity between a source example and the target dataset.
Table 3 shows the results.As can be seen, our selection mechanism works better than the baselines We investigate how the performance of our method is affected by the tradeoff parameter λ.
Figure 3 shows the test accuracy on ImageCLEF.
As can be seen, when λ increases from 0.01 to 0.1, the accuracy improves.This is because the selected and adapted source data is used as additional training resources for the target model.However, as λ further increases, the accuracy drops.This is because the source data is not as reliable as the target data.An excessively large λ renders too much emphasis to be put on less-reliable source data.

Ablation on the Train-Validate Ratio
In the next ablation study, we investigate how the performance of our method varies under different split ratios between target training and validation datasets.The study was performed on the text classification task.Table 5 shows the results.As can be seen, a more balanced ratio (e.g., 1:1) yields better performance.When the ratio is largely imbalanced (e.g., 1:9, 1:4, 1:0.

A Bi-level Optimization Based Ablation Setting
We also perform an ablation study which reduces the proposed four-level optimization problem to a BLO problem.The study was conducted on the text classification task.For each source example d s,i , we learn a weight b i ∈ [0, 1].A larger weight indicates that d s,i is more likely to be in the target domain.Let B = {b i } N s i=1 where N s is the number of source examples.We use the Transformer (Vaswani et al., 2017) T to perform domain adaptation.It takes a source text t as input and generates an adapted text f (t, T ) which is expected to be in the target domain.The Gumbel-Softmax (Jang et al., 2017) trick is used to deal with the nondifferentiability of generated texts.At the lower level in the BLO formulation, we train the target model U .We define the training loss on selected source data as: and the training loss on adapted source data as:

Method
Accuracy BO (Ruder and Plank, 2017) 79.9 MMG (Wang et al., 2019a) 82.1 LSI (Huan et al., 2021) 83.9 ). (21) The overall BLO formulation is: (22) Table 6 shows the results.As can be seen, this BLO method performs worse than our method.The reason is: Our method uses self-supervised learning to learn domain discrepancy (in stage I) and leverages the learned domain discrepancy metric to perform domain adaptation (in stage II).Such mechanisms are lacking in the BLO method.This further demonstrates the necessity of stage I and II in our method.

Human Evaluation
We perform a human evaluation on whether the identified in-domain and out-of-domain source examples are indeed in or out of the target domain.
The study is performed on CHEMPROT, RCT, and ACL-ARC.For each dataset, we randomly sample 200 source texts.Three undergraduates were asked to label whether these source texts are in-domain or out-of-domain.Majority vote is leveraged to decide the final label.The Kappa score among the annotations is 0.75, which indicates a strong level of agreement among the annotators.Different methods are applied to predict whether each source text is in-domain or not.Table 7 shows the results.Our method achieves the best accuracy in identifying in-domain and out-of-domain source examples, due to its mechanism of learning the domain distance network in a discriminative way (as analyzed in Section 4.3).

Conclusions and Discussion
We propose a framework for simultaneous selection and adaptation of source examples in transfer learning.Our method automatically identifies which source examples are in or out of the target domain, and performs example-specific operations (either selection or adaptation).This is different from previous methods, which 1) discard out-of-domain source examples, leading to information loss; or 2) try to adapt all source examples into the target domain, incurring a risk of moving in-domain source examples outside the target domain.Our framework is based on four-level optimization.Experiments on text classification and visual question answering demonstrate the effectiveness of our proposed method.
Stage IV.At the fourth stage, we use the trained target model to make predictions on the validation dataset D (val) t of the target task.We update the weights A of self-supervised training examples in the first stage by minimizing the validation loss:

Table 4 :
Results on the PathVQA dataset.Runtime (hours) for training is measured on a 1080TI GPU.

Figure 2 :
Figure 2: Randomly sampled source pathology figures identified by our method as being out of the target domain.
. The reason is: In our framework, the domain distance network (DSN) is learned in a discriminative way by performing classification on self-supervised training examples.Discriminative training can enable the DSN to better distinguish in-domain source examples from out-of-domain ones.In contrast, the selection mechanisms in the two baselines lack discriminability.4.3.4Ablation on the Tradeoff Parameter λ b i ) tgt (U, f (d s,i , T )).

Table 5 :
Text classification results of our method under different split ratios between target training and validation sets.

Table 6 :
Ablation study results of the BLO method on text classification.by splitting the original target training dataset, it is always possible to obtain a balanced split.

Table 7 :
Human evaluation results.weevaluate U * (B, T ) on the target validation set D * (B, T ), D