PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models

Pivot-based neural representation models have lead to significant progress in domain adaptation for NLP. However, previous works that follow this approach utilize only labeled data from the source domain and unlabeled data from the source and target domains, but neglect to incorporate massive unlabeled corpora that are not necessarily drawn from these domains. To alleviate this, we propose PERL: A representation learning model that extends contextualized word embedding models such as BERT with pivot-based fine-tuning. PERL outperforms strong baselines across 22 sentiment classification domain adaptation setups, improves in-domain model performance, yields effective reduced-size models and increases model stability.


Introduction
Natural Language Processing (NLP) algorithms are constantly improving, gradually approaching human level performance (Dozat and Manning, 2017;Edunov et al., 2018;Radford et al., 2018). However, those algorithms often depend on the availability of large amounts of manually annotated data from the domain where the task is performed. Unfortunately, collecting such annotated data is often costly and laborious, which substantially limits the applicability of NLP technology.
Domain Adaptation (DA), training an algorithm on annotated data from a source domain so that it can be effectively applied to other target domains, is one of the ways to solve the above bottleneck. Indeed, over the years substantial efforts have been devoted to the DA challenge (Roark and Bacchiani, 2003;III and Marcu, 2006; Ben-David * * Both authors equally contributed to this work. 1 Our code is at https://github.com/eyalbd2/ PERL. 2 This paper was accepted to TACL in June 2020 Jiang and Zhai, 2007;McClosky et al., 2010;Rush et al., 2012;Schnabel and Schütze, 2014). Our focus in this paper is on unsupervised DA, the setup we consider most realistic. In this setup labeled data is available only from the source domain while unlabeled data is available from both the source and the target domains. While various approaches for DA have been proposed ( §2), with the prominence of deep neural network (DNN) modeling, attention has been recently focused on representation learning approaches. Within representation learning for unsupervised DA, two approaches have been shown particularly useful. In one line of work, DNNbased methods which employ compress-based noise reduction to learn cross-domain features have been developed (Glorot et al., 2011;Chen et al., 2012). In another line of work, methods based on the distinction between pivot and nonpivot features (Blitzer et al., 2006(Blitzer et al., , 2007 learn a joint feature representation for the source and the target domains. Later on, Reichart (2017, 2018), and Li et al. (2018) married the two approaches and achieved substantial improvements on a variety of DA setups.
Despite their success, pivot-based DNN models still only utilize labeled data from the source domain and unlabeled data from both the source and the target domains, but neglect to incorporate massive unlabeled corpora that are not necessarily drawn from these domains. With the recent gamechanging success of contextualized word embedding models trained on such massive corpora (Devlin et al., 2019;Peters et al., 2018), it is natural to ask whether information from such corpora can enhance these DA methods, particularly that background knowledge from non-contextualized embeddings has shown useful for DA (Plank and Moschitti, 2013;. In this paper we hence propose an unsupervised DA approach that extends leading approaches based on DNNs and pivot-based ideas, so that they can incorporate information encoded in massive corpora ( §3). Our model, named PERL: Pivot-based Encoder Representation of Language, builds on massively pre-trained contextualized word embedding models such as BERT (Devlin et al., 2019). To adjust the representations learned by these models so that they close the gap between the source and target domains, we finetune their parameters using a pivot-based variant of the Masked Language Modeling (MLM) objective, optimized on unlabeled data from both the source and the target domains. We further present R-PERL (regularized PERL) which facilitates parameter sharing for pivots with similar meaning.
We perform extensive experimentation in various unsupervised DA setups of the task of binary sentiment classification ( §4, 5). First, for compatibility with previous work, we experiment with the legacy product review domains of Blitzer et al. (2007) (12 setups). We then experiment with more challenging setups, adapting between the above domains and the airline review domain (Nguyen, 2015) used in Ziser and Reichart (2018) (4 setups), as well as the IMDB movie review domain (Maas et al., 2011) (6 setups). We compare PERL to the best performing pivot-based methods (Ziser and Reichart, 2018;Li et al., 2018) and to DA approaches that fine-tune a massively pre-trained BERT model by optimizing its standard MLM objective using target-domain unlabeled data (Lee et al., 2020;Han and Eisenstein, 2019). PERL and R-PERL substantially outperform these baselines, emphasizing the additive effect of massive pre-training and pivot-based fine-tuning.
As an additional contribution, we show that pivot-based learning is effective beyond improving domain adaptation accuracy. Particularly, we show that an in-domain variant of PERL substantially improves the in-domain performance of a BERT-based sentiment classifier, for varying training set sizes (from 100 to 20K labeled examples). We also show that PERL facilitates the generation of effective reduced-size DA models. Finally, we perform an extensive ablation study ( §6) that uncovers PERL's crucial design choices and demonstrates the stability of PERL to hyper-parameter selection compared to other DA methods.

Background and Previous Work
There are several approaches to DA, including instance re-weighting (Sugiyama et al., 2007;Huang et al., 2006;Mansour et al., 2008), sub-sampling from the participating domains (Chen et al., 2011) and DA through representation learning, where a joint representation is learned based on texts from the source and target domains (Blitzer et al., 2007;Xue et al., 2008;Reichart, 2017, 2018). We first describe the unsupervised DA pipeline, continue with representation learning methods for DA with a focus on pivot-based methods, and, finally, describe contextualized embedding models.
Unsupervised Domain Adaptation through Representation Learning As said in §1 our focus in this work is on unsupervised DA through representation learning. A common pipeline for this setup consists of two steps: (A) Learning a representation model (often referred to as the encoder) using the source and target unlabeled data; and (B) Training a supervised classifier on the source domain labeled data. To facilitate domain adaptation, every text fed to the classifier in the second step is first represented by the pre-trained encoder. This is performed both when the classifier is trained in the source domain and when it is applied to new text from the target domain.
Exceptions to this pipeline are end-to-end models that jointly learn to perform the cross-domain text representation and the classification task. This is achieved by training a unified objective on the source domain labeled data and the unlabeled data from both the source and the target. Among these models are domain adversarial networks (Ganin et al., 2016), which were strongly outperformed by Ziser and Reichart (2018) to which we compare our methods, and the hierarchical attention transfer network (HATN, (Li et al., 2018)), which is one of our baselines (see below).
Unsupervised DA through representation learning has followed two main avenues. The first avenue consists of works that aim to explicitly build a feature representation that bridges the gap between the domains. A seminal framework in this line is structural correspondence learning (SCL, (Blitzer et al., 2006(Blitzer et al., , 2007), that splits the feature space into pivot and non-pivot features. A large number of works have followed this idea (e.g. (Pan et al., 2010;Gouws et al., 2012;Bollegala et al., 2015;Yu and Jiang, 2016;Li et al., 2017Li et al., , 2018Tu and Wang, 2019;Reichart, 2017, 2018)) and we discuss it below.
Works in the second avenue learn cross-domain representations by training autoencoders (AEs) on the unlabeled data from the source and target domains. This way they hope to get a more robust representation, which is hopefully better suited for DA. Examples for such models include the stacked denoising AE (SDA, (Vincent et al., 2008;Glorot et al., 2011), the marginalized SDA and its variants (MSDA, (Chen et al., 2012;Yang and Eisenstein, 2014;Clinchant et al., 2016)) and variational AE based models (Louizos et al., 2016).
Recently, Reichart (2017, 2018) and Li et al. (2018) married these approaches and presented pivot-based approaches where the representation model is based on DNN encoders (AE, LSTM or hierarchical attention networks). Since their methods outperformed the above models, we aim to extend them to models that can also exploit massive out of (source and target) domain corpora. We next elaborate on pivot-based approaches.
Pivot-based Domain Adaptation Proposed by Blitzer et al. (2006Blitzer et al. ( , 2007 through their SCL framework, the main idea of pivot-based DA is to divide the shared feature space of the source and the target domains to two complementary subsets: one of pivots and one of non-pivots. Pivot features are defined based on two criteria: (a) They are frequent in the unlabeled data of both domains; and (b) They are prominent for the classification task defined by the source domain labeled data. Nonpivot features are those features that do not meet at least one of the above criteria. While SCL is based on linear models, there have been some very successful recent efforts to extend this framework so that non-linear encoders (DNNs) are employed. Here we focus on the latter line of work, which produces much better results, and do not elaborate on SCL any further. Ziser and Reichart (2018) have presented the Pivot Based Language Model (PBLM), which incorporates pre-training and pivot-based learning. PBLM is a variant of an LSTM-based language model, but instead of predicting at each point the most likely next input word, it predicts the next input unigram or bigram if one of these is a pivot (if both are, it predicts the bigram), and NONE otherwise. In the unsupervised DA pipeline PBLM is trained on the source and target unlabeled data. Then, when the task classifier is trained and ap-plied to the target domain, PBLM is employed as a contextualized word embedding layer. Notice that PBLM is not pre-trained on massive out of (source and target) domain corpora, and its single-layer, unidirectional LSTM architecture is probably not ideal for knowledge encoding from such corpora.
Another work in this line is HATN (Li et al., 2018).
This model automatically learns the pivot/non-pivot distinction, rather than following the SCL definition as Reichart (2017, 2018) did. HATN consists of two hierarchical attention networks, P-net and NP-net. First, it trains the P-net on the source labeled data. Then, it decodes the most prominent tokens of P-net (i.e. tokens which received the highest attention values), and considers them as its pivots. Finally, it simultaneously trains the P-net and the NP-net on both the labeled and the unlabeled data, such that P-net is adversarially trained to predict the domain of the input example (Ganin et al., 2016) and NP-net is trained to predict its pivots, and the hidden representations from both networks serve for the task label (sentiment) prediction.
Both HATN and PBLM strongly outperform a large variety of previous DA models on various cross-domain sentiment classification setups. Hence, they are our major baselines in this work. Like PBLM, we employ the same definition of the pivot and non-pivot subsets as in Blitzer et al. (2007). Like HATN, we also employ an attentionbased DNN. Unlike both models, we design our model so that it incorporates pivot-based learning with pre-training on massive out of (source and target) domain corpora. We next discuss this pretraining process, which is also known as training models for contextualized word embeddings.

Contextualized Word Embedding Models
Contextualized word embedding (CWE) models are trained on massive corpora (Peters et al., 2018;Radford et al., 2019). They typically employ a language modeling objective or a closely related variant (Peters et al., 2018;Ziser and Reichart, 2018;Devlin et al., 2019;Yang et al., 2019), although in some recent papers the model is trained on a mixture of basic NLP tasks (Zhang et al., 2019;Rotman and Reichart, 2019). The contribution of such models to the state-of-the-art in a variety of NLP tasks is already well-established.
CWE models typically follow three steps: (1) Pre-training: Where a DNN (referred to as the encoder of the model) is first trained on massive un-labeled corpora which represent a broad domain (such as English Wikipedia); (2) Fine-tuning: An optional step, where the encoder is refined on unlabeled text of interest. As noted above, Lee et al. (2020) and Han and Eisenstein (2019) tuned BERT on unlabeled target domain data to facilitate domain adaptation; and (3) Supervised task training: Where task specific layers are trained on labeled data for a downstream task of interest.
PERL employs a pre-trained encoder, BERT in this paper. BERT's architecture is based on multi-head attention layers, trained with a twocomponent objective: (a) MLM and (b) Is-nextsentence prediction (NSP). For Step 2, PERL modifies only the MLM objective and it can hence be implemented within any CWE framework that employs this objective Lan et al., 2020;Yang et al., 2019).
MLM is a modified language modeling objective, adjusted to self-attention models. When building the pre-training task, all input tokens have the same probability to be masked. 3 After the masking process, the model has to predict a distribution over the vocabulary for each masked token given the non-masked tokens. The input text may have more than one masked token, and when predicting one masked token information from the other masked tokens is not utilized.
In the next section we describe our PERL domain adaptation model. The novel component of this model is a pivot-based MLM objective, optimized at the fine-tuning step (Step 2) of the CWE pipeline, using source and target unlabeled data.

Domain adaptation with PERL
PERL employs pivot features in order to learn a representation that bridges the gap between two domains. Contrary to previous pivot-based DA representation models, it exploits unlabeled data from the source and target domains, and also from massive out of source and target domain corpora.
PERL consists of three steps that correspond to the three steps of CWE models, as described in § 2: (1) Pre-training ( Figure 1a): in which it employs a pre-trained CWE model (encoder, BERT in this work) that was trained on massive corpora; (2) Fine-tuning ( Figure 1b): where it refines some of the pre-trained encoder weights, based on a pivot-based objective that is optimized on unlabeled data from the source and target domains; and (3) Supervised task training ( Figure 1c): where task specific layers are trained on source domain labeled data for the downstream task of interest. Our pivot selection method is identical to that of Blitzer et al. (2007) and Reichart (2017, 2018). That is, the pivots are selected independently of the above three steps protocol.
We further present a variant of PERL, denoted with R-PERL, where the non-contextualized embedding matrix of the BERT model trained at Step (1) is employed in order to regularize PERL during its fine-tuning stage (Step 2). We elaborate on this model towards the end of this section. We next provide a detailed description.
Pivot Selection Being a pivot-based language representation model, PERL is based on high quality pivot extraction. Since the representation learning is based on a masked language modeling task, the feature set we address consists of the unigrams and bigrams of the vocabulary. We base the division of this feature set into pivots and non-pivots on unlabeled data from the source and target domains. Pivot features are: (a) Frequent in the unlabeled data from the source and target domains; and (b) Among those frequent features, pivot features are the ones whose mutual information with the task label according to source domain labeled data crosses a pre-defined threshold. Features that do not meet the above two criteria form the nonpivot feature subset.
PERL pre-training (Step 1, Figure 1a) In order to inject prior language knowledge to our model, we first initialize the PERL encoder with a powerful pre-trained CWE model. As noted above, our rationale is that the general language knowledge encoded in these models, which is not specific to the source or target domains, should be useful for DA just as it has shown useful for indomain learning. In this work we employ BERT, although any other CWE model that employs the MLM objective for pre-training (Step 1) and finetuning (Step 2), could have been used.
PERL fine-tuning (Step 2, Figure 1b) This step is the core novelty of PERL. Our goal is to refine the initialized encoder on unlabeled data from the source and the target domains, using the distinction between pivot and non-pivot features.
For this aim we fine-tune the parameters of the PRD and PLR stand for the BERT prediction head and pooler head respectively, FC is a fully connected layer, and msk stands for masked tokens embeddings (embeddings of tokens that were masked). NSP and MLM are the next sentence prediction and masked language model objectives. For the definitions of the PRD and PRL layers as well as the NSP objective, see Devlin et al. (2019). We mark frozen layers (layers whose parameters are kept fixed) and non-frozen layers with snow-flake and fire symbols, respectively. The token embedding and BERT layers values at the end of each step initialize the corresponding layers of the next step model. The BERT box of the fine tuning step is described in more details in Figure 2.
pre-trained BERT using its MLM objective, but we choose the masked words so that the model learns to map non-pivot to pivot features. Recall, that when building the MLM training task, each training example consists of an input text in which some of the words are masked, and the task of the model is to predict the identity of each of the masked words given the rest of the (non-masked) input text. While in standard MLM training all input tokens have the same probability to be masked, in the PERL fine-tuning step we change both the masking probability and the prediction task so that the desired non-pivot to pivot mapping is learned. We next describe these two changes, see also a detailed graphical illustration in Figure 2.
1. Prediction task. While in standard MLM the task is to predict a token out of the entire vocabulary, here we define a pivot-base prediction task. Particularly, the model should predict whether the masked token is a pivot feature or not, and if it is then it has to identify the pivot. That is, this is a multi-class classification task where the number of classes is equal to the number of pivots plus 1 (for the non-pivot prediction).
Put it more formally, the modified pivot-based MLM objective is: where y i is a masked unigram or bigram at position i, P is the set of pivot features (token unigrams and bigrams), h i is the encoder representation for the i-th token, W (the FC-Pivots layer of Figure 1b and Figure 2) is the pivot predictor matrix that maps from the latent space to the pivot set space (W a is the a-th row of W ), and f is a nonlinear function composed of a dense layer, a gelu activation layer and LayerNorm (the PRD layer of Figure 1b and Figure 2). 2. Masking process. Instead of masking each input token (unigram) with the same probability, we perform the following masking process. For each input token (unigram) we first check whether it forms a bigram pivot together with the next token, and if so we mask this bigram with a probability of α. If the answer is negative, we check if the token at hand is a unigram pivot and if so we again mask it with a probability of α. Finally, if the token is not a pivot we mask it with a probability of β. Our hyper-parameter tuning process revealed that the values of α = 0.5 and β = 0.1 Figure 2: The PERL pivot-based fine-tuning task ( Step 2). In this example two tokens are masked, general and good, only the latter is a pivot. The architecture is identical to that of BERT but the MLM task and the masking process are different, taking into account the pivot/non-pivot distinction.
provide strong results across our various experimental setups (see more on this in §6). This way PERL gives a higher probability to pivot masking, and by doing so the encoder parameters are finetuned so that they can predict (mostly) pivot features based (mostly) on non-pivot input.
Designing the fine-tuning task this way yields two advantages. First, the model should shape its parameters so that most of the information about the input pivots is preserved, while most of the information preserved about the non-pivots is what needed in order to predict the existence of the pivots. This way the model keeps mostly the information about unigrams and bigrams that are shared among the two domains and are significant for the supervised task, thus hopefully increasing its cross-domain generalization capacity.
Second, standard MLM, which has recently been used for fine-tuning in domain adaptation (Lee et al., 2020;Han and Eisenstein, 2019), performs a multi-class classification task with 30K tokens, 4 which requires ∼ 23M parameters as in the FC1 layer of Figure 1a. By focusing PERL on pivot prediction, we can use only a factor of |P |+1

30K
of the FC layer parameters, as we do in the FCpivots layer (Figure 1b, where |P | is the number of pivots, in our experiments |P | ∈ [100, 500]). 4 The BERT implementation we use keeps a fixed 30K word vocabulary, derived from its pre-training process.
Supervised task training (Step 3, Figure 1c) To adjust PERL for a downstream task, we place a classification network on top of its encoder. While training on labeled data from the source domain and testing on the target domain, each input text is first represented by the encoder and is then fed to the classification network. Since our focus in this work is on the representation learning, the classification network is kept simple, consisting of one convolution layer followed by an average pooling layer and a linear layer. When training for the downstream task, the encoder weights are frozen.
R-PERL A potential limitation of PERL is that it ignores the semantics of its pivots. While the negative pivots sad and unhappy encode similar information with respect to the sentiment classification task, PERL considers them as two different output classes. To alleviate this, we propose the regularized PERL (R-PERL) model where pivotsimilarity information is taken into account.
To achieve this we construct the FC-pivots matrix of R-PERL (Figures 1b and 2) based on the Token Embedding matrix learned by BERT in its pre-training stage (Figure 1a). Particularly, we fix the unigram pivot rows of the FC-pivots matrix to the corresponding rows in BERT's Token Embedding matrix, and the bigram pivot rows to the mean of the Token Embedding rows that correspond to the unigrams that form this bigram. The FC-pivots matrix of R-PERL is kept fixed during fine-tuning.
Our assumptions are that: (1) Pivots with similar meaning, such as sad and unhappy have similar representations in the Token Embedding matrix learned at the pre-training stage (Step 1); and (2) There is a positive correlation between the appearance of such pivots (i.e. they tend to appear, or not appear, together; see (Ziser and Reichart, 2017) for similar considerations). In its fine-tuning step, R-PERL is hence biased to learn similar representations to such pivots in order to capture the positive correlation between them. This follows from the fact that pivot probability is computed by taking the dot product of its representation with its corresponding row in the FC-pivots matrix.

Experiments
Tasks and Domains Following a large body of prior DA work, we focus on the task of binary sentiment classification. For compatibility with previous literature, we first experiment with the four legacy product review domains of Blitzer et al. (2007): Books (B), DVDs (D), Electronic items (E) and Kitchen appliances (K) with a total of 12 cross-domain setups. Each domain has 2000 labeled reviews, 1000 positive and 1000 negative, and unlabeled reviews as follows: B: 6,000, D: 34,741, E: 13,153 and K: 16,785.
We next experiment in a more challenging setup, considering an airline review dataset (A) (Nguyen, 2015;Ziser and Reichart, 2018). This setup is challenging both due to the differences between the product and service domains, and because the prior probability of observing a positive review at the A domain is much lower than the same probability in the product domains. 5 For the A domain, following Ziser and Reichart (2018), we randomly sampled 1000 positive and 1000 negative reviews for our labeled set, and 39396 reviews for our unlabeled set. Due to the heavy computational demands of the experiments, we arbitrarily chose 3 product to airline and 3 airline to product setups.
We further consider an additional modern domain: IMDB (I) (Maas et al., 2011), 6 which is commonly used in recent sentiment analysis work. This dataset consists of 50000 movie reviews from IMDB (25000 positive and 25000 negative), where there is a limitation on the number of reviews per movie. We randomly sampled 2000 labeled reviews, 1000 positive and 1000 negative, for our labeled set, and the remaining 48000 reviews form our unlabeled set. 7 As above, we arbitrarily chose 2 IMDB to product and 2 product to IMDB setups for our experiments.
Pivot-based representation learning has shown instrumental for DA. We hypothesize that it can also be beneficial for in-domain tasks, as it focuses the representation on the information encoded in prominent unigrams and bigrams. To test this hypothesis we experiment in an in-domain setup, with the IMDB movie review dataset. We follow the same experimental setup as in the domain adaptation case, except that only IMDB unlabeled data is used for fine-tuning, and the frequency criterion in pivot selection is defined with respect to this dataset.
We randomly sampled 25000 training and 5 This analysis, performed by Ziser and Reichart (2018), is based on the gold labels of the unlabeled data. 6 The details of the IMDB dataset are available at: http: //www.andrew-maas.net/data/sentiment . 7 We make sure that all reviews of the same movie appear either in the training set or in the test set.
25000 test examples, keeping the two sets balanced, and additional 50000 reviews formed an unlabeled balanced set. 8 We consider 6 setups, differing in their training set size: 100, 500, 1K, 2K, 10K and 20K randomly sampled examples.
Baselines We compare our PERL and R-PERL models to the following baselines: (a+b) PBLM-CNN and PBLM-LSTM (Ziser and Reichart, 2018), differing only in their classification layer (CNN vs. LSTM); 9 (c) HATN (Li et al., 2018); 10 (d) BERT; and (e) Fine-tuned BERT (following (Lee et al., 2020;Han and Eisenstein, 2019)): This model is identical to PERL, except that the finetuning stage is performed with a standard MLM instead of our pivot-based MLM. BERT, Finetuned BERT, PBLM-CNN, PERL and R-PERL all use the same CNN-based sentiment classifier, while HATN jointly learns the feature representation and performs sentiment classification.

Cross-validation
We employ a five fold crossvalidation protocol, where in every fold 80% of the source domain examples are randomly selected for training data, and 20% for development data (both sets are kept balanced). For each model we report the average results across the five folds. In each fold we tune the hyper-parameters so that to minimize the cross-entropy development data loss.
Hyper-parameter Tuning For all models we use the WordPiece word embeddings (Wu et al., 2016) with a vocabulary size of 30k, and the same optimizer (with the same hyper-parameters) as in their original paper. For all pivot-based methods we consider the unigrams and bigrams that appear at least 20 times both in the unlabeled data of the source domain and in the unlabeled data of the target domain as candidates for pivots, 11 and from these we select the |P | candidates with the highest mutual information with the task source domain label (|P | = {100, 200, . . . , 500}). The exception is HATN that automatically selects its pivots, which are limited to unigrams.
We next describe the hyper-parameters of each of the models. Due to our extensive experimentation (22 DA and 6 in-domain setups, 5-fold cross-8 These reviews are also part of the IMDB dataset. 9 https://github.com/yftah89/ PBLM-Domain-Adaptation 10 https://github.com/hsqmlzno1/HATN 11 In the in-domain experiments we consider the IMDB unlabeled data.    validation), we limit our search space, especially for the heavier components of the models. R-PERL, PERL, BERT and Fine-tuned BERT For the encoder, we use the BERT-base uncased architecture with the same hyper-parameters as in Devlin et al. (2019), tuning for PERL, R-PERL and Fine-tuned BERT the number of fine-tuning epochs (out of: 20, 40, 60) and the number of unfrozen BERT layer during the fine-tuning process (1,2,3,5,8,12). For PERL and R-PERL we tune the number of pivots (100, 200, 300, 400, 500) as well as α and β (0.1, 0.3, 0.5, 0.8). The supervised task classifier is a basic CNN architecture, which enables us to search over the number of filters (out of: 16, 32, 64), the filter size (7,9,11) and the training batch size (32, 64).
PBLM-LSTM and PBLM-CNN For PBLM we tune the input word embedding size (32,64,128,256), the number of pivots (100,200,300,400,500) and the hidden dimension (128,256,512). For the LSTM classification layer of PBLM-LSTM we consider the same hidden dimension and input word embedding size as for the PBLM encoder. For the CNN classification layer of PBLM-CNN, following Ziser and Reichart (2018) we use 250 filters and a kernel size of 3. In each setup we choose the PBLM model (PBLM-LSTM or PBLM-CNN) that yields better test set accuracy and report its result, under PBLM-Max.
HATN The hyper-parameters of Li et al. (2018) were tuned on a larger training set than ours, and they hence yield sub-optimal performance in our setup. We tune the training batch size (20, 50 300), the hidden layer size (20, 100, 300) and the word embedding size (50, 100, 300).

Results
Overall results Table 1 presents domain adaptation results, and is divided to 2 sub-tables. The top table reports results on the 12 setups derived from the 4 legacy product review domains of Blitzer et al. (2007) (denoted with P ⇔ P ). The bottom table reports results for 10 setups involving product review domains and the IMDB movie review domain (left side; denoted P ⇔ I) or the airline review domain (right side; denoted P ⇔ A). Table 2 presents in-domain results on the IMDB domain, for various training set sizes. Table 1, PERL models are superior in 20 out of 22 DA setups, with R-PERL performing best in 17 out of 22 setups. In the P ⇔ P setups, their averaged performance (top table, All column) are 87.5% and 86.9% (for R-PERL and PERL, respectively) compared to 82.3% of HATN and 80.7% of PBLM-Max. Importantly, in the more challenging setups, the performance of one of these baselines substantially degrade. Particularly, the averaged R-PERL and PERL performance in the P ⇔ I setups are 84.7% and 84.4%, respectively (bottom table, left All column), compared to 75.5% of HATN and 69.0% of PBLM-Max. In the P ⇔ A setups the averaged R-PERL and PERL performances are 84.2% and 82.9%, respectively (bottom table, right All column), compared to 80.5% of PBLM-Max and only 71.8% of HATN.

Domain Adaptation As presented in
The performance of BERT and Fine-tuned BERT also degrade on the challenging setups: From an average of 80.2% (BERT) and 84.1% (Fine-tuned BERT) in P ⇔ P setups, to 74.2% and 78.9% respectively in P ⇔ I setups, and to 75.6% and 79.4% respectively in P ⇔ A setups. R-PERL and PERL, in contrast, remain stable across setups, with an averaged accuracy of 84.2-87.5% (R-PERL) and 82.9-86.8% (PERL).
The IMDB and airline domains differ from the product domains in their topic (movies (IMDB) and services (airline) vs. products). Moreover, the unlabeled data from the airline domain contains an increased fraction of negative reviews (see §4). Finally, the IMDB and airline reviews are also more recent. The success of PERL in the P ⇔ I and P ⇔ A setups is of particular importance, as it indicates the potential of our algorithm to adapt supervised NLP algorithms to domains that substantially differ from their training domain.
Finally, our results clearly indicate the positive impact of a pivot-aware approach when finetuning BERT with unlabeled source and target data. Indeed, the averaged gaps between Finetuned BERT and BERT (3.9% for P ⇔ P , 4.7% for P ⇔ I and 3.8% for P ⇔ A) are much smaller than the corresponding gaps between R-PERL and BERT (7.3% for P ⇔ P , 10.5% for P ⇔ I and 8.6% for P ⇔ A).

In-domain Results
In this setup both the labeled and the unlabeled data, used for supervised task training (labeled data, Step 3), fine-tuning (unlabeled data, Step 2), and pivot selection (both datasets) come from the same domain (IMDB). As shown in Table 2, PERL outperforms BERT and Fine-tuned BERT for all training set sizes.
Unsurprisingly, the impact of (R-)PERL diminishes as more labeled training data become available: From 7.5% (R-PERL vs. Fine-tuned BERT) when 100 sentences are available, to 2.1% for 20K training sentences. To our knowledge, the effectiveness of pivot-based methods for in-domain learning has not been demonstrated in past.

Ablation Analysis and Discussion
In order to shed more light on PERL, we conduct an ablation analysis. We start by uncovering the hyper-parameters that have strong impact on its performance, and analysing its stability across hyper-parameter configurations. We then explore the impact of some of the design choices we made when constructing the model.
In order to keep our analysis concise and to avoid heavy computations, we have to consider only a handful of arbitrarily chosen DA setups for each analysis. We follow the five-fold crossvalidation protocol of §4 for hyper-parameter tuning, except that in some of the analyses a hyperparameter of interest is kept fixed.

Hyper-parameters Analysis
In this analysis we focus on one hyper-parameter that is relevant only for methods that employ massively pre-trained encoders (the number of unfrozen encoder layers during fine-tuning), as well as on two hyper-parameters that impact the core of our modified MLM objective (number of pivots and the pivot and non-pivot masking probabilities). We finally perform stability analysis across hyper-parameter configurations.

Number of Unfrozen BERT Layers during Fine
Tuning (stage 2, Figure 1b) In Figure 3 we compare PERL final sentiment classification accuracy with six alternatives -1, 2, 3, 5, 8 or 12 unfrozen layers, going from the top to the bottom layers. We consider 4 arbitrarily chosen DA setups, where the number of unfrozen layers is kept fixed during the five-fold cross validation process. The general trend is clear: PERL performance improves as more layers are unfrozen, and this improvement saturates at 8 unfrozen layers (for the K→A setup the saturation is at 5 layers). The classification accuracy improvement (compared to 1 unfrozen layer) is of 4% or more in three of the setups (K→A is again the exception with only ∼ 2% improvement). Across the experiments of this paper, this hyper-parameter has been the single most influential hyper-parameter of the PERL, R-PERL and Fine-tuned BERT models.
Number of Pivots Following previous work (e.g. (Ziser and Reichart, 2018)), our hyperparameter tuning process considers 100 to 500  pivots in steps of 100. We would next like to explore the impact of this hyper-parameter on PERL performance. Figure 4 presents our results, for four arbitrarily selected setups. In 3 of 4 setups PERL performance is stable across pivot numbers. In 2 setups, 100 is the optimal number of pivots (for the A → B setup with a large gap), and in the 2 other setups it lags behind the best value by no more than 0.2%. These two characteristicsmodel stability across pivot numbers and somewhat better performance when using fewer pivots -were observed across our experiments with PERL and R-PERL.

Pivot and Non-Pivot Masking Probabilities
We next study the impact of the pivot and nonpivot masking probabilities, used during PERL fine-tuning (α and β, respectively, see §3). For both α and β we consider the values of 0.1, 0.3, Figure 5: Heat maps of PERL performance with different pivot (α) and non-pivot (β) masking probabilities. A darker color corresponds to a higher sentiment classification accuracy. 0.5 and 0.8. Figure 5 presents heat maps that summarize our results. A first observation is the relative stability of PERL to the values of these hyper-parameters: The gap between the best and worst performing configurations are 2.6% (E → D), 1.2% (B → E), 3.1% (K → D) and 5.0% (A → B). A second observation is that extreme α values (0.1 and 0.8) tend to harm the model. Finally, in 3 of 4 cases the best model performance is achieved with α = 0.5 and β = 0.1.

Stability Analysis
We finally turn to analyse the stability of the PERL models compared to the baselines. Previous work on PBLM and HATN has demonstrated their instability across model configurations (see Ziser and Reichart (2019) for PBLM and Cui et al. (2019) for HATN). As noted in Ziser and Reichart (2019), cross-configuration stability is of particular importance in unsupervised domain adaptation as the hyper-parameter configuration is selected using unlabeled data from the source, rather than the target domain.
In this analysis a hyper-parameter value is not considered for a model if it is not included in the best hyper-parameter configuration of that model for at least one DA setup. Hence, for PERL we fix the number of unfrozen layers (8), the number of pivots (100), and set (α, β) = (0.5, 0.1), and for PBLM we consider only word embedding size of 128 and 256. Other than that, we consider all possible hyper-parameter configurations of all models ( §4, 54 configurations for PERL, R-PERL and Fine-tuned BERT, 18 for BERT, 30 for PBLM and 27 for HATN). Table 3 presents the minimum (min), maximum (max), average (avg) and standard deviation (std) of the test set scores across the hyper-parameter configurations of each model, for 4 arbitrarily selected setups.
In all 4 setups, PERL and R-PERL consistently achieve higher avg, max and min values and lower std values compared to the other models (with the exception of PBLM achieving higher max for K → A). Moreover, the std values of PBLM and especially HATN are substantially higher than those of the models that employ BERT. Yet, PERL and R-PERL demonstrate lower std values compared to BERT and Fine-tuned BERT in 3 of 4 setups, indicating that our method contributes to stability beyond the documented contribution of BERT itself (Hao et al., 2019).

Design Choice Analysis
Impact of Pivot Selection One design choice that impacts our results is the method through which pivots are selected. We next compare three alternatives to our pivot selection method, keeping all other aspects of PERL fixed. As above, we arbitrarily select four setups.
We consider the following pivot selection methods: (a) Random-Frequent: Pivots are randomly selected from the unigrams and bigrams that appear at least 80 times in the unlabeled data of each of the domains; (b) High-MI, No Target: We select the pivots that have the highest mutual information (MI) with the source domain label, but appear less than 10 times in the target domain unlabeled data; (c) Oracle (Miller, 2019) : Here the pivots are selected according to our method, but the labeled data used for pivot-label MI computation is the target domain test data rather than the source domain training data. This is an upper bound on the performance of our method since it uses target domain labeled data, which is not available to us. For all methods we select 100 pivots (see above). Table 5 presents the results of the four PERL variants, and compare them to BERT and Finetuned BERT. We observe four patterns in the results. First, PERL with our pivot selection method, that emphasizes both high MI with the task la-     the strong positive impact of our pivot selection method on the performance of PERL.
Unlabeled Data Selection Another design choice we consider is the impact of the type of fine-tuning data. While we followed previous work (e.g. (Ziser and Reichart, 2018)) and used the unlabeled data from both the source and target domains, it might be that data from only one of the domains, particularly the target, is a better choice. As above, we explore this question on 4 arbitrarily selected domain pairs. The results, presented in Table 6, clearly indicate that our choice to use unlabeled data from both domains is optimal, particularly when transferring from a non-product domain (A or I) to a product domain.
Reduced Size Encoder We finally explore the effect of the fine-tuning step on the performance of reduced-size models. By doing this we address a major limitation of pre-trained encoders -their size, which prevents them from running on small computational devices and dictates long run times.
For this experiment we prune the top encoder layers before its fine-tuning step, yielding three new model sizes, with 5, 8, or 10 layers, compared to the full 12 layers. This is done both for Fine-tuned BERT and for PERL. We then tune the number of encoder's top unfrozen layers during fine-tuning, as follows: 5 layer-encoder (1, 2, 3); 8 layer-encoder (1, 3, 4, 5); 10 layer-encoder (1, 3, 5, 8); and full encoder (1,2,3,5,8,12). For comparison, we employ the BERT model when its top layers are pruned, and no fine-tuning is performed. We focus on two arbitrarily selected DA setups. Table 4 presents accuracy results. In both setups PERL with 10 layers is the best performing model. Moreover, for each number of layers, PERL outperforms the other two models, with particularly substantial improvements for 5 and 8 layers (e.g. 7.3% and 6.7%, over BERT and Fine-tuned BERT, respectively, for B → E and 8 layers).
Reduced-size PERL is of course much faster than the full model. The averaged run-time of the full (12 layers) PERL on our test-sets is 196.5 msec and 9.9 msec on CPU (skylake i9-7920X, 2.9 GHz, single thread) and GPU (GeForce GTX 1080 Ti), respectively. For 8 layers the numbers drop to 132.4 msec (CPU) and 6.9 msec (GPU) and for 5 layers to 84.0 (CPU) and 4.7 (GPU) msec.

Conclusions
We presented PERL, a domain-adaptation model which fine-tunes a massively pre-trained deep contextualized embedding encoder (BERT) with a pivot-based MLM objective. PERL outperforms strong baselines across 22 sentiment classification DA setups, improves in-domain model performance, increases its cross-configuration stability and yields effective reduced-size models.
Our focus in this paper is on binary sentiment classification, as was done in a large body of previous DA work. In future work we would like to extend PERL's reach to structured (e.g. dependency parsing and aspect-based sentiment classification) and generation (e.g. abstractive summarization and machine translation) NLP tasks.