Model Compression for Domain Adaptation through Causal Effect Estimation

Recent improvements in the predictive quality of natural language processing systems are often dependent on a substantial increase in the number of model parameters. This has led to various attempts of compressing such models, but existing methods have not considered the differences in the predictive power of various model components or in the generalizability of the compressed models. To understand the connection between model compression and out-of-distribution generalization, we define the task of compressing language representation models such that they perform best in a domain adaptation setting. We choose to address this problem from a causal perspective, attempting to estimate the average treatment effect (ATE) of a model component, such as a single layer, on the model's predictions. Our proposed ATE-guided Model Compression scheme (AMoC), generates many model candidates, differing by the model components that were removed. Then, we select the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain. AMoC outperforms strong baselines on dozens of domain pairs across three text classification and sequence tagging tasks.


Introduction
The rise of deep neural networks (DNNs) has transformed the way we represent language, allowing models to learn useful features directly from raw inputs. However, recent improvements in the predictive quality of language representations are often related to a substantial in-crease in the number of model parameters. Indeed, the introduction of the Transformer architecture (Vaswani et al., 2017) and attention-based models (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020) has improved performance on most natural language processing (NLP) tasks, while facilitating a large increase in model sizes.
Since large models require a significant amount of computation and memory during training and inference, there is a growing demand for compressing such models while retaining the most relevant information. While recent attempts have shown promising results (Sanh et al., 2019), they have some limitations. Specifically, they attempt to mimic the behavior of the larger models without trying to understand the information preserved or lost in the compression process.
In compressing the information represented in billions of parameters, we identify three main challenges. First, current methods for model compression are not interpretable. While the importance of different model parameters is certainly not uniform, it is hard to know a-priori which of the model components should be discarded in the compression process. This notion of feature importance has not yet trickled down into compression methods, and they often attempt to solve a dimensionality reduction problem where a smaller model aims to mimic the predictions of the larger model. Nonetheless, not all parameters are born equal, and only a subset of the information captured in the network is actually useful for generalization (Frankle and Carbin, 2018).
The second challenge we observe in model compression is out-of-distribution generalization.
Typically, compressed models are tested for their in-domain generalization. However, in reality the distribution of examples often varies and is different than that seen during training. Without testing for the generalization of the compressed models on different test-set distributions, it is hard to fully assess what was lost in the compression process. The setting explored in domain adaptation provides us with a platform to test the ability of the compressed models to generalize acrossdomains, where some information that the model has learned to rely on might not exist. Strong model performance across domains provides a stronger signal on retaining valuable information.
Lastly, another challenge we identify in training and selecting compressed models is confidence estimation. In trying to understand what gives large models the advantage over their smaller competitors, recent probing efforts have discovered that commonly used models such as BERT (Devlin et al., 2019), learn to capture semantic and syntactic information in different layers and neurons across the network (Rogers et al., 2021). While some features might be crucial for the model, others could learn spurious correlations that are only present in the training set and are absent in the test set (Kaushik et al., 2019). Such cases have led to some intuitive common practices such as keeping only layers with the same parity or the top or bottom layers (Fan et al., 2019;. Those practices can be good on average, but do not provide model confidence scores or success rate estimates on unseen data.
Our approach addresses each of the three main challenges we identify, as it allows estimating the marginal effect of each model component, is designed and tested for out-of-distribution generalization, and provides estimates for each compressed model performance on an unlabeled target domain. We dive here into the connection between model compression and out-of-distribution generalization, and ask whether compression schemes should consider the effect of individual model components on the resulting compressed model. Particularly, we present a method that attempts to compress a model while maintaining components that can generalize well across domains.
Inspired by causal inference (Pearl, 1995), our compression scheme is based on estimating the average effect of model components on the decisions the model makes, at both the source and target domains. In causal inference, we measure the effect of interventions by comparing the difference in outcome between the control and treatment groups. In our setting, we take advantage of the fact that we have access to unlabeled target examples, and treat the model's predictions as our outcome variable. We then try to estimate the effect of a subset of the model components, such as one or more layers, on the model's output.
To do that, we propose an approximation of a counterfactual model where a model component of choice is removed. We train an instance of the model without that component and keep everything else equal apart from the input and output to that component, which allows us to perform only a small number of gradient steps. Using this approximation, we then estimate the average treatment effect (ATE) by comparing the predictions of the base model to those of its counterfactual instance.
Since our compressed models are very efficiently trained, we can generate a large number of such models per each source-target domain pair. We then train a regression model on our training domain pairs in order to predict how well a compressed model would generalize from a source to a target domain, using the ATE as well as other variables. This regression model can then be applied to new source-target domain pairs in order to select the compressed model that best supports cross-domain generalization.
To organize our contributions, we formulate three research questions: 1. Can we produce a compressed model that outperforms all baselines in out-ofdistribution generalization? 2. Does the model component we decide to remove indeed hurt performance the least? 3. Can we use the average treatment effect to guide our model selection process?
In § 6 we directly address each of the three research questions, and demonstrate the usefulness of our method, ATE-guided model compression (AMoC), to improve model generalization.

Previous Work
Previous work on the intersection of neural model compression, domain adaptation and causal inference is limited, as our application of causal inference to model compression and our discussion of the connection between compression and crossdomain generalization are novel. However, there is an abundance of work in each field on its own, and on the connection between domain adaptation and causal inference. Since our goal is to explore the connection between compression and out-ofdistribution generalization, as framed in the setting of domain adaptation, we survey the literature on model compression and the connection between generalization, causality and domain adaptation.

Model Compression
NLP models have been increased exponentially in size, growing from less than a million parameters a few years ago to hundreds of billions. Since the introduction of the Transformer architecture, this trend has been strengthened, with some models reaching more than 175 billion parameters (Brown et al., 2020). As a result, there has been a growing interest in compressing the information captured in Transformers into smaller models Ganesh et al., 2020;Sun et al., 2020).
Usually, such smaller models are trained using the base model as a teacher, with the smaller student model learning to predict its output probabilities (Hinton et al., 2015;Jiao et al., 2020;Sanh et al., 2019). However, even if the student closely matches the teacher's soft labels, their internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student (Aguilar et al., 2020;Mirzadeh et al., 2020).
As an alternative, we try not to interfere or alter the learned representation of the model. Compression schemes such as those presented in Sanh et al. (2019) discard model components randomly. Instead, we choose to focus on understanding which components of the model capture the information that is most useful for it to perform well across domains, and hence should not be discarded.

Domain Adaptation and Causality
Domain adaptation is a longstanding challenge in machine learning (ML) and NLP, which deals with cases where the train and test sets are drawn from different distributions. A great effort has been dedicated to exploit labels from both source and target domains for that purpose (Daumé III et al., 2010;Sato et al., 2017;Cui et al., 2018;Lin and Lu, 2018;. However, a much more challenging and realistic scenario, also termed as unsupervised domain adaptation, occurs when no labeled target samples exist (Blitzer et al., 2006;Ganin et al., 2016;Ziser and Reichart, 2017, 2018a,b, 2019Rotman and Reichart, 2019;Ben-David et al., 2020). In this setting, we have access to labeled and unlabeled data from the source domain and to unlabeled data from the tar-get, and models are tested by their performance on unseen examples from the target domain.
A closely related task is domain adaptation success prediction. This task explores the possibility of predicting the expected performance degradation between source and target domains (Mc-Closky et al., 2010;Elsahar and Gallé, 2019). Similar to predicting performance in a given NLP task, methods for predicting domain adaptation success often rely on in-domain performance and distance metrics estimating the difference between the source and target distributions (Reichart and Rappoport, 2007;Ravi et al., 2008;Louis and Nenkova, 2009;Van Asch and Daelemans, 2010;Xia et al., 2020). While these efforts have demonstrated the importance of out-of-domain performance prediction, they have not been made as far as we know in relation to model compression.
As the fundamental purpose of domain adaptation algorithms is improving the out-ofdistribution generalization of learning models, it is often linked with causal inference (Johansson et al., 2016). In causal inference we typically care about estimating the effect that an intervention on a variable of interest would have on an outcome . Recently, using causal methods to improve the out-of-distribution performance of trained classifiers is gaining traction (Rojas-Carulla et al., 2018;Wald et al., 2021).
Indeed, recent papers applied a causal approach to domain adaptation. Some researchers proposed using causal graphs to predict under distribution shifts (Schölkopf et al., 2012) and to understand the type of shift (Zhang et al., 2013). Adapting these ideas to computer vision, Gong et al. (2016) were one of the first to propose a causal graph describing the generative process of an image as being generated by a "domain". The causal graph served for learning invariant components that transfer across domains. Since that, the notion of invariant prediction has emerged as an important operational concept in causal inference (Peters et al., 2017). This idea has been used to learn classifiers that are robust to domain shifts and can perform well on unseen target distributions (Gong et al., 2016;Magliacane et al., 2018;Rojas-Carulla et al., 2018;Greenfeld and Shalit, 2020).
Here we borrow ideas from causality to help us reason on the importance of specific model components, such as individual layers. That is, we estimate the effect of a given model component (de-noted as the treatment) on the model's predictions in the unlabeled target domain, and use the estimated effect as an evaluation of the importance of this component. Our treatment effect estimation method is inspired by previous causal model explanation work , although our algorithm is very different.

Causal Terminology
Causal methodology is most commonly used in cases where the goal is estimating effects on realworld outcomes, but it can be adapted to help us understand and explain what affects NLP models . Specifically, we can think of intervening on a model and altering its components as a causal question, and measure the effect of this intervention on model predictions. A core benefit of this approach is that we can estimate treatment effects on model's predictions without the need for manually-labeled target data.
Borrowing causal methodology into our setting, we treat model components as our treatment, and try to estimate the effect of removing them on our model's predictions. The predictions of a model are driven by its components, and by changing one component and holding everything else equal, we can estimate the effect of this intervention. We can use this estimation in deciding which model component should be kept in the compression process.
As the link between model compression and causal inference was not explored previously, we provide here a short introduction to causal inference and its basic terminology, focusing on its application to our use case. We then discuss the connection to Pearl's do-operator  and the estimation of treatment effects.
Imagine we have a model m that classifies examples to one of L classes. Given a set C of K model components, which we hypothesize might affect the model's decision, we denote the set of binary variables I c = {I c j ∈ {0, 1}|j ∈ {1, . . . , K}}, where each corresponds to the inclusion of the component in the model, i.e., if I c j = 1 then the j-th component (c j ) is in the model. Our goal is to assert how the model's predictions are affected by the components in C. As we are interested in the effect on the class probability assigned by m, we measure this probability for an example x, and denote it for a class l as z(m(x)) l and for all L classes as z(m(x)).
Using this setup, we can now define the ATE, the common metric used when estimating causal effects. ATE is the difference in mean outcomes between the treatment and control groups, and using do-calculus (Pearl, 1995) we can define it as follows: Definition 1 (Average Treatment Effect (ATE)).
The average treatment effect of a binary treatment I c j on the outcome z(m(x)) is: where the do-operator is a mathematical operator introduced by Pearl (1995), which indicates that we intervene on c j such that it is included (do(I c j = 1)) or not (do(I c j = 0)) in the model.
While the setup usually explored with docalculus involves a fixed joint-distribution where treatments are assigned to individuals (or examples), we borrow intuition from a specialized case where interventions are made on the process which generates outcomes given examples. This type of an intervention is called Process Control, and was proposed by  and further explored by Bottou et al. (2013). This unique setup is designed to improve our understanding of the behavior of complex learning systems and predict the consequences of changes made to the system. Recently,  used it to intervene on language representation models, generating a counterfactual representation model through an adversarial training algorithm which biases the representation model to forget information about treatment concepts and maintain information about control concepts.
In our approach we intervene on the j-th component, by holding the rest of the model fixed and training only the parameters that control the input and output to that component. This is crucial for our estimation procedure as we want to know the effect of the j-th component on a specific model instance. This effect can be computed by comparing the predictions of the original model instance to those of the intervened model (see below). This computation is fundamentally different from measuring the conditional probability where the j-th component is not in the model by estimat- distribution examples, detailing the domain adaptation framework we focus on. Then, we describe our compression scheme, designed to allow us to approximate the ATE and responsible for producing compressed model candidates. Finally, we propose a regression model that uses the ATE and other features to predict a candidate model's performance on a target domain. This regression allows us to select a strong candidate model.

Task Definition and Framework
To test the ability of a compressed model to generalize on out-of-distribution examples, we choose to focus on a domain adaptation setting. An appealing property of domain adaptation setups is that they allow us to measure out-of-distribution performance in a very natural way by training on one domain and testing on another.
In our setup, during training, we have access to n source-target domain pairs For each pair we assume to have labeled data from the source domains (L S i ) n i=1 and unlabeled data from the the source and target domains ( We also assume to have held-out labeled data for all domains, for measuring test perfor- At test time we are given an unseen domain pair (S n+1 , T n+1 ) with labeled source data L S n+1 and unlabeled data from both domains U S n+1 and U T n+1 , respectively. Our goal is to classify examples on the unseen target domain T n+1 using a compressed model m n+1 trained on the new source domain.
For each domain pair in (S i , T i ) n i=1 , we generate a set of K candidate models M i = {m i 1 , . . . , m i K }, differing by the model components that were removed from the base model m i B . For each candidate, we compute the ATE and other relevant features which we discuss in § 4.3. Then, using the training domain pairs, for which we have access to a limited amount of labeled target data, we train a stepwise linear regression to predict the performance of all candidate models in {M i } n i=1 on their target domain. Finally, at test time, after computing the regression features on the unseen source-target pair, we use the trained regression model to select the compressed model (m n+1 ) * ∈ M n+1 that is expected to perform best on the unseen unlabeled target domain.
While this task definition relies on a limited number of labeled examples from some target domains at training time, at test time we only use labeled examples from the source domain and unlabeled examples from the target. We elaborate on our compression scheme, responsible for generating the compressed model candidates in § 4.2. We then describe the regression features and the regression model in § 4.3 and § 4.4, respectively.

Compression Scheme
Our compression scheme (AMoC) assumes to operate on a large classifier, consisting of an encoderdecoder architecture, that serves as the base model being compressed. In such models, the encoder is the language representation model (e.g., BERT), and the decoder is the task classifier. Each input sentence x to the base model m i B is encoded by the encoder e. Then, the encoded sentence e(x) is passed through the decoder d to compute a distribution over the the label space L: z(m i B (x)) = Sof tmax(d(e(x))). AMoC is designed to remove a set of encoder components, and can in principle be used with any language encoder.
As described in Algorithm 1, AMoC generates candidate compressed versions of m i B . In each iteration it selects from C, the set containing subsets of encoder components, a candidate c k ∈ C to be removed. 3 The goal of this process is to generate many compressed model candidates, such that the k-th candidate c k differs from the base model m i B only by the effect of the parameters in c k on the model's predictions. After generating these candidates, AMoC tries to choose the best performing model for the unseen target domain.
When generating the k-th compressed model of the i-th source-target pair, we start by removing all parameters in c k from the computational graph of m i B . Then, we connect the predecessor of each detached component from c k to its successor in the graph, which yields the new m i k (see Figure 1). To estimate the effect of c k on the predictions of m i B , we freeze all remaining model parameters in m i k and fine-tune it for one or more epochs, training only the decoder and the parameters of the new connections between the predecessors and successors of the removed components. An advantage of this procedure is that we can efficiently generate many model candidates. Figure 1 demonstrates this process on a simple architecture when considering the removal of layer components. The second encoder layer is removed from the base model, and the first layer is connected to the final encoder layer. The compressed model is then fine-tuned for one or more epochs, where only the parameters of the first layer and the decoder are updated (Alg. 1, step 1(b)). We mark frozen layers and non-frozen layers with snow-flakes and fire symbols, respectively.

Algorithm 1 ATE-Guided Model Compression (AMoC)
, and a set C of subsets of encoder components to be removed. Algorithm: -Freeze all encoder parameters.
-Remove every component in c k from m i B . -Connect and unfreeze the remaining components according to § 4.2.
-Fine-tune the new model m i k on L S i for one or more epochs. - -Compute the remaining features in 4.3.
2. Train the stepwise regression according to Eq. 4, using all compressed models generated in step 1.
Guiding our model selection step is the ATE of c k on the base model m i B . The generation of each compressed candidate m i k is designed to allow us to estimate the effect of c k on the model's predictions. In comparing the predictions of m i B to the compressed model m i k on many examples, we try to mimic the process of generating control and treatment groups. As is done in controlled experiments, we compare examples that are given a treatment, i.e., encoded by the compressed model m i k , and examples that were encoded by the base model m i B . Intervening on the example-generating process was explored previously in the causality literature by Bottou et al. (2013);.
Alongside the ATE, we compute other features that might be predictive of a compressed model's performance on an unlabeled target domain, which we discuss in detail in § 4.3. Using those features and the ATE, we train a linear stepwise regression to predict a compressed model's performance on target domains ( § 4.4). Finally, at test time AMoC is given an unseen domain pair and applies the regression in order to choose the compressed source model expected to perform best on the target domain. Using the regression, we can estimate the power of the ATE in predicting model performance and answer Question 3 of § 1.
In this paper, we choose to focus on the removal of sets of layers, as done in previous work (Fan et al., 2019;Sanh et al., 2019;. While our method can support any other parameter partitioning, such as clusters of neurons, we leave this for future work. In the case of layers, to es-tablish the new compressed model we simply connect the remained layers according to their hierarchy. For example, for a base model with a 12-layer encoder and c = {2, 3, 7} the unconnected components are {1}, {4, 5, 6} and {8, 9, 10, 11, 12}. Layer 1 will then be connected to layer 4, and layer 6 to layer 8. The compressed model will be then trained for one or more epochs where only the decoder and layers 1 and 6 (using the original indices) are fine-tuned . In times where layer 1 is removed, the embedding layer is connected to the first unremoved layer and is fine-tuned.

Regression Features
Apart from the ATE, which estimates the impact of the intervention on the base model, we naturally need to consider other features. Indeed, without any information on the target domain, predicting that a model will perform the same as in the source domain could be a reasonable first-order approximation (McClosky et al., 2010). Also, adding information on the distance between the source and target distributions (Van Asch and Daelemans, 2010) or on the type of components that were removed (such as the number of layers) might also be useful for predicting the model's success. We present here all the features we consider, and discuss their usefulness in predicting model performance. To answer Q3, we need to show that given all this information, the ATE is still predictive for the model's performance in the target domain.
ATE Our main variable of interest is the average treatment effect of the components in c k on the predictions of the model. In our compression scheme, we estimate for a specific domain d ∈ {S i , T i } the ATE for each compressed model m i k by comparing it to the base model m i B : (2) where the operator denotes the total variation distance: A summation over the absolute values of vector coordinates. 4 As we are interested in the effect on the probability assigned to each class by the classifier m i k , we measure the class probabil-ity of its output for an example x, as proposed by . 5 In our regression model we choose to include the ATE of the source and the target domains, AT E S i (c k ) (estimated on U S i ) and AT E T i (c k ) (estimated on U T i ) , respectively. We note that in computing the ATE we only require the predictions of the models, and do not need labeled data.
In-domain Performance A common metric for selecting a classification model is its performance on a held-out set. Indeed, in cases where we do not have access to any information from the target domain, the naive choice is the best performing model on a held-out source domain set (Elsahar and Gallé, 2019). Hence, for every c k ∈ C we compute the performance of m i k on H S i . Domain Classification An important variable when predicting model performance on an unseen test domain is the distance between its training domain and that test domain (Elsahar and Gallé, 2019). While there are many ways to approximate this distance, we choose to do so by training a domain classifier on U S i and U T i , classifying each example according to its domain. We then compute the average probability assigned to the target examples to belong to the source domain, according to the domain classifier: where P (S i |x) denotes for an unlabeled target example x, the probability that it belongs to the source domain S i , based on the domain classifier.
Compression-size Effects We include in our regression binary variables indicating the number of layers that were removed. Naturally, we assume that the larger the number of layers removed, the bigger the gap from the base model should be.

Regression Analysis
In order to decide which c k should be removed from the base model, we follow the process described in Algorithm 1 for all c ∈ C and end up with many candidate compressed models, differing by the model components that were removed. As our goal is to choose a candidate model to be used in an unseen target domain, we train a standard linear stepwise regression model (Hocking, 1976;Draper and Smith, 1998;Dubossarsky et al., 2020) to predict the candidate's performance on the seen target domains: where Y is performance on these target domains, computed using their held-out sets (H T i ) n i=1 , and X 1 , · · · , X m are the set of variables described in 4.3, including the ATE. In stepwise regression variables are added to the model incrementally only if their marginal addition for predicting Y is statistically significant (p < 0.01). This method is useful for finding variables with maximal and unique contribution to the explanation of Y . The value of this regression is two-fold in our case as it allows us to: (1) get a predictive model that can choose a high quality compressed model candidate, and (2) estimate the predictive power of the ATE on model performance in the target domain.

Data
We consider three challenging datasets (tasks): (1) The Amazon product reviews dataset for sentiment classification (He and McAuley, 2016). 6 This dataset consists of product reviews and metadata, from which we choose 6 distinct domains: Amazon Instant Video (AIV), Beauty (B), Digital Music (DM), Musical Instruments (MI), Sports and Outdoors (SAO) and Video Games (VG). All reviews are annotated with an integer score between 0 and 5. We label > 3 reviews as positive and < 3 reviews as negative. Ambiguous reviews (rating = 3) are discarded. Since the dataset does not contain development and test sets, we randomly split each domain into training (64%), development (16%) and test (20%) sets.
(2) The Multi-Genre Natural Language Inference (MultiNLI) corpus for natural language inference classification (Williams et al., 2018) . 7 This corpus consists of pairs of sentences, a premise and a hypothesis, where the hypothesis either entails the premise, is neutral to it or contradicts it. The MultiNLI dataset extends upon the SNLI corpus (Bowman et al., 2015), assembled from image captions, to 10 additional domains: 5 matched multinli/. domains, containing training, development and test samples and 5 mismatched, containing only development and test samples. We experiment with the original SNLI corpus (Captions domain) as well as the matched version of MultiNLI, containing the Fiction, Government, Slate, Telephone and Travel domains, for a total of 6 domains.
(3) The OntoNotes 5.0 dataset (Hovy et al., 2006), consisting of sentences annotated with named entities, part-of-speech tags and parse trees. 8 We focus on the Named Entity Recognition (NER) task with 6 different English domains: Broadcast Conversation (BC), Broadcast News (BN), Magazine (MZ), Newswire (NW), Telephone Conversation (TC) and Web data (WB). This setup allows us to evaluate the quality of AMoC on a sequence tagging task.
The statistics of our experimental setups are reported in Table 1. Since the test sets of the MultiNLI domains are not publicly available, we treat the original development sets as our test sets, and randomly choose 2,000 examples from the training set of each domain to serve as the development sets. We use the original splits of the SNLI as they are all publicly available. Since our datasets manifest class imbalance phenomena we use the macro average F1 as our evaluation measure.
For the regression step of Algorithm 1, we use the development set of each target domain to compute the model's macro F1 score (for the Y and the in-domain performance variables). We compute the ATE variables on the development sets of both domains, train the domain classifier on unlabeled versions of the training sets and compute P (S|T ) on the target development set.

Model and Baselines
Model The encoder being compressed is the BERT-base model (Devlin et al., 2019). BERT is a 12-layer Transformer model Vaswani et al. (2017); Radford et al. (2018), representing textual inputs contextually and sequentially. Our decoder consists of a layer attention mechanism (Kondratyuk and Straka, 2019) which computes a parameterized weighted average over the layers' output, followed by a 1D convolution with the max-pooling operation and a final Softmax layer. Figure 1   Baselines To put our results in context of previous model compression work, we compare our models to three strong baselines. Like AMoC, the baselines generate reduced-size encoders. These encoders are augmented with the same decoder as in our model to yield the baseline architectures. The first baseline is DistilBERT (DB) (Sanh et al., 2019): A 6-layer compressed version of BERT-base, trained on the masked language modelling task with the goal of mimicking the predictions of the larger model. We used its default setting, i.e., removal of 6 layers with c = {2, 4, 6, 7, 9, 11}. Sanh et al. (2019) demonstrated that DistilBERT achieves comparable results to the large model with only half of its layers.
Since DistilBERT was not designed or tested on out-of-distribution data, we create an additional version, denoted as DB + DA. In this version, the training process is performed on the masked language modelling task using an unlabeled version of the training data from both the source and the target domains, with its original hyper-parameters.
We further add an additional adaptation-aware baseline: DB + GR, the DistilBERT model equipped with the gradient reversal (GR) layer (Ganin and Lempitsky, 2015). Particularly, we augment the DistilBERT model with a domain classifier, similar in structure to the task classifier, which aims to distinguish between the unlabeled source and the unlabeled target examples. By reversing the gradients resulting from the objective function of this classifier, the encoder is biased to produce domain-invariant representations. We set the weights of the main task loss and the domain classification loss to 1 and 0.1, respectively.
Another baseline is LayerDrop (LD), a procedure that applies layer dropout during training, making the model robust to the removal of certain layers during inference (Fan et al., 2019). During training, we apply a fixed dropout rate of 0.5 for all layers. At inference, we apply their Every Other strategy by removing all even layers to obtain a reduced 6-layer model.
Finally, we compare AMoC to ALBERT, a recently proposed BERT-based variant designed to mimic the performance of the larger BERT model with only a tenth of its parameters (11M parameters compared to BERT's 110M parameters) (Lan et al., 2020). ALBERT is trained with crosslayer parameter sharing and sentence ordering objectives, leading to better model efficiency. Unlike other baselines explored here, it is not directly comparable since it consists of 12 layers and was pre-trained on substantially more data. As such, we do not include it in the main results table (Table  2), and instead discuss its performance compared to AMoC in Section 6.

Compression Scheme Experiments
While our compression algorithm is neither restricted to a specific DNN architecture nor to the removal of certain model components, we follow previous work and focus on the removal of layer sets (Fan et al., 2019;Sanh et al., 2019;. With the goal of addressing our research questions posed in § 1, we perform extensive compression experiments on the 12-layer BERT by considering the removal of 4, 6 and 8 layers. For each number of layers removed, we randomly sample 100 layer sets to generate our model candidates. To be able to test our method on all domain pairs, we randomly split these pairs into five 20% domain pair sets and train five regression models, differing in the set used for testing. Our splits respect the restriction that no test set domain (source or target) appears in the training set.

Hyper-parameters
We implement all models using HuggingFace's Transformers package (Wolf et al., 2020). 9 We consider the following hyper-parameters for the uncompressed models: Training for 10 epochs (Amazon Reviews and MultiNLI) or 30 epochs (OntoNotes) with an early stopping criterion according to the development set, optimizing all parameters using the ADAM optimizer (Kingma and Ba, 2015) with a weight decay of 0.01 and a learning rate of 1e-4, a batch size of 32, a window size of 9, 16 output channels for the 1D convolution, and a dropout layer probability of 0.1 for the layer attention module. The compressed models are trained on the labeled source data for 1 epoch (Amazon Reviews and MultiNLI) or 10 epochs (OntoNotes).
The domain classifiers are identical in architecture to our task classifiers and use the uncompressed encoder after it was optimized during the above task-based training. These classifiers are trained on the unlabeled version of the source and target training sets for 25 epochs with early stopping, using the same hyper-parameters as above. Table 2 reports macro F1 scores for all domain pairs of the Amazon Reviews, MultiNLI and OntoNotes datasets, when considering the removal of 6 layers, while Figure 2 provides summary statistics. Clearly, AMoC outperforms all baselines in the vast majority of setups (see, e.g., the lower graphs of Figure 2). Moreover, its average target-domain performance (across the 5 source domains) improves over the second best model (DB + DA) by up to 4.56%, 5.16% and 1.63%, on Amazon Reviews, MultiNLI and OntoNotes, respectively (lowest rows of each table in Table 2; see also the average across setups in the upper graphs of Figure 2). These results provide a positive answer to Q1 of § 1, by indicating the superiority of AMoC over strong alternatives.

Performance of Compressed Models
DB+GR is overall the worst performing baseline, followed by DB, with an average degradation of 11.3% and 8.2% macro F1 score, respectively, compared to the more successful cross-domain oriented variant DB + DA. This implies that outof-the-box compressed models such as DB struggle to generalize well to out-of-distribution data. DB + DA also performs worse than AMoC in a large portion of the experiments. These results are even more appealing given that AMoC does not perform any gradient step on the target data, performing only a small number of gradient steps on the source data. In fact, AMoC only uses the unlabeled target data for computing the regres- Overall average score (up) and overall number of wins (down) over all source-target domain pairs. sion features. Lastly, LD, another strong baseline which was specifically designed to remove layers from BERT, is surpassed by AMoC by as much as 6.76% F1, when averaging over all source-target domain pairs. Finally, we compare AMoC to ALBERT. We find that on average ALBERT is outperformed by AMoC by 8.8% F1 on Amazon Reviews, and by 1.6% F1 on MultiNLI. On OntoNotes the performance gap between ALBERT and AMoC is an astounding 24.8% F1 in favor of AMoC, which might be a result of ALBERT being an uncased model, an important feature for NER tasks.
Compressed Model Selection We next evaluate how well the regression model and its variables predict the performance of a candidate compressed model on the target domain. Table 3 presents the Adjusted R 2 , indicating the share of the variance in the predicted outcome that the variables explain. Across all experiments and regardless of the number of layers removed, our regression model predicts well the performance on unseen domain pairs, averaging an R 2 of 0.881, 0.916 and 0.826 on Amazon Reviews, MultiNLI and OntoNotes, respectively. This indicates that our regression properly estimates the performance of candidate models.
Another support for this observation is that in 75% of the experiments the model selected by the regression is among the top 10 performing com-   Finally, as expected, we find that AMoC is often outperformed by the full model. However, the gap between the models is small, averaging only in 1.26% . Moreover, in almost 25% of all experiments AMoC was able to surpass the full model (underscored scores in Table 2).

Marginal Effects of Regression Variables
While the performance of the model on data drawn from the same distribution may also be indicative of its out-of-distribution performance, additional information is likely to be needed in order to make an exact prediction. Here, we supplement  Indeed, most of the regression's predictive power comes from the model performance on the source domain (F 1 S ) and the treatment effects on the source and target domains ( AT E S , AT E T ). In contrast, the distance metric ( P (S|T )) and the interaction terms ( AT E T · P (S|T ), F 1 S · P (S|T )) contribute much less to the total R 2 . The predictive power of the ATE in both source and target domains suggests a positive answer to Q3 of § 1.

Layer Importance
To further understand the importance of each of BERT's layers, we compute the frequency in which each layer appears in the best candidate model, i.e., the model with the highest F1 score on the target test set, of every experiment. Figure 3 captures the layer frequencies across the different datasets and across the number of removed layers.
The plots suggest that the two final layers, layers 11 and 12, are the least important layers with average frequencies of 30.3% and 24.8%, respectively. Additionally, in most cases layer 1 is ranked below the other layers. These results imply that the compressed models are able to better recover from the loss of parameters when the external layers are removed. The most important layer appears to be layer 4, with an average frequency of 73.3%. Finally, we notice that a large frequency variance exists across the different subplots. Such variance supports our hypothesis that the decision of which layers to remove should not be based solely on the architecture of the model.
To pin down the importance of a specific layer for a given base model, we utilize a similar regression analysis to that of § 6. Specifically, we train a regression model on all compressed candidates for a given source-target domain pair (in all three tasks), adding indicator variables for the exclusion of each layer from the model. This model associates each layer with a regression coefficient, which can be interpreted as the marginal effect of that layer being removed on expected target performance. We then compute for each layer its average coefficient across source-target pairs (Table 5, β column) and compare it to the fraction of source-target pairs where this layer is not included in the best possible (oracle) compressed model (Table 5, P (Layer removed) column).
As can be seen in the table, layers that their removal is associated with better model performance are more often not included in the best performing compressed models. Indeed, the Spearman's rank correlation between the two rankings is as high as 0.924. Such analysis demonstrates that the regression model used as part of AMoC not only selects high quality candidates, but can also shed light on the importance of individual layers.

Training Epochs
We next analyze the number of epochs required to fine-tune our compressed models. For each dataset Figure 3: Layer frequency at the best (oracle) compressed models when considering the removal of 4, 6 and 8 layers in the three datasets.
(task) we randomly choose for every target domain 10 compressed models and create two alternatives, differing in the number of training epochs performed after layer removal: One trained for a single epoch and another for 5 epochs (Amazon Reviews, MultiNLI) or 10 epochs (Ontonotes). Table 6 compares the average F1 (target-domain task performance) and AT E T differences between the two alternatives, on the target domain test and dev sets, respectively. The results suggest that when training for more epochs on Amazon Reviews and MultiNLI the difference in both the F1 and ATE are negligible. For OntoNotes (NER), in contrast, additional training improves the F1, suggesting that further training of the compressed model candidates may be favorable for sequence tagging tasks such as NER.   Table 7 compares the number of overall and trainable parameters and the training time of BERT, DistilBERT and AMoC. Removing L layers from BERT yields a reduction of 7L million parameters. As can be seen in the Table, AMoC requires training only a small fraction of the overall parameters. Since we only unfreeze one layer per each new connected component, at the worst case our algorithm requires the training of min{L, 12 − L} layers. The only exception is in the case where Layer 1 is removed (1 ∈ c). In such a case we unfreeze the embedding layer, which adds 24 million trained parameters. In terms of total training time (one epoch of task-based fine-tuning), when averaging over all setups, a single compressed AMoC model is ×11 faster than BERT and ×6 faster than DistilBERT.

Design Choices
Computing the ATE Following  and , we implement the ATE with the total variation distance between the probability output of the original model and that of the compressed models. To verify the qual-  ity of this design choice, we re-ran our experiments where the ATE is calculated using the KLdivergence between the same distributions. While the results in both conditions are qualitatively similar, we did find a consistent quantitative improvement of the R 2 (average of 0.05 across setups) when considering our total variation distance.
Regression Analysis Our regression approach is designed to allow us to both select high-quality compressed candidates and to interpret the importance of each explanatory variable, including the ATEs. As this regression has relatively few features, we do not expect to lose significant predictive power by choosing to focus on linear predictors. To verify this, we re-ran our experiments when using a fully connected feed-forward network 10 to predict target performance. This model, which is less interpretable than our regression, is also less accurate: We have observed an increased mean squared error of 1-3% with the network.

Conclusion
We explored the relationship between model compression and out-of-distribution generalization. AMoC, our proposed algorithm, relies on causal inference tools for estimating the effects of interventions. It hence creates an interpretable process that allows to understand the role of specific model components. Our results indicate that AMoC is able to produce a smaller model with minimal loss in performance across domains, without any use of target labeled data at test time (Q1).
AMoC can efficiently train a large number of compressed model candidates, that can then serve as training examples for a regression model. We have shown that this approach results in a high quality estimation of the performance of compressed models on unseen target domains (Q2). Moreover, our stepwise regression analysis indi-10 With one intermediate layer, same input feature as the regression, and hyper-parameters tuned on the development set of each source-target pair.
cates that the AT E S and AT E T estimates are instrumental for these attractive properties (Q3).
As training and test set mismatches are common, we steered our model compression research towards out-of-domain generalization. Besides its realistic nature, this setup poses additional modeling challenges, such as understanding the proximity between domains, identifying which components are invariant to domain shift, and estimating performance on unseen domains. Hence, AMoC is designed for model compression in the out-ofdistribution setup. We leave the design of similar in-domain compression methods for future work.
Finally, we believe that using causal methods to produce compressed NLP models that can well generalize across distributions is a promising direction of research, and hope that more work will be done in this intersection.