Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students?

While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the student during training, but are not available at test time. Compared with prior proposals, our approach is less easily gamed, enabling principled, automatic, model-agnostic evaluation of attributions. Using our framework, we compare numerous attribution methods for text classification and question answering, and observe quantitative differences that are consistent (to a moderate to high degree) across different student model architectures and learning strategies.1


Introduction
The success of deep learning models, together with the difficulty of understanding how they work, has inspired a subfield of research on explaining predictions, often by highlighting specific input features deemed somehow important to a prediction (Ribeiro et al., 2016;Sundararajan et al., 2017;Shrikumar et al., 2017). For instance, we might expect such a method to highlight spans like ''poorly acted'' and ''slow-moving'' to explain a prediction of negative sentiment for a given movie review. However, there is little agreement in the literature as to what constitutes a good explana-tion (Lipton, 2016;Jacovi and Goldberg, 2021). Moreover, various popular methods for generating such attributions disagree considerably over which tokens to highlight (Table 1). With so many methods claimed to confer the same property while disagreeing so markedly, one path forward is to develop clear quantitative criteria for evaluating purported explanations at scale.
The status quo for evaluating so-called explanations skews qualitative-many proposed techniques are evaluated only via visual inspection of a few examples (Simonyan et al., 2014;Sundararajan et al., 2017;Shrikumar et al., 2017). While several quantitative evaluation techniques have recently been proposed, many of these are easily gamed (Treviso and Martins, 2020;. 2 Some depend upon the model outputs corresponding to deformed examples that lie outside the support of the training distribution (DeYoung et al., 2020), and a few validate explanations on specifically crafted tasks (Poerner et al., 2018).
In this work, we propose a new framework, where explanations are quantified by the degree to which they help a student model in learning to simulate the teacher on future examples ( Figure 1). Our framework addresses a coherent goal, is model-agnostic and broadly applicable across tasks, and (when instantiated with models as students) can easily be automated and scaled. Our method is inspired by argumentative models for justifying human reasoning, which posit that the role of explanations is to communicate information about how decisions are made, and thus to enable a recipient to anticipate future Table 1: Overlap among the top-10% tokens selected by different explanation techniques for sentiment analysis. In each row, for a given technique, we tabulate the fraction of explanatory tokens that overlap with other explanations. Value of implies perfect overlap and 0.0 denotes no overlap. decisions (Mercier and Sperber, 2017). Our framework is similar to human studies conducted by , who evaluate if explanations help predict model behavior. However, here we focus on protocols that do not rely on human-subject experiments. Using our framework, we conduct extensive experiments on two broad categories of NLP tasks: text classification and question answering. For classification tasks, we compare seven widely used input attribution techniques, covering gradient-based methods (Simonyan et al., 2014;Sundararajan et al., 2017), perturbation-based techniques (Ribeiro et al., 2016), attention-based explanations (Bahdanau et al., 2015), and other popular attributions (Shrikumar et al., 2017;Dhamdhere et al., 2019). These comparisons lead to observable quantitative differences-we find attention-based explanations and integrated gradients (Sundararajan et al., 2017) to be the most effective, and vanilla gradient-based saliency maps and LIME to be the least effective. Further, we observe moderate to high agreement among rankings obtained by varying student architectures and learning strategies in our framework. For question answering, we validate the effectiveness of student learners on both human-produced explanations collected by Lamm et al. (2021), and automatically generated explanations from a SpanBERT model (Joshi et al., 2020).

An Illustrative Example
In our framework, we view explanations as a communication channel between a teacher T and a student S, whose purpose is to help S to predict T 's outputs on a given input. As an example, consider the case of graduate admissions: An aspirant submits their application x and subsequently the admission committee T decides whether the candidate is to be accepted or not. The acceptance criterion, f T (x), represents a typical black box function-one that is of great interest to future aspirants. 3 To simulate the admission criterion, a student S might study profiles of several applicants from previous iterations, x 1 , . . . , x n , and their admission outcomes f T (x 1 ), . . . , f T (x n ).
Let A(f S , f T ) be the simulation accuracy, that is, the accuracy with which the student predicts the teacher's decisions on unseen future applications (defined formally below in §2.2). Now suppose each previous admission outcome was supplemented with an additional explanation e T (x) from the admission committee, intended to help S understand the decisions made by T . Ideally, these explanations would enhance students' understanding about the admission process, and would help students simulate the admission decisions better, leading to a higher accuracy. We argue that the degree of improvement in simulation accuracy is a quantitative indicator of the utility of the explanations. Note that generic explanations or explanations that simply encode the final decision (e.g., ''We received far too many applications ...'') are unlikely to help students simulate f T (x), as they provide no additional information.

Quantifying Explanations
For concreteness, we assume a classification task, and for a teacher T , we let f T denote a model that computes the teacher's predictions. Let S be a student (either human or a machine), then T could teach S to simulate f T by sampling n examples, x 1 , . . . , x n , and sharing with S a datasetD containing its associated predictions {(x 1 ,ŷ 1 ), . . . , (x n ,ŷ n )}, whereŷ i = f T (x i ), and S could then learn some approximation of f T from this data: Additionally, we assume that for a given teacher T , an explanation generation method can generate an explanation e T (x) for any example x which is some side information that potentially helps S in predicting f T (x). We useÊ to denote a dataset of explanation-augmented examples, that is, and the student learner can make use of this side information during training, to learn a classifier f S,Ê = learn(S,Ê).
Note that none of the learning tasks discussed above involve the ''gold'' label y for any instance x, only the predictionŷ for x, produced by the teacher. While the student S can use the explanations for learning, all the classifiers f T , f S,D , and f S,Ê predict labels given only the input x, without using the explanations, that is, explanations are only available during training, not at test time.
In our framework the benefit of explanations is measured by how much they help the student to simulate the teacher. In particular, we quantify the ability of a student f S to simulate a teacher using the simulation accuracy: where the expected agreement between student and teacher is computed over test examples. Better explanations will lead to higher values of A(f S,Ê , f T ) than the accuracy associated with learning to simulate the teacher without explanations, namely, A(f S,D , f T ).
So far, for a given teacher model, our criteria for explanation quality depends upon the choice of the student model (S), its learning procedure, and the number of examples used to train it (n). To reduce the reliance on a given student, we could assume that the student S is drawn from a distribution of students Pr(S), and extend our framework by considering the expected benefit for a random student averaged over various values of n. In practice, we experiment with a small set of diverse students (e.g., models with different sizes, architectures, learning procedures) and consider different values of n.

Automated Teachers and Students
In principle, T and S could be either people or algorithms. However, quantitative measurements are easier to conduct when T and (especially) S are algorithms. In particular, imagine that T (which for example could be a BERT-based classifier) identifies an explanation e T (x) that is some subset of tokens in a document x that are relevant to the prediction (acquired by, for example, any of the explanation methods mentioned in the introduction) and S is some machine learner that makes use of the explanation. The value of teacher-explanations for S can then be assessed via standard evaluation of explanation-aware student learners, using predicted labels instead of gold labels. This value can then be compared to other schemes for producing explanations (e.g., integrated gradients). Albeit, an important concern in automated evaluation is that, by design, the obtained results are contingent on the student model(s) and how explanations are incorporated by the student model(s). Another apparent ''bug'' in this framework is that in the automated case, one could obtain a perfect simulation accuracy with an explanation that communicates all the weights of the teacher classifier f T to the student. 4 We propose two approaches to address this problem. First, we simply limit explanations to be of a form that people can comprehend-for example, spans in a document x. That is, we consider only popular formats of explanations that are considered to be human understandable (see §3 for details and Table 2 for examples). Secondly, we experiment with a diverse set of student models (e.g., networks with architectures different from the original teacher model), which precludes trivial weightcopying solutions.

Discussion
In our framework, two design choices are crucial: (i) students do not have access to explanations at test time; and (ii) we use a machine learning model as a substitute for student learner. These two design choices differentiate our framework from similar communication games proposed by Treviso and Martins (2020) and . When explanations are available at test time, they can leak the teacher output directly or indirectly, thus corrupting the simulation task. Both genuine and trivial explanations can encode the teacher output, making it difficult to discern the quality of explanations. 5 The framework of Treviso and Martins (2020) is affected by this issue, which is probably only partially addressed by enforcing constraints on the student. Preventing access to explanations while testing solves this problem and offers flexibility in choosing student models.
Substituting machine learners for people allows us to train student models on thousands of examples, in contrast to the study by , where (human) students were trained on only 16 or 32 examples. As a consequence, the observed differences among many explanation techniques were statistically insignificant in their studies. While human subject experiments are a valuable complement to scalable automatic evaluations, it is expensive to conduct sufficiently large-scale studies; people's preconceived notions might impair their ability to simulate the models accurately; 6 and lastly these preconceived notions might bias performance for different people differently.

Learning with Explanations
Our student-teacher framework does not specify how to use explanations while training the student model. Below, we examine two broad approaches to incorporate explanations: attention regularization and multitask learning. Our first approach regularizes attention values of the student model to align with the information communicated in explanations. In the second method, we pose the learning task for the student as a joint task of prediction and explanation generation, expecting discussed in Chang et al. (2020) and Jacovi and Goldberg (2021). 6 We speculate this effect to be pronounced when the models' outputs and the true labels differ only over a few samples. to improve prediction due to the benefits of multitask learning. We show that both of these methods indeed improve student performance when using human-provided explanations (and gold labels) for classification tasks. We explore variants of these two approaches for question answering tasks.

Classification Tasks
The training data for the student model consists of n documents x 1 , . . . , x n , and the output to be learned, y 1 , . . . , y n , comes from the teacher, that is, In this work, we consider teacher explanations in the form of a binary vector e T (x i ), such that e T (x i ) j = 1 if the j th token in document x i is a part of the teacher-explanation, and 0 otherwise (see Table 2 for an example). 7 To incorporate explanations during training, we suggest two different approaches. First, we use attention regularization, where we add a regularization term to our loss to reduce the KL divergence between the attention distribution of the student model (α student ) and the distribution of the teacher-explanation (α exp ): where the explanation distribution (α exp ) is uniform over all the tokens in the explanation and elsewhere (where is a very small constant). When dealing with student models that employ multi-headed attention, which use multiple different attention vectors at each layer of the model (Vaswani et al., 2017), we take α student to be the attention from the [CLS] token to other tokens in the last layer, averaged across all attention heads. Several past approaches have used attention regularization to incorporate human rationales, with an aim to improve the overall performance of the system for classification tasks (Bao et al., 2018;Zhong et al., 2019) and machine translation (Yin et al., 2021).
Second, we use explanations via multitask learning, where the two tasks are prediction and explanation generation (a sequence labeling problem). Formally, the overall loss can be written as: As in multitask learning, if the task of prediction and explanation generation are complementary, then the two tasks would benefit from each other. As a corollary, if the teacher-explanations offer no additional information about the prediction, then we would see no benefit from multitask learning (appropriately so). For most of our classification experiments, we use BERT  with a linear classifier on top of the [CLS] vector to model p(y|x; θ). To model p(e|x; φ θ) we use a linear-chain CRF (Lafferty et al., 2001) on top of the sequence vectors from BERT. Note that we share the BERT parameters θ between classification and explanation tasks. In prior work, similar multitask formulations have been demonstrated to effectively incorporate rationales to improve classification performance (Zaidan and Eisner, 2008) and evidence extraction (Pruthi et al., 2020).
Question Answering Let the question q consist of m tokens q 1 . . . q m , along with passage x that provides the answer to the question, consisting of n tokens x 1 , . . . , x n . Let us define a set of question phrases Q and passage phrases P to be We consider a subset of QED explanations (Lamm et al., 2021), which consist of a sequence of one or more ''referential equality annotations'' e 1 . . . e |e| . Formally, each referential equality annotation e k for k = 1 . . . |e| is a pair (φ k , π k ) ∈ Q × P, specifying that phrase φ k in the question refers to the same thing in the world as the phrase π k in the passage (see Table 2 for an example).
To incorporate explanations for question answering tasks, we use the two approaches discussed for text classification tasks, namely, attention regularization and multitask learning. Since the explanation format for question answering is different from the explanations in text classification, we use a lossy transformation, where we construct a binary explanation vector, where 1 corresponds to tokens that appear in one or more referential equalities and 0 otherwise. Given the transformation, both these approaches do not use the alignment information present in the referential equalities.
To exploit the alignment information provided by referential equalities, we introduce and append  the standard loss with attention alignment loss: is the last layer average attention originating from tokens in φ k to tokens in π k . The average is computed across all the tokens in φ k and across all attention heads. The underlying idea is to increase attention values corresponding to the alignments provided in explanations.

Human Experts as Teachers
Below, we discuss the results upon applying our framework to explanations and output from human teachers to confirm if expert explanations improve the student models' performance.
Setup There exist a few tasks where researchers have collected explanations from experts besides the output label. For the task of sentiment analysis on movie reviews, Zaidan et al. (2007) collected ''rationales'' where people highlighted portions of the movie reviews that would encourage (or discourage) readers to watch (or avoid) the movie. In another recent effort, Lamm et al. (2021) collected ''QED annotations'' over questions and the passages from the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019). These annotations contain the salient entity in the question and their referential mentions in the passage that need to be resolved to answer the question. For both these tasks, our student-learners are pretrained BERT-base models, which are further  Table 4: Simulation performance (F1 score) of a student model when trained with and without explanations from human experts for question answering. We find that attention regularization and attention alignment loss result in large improvements upon incorporating explanations.
fine-tuned with outputs and explanations from human experts.

Results
Our suggested methods to learn from explanations indeed benefit from human explanations. For the sentiment analysis task, attention regularization boosts performance, as depicted in Table 3. For instance, attention regularization improves the accuracy by an absolute 6 points, for 600 examples. The performance benefits, unsurprisingly, diminish with increasing training examples-for 1200 examples, the attention regularization improves performance by 2.9 points. While attention regularization is immediately effective, the multitask learning requires more examples to learn the sequence labeling task. We do not see any improvement using multitask learning for 600 examples, but for 900 and 1200 training examples, we see absolute improvements of 1 and 1.4 points, respectively.
We follow up our findings to validate if the simulation performance of the student model is correlated with explanation quality. To do so, we corrupt human explanations by unselecting the marked tokens with varying noising probabilities (ranging from 0 to 1, in steps of 0.1). We train student models on corrupted explanations using attention regularization and find their performance to be highly negatively correlated with the amount of noise (Pearson correlation ρ = −0.72). This study verifies that our metric is correlated with (an admittedly simple notion of) explanation quality.
For the question-answering task, we measure the F1 score of the student model on the test set carved from the QED dataset. As one can observe from Table 4, both attention regularization and attention alignment loss improve the performance, Table 5: We evaluate the effectiveness of attribution methods for sentiment analysis using simulation accuracy of student models trained with these explanations on varying amounts of data ( §5.2). Each method selects top-10% ''important'' tokens for each example. We find attention-based explanations to be most effective, followed by integrated gradients. We also tabulate the average rank as per our metric. Statistically significant differences (p-value < 0.05) from the no-explanation control are underlined.
whereas multitask learning is not effective. 8 Attention regularization and attention alignment loss improve F1 score by 2.3 and 8.4 points for 500 examples, respectively. The gains decrease with increasing examples (e.g., the improvement due to attention alignment loss is 5 points on 2500 examples, compared to 8.4 points with 500 examples).
The key takeaway from these experiments (with explanations and outputs from human experts) is that we observe benefits with the learning procedures discussed in previous section. This provides support to our proposal to use these methods for evaluating various explanation techniques.

Automated Evaluation of Attributions
Here, we use a machine learning model as our choice for the teacher, and subsequently train student models using the output and explanations produced by the teacher model. Such a setup allows us to compare attributions produced by different techniques for a given teacher model.

Setup
For sentiment analysis, we use BERT-base ) as our teacher model and train it on the IMDb dataset (Maas et al., 2011). The accuracy of the teacher model is 93.5%. 8 We speculate that multitask learning might require more than 2500 examples to yield benefits. Unfortunately, for the QED dataset, we only have 2500 training examples.
For each explanation technique to be comparable to others, we sort the tokens as per scores assigned by a given explanation technique, and use only the top-k% tokens. This also ensures that across different explanations, the quantity of information from the teacher to the student per example is constant. Additionally, we evaluate no-explanation, random-explanation, and trivialexplanation baselines. For random explanations, we randomly choose k% tokens, and for trivial explanations, we use the first k% tokens for the positive class, and the next k% tokens for the negative class. Such trivial explanations encode the label and can achieve perfect scores for many evaluation protocols that use explanations at test time.
Corresponding to each explanation type, we train 4 different student models-comprising BERT and BiLSTM based models-using outputs and explanations from the teacher. The test Table 6: Evaluatin different attribution methods for sentiment analysis using the simulation accuracy of BiLSTM-based student models trained with these explanations on varying amounts of data ( §5.2). We find attention-based explanations, integrated gradients, and layer conductance to be effective techniques. The rankings are largely consistent with those attained using transformer-based student models (Table 5). Statistically significant differences (p-value < 0.05) from the no-explanation control are underlined.
set of the teacher model is divided to construct train, development, and test splits for the student model. We train student models with explanations by using attention regularization and multitask learning. We vary the amount of training data available and note the simulation performance of student models. For the question answering task, we use the Natural Questions dataset (Kwiatkowski et al., 2019). The teacher model is a SpanBERT-based model that is trained jointly to answer the question and produce explanations (Lamm et al., 2021). We use the model made available by the authors. The test set of Natural Questions is split to form the training, development, and test set for the student model. We use a BERT-base QA model as our student model to evaluate the value of teacher explanations.

Main Results
We evaluate different explanation generation methods based upon the simulation accuracy of various student models for two NLP tasks: text classification and question answering.
For the sentiment analysis task, we present the simulation performance of BERT-base and BERT-large student models in  Table 7: Simulation performance (F1 score) of a student model when trained with and without explanations from the SpanBERT QA model (the teacher model in this case). We find these explanations to be effective across both the learning strategies. Table 6. From these two tables, we first note that attention-based explanations are effective, resulting in large and statistically significant improvements over the no-explanation control. We see an improvement of 1.4 to 2.6 points for transformerbased student models (Table 5), and up to 7 points for the Bi-LSTM student model (Table 6).
While it may seem that attention is effective because it aligns most directly with attention regularization learning strategy, we note that the trends from multitask learning corroborate the same conclusion for different student models-especially the Bi-LSTM student model, which does not even use the attention mechanism, and therefore cannot  incorporate explanations using attention regularization. Besides attention explanations, we also find integrated gradients and layer conductance to be effective techniques. Qualitatively inspecting a few examples, we notice that attention and integrated gradients indeed highlight spans that convey the sentiment of the review. Lastly, we see that trivial explanations do not outperform the control experiment, confirming that our framework is robust to such gamification attempts. These explanations would result in a perfect score for the protocol discussed in Treviso and Martins (2020). The metric by  would be undefined in the case when 100% of the explanations trivially leak the label-in the limiting case (when all but one explanation leak the label trivially), the metric would result in a high score, which is unintended. For the question answering task, we observe from Table 7

Analysis
Here, we analyze the the effect of different instantiations of our framework-namely, sensitivity to the choice of student architectures, their hyperparameters, learning strategies, and so forth. Additionally, we examine the effect of varying the percentage of explanatory tokens (k in top-k tokens) on the results obtained from our framework.

Varying Student Models and Learning
Strategies We evaluate the agreement among attribution rankings obtained using (i) different learning strategies; and (ii) different student models. We compute the Kendall rank correlation coefficient τ to measure the agreement among different attribution rankings. 9 We report different τ values for varying combinations of student models and learning strategies in the Appendix (Table 10). The key takeaways from this investigation are twofold: first the rank correlation between rankings produced using the two learning strategies-attention regularization (AR) and multi-task learning (MTL)-for the same student model is 0.64, which is considered a high agreement. This value is obtained by averaging τ values from 3 different student models that can use both these learning strategies. Second, the rank correlation among rankings produced using different student models (given the the same learning strategy) is also high-we report average values of 0.65 and 0.47 when we use AR and MTL learning strategies, respectively. For completion, we also compute τ for all distinct combinations across student models and learning strategies (21 combinations in total) and obtain an average value of 0.52. Overall, we observe high agreement among different rankings attained through different instantiations of our student-teacher framework.

Sensitivity to Hyperparameters
We examine the sensitivity of our framework to different hyperparameter values of the student models. For BiLSTM-based student models, we perform a random search over different values of four hyperparameters, that is, number of embedding dimensions (ED ∈ {64, 128, 256, 512}), number of hidden size (HS ∈ {256, 512, 768, 1024}), batch size (BS ∈ {8, 16, 32, 64}) and learning rate (LR ∈ {0.5 × 10 −3 , 1 × 10 −3 , 2.5 × 10 −3 , 0.5 × 10 −2 }). From all possible configurations above, we randomly sample 4 configurations and train a BiLSTM with attention student model corresponding to each configuration. The simulation accuracy of student models with different choices of hyperparameters are presented in Table 8. For a given hyperparameter configuration, we average the ranks across the two learning strategies. We compute the Kendall rank correlation coefficient τ among rankings obtained using different hyperparameter configurations (including the default configuration from Table 6, thus resulting in 5 2 comparisons). We obtain a high average correlation of 0.95, suggesting that our framework yields largely consistent ranking of attributions across varying hyperparameters.

Varying the Percentage of Explanatory Tokens
To examine the effect of k in selecting top-k% tokens, we evaluate the simulation performance of BERT-base students trained with varying values of k ∈ {5, 10, 20, 40} on 2000 examples. 10 For these values of k, we corroborate the same trend, that is, attention-based explanations are the most effective, followed by integrated gradients 10 Note that k is not a parameter of our framework, but controls the number of explanatory tokens for each attribution.  Table 9: Comparing attribution methods as per the sufficiency (lower the better) and comprehensiveness metrics proposed in (DeYoung et al., 2020).
(see Table 11 in the Appendix). We also perform an experiment where we consider the entire attention vector to be an explanation, as it does not lose any information due to thresholding.

Comparison With Other Benchmarks
For completeness, we compare the ranking of explanations obtained through our metrics with existing metrics of sufficiency and comprehensiveness introduced in (DeYoung et al., 2020). The sufficiency metric computes the average difference in the model output upon using the input example versus using the explanation alone (f T (x) − f T (e)), while the comprehensiveness metric is the average of f T (x) − f T (x\e) over the examples. Note that using these metrics is not ideal as they rely upon the model output on deformed input instances that lie outside the support of the training distribution. We present these metrics for different explanations in Table 9. We observe that LIME outperforms other explanations on both the sufficiency and comprehensiveness metrics. We attribute this to the fact that LIME explanations rely on attributions from a surrogate linear model trained on perturbed sentences, akin to the inputs used to compute these metrics. The average rank correlation of rankings obtained by our metrics (across all students and tasks) with the rankings from these two metrics is moderate (τ = 0.39), which indicates that the two proposals produce slightly different orderings. This is unsurprising as our protocol, in principle, is different from the compared metrics.
Ideally, we would like to link this comparison with some notion of user preference. This aspiration to evaluate inferred associations with users is similar to that of evaluating latent topics for topic models (Chang et al., 2009). However, directly asking users for their preference (for one explanation versus the other) would be inadequate, as users would not be able to comment upon the faithfulness of the explanation to the computation that resulted in the prediction. Instead, we conduct a study inspired from our protocol, that is, where users simulate the model with and without explanations.

Human Students
As discussed in §2.4, it is difficult to ''train'' people using a small number of input, output, explanation triples to understand the model sufficiently to simulate the model (on unseen examples) better than the control baseline. A recent study trained students with 16 or 32 examples, and tested if students could simulate the model better using different explanations, however the observed differences among techniques were not statistically significant . Here, we attempt a similar human study, where we present each crowdworker 60 movie reviews, and for 40 (out of 60) reviews we supplement explanations of the model predictions. The goal for the workers is to understand the teacher model and guess the output of the model on the 20 unseen movie reviews for which explanations are unavailable.
In our case, the teacher model accurately predicts 93.5% of the test examples, therefore to avoid crowdworkers conflating the task of simulation with that of sentiment prediction, we over-sample the error cases such that our final setup comprises 50% correctly classified and 50% incorrectly classified reviews. We experiment with 3 different attribution techniques: attention (as it is one of the best performing explanation technique as per our protocol), LIME (as it is not very effective according to our metrics, but nonetheless is a popular technique), and random (for control). We divide a total of 30 crowdworkers in three cohorts corresponding to each explanation type. The average simulation accuracy of workers is 68.0%, 69.0%, and 75.0% using LIME, attention, and random explanations, respectively. However, given the large variance in the performance of workers in each cohort, the differences between any pair of these explanations is not statistically significant. The p-value for random vs LIME, random vs attention and LIME vs attention is 0.35, 0.14, and 0.87 respectively.
This study, similar to past human-subject experiments on model simulatability, concludes that explanations do not definitively help crowdworkers to simulate text classification models. We speculate that it is difficult for people to simulate models, especially when they see a few fixed examples. A promising direction for future work could be to explore interactive studies, where people could query the model on inputs of their choice to evaluate any hypotheses they might conjecture.

Limitations and Future Directions
There are a few important limitations of our work that could motivate future work in this space. First, our current experiments only compare explanations that are of the same format. More work is required to compare explanations of different formats, for example, comparing natural language explanations to the top-k% highlighted tokens, or even comparing two methods to produce natural language explanations. To make such comparisons, one would have to ensure that different explanations (potentially with different formats) communicate comparable bits of information, and subsequently develop learning strategies to train student models.
Second, validating the results of any automated evaluation with human judgement of explanation quality remains inherently difficult. When people evaluate input attributions (or any form of explanations) qualitatively, they can determine whether the attributions match their intuition about what portions of the input should be important to solve the task (i.e., plausibility of explanations), but it is not easy to evaluate if the highlighted portions are responsible for the model's prediction. Going forward, we think that more granular notions of simulatability, coupled with counterfactual access to models (where people can query the model), might help people better assess the role of explanations.
Third, while we observe moderate to high agreement among attribution rankings across different student architectures and learning schemes, it is conceivable that different explanations are favored based on the choice of student model. This is a natural drawback of using a learning model for evaluation as the measurement could be sensitive to its design. Therefore, we recommend users to average simulation results over a diverse set of student architectures, training examples, and learning strategies; and, wherever possible, validate explanation quality with its intended users.
Lastly, an interesting future direction is to train explanation modules to generate explanations that optimize our metric, that is, learning to produce explanations based on the feedback from the students. To start with, an explanation generation module could be a simple transformation over the attention heads of the teacher model (as attention-based explanations are effective explanations as per our framework). Learning explanations can be modeled as a meta-learning problem, where the meta-objective is the few-shot test performance of the student trained with intermediate explanations, and this performance could serve as a signal to update the explanation generation module using implicit gradients as in (Rajeswaran et al., 2019).

Related Work
Several papers have suggested simulatability as an approach to measure interpretability (Lipton, 2016;Doshi-Velez and Kim, 2017). In a survey on interpretability, Doshi-Velez and Kim (2017) propose the task of forward simulation: Given an input and an explanation, people must predict what a model would output for that instance. Chandrasekaran et al. (2018) conduct humanstudies to evaluate if explanations from Visual Question Answering (VQA) models help users predict the output. Recently, Hase and Bansal (2020) perform a similar human-study across text and tabular classification tasks. Due to the nature of these two studies, the observed differences with and without explanation, and among different explanation types, were not significant. Conducting large-scale human studies poses several challenges, including the considerable financial expense and the logistical challenge of recruiting and retaining participants for unusually long tasks (Chandrasekaran et al., 2018). By automating students in our framework, we mitigate such challenges, and observe quantitative differences among methods in our comparisons.
Closest in spirit to our work, Treviso and Martins (2020) propose a new framework to assess explanatory power as the communication success rate between an explainer and a layperson (which can be people or machines). However, as a part of their communication, they pass on explanations during test time, which could easily leak the label, and the models trained to play this communication game can learn trivial protocols (e.g., explainer generating a period for positive examples and a comma for negative examples). This is probably only partially addressed by enforcing constraints on the explainer and the explainee. Our setup does not face this issue as explanations are not available at test time.
To counter the effects of leakage due to explanations,  present a Leakage-Adjusted Simulatability (LAS) metric. Their metric quantifies the difference in performance of the simulation models (analogous to our student models) with and without explanations at test time. To adjust for this leakage, they average their simulation results across two different sets of examples, ones that leak the label, and others that do not. Leakage is modeled as a binary variable, which is estimated by whether a discriminator can predict the answer using the explanation alone. It is unclear how the average of simulation results solves the problem, especially when trivial explanations leak the label. DeYoung et al. (2020) introduce the ERASER benchmark to assess how well the rationales provided by models align with human rationales, and also how faithful these rationales are to model predictions. To measure faithfulness, they propose two metrics: comprehensiveness and sufficiency. They compute sufficiency by calculating the model performance using only the rationales, and comprehensiveness by measuring the performance without the rationales. This approach violates the i.i.d assumption, as the training and evaluation data do not come from the same distribution. It is possible that the differences in model performance are due to distribution shift rather than the features that were removed. This concern is also highlighted by Hooker et al. (2019), who instead evaluate interpretability methods via their Re-mOve And Retrain (ROAR) benchmark. Because the ROAR approach uses explanations at test time, it could be gamed: Depending upon the prediction, an adversarial teacher could use a different pre-specified ordering of important pixels as an explanation. Lastly, Poerner et al. (2018) present a hybrid document classification task, where the sentences are sampled from different documents with different class labels. The evaluation metric validates if the important tokens (as per a given interpretation technique) point to the tokens from the ''right'' document, that is, one with the same label as the predicted class. This protocol, too, relies on model output for out-of-distribution samples (i.e., hybrid documents), and is very task-specific.

Conclusion
We have formalized the value of explanations as their utility in a student-teacher framework, measured by how much they improve the student's ability to simulate the teacher. In our setup, explanations are provided by the teacher as additional side information during training, but are not available at test time, thus preventing ''leakage'' between explanations and output labels. Our proposed evaluation confirms the value of human-provided explanations, and correlates with a (simplistic) notion of explanation quality. Additionally, we conduct extensive experiments that measure the value of numerous previously-proposed schemes for producing explanations. Our experiments result in clear quantitative differences between different explanation methods, which are consistent, to a moderate to high degree, across different choices. Among explanation methods, we find attention to be the most effective. For student models, we find that both multitask and attention-regularized student learners are effective, but attention-based learners are more effective, especially in low-resource settings.  Table 11: Simulation accuracy of a BERT-base student model, examining the effect of k in selecting top-k% explanatory tokens. Student model without explanations obtains a simulation accuracy of 92.6.