QED: A Framework and Dataset for Explanations in Question Answering

A question answering system that in addition to providing an answer provides an explanation of the reasoning that leads to that answer has potential advantages in terms of debuggability, extensibility and trust. To this end, we propose QED, a linguistically informed, extensible framework for explanations in question answering. A QED explanation specifies the relationship between a question and answer according to formal semantic notions such as referential equality, sentencehood, and entailment. We describe and publicly release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset, and report baseline models on two tasks -- post-hoc explanation generation given an answer, and joint question answering and explanation generation. In the joint setting, a promising result suggests that training on a relatively small amount of QED data can improve question answering. In addition to describing the formal, language-theoretic motivations for the QED approach, we describe a large user study showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.


Introduction
Question Answering (QA) systems can enable efficient access to the vast amount of information that exists as text (Rajpurkar et al., 2016;Reddy et al., 2019, i.a.). Modern neural systems have made tremendous progress in QA accuracy in recent years . However, they generally give no explanation or justification of * Work done during internship at Google. † Work done at Google.  how they arrive at an answer to a question. Models that in addition to providing an answer can explain their reasoning may have significant benefits pertaining to trust and debuggability (Doshi-Velez and Kim, 2017;Ehsan et al., 2019).
Critical questions then, are what constitutes an explanation in question answering, and how can we enable models to provide such explanations. In an effort to make progress on these questions, in this paper we make the following contributions: (1) we introduce QED 1 , a linguistically grounded definition of QA explanations; and (2) we describe a corpus of QED annotations based on the Natural Questions . The QED corpus has been released publicly. 2 Figure 1 shows a QED example. Given a question and a passage, QED represents an explanation as a combination of discrete, human-interpretable steps: (1) identification of a sentence implying an answer to the question, (2) identification of noun phrases in both the question and answering sentence that refer to the same thing, and (3) confirmation that the predicate in the sentence entails the predicate in the question once referential equalities are abstracted away.
This choice of explanation makes use of core semantic relations-referential equality and entailment-and thus has well-understood formal properties. (See Section 2 for further discussion.) In addition, we found that this way of decomposing explanations has high coverage (77% on the Natural Questions corpus 3 ). Since QED decomposes the QA process into distinct subproblems, we also believe that it should enable research directions aimed at extending or improving upon extant QA systems.
In what follows, after contextualizing the present work in the broader discussion on explainability, we present a formal definition of QED explanations. We then describe the dataset of QED annotations (7638/1353 train/dev examples), including discussion of the distribution of linguistic phenomena exhibited in the data. We move to propose four potential tasks, of varying complexity, related to the QED framework, and use the QED annotations to train and evaluate baseline models on two of these. Additionally, we describe a rater study which shows how the presence of QED explanations can help users identify errors made by an automated QA system.

Motivation: The Need for Explanations in Question Answering
We take as our departure point the following passage from Ehsan et al. (2018) concerning explainable AI: Explainability is important in situations where human operators work alongside autonomous and semi-autonomous systems because it can help build rapport, confidence, and understanding between the agent and its operator. In the event that an autonomous system fails to complete a task or completes it in an unexpected way, explanations help the human collaborator understand the circumstances that led to the behavior, which also allows the operator to make an informed decision on how to address the behavior.
This quote refers to AI and ML systems in general, but is highly relevant to QA systems. Explanations can help users understand and trust a QA system, and can help them to work in tandem with a QA system to fulfill their information needs. Explanations can also help system builders to understand and debug QA systems, and also to extend them.
QED makes a particular choice about the form of explanations for QA. In particular, it decomposes the question-answer relationship according to known semantic and syntactic categories -sentence, reference (and referential equality), predicate, and entailment. The explanations provided in QED are discrete structured objects, as opposed, for example, to "heat map"-style explanations (attention distributions, or other real-valued, wordlevel feature importance measures) (Jacovi and Goldberg, 2020).
One major goal in developing QED is to define models which provide faithful explanations; that is, explanations that in some sense truly reflect the underlying computation or reasoning performed by a question-answering model. (See Section 7 for more discussion.) Another major goal, which is closely related to faithfulness, is to develop models that have a sound basis in concepts from cognitive science and linguistics, and are thus closer to human reasoning. For example reference, a core component of QED, is fundamental to semantics and cognition (Russell, 1905;Clark and Marshall, 1981;Tomasello et al., 2007).

Annotation Definition
We now describe the form of QED annotations. Section 3.1 gives an overview of the annotation process. Section 3.2 then gives a formal definition, which is extended in Section 3.3.

An Overview of the Approach
We will use the following example to illustrate the approach: Question: how many seats in university of michigan stadium Passage: Michigan Stadium, nicknamed "The Big House", is the football stadium for the University of Michigan in Ann Arbor, Michigan. It is the largest stadium in the United States and the second largest stadium in the world. Its official capacity is 107,601.
The annotator is presented with a question/passage pair. Annotation then proceeds in the following four steps: (1) Single Sentence Selection. The annotator identifies a single sentence in the passage that entails an answer to the question assuming that coreference and bridging anaphora (see Section 3.3) have been resolved in the sentence. 4 In the above example, the following sentence entails an answer to the question, and would be selected by the annotator: Its official capacity is 107,601.
This follows because given the passage context, "Its" refers to the same thing as the NP "university of michigan stadium" in the question, and the predicate in the sentence, "X's official capacity is 107,601", entails the predicate in the question "how many seats in X".
(2) Answer Selection. The annotator highlights a short answer span (or spans) in the answer sentence. In the above example the annotator would mark the following (answer shown with [=A...]): Its official capacity is [=A 107,601].
In addition if the answer appears in the sentence in the form of a pronoun, bridged reference or underspecified NP, the annotator resolves the underlying coreference within the passage (see Section 3.3 for more discussion).
(3) Identification of Question-Sentence Noun Phrase Equalities. The annotator marks referentially equivalent noun phrases, or noun phrases that refer to the same thing, in the question and the answer sentence. This includes reference not only to individuals and other proper nouns, but also to generic concepts.
In our example the annotator would mark the following two noun-phrases (marked with the [=1 ...] annotations) as referentially equivalent: how many seats in [=1 university of michigan stadium] [=1 Its] official capacity is [=A 107,601] (4) Extraction of an Entailment Pattern. As a final, automatic step, an entailment pattern can be extracted from the annotated example by abstracting over referentially equivalent noun phrases, and the answer. In the above example the entailment pattern would be as follows: how many seats in X X's official capacity is ANSWER

A Formal Definition
An annotator is presented with a question q that consists of m tokens q 1 . . . q m , along with a passage c consisting of n tokens c 1 . . . c n .
The QED annotation is a triple s, e, a where: • s is a sentence within the context c. Specifically s is a pair s 0 , s 1 indicating that the sentence spans words c s 0 . . . c s 1 inclusive.
• e is a sequence of 0 or more "referential equality annotations", e 1 . . . e |e| . Each member of e specifies that some noun phrase within the question refers to the same item in the world as some noun phrase within the sentence s.
• a is one or more answer annotations a 1 . . . a |a| .
We now describe the form of the e and a annotations. As a preliminary step, given the paragraph c and sentence s, we use S to refer to the set of all phrases within s. Our initial definition of S is We also define the set of question phrases Q and passage phrases C to be We can then give the following definitions: Definition 1 Each referential equality annotation e k for k = 1 . . . |e| is a pair (φ k , π k ) ∈ Q × S, specifying that the phrase φ k in the query refers to the same thing in the world as the phrase π k within s.
Definition 2 Each answer annotation a k for k = 1 . . . |a| is a pair (π k , ξ k ) ∈ S × C specifying that the answer is given by phrase π k , and the full string corresponding to π k after coreference is resolved is the phrase ξ k . If no coreference resolution is required then π k = ξ k .
To illustrate the treatment of coreference resolution within answers, consider the following: Question: who won wimbledon in 2019 Passage: Simona Halep is a female tennis player. She won Wimbledon in 2019.
In this case the single sentence She won Wimbledon in 2019 would be selected by the annotator in step 1, as once coreference is resolved, this entails the answer to the question. The QED annotation would be as follows However, the answer "She" is not sufficient, as it involves an unresolved anaphor. Because of this, the annotator would mark the fact that "She" refers to "Simona Halep" earlier in the passage. In this case the answer is a pair (π, ξ) where π corresponds to "She" within the sentence, and ξ corresponds to the earlier phrase "Simona Halep".

Extending Annotations to Include Bridging
Bridging anaphora (Clark, 1975) are frequently encountered in the QA passages in our data, and in Wikipedia more broadly. This section describes an extension to include annotations of bridging anaphora. Consider the following: It is clear from context surrounding the sentence "Grace VanderWaal was announced as the winner on September 14, 2016" that the noun phrase "the winner" refers to "the winner of America's Got Talent Season 11", and hence the sentence provides an answer to the question. It is helpful to imagine that there is an implicit prepositional phrase "of America's Got Talent Season 11" modifying "the winner".
Another motivating example is the following: Question: who sang the national anthem at the first game of 2017 world series Passage: Game 1 of the 2017 World Series: The ceremonial first pitch was thrown out by members of former Dodger Jackie Robinson's family, including his widow Rachel. The game marked the 45th anniversary of Robinson's death. Keith Williams Jr., a gospel singer, performed "The Star-Spangled Banner", the national anthem.
In this case it is clear that the sentence "Keith Williams Jr., a gospel singer, performed "The Star-Spangled Banner", the national anthem" is referring to a performance at Game 1 of the 2017 World Series, and hence that this sentence provides an answer to the question. In some sense there is an implicit prepositional phrase "at the first game of 2017 world series" modifying the entire sentence.
Recall that the set of phrases within the sentence s was previously defined as We extend QED by redefining S to include implicit phrases introduced in the form of implicit prepositional phrases, as in the "winner [of ...]" and "[at the first game ...]" examples above. The modified definition of S includes all phrases of the following form: (1) Any pair (i, j) such that s 0 ≤ i ≤ j ≤ s 1 indicating the subsequence of words c i . . . c j within the sentence. (2) Any triple (i, j, p) such that s 0 ≤ i ≤ j ≤ s 1 and p is a preposition, indicating the implicit noun phrase in the sentence that modifies the phrase c i . . . c j through the preposition p. (3) Any pair (NULL, p) such that p is a preposition, indicating the implicit noun phrase modifying the entire sentence c s 0 . . . c s 1 through the preposition p.

QED Annotations for the Natural Questions
We now describe QED annotations over the Natural Questions (NQ) dataset . We first describe the annotation process; then describe agreement statistics; finally we describe statistics of types of referential expression. We focus on questions in the NQ corpus that have both a passage and short answer marked by the NQ annotator. We exclude examples where Question: where did they film and then there were none Wikipedia page: And_Then_There_Were_None Passage: Filming began in July 2015. Cornwall was used for many of the harbour and beach scenes, including Holywell Bay, Kynance Cove, and Mullion Cove. Harefield House in Hillingdon, outside London, served as the location for the island mansion. Production designer Sophie Beccher decorated the house in the style of 1930s designers like Syrie Maugham and Elsie de Wolfe. The below stairs and kitchen scenes were shot at Wrotham Park in Hertfordshire. Railway scenes were filmed at the South Devon Railway between Totnes and Buckfastleigh. the passage is a table. A QED annotator was presented with a question/paragraph pair. In a first step they determine whether: (1) there is a valid short answer within the paragraph (note that they can overrule the original NQ judgment), and there is a valid QED explanation for that answer; (2) there is a valid short answer within the paragraph, but there is no valid QED explanation for that answer. (See Figure 2 for a representative example in this category, in which multiple sentences are required to justify an answer, thus violating the single-sentence assumption of QED); (3) there is no valid short answer within the passage (hence the original NQ annotation is judged to be an error). 10% of all examples fell into category (3). Of the remaining 90% of examples which contained a correct short answer, 77% fell into category (1), and 23% fell into category (2).

Agreement Statistics
Each of the three annotators marked a common set of 100 examples drawn from the development set. Average accuracy of classification of instances was 73.9.% 6 Average pairwise F1 on mention identification/mention alignment, conditioned on both annotators labeling instances as amenable to QED, was 88.4 and 84.1 respectively. Referential Link Count 0 1 2 3 Instances 54 649 294 6

Types of Referential Expressions
The referential equality annotations are a major component of QED. Figure 3 shows some full QED examples from the corpus, and Figure 4 shows some example equalities from the corpus. In this section, in an effort to gain insight about the types of phenomena present, we describe statistics on types of referential equalities. We subcategorize referring expressions into the following types: 7 Proper Names Examples are "How I met your Mother" or "the cbs television sitcom how i met your mother".
Non-Anaphoric Definite NPs These are expressions such as "the president of the United States" or "the next Maze Runner film". The majority involve one or more common nouns (e.g., "president", "film") together with a proper name, thereby defining a new entity that is in some sense a "derivative" of the underlying proper name.
Anaphoric Definite NPs These are definite NPs, most often from within the passage rather than the question, that require context to be interpreted. Examples are "the series" referring to an earlier mention of "the Vampire Diaries" within the passage, or "the winner" referring to "the winner of America's got Talent Season 11".
Generics Examples are "a dead zone" in the question "what causes a dead zone in the ocean", or "Dead zones" in the passage sentence "Dead zones are low-oxygen areas caused by ...".
Pronouns Examples are it, they, he, she.
Bridging Referential expressions in the passage sentence that use bridging (see Section 3.3).
Miscellaneous All referential expressions not included in the categories above. Table 1 shows the frequency distribution of per-instance referential equality counts. Figure 5 Pronominal reference Question: how many blocks in the great pyramid of giza1

Wikipedia page: Great_Pyramid_of_Giza
Passage: Based on these estimates, building the pyramid in 20 years would involve installing approximately 800 tonnes of stone every day. Additionally, since it1 consists of an estimated 2.3 millionA blocks, completing the building in 20 years would involve moving an average of more than 12 of the blocks into place each hour, day and night. The first precision measurements of the pyramid were made by Egyptologist  shows an analysis of 100 referential equality annotations from QED, with a breakdown by type of referring expression in the question and passage. Proper names, non-anaphoric definites, and generics dominate expression types in the question (73, 16, and 6 examples respectively). Expressions in the sentence are more diverse, with a much greater proportion of anaphoric definites, pronouns, and bridging examples (21, 9, and 5 cases respectively).
Finally, as an indication of the difficulty of the referential equality task, we note that in only 12% of all referential equalities in the 100 examples in Figure 5 is there an exact string match (after lowercasing of both question and passage) between the question and passage referential expression.

Tasks and Baseline Results
We release the QED dataset with the intention to spur research into QED-based tasks and models. In this section, we introduce four potential modeling tasks using the data and describe baseline approaches and results for the first two tasks.  Figure 4: Referential equalities from the QED corpus.

Four Tasks
Each QED example is a (q, d, c, a, e) tuple where q is a question from the NQ corpus, d is a Wikipedia page, c is a long answer (typically a paragraph) within d, a is a short answer within c, and e is a QED explanation. We use E to refer to set of evaluation examples (either the development or test set). Such data could potentially be used in many different ways. We highlight the following four tasks, in order of increasing complexity: Task 1 Given a (q, d, c, a) 4-tuple, make a predictionê = f (q, d, c, a) where f is a function that maps a (q, d, c, a) tuple to an explanation.
We might for example define f (q, d, c, a) = arg max e p(e|q, d, c, a; θ) under some model p(. . .). The evaluation measure is then 1 |E| where l 1 (e,ê) is a per-example evaluation measure indicating how closeê is to e.
Task 2 Given a (q, d, c) triple, predict (â,ê) = f (q, d, c), where f is a function that maps a (q, d, c) pair to a shortanswer/explanation triple. We might for example define f (q, d, c) = arg max a,e p(a, e|q, d, c; θ) Task 3 Given a (q, d) pair, predict (ĉ,â,ê) = f (q, d), where f is a function that maps a (q, d) pair to a long-answer/shortanswer/explanation triple. We might for example define f (q, d) = arg max c,a,e p(c, a, e|q, d; θ) under some model p(. . .). The evaluation measure is (q,d,c,a,e)∈E l 3 ((c, a, e), f (q, d)) where l 3 is some per-example measure.
Task 4 As in Task 3, given a (q, d) pair, predict (ĉ,â,ê) = f (q, d). One part of the evaluation is the same as in Task 3. But in addition, we require the explanations generated by f (. . .) to be faithful with respect to the reasoning process of the underlying model. This will require an evaluation measure for faithfulness, which is an open question beyond the scope of this paper.
Accurate models for Tasks 1, 2, and 3 even if they do not generate faithful explanations (Task 4), may still have considerable utility. However, faithful models have several desirable characteristics (see Section 7); we view them as a major avenue for future work.
In the remainder of this section we describe results for baseline models on Tasks 1 and 2. The intention here is to establish baseline results as a reference point for future work on QED models and to get an idea of tractability of recovery of QED annotations.

A Baseline Model for Task 1
Our baseline model for Task 1 is an extension of the recently proposed coreference resolution model of Joshi et al. (2019) and Lee et al. (2017). We present two variations on the model, the first trained on coreference data alone, the second trained on coreference data with fine-tuning on QED annotations

The Coreference Resolution Model
We give a brief recap of the approach of Joshi et al. (2019) and Lee et al. (2017). Given some document d and a candidate mention x, corresponding to a span within d, define Y(x) to be the set of potential antecedents for x. Each antecedent is either a span in the document with start-point before x in the document, or signifying that x does not have an antecedent. We can then define a distribution over the antecedent spans Y(x) as p(y|x, D) = e s(x,y) where g x and g y are span representations obtained by concatenating the SpanBERT representation of the first and last token in each mention span. The scoring functions s m and s c represent mention and joint span match scores respectively. Lee et al. (2017) describe a method for training the model based on log-likelihood, and a beam search method that uses the scores s m (. . .) to filter mentions and antecedents. The final output from the model is a hard clustering of the potential mentions into coreference clusters.

The Model Applied to Task 1
Assume an example contains a question q of m tokens q 1 . . . q m and a passage c consisting of n tokens c 1 . . . c n . We denote the title of the Wikipedia page separately as the sequence t of k tokens t 1 . . . t k . The model considers the concatenation of these token sequences, as an input document. 8 The model is tasked with predicting the referential equality annotations e = 8 We simply use [S1] = "." and [S2] = "?" as separators. Table 2: SpanBERT model performance for Task 1: recovering QED annotations when the correct answer is given. e 1 . . . e k in the QED annotation. We do assume that the NQ short answer is also an input to the model, used to restrict the position of referential equality annotations in the passage; we describe this restriction below.
QED referential equality annotations are of two types: (1) coreferential links between noun phrases in the question and in the passage, and (2) coreferential links between a noun phrase in the question and an implicit argument in the passage. We observe that many implicit arguments link to the title of the passage, so we model the latter annotation type as a coreferential link between the question mention and the title span t 1 . . . t k . In the untrained baseline, we restrict s m to only score mentions in the sentence containing the answer. In both models we restrict s c to only score coreferential links between the query and the passage or between the query and the title (all other values for s m or s c are set to −∞).
We finally post-process the cluster outputs as follows: for each cluster we output the first cluster mention in the question paired with the first cluster mention in the passages. If there is no cluster mention in the passage, then we output the question mention paired with an implicit argument.
For the untrained baseline, we did not use expert annotated QED data but instead used the CoNLL OntoNotes coreference dataset (Pradhan et al., 2012) to train the pretrained SpanBERT model. For the fine-tuned baseline, we further trained the model with the training portion of QED data converted into coreference format. We used SpanBERT "large", with a maximum span width of 16 tokens, a top span ratio of 0.2, 30 max antecedents per mention. In fine-tuning, we used an initial learning rate of 3 · 10 −4 and trained for 3 epochs on the QED training set.
We evaluate both mention identification (the identification of individual referential expressions in the question and passage) and referential equal-ity detection (the identification of pairs of referential expressions). We compute precision, recall, and F1 measure in both cases. Evaluation results are reported in Table 2. The table shows results for both the zero-shot model, trained on coreference data alone, and a fine-tuned model, which is fine-tuned on QED annotations. 9

A Baseline Model for Task 2
Our baseline model for Task 2 is a straightforward extension of the baseline model for Task 1. We build a model of the form p(a, e|q, d, c; θ) where p (1) is an existing QA model (similar to Alberti et al. (2019)), and p (2) is the baseline model for Task 1. Thus we simply compose an existing question-answering model with an answer agnostic model that recovers explanations.
The answer scoring component of the model computes answer candidate representations g z in the same way as the Task 1 baseline computes mention representations. The score of an answer z is then computed as s a (z) = FFNN a (g z ).
Mention representations are shared between p (1) and p (2) , so the only new parameters belong to a single hidden layer feed-forward net FFNN a that computes the answer score for each mention. No further dependence is introduced between the answer and explanation predictions. We train p (1) and p (2) in a multitask fashion, by minimizig the weighted sum of the question answering and coreference cross entropy losses. Our best results are obtained with a weight of 5 on the coreference loss and 2 epochs of training. The best answer accuracy and QED F1 are obtained for different base learning rates of 2 · 10 −5 and 5 · 10 −5 respectively.

Results
In Table 3 we report results for Task 2 for three separate variations of the approach described in the previous section. QED-only fine-tunes p (2) on the QED training set only. QA-only fine-tunes p (1) on all the paragraphs of the NQ dataset that contain a short answer. QA+QED fine-tunes both p (1) and p (2) on all NQ and QED data. We obtain the 9 Official evaluation code will be released with the dataset.

Rater Study
A system which makes use of QED explanations to answer a question is one which decomposes its reasoning process into human-interpretable chunks. We hypothesize that exposing QED explanations should improve a user's ability to spot errors made by an automated QA system. To this end, we evaluate QED explanations using a rater study.

Task Setup
Given a question, passage, and a candidate answer span, raters were tasked with assessing whether the candidate answer was correct or incorrect, and indicating the confidence of their assessment. We obtained the data for the study by taking a random set of 50 correct answers and 50 incorrect guesses from the NQ baseline model on the Natural Questions dev set. So as to ensure that the task was sufficiently challenging, correct instances were the gold answer spans on question/passage pairs where the model produced a false negative. 10 Incorrect instances were false positive guesses from the model.
A total of 354 raters, all of whom are USresidents and native English speakers, were divided into three disjoint pools to perform the task in three distinct test settings: The None group of raters (n=121) was presented with a question, passage, and a highlighted answer span. The Sentence group (n=117) was provided with additional highlighting of the sentence containing the answer, with no distinction made between referen-  tial equalities and predicates. The QED group (n=116) was provided with additional highlighting to indicate referential equalities between spans in the question and spans in the passage. On average, a given rater provided judgments for 41 questions.
In each case, raters were told that highlighting was the output of "an automated question answering system" that was incorrect "about half of the time." Where explanations were present, they were manually imputed to simulate the inferences of a hypothetical model that used a QED-style reasoning process. Additionally, raters were told that the system made use of the highlighted information to produce its candidate answers.

Results
Average rater accuracies for each test setting are presented in Table 4. We see that, in aggregate, QED explanations improved accuracy on the task over and above the other test settings, and gave the most improvement on the identification of answers that were incorrect. These improvements translate to incorrect answers resulting from both predicate and reference model errors.
Somewhat surprisingly, highlighting just the sentence containing the answer improved accuracy more than including referential equality highlighting on instances that were correct. This is likely because raters' propensity to mark instances correct decreases as the complexity of explanations increases, from None (73.1%) to Sentence (72.6%) to QED (70.5%).
Also clear from Table 4 is that rater accuracy is much lower on incorrect instances. Even though raters were told that the answers presented were incorrect half of the time, they marked the model guess as correct roughly 71% of the time. 11   Figure 6 provides another perspective on the disparity in judgments on correct/incorrect instances summarized in Table 4. The instances receiving the highest accuracy in the incorrect pool are harder for raters on average than most of the correct instances, and the lowest accuracy on incorrect instances is far lower than that of any of the correct instances. The wide distribution of accuracies on incorrect instances (σ≈0.50) seen in Figure 6 was also reflected in the rater pool (σ≈0.45). The challenging nature of incorrect instances speaks to the promise of improvements from QED explanations.

Effectiveness of explanations
How statistically significant are the results reported in Table 4? The 14,115 test instances were spread across 354 raters and 100 questions. To control for the correlations induced by the rater and question groups, we fit a generalized linear mixed model (GLMM) using the rstanarm R package (Goodrich et al., 2020). We used the formula a ∼ c * e + (1|r) + (1|q), where a is whether or not the rater accurately marked the instance; c is whether the instance was Correct or Incorrect; e is the explanation test setting of None, Sentence, or QED; r is the rater id; and q is the question id. This formula specifies a regression of the log-odds of the rater accuracy on the fixed effects of instance correctness (c) and explanation setting (e), while allowing for random effects in the raters (r) and questions (q). Ultimately we are interested in the magnitude and statistical properties of e under the various test settings. Table 5 shows the fixed effect coefficient and standard deviations for each setting. The presence of QED explanations in the Incorrect setting in-Raters were not trained to do well on the task, as we aimed to approximate how users interact with automated QA systems. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Figure 6: Sorted, per-question evaluation accuracies from different rater study settings, with 95% binomial confidence intervals. Left three plots correspond to trials with incorrect answers highlighted; right three plots to trials with correct answers highlighted. Dashed red lines correspond to the average accuracy for each setting, identical to the numbers in Table 4. creased the log-odds of rater accuracy by 0.25, with a posterior predictive p-value of 0.015 that this effect is greater than zero. The comparable effect for Sentence explanations was 0.15, with a posterior predictive p-value of 0.08. The rater and question random effects had standard deviations of 0.63 and 0.90 respectively, reflecting again the high variance of questions shown in Figure 6. As we saw earlier, the effects of explanations in the Correct setting was reversed: the Sentence explanations caused a small, statistically insignificant increase in log-odds, while QED explanations caused a statistically significant drop in log-odds.

QED and strong explainability
It is an open question as to what constitutes a good explanation (Lipton, 2001). A major inflection point in the discussion is the notion of faithfulness (Ross et al., 2017;Lipton, 2016). We say a model's explanations are faithful when there is a causal relationship between an explanation and a prediction. That is, when an explanation changes, the outputs change accordingly. When this is not true, we say a model generates rationales, which have the appearance of justifying its outputs, but without causal guarantees (Ehsan et al., 2018).
While the models described in Section 5 fall into the latter category, we believe QED is a promising framework for strongly explainable QA. This is due in large part to its commitment to the cognitive reality of reference and entail-ment. We can say, definitively, that in order for a sentence to answer a question about a thing, its meaning must involve that thing in a very particular sense. Posed counterfactually, when you break referential equality, you break answerhood, and the same argument follows for predicate entailment. Unlike other intelligent behavior that may permit of post-hoc rationalization at best (Ehsan et al., 2019), certain forms of high-level linguistic reasoning are in fact amenable to strong explanation.

Potential Extensions to the QED Framework
QED exists in between relatively unstructured explanation forms on the one hand, such as attention distributions (Wiegreffe and Pinter, 2019;Jain and Wallace, 2019;Mohankumar et al., 2020) or sequential outputs (Camburu et al., 2018(Camburu et al., , 2019Narang et al., 2020;Kumar and Talukdar, 2020) and more elaborate, discrete semantic representations that can in theory be applied to explainable QA (Abzianidze et al., 2017;Wolfson et al., 2020). The version of QED presented here is a broad coverage, yet limited instantiation of a framework, in which explanations are semantic relations whose substructures are defined in terms of formally motivated linguistic categories. However, in keeping with its modularity, we can extend QED to account for these by looking to semantic relations beyond referential equality and predicate entailment, such as set-membership noun phrase (Hearst, 1992) and interclausal (Miltsakaki et al., 2004;Lamm et al., 2018;Tandon et al., 2019) relations.

Future uses of QED representations
Our hope is that QED representations may be useful in a variety of extensions to extant QA systems. Some examples are as follows: Ambiguous Questions. Consider again the question in Figure 1, "who wrote the film howl's moving castle". Now consider the question "who wrote howl's moving castle". In this case there are two possible answers, depending on whether the author of the question is referring to the film or novel. It would be natural for a system to provide two possible answers (see, e.g. Min et al., 2020), with two possible QED explanations highlighting the differing assumptions underlying each answer. Such referential ambiguities are common, and the centrality of referential equality in QED annotations should mean that they are useful in this scenario.
Complex Referential Equalities. Consider the question "meaning of whiskey in the jar by metallica". The Wikipedia page for "Whiskey in the Jar" says the following: Passage: "Whiskey in the Jar" is an Irish traditional song set in the southern mountains of Ireland. The song, about a rapparee (highwayman) who is betrayed by his wife or lover, is one of the most widely performed traditional Irish songs and has been recorded by numerous artists since the 1950s.
A good answer could be that the song is "about a rapparee . . . who is betrayed by his wife or lover", assuming that the Metallica song is a variant of the Irish traditional song. Thus the validity of this answer hinges on a complex referential equality, between the Metallica version and the original. Examples that require this type of complex referential reasoning are quite common, and the centrality of reference in QED should be relevant.

Conclusions
We have described QED, a framework for explanations in question answering, and we have introduced a corpus of QED annotations. The framework is grounded in referential equality, and entailment. In addition we have described baseline models for two QED-based tasks, and a rater study utilizing QED annotations.
Future work should consider the development of models that provide faithful explanations based on QED; extensions of QED, for example to handle multi-sentence inference or referential phenomena going beyond equality; and applications of QED, for example to sentences with multiple potential answers, to questions that are vague or underspecified, or to referential equalities that require significant inference to be justified.