A question answering system that in addition to providing an answer provides an explanation of the reasoning that leads to that answer has potential advantages in terms of debuggability, extensibility, and trust. To this end, we propose QED, a linguistically informed, extensible framework for explanations in question answering. A QED explanation specifies the relationship between a question and answer according to formal semantic notions such as referential equality, sentencehood, and entailment. We describe and publicly release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset, and report baseline models on two tasks—post- hoc explanation generation given an answer, and joint question answering and explanation generation. In the joint setting, a promising result suggests that training on a relatively small amount of QED data can improve question answering. In addition to describing the formal, language-theoretic motivations for the QED approach, we describe a large user study showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.

Question answering (QA) systems can enable efficient access to the vast amount of information that exists as text (Rajpurkar et al., 2016; Kwiatkowski et al., 2019; Clark et al., 2019; Reddy et al., 2019, among others). Modern neural systems have made tremendous progress in QA accuracy in recent years (Devlin et al., 2019). However, they generally give no explanation or justification of how they arrive at an answer to a question. Models that in addition to providing an answer can explain their reasoning may have significant benefits pertaining to trust and debuggability (Doshi-Velez and Kim, 2017; Ehsan et al., 2019).

Critical questions then, are what constitutes an explanation in question answering, and how we can enable models to provide such explanations. In an effort to make progress on these questions, in this paper we (1) introduce QED,1 a linguistically grounded definition of explanations for extractive QA; and (2) describe an expert-annotated corpus of QED annotations based on the Natural Questions (Kwiatkowski et al., 2019) dataset. The QED corpus has been released publicly.2

Figure 1 shows a QED example. Given a question and a passage, QED represents an explanation as a combination of discrete, human-interpretable steps: (1) identification of a sentence implying an answer to the question, (2) identification of noun phrases in both the question and answering sentence that refer to the same thing, and (3) confirmation that the predicate in the sentence entails the predicate in the question once referential equalities are abstracted away.

Figure 1:

QED explanations decompose the question-passage relationship in terms of referential equality and predicate entailment.

Figure 1:

QED explanations decompose the question-passage relationship in terms of referential equality and predicate entailment.

Close modal

This choice of explanation makes use of core semantic relations—referential equality and entailment—and thus has well-understood formal properties. In addition, we found that this way of decomposing explanations has high coverage (77% on the Natural Questions corpus3 ) and can be readily extended to other forms of question answering. (See Section 6.) Since QED decomposes the QA process into distinct subproblems, we also believe that it should enable research directions aimed at extending or improving upon extant QA systems.

In what follows, we present a definition of QED explanations. We then describe the dataset of QED annotations (7638/1353 train/dev examples), including discussion of the distribution of linguistic phenomena exhibited in the data. We move to propose four potential tasks, of varying complexity, related to the QED framework, and use the QED annotations to train and evaluate different models on two of these. Additionally, we describe a rater study which shows how the presence of QED explanations can help users identify errors made by an automated QA system.

We now describe the form of QED annotations. The treatment in this section is somewhat informal; for formal definitions see Appendix A.

### 2.1 Basic Definitions

We will use the following example to illustrate the approach:

Question: how many seats in university of michigan stadium

Passage: Michigan Stadium, nicknamed “The Big House”, is the football stadium for the University of Michigan in Ann Arbor, Michigan. It is the largest stadium in the United States and the second largest stadium in the world. Its official capacity is 107,601.

The annotator is presented with a question/ passage pair. Annotation then proceeds in the following four steps:

#### (1) Single Sentence Selection.

The annotator identifies a single sentence in the passage that entails an answer to the question assuming that coreference and bridging anaphora (see later in this section) have been resolved in the sentence.4

In the above example, the following sentence entails an answer to the question, and would be selected by the annotator:

Its official capacity is 107,601.

This follows because given the passage context, “Its” refers to the same thing as the NP “university of michigan stadium” in the question, and the predicate in the sentence, “X’s official capacity is 107,601”, entails the predicate in the question “how many seats in X”.

The annotator highlights a short answer span (or spans) in the answer sentence. In the above example the annotator would mark the following (answer shown with [=A …]):

Its official capacity is [=A 107,601].

#### (3) Identification of Question–Sentence Noun Phrase Equalities.

The annotator marks referentially equivalent noun phrases, or noun phrases that refer to the same thing, in the question and the answer sentence. This includes reference not only to individuals and other proper nouns, but also to generic concepts.

In our example the annotator would mark the following two noun phrases (marked with the [=1 …] annotations) as referentially equivalent:

how many seats in [=1 university of michigan stadium] [=1 Its] official capacity is [=A 107,601].

#### (4) Extraction of an Entailment Pattern.

As a final, automatic step, an entailment pattern can be extracted from the annotated example by abstracting over referentially equivalent noun phrases. In the above example the entailment pattern would be as follows:

how many seats in X X’s official capacity is [=A 107,601].

### 2.2 Two Extensions

There are two extensions to the above approach, coreference in answers, and bridging in referential equalities:

Consider the following example:

Question: who won wimbledon in 2019

Passage: Simona Halep is a female tennis player. She won Wimbledon in 2019.

In this case the single sentence She won Wimbledon in 2019 would be selected by the annotator in step 1, as once coreference is resolved, this entails the answer to the question. The QED annotation would be as follows:

who won [=1 wimbledon] in [=2 2019] [=A She (=C Simona Halep)] won [=1 Wimbledon] in [=2 2019]

In this case the answer “She”—the substring in the original sentence—is not sufficient, as it involves an unresolved anaphor. Because of this, the annotator would mark the fact that “She” refers to “Simona Halep” earlier in the passage, using the (=C…) notation.

#### Bridging in Referential Equalities.

Bridging anaphora (Clark, 1975) are frequently encountered in the QA passages in our data, and in Wikipedia more broadly. Consider the following:

Question: who won america’s got talent season 11

Passage: The 11th season of America’s Got Talent, an American talent show competition, began broadcasting in the United States during 2016. Grace VanderWaal was announced as the winner on September 14, 2016.

It is clear from the context surrounding the sentence “Grace VanderWaal was announced as the winner on September 14, 2016” that the noun phrase “the winner” refers to “the winner of America’s Got Talent Season 11”, and hence the sentence provides an answer to the question. It is helpful to imagine that there is an implicit prepositional phrase “of America’s Got Talent Season 11” modifying “the winner”. In this case the annotation would be the following:

who won [=1 america’s got talent season 11] [=A Grace VanderWaal] was announced as [=B the winner (of =1)] on September 14, 2016.

Here the annotation [=B the winner (of =1)] indicates that the phrase marked [=1 …] in the query is a bridged modifier to the phrase the winner, through the preposition “of”.

Sometimes, there is no phrase like “the winner” above, but the referent is clearly an implicit argument of the supporting sentence. In this case we treat it as a bridge into the entire sentence.

### 2.3 A Note on Terminology

In defining QED we use the terms “predicate” and “entailment” in ways that may seem unfamiliar, but are not unrelated to the typical senses of those terms in linguistics. Canonically speaking, one thinks of a predicate as the semantic correlate of a verb in a sentence, and usually containing information about its argument structure. By taking a less structured notion of the term, as everything in a sentence surrounding a set of salient referring expressions, we are able to strike a balance between completely unstructured text, and more elaborate, structured representations that tend to be brittle.

The sense of entailment we intend then follows from this definition. A sentence entails an answer to a question if, having resolved and abstracted away referential equalities between the two, one can identify an answer to the question in the sentence.

We now describe QED annotations over the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019). We first describe the annotation process, then describe agreement statistics and statistics of types of referential expressions. For discussion of the assumptions we make and future extensions to QED, please see Section 6.3.

We focus on questions in the NQ corpus that have both a passage and short answer marked by the NQ annotator. We exclude examples where the passage is a table. A QED annotator was presented with a question/paragraph pair. Before performing the core QED annotation, annotators first determine whether: (1) there is a valid short answer within the paragraph (note that they can overrule the original NQ judgment), and there is a valid QED explanation for that answer; (2) there is a valid short answer within the paragraph, but there is no valid QED explanation for that answer; or (3) there is no valid short answer within the passage (hence the original NQ annotation is judged to be an error). Ten percent of all examples fell into category (3). Of the remaining 90% of examples that contained a correct short answer, 77% fell into category (1), and 23% fell into category (2).

Three QED annotators5 annotated 7638 training examples (5154/1702/782 in categories 1/2/3 respectively), and 1353 dev examples (1019/183/151 in categories 1/2/3), without replication. We estimate that annotators averaged approximately 2 minutes per instance. Additionally, early stages of annotation consisted of regular adjudication among annotators to establish a consensus on QED’s guidelines.

### 3.1 Agreement Statistics

All three annotators marked a common set of 100 examples drawn from the development set. We compute average pairwise agreement by comparing each annotator against the other two, and averaging across all pairs. Average classification of instances was 73.9%. If this seems low, it is because one annotator was more conservative interpreting QED’s single sentence assumption, and pairwise accuracy breakdown was thus 81.2/72.3/68.1%. Given the high number of “debatable” instances reported in the Natural Questions paper, this divergence is unsurprising, however. Average pairwise F1 on mention identification/mention alignment, conditioned on both annotators labeling instances as amenable to QED, was 88.4 and 84.1, respectively.

### 3.2 Types of Referential Expressions

The referential equality annotations are a major component of QED. Figure 2 shows some full QED examples from the corpus, and Figure 3 shows some example referential equalities from the corpus. In this section, in an effort to gain insight about the types of phenomena present, we describe statistics on types of referential equalities. We subcategorize referring expressions into the following types:6

Figure 2:

Examples from the QED dataset, grouped according to different types of referential equalities.

Figure 2:

Examples from the QED dataset, grouped according to different types of referential equalities.

Close modal
Figure 3:

Referential equalities from the QED corpus.

Figure 3:

Referential equalities from the QED corpus.

Close modal

#### Proper Names

Examples are “How I met your Mother” or “the cbs television sitcom how i met your mother”.

#### Non-Anaphoric Definite NPs

These are expressions such as “the president of the United States” or “the next Maze Runner film”. The majority involve one or more common nouns (e.g., “president”, “film”) together with a proper name, thereby defining a new entity that is in some sense a “derivative” of the underlying proper name.

#### Anaphoric Definite NPs

These are definite NPs, most often from within the passage rather than the question, that require context to be interpreted. Examples are “the series” referring to an earlier mention of “the Vampire Diaries” within the passage, or “the winner” referring to “the winner of America’s got Talent Season 11”.

#### Generics

Examples are “a dead zone” in the question “what causes a dead zone in the ocean”, or “Dead zones” in the passage sentence “Dead zones are low-oxygen areas caused by …”.

#### Pronouns

Examples are it, they, he, she.

#### Bridging

Referential expressions in the passage sentence that use bridging (see Section 2.2).

#### Miscellaneous

All referential expressions not included in the categories above.

Table 1 shows the frequency distribution of per-instance referential equality counts. Figure 4 shows an analysis of 100 referential equality annotations from QED, with a breakdown by type of referring expression in the question and passage. Proper names, non-anaphoric definites,and generics dominate expression types in the question (73, 16, and 6 examples, respectively). Expressions in the sentence are more diverse, with a much greater proportion of anaphoric definites, pronouns, and bridging examples (21, 9, and 5 cases, respectively). Finally, as an indication of the difficulty of the referential equality task, we note that in only 12% of all referential equalities in the 100 examples in Figure 4 is there an exact string match (after lower-casing of both question and passage) between the question and passage referential expression.

Table 1:

Referential link count frequency distribution in a random sample of 1000 instances. When there are 0 links, the explanation consists only of a selected sentence.

0123
Instances 54 649 294
0123
Instances 54 649 294
Figure 4:

Counts for 100 randomly drawn referential equality annotations from the QED corpus, subcategorized by expression type in the question (Qu.) and passage (Ps.). P/N/A/G/Pn/B/M refer to Proper/Def(non-ana)/Def(ana)/Generic/Pronoun/Bridge/Misc.

Figure 4:

Counts for 100 randomly drawn referential equality annotations from the QED corpus, subcategorized by expression type in the question (Qu.) and passage (Ps.). P/N/A/G/Pn/B/M refer to Proper/Def(non-ana)/Def(ana)/Generic/Pronoun/Bridge/Misc.

Close modal

The QED data, which we release publicly, can be used as part of wide range of QA tasks and models. After discussing some of these tasks, we assess how well two recent neural architectures, one structured and one sequence-to-sequence, perform on two of them.

Each QED example is a (q,d,c,a,e) tuple where q is a question from the NQ dataset, d is a Wikipedia page, c is a passage within d,7a is a short answer within c, and e is a QED explanation. A formal definition of e can be found in Appendix A. In brief, it consists of a sentence, which is a span within c, as well as a set of referential equalities. Each referential equality is a pair consisting of a question span together with a passage span (or a bridging position in the passage). Additionally, where an answer span a falls outside of the selected sentence, the explanation contains an answer coreference span.

We use $E$ to refer to set of evaluation examples (either the development or test set). We focus our modeling efforts on the following two tasks, in order of increasing complexity:

Given a (q,d,c,a) 4-tuple, make a prediction $ê=f(q,d,c,a)$ where f is a function that maps a (q,d,c,a) tuple to an explanation. We might, for example, define $f(q,d,c,a)=argmaxep(e|q,d,c,a;θ)$ under some model p(…). The evaluation measure is then $∑(q,d,c,a,e)∈El1(e,f(q,d,c,a))$ where $l1(e,ê)$ is a per-example evaluation measure indicating how close $ê$ is to e.

Given a (q,d,c) triple, predict $(â,ê)=f(q,d,c)$, where f is a function that maps it to a short-answer/explanation pair. We might, for example, define $f(q,d,c)=argmaxa,ep(a,e|q,d,c;θ)$ under some model p(…). The evaluation measure is $∑(q,d,c,a,e)∈El2((a,e),f(q,d,c))$ where l2 is some per-example measure.

By extension, one can conceive of a task in which one must also predict a passage c, in addition to an answer and an explanation. One could even integrate QED with a version of the open- domain QA task, which also entails retrieval of documents d. Given QED’s linguistic generality, the data may also be useful as auxiliary input for training models that are not explicitly interested in evaluating explanation generation.

An open question in explainability is how we can build and evaluate models that generate faithful explanations, where the explanation truly reflects the model’s underlying reasoning (Jacovi and Goldberg, 2020). Accurate models for the above tasks, even if they do not generate faithful explanations, may still have considerable utility. However, faithful models have several desirable characteristics (see Sections 5 and 6); we view them as a major avenue for future work.

### 4.2 A SpanBERT Model

The first model we consider for Tasks 1 and 2 uses the SpanBERT coreference resolution model (Joshi et al., 2020; Lee et al., 2017) to identify referential equalities, extends the model with a QA component, and heuristically selects supporting sentences to produce a final QA+QED output.

#### Representation

Assume an example contains a question q of m tokens q1qm and a passage c consisting of n tokens c1cn. We denote the title of the Wikipedia page separately as the sequence t of k tokens t1tk. The model uses SpanBERT to jointly encode the the concatenation of these token sequences,
$[CLS]t1…tk[S1]q1…qm[S2]c1…cn[SEP]$
as an input document.8

#### Coreference

Given some document d and a candidate mention x, corresponding to a span within d, define $Y(x)$ to be the set of potential antecedents for x. Each antecedent is either a span in the document with start-point before x in the document, or ϵ signifying that x does not have an antecedent. We can then define a distribution over the antecedent spans $Y(x)$ as $p(y|x,D)=es(x,y)∑y′∈Y(x)es(x,y′)$ where
$s(x,y)=0ify=ϵ;sm(x)+sm(y)+sc(x,y)otherwise$
$sm(x)=FFNNm(gx)sc(x,y)=NNc(gx,gy)$
where gx and gy are span representations obtained by concatenating the SpanBERT representations of the first and last token in each mention span. The scoring functions sm and sc represent mention and joint span match scores respectively. Whereas sm is a simple feedforward net, sc is a more complex scoring function that has been optimized to the coreference task. We refer the reader to Lee et al. (2017) for more details.

Lee et al. (2017) describe a method for training the model based on log-likelihood, and a beam search method that uses the scores sm(…) and sc(…) to filter candidate mention, antecedent pairs into the final set considered by the loss function. The final output from the coreference model is a hard clustering of the potential mentions into coreference clusters.

Given the constraints of QED referential equalities, we restrict sc to only score coreferential links between the query and the passage or between the query and the title (all other values for sm or sc are set to $−∞$). We model bridges as links between a query passage the title.

We finally post-process the cluster outputs of the coreference component as follows: For each cluster we output the first mention in the cluster that appears in the question with the first mention in the cluster of references that appears in the passage, once cluster mentions are sorted.9 If there is no cluster mention in the passage, we assume the passage reference is a bridge.

#### QA

The answer scoring component computes answer candidate representations gz using the same candidate mention scoring network as the coreference model, FFNNm, as well as a feedforward network, FFNNq, that scores candidate answer spans relative to a representation of the question. The score of an answer z is then computed as
$sa(z)=FFNNm(gz)+FFNNa(gz,gq).$
Thus, the only new parameters belong to a single hidden layer feed-forward net FFNNa that specifically targets the question-answer relationship. Apart from the use of shared candidate mention scoring parameters, no further dependence is introduced between the answer and referential equality predictions.

#### Sentence Selection

We perform sentence selection heuristically by choosing the sentence containing the first cluster output by the coreference model. Any subsequent coreference cluster containing a document mention outside of this sentence is dropped in the final prediction. If no referential link is predicted, we take the supporting sentence to be the one containing the answer span.

#### Training

For Task 1, we consider an untrained model and a fine-tuned model, both of which omit the QA component described above. In the former, we do not use expert annotated QED data but instead use the CoNLL OntoNotes coreference dataset (Pradhan et al., 2012) to train the pretrained SpanBERT model. We only score document mentions in the sentence containing the answer.

For the fine-tuned model, we mark short answers with special tokens before computing the SpanBERT document representation. Then, we further train the model with the training portion of QED data. We used SpanBERT “large”, with a maximum span width of 16 tokens, a top span ratio of 0.2, 30 max antecedents per mention. In fine- tuning, we used an initial learning rate of 3 ⋅ 10−4 and trained for 3 epochs on the QED training set.

For Task 2, we train the QA and Coreference components in a multitask fashion, by minimizing the weighted sum of the QA and coreference cross entropy losses. For the QA data, we augment using passages containing short answers from NQ. Our best results are obtained with a weight of 5 on the coreference loss and 2 epochs of training. The best answer accuracy and QED F1 are obtained for different base learning rates of 2 ⋅ 10−5 and 5 ⋅ 10−5 respectively.

### 4.3 A T5 Model

The second model we consider fine-tunes T5 (Raffel et al., 2020) to predict linearized QA and QED outputs from an input document. We briefly describe the linearization approach here, and refer the reader to Appendix B for a worked example.

#### Input Representation

Similar to the SpanBERT model and as depicted immediately above, we pass the concatenation of question, title, and document tokens as input to T5, in that order.10 Each input instance is either a QA- or QED- specific instance, which is indicated to T5 by appending a task-specific token to the end of the input.

#### Output Representation

The model is tasked with predicting either (1) an answer span or (2) a QED explanation, represented as a sequence of referential equalities, all separated by a special token.11 In (2), each referential equality is represented as the concatenation of two spans: the tokens in its query mention and the tokens in its passage mention, separated by ">>". In both (1) and (2) the four tokens in the passage immediately following the answer or passage mention are also appended. These additional tokens are not part of the evaluated spans; they serve to uniquely locate the character offset of the answer or passage mention during evaluation.

Sentence selection proceeds heuristically as in the SpanBERT model.

#### Training

We trained T5 11B on only the QED training data, using the standard fine-tuning recipe with a batch size of 1024, learning rate of 2e−4 and a dropout rate of 0.1. For Task 1, we trained on the explanation task, marking short answers using "«" and "»" brackets in the input. For Task 2, we mixed the QA task and the explanation task with equal weights, and randomly shuffled the instances. We saw the best results when we trained Task 1 for 7000 steps and Task 2 for 2000 steps.

### 4.4 Evaluation and Results

We evaluate answer selection, sentence selection, and the identification of referential equalities. For answer and sentence selection, we report accuracy on 90% span overlap F1. For referential equality, we evaluate both mention identification (the identification of individual referential expressions in the question and passage) and referential equality detection (the identification of pairs of referential expressions).12 We compute precision, recall, and F1 measure in both cases.13

Results for Task 1 for both the SpanBERT and T5 models are reported in Table 2. The table shows results for both the OntoNotes- and QED- fine-tuned SpanBERT models, as well as the T5 model trained only on the task of explanation prediction. Of note is that trained models trained on QED data do considerably better than the model trained on OntoNotes, indicating that referential equalities are of a distinct distribution from other coreference data.

Table 2:

SpanBERT (SB) and T5 11B model performance for Task 1: recovering QED annotations when the correct answer is given.

Mention IdentificationMention AlignmentSentence Accuracy
PRF1PRF1
SB-onto 59.0 35.6 44.4 47.7 28.8 35.9 97.3
SB-fine-tuned 76.8 68.8 72.6 68.4 61.3 64.6 94.2
T5 73.0 75.8 74.4 63.1 65.5 64.3 95.9
Mention IdentificationMention AlignmentSentence Accuracy
PRF1PRF1
SB-onto 59.0 35.6 44.4 47.7 28.8 35.9 97.3
SB-fine-tuned 76.8 68.8 72.6 68.4 61.3 64.6 94.2
T5 73.0 75.8 74.4 63.1 65.5 64.3 95.9

In Table 3 we report results for Task 2. SB-QED-only refers to the SpanBERT model fine- tuned only with QED data. SB-QA-only refers to the SpanBERT model fine-tuned on the NQ QA data. SB-QA+QED finetunes on both QA and QED. Similarly, we report results for T5 models. We find that the T5 model tends to have higher recall than the SpanBERT model on mention evaluations, but that the SpanBERT model is considerably more precise. T5 far outperforms SpanBERT on answer accuracy, even though it was fine-tuned without the NQ QA data. Interestingly, in both T5 and SpanBERT models, training on QED data improves QA performance. While the SpanBERT model is more complex than the sequence-to- sequence T5 model, it is considerably more compact (320 million parameters versus 11 billion).

Table 3:

SpanBERT (SB) and T5 model performance for Task 2: recovering answer and QED annotations given a passage that is known to contain the answer.

PRF1PRF1
SB-QED-only 74.0 63.1 68.1 63.6 54.2 58.6 – 88.4
SB-QA-only – – – – – – 73.0 81.5
SB-QA+QED 77.6 64.4 70.4 68.9 57.2 62.5 74.5 90.8

T5-QED-only 71.1 73.3 72.2 59.5 61.4 60.4 – 88.9
T5-QA-only – – – – – – 78.9 88.7
T5-QA+QED 70.3 72.3 71.3 58.3 59.9 59.1 79.2 89.1
PRF1PRF1
SB-QED-only 74.0 63.1 68.1 63.6 54.2 58.6 – 88.4
SB-QA-only – – – – – – 73.0 81.5
SB-QA+QED 77.6 64.4 70.4 68.9 57.2 62.5 74.5 90.8

T5-QED-only 71.1 73.3 72.2 59.5 61.4 60.4 – 88.9
T5-QA-only – – – – – – 78.9 88.7
T5-QA+QED 70.3 72.3 71.3 58.3 59.9 59.1 79.2 89.1

The data contain annotations for answer coreference, in which answer spans outside of the supporting sentence are referred to by an anaphor within it. (See Section 2.) The phenomenon is relatively rare though, and hence there are not enough data to evaluate performance properly. We did perform an additional experiment with T5 where, in addition to an answer span, it predicted its anaphor in the answering sentence where appropriate. The model achieved satisfactory performance, with an F1 of 71%.

A major desideratum for explanation generation models is faithfulness—that is, when the explanations generated by a model truly reflect its reasoning process (Jacovi and Goldberg, 2020; Ross et al., 2017). One motivation for this is that when a model is wrong, faithful explanations reliably indicate the reason for the error. In the context of QA, exposing the explanations of a faithful system should improve users’ ability to spot incorrect answers. We show that this is true of a faithful QED system using a rater study.

Given a question, passage, and a candidate answer span, raters were tasked with assessing whether the candidate answer was correct or incorrect, and indicating the confidence of their assessment.

A total of 354 raters, all of whom are US residents and native English speakers, were divided into three disjoint pools to perform the task in three distinct test settings: The None group of raters (n = 121) was presented with a question, passage, and a highlighted answer span. The Sentence group (n = 117) was provided with additional highlighting of the sentence justifying the answer, with no distinction made between referential equalities and predicates. The QED group (n = 116) was provided with additional highlighting to indicate referential equalities between spans in the question and spans in the passage. On average, a given rater provided judgments for 41 questions.

We constructed the data for the study by taking a random set of 50 correct answers, and 50 incorrect guesses from the NQ baseline model (Alberti et al., 2019), on the Natural Questions dev set. So as to ensure that the task was sufficiently challenging, correct instances were the gold answer spans on question/passage pairs where the model produced a false negative—that is, where an answer existed in the passage, but the model was not confident about it. Incorrect instances were false positive guesses from the model, where an answer did not exist in the passage but the model was confident that one did.

Explanations, where present, were manually annotated to simulate the inferences of a hypothetical model that used a QED-style reasoning process. When an item’s answer was correct, the explanation shown was simply its corresponding QED explanation. When the answer was incorrect, referential equalities were identified using counterfactual reasoning; they indicate equivalences that would have to hold if the answer were correct.

Representative examples from the set of rater study items are shown in Figure 5. Note that although referential equalities were manually chosen for incorrect examples, they are not outlandish: They tend to correspond with closely related referents, but that are not equivalent upon further inspection. More generally, incorrect answers in the rater study tend to be incorrect for very subtle reasons; this is a result of the aforementioned answer selection process.

Figure 5:

Three example items from the rater study. In the referential equality error example, the answer is incorrect because the White House and the State Capitol Building are not the same. In the predicate entailment error example, the answer is incorrect because the sentence mentions the number of people over 65, whereas the question asks for the number of people over 50.

Figure 5:

Three example items from the rater study. In the referential equality error example, the answer is incorrect because the White House and the State Capitol Building are not the same. In the predicate entailment error example, the answer is incorrect because the sentence mentions the number of people over 65, whereas the question asks for the number of people over 50.

Close modal

All raters were told that highlighting was the output of “an automated question answering system” that was incorrect “about half of the time.” They were advised not to use external knowledge sources or web search to make their judgments. Raters who saw explanations were also told that the system made use of the highlighted explanation to produce its candidate answers.

### 5.2 Results

Average rater accuracies for each test setting are presented in Table 4. We see that, in aggregate, QED explanations improved accuracy on the task over and above the other test settings, and gave the most improvement on the identification of answers that were incorrect. These improvements translate to incorrect answers resulting from both predicate and reference model errors.

Table 4:

Rater study results. Corr and Incorr are accuracies of raters in each group on correct and incorrect instances respectively, with incorrect instances further broken into Pred(icate) and Ref(erence) model errors. F1 is on the task of identifying incorrect instances.

AccuracyF1
AllCorrIncorr/Pred/RefIncorr
None 67.5 90.4 44.3/43.9/44.7 57.6
Sentence 69.7 92.4 47.1/46.1/48.0 60.9
QED 70.2 90.6 49.7/48.2/51.0 62.5
AccuracyF1
AllCorrIncorr/Pred/RefIncorr
None 67.5 90.4 44.3/43.9/44.7 57.6
Sentence 69.7 92.4 47.1/46.1/48.0 60.9
QED 70.2 90.6 49.7/48.2/51.0 62.5

Somewhat surprisingly, highlighting just the sentence containing the answer improved accuracy more than including referential equality highlighting on instances that were correct. This may be because raters’ propensity to mark instances correct decreases as the complexity of explanations increases, from None (73.1%) to Sentence (72.6%) to QED (70.5%).

Also clear from Table 4 is that rater accuracy is much lower on incorrect instances. Even though raters were told that the answers presented were incorrect half of the time, they judged the answers to be correct roughly 71% of the time.14

Figure 6 provides another perspective on the disparity in judgments on correct/incorrectinstances summarized in Table 4. Highest perquestion accuracies in the incorrect pool were still lower than the average accuracy on all correct instances, and the lowest accuracy on incorrect instances is far lower than that of any of the correct instances. The wide distribution of accuracies on incorrect instances (σ ≈0.50) seen in Figure 6 was also reflected in the rater pool (σ ≈0.45). The challenging nature of incorrect instances speaks to the promise of improvements from QED explanations.

Figure 6:

Sorted, per-question evaluation accuracies from different rater study settings, with 95% binomial confidence intervals. The “evaluation accuracy” for a question is the proportion of raters who judged it correctly. Left three plots correspond to trials with incorrect answers highlighted; right three plots to trials with correct answers highlighted. Dashed red lines correspond to the average accuracy for each setting, identical to the numbers in Table 4.

Figure 6:

Sorted, per-question evaluation accuracies from different rater study settings, with 95% binomial confidence intervals. The “evaluation accuracy” for a question is the proportion of raters who judged it correctly. Left three plots correspond to trials with incorrect answers highlighted; right three plots to trials with correct answers highlighted. Dashed red lines correspond to the average accuracy for each setting, identical to the numbers in Table 4.

Close modal

### 5.3 Effectiveness of explanations

How statistically significant are the results reported in Table 4? The 14,115 test instances were spread across 354 raters and 100 questions. We use the rstanarm R package (Goodrich et al., 2020) to fit a generalized linear mixed model (GLMM) that estimates the log-odds of rater accuracy on the basis of fixed effects (instance correctness and explanation type), while controlling for random effects of rater and question. (See Gelman and Hill (2006) for further discussion of GLMMs.) Ultimately we are interested in the magnitude and statistical properties of the model under the various test settings.

Table 5 shows the fixed effect coefficient and standard deviations for each setting. The presence of QED explanations in the Incorrect setting increased the log-odds of rater accuracy by 0.25, with a posterior predictive p-value of 0.015 that this effect is greater than zero. The comparable effect for Sentence explanations was 0.15, with a posterior predictive p-value of 0.08. The rater and question random effects had standard deviations of 0.63 and 0.90 respectively, reflecting again the high variance of questions shown in Figure 6.

Table 5:

Generalized linear mixed model fixed effect coefficients, showing mean and standard deviation of 10k MCMC samples. The Intercept corresponds to the Incorrect+None setting.

ParameterCoefficient (SD)
(Intercept) −0.31 (0.15)
+ Incorrect+Sentence 0.15 (0.11)
+ Incorrect+QED 0.25 (0.11)
+ Correct+None 2.94 (0.21)
+ Correct+Sentence 3.04 (0.13)
+ Correct+QED 2.69 (0.13)
ParameterCoefficient (SD)
(Intercept) −0.31 (0.15)
+ Incorrect+Sentence 0.15 (0.11)
+ Incorrect+QED 0.25 (0.11)
+ Correct+None 2.94 (0.21)
+ Correct+Sentence 3.04 (0.13)
+ Correct+QED 2.69 (0.13)

As we saw earlier, the effects of explanations in the Correct setting was reversed: The Sentence explanations caused a small, statistically insignificant increase in log-odds, while QED explanations caused a statistically significant drop in log-odds.

### 6.1 QED versus other Explanation Types

QED exists in between relatively unstructured explanation forms on the one hand, such as attention distributions (Wiegreffe and Pinter, 2019; Jain and Wallace, 2019; Mohankumar et al., 2020) or sequential outputs (Camburu et al., 2018, 2020; Narang et al., 2020; Kumar and Talukdar, 2020) and more elaborate, discrete semantic representations that can in theory be applied to explainable QA (Abzianidze et al., 2017; Wolfson et al., 2020).

### 6.2 QED and Faithfulness

A major goal for future work is to develop faithful QA models with the QED framework. As Section 5 suggests, models that are not only right for the right reasons, but also wrong for the right reasons, can help users identify subtle errors. Other motivations include model debuggability: Since faithful models should reveal weaknesses in their reasoning, they may enable more targeted intervention.

QED is a promising style of explanation to this end, because it makes use of fundamental semantic variables, like reference (Russell, 1905; Clark and Marshall, 1981; Tomasello et al., 2007). We can say, definitively, that in order for a sentence to answer a question about a thing, its meaning must involve that thing in a very particular sense. Posed counterfactually, when you break referential equality, you break answerhood, and the same argument follows for predicate entailment. This is a hallmark of a good explanation (Pearl, 2019; Lipton, 2001).

### 6.3 Scoping and Extension to other Question Types

The instantiation of QED presented in the current work is limited to extractive wh-questions whose answers are entailed by single sentences. We feel this scoping is well justified, because (1) a significant portion of NQ falls under QED’s current purview; (2) previous work and data analysis suggests QED can be readily extended to accommodate these other types (Hearst, 1992; Miltsakaki et al., 2004; Lamm et al., 2018; Tandon et al., 2019); and (3) close study of the single sentence case is a necessary condition for these other question types.

In Figure 7, we present several representative NQ instances that require more machinery than QED provides at present. Let us consider how QED might be extended to handle each of these.

Figure 7:

Examples from NQ that go beyond the current definition of QED. Highlighting resembles QED highlighting. In the multi-hop instance, we require an additional sentence to link the entity mentions in the question to the answer sentence. In the yes/no question, the supporting sentence justifies a “No” answer because it contradicts the question predicate. In the set-valued question, multiple sentences provide partial answers to the question, and the resulting answer is the union of all of these.

Figure 7:

Examples from NQ that go beyond the current definition of QED. Highlighting resembles QED highlighting. In the multi-hop instance, we require an additional sentence to link the entity mentions in the question to the answer sentence. In the yes/no question, the supporting sentence justifies a “No” answer because it contradicts the question predicate. In the set-valued question, multiple sentences provide partial answers to the question, and the resulting answer is the union of all of these.

Close modal

#### Multi-hop QA

For multi-hop questions, referential equalities involve longer, text-mediated paths from entity references in the question to an ultimate sentence entailing its answer (Yang et al., 2018).

#### Yes/No QA

Answering Yes/No questions requires identifying sentences in a text that entail or contradict the premise presented in the question.

#### Set-valued QA

Set-valued QA requires assembling QED explanations for a set of answers to the question, and returning the union of the (unique) answers found.

Looking further afield from these question types above, which are less frequent in NQ but nevertheless attested there, it becomes clear that QA writ large is much broader than even a dataset of NQ’s scale suggests (Rogers et al., 2020). The generality of QED as a model for how elements of questions can link up with textual evidence suggests that QED would likely be complementary to, rather than at odds with, efforts to understand these broader senses of QA.

We have described QED, a framework for explanations in question answering, and we have introduced a dataset of QED annotations. The framework is grounded in referential equality, and entailment. In addition we have described baseline models for two QED-based tasks, and a rater study utilizing QED annotations.

Future work should consider the development of models based on QED, especially those that provide faithful explanations, and extensions of QED beyond the single-sentence assumption.

1

QED stands for the Latin “quod erat demonstrandum” or “that which was to be shown”.

3

Instances with annotated short answers, omitting table passages.

4

If it is not possible to find a sentence that satisfies these properties—typically because the answer requires inference beyond coreference/bridging that involves multiple sentences—the annotator marks the example as not possible. See Section 3.

5

Three of the authors of this paper.

6

For formal discussion, see Carlson (1977), Krifka (2003), Abbott (2004), and Mikkelsen (2011), among others.

7

Passages are the same as NQ long answers.

8

We use [S1] = “.” and [S2] = “?” as separators.

9

This is necessary because it is technically possible for a cluster to contain more than two mentions before post-processing.

10

We use ">>" as field separators.

11

We use ”&&” to separate referential equalities.

12

Where referential equalities involved bridged passage mentions, we only evaluate the models’ ability to recognize that they are bridged, since there are many conceivable places in a sentence into which mentions can be bridged.

13

Official evaluation code has been released with the dataset.

14

While this confirmation bias presents an interesting challenge for future work, it is not a shortcoming of our results: Raters were not trained to do well on the task, as we aimed to approximate how users interact with automated QA systems.

Barbara
Abbott
.
2004
.
Definiteness and Indefiniteness
.
The Handbook of Pragmatics
,
122
.
Lasha
Abzianidze
,
Johannes
Bjerva
,
Kilian
Evang
,
Hessel
Haagsma
,
Rik van
Noord
,
Pierre
Ludmann
,
Duc-Duy
Nguyen
, and
Johan
Bos
.
2017
.
The Parallel Meaning Bank: Towards a multilingual corpus of translations annotated with compositional meaning representations
. In
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
, pages
242
247
,
Valencia, Spain
.
Association for Computational Linguistics
.
Chris
Alberti
,
Kenton
Lee
, and
Michael
Collins
.
2019
.
A BERT baseline for the natural questions
.
arXiv preprint arXiv:1901.08634
.
Oana-Maria
Camburu
,
Tim
Rocktäschel
,
Thomas
Lukasiewicz
, and
Phil
Blunsom
.
2018
.
e-SNLI: Natural language inference with natural language explanations
. In
Advances in Neural Information Processing Systems
, pages
9539
9549
.
Oana-Maria
Camburu
,
Brendan
Shillingford
,
Pasquale
Minervini
,
Thomas
Lukasiewicz
, and
Phil
Blunsom
.
2020
.
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
4157
4165
,
Online
.
Association for Computational Linguistics
.
Greg N.
Carlson
.
1977
.
A unified analysis of the English bare plural
.
Linguistics and Philosophy
,
1
(
3
):
413
457
.
Christopher
Clark
,
Kenton
Lee
,
Ming-Wei
Chang
,
Tom
Kwiatkowski
,
Michael
Collins
, and
Kristina
Toutanova
.
2019
.
BoolQ: Exploring the surprising difficulty of natural yes/no questions
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
2924
2936
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Herbert H.
Clark
.
1975
.
Bridging
. In
Proceedings of the 1975 Workshop on Theoretical Issues in Natural Language Processing
,
TINLAP ’75
,
page 169174
,
USA
.
Association for Computational Linguistics
.
Herbert H.
Clark
and
Catherine R.
Marshall
.
1981
.
Definite reference and mutual knowledge
.
Elements of Discourse Understanding
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Finale
Doshi-Velez
and
Been
Kim
.
2017
.
Towards a rigorous science of interpretable machine learning
.
arXiv preprint arXiv:1702.08608
.
Upol
Ehsan
,
Tambwekar
,
Larry
Chan
,
Brent
Harrison
, and
Mark O.
Riedl
.
2019
.
Automated rationale generation: A technique for explainable ai and its effects on human perceptions
. In
Proceedings of the 24th International Conference on Intelligent User Interfaces
, pages
263
274
.
ACM
.
Andrew
Gelman
and
Jennifer
Hill
.
2006
.
Data Analysis Using Regression and Multilevel/Hierarchical Models
,
Cambridge University Press
.
Ben
Goodrich
,
Jonah
Gabry
,
Ali
, and
Sam
Brilleman
.
2020
.
rstanarm: Bayesian Applied Regression Modeling via Stan
.
R package version 2.19.3
.
Marti A.
Hearst
.
1992
.
Automatic acquisition of hyponyms from large text corpora
. In
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics
.
Alon
Jacovi
and
Yoav
Goldberg
.
2020
.
Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?
In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
4198
4205
,
Online
.
Association for Computational Linguistics
.
Sarthak
Jain
and
Byron C.
Wallace
.
2019
.
Attention is not explanation
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
3543
3556
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Mandar
Joshi
,
Danqi
Chen
,
Yinhan
Liu
,
Daniel S.
Weld
,
Luke
Zettlemoyer
, and
Omer
Levy
.
2020
.
SpanBERT: Improving pre-training by representing and predicting spans
.
Transactions of the Association for Computational Linguistics
,
8
:
64
77
.
Manfred
Krifka
.
2003
.
Bare NPS: Kind-referring, indefinites, both, or neither?
Semantics and Linguistic Theory
,
13
:
180
203
.
Sawan
Kumar
and
Partha
Talukdar
.
2020
.
Nile: Natural language inference with faithful natural language explanations
.
Tom
Kwiatkowski
,
Jennimaria
Palomaki
,
Olivia
Redfield
,
Michael
Collins
,
Ankur
Parikh
,
Chris
Alberti
,
Danielle
Epstein
,
Illia
Polosukhin
,
Jacob
Devlin
,
Kenton
Lee
,
Kristina
Toutanova
,
Llion
Jones
,
Matthew
Kelcey
,
Ming-Wei
Chang
,
Andrew M.
Dai
,
Jakob
Uszkoreit
,
Quoc
Le
, and
Slav
Petrov
.
2019
.
Natural questions: A benchmark for question answering research
.
Transactions of the Association for Computational Linguistics
,
7
:
452
466
.
Matthew
Lamm
,
Arun
Chaganty
,
Christopher D.
Manning
,
Dan
Jurafsky
, and
Percy
Liang
.
2018
.
Textual analogy parsing: What’s shared and what’s compared among analogous facts
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
82
92
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Kenton
Lee
,
Luheng
He
,
Mike
Lewis
, and
Luke
Zettlemoyer
.
2017
.
End-to-end ution
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
188
197
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Peter
Lipton
.
2001
,
What good is an explanation?
,
Giora
Hon
and
Sam S.
Rakover
, editors,
Explanation: Theoretical Approaches and Applications
,
Springer Netherlands
,
Dordrecht
, pages
43
59
.
Line
Mikkelsen
.
2011
.
Copular clauses
. In
Claudia
Maienborn
,
Klaus von
Heusinger
, and
Paul
Portner
, editors,
Semantics: An International Handbook of Natural Language Meaning
,
2
, pages
1805
1829
,
Berlin
.
Mouton De Gruyter
.
Eleni
Miltsakaki
,
Rashmi
,
Aravind K.
Joshi
, and
Bonnie L.
Webber
.
2004
.
The Penn Discourse Treebank.
In
LREC
.
Akash Kumar
Mohankumar
,
Preksha
Nema
,
Sharan
Narasimhan
,
Mitesh M.
Khapra
,
Balaji Vasan
Srinivasan
, and
Balaraman
Ravindran
.
2020
.
Towards transparent and explainable attention models
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
4206
4216
,
Online
.
Association for Computational Linguistics
.
Sharan
Narang
,
Colin
Raffel
,
Katherine
Lee
,
Roberts
,
Noah
Fiedel
, and
Karishma
Malkan
.
2020
.
Wt5?! Training text-to-text models to explain their predictions
.
Judea
Pearl
.
2019
.
The limitations of opaque learning machines
.
Possible Minds: Twenty- Five Ways of Looking at AI
13
19
.
Sameer
,
Alessandro
Moschitti
,
Nianwen
Xue
,
Olga
Uryupina
, and
Yuchen
Zhang
.
2012
.
CoNLL-2012 Shared Task: Modeling multilingual unrestricted coreference in OntoNotes
. In
.
Colin
Raffel
,
Noam
Shazeer
,
Roberts
,
Katherine
Lee
,
Sharan
Narang
,
Michael
Matena
,
Yanqi
Zhou
,
Wei
Li
, and
Peter J.
Liu
.
2020
.
Exploring the limits of transfer learning with a unified text-to-text transformer
.
Journal of Machine Learning Research
,
21
(
140
):
1
67
.
Pranav
Rajpurkar
,
Jian
Zhang
,
Konstantin
Lopyrev
, and
Percy
Liang
.
2016
.
SQuAD: 100,000+ Questions for machine comprehension of text
.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
.
Siva
Reddy
,
Danqi
Chen
, and
Christopher D.
Manning
.
2019
.
CoQA: A conversational question answering challenge
.
Transactions of the Association for Computational Linguistics
,
7
:
249266
.
Anna
Rogers
,
Olga
Kovaleva
,
Matthew
Downey
, and
Anna
Rumshisky
.
2020
.
Getting closer to AI complete question answering: A set of prerequisite real tasks
.
Proceedings of the AAAI Conference on Artificial Intelligence
,
34
(
05
):
8722
8731
.
Andrew Slavin
Ross
,
Michael C.
Hughes
, and
Finale
Doshi-Velez
.
2017
.
Right for the right reasons: Training differentiable models by constraining their explanations
. In
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17
, pages
2662
2670
.
Bertrand
Russell
.
1905
.
On denoting
.
Mind
,
14
(
56
):
479
493
.
Niket
Tandon
,
Bhavana
Dalvi
,
Keisuke
Sakaguchi
,
Peter
Clark
, and
Antoine
Bosselut
.
2019
.
WIQA: A dataset for “What if…” reasoning over procedural text
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP)
, pages
6078
6087
.
Michael
Tomasello
,
Malinda
Carpenter
, and
Ulf
Liszkowski
.
2007
.
A new look at infant pointing
.
Child Development
,
78
(
3
):
705
722
. ,
[PubMed]
Sarah
Wiegreffe
and
Yuval
Pinter
.
2019
.
Attention is not not explanation
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP)
,
pages 11–pages 20
,
Hong Kong, China
.
Association for Computational Linguistics
.
Tomer
Wolfson
,
Mor
Geva
,
Ankit
Gupta
,
Matt
Gardner
,
Yoav
Goldberg
,
Daniel
Deutch
, and
Jonathan
Berant
.
2020
.
Break it down: A question understanding benchmark
.
Transactions of the Association for Computational Linguistics
,
8
:
183
198
.
Zhilin
Yang
,
Peng
Qi
,
Saizheng
Zhang
,
Yoshua
Bengio
,
William W.
Cohen
,
Ruslan
Salakhutdinov
, and
Christopher D.
Manning
.
2018
.
HotpotQA: A dataset for diverse, explainable multi-hop question answering
. In
Conference on Empirical Methods in Natural Language Processing (EMNLP)
.

### A Formal Definition of QED Annotations

An annotator is presented with a question q that consists of m tokens q1qm, along with a passage c consisting of n tokens c1cn.

The QED annotation is a triple 〈s,e,a〉 where:

• •

s is a sentence within the passage c. Specifically s is a pair s0,s1 indicating that the sentence spans words $cs0…cs1$ inclusive.

• •

e is a sequence of 0 or more “referential equality annotations”, e1e|e|. Each member of e specifies that some noun phrase within the question refers to the same item in the world as some noun phrase within the sentence s.

• •

a is one or more answer annotations a1a|a|.

We now describe the form of the e and a annotations, making reference to the following example. Subscripts indicate token positions:

Question: who1 won2 wimbledon3 in4 20195

Passage: Simona1 Halep2 is3 a4 female5 tennis6 player7 .8 She9 won10 Wimbledon11 in12 201913 .14

As a preliminary step, given the paragraph c and sentence s, we use $S$ to refer to the set of all phrases within s. Our initial definition of $S$ is
$S={(i,j):s0≤i≤j≤s1}$
We also define the set of question phrases $Q$ and passage phrases $C$ to be
$Q={(i,j):1≤i≤j≤m}C={(i,j):1≤i≤j≤n}$

We can then give the following definitions:

In our example,
$e=[((3,3),(11,11)),((5,5),(13,13))]$
where the first tuple in the sequence corresponds with the alignment between “wimbledon” in the question and “Wimbledon” in the passage, and the second tuple with “2019” in the question and “2019” in the passage.
In our example,
$a=[((9,9),(1,2))]$
corresponding with the alignment of “She” in the sentence “She won Wimbledon in 2019” with the mention of “Simona Halep” earlier in the passage.

#### A.1 Extending Annotations to Include Bridging

Recall the definition of bridging in Section 2. We extend the formal definition of QED to include bridging by redefining $S$ to include implicit phrases introduced in the form of implicit prepositional phrases, as in the “winner [of …]”. The modified definition of $S$ includes all phrases of the following form: (1) Any pair (i, j) such that s0ijs1 indicating the subsequence of words cicj within the sentence. (2) Any triple (i, j, p) such that s0ijs1 and p is a preposition, indicating the implicit noun phrase in the sentence that modifies the phrase cicj through the preposition p. (3) Any pair (NULL,p) such that p is a preposition, indicating the implicit noun phrase modifying the entire sentence cs0cs1 through the preposition p.

Given the following example, then:

Question: who1 won2 america’s3 got4 talent5 season6 117

Passage: The1 11th2 season3 of4 America’s5 Got6 Talent7 began8 broadcasting9 in10 the11 United12 States13 during14 201615 .16 Grace17 VanderWaal18 was19 announced20 as21 the22 winner23 on24 September25 1426 ,27 201628 .29

we have that
$e=[((3,7),(22,23),“of”)]$

This means that the question span “america’s got talent season 11” is bridged by the reference ”the winner” in the answering sentence. The preposition “of” indicates how the question referent can be attached to the sentence reference: Putting them together yields “the winner of america’s got talent season 11”.

### B T5 Model Linearization

We describe our method for linearizing QED instances as T5 input and output sequences. Let us consider the following example:

Question: how many seats in university of michigan stadium

Passage: Michigan Stadium, nicknamed “The Big House”, is the football stadium for the University of Michigan in Ann Arbor, Michigan. It is the largest stadium in the United States and the second largest stadium in the world. Its official capacity is 107,601.

Recall that the QED annotation for this example is as follows

how many seats in [=1 university of michigan stadium] [=1 Its] official capacity is [=A 107,601]

The T5 model input are constructed by concatenating the question, page title, and paragraph into one sequence, separated by “>>”:

how many seats in university of michigan stadium ¿ Michigan Stadium » Michigan Stadium, nicknamed “The Big House”, is the football stadium for the University of Michigan in AnnArbor, Michigan. It is the largest stadium in the United States and the second largest stadium in the world. Its official capacity is 107,601.

A task-specific token is prepended to this input to indicate whether the model should produce an answer or an explanation. The answer output sequence is as follows:

107,601 ».

where additional material after the special character ">>" (as distinct from ">>") is used to disambiguate the position of the answer in the passage. Finally, the explanation would be linearized as follows:

university of michigan stadium ¿ Its » official capacity is 107,601

Here, the first phrase corresponds to the question mention and the second to the passage mention. The additional material after the passage mention is meant to uniquely identify its position in the passage, for evaluation purposes.

## Author notes

*

Work done during internship at Google.