Abstract

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

1 Introduction

In recent years there has been dramatic progress in machine learning approaches to problems such as machine translation, speech recognition, and image recognition. One major factor in these successes has been the development of neural methods that far exceed the performance of previous approaches. A second major factor has been the existence of large quantities of training data for these systems.

Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU), which has significant utility to users, and in addition is potentially a challenge task that can drive the development of methods for NLU. Several pieces of recent work have introduced QA data sets (e.g., Rajpurkar et al., 2016; Reddy et al., 2018). However, in contrast to tasks where it is relatively easy to gather naturally occurring examples,1 the definition of a suitable QA task, and the development of a methodology for annotation and evaluation, is challenging. Key issues include the methods and sources used to obtain questions; the methods used to annotate and collect answers; the methods used to measure and ensure annotation quality; and the metrics used for evaluation. For more discussion of the limitations of previous work with respect to these issues, see Section 2 of this paper.

This paper introduces Natural Questions2 (nq), a new data set for QA research, along with methods for QA system evaluation. Our goals are three-fold: 1) To provide large-scale end-to-end training data for the QA problem. 2) To provide a data set that drives research in natural language understanding. 3) To study human performance in providing QA annotations for naturally occurring questions.

In brief, our annotation process is as follows. An annotator is presented with a (question, Wikipedia page) pair. The annotator returns a (long answer, short answer) pair. The long answer (l) can be an HTML bounding box on the Wikipedia page—typically a paragraph or table—that contains the information required to answer the question. Alternatively, the annotator can return l = NULL if there is no answer on the page, or if the information required to answer the question is spread across many paragraphs. The short answer (s) can be a span or set of spans (typically entities) within l that answer the question, a boolean yes or no answer, or NULL. If l = NULL then s = NULL, necessarily. Figure 1 shows examples.

Figure 1: 

Example annotations from the corpus.

Figure 1: 

Example annotations from the corpus.

Natural Questions has the following properties:

Source of questions

The questions consist of real anonymized, aggregated queries issued to the Google search engine. Simple heuristics are used to filter questions from the query stream. Thus the questions are “natural” in that they represent real queries from people seeking information.

Number of items

The public release contains 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and 7,842 5-way annotated items sequestered as test data. We justify the use of 5-way annotation for evaluation in Section 5.

Task definition

The input to a model is a question together with an entire Wikipedia page. The target output from the model is: 1) a long-answer (e.g., a paragraph) from the page that answers the question, or alternatively an indication that there is no answer on the page; 2) a short answer where applicable. The task was designed to be close to an end-to-end question answering application.

Ensuring high-quality annotations at scale

Comprehensive guidelines were developed for the task. These are summarized in Section 3. Annotation quality was constantly monitored.

Evaluation of quality

Section 4 describes post-hoc evaluation of annotation quality. Long/short answers have 90%/84% precision, respectively.

Study of variability

One clear finding in nq is that for naturally occurring questions there is often genuine ambiguity in whether or not an answer is acceptable. There are also often a number of acceptable answers. Section 4 examines this variability using 25-way annotations.

Robust evaluation metrics

Section 5 introduces methods of measuring answer quality that account for variability in acceptable answers. We demonstrate a high human upper bound on these measures for both long answers (90% precision, 85% recall), and short answers (79% precision, 72% recall).

We propose nq as a new benchmark for research in QA. In Section 6.4 we present baseline results from recent models developed on comparable data sets (Clark and Gardner, 2018), as well as a simple pipelined model designed for the nq task. We demonstrate a large gap between the performance of these baselines and a human upper bound. We argue that closing this gap will require significant advances in NLU.

2 Related Work

The SQuAD (Rajpurkar et al., 2016), SQuAD 2.0 (Rajpurkar et al., 2018), NarrativeQA (Kocisky et al., 2018), and HotpotQA (Yang et al., 2018) data sets contain questions and answers written by annotators who have first read a short text containing the answer. The SQuAD data sets contain questions/paragraph/answer triples from Wikipedia. In the original SQuAD data set, annotators often borrow part of the evidence paragraph to create a question. Jia and Liang (2017) showed that systems trained on SQuAD could be easily fooled by the insertion of distractor sentences that should not change the answer, and SQuAD 2.0 introduces questions that are designed to be unanswerable. However, we argue that questions written to be unanswerable can be identified as such with little reasoning, in contrast to nq’s task of deciding whether a paragraph contains all of the evidence required to answer a real question. Both SQuAD tasks have driven significant advances in reading comprehension, but systems now outperform humans and harder challenges are needed. NarrativeQA aims to elicit questions that are not close paraphrases of the evidence by separate summary texts. No human performance upper bound is provided for the full task and, although an extractive system could theoretically perfectly recover all answers, current approaches only just outperform a random baseline. NarrativeQA may just be too hard for the current state of NLU. HotpotQA is designed to contain questions that require reasoning over text from separate Wikipedia pages. As well as answering questions, systems must also identify passages that contain supporting facts. This is similar in motivation to nq’s long answer task, where the selected passage must contain all of the information required to infer the answer. Mirroring our identification of acceptable variability in the nq task definition, HotpotQA’s authors observe that the choice of supporting facts is somewhat subjective. They set high human upper bounds by selecting, for each example, the score maximizing partition of four annotations into one prediction and three references. The reference labels chosen by this maximization are not representative of the reference labels in HotpotQA’s evaluation set, and it is not clear that the upper bounds are achievable. A more robust approach is to keep the evaluation distribution fixed, and calculate an acheivable upper bound by approximating the expectation over annotations—as we have done for nq in Section 5.

The QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2018) data sets contain dialogues between a questioner, who is trying to learn about a text, and an answerer. QuAC also prevents the questioner from seeing the evidence text. Conversational QA is an exciting new area, but it is significantly different from the single turn QA task in nq. In both QuAC and CoQA, conversations tend to explore evidence texts incrementally, progressing from the start to the end of the text. This contrasts with nq, where individual questions often require reasoning over large bodies of text.

The WikiQA (Yang et al., 2015) and MS Marco (Nguyen et al., 2016) data sets contain queries sampled from the Bing search engine. WikiQA contains only 3,047 questions. MS Marco contains 100,000 questions with freeform answers. For each question, the annotator is presented with 10 passages returned by the search engine, and is asked to generate an answer to the query, or to say that the answer is not contained within the passages. Free-form text answers allow more flexibility in providing abstractive answers, but lead to difficulties in evaluation (BLEU score [Papineni et al., 2002] is used). MS Marco’s authors do not discuss issues of variability or report quality metrics for their annotations. From our experience, these issues are critical. DuReader (He et al., 2018) is a Chinese language data set containing queries from Baidu search logs. Like nq, DuReader contains real user queries; it requires systems to read entire documents to find answers; and it identifies acceptable variability in answers. However, as with MS Marco, DuReader is reliant on BLEU for answer scoring, and systems already outperform a humans according to this metric.

There are a number of reading comprehension benchmarks based on multiple choice tests (Mihaylov et al., 2018; Richardson et al., 2013; Lai et al., 2017). The TriviaQA data set (Joshi et al., 2017) contains questions and answers taken from trivia quizzes found online. A number of Cloze-style tasks have also been proposed (Hermann et al., 2015; Hill et al., 2015; Paperno et al., 2016; Onishi et al., 2016). We believe that all of these tasks are related to, but distinct from, answering information-seeking questions. We also believe that, because a solution to nq will have genuine utility, it is better equipped as a benchmark for NLU.

3 Task Definition and Data Collection

Natural Questions contains (question, wikipedia page, long answer, short answer) quadruples where: the question seeks factual information; the Wikipedia page may or may not contain the information required to answer the question; the long answer is a bounding box on this page containing all information required to infer the answer; and the short answer is one or more entities that give a short answer to the question, or a boolean yes or no. Both the long and short answer can be NULL if no viable candidates exist on the Wikipedia page.

3.1 Questions and Evidence Documents

All the questions in nq are queries of 8 words or more that have been issued to the Google search engine by multiple users in a short period of time. From these queries, we sample a subset that either:

  1. start with “who”, “when”, or “where” directly followed by: a) a finite form of “do” or a modal verb; or b) a finite form of “be” or “have” with a verb in some later position;

  2. start with “who” directly followed by a verb that is not a finite form of “be”;

  3. contain multiple entities as well as an adjective, adverb, verb, or determiner;

  4. contain a categorical noun phrase immediately preceded by a preposition or relative clause;

  5. end with a categorical noun phrase, and do not contain a preposition or relative clause.3

Table 1 gives examples. We run questions through the Google search engine and keep those where there is a Wikipedia page in the top 5 search results. The (question, Wikipedia page) pairs are the input to the human annotation task described next.

Table 1: 
Matches for heuristics in Section 3.1.
1.a where does the nature conservancy get its funding 
1.b who is the song killing me softly written about 
who owned most of the railroads in the 1800s 
how far is chardon ohio from cleveland ohio 
american comedian on have i got news for you 
1.a where does the nature conservancy get its funding 
1.b who is the song killing me softly written about 
who owned most of the railroads in the 1800s 
how far is chardon ohio from cleveland ohio 
american comedian on have i got news for you 

The goal of these heuristics is to discard a large proportion of queries that are non-questions, while retaining the majority of queries of 8 words or more in length that are questions. A manual inspection showed that the majority of questions in the data, with the exclusion of question beginning with “how to”, are accepted by the filters. We focus on longer queries as they are more complex, and are thus a more challenging test for deep NLU. We focus on Wikipedia as it is a very important source of factual information, and we believe that stylistically it is similar to other sources of factual information on the Web; however, like any data set there may be biases in this choice. Future data-collection efforts may introduce shorter queries, “how to” questions, or domains other than Wikipedia.

3.2 Human Identification of Answers

Annotation is performed using a custom annotation interface, by a pool of around 50 annotators, with an average annotation time of 80 seconds.

The guidelines and tooling divide the annotation task into three conceptual stages, where all three stages are completed by a single annotator in succession. The decision flow through these is illustrated in Figure 2 and the instructions given to annotators are summarized below.

Figure 2: 

Annotation decision process with path proportions from nq training data. Percentages are proportions of entire data set. A total of 49% of all examples have a long answer.

Figure 2: 

Annotation decision process with path proportions from nq training data. Percentages are proportions of entire data set. A total of 49% of all examples have a long answer.

Question Identification:

Contributors determine whether the given question is good or bad. A good question is a fact-seeking question that can be answered with an entity or explanation. A bad question is ambigous, incomprehensible, dependent on clear false presuppositions, opinion-seeking, or not clearly a request for factual information. Annotators must make this judgment solely by the content of the question; they are not yet shown the Wikipedia page.

Long Answer Identification:

For good questions only, annotators select the earliest HTML bounding box containing enough information for a reader to completely infer the answer to the question. Bounding boxes can be paragraphs, tables, list items, or whole lists. Alternatively, annotators mark “no answer” if the page does not answer the question, or if the information is present but not contained in a single one of the allowed elements.

Short Answer Identification:

For examples with long answers, annotators select the entity or set of entities within the long answer that answer the question. Alternatively, annotators can flag that the short answer is yes, no, or they can flag that no short answer is possible.

3.3 Data Statistics

In total, annotators identify a long answer for 49% of the examples, and short answer spans or a yes/no answer for 36% of the examples. We consider the choice of whether or not to answer a question a core part of the question answering task, and do not discard the remaining 51% that have no answer labeled.

Annotators identify long answers by selecting the smallest HTML bounding box that contains all of the information required to answer the question. These are mostly paragraphs (73%). The remainder are made up of tables (19%), table rows (1%), lists (3%), or list items (3%).4 We leave further subcategorization of long answers to future work, and provide a breakdown of baseline performance on each of these three types of answers in Section 6.4.

4 Evaluation of Annotation Quality

This section describes evaluation of the quality of the human annotations in our data. We use a combination of two methods: 1) post hoc evaluation of correctness of non-null answers, under consensus judgments from four “experts”; 2) k-way annotations (with k = 25) on a subset of the data.

Post hoc evaluation of non-null answers leads directly to a measure of annotation precision. As is common in information-retrieval style problems such as long-answer identification, measuring recall is more challenging. However, we describe how 25-way annotated data provide useful insights into recall, particularly when combined with expert judgments.

4.1 Preliminaries: The Sampling Distribution

Each item in our data consists of a four-tuple (q,d,l,s) where q is a question, d is a document, l is a long answer, and s is a short answer. Thus we introduce random variables Q, D, L, and S corresponding to these items. Note that L, can be a span within the document, or NULL. Similarly, S can be one or more spans within L, a boolean, or NULL.

For now we consider the three-tuple (q,d,l). The treatment for short answers is the same throughout, with (q,d,s) replacing (q,d,l).

Each data item (q,d,l) is independent and identically distrbuted (IID) sampled from
p(l,q,d)=p(q,d)×p(l|q,d)
Here, p(q, d) is the sampling distribution (probability mass function [PMF]) over question/ document pairs. It is defined as the PMF corresponding to the following sampling process:5 First, sample a question at random from some distribution; second, perform a search on a major search engine using the question as the underlying query; finally, either: 1) return (q,d) where d is the top Wikipedia result for q, if d is in the top 5 search results for q; 2) if there is no Wikipedia page in the top 5 results, discard q and repeat the sampling process.

Here p(l|q,d) is the conditional distribution (PMF) over long answer l conditioned on the pair (q,d). The value for l is obtained by: 1) sampling an annotator uniformly at random from the pool of annotators; 2) presenting the pair (q,d) to the annotator, who then provides a value for l.

Note that l is non-deterministic due to two sources of randomness: 1) the random choice of annotator; 2) the potentially random behavior of a particular annotator (the annotator may give a different answer depending on the time of day, etc.).

We will also consider the distribution
p(l,q,d|LNULL)=p(l,q,d)P(LNULL)iflNULL=0otherwise
where P(LNULL)=l,q,d:lNULLp(l,q,d). Thus p(l,q,d|LNULL) is the probability of seeing the triple (l,q,d), conditioned on L not being NULL.
We now define precision of annotations. Consider a function π(l,q,d) that is equal to 1 if l is a “correct” answer for the pair (q,d), 0 if the answer is incorrect. The next section gives a concrete definition of π. The annotation precision is defined as
Ψ=l,q,dp(l,q,d|LNULL)×π(l,q,d)
Given a set of annotations S={(l(i),q(i),d(i))}i=1|S| drawn IID from p(l,q,d|LNULL), we can derive an estimate of Ψ as Ψ^=1|S|(l,q,d)Sπ(l,q,d).

4.2 Expert Evaluations of Correctness

We now describe the process for deriving “expert” judgments of answer correctness. We used four experts for these judgments. These experts had prepared the guidelines for the annotation process.6 In a first phase each of the four experts independently annotated examples for correctness. In a second phase the four experts met to discuss disagreements in judgments, and to reach a single consensus judgment for each example.

A key step is to define the criteria used to determine correctness of an example. Given a triple (l, q, d), we extracted the passage l′ corresponding to l on the page d. The pair (q, l′) was then presented to the expert. Experts categorized (q, l′) pairs into the following three categories:

Correct (C): It is clear beyond a reasonable doubt that the answer is correct.

Correct (but debatable) (Cd): A reasonable person could be satisfied by the answer; however, a reasonable person could raise a reasonable doubt about the answer.

Wrong (W): There is not convincing evidence that the answer is correct.

Figure 3 shows some example judgments. We introduced the intermediate Cd category after observing that many (q, l′) pairs are high quality answers, but raise some small doubt or quibble about whether they fully answer the question. The use of the word “debatable” is intended to be literal: (q, l′) pairs falling into the Cd category could literally lead to some debate between reasonable people as to whether they fully answer the question or not.

Figure 3: 

Examples with consensus expert judgments, and justification for these judgments. See Figure 6 for more examples.

Figure 3: 

Examples with consensus expert judgments, and justification for these judgments. See Figure 6 for more examples.

Given this background, we will make the following assumption:

Answers in theCdcategory should be very useful to a user interacting with a QA system, and should be considered to be high-quality answers; however, an annotator would be justified in either annotating or not annotating the example.

For these cases there is often disagreement between annotators as to whether the page contains an answer or not: We will see evidence of this when we consider the 25-way annotations.

4.3 Results for Precision Measurements

We used the following procedure to derive measurements of precision: 1) We sampled examples IID from the distribution p(l, q, d|LNULL). We call this set S. We had |S|=139. 2) Four experts independently classified each of the items in S into the categories C, Cd, W. 3) The four experts met to come up with a consensus judgment for each item. For each example (l(i),q(i),d(i))S, we define c(i) to be the consensus judgment. This process was repeated to derive judgments for short answers.

We can then calculate the percentage of examples falling into the three expert categories; we denote these values as Ê(C),Ê(Cd), and Ê(W).7 We define Ψ^=Ê(C)+Ê(Cd). We have explicitly included samples C and Cd in the overall precision as we believe that Cd answers are essentially correct. Table 2 shows the values for these quantities.

Table 2: 
Precision results (Ψ^) and empirical estimates of the proportions of C, Cd, and W items.
Quantity Long answer Short answer 
Ψ^ 90% 84% 
Ê(C) 59% 51% 
Ê(Cd) 31% 33% 
Ê(W) 10% 16% 
Quantity Long answer Short answer 
Ψ^ 90% 84% 
Ê(C) 59% 51% 
Ê(Cd) 31% 33% 
Ê(W) 10% 16% 

4.4 Variability of Annotations

We have shown that an annotation drawn from p(l,q,d|LNULL) has high expected precision. Now we address the distribution over annotations for a given (q,d) pair. Annotators can disagree about whether or not d contains an answer to q—that is, whether or not L = NULL. In the case that annotators agree that LNULL, they can also disagree about the correct assignment to L.

In order to study variability, we collected 24 additional annotations from separate annotators for each of the (q, d, l) triples in S. For each (q,d,l) triple, we now have a 5-tuple (q(i),d(i),l(i),c(i),a(i)) where a(i)=a1(i)a25(i) is a vector of 25 annotations (including l(i)), and c(i) is the consensus judgment for l(i). For each i also define
μ(i)=125j=125[[aj(i)NULL]]
to be the proportion of the 25-way annotations that are non-null.
We now show that μ(i) is highly correlated with annotation precision. We define
Ê[(0.8,1.0]]=1|S|i=1|S|[[0.8<μ(i)1]]
to be the proportion of examples with greater than 80% of the 25 annotators marking a non-null long answer, and
Ê[(0.8,1.0],C]=1|S|i=1|S|[[0.8<μ(i)1andc(i)=C]]
to be the proportion of examples with greater than 80% of the 25 annotators marking a non-null long answer and with c(i)=C. Similar definitions apply for the intervals ( 0, 0.2],( 0.2,0.4],(0.4,0.6], and (0.6,0.8], and for judgments Cd and W.

Figure 4 illustrates the proportion of annotations falling into the C/Cd/W categories in different regions of μ(i). For those (q,d) pairs where more than 80% of annotators gave some non-null answer, our expert judgements agree that these annotations are overwhelmingly correct. Similarly, when fewer than 20% of annotators gave a non-null answer, these answers tend to be incorrect. In between these two extremes, the disagreement between annotators is largely accounted for by the Cd category—where a reasonable person could either be satisfied with the answer, or want more information. Later, in Section 5, we make use of the correlation between μ(i) and accuracy to define a metric for the evaluation of answer quality. In that section, we also show that a model trained on (l, q, d) triples can outperform a single annotator on this metric by accounting for the uncertainty of whether or not an answer is present.

Figure 4: 

Values of Ê[(θ1,θ2]] and Ê[(θ1,θ2],C/Cd/W] for different intervals (θ1,θ2]. The height of each bar is equal to Ê[(θ1,θ2]], the divisions within each bar show Ê[(θ1,θ2],C], Ê[(θ1,θ2],Cd], and Ê[(θ1,θ2],W].

Figure 4: 

Values of Ê[(θ1,θ2]] and Ê[(θ1,θ2],C/Cd/W] for different intervals (θ1,θ2]. The height of each bar is equal to Ê[(θ1,θ2]], the divisions within each bar show Ê[(θ1,θ2],C], Ê[(θ1,θ2],Cd], and Ê[(θ1,θ2],W].

As well as disagreeing about whether (q, d) contains a valid answer, annotators can disagree about the location of the best answer. In many cases there are multiple valid long answers in multiple distinct locations on the page.8 The most extreme example of this that we see in our 25-way annotated data is for the question “name the substance used to make the filament of bulb” paired with the Wikipedia page about incandescent light bulbs. Annotators identify 7 passages that discuss tungsten wire filaments.

Short answers can be arbitrarily delimited and this can lead to extreme variation. The most extreme example of this that we see in the 25-way annotated data is the 11 distinct, but correct, answers for the question “where is blood pumped after it leaves the right ventricle”. Here, 14 annotators identify a substring of “to the lungs” as the best possible short answer. Of these, 6 label the entire string, 4 reduce it to “the lungs”, and 4 reduce it to “lungs”. A further 6 annotators do not consider this short answer to be sufficient and choose more precise phrases such as “through the semilunar pulmonary valve into the left and right main pulmonary arteries (one for each lung)”. The remaining 5 annotators decide that there is no adequate short answer.

For each question, we ranked each of the unique answers given by our 25 annotators according to the number of annotators that chose it. We found that by just taking the most popular long answer, we could account for 83% of the long answer annotations. The two most popular long answers account for 96% of the long answer annotations. It is extremely uncommon for a question to have more than three distinct long answers annotated. Short answers have greater variability, but the most popular short answer still accounts for 64% of all short answer annotations. The three most popular short answers account for 90% of all short answer annotations.

5 Evaluation Measures

nq includes 5-way annotations on 7,830 items for development data, and we will sequester a further 7,842 items, 5-way annotated, for test data. This section describes evaluation metrics using this data, and gives justification for these metrics.

We choose 5-way annotations for the following reasons: First, we have evidence that aggregating annotations from 5 annotators is likely to be much more robust than relying on a single annotator (see Section 4). Second, 5 annotators is a small enough number that the cost of annotating thousands of development and test items is not prohibitive.

5.1 Definition of an Evaluation Measure Based on 5-Way Annotations

Assume that we have a model fθ with parameters θ that maps an input (q, d) to a long answer l = fθ(q, d). We would like to evaluate the accuracy of this model. Assume we have evaluation examples {q(i), d(i), a(i)} for i = 1…n, where q(i) is a question, d(i) is the associated Wikipedia document, and a(i) is a vector with components aj(i) for j = 1 … 5. Each aj(i) is the output from the j’th annotator, and can be a paragraph in d(i), or can be NULL. The five annotators are chosen uniformly at random from a pool of annotators.

We define an evaluation measure based on the five way annotations as follows. If at least two out of five annotators have given a non-null long answer on the example, then the system is required to output a non-null answer that is seen at least once in the five annotations; conversely, if fewer than two annotators give a non-null long answer, the system is required to return NULL as its output.

To make this more formal, define the function g(a(i)) to be the number of annotations in a(i) that are non-null. Define a function hβ(a,l) that judges the correctness of label l given annotations a = a1a5. This function is parameterized by an integer β. The function returns 1 if the label l is judged to be correct, and 0 otherwise:

Definition 1 (Definition ofhβ(a, l)) If g(a) ≥ β and lNULL andl = ajfor somej ∈{1…5} Then hβ(a, l) = 1; Else If g(a) < β and l = NULL Then hβ(a, l) = 1; Else hβ(a, l) = 0.

We used β = 2 in our experiments.9

The accuracy of a model is then
Aβ(fθ)=1ni=1nhβa(i),fθq(i),d(i)
The value for Aβ is an estimate of accuracy with respect to the underlying distribution, which we define as A-β(fθ)=E[hβ(a,fθ(q,d))]. Here the expectation is taken with respect to p(a,q,d)=p(q,d)j=15p(aj|q,d) where p(aj|q,d) = P(L = aj|Q = q,D = d); hence the annotations a1a5 are assumed to be drawn IID from p(l|q,d).10

We discuss this measure at length in this section. First, however, we make the following critical point:

It is possible for a model trained on (l(i),q(i), d(i)) triples drawn IID from p(l,q,d) to exceed the performance of a single annotator on this measure.

In particular, if we have a model p(l|q, d; θ), trained on (l, q, d) triples, which is a good approximation to p(l|q, d), it is then possible to use p(l|q, d; θ) to make predictions that outperform a single random draw from p(l|q, d). The Bayes optimal hypothesis (see Devroye et al., 1997) for hβ, defined as argmaxfEq,d,a[[hβ(a,f(q,d))]], is a function of the posterior distribution p(⋅|q, d),11 and will generally exceed the performance of a single random annotation, Eq,d,a[[lp(l|q,d)×hβ(a,l)]].

We also show this empirically, by constructing an approximation to p(l|q,d) from 20-way annotations, then using this approximation to make predictions that significantly outperform a single annotator.

Precision and Recall

During evaluation, it is often beneficial to separately measure false positives (incorrectly predicting an answer), and false negatives (failing to predict an answer). We define the precision (P) and recall (R) of fθ:
t(q,d,a,fθ)=hβ(a,fθ(q,d))[[fθ(q,d)NULL]]R(fθ)=i=1nt(q(i),d(i),a(i),fθ)i=1n[[g(a(i)β]]P(fθ)=i=1nt(q(i),d(i),a(i),fθ)i=1n[[fθ(q(i),d(i))NULL]]

5.2 Super-Annotator Upper Bound

To place an upper bound on the metrics introduced above we create a “super-annotator” from the 25-way annotated data introduced in Section 4. From this data, we create four tuples (q(i),d(i),a(i),b(i)). The first three terms in this tuple are the question, document, and vector of five reference annotations. b(i) is a vector of annotations bj(i) for j = 1…20 drawn from the same distribution as a(i). The super-annotator predicts NULL if g(b(i)) < α, and l*=argmaxldj=120[[l=bj]] otherwise.

Table 3 shows super-annotator performance for α = 8, with 90.0% precision, 84.6% recall, and 87.2% F-measure. This significantly exceeds the performance (80.4% precision/67.6% recall/ 73.4% F-measure) for a single annotator. We subsequently view the super-annotator numbers as an effective upper bound on performance of a learned model.

Table 3: 
Precision (P), recall (R), and the harmonic mean of these (F1) of all baselines, a single annotator, and the super-annotator upper bound. The human performances marked with † are evaluated on a sample of five annotations from the 25-way annotated data introduced in Section 5.
Long answer DevLong answer TestShort answer DevShort answer Test
PRF1PRF1PRF1PRF1
First paragraph 22.2 37.8 27.8 22.3 38.5 28.3  – – – – – – 
Most frequent 43.1 20.0 27.3 40.2 18.4 25.2  – – – – – – 
Closest question 37.7 28.5 32.4 36.2 27.8 31.4  – – – – – – 
 
DocumentQA 47.5 44.7 46.1 48.9 43.3 45.7  38.6 33.2 35.7 40.6 31.0 35.1 
DecAtt + DocReader 52.7 57.0 54.8 54.3 55.7 55.0  34.3 28.9 31.4 31.9 31.1 31.5 
 
Single annotator 80.4 67.6 73.4 – – –  63.4 52.6 57.5 – – – 
Super-annotator 90.0 84.6 87.2 – – –  79.1 72.6 75.7 – – – 
Long answer DevLong answer TestShort answer DevShort answer Test
PRF1PRF1PRF1PRF1
First paragraph 22.2 37.8 27.8 22.3 38.5 28.3  – – – – – – 
Most frequent 43.1 20.0 27.3 40.2 18.4 25.2  – – – – – – 
Closest question 37.7 28.5 32.4 36.2 27.8 31.4  – – – – – – 
 
DocumentQA 47.5 44.7 46.1 48.9 43.3 45.7  38.6 33.2 35.7 40.6 31.0 35.1 
DecAtt + DocReader 52.7 57.0 54.8 54.3 55.7 55.0  34.3 28.9 31.4 31.9 31.1 31.5 
 
Single annotator 80.4 67.6 73.4 – – –  63.4 52.6 57.5 – – – 
Super-annotator 90.0 84.6 87.2 – – –  79.1 72.6 75.7 – – – 

6 Baseline Performance

The NQ corpus is designed to provide a benchmark with which we can evaluate the performance of QA systems. Every question in nq is unique under exact string match, and we split questions randomly in nq into separate train/development/test sets. To facilitate comparison, we introduce baselines that either make use of high-level data set regularities, or are trained on the 307k examples in the training set. Here, we present well-established baselines that were state of the art at the time of submission. We also refer readers to Alberti et al. (2019) for more recent advances in modeling. All of our baselines focus on the long and short answer extraction tasks. We leave boolean answers to future work.

6.1 Untrained Baselines

nq’s long answer selection task admits several untrained baselines. The first paragraph of a Wikipedia page commonly acts as a summary of the most important information regarding the page’s subject. We therefore implement a long answer baseline that simply selects the first paragraph for all pages.

Furthermore, because 79% of the Wikipedia pages in the development set also appear in the training set, we implement two “copying” baselines. The first of these simply selects the most frequent annotation applied to a given page in the training set. The second selects the annotation given to the training set question closest to the eval set question according to TFIDF weighted word overlap. These three baselines are reported as First paragraph, Most frequent, and Closest question in Table 3, respectively.

6.2 Document-QA

We adapt the reference implementation12 of Document-QA (Clark and Gardner, 2018) for the nq task. This system performs well on the SQuAD and TriviaQA short answer extraction tasks, but it is not designed to represent: (i) the long answers that do not contain short answers, and (ii) the NULL answers that occur in nq.

To address (i) we choose the shortest available answer span at training, differentiating long and short answers only through the inclusion of special start and end of passage tokens that identify long answer candidates. At prediction time, the model can either predict a long answer (and no short answer), or a short answer (which implies a long answer).

To address (ii), we tried adding special NULL passages to represent the lack of answer. However, we achieved better performance by training on the subset of questions with answers and then only predicting those answers whose scores exceed a threshold.

With these two modifications, we are able to apply Document-QA to nq. We follow Clark and Gardner (2018) in pruning documents down to the set of passages that have highest TFIDF similarity with the question. Under this approach, we consider the top 16 passages as long answers. We consider short answers containing up to 17 words. We train Document-QA for 30 epochs with batches containing 15 examples. The post hoc score threshold is set to 3.0. All of these values were chosen on the basis of development set performance.

6.3 Custom Pipeline (DecAtt + DocReader)

One view of the long answer selection task is that it is more closely related to natural language inference (Bowman et al., 2015; Williams et al., 2018) than short answer extraction. A valid long answer must contain all of the information required to infer the answer. Short answers do not need to contain this information—they need to be surrounded by it.

Motivated by this intuition, we implement a pipelined approach that uses a model drawn from the natural language interference literature to select long answers. Then short answers are selected from these using a model drawn from the short answer extraction literature.

Long answer selection

Let t(d, l) denote the sequence of tokens in d for the long answer candidate l. We then use the Decomposable Attention model (Parikh et al., 2016) to produce a score for each question, candidate pair xl = DecAtt(q, t(d, l)). To this we add a 10-dimensional trainable embedding rl of the long answer candidate’s position in the sequence of candidates;13 an integer ul containing the number of the words shared by q and t(d, l); and a scalar vl containing the number of words shared by q and t(d, l) weighted by inverse document frequency. The long answer score zl is then given as a linear function of the above features zl =w[xl, rl, ul, vl] + b where w and b are the trainable weight vector and bias, respectively,

Short answer selection

Given a long answer, the Document Reader model (Chen et al., 2017; abbreviated DocReader) is used to extract short answers.

Training

The long answer selection model is trained by minimizing the negative log-likelihood of the correct answer l(i) with a hyperparameter η that down-weights examples with the NULL label:
i=1nlogexp(zl(i))lexp(zl)×(1η[[l(i)=NULL]])
We found that the inclusion of η is useful in accounting for the asymmetry in labels—because a NULL label is less informative than an answer location. Varying η also seems to provide a more stable method of setting a model’s precision point than post hoc thresholding of prediction scores. An analogous strategy is used for the short answer model where examples with no entity answers are given a different weight.

6.4 Results

Table 3 shows results for all baselines as well as a single annotator, and the super-annotator introduced in Section 5. It is clear that there is a great deal of headroom in both tasks. We find that Document-QA performs significantly worse than DecAtt+DocReader in long answer identification. This is likely because Document-QA was designed for the short answer task only.

To ground these results in the context of comparable tasks, we measure performance on the subset of nq that has non-NULL labels for both long and short answers. Freed from the decision of whether or not to answer, DecAtt+DocReader obtains 68.0% F1 on the long answer task, and 40.4% F1 on the short answer task. We also examine performance of the short answer extraction systems in the setting where the long answer is given, and a short answer is known to exist. With this simplification, short answer F1 increases 57.7% for DocReader. Under this restriction nq roughly approximates the SQuAD 1.1 task. From the gap to the super-annotator upper bound we know that this task is far from being solved in nq.

Finally, we break the long answer identification results down according to long answer type. From Table 3 we know that DecAtt+DocReader predicts long answers with 54.8% F1. If we only measure performance on examples that should have a paragraph long answer, this increases to 65.1%. For tables and table rows it is 66.4%. And for lists and list items it is 32.0%. All other examples have a NULL label. Clearly, the model is struggling to learn some aspect of list-formatted data from the 6% of the non NULL examples that have this type.

7 Conclusion

We argue that progress on QA has been hindered by a lack of appropriate training and test data. To address this, we present the Natural Questions corpus. This is the first large publicly available data set to pair real user queries with high-quality annotations of answers in documents. We also present metrics to be used with nq, for the purposes of evaluating the performance of question answering systems. We demonstrate a high upper bound on these metrics and show that existing methods do not approach this upper bound. We argue that for them to do so will require significant advances in NLU. Figure 5 shows example questions from the data set. Figure 6 shows example question/answer pairs from the data set, together with expert judgments and statistics from the 25-way annotations.

Figure 5: 

Examples from the questions with 25-way annotations.

Figure 5: 

Examples from the questions with 25-way annotations.

Figure 6: 

Answer annotations for four examples from Figure 5 that have long answers that are paragraphs (i.e., not tables or lists). We show the expert judgment (C/Cd/W) for each non-null answer. “Long answer stats”a/25, b/25 have a = number of non-null long answers for this question, b = number of long answers the same as that shown in the figure. For example, for question A1, 13 out of 25 annotators give some non-null answer, and 4 out of 25 annotators give the same long answer After mashing …. “Short answer stats” has similar statistics for short answers.

Figure 6: 

Answer annotations for four examples from Figure 5 that have long answers that are paragraphs (i.e., not tables or lists). We show the expert judgment (C/Cd/W) for each non-null answer. “Long answer stats”a/25, b/25 have a = number of non-null long answers for this question, b = number of long answers the same as that shown in the figure. For example, for question A1, 13 out of 25 annotators give some non-null answer, and 4 out of 25 annotators give the same long answer After mashing …. “Short answer stats” has similar statistics for short answers.

Notes

1

For example, for machine translation/speech recognition humans provide translations/transcriptions relatively easily.

3

We pre-define the set of categorical noun phrases used in 4 and 5 by running Hearst patterns (Hearst, 1992) to find a broad set of hypernyms. Part of speech tags and entities are identified using Google’s Cloud NLP API: https://cloud.google.com/natural-language.

4

We note that both tables and lists may be used purely for the purposes of formatting text, or they may have their own complex semantics—as in the case of Wikipedia infoboxes.

5

More formally, there is some base distribution pb(q) from which queries q are drawn, and a deterministic function s(q) which returns the top-ranked Wikipedia page in the top 5 search results, or NULL if there is no Wikipedia page in the top 5 results. Define Q to be the set of queries such that s(q) ≠ NULL, and b=qQpb(q). Then p(q, d) = pb(q)/b if qQ and dNULL and d = s(q), otherwise p(q, d) = 0.

6

The first four authors of this paper.

7

More formally, let [[e]] for any statement e be 1 if e is true, 0 if e is false. We define Ê(C)=1|S|i=1|S|[[c(i)=C]]. The values for Ê(Cd) and Ê(W) are calculated in a similar manner.

8

As stated earlier in this paper, we did instruct annotators to select the earliest instance of an answer when there are multiple answer instances on the page. However, there are still cases where different annotators disagree on whether an answer earlier in the page is sufficient in comparison to a later answer, leading to differences between annotators.

9

This is partly motivated through the results on 25-way annotations (see Section 4.4), where for μ(i) ≥ 0.4 over 93% (114/122 annotations) are in the C or Cd categories, whereas for μ(i) < 0.4 over 35% (11/17 annotations) are in the W category.

10

This isn’t quite accurate as the annotators are sampled without replacement; however, it simplifies the analysis.

11

Specifically, for an input (q,d), if we define l* =argmaxlNULLp(l|q,d), γ = p(l*|q,d), and γ-=p(NULL|q,d), then the Bayes optimal hypothesis is to output l* if P(hβ(a,l*)=1|γ,γ-)P(hβ(a,NULL)=1|γ,γ-), and to output NULL otherwise. Implementation of this strategy is straightforward if γ and γ- are known; this strategy will in general give a higher accuracy value than taking a single sample l from p(l|q, d) and using this sample as the prediction. In principle a model p(l|q, d; θ) trained on (l, q, d) triples can converge to a good estimate of γ and γ-. Note that for the special case γ+γ-=1 we have P(hβ(a,NULL)=1|γ,γ-)=γ-5+5γ-4(1γ-) and P(hβ(a,l*)=1|γ,γ-)=1P(hβ(a,NULL)=1|γ,γ-). It follows that the Bayes optimal hypothesis is to predict l* if γα where α ≈ 0.31381, and to predict NULL otherwise. α is 1α- where α- is the solution to α-5+5α-4(1α-)=0.5.

13

Specifically, we have a unique learned 10-dimensional embedding for each position 1 … 19 in the sequence, and a 20th embedding used for all positions ≥ 20.

References

Chris
Alberti
,
Kenton
Lee
, and
Michael
Collins
.
2019
.
A BERT Baseline for the Natural Questions
.
arXiv preprint:1901.08634
.
Samuel R.
Bowman
,
Gabor
Angeli
,
Christopher
Potts
, and
Christopher D.
Manning
.
2015
.
A large annotated corpus for learning natural language inference
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
632
642
.
Danqi
Chen
,
Adam
Fisch
,
Jason
Weston
, and
Antoine
Bordes
.
2017
.
Reading Wikipedia to answer open-domain questions
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1870
1879
.
Eunsol
Choi
,
He
He
,
Mohit
Iyyer
,
Mark
Yatskar
,
Wen-tau
Yih
,
Yejin
Choi
,
Percy
Liang
, and
Luke
Zettlemoyer
.
2018
.
Quac: Question answering in context
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
2174
2184
,
Brussels
.
Christopher
Clark
and
Matt
Gardner
.
2018
.
Simple and effective multi-paragraph reading comprehension
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
845
855
,
Melbourne
.
Luc
Devroye
,
László
Györfi
, and
Gábor
Lugosi
.
1997
.
A Probabilistic Theory of Pattern Recognition
, corrected 2nd edition,
volume 31
of
Applications of Mathematics
.
Springer
.
Wei
He
,
Kai
Liu
,
Jing
Liu
,
Yajuan
Lyu
,
Shiqi
Zhao
,
Xinyan
Xiao
,
Yuan
Liu
,
Yizhong
Wang
,
Hua
Wu
,
Qiaoqiao
She
,
Xuan
Liu
,
Tian
Wu
, and
Haifeng
Wang
.
2018
.
Dureader: A Chinese machine reading comprehension dataset from real-world applications
. In
Proceedings of the Workshop on Machine Reading for Question Answering
, pages
37
46
,
Melbourne
.
Marti A.
Hearst
.
1992
.
Automatic acquisition of hyponyms from large text corpora
. In
COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics
.
Karl Moritz
Hermann
,
Tomáš
Kočiský
,
Edward
Grefenstette
,
Lasse
Espeholt
,
Will
Kay
,
Mustafa
Suleyman
, and
Phil
Blunsom
.
2015
.
Teaching machines to read and comprehend
. In
Proceedings of the 28th International Conference on Neural Information Processing Systems
,
NIPS’15
.
Cambridge, MA
.
Felix
Hill
,
Antoine
Bordes
,
Sumit
Chopra
, and
Jason
Weston
.
2015
.
The goldilocks principle: Reading children’s books with explicit memory representations
. In
Proceedings of the International Conference on Learning Representations
.
Robin
Jia
and
Percy
Liang
.
2017
.
Adversarial examples for evaluating reading comprehension systems
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
2021
2031
,
Copenhagen
.
Mandar
Joshi
,
Eunsol
Choi
,
Daniel
Weld
, and
Luke
Zettlemoyer
.
2017
.
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1601
1611
.
Tomas
Kocisky
,
Jonathan
Schwarz
,
Phil
Blunsom
,
Chris
Dyer
,
Karl Moritz
Hermann
,
Gabor
Meli
, and
Edward
Grefenstette
.
2018
.
The narrative qa reading comprehension challenge
.
Transactions of the Association for Computational Linguistics
,
6
317
328
.
Guokun
Lai
,
Qizhe
Xie
,
Hanxiao
Liu
,
Yiming
Yang
, and
Eduard
Hovy
.
2017
.
Race: Large- scale reading comprehension dataset from examinations
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
785
794
.
Copenhagen
.
Todor
Mihaylov
,
Peter
Clark
,
Tushar
Khot
, and
Ashish
Sabharwal
.
2018
.
Can A suit of armor conduct electricity? A new dataset for open book question answering
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
2381
2391
,
Brussels
.
Tri
Nguyen
,
Mir
Rosenberg
,
Xia
Song
,
Jianfeng
Gao
,
Saurabh
Tiwary
,
Rangan
Majumder
, and
Li
Deng
.
2016
.
MS MARCO: A human generated machine reading comprehension dataset
. In
Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches
.
Takeshi
Onishi
,
Hai
Wang
,
Mohit
Bansal
,
Kevin
Gimpel
, and
David
McAllester
.
2016
.
Who did what: A large-scale person-centered cloze dataset
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
2230
2235
.
Austin, TX
.
Denis
Paperno
,
Germán
Kruszewski
,
Angeliki
Lazaridou
,
Ngoc Quan
Pham
,
Raffaella
Bernardi
,
Sandro
Pezzelle
,
Marco
Baroni
,
Gemma
Boleda
, and
Raquel
Fernandez
.
2016
.
The LAMBADA dataset: Word prediction requiring a broad discourse context
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1525
1534
,
Berlin
.
Kishore
Papineni
,
Salim
Roukos
,
Todd
Ward
, and
Wei-Jing
Zhu
.
2002
.
BLUE: A method for automatic evaluation of machine translation
. In
Proceedings of 40th Annual Meeting of the Association for Computational Linguistics
, pages
311
318
,
Philadelphia
.
Ankur
Parikh
,
Oscar
Täckström
,
Dipanjan
Das
, and
Jakob
Uszkoreit
.
2016
.
A decomposable attention model for natural language inference
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
2249
2255
,
Austin, TX
.
Pranav
Rajpurkar
,
Robin
Jia
, and
Percy
Liang
.
2018
.
Know what you don’t know: Unanswerable questions for squad
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
784
789
.
Pranav
Rajpurkar
,
Jian
Zhang
,
Konstantin
Lopyrev
, and
Percy
Liang
.
2016
.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
2383
2392
,
Austin, TX
.
Siva
Reddy
,
Danqi
Chen
, and
Christopher D.
Manning
.
2018
.
Coqa: A conversational question answering challenge
.
arXiv preprint arXiv:1808.07042
.
Matthew
Richardson
,
Christopher J. C.
Burges
, and
Erin
Renshaw
.
2013
.
MCTest: A challenge dataset for the open-domain machine comprehension of text
. In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
, pages
193
203
,
Seattle, WA
.
Adina
Williams
,
Nikita
Nangia
, and
Samuel
Bowman
.
2018
.
A broad-coverage challenge corpus for sentence understanding through inference
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
1112
1122
,
New Orleans, LA
.
Yi
Yang
,
Wen-tau
Yih
, and
Christopher
Meek
.
2015
.
Wikiqa: A challenge dataset for open-domain question answering
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
2013
2018
,
Lisbon
.
Zhilin
Yang
,
Peng
Qi
,
Saizheng
Zhang
,
Yoshua
Bengio
,
William
Cohen
,
Ruslan
Salakhutdinov
, and
Christopher D.
Manning
.
2018
.
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
2369
2380
,
Brussels
.

Author notes

Action Editor: Jing Jiang.

Project initiation;

Project design;

Data creation;

Model development;

Project support;

Also affiliated with Columbia University, work done at Google;

No longer at Google, work done at Google.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.