Natural Questions: A Benchmark for Question Answering Research

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.


Introduction
In recent years there has been dramatic progress in machine learning approaches to problems such as machine translation, speech recognition, and image recognition. One major factor in these successes has been the development of neural methods that far exceed the performance of previous approaches. A second major factor has * ♣ Project initiation; ♦ Project design; ♠ Data creation; ♥ Model development; ♤ Project support; ♥ Also affiliated with Columbia University, work done at Google; ♦ No longer at Google, work done at Google. been the existence of large quantities of training data for these systems.
Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU), which has significant utility to users, and in addition is potentially a challenge task that can drive the development of methods for NLU. Several pieces of recent work have introduced QA data sets (e.g., Rajpurkar et al., 2016;Reddy et al., 2018). However, in contrast to tasks where it is relatively easy to gather naturally occurring examples, 1 the definition of a suitable QA task, and the development of a methodology for annotation and evaluation, is challenging. Key issues include the methods and sources used to obtain questions; the methods used to annotate and collect answers; the methods used to measure and ensure annotation quality; and the metrics used for evaluation. For more discussion of the limitations of previous work with respect to these issues, see Section 2 of this paper. This paper introduces Natural Questions 2 (NQ), a new data set for QA research, along with methods for QA system evaluation. Our goals are three-fold: 1) To provide large-scale end-to-end training data for the QA problem. 2) To provide a data set that drives research in natural language understanding. 3) To study human performance in providing QA annotations for naturally occurring questions.
In brief, our annotation process is as follows. An annotator is presented with a (question, Wikipedia page) pair. The annotator returns a (long answer, short answer) pair. The long answer (l) can be an HTML bounding box on the Wikipedia page-typically a paragraph or table-that contains the information required to answer the question. Alternatively, the annotator can return l = NULL if there is no answer on the page, or if the information required to answer the question is spread across many paragraphs. The short answer (s) can be a span or set of spans (typically entities) within l that answer the question, a boolean yes or no answer, or NULL. If l = NULL then s = NULL, necessarily. Figure 1 shows examples.
Natural Questions has the following properties: Source of questions The questions consist of real anonymized, aggregated queries issued to the Google search engine. Simple heuristics are used to filter questions from the query stream. Thus the questions are ''natural'' in that they represent real queries from people seeking information.
Number of items The public release contains 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and 7,842 5-way annotated items sequestered as test data. We justify the use of 5-way annotation for evaluation in Section 5.

Task definition
The input to a model is a question together with an entire Wikipedia page. The target output from the model is: 1) a long-answer (e.g., a paragraph) from the page that answers the question, or alternatively an indication that there is no answer on the page; 2) a short answer where applicable. The task was designed to be close to an end-to-end question answering application.
Ensuring high-quality annotations at scale Comprehensive guidelines were developed for the task. These are summarized in Section 3. Annotation quality was constantly monitored.
Evaluation of quality Section 4 describes posthoc evaluation of annotation quality. Long/short answers have 90%/84% precision, respectively.
Study of variability One clear finding in NQ is that for naturally occurring questions there is often genuine ambiguity in whether or not an answer is acceptable. There are also often a number of acceptable answers. Section 4 examines this variability using 25-way annotations.
Robust evaluation metrics Section 5 introduces methods of measuring answer quality that account for variability in acceptable answers. We demonstrate a high human upper bound on these measures for both long answers (90% precision, 85% recall), and short answers (79% precision, 72% recall). We propose NQ as a new benchmark for research in QA. In Section 6.4 we present baseline results from recent models developed on comparable data sets (Clark and Gardner, 2018), as well as a simple pipelined model designed for the NQ task. We demonstrate a large gap between the performance of these baselines and a human upper bound. We argue that closing this gap will require significant advances in NLU.

Related Work
The SQuAD (Rajpurkar et al., 2016), SQuAD 2.0 (Rajpurkar et al., 2018), NarrativeQA (Kocisky et al., 2018), and HotpotQA (Yang et al., 2018) data sets contain questions and answers written by annotators who have first read a short text containing the answer. The SQuAD data sets contain questions/paragraph/answer triples from Wikipedia. In the original SQuAD data set, annotators often borrow part of the evidence paragraph to create a question. Jia and Liang (2017) showed that systems trained on SQuAD could be easily fooled by the insertion of distractor sentences that should not change the answer, and SQuAD 2.0 introduces questions that are designed to be unanswerable. However, we argue that questions written to be unanswerable can be identified as such with little reasoning, in contrast to NQ's task of deciding whether a paragraph contains all of the evidence required to answer a real question. Both SQuAD tasks have driven significant advances in reading comprehension, but systems now outperform humans and harder challenges are needed. NarrativeQA aims to elicit questions that are not close paraphrases of the evidence by separate summary texts. No human performance upper bound is provided for the full task and, although an extractive system could theoretically perfectly recover all answers, current approaches only just outperform a random baseline. NarrativeQA may just be too hard for the current state of NLU. HotpotQA is designed to contain questions that require reasoning over text from separate Wikipedia pages. As well as answering questions, systems must also identify passages that contain supporting facts. This is similar in motivation to NQ's long answer task, where the selected passage must contain all of the information required to infer the answer. Mirroring our identification of acceptable variability in the NQ task definition, HotpotQA's authors observe that the choice of supporting facts is somewhat subjective. They set high human upper bounds by selecting, for each example, the score maximizing partition of four annotations into one prediction and three references. The reference labels chosen by this maximization are not representative of the reference labels in HotpotQA's evaluation set, and it is not clear that the upper bounds are achievable. A more robust approach is to keep the evaluation distribution fixed, and calculate an acheivable upper bound by approximating the expectation over annotations-as we have done for NQ in Section 5.
The QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2018) data sets contain dialogues between a questioner, who is trying to learn about a text, and an answerer. QuAC also prevents the questioner from seeing the evidence text. Conversational QA is an exciting new area, but it is significantly different from the single turn QA task in NQ. In both QuAC and CoQA, conversations tend to explore evidence texts incrementally, progressing from the start to the end of the text.
This contrasts with NQ, where individual questions often require reasoning over large bodies of text.
The WikiQA (Yang et al., 2015) and MS Marco (Nguyen et al., 2016) data sets contain queries sampled from the Bing search engine. WikiQA contains only 3,047 questions. MS Marco contains 100,000 questions with freeform answers. For each question, the annotator is presented with 10 passages returned by the search engine, and is asked to generate an answer to the query, or to say that the answer is not contained within the passages. Free-form text answers allow more flexibility in providing abstractive answers, but lead to difficulties in evaluation (BLEU score [Papineni et al., 2002] is used). MS Marco's authors do not discuss issues of variability or report quality metrics for their annotations. From our experience, these issues are critical. DuReader ) is a Chinese language data set containing queries from Baidu search logs. Like NQ, DuReader contains real user queries; it requires systems to read entire documents to find answers; and it identifies acceptable variability in answers. However, as with MS Marco, DuReader is reliant on BLEU for answer scoring, and systems already outperform a humans according to this metric.
There are a number of reading comprehension benchmarks based on multiple choice tests (Mihaylov et al., 2018;Richardson et al., 2013;Lai et al., 2017). The TriviaQA data set (Joshi et al., 2017) contains questions and answers taken from trivia quizzes found online. A number of Clozestyle tasks have also been proposed (Hermann et al., 2015;Hill et al., 2015;Paperno et al., 2016;Onishi et al., 2016). We believe that all of these tasks are related to, but distinct from, answering information-seeking questions. We also believe that, because a solution to NQ will have genuine utility, it is better equipped as a benchmark for NLU.

Task Definition and Data Collection
Natural Questions contains (question, wikipedia page, long answer, short answer) quadruples where: the question seeks factual information; the Wikipedia page may or may not contain the information required to answer the question; the long answer is a bounding box on this page containing all information required to infer the answer; and the short answer is one or more entities that give a short answer to the question, or a boolean yes or 455 1.a where does the nature conservancy get its funding 1.b who is the song killing me softly written about 2 who owned most of the railroads in the 1800s 4 how far is chardon ohio from cleveland ohio 5 american comedian on have i got news for you Table 1: Matches for heuristics in Section 3.1.
no. Both the long and short answer can be NULL if no viable candidates exist on the Wikipedia page.

Questions and Evidence Documents
All the questions in NQ are queries of 8 words or more that have been issued to the Google search engine by multiple users in a short period of time.
From these queries, we sample a subset that either: 1. start with ''who'', ''when'', or ''where'' directly followed by: a) a finite form of ''do'' or a modal verb; or b) a finite form of ''be'' or ''have'' with a verb in some later position; 2. start with ''who'' directly followed by a verb that is not a finite form of ''be''; 3. contain multiple entities as well as an adjective, adverb, verb, or determiner; 4. contain a categorical noun phrase immediately preceded by a preposition or relative clause; 5. end with a categorical noun phrase, and do not contain a preposition or relative clause. 3 Table 1 gives examples. We run questions through the Google search engine and keep those where there is a Wikipedia page in the top 5 search results. The (question, Wikipedia page) pairs are the input to the human annotation task described next.
The goal of these heuristics is to discard a large proportion of queries that are non-questions, while retaining the majority of queries of 8 words or more in length that are questions. A manual inspection showed that the majority of questions in the data, with the exclusion of question beginning with ''how to'', are accepted by the filters. We focus on longer queries as they are more complex, and are thus a more challenging test for deep NLU. We focus on Wikipedia as it is a very important source of factual information, and we believe that stylistically it is similar to other sources of factual information on the Web; however, like any data set there may be biases in this choice. Future datacollection efforts may introduce shorter queries, ''how to'' questions, or domains other than Wikipedia.

Human Identification of Answers
Annotation is performed using a custom annotation interface, by a pool of around 50 annotators, with an average annotation time of 80 seconds.
The guidelines and tooling divide the annotation task into three conceptual stages, where all three stages are completed by a single annotator in succession. The decision flow through these is illustrated in Figure 2 and the instructions given to annotators are summarized below.
Question Identification: Contributors determine whether the given question is good or bad. A good question is a fact-seeking question that can be answered with an entity or explanation. A bad question is ambigous, incomprehensible, dependent on clear false presuppositions, opinionseeking, or not clearly a request for factual information. Annotators must make this judgment solely by the content of the question; they are not yet shown the Wikipedia page.
Long Answer Identification: For good questions only, annotators select the earliest HTML bounding box containing enough information for a reader to completely infer the answer to the question. Bounding boxes can be paragraphs, tables, list items, or whole lists. Alternatively, annotators mark ''no answer'' if the page does not answer the question, or if the information is present but not contained in a single one of the allowed elements.

Short Answer Identification:
For examples with long answers, annotators select the entity or set of entities within the long answer that answer the question. Alternatively, annotators can flag that the short answer is yes, no, or they can flag that no short answer is possible.

Data Statistics
In total, annotators identify a long answer for 49% of the examples, and short answer spans or a yes/no answer for 36% of the examples. We consider the choice of whether or not to answer a question a core part of the question answering task, and do not discard the remaining 51% that have no answer labeled.
Annotators identify long answers by selecting the smallest HTML bounding box that contains all of the information required to answer the question. These are mostly paragraphs (73%). The remainder are made up of tables (19%), table rows (1%), lists (3%), or list items (3%). 4 We leave further subcategorization of long answers to future work, and provide a breakdown of baseline performance on each of these three types of answers in Section 6.4.

Evaluation of Annotation Quality
This section describes evaluation of the quality of the human annotations in our data. We use a combination of two methods: 1) post hoc evaluation of correctness of non-null answers, under consensus judgments from four ''experts''; 2) k-way annotations (with k = 25) on a subset of the data.
Post hoc evaluation of non-null answers leads directly to a measure of annotation precision. As is common in information-retrieval style problems such as long-answer identification, measuring recall is more challenging. However, we describe how 25-way annotated data provide useful insights into recall, particularly when combined with expert judgments.

Preliminaries: The Sampling Distribution
Each item in our data consists of a four-tuple (q, d, l, s) where q is a question, d is a document, l is a long answer, and s is a short answer. Thus we introduce random variables Q, D, L, and S corresponding to these items. Note that L, can be a span within the document, or NULL. Similarly, S can be one or more spans within L, a boolean, or NULL. For now we consider the three-tuple (q, d, l). The treatment for short answers is the same throughout, with (q, d, s) replacing (q, d, l).
Each data item (q, d, l) is independent and identically distrbuted (IID) sampled from Here, p(q, d) is the sampling distribution (probability mass function [PMF]) over question/ document pairs. It is defined as the PMF corresponding to the following sampling process: 5 First, sample a question at random from some distribution; second, perform a search on a major search engine using the question as the underlying query; finally, either: 1) return (q, d) where d is the top Wikipedia result for q, if d is in the top 5 search results for q; 2) if there is no Wikipedia page in the top 5 results, discard q and repeat the sampling process.
Here p(l|q, d) is the conditional distribution (PMF) over long answer l conditioned on the pair (q, d). The value for l is obtained by: 1) sampling an annotator uniformly at random from the pool of annotators; 2) presenting the pair (q, d) to the annotator, who then provides a value for l.
Note that l is non-deterministic due to two sources of randomness: 1) the random choice of annotator; 2) the potentially random behavior of a particular annotator (the annotator may give a different answer depending on the time of day, etc.).
We will also consider the distribution Thus p(l, q, d|L = NULL) is the probability of seeing the triple (l, q, d), conditioned on L not being NULL.
We now define precision of annotations. Consider a function π (l, q, d) that is equal to 1 if l is a ''correct'' answer for the pair (q, d), 0 if the answer is incorrect. The next section gives a concrete definition of π. The annotation precision is defined as

Given a set of annotations
drawn IID from p(l, q, d|L = NULL), we can derive an estimate of Ψ asΨ = 1

Expert Evaluations of Correctness
We now describe the process for deriving ''expert'' judgments of answer correctness. We used four experts for these judgments. These experts had prepared the guidelines for the annotation process. 6 In a first phase each of the four experts independently annotated examples for correctness. In a second phase the four experts met to discuss disagreements in judgments, and to reach a single consensus judgment for each example.
A key step is to define the criteria used to determine correctness of an example. Given a triple (l, q, d), we extracted the passage l corresponding to l on the page d. The pair (q, l ) was then presented to the expert. Experts categorized (q, l ) pairs into the following three categories: Correct (C): It is clear beyond a reasonable doubt that the answer is correct. 6 The first four authors of this paper. Figure 3: Examples with consensus expert judgments, and justification for these judgments. See Figure 6 for more examples.
Correct (but debatable) (C d ): A reasonable person could be satisfied by the answer; however, a reasonable person could raise a reasonable doubt about the answer.
Wrong (W): There is not convincing evidence that the answer is correct. Figure 3 shows some example judgments. We introduced the intermediate C d category after observing that many (q, l ) pairs are high quality answers, but raise some small doubt or quibble about whether they fully answer the question. The use of the word ''debatable'' is intended to be literal: (q, l ) pairs falling into the C d category could literally lead to some debate between reasonable people as to whether they fully answer the question or not.
Given this background, we will make the following assumption: Answers in the C d category should be very useful to a user interacting with a QA system, and should be considered to be high-quality answers; however, an annotator would be justified in either annotating or not annotating the example.
For these cases there is often disagreement between annotators as to whether the page contains Quantity Long answer Short answer an answer or not: We will see evidence of this when we consider the 25-way annotations.

Results for Precision Measurements
We used the following procedure to derive measurements of precision: 1) We sampled examples IID from the distribution p(l, q, d|L = NULL). We call this set S. We had |S| = 139. 2) Four experts independently classified each of the items in S into the categories C, C d , W.
3) The four experts met to come up with a consensus judgment for each item.
For each example (l (i) , q (i) , d (i) ) ∈ S, we define c (i) to be the consensus judgment. This process was repeated to derive judgments for short answers. We can then calculate the percentage of examples falling into the three expert categories; we denote these values asÊ(C),Ê(C d ), andÊ(W ). 7 We defineΨ =Ê(C)+Ê(C d ). We have explicitly included samples C and C d in the overall precision as we believe that C d answers are essentially correct. Table 2 shows the values for these quantities.

Variability of Annotations
We have shown that an annotation drawn from p(l, q, d|L = NULL) has high expected precision. Now we address the distribution over annotations for a given (q, d) pair. Annotators can disagree about whether or not d contains an answer to q-that is, whether or not L = NULL. In the case that annotators agree that L = NULL, they can also disagree about the correct assignment to L.
In order to study variability, we collected 24 additional annotations from separate annotators for each of the (q, d, l) triples in S. For each (q, d, l) triple, we now have a 5-tuple (q (i) , d (i) , l (i) , c (i) , a (i) ) where a (i) = a  the consensus judgment for l (i) . For each i also define to be the proportion of the 25-way annotations that are non-null. We now show that μ (i) is highly correlated with annotation precision. We definê to be the proportion of examples with greater than 80% of the 25 annotators marking a non-null long answer, and  Figure 4 illustrates the proportion of annotations falling into the C/C d /W categories in different regions of μ (i) . For those (q, d) pairs where more than 80% of annotators gave some non-null answer, our expert judgements agree that these annotations are overwhelmingly correct. Similarly, when fewer than 20% of annotators gave a non-null answer, these answers tend to be incorrect. In between these two extremes, the disagreement between annotators is largely accounted for by the C d category-where a reasonable person could either be satisfied with the answer, or want more information. Later, in Section 5, we make use of the correlation between μ (i) and accuracy to define a metric for the evaluation of answer quality. In that section, we also show that a model trained on (l, q, d) triples can outperform a single annotator on this metric by accounting for the uncertainty of whether or not an answer is present.
As well as disagreeing about whether (q, d) contains a valid answer, annotators can disagree about the location of the best answer. In many cases there are multiple valid long answers in multiple distinct locations on the page. 8 The most extreme example of this that we see in our 25-way annotated data is for the question ''name the substance used to make the filament of bulb'' paired with the Wikipedia page about incandescent light bulbs. Annotators identify 7 passages that discuss tungsten wire filaments.
Short answers can be arbitrarily delimited and this can lead to extreme variation. The most extreme example of this that we see in the 25-way annotated data is the 11 distinct, but correct, answers for the question ''where is blood pumped after it leaves the right ventricle''. Here, 14 annotators identify a substring of ''to the lungs'' as the best possible short answer. Of these, 6 label the entire string, 4 reduce it to ''the lungs'', and 4 reduce it to ''lungs''. A further 6 annotators do not consider this short answer to be sufficient and choose more precise phrases such as ''through the semilunar pulmonary valve into the left and right main pulmonary arteries (one for each lung)''. The remaining 5 annotators decide that there is no adequate short answer.
For each question, we ranked each of the unique answers given by our 25 annotators according to the number of annotators that chose it. We found that by just taking the most popular long answer, we could account for 83% of the long answer annotations. The two most popular long answers account for 96% of the long answer annotations. It is extremely uncommon for a question to have more than three distinct long answers annotated. Short answers have greater variability, but the most popular short answer still accounts for 64% of all short answer annotations. The three most popular short answers account for 90% of all short answer annotations. 8 As stated earlier in this paper, we did instruct annotators to select the earliest instance of an answer when there are multiple answer instances on the page. However, there are still cases where different annotators disagree on whether an answer earlier in the page is sufficient in comparison to a later answer, leading to differences between annotators.

Evaluation Measures
NQ includes 5-way annotations on 7,830 items for development data, and we will sequester a further 7,842 items, 5-way annotated, for test data. This section describes evaluation metrics using this data, and gives justification for these metrics.
We choose 5-way annotations for the following reasons: First, we have evidence that aggregating annotations from 5 annotators is likely to be much more robust than relying on a single annotator (see Section 4). Second, 5 annotators is a small enough number that the cost of annotating thousands of development and test items is not prohibitive.

Definition of an Evaluation Measure Based on 5-Way Annotations
Assume that we have a model f θ with parameters θ that maps an input (q, d) to a long answer l = f θ (q, d). We would like to evaluate the accuracy of this model. Assume we have evaluation examples j is the output from the j'th annotator, and can be a paragraph in d (i) , or can be NULL. The five annotators are chosen uniformly at random from a pool of annotators.
We define an evaluation measure based on the five way annotations as follows. If at least two out of five annotators have given a non-null long answer on the example, then the system is required to output a non-null answer that is seen at least once in the five annotations; conversely, if fewer than two annotators give a non-null long answer, the system is required to return NULL as its output.
To make this more formal, define the function g(a (i) ) to be the number of annotations in a (i) that are non-null. Define a function h β (a, l) that judges the correctness of label l given annotations a = a 1 . . . a 5 . This function is parameterized by an integer β. The function returns 1 if the label l is judged to be correct, and 0 otherwise: Definition 1 (Definition of h β (a, l)) If g(a) ≥ β and l = NULL and l = a j for some j ∈ {1 . . . 5} Then h β (a, l) = 1; Else If g(a) < β and l = NULL Then h β (a, l) = 1; Else h β (a, l) = 0.
We used β = 2 in our experiments. 9 The accuracy of a model is then The value for A β is an estimate of accuracy with respect to the underlying distribution, which we define asĀ Here the expectation is taken with respect to p(a, q, d) = p(q, d) 5 j=1 p(a j |q, d) where p(a j |q, d) = P (L = a j |Q = q, D = d); hence the annotations a 1 . . . a 5 are assumed to be drawn IID from p(l|q, d). 10 We discuss this measure at length in this section. First, however, we make the following critical point: It is possible for a model trained on (l (i) , q (i) , d (i) ) triples drawn IID from p (l, q, d) to exceed the performance of a single annotator on this measure.
In particular, if we have a model p(l|q, d; θ), trained on (l, q, d) triples, which is a good approximation to p(l|q, d), it is then possible to use p (l|q, d; θ) to make predictions that outperform a single random draw from p(l|q, d). The Bayes optimal hypothesis (see Devroye et al., 1997) for , is a function of the posterior distribution p(·|q, d), 11 and will generally exceed the performance of a single random annotation, E q,d,a [[ l p(l|q, d) We also show this empirically, by constructing an approximation to p(l|q, d) from 20-way annotations, then using this approximation to make predictions that significantly outperform a single annotator.
Precision and Recall During evaluation, it is often beneficial to separately measure false positives (incorrectly predicting an answer), and false negatives (failing to predict an answer). We define the precision (P ) and recall (R) of f θ :

Super-Annotator Upper Bound
To place an upper bound on the metrics introduced above we create a ''super-annotator'' from the 25way annotated data introduced in Section 4. From this data, we create four tuples (q (i) , d (i) , a (i) , b (i) ). The first three terms in this tuple are the question, document, and vector of five reference annotations. b (i) is a vector of annotations b (i) j for j = 1 . . . 20 drawn from the same distribution as a (i) . The super-annotator predicts NULL if g(b (i) ) < α, and l * = arg max l∈d 20 Table 3 shows super-annotator performance for α = 8, with 90.0% precision, 84.6% recall, and 87.2% F-measure. This significantly exceeds the performance (80.4% precision/67.6% recall/ 73.4% F-measure) for a single annotator. We subsequently view the super-annotator numbers as an effective upper bound on performance of a learned model.

Baseline Performance
The NQ corpus is designed to provide a benchmark with which we can evaluate the performance of QA systems. Every question in NQ is unique under exact string match, and we split questions randomly in NQ into separate train/development/test sets. To facilitate comparison, we introduce baselines that either make use of high-level data set regularities, or are trained on the 307k examples in the training set. Here, we present well-established baselines that were state of the art at the time of submission. We also refer readers to Alberti et al. (2019) for more recent advances in modeling. All of our baselines focus on the long and short answer extraction tasks. We leave boolean answers to future work.  Table 3: Precision (P), recall (R), and the harmonic mean of these (F1) of all baselines, a single annotator, and the super-annotator upper bound. The human performances marked with † are evaluated on a sample of five annotations from the 25-way annotated data introduced in Section 5.

Untrained Baselines
NQ's long answer selection task admits several untrained baselines. The first paragraph of a Wikipedia page commonly acts as a summary of the most important information regarding the page's subject. We therefore implement a long answer baseline that simply selects the first paragraph for all pages. Furthermore, because 79% of the Wikipedia pages in the development set also appear in the training set, we implement two ''copying'' baselines. The first of these simply selects the most frequent annotation applied to a given page in the training set. The second selects the annotation given to the training set question closest to the eval set question according to TFIDF weighted word overlap. These three baselines are reported as First paragraph, Most frequent, and Closest question in Table 3, respectively.

Document-QA
We adapt the reference implementation 12 of Document-QA (Clark and Gardner, 2018) for the NQ task. This system performs well on the SQuAD and TriviaQA short answer extraction tasks, but it is not designed to represent: (i) the long answers that do not contain short answers, and (ii) the NULL answers that occur in NQ.
To address (i) we choose the shortest available answer span at training, differentiating long and short answers only through the inclusion of special start and end of passage tokens that identify long answer candidates. At prediction time, the model can either predict a long answer (and no short answer), or a short answer (which implies a long answer). 12 https://github.com/allenai/document-qa.
To address (ii), we tried adding special NULL passages to represent the lack of answer. However, we achieved better performance by training on the subset of questions with answers and then only predicting those answers whose scores exceed a threshold.
With these two modifications, we are able to apply Document-QA to NQ. We follow Clark and Gardner (2018) in pruning documents down to the set of passages that have highest TFIDF similarity with the question. Under this approach, we consider the top 16 passages as long answers. We consider short answers containing up to 17 words. We train Document-QA for 30 epochs with batches containing 15 examples. The post hoc score threshold is set to 3.0. All of these values were chosen on the basis of development set performance.

Custom Pipeline (DecAtt + DocReader)
One view of the long answer selection task is that it is more closely related to natural language inference (Bowman et al., 2015;Williams et al., 2018) than short answer extraction. A valid long answer must contain all of the information required to infer the answer. Short answers do not need to contain this information-they need to be surrounded by it.
Motivated by this intuition, we implement a pipelined approach that uses a model drawn from the natural language interference literature to select long answers. Then short answers are selected from these using a model drawn from the short answer extraction literature.
Long answer selection Let t (d, l) denote the sequence of tokens in d for the long answer candidate l. We then use the Decomposable Attention model (Parikh et al., 2016) to produce a score for each question, candidate pair x l = DecAtt (q, t(d, l)). To this we add a 10dimensional trainable embedding r l of the long answer candidate's position in the sequence of candidates; 13 an integer u l containing the number of the words shared by q and t(d, l); and a scalar v l containing the number of words shared by q and t (d, l) weighted by inverse document frequency. The long answer score z l is then given as a linear function of the above features z l = w [x l , r l , u l , v l ] + b where w and b are the trainable weight vector and bias, respectively, Short answer selection Given a long answer, the Document Reader model (Chen et al., 2017; abbreviated DocReader) is used to extract short answers.
Training The long answer selection model is trained by minimizing the negative log-likelihood of the correct answer l (i) with a hyperparameter η that down-weights examples with the NULL label: We found that the inclusion of η is useful in accounting for the asymmetry in labels-because a NULL label is less informative than an answer location. Varying η also seems to provide a more stable method of setting a model's precision point than post hoc thresholding of prediction scores. An analogous strategy is used for the short answer model where examples with no entity answers are given a different weight. 13 Specifically, we have a unique learned 10-dimensional embedding for each position 1 . . . 19 in the sequence, and a 20th embedding used for all positions ≥ 20. Table 3 shows results for all baselines as well as a single annotator, and the super-annotator introduced in Section 5. It is clear that there is a great deal of headroom in both tasks. We find that Document-QA performs significantly worse than DecAtt+DocReader in long answer identification. This is likely because Document-QA was designed for the short answer task only.

Results
To ground these results in the context of comparable tasks, we measure performance on the subset of NQ that has non-NULL labels for both long and short answers. Freed from the decision of whether or not to answer, DecAtt+DocReader obtains 68.0% F1 on the long answer task, and 40.4% F1 on the short answer task. We also examine performance of the short answer extraction systems in the setting where the long answer is given, and a short answer is known to exist. With this simplification, short answer F1 increases 57.7% for DocReader. Under this restriction NQ roughly approximates the SQuAD 1.1 task. From the gap to the super-annotator upper bound we know that this task is far from being solved in NQ.
Finally, we break the long answer identification results down according to long answer type. From   Figure 5 that have long answers that are paragraphs (i.e., not tables or lists). We show the expert judgment (C/C d /W) for each non-null answer. ''Long answer stats'' a/25, b/25 have a = number of non-null long answers for this question, b = number of long answers the same as that shown in the figure. For example, for question A1, 13 out of 25 annotators give some non-null answer, and 4 out of 25 annotators give the same long answer After mashing . . .. ''Short answer stats'' has similar statistics for short answers.

Conclusion
We argue that progress on QA has been hindered by a lack of appropriate training and test data. To address this, we present the Natural Questions corpus. This is the first large publicly available data set to pair real user queries with high-quality annotations of answers in documents. We also present metrics to be used with NQ, for the purposes of evaluating the performance of question answering systems. We demonstrate a high upper bound on these metrics and show that existing methods do not approach this upper bound. We argue that for them to do so will require significant advances in NLU. Figure 5 shows example questions from the data set. Figure 6 shows example question/answer pairs from the data set, together with expert judgments and statistics from the 25-way annotations.