Decontextualization: Making Sentences Stand-Alone

Models for question answering, dialogue agents, and summarization often interpret the meaning of a sentence in a rich context and use that meaning in a new context. Taking excerpts of text can be problematic, as key pieces may not be explicit in a local window. We isolate and define the problem of sentence decontextualization: taking a sentence together with its context and rewriting it to be interpretable out of context, while preserving its meaning. We describe an annotation procedure, collect data on the Wikipedia corpus, and use the data to train models to automatically decontextualize sentences. We present preliminary studies that show the value of sentence decontextualization in a user facing task, and as preprocessing for systems that perform document understanding. We argue that decontextualization is an important subtask in many downstream applications, and that the definitions and resources provided can benefit tasks that operate on sentences that occur in a richer context.


Introduction
Many applications of natural language processing need to be able to interpret, or present, text independently from the rich context in which it occurs. For example, summarization systems extract salient information from documents and present it in a reduced context. Many systems also segment documents prior to interpretation of retrieval for computational efficiency. In all of these cases, we would like the context-reduction step to be meaning preserving but, to date, there has been no independent method of ensuring this.
In this paper we isolate and define the problem of sentence decontextualization: taking a sentence * Work done at Google.
Page Title: Croatia at the FIFA World Cup Paragraph: Croatia national football team have appeared in the FIFA World Cup on five occasions (in 1998, 2002, 2006, 2014 and 2018) since gaining independence in 1991. Before that, from 1930 to 1990 Croatia was part of Yugoslavia. Their best result thus far was reaching the 2018 final, where they lost 4-2 to France. Decontextualized Sentence: The Croatia national football team's best result thus far in the FIFA World Cup was reaching the 2018 final, where they lost 4-2 to France. together with its context and rewriting it to be interpretable out of context if feasible, while preserving its meaning. 1 Having defined the problem, we operationalize this definition into a high quality annotation procedure; use the resulting data to train models to automatically decontextualize sentences; and present preliminary results that show the value of automatic decontextualization in a user facing task, and as preprocessing for systems that perform document understanding. We argue that decontextualization is an important sub-task in many downstream applications, and we believe this work can benefit tasks that operate on sentences that occur in a wider context.
One contribution of this work is to release a dataset of decontextualized sentences that can be used as training and evaluation data, together with the evaluation script: on publication of this paper the data will be available at https://github. com/google-research/language/ tree/master/language/decontext. Figure 1 shows an example decontextualization. In this example we have a coreference resolution step (their → The Croatia national football team's) and a bridging step (insertion of the prepositional phrase "in the FIFA World Cup" to modify "Croa-tia's best result thus far"). Decontextualization involves various linguistic phenomena, including coreference resolution, global scoping, and bridging anaphora (Clark, 1975). We present a linguistically motivated definition of decontextualiation in Section 2 and show that this definition can be reliably applied by crowdworkers in Section 3.
We generate a corpus of decontextualized sentences corresponding to original sentences drawn from the English Wikipedia. We show that a high proportion of these original sentences can be decontextualized using a relatively simple set of rewrite operations, and we use the data to define a new automatic decontextualization task in which a computer system needs to create a decontextualized sentence from an original sentence presented in paragraph context. We discuss the implications of choosing Wikipedia as a domain in Section 3.4.
We present two methods for automatic decontextualization based on state-of-the-art coreference (Joshi et al., 2020) and generation (Raffel et al., 2019a) models. We evaluate the output of these models with automatic measures (derived from Xu et al. (2016)), as well as through human evaluation. Both automatic and human evaluations show that the largest sequence-to-sequence model produces high quality decontextualizations in the majority of cases, although it still lags human performance in the thoroughness and accuracy of these decontextualization edits.
Finally, we present two demonstrations of the utility of decontextualization. The first is a user study giving evidence that decontextualized sentences can be valuable when presented to users as answers in a question-answering task-raters judge that they balance conciseness with informativeness. In the second one, we use decontextualization as a preprocessing component for generating a retrieval corpus for open domain question answering. Decontextualizing the sentences to be indexed by retrieval system enables more efficient answer string retrieval for information seeking queries. These demonstrations are presented as preliminary results, and we argue that decontextualization is an important sub-task for a wide range of NLP applications.

Linguistic Background
We start with the following definition: Definition 1 (Decontextualization) Given a sentence-context pair (s, c), a sentence s is a valid decontextualization of s if: (1) the sentence s is interpretable in the empty context; and (2) the truth-conditional meaning of s in the empty context is the same as the truth-conditional meaning of s in context c.
A context c is a sequence of sentences preceding s, and the empty context is the empty sequence.
We have been careful here to use the more specific term "truth conditional meaning" rather than "meaning". Here we follow the distinction in semantics/pragmatics between truth conditional meaning and implicature, and deliberately exclude implicatures (which can also be considered part of the meaning of an utterance) from our definition. There is a rich history of work in semantics and pragmatics on truth-conditional meaning and implicatures, going back to Grice (1975). Our concept of "truth conditional meaning" is very close to "explicature" as used in Relevance Theory (Sperber and Wilson, 1986). Consider this description of explicature from Birner (2012) (pages 96-97, our own emphasis added): "The explicature in an utterance is the result of enriching the semantic content with the sorts of pragmatic information necessary to provide us with a truth-evaluable proposition. This includes calculating the referents for pronouns, working out the intended interpretation for deictic phrases like here and later ..., disambiguating lexically and structurally ambiguous words and phrases, making any "bridging" inferences necessary for reference resolution ... and so on." We will see in the next section that our annotation task follows this definition quite closely.
As an example consider the following exchange: Susan: Has the Croatia national football team ever won the FIFA World Cup? Jon: Their best result thus far was reaching the 2018 final, where they lost 4-2 to France.
Here the truth conditional meaning of Jon's reply is equivalent to "Croatia's best result thus far in the FIFA World Cup was reaching the 2018 final, where they lost 4-2 to France.", whereas the implicature would be "the Croatia national football team has never won the FIFA World Cup" (which answers Susan's question). In our definition the decontextualized sentence s should preserve the truth-conditional meaning, but is not required to preserve the implicature(s) of the sentence. 2 Remark (extra-linguistic context): In addition to its document context, a given sentence s and its counterpart s also come with a temporal, cultural, and geographic context -i.e. where and when they are being written or read and by whom. 3 We assume that these aspects of context are preserved during decontextualization. The effect of this is that elements of s which derive their meaning from outside of the document context will receive equivalent interpretation in s , and hence do not require decontextualization. For example, the expression "thus far" in Figure 1 is interpreted relative to the time of utterance, not relative to what has been previously said in the Wikipedia article, and hence it appears in both the original and decontextualized sentences.

Task Definition
An annotator is provided with an entire document d with a target sentence within the document, represented as a start and end index s st , s end . First, the annotator decides whether the target sentence can be decontextualized or not, labeling it as FEASIBLE or INFEASIBLE. If the example is marked as FEASIBLE, the annotator decontextualizes the sentence, producing y, a new sentence that satisfies the conditions in definition 1.

Feasibility
Sentences in FEASIBLE include sentences that do not require any modification to be decontextualized (e.g., "Émilie du Châtelet proposed the hypothesis of the conservation of total energy, as distinct from momentum."), and sentences that require edits to stand alone.
In the decontextualization step, we instructed annotators to make only minor modifications, which includes copying and pasting a few phrases from the document to the target sentence and deleting phrases from the target sentence. When it is too challenging to decontextualize, it is classified into the INFEASIBLE category. Often, sentences in this category are a part of a narrative Page Title: Postage stamps and postal history of India Paragraph: ... In the opinion of Geoffrey Clarke , the reformed system was to be maintained " for the benefit of the people of India and not for the purpose of swelling the revenue." The Commissioners voted to abolish the earlier practice of conveying official letters free of postage ("franking"). The new system was recommended by the Governor -General , Lord Dalhousie , and adopted by the East India Company 's Court of Directors.
Page Title: Thermodynamic temperature Paragraph: ... To completely melt ice at 0 C into water at 0 C, one must add roughly 80 times the thermal energy as is required to increase the temperature of the same mass of liquid water by one degree Celsius. The metals' ratios are even greater, typically in the range of 400 to 1200 times.

Edit Types and Linguistic Phenomena
When an example is classified as FEASIBLE, the annotator makes edits to decontextualize the sentence. Table 1 shows the different edit types. They fall into four broad categories: NAME COMPLETION, PRONOUN / NP SWAP correspond to replacement of a referring expression that is unclear out of context with a referring expression that is unambiguous out of context. For example replacing the pronoun "She" with "Cynthia Nixon", the definite NP "the copper statue" with "The Statue of Liberty", or the abbreviated name "Meg" with "Megan "Meg" Griffin".
DM REMOVAL involves removal of discourse markers (DMs) such as "therefore".
BRIDGING, GLOBAL SCOPING involve addition of a phrase (typically a prepositional phrase) that modifies either a particular noun phrase ("bridging") or the entire sentence ("global scoping"). For example adding "in the Ultimate Fighting Championship" as a modifier to "all fights", or adding "at the 2018 Cannes Film Festival" at the end of the sentence. The additional phrase essentially spells out a modifier that is implied by the context.
ADDITION inserts background information that significantly improves readability: in many cases, this involves adding an appositive or premodifier

ADDITION
Addition of background information that is not necessary but helps readability significantly Charles Darwin +, an English naturalist and biologist, was among the first to suggest that physiological changes caused by an emotion had a direct impact on , rather than being just the consequence of that emotion.
10 to a named entity to add useful background information about that entity. Unlike other edits described above, edits in this category are optional. For example, replacing "The Eagles" with "The American rock band The Eagles."

Variability
We note that for a given sentence frequently there will be more than one possible decontextualization. While this inherent subjectivity makes task challenging to crowdsource and evaluate, we argue this is important feature, as shown in recent literature (Aroyo and Welty, 2015;Pavlick and Kwiatkowski, 2019;, and propose to collect multiple references per example. Table 2 shows examples where there can be multiple different correct decontextualizations.
In the first example, while the semantics of the edits are roughly equivalent, i.e., the annotators agreed on what noun phrases have to be disambiguated and information has to be added, they differ in how to rewrite the sentence. In the second example, we see disagreement on what information should be added to the sentence. We do not make any explicit assumptions about what is known and salient to the reader, and instructed them to use their best judgement to rewrite such that the new sentence is fluent, unambiguous and clear when posed alone. In the last example, annotators disagree on the feasibility. While the sentence is a part of a bigger narrative, two annotators judged it could be edited to alone, by adding a global scoping modifier, "In Greek mythology."

Scope of Current Task Formulation
Our data comes from the English portion of the Wikipedia corpus. We sampled sentences as follows. We first pick a (question, Wikipedia, short answer) triple from the Natural Questions  uniformly at random from the questions that have a short answer. We include the sentence containing the short answer as one example; as a second example we choose a sentence at random from the Wikipedia page. After sampling we exclude (1) sentences under a "Plot" category as they are often infeasible to decontextualize; (2) any sentence that is the first sentence of the page; and (3) any sentence from a paragraph containing only a single sentence. We designed this data selection process to ensure that a large proportion of examples (90%) could be decontextualized using simple edits de-Page title / Section title: We Don't Talk Anymore (Charlie Puth song) / Music video Paragraph: The music video premiered on August 2 , 2016 , on BuzzFeed and was directed by Phil Pinto . It shows Puth and Mirella Cardoso as his love interest . . . . Decontextualization 1: -It , +We Don't Talk Anymore music video shows -Puth , +Charlie Puth and Mirella Cardoso as his love interest. Decontextualization 2: -It , +The "We Don't Talk Anymore"(Charlie Puth song) music video shows Puth and Mirella Cardoso as his love interest. Page title: Gemini (Constellation) Paragraph: In Greek mythology, Gemini was associated with the myth of Castor and Pollux, the children of Leda and Argonauts both. Pollux was the son of Zeus, who seduced Leda, while Castor was the son of Tyndareus, king of Sparta and Leda's husband. Castor and Pollux were also mythologically associated with St. Elmo's fire in their role as the protectors of sailors. When Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. Decontextualization 1: INFEASIBLE Decontextualization 2: +In Greek mythology, when Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. Before settling on Wikipedia, we conducted an initial pilot study which revealed that encyclopedic text is substantially easier to decontextualize compared to newswire or literary text. In the latter genres, the context required for the comprehension of any given sentence appears to be much more complex in structure. Similarly, it is difficult to posit decontextualization for sentences that appear on social media platforms, because they are situated within complex and highly specific social contexts. In contrast, being written for a general audience, Wikipedia makes limited assumptions about its reader.
Within Wikipedia, we similarly found that articles on popular historical or cultural entities and events were easier to decontextualize by crowdworkers compared to articles from technical domains, such as ones on medical or mathematical concepts. Comprehension of such articles requires a considerable body of background knowledge or information from preceding paragraphs. Articles in our dataset cover topics that require little background knowledge to comprehend.
We focus on decontextualization of sentences, where the space of edits is restricted, to make the task easier to quality control and annotate. However alternate formulations, such as decontextual-ization on paragraphs could also be studied. One could even also consider allowing wider range of edits, such as multi-sentence outputs and edits beyond copy-and-pasting, such as paraphrasing and re-ordering. We anticipate exploring such alternative formulations would help to extend the scope of decontextualization to the more challenging domains previously mentioned.
We stress however that in spite of our restriction to single sentences in Wikipedia, the decontextualization task is nevertheless valuable: Wikipedia (and other encyclopedic sources) contain a wealth of factual information, and a high proportion (over 60%; see table 3) of sentences both require decontextualization and can be decontextualized under our definitions (only 30% of sentences are interpretable out of context without any edits).

Data Collection
Annotation Interface The annotator is presented a sentence in the context of an entire Wikipedia page. In the first step the annotator judges whether the example is FEASIBLE or INFEASIBLE. If the example is marked as FEASIBLE, the annotator can use delete, add, or swap operations within a user interface to produce a decontextualized string.  Train  11290  695  156  60  31  9  Dev  1945  695  162  67  21  12  Test  1945  711  160  68  20  12   Expert  100  658  163  63  26  12   Table 3: Data statistics. par. len refers to paragraph length in bytes, and sent. len refers to sentence length in bytes. All lengths are in bytes. The development and test set is five-way annotated, and the expert data is four-way annotated.
Data Statistics We collected one reference for each example in the training data, and five references for each example in the evaluation data.
Annotators are native speakers of English located in the United States, and on average, they took 4 minutes to annotate a single example.
In total, 28 annotators annotated the examples, with 11 annotators annotating more than 1K examples each. Table 3 represents some overall data statistics. Decontextualization is possible for the majority of examples, with the INFEASIBLE category covering roughly 10% of the data. We note a slight discrepancy between train and evaluation dataset distribution, potentially due to a change in the annotation interface. A small subset of data is annotated by the authors to be compared with the crowd-sourced data (last row in the table).
Annotation Quality We quantify the annotation agreement on the category classification. The Fleiss' kappa on category classification is 0.51 among expert annotators, and is 0.30 among the crowd annotators (binary agreement is at 85%). We observed more variability in crowdworkers as annotators' background is more diverse, and some annotators have a loose concept of "stand alone" and consistently attempted decontextualization.
We also measured agreement among the individual edits. For each of the edit operations (as defined in Section 3.2), we compare the output sentence after the single edit and to a set of output sentences, each after a single edit by other annotators. About 32.5% of edits were covered.
Because of the inherent annotation variability, four of the authors manually evaluated 100 crowd-sourced annotations from the evaluation data based on two measures: (1) whether the sentence is sufficiently and correctly decontextualized, and (2) whether the sentence is grammati-cally correct and fluent. Overall, 88% of annotations were valid in both, 89% on the content and 88% on form.

Models
We present two models for decontextualization: a coreference resolution model and a sequenceto-sequence generation model. For both models, the input is a concatenation of the title of the Wikipedia document, the section titles, and the paragraph containing the target sentence. During the annotation pilots, we found the document title is crucial for decontextualization, while section headers were frequently necessary or missing. We denote the title of the Wikipedia page as the sequence of tokens t, section titles of the paragraph as the sequence of tokens t s and the n sentences in the paragraph where the target sentence is coming from as x 1 . . . x n , where each x i is a sequence of tokens, and x t is the target sentence (1 ≤ t ≤ n). The model considers the concatenation of a subset of the document, where [S] is a separator token. This representation differs from the setting of annotators, where they were given the full document context. As an approximation, we include article and section titles in the inputs, as these often contain salient contextual elements. We did experiment with giving more context, i.e., adding the first paragraph of the article as an additional input, but did not observe a performance improvement. On the initial pilot, annotators marked that 10-20% of examples required access to the full document.
The Coreference Model As many decontextualization edits can be recovered by a coreference resolution module, we adapt the output from the state-of-the-art coreference resolution system of (Joshi et al., 2020), trained on the CoNLL dataset (Pradhan et al., 2012), as a decontextualization system. We used the publicly available pre-trained checkpoint of SpanBERT-Large with the original hyper parameters. 4 We run this model on the input sequence, and map the coreference cluster predictions to modify the sentence as follows. We only consider clusters with a mention in the target sentence. For each such cluster, we find its first mention inside the target sentence, and find another mention in the same cluster that was presented earlier in the input and is longer than the current mention. If such a mention is found, we replace the current entity mention string with the earliest such mention string (e.g., "She" is replaced with "Taylor Swift"). On average, 36.5% of examples were modified through this process.
The Seq2Seq Generation Model is based on the recent T5 model (Raffel et al., 2019b). We show two variations of the model, BASE and 11B, which mainly differ in the model capacity. We fine-tune the model on our crowd- We limit the input/output to 512/128 tokens for both variants, and fine-tuned from pre-trained checkpoints 5 with a batch size of 100 examples until the validation loss stopped decreasing, after about 32K for the larger and 500K steps for the smaller model.

Feasibility Detection
We first evaluate the accuracy of models in making the feasible vs. infeasible decision. To do this we compute the binary agreements with all human references and average them to get an accuracy.
Results For the feasible vs. infeasible classification task, baseline that always predicts FEA-SIBLE will have 88% accuracy. The larger variant of T5, T5-11B, achieves 89% accuracy, outperforming human agreement (85% accuracy), affirming the strong performance of pre-trained language models on classification tasks (Devlin et al., 2018). This model predicts the INFEASIBLE category infrequently for the larger variant (5% of examples), while humans classify an example as IN-FEASIBLE for 12% of examples. We observe the smaller variant, T5-Base, is less accurate, overpredicting the INFEASIBLE category (for 20% of examples), getting 77% accuracy. The coreference model cannot decide the decontextualization feasibility, as an untrained baseline.

Decontextualized Sentence Generation
Setup For development / test examples, we have five human annotations per example. We only consider examples marked by three or more annotators (out of five) as FEASIBLE for decontextualized sentence generation. For each of these examples, we discard annotations which mark the example as INFEASIBLE. For automatic evaluation and comparison, we need a human output, which will be compared to model outputs, and a set of reference annotations which will be considered as correct, gold annotations. The single human output provides a reference point for evaluation measures to which the automatic output can be compared.
We observed comparing a longer decontextualized sentence to shorter decontextualized sentences often erroneously results in low scores automatic metrics (e.g., In the last example of Table 2, adding extra information will be erroneously punished). Thus, instead of randomly selecting one annotation to be used as the representative human output, we sort the annotations by the length of the output sentence (raw bytes), and take the annotation with median length 6 as a human output and take the remaining annotations as a set of reference annotations. From manual inspection of the data the median-length output appeared often to be optimal in terms of balancing length versus accuracy of the decontextualization. Metric For each model prediction and human output, we report: • Length Increase, the average value of (len(decontext)-len(original)) / len(original).
• % edited, the proportion of examples that were modified for decontextualization (as opposed to being left unchanged).
• Sentence match, a binary score computed between the output and a set of references, indicating whether the output matches any of the references after normalization (stripping away articles and punctuation and lowercasing). We report two numbers, a score on all examples, and a score on examples where all references edited the sentence.  • SARI (system output against references and against the input sentence) metric (Xu et al., 2016). To compute this, for each reference, we calculate a set of add edits, corresponding to which unigrams are seen in the reference but not in the original sentence. Conversely, we can calculate the set of delete edits, corresponding to unigrams that are in the original sentence but not in the reference. We calculate precision/recall/F1-measure on add and delete edits. We look at unigrams only, and use fractional counts for the words in the references (i.e., a word appearing in one of r references will be counted as 1/r). We compute micro average across examples, i.e., globally by counting the total true positives, false negatives and false positives, as many examples do not require any edits. 7 While the sentence match score is the easiest to interpret, it punishes longer outputs, making comparisons across systems producing outputs of different lengths challenging, and it overly rewards conservative strategies that simply copy across the original sentence. Thus, we use the SARI metric as our main evaluation metric. SARI can be thought of as a precision/recall measure on topics (unigrams) that should be added or deleted. Tables 4 and 5 show development and test performance. A successful 7 Similar to BLEU in machine translation, SARI is a useful measure for comparing different systems; however, due to the relatively large space of possible decontextualizations it will not be possible to achieve anything close to 100% F1 on SARI measures, and thus the absolute score is harder to interpret. A SARI score of for example 50% should not be interpreted as indicating a system with 50% accuracy.   Table 6: Preference between T5 output and human annotation. Columns represents the judgement of the expert A, rows that of the expert B. We see high agreement between two expert annotators, despite one expert annotator (column annotator) is ambivalent more frequently.

Automatic Evaluation
decontextualization system would result in high sentence match, adequate changed ratio (experts edited about 79% of examples), and length change ratio (the experts' ratio is 1.19), as well as high SARI addition and deletion scores. As a sanity check, we report REPEAT, which outputs the original sentence. This alone results in high sentence match score, around 40%, meaning that on this number of examples, at least one of the annotators deemed the sentence can stand alone without any edits.
The coreference system has an exact match of about 13% of examples that require edits, without any task-specific fine-tuning. Its SARI add scores shows high precision and low recall, and its deletion scores are low as it cannot delete discourse markers. The Seq2seq generation model achieves high scores across all measures. The bigger variant is substantially better, editing more than its smaller variant without losing precision. We observe the larger variants outperform the average human on sentence match measure, but not in SARI measures. The T5 model modifies fewer examples than the annotator, and edits involve fewer tokens, benefiting it on the sentence match measure. However, the model is more likely to miss required edits, as shown in low recall for the SARI add and deletion measures. We discuss this further in the following human evaluation section.

Human Evaluation
We sampled 100 examples in the evaluation set, where at least two annotators and our best model made decontextualization edits. We randomized the order or presentation of the T5 and human outputs so as to not bias the annotation. On this set, we (two of the authors) conducted a manual evaluation. Given two decontextualized sentences, one from the best model and another randomly selected from a set of annotations with decontextualization edits, we evaluated each on two dimensions: (a) is it fluent and grammatically correct? (b) is it sufficiently and correctly decontextualized? Lastly, we chose the preference between two outputs (A, B, or either).
Expert annotators marked as 'sufficient' those items for which all possible referential ambiguities had been resolved. Given the subjective nature of the task, some 'insufficient' decontextualizations by the expert annotator could be valid for the another annotator with a different world knowledge. We report averaged binary scores from two experts. The model output scored 88.0% on fluency, and 67.5% on correct decontextualization, while the human reference output scored 84.5% on fluency and 78.5% on correct decontextualization. Both annotators found T5 to be slightly more fluent, while humans are more thorough and accurate in decontextualizating. Table 6 shows the preferences of two annotators. Both preferred human output, and their preferences exhibit high agreement (matching on 37 out of forty examples when both had preferences).
We briefly characterize common error patterns for annotators and the T5 model. Similar error patterns emerge between the annotations and model outputs. Both occasionally fail to identify generics that need to be replaced with referring NPs, phrases that require bridging, and temporal contexts that should be provided. Additionally, we noticed that the T5 model heavily relies on the title cues, and sometimes fail to clarify ambiguous entities that are not the main entity of the page. We noticed very few examples where T5 hallucinates factually incorrect contents.

Two Applications
We present two demonstrations of the utility of decontextualization. First, we argue that the decontextualized sentences can be valuable in themselves in question answering, and show that they can be useful as a preprocessing step.  Table 7: User study results. Dec. refers to decontextualized sentence answer, Ori. means original sentence answer, Par. means paragraph answer. We present raw counts of preferences and the log odds of preferring option A and its 95% confidence interval.

Decontextualized Answer As Is
We showcase a use case of decontextualized sentences as providing a succinct yet informative answer to open domain factoid questions . We design a user study where people compare a decontextualizedsentence answer with an original-sentence answer and a paragraph answer to the same query. 8 Setup Given a question and two presentations of the same answer, raters were tasked with marking their preference between the two answer presentations (option A, option B, or either). The actual short span answer in the sentence is always highlighted (similar to seen in Table 8) (See Figure 3 for a screenshot).
We conduct three comparison studies on the same set of 150 questions: (a) decontextualized sentence vs. original sentence, (b) original sentence vs. original paragraph, (c) decontextualized sentence vs. original paragraph. For each example in each study, we collected 10 user ratings. The questions are randomly chosen from a set of questions which have a short answer, and such that the sentence containing the short answer is categorized as FEASIBLE by the annotators and edits were necessary to decontextualize. We use crowd-sourced annotations of decontextualized sentences. Figure 3 shows the screenshot of the user study interface. Table 7 shows the results of the user study. We observe that decontextualized sentence answers are preferred to both the original sentence answers and the original paragraph answers. We also note that the users preferred sentence answer compared to paragraph answer in general.  : Each dot represents how frequently the decontextualized answer is preferred for a single question / rater to the original sentence for a single question (top plot) and single rater (bottom plot). Questions (top plot) and raters (bottom plot) are sorted by its preference towards the decontextualized answer. The red line is where both are equally preferred, and above the line represents question / rater where decontextualized answers were preferred. While the decontextualized answer is preferred overall, we see a large variability.

Result
We further investigated the statistical significance of the preferences reported in Table 7. We noticed a quite large amount of question and rater variability-some raters consistently preferred a sentence answer, valuing conciseness, while some raters behaved in the other direction. Similarly, for some questions, all raters preferred a sentence answer. Figure 4 visualizes such variability based on the questions and raters.
To control for the correlations induced by the rater and question groups, we fit a generalized linear mixed model (GLMM) using the brm R package (Bürkner, 2017). For this analysis, we excluded data points where users did not show a preference (selected either). We used the formula: p ∼ 1 + (1|r) + (1|q) where p is whether a rater chose one option over the other; r is the rater id; and q is the question id. This formula specifies a regression of the log-odds of the rater preference while allowing for random effects in the raters (r) and questions (q). The last column of Table 7 shows the fixed effect coefficients and their confidence intervals. The intercept represents the strength of preference towards option A. We found a statistically significant preference for decontextualized sentences over both original sentences and the paragraphs (p-value was smaller than 0.05 for both studies).
Examples We qualitatively investigated which examples benefit from decontextualization, and in which examples raters prefer paragraph answers. Table 8 shows questions together with two answer presentations, along with the predicted fixed effect question coefficient towards decontextualized answer in study (b) and towards the sentence answer in study (c). In the first row, the added information from the decontextualization is not relevant to the question, thus we observe preference against decontextualization. In the second and third row, the decontextualized sentence answer is preferred as it provides enough evidence to answer the query,
when was the rising of the moon written  while the original sentence answer does not.

Decontextualizing System Inputs
Having shown the benefits of decontextualization in a user facing task, we now investigate the use of decontextualizaton as a preprocessing step. Specifically, we construct a passage retrieval corpus for open domain question answering (Chen et al., 2017) with decontextualized sentences. Experiment shows that decontextualized sentences ensure completeness of the passages while minimizing their length (thus computational cost).
Background Open domain question answering typically consists of pair a passage retrieval (Liu and Croft, 2002) and transformer-based answer extractor (reading comprehension model) based on the retrieved passages (Guu et al., 2020;Karpukhin et al., 2020;Izacard and Grave, 2020). And the computational cost is dominated by the cost of co-encoding the query with the retrieved passages (typically paragraphs or overlapping 100 word windows).
Setup We create a corpus using the 7k documents (233k paragraphs, 868k sentences) from the documents associated with the questions in the NQ-open development set (Lee et al., 2019). We consider a retrieved passage to be correct if it con-tains one of the answer strings 9 and investigate the number of questions for which we can retrieve a correct passage for a fixed computational cost. Under this measure, we compare paragraphs, windows of 100 words, sentences, and decontextualized sentences as a set of retrieval passages. These segmentation approaches generate different number of passages for the same article (paragraph and a window of 100 words segmentation make fewer passages compared to sentences-level segmentation). To generate decontextualized sentences process all paragraphs with T5-11B model, which are trained on all annotated data (including development and test set). For about 40% of sentences, the model classified the sentence as infeasible to decontextualize or unnecessary to make any edits, we use the original sentence. On the other 60% model tended to add more information. For example, for a sentence "Bush was widely seen as a 'pragmatic caretaker' president who lacked a unified and compelling long-term theme in his efforts.", the decontextualized sentence will be "George H.W. Bush was widely seen as a 'pragmatic caretaker' president of the United States who lacked a unified and compelling longterm theme in his efforts." and a paragraph would be entire paragraph containing this sentence, and 100-word window will be a chunk without using a sentence boundary as a segmentation. For all, we prepend the document title to the passage, following the literature and use the TFIDF as a retriever model.
Metric Let q i be a question; let A i = [a 0 i . . . a n i ] be the set of valid answers; let C i = [c 1 i . . . c k i ] be a ranked list of evidence passages; and let H(A i , C i ) be the index of the top ranked context that contains one of the valid answers (or k + 1 if there is no such context). We first define the cost of encoding a single question and passage, c(q i , c m i ) = (|q i | + 1 + |c m i |) 2 . This captures the fact that the Transformer's computation cost scales quadratically with the length of the encoded text (question + separator + evidence passage).
Given the per example cost defined above, we define the recall of the retrieval system at computational cost budget t to be: where N is the total number of examples and 1 is an indicator function. We use this as an evaluation measure instead of mean reciprocal rank or recall at N, to compare across different retrieval passage length.
Results Figure 5 plots the recall of each retrieval corpus at different computational cost budget t on the whole NQ-open evaluation set. The graph shows that sentence level segmentation is more cost-effective than paragraph or 100 word level segmentation, and using decontextualized sentences is more cost effective than using the original sentences. Decontextualized sentences near the performance of commonly used 100 word windows with 1/10th the cost.
This result exemplifies the way in which decontextualization can be used to ensure that the input to natural language understanding system is concise yet complete. We think this way of using decontextualization as a preprocessing could also aid tasks such as summarization.

Related Work
Prior literature in summarization studied how article context affects the understanding of sentences within an article. It has been observed that disambiguating entity mentions and correctly resolving anaphora is crucial for automatic summarization (Otterbacher et al., 2002;Steinberger et al., 2007) and for evaluation of summarization systems (Pitler et al., 2010). Li et al. (2016) identified information missing from a sentence could be identified in the article context in newswire text 60% of the time. This is considerably less frequent than for the encyclopedic text studied here, but nevertheless hints that decontextualization for newswire text could be feasible. It remains unclear whether information accessible in newswire contexts can be readily incorprated into sentences using controlled edits of the type we employ.
Successful decontextualization models must resolve entity and event coreferences (Humphreys et al., 1997) as well as other forms of anaphora (Rösiger et al., 2018). These are necessary but insufficient for decontextualization however, which also involves discourse marker removal, acronym expansion, and fluent and grammatical sentence generation.
The term decontextualization was introduced in a recent table-to-text generation dataset (Parikh et al., 2020) where a sentence from a Wikipedia document was decontextualized such that it can be interpretable when presented with a table alone. They cover only the sentences that are relevant to the table, and adapt it to the table context. In a recent image captioning dataset (Sharma et al., 2018), sentences are re-written such that information that cannot be inferred from the image is removed. For example, entity names are replaced with generics (e.g. -Tom Cruz , +A man is waiting.").

Conclusion
We define decontextualization, the task of rewriting a sentence from a document to be interpretable in an empty context, while preserving its meaning. We build a crowdsourced dataset and a model for decontextualization, and demonstrate how decontextualization can be used in a user-facing task and as a sub-component of an application system. We believe that decontextualization will also be helpful in a wide range of other applications. For example, in multi-document summarization (Fabbri et al., 2019), co-referring entities and events must be resolved across different documents and removing ambiguous references may help; extractive summarization (Cheng and Lapata, 2016) could benefit from the type of pre-processing that we presented for opendomain QA; anaphora resolution is crucial for both summarization and machine translation (Susanne et al., 1992); and decontextualizing sentences may help in recovering explicit mentions of entities and relations which can help information extraction (Narasimhan et al., 2016). The current formulation focuses on the English encyclopedic corpus and rewriting for an empty context, and future work can explore different domains of text as well as mapping to a different context.