Abstract
Models for question answering, dialogue agents, and summarization often interpret the meaning of a sentence in a rich context and use that meaning in a new context. Taking excerpts of text can be problematic, as key pieces may not be explicit in a local window. We isolate and define the problem of sentence decontextualization: taking a sentence together with its context and rewriting it to be interpretable out of context, while preserving its meaning. We describe an annotation procedure, collect data on the Wikipedia corpus, and use the data to train models to automatically decontextualize sentences. We present preliminary studies that show the value of sentence decontextualization in a user-facing task, and as preprocessing for systems that perform document understanding. We argue that decontextualization is an important subtask in many downstream applications, and that the definitions and resources provided can benefit tasks that operate on sentences that occur in a richer context.
1 Introduction
Many applications of natural language processing need to be able to interpret, or present, text independently from the rich context in which it occurs. For example, summarization systems extract salient information from documents and present it in a reduced context. Many systems also segment documents prior to interpretation of retrieval for computational efficiency. In all of these cases, we would like the context-reduction step to be meaning preserving but, to date, there has been no independent method of ensuring this.
In this paper we isolate and define the problem of sentence decontextualization: taking a sentence together with its context and rewriting it to be interpretable out of context if feasible, while preserving its meaning.1 Having defined the problem, we operationalize this definition into a high quality annotation procedure; use the resulting data to train models to automatically decontextualize sentences; and present preliminary results that show the value of automatic decontextualization in a user facing task, and as preprocessing for systems that perform document understanding. We argue that decontextualization is an important sub-task in many downstream applications, and we believe this work can benefit tasks that operate on sentences that occur in a wider context.
One contribution of this work is to release a dataset of decontextualized sentences that can be used as training and evaluation data, together with the evaluation script: On publication of this paper the data will be available at https://github.com/google-research/language/tree/master/language/decontext.
Figure 1 shows an example decontextualization. In this example we have a coreference resolution step (their The Croatia national football team’s) and a bridging step (insertion of the prepositional phrase “in the FIFA World Cup” to modify “Croatia’s best result thus far”). Decontextualization involves various linguistic phenomena, including coreference resolution, global scoping, and bridging anaphora (Clark, 1975). We present a linguistically motivated definition of decontextualiation in Section 2 and show that this definition can be reliably applied by crowdworkers in Section 3.
We generate a corpus of decontextualized sentences corresponding to original sentences drawn from the English Wikipedia. We show that a high proportion of these original sentences can be decontextualized using a relatively simple set of re-write operations, and we use the data to define a new automatic decontextualization task in which a computer system needs to create a decontextualized sentence from an original sentence presented in paragraph context. We discuss the implications of choosing Wikipedia as a domain in Section 3.4.
We present two methods for automatic decontextualization based on state-of-the-art coreference (Joshi et al., 2020) and generation (Raffel et al., 2019a) models. We evaluate the output of these models with automatic measures (derived from Xu et al. [2016]), as well as through human evaluation. Both automatic and human evaluations show that the largest sequence-to-sequence model produces high quality decontextualizations in the majority of cases, although it still lags human performance in the thoroughness and accuracy of these decontextualization edits.
Finally, we present two demonstrations of the utility of decontextualization. The first is a user study giving evidence that decontextualized sentences can be valuable when presented to users as answers in a question-answering task—raters judge that they balance conciseness with informativeness. In the second one, we use decontextualization as a preprocessing component for generating a retrieval corpus for open domain question answering. Decontextualizing the sentences to be indexed by retrieval system enables more efficient answer string retrieval for information seeking queries. These demonstrations are presented as preliminary results, and we argue that decontextualization is an important sub-task for a wide range of NLP applications.
2 Linguistic Background
We start with the following definition:
Definition 1 (Decontextualization)
Given a sentence-context pair (s,c), a sentence s′ is a valid decontextualization of s if: (1) the sentence s′ is interpretable in the empty context; and (2) the truth-conditional meaning of s′ in the empty context is the same as the truth-conditional meaning of s in context c.
A context c is a sequence of sentences preceding s, and the empty context is the empty sequence.
We have been careful here to use the more specific term “truth conditional meaning” rather than “meaning”. Here we follow the distinction in semantics/pragmatics between truth conditional meaning and implicature, and deliberately exclude implicatures (which can also be considered part of the meaning of an utterance) from our definition. There is a rich history of work in semantics and pragmatics on truth-conditional meaning and implicatures, going back to Grice (1975). Our concept of “truth conditional meaning” is very close to “explicature” as used in Relevance Theory (Sperber and Wilson, 1986). Consider this description of explicature from Birner (2012) (pages 96–97, our own emphasis added):
The explicature in an utterance is the result of enriching the semantic content with the sorts of pragmatic information necessary to provide us with a truth-evaluable proposition. This includes calculating the referents for pronouns, working out the intended interpretation for deictic phrases like here and later ..., disambiguating lexically and structurally ambiguous words and phrases, making any “bridging” inferences necessary for reference resolution … and so on.
We will see in the next section that our annotation task follows this definition quite closely.
As an example consider the following exchange:
Susan: Has the Croatia national football team ever won the FIFA World Cup? Jon: Their best result thus far was reaching the 2018 final, where they lost 4-2 to France.
Here the truth conditional meaning of Jon’s reply is equivalent to “Croatia’s best result thus far in the FIFA World Cup was reaching the 2018 final, where they lost 4-2 to France”, whereas the implicature would be “the Croatia national football team has never won the FIFA World Cup” (which answers Susan’s question). In our definition the decontextualized sentence s′ should preserve the truth-conditional meaning, but is not required to preserve the implicature(s) of the sentence.2
Remark (extra-linguistic context): In addition to its document context, a given sentence s and its counterpart s′ also come with a temporal, cultural, and geographic context—that is, where and when they are being written or read and by whom.3 We assume that these aspects of context are preserved during decontextualization. The effect of this is that elements of s that derive their meaning from outside of the document context will receive equivalent interpretation in s′, and hence do not require decontextualization. For example, the expression “thus far” in Figure 1 is interpreted relative to the time of utterance, not relative to what has been previously said in the Wikipedia article, and hence it appears in both the original and decontextualized sentences.
3 Task Definition
An annotator is provided with an entire document d with a target sentence within the document, represented as a start and end index sst,send. First, the annotator decides whether the target sentence can be decontextualized or not, labeling it as feasible or infeasible. If the example is marked as feasible, the annotator decontextualizes the sentence, producing y, a new sentence that satisfies the conditions in Definition 1.
3.1 Feasibility
Sentences in feasible include sentences that do not require any modification to be decontextualized (e.g., “Émilie du Châtelet proposed the hypothesis of the conservation of total energy, as distinct from momentum”), and sentences that require edits to stand alone.
In the decontextualization step, we instructed annotators to make only minor modifications, which includes copying and pasting a few phrases from the document to the target sentence and deleting phrases from the target sentence. When it is too challenging to decontextualize, it is classified into the infeasible category. Often, sentences in this category are a part of a narrative story, or rely heavily on the preceding few sentences. See Figure 2 for examples.
3.2 Edit Types and Linguistic Phenomena
When an example is classified as feasible, the annotator makes edits to decontextualize the sentence. Table 1 shows the different edit types. They fall into four broad categories:
Edit Type . | Description . | Example . | % . |
---|---|---|---|
Pronoun/NP Swap | Replacement of a definite pronoun / noun phrase with another referring expression | ⟅ -The copper statue, +The Statue of Liberty⟆, a gift from the people of France to the people of the United States, was designed by French sculptor Frédéric Auguste Bartholdi and built by Gustave Eiffel. | 40.5 |
Name Completion | Expansion of acronyms or partial names | ⟅ -Meg, +Megan “Meg” Griffin⟆ made her first appearance on television when Family Guy debuted on Fox on January 31, 1999, with the episode “Death Has a Shadow”. | 11.5 |
DM Removal | Removal of discourse markers that can be only understood in context | ⟅For instance,⟆ Alaska could be regarded as the highest state because Denali, at 20,310 feet, is the highest point in the US. | 3.5 |
Bridging | Addition of a modifier (typically a PP) to a noun phrase | In all fights ⟅+in the Ultimate Fighting Championship⟆, each round can be no longer than five minutes. | 13 |
Global Scoping | Addition of a phrase (typically a PP) that modifies the entire sentence | The Japanese film Shoplifters, directed by Hirokazu Kore-eda, won the Palme d’Or ⟅+at the 2018 Cannes Film Festival.⟆ | 7 |
Addition | Addition of background information that is not necessary but helps readability significantly | Charles Darwin⟅+, an English naturalist and biologist,⟆ was among the first to suggest that physiological changes caused by an emotion had a direct impact on , rather than being just the consequence of that emotion. | 10 |
Edit Type . | Description . | Example . | % . |
---|---|---|---|
Pronoun/NP Swap | Replacement of a definite pronoun / noun phrase with another referring expression | ⟅ -The copper statue, +The Statue of Liberty⟆, a gift from the people of France to the people of the United States, was designed by French sculptor Frédéric Auguste Bartholdi and built by Gustave Eiffel. | 40.5 |
Name Completion | Expansion of acronyms or partial names | ⟅ -Meg, +Megan “Meg” Griffin⟆ made her first appearance on television when Family Guy debuted on Fox on January 31, 1999, with the episode “Death Has a Shadow”. | 11.5 |
DM Removal | Removal of discourse markers that can be only understood in context | ⟅For instance,⟆ Alaska could be regarded as the highest state because Denali, at 20,310 feet, is the highest point in the US. | 3.5 |
Bridging | Addition of a modifier (typically a PP) to a noun phrase | In all fights ⟅+in the Ultimate Fighting Championship⟆, each round can be no longer than five minutes. | 13 |
Global Scoping | Addition of a phrase (typically a PP) that modifies the entire sentence | The Japanese film Shoplifters, directed by Hirokazu Kore-eda, won the Palme d’Or ⟅+at the 2018 Cannes Film Festival.⟆ | 7 |
Addition | Addition of background information that is not necessary but helps readability significantly | Charles Darwin⟅+, an English naturalist and biologist,⟆ was among the first to suggest that physiological changes caused by an emotion had a direct impact on , rather than being just the consequence of that emotion. | 10 |
Name Completion, Pronoun / NP Swap
correspond to replacement of a referring expression that is unclear out of context with a referring expression that is unambiguous out of context. For example, replacing the pronoun “She” with “Cynthia Nixon”, the definite NP “the copper statue” with “The Statue of Liberty”, or the abbreviated name “Meg” with “Megan “Meg” Griffin”.
DM Removal
involves removal of discourse markers (DMs) such as “therefore”.
Bridging, Global Scoping
involve addition of a phrase (typically a prepositional phrase) that modifies either a particular noun phrase (“bridging”) or the entire sentence (“global scoping”). For example, adding “in the Ultimate Fighting Championship” as a modifier to “all fights”, or adding “at the 2018 Cannes Film Festival” at the end of the sentence. The additional phrase essentially spells out a modifier that is implied by the context.
Addition
inserts background information that significantly improves readability: In many cases, this involves adding an appositive or premodifier to a named entity to add useful background information about that entity. Unlike other edits described above, edits in this category are optional. For example, replacing “The Eagles” with “The American rock band The Eagles.”
3.3 Variability
We note that for a given sentence frequently there will be more than one possible decontextualization. While this inherent subjectivity makes the task challenging to crowdsource and evaluate, we argue this is important feature, as shown in recent literature (Aroyo and Welty, 2015; Pavlick and Kwiatkowski 2019; Kwiatkowski et al., 2019), and propose to collect multiple references per example. Table 2 shows examples where there can be multiple different correct decontextualizations. In the first example, while the semantics of the edits are roughly equivalent (i.e., the annotators agreed on what noun phrases have to be disambiguated and information has to be added), they differ in how to rewrite the sentence. In the second example, we see disagreement on what information should be added to the sentence. We do not make any explicit assumptions about what is known and salient to the reader, and instructed annotators to use their best judgment to rewrite such that the new sentence is fluent, unambiguous and clear when posed alone. In the last example, annotators disagree on the feasibility. While the sentence is a part of a bigger narrative, two annotators judged it could be edited to alone, by adding a global scoping modifier, “In Greek mythology.”
Page title / Section title: We Don’t Talk Anymore (Charlie Puth song) / Music video |
Paragraph: The music video premiered on August 2 , 2016 , on BuzzFeed and was directed by Phil Pinto . It shows Puth and Mirella Cardoso as his love interest. … |
Decontextualization 1: ⟅-It, +We Don’t Talk Anymore music video⟆ ⟅shows -Puth, +CharliePuth⟆ and Mirella Cardoso as his love interest. |
Decontextualization 2: ⟅-It, +The “We Don’t Talk Anymore” (Charlie Puth song) music video⟆ shows Puth and Mirella Cardoso as his love interest. |
Page title: The American Baking Competition |
Paragraph: CBS placed casting calls for participants on November 14, 2012 . Auditions were held between December 1 and December 15, 2012. The competition took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013. |
Decontextualization 1: The ⟅-competition, +American Baking Competition⟆ took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013. |
Decontextualization 2: The ⟅-competition, +American Baking Competition, a reality competition television series,⟆ took place at the Gibbs Gardens in Ball Ground, Georgia in March 2013. |
Page title: Gemini (Constellation) |
Paragraph: In Greek mythology, Gemini was associated with the myth of Castor and Pollux, the children of Leda and Argonauts both. Pollux was the son of Zeus, who seduced Leda, while Castor was the son of Tyndareus, king of Sparta and Leda’s husband. Castor and Pollux were also mythologically associated with St. Elmo’s fire in their role as the protectors of sailors. When Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. |
Decontextualization 1: Infeasible |
Decontextualization 2: ⟅+In Greek mythology,⟆ when Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. |
Page title / Section title: We Don’t Talk Anymore (Charlie Puth song) / Music video |
Paragraph: The music video premiered on August 2 , 2016 , on BuzzFeed and was directed by Phil Pinto . It shows Puth and Mirella Cardoso as his love interest. … |
Decontextualization 1: ⟅-It, +We Don’t Talk Anymore music video⟆ ⟅shows -Puth, +CharliePuth⟆ and Mirella Cardoso as his love interest. |
Decontextualization 2: ⟅-It, +The “We Don’t Talk Anymore” (Charlie Puth song) music video⟆ shows Puth and Mirella Cardoso as his love interest. |
Page title: The American Baking Competition |
Paragraph: CBS placed casting calls for participants on November 14, 2012 . Auditions were held between December 1 and December 15, 2012. The competition took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013. |
Decontextualization 1: The ⟅-competition, +American Baking Competition⟆ took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013. |
Decontextualization 2: The ⟅-competition, +American Baking Competition, a reality competition television series,⟆ took place at the Gibbs Gardens in Ball Ground, Georgia in March 2013. |
Page title: Gemini (Constellation) |
Paragraph: In Greek mythology, Gemini was associated with the myth of Castor and Pollux, the children of Leda and Argonauts both. Pollux was the son of Zeus, who seduced Leda, while Castor was the son of Tyndareus, king of Sparta and Leda’s husband. Castor and Pollux were also mythologically associated with St. Elmo’s fire in their role as the protectors of sailors. When Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. |
Decontextualization 1: Infeasible |
Decontextualization 2: ⟅+In Greek mythology,⟆ when Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. |
3.4 Scope of Current Task Formulation
Our data comes from the English portion of the Wikipedia corpus. We sampled sentences as follows. We first pick a (question, Wikipedia, short answer) triple from the Natural Questions (Kwiatkowski et al., 2019) uniformly at random from the questions that have a short answer. We include the sentence containing the short answer as one example; as a second example we choose a sentence at random from the Wikipedia page. After sampling we exclude (1) sentences under a “Plot” category as they are often infeasible to decontextualize; (2) any sentence that is the first sentence of the page; and (3) any sentence from a paragraph containing only a single sentence.
We designed this data selection process to ensure that a large proportion of examples (90%) could be decontextualized using simple edits described in Section 3.2.
Before settling on Wikipedia, we conducted an initial pilot study which revealed that encyclopedic text is substantially easier to decontextualize compared to newswire or literary text. In the latter genres, the context required for the comprehension of any given sentence appears to be much more complex in structure. Similarly, it is difficult to posit decontextualization for sentences that appear on social media platforms, because they are situated within complex and highly specific social contexts. In contrast, being written for a general audience, Wikipedia makes limited assumptions about its reader.
Within Wikipedia, we similarly found that articles on popular historical or cultural entities and events were easier to decontextualize by crowdworkers compared to articles from technical domains, such as ones on medical or mathematical concepts. Comprehension of such articles requires a considerable body of background knowledge or information from preceding paragraphs. Articles in our dataset cover topics that require little background knowledge to comprehend.
We focus on decontextualization of sentences, where the space of edits is restricted, to make the task easier to quality control and annotate. However alternate formulations, such as decontextualization on paragraphs could also be studied. One could even also consider allowing wider range of edits, such as multi-sentence outputs and edits beyond copy-and-pasting, such as paraphrasing and re-ordering. We anticipate exploring such alternative formulations would help to extend the scope of decontextualization to the more challenging domains previously mentioned.
We stress however that in spite of our restriction to single sentences in Wikipedia, the decontextualization task is nevertheless valuable: Wikipedia (and other encyclopedic sources) contain a wealth of factual information, and a high proportion (over 60%; see Table 3) of sentences both require decontextualization and can be decontextualized under our definitions (only 30% of sentences are interpretable out of context without any edits).
. | # . | par. len . | sent. len . | feasible (%) . | infeasible (%) . | |
---|---|---|---|---|---|---|
w/ edit . | as is . | |||||
Train | 11290 | 695 | 156 | 60 | 31 | 9 |
Dev | 1945 | 695 | 162 | 67 | 21 | 12 |
Test | 1945 | 711 | 160 | 68 | 20 | 12 |
Expert | 100 | 658 | 163 | 63 | 26 | 12 |
. | # . | par. len . | sent. len . | feasible (%) . | infeasible (%) . | |
---|---|---|---|---|---|---|
w/ edit . | as is . | |||||
Train | 11290 | 695 | 156 | 60 | 31 | 9 |
Dev | 1945 | 695 | 162 | 67 | 21 | 12 |
Test | 1945 | 711 | 160 | 68 | 20 | 12 |
Expert | 100 | 658 | 163 | 63 | 26 | 12 |
4 Data Collection
Annotation Interface
The annotator is presented a sentence in the context of an entire Wikipedia page. In the first step the annotator judges whether the example is feasible or infeasible. If the example is marked as feasible, the annotator can use delete, add, or swap operations within a user interface to produce a decontextualized string.
Data Statistics
We collected one reference for each example in the training data, and five references for each example in the evaluation data. Annotators are native speakers of English located in the United States, and on average, they took 4 minutes to annotate a single example.
In total, 28 annotators annotated the examples, with 11 annotators annotating more than 1K examples each.
Table 3 represents some overall data statistics. Decontextualization is possible for the majority of examples, with the infeasible category covering roughly 10% of the data. We note a slight discrepancy between train and evaluation dataset distribution, potentially due to a change in the annotation interface. A small subset of data is annotated by the authors to be compared with the crowd-sourced data (last row in the table).
Annotation Quality
We quantify the annotation agreement on the category classification. The Fleiss’ kappa on category classification is 0.51 among expert annotators, and is 0.30 among the crowd annotators (binary agreement is at 85%). We observed more variability in crowdworkers as annotators’ background is more diverse, and some annotators have a loose concept of “stand alone” and consistently attempted decontextualization.
We also measured agreement among the individual edits. For each of the edit operations (as defined in Section 3.2), we compare the output sentence after the single edit and to a set of output sentences, each after a single edit by other annotators. About 32.5% of edits were covered.
Because of the inherent annotation variability, four of the authors manually evaluated 100 crowd-sourced annotations from the evaluation data based on two measures: (1) whether the sentence is sufficiently and correctly decontextualized, and (2) whether the sentence is grammatically correct and fluent. Overall, 88% of annotations were valid in both, 89% on the content and 88% on form.
5 Automatic Decontextualization
5.1 Models
The Coreference Model
As many decontextualization edits can be recovered by a coreference resolution module, we adapt the output from the state-of-the-art coreference resolution system of Joshi et al. (2020), trained on the CoNLL dataset (Pradhan et al., 2012), as a decontextualization system. We used the publicly available pre-trained checkpoint of SpanBERT-Large with the original hyper parameters.4
We run this model on the input sequence, and map the coreference cluster predictions to modify the sentence as follows. We only consider clusters with a mention in the target sentence. For each such cluster, we find its first mention inside the target sentence, and find another mention in the same cluster that was presented earlier in the input and is longer than the current mention. If such a mention is found, we replace the current entity mention string with the earliest such mention string (e.g., “She” is replaced with “Taylor Swift”). On average, 36.5% of examples were modified through this process.
The Seq2Seq Generation Model
is based on the recent T5 model (Raffel et al., 2019b). We show two variations of the model, BASE and 11B, which mainly differ in the model capacity. We fine-tune the model on our crowdsourced training set, by setting the target sequence to be [CAT] [SEP]y, where [CAT] ∈ {unnecessary,feasible, infeasible} and y is a decontextualized sentence when [CAT] = feasible and the original sentence when [CAT] ∈{unnecessary,infeasible}. unnecessary are examples where the original sentence without any edit can stand alone.
We limit the input/output to 512/128 tokens for both variants, and fine-tuned from pre-trained checkpoints5 with a batch size of 100 examples until the validation loss stopped decreasing, after about 32K for the larger and 500K steps for the smaller model.
5.2 Evaluation
5.2.1 Feasibility Detection
We first evaluate the accuracy of models in making the feasible vs. infeasible decision. To do this we compute the binary agreements with all human references and average them to get an accuracy.
Results
For the feasible vs. infeasible classification task, a baseline that always predicts feasible will have 88% accuracy. The larger variant of T5, T5-11B, achieves 89% accuracy, outperforming human agreement (85% accuracy), affirming the strong performance of pre-trained language models on classification tasks (Devlin et al., 2018). This model predicts the infeasible category infrequently for the larger variant (5% of examples), while humans classify an example as infeasible for 12% of examples. We observe the smaller variant, T5-Base, is less accurate, over-predicting the infeasible category (for 20% of examples), getting 77% accuracy. The coreference model cannot decide the decontextualization feasibility, as an untrained baseline.
5.2.2 Decontextualized Sentence Generation
Setup
For development / test examples, we have five human annotations per example. We only consider examples marked by three or more annotators (out of five) as feasible for decontextualized sentence generation. For each of these examples, we discard annotations which mark the example as infeasible. For automatic evaluation and comparison, we need a human output, which will be compared to model outputs, and a set of reference annotations that will be considered as correct, gold annotations. The single human output provides a reference point for evaluation measures to which the automatic output can be compared.
We observed comparing a longer decontextualized sentence to shorter decontextualized sentences often erroneously results in low scores automatic metrics (e.g., in the last example of Table 2, adding extra information will be erroneously punished). Thus, instead of randomly selecting one annotation to be used as the representative human output, we sort the annotations by the length of the output sentence (raw bytes), and take the annotation with median length6 as a human output and take the remaining annotations as a set of reference annotations. From manual inspection of the data the median-length output appeared often to be optimal in terms of balancing length versus accuracy of the decontextualization.
Metric
For each model prediction and human output, we report:
Length increase, the average value of (len(decontext)-len(original)) / len(original).
% edited, the proportion of examples that were modified for decontextualization (as opposed to being left unchanged).
Sentence match, a binary score computed between the output and a set of references, indicating whether the output matches any of the references after normalization (stripping away articles and punctuation and lowercasing). We report two numbers, a score on all examples, and a score on examples where all references edited the sentence.
SARI (system output against references and against the input sentence) metric (Xu et al., 2016). To compute this, for each reference, we calculate a set of add edits, corresponding to which unigrams are seen in the reference but not in the original sentence. Conversely, we can calculate the set of delete edits, corresponding to unigrams that are in the original sentence but not in the reference. We calculate precision/recall/F1-measure on add and delete edits. We look at unigrams only, and use fractional counts for the words in the references (i.e., a word appearing in one of r references will be counted as 1/r). We compute micro average across examples, that is, globally by counting the total true positives, false negatives, and false positives, as many examples do not require any edits.7
While the sentence match score is the easiest to interpret, it punishes longer outputs, making comparisons across systems producing outputs of different lengths challenging, and it overly rewards conservative strategies that simply copy across the original sentence. Thus, we use the SARI metric as our main evaluation metric. SARI can be thought of as a precision/recall measure on topics (unigrams) that should be added or deleted.
Automatic Evaluation
Tables 4 and 5 show development and test performance. A successful decontextualization system would result in high sentence match, adequate changed ratio (experts edited about 79% of examples), and length change ratio (the experts’ ratio is 1.19), as well as high SARI addition and deletion scores. As a sanity check, we report Repeat, which outputs the original sentence. This alone results in high sentence match score, around 40%, meaning that on this number of examples, at least one of the annotators deemed the sentence can stand alone without any edits.
. | len inc. . | % edited . | match . | SARI add . | SARI del . |
---|---|---|---|---|---|
all / edited . | F1 (P/R) . | F1 (P/R) . | |||
Repeat | 0 | 0 | 38 / 0 | 0 (0/0) | 0 (0/0) |
Coref | 7 | 42 | 39 / 13 | 22 (51/14) | 31 (34/28) |
T5-Base | 8 | 40 | 48 / 21 | 29 (67/19) | 40 (54/32) |
T5-11B | 12 | 59 | 53 / 32 | 42 (72/30) | 46 (49/43) |
Human | 24 | 76 | 45 / 29 | 56 (64/49) | 58 (61/55) |
. | len inc. . | % edited . | match . | SARI add . | SARI del . |
---|---|---|---|---|---|
all / edited . | F1 (P/R) . | F1 (P/R) . | |||
Repeat | 0 | 0 | 38 / 0 | 0 (0/0) | 0 (0/0) |
Coref | 7 | 42 | 39 / 13 | 22 (51/14) | 31 (34/28) |
T5-Base | 8 | 40 | 48 / 21 | 29 (67/19) | 40 (54/32) |
T5-11B | 12 | 59 | 53 / 32 | 42 (72/30) | 46 (49/43) |
Human | 24 | 76 | 45 / 29 | 56 (64/49) | 58 (61/55) |
. | len inc. . | % edited . | match . | SARI add . | SARI del . |
---|---|---|---|---|---|
all / edited . | F1 (P/R) . | F1 (P/R) . | |||
Repeat | 0 | 0 | 36 / 0 | 0 (0/0) | 0 (0/0) |
Coref | 8 | 42 | 38 / 13 | 23 (50/15) | 36 (40/32) |
T5-11B | 13 | 61 | 52 / 32 | 43 (69/31) | 47 (49/46) |
Human | 23 | 77 | 44 / 28 | 56 (64/49) | 58 (61/56) |
. | len inc. . | % edited . | match . | SARI add . | SARI del . |
---|---|---|---|---|---|
all / edited . | F1 (P/R) . | F1 (P/R) . | |||
Repeat | 0 | 0 | 36 / 0 | 0 (0/0) | 0 (0/0) |
Coref | 8 | 42 | 38 / 13 | 23 (50/15) | 36 (40/32) |
T5-11B | 13 | 61 | 52 / 32 | 43 (69/31) | 47 (49/46) |
Human | 23 | 77 | 44 / 28 | 56 (64/49) | 58 (61/56) |
The coreference system has an exact match of about 13% of examples that require edits, without any task-specific fine-tuning. Its SARI add scores shows high precision and low recall, and its deletion scores are low as it cannot delete discourse markers. The Seq2seq generation model achieves high scores across all measures. The bigger variant is substantially better, editing more than its smaller variant without losing precision. We observe the larger variants outperform the average human on sentence match measure, but not in SARI measures. The T5 model modifies fewer examples than the annotator, and edits involve fewer tokens, benefiting it on the sentence match measure. However, the model is more likely to miss required edits, as shown in low recall for the SARI add and deletion measures. We discuss this further in the following human evaluation section.
Human Evaluation
We sampled 100 examples in the evaluation set, where at least two annotators and our best model made decontextualization edits. We randomized the order or presentation of the T5 and human outputs so as to not bias the annotation. On this set, we (two of the authors) conducted a manual evaluation. Given two decontextualized sentences, one from the best model and another randomly selected from a set of annotations with decontextualization edits, we evaluated each on two dimensions: (a) is it fluent and grammatically correct? (b) is it sufficiently and correctly decontextualized? Lastly, we chose the preference between two outputs (A, B, or either).
Expert annotators marked as “sufficient” those items for which all possible referential ambiguities had been resolved. Given the subjective nature of the task, some “insufficient” decontextualizations by the expert annotator could be valid for the another annotator with a different world knowledge. We report averaged binary scores from two experts. The model output scored 88.0% on fluency, and 67.5% on correct decontextualization, while the human reference output scored 84.5% on fluency and 78.5% on correct decontextualization. Both annotators found T5 to be slightly more fluent, while humans are more thorough and accurate in decontextualizating. Table 6 shows the preferences of two annotators. Both preferred human output, and their preferences exhibit high agreement (matching on 37 out of 40 examples when both had preferences).
. | T5 . | either . | Annotator . | Sum . |
---|---|---|---|---|
T5 | 13 | 12 | 2 | 27 |
either | 7 | 22 | 4 | 33 |
Annotator | 1 | 15 | 24 | 40 |
Sum | 21 | 49 | 30 | 100 |
. | T5 . | either . | Annotator . | Sum . |
---|---|---|---|---|
T5 | 13 | 12 | 2 | 27 |
either | 7 | 22 | 4 | 33 |
Annotator | 1 | 15 | 24 | 40 |
Sum | 21 | 49 | 30 | 100 |
We briefly characterize common error patterns for annotators and the T5 model. Similar error patterns emerge between the annotations and model outputs. Both occasionally fail to identify generics that need to be replaced with referring NPs, phrases that require bridging, and temporal contexts that should be provided. Additionally, we noticed that the T5 model heavily relies on the title cues, and sometimes fail to clarify ambiguous entities that are not the main entity of the page. We noticed very few examples where T5 hallucinates factually incorrect contents.
6 Two Applications
We present two demonstrations of the utility of decontextualization. First, we argue that the decontextualized sentences can be valuable in themselves in question answering, and show that they can be useful as a preprocessing step.
6.1 Decontextualized Answer As Is
We showcase a use case of decontextualized sentences as providing a succinct yet informative answer to open domain factoid questions (Kwiatkowski et al., 2019). We design a user study where people compare a decontextualized-sentence answer with an original-sentence answer and a paragraph answer to the same query.8
Setup
Given a question and two presentations of the same answer, raters were tasked with marking their preference between the two answer presentations (option A, option B, or either). The actual short span answer in the sentence is always highlighted (similar to seen in Table 8) (See Figure 3 for a screenshot).
We conduct three comparison studies on the same set of 150 questions: (a) decontextualized sentence vs. original sentence, (b) original sentence vs. original paragraph, (c) decontextualized sentence vs. original paragraph. For each example in each study, we collected 10 user ratings. The questions are randomly chosen from a set of questions that have a short answer, and such that the sentence containing the short answer is categorized as feasible by the annotators and edits were necessary to decontextualize. We use crowd-sourced annotations of decontextualized sentences. Figure 3 shows the screenshot of the user study interface.
Result
Table 7 shows the results of the user study. We observe that decontextualized sentence answers are preferred to both the original sentence answers and the original paragraph answers. We also note that the users preferred sentence answer compared to paragraph answer in general.
Opt.A vs. Opt.B . | Prefer . | log odds . | ||
---|---|---|---|---|
A . | B . | either . | intercept [CI] . | |
Dec. vs. Ori. | 730 | 426 | 364 | 0.85 [0.4,1.3] |
Dec. vs. Par. | 850 | 456 | 234 | 0.55 [0.1,1.0] |
Ori. vs. Par. | 741 | 505 | 274 | 0.31 [−0.2,0.8] |
Opt.A vs. Opt.B . | Prefer . | log odds . | ||
---|---|---|---|---|
A . | B . | either . | intercept [CI] . | |
Dec. vs. Ori. | 730 | 426 | 364 | 0.85 [0.4,1.3] |
Dec. vs. Par. | 850 | 456 | 234 | 0.55 [0.1,1.0] |
Ori. vs. Par. | 741 | 505 | 274 | 0.31 [−0.2,0.8] |
We further investigated the statistical significance of the preferences reported in Table 7. We noticed a quite large amount of question and rater variability—some raters consistently preferred a sentence answer, valuing conciseness, while some raters behaved in the other direction. Similarly, for some questions, all raters preferred a sentence answer. Figure 4 visualizes such variability based on the questions and raters.
To control for the correlations induced by the rater and question groups, we fit a generalized linear mixed model (GLMM) using the brm R package (Bürkner, 2017). For this analysis, we excluded data points where users did not show a preference (selected either). We used the formula: p ∼ 1 + (1—r) + (1—q), where p is whether a rater chose one option over the other; r is the rater id; and q is the question id. This formula specifies a regression of the log-odds of the rater preference while allowing for random effects in the raters (r) and questions (q). The last column of Table 7 shows the fixed effect coefficients and their confidence intervals. The intercept represents the strength of preference towards option A. We found a statistically significant preference for decontextualized sentences over both original sentences and the paragraphs (p-value was smaller than 0.05 for both studies).
Examples
We qualitatively investigated which examples benefit from decontextualization, and in which examples raters prefer paragraph answers. Table 8 shows questions together with two answer presentations, along with the predicted fixed effect question coefficient towards decontextualized answer in study (b) and towards the sentence answer in study (c). In the first row, the added information from the decontextualization is not relevant to the question, thus we observe preference against decontextualization. In the second and third row, the decontextualized sentence answer is preferred as it provides enough evidence to answer the query, while the original sentence answer does not.
Query . | Decontextualized answer . | Paragraph answer (sentence answer highlighted) . | Decont. . | Ori. . |
---|---|---|---|---|
when was the rising of the moon written | The Rising of the Moon, Irish ballad recounting a battle between the United Irishmen and the British Army has been in circulation since circa 1865. | The ballad has been in circulation since circa 1865. The earliest verifiable date found in publication is 1867. | −2.09 | −1.53 |
what is the most viewed video on youtube in 24 hours | The most viewed music video within 24 hours of its release is Taylor Swift ’s Look What You Made Me Do. | This list of most viewed online videos in the first 24 hours contains the top 30 online videos that received the most views within 24 hours of release across the world. This list excludes movie trailers , which are featured on the list of most viewed online trailers in the first 24 hours. The most viewed music video in this time period is Taylor Swift ’s Look What You Made Me Do. | 1.06 | −0.48 |
when was last time england got to quarter finals in world cup | The England national football team have reached the quarter - finals on nine occasions, the latest of which were at the 2002 and the 2006. | England did not enter the competition until 1950… Their best ever performance is winning the Cup in the 1966, whilst they also finished in fourth place in 1990, and in 2018. Other than that, the team have reached the quarter - finals on nine occasions, the latest of which were at the 2002 and the 2006. | 1.40 | 0.70 |
Query . | Decontextualized answer . | Paragraph answer (sentence answer highlighted) . | Decont. . | Ori. . |
---|---|---|---|---|
when was the rising of the moon written | The Rising of the Moon, Irish ballad recounting a battle between the United Irishmen and the British Army has been in circulation since circa 1865. | The ballad has been in circulation since circa 1865. The earliest verifiable date found in publication is 1867. | −2.09 | −1.53 |
what is the most viewed video on youtube in 24 hours | The most viewed music video within 24 hours of its release is Taylor Swift ’s Look What You Made Me Do. | This list of most viewed online videos in the first 24 hours contains the top 30 online videos that received the most views within 24 hours of release across the world. This list excludes movie trailers , which are featured on the list of most viewed online trailers in the first 24 hours. The most viewed music video in this time period is Taylor Swift ’s Look What You Made Me Do. | 1.06 | −0.48 |
when was last time england got to quarter finals in world cup | The England national football team have reached the quarter - finals on nine occasions, the latest of which were at the 2002 and the 2006. | England did not enter the competition until 1950… Their best ever performance is winning the Cup in the 1966, whilst they also finished in fourth place in 1990, and in 2018. Other than that, the team have reached the quarter - finals on nine occasions, the latest of which were at the 2002 and the 2006. | 1.40 | 0.70 |
6.2 Decontextualizing System Inputs
Having shown the benefits of decontextualization in a user facing task, we now investigate the use of decontextualizaton as a preprocessing step. Specifically, we construct a passage retrieval corpus for open domain question answering (Chen et al., 2017) with decontextualized sentences. Experiment shows that decontextualized sentences ensure completeness of the passages while minimizing their length (thus computational cost).
Background
Open domain question answering typically consists of pair a passage retrieval (Liu and Croft, 2002) and transformer-based answer extractor (reading comprehension model) based on the retrieved passages (Guu et al., 2020; Karpukhin et al., 2020; Izacard and Grave, 2020). And the computational cost is dominated by the cost of co-encoding the query with the retrieved passages (typically paragraphs or overlapping 100 word windows).
Setup
We create a corpus using the 7k documents (233k paragraphs, 868k sentences) from the documents associated with the questions in the NQ-open development set (Lee et al., 2019). We consider a retrieved passage to be correct if it contains one of the answer strings9 and investigate the number of questions for which we can retrieve a correct passage for a fixed computational cost. Under this measure, we compare paragraphs, windows of 100 words, sentences, and decontextualized sentences as a set of retrieval passages. These segmentation approaches generate different number of passages for the same article (paragraph and a window of 100 words segmentation make fewer passages compared to sentences-level segmentation). To generate decontextualized sentences process all paragraphs with T5-11B model, which are trained on all annotated data (including development and test set). For about 40% of sentences, the model classified the sentence as infeasible to decontextualize or unnecessary to make any edits, we use the original sentence. On the other 60% the model tended to add more information. For example, for a sentence “Bush was widely seen as a ‘pragmatic caretaker’ president who lacked a unified and compelling long-term theme in his efforts.”, the decontextualized sentence will be “George H.W. Bush was widely seen as a ‘pragmatic caretaker’ president of the United States who lacked a unified and compelling long-term theme in his efforts.” A paragraph would be the entire paragraph containing this sentence, and a 100-word window will be a chunk without using a sentence boundary as a segmentation. For all, we prepend the document title to the passage, following the literature and use the TFIDF as a retriever model.
Metric
Results
Figure 5 plots the recall of each retrieval corpus at different computational cost budget t on the whole NQ-open evaluation set. The graph shows that sentence level segmentation is more cost-effective than paragraph or 100-word level segmentation, and using decontextualized sentences is more cost effective than using the original sentences. Decontextualized sentences near the performance of commonly used 100-word windows with 1/10th the cost.
This result exemplifies the way in which decontextualization can be used to ensure that the input to natural language understanding system is concise yet complete. We think this way of using decontextualization as a preprocessing could also aid tasks such as summarization.
7 Related Work
Prior literature in summarization studied how article context affects the understanding of sentences within an article. It has been observed that disambiguating entity mentions and correctly resolving anaphora is crucial for automatic summarization (Otterbacher et al., 2002; Steinberger et al., 2007) and for evaluation of summarization systems (Pitler et al., 2010). Li et al. (2016) identified that information missing from a sentence could be identified in the article context in newswire text 60% of the time. This is considerably less frequent than for the encyclopedic text studied here, but nevertheless hints that decontextualization for newswire text could be feasible. It remains unclear whether information accessible in newswire contexts can be readily incorprated into sentences using controlled edits of the type we employ.
Successful decontextualization models must resolve entity and event coreferences (Humphreys et al., 1997) as well as other forms of anaphora (Rösiger et al., 2018). These are necessary but insufficient for decontextualization however, which also involves discourse marker removal, acronym expansion, and fluent and grammatical sentence generation.
The term decontextualization was introduced in a recent table-to-text generation dataset (Parikh et al., 2020) where a sentence from a Wikipedia document was decontextualized such that it can be interpretable when presented with a table alone. They cover only the sentences that are relevant to the table, and adapt it to the table context. In a recent image captioning dataset (Sharma et al., 2018), sentences are re-written such that information that cannot be inferred from the image is removed. For example, entity names are replaced with generics (e.g., ⟅ -Tom Cruz, +A man⟆ is waiting.”).
8 Conclusion
We define decontextualization, the task of rewriting a sentence from a document to be interpretable in an empty context, while preserving its meaning. We build a crowdsourced dataset and a model for decontextualization, and demonstrate how decontextualization can be used in a user-facing task and as a sub-component of an application system.
We believe that decontextualization will also be helpful in a wide range of other applications. For example, in multi-document summarization (Fabbri et al., 2019), co-referring entities and events must be resolved across different documents and removing ambiguous references may help; extractive summarization (Cheng and Lapata, 2016) could benefit from the type of pre-processing that we presented for open-domain QA; anaphora resolution is crucial for both summarization and machine translation (Susanne et al., 1992); and decontextualizing sentences may help in recovering explicit mentions of entities and relations which can help information extraction (Narasimhan et al., 2016). The current formulation focuses on the English encyclopedic corpus and rewriting for an empty context, and future work can explore different domains of text as well as mapping to a different context.
Acknowledgments
We would like to thank members of Google AI, especially Jacob Eisenstein, Kenton Lee, Santiago Ontanon, Ankur Parikh, Daniel Andor, Chris Alberti, and Slav Petrov for helpful discussions and comments. Lastly, we would like to thank our annotators.
Notes
We have not necessarily given up on recovering implicatures: The decontextualized sentence will likely be a valuable intermediate step in deriving the implicatures of an utterance.
When there are four references, we take the second shortest sentence.
Similar to BLEU in machine translation, SARI is a useful measure for comparing different systems; however, due to the relatively large space of possible decontextualizations it will not be possible to achieve anything close to 100% F1 on SARI measures, and thus the absolute score is harder to interpret. A SARI score of for example 50% should not be interpreted as indicating a system with 50% accuracy.
Understanding how to present answers to users is a complex problem with many desiderata, e.g., preserving the original content, crediting the source, interaction with the user interface, which we are not covering comprehensively.
We adopt the answer match heuristics from Lee et al. (2019).
References
Author notes
Work done at Google.