Abstract
The availability of large-scale datasets has driven the development of neural models that create generic summaries for single or multiple documents. For query-focused summarization (QFS), labeled training data in the form of queries, documents, and summaries is not readily available. We provide a unified modeling framework for any kind of summarization, under the assumption that all summaries are a response to a query, which is observed in the case of QFS and latent in the case of generic summarization. We model queries as discrete latent variables over document tokens, and learn representations compatible with observed and unobserved query verbalizations. Our framework formulates summarization as a generative process, and jointly optimizes a latent query model and a conditional language model. Despite learning from generic summarization data only, our approach outperforms strong comparison systems across benchmarks, query types, document settings, and target domains.1
1 Introduction
Recent years have witnessed substantial progress in generic summarization (See et al., 2017; Gehrmann et al., 2018; Liu and Lapata, 2019a, inter alia) thanks to neural architectures based on the encoder-decoder paradigm (Sutskever et al., 2014) and the availability of large-scale datasets containing hundreds of thousands of document-summary pairs. Unfortunately, training data of this magnitude is not readily available for the related task of query-focused summarization (QFS; Dang 2005) which aims to create a summary from one or multiple document(s) that answers a specific query. Existing QFS benchmarks (Dang, 2005; Hoa, 2006; Nema et al., 2017; Baumel et al., 2016) have been constructively used for evaluation but are relatively small for training large neural models.
To make up for the absence of labeled QFS data, recent work has resorted to distant supervision provided by pretrained models, paraphrase identification, and question-answering datasets (Xu and Lapata, 2020; Su et al., 2020; Laskar et al., 2020b). Other work induces proxy queries (Xu and Lapata, 2021) from generic summarization datasets, without additional question-answering resources that can be also extremely expensive to acquire (Bajaj et al., 2016). Despite this progress, building and scaling QFS systems remains challenging due to the many different ways natural language queries express users’ information needs. For instance, queries can have one or multiple keyword(s) (Baumel et al., 2016; Zhu et al., 2019), a simple question (Nema et al., 2017), or a longer narrative composed of multiple sub-queries (Dang, 2006) (see the examples in Table 1). Although QFS systems can potentially handle queries resembling those seen in training, they are not expected to work well on out-of-distribution queries (Xu and Lapata, 2021), namely, queries with different surface forms from those seen in training. In order to cover new types of queries, it might be necessary to gather more data, re-design proxy queries, and re-train one or more system components that can be computationally inefficient and in some cases practically infeasible.
Dataset . | Task . | Domain . | Size . | D/Q/S Tokens . | Query Type . | Query Example . |
---|---|---|---|---|---|---|
CNN/DM | SDS | News | 11,490 | 760.5/0.0/45.7 | Empty | ∅ |
WikiCatSum | MDS | Wiki | 8,494 | 800.0/0.0/105.6 | Empty | ∅ |
WikiRef | SDS | Wiki | 12,000 | 398.7/6.7/36.2 | Keywords | Marina Beach, Incidents |
Debatepedia | SDS | Debates | 1,000 | 66.4/10.0/11.3 | Question | Is euthanasia better than withdrawing life support? |
DUC 2006 | MDS | Newswire | 1,250 (50) | 699.3/32.8/250 | Composite | AMNESTY INTERNATIONAL – What is the scope of operations of Amnesty International and what are the international reactions to its activities? |
DUC 2007 | MDS | Newswire | 1,125 (45) | 540.3/30.5/250 | Composite | |
TD-QFS | MDS | Medical | 7,099 (50) | 182.9/3.0/250 | Title | Alzheimer’s Disease |
Dataset . | Task . | Domain . | Size . | D/Q/S Tokens . | Query Type . | Query Example . |
---|---|---|---|---|---|---|
CNN/DM | SDS | News | 11,490 | 760.5/0.0/45.7 | Empty | ∅ |
WikiCatSum | MDS | Wiki | 8,494 | 800.0/0.0/105.6 | Empty | ∅ |
WikiRef | SDS | Wiki | 12,000 | 398.7/6.7/36.2 | Keywords | Marina Beach, Incidents |
Debatepedia | SDS | Debates | 1,000 | 66.4/10.0/11.3 | Question | Is euthanasia better than withdrawing life support? |
DUC 2006 | MDS | Newswire | 1,250 (50) | 699.3/32.8/250 | Composite | AMNESTY INTERNATIONAL – What is the scope of operations of Amnesty International and what are the international reactions to its activities? |
DUC 2007 | MDS | Newswire | 1,125 (45) | 540.3/30.5/250 | Composite | |
TD-QFS | MDS | Medical | 7,099 (50) | 182.9/3.0/250 | Title | Alzheimer’s Disease |
In this work, we provide a unified modeling framework for generic summarization and QFS, under the assumption that only data for the former is available. Specifically, we treat generic summarization as a special case of QFS where the query is latent. We model queries as discrete latent variables over document tokens, and learn representations compatible with observed and unobserved query verbalizations. Our framework formulates abstractive summarization as a generative process, and decomposes the learning objective into: (1) latent query modeling (i.e., generating latent query variables from document observations) and (2) conditional language modeling (i.e., generating summaries conditioned on observed documents and latent queries). To further handle user queries at test time, we propose a non-parametric calibration of the latent query distribution, which allows us to perform zero-shot QFS without model re-training.
Our contributions in this work are threefold: (a) we bring together generic summarization and QFS under a unified modeling framework that does not require query-related resources for training or development; (b) we provide a deep generative formulation for document summarization, where queries are represented directly from input documents in latent space, that is, without resorting to pipeline-style query extraction or generation; and (c) experiments on a range of summarization benchmarks show that across query types, document settings, and target domains, our model achieves better results than strong comparison systems.
2 Related Work
Rush et al. (2015) and Nallapati et al. (2016) were among the first to apply the neural encoder-decoder architecture to abstractive summarization. See et al. (2017) enhance their approach with a pointer-generator model, essentially a copy mechanism allowing words from the source document to be copied directly in the summary. Gehrmann et al. (2018) incorporate a content selection model that decides on relevant aspects of the source document. They frame this task as a word-level tagging problem, with the objective of separately identifying tokens from a document that should be part of its summary; at test time, they produce content selection probabilities for each word, which are then used to restrict the copy mechanism by performing hard masking over the input document. Another line of research controls summary generation via topics (Perez-Beltrachini et al., 2019a; Wang et al., 2020), retrieve-and-edit methods (Cao et al., 2018), factual relations (Jin et al., 2020), keywords, relational triples, or preselected source sentences (Dou et al., 2021).
The majority of previous QFS approaches have been extractive and compose summaries by selecting central and query-relevant sentences (Wan et al., 2007; Badrinath et al., 2011; Wan and Zhang, 2014; Li et al., 2017b, a). More recently, Xu and Lapata (2020) propose a coarse-to-fine framework that leverages distant supervision from question answering for summary sentence extraction. Abstractive QFS has received significantly less attention in comparison, due to generation models being particularly data-hungry (Lebanoff et al., 2018; Liu and Lapata, 2019a). As a result, resources from a wider range of NLP tasks have been used. Su et al. (2020) rank document paragraphs against queries with the aid of QA and machine reading datasets (Su et al., 2019; Rajpurkar et al., 2016), and then iteratively summarize selected paragraphs. Similarly, Laskar et al. (2020b) jointly exploit supervision from QFS data (typically reserved for evaluation) and related QA and paraphrase identification tasks.
Because query-related resources can be also costly to obtain (Bajaj et al., 2016; Kwiatkowski et al., 2019), Xu and Lapata (2021) use none whatsoever. Instead, they create proxy queries by selectively masking information slots in generic summaries. Despite promising system performance, their approach assumes prior knowledge of target queries (proxies are created to match their length, and content), and a development set is used (Xu and Lapata, 2021). Also, their system is particularly tailored to multi-document QFS and includes a sophisticated evidence selection component. Our work is closely related to theirs in that we also do not take advantage of query-related resources. We go a step further and do not require a development set either, allowing our model to be independent of specific query verbalizations and produce QFS summaries in zero-shot settings.
Our approach is generally applicable to single- and multi-document QFS. For any summarization task we assume that queries are latent and estimate these jointly via a summarization and (weakly supervised) tagging task. The latter draws inspiration from Gehrmann et al. (2018) under the assumption that document tokens found in the summary also provide evidence for the (latent) query that gave rise to it. Finally, our model is fundamentally different from approaches that rely on document-based guidance to improve the informativeness (Cao et al., 2018) or faithfulness (Chen et al., 2021) of summaries. While these models exploit guidance from supervision signals in training data, we are faced with the problem of estimating queries when there are none available (at least during training).
3 Problem Formulation
Let denote a summarization dataset, where document is a sequence of tokens, and its corresponding summary; query additionally specifies an information request. In generic summarization, , whereas in QFS can assume various formats, ranging from keywords to composite questions (see Table 1 for examples).
Our model learns from generic summarization data alone, while robustly generalizing to a range of tasks at test time, including out-of-domain QFS. A shared characteristic between generic summarization and QFS is the fact that user intent is underspecified. Even when queries are available (i.e., ), they are incomplete expressions of intent as it is unlikely to specify queries to the level of detail necessary to compose a good summary (Xu and Lapata, 2021). We thus identify latent query signals from , and optionally take advantage of as additional observation for belief update.
Generative Model
We model an observed input document as a sequence of random variables x = [x1;x2;…;xM] where xi is a token and M the length of the document. We define the latent query as a sequence of discrete latent states over input document tokens: z = [z1;z2;…;zM]. Specifically, from each document token xi, we generate a binary query variable zi, whose distribution p(zi) represents the belief that xi contributes to a potential query for document . Modeling latent queries at the token-level allows us to regularize the model—by taking into account weak supervision in the form of token-level tagging (Gehrmann et al., 2018). It also renders the model independent of the query form, thereby enabling zero-shot inference (see Section 4).
The output summary y = [y1;y2;…;yT] is then generated from {x,z} using teacher-forcing at training time. At test time, we may additionally be presented with a query ; we ground this optional information to the input document via discrete observed variables , and generate y by additionally conditioning on (if it exists) in an autoregressive manner.
Inference Model
As can be seen from Equation (6), we decompose summarization into two modeling objectives, namely, latent query modeling and conditional language modeling. Inside the query modeling term, hyperparameter ω controls the influence of weak supervision , while β controls the strength of label smoothing on the weak annotations.
Neural Parametrization
We parametrize the two objectives in Equation (6) with a latent query model and a conditional language model illustrated in Figure 1. The query model estimates latent query z from input variable x. At inference time, it, optionally, conditions on query knowledge (when this is available). The conditional language model is based on the vanilla encoder-decoder architecture, the main difference being that it encodes two views of input document . One encoding is query-focused, and depends directly on z as generated from the query model. The second encoding is query-agnostic, allowing for the original document to provide complementary context. A decoder conditioned on both encodings autoregressively generates the summary y. In contrast to previous work (Xu and Lapata, 2021), the latent query model and conditional language model are trained jointly in a fully differentiable end-to-end manner. In the following sections we explain in detail how these two models are parametrized.
4 Latent Query Model
In this section we discuss how the inference network for latent queries is constructed. We also explain how query-focused document representations are obtained, our attempts to mitigate posterior collapse via weak supervision (see Equation (6)), and how query belief is updated when queries are available at test time.
Inference Network for Latent Queries
Query-focused View
As explained earlier, in addition to a canonical, query-agnostic encoding of the input document (which we discuss in Section 5), we further introduce a query-focused encoding factorized via latent queries z.
Token Tagging as Weak Supervision
Although it is possible to optimize latent queries solely based on conditional language modeling (our approach is fully differentiable), we additionally exploit weak supervision to label tokens in the document as query-specific or not. Weak supervision is advantageous as it imposes extra regularization on the posterior (see Equation (6)), thereby mitigating its collapse (i.e., the decoder may learn to ignore the query-focused view and instead solely rely on the query-agnostic view).
Let t1,…,tn denote binary tags for each of the source tokens, that is, 1 if a token is query-specific and 0 otherwise. We could learn such a tagger from training data generated by aligning query tokens to the document. In default of such gold-standard data, we approximate queries by summaries and obtain silver standard token labels by aligning summaries to their corresponding documents. Specifically, inspired by Gehrmann et al. (2018), we assume a token in the document is query-specific if it is part of the longest common sub-sequence (LCS) of tokens in the summary. Our tagging model is built on top of a pretrained language model, and thus operates on subwords. We first byte-pair encode (BPE; Sennrich et al., 2016) documents and summaries, and then search for the LCS over BPE sequences. If there exist multiple identical LCSs, only the one appearing at the earliest document position is tagged as positive. We refer to this tagging scheme as Bpe-Lcs.
Note that although we model query variables at the token level, we take phrases indirectly into account through LCS, which identifies subsequences of tokens (or phrases) as query annotations. Our our tagging model is therefore able to capture dependencies between tokens, albeit indirectly.
Training
In the initial stages of training, the tagger might lead to inaccurate posterior probability assignments qϕ(zi|x), and, consequently, hurt the summarization model, which relies heavily on a high-quality query-focused view. To address this issue, we introduce a posterior dropout mechanism that replaces the estimated posterior with weak supervision according to probability α. We initialize α to 1, so that only is used in the beginning of training, and the tagger is supervised via Equation (12). We then linearly anneal α over optimization steps so that the gradients from the summarization objective (which we introduce in Section 5) can jointly optimize the tagger.
Zero-shot Transfer
5 Conditional Language Model
In this section we describe our conditional language model, which estimates the log-likelihood expectation of a summary sequence over the variational posterior (see Equation (6)). As mentioned earlier, we adopt an encoder-decoder architecture tailored to document summarization with latent queries.
Encoder
Decoder
6 Experimental Setup
Datasets
For model training and development, we used the CNN/Daily Mail dataset (Hermann et al., 2015), a generic single-document summarization benchmark containing news articles and associated highlights (287,227/13,368 instances). We evaluated our model on the CNN/Daily Mail test set, following a generic summarization, supervised setting. We also performed several zero-shot experiments, on five benchmarks representing various query formats, domains, and summarization scenarios (e.g., single- vs. multiple-documents). Specifically, we report results on WikiCatSum (Perez-Beltrachini and Lapata, 2021) as an example of multi-document generic summarization, and WikiRef (Zhu et al., 2019), Debatepedia (Nema et al., 2017), DUC 2006-07, and TD-QFS (Baumel et al., 2016) as examples of QFS. Table 1 summarizes the characteristics of these datasets and presents test set statistics. Note that in contrast to Xu and Lapata (2021), we do not make use of development data for our QFS tasks.
Implementation Details
The shared encoder consists of 11 Transformer layers. The document and query encoders have a separate Transformer layer each. All encoders and decoder are initialized with a pretrained Bart model (Lewis et al., 2020), while the query encoder is initialized randomly. We used four GeForce RTX 2080 GPUs for training; we set the batch size to 8 (i.e., one sample for each GPU), and accumulate gradients every 32 steps. We fine-tuned Bart on CNN/Daily Mail with a learning rate of 3 × 10−5 for 20,000 optimization steps, and a warmup-step of 500. We used half float precision for efficient training and set the maximum length of an input document to 640 tokens, with the excess clipped. We set β = 0.1 and ω = 10 in the learning objective, and τ = 0.9 for latent query modeling. We annealed the dropout rate α from 1.0 to 0.5 over the whole training session.
7 Automatic Evaluation
Before analyzing our model under various zero-shot settings, we first confirm it can indeed produce good quality generic summaries in a supervised setting. There is no point in contemplating zero-shot scenarios if our approach underperforms when full supervision is available. Following standard practice, we use F1 ROUGE as our automatic evaluation metric (Lin and Hovy, 2003). Unigram and bigram ROUGE (R-1 and R-2) are a proxy for assessing informativeness and the longest common subsequence (R-L) represents fluency. For multi-document QFS, we follow DUC (Dang, 2005) and report R-SU4 (based on skip bigram with maximum skip distance of 4) instead of R-L.5
7.1 Supervised Setting
Table 2 summarizes our results on the CNN/Daily Mail test set. As an upper bound (first block) we report the performance of an extractive Oracle that performs greedy search to find a set of sentences in the source document that maximize ROUGE scores against the reference (Liu and Lapata, 2019b). The Lead baseline considers the first 3 sentences in a document as the summary. LexRank (Erkan and Radev, 2004) estimates sentence-level centrality via a Markov Random Walk on graphs. The second block includes two additional extractive systems. BertExt (Liu and Lapata, 2019b) is the first rendition of a summarization system with a pretrained encoder (Devlin et al., 2019). MatchSum (Zhong et al., 2020) extracts an optimal set of sentences via semantically matching documents to candidate summaries.
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 55.8 | 33.2 | 51.8 |
Lead | 40.4 | 17.6 | 36.7 |
LexRank | 33.2 | 11.8 | 29.6 |
Supervised (Extractive) | R-1 | R-2 | R-L |
BertExt (Liu and Lapata, 2019b) | 43.9 | 20.3 | 39.9 |
MatchSum (Zhong et al., 2020) | 43.9 | 20.6 | 39.8 |
Supervised (Abstractive) | R-1 | R-2 | R-L |
PTGen (See et al., 2017) | 39.5 | 17.3 | 36.4 |
BottomUp (Gehrmann et al., 2018) | 41.2 | 18.7 | 38.4 |
BertAbs (Liu and Lapata, 2019b) | 41.7 | 19.4 | 38.8 |
Bart (Lewis et al., 2020) | 44.2 | 21.3 | 40.9 |
GSum (Dou et al., 2021) | 45.9 | 22.3 | 42.5 |
GSum (our implementation) | 45.0 | 21.9 | 41.8 |
LQSum | 45.1 | 22.0 | 41.9 |
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 55.8 | 33.2 | 51.8 |
Lead | 40.4 | 17.6 | 36.7 |
LexRank | 33.2 | 11.8 | 29.6 |
Supervised (Extractive) | R-1 | R-2 | R-L |
BertExt (Liu and Lapata, 2019b) | 43.9 | 20.3 | 39.9 |
MatchSum (Zhong et al., 2020) | 43.9 | 20.6 | 39.8 |
Supervised (Abstractive) | R-1 | R-2 | R-L |
PTGen (See et al., 2017) | 39.5 | 17.3 | 36.4 |
BottomUp (Gehrmann et al., 2018) | 41.2 | 18.7 | 38.4 |
BertAbs (Liu and Lapata, 2019b) | 41.7 | 19.4 | 38.8 |
Bart (Lewis et al., 2020) | 44.2 | 21.3 | 40.9 |
GSum (Dou et al., 2021) | 45.9 | 22.3 | 42.5 |
GSum (our implementation) | 45.0 | 21.9 | 41.8 |
LQSum | 45.1 | 22.0 | 41.9 |
The third block includes various abstractive systems (see Section 2 for an overview). PTGen (See et al., 2017) and BottomUp (Gehrmann et al., 2018) do not use pretrained LMs, while BertAbs (Liu and Lapata, 2019b) is built on top of a pretrained Bert encoder. Bart (Lewis et al., 2020) is fine-tuned on CNN/DM, while GSum (Dou et al., 2021) is initialized with Bart parameters.
Our Latent Query Summarization model (LQSum) outperforms Bart by a large margin, which demonstrates the effectiveness of latent queries even for generic summarization. It also performs on par with GSum, under identical training resources and configurations. GSum is a state-of-the-art abstractive model, which relies on MatchSum (Zhong et al., 2020), a high-performance extractive model to provide guidance to the decoder. Compared to GSum, LQSum can be trained end-to-end and requires significantly less parameters (406 M for LQSum versus 625 M for GSum; see Table 3 for details).
Model . | Size . | Components . |
---|---|---|
Bart | 400M | Enc=12, Dec=12 |
GSum | 625M | Enc=13, Dec=12, Bert=2 (220M; guidance) |
LQSum | 406M | Enc=13, Dec=12, Tag=1 (1M; latent query) |
Model . | Size . | Components . |
---|---|---|
Bart | 400M | Enc=12, Dec=12 |
GSum | 625M | Enc=13, Dec=12, Bert=2 (220M; guidance) |
LQSum | 406M | Enc=13, Dec=12, Tag=1 (1M; latent query) |
7.2 Zero-Shot Setting
Multi-Document Summarization
We evaluated our model’s ability to summarize multiple documents on WikiCatSum (Perez-Beltrachini et al., 2019b), a collection of articles on a specific topic (e.g., Tokyo Olympics) and their corresponding Wikipedia summary. In order to handle multi-document input with a model trained on single-document data, we follow previous work (Perez-Beltrachini et al., 2019b) and first select a subset of salient passages which are then concatenated into a sequence and given to our model to summarize.
In the first block of Table 4 we present upper bound and baseline results. The second block contains results for two supervised systems, a sequence-to-sequence model based on Transformer (Liu et al., 2018), and a state-of-the-art system enhanced with a convolutional encoder, a structured decoder, and a topic prediction module (CV-S2D+T; Perez-Beltrachini et al. 2019b). The third block contains zero-shot models, including Bart, GSum, and LQSum. GSum requires another extractive system’s output as guidance during inference, for which we default to LexRank. As can be seen, LQSum performs best among zero-shot models, but lags behind fully supervised ones which is not surprising (zero-shot models operate over pre-ranked, incoherent passages).
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 47.2 | 23.3 | 42.9 |
Lead | 22.3 | 6.9 | 19.9 |
LexRank | 23.3 | 6.5 | 20.3 |
Supervised (Abstractive) | R-1 | R-2 | R-L |
Transformer (Liu et al., 2018) | 35.5 | 19.0 | 30.5 |
CV-S2D+T (Perez-Beltrachini et al., 2019b) | 36.1 | 19.9 | 30.5 |
Zero-shot Abstractive | R-1 | R-2 | R-L |
Bart (Lewis et al., 2020) | 27.8 | 9.8 | 25.1 |
GSum+LexRank | 27.4 | 8.2 | 25.0 |
LQSum | 28.7 | 9.9 | 26.1 |
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 47.2 | 23.3 | 42.9 |
Lead | 22.3 | 6.9 | 19.9 |
LexRank | 23.3 | 6.5 | 20.3 |
Supervised (Abstractive) | R-1 | R-2 | R-L |
Transformer (Liu et al., 2018) | 35.5 | 19.0 | 30.5 |
CV-S2D+T (Perez-Beltrachini et al., 2019b) | 36.1 | 19.9 | 30.5 |
Zero-shot Abstractive | R-1 | R-2 | R-L |
Bart (Lewis et al., 2020) | 27.8 | 9.8 | 25.1 |
GSum+LexRank | 27.4 | 8.2 | 25.0 |
LQSum | 28.7 | 9.9 | 26.1 |
Single-Document QFS
Tables 5 and 6 show results for single-document QFS on two datasets, namely, WikiRef (Zhu et al., 2019) and Debatepedia (Nema et al., 2017), which differ in terms of document/summary size and query type (see Table 1). The first block in both tables shows results for the Oracle upper bound, Lead, and , a query-focused version of LexRank described in Xu and Lapata (2020). The second block presents various supervised systems on WikiRef and Debatepedia, both extractive and abstractive. Note that abstractive QFS systems have not been previously evaluated on WikiRef, while Debatepedia contains short documents and accordingly short summaries and has mainly served as a testbed for abstractive summarization. The third block reports system performance in the zero-shot setting. We compare LQSum against Bart and GSum, which, however, requires guidance from automatically extracted sentences. Note that MatchSum (Zhong et al., 2020), the original extractive system used by GSum for guidance, is not directly applicable to QFS, as it is trained for generic summarization which does not take queries as input. We made a best effort attempt to adapt GSum to our QFS setting by using query-focused to extract the top K sentences for each test document as guidance.
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 54.5 | 37.5 | 48.5 |
Lead | 26.3 | 10.5 | 21.8 |
29.9 | 12.3 | 26.1 | |
Supervised (Extractive) | R-1 | R-2 | R-L |
Transformer (Zhu et al., 2019) | 28.1 | 12.8 | 23.8 |
BertExt (Zhu et al., 2019) | 35.1 | 18.2 | 30.0 |
Zero-shot Abstractive | R-1 | R-2 | R-L |
Bart (Lewis et al., 2020) | 30.0 | 12.2 | 26.0 |
GSum+ | 30.2 | 12.5 | 26.3 |
LQSum | 31.1 | 12.6 | 27.1 |
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 54.5 | 37.5 | 48.5 |
Lead | 26.3 | 10.5 | 21.8 |
29.9 | 12.3 | 26.1 | |
Supervised (Extractive) | R-1 | R-2 | R-L |
Transformer (Zhu et al., 2019) | 28.1 | 12.8 | 23.8 |
BertExt (Zhu et al., 2019) | 35.1 | 18.2 | 30.0 |
Zero-shot Abstractive | R-1 | R-2 | R-L |
Bart (Lewis et al., 2020) | 30.0 | 12.2 | 26.0 |
GSum+ | 30.2 | 12.5 | 26.3 |
LQSum | 31.1 | 12.6 | 27.1 |
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 28.9 | 11.0 | 24.9 |
Lead | 18.1 | 5.6 | 15.9 |
17.4 | 5.3 | 15.1 | |
Supervised (Abstractive) | R-1 | R-2 | R-L |
Dda (Laskar et al., 2020a) | 7.4 | 2.8 | 7.2 |
BertAbs+Rank (Abdullah and Chali, 2020) | 19.2 | 10.6 | 17.9 |
BertAbs+Concat (Laskar et al., 2020a) | 26.4 | 11.9 | 25.1 |
Zero-shot Abstractive | R-1 | R-2 | R-L |
BertAbs† (Liu and Lapata, 2019b) | 13.3 | 2.8 | 2.8 |
Bart (Lewis et al., 2020) | 21.4 | 6.3 | 18.4 |
GSum+ | 21.2 | 6.2 | 18.2 |
LQSum | 23.5 | 7.2 | 20.6 |
Upper Bound & Baselines . | R-1 . | R-2 . | R-L . |
---|---|---|---|
Oracle | 28.9 | 11.0 | 24.9 |
Lead | 18.1 | 5.6 | 15.9 |
17.4 | 5.3 | 15.1 | |
Supervised (Abstractive) | R-1 | R-2 | R-L |
Dda (Laskar et al., 2020a) | 7.4 | 2.8 | 7.2 |
BertAbs+Rank (Abdullah and Chali, 2020) | 19.2 | 10.6 | 17.9 |
BertAbs+Concat (Laskar et al., 2020a) | 26.4 | 11.9 | 25.1 |
Zero-shot Abstractive | R-1 | R-2 | R-L |
BertAbs† (Liu and Lapata, 2019b) | 13.3 | 2.8 | 2.8 |
Bart (Lewis et al., 2020) | 21.4 | 6.3 | 18.4 |
GSum+ | 21.2 | 6.2 | 18.2 |
LQSum | 23.5 | 7.2 | 20.6 |
Across both datasets, LQSum achieves the highest ROUGE scores in the zero-shot setting, in some cases surpassing the performance of supervised models. Compared to our results on generic summarization, LQSum also shows a clearer advantage over systems without latent query modeling.
Multi-Document QFS
We performed experiments on the DUC 2005-2007 benchmarks and TD-QFS (Baumel et al., 2016). The former contains long query narratives while TD-QFS focuses on short keyword queries (see Table 1).
We applied our summarization model trained on single documents to document clusters following a simple iterative approach (Baumel et al., 2018): We first rank documents in a cluster via their query term frequency, and then generate a summary for each document. The summary for the entire cluster is the concatenation of the individual document summaries subject to a budget (i.e., 250 tokens).6 Repeated sentences were skipped to reduce redundancy in the final summary.
Our results are given in Table 7. The first block reports performance for the Oracle upper bound and Gold, which was estimated by comparing a (randomly selected) reference summary against the remaining two or three reference summaries.7 We also include , and Lead (Xu and Lapata, 2021), which returns all lead sentences (up to 250 words) of the most recent document.
. | DUC 2006 . | DUC 2007 . | TD-QFS . | ||||||
---|---|---|---|---|---|---|---|---|---|
Upper Bound & Baselines | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 |
Gold | 45.4 | 11.2 | 16.8 | 47.5 | 14.0 | 18.9 | 52.2 | 27.0 | 30.2 |
Oracle | 47.5 | 15.8 | 20.2 | 47.6 | 17.1 | 20.9 | 64.9 | 48.3 | 49.4 |
Lead | 32.1 | 5.3 | 10.4 | 33.4 | 6.5 | 11.3 | 33.5 | 5.2 | 10.4 |
34.2 | 6.4 | 11.4 | 35.8 | 7.7 | 12.7 | 35.3 | 7.6 | 12.2 | |
Distantly Supervised | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 |
QuerySum* (Xu and Lapata, 2020) | 41.6 | 9.5 | 15.3 | 43.3 | 11.6 | 16.8 | 44.3 | 16.1 | 20.7 |
Bart-Caq (Su et al., 2020) | 38.3 | 7.7 | 12.9 | 40.5 | 9.2 | 14.4 | — | — | — |
PQSum (Laskar et al., 2020b) | 40.9 | 9.4 | 14.8 | 42.2 | 10.8 | 16.0 | — | — | — |
Few- or Zero-shot Abstractive | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 |
MargeSum†(Xu and Lapata, 2021) | 40.2 | 9.7 | 15.1 | 42.5 | 12.0 | 16.9 | 45.5 | 16.6 | 20.9 |
Bart (Lewis et al., 2020) | 38.3 | 7.8 | 13.1 | 40.2 | 9.9 | 14.6 | 45.1 | 16.9 | 21.4 |
GSum+ | 38.1 | 7.9 | 13.1 | 39.5 | 9.5 | 14.3 | 45.5 | 18.0 | 22.4 |
LQSum | 39.1 | 8.5 | 13.7 | 40.4 | 10.2 | 15.0 | 45.7 | 18.1 | 22.1 |
. | DUC 2006 . | DUC 2007 . | TD-QFS . | ||||||
---|---|---|---|---|---|---|---|---|---|
Upper Bound & Baselines | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 |
Gold | 45.4 | 11.2 | 16.8 | 47.5 | 14.0 | 18.9 | 52.2 | 27.0 | 30.2 |
Oracle | 47.5 | 15.8 | 20.2 | 47.6 | 17.1 | 20.9 | 64.9 | 48.3 | 49.4 |
Lead | 32.1 | 5.3 | 10.4 | 33.4 | 6.5 | 11.3 | 33.5 | 5.2 | 10.4 |
34.2 | 6.4 | 11.4 | 35.8 | 7.7 | 12.7 | 35.3 | 7.6 | 12.2 | |
Distantly Supervised | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 |
QuerySum* (Xu and Lapata, 2020) | 41.6 | 9.5 | 15.3 | 43.3 | 11.6 | 16.8 | 44.3 | 16.1 | 20.7 |
Bart-Caq (Su et al., 2020) | 38.3 | 7.7 | 12.9 | 40.5 | 9.2 | 14.4 | — | — | — |
PQSum (Laskar et al., 2020b) | 40.9 | 9.4 | 14.8 | 42.2 | 10.8 | 16.0 | — | — | — |
Few- or Zero-shot Abstractive | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 | R-1 | R-2 | R-SU4 |
MargeSum†(Xu and Lapata, 2021) | 40.2 | 9.7 | 15.1 | 42.5 | 12.0 | 16.9 | 45.5 | 16.6 | 20.9 |
Bart (Lewis et al., 2020) | 38.3 | 7.8 | 13.1 | 40.2 | 9.9 | 14.6 | 45.1 | 16.9 | 21.4 |
GSum+ | 38.1 | 7.9 | 13.1 | 39.5 | 9.5 | 14.3 | 45.5 | 18.0 | 22.4 |
LQSum | 39.1 | 8.5 | 13.7 | 40.4 | 10.2 | 15.0 | 45.7 | 18.1 | 22.1 |
The second block contains distantly supervised approaches. QuerySum (Xu and Lapata, 2020) is an extractive system that takes advantage of existing QA datasets and adopts a coarse-to-fine salience estimation procedure. Bart-Caq (Su et al., 2020) uses an ensembled QA model for answer evidence extraction, and a fine-tuned Bart model (Lewis et al., 2020) to iteratively generate summaries from paragraphs. PQSum (Laskar et al., 2020b) uses fine-tuned BertSum to generate summaries for each document in a cluster, and a QA model for summary sentence re-ranking.
The third block compares our model against MargeSum (Xu and Lapata, 2021), a state-of-the- art few-shot approach, which uses data for proxy query generation and model development, and various zero-shot systems including Bart and GSum+LexRank.
Across datasets, LQSum outperforms comparison zero-shot approaches. It also has a clear advantage over MargeSum on TD-QFS but is slightly worse on DUC. We also see that LQSum is superior to Bart-Caq, which relies on distant supervision from QA data.
7.3 Ablation Studies
We further performed a series of ablation studies, reported in Table 8, to assess the contribution of individual model components. Perhaps unsurprisingly, we observe that not updating the query belief at test time hurts performance (). Recall that we adopt a simple method that calibrates the variational posterior distribution. When it comes to learning meaningful latent queries that benefit summarization tasks, relying solely on tagging (−Joint training) or generation (−Weak supervision) substantially decreases performance.8 Latent query learning balances a trade-off between direct but weak supervision from the tagging objective (based on silver standard token labels) and natural but indirect supervision from the generation objective (based on human-written summaries). As silver tagging labels provide less accurate supervision than human-written summaries, we observe that −Joint training hurts performance more than −Weak supervision.
. | CNN/DM . | WikiRef . | Debatepedia . | DUC 2006 . | DUC 2007 . | TD-QFS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model . | R-1 . | R-2 . | R-L . | R-1 . | R-2 . | R-L . | R-1 . | R-2 . | R-L . | R-1 . | R-2 . | R-SU4 . | R-1 . | R-2 . | R-SU4 . | R-1 . | R-2 . | R-SU4 . |
LQSum | 45.1 | 22.0 | 41.9 | 31.1 | 12.6 | 27.1 | 23.5 | 7.2 | 20.6 | 39.1 | 8.5 | 13.7 | 40.4 | 10.2 | 15.0 | 45.7 | 18.1 | 22.1 |
— | — | — | ↓0.1 | ↓0.2 | ↓0.2 | ↓0.5 | ↓0.3 | ↓0.6 | ↓0.6 | ↓0.2 | ↓0.6 | ↑0.1 | ↓0.1 | ↓1.3 | ↑0.1 | ↓0.6 | ↓0.4 | |
−Joint training | ↓0.4 | ↓0.3 | ↓0.4 | ↓2.9 | ↓0.9 | ↓2.8 | ↓2.8 | ↓1.1 | ↓2.8 | ↓2.9 | ↓1.7 | ↓1.6 | ↓2.4 | ↓2.0 | ↓1.7 | ↓0.7 | ↓0.6 | ↓0.4 |
−Weak supervision | ↓0.6 | ↓0.7 | ↓0.7 | ↓0.7 | ↓0.2 | ↓0.5 | ↓1.0 | ↓0.5 | ↓1.3 | ↓0.2 | ↓0.2 | ↓0.2 | ↓0.2 | ↓0.3 | ↓0.3 | ↓0.1 | ↓0.3 | ↓0.0 |
−Dual view | ↓2.7 | ↓3.5 | ↓2.5 | ↓12.2 | ↓9.3 | ↓10.5 | ↓7.9 | ↓3.3 | ↓6.6 | ↓6.3 | ↓1.8 | ↓1.8 | ↓6.5 | ↓3.0 | ↓2.5 | ↓2.5 | ↓3.3 | ↓2.8 |
−Posterior dropout | ↓0.7 | ↓0.6 | ↓0.8 | ↓0.8 | ↓0.3 | ↓0.7 | ↓1.1 | ↓0.3 | ↓1.2 | ↓0.2 | ↓0.2 | ↓0.2 | ↓0.4 | ↓0.4 | ↓0.5 | ↑0.2 | ↓0.0 | ↑0.1 |
. | CNN/DM . | WikiRef . | Debatepedia . | DUC 2006 . | DUC 2007 . | TD-QFS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model . | R-1 . | R-2 . | R-L . | R-1 . | R-2 . | R-L . | R-1 . | R-2 . | R-L . | R-1 . | R-2 . | R-SU4 . | R-1 . | R-2 . | R-SU4 . | R-1 . | R-2 . | R-SU4 . |
LQSum | 45.1 | 22.0 | 41.9 | 31.1 | 12.6 | 27.1 | 23.5 | 7.2 | 20.6 | 39.1 | 8.5 | 13.7 | 40.4 | 10.2 | 15.0 | 45.7 | 18.1 | 22.1 |
— | — | — | ↓0.1 | ↓0.2 | ↓0.2 | ↓0.5 | ↓0.3 | ↓0.6 | ↓0.6 | ↓0.2 | ↓0.6 | ↑0.1 | ↓0.1 | ↓1.3 | ↑0.1 | ↓0.6 | ↓0.4 | |
−Joint training | ↓0.4 | ↓0.3 | ↓0.4 | ↓2.9 | ↓0.9 | ↓2.8 | ↓2.8 | ↓1.1 | ↓2.8 | ↓2.9 | ↓1.7 | ↓1.6 | ↓2.4 | ↓2.0 | ↓1.7 | ↓0.7 | ↓0.6 | ↓0.4 |
−Weak supervision | ↓0.6 | ↓0.7 | ↓0.7 | ↓0.7 | ↓0.2 | ↓0.5 | ↓1.0 | ↓0.5 | ↓1.3 | ↓0.2 | ↓0.2 | ↓0.2 | ↓0.2 | ↓0.3 | ↓0.3 | ↓0.1 | ↓0.3 | ↓0.0 |
−Dual view | ↓2.7 | ↓3.5 | ↓2.5 | ↓12.2 | ↓9.3 | ↓10.5 | ↓7.9 | ↓3.3 | ↓6.6 | ↓6.3 | ↓1.8 | ↓1.8 | ↓6.5 | ↓3.0 | ↓2.5 | ↓2.5 | ↓3.3 | ↓2.8 |
−Posterior dropout | ↓0.7 | ↓0.6 | ↓0.8 | ↓0.8 | ↓0.3 | ↓0.7 | ↓1.1 | ↓0.3 | ↓1.2 | ↓0.2 | ↓0.2 | ↓0.2 | ↓0.4 | ↓0.4 | ↓0.5 | ↑0.2 | ↓0.0 | ↑0.1 |
Removing the query agnostic view (−Dual view) causes a significant performance drop as the decoder can no longer leverage the original document context, which is useful especially when the query model is not accurate. Relying solely on the estimated posterior to create the query-focused view for training (−Posterior dropout), also hurts performance as it leads to more severe error propagation for the downstream generation model.
8 Human Evaluation
Following previous work (Xu and Lapata, 2021, 2020), we also evaluated query-focused summaries in a judgment elicitation study via Amazon Mechanical Turk. Native English speakers (self-reported) were asked to rate query-summary pairs on two dimensions: Succinctness (does the summary avoid unnecessary detail and redundant information?) and Coherence (does the summary make logical sense?). The ratings were obtained using a five-point Likert scale.
In addition, participants were asked to assess the Relevance of the summary to the query. Crowdworkers read a summary and for each sentence decided whether it is relevant (i.e., provides an answer to the query), irrelevant (i.e., does not answer the query), or partially relevant (i.e., unclear it directly answers the query). Relevant sentences were awarded a score of 5, partially relevant ones a score of 2.5, and 0 otherwise. Sentence scores were averaged to obtain a relevance score for the whole summary. We view Relevance as as more critical for QFS than Coherence or Succinctness. This is why we obtained per-sentence ratings which we then aggregated to an overall summary score. To make this task manageable, raters were asked to provide more coarse-grained ratings.
Participants assessed summaries created by LQSum (our model), GSum+ (a competitive abstractive system), (an extractive baseline), and Gold (the ground-truth upper bound). We also compared against BertExt on WikiRef, BertAbs on Debatepedia, and MargeSum on DUC and TD-QFS.9 We sampled 40 query-document pairs from WikiRef and Debatepedia, 40 query-cluster pairs from DUC (2006, 2007; 20 from each set), and 40 pairs from TD-QFS and collected three responses per pair.10
We show our results in Table 9 and examples of system output in Table 10. On WikiRef, LQSum outperforms GSum+ significantly in terms of relevance. On Debatepedia it surpasses BertAbs, a supervised model, across all three metrics. On DUC, it outperforms comparison systems in terms of succinctness and coherence. LQSum avoids repetition by yielding dynamic (latent) query representations for each document in the a cluster. On TD-QFS, all comparison systems perform similarly, except which is significantly worse in terms of relevance and succinctness. As far as Relevance is concerned we observe that LQSum outperforms comparison systems on Debatepedia and TD-QFS, while being very similar to MargeSum on DUC. On Wikiref, BertExt is slightly more relevant but less coherent.
WikiRef . | Rel . | Suc . | Coh . | . | Debatepedia . | Rel . | Suc . | Coh . |
---|---|---|---|---|---|---|---|---|
BertExt | 3.57 | 3.63 | 3.72 | BertAbs | 2.42† | 2.93†° | 2.59† | |
GSum+ | 2.92†° | 3.48° | 3.72 | GSum+ | 2.88† | 3.60 | 3.49† | |
3.23 | 3.40 | 3.68 | 3.33 | 3.47° | 3.52 | |||
LQSum | 3.41 | 3.58 | 3.78 | LQSum | 3.39 | 3.74 | 3.78 | |
Gold | 3.62 | 3.73 | 3.59 | Gold | 3.29 | 3.76 | 3.57 | |
DUC | Rel | Suc | Coh | TD-QFS | Rel | Suc | Coh | |
MargeSum | 4.00 | 3.75 | 3.65†° | MargeSum | 3.28 | 3.57 | 3.62 | |
GSum+ | 3.90 | 3.44†° | 3.84 | GSum+ | 3.26 | 3.65 | 3.76 | |
3.59†° | 3.38†° | 3.54†° | 2.78†° | 3.36†° | 3.33†° | |||
LQSum | 3.97 | 3.88 | 3.95 | LQSum | 3.35 | 3.70 | 3.77 | |
Gold | 4.01 | 3.94 | 4.04 | Gold | 3.50 | 3.88 | 3.68 |
WikiRef . | Rel . | Suc . | Coh . | . | Debatepedia . | Rel . | Suc . | Coh . |
---|---|---|---|---|---|---|---|---|
BertExt | 3.57 | 3.63 | 3.72 | BertAbs | 2.42† | 2.93†° | 2.59† | |
GSum+ | 2.92†° | 3.48° | 3.72 | GSum+ | 2.88† | 3.60 | 3.49† | |
3.23 | 3.40 | 3.68 | 3.33 | 3.47° | 3.52 | |||
LQSum | 3.41 | 3.58 | 3.78 | LQSum | 3.39 | 3.74 | 3.78 | |
Gold | 3.62 | 3.73 | 3.59 | Gold | 3.29 | 3.76 | 3.57 | |
DUC | Rel | Suc | Coh | TD-QFS | Rel | Suc | Coh | |
MargeSum | 4.00 | 3.75 | 3.65†° | MargeSum | 3.28 | 3.57 | 3.62 | |
GSum+ | 3.90 | 3.44†° | 3.84 | GSum+ | 3.26 | 3.65 | 3.76 | |
3.59†° | 3.38†° | 3.54†° | 2.78†° | 3.36†° | 3.33†° | |||
LQSum | 3.97 | 3.88 | 3.95 | LQSum | 3.35 | 3.70 | 3.77 | |
Gold | 4.01 | 3.94 | 4.04 | Gold | 3.50 | 3.88 | 3.68 |
Query: Prashant Bhushan, Legal activism, Government accountability |
Gold: CPIL won a major victory in 2003 when the Supreme Court restrained the Union government from privatising Hindustan Petroleum and Bharat Petroleum without the approval of Parliament. |
BertExt: New Delhi, March 3: The Supreme Court verdict against P.J. Thomas’s appointment is not the lone feather in the cap of the petitioner, the Centre for Public Interest Litigation (CPIL), but perhaps the most visible one.That was when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum without Parliament’s approval. The CPIL was founded in the late 1980s by Justice V.M. Tarkunde, who also co-founded the People’s Union for Civil Liberties. |
GSum+: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers whose aim is to fight corruption. Among its members are lawyers Shanti Bhushan, Prashant BhUSHan, Kamini Jaiswal, Ram Jethmalani, Anand Divan and Anil Divan. Another PIL asks that the government be directed to recover Indian black money stashed in foreign banks. |
LQSum: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers. The group had its big hurrah in 2003 when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum. |
Query: Effectiveness: Do earmarks allocate spending effectively? |
Gold: Earmarks are often unrelated to legislation; holds up bill. |
BertAbs: Earmarks can be fully examined. |
GSum+: Sometimes a good piece of legislation that receives the support of a majority of congressman will be held up and voted down. |
LQSum: Congressmen are using earmarks to hold up bills they don’t like, says Rep. Ruben Gallego. |
Query: Prashant Bhushan, Legal activism, Government accountability |
Gold: CPIL won a major victory in 2003 when the Supreme Court restrained the Union government from privatising Hindustan Petroleum and Bharat Petroleum without the approval of Parliament. |
BertExt: New Delhi, March 3: The Supreme Court verdict against P.J. Thomas’s appointment is not the lone feather in the cap of the petitioner, the Centre for Public Interest Litigation (CPIL), but perhaps the most visible one.That was when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum without Parliament’s approval. The CPIL was founded in the late 1980s by Justice V.M. Tarkunde, who also co-founded the People’s Union for Civil Liberties. |
GSum+: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers whose aim is to fight corruption. Among its members are lawyers Shanti Bhushan, Prashant BhUSHan, Kamini Jaiswal, Ram Jethmalani, Anand Divan and Anil Divan. Another PIL asks that the government be directed to recover Indian black money stashed in foreign banks. |
LQSum: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers. The group had its big hurrah in 2003 when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum. |
Query: Effectiveness: Do earmarks allocate spending effectively? |
Gold: Earmarks are often unrelated to legislation; holds up bill. |
BertAbs: Earmarks can be fully examined. |
GSum+: Sometimes a good piece of legislation that receives the support of a majority of congressman will be held up and voted down. |
LQSum: Congressmen are using earmarks to hold up bills they don’t like, says Rep. Ruben Gallego. |
9 Conclusion
We propose a deep generative formulation for document summarization that supports generic and query-focused applications. We represent queries as discrete latent variables, whose approximated posterior distribution can be calibrated with query observations at test time without further adaptation. Our approach does not rely on any query-related resource and can be applied in zero-shot settings. Experimental results across summarization datasets show that the proposed model yields state-of-the-art QFS performance in zero-shot settings.
Directions for future work are many and varied. One research challenge is to push this low-resource approach even further and generate abstractive summaries without access to any summaries or queries. We would also like to extend the proposed framework to cross-lingual settings, and satisfy the information needs of users with different language backgrounds through effective query understanding and summary generation.
Acknowledgments
The authors would like to thank the action editor, Wenjie Li, and the anonymous reviewers for their valuable feedback. We acknowledge the financial support of the European Research Council (Lapata; award number 681760). This research was supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract FA8650-17-C-9118. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Notes
Our code and models can be found at https://github.com/yumoxu/lqsum.
When , always holds (z ∈ [a,b]).
We experimentally verified this assumption in several QFS datasets. In WikRef (Zhu et al., 2019) and Debatepedia (Nema et al., 2017), 1.57% and 4.27% of query tokens are not attested in the input document, respectively. In DUC (Dang, 2005) and TD-QFS (Baumel et al., 2016) where the input contains multiple documents, all query tokens are attested. Across all datasets, only 1.69% of query tokens are not attested in the input document/cluster.
We also experimented with drawing hard samples from z via the straight-through trick (Jang et al., 2016), which is differentiable with biased gradient estimation. However, it did not yield better results than continuous relaxation.
We used pyrouge with the following parameter settings: ROUGE-1.5.5.pl -a -c 95 -m -n 2 -2 4 -u -p 0.5 -l 250.
An alternative would be to generate a long summary at once. However, this requires a model to be trained on a MDS dataset, or at least a proxy thereof (Xu and Lapata, 2021).
We compute this upper bound only for DUC and TD-QFS benchmarks as they include multiple reference summaries.
−Joint training replaces the softmax in Equation (9) with , to stop the gradients from the generation loss in backpropagation. −Weak supervision sets ω = 0.
BertExt and BertAbs are supervised systems, while MargeSum is a few-shot system.
We are grateful to Md Tahmid Rahman Laskar and Haichao Zhu for providing us with system output.
References
Author notes
Action Editor: Wenjie (Maggie) Li