The availability of large-scale datasets has driven the development of neural models that create generic summaries for single or multiple documents. For query-focused summarization (QFS), labeled training data in the form of queries, documents, and summaries is not readily available. We provide a unified modeling framework for any kind of summarization, under the assumption that all summaries are a response to a query, which is observed in the case of QFS and latent in the case of generic summarization. We model queries as discrete latent variables over document tokens, and learn representations compatible with observed and unobserved query verbalizations. Our framework formulates summarization as a generative process, and jointly optimizes a latent query model and a conditional language model. Despite learning from generic summarization data only, our approach outperforms strong comparison systems across benchmarks, query types, document settings, and target domains.1

Recent years have witnessed substantial progress in generic summarization (See et al., 2017; Gehrmann et al., 2018; Liu and Lapata, 2019a, inter alia) thanks to neural architectures based on the encoder-decoder paradigm (Sutskever et al., 2014) and the availability of large-scale datasets containing hundreds of thousands of document-summary pairs. Unfortunately, training data of this magnitude is not readily available for the related task of query-focused summarization (QFS; Dang 2005) which aims to create a summary from one or multiple document(s) that answers a specific query. Existing QFS benchmarks (Dang, 2005; Hoa, 2006; Nema et al., 2017; Baumel et al., 2016) have been constructively used for evaluation but are relatively small for training large neural models.

To make up for the absence of labeled QFS data, recent work has resorted to distant supervision provided by pretrained models, paraphrase identification, and question-answering datasets (Xu and Lapata, 2020; Su et al., 2020; Laskar et al., 2020b). Other work induces proxy queries (Xu and Lapata, 2021) from generic summarization datasets, without additional question-answering resources that can be also extremely expensive to acquire (Bajaj et al., 2016). Despite this progress, building and scaling QFS systems remains challenging due to the many different ways natural language queries express users’ information needs. For instance, queries can have one or multiple keyword(s) (Baumel et al., 2016; Zhu et al., 2019), a simple question (Nema et al., 2017), or a longer narrative composed of multiple sub-queries (Dang, 2006) (see the examples in Table 1). Although QFS systems can potentially handle queries resembling those seen in training, they are not expected to work well on out-of-distribution queries (Xu and Lapata, 2021), namely, queries with different surface forms from those seen in training. In order to cover new types of queries, it might be necessary to gather more data, re-design proxy queries, and re-train one or more system components that can be computationally inefficient and in some cases practically infeasible.

Table 1:

Test data statistics. SDS/MDS stand for single-/multi-document summarization. Size refers to number of test documents; for multi-document QFS, we specify the number of clusters in brackets. D/Q/S are Document/Query/Summary tokens. Composite queries consist of a TOPIC and a narrative.

CNN/DM SDS News 11,490 760.5/0.0/45.7 Empty ∅
WikiCatSum MDS Wiki 8,494 800.0/0.0/105.6 Empty ∅
WikiRef SDS Wiki 12,000 398.7/6.7/36.2 Keywords Marina Beach, Incidents
Debatepedia SDS Debates 1,000 66.4/10.0/11.3 Question Is euthanasia better than withdrawing life support?
DUC 2006 MDS Newswire 1,250 (50) 699.3/32.8/250 Composite AMNESTY INTERNATIONAL – What is the scope of operations of Amnesty International and what are the international reactions to its activities?
DUC 2007 MDS Newswire 1,125 (45) 540.3/30.5/250 Composite
TD-QFS MDS Medical 7,099 (50) 182.9/3.0/250 Title Alzheimer’s Disease
CNN/DM SDS News 11,490 760.5/0.0/45.7 Empty ∅
WikiCatSum MDS Wiki 8,494 800.0/0.0/105.6 Empty ∅
WikiRef SDS Wiki 12,000 398.7/6.7/36.2 Keywords Marina Beach, Incidents
Debatepedia SDS Debates 1,000 66.4/10.0/11.3 Question Is euthanasia better than withdrawing life support?
DUC 2006 MDS Newswire 1,250 (50) 699.3/32.8/250 Composite AMNESTY INTERNATIONAL – What is the scope of operations of Amnesty International and what are the international reactions to its activities?
DUC 2007 MDS Newswire 1,125 (45) 540.3/30.5/250 Composite
TD-QFS MDS Medical 7,099 (50) 182.9/3.0/250 Title Alzheimer’s Disease

In this work, we provide a unified modeling framework for generic summarization and QFS, under the assumption that only data for the former is available. Specifically, we treat generic summarization as a special case of QFS where the query is latent. We model queries as discrete latent variables over document tokens, and learn representations compatible with observed and unobserved query verbalizations. Our framework formulates abstractive summarization as a generative process, and decomposes the learning objective into: (1) latent query modeling (i.e., generating latent query variables from document observations) and (2) conditional language modeling (i.e., generating summaries conditioned on observed documents and latent queries). To further handle user queries at test time, we propose a non-parametric calibration of the latent query distribution, which allows us to perform zero-shot QFS without model re-training.

Our contributions in this work are threefold: (a) we bring together generic summarization and QFS under a unified modeling framework that does not require query-related resources for training or development; (b) we provide a deep generative formulation for document summarization, where queries are represented directly from input documents in latent space, that is, without resorting to pipeline-style query extraction or generation; and (c) experiments on a range of summarization benchmarks show that across query types, document settings, and target domains, our model achieves better results than strong comparison systems.

Rush et al. (2015) and Nallapati et al. (2016) were among the first to apply the neural encoder-decoder architecture to abstractive summarization. See et al. (2017) enhance their approach with a pointer-generator model, essentially a copy mechanism allowing words from the source document to be copied directly in the summary. Gehrmann et al. (2018) incorporate a content selection model that decides on relevant aspects of the source document. They frame this task as a word-level tagging problem, with the objective of separately identifying tokens from a document that should be part of its summary; at test time, they produce content selection probabilities for each word, which are then used to restrict the copy mechanism by performing hard masking over the input document. Another line of research controls summary generation via topics (Perez-Beltrachini et al., 2019a; Wang et al., 2020), retrieve-and-edit methods (Cao et al., 2018), factual relations (Jin et al., 2020), keywords, relational triples, or preselected source sentences (Dou et al., 2021).

The majority of previous QFS approaches have been extractive and compose summaries by selecting central and query-relevant sentences (Wan et al., 2007; Badrinath et al., 2011; Wan and Zhang, 2014; Li et al., 2017b, a). More recently, Xu and Lapata (2020) propose a coarse-to-fine framework that leverages distant supervision from question answering for summary sentence extraction. Abstractive QFS has received significantly less attention in comparison, due to generation models being particularly data-hungry (Lebanoff et al., 2018; Liu and Lapata, 2019a). As a result, resources from a wider range of NLP tasks have been used. Su et al. (2020) rank document paragraphs against queries with the aid of QA and machine reading datasets (Su et al., 2019; Rajpurkar et al., 2016), and then iteratively summarize selected paragraphs. Similarly, Laskar et al. (2020b) jointly exploit supervision from QFS data (typically reserved for evaluation) and related QA and paraphrase identification tasks.

Because query-related resources can be also costly to obtain (Bajaj et al., 2016; Kwiatkowski et al., 2019), Xu and Lapata (2021) use none whatsoever. Instead, they create proxy queries by selectively masking information slots in generic summaries. Despite promising system performance, their approach assumes prior knowledge of target queries (proxies are created to match their length, and content), and a development set is used (Xu and Lapata, 2021). Also, their system is particularly tailored to multi-document QFS and includes a sophisticated evidence selection component. Our work is closely related to theirs in that we also do not take advantage of query-related resources. We go a step further and do not require a development set either, allowing our model to be independent of specific query verbalizations and produce QFS summaries in zero-shot settings.

Our approach is generally applicable to single- and multi-document QFS. For any summarization task we assume that queries are latent and estimate these jointly via a summarization and (weakly supervised) tagging task. The latter draws inspiration from Gehrmann et al. (2018) under the assumption that document tokens found in the summary also provide evidence for the (latent) query that gave rise to it. Finally, our model is fundamentally different from approaches that rely on document-based guidance to improve the informativeness (Cao et al., 2018) or faithfulness (Chen et al., 2021) of summaries. While these models exploit guidance from supervision signals in training data, we are faced with the problem of estimating queries when there are none available (at least during training).

Let ${(D,Q,S)}$ denote a summarization dataset, where document $D$ is a sequence of tokens, and $S$ its corresponding summary; query $Q$ additionally specifies an information request. In generic summarization, $Q=∅$, whereas in QFS $Q$ can assume various formats, ranging from keywords to composite questions (see Table 1 for examples).

Our model learns from generic summarization data alone, while robustly generalizing to a range of tasks at test time, including out-of-domain QFS. A shared characteristic between generic summarization and QFS is the fact that user intent is underspecified. Even when queries are available (i.e., $Q≠∅$), they are incomplete expressions of intent as it is unlikely to specify queries to the level of detail necessary to compose a good summary (Xu and Lapata, 2021). We thus identify latent query signals from $D$, and optionally take advantage of $Q$ as additional observation for belief update.

##### Generative Model

We model an observed input document $D$ as a sequence of random variables x = [x1;x2;…;xM] where xi is a token and M the length of the document. We define the latent query as a sequence of discrete latent states over input document tokens: z = [z1;z2;…;zM]. Specifically, from each document token xi, we generate a binary query variable zi, whose distribution p(zi) represents the belief that xi contributes to a potential query for document $D$. Modeling latent queries at the token-level allows us to regularize the model—by taking into account weak supervision in the form of token-level tagging (Gehrmann et al., 2018). It also renders the model independent of the query form, thereby enabling zero-shot inference (see Section 4).

The output summary y = [y1;y2;…;yT] is then generated from {x,z} using teacher-forcing at training time. At test time, we may additionally be presented with a query $Q$; we ground this optional information to the input document via discrete observed variables $z~=[z~1;z~2;…;z~M]$, and generate y by additionally conditioning on $z~$ (if it exists) in an autoregressive manner.

Our model estimates the conditional distribution pθ(y|x) according to the generative process just described (and illustrated in Figure 1) as:
$pθ(y|x)=∑zpθ(y|z,x)pθ(z|x)=∑zpθ(y|z,x)∏ipθ(zi|xi)$
(1)
Figure 1:

Proposed summarization framework: generative process and neural parametrization. Shaded nodes represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test time. Latent queries create a query-focused view of the input document, which together with a query-agnostic view serve as input to a decoder for summary generation.

Figure 1:

Proposed summarization framework: generative process and neural parametrization. Shaded nodes represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test time. Latent queries create a query-focused view of the input document, which together with a query-agnostic view serve as input to a decoder for summary generation.

Close modal
##### Inference Model
The posterior distribution of latent variable z is calculated as:
$pθ(z|x,y)=pθ(x,y,z)pθ(x,y)=pθ(x,y,z)∑zpθ(x,y,z)$
(2)
Unfortunately, exact inference of this posterior is computationally intractable due to the joint probability pθ(x,y). We therefore approximate it with a variational posterior qϕ(z|x,y). Inspired by β-VAE (Higgins et al., 2017), we maximize the probability of generating summary y, provided the distance between the prior and variational posterior distributions is below a small constant δ:
$maxϕ,θE(x,y)∼DEz∼qϕ(z|x,y)logpθ(y|x,z)$
(3)
$subject toDKLqϕ(z|x,y)∥pθ(z|x)<δ$
(4)
Because we cannot solve Equation (4) directly, we invoke the Karush-Kuhn-Tucker conditions (Kuhn et al., 1951) and cast the above constrained optimization problem into unconstrained optimization, with the following ELBO objective:
$LELBO=Eqϕ(z|x,y)logpθ(y|x,z)−βDKLqϕ(z|x,y)||pθ(z|x)$
(5)
where the Lagrangian multiplier β is a hyperparameter. To minimize our model’s dependence on queries (which we assume are unavailable for both training and development), we adopt a uniform prior pθ(z|x). In other words, the probability of variable z being a query word (given all instances of x) follows a uniform distribution. In this case, minimizing the KL term in Equation (5) is equivalent to maximizing the entropy of the variational posterior.2 We further assume that the tokens observed in a document are a superset of potential query tokens, and therefore zy and qϕ(z|x,y) = qϕ(z|x).3
While the simplification reduces the risk of exposure to bias from training on y, it makes learning meaningful latent variables more challenging, as they depend solely on x. We alleviate this by introducing a new type of weak supervision $o(z^|x,y)$, which we automatically extract from data (i.e., document-summary pairs). Essentially, we tag tokens in the document as likely to be in the summary and by extension in the query. We discuss how this tagger is learned in Section 4. For now, suffice it to say that weak supervision is a form of posterior regularization adding an extra term in the objective, which we rewrite as:
$L=Eqϕ(z|x)logpθ(y|x,z)︸conditional language modeling+βHqϕ(z|x)−ωHo(z^|x,y),qϕ(z|x)︸latent query modeling$
(6)
where ℋ(⋅) denotes posterior entropy and ℋ(⋅,⋅) denotes cross entropy.

As can be seen from Equation (6), we decompose summarization into two modeling objectives, namely, latent query modeling and conditional language modeling. Inside the query modeling term, hyperparameter ω controls the influence of weak supervision $z^$, while β controls the strength of label smoothing on the weak annotations.

##### Neural Parametrization

We parametrize the two objectives in Equation (6) with a latent query model and a conditional language model illustrated in Figure 1. The query model estimates latent query z from input variable x. At inference time, it, optionally, conditions on query knowledge $z^$ (when this is available). The conditional language model is based on the vanilla encoder-decoder architecture, the main difference being that it encodes two views of input document $D$. One encoding is query-focused, and depends directly on z as generated from the query model. The second encoding is query-agnostic, allowing for the original document to provide complementary context. A decoder conditioned on both encodings autoregressively generates the summary y. In contrast to previous work (Xu and Lapata, 2021), the latent query model and conditional language model are trained jointly in a fully differentiable end-to-end manner. In the following sections we explain in detail how these two models are parametrized.

In this section we discuss how the inference network for latent queries is constructed. We also explain how query-focused document representations are obtained, our attempts to mitigate posterior collapse via weak supervision $o(z^|x,y)$ (see Equation (6)), and how query belief is updated when queries are available at test time.

##### Inference Network for Latent Queries
We construct a neural network model to infer for each token in the input document whether it constitutes a query term. Given a contextual token representation matrix $Hq∈RM×dh$, we project it to ℝM×2 with a two-layer MLP as a scoring function:
$Hs=ReLU(HqWh+bh⊺)$
(7)
$π=HsWs+bs⊺$
(8)
where $Wh∈Rdh×dh$, $bh∈Rdh×1$, $Ws∈Rdh×2$, and bs ∈ℝ2×1 are learnable model parameters.
Let G(0) denote the standard Gumbel distribution, and gG(0), ∈ [0,1] is i.i.d. drawn Gumbel noise. We normalize π to form a variational distribution as:
$qϕ(zi=ℓ|x)=softmaxℓ([π0+g0,π1+g1])=exp((πℓ+gℓ)/τ)∑ℓ′∈[0,1]exp((πℓ′+gℓ′)/τ)$
(9)
where τ is the temperature controlling how close qϕ(z|x) is to $argmaxℓqϕ(z|x)$, and is optimized on the development set. Note that Gumbel noise is only applied during learning and is set to its mode (i.e., 0) for inference.
##### Query-focused View

As explained earlier, in addition to a canonical, query-agnostic encoding of the input document $D$ (which we discuss in Section 5), we further introduce a query-focused encoding factorized via latent queries z.

Specifically, for the ith token, we take the continuous relaxation of its discrete latent variable zi, and ground4 it to the input document via:
$Qi=qϕ(zi=1|x)⋅Hq,i$
(10)
As we can see, the query-focused view explicitly models the dependency on latent queries. From a learning perspective, this factorization leads to the following partial derivatives of the query encoder states with respect to the query-focused view:
$∂Qi∂Hq,i=1−qϕ(1)︸carry gate⋅∂Δπ∂Hq,i⊙Qi+qϕ(1)︸transform gate⋅1$
(11)
where $qϕ(ℓ)$ is a shorthand for the variational probability of zi = |x, and Δπ =π1π0 (see Equation (8)) and 1 denotes an all-one vector. This can be viewed as a special case of highway networks (Srivastava et al., 2015) where transform gate $qϕ(1)$ compresses the information captured by a token based on its likelihood of being a query term.
##### Token Tagging as Weak Supervision

Although it is possible to optimize latent queries solely based on conditional language modeling (our approach is fully differentiable), we additionally exploit weak supervision to label tokens in the document as query-specific or not. Weak supervision is advantageous as it imposes extra regularization on the posterior (see Equation (6)), thereby mitigating its collapse (i.e., the decoder may learn to ignore the query-focused view and instead solely rely on the query-agnostic view).

Let t1,…,tn denote binary tags for each of the source tokens, that is, 1 if a token is query-specific and 0 otherwise. We could learn such a tagger from training data generated by aligning query tokens to the document. In default of such gold-standard data, we approximate queries by summaries and obtain silver standard token labels by aligning summaries to their corresponding documents. Specifically, inspired by Gehrmann et al. (2018), we assume a token in the document is query-specific if it is part of the longest common sub-sequence (LCS) of tokens in the summary. Our tagging model is built on top of a pretrained language model, and thus operates on subwords. We first byte-pair encode (BPE; Sennrich et al., 2016) documents and summaries, and then search for the LCS over BPE sequences. If there exist multiple identical LCSs, only the one appearing at the earliest document position is tagged as positive. We refer to this tagging scheme as Bpe-Lcs.

Note that although we model query variables at the token level, we take phrases indirectly into account through LCS, which identifies subsequences of tokens (or phrases) as query annotations. Our our tagging model is therefore able to capture dependencies between tokens, albeit indirectly.

##### Training
To optimize the variational inference model, that is, the MLP defined in Equations (79), we use a cross entropy loss for token tagging, with the posterior entropy term from Equation (6). Formally, we write the query modeling loss as follows:
$Lquery=−ωLtag+βLentropy=−∑j=1N∑i=1Mωz^ij−βqϕ(1)logqϕ(1)+ω1−z^ij−βqϕ(0)logqϕ(0)$
(12)
where $z^i$ is a binary label automatically assigned via Bpe-Lcs$(D,S)$, the alignment procedure described above. As we can see, the entropy term dynamically smooths the weak annotations $z^i$ (the degree of smoothing is modulated by qϕ). We optimize ω and β on a development set.

In the initial stages of training, the tagger might lead to inaccurate posterior probability assignments qϕ(zi|x), and, consequently, hurt the summarization model, which relies heavily on a high-quality query-focused view. To address this issue, we introduce a posterior dropout mechanism that replaces the estimated posterior with weak supervision $o(z^|x)$ according to probability α. We initialize α to 1, so that only $o(z^|x)$ is used in the beginning of training, and the tagger is supervised via Equation (12). We then linearly anneal α over optimization steps so that the gradients from the summarization objective (which we introduce in Section 5) can jointly optimize the tagger.

##### Zero-shot Transfer
We now explain how queries are taken into account at test time by performing query belief updates $Δ(zi|x,z~)$. In the case of generic summarization where no queries are available, we simply perform no update. When $Q≠∅$, some tokens in the document become more relevant and we consequently set $Δ(zi=1|x,z~)=1$, ∀wiBpe-Lcs($D$, $Q$), and all other tokens to zero. We further incorporate query information via a simple calibration as:
$qϕ(zi=1|x,z~)=min{1,qϕ(zi=1|x)+Δ(zi=1|x,z~)}$
(13)
Note that our calibration is non-parametric, since it is not realistic to assume access to a development set for each query type (e.g., in order to perform hyper-parameter tuning). This enables zero-shot transfer to QFS tasks with varying characteristics.

In this section we describe our conditional language model, which estimates the log-likelihood expectation of a summary sequence over the variational posterior (see Equation (6)). As mentioned earlier, we adopt an encoder-decoder architecture tailored to document summarization with latent queries.

##### Encoder
We encode two views of the input document, a generic query-agnostic view D, and a query-focused one Q (see Equation (10)). As shown in Figure 1(c), our encoder module consists of three encoders: a shared encoder, a document encoder, and a query encoder. Because both views are created from the same document, we use a shared encoder for general document understanding that also reduces model parameters. The shared document representation serves as input to more specialized encoders. Each encoder contains one or multiple Transformer layers (Vaswani et al., 2017), each composed of a multi-head attention (MHA) layer and a feed-forward (FFN) layer:
$H(enc)=LNH(enc)+MHAH(enc),H(enc),H(enc)H(enc)=LNH(enc)+FFNH(enc)$
(14)
where LN denotes layer normalization. As shown in Figure 1(c), the query-focused view Q directly conditions on sampled latent queries, while D is based on the original document and its content.
##### Decoder
We adopt a decoder structure similar to Dou et al. (2021) to handle multiple inputs. Our decoder sequentially attends to the two encoded views of the same document:
$H(dec)=LNH(dec)+MHAH(dec),H(dec),H(dec)H(dec)=LNH(dec)+MHAH(dec),Q,QH(dec)=LNH(dec)+MHAH(dec),D,DH(dec)=LNH(dec)+FFNH(dec)$
(15)
After taking the context of the previous generation H(dec) into account, the decoder will first attend to signals coming from query Q, then to original document D (based on guidance provided by the query). The final summary generation objective is calculated autoregressively as:
$Llm=∑j=1N∑t=1Tlogpθytj|y
(16)
which is jointly trained with the query model (see Equation (12)) as: ℒ =ℒlm +ℒquery.
##### Datasets

For model training and development, we used the CNN/Daily Mail dataset (Hermann et al., 2015), a generic single-document summarization benchmark containing news articles and associated highlights (287,227/13,368 instances). We evaluated our model on the CNN/Daily Mail test set, following a generic summarization, supervised setting. We also performed several zero-shot experiments, on five benchmarks representing various query formats, domains, and summarization scenarios (e.g., single- vs. multiple-documents). Specifically, we report results on WikiCatSum (Perez-Beltrachini and Lapata, 2021) as an example of multi-document generic summarization, and WikiRef (Zhu et al., 2019), Debatepedia (Nema et al., 2017), DUC 2006-07, and TD-QFS (Baumel et al., 2016) as examples of QFS. Table 1 summarizes the characteristics of these datasets and presents test set statistics. Note that in contrast to Xu and Lapata (2021), we do not make use of development data for our QFS tasks.

##### Implementation Details

The shared encoder consists of 11 Transformer layers. The document and query encoders have a separate Transformer layer each. All encoders and decoder are initialized with a pretrained Bart model (Lewis et al., 2020), while the query encoder is initialized randomly. We used four GeForce RTX 2080 GPUs for training; we set the batch size to 8 (i.e., one sample for each GPU), and accumulate gradients every 32 steps. We fine-tuned Bart on CNN/Daily Mail with a learning rate of 3 × 10−5 for 20,000 optimization steps, and a warmup-step of 500. We used half float precision for efficient training and set the maximum length of an input document to 640 tokens, with the excess clipped. We set β = 0.1 and ω = 10 in the learning objective, and τ = 0.9 for latent query modeling. We annealed the dropout rate α from 1.0 to 0.5 over the whole training session.

Before analyzing our model under various zero-shot settings, we first confirm it can indeed produce good quality generic summaries in a supervised setting. There is no point in contemplating zero-shot scenarios if our approach underperforms when full supervision is available. Following standard practice, we use F1 ROUGE as our automatic evaluation metric (Lin and Hovy, 2003). Unigram and bigram ROUGE (R-1 and R-2) are a proxy for assessing informativeness and the longest common subsequence (R-L) represents fluency. For multi-document QFS, we follow DUC (Dang, 2005) and report R-SU4 (based on skip bigram with maximum skip distance of 4) instead of R-L.5

### 7.1 Supervised Setting

Table 2 summarizes our results on the CNN/Daily Mail test set. As an upper bound (first block) we report the performance of an extractive Oracle that performs greedy search to find a set of sentences in the source document that maximize ROUGE scores against the reference (Liu and Lapata, 2019b). The Lead baseline considers the first 3 sentences in a document as the summary. LexRank (Erkan and Radev, 2004) estimates sentence-level centrality via a Markov Random Walk on graphs. The second block includes two additional extractive systems. BertExt (Liu and Lapata, 2019b) is the first rendition of a summarization system with a pretrained encoder (Devlin et al., 2019). MatchSum (Zhong et al., 2020) extracts an optimal set of sentences via semantically matching documents to candidate summaries.

Table 2:

Generic summarization, supervised setting, CNN/Daily Mail test set.

Upper Bound & BaselinesR-1R-2R-L
Oracle 55.8 33.2 51.8
LexRank 33.2 11.8 29.6

Supervised (Extractive) R-1 R-2 R-L
BertExt (Liu and Lapata, 2019b43.9 20.3 39.9
MatchSum (Zhong et al., 202043.9 20.6 39.8

Supervised (Abstractive) R-1 R-2 R-L
PTGen (See et al., 201739.5 17.3 36.4
BottomUp (Gehrmann et al., 201841.2 18.7 38.4
BertAbs (Liu and Lapata, 2019b41.7 19.4 38.8
Bart (Lewis et al., 202044.2 21.3 40.9
GSum (Dou et al., 202145.9 22.3 42.5
GSum (our implementation) 45.0 21.9 41.8
LQSum 45.1 22.0 41.9
Upper Bound & BaselinesR-1R-2R-L
Oracle 55.8 33.2 51.8
LexRank 33.2 11.8 29.6

Supervised (Extractive) R-1 R-2 R-L
BertExt (Liu and Lapata, 2019b43.9 20.3 39.9
MatchSum (Zhong et al., 202043.9 20.6 39.8

Supervised (Abstractive) R-1 R-2 R-L
PTGen (See et al., 201739.5 17.3 36.4
BottomUp (Gehrmann et al., 201841.2 18.7 38.4
BertAbs (Liu and Lapata, 2019b41.7 19.4 38.8
Bart (Lewis et al., 202044.2 21.3 40.9
GSum (Dou et al., 202145.9 22.3 42.5
GSum (our implementation) 45.0 21.9 41.8
LQSum 45.1 22.0 41.9

The third block includes various abstractive systems (see Section 2 for an overview). PTGen (See et al., 2017) and BottomUp (Gehrmann et al., 2018) do not use pretrained LMs, while BertAbs (Liu and Lapata, 2019b) is built on top of a pretrained Bert encoder. Bart (Lewis et al., 2020) is fine-tuned on CNN/DM, while GSum (Dou et al., 2021) is initialized with Bart parameters.

Our Latent Query Summarization model (LQSum) outperforms Bart by a large margin, which demonstrates the effectiveness of latent queries even for generic summarization. It also performs on par with GSum, under identical training resources and configurations. GSum is a state-of-the-art abstractive model, which relies on MatchSum (Zhong et al., 2020), a high-performance extractive model to provide guidance to the decoder. Compared to GSum, LQSum can be trained end-to-end and requires significantly less parameters (406 M for LQSum versus 625 M for GSum; see Table 3 for details).

Table 3:

System comparison. Enc, Dec, and Tag denote number of layers for encoding, decoding, and tagging, respectively. GSum (Dou et al., 2021) and LQSum add a (randomly initialized) encoding layer on top of Bart (Lewis et al., 2020) for guidance/query representation. LQSum replaces guidance extraction in GSum (i.e., two Bert models) with latent query modeling (i.e., a lightweight tagging layer), which is more parameter efficient.

ModelSizeComponents
Bart 400M Enc=12, Dec=12
GSum 625M Enc=13, Dec=12, Bert=2 (220M; guidance)
LQSum 406M Enc=13, Dec=12, Tag=1 (1M; latent query)
ModelSizeComponents
Bart 400M Enc=12, Dec=12
GSum 625M Enc=13, Dec=12, Bert=2 (220M; guidance)
LQSum 406M Enc=13, Dec=12, Tag=1 (1M; latent query)

### 7.2 Zero-Shot Setting

##### Multi-Document Summarization

We evaluated our model’s ability to summarize multiple documents on WikiCatSum (Perez-Beltrachini et al., 2019b), a collection of articles on a specific topic (e.g., Tokyo Olympics) and their corresponding Wikipedia summary. In order to handle multi-document input with a model trained on single-document data, we follow previous work (Perez-Beltrachini et al., 2019b) and first select a subset of salient passages which are then concatenated into a sequence and given to our model to summarize.

In the first block of Table 4 we present upper bound and baseline results. The second block contains results for two supervised systems, a sequence-to-sequence model based on Transformer (Liu et al., 2018), and a state-of-the-art system enhanced with a convolutional encoder, a structured decoder, and a topic prediction module (CV-S2D+T; Perez-Beltrachini et al. 2019b). The third block contains zero-shot models, including Bart, GSum, and LQSum. GSum requires another extractive system’s output as guidance during inference, for which we default to LexRank. As can be seen, LQSum performs best among zero-shot models, but lags behind fully supervised ones which is not surprising (zero-shot models operate over pre-ranked, incoherent passages).

Table 4:

Multi-document summarization, zero-shot setting, WikiCatSum test set. Results averaged over three domains: Company, Film, Animal.

Upper Bound & BaselinesR-1R-2R-L
Oracle 47.2 23.3 42.9
LexRank 23.3 6.5 20.3

Supervised (Abstractive) R-1 R-2 R-L
Transformer (Liu et al., 201835.5 19.0 30.5
CV-S2D+T (Perez-Beltrachini et al., 2019b36.1 19.9 30.5

Zero-shot Abstractive R-1 R-2 R-L
Bart (Lewis et al., 202027.8 9.8 25.1
GSum+LexRank 27.4 8.2 25.0
LQSum 28.7 9.9 26.1
Upper Bound & BaselinesR-1R-2R-L
Oracle 47.2 23.3 42.9
LexRank 23.3 6.5 20.3

Supervised (Abstractive) R-1 R-2 R-L
Transformer (Liu et al., 201835.5 19.0 30.5
CV-S2D+T (Perez-Beltrachini et al., 2019b36.1 19.9 30.5

Zero-shot Abstractive R-1 R-2 R-L
Bart (Lewis et al., 202027.8 9.8 25.1
GSum+LexRank 27.4 8.2 25.0
LQSum 28.7 9.9 26.1
##### Single-Document QFS

Tables 5 and 6 show results for single-document QFS on two datasets, namely, WikiRef (Zhu et al., 2019) and Debatepedia (Nema et al., 2017), which differ in terms of document/summary size and query type (see Table 1). The first block in both tables shows results for the Oracle upper bound, Lead, and $LexRankQ$, a query-focused version of LexRank described in Xu and Lapata (2020). The second block presents various supervised systems on WikiRef and Debatepedia, both extractive and abstractive. Note that abstractive QFS systems have not been previously evaluated on WikiRef, while Debatepedia contains short documents and accordingly short summaries and has mainly served as a testbed for abstractive summarization. The third block reports system performance in the zero-shot setting. We compare LQSum against Bart and GSum, which, however, requires guidance from automatically extracted sentences. Note that MatchSum (Zhong et al., 2020), the original extractive system used by GSum for guidance, is not directly applicable to QFS, as it is trained for generic summarization which does not take queries as input. We made a best effort attempt to adapt GSum to our QFS setting by using query-focused $LexRankQ$ to extract the top K sentences for each test document as guidance.

Table 5:

Single-document QFS, zero-shot setting, WikiRef test set (queries are keywords).

Upper Bound & BaselinesR-1R-2R-L
Oracle 54.5 37.5 48.5
$LexRankQ$ 29.9 12.3 26.1

Supervised (Extractive) R-1 R-2 R-L
Transformer (Zhu et al., 201928.1 12.8 23.8
BertExt (Zhu et al., 201935.1 18.2 30.0

Zero-shot Abstractive R-1 R-2 R-L
Bart (Lewis et al., 202030.0 12.2 26.0
GSum+$LexRankQ$ 30.2 12.5 26.3
LQSum 31.1 12.6 27.1
Upper Bound & BaselinesR-1R-2R-L
Oracle 54.5 37.5 48.5
$LexRankQ$ 29.9 12.3 26.1

Supervised (Extractive) R-1 R-2 R-L
Transformer (Zhu et al., 201928.1 12.8 23.8
BertExt (Zhu et al., 201935.1 18.2 30.0

Zero-shot Abstractive R-1 R-2 R-L
Bart (Lewis et al., 202030.0 12.2 26.0
GSum+$LexRankQ$ 30.2 12.5 26.3
LQSum 31.1 12.6 27.1
Table 6:

Single-document QFS, zero-shot setting, Debatepedia test set (queries are natural questions). BertAbs (Laskar et al., 2020a) is optimized on XSum (Narayan et al., 2018).

Upper Bound & BaselinesR-1R-2R-L
Oracle 28.9 11.0 24.9
$LexRankQ$ 17.4 5.3 15.1

Supervised (Abstractive) R-1 R-2 R-L
Dda (Laskar et al., 2020a7.4 2.8 7.2
BertAbs+Rank (Abdullah and Chali, 202019.2 10.6 17.9
BertAbs+Concat (Laskar et al., 2020a26.4 11.9 25.1

Zero-shot Abstractive R-1 R-2 R-L
BertAbs (Liu and Lapata, 2019b13.3 2.8 2.8
Bart (Lewis et al., 202021.4 6.3 18.4
GSum+$LexRankQ$ 21.2 6.2 18.2
LQSum 23.5 7.2 20.6
Upper Bound & BaselinesR-1R-2R-L
Oracle 28.9 11.0 24.9
$LexRankQ$ 17.4 5.3 15.1

Supervised (Abstractive) R-1 R-2 R-L
Dda (Laskar et al., 2020a7.4 2.8 7.2
BertAbs+Rank (Abdullah and Chali, 202019.2 10.6 17.9
BertAbs+Concat (Laskar et al., 2020a26.4 11.9 25.1

Zero-shot Abstractive R-1 R-2 R-L
BertAbs (Liu and Lapata, 2019b13.3 2.8 2.8
Bart (Lewis et al., 202021.4 6.3 18.4
GSum+$LexRankQ$ 21.2 6.2 18.2
LQSum 23.5 7.2 20.6

Across both datasets, LQSum achieves the highest ROUGE scores in the zero-shot setting, in some cases surpassing the performance of supervised models. Compared to our results on generic summarization, LQSum also shows a clearer advantage over systems without latent query modeling.

##### Multi-Document QFS

We performed experiments on the DUC 2005-2007 benchmarks and TD-QFS (Baumel et al., 2016). The former contains long query narratives while TD-QFS focuses on short keyword queries (see Table 1).

We applied our summarization model trained on single documents to document clusters following a simple iterative approach (Baumel et al., 2018): We first rank documents in a cluster via their query term frequency, and then generate a summary for each document. The summary for the entire cluster is the concatenation of the individual document summaries subject to a budget (i.e., 250 tokens).6 Repeated sentences were skipped to reduce redundancy in the final summary.

Our results are given in Table 7. The first block reports performance for the Oracle upper bound and Gold, which was estimated by comparing a (randomly selected) reference summary against the remaining two or three reference summaries.7 We also include $LexRankQ$, and Lead (Xu and Lapata, 2021), which returns all lead sentences (up to 250 words) of the most recent document.

Table 7:

Multi-document QFS, zero-shot setting, DUC (queries are narratives) and TD-QFS (queries are keywords) test sets. * / † denotes extractive/few-shot systems.

DUC 2006DUC 2007TD-QFS
Upper Bound & Baselines R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4
Gold 45.4 11.2 16.8 47.5 14.0 18.9 52.2 27.0 30.2
Oracle 47.5 15.8 20.2 47.6 17.1 20.9 64.9 48.3 49.4
Lead 32.1 5.3 10.4 33.4 6.5 11.3 33.5 5.2 10.4
$LexRankQ$ 34.2 6.4 11.4 35.8 7.7 12.7 35.3 7.6 12.2

Distantly Supervised R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4
QuerySum* (Xu and Lapata, 202041.6 9.5 15.3 43.3 11.6 16.8 44.3 16.1 20.7
Bart-Caq (Su et al., 202038.3 7.7 12.9 40.5 9.2 14.4 — — —
PQSum (Laskar et al., 2020b40.9 9.4 14.8 42.2 10.8 16.0 — — —

Few- or Zero-shot Abstractive R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4
MargeSum(Xu and Lapata, 202140.2 9.7 15.1 42.5 12.0 16.9 45.5 16.6 20.9
Bart (Lewis et al., 202038.3 7.8 13.1 40.2 9.9 14.6 45.1 16.9 21.4
GSum+$LexRankQ$ 38.1 7.9 13.1 39.5 9.5 14.3 45.5 18.0 22.4
LQSum 39.1 8.5 13.7 40.4 10.2 15.0 45.7 18.1 22.1
DUC 2006DUC 2007TD-QFS
Upper Bound & Baselines R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4
Gold 45.4 11.2 16.8 47.5 14.0 18.9 52.2 27.0 30.2
Oracle 47.5 15.8 20.2 47.6 17.1 20.9 64.9 48.3 49.4
Lead 32.1 5.3 10.4 33.4 6.5 11.3 33.5 5.2 10.4
$LexRankQ$ 34.2 6.4 11.4 35.8 7.7 12.7 35.3 7.6 12.2

Distantly Supervised R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4
QuerySum* (Xu and Lapata, 202041.6 9.5 15.3 43.3 11.6 16.8 44.3 16.1 20.7
Bart-Caq (Su et al., 202038.3 7.7 12.9 40.5 9.2 14.4 — — —
PQSum (Laskar et al., 2020b40.9 9.4 14.8 42.2 10.8 16.0 — — —

Few- or Zero-shot Abstractive R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4
MargeSum(Xu and Lapata, 202140.2 9.7 15.1 42.5 12.0 16.9 45.5 16.6 20.9
Bart (Lewis et al., 202038.3 7.8 13.1 40.2 9.9 14.6 45.1 16.9 21.4
GSum+$LexRankQ$ 38.1 7.9 13.1 39.5 9.5 14.3 45.5 18.0 22.4
LQSum 39.1 8.5 13.7 40.4 10.2 15.0 45.7 18.1 22.1

The second block contains distantly supervised approaches. QuerySum (Xu and Lapata, 2020) is an extractive system that takes advantage of existing QA datasets and adopts a coarse-to-fine salience estimation procedure. Bart-Caq (Su et al., 2020) uses an ensembled QA model for answer evidence extraction, and a fine-tuned Bart model (Lewis et al., 2020) to iteratively generate summaries from paragraphs. PQSum (Laskar et al., 2020b) uses fine-tuned BertSum to generate summaries for each document in a cluster, and a QA model for summary sentence re-ranking.

The third block compares our model against MargeSum (Xu and Lapata, 2021), a state-of-the- art few-shot approach, which uses data for proxy query generation and model development, and various zero-shot systems including Bart and GSum+LexRank$Q$.

Across datasets, LQSum outperforms comparison zero-shot approaches. It also has a clear advantage over MargeSum on TD-QFS but is slightly worse on DUC. We also see that LQSum is superior to Bart-Caq, which relies on distant supervision from QA data.

### 7.3 Ablation Studies

We further performed a series of ablation studies, reported in Table 8, to assess the contribution of individual model components. Perhaps unsurprisingly, we observe that not updating the query belief at test time hurts performance ($−Δ(z^|x,z)$). Recall that we adopt a simple method that calibrates the variational posterior distribution. When it comes to learning meaningful latent queries that benefit summarization tasks, relying solely on tagging (−Joint training) or generation (−Weak supervision) substantially decreases performance.8 Latent query learning balances a trade-off between direct but weak supervision from the tagging objective (based on silver standard token labels) and natural but indirect supervision from the generation objective (based on human-written summaries). As silver tagging labels provide less accurate supervision than human-written summaries, we observe that −Joint training hurts performance more than −Weak supervision.

Table 8:

LQSum ablation results; ↑/↓: absolute increase/decrease.

ModelR-1R-2R-LR-1R-2R-LR-1R-2R-LR-1R-2R-SU4R-1R-2R-SU4R-1R-2R-SU4
LQSum 45.1 22.0 41.9 31.1 12.6 27.1 23.5 7.2 20.6 39.1 8.5 13.7 40.4 10.2 15.0 45.7 18.1 22.1
$−Δ(z^|x,z)$ — — — ↓0.1 ↓0.2 ↓0.2 ↓0.5 ↓0.3 ↓0.6 ↓0.6 ↓0.2 ↓0.6 ↑0.1 ↓0.1 ↓1.3 ↑0.1 ↓0.6 ↓0.4
−Joint training ↓0.4 ↓0.3 ↓0.4 ↓2.9 ↓0.9 ↓2.8 ↓2.8 ↓1.1 ↓2.8 ↓2.9 ↓1.7 ↓1.6 ↓2.4 ↓2.0 ↓1.7 ↓0.7 ↓0.6 ↓0.4
−Weak supervision ↓0.6 ↓0.7 ↓0.7 ↓0.7 ↓0.2 ↓0.5 ↓1.0 ↓0.5 ↓1.3 ↓0.2 ↓0.2 ↓0.2 ↓0.2 ↓0.3 ↓0.3 ↓0.1 ↓0.3 ↓0.0
−Dual view ↓2.7 ↓3.5 ↓2.5 ↓12.2 ↓9.3 ↓10.5 ↓7.9 ↓3.3 ↓6.6 ↓6.3 ↓1.8 ↓1.8 ↓6.5 ↓3.0 ↓2.5 ↓2.5 ↓3.3 ↓2.8
−Posterior dropout ↓0.7 ↓0.6 ↓0.8 ↓0.8 ↓0.3 ↓0.7 ↓1.1 ↓0.3 ↓1.2 ↓0.2 ↓0.2 ↓0.2 ↓0.4 ↓0.4 ↓0.5 ↑0.2 ↓0.0 ↑0.1
ModelR-1R-2R-LR-1R-2R-LR-1R-2R-LR-1R-2R-SU4R-1R-2R-SU4R-1R-2R-SU4
LQSum 45.1 22.0 41.9 31.1 12.6 27.1 23.5 7.2 20.6 39.1 8.5 13.7 40.4 10.2 15.0 45.7 18.1 22.1
$−Δ(z^|x,z)$ — — — ↓0.1 ↓0.2 ↓0.2 ↓0.5 ↓0.3 ↓0.6 ↓0.6 ↓0.2 ↓0.6 ↑0.1 ↓0.1 ↓1.3 ↑0.1 ↓0.6 ↓0.4
−Joint training ↓0.4 ↓0.3 ↓0.4 ↓2.9 ↓0.9 ↓2.8 ↓2.8 ↓1.1 ↓2.8 ↓2.9 ↓1.7 ↓1.6 ↓2.4 ↓2.0 ↓1.7 ↓0.7 ↓0.6 ↓0.4
−Weak supervision ↓0.6 ↓0.7 ↓0.7 ↓0.7 ↓0.2 ↓0.5 ↓1.0 ↓0.5 ↓1.3 ↓0.2 ↓0.2 ↓0.2 ↓0.2 ↓0.3 ↓0.3 ↓0.1 ↓0.3 ↓0.0
−Dual view ↓2.7 ↓3.5 ↓2.5 ↓12.2 ↓9.3 ↓10.5 ↓7.9 ↓3.3 ↓6.6 ↓6.3 ↓1.8 ↓1.8 ↓6.5 ↓3.0 ↓2.5 ↓2.5 ↓3.3 ↓2.8
−Posterior dropout ↓0.7 ↓0.6 ↓0.8 ↓0.8 ↓0.3 ↓0.7 ↓1.1 ↓0.3 ↓1.2 ↓0.2 ↓0.2 ↓0.2 ↓0.4 ↓0.4 ↓0.5 ↑0.2 ↓0.0 ↑0.1

Removing the query agnostic view (−Dual view) causes a significant performance drop as the decoder can no longer leverage the original document context, which is useful especially when the query model is not accurate. Relying solely on the estimated posterior to create the query-focused view for training (−Posterior dropout), also hurts performance as it leads to more severe error propagation for the downstream generation model.

Following previous work (Xu and Lapata, 2021, 2020), we also evaluated query-focused summaries in a judgment elicitation study via Amazon Mechanical Turk. Native English speakers (self-reported) were asked to rate query-summary pairs on two dimensions: Succinctness (does the summary avoid unnecessary detail and redundant information?) and Coherence (does the summary make logical sense?). The ratings were obtained using a five-point Likert scale.

In addition, participants were asked to assess the Relevance of the summary to the query. Crowdworkers read a summary and for each sentence decided whether it is relevant (i.e., provides an answer to the query), irrelevant (i.e., does not answer the query), or partially relevant (i.e., unclear it directly answers the query). Relevant sentences were awarded a score of 5, partially relevant ones a score of 2.5, and 0 otherwise. Sentence scores were averaged to obtain a relevance score for the whole summary. We view Relevance as as more critical for QFS than Coherence or Succinctness. This is why we obtained per-sentence ratings which we then aggregated to an overall summary score. To make this task manageable, raters were asked to provide more coarse-grained ratings.

Participants assessed summaries created by LQSum (our model), GSum+$LexRankQ$ (a competitive abstractive system), $LexRankQ$ (an extractive baseline), and Gold (the ground-truth upper bound). We also compared against BertExt on WikiRef, BertAbs on Debatepedia, and MargeSum on DUC and TD-QFS.9 We sampled 40 query-document pairs from WikiRef and Debatepedia, 40 query-cluster pairs from DUC (2006, 2007; 20 from each set), and 40 pairs from TD-QFS and collected three responses per pair.10

We show our results in Table 9 and examples of system output in Table 10. On WikiRef, LQSum outperforms GSum+$LexRankQ$ significantly in terms of relevance. On Debatepedia it surpasses BertAbs, a supervised model, across all three metrics. On DUC, it outperforms comparison systems in terms of succinctness and coherence. LQSum avoids repetition by yielding dynamic (latent) query representations for each document in the a cluster. On TD-QFS, all comparison systems perform similarly, except $LexRankQ$ which is significantly worse in terms of relevance and succinctness. As far as Relevance is concerned we observe that LQSum outperforms comparison systems on Debatepedia and TD-QFS, while being very similar to MargeSum on DUC. On Wikiref, BertExt is slightly more relevant but less coherent.

Table 9:

Human evaluation on QFS benchmarks: average Relevance, Succinctness, Coherence ratings; †/° : sig different from LQSum/Gold (at p < 0.05, using a pairwise t-test); best system shown in bold.

WikiRefRelSucCohDebatepediaRelSucCoh
BertExt 3.57 3.63 3.72  BertAbs 2.42 2.93†° 2.59
GSum+$LexRankQ$ 2.92†° 3.48° 3.72  GSum+$LexRankQ$ 2.88 3.60 3.49
$LexRankQ$ 3.23 3.40 3.68  $LexRankQ$ 3.33 3.47° 3.52
LQSum 3.41 3.58 3.78  LQSum 3.39 3.74 3.78

Gold 3.62 3.73 3.59  Gold 3.29 3.76 3.57

DUC Rel Suc Coh  TD-QFS Rel Suc Coh
MargeSum 4.00 3.75 3.65†°  MargeSum 3.28 3.57 3.62
GSum+$LexRankQ$ 3.90 3.44†° 3.84  GSum+$LexRankQ$ 3.26 3.65 3.76
$LexRankQ$ 3.59†° 3.38†° 3.54†°  $LexRankQ$ 2.78†° 3.36†° 3.33†°
LQSum 3.97 3.88 3.95  LQSum 3.35 3.70 3.77

Gold 4.01 3.94 4.04  Gold 3.50 3.88 3.68
WikiRefRelSucCohDebatepediaRelSucCoh
BertExt 3.57 3.63 3.72  BertAbs 2.42 2.93†° 2.59
GSum+$LexRankQ$ 2.92†° 3.48° 3.72  GSum+$LexRankQ$ 2.88 3.60 3.49
$LexRankQ$ 3.23 3.40 3.68  $LexRankQ$ 3.33 3.47° 3.52
LQSum 3.41 3.58 3.78  LQSum 3.39 3.74 3.78

Gold 3.62 3.73 3.59  Gold 3.29 3.76 3.57

DUC Rel Suc Coh  TD-QFS Rel Suc Coh
MargeSum 4.00 3.75 3.65†°  MargeSum 3.28 3.57 3.62
GSum+$LexRankQ$ 3.90 3.44†° 3.84  GSum+$LexRankQ$ 3.26 3.65 3.76
$LexRankQ$ 3.59†° 3.38†° 3.54†°  $LexRankQ$ 2.78†° 3.36†° 3.33†°
LQSum 3.97 3.88 3.95  LQSum 3.35 3.70 3.77

Gold 4.01 3.94 4.04  Gold 3.50 3.88 3.68
Table 10:

System output on WikiRef (above; document 3918) and and Debetepedia (below; document 260). Information irrelevant to the query or incoherent in the summary is highlighted.

 Query: Prashant Bhushan, Legal activism, Government accountability Gold: CPIL won a major victory in 2003 when the Supreme Court restrained the Union government from privatising Hindustan Petroleum and Bharat Petroleum without the approval of Parliament. BertExt: New Delhi, March 3: The Supreme Court verdict against P.J. Thomas’s appointment is not the lone feather in the cap of the petitioner, the Centre for Public Interest Litigation (CPIL), but perhaps the most visible one.That was when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum without Parliament’s approval. The CPIL was founded in the late 1980s by Justice V.M. Tarkunde, who also co-founded the People’s Union for Civil Liberties. GSum+$LexRankQ$⁠: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers whose aim is to fight corruption. Among its members are lawyers Shanti Bhushan, Prashant BhUSHan, Kamini Jaiswal, Ram Jethmalani, Anand Divan and Anil Divan. Another PIL asks that the government be directed to recover Indian black money stashed in foreign banks. LQSum: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers. The group had its big hurrah in 2003 when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum. Query: Effectiveness: Do earmarks allocate spending effectively? Gold: Earmarks are often unrelated to legislation; holds up bill. BertAbs: Earmarks can be fully examined. GSum+$LexRankQ$⁠: Sometimes a good piece of legislation that receives the support of a majority of congressman will be held up and voted down. LQSum: Congressmen are using earmarks to hold up bills they don’t like, says Rep. Ruben Gallego.
 Query: Prashant Bhushan, Legal activism, Government accountability Gold: CPIL won a major victory in 2003 when the Supreme Court restrained the Union government from privatising Hindustan Petroleum and Bharat Petroleum without the approval of Parliament. BertExt: New Delhi, March 3: The Supreme Court verdict against P.J. Thomas’s appointment is not the lone feather in the cap of the petitioner, the Centre for Public Interest Litigation (CPIL), but perhaps the most visible one.That was when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum without Parliament’s approval. The CPIL was founded in the late 1980s by Justice V.M. Tarkunde, who also co-founded the People’s Union for Civil Liberties. GSum+$LexRankQ$⁠: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers whose aim is to fight corruption. Among its members are lawyers Shanti Bhushan, Prashant BhUSHan, Kamini Jaiswal, Ram Jethmalani, Anand Divan and Anil Divan. Another PIL asks that the government be directed to recover Indian black money stashed in foreign banks. LQSum: The Centre for Public Interest Litigation (CPIL) is a loose collection of activists and lawyers. The group had its big hurrah in 2003 when it got the apex court to restrain the Centre from divesting majority shares in Hindustan Petroleum and Bharat Petroleum. Query: Effectiveness: Do earmarks allocate spending effectively? Gold: Earmarks are often unrelated to legislation; holds up bill. BertAbs: Earmarks can be fully examined. GSum+$LexRankQ$⁠: Sometimes a good piece of legislation that receives the support of a majority of congressman will be held up and voted down. LQSum: Congressmen are using earmarks to hold up bills they don’t like, says Rep. Ruben Gallego.

We propose a deep generative formulation for document summarization that supports generic and query-focused applications. We represent queries as discrete latent variables, whose approximated posterior distribution can be calibrated with query observations at test time without further adaptation. Our approach does not rely on any query-related resource and can be applied in zero-shot settings. Experimental results across summarization datasets show that the proposed model yields state-of-the-art QFS performance in zero-shot settings.

Directions for future work are many and varied. One research challenge is to push this low-resource approach even further and generate abstractive summaries without access to any summaries or queries. We would also like to extend the proposed framework to cross-lingual settings, and satisfy the information needs of users with different language backgrounds through effective query understanding and summary generation.

The authors would like to thank the action editor, Wenjie Li, and the anonymous reviewers for their valuable feedback. We acknowledge the financial support of the European Research Council (Lapata; award number 681760). This research was supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract FA8650-17-C-9118. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

1

Our code and models can be found at https://github.com/yumoxu/lqsum.

2

When $pθ(z|x)∼U(a,b)$, $DKL(qϕ(z|x,y)||pθ(z|x))=−Hqϕ(z|x)+log(b−a+1)$ always holds (z ∈ [a,b]).

3

We experimentally verified this assumption in several QFS datasets. In WikRef (Zhu et al., 2019) and Debatepedia (Nema et al., 2017), 1.57% and 4.27% of query tokens are not attested in the input document, respectively. In DUC (Dang, 2005) and TD-QFS (Baumel et al., 2016) where the input contains multiple documents, all query tokens are attested. Across all datasets, only 1.69% of query tokens are not attested in the input document/cluster.

4

We also experimented with drawing hard samples from z via the straight-through trick (Jang et al., 2016), which is differentiable with biased gradient estimation. However, it did not yield better results than continuous relaxation.

5

We used pyrouge with the following parameter settings: ROUGE-1.5.5.pl -a -c 95 -m -n 2 -2 4 -u -p 0.5 -l 250.

6

An alternative would be to generate a long summary at once. However, this requires a model to be trained on a MDS dataset, or at least a proxy thereof (Xu and Lapata, 2021).

7

We compute this upper bound only for DUC and TD-QFS benchmarks as they include multiple reference summaries.

8

−Joint training replaces the softmax in Equation (9) with $argmax$, to stop the gradients from the generation loss in backpropagation. −Weak supervision sets ω = 0.

9

BertExt and BertAbs are supervised systems, while MargeSum is a few-shot system.

10

We are grateful to Md Tahmid Rahman Laskar and Haichao Zhu for providing us with system output.

Abdullah
and
Yllias
Chali
.
2020
.
Towards generating query to perform query focused abstractive summarization using pre-trained model
. In
Proceedings of the 13th International Conference on Natural Language Generation
, pages
80
85
.
Dublin, Ireland
.
Rama
,
Suresh
Venkatasubramaniyan
, and
CE
.
2011
.
Improving query focused summarization using look-ahead strategy
. In
Proceedings of the 33rd European Conference on Advances in Information Retrieval
, pages
641
652
.
Dublin, Ireland
.
Payal
Bajaj
,
Daniel
Campos
,
Nick
Craswell
,
Li
Deng
,
Jianfeng
Gao
,
Xiaodong
Liu
,
Rangan
Majumder
,
Andrew
McNamara
,
Mitra
,
Tri
Nguyen
,
Mir
Rosenberg
,
Xia
Song
,
Alina
Stoica
,
Saurabh
Tiwary
, and
Tong
Wang
.
2016
.
MS MARCO: A human generated machine reading comprehension dataset
.
arXiv preprint arXiv:1611.09268
.
Tal
Baumel
,
Raphael
Cohen
, and
Michael
.
2016
.
Topic concentration in query focused summarization datasets
. In
Proceedings of the 30th AAAI Conference on Artificial Intelligence
, pages
2573
2579
.
Phoenix, Arizona
.
Tal
Baumel
,
Matan
Eyal
, and
Michael
.
2018
.
Query focused abstractive summarization: Incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models
.
arXiv preprint arXiv:1801 .07704
.
Ziqiang
Cao
,
Wenjie
Li
,
Sujian
Li
, and
Furu
Wei
.
2018
.
Retrieve, rerank and rewrite: Soft template based neural summarization
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
152
161
.
Melbourne, Australia
.
Sihao
Chen
,
Fan
Zhang
,
Kazoo
Sone
, and
Dan
Roth
.
2021
.
Improving faithfulness in abstractive summarization with contrast candidate generation and selection
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
5935
5941
.
Online
.
Hoa Trang
Dang
.
2005
.
Overview of duc 2005
. In
Proceedings of the 2005 Document Understanding Conference
, pages
1
12
.
.
Hoa Trang
Dang
.
2006
.
DUC 2005: Evaluation of question-focused summarization systems
. In
, pages
48
55
.
Stroudsburg, PA, USA
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4171
4186
.
Minneapolis, Minnesota
.
Zi-Yi
Dou
,
Pengfei
Liu
,
Hiroaki
Hayashi
,
Zhengbao
Jiang
, and
Graham
Neubig
.
2021
.
GSum: A general framework for guided neural abstractive summarization
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4830
4842
.
Online
.
Günes
Erkan
and
Dragomir R.
.
2004
.
Lexrank: Graph-based lexical centrality as salience in text summarization
.
Journal of Artificial Intelligence Research
,
22
:
457
479
.
Sebastian
Gehrmann
,
Yuntian
Deng
, and
Alexander
Rush
.
2018
.
Bottom-up abstractive summarization
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
4098
4109
.
Brussels, Belgium
.
Karl Moritz
Hermann
,
Tomáš
Kočiský
,
Edward
Grefenstette
,
Lasse
Espeholt
,
Will
Kay
,
Mustafa
Suleyman
, and
Phil
Blunsom
.
2015
.
Teaching machines to read and comprehend
. In
Proceedings of the 28th International Conference on Neural Information Processing Systems
,
pages 1693–pages 1701
.
Cambridge, MA, USA
.
Irina
Higgins
,
Loïc
Matthey
,
Arka
Pal
,
Christopher
Burgess
,
Xavier
Glorot
,
Matthew
Botvinick
,
Shakir
Mohamed
, and
Alexander
Lerchner
.
2017
.
beta-vae: Learning basic visual concepts with a constrained variational framework
. In
Proceedings of the 5th International Conference on Learning Representations
.
Toulon, France
.
T. D.
Hoa
.
2006
.
Overview of DUC 2006
. In
Proceedings of the 2006 Document Understanding Conference
.
New York, USA
.
Eric
Jang
,
Shixiang
Gu
, and
Ben
Poole
.
2016
.
Categorical reparameterization with gumbel- softmax
.
arXiv preprint arXiv:1611.01144
.
Hanqi
Jin
,
Tianming
Wang
, and
Xiaojun
Wan
.
2020
.
Semsum: Semantic dependency guided neural abstractive summarization
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
34
, pages
8026
8033
.
New York, USA
.
H. W.
Kuhn
,
A. W.
Tucker
.
1951
.
Nonlinear programming
. In
Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability
, pages
481
492
.
California, USA
.
Tom
Kwiatkowski
,
Jennimaria
Palomaki
,
Olivia
Redfield
,
Michael
Collins
,
Ankur
Parikh
,
Chris
Alberti
,
Danielle
Epstein
,
Illia
Polosukhin
,
Jacob
Devlin
,
Kenton
Lee
,
Kristina N.
Toutanova
,
Llion
Jones
,
Ming-Wei
Chang
,
Andrew
Dai
,
Jakob
Uszkoreit
,
Quoc
Le
, and
Slav
Petrov
.
2019
.
Natural questions: a benchmark for question answering research
.
Transactions of the Association for Computational Linguistics
,
7
:
453
466
.
Md Tahmid Rahman
,
Enamul
Hoque
, and
Jimmy
Huang
.
2020a
.
Query focused abstractive summarization via incorporating query relevance and transfer learning with transformer models
. In
, pages
342
348
.
Springer
.
Md Tahmid Rahman
,
Enamul
Hoque
, and
Jimmy Xiangji
Huang
.
2020b
.
WSL-DS: Weakly supervised learning with distant supervision for query focused multi-document abstractive summarization
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
5647
5654
.
Online
.
Logan
Lebanoff
,
Kaiqiang
Song
, and
Fei
Liu
.
2018
.
Adapting the neural encoder-decoder framework from single to multi-document summarization
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
4131
4141
.
Brussels, Belgium
.
Mike
Lewis
,
Yinhan
Liu
,
Naman
Goyal
,
Marjan
,
Abdelrahman
Mohamed
,
Omer
Levy
,
Veselin
Stoyanov
, and
Luke
Zettlemoyer
.
2020
.
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7871
7880
.
Online
.
Piji
Li
,
Wai
Lam
,
Lidong
Bing
,
Weiwei
Guo
, and
Hang
Li
.
2017a
.
Cascaded attention based unsupervised information distillation for compressive summarization
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
2081
2090
.
Brussels, Belgium
.
Piji
Li
,
Zihao
Wang
,
Wai
Lam
,
Zhaochun
Ren
, and
Lidong
Bing
.
2017b
.
Salience estimation via variational auto-encoders for multi-document summarization
. In
Proceedings of the 31th AAAI Conference on Artificial Intelligence
, pages
3497
3503
.
San Francisco, California, USA
.
Chin-Yew
Lin
and
Eduard
Hovy
.
2003
.
Automatic evaluation of summaries using n-gram co-occurrence statistics
. In
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics
, pages
71
78
.
.
Peter J.
Liu
,
Saleh
,
Etienne
Pot
,
Ben
Goodrich
,
Ryan
Sepassi
,
Lukasz
Kaiser
, and
Noam
Shazeer
.
2018
.
Generating Wikipedia by summarizing long sequences
. In
Proceedings of the 6th International Conference on Learning Representations
.
.
Yang
Liu
and
Mirella
Lapata
.
2019a
.
Hierarchical transformers for multi-document summarization
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
5070
5081
.
Florence, Italy
.
Yang
Liu
and
Mirella
Lapata
.
2019b
.
Text summarization with pretrained encoders
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
, pages
3730
3740
.
Hong Kong, China
.
Ramesh
Nallapati
,
Bowen
Zhou
,
Cicero dos
Santos
,
Caglar
Gulcehre
, and
Bing
Xiang
.
2016
.
Abstractive text summarization using sequence-to-sequence RNNs and beyond
. In
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning
, pages
280
290
.
Berlin, Germany
.
Shashi
Narayan
,
Shay B.
Cohen
, and
Mirella
Lapata
.
2018
.
Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
1797
1807
.
Brussels, Belgium
.
Preksha
Nema
,
Mitesh M.
Khapra
,
Anirban
Laha
, and
Balaraman
Ravindran
.
2017
.
Diversity driven attention model for query-based abstractive summarization
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
, pages
1063
1072
.
.
Laura
Perez-Beltrachini
and
Mirella
Lapata
.
2021
.
Multi-document summarization with determinantal point process attention
.
Journal of Artificial Intelligence Research
,
71
:
371
399
.
Laura
Perez-Beltrachini
,
Yang
Liu
, and
Mirella
Lapata
.
2019a
.
Generating summaries with topic templates and structured convolutional decoders
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
5107
5116
.
Florence, Italy
.
Laura
Perez-Beltrachini
,
Yang
Liu
, and
Mirella
Lapata
.
2019b
.
Generating Summaries with Topic Templates and Structured Convolutional Decoders
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
.
Florence, Italy
.
Pranav
Rajpurkar
,
Jian
Zhang
,
Konstantin
Lopyrev
, and
Percy
Liang
.
2016
.
SQuAD: 100,000+ questions for machine comprehension of text
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
2383
2392
.
Sydney, Australia
.
Alexander M.
Rush
,
Sumit
Chopra
, and
Jason
Weston
.
2015
.
A neural attention model for abstractive sentence summarization
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
379
389
.
Lisbon, Portugal
.
Abigail
See
,
Peter J.
Liu
, and
Christopher D.
Manning
.
2017
.
Get to the point: Summarization with pointer-generator networks
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
, pages
1073
1083
.
.
Rico
Sennrich
,
Barry
, and
Alexandra
Birch
.
2016
.
Neural machine translation of rare words with subword units
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1715
1725
.
Berlin, Germany
.
Rupesh Kumar
Srivastava
,
Klaus
Greff
, and
Jürgen
Schmidhuber
.
2015
.
Training very deep networks
. In
Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2
, pages
2377
2385
.
.
Dan
Su
,
Yan
Xu
,
Genta Indra
Winata
,
Peng
Xu
,
Hyeondey
Kim
,
Zihan
Liu
, and
Pascale
Fung
.
2019
.
Generalizing question answering system with pre-trained language model fine-tuning
. In
, pages
203
211
.
Hong Kong, China
.
Dan
Su
,
Yan
Xu
,
Tiezheng
Yu
,
Siddique
,
Elham
Barezi
, and
Pascale
Fung
.
2020
.
CAiRE-COVID: A question answering and query-focused multi-document summarization system for COVID-19 scholarly information management
. In
Proceedings of the 1st Workshop on NLP for COVID-19 at EMNLP 2020
.
Online
.
Ilya
Sutskever
,
Oriol
Vinyals
, and
Quoc V.
Le
.
2014
.
Sequence to sequence learning with neural networks
. In
Advances in Neural Information Processing Systems
, volume
27
.
Curran Associates, Inc.
Ashish
Vaswani
,
Noam
Shazeer
,
Niki
Parmar
,
Jakob
Uszkoreit
,
Llion
Jones
,
Aidan N.
Gomez
,
Łukasz
Kaiser
, and
Illia
Polosukhin
.
2017
.
Attention is all you need
. In
Advances in Neural Information Processing Systems
, pages
6000
6010
.
Xiaojun
Wan
,
Jianwu
Yang
, and
Jianguo
Xiao
.
2007
.
Manifold-ranking based topic-focused multi-document summarization
. In
Proceedings of the 20th International Joint Conference on Artificial Intelligence
, pages
2903
2908
.
.
Xiaojun
Wan
and
Jianmin
Zhang
.
2014
.
CTSUM: Extracting more certain summaries for news articles
. In
Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval
, pages
787
796
.
New York, United States
.
Zhengjue
Wang
,
Zhibin
Duan
,
Hao
Zhang
,
Chaojie
Wang
,
Long
Tian
,
Bo
Chen
, and
Mingyuan
Zhou
.
2020
.
Friendly topic assistant for transformer based abstractive summarization
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
, pages
485
497
.
Online
.
Yumo
Xu
and
Mirella
Lapata
.
2020
.
Coarse- to-fine query focused multi-document summarization
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
, pages
3632
3645
.
Online
.
Yumo
Xu
and
Mirella
Lapata
.
2021
.
Generating query focused summaries from query-free resources
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
6096
6109
.
Online
.
Ming
Zhong
,
Pengfei
Liu
,
Yiran
Chen
,
Danqing
Wang
,
Xipeng
Qiu
, and
Xuanjing
Huang
.
2020
.
Extractive summarization as text matching
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
6197
6208
.
Online
.
Haichao
Zhu
,
Li
Dong
,
Furu
Wei
,
Bing
Qin
, and
Ting
Liu
.
2019
.
Transforming Wikipedia into augmented data for query-focused summarization
.
arXiv preprint arXiv:1911.03324
.

## Author notes

Action Editor: Wenjie (Maggie) Li

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.