Controllable Summarization with Constrained Markov Decision Process

We study controllable text summarization which allows users to gain control on a particular attribute (e.g., length limit) of the generated summaries. In this work, we propose a novel training framework based on Constrained Markov Decision Process (CMDP), which conveniently includes a reward function along with a set of constraints, to facilitate better summarization control. The reward function encourages the generation to resemble the human-written reference, while the constraints are used to explicitly prevent the generated summaries from violating user-imposed requirements. Our framework can be applied to control important attributes of summarization, including length, covered entities, and abstractiveness, as we devise specific constraints for each of these aspects. Extensive experiments on popular benchmarks show that our CMDP framework helps generate informative summaries while complying with a given attribute's requirement.


Introduction
Text summarization aims to condense the information of an input document into a concise summary. Although recently neural abstractive summarization models have achieved promising performance (See et al., 2017;, they do not allow users to indicate their preference to control different aspects of the generated summaries. Controllable summarization has many use cases. For instance, it can summarize product descriptions to fit within a word limit in online advertising. In another example, teachers can demonstrate the technique of paraphrasing important information by showing a system-generated summary with high abstractiveness. Controllable com/kenchan0226/control-sum-cmdp    and fine-tuned by our proposed method. Each summary corresponds to the requested entity inside the pair of brackets. summarization can also complement information retrieval systems, for example, to only generate summaries covering the entities that users are interested in. Figure 1 illustrates one such usage, where our proposed model produces distinct abstractive summaries of the same source document, focusing on different input entities. To allow users to control a particular attribute of the generated summaries, Fan et al. (2018) proposed a token-based controllable summarization model (ControlSum). Although ControlSum incorporates control tokens that let users specify a requirement on a summary attribute, the maximum likelihood training objective of the model does not provide explicit supervision signals that prevent the model from violating the specified attribute requirement. Consequently, a substantial portion of the generated summaries still fail to meet the spec-ified attribute requirement as shown in our experiments.
One possible solution to enforce the attribute requirement is to apply reinforcement learning (RL) with Markov Decision Process (MDP) (Bellman, 1957) to optimize a weighted sum of reward functions, including a penalty function to penalize the violation of the attribute requirement, and a summarization metric to encourage the generated summaries to be consistent with the references. However, selecting appropriate weights for different reward functions is a delicate task, and requires intensive hyper-parameter tuning.
In this work, we argue that applying constraints on the training objective is a more convenient way to control an attribute of a summary, since it avoids tuning reward function weights. We formulate the problem of training controllable text summarization models as a constrained Markov Decision Process (CMDP) (Altman, 1999), a RL framework trained with both rewards and constraints. In this setup, we maximize a summarization metric to encourage the similarity between the output summaries and the references, as well as impose constraints to disallow the summaries from violating a specified attribute requirement.
Moreover, we apply our approach to improve token-based controllable summarization models and control important summary attributes including length, covered entities, and abstractiveness by creating specific constraints for each attribute. For length control, we divide summary length into disjoint length bins and restrict the summary length according to the desired length bin. For entity control, we design constraints that guide the generated summary to cover the salient information of user-specified entities. To control abstractiveness, which measures the degree of textual novelty between a summary and its input document, we define bins corresponding to three abstractiveness levels, and design constraints that allow users to control the summary's abstractiveness.
Extensive experiments are conducted on popular benchmarks, to evaluate the effectiveness of our CMDP training framework with different types of attribute requirements. Concretely, we use our CMDP framework to finetune controllable summarization models based on pointer-generator network (See et al., 2017), a Recurrent Neural Network (RNN) (Hochreiter and Schmidhuber, 1997) model, and Distil-GPT2 , a large-scale pre-trained Transformer (Vaswani et al., 2017) model 2 . Experiment results demonstrate that our approach consistently improves both controllable summarization models' capabilities of following the specified attribute requirement. In addition, our framework increases the ROUGE scores of the generated summaries when provided with the reference control tokens (e.g., the tokens that represent the entities in the reference summary). Human evaluations further confirm that our framework produces informative summaries that conform to the attribute requirement.
The key contributions of this paper include: (1) A novel training framework that provides explicit guidance signals to supervise a controllable summarization model to conform to the specified attribute requirement; (2) Constraints that allow users to control the length, covered entities, and the abstractiveness of the generated summaries, respectively; (3) Consistent performance improvement of controllable summarization models based on two different architectures.

Related Work
Summarization systems with specified attributes.
Several methods extend abstractive summarization models to allow users to control a specific attribute of summaries. Fan et al. (2018) propose a method that allows users to control an attribute such as length, entity, and style of summaries by prepending special tokens to the input document.  focus on controlling the exact length of summaries. They multiply the input word embeddings in the decoder by the specified summary length. Song et al. (2020) propose a masked language model to control the portion of copied words in the output summary for the sentence summarization task. This model controls the abstractiveness of a summary at the word level. In contrast, our work controls the extractive fragment density (Grusky et al., 2018) of the output summary, which restricts the abstractiveness at the fragment level. Makino et al. (2019) and Laban et al. (2020) incorporate a penalty term on the training objective to penalize a model for violating the length requirement for word limit control. However, it requires hyper-parameter tuning for the weight of penalty if one wants to apply their method to another dataset. Our approach imposes constraints on the training objective and does not need to search suitable weights for penalties based on human inspection.
Query-focused summarization aims to predict a summary that answers specific questions, e.g., "How often did Lebron James visit his hometown?". Most of the query-focused summarization methods are extractive and they are based on centrality ranking (Wan, 2008;Wan and Zhang, 2014), manifold-ranking (Wan et al., 2007;Wan and Xiao, 2009;Wan, 2009), or sentence-compression framework (Wang et al., 2013). Recently, Nema et al. (2017) propose two query attention-based models for abstractive query-focused summarization. On the other hand, entity-controlled summarization aims to produce a summary that captures the salient information of the desired entities, e.g., "Lebron James".
Abstractive summarization. Most of the existing abstractive summarization models (Gehrmann et al., 2018;Zhang et al., 2020a;Chan et al., 2020) are built on the encoder-decoder model (Bahdanau et al., 2015) to generate summaries. See et al. (2017) propose the pointer-generator network which allows copying words from the source to the output summary. The structure-infused copy mechanism  incorporates the syntactic structure of the source text into the pointer-generator network to facilitate copying important words to the output summary. Lebanoff et al. (2019) propose a summarization framework that first extracts either a single sentence or a pair of sentences from the source document, then it condenses or fuses the selected sentence(s) to generate a summary. The above models do not allow users to constrain the degree of copying nor sentence fusion from the source document.
Recent methods apply RL with MDP to optimize an abstractive summarization model towards a single or a weighted sum of reward functions. Several methods Çelikyilmaz et al., 2018) adopt the ROUGE-L score (Lin, 2004) as the reward function. The SENECA model (Sharma et al., 2019) optimizes a weighted sum of ROUGE-2, ROUGE-L, and a coherence score from a coherence model. To improve the factual correctness of the generated summaries, several methods (Huang et al., 2020;Zhang et al., 2020c) use RL to maximize a weighted sum of ROUGE scores and a factual correctness score computed by a model. Kryscinski et al. (2018) use the weighted sum of ROUGE-L and 3-gram novelty as the reward to increase the abstractiveness of summaries, but this method does not allow users to control the abstractiveness level of summaries. Pasunuru and Bansal (2018) extend the ROUGE-L reward by up-weighting the salient words detected by a classifier. One can modify this wordlevel weighting scheme to encourage the summary to contain certain keywords, but this method does not explicitly encourage the model to generate relevant information about the keywords. In contrast, we design a constraint to enforce a summary to retain relevant information of the requested entities. Ziegler et al. (2020) apply RL to fine-tune a GPT2 model (Radford et al., 2019). The reward is provided by a model trained from human preferences on different summaries. Though one can use a weighted sum of rewards to control an attribute of generated summaries, such a method needs to tune the weights for rewards. Our CMDP approach avoids the tuning of such weights.
Controllable text generation. Controllable text generation has received increasing attention from researchers. In machine translation, several methods (Sennrich et al., 2016;Kobus et al., 2017;Takeno et al., 2017) apply special tokens to control the politeness, domain, or length of the translation output. Ficler and Goldberg (2017) concatenate a style embedding with the decoder input to control the style of the generated review. Kikuchi et al. (2016); Miao et al. (2019); Schumann et al. (2020) introduce different techniques to control sentence length for the headline generation task, such as feeding a length embedding to the decoder. The label-fine-tuning (LFT) model (Niu and Bansal, 2018) uses special tokens to control the politeness of responses for dialogue response generation. Several insertion-based decoding methods (Sun et al., 2017;Zhu et al., 2019;Gu et al., 2019) are proposed to complete a fill-inthe-blank sentence, e.g., "keywords 1 __ keywords 2 __". These decoding methods can be used to enforce the output to contain certain keywords, but users need to specify the relative order among the keywords. In contrast, entity-controlled summarization lets the model determine the relative order among the requested entities. Recently, Keskar et al. (2019) train a large language model conditioned on control codes that specify particular attributes such as domain or language style. Compared with the above methods, our approach incor-porates the attribute requirement into the training objective, which gives more explicit supervision signals to the summarizer.

Controllable Summarization with
Constrained Markov Decision Process

Problem Definition
Given a text document x and a requirement on an attribute a (e.g., length limit of 20 words), the goal of controllable text summarization is to generate a summary y that satisfies the requirement. Both the input document and output summary are sequences of words, i.e., x = [x 1 , . . . , x lx ] and y = [y 1 , . . . , y ly ], where l x and l y are the numbers of words in x and y respectively. In this work, we focus on single-document summarization.

Constrained Markov Decision Process Formulation
We propose a constrained Markov Decision Process (CMDP) approach to guide a controllable summarization model to follow the attribute requirement. Assume an agent interacts with an environment to generate a summary in discrete time steps. At each step t, the agent performs an action by sampling a word y t from its policy π θ , which is a controllable summarization model. Then the agent updates its internal state representation (hidden state of the decoder) and proceeds to the next step. Once the agent produces the end-of-sequence (EOS) token, we denote the current time step as T , the environment gives a reward r(y 1 , . . . , y T , y * , x), and a set of costs c i (y 1 , . . . , y T , y * , x) to the agent. The process then terminates. The reward function r measures the similarity between the output summary [y 1 , . . . , y T ] and the reference summary y * , while a cost function c i measures how well a summary satisfies an attribute requirement, e.g., we can define a length cost function to measure the difference between the output summary length l y and the specified length limit l: l y − l. The goal of the agent is to maximize the expected reward while ensuring the costs are under constraints as follows: where y 1:T denotes y 1 , . . . , y T , α i is a pre-defined threshold associated with cost function c i , m is the size of the set of constraints. A constraint restricts an attribute of the generated summary. For example, to limit the summary length, we can define a constraint to enforce the length cost function to be no larger than 0, l y − l ≤ 0. Lagrange relaxation. Following Tessler et al. (2019), we apply the Lagrange relaxation technique (Bertsekas, 1997) to approximate the constrained optimization problem in Eq. (1). We use J(π θ ) as a shorthand to denote E y 1:T ∼π θ [r(y 1:T , y * , x)] and use J c i (π θ ) to denote E y 1:T ∼π θ [c i (y 1:T , y * , x)]. We then define a Lagrangian function L(λ, where λ i is a Lagrangian multiplier and λ = [λ 1 , . . . , λ m ] ∈ R m . When λ i ≥ 0, ∀i, the optimal value of max θ L(λ, θ) is an upper bound to the optimal value of Eq. (1). If we minimize the optimal value of max θ L(λ, θ), we will obtain a tighter upper bound on the optimal value of Eq. (1). Thus, we approximate Eq. (1) by the following relaxed problem: where λ 0 denotes that every entry in λ is nonnegative. Intuitively, this relaxed problem penalizes the behavior of violating the constraints, and all the Lagrange multipliers λ i are learnable. In contrast, the MDP formulation requires the manual tuning of weights for penalty terms. Policy training.
Since it is intractable to enumerate all possible y 1:T , we approximate the expectation E y 1:T ∼π θ using a sample of output sequence y 1:T ∼ π θ . Moreover, we also subtract the reward by a baseline b, which is a standard technique to reduce the variance of the gradient estimator (Sutton and Barto, 1998). The gradients are then estimated by: We can interpret ∇ θ L as the standard policy gradient with a regularization term −λ T c, where λ is trained by a gradient descent algorithm.
In this work, we apply the self-critical baseline (Rennie et al., 2017). Specifically, we use greedy search to generate an output sequenceȳ from the policy. Then, we treat the reward of this sequence r(ȳ, y * , x) as the baseline b. Reward function. We apply BERTScore (Zhang et al., 2020b) as the reward function to measure the similarity between an output summary and the reference summary based on their BERT (Devlin et al., 2019) contextual embeddings. We do not use ROUGE scores (Lin, 2004) as the reward since they cannot match paraphrases in an output. 3-gram repetition constraint. Similar to prior work Liu and Lapata, 2019;Laban et al., 2020), we address the problem of repetition of text fragments by adding a 3-gram repetition constraint into our framework. We define a cost function that measures the ratio of 3gram repetition in a summary: RepeatRatio 3 (y) = #repeat 3-gram/# 3-gram. Then we set its threshold to zero and apply the following 3-gram repetition constraint: RepeatRatio 3 (y) ≤ 0.

Implementation with RNN and Pre-trained Transformer
We apply our CMDP framework to train two types of controllable summarization models: pointergenerator network (See et al., 2017) and Dis-tilGPT2 . The pointergenerator network is a popular abstractive summarization model based on RNN encoder-decoder model (Bahdanau et al., 2015). We also incorporate the intra decoder attention  mechanism since it has been shown to improve the performance of the pointer-generator. GPT2 (Radford et al., 2019) is a large-scale pre-trained language model based on Transformer (Vaswani et al., 2017). DistilGPT2 is a compressed version of GPT2 model using the knowledge distillation technique . We append the text "TL;DR" to the input document to trigger the summarization operation by DistilGPT2. We append control tokens to these two models.

Length-controlled Summarization
Length-controlled summarization aims to control the length of generated summaries. We adopt the setting proposed by Fan et al. (2018), which allows users to constrain the summary length to a pre-defined range, e.g., 33 to 37 words. We first divide summary length into 10 disjoint length bins LB = (lb 1 , . . . , lb 10 ). Each length bin corresponds to a range of length, and each bin contains a roughly equal number of training samples in the corpus. Let lb i * denote the specified length bin. The goal of this task is to generate a summary y that satisfies the specified length bin lb i * . Base model. We expand the vocabulary of the model with ten special tokens (e.g., <bin_2>) to denote the corresponding bins. In training, we feed the token that indicates the length bin of the reference summary. During testing, we control the length of the output summary by inputting the token of our specified length bin. For pointergenerator, we prepend the token at the beginning of the document. For DistilGPT2, we insert the special token into the "TL;DR:" prefix, e.g., "TL;DR<bin_2>:". Length bin constraint. To encourage the summary length to match the specified length bin, we define a cost function that computes the normalized distance between the length bin of the generated summaryî and the specified length bin i * : |î − i * |/10, then we set the threshold α = 0, which leads to the following length bin constraint: |î − i * | ≤ 0. We adopt a normalized cost function to prevent the values of costs from being too large and dominating the gradient ∇ θ L in Eq. (3).

Entity-controlled Summarization
Our second task is to generate a summary that focuses on entities requested by a user. Fan et al. (2018) anonymize each entity in the document by a special token. In contrast, we do not anonymize the entities, which is a more realistic setup. Base model. During training, we prepend the reference entities to the document. These requested entities are separated by segmenters, e.g., "Lebron James <ent> LA Lakers". In test time, we control the focus of the summary by feeding in our specified entities. To make the reference summaries focus on the reference entities, we remove the reference summary sentences that contain neither reference entities nor coreferent mentions of reference entities on training, validation, and test splits 3 . QA constraint. We apply a question-answering (QA) constraint to guide the generated summary to capture the important information of the requested entities. The main idea is to use the QA-based metric from Eyal et al. (2019) and Scialom et al. (2019) to evaluate the capability of a summary to answer a set of questions regarding the reference entities. The QA constraint ensures that the score of the QA-based metric is above a threshold.
Specifically, we first construct a set of cloze question-answer pairs by individually masking each of the named entities from the reference summary to create the question, with the masked entity as its gold-standard answer. The summary predicted by a system is considered as the context for a QA model. We feed each of the cloze questions and the context to the QA model, then the QA model extracts an answer from the context for each cloze question. We use the F 1 score of the answers extracted by the QA model as the evaluation metric, denoted as QA-F 1 score. If a summary presents the key information of the reference entities, then the QA-model can predict the correct answers from the summary most of the time. We use the negative of QA-F 1 as our cost function and set the threshold to -0.9. Our QA constraint is then defined as: −QA-F 1 (y) ≤ −0.9.
The QA model is a BERT model (Devlin et al., 2019) with a span classification head on top of the last-layer hidden states. The span classification head is a fully-connected layer that predicts the beginning and ending positions of the answer span on the context. We obtain a BERT-based QA model that is fine-tuned on SQuAD 2.0 (Rajpurkar et al., 2018) from Huggingface Transformers . Then we further fine-tune the QA model on the CNN/Dailymail (Hermann et al., 2015;Nallapati et al., 2016) corpus using our constructed question-context-answer triplets. We construct 349,653/17,442 cloze questioncontext-answer triplets for training and development. The details of the construction method are described in §A.2.
Entity repetition constraint. We find that the QA constraint will cause the model to repeatedly generate the same requested entity in a sentence, because the model wants to increase the chance that the QA model will select the requested entities as the answer. Since a named entity usually contains one or two words, the entity repetition behavior cannot be fixed by the 3-gram repetition constraint. To address this problem, we first define a function ER(y) to measure the fraction of sentences in y that contain repetition of requested entities. We then use ER(y) as the cost function and apply the following constraint: ER(y) ≤ 0.

Abstractiveness-controlled Summarization
Our third task is abstractiveness-controlled summarization, which allows a user to specify the degree of text novelty between a generated summary and the corresponding document 4 . In this work, we adopt extractive fragment density (Grusky et al., 2018) to measure the abstractiveness of a summary. Given a document x and a summary y, the set of extractive fragments F(x, y) is the set of common sequences of words in x and y. Extractive fragment density is defined as the mean square of the extractive fragment lengths: Intuitively, a summary that copies many longer text fragments from the document has a higher extractive fragment density and a lower abstractiveness. We divide the values of extractive fragment density into three abstractiveness bins: ab 1 = (3.3, +∞], ab 2 = (1.3, 3.3], ab 3 = [0, 1.3], which indicates low, medium, and high abstractiveness respectively. The goal of abstractiveness control is to generate a summary y that follows the specified abstractiveness bin ab i * . Base model. Similar to length control, we use special tokens to denote the abstractiveness bins and input a special token to control the abstractiveness level of the output summary. Abstractiveness bin constraint. To avoid the output summary from violating the specified abstractivenss bin, we apply a cost function to evaluate the normalized distance between the abstractiveness bin of the output summaryî and the desired abstractiveness bin i * : |î − i * |/3. We set the threshold to 0 and obtain the following abstractiveness bin constraint: |î − i * | ≤ 0. Conjunction constraint. We find that after applying the abstractiveness constraint, the model often inserts the conjunction "but" into a copied fragment to decrease the extractive fragment density, even if there is no contrast relationship. Since it is difficult to detect the improper use of conjunction, we devise a constraint to avoid the model from generating "but" when the reference summary does not contain "but". Concretely, we first define a binary function IC(y) as follows. IC(y) = 1 if the predicted summary y contains "but" and the reference summary does not contain "but"; otherwise, IC(y) = 0. We then apply the following conjunction constraint: IC(y) ≤ 0. This method can be generalized to other discourse markers depending on specific model behavior.

Experimental Setup
Datasets.
We use three popular summarization datasets in our experiments. The first one is the CNN/DailyMail (Hermann et al., 2015;Nallapati et al., 2016) corpus. We use the standard splits, which have 287,113/13,368/11,490 samples for training, validation, and test sets. Each summary in the training set has 66 words on average. We follow the preprocessing steps of See et al. (2017). Table 1 shows the distribution of abstractiveness bins. We can observe that most of the reference summaries belong to abstractiveness bin 1 and 2, indicating that this dataset is not abstractive.
Moreover, we use a subset of the Newsroom (Grusky et al., 2018) corpus. Newsroom contains 1.3 million news articles with summaries from 38 different news publishers. We construct a subset of the Newsroom corpus called Newsroomb which has a more balanced distribution of abstractiveness bins. We extract all the samples from three of the news publishers (Washington Post, The Guardian, and New York Times) and obtain the splits of 297,327/31,815/32,047 for training, validation, and test sets. The distribution of abstractiveness bins is shown in Table 1.
Furthermore, we conduct experiments of length control on the DUC-2002 dataset (Ellis, 2002) using a test-only setup Chen and Bansal, 2018;Chan and King, 2021). DUC-2002 consists of 567 documents and each document has two reference summaries. We remove the documents that are shorter than their corresponding reference summaries, resulting in 554 documents. This dataset has long reference summaries with an average length of 113 words.

Baselines and comparison.
We use maximum likelihood (ML) loss to train the pointergenerator and DistilGPT2 based controllable summarization models described in §3.5, denoted as PG and D.GPT2 respectively. We then use a suffix "+CMDP" to indicate that a model is fine-tuned by our CMDP framework. The following baselines do not use pre-trained models. We consider the ControlSum (Fan et al., 2018) model as a baseline for all of our control settings. For entity con- trol, we incorporate query-focused summarization baselines including GRSUM (Wan, 2008), an extractive model that incorporates query-relevance into a random walk algorithm, QueryAtt (Nema et al., 2017), an abstractive model that applies a query attention to focus on different parts of the input query, and SD2 (Nema et al., 2017), which integrates an orthogonality constraint into the QueryAtt model to encourage the successive query attention context vectors to be orthogonal to each other. Both the QueryAtt and SD2 models have a strong inductive bias that the generated summary should focus on the query. We modify the ROUGESal (Pasunuru and Bansal, 2018) method by doubling the weights to the words of the requested entities and treat it as a baseline, denoted as ROUGEEnt.
Evaluation metrics. For length control and entity control, we evaluate the quality of summaries using ROUGE-1, ROUGE-2, and ROUGE-L F 1 scores with full-length and stemming (Lin, 2004).
For abstractiveness control, we use embeddingbased metrics, BERTScore (Zhang et al., 2020b) and MoverScore (Zhao et al., 2019), to measure the semantic similarity between an output summary and a reference summary. To evaluate how well the generated summaries satisfy the attribute requirement, we define a metric called bin % to measure the percentage of generated summaries that follow the specified bin (length or abstractiveness bin). We use the QA-F 1 score defined in §3.5 to evaluate whether a summary retains the essential information of the reference entities. We define reference entities as all the named entities (typed as location, person, and organization) that appear in both the reference summary and the first 400 words of the input document. We also define appear % to measure the percentage of requested entities that appear in the summary. For the nonreference control settings, the entire test set is eval-  uated under different control constraints and reference summaries do not exist in these cases. Implementation Details. We use Spacy (Honnibal et al., 2020) for coreference resolution. For RNN-based models, we use the Adam algorithm (Kingma and Ba, 2015) for training. We first use ML loss to train a RNN-based model until the validation loss stops decreasing for three consecutive checkpoints. Then we start the (C)MDP training. The initial learning rates are 1e-3 and 5e-5 for ML and CMDP training respectively. For Transformer-based models, we use the AdamW algorithm (Loshchilov and Hutter, 2017) for training. We first use ML loss to train a Transformerbased model for 12 epochs. Then we start the (C)MDP training. The initial learning rates are 5e-5 and 1.77e-5 for ML and CMDP training. During CMDP training of D.GPT2, we freeze the bottom four layers of the model. We initialize the values of λ to 0.01.

Results of Length Control
Reference length bin.
We first evaluate the performance of length controlled models when supplying the length bin of the reference summary (reference length bin) at testing time. The results are shown in Table 3. We observe that after applying our CMDP framework, both PG and D.GPT2 models obtain significantly higher ROUGE scores and a larger portion of their generated summaries follow the specified length bin. We also report the results of the D.GPT2 model after fine-tuned by RL with MDP (D.GPT2+MDP). In this MDP approach, the reward is BERTScore minus a weighted sum of length bin distance and 3-gram repetition ratio. We tune the weights of penalties on the validation set and set the weights for length bin distance and 3-gram repetition to 0.4 and 0.6 respectively. We can see that our CMDP approach outperforms the MDP approach. The  above results demonstrate the effectiveness of our framework. Moreover, we observe that the D.GPT2 based models obtain higher ROUGE scores but lower bin % than the PG based models. One possible reason is that the large-scale pre-training in D.GPT2 makes the model more difficult to adapt to a specific bin requirement. This suggests a trade-off between the task metrics and the bin %. Arbitrary length bin. We evaluate the performance of length-controlled models when supplying different length bins at testing time. We report the results of length-controlled models on four different length bins: 1, 4, 7, and 10. The DUC-2002 dataset is adopted since this dataset has long reference summaries. Hence, we can evaluate the quality of summaries with different lengths by truncating the summaries. We truncate the reference and system summaries to 33, 46, 59, and 100 for specified length bins of 1, 4, 7, and 10 respectively when computing ROUGE scores. ROUGE evaluation with truncation is a common practice for evaluating a system summary when given a length budget (Hong et al., 2014). The intuition is that a good summary should contain the more essential information at the beginning.
We analyze the results of length-controlled models on different length bins. Figure 2 illustrates the results of bin % obtained by different

Averaged value
for normalized length bin distance cost for 3-gram repetition ratio cost models. We observe that all the models achieve more than 90 bin % for length bin 1. It is because length bin 1 represents the range of (0, 33] in length, it is easy to satisfy the requirement by generating a very short summary. For length bin 4, 7, and 10, our CMDP framework improves the bin % of both PG and D.GPT2 models by a wide margin. From Table 2, we can see that our framework consistently improves the ROUGE scores of PG and D.GPT2 models.

Costs and Lagrangian multipliers.
Furthermore, we analyze the values of costs (c) and Lagrangian multipliers (λ) of our PG+CMDP model during training. From Figure 3, we can see that the costs received by the agent decrease gradually over iterations. It is because the relaxed training objective of our framework in Eq. (2) penalizes the behavior of violating the constraints. We also observe that the values of Lagrangian multipliers λ keeps increasing. The reason is that according to Eq. (4), the gradient of λ is negative as long as there is a sample that violates the constraints dur-  ing training. As mentioned in § 3.2, λ is learned by a gradient descent algorithm and the algorithm increases λ when the gradient is negative.

Results of Entity Control
Reference entities. We first evaluate the performance of entity-controlled models in summarizing the reference entities. For each of the models, we feed in all the reference entities to generate a summary that centers on the reference entities. The results are presented in Table 4. We use the CNN/DM dataset for entity-controlled summarization because it contains named entities in 99.74% of the reference summaries in its test set, whereas the Newsroom-b dataset only has 85.24%. When computing QA-F 1 and appear %, we ignore the samples that do not have a named entity in the reference summary. We observe that our framework consistently and significantly improves the ROUGE scores, QA-F 1 score, and appear % for both of the PG and D.GPT2 models. These results demonstrate the effectiveness of our framework in summarizing reference entities. We also adopt the D.GPT2+MDP model as a rival system.
In this control setting, the reward is BERTScore(y) + γ 1 QAF 1 (y) − γ 2 RepeatRatio 3 (y) − γ 3 ER(y). We set γ 1 , γ 2 , γ 3 to 0.15, 0.4, and 0.5 respectively after hyperparameter tuning. It is observed that the MDP approach and our CMDP approach obtain similar performance while our approach has fewer hyperparameters to tune. Entities at different positions. Next, we evaluate the capability of entity-controlled models to summarize entities at different positions of the document with the following setup. For each of these models, we use the named entities at document sentences 1 to 2, 3 to 4, 5 to 6, and 7 to   Figure 4: Results of entity-controlled models for entities in different document sentences. Our CMDP framework consistently improves the QA -F 1 and appear % for entities at different positions.
8 as the requested entities respectively. Since we do not have reference summaries for these entities, we use the document sentences to construct cloze questions to evaluate the output summaries. For each requested entity, we build cloze questions by masking each document sentence that contains the entity or its coreferent mention. We use the F 1 score of the answer predicted by the QA model as an evaluation metric, denoted as QA -F 1 . We analyze the performance of our method for entities at various sentences of the document. The results of appear % and QA -F 1 scores are presented in Figure 4. We observe that our CMDP framework consistently improves the appear % and QA -F 1 scores of both PG and D.GPT2 models for entities at different positions. Without our CMDP training, the appear % are low for entities at latter positions of the document. The reason is that we use reference entities for model training and the reference entities are concentrated in the first few sentences of the document, which bias a neural model towards these sentences. There are 45.6% of reference entities appear in the first two document sentences in the training set of CNN/DM. Nevertheless, the neural models finetuned by our CMDP achieve high appear % for entities at varying positions.
Moreover, we observe that the GRSUM system achieves highest QA -F 1 scores and its appear % scores are similar to that of D.GPT2+CMDP. We analyze the reasons as follows. The GRSUM system is an extractive method while all other methods in Figure 4 are abstractive methods. It is relatively easy for an extractive method to select document sentences that mention the request entities to obtain high appear %. In the setting of nonreference entity control, we use document sentences to construct the cloze questions for the QA -F 1 metric since we do not have a reference summary. Hence, the QA -F 1 metric tends to give higher scores to extractive summaries. Moreover, we also observe that the GRSUM model achieves higher QA -F 1 scores for the entities at latter sentences of the document. The entities at latter positions of a news article are usually less important entities that are only mentioned once and do not have coreferent mentions. The GRSUM system relies on term vectors to measure the relevance of a sentence. Thus, this system cannot recognize a coreferent mention that uses completely different words (e.g., pronoun). As a result, it is easier for GRSUM to extract a summary for entities at latter locations. However, an extractive method cannot paraphrase the information of the document to generate a concise entity-focused summary.

Results of Abstractiveness Control
We analyze the capability of abstractivenesscontrolled models to generate summaries with different abstractiveness levels. In our experiments, for each of the abstractiveness-controlled models, we feed in abstractiveness bin 1, bin 2, and bin 3 independently. The results on Newsroom-b and   CNN/DM datasets are presented in Table 5 and 6. We can see that our CMDP framework consistently improves the BERTScores and MoverScores of PG and D.GPT2 models. We also observe that all the models achieve more than 99 bin % for bin 1 (least abstractive), because it is easier for models to directly copy document sentences than to paraphrase document information. For abstractiveness bin 2 and 3, our CMDP framework substantially improves the bin % of PG and D.GPT2 models, which show that our framework improves the ability of summarization models to generate summaries of higher abstractiveness levels. Similar to the results of length control, there is a tradeoff between the task metrics and the bin %. We then compare the bin % results on the CNN/DM dataset with that on Newsroom-b. It is observed that for abstractiveness bin 3 (most abstractive), all the models achieve a low bin % on CNN/DM but a substantially higher bin % on Newsroom-b. This is because in the CNN/DM, there are only 4.6% of the training samples belonging to bin 3. Hence, it is difficult for a model to learn to generate a highly abstractive summary. In contrast, the Newsroom-b dataset has a balanced distribution of abstractiveness bins so that a model can learn from more abstractive references.
Furthermore, we compare our framework with the D.GPT2+MDP model on both datasets. The reward is BERTScore(y) − γ 1 |î − i * |/3 − γ 2 RepeatRatio 3 (y)−γ 3 IC(y), whereî denotes the abstractiveness bin of the generated summary and i * denotes the specified abstractiveness bin. On the CNN/DM dataset, we set γ 1 , γ 2 , γ 3 to 0.3, 0.5, and 0.3 respectively. On the Newsroom-b dataset, we set these weights to 0.4, 0.5, and 0.3 respectively. We observe that the MDP approach and our CMDP approach obtain similar performance while our approach has fewer hyper-parameters to tune.

Human Evaluation
We conduct human evaluation to verify the quality of the generated summaries. We hire postgraduate students as annotators and each test sample is evaluated by three annotators. The names of models are blinded to the annotators.

Results of Entity Control
The human annotators evaluate entity-controlled summarization models using the following metrics: (i) fluency: estimating the readability and grammaticality of a summary using a rating from 1 to 5; (ii) faithfulness: a yes/no question indicating whether a summary is factually consistent with the document. The annotators are instructed to state "yes" only if the summary does not contain any factual inconsistencies; and (iii) entity-relevance: evaluating how well a summary retains the key information of the requested entities from 1 to 5. Reference entities. We ask human annotators to evaluate the quality of summaries when requesting reference entities. For each of the entitycontrolled models, we feed in all the reference entities. The overall number of annotators is six.
For each of the test samples, we present the input document, requested entities, reference summary, and three system summaries generated by  SD2, D.GPT2, and D.GPT2+CMDP models. We present the evaluation scores on 100 random samples of the CNN/DM dataset in Table 7. For the faithfulness metric, we report the percentage of faithful summary computed by majority vote (i.e., at least two out of three annotators vote as faithful). Our D.GPT2+CMDP method significantly outperforms the D.GPT2 and SD2 models in terms of entity-relevance (power analysis with mixed effects model (Card et al., 2020), power > 0.99, approx. randomization test, p < 0.0001) while maintaining similar fluency and faithfulness with the SD2 model (approx. randomization test, p > 0.97).
Entities at different positions. We pick the best two models (SD2 and D.GPT2+CMDP) in the previous section to further conduct human evaluation for entities at different sentences of the document. The total number of annotators is four. As mentioned in §5.2, most of the reference entities are located in document sentences 1 to 2. To avoid too much overlapping with the reference entities setting, we do not choose the bin of sentences 1 to 2 and conduct evaluation on the subsequent two bins, sentences 3 to 4 and 5 to 6. For each model, we feed in the named entities at document sentences 3 to 4 and 5 to 6 as the requested entities respectively. Since we do not have gold-standard summaries for this setup, we cannot show the reference summaries to the annotators. The results on 100 random samples are shown in Table 8. Our D.GPT2+CMDP model consistently achieves higher entity-relevance scores than the SD2 model (power analysis with mixed effects model, power > 0.81, approx. randomization test, p < 0.0001) and obtains competitive fluency and faithfulness scores (approx. randomization test, p > 0.41).

Results of Abstractiveness Control
The annotators evaluate abstractiveness-controlled models using the following setting. There are six  Table 9: Results of exact match (EM) and partial match (PM) scores of human abstractiveness rankings that are consistent with the specified bins. The Krippendorf's α inter-rater agreement for the abstractiveness rankings on CNN/DM and Newsroom-b are 0.85 and 0.72 respectively.
annotators for the results of CNN/DM dataset and three annotators for the results of Newsroom-b dataset. For each test sample, we generate two groups of system summaries (group 1 and group 2). For group 1, we use our D.GPT2+CMDP model to generate three different summaries by feeding abstractiveness bin 1, bin 2, and bin 3 respectively. For group 2, we use our PG+CMDP model to generate three different summaries using a similar method. During evaluation, we present the source document, the reference summary, and two groups of system summaries to the annotators. The summaries within each group are randomly shuffled. Abstractiveness among summaries. We evaluate the abstractiveness of the generated summaries by human judgments using the following setup. For each group of system summaries, we ask the annotators to give a ranking among the three system summaries according to their abstractiveness. For instance, if an annotator thinks that summary 1 > summary 2 > summary 3 in terms of abstractiveness, then the annotator gives a ranking of [3, 2, 1] to them. The abstractiveness rankings from different annotators are then aggregated by averaging. If the aggregated abstractiveness ranking is consistent with the order of our specified abstractiveness bins, then this group of summaries has an exact match. For example, suppose the order of our specified abstractiveness bins is [3, 2, 1]. If the aggregated abstractiveness ranking is [3, 1.6, 1.3], then then this group of summaries has an exact match. If the aggregated abstractiveness ranking is [3, 1.3, 1.6], then there is no exact match. Moreover, we investigate whether the summaries of abstractiveness bin 1 and bin 3 can be distinguished by annotators. If the aggregated abstractiveness ranking is consistent with the order of abstractiveness bin 1 and bin 3, then there is a partial match. Suppose the order of our specified abstrac-  We analyze the exact match and partial match scores of abstractiveness-controlled models as follows. The results on 100 random test samples of the CNN/DM and Newsroom-b datasets 5 are presented in Table 9. We observe that our models on both of the two datasets achieve very high partial match scores, but our models on the CNN/DM dataset obtain lower exact match scores than that on the Newsroom-b dataset (approx. randomization test, p < 0.02). This is because the CNN/DM dataset is extractive in nature. Hence, it is more difficult to learn three levels of abstractiveness on CNN/DM. Nonetheless, our models can still achieve more than 60% exact match scores. Quality of individual summaries. Next, we ask the annotators to evaluate the qualities of the summaries of three different abstractiveness bins using the following metrics: (i) fluency: measuring the readability of a summary from 1 to 5; (ii) faithfulness: a yes/no question asking whether a summary is factually consistent with the document; and (iii) relevance: evaluating how well a summary retains the salient information of the document on 1-5. The results of 100 random test samples from the Newsroom-b dataset 6 are presented in Table 10. When using abstractiveness bin 1 (lowest level), all the models achieve significantly higher fluency, relevance, and faithfulness (approx. randomiza-Reference Summary: Manchester United have announced a 10-year contract with German manufacturers Adidas to be the club's new kit sponsor for a record-breaking minimum £750m. Abstractiveness bin 1 (least abstractive): Manchester United have announced a 10-year contract with the German manufacturers Adidas to be the club's new kit sponsor for a record-breaking minimum £750m. Abstractiveness bin 2 (medium abstractive): Manchester United have announced a 10-year contract with Adidas to sponsor new kit sponsor for £750m club record signing. Abstractiveness bin 3 (most abstractive): Manchester United are hoping to secure £750m sponsorship deal with German company after contract is signed in 2015-16 season. tion test, p < 0.005). The scores of all these metrics drop substantially for abstractiveness bin 2 and bin 3 because paraphrasing is more challenging than copying. Figure 5 illustrates sample summaries generated by our D.GPT2+CMDP model on the Newsroom-b dataset. We observe that the generated summary of bin 3 has a factual error, which is italicized in the figure.

Conclusion
We propose a novel CMDP training framework for controllable text summarization. Our framework imposes constraints on the training objective to explicitly disallow the output summaries from violating the requirement specified by users. Moreover, we apply our framework to control key summarization attributes such as length, covered entities, and abstractiveness of the summaries. We then devise specific constraints to restrict each of these attributes respectively. Empirical studies on popular benchmarks demonstrate that our framework significantly improves the capability of controllable summarization models to conform to the desired attribute requirement.
In our framework, we can set hard constraints without tuning threshold values. For instance, we set the threshold of our length bin constraint to 0 to disallow the violation of length bin requirement. Compared to the weights of penalty in the MDP framework, the threshold value in a soft constraint is also easier to set. For example, the goal of entity control is to generate a summary that presents the key information of the requested entities, which implies that the generated summaries should obtain a high QA-F1 score. The range of QA-F1 score is [0,1]. In order to encourage the generated summaries to obtain a high QA-F1 score, the threshold for QA-F1 score should be close to 1, this gives us a clue about how to set the value of threshold. On the other hand, the MDP framework does not give us any clues to set the values of penalty weights. In summary, our CMDP framework needs to tune one threshold value for entity control and it does not need to tune any threshold for other control settings. Whereas the numbers of penalty weights to be tuned in the MDP framework are 2, 3, and 3 for length, entity, and abstractiveness control respectively.

A Appendix
A.1 Output Samples for Length Control Figure 6 presents sample summaries generated by our D.GPT2+CMDP model using different length bins on the DUC-2002 testing set. We observe that our model discards secondary information when given a shorter length budget.

A.2 Training data for the QA Model
We construct question-context-answer triplets to train a QA model. We individually mask each named entity in a reference summary to create a cloze question and the masked entity is its answer. The reference summary is used as the context. For example, suppose the reference summary y * is "Arsenal beat Chelsea 3-1 yesterday.", then we construct two cloze questions, q 1 =" [MASK] beat Chelsea 3-1 yesterday." and q 2 ="Arsenal beat [MASK] 3-1 yesterday.", and two answers, a 1 ="Arsenal" and a 2 ="Chelsea". After that, we obtain two question-context-answer triplets, (q 1 , y * , a 1 ) and (q 2 , y * , a 2 ).
Since the constructed cloze questions are too similar to the corresponding reference summaries, if we only use reference summaries as the context in our training data, it will encourage the QA model to only rely on surface clues to extract answers. To alleviate this problem, we use the method by Chen and Bansal (2018) to extract a pseudo reference summaryỹ from the source document. Then we useỹ as the context to construct another set of question-contextanswer triplets {(q i ,ỹ, a i )}. The pseudo reference summary includes the document sentences that achieve highest ROUGE-L recall with the ref-Length bin 0: Hurricane Gilbert slams into Kingston on Monday. 115 mph winds cause flash floods and mud slides. Jamaica expected to receive 10 inches of rain. Hurricane warnings are canceled. Length bin 3: NEW: Hurricane Gilbert slams into Kingston on Monday with 115 mph winds. No serious injuries were reported in the city of 750,000. The hurricane lashed Kingston's airport and aircraft. Jamaica will receive 10 inches of rain. Hurricane warnings are canceled. Length bin 6: NEW: Hurricane Gilbert slams into Kingston on Monday with 115 mph winds. No serious injuries were reported in the city of 750,000 people. The hurricane hit Kingston's airport and aircraft parked on its fields. The National Weather Service reports heavy damage to Kingston's airport. Jamaica will receive 10 inches of rain. Length bin 9: NEW: "The eye is going to move lengthwise across that island," a man says. Hurricane Gilbert slams into Kingston on Monday with 115 mph winds. The National Weather Service reports heavy damage to Kingston's airport and aircraft parked on its fields. "People were running around in the main lobby of our hotel," a man says. Jamaica is expected to receive 10 inches of rain. Hurricane warnings are canceled in Cuba. erence summary. We discard a triplet ifỹ does not contain all the named entities in the reference. To have a balanced training data, we only keep the training triplets (q i , y * , a i ) that has a corresponding pseudo reference summary (q i ,ỹ, a i ).
To allow the QA model to give a prediction of "unanswerable" to low-quality summaries, we construct two types of unanswerable training samples: irrelevant training samples and repeatedentity training samples. For irrelevant training samples, we select document sentences that do not contain the reference entities and have a low textual overlap with the reference summary (ROUGE-L recall ≤ 0.2). For repeated-entity training samples, we find out the sentences in the reference summary that contains two named entities and repeat one of its named entities. We treat such samples as unanswerable since they contain factual inconsistencies. Overall, our training data consists of 109,815 unanswerable samples and 239,838 answerable samples. We will release our training data for the QA model.