MACSum: Controllable Summarization with Mixed Attributes

Abstract Controllable summarization allows users to generate customized summaries with specified attributes. However, due to the lack of designated annotations of controlled summaries, existing work has to craft pseudo datasets by adapting generic summarization benchmarks. Furthermore, most research focuses on controlling single attributes individually (e.g., a short summary or a highly abstractive summary) rather than controlling a mix of attributes together (e.g., a short and highly abstractive summary). In this paper, we propose MACSum, the first human-annotated summarization dataset for controlling mixed attributes. It contains source texts from two domains, news articles and dialogues, with human-annotated summaries controlled by five designed attributes (Length, Extractiveness, Specificity, Topic, and Speaker). We propose two simple and effective parameter-efficient approaches for the new task of mixed controllable summarization based on hard prompt tuning and soft prefix tuning. Results and analysis demonstrate that hard prompt models yield the best performance on most metrics and human evaluations. However, mixed-attribute control is still challenging for summarization tasks. Our dataset and code are available at https://github.com/psunlpgroup/MACSum.


Introduction
Text summarization is the task of compressing the input text into a concise and coherent version by preserving salient information.There has been substantial progress in generic summarization by generating one overall summary for each input (McKeown and Radev, 1995;Erkan and Radev, 2004;Rush et al., 2015;Cheng and Lapata, 2016;See et al., 2017;Paulus et al., 2018).However, readers have diverse preferences when summarizing the same article, such as topics, speakers, or lengths of the summary (Fan et al., 2018;Zhong et al., 2021;Goyal et al., 2022b).Therefore, generating customized summaries to meet different preferences is a natural capability of summarization systems.
Due to the lack of a human-annotated controllable summarization benchmark, existing research has to adapt generic datasets to create pseudocontrollable summaries (Fan et al., 2018;He et al., 2020;Zhong et al., 2021;Goyal et al., 2022b;Chan et al., 2021).He et al. (2020), for example, extract topics from a generic summary by assuming the summary is controlled by the extracted topics to evaluate summarization over topics.However, this adaptation-based setting raises three issues.First, the adapted summaries are not really written with the guidance of being controlled by the designed attributes.Second, this conversion can only build one target summary for each source, while it is preferable to have summaries with different control attributes for the same input article.Third, for attributes like Extractiveness or Specificity, there are no straightforward adaptation methods.
Meanwhile, previous studies mostly focus on controlling a single attribute, e.g., generating a short summary or a highly abstractive summary.However, mixing different control attributes is more challenging and underexplored (Russo et al., 2020).For example, Figure 1 shows a case of mixed-attribute control by requiring a short summary regarding "Pope Francis", or a highly extractive and highly specific summary on the topic "blood moon".Users can simultaneously control different attributes in the generated summary.
In this paper, we propose MACSUM, a humanannotated benchmark for controllable summarization with mixed attributes.In MACSUM, source texts are collected from both news and dialogue arXiv:2211.05041v2[cs.CL] 7 Jun 2023 domains.We define five control attributes of summarization by synthesizing previous studies (Chan et al., 2021;Liu et al., 2018;Fan et al., 2018), including Length (Len), Extractiveness (Ext), Specificity (Spe), Topic (Tpc), and Speaker (Spk) 1 .For each input source, we sample a set of different combinations of these attributes for human annotations.The resulting MACSUM dataset contains a rich set of annotations of human-written summaries for the same input with different mixtures of control attributes.Table 1 compares MACSUM with previous work, and MACSUM is the first one to mix these five attributes with human annotations, covering both dialogue and document source texts.
Furthermore, to establish a baseline of mixedattribute control, we design two simple yet effective frameworks that can steer a summarization model by either hard prompt (HP) or soft prefix (SP) inspired by prompt learning (Raffel et al., 2020;Li and Liang, 2021).For each value of a control attribute (e.g., long length), in the HP framework, we prepend the description of control attributes (e.g., "Length: Long") to the input source as hard prompts; in the SP framework, we assign a set of external parameters, called soft prefixes, to the model.The summarization model is trained to summarize an input with hard prompt/soft prefixes of different control signals.We evaluate these baseline models on MACSUM with proposed two automatic evaluation metrics measuring the quality of control.Our results and analysis in two domains reveal that the HP framework yields the best performance on all automatic metrics and human evaluations, however, mixed-attribute control is still challenging.
Easter is a cornerstone event in the Christian faith, but it's surrounded by interesting quirks.... Rain sprinkled down on worshipers standing under a sea of umbrellas as they gathered in a gray St. Peter's Square on Sunday to partake in the outdoor services held by Pope Francis.... A blood moon appeared in the sky early Saturday, right between Good Friday and Easter Sunday and during Pass-over.Just a coincidence?Not completely, because the dates for ...

News Document
A blood moon appears early Saturday, right between Good Friday and Easter Sunday .... Topic: blood moon; Length: Normal; Extractiveness: High; Specificity: High; Topic: pope francis; Length: Short; Extractiveness: Normal; Specificity: Normal; The worshipers were gathered to partake in the outdoor services by Pope Francis.quests, but it does not explicitly control the output style.Furthermore, interactive summarization (Bornstein et al., 1999;Leuski et al., 2003) and reinforcement learning guided summarization (Paulus et al., 2018;Böhm et al., 2019;Stiennon et al., 2020) have been used to incorporate human preferences and feedback, yet the human feedback explored so far is largely limited to the generic quality of summaries instead of fine-grained attributes.Goyal et al. (2022b) investigate multi-feature control by mixing multiple decoders, yet their solution is only based on decoding improvements which yield suboptimal controlling performance.Therefore, most previous works are over-specialized for controlling particular attributes, while controlling multiple attributes is still underexplored.Furthermore, existing works are mostly evaluated on pseudo datasets adapted from generic summarization datasets.

Prompt Learning
Prompt learning is first proposed in GPT-3 (Brown et al., 2020), where large pretrained language models can perform desired tasks with the guidance of instructions and examples.Some efforts explore prompt-tuning using natural language by convert-  (He et al., 2020) News, Papers ing original inputs into cloze-style questions and then tuning language models (Shin et al., 2020;Schick and Schütze, 2021;Chen et al., 2022;Min et al., 2022).However, most of them focus on natural language understanding tasks and usually need a careful selection of prompts.Instead of using human-crafted tokens, other work explores using continuous vectors as prompts (Lester et al., 2021;Qin and Eisner, 2021;Liu et al., 2021;Li and Liang, 2021).Among them, prefix-tuning is particularly designed for text generation (Li and Liang, 2021).Prefix-tuning prepends trainable vectors to each layer of language models as prefixes and keeps other parameters frozen during training.In this work, we propose two methods for mixed attribute controllable summarization based on prompt-tuning and prefix-tuning, respectively.

The MACSUM Dataset
To provide a benchmark for controllable summarization, we propose MACSUM, a high-quality human-annotated mixed-attribute controlled summarization dataset.Inspired by several previous studies on controllable generation (Chan et al., 2021;Liu et al., 2018;Fan et al., 2018), MAC-SUM is annotated with 5 types of control attributes, including Topic, Speaker, Length, Extractiveness, and Specificity (Section 3.1).As shown in Figure 1, these five attributes can be combined together in various designs (Section 3.1).Besides, Topic and Speaker can have multiple values as well, i.e., more than one speaker or topic to focus on.In annotation, we require the corresponding summary to fulfill all attributes together.
The data annotation pipeline is divided into four steps (Figure 2).First, we carefully select the source texts from several widely used summarization datasets in news or dialogue domains.Second, some automatic tools are leveraged to form a pool of candidate attributes as guidance for the next step.Third, the annotators manually label five attributes to form a control attribute set and repeat the process multiple times.Finally, the annotators write down the summary that meets each control attribute set.

Control Attributes
MACSUM provides five attributes to control the summary generation.
Topic (Tpc) indicates certain contents of the source that users are particularly interested in.The summary should only contain contents that are related to the given topics.We provide multiple keywords such as "remote control, financial information" for annotators as candidate topics.
Speaker (Spk) indicates certain speakers in a dialogue whose content is preferred by the user.In MACSUM (MAC-Dial only), this is specified by giving the name of certain speakers, such as "Program Manager".
Length (Len) indicates the number of words of the summary, serving the time budget for users to read this summary.In MACSUM, Len is controlled by [short, normal, long], three values of this attribute2 .Our annotation guideline provides a reference range of compression ratio and word count for each length value.
Extractiveness (Ext) describes the proportion of the summary extracted from the source text.This is useful when users sometimes want content directly extracted from the source, while sometimes they want more abstractive and more readable results.where normal is the natural specificity and high requires more specific contents.Specificity differs from Length.Length is the number of words, while Specificity is the density of descriptive contents (e.g., numbers, entities, and names).Thus, a short summary can have a higher Specificity than a long one.
MACSUM supports Mixed-Attribute Control because it is a natural need for users to control multiple aspects at the same time, e.g., wanting the summary to be short, highly extractive, and only talking about some topics.To this end, as shown in Figure 1, the samples in MACSUM can control multiple attributes simultaneously.We require the annotated summaries to meet all requirements at the same time.If some combinations are considered too difficult to fulfill, we allow annotators to skip them in rare cases.We provide detailed distribution of attributes in Figure 3.

Annotation Pipeline
Source Selection MACSUM covers source text from both document and dialogue summarization tasks.We pick CNNDM (Hermann et al., 2015) as the document dataset and QMSum (Zhong et al., 2021) as the dialogue dataset.CNNDM is a largescale document summarization dataset containing news stories along with their corresponding highlights, collected from CNN and Daily Mail websites.QMSum is a popular query-based meeting summarization dataset.It contains the transcripts of three domains, including AMI, ICSI, and committee meetings of the Welsh Parliament and the Parliament of Canada.For CNNDM, we randomly pick 10k documents in the test set for the annotation.For QMSum, we first split each meeting into shorter units according to the topic partition and discard the units that are longer than 5000 tokens.
Attribute Candidate Extraction For Topic, we first use a keyword extraction tool (Boudin, 2016) to extract the top 20 keywords from the source text as candidates.For Speaker, we collect all speakers in the source text to form a candidate set.For the remaining Length, Extractiveness, and Specificity attributes, we generate their values and combination randomly from a uniform distribution, mimicking the behavior of users with diverse needs for customized summaries.

Attribute Generation
We hire 4 native English speakers as annotators.The annotators can either freely choose topics from the candidate topics or write the keywords by themselves.As for the Speaker attribute, we ask the annotators to pick one or more names from the candidate set.Besides, Length, Extractiveness, and Specificity are automatically filled with randomly generated values.
Attributes generation repeats several times for each source to form various attribute combinations, so-called samples.Overall, each source text contains eight samples for every two thousand words.

Summary Generation
We first ask all annotators to read our annotation guideline and 10 annotated examples.Afterward, given several combinations of control requirements, i.e., the control attribute sets, the annotators follow our guidance and write a summary for each control combination.
We also ask them to annotate the related text spans for use in future work, such as retrieval-based methods.Related text spans are the turns/sentences in the source that are most relevant to the golden summary.These spans are the minimum necessary turns/sentences the annotators need to produce the complete summary.Finally, the annotators read the summary again for quality insurance, and we ask them to write a short title for this summary, e.g., "discussion of remote control style".This is helpful for future work such as title generation, and it also provides us with a quick way to verify whether the annotators read their generated summaries.
Quality Control First, we control the annotation quality through a careful pilot test.Before the annotation process starts, annotators are carefully selected via a pilot test.We assign each annotator the same three input texts with various mixed attributes, and we choose the qualified annotators according to annotation results.
Second, we conduct a weekly sampling inspection.We frequently monitor the quality of annotations.We collect the results weekly and provide feedback to the annotators to ensure quality.

Automatic Metrics
Overview Along with the annotated benchmark, we also design a system of automatic metrics for evaluating the model's capability to generate controllable summaries.For each attribute, we define its own attribute metric function to represent the degree of control.We then propose Control Error Rate (CER) and Control Correlation (CC).CER measures the distance between the generated and golden summary in terms of their degrees of control using attribute metric functions.A good model should have smaller CER ↓.CC measures the distribution of attribute metric functions among generated summaries with different attribute values, representing the model's capability to correlate to the definition of the control attribute.A good model should have a CC distribution that is similar to that of the golden summary ↕.In addition, we also report F-1 of ROUGE-1/2/L (Lin, 2004) for the general quality of the summary ↑.
Definition For a control attribute r and its attribute metric function f r , given a predicted summary ŷ, golden summary y, Control Error Rate (CER) is defined as: where ϵ is a small value to avoid error when f r (y) is zero.
Additionally, for the control attribute r (e.g., Len) with a control value pair [v 1 , v 2 ] (e.g., [short, long]), predicted summaries for these two values where Distance(v 1 , v 2 ) calculates the distance from control value v 1 to v 2 , which can be negative.For instance, Distance(high, normal) = 1, and Distance(short, long) = −2.When CC is above/below 0, it indicates the evaluated model has a positive/negative correlation with the control objective.Additionally, CER and CC for multiple samples are their arithmetic mean.
For each of the five control attributes, we define its own attribute metric f r which maps the summary to a real number that represents the degree of control.Topic f T pc is the proportion of topic keywords shown in the summary.Speaker f Spk is the number of tokens spoken by the selected speakers divided by the total number of tokens in the summary.Length f Len is the number of tokens in the summary.Extractiveness f Ext is the average of ROUGE-2 precision and ROUGE-3 precision (Lin, 2004) of the generated summary against the input.For Specificity, inspired by previous studies (Resnik, 1995;Amplayo et al., 2021), we find that verb, noun, numeral, and the total number of tokens show the most significant information about specificity.Thus, we define f Spe =  As presented, the annotated summaries with different control attribute values can distinguish from each other by a large margin.For example, samples with Len: long have a much longer input, and samples with Ext: full have a higher extractiveness metric.This verifies the high annotation quality of MACSUM and also proves that our proposed attribute metrics are consistent with the control objective of each control attribute.
Mixed-Attribute Distributions Figure 3 shows the ratio of different combinations of the control attributes.This illustrates diverse combinations of mixed-attributes summaries by controlling Len, Ext, and Spe together in one sample.

Methods
For setting baseline results on MACSUM, we propose three models following previous research on controllable text generation using prompt learning.With the same input and different prompts, the large pretrained model is able to generate different results for different tasks, such as summarization and translation (He et al., 2020;Fan et al., 2018;Raffel et al., 2020).As shown in Figure 4, we leverage two types of prompt learning approaches to control the attributes of summaries, namely hard prompt (HP) and soft prefix tuning (SP).We also test the combination of them, namely HP+SP.
Hard Prompt (HP) uses the description of control attributes as the hard prompt.Each attribute is formed as "Attribute: Value", where "Attribute" can be "Topic, Speaker, Length, Extractiveness, Specificity", and "Value" is the corresponding value (e.g., High or Normal) of the attribute.We concatenate 5 control attributes using ";" and prepend it to the input source.
Soft Prefix (SP) follows Li and Liang (2021).We prepend external trainable parameters to both the encoder and decoder to control the summarization model.For controlling Len, Ext, and Spe, we assign m prefix embeddings for each attribute value where m is a hyper-parameter meaning the length of prefix, i.e. prefix length.Readers can refer to Li and Liang (2021) for implementation details.For example, for Len: Long, we assign E Len:long = [e 1 Len:long , • • • , e m Len:long ] where e j i is a vector with dimension of word embedding.And for controlling an input case with a set V of mixed requirements, we sum the embeddings of all control attribute values: E =  [ . And for controlling Tpc and Spk, we use the embeddings of input topics words E T pc and input speaker names E Spk .This list of embedding vectors E is then prepended to each layer of the Transformer-based summarization model as external key/value vectors in its self-attention operations.E T pc and E Spk are prepended only to the input layer.
Hard Prompt + Soft Prefix (HP+SP) combines both approaches by prepending the hard prompt of five attributes in HP and using prefix tuning in SP.

Experiments
In this section, we present the implementation details, experimental results, and human evaluation of models on MACSUM dataset.

Implementation Details
We use PyTorch and the Huggingface library (Wolf et al., 2019) to implement our model.The experiments are conducted on 8 A100 GPUs.
We use BART (Lewis et al., 2020) as the backbone model.We also use a vanilla BART trained without control attribute input as a weak baseline (Appendix A).If not mentioned, we initialize the backbone using BART-large-cnn and then finetune it on the MACSUM dataset.We pick the 3e-5 learning rate searching from {1e-5, 3e-5, 1e-4} .Additionally, n-gram blocking is set to 3, and we use the AdamW optimizer with 500 warmup steps.Dialogue inputs are flattened by separating turns with "<\s>" which we find yields better results.

Experiment Results
As motioned in Section 3.3, we calculate Control Error Rate (CER) and Control Correlation (CC) metrics for evaluating control quality, and we also report ROUGE scores for evaluating summarization quality.For a model, its performance is better when the CER value is lower↓, ROUGE is higher↑, and its CC is closer to the golden summary↕.
Table 4 shows the results of MAC-Doc.The HP model obtains the highest performance on both CER and CC across all 5 control attributes.Compared with the HP model, the SP model has similar control ability on Ext and Spe.However, it does not perform well on Len and Tpc.This could be the result of using the pretraining checkpoint that has learned some knowledge about the length-related hard prompt before training (Section 6.3).
Table 5 displays the results of MAC-Dial.Similar to the MAC-Doc dataset, the HP model obtains the highest scores on most of the metrics.However, the overall performance of length decreases because using the pretrained CNNDM checkpoint does not lead to performance gain in the dialogue domain (Section 6.3).
It is worth noting that the CER should not be compared across datasets, because its scale is different from different datasets.For example, random uncontrolled BART in MAC-Doc obtains 1.177 CER for Ext while it is 0.544 in MAC-Dial.

Human Evaluation
Although automatic metrics usually provide a speedy comparison, these metrics cannot easily evaluate the quality of the control, especially mixed-attribute control.Thus, we also conduct a human evaluation for the controlled summaries.
Evaluation Method We hire two evaluators with expertise in English and text summarization.We show them randomly-selected summaries generated by different systems with the source text and control attributes.The evaluators answer a yes/no question: "For the given summary, does it follow the control requirement of this attribute?".Specifically, we select golden summaries, summaries generated by HP model, and summaries generated by HP+SP model.For each model, we pick 30 samples from MAC-Doc and MAC-Dial separately, resulting in 180 summaries in total.Furthermore, We compute Cohen's kappa (Cohen, 1960) to measure the agreement between evaluators.

Evaluation Results
Table 6 shows the human evaluation results.Each number (except for Kappa) is calculated by the count of yes answers divided by the total count of questions, indicating the control ability of the model.As shown, the HP model performs better than HP+SP on most of the attributes.This result confirms the consistency of our proposed CER and CC with human evaluation.
Besides, golden summaries always rank first, and the kappa score of the two evaluators is over 0.8.These two results also verify the high annotation quality of MACSUM, because human evaluators agreed that the golden summaries followed the control requirements most of the time.

Analysis and Discussion
For a deeper understanding of the task of mixedattribute controllable summarization on MAC-SUM, we conduct analysis including attribute difficulty, attribute dependency, model pretraining, and present several example outputs for case studies.

Difficulty of Controlling Attribute Values
Models have different difficulties in controlling certain attribute values, as some attribute values can be easier or harder to be controlled.We analyze this by comparing CER for different attribute values of the HP model's outputs.As shown in Figure 5, for MAC-Doc, the system obtains a higher CER on Len: normal samples compared with the other two values of Len, showing that normal is more difficult to control, and the hardest values in controlling Ext  and Spe are both high.For MAC-Dial, the hardest values in controlling Len, Ext, and Spe are short, normal, and high respectively.

Dependency of Attributes
In mixed-attribute controllable summarization, we notice interesting dependencies among attributes, as changing one attribute influences the other one.
To analyze this, we randomly select 200 samples from the test set for each attribute, and randomly change this attribute to another value to form a new sample (e.g., from Len: long to Len: short).Then, the same HP model, without further training, is used to generate summaries on these new samples.We evaluate the performance difference between the newly predicted summaries ŷ′ and the originally predicted summaries ŷ via CER(ŷ ′ , ŷ).
Figure 6 shows the performance change.As can be seen, for MAC-Doc, Len has the highest dependency toward other attributes, while Spe has the lowest.For MAC-Dial, Ext has the highest dependency, while Spe has the lowest.We believe this is because the model in MAC-Doc has a strong control ability towards Len.Thus, the value change of Len will influence more on other attributes.

Effect of Pretraining
We investigate the effect of pretraining on the control ability of summarization models.For two HP models initialized by BART-large and BART-largecnn separately, we compare their results after finetuning them on both MAC-Doc and MAC-Dial.
As shown in Table 7, for MAC-Doc, the BARTlarge-cnn initialized model is able to control the length substantially better than the vanilla BARTlarge initialized model.On the contrary, for MAC-Dial, the advantage of the BART-large-cnn checkpoint is negligible.Using BART-large-cnn or not only slightly influences the control ability of all attributes in MAC-Dial.We believe the reason for this is that the CNNDM pretraining provides certain useful information for the model to learn the ability to control attributes on news articles.

Case Study
We show three case studies in Table 8, discussing three typical phenomena in mixed-attribute controllable summarization, namely Topic Defocus, Length against Specificity, and Extractiveness against Readability.

Gold
They quickly reopened the University of Mosul, under a radically altered curriculum.Some subjects would be banned -democracy and political thought, hotel management, tourism and archaeology.ISIS allows girls to go to school, in a segregated environment.

HP
The Taliban, forbids all girls' education.But ISIS allows girls to go to school, albeit in a segregated environment.Case 2: Length against Specificity (MAC-Doc) Topic Defocus In Table 8 Case 1, MAC-SUM asks for a summary focusing on the topic of "education".Although the human-annotated summary does not contain the topic word, its contents are still highly related to "education".This shows that human annotators have the flexibility of conducting high-level summarization of the topic.
In contrast, although the model-generated summary contains the topic word, its content is poorly structured.This shows the challenge of topic defocus, a phenomenon where models rely too much on explicitly containing the topic words when generating topic-controlled summaries.
Length against Specificity Another challenge is the contradiction between long length and low specificity.Long summaries contain more tokens and inevitably invite more specific information.On the contrary, short summaries only describe core events using a few words and are naturally biased towards low specificity.As shown in Table 8 Case 2, when Len is short and Spe is high, both HP and HP+SP generated summaries are longer compared with the human-annotated summary.
Extractiveness against Readability As shown in Table 8 Case 3, when Ext is full, the modelgenerated summaries are choppy and unnatural, in particular for dialogues.When humans are asked to annotate fully extractive summaries, they may have to write unnatural sentences, and this phenomenon is amplified by a trained summarization system.As shown in the table, the HP+SP generated summary is not grammatical and consists of short phrases instead of complete sentences.This can be explained by the fact that the complicated dialogue discourse structures and frequent interactions between different interlocutors make salient information sparse.

Conclusion
We propose MACSUM, a high-quality humanannotated benchmark for mixed-attribute controllable summarization.It contains 5 types of control attributes, including Topic, Speaker, Length, Extractiveness, and Specificity.To the best of our knowledge, MACSUM is the first dataset with mixed attributes as well as human annotations.We explore the hard prompt and soft prefix models and evaluate them on MACSUM.Results and analysis demonstrate that hard prompt models yield the best performance and also show this is a challenging task as a large gap between machine learning models and human still exists.Future work can design more effective models for the mixed-attribute controllable summarization task, or explore mixed-attribute control on other generation tasks.

A Implementation Details
We list the implementation details for models.
HP+SP For HP+SP on MAC-Doc, we load the HP trained model first and set different learning rates for the language model and prefixes, i.e., 3e − 5, 1e − 6 separately, and remove the Len prefix from the model.This is because we find that the HP model obtains high performance with Len related attributes very well, due to the pretrained BART-large-CNN checkpoint.Using prefix tuning or tuning the language model with a large learning rate will hurt the performance (Section 6.3).For HP+PE on MAC-Dial, we only set the different learning rates, but we do not load the checkpoint or remove the Len prefix.This is because the CNN pretrained checkpoint is not significantly beneficial for MAC-Dial (Section 6.3).
BART model is a pretrained BART model which only prepends the hard prompt of topic and speaker to the input, which means it does not control the rest of the attributes.This is the baseline to justify if we control these three attributes.

B Annotator Details
We have four annotators with native English background.Before the pilot test, we also supply annotators with professional training for high-quality annotation and provide annotation visualization tools for the annotators to regularize the annotation process.For each sample, we ask the annotators to inspect the quality and decide to keep the annotation or discard it due to difficulty or errors.We combine the annotations of each week to form the MACSUM dataset with careful processing: we discard the invalid samples reported by the annotators and use a program to filter out the other invalid samples with empty or wrong text.

C Annotation Guidelines
We write annotation guidelines of MACSUM for two purposes.First, the guidelines are used as our criteria to evaluate annotators during the pilot test.Second, during annotation, we provide annotation guidelines to the annotators and ask them to carefully follow them.For both purposes, the guidelines are a key step to ensure the quality of the whole annotation process.Thus, we pick out some of the details in the guidelines.Note that the following paragraphs are directly copied from the guideline document and shared across all four annotators.

Speakers annotation criteria.
A dialogue may contain multiple speakers.if we specify certain speaker names as the control attribute, it means we only care about what these speakers say in the dialogue.So we need to focus on the dialogue turns spoken by these speakers and write the summary for them.
Topics annotation criteria.Topic is represented by a set of keywords (usually) copied from the dialogue.A dialogue may contain multiple topics, we need to summarize the content that is only related to the given topic.
Length annotation criteria.Normal length: the length of the summary should equal 15% -25% of the related text spans.E.g., the dialogue contains 2000 words, and the related text span for the labeled speaker contains 1000 words.Then we need to write 15% -25% x 1000 = 150 -250 words for the summary.Long summary: the length of the summary should equal 30%-35% of the related text spans.Short summary: the length of the summary should equal 5%-10% of the related text spans.These criteria should be dynamically modified, the target of length control is to differentiate the length of the different outputs.We can adjust the criteria a little bit if the lengths of the three types of summaries are too similar.
Extractiveness annotation criteria.Normal extractiveness: the same as a natural summary that humans will write.High extractiveness: copy more sentences/tokens from the source text compared with normal extractiveness.Full extractiveness: copy all the sentences/tokens from the source text.Again, this can be modified if we can better differentiate summaries with different abstractiveness.Specificity annotation criteria.Normal specificity: the same as a natural summary that humans will write.High specificity: include more descriptive content in the source text compared with normal specificity.

D Examples of the MACSUM Dataset
Table 9 shows five examples of our proposed MAC-Doc dataset.Note that sample 2 and 3, sample 4 and 5 only differs in Len, Ext, and Spe.
Source text (CNN)Jackson Gordon is no ordinary 21-year-old.By day he is an industrial design student at Philadelphia University, but Gordon has another side to him -a side altogether darker, tougher, and more enigmatic.Hanging in his workshop Gordon has a full suit of armor plating, cape, and cowl -matte black and built to stop a knife.Gordon has an alter ego: the Dark Knight himself, Batman.You might expect his origin story to be cloaked in mystery, but speaking to CNN Gordon is quick to explain how the transformation took place.... Perhaps because of their versatility and the small matter of copyright issues, those that go on sale will not feature the iconic bat symbol.Gordon says his fledgling business will remain small whilst he's at University -he has to finish his studies after all, and won't be using the project towards his degree credits.For now the Batsuit and Armatus Design will remain a one man operation: such is the life of a superhero."Attributes Length: short; Extractiveness: normal; Specificity: normal; Topic: (No Topic Specified); Gold Jackson Gordon, a 21-year-old industrial design student at Philadelphia University built a Batsuit that is resistant to stabs, knife slashes, and high impacts.According to Gordon, this is a second attempt at building the suit after an earlier attempt five years ago.

HP
The transformation of Jackson Gordon, a 21-year-old industrial design student at Philadelphia University, into a Batman fan has happened.Gordon has created a full suit of armor plating, cape and cowl with suede detailing.Attributes Length: normal; Extractiveness: normal; Specificity: normal; Topic: industrial design student; Gold Apart from being an industrial design student, Gordon is also a Shaolin Kung Fu expert and has started a business making jackets and cowls but plans to focus on studies first.

HP
The industrial design student Jackson Gordon, 21, is no ordinary student.Gordon has created a replica of the iconic Batman suit with an alter ego named after the Dark Knight.Attributes Length: normal; Extractiveness: full; Specificity: high; Topic: industrial design student; Gold By day he is an industrial design student; Gordon is also an expert in Shaolin Kung Fu; He has already begun manufacturing the cowls for the public.
HP By day, an industrial design student, Gordon, has another side to him; a side altogether darker, tougher and more enigmatic; Gordon has an alter ego, the Dark Knight himself, Batman; as elaborate as his design was, it lacked the functionality or the authenticity of the genuine article.Attributes Length: normal; Extractiveness: normal; Specificity: normal; Topic: conventional materials; Gold Gordon chose unconventional materials to build the Batsuit ensuring that every part was protected whether it had armor plates or not.

HP
In order to avoid using conventional materials, Gordon used memory foam, built around key areas to squish and compressäreas to dissipate the impact of blows, also used Kevlar as the base fabric.Attributes Length: long; Extractiveness: normal; Specificity: normal; Topic: conventional materials; Gold Gordon chose unconventional materials, using Kevlar for slash resistance, a form of memory foam for impact absorption, ABS plastic for armor plates, and polyurethane for the cowl.

HP
Eschewing conventional materials, Gordon opted for a form of memory foam, built around key areas to squish and compress, dissipating the impact of blows; also used Kevlar as the base fabric, making it cut and slash resistant to bladed weapons, but breathable and wearable all day.
Table 9: Five case studies on MAC-Doc.

Figure 1 :
Figure 1: An example of MACSUM.For the same input source text, the system needs to generate different reference summaries (green boxes) for different mixed control attributes (orange boxes).

Figure 3 :
Figure 3: Distribution of mixed attributes.Each category is represented by the first character of its control attribute values, e.g., snh represents Len: short, Ext: normal, and Spe: high.

Figure 4 :
Figure4: Comparison of different frameworks.For the HP model, the control attributes are prepended to the input to form a hard prompt.For the SP model, the selected prefix vectors are added together to form a control prefix.HP+SP contains both hard prompts and control prefixes.

Figure 5 :
Figure 5: Difficulty of attribute values.The x-axis shows the control attribute and its value.For instance, S in length is the CER of all the Len: short samples.

Figure 6 :
Figure 6: Dependency of attributes.Each row shows the attributes that are modified while each column shows the change in the corresponding attribute.

Table 1 :
Comparison between MACSUM and previous work on controllable summarization.Dial.and Doc.means if the source is dialogue or document.Anno.indicates whether the data is constructed by human annotation or rule-based pseudo-split.Multi-O.shows if there are multiple outputs with different control attributes for the same source.Mixed Attr.shows if mixed attribute control is allowed.Control Attributes are defined in Section 3.

Candidate Tpc & Spk
Annotation pipeline of MACSUM.The annotator needs to summarize the contents of meetings/documents according to the five control attributes, give the relevant text spans, and write a summary title.
Louis and Nenkova (2011)values of[normal, high,  full].Specificity (Spe) means how many details or descriptive contents we need to include in the summary.Referring toLouis and Nenkova (2011), different users can prefer more general summaries or more specific summaries.MACSUM contains two levels of Spe control, namely [normal, high],

Table 2 :
Statistics of MACSUM consisting of two parts: MAC-Doc from CNNDM and MAC-Dial from QM-Sum.Source Len., Ref. Len. are tokens in source and reference.Topic, Speaker are the averaged number of topics/speakers.

Table 3 :
Attribute metric functions f r of different control attribute values.(0.1 × vb + 0.2 × tok + 0.3 × nn + 0.4 × cd)/n s ,where vb, tok, nn, cd, and n s represent the number of verbs, tokens, nouns, numeral tokens, and the number of sentences in the summary.
3.4 Statistics of MACSUMDataset Split and Source Data Distribution Table 2 shows the statistics.MACSUM covers two domains (MAC-Doc for news and MAC-Dial for dialogue) with 8333 annotated summaries (5379 in MAC-Doc and 2954 in MAC-Dial), paired with 1353 source inputs (943 in MAC-Doc and 410 in MAC-Dial).The averaged number of tokens in sources of MAC-Doc is shorter than that in the original QMSum dataset since we truncate the input into segments.We split the source text randomly into training/valid/test sets with 80%/10%/10%.Distribution of Control Attribute MetricsWith definitions in Section 3.3, Table3calculates automatic attribute metrics for all 5 control attributes.

Table 4 :
Results on MAC-Doc.The performance of the model is better when Control Error Rate (CER) is lower ↓, ROUGE is higher ↑, and Control Correlation (CC) is closer to the golden summary ↕.

Table 5 :
Results on MAC-Dial.The performance of the model is better when Control Error Rate (CER) is lower ↓, ROUGE is higher ↑, and Control Correlation (CC) is closer to the golden summary ↕.

Table 6 :
Human evaluation results.We evaluate Speaker (Spk), Extractiveness (Ext), and Specificity (Spe).Length does not require human annotation because it is measured by counting the number of tokens.

Table 7 :
Ablation on MACSUM on pretraining on CN-NDM.MAC-Doc, MAC-Dial denote the HP model initialized with BART-large-cnn, while -CNN uses BARTlarge checkpoint.Numbers for five control attributes are CER and for Quality are the average of ROUGE-1/2/L.

Table 8 :
market it as the point of view; we could have parallel marketing s schemes; one where you've got one where it appeals to people that want to have the new device that looks cool, is fashionable; So um, I dunno we'll have to decide which which angle we're gonna go to or both; Either market it together by getting control in a set colour or like you buy it with several; as a separate thing.HP+SPMarketing, could have parallel marketing, schemes, one where it appeals to people that want to have the new device that looks cool; one that rather, than a kind of a need relationship with the device; people might not like, having a device, just looks nice; also a device, practically sound; decide which angle, gonna go to or both.Three case studies on MACSUM.