Template-based Abstractive Microblog Opinion Summarisation

We introduce the task of microblog opinion summarisation (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarisation dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarising news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favours extractive summarisation models. To showcase the dataset's utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarisation models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.


Introduction
Social media has gained prominence as a means for the public to exchange opinions on a broad range of topics.Furthermore, its social and temporal properties make it a rich resource for policy makers and organisations to track public opinion on a diverse range of issues (Procter et al., 2013;Chou et al., 2018;Kalimeri et al., 2019).However, understanding opinions about different issues and entities discussed in large volumes of posts in platforms such as Twitter is a difficult task.Existing work on Twitter employs extractive summarisation (Inouye and Kalita, 2011;Zubiaga et al., 2012;Wang et al., 2017a;Jang and Allan, 2018) to filter through information by ranking and selecting tweets according to various criteria.However, this approach unavoidably ends up including incomplete or redundant information (Wang and Ling, 2016).
Human Summary Main Story: The UK government faces intense backlash after its decision to fund the war in Syria.Majority Opinion: The majority of users criticise UK politicians for not directing their efforts to more important domestic issues like the NHS, education and homelessness instead of the war in Syria.Minority Opinion: Some users accuse the government of its intention to kill innocents by funding the war.Tweet Cluster It is shocking to me how the NHS is on its knees and the amount of homeless people that need help in this country...but we have funds for war!..SAD The government cannot even afford to help the homeless people of Britain yet they can afford to fund a war?It makes no proper sense at all They spend so much on sending missiles to murder innocent people and they complain daily about homeless on the streets?Messed up.Also, no money to resolve the issues of the homeless or education or the NHS.Yet loads of money to drop bombs?#SyriaVote Table 1: Abridged cluster of tweets and its corresponding summary.Cluster content is color-coded to represent information overlap with each summary component: blue for Main Story, red for Majority Opinion and green for Minority Opinion To tackle this challenge we introduce Microblog opinion summarisation (MOS), which we define as a multi-document summarisation task aimed at capturing diverse reactions and stances (opinions) of social media users on a topic.While here we apply our methods to Twitter data readily available to us, we note that this summarisation strategy is also useful for other microblogging platforms.An example of a tweet cluster and its opinion summary is shown in Table 1.As shown, our proposed summary structure for MOS separates the factual information (story) from reactions to the story (opinions); the latter is further divided according to the prevalence of different opinions.We believe that making combined use of stance identification, sentiment analysis and abstractive summarisation is a m9.figshare.20391144to create relevant topical summaries.While product reviews have a relatively fixed structure, MOS operates on microblog clusters where posts are more loosely related, which poses an additional challenge.Moreover, while the former generally only encodes the consensus opinion (Bražinskas et al., 2020;Chu and Liu, 2019), our approach includes both majority and minority opinions.Multi-document summarisation has gained traction in non-opinion settings and for news events in particular.DUC (Dang, 2005) and TAC conferences pioneered this task by introducing datasets of 139 clusters of articles paired with multiple human-authored summaries.Recent work has seen the emergence of larger scale datasets such as WikiSum (Liu et al., 2018), Multi-News (Fabbri et al., 2019) and WCEP (Gholipour Ghalandari et al., 2020) to combat data sparsity.Extractive (Wang et al., 2020b,c;Liang et al., 2021) and abstractive (Jin et al., 2020) methods have followed from these multi-document news datasets.Twitter Summarisation is recognised by Cao et al. (2016) to be a promising direction for tracking reaction to major events.As tweets are inherently succinct and often opinionated (Mohammad et al., 2016), this task is at the intersection of multi-document and opinion summarisation.The construction of datasets (Nguyen et al., 2018;Wang and Zhang, 2017) usually requires a clustering step to group tweets together under specific temporal and topical constraints, which we include within our own pipeline.Work by Jang and Allan (2018) and Corney et al. (2014) makes use of the subjective nature of tweets by identifying two stances for each topic to be summarised; we generalise this idea and do not impose a restriction on the number of possible opinions on a topic.The lack of an abstractive gold standard means that the majority of existing Twitter models are extractive (Alsaedi et al., 2021;Inouye and Kalita, 2011;Jang and Allan, 2018;Corney et al., 2014).Here we provide such an abstractive gold standard and show the potential of neural abstractive models for microblog opinion summarisation.
3 Creating the MOS Dataset

Data Sources
Our MOS corpus consists of summaries of microblog posts originating from two data sources, both involving topics that have generated strong public opinion: COVID-19 (Chen et al., 2020) and UK Elections (Bilal et al., 2021).
• UK Elections: The Election dataset consists of all geo-located UK tweets posted between May 2014 and May 2016.The tweets were filtered using a list of 438 election-related keywords and 71 political party aliases curated by a team of journalists.
We follow the methodology in Bilal et al. ( 2021) to obtain opinionated, coherent clusters of between 20-50 tweets: the clustering step employs the GSDMM-LDA algorithm (Wang et al., 2017b), followed by thematic coherence evaluation (Bilal et al., 2021).The latter is done by aggregating exhaustive metrics BLEURT (Sellam et al., 2020), BERTScore (Zhang* et al., 2020) and TF-IDF to construct a random forest classifier to identify coherent clusters.Our final corpus is created by randomly sampling 3100 clusters2 , 1550 each from the COVID-19 and Election datasets.

Summary Creation
The summary creation process was carried out in 3 stages on the Figure Eight platform by 3 journalists experienced in sub-editing.Following Iskender et al. (2021), a short pilot study was followed by a meeting with the summarisers to ensure the task and guidelines were well understood.
Prior to this, the design of the summarisation interface was iterated to ensure functionality and usability (See Appendix A for interface snapshots).
In the first stage, the summarisers were asked to read a cluster of tweets and state whether the opinions within it could be easily summarised by assigning one of three cluster types: 1. Coherent Opinionated: there are clear opinions about a common main story expressed in the cluster that can be easily summarised.
2. Coherent Non-opinionated: there are very few or no clear opinions in the cluster, but a main story is clearly evident and can be summarised.
3. Incoherent: no main story can be detected.This happens when the cluster contains diverse stories to which no majority of tweets refers, hence it cannot be summarised.
Following Bilal et al. (2021) on thematic coherence, we assume a cluster is coherent, if and only if, its contents can be summarised.Thus, both Coherent Opinionated and Coherent Nonopinionated can be summarised, but are distinct with respect to the level of subjectivity in the tweets, while Incoherent clusters cannot be summarised.
In the second stage, information nuggets are defined in a cluster as important pieces of information to aid in its summarisation.The summarisers were asked to highlight information nuggets when available and categorise their aspect in terms of: WHAT, WHO, WHERE, REACTION and OTHER.Thus, each information nugget is a pair consisting of the text and its aspect category (see Appendix A for an example).Inspired by the pyramid evaluation framework (Nenkova and Passonneau, 2004) and extractive-abstractive two-stage models in the summarisation literature (Lebanoff et al., 2018;Rudra et al., 2019;Liu et al., 2018), information nuggets have a dual purpose: (1) helping summarisers create the final summary and (2) constituting an extractive reference for summary informativeness evaluation (See 5.2.1).
In the third and final stage of the process, the summarisers were asked to write a short templatebased summary for coherent clusters.Our chosen summary structure diverges from current summarisation approaches that reconstruct the "most popular opinion" (Bražinskas et al., 2020;Angelidis et al., 2021).Instead, we aim to showcase a spectrum of diverse opinions regarding the same event.Thus, the summary template comprises three components: Main Story, Majority Opinion, Minority Opinion(s).The component Main Story serves to succinctly present the focus of the cluster (often an event), while the other components describe opinions about the main story.Here, we seek to distinguish the most popular opinion (Majority opinion) from ones expressed by a minority (Minority opinions).This structure is consistent with the work of Gerani et al. (2014) in templatebased summarisation for product reviews, who quantify the popularity of user opinions in the final summary.
For "Coherent Opinionated clusters", summarisers were asked to identify the majority opinion within the cluster and if it exists, to summarise it, along with any minority opinions.If a majority opinion could not be detected, then the minority opinions were summarised.The final summary of "Coherent Opinionated clusters" is the concatenation of the three components: Main story + Majority Opinion (if any) + Minority Opinion(s) (if any).In 43% of opinionated clusters in our MOS corpus a majority opinion and at least one minority opinion were identified.Additionally, in 12% of opinionated clusters, 2 or more main opinions were identified (See Appendix 8, Table 13), but without a majority opinion as there is a clear divide between user reactions.For clusters with few or no clear opinions (Coherent Non-opinionated), the final summary is represented by the Main Story component.Statistics regarding the annotation results are shown in Table 2.

Agreement Analysis
Our tweet summarisation corpus consists of 3100 clusters.Of these, a random sample of 100 clusters was shared among all three summarisers to compute agreement scores.Each then worked on 1000 clusters.We obtain a Cohen's Kappa score of κ = 0.46 for the first stage of the summary creation process, which involves categorising clusters as either Coherent Opinionated, Coherent Non-opinionated or Incoherent.Previous work (Feinstein and Cicchetti, 1990) highlights a paradox regarding Cohen's kappa in that high levels of agreement do not translate to high kappa scores in cases of highly imbalanced datasets.In our data, at least 2 of the 3 summarisers agreed on the type of cluster in 97% of instances.
In addition, we evaluate whether the concept of 'coherence/summarisability' is uniformly assessed, i.e., we check whether summarisers agree on what clusters can be summarised (Coherent clusters) and which clusters are too incoherent.We find that 83 out of 100 clusters were evalu-ated as coherent by the majority, of which 65 were evaluated as uniformly coherent by all.
ROUGE-1,2,L and BLEURT (Sellam et al., 2020) are used as proxy metrics to check the agreement in terms of summary similarity produced between the summarisers.We compare the consensus between the complete summaries as well as individual components such as the main story of the cluster, its majority opinion and any minority opinions in Table 3.The highest agreement is achieved for the Main Story, followed by Majority Opinion and Minority Opinions.These scores can be interpreted as upper thresholds for the lexical and semantic overlap later in Section 6.Table 3: Agreements between summarisers wrt to final summary, main story, majority opinion and minority opinions using ROUGE-1,2,L and BLEURT.

Comparison with other Twitter datasets
We next compare our corpus against the most recent and popular Twitter datasets for summarisation in Table 4.To the best of our knowledge there are currently no abstractive summarisation Twitter datasets for either event or opinion summarisation.
While we primarily focussed on the collection of opinionated clusters, some of the clusters we had automatically identified as opinionated were not deemed to be so by our annotators.Including the non-opinionated clusters helps expand the depth and range of Twitter datasets for summarisation.
Compared to the summarisation of product reviews and news articles, which has gained recognition in recent years because of the availability of large-scale datasets and supervised neural architectures, Twitter summarisation remains a mostly uncharted domain with very few datasets curated.Inouye and Kalita (2011) 3 collected the tweets for the top ten trending topics on Twitter for 5 days and manually clustered these.The SMERP dataset (Ghosh et al., 2017)  is the dataset most similar to our work as it covers, but on a smaller scale, several popular topics that are deemed relevant to news providers.
Other Twitter summarisation datasets include: (Zubiaga et al., 2012;Corney et al., 2014) on summarisation of football matches, (Olariu, 2014) on real-time summarisation for Twitter streams.These datasets are either publicly unavailable or unsuitable for our summarisation task4 .Summary type.These datasets exclusively contain extractive summaries, where several tweets are chosen as representative per cluster.This results in summaries which are often verbose, redundant and information-deficient.As shown in other domains (Grusky et al., 2018;Narayan et al., 2018), this may lead to bias towards extractive summarisation techniques and hinder progress for abstractive models.Our corpus on COVID-19 and Election data aims to bridge this gap and introduces an abstractive gold standard generated by journalists experienced in sub-editing.Size.The average number of posts in our clusters is 30, which is similar to the TSix dataset and in line with the empirical findings by Inouye and Kalita (2011), who recommend 25 tweets/cluster.Having clusters with a much larger number of tweets makes it harder to apply our guidelines for human summarisation.To the best of our knowledge, our combined corpus (COVID-19 & Election) is currently the biggest human generated corpus for microblog summarisation.Time-span.Both COVID-19 and Election partitions were collected across year-long time spans.This is in contrast to other datasets, which have been constructed in brief time windows, ranging from 3 days to a month.This emphasises the longitudinal aspect of the dataset, which also allows topic diversity as 153 keywords and accounts were tracked through time.

Defining Model Baselines
As we introduce a novel summarisation task (MOS), the baselines featured in our experiments are selected from domains tangential to microblog opinion summarisation, such as news articles, Twitter posts and product reviews (See Sec.2).In addition, the selected models represent diverse summarisation strategies: abstractive or extractive, supervised or unsupervised, multi-document (MDS) or single-document summarisation (SDS).Note that most SDS models enforce a length limit (1024 characters) over the input which makes it impossible to summarise the whole cluster of tweets.We address this issue by only considering the most relevant tweets ordered by topic relevance.The latter is computed using the Kullback-Leibler divergence with respect to the topical word distribution of the cluster in the GSDMM-LDA clustering algorithm (Wang et al., 2017b).
The summaries were generated such that their length matches the average length of the gold standard.Some model parameters (such as Lexrank) only allow sentence-level truncation, in which case the length matches the average number of sentences in the gold standard.For models that allow a word limit to the text to be generated (BART, Pegasus, T5), a minimum and maximum number of tokens was imposed such that the generated summary would be within [90%, 110%] of the gold standard length.

Heuristic Baselines
Extractive Oracle: this baseline uses the gold summaries to extract the highest scoring sentences from a cluster of tweets.We follow Zhong et al. (2020) and rank each sentence by its average ROUGE-{1,2,L} recall scores.We then consider the highest ranking 5 sentences to form combinations of k 5 sentences, which are re-evaluated against the gold summaries.k is chosen to equal the average number of sentences in the gold stan-dard.The highest scoring summary with respect to the average ROUGE-{1,2,L} recall scores is assigned as the oracle.Random: k sentences are extracted at random from a tweet cluster.We report the mean result over 5 iterations with different random seeds.

Extractive Baselines
LexRank (Erkan and Radev, 2004) constructs a weighed connectivity graph based on cosine similarities between sentence TF-IDF representations.Hybrid TF-IDF (Inouye and Kalita, 2011) is an unsupervised model designed for Twitter, where a post is summarised as the weighted mean of its TF-IDF word vectors.BERTSumExt (Liu and Lapata, 2019) is an SDS model comprising a BERT (Devlin et al., 2019)based encoder stacked with Transformer layers to capture document-level features for sentence extraction.We use the model trained on CNN/Daily Mail (Hermann et al., 2015).HeterDocSumGraph (Wang et al., 2020b) introduces the heterogenous graph neural network, which is constructed and iteratively updated using both sentence nodes and nodes representing other semantic units, such as words.We use the MDS model trained on Multi-News (Fabbri et al., 2019).Quantized Transformer (Angelidis et al., 2021) combines Transformers (Vaswani et al., 2017) and Vector-Quantized Variational Autoencoders for the summarisation of popular opinions in reviews.We trained QT on the MOS corpus.

Abstractive Baselines
Opinosis (Ganesan et al., 2010) is an unsupervised MDS model.Its graph-based algorithm identifies valid paths in a word graph and returns the highest scoring path with respect to redundancy.PG-MMR (Lebanoff et al., 2018) adapts the single document setting for multi-documents by introducing 'mega-documents' resulting from concatenating clusters of texts.The model combines an abstractive SDS pointer-generator network with an MMR-based extractive component.PEGASUS (Zhang et al., 2020) introduces gapsentences as a pre-training objective for summarisation.It is then fine-tuned for 12 downstream summarisation domains.We chose the model pretrained on Reddit TIFU (Kim et al., 2019).T5 (Raffel et al., 2020) adopts a unified approach for transfer learning on language-understanding tasks.For summarisation, the model is pre-trained on the Colossal Clean Crawled Corpus (Raffel et al., 2020) and then fine-tuned on CNN/Daily Mail.BART (Lewis et al., 2020) is pre-trained on several evaluation tasks, including summarisation.With a bidirectional encoder and GPT2, BART is considered a generalisation of BERT.We use the BART model pre-trained on CNN/Daily Mail.SummPip (Zhao et al., 2020) is an MDS unsupervised model which constructs a sentence graph following Approximate Discourse Graph and deep embedding methods.After spectral clustering of the sentence graph, summary sentences are generated through a compression step of each cluster of sentences.
Copycat (Bražinskas et al., 2020) is a Variational Autoencoder model trained in an unsupervised setting to capture the consensus opinion in product reviews for Yelp and Amazon.We train it on the MOS corpus.

Evaluation Methodology
Similar to other summarisation work (Fabbri et al., 2019;Grusky et al., 2018), we perform both automatic and human evaluation of models.Automatic evaluation is conducted on a set of 200 clusters: each partition of the test (COVID-19 Opinionated, COVID-19 Non-opinionated, Election Opinionated, Election Non-opinionated) contains 50 clusters uniformly sampled from the total corpus.For the human evaluation, only the 100 opinionated clusters are evaluated.

Automatic Evaluation
Word overlap is evaluated according to the harmonic mean F 1 of ROUGE-1, 2, L6 (Lin, 2004) as reported elsewhere (Narayan et al., 2018;Gholipour Ghalandari et al., 2020;Zhang et al., 2020).Work by Tay et al. (2019) acknowledges the intractability of ROUGE in opinion text summarisation as sentiment-rich language uses a vast vocabulary that does not rely on word matching.This issue is mitigated by (Kryscinski et al., 2021;Bhandari et al., 2020), who use semantic similarity as an additional assessment of candidate summaries.Similarly, we use text generation metrics BLEURT (Sellam et al., 2020) and BERTScore7 (Zhang* et al., 2020) to assess semantic similarity.

Human Evaluation
Human evaluation is conducted to assess the quality of summaries with respect to three objectives: 1) linguistic quality, 2) informativeness and 3) ability to identify opinions.We conducted two human evaluation experiments: the first (5.2.1) assesses the gold standard and non-fine-tuned model summaries on a rating scale, while the second (5.2.2) addresses the advantages and disadvantages of fine-tuned model summaries via Best-Worst Scaling.Four and three experts were employed for the two experiments, respectively.

Evaluation of Gold Standard & Models
The first experiment focused on assessing the gold standard and best models from each summarisation type: Gold, LexRank (best extractive), SummPip (best unsupervised abstractive) and BART (best supervised).
Linguistic quality measures 4 syntactic dimensions, which were inspired by previous work on summary evaluation.Similar to DUC (Dang, 2005), each summary was evaluated with respect to each criterion below on a 5-point scale.
• Fluency (Grusky et al., 2018): Sentences in the summary "should have no formatting problems, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read." • Sentential Coherence (Grusky et al., 2018): A sententially coherent summary should be well-structured and well-organized.The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic.
• Non-redundancy (Dang, 2005): A nonredundant summary should contain no duplication, i.e., there should be no overlap of information between its sentences.Informativeness is defined as the amount of factual information displayed by a summary.To measure this, we use a Question-Answer algorithm (Patil, 2020) as a proxy.Pairs of questions and corresponding answers are generated from the information nuggets of each cluster.Since we want to assess whether the summary contains factual information, only information nuggets belonging to the 'WHAT', 'WHO', 'WHERE' are selected as input.We chose not to include the entire cluster as input for the QA algorithm, as this might lead the informativeness evaluation to prioritize irrelevant details in the summary.Each cluster in the test set is assigned a question-answer pair and each system is then scored based on the percentage of times its generated summaries contain the answer to the question.Similar to factual consistency (Wang et al., 2020a), informativeness penalises incorrect answers (hallucinations), as well as the lack of a correct answer in a summary.
As Opinion is a central component for our task, we want to assess the extent to which summaries contain opinions.Assessors report whether summaries identify any majority or minority opinions8 .A summary contains a majority opinion if most of its sentences express this opinion or if it contains specific terminology ('The majority is/ Most users think...' etc.), which is usually learned during the fine-tuning process.Similarly, a summary contains a minority opinion if at least one of its sentences expresses this opinion or it contains specific terminology ('A minority/ A few users' etc.).The final scores for each system are the percentage of times the summaries contain majority or minority opinions, respectively.

Best-Worst Evaluation of Fine-tuned Models
The second human evaluation assesses the effects of fine-tuning on the best supervised model, BART.The experiments use non-fine-tuned BART (BART), BART fine-tuned on 10% of the corpus (BART FT10%) and BART fine-tuned on 70% of the corpus (BART FT70%).As all the above are versions of the same neural model, Best-Worst scaling is chosen to detect subtle improvements, which cannot otherwise be quantified as reliably by traditional ranking scales (Kiritchenko and Mohammad, 2017).An evaluator is shown a tuple of 3 summaries (BART, BART FT70%, BART FT30%) and asked to choose the best/worst with respect to each criteria.To avoid any bias, the summary order is randomized for each document following van der Lee et al. (2019).The final score is calculated as the percentage of times a model is scored as the best, minus the percentage of times it was selected as the worst (Orme, 2009).In this setting, a score of 1 represents the unanimously best, while -1 is unanimously the worst.
The same criteria as before are used for linguistic quality and one new criterion is added to assess Opinion.We define Meaning Preservation as the extent to which opinions identified in the candidate summaries match the ones identified in the gold standard.We draw a parallel between the Faithfulness measure (Maynez et al., 2020), which assesses the level of hallucinated information present in summaries and Meaning Preservation, which assesses the extent of hallucinated opinions.

Automatic Evaluation
Results for the automatic evaluation are shown in Table 5. Fine-tuned Models Unsurprisingly, the best performing models are ones that have been finetuned on our corpus: BART (FT70%) and BART (FT10%).Fine-tuning has been shown to yield competitive results for many domains (Kryscinski et al., 2021;Fabbri et al., 2021), including ours.In addition, one can see that only the fine-tuned abstractive models are capable of outperforming the Extractive Oracle, which is set as the upper threshold for extractive methods.Note that on average, the Oracle outperforms the Random summariser by a 59% margin, which only fine-tuned models are able to improve on, with 112% for BART (FT10%) and 114% for BART (FT70%).We hypothesise that our gold summaries' template format poses difficulties for off-the-shelf models and fine-tuning even on a limited portion of the corpus produces summaries that follow the correct structure (See Table 9 and Appendix C for examples).We include comparisons between the performance of BART (FT10%) and BART (FT70%) on the individual components of the summary in Table 6. 9 Non-Fine-tuned Models Of these, SummPip performs the best across most metrics and datasets with an increase of 37% in performance over the random model, followed by LexRank with an increase of 29%.Both models are designed for the multi-document setting and benefit from the common strategy of mapping each sentence in a tweet from the cluster into a node of a sentence graph.However, not all graph mappings prove to be useful: summaries produced by Opinosis and Het-erDocSumGraph, which employ a word-to-node mapping, do not correlate well with the gold standard.The difference between word and sentencelevel approaches can be partially attributed to the high amount of spelling variation in tweets, making the former less reliable than the latter.ROUGE vs BLEURT The performance on ROUGE and BLEURT is tightly linked to the data differences between COVID-19 and Election partitions of the corpus.Most models achieve higher ROUGE scores and lower BLEURT scores on the COVID-19 than on the Election dataset.An inspection of the data differences reveals that COVID-19 tweets are much longer than Election ones (169 vs 107 char), as the latter had been collected before the increase in length limit from 140 to 280 char in Twitter posts.This is in line with findings by Sun et al. (2019), who revealed that high ROUGE scores are mostly the result of longer summaries rather than better quality summaries.

Human Evaluation
Evaluation of Gold Standard & Models Table 7 shows the comparison between the gold standard and the best performing models against a set of criteria (See 5.2.1).As expected, the humanauthored summaries (Gold) achieve the highest scores with respect to all linguistic quality and structure-based criteria.However, the gold standard fails to capture informativeness as well as its automatic counterparts, which are, on average, longer and thus may include more information.Since BART is previously pre-trained on CNN/DM dataset of news articles, its output summaries are more fluent, sententially coherent and contain less duplication than the unsupervised models Lexrank and SummPip.We hypothesise 9 We do not include other models in the summary component-wise evaluation because it is impossible to identify the Main Story, Majority Opinion and Minority Opinions in non-fine-tuned models.that SummPip achieves high referential clarity and majority scores as a trade-off for its very low nonredundancy (high redundancy).

Best-Worst Evaluation of Fine-tuned Models
The results for our second human evaluation are shown in Table 8 using the guidelines presented in 5.2.2.The model fine-tuned on more data BART (FT70%) achieves the highest fluency and sentential coherence scores.As seen in Table 9, the summary produced by BART (FT70%) contains complete and fluent sentences, unlike its counterparts.Most importantly, fine-tuning yields better alignment with the gold standard with respect to meaning preservation, as the fine-tuned models BART (FT70%) and BART (FT10%) learn how to correctly identify and summarise the main story and the relevant opinions in a cluster of tweets.In the specific example, non-fine-tuned BART introduces a lot of irrelevant information ('industrial air pollution','google, apple rolling out covid') to the main story and offers no insight into the opinions found in the cluster of tweets, whereas both finetuned models correctly introduce the Main Story and both partially identify the Majority Opinion ('great idea' for anti-maskers 'to dig graves').However, we note that the fine-tuning process does not lead to increased performance with respect to all criteria; non-redundancy is compromised and referential clarity stops improving after a certain amount of training data.As observed in the example, BART (FT70%) contains duplicated content 'think this is a great idea.What a great idea!'.Wilcoxon signed rank tests with p < 0.05 and p < 0.10 are used for significance testing between all pairs of models.We note that most pairwise differences are significant at p < 0.05, while BART (FT70%) and BART NFT differences are significant at p < 0.10 for non-redundancy.
The only two exceptions are referential clarity and non-redundancy between BART (FT70%) and BART (FT10%) where both fine-tuned models perform similarly.

Error Analysis
Error analysis is carried out on 30 fine-tuned BART summaries from a set of 15 randomly sampled clusters.The results are found in Table 10.
Hallucination Fine-tuning on the MOS corpus introduces hallucinated content in 8 out of 30 manually evaluated summaries.Generated summaries contain opinions that prove to be either false or unfounded after careful inspection of the clus-  Human Summary: Anti-maskers are forced to dig graves for Covid19 victims in Indonesia.The majority of Twitter users think it is a good idea that will cause a reality check and make people rethink their ideas.
BART FT 70%: Anti-maskers in Indonesia are forced to dig graves for Covid-19 victims as punishment for not wearing mask.The majority think this is a great idea.What a great idea!It's a good way to get people to reconsider their misinformed opinions.We should do this in the US.
BART FT 10%: Anti-maskers forced to dig graves for Covid-19 victims in Indonesia as punishment for refusing to wear a mask.majority believe that this is a good idea, and that it will hopefully bring about an end to the need for masks.A minority do not believe it is necessary to wear a BART: Covid-19 can be more deadly for people living in areas with industrial air pollution.Anyone refusing to wear a mask is made to dig graves for covid-19 victims as punishment in east java.as domestic violence spikes, many victims and their children have nowhere to live.google, apple rolling out covid-  ter of tweets.We follow the work of Maynez et al. (2020) in classifying hallucinations as either intrinsic (incorrect synthesis of information in the source) or extrinsic (external information not found in the source).Example 1 in Table 10 is an instance of an intrinsic hallucination: the majority opinion is wrongly described as 'pleased', despite containing the correct facts regarding US coronavirus cases.Next, Example 2 shows that Rolf Harris 'is called a terrorist', which is confirmed to be an extrinsic hallucination as none of the tweets in the source cluster contain this information.
Information loss Information loss is the most frequent error type.As outlined in (Kryscinski et al., 2021), the majority of current summarisation models face length limitations (usually 1024 characters) which are detrimental for long-input documents and tasks.Since our task involves the detection of all opinions within the cluster, this weakness may lead to incomplete and less informative summaries as illustrated in Example 3 from Table 10.The candidate summary does not contain the minority opinion identified by the experts in the gold standard.An inspection of the cluster of tweets reveals that most posts expressing this opinion are indeed not found in the first 1024-char allowed limit of the cluster input.

Conclusions and Future Work
We have introduced the task of Twitter opinion summarisation and constructed the first abstractive corpus for this domain, based on templatebased human summaries.Our experiments show that existing extractive models fall short on linguistic quality and informativeness while abstractive models perform better but fail to identify all relevant opinions required by the task.Fine-tuning on our corpus boosts performance as the models learn the summary structure.
In the future, we plan to take advantage of the template-based structure of our summaries to refine fine-tuning strategies.One possibility is to exploit style-specific vocabulary during the generation step of model fine-tuning to improve on capturing opinions and other aspects of interest.

Table 2 :
Annotation statistics of our MOS corpus

Table 4 :
focuses on topics on postdisaster relief operations for the 2016 earthquakes in central Italy.Finally, TSix (Nguyen et al., 2018)Overview of other Twitter datasets.

Table 5 :
Performance on the test set of baseline models evaluated with automatic metrics: ROUGE-n (R-n) and BLEURT.The best model from each category (Extractive, Abstractive, Fine-tuned) and overall are highlighted.

Table 7 :
Evaluation of Gold Standard and Models: Results

Table 8 :
Best-Worst Evaluation of Fine-tuned models: Results

Table 9 :
BART Summary Examples for the same cluster of tweets.United States surpasses six million coronavirus cases and deaths and remains at the top of the global list of countries with the most cases and deaths The majority are pleased to see the US still leads the world in terms of cases and deaths, with 180,000 people succumbing to Covid-19.Sex offender Rolf Harris is involved in a prison brawl after absconding from open jail.The majority think Rolf Harris deserves to be spat at and called a "nonce" and a "terrorist" for absconding from open prison.A minority are putting pressure on Information 12/30 Example 3 Loss Human Summary: Miley Cyrus invited a homeless man on stage to accept her award.Most people thought it was a lovely thing to do and it was emotional.A minority think that it was a publicity stunt.Generated Summary: Miley Cyrus had homeless man accept Video of the Year award at the MTV Video Music Awards.The majority think it was fair play for Miley Cyrus to allow the homeless man to accept the award on her behalf.She was emotional and selfless.The boy band singer cried and thanked him for accepting the

Table 10 :
Error Analysis: Frequency of errors and representative summary examples for each error type.