WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

Aspect-based summarization is the task of generating focused summaries based on specific points of interest. Such summaries aid efficient analysis of text, such as quickly understanding reviews or opinions from different angles. However, due to large differences in the type of aspects for different domains (e.g., sentiment, product features), the development of previous models has tended to be domain-specific. In this paper, we propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization that attempts to spur research in the direction of open-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. We propose several straightforward baseline models for this task and conduct experiments on the dataset. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.


Introduction
Aspect-based summarization is a subtask of summarization that aims to provide targeted summaries of a document from different perspectives (Titov and McDonald, 2008;Lu et al., 2009;Wang and Ling, 2016;Yang et al., 2018;Angelidis and Lapata, 2018). Unlike generic summarization, this gives more concise summaries that are separated according to specific points of interest, allowing readers to fulfill focused information needs more easily and quickly. However, existing aspect-based summarization work is somewhat narrowly focused; for example a great majority of the work focuses 1 http://github.com/neulab/wikiasp Figure 1: In WikiAsp, given reference documents cited by a target article, a summarization model must produce targeted aspect-based summaries that correspond to sections. specifically on the domain of product or restaurant reviews. In contrast, generic summarization models are tested on a much wider variety of genres, from newswire (Nallapati et al., 2016;Grusky et al., 2018), to academic papers (Kang et al., 2018;Kedzie et al., 2018), to movie scripts (Gorinski and Lapata, 2015). For each genre, the types and characteristics of aspects that will need to be touched upon in a good summary will differ greatly.
One natural source of such multi-domain articles is Wikipedia, and the section boundaries and titles in each article form natural annotations of aspects and corresponding text. There have recently been a number of attempts to generate the lead section of Wikipedia articles from the linked external sites in the reference section (Liu et al., 2018;Fan et al., 2019;Liu and Lapata, 2019a), an approach that does not explicitly consider the different aspects covered by the article. Perez-Beltrachini et al. (2019) also examine domain differences in Wikipedia text summarization. However, existing datasets and analyses lack structure, broad domain coverage, or both. We argue that (1) generating structured summaries is of inherent interest, as these will allow humans consuming the information to browse specific aspects of interest more readily, and (2) the structure will vary across domains, with different domains demonstrating very different characteristics.
In this paper, we construct a dataset for multidomain aspect-based summarization that allows us to train models for this unique variety of summarization task, and examine the challenges posed therein. Figure 1 illustrates the overview of our task. Specifically, we turn to section titles of Wikipedia articles and construct sets of "aspects" through steps of automatic extraction, curation, and filtering. The section texts then serve as corresponding aspect-based summaries.
We devise a baseline two-stage method consisting of aspect identification and summarization using extractive and abstractive models, and conduct experiments on the proposed dataset. The analysis of experimental results and the generated summaries reveals the unique challenges posed by our multi-domain and multi-document setting. For example, aspects that require summarizing contents in a particular order (e.g., time series events) in a multi-document setting adds extra difficulty because of the need for correctly ordering scattered (and possibly duplicate) pieces of information from different sources. Certain domains that involve interviews or quotes of people also exhibit challenges in correctly modifying pronouns based on the relationship to the topic of interest.

Generating Wikipedia as Aspect-based Summarization
Wikipedia articles exhibit a specific way of organizing information about a focused topic. An article S consists of two parts: section titles a, and their contents p. The contents are further split into sections, where each section describes information about the main topic from different viewpoints. Table 1 shows an example article about the topic "Barack Obama", with several sections "Early life and Career," "Presidency," and "Legacy". In practice, the contents included in each section can take many forms, from text, tables, and images, to more specialized content such as brackets of a tournament. In this work, we focus only on sections that mainly consist of textual content (see Section 3 for how we define this). Importantly, the content in Wikipedia articles is required to be verifiable: "other people using the

Aspect: Presidency
The inauguration of Barack Obama as the 44th President took place on January 20, 2009. In his first few days in office, Obama issued . . .

Aspect: Legacy
Obama's most significant legacy is generally considered to be the Patient Protection and Affordable Care Act (PPACA), . . . encyclopedia can check that the information comes from a reliable source". 2 To ensure this, articles contain citations from a set of references R so that readers can check the validity of the content. In other words, citations supposedly contain the majority of the information written in the articles. Liu et al. (2018) took advantage of this fact by proposing a summarization task using cited references as source documents for summarization. Citations include published material (such as books) and websites, but because only web-based citations can easily and automatically be mined via crawling, we consider only web-based citations as source documents in this work and ignore the rest of non-web based citations following Liu et al. (2018). The goal of our task is to learn a model f : R → S, which can 1) identify and gather information from cited references and 2) generate a section-bysection summary where each section contains the appropriate type of information. Formally, let R = {R 1 , R 2 , . . . , R M } be a collection of M cited references for an article S = {s 1 , s 2 , . . . , s N } of N sections. Each section s i is essentially a tuple of a section title and one or more paragraphs: While there is a fair amount of variety in section titles across different articles, articles that belong to the same domain tend to share aspects that are particularly salient for that domain. Because of this, we select a fixed-size subset of all section titles that appear in each domain as the set of aspects A that we will target; details on how we select this subset  Table 2: Frequency of filtered aspects that are textual in 2 domains. Due to space constraint, the statistics for the rest of domains will be available in the Appendix C.
will be elucidated in the following section. Hence, our task is cast as multi-document aspect-based summarization.

The WIKIASP Dataset
In this section, we describe our concrete steps to create our dataset.

Data Collection
As the base data, we build upon the data collection strategy from the WikiSum dataset (Liu et al., 2018), a dataset for generating lead sections of Wikipedia from referenced web pages. Following the WikiSum data generation script, 3 we first crawled cited references covered by Common-Crawl for each Wikipedia article. We then recover all the sections 4 of the target Wikipedia articles from WikiSum (which was unused in WikiSum dataset) and obtain pairs of (section title, section paragraph). An example for this is shown in Table 1.

Domain Separation
Articles in different domains focus on different salient topics, as observed by Perez-Beltrachini et al. (2019). For example, the "discography" section is common for articles about singers, but is not appropriate for articles about infrastructure. To characterize such structural differences, we separate the set of articles obtained in the previous step 3 Tensor2tensor's WikiSum generator was used. 4 Due to the design of WikiSum dataset, the first section title of any article is automatically renamed to "LEAD". Therefore, we could not recover first sections of the Wikipedia articles. We suggest editing the data generation scripts for future WikiSum users if section title information is necessary. into sets in particular domains. Specifically, we follow Perez-Beltrachini et al. (2019) in assigning one category for each article using DBPedia (Auer et al., 2007). DBPedia stores structured information for each Wikipedia article, including the domain labels and info boxes. Additionally, it defines a topical hierarchy of the domains (ontology classes). We first map between articles and the domain labels from the corresponding DBPedia dump. Obtained domain labels, however, have mixed granularity (e.g., Person and its sub-class Dancer) which causes imbalance in the number of examples in each domain, as well as domain overlap between high-level and low-level domains in the domain hierarchy. We mitigate this by recursively merging domains at leaflevel into coarser ones according to the aforementioned topical hierarchy from the ontology classes. 5 We repeat the merging procedure until a branch in the hierarchy includes more than 15,000 articles, and picked 20 domains at the leaf of the merged hierarchy. 6

Aspect Selection
Next, we perform aspect selection on each set of articles in the domains extracted during the previous step. As previously noted, articles in the same domain tend to share similar set of section titles. Motivated by this observation, we construct the set of aspects from the most frequent section titles.
From the frequency distribution of section titles in a domain, we manually filter ones that are not textual, that is, more than half portion of section consists of text. For each section title, we take 20 randomly sampled sections and include it in the set of aspects only if 80% of samples consist of textual paragraphs. Following the steps above, we construct the 10 most frequent aspects for each domain. However, the choice of words in section titles vary depending on the editors within the same domain, which leads to missing relevant aspects that are moderately frequent but not present in Top-10. For example, one of the common section titles in WrittenWork domain are "summary" and "plot summary," which should be merged together to form a single aspect. We handle these cases by inspecting the frequent distribution further down and manually identifying semantically equivalent  titles to merge. The resulting dataset consists of instances in 20 domains where each domain has 10 pre-defined aspect classes. We show statistics comparisons of the dataset to existing aspect-based summarization datasets in Table 3 and examples of obtained aspects for two domains in Table 2.
Appendix A and C summarizes the data size for each domain and the obtained aspects for the rest of 18 domains respectively.

Baseline Models
Next, in this section we describe two baseline models for solving this task. Both of these models decompose the overall process into two stages: aspect discovery and aspect-based summarization of classified sentences. Both baseline models share the same methodology for aspect discovery, but differ in terms of summarization models. The model overview is shown in Figure 2.

Aspect Discovery
The first stage consists of labeling sentences in cited reference texts according to aspects. Having training data that contains sentences in the reference documents labeled with target aspects would be the ideal case, but these do not exist a priori. Therefore, we instead create training data by assigning each sentence in the target articles with aspect labels corresponding to the aspect to which the sentence belongs. For example, the article about Barack Obama in Table 1 yields training instances consisting of sentences labeled with Early life and career, Presidency and Legacy depending on which paragraph a sentence comes from. This data makes it possible to train a classifier that learns to predict aspects from the texts at sentence-level. At test time, cited reference sentences are fed into the learned classifier and are labeled with their most likely aspects.
However, the discrepancy of inputs at train/test time is problematic because the model is not exposed to any noisy sentences that do not belong to any of the relevant aspects at training time, while cited reference texts do contain such sentences. For example, an article in the Company domain may have a citation to the company website itself, which contains commercial messages that may not be appropriate in encyclopedic text such as Wikipedia. We manage such cases by introducing an auxiliary label Other at training time and let the model learn to identify noisy sentences as well. To do so, sentences labeled with Other are randomly sampled from texts in different domains and added to training data. We fine-tune the pretrained ROBERTa  model on this classification dataset for each domain. Logits obtained from the model are then passed through the sigmoid function to obtain probabilities of each aspect for a given sentence. Finally, we assign labels to a sentence by taking the aspects a i whose probabilities are greater than the threshold λ: P (a i ) > λ. The lower we set the threshold, the more but potentially noisy sentences we include as the input to the summarization model. We tune λ independently for each domain based on the performance on validation sets and set 0.5 for Group, 0.8 for Album, Animal, Building, Film, and 0.9 for the remaining domains as the threshold values.

Summarization
Sentences that are labeled with the same aspect are then grouped in order of occurrence in cited references to form a chunked paragraph that discusses the same aspect. This forms aspect-based clusters of relevant sentences, which become the Figure 2: Two-stage model diagram. The aspect classifier assigns aspect labels for each reference sentence R i j from references R with a threshold λ. Sentences are then grouped according to the assigned labels, which are fed to the summarization model. Groups about irrelevant aspects (i.e., a 2 ) is ignored. Finally, the summarization model outputs summaries for each relevant aspect.
input to a summarization model. On the contrary, aspects that are never labeled (due to low probabilities) are deemed irrelevant and thus will not be summarized. We consider both an extractive and an abstractive summarization model in our baseline implementation. For the extractive model, we use TextRank (Mihalcea and Tarau, 2004;Barrios et al., 2016), a graph-based ranking model for extracting important sentences. For the abstractive model, we use PreSumm (Liu and Lapata, 2019b), a Transformer-based summarizer with fine-tuned BERT as the source encoder. For each domain, PreSumm is fine-tuned and trained on the pairs of (grouped sentences, target aspect paragraph) to learn to produce summaries given the aspectrelevant sentences.

Evaluation
We evaluate models along two axes: aspect discovery and summarization. We note that the primary task in this dataset is aspect-based summarization, thus aspect discovery evaluation discussed below is only for diagnostic purposes. Since the aspect sets differ in different domains, evaluation is performed separately for each domain.
Aspect Discovery Models have to correctly predict the right set of aspects about which they generate summaries. The aspect discovery criterion aims to evaluate the similarity between the set of aspects about which a model decides to generate summaries and the set of aspects that appear in the target article. 7 For comparing these two sets, we 7 Note that there are two potential reasons an aspect does not appear in the target article: (1) it may not be appropriate for that particular entity (e.g. the "controversy" aspect in use precision, recall and F1 scores. Aspect-based Summarization Gold standard summaries only exist for each of the aspects that appear in an article. Therefore in this evaluation, we focus on evaluating the model's ability to summarize inputs particularly on these aspects. Specifically, generated summaries are paired to corresponding reference summaries with the same aspects and are evaluated using ROUGE (Lin, 2004). Since ROUGE is a recall-based measure, the number of tokens in the model outputs directly affect the performance. Controlling the length is particularly important for our dataset because average summary length for each aspect in different domains varies (e.g., "description" and "location" from HistoricPlace domain has 396 and 90 average tokens, respectively). We take this into account by explicitly setting the maximum number of words for extractive and abstractive summaries to be the average number of words in the target summaries in the training set for each aspect and for each domain.

Experiments
We provide two baseline models for the task and evaluate on the proposed dataset.
the "company" domain should not exist if that company has legitimately never had a controversy), or (2) the article may not be complete. For this evaluation, we make the simplifying assumption that all articles are complete and thus missing aspects are an indication of failure to recall information, but relaxing this assumption in some way may result in more accurate evaluation.

Implementation Details
For aspect classification, we used roberta-base 8 model and fine-tuned for 5 epochs on the created surrogate dataset above for each domain, with the learning rate 2 × 10 −5 . For the extractive summarization, we specify the summary length for TextRank according to the mean length of target summaries for each aspect in each domain. We re-train the PreSumm summarizer on our dataset for each domain: the encoder is initialized with the weights of pre-trained BERT (Devlin et al., 2019) and the decoder is trained from scratch. The total number of training steps is 300,000. For some domains, we further tuned the decoder dropout rate to 0.3 to stabilize training. At inference time, we specify maximum summary lengths for each aspect for each domain using the average summary lengths from computed from the training set.

Results
In this section, we discuss the experimental results on each stage.

Aspect Discovery
We show the aspect discovery results in Table 4. We see a general trend of high recall predictions made by the model. While varying thresholds could balance precision and recall, the results exhibited high recall after hyperparameter search. This suggests that the learned classifier is poorly calibrated. Class imbalance also plays a role here; predicting the major classes give high recall due to skew aspect frequency distributions. Among others, the classifier performed best with the Town domain by achieving the highest precision and the F1 score.

Summarization
The automatic evaluation results are shown in Table 5. Neither baseline unanimously outperformed the other on all domains, but we observe that Pre-Summ (abstractive) performs better than TextRank (extractive) on average.The low R-2 and R-L scores by both models despite the oracle being relatively higher suggest that important phrases to be summarized do not appear rarely. 9 To understand the upper-bound of model performance for the task, we also show summarization 8 We used Huggingface's implementation (Wolf et al., 2019) for obtaining and fine-tuning the weights. 9 Note that TextRank connects nodes according to content overlap, thus isolated sentences are not selected.  results of the extractive oracle model in Table 5. Sentences were chosen directly from cited reference texts to maximize the ROUGE score against summaries, thus bypassing the aspect classification stage. The oracle performance shows that a summarization model can indeed perform competitively on the dataset if the model is given with the full input information. The contrasting results between the oracle and two stage models suggests the importance of accurate content selection before performing summarization.

Analysis
We discuss the model outputs and analysis below.

Aspect-by-aspect Evaluation
Not all the aspects are equally hard to summarize; some might require summarization of a broad range of information, while others require only specific concepts to be summarized. We further investigate this by looking into summarization performance for both models on per-aspect basis. Table 6 shows the best-performing aspects sorted in descending order by ROUGE-1 scores for two summarization models on the validation set. Through manual investigation of the generated samples for each aspect, we observed that the aspects where the abstractive model performed well tend to have common templates and similar choice of vocabulary, more so than other aspects. For example, 58% (out of 183  samples) of the target summaries for government in Town shared the identical summaries despite the fact that articles discuss different townships. Similar but less prevalent patterns were observed in other aspects as well.
Aspects where the extractive summarization model performed better contain much larger numbers of tokens in the summaries than average. Specifically, the average summary length for 10 aspects where TextRank performed the best was 303, while that for 10 aspects where PreSumm performed the best was 166. Naturally, abstractive models have issues with maintaining coherence over long decoding results, but the extractive model has few issues gathering relevant sentences at the cost of incoherent transitions from sentence to sentence. As for the content, extractive summaries exhibited the advantage of being able to correctly include mentions related to numbers and dates.

Quality of Generated Summaries
We then examined the generated summaries from the two models and compared them qualitatively. Samples are shown 10 in Table 7 from some of the domains listed in Table 2. 10 Samples from other domains are in Appendix B.
Manual inspection of the generated summaries revealed pros and cons of the two models: • Both models are successful at discussing on-topic content. For all the summaries inspected, both models were able to generate on-topic content in spite of the source documents potentially being noisy.
• Abstractive summaries underperform at generating exact entity mentions. Almost all the samples require generation of entities because the task targets at generating encyclopedic texts. Except for the title (topic) entity, abstractive models either generated no entities or wrong ones.

Aspect Classification Accuracy
We observed a general trend of low precision for aspect discovery. We hypothesize that this is due to limited target aspects for each article; correctly extracted aspects affect negatively to precision if they do not exist in the target article. To quantify this, 10 random articles are selected from the validation set in Software domain. For each article, we extract 10 sentences labeled with the highest confidence for each of the 10 aspects, resulting in  Table 6: List of aspects sorted in descending order of ROUGE-1 score according to PreSumm (top half) and TextRank (bottom half). "performance" and "naming" are abbreviated to "perf." and "nm.", respectively. Domain names shortened to the first three letters.
1,000 sentences in total. Each sentence is annotated with binary labels indicating whether it is correctly associated with the aspect or not. 11 With the threshold λ set to 0.9, we achieved the precision of 45.1, which shows that the aspect discovery has the ability to extract aspects, but not as good at extracting relevant aspects for the article. We observed that the model predictions tend to be polarized to extreme values (i.e., near 0 or 1). We also show the relationship between λ ranges and the precision in Figure 3, which indicates that the classifier is not well-calibrated.

Domain-specific Challenges
One of the benefits of having many domains for the same task is to be able to characterize the differences and challenges that are unique to certain domains. We analyzed the generated summaries 11 Sometimes, the entity in discussion by the sentence is not clear. In this case, we annotate it correct if the sentence could correspond to the target aspect of any entity. from both of the summarization models and identified some of them below.

Pronoun Resolution for Opinion-based Inputs
This is particularly important in domains and aspects with subjective reviews such as music(Album, Artist, Group, and Single) or Software. Source documents in these domains often include quotes by artists or critics, which are often written from different person perspective. These are usually converted by the Wikipedia editors into more encyclopedic text, citing the source of the information and writing in the third person. By design, extractive summaries have issues with this problem because of the lack of ability to transform the input sentences in any way. For example, the first extractive summary in Table 7 describes a game in a subjective way. We verified this by randomly selecting 20 summaries for gameplay aspect in Software domain. We inspected pronouns in extractive summaries and mark ones with first-or second-person pronouns if the gold summaries do not contain them. We found 65% of the samples contained those undesirable pronouns that do not align with the format of gold summaries.

Chronological Explanation
This variety of content is often found in certain aspects such as history and event, which tend to appear across multiple domains but are most prevalent in Event, HistoricPlace, and non-human entities like Company and Building. It is essential in these aspects to describe key information in the right chronological order for better readability. This would not be a hard task for single document summarization, as the model could perform reasonably by following the order of the original document. However, since our input is of multidocument form, maintaining chronological order when aggregating information across multiple domains becomes non-trivial. Indeed, neither of the models were successful at being truthful to the order even when there are enough clues in the original references. For example, multiple sentences start with "In [year], . . .", but the generated summary jumps around in time. We randomly picked 20 samples of extractive summaries with history aspect from Company domain and found that 25% of the samples have inconsistent timeline explanations.

Related Work
Aspect-based Summarization Aspect-based summarization has been widely investigated primarily on product or restaurant reviews (Titov and McDonald, 2008;Lu et al., 2009;Yang et al., 2018;Wang and Ling, 2016). Angelidis and Lapata (2018) proposed a weakly supervised method for aspect-based opinion summarization that discovers aspects with a topic model and does not require gold aspect annotation. TAC 2010 held a shared task of guided-based summarization on newswire domain, which resembles aspect-based summarization in terms of topic guidance. Recently, the task has been extend to news-domain by generating artificial datasets for aspect-based summarization to address the lack of large-scale data with aspect annotation (Frermann and Klementiev, 2019;Krishna and Srinivasan, 2018). Our work also builds an aspect-based summarization dataset automatically and is most similar to Krishna and Srinivasan (2018), but utilizes naturally available online encyclopedia entries and their sections in multiple domains.

Wikipedia as a Summarization Dataset
Wikipedia has been studied as a target resource for generation. An early attempt on generating full Wikipedia articles relied on web search results for target entities as inputs (Sauper and Barzilay, 2009), which simulates an authoring process of humans searching information over the Internet. Liu et al. (2018) formulate a sub-task of generating lead sections as summarization of reference web pages to target articles. The resulting WikiSum dataset is accompanied by rich metadata about articles and inspired different uses of the dataset (Perez-Beltrachini et al., 2019). Our work also builds upon the WikiSum dataset, and aims to evaluate aspect-based summarization models using different sections from Wikipedia articles. Compared to Sauper and Barzilay (2009), our dataset is an order of magnitude larger, both in the amount of articles and in the number of domains covered.

Multi-Document Summarization
Extractive methods have shown effective for multi-document summarization in previous work (Nenkova et al., 2006;Cao et al., 2015;Yasunaga et al., 2017), but abstractive methods have increasingly adopted for the task (Lebanoff et al., 2018;Fabbri et al., 2019). Our task is based on the idea of (Liu et al., 2018) which treats references as source documents for the multi-document summarization task, and we experimented with both types of summarization models in our experiments.

Conclusion and Future Work
In this paper, we propose a large-scale, multidomain multi-aspect summarization dataset derived from Wikipedia. Through experiments, we perform an extensive analysis of performance across different genres and aspect types. Our analysis has demonstrated that there are both general challenges regarding summarization into various aspects, as well as specific challenges in particular genres such as time-consistent mentions and proper pronoun conversion depending on the writer of the original content. Because of this, the proposed dataset also provides a testbed for several potential directions for future work. For example, better aspect discovery models may take into account the coherence of the discourse in the original documents when extracting aspects. Better summarization models may take into account the provenance of the information, appropriately determining when the information is written by a first or third party. WikiAsp also invites a focus on domains of interest to investigate various problems of text summarization, such as correct pronoun handling and description of chronological timeline.

Acknowledgment
We would like to thank anonymous reviewers for insightful comments. HH and GN were supported by a grant from AlphaSense. Table 9: Generated summaries from Album domain.
Title: Pride and Glory (film)

Aspect: Plot
Gold: assistant chief francis tierney sr . is the head of a multigenerational new york city police department ( nypd ) family , which includes his sons francis " franny " jr . , ray , and his son -in -law jimmy egan . deputy inspector franny is the commanding officer of the 31st precinct , where sergeant jimmy is a patrol officer , . . . Ext.: as we know , under the macho code , this means that after two people who love each other end up beaten and bloody , they will somehow arrive at a catharsis . the plot involves how and why the four cops were killed . a family of police officers -patriarch , two sons , and a son -in -law -deals with corruption in a precinct in washington heights . . . . Abs.: in the year before the events of the first film , the movie takes place in washington heights , d . c . , a . army sergeant -in -law , ray ' s wife , and sister abby , living in washington city . they have a romantic relationship with one of their officers . while the four officers are called to " the mental patient " , . . . Gold: soudas served for one term as a school trustee at the western quebec school board from 2002 to 2005 . between 2006 and 2011 , soudas was a " high profile " member of prime minister stephen harper ' s communication team , and one of the prime minister ' s " closest and most faithful aides . " initially serving as a press secretary and later as an associate director of communications for the prime minister ' s office , . . . Ext.: april 2010 -after serving as a press secretary in the prime minister ' s office , soudas was promoted to director of communications . " to fulfil the opportunities afforded by social media , directors of communication need to be aware of this trend and engage with it , " dimitri soudas writes in his master ' s thesis , a copy of which has been obtained by cbc news . . . . Abs.: in 2001 , he was elected to the canadian house of commons as a member of the people ' s action party ( pc ) for the riding of yorkshire . he was re -elected in 2002 and 2006 . in 2006 , he was .