Abstract
This paper presents a novel unsupervised abstractive summarization method for opinionated texts. While the basic variational autoencoder-based models assume a unimodal Gaussian prior for the latent code of sentences, we alternate it with a recursive Gaussian mixture, where each mixture component corresponds to the latent code of a topic sentence and is mixed by a tree-structured topic distribution. By decoding each Gaussian component, we generate sentences with tree-structured topic guidance, where the root sentence conveys generic content, and the leaf sentences describe specific topics. Experimental results demonstrate that the generated topic sentences are appropriate as a summary of opinionated texts, which are more informative and cover more input contents than those generated by the recent unsupervised summarization model (Bražinskas et al., 2020). Furthermore, we demonstrate that the variance of latent Gaussians represents the granularity of sentences, analogous to Gaussian word embedding (Vilnis and McCallum, 2015).
1 Introduction
Summarizing opinionated texts, such as product reviews and online posts on Web sites, has attracted considerable attention recently along with the development of e-commerce and social media. Although extractive approaches are widely used in document summarization (Erkan and Radev, 2004; Ganesan et al., 2010), they often fail to provide an overview of the documents, particularly for opinionated texts (Carenini et al., 2013; Gerani et al., 2014). Abstractive summarization can overcome this challenge by paraphrasing and generalizing an entire document. Although supervised approaches have seen significant success with the development of neural architectures (See et al., 2017; Fabbri et al., 2019), they are limited to specific domains, e.g., news articles, where a large number of gold summaries are available. However, the domain of opinionated texts is diverse; manually writing gold summaries is therefore costly.
This lack in gold summaries has motivated prior work to develop unsupervised abstractive summarization of opinionated texts, for example, product reviews (Chu and Liu, 2019; Bražinskas et al., 2020; Amplayo and Lapata, 2020). While they generated consensus opinions by condensing input reviews, two key components were absent: topics and granularity (i.e., the level of detail). For instance, as shown in Figure 1, a gold summary of a restaurant review provides the overall impression and details about certain topics, such as food, ambience, and service. Hence, a summary typically comprises diverse topics, some of which are described in detail, whereas others are mentioned concisely.
From this investigation, we capture the topic-tree structure of reviews and generate topic sentences, that is, sentences summarizing specified topics. In the topic-tree structure, the root sentence conveys generic content, and the leaf sentences mention specific topics. From the generated topic sentences, we extract sentences with appropriate topics and levels of granularity as a summary. Regarding extractive summarization, capturing topics (Titov and McDonald, 2008; Isonuma et al., 2017; Angelidis and Lapata, 2018) and topic-tree structure (Celikyilmaz and Hakkani-Tur, 2010, 2011) is useful for detecting salient sentences. To the best of our knowledge, this is the first study to use the topic-tree structure in unsupervised abstractive summarization.
The difficulty of generating sentences with tree- structured topic guidance lies in controlling the granularity of topic sentences. Wang et al. (2019) generated a sentence with designated topic guidance, assuming that the latent code of an input sentence can be represented by a Gaussian mixture model (GMM), where each Gaussian component corresponds to the latent code of a topic sentence. While they successfully generated a sentence relating to a designated topic by decoding each mixture component, modelling the sentence granularity in a latent space to generate topic sentences with multiple granularities remains to be realized.
To overcome this challenge, we model the sentence granularity by the variance size of the latent code. We assume that general sentences have more uncertainty and are generated from a latent distribution with a larger variance, analogous to Gaussian word embedding (Vilnis and McCallum, 2015). Based on this assumption, we represent the latent code of topic sentences with Gaussian distributions, where the parent Gaussian receives a larger variance and represents a more generic topic sentence than its children, as shown in Figure 1. To obtain the latent code characterized above, we introduce a recursive Gaussian mixture prior to modeling the latent code of input sentences in reviews. A recursive GMM consists of Gaussian components that correspond to the nodes of the topic-tree, and the child priors are set to the inferred parent posterior. Because of this configuration, the Gaussian distribution of higher topics receives a larger variance and conveys more general content than lower topics.
The contributions of our work are as follows:
We propose a novel unsupervised abstractive opinion summarization method by generating sentences with tree-structured topic guidance.
To model the sentence granularity in a latent space, we specify a Gaussian distribution as the latent code of a sentence and demonstrate that the granularity depends on the variance size.
Experiments demonstrate that the generated summaries are more informative and cover more input content than the recent unsupervised summarization (Bražinskas et al., 2020).
2 Preliminaries
Bowman et al. (2016) adapted the variational autoencoder (VAE; Kingma and Welling, 2014; Rezende et al., 2014) to obtain the density-based latent code of sentences. They assume the generative process of documents to be as follows:
For each document index d ∈ {1,…,D}:
For each sentence index s ∈ {1,…,Sd} in d:
- Draw a latent code of the sentence :(1)
- Draw a sentence ws:(2)
By representing sentences by Gaussians rather than vectors, the decoded sentence from the intermediate latent code between two sentences is grammatical and has a coherent topic with the two sentences. Extending their work, we construct the prior as a recursive GMM and infer the topic sentences by decoding each Gaussian component.
3 RecurSum: Recursive Summarization
In this section, we explain our model, RecurSum. Figure 2 shows the outline. The latent code of review sentences is obtained as a recursive GMM (3.1), and topic sentences are inferred by decoding each Gaussian component (3.2). A summary is then created by extracting the appropriate topic sentences (3.3). We introduce additional components to improve the quality of topic sentences (3.4) and explain why general/specific content is conveyed by the root/leaf topics, referring to the analogy with Gaussian word embedding (3.5).
3.1 Generative Model of Reviews
We assume the generative process of reviews to be as follows. We refer to the set of sentences in multiple reviews of a specific product as instance. Compared to Bowman et al. (2016), we explicitly model the topic of review sentences as follows:
For each instance index d ∈ {1,…,D}:
For each sentence index s ∈ {1,…,Sd} in d:
- Draw a topic of the sentence zs ∈ {1,…,K}:(5)
- Draw a latent code of the sentence :(6)
- Draw a review sentence ws:(7)
3.2 Inference of Topic Sentences
3.3 Extraction of Summary Sentences
Next, we create a summary by extracting appropriate sentences from the generated topic sentences. As gold summaries are not available for training, we need a measure to evaluate candidate summaries using only input reviews. As reported in Chu and Liu (2019), the ROUGE scores (Lin, 2004) between a candidate summary and the input reviews effectively measure the extent to which the summary encapsulates the reviews. Based on this assumption, we search the topic sentences by maximizing the ROUGE-1 F-measure with the review sentences in an instance. We use a beam search and keep multiple highest-score candidates for each step. Similar to Carbonell and Goldstein (1998), to eliminate the redundancy of summary sentences, we do not add a sentence with a high word overlap (ROUGE-1 precision) against the sentences already included in the summary. The hyperparameters are tuned based on the validation set, as described in Section 4.2.
After selecting the summary sentences, we sort them in the depth-first order according to the topic- tree structure—that is, we begin at the root node and explore as far as possible along each branch before backtracking. Barzilay and Lapata (2008) advocate that adjacent sentences in the coherent text tend to have similar contents. As we assume that sentences linked by parent-child relations are topically coherent, the generated summary is expected to be locally coherent by extracting child sentences after their parent sentence.
3.4 Additional Model Components
The basic components of our model have been explained in the previous sections. This section introduces three additional components to improve the quality of topic sentences. In ablation studies (Section 5.2), we will see the effect of these components on summarization performance.
Discriminator
Attention
Nucleus Sampling
During the inference, we use nucleus sampling (Holtzman et al., 2019) to decode the topic sentences. Holtzman et al. (2019) reported that maximization-based decoding methods such as beam search tend to generate bland, incoherent, and repetitive text in open-ended text generation. As we will see in the ablation experiments, nucleus sampling is effective in generating diverse and informative topic sentences.
3.5 Analogy with Gaussian Word Embedding
Here, we explain why a general sentence is generated from the root topic, while more specific content is conveyed by the sentences generated by the leaf topics, referring to Gaussian word embedding.
Gaussian word embedding (Vilnis and McCallum, 2015) represents words as Gaussian distributions and captures the hierarchical relations among the words. As shown in Figure 3, by representing words as densities over a latent space and minimizing the KL-divergence of the distributions, they detect that common words such as “animal” obtain a larger variance than more specific words, such as “dog” and “cat”. This can be explained by the fact that general words have more uncertainty in their meaning (i.e.,“animal” sometimes denotes “dog” and other times “cat”).
Similar to Vilnis and McCallum (2015), we observed that the eigenvalues of the full covariance of topic sentences (14) become extremely small during training. To maintain a reasonably sized and positive semi-definite covariance, we add a hard constraint to the diagonal covariance of the review sentences as since , as derived in Appendix A.3.
4 Experiments
4.1 Datasets
In our experiments, we used the Yelp Dataset Challenge1 and Amazon product reviews (McAuley et al., 2015). By pre-processing the reviews similarly as in Chu and Liu (2019) and Bražinskas et al. (2020), we obtained the dataset as shown in Table 1. Regarding the training set, we removed products2 with fewer than 8 reviews and reviews in which the maximum number of sentences exceeds 50. To prevent the dataset from being dominated by a small number of products, we created 12 and 2 instances for each product in Yelp and Amazon, respectively. Then, we randomly selected 8 reviews to construct an instance. Regarding the validation/test set of Yelp, we randomly split 200 instances provided by Chu and Liu (2019)3 into validation and test sets. For Amazon, we used the same validation and test sets provided by Bražinskas et al. (2020).4 These gold summaries were created by Amazon Mechanical Turk (AMT) workers, who summarized 8 reviews for each product. The vocabulary comprises words that appear more than 16 times in the training set. The vocabulary sizes are 31,748 and 30,732 for Yelp and Amazon, respectively.
4.2 Implementation Details
We set the hyperparameters as follows, which maximize the ROUGE-L in the validation set of Yelp and use the same hyperparameters on Amazon.5 The dimensions of word embeddings and the latent code of the sentences are 200 and 32, respectively. The encoder and decoder are single- layer bi-directional and uni-directional GRU- RNN (Chung et al., 2014) with 200-dimensional hidden units for each direction. The threshold of nucleus sampling is 0.4. We train our model using Adam (Kingma and Ba, 2014) with a learning rate of 5.0 × 10−3, a batch size of 8, and a dropout rate of 0.2. The initial Gumbel-softmax temperature is set to 1 and decreased by 2.5 × 10−5 per training step. Similar to Bowman et al. (2016) and Yang et al. (2017), we avoid posterior collapse by increasing the weight of the KL-term by 2.5 × 10−5 per training step. We set the review sentence’s minimum covariance to . Regarding the tree structure, we set the number of levels to 3, and the number of branches to 4 for both the second and third levels. The total number of topics is 21. Regarding the summary sentence extractor in Section 3.3, we set the maximum number of extracted sentences as 6, the beam width as 8, and the redundancy threshold as 0.6.
4.3 Baseline Methods
As a baseline, we use Multi-Lead-1, which extracts the first sentence of each review. Furthermore, we employ unsupervised extractive approaches, LexRank (Erkan and Radev, 2004) and Opinosis (Ganesan et al., 2010). LexRank is a PageRank-based sentence extraction method that constructs a graph in which sentences and their similarity are represented by the nodes and edges, respectively. Opinosis constructs a word-based graph and extracts redundant phrases as a summary. As unsupervised abstractive summarization methods, we use MeanSum (Chu and Liu, 2019), Copycat (Bražinskas et al., 2020), and DenoiseSum (Amplayo and Lapata, 2020). MeanSum computes the mean of the review embeddings and decodes it as a summary. Copycat generates a consensus opinion by a hierarchical VAE which is trained by generating a new review given a set of other reviews of a product. DenoiseSum6 creates synthetic reviews by adding noise to original reviews and generates a summary by removing non-salient information as noise.
As an upper bound of extraction methods, we also report the performance of Oracle, which extracts the topic sentences such that they obtain the highest ROUGE-L against each gold summary. As the average number of sentences in the gold summaries is approximately four, we extract four topic sentences to generate a summary.
4.4 Semi-automatic Evaluation of Summaries
Following Chu and Liu (2019) and Bražinskas et al. (2020), we use the ROUGE-1/2/L F1-scores (Lin, 2004) as semi-automatic evaluation metrics.
Table 2 shows the rouge scores of our model, RecurSum, and the baselines for the test sets. In most metrics on both datasets, our model outperforms MeanSum and achieves competitive performance compared with the recent unsupervised summarization model, Copycat. Regarding the oracle, our model significantly outperforms the other models. This result suggests that our model can improve the performance by using more sophisticated extraction methods. Although we have also attempted to use the integer linear programming- based method (Gillick and Favre, 2009), it did not improve the performance. Developing such extraction techniques is beyond the scope of the current study, which focuses on topic structure, and is deferred to future work.
. | Yelp . | Amazon . | ||||
---|---|---|---|---|---|---|
Model . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . |
Multi-Lead-1 | 27.42 | 3.74 | 14.34 | 30.32 | 5.85 | 15.96 |
LexRank (Erkan and Radev, 2004) | 26.40 | 3.19 | 14.35 | 31.42 | 5.31 | 16.70 |
Opinosis (Ganesan et al., 2010) | 25.80 | 2.92 | 14.57 | 28.90 | 4.11 | 16.33 |
MeanSum (Chu and Liu, 2019) | 28.66 | 3.73 | 15.77 | 30.16 | 4.51 | 17.76 |
Copycat (Bražinskas et al., 2020) | 28.95 | 4.80 | 17.76 | 31.84 | 5.79 | 20.00 |
DenoiseSum (Amplayo and Lapata, 2020) | 29.77 | 5.02 | 17.63 | – | – | – |
RecurSum (Our Model) | 33.24 | 5.15 | 18.01 | 34.91 | 6.33 | 18.91 |
RecurSum (Oracle) | 35.59 | 7.93 | 28.63 | 37.17 | 9.85 | 30.19 |
. | Yelp . | Amazon . | ||||
---|---|---|---|---|---|---|
Model . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . |
Multi-Lead-1 | 27.42 | 3.74 | 14.34 | 30.32 | 5.85 | 15.96 |
LexRank (Erkan and Radev, 2004) | 26.40 | 3.19 | 14.35 | 31.42 | 5.31 | 16.70 |
Opinosis (Ganesan et al., 2010) | 25.80 | 2.92 | 14.57 | 28.90 | 4.11 | 16.33 |
MeanSum (Chu and Liu, 2019) | 28.66 | 3.73 | 15.77 | 30.16 | 4.51 | 17.76 |
Copycat (Bražinskas et al., 2020) | 28.95 | 4.80 | 17.76 | 31.84 | 5.79 | 20.00 |
DenoiseSum (Amplayo and Lapata, 2020) | 29.77 | 5.02 | 17.63 | – | – | – |
RecurSum (Our Model) | 33.24 | 5.15 | 18.01 | 34.91 | 6.33 | 18.91 |
RecurSum (Oracle) | 35.59 | 7.93 | 28.63 | 37.17 | 9.85 | 30.19 |
4.5 Human Evaluation of Summaries
Quality of the Summaries
We presented four system summaries in random order and asked six AMT workers to rank the summzarization quality referring to the gold summary. We compute each system’s score as the percentage of times selected as the best minus those are selected as the worst by using the best-worst scaling (Louviere et al., 2015; Kiritchenko and Mohammad, 2016).
Following Bražinskas et al. (2020) and Amplayo and Lapata (2020), we use the following four criteria: Fluency: the summary is grammatically correct, easy to read, and understand; Coherence: the summary is well structured and organized; Informativeness: the summary mentions specific aspects of the product; and Redundancy: the summary has no unnecessary repetitive words or phrases.
Table 3 shows the human evaluation scores of four systems. In terms of coherence and informativeness, RecurSum achieves the highest score among all approaches across the two datasets. This result indicates the effectiveness of considering topics and structure in unsupervised abstractive opinion summarization. With regard to fluency, Copycat is superior to our model because our model sometimes makes a grammatical or referential error, which has a negative impact on fluency, as will be shown later in Section 5.1.
. | Yelp . | Amazon . | ||||||
---|---|---|---|---|---|---|---|---|
Model . | Fluency . | Coherence . | Informative. . | Redundancy . | Fluency . | Coherence . | Informative. . | Redundancy . |
LexRank | −16.88 | −13.51 | −0.64 | −6.83 | −18.18 | −15.07 | 14.11 | −4.76 |
MeanSum | 5.63 | −16.18 | −13.73 | 0.70 | 2.74 | −14.69 | −13.70 | 1.32 |
Copycat | 15.07 | 7.88 | −7.19 | 4.00 | 14.65 | 9.80 | −17.65 | 6.85 |
RecurSum | −2.56 | 19.46 | 24.44 | 2.78 | 0.70 | 17.72 | 17.39 | −2.99 |
. | Yelp . | Amazon . | ||||||
---|---|---|---|---|---|---|---|---|
Model . | Fluency . | Coherence . | Informative. . | Redundancy . | Fluency . | Coherence . | Informative. . | Redundancy . |
LexRank | −16.88 | −13.51 | −0.64 | −6.83 | −18.18 | −15.07 | 14.11 | −4.76 |
MeanSum | 5.63 | −16.18 | −13.73 | 0.70 | 2.74 | −14.69 | −13.70 | 1.32 |
Copycat | 15.07 | 7.88 | −7.19 | 4.00 | 14.65 | 9.80 | −17.65 | 6.85 |
RecurSum | −2.56 | 19.46 | 24.44 | 2.78 | 0.70 | 17.72 | 17.39 | −2.99 |
Faithfulness of the Summaries
Abstractive summarization sometimes invents content that is unfaithful to the input texts (Maynez et al., 2020). The next study assesses whether the contents mentioned in the generated summaries are included in the input reviews. We use the same summary sets as in the quality evaluation and split them into sentences. For each summary sentence, we asked the AMT workers to judge whether the content is fully mentioned (Full), some of the content is mentioned (Partial), or no content is mentioned (No) in the reviews.
Table 4 shows the percentage of each answer. The frequency distribution is not regarded as statistically significant by the χ2 test (p < 0.05). This result indicates that our model correctly reflects the content in the input reviews as well as Copycat.
. | Yelp . | Amazon . | ||
---|---|---|---|---|
. | Copycat . | RecurSum . | Copycat . | RecurSum . |
Full | 47.79 | 47.43 | 45.64 | 44.74 |
Partial | 41.59 | 40.00 | 40.94 | 38.95 |
No | 10.62 | 12.57 | 13.42 | 16.32 |
. | Yelp . | Amazon . | ||
---|---|---|---|---|
. | Copycat . | RecurSum . | Copycat . | RecurSum . |
Full | 47.79 | 47.43 | 45.64 | 44.74 |
Partial | 41.59 | 40.00 | 40.94 | 38.95 |
No | 10.62 | 12.57 | 13.42 | 16.32 |
Coverage of the Summaries
Another desirable property of summaries is that they cover more content mentioned in the input reviews. As reported in Bražinskas et al. (2020), Copycat and MeanSum achieve relatively low scores for the human evaluation of opinion consensus, which captures the coverage of common opinions in the input reviews. In contrast, as RecurSum explicitly generates summary sentences for each topic, it could cover more input content across diverse topics. To assess this assumption, we conducted the opposite study from the faithfulness evaluation. Similar to the faithfulness evaluation, we split reviews into sentences. For each review sentence, we asked the AMT workers to rate the extent to which the generated summaries cover the input content.
Table 5 shows the percentage of fully-covered (Full), partially-covered (Partial), and un-covered (No) sentences. In addition to the two models, we also included gold summaries as the upper bounds. For both datasets, RecurSum covers more number of common opinions by capturing diverse topics.
. | Yelp . | Amazon . | ||||
---|---|---|---|---|---|---|
. | Copycat . | RecurSum . | Gold . | Copycat . | RecurSum . | Gold . |
Full | 23.94 | 31.05 | 34.52 | 29.15 | 33.31 | 38.41 |
Partial | 30.02 | 37.73 | 40.91 | 29.15 | 36.53 | 39.63 |
No | 46.04 | 31.22 | 24.58 | 41.71 | 30.16 | 21.96 |
. | Yelp . | Amazon . | ||||
---|---|---|---|---|---|---|
. | Copycat . | RecurSum . | Gold . | Copycat . | RecurSum . | Gold . |
Full | 23.94 | 31.05 | 34.52 | 29.15 | 33.31 | 38.41 |
Partial | 30.02 | 37.73 | 40.91 | 29.15 | 36.53 | 39.63 |
No | 46.04 | 31.22 | 24.58 | 41.71 | 30.16 | 21.96 |
5 Discussion
5.1 Analyzing Generated Summaries
In this section, we discuss the strengths and weaknesses of our method by presenting examples of the generated summaries and tree structures.
In Figure 4(a), we present a summary of a review of shoes in Amazon. RecurSum generates topic sentences about fitness and size (12, 121), similar to Copycat. In addition, our model also mentions color and use (11, 111, 112), which is also described in the gold summary. While we cannot grasp that the shoes are appropriate for weddings from Copycat’s summary, RecurSum covers such topics and provides more useful information.
Figure 4(b) shows the generated summaries on a coffee shop review in Yelp. While both RecurSum and Copycat present a positive review about the taste of bubble tea (tea with tapioca), RecurSum also focuses on the dessert (12, 121), similar to the gold summary. While Copycat also refers to friendly staff, they are not mentioned in the input review. Our model successfully does not extract topic sentences about staff by measuring content overlap with the input reviews. However, RecurSum sometimes makes grammatical or referential errors such as “It’s a little bit of the best bubble tea”. These errors cause the inferior performance of RecurSum in terms of fluency.
Figure 4(c) shows the summary of an Amazon review on a table chair set. RecurSum accurately captures opinions about the table (11, 111, 112) and chair (12, 121). The topic sentences on the bottom level elaborate on the parent sentences, referring to the easy assembly (111), the appropriate use of table (112), and the quality of chair (121). By inferring topics in the tree structure, RecurSum can offer summary sentences over multiple granularities of topics.
5.2 Ablation Study of Model Components
We report the results of the ablation study to investigate how individual components affect summarization performance. In addition to the ROUGE scores, we also report self-BLEU scores (Zhu et al., 2018) to investigate the diversity of the generated summaries. Self-BLEU is computed by calculating the BLEU score of each generated summary with all other generated summaries in the test set as references. A higher self-BLEU implies that the generated summaries are not diversified, that is, the model tends to generate a generic summary similar to the other summaries. Table 6 shows the performances of model variants on Yelp dataset.
Model Variants . | R-1 . | R-2 . | R-L . | B-3 . | B-4 . |
---|---|---|---|---|---|
w/o Discriminator | 30.52 | 3.50 | 16.43 | 54.18 | 30.42 |
w/o Attention | 30.62 | 4.87 | 17.01 | 66.11 | 50.89 |
w/o Nucleus | 31.71 | 5.10 | 17.70 | 69.13 | 55.81 |
Full | 33.24 | 5.15 | 18.01 | 64.30 | 48.37 |
Model Variants . | R-1 . | R-2 . | R-L . | B-3 . | B-4 . |
---|---|---|---|---|---|
w/o Discriminator | 30.52 | 3.50 | 16.43 | 54.18 | 30.42 |
w/o Attention | 30.62 | 4.87 | 17.01 | 66.11 | 50.89 |
w/o Nucleus | 31.71 | 5.10 | 17.70 | 69.13 | 55.81 |
Full | 33.24 | 5.15 | 18.01 | 64.30 | 48.37 |
w/o Disc denotes our model without a discriminator. The ROUGE scores are significantly lowerthan the full model. Without the discriminator, the topic distribution becomes sparse (i.e., most of the review sentences are assigned to some specific topics). Therefore, the model obtains incoherent topics and generates unfaithful summaries for the input review. Discriminator penalizes this situation by assigning an appropriate topic to topically different sentences. This mechanism makes the generated topic sentences topically coherent and improves ROUGE scores.
w/o Attention indicates our model without an attention mechanism. Although the generated sentences are faithful to the input review, they are often generic and miss some specific details of the content. By adding the attention mechanism, the generated summary effectively reflects the content of the input reviews and provides more detailed information. Although the copy-mechanism (See et al., 2017) has also been reported to be useful in previous summarization models (Bražinskas et al., 2020; Amplayo and Lapata, 2020), it degrades the performance of our model. While their models use different input-output pairs (reviews vs. pseudo-summary), our model uses the same input-output pairs in an autoencoder manner and tends to fully copy the input sentences. Thus, our model fails to obtain a meaningful latent code.
w/o Nucleus denotes our model using a beam-search decoder (beam width = 5) instead of nucleus sampling when decoding topic sentences in inference. As reported by Holtzman et al. (2019), we also confirmed that the beam-search decoder tends to generate bland or repetitive text and sometimes fails to capture product-specific words. Owing to nucleus sampling, the decoder generates more informative content and improves the ROUGE-1 score with a significant decrease in self-BLEU.
5.3 Analyzing Topic-Tree Structure
As generating sentences with tree-structured topic guidance is a novel challenge, we introduce new measures to verify that the generated sentences exhibit the desired properties of tree structures. Based on the work of tree-structured topic model (Kim et al., 2012), we introduce two metrics: hierarchical affinity and topic specialization.
Hierarchical Affinity.
An important characteristic of the tree structure is that a parent topic sentence is more similar to its children than the sentences descending from the other parents. To confirm this property, we estimated the similarity of sentences in parent-child pairs and non parent-child pairs. To measure sentence similarity, we used ALBERT (Lan et al., 2019), which is a SoTA model on the semantic textual similarity benchmark (STS-B; Cer et al., 2017). In our experiment, we used ALBERT-base, which achieves a 84.7 Pearson correlation coefficient against the test sets of STS-B. As shown in Table 7, parent-child sentence pairs are more similar than those of non parent-child pairs in both datasets. This result indicates that the generated sentences linked by parent-child relations are topically coherent.
Topic Specialization.
In tree-structured topics, we would expect the root topic to generate general sentences, whereas more specific content is conveyed by the sentences generated by the leaf topics. To empirically test this property, we estimated the average specificity of sentences at each level of the tree-structured topics. We fine-tuned ALBERT-base on the task of estimating the specificity of sentences (Louis and Nenkova, 2011). We used the dataset provided by Ko et al. (2019), which comprises the Yelp, Movie, and Tweet domains. The fine-tuned model achieves a SoTA performance of 86.2 Pearson correlation coefficient on the test sets in Yelp. As shown in Table 8, we see that sentences with lower topics are more specific than higher topics. This indicates that the root sentences refer to general topics, whereas leaf sentences describe more specific topics.
5.4 Analyzing Latent Space of Sentences
In Figure 5, we project the latent code of topic sentences of a restaurant review onto the top two principal component vector space. Following the modeling assumption, the latent distributions of child sentences are located relatively near their parent distributions. This property ensures that the parent and child sentences are topically coherent, as shown in Table 7. Furthermore, we present the average log determinant of the covariance matrices at each level in Table 9. We confirm that the latent code of the topic sentences has a smaller variance towards the leaves. This property forces the topic sentences to be more specific as the level becomes deeper, as described in Table 8.
6 Related Work
6.1 Text Generation with Topic Guidance
The VAE is intensively used to obtain disentangled latent code of sentences (Bowman et al., 2016; Hu et al., 2017; Tang et al., 2019). Closely related to our work, Wang et al. (2019) specify the prior as a GMM, where each mixture component corresponds to the latent code of a topic sentence and is mixed with the topic distribution inferred by the flat neural topic model (Miao et al., 2017).
In contrast, we address a novel challenge to generate topic sentences with tree-structured topic guidance, where the root sentence refers to a general topic, whereas the leaf sentences describe more specific topics. We adopt the tree-structured neural topic model (Isonuma et al., 2020) to infer the topic distribution of sentences and introduce a recursive Gaussian mixture prior for modeling the latent distribution of sentences in a document.
6.2 Unsupervised Summary Generation
Owing to the success of supervised abstractive summarization by neural architectures (Nallapati et al., 2016; See et al., 2017; Liu and Lapata, 2019a), unsupervised sentence compression (Fevry and Phang, 2018; Baziotis et al., 2019), and unsupervised summary generation (Isonuma et al., 2019) have recently drawn attention.
Recently, specifically for opinionated texts, several abstractive multi-document summarization methods have been developed, such as MeanSum, Copycat, and DenoiseSum, as explained in Section 4.3. Concurrently with our work, Angelidis et al. (2021) use quantitized transformers enabling aspect-based extractive summarization, and Amplayo et al. (2020) incorporate the aspect and sentiment distributions into the unsupervised abstractive summarization. Our method incorporates topic-tree structure into unsupervised abstractive summarization and generates summaries consisting of multiple granularities of topics.
7 Conclusion
In this paper, we proposed a novel unsupervised abstractive opinion summarization method by generating topic sentences with tree-structured topic guidance. Experimental results demonstrated that the generated summaries are more informative and cover more input content than those generated by the recent unsupervised summarization (Bražinskas et al., 2020). Additionally, we demonstrated that the variance of latent Gaussians represents the granularity of sentences, analogous to Gaussian word embedding (Vilnis and McCallum, 2015). This property will be useful not only for summarization but also for other tasks that need to consider the granularity of the contents.
Acknowledgments
We would like to thank the anonymous reviewers and action editor, Asli Celikyilmaz, for their valuable feedback. This work was supported by JST ACT-X grant number JPMJAX1904 and JSPS KAKENHI grant number JP20J10726, Japan.
Notes
We refer to businesses (e.g., a specific Starbucks branch) in Yelp and products (e.g., iPhone X) in Amazon as products.
As complete code is not available, we report the result of different test splits from ours, which are used in their sample of output summaries. https://github.com/rktamplayo/DenoiseSum.
To obtain reliable answers, we set the worker requirements to 98% approval rate, 1000+ accepted tasks, and locations in the US, UK, Canada, Australia, and New Zealand.
A Appendices
A.1 Inference of Topic Distribution
A.2 Sensitivity for the Number of Topics
We investigated how the number of topics affects summarization performance. Table 10 shows the ROUGE scores on the various number of branches with a fixed depth of 3 in topic-tree structure. When the number of topics is small, the models achieve a relatively low score. However, when the number of branches ≥ 4, the performance does not significantly change for various numbers of topics. A similar trend is confirmed in Table 10, which shows the ROUGE scores on the various number of levels with the fixed number of branches of 3. These results indicate that our model is relatively robust for the number of topics.
# of topics for each level (total) . | R-1 . | R-2 . | R-L . |
---|---|---|---|
1–2–4 (7) | 29.03 | 4.39 | 16.94 |
1–3–9 (13) | 31.42 | 4.43 | 17.19 |
1–4–16 (21) | 33.24 | 5.15 | 18.01 |
1–5–25 (31) | 31.94 | 4.78 | 17.50 |
1–6–36 (43) | 33.25 | 4.82 | 17.81 |
# of topics for each level (total) . | R-1 . | R-2 . | R-L . |
---|---|---|---|
1–2–4 (7) | 29.03 | 4.39 | 16.94 |
1–3–9 (13) | 31.42 | 4.43 | 17.19 |
1–4–16 (21) | 33.24 | 5.15 | 18.01 |
1–5–25 (31) | 31.94 | 4.78 | 17.50 |
1–6–36 (43) | 33.25 | 4.82 | 17.81 |