Unsupervised Abstractive Opinion Summarization by Generating Sentences with Tree-Structured Topic Guidance

This paper presents a novel unsupervised abstractive summarization method for opinionated texts. While the basic variational autoencoder-based models assume a unimodal Gaussian prior for the latent code of sentences, we alternate it with a recursive Gaussian mixture, where each mixture component corresponds to the latent code of a topic sentence and is mixed by a tree-structured topic distribution. By decoding each Gaussian component, we generate sentences with tree-structured topic guidance, where the root sentence conveys generic content, and the leaf sentences describe specific topics. Experimental results demonstrate that the generated topic sentences are appropriate as a summary of opinionated texts, which are more informative and cover more input contents than those generated by the recent unsupervised summarization model (Bra\v{z}inskas et al., 2020). Furthermore, we demonstrate that the variance of latent Gaussians represents the granularity of sentences, analogous to Gaussian word embedding (Vilnis and McCallum, 2015).


Introduction
Summarizing opinionated texts, such as product reviews and online posts on websites, has attracted considerable attention recently along with the development of e-commerce and social media. Although extractive approaches are widely used in document summarization (Erkan and Radev, 2004;Ganesan et al., 2010), they often fail to provide an overview of the documents, particularly for opinionated texts (Carenini et al., 2013;Gerani et al., 2014). Abstractive summarization can overcome this challenge by paraphrasing and generalizing an entire document. Although supervised approaches have seen significant success with the development of neural architectures (See et al., 2017;Fabbri et al., 2019), they are limited to specific domains, e.g., news articles, where a large number of gold summaries are available. However, the domain of opinionated texts is diverse; manually writing gold summaries is therefore costly.
This lack in gold summaries has motivated prior work to develop unsupervised abstractive summarization of opinionated texts, e.g., product reviews (Chu and Liu, 2019;Bražinskas et al., 2020;Amplayo and Lapata, 2020). While they generated consensus opinions by condensing input reviews, two key components were absent: topics and granularity, i.e., the level of detail. For instance, as shown in Figure 1, a gold summary of a restaurant review provides the overall impression and details about certain topics, such as food, ambience, and service. Hence, a summary typically comprises diverse topics, some of which are described in detail, whereas others are mentioned concisely.
From this investigation, we capture the topictree structure of reviews and generate topic sentences, i.e., sentences summarizing specified topics. In the topic-tree structure, the root sentence conveys generic content, whereas the leaf sentences mention specific topics. From the generated topic sentences, we extract sentences with appropriate topics and levels of granularity as a summary. Regarding extractive summarization, capturing topics (Titov and McDonald, 2008;Isonuma et al., 2017;Angelidis and Lapata, 2018) and topic-tree structure Hakkani-Tur, 2010, 2011) is useful for detecting salient sentences. To the best of our knowledge, this is the first study to use the topic-tree structure in unsupervised abstractive summarization.
The difficulty of generating sentences with treestructured topic guidance lies in controlling the granularity of topic sentences. Wang et al. (2019) generated a sentence with designated topic guidance, assuming that the latent code of an input sentence can be represented by a Gaussian mixture arXiv:2106.08007v1 [cs.CL] 15 Jun 2021

Summary (Set of Topic Sentences)
The food here is fantastic, easily the best sub sandwiches in the Arizona area. food ambience service overall small small small large Review Sentences The sandwiches are inexpensive and are, in my opinion, the best Italian subs in AZ.
The shop is local and family run, so I definitely choose it over a lot of the large national chains.
The staff are extremely friendly and will always go above and beyond in creating a delicious sandwich.
You will not be let down by the great food that they make here! decompose recursive GMM (1) training: (2) inference: encode decode decode Figure 1: Outline of our approach.
(1) The latent distribution of review sentences is represented as a recursive GMM and trained in an autoencoding manner. Then, (2) the topic sentences are inferred by decoding each Gaussian component. An example of a restaurant review and its corresponding gold summary are displayed.
model (GMM), where each Gaussian component corresponds to the latent code of a topic sentence. While they successfully generated a sentence relating to a designated topic by decoding each mixture component, modelling the sentence granularity in a latent space to generate topic sentences with multiple granularities remains to be realized.
To overcome this challenge, we model the sentence granularity by the variance size of the latent code. We assume that general sentences have more uncertainty and are generated from a latent distribution with a larger variance, analogous to Gaussian word embedding (Vilnis and McCallum, 2015). Based on this assumption, we represent the latent code of topic sentences with Gaussian distributions, where the parent Gaussian receives a larger variance and represents a more generic topic sentence than its children, as shown in Figure 1. To obtain the latent code characterized above, we introduce a recursive Gaussian mixture prior to modelling the latent code of input sentences in reviews. A recursive GMM consists of Gaussian components that correspond to the nodes of the topic-tree, and the child priors are set to the inferred parent posterior. Because of this configuration, the Gaussian distribution of higher topics receives a larger variance and conveys more general content than lower topics.
The contributions of our work are as follows: • We propose a novel unsupervised abstractive opinion summarization method by generating sentences with tree-structured topic guidance. • To model the sentence granularity in a latent space, we specify a Gaussian distribution as the latent code of a sentence and demonstrate that the granularity depends on the variance size. • Experiments demonstrate that the generated summaries are more informative and cover more input content than the recent unsupervised summarization (Bražinskas et al., 2020). Bowman et al. (2016) adapted the variational autoencoder (VAE; Kingma and Welling, 2014;Rezende et al., 2014) to obtain the density-based latent code of sentences. They assume the generative process of documents to be as follows:

Preliminaries
For each document index d ∈ {1, . . . , D}: For each sentence index s ∈ {1, . . . , S d } in d: 1. Draw a latent code of the sentence x s ∈ R n : 2. Draw a sentence w s : where p(w s |x s ) = t p(w t s |w <t s , x s ) is derived by an recurrent neural networks (RNN) decoder. The latent prior is a standard Gaussian: p(x s ) = N (x s |µ 0 , Σ 0 ). The likelihood of a document and its evidence lower bound (ELBO) are given by (3) and (4), respectively: where f µ and f Σ are RNN encoders. By representing sentences by Gaussians rather than vectors, the decoded sentence from the intermediate latent code between two sentences is grammatical and has a coherent topic with the two sentences. Extending their work, we construct the prior as a recursive GMM and infer the topic sentences by decoding each Gaussian component.   Figure 2: Outline of our model. We set a recursive Gaussian mixture as the latent prior of review sentences and obtain the latent posteriors of topic sentences by decomposing the posteriors of review sentences.

RecurSum: Recursive Summarization
In this section, we explain our model, RecurSum. Figure 2 shows the outline. The latent code of review sentences is obtained as a recursive GMM (3.1), and topic sentences are inferred by decoding each Gaussian component (3.2). A summary is then created by extracting the appropriate topic sentences (3.3). We introduce additional components to improve the quality of topic sentences (3.4) and explain why general/specific content is conveyed by the root/leaf topics, referring to the analogy with Gaussian word embedding (3.5).

Generative Model of Reviews
We assume the generative process of reviews to be as follows. We refer to the set of sentences in multiple reviews of a specific product as instance. Compared to Bowman et al. (2016), we explicitly model the topic of review sentences as follows: For each instance index d ∈ {1, . . . , D}: For each sentence index s ∈ {1, . . . , S d } in d: 1. Draw a topic of the sentence z s ∈ {1, . . . , K}: 2. Draw a latent code of the sentence x s ∈ R n : 3. Draw a review sentence w s : where the topic distribution is tree-structured, and its prior is set to be uniform. In (6), we assume a recursive GMM as the latent prior of a review sentence (δ is a Dirac delta). Each mixture component corresponds to the latent distribution of a sentence conditioned on a specific topic, p(x s |z s = k): where par(k) denotes the parent of the k-th topic.
q(x s |z s = par(k)) is the approximated latent posterior of the parent topic sentence as derived later in Section 3.2. We assume that the latent posterior of the parent sentence is appropriate as the latent prior of its child sentences. Under our generative model, the likelihood of an instance and its ELBO are given by (10) and (11), respectively: p(ws|xs)p(xs|zs)p(zs)dxsdzs (10) where q(x s |w s ) = N (x s |μ s ,Σ s ) is the latent posterior of a sentence s, inferred by an RNN encoder. θ s,k = q(z s = k|w s ) is the variational topic distribution and inferred by the tree-structured neural topic model (TSNTM;Isonuma et al., 2020). More details are provided in Appendix A.1.

Inference of Topic Sentences
From the latent posterior of review sentences, we infer the latent posterior of each topic sentence using the M-step of the EM algorithm. We define the variational distribution of the latent code of a topic sentence as (12) and compute the Gaussian parameters as (13) and (14) that maximize S d s=1 E q(xs|ws)q(zs|ws) log q(x s |z s ) as follows: From these latent posteriors, we generate the topic sentences for each instance using the respective mean not a sample:ŵ d,k ∼ p(w d,k |μ d,k ) = RNN(μ d,k ). Similar to Bražinskas et al. (2020); Chu and Liu (2019), we assume that the average latent code represents the common contents of the corresponding topic, while specific contents are distributed apart from the mean. Therefore, decoding the mean rather than a sample would be desirable for generating a summary.

Extraction of Summary Sentences
Next, we create a summary by extracting appropriate sentences from the generated topic sentences. As gold summaries are not available for training, we need a measure to evaluate candidate summaries using only input reviews. As reported in Chu and Liu (2019), the ROUGE scores (Lin, 2004) between a candidate summary and the input reviews effectively measures the extent to which the summary encapsulates the reviews. Based on this assumption, we search the topic sentences by maximizing the ROUGE-1 F-measure with the review sentences in an instance. We use a beam search and keep multiple highest-score candidates for each step. Similar to Carbonell and Goldstein (1998), to eliminate the redundancy of summary sentences, we do not add a sentence with a high word overlap (ROUGE-1 precision) against the sentences already included in the summary. The hyperparameters are tuned based on the validation set, as described in Section 4.2.
After selecting the summary sentences, we sort them in the depth-first order according to the topic-tree structure, i.e., we begin at the root node and explore as far as possible along each branch before backtracking. Barzilay and Lapata (2008) advocate that adjacent sentences in the coherent text tend to have similar contents. As we assume that sentences linked by parent-child relations are topically coherent, the generated summary is expected to be locally coherent by extracting child sentences after their parent sentence.

Additional Model Components
The basic components of our model have been explained in the previous sections. This section introduces three additional components to improve the quality of topic sentences. In ablation studies (Section 5.2), we will see the effect of these components on summarization performance.
Discriminator To ensure that each topic sentence has a specific topic, we introduce a discriminator following Hu et al. (2017); Tang et al. (2019). We approximate the sample of the topic sentence by using the Gumbel-softmax trick (Jang et al., 2017;Maddison et al., 2017) and reuse the TSNTM to estimate the topic distribution of the sample, q(z d,k |ŵ d,k ). By maximizing the likelihood of the specified topic as (15), the discriminator forces the generated k-th topic sentence to be coherent with topic k.
Attention We use the attention-based RNN decoder (Luong et al., 2015) to efficiently reflect input sentence information into output topic sentences. Given the hidden state of the t-th word in an output sentence h t o and the i-th word in an input review sentence h i s , we calculate the attention distribution over all the words in the input review sentences to compute the word probability.  Figure 3: Analogy with Gaussian word embedding.

Analogy with Gaussian Word Embedding
Here, we explain why a general sentence is generated from the root topic, while more specific content is conveyed by the sentences generated by the leaf topics, referring to Gaussian word embedding. Gaussian word embedding (Vilnis and McCallum, 2015) represents words as Gaussian distributions and captures the hierarchical relations among the words. As shown in Figure 3, by representing words as densities over a latent space and minimizing the KL-divergence of the distributions, they detect that common words such as "animal" obtain a larger variance than more specific words, such as "dog" and "cat". This can be explained by the fact that general words have more uncertainty in their meaning (i.e.,"animal" sometimes denotes "dog" and other times "cat"). Similarly, our model minimizes the upper bound of the KL-divergence of the latent distribution between a parent topic sentence and its children. In (19), we show that the x-related term in the ELBO (11) is an upper bound of the KL divergence of the latent posteriors between parent-child topic sentences (derived in Appendix A.3).
since p(x s |z s = k) = q(x s |z s = par(k)) as defined in (9). Similar to Gaussian word embedding, maximizing the ELBO forces the latent distribution of a parent to be close to that of its children, and the parent receives a larger variance than its children. This property ensures that the parent-child topics have a coherent topic, and more general content is conveyed by the root topic sentences. Intuitively, a general sentence, such as "I love this restaurant", includes several topics, such as "food" and "service", and has a large uncertainty of semantics. Thus, we assume that a generic sentence is represented by the mean of the latent distribution with a larger variance, whereas a more specific sentence is generated from the distribution with a smaller variance.  Similar to Vilnis and McCallum (2015), we observed that the eigenvalues of the full covariance of topic sentences (14) become extremely small during training. To maintain a reasonably sized and positive semi-definite covariance, we add a hard constraint to the diagonal covariance of the review sentences asΣ s,

Datasets
In our experiments, we used the Yelp Dataset Challenge 1 and Amazon product reviews (McAuley et al., 2015).
By pre-processing the reviews similarly as in Chu and Liu (2019); Bražinskas et al. (2020), we obtained the dataset as shown in Table 1. Regarding the training set, we removed products 2 fewer than 8 reviews and reviews in which the maximum number of sentences exceeds 50. To prevent the dataset from being dominated by a small number of products, we created 12 and 2 instances for each product in Yelp and Amazon, respectively. Then, we randomly selected 8 reviews to construct an instance. Regarding the validation/test set of Yelp, we randomly split 200 instances provided by Chu and Liu (2019) 3 into validation and test sets. For Amazon, we used the same validation and test sets provided by Bražinskas et al. (2020) 4 . These gold summaries were created by Amazon Mechanical Turk (AMT) workers, who summarized 8 reviews for each product. The vocabulary comprises words that appear more than 16 times in the training set. The vocabulary sizes are 31, 748 and 30, 732 for Yelp and Amazon, respectively.

Implementation Details
We set the hyperparameters as follows, which maximize the ROUGE-L in the validation set of Yelp and use the same hyperparameters on Amazon 5 . The dimensions of word embeddings and the latent code of the sentences are 200 and 32, respectively. The encoder and decoder are singlelayer bi-directional and uni-directional GRU-RNN (Chung et al., 2014) with 200-dimensional hidden units for each direction. The threshold of nucleus sampling is 0.4. We train our model using Adam (Kingma and Ba, 2014) with a learning rate of 5.0 × 10 −3 , a batch size of 8, and a dropout rate of 0.2. The initial Gumbel-softmax temperature is set to 1 and decreased by 2.5 × 10 −5 per training step. Similar to Bowman et al. (2016); Yang et al. (2017), we avoid posterior collapse by increasing the weight of the KL-term by 2.5 × 10 −5 per training step. We set the review sentence's minimum covariance to λ = exp(0.5). Regarding the tree structure, we set the number of levels to 3, and the number of branches to 4 for both the second and third levels. The total number of topics is 21.
Regarding the summary sentence extractor in Section 3.3, we set the maximum number of extracted sentences as 6, the beam width as 8, and the redundancy threshold as 0.6.

Baseline Methods
As a baseline, we use Multi-Lead-1, which extracts the first sentence of each review. Furthermore, we employ unsupervised extractive approaches, LexRank (Erkan and Radev, 2004) and Opinosis (Ganesan et al., 2010). LexRank is a PageRank-based sentence extraction method that 5 https://github.com/misonuma/ recursum constructs a graph in which sentences and their similarity are represented by the nodes and edges, respectively. Opinosis constructs a word-based graph and extracts redundant phrases as a summary. As unsupervised abstractive summarization methods, we use MeanSum (Chu and Liu, 2019), Copycat (Bražinskas et al., 2020) and DenoiseSum (Amplayo and Lapata, 2020). MeanSum computes the mean of the review embeddings and decodes it as a summary. Copycat generates a consensus opinion by a hierarchical VAE which is trained by generating a new review given a set of other reviews of a product. DenoiseSum 6 creates synthetic reviews by adding noise to original reviews and generates a summary by removing nonsalient information as noise.
As an upper bound of extraction methods, we also report the performance of Oracle, which extracts the topic sentences such that they obtain the highest ROUGE-L against each gold summary. As the average number of sentences in the gold summaries is approximately four, we extract four topic sentences to generate a summary. (2019); Bražinskas et al. (2020), we use the ROUGE-1/2/L F1-scores (Lin, 2004) as semi-automatic evaluation metrics.  Table 3: Human evaluation scores on the quality of the summaries. The scores are computed by using the bestworst scaling (%) and range from −100 (unanimously worst) to +100 (unanimously best). Boldface denotes the highest score, and underlined scores are not regarded as statistically significant (p < 0.05) by Tukey HSD test as compared to the highest score.   acle, our model significantly outperforms the other models. This result suggests that our model can improve the performance by using more sophisticated extraction methods. Although we have also attempted to use the integer linear programmingbased method (Gillick and Favre, 2009), it did not improve the performance. Developing such extraction techniques is beyond the scope of the current study, which focuses on topic structure and is deferred to future work.

Human Evaluation of Summaries
We conducted a human evaluation using AMT. Following Bražinskas et al. (2020); Amplayo and Lapata (2020), we randomly selected 50 instances from each test set and asked AMT workers 7 to answer the following three tasks: Quality of the Summaries We presented four system summaries in random order and asked six AMT workers to rank the summzarization quality referring to the gold summary. We compute each system's score as the percentage of times selected as the best minus those are selected as the worst by using the best-worst scaling (Louviere et al., 2015;Kiritchenko and Mohammad, 2016). Following Bražinskas et al. (2020); Amplayo and Lapata (2020), we use the following four cri-7 To obtain reliable answers, we set the worker requirements to 98% approval rate, 1000+ accepted tasks, and locations in the US, UK, Canada, Australia, and New Zealand. teria: Fluency: the summary is grammatically correct, easy to read, and understand; Coherence: the summary is well structured and organized; Informativeness: the summary mentions specific aspects of the product; Redundancy: the summary has no unnecessary repetitive words or phrases. Table 3 shows the human evaluation scores of four systems. In terms of coherence and informativeness, RecurSum achieves the highest score among all approaches across the two datasets. This result indicates the effectiveness of considering topics and structure in unsupervised abstractive opinion summarization. As regards fluency, Copycat is superior to our model because our model sometimes makes a grammatical or referential error, which has a negative impact on fluency, as will be shown later in Section 5.1.

Faithfulness of the Summaries
Abstractive summarization sometimes hallucinates content that is unfaithful to the input texts (Maynez et al., 2020). The next study assesses whether the contents mentioned in the generated summaries are included in the input reviews. We use the same summary sets as in the quality evaluation and split them into sentences. For each summary sentence, we asked the AMT workers to judge whether the content is fully mentioned (Full), some of the content is mentioned (Partial), or no content is mentioned (No) in the reviews. Table 4 shows the percentage of each answer.

RecurSum: Copycat:
This is a great table set for the price. It was easy to put together and looks great. The only thing is that the chairs are a little flimsy, but they are easy to assemble.

Gold summary:
The dinning room set is very sturdy, seats are very comfortable and has a nice color. Great for small area, but I will not advice usage for a large eating area. It looks great and it is easy to put together. The top scratches easily. It was delivered on time and I'm pleased with the purchase.

Copycat:
This place has the best bubble tea I've ever had in my life. It's hard to find a place that serves bubble tea and boba tea, but I think it's worth the money. The staff is very friendly and helpful. I will definitely be back!

Gold summary:
Great place for bubble tea, lots of options to customize your drinks and toppings! They also offer loose teas and milk teas. The desserts are nice as well, they offer mousse, macaroons, and pastries ect. The price point is fair, cheaper than some other local options. Great atmosphere to meet up with friends and chat, but also relaxing enough to come in and study.
1. This is a great place to go.
11. They have a wide variety of options.

It's also a nice place to grab a bite.
111. It's a little bit of the best bubble tea.

I love the fruity pebbles chocolate, chocolate and the ice cream.
RecurSum: (c) (b) 1. I bought these for my wedding.

The shoe is nice and the color is very well made.
12. The shoes are comfortable and the material is very nice.

I bought a pair of black and they were so beautiful.
121. The size is a bit too small for my needs.

RecurSum: Copycat:
This is my second pair of these shoes and I love them. They are true to size and are comfortable to wear. I wear them to work and they are very comfortable.

Gold summary:
Very pretty shoes and nice quality. The shoes run a bit small, about half a size and there is a ridge in the shoe that rubs on your toe. Nice formal night shoe, not so much for every day. (a) 13. I will definitely be back. The frequency distribution is not regarded as statistically significant by χ 2 test (p < 0.05). This result indicates that our model correctly reflects the content in the input reviews as well as Copycat.

The place is clean and the food is delicious.
Coverage of the Summaries Another desirable property of summaries is that they cover more content mentioned in the input reviews. As reported in Bražinskas et al. (2020), Copycat and Mean-Sum achieve relatively low scores for the human evaluation of opinion consensus, which captures the coverage of common opinions in the input reviews. In contrast, as RecurSum explicitly generates summary sentences for each topic, it could cover more input content across diverse topics. To assess this assumption, we conducted the opposite study from the faithfulness evaluation. Similar to the faithfulness evaluation, we split reviews into sentences. For each review sentence, we asked the AMT workers to rate the extent to which the generated summaries cover the input content. Table 5 shows the percentage of fully-covered (Full), partially-covered (Partial), and un-covered (No) sentences. In addition to the two models, we also included gold summaries as the upper bounds. For both datasets, RecurSum covers more number of common opinions by capturing diverse topics.

Analyzing Generated Summaries
In this section, we discuss the strengths and weaknesses of our method by presenting examples of the generated summaries and tree structures.
In Figure 4 (a), we present a summary of a review of shoes in Amazon. RecurSum generates topic sentences about fitness and size (12, 121), similar to Copycat. In addition, our model also mentions color and use (11,111,112), which is also described in the gold summary. While we cannot grasp that the shoes are appropriate for weddings from Copycat's summary, RecurSum covers such topics and provides more useful information. Figure 4 (b) shows the generated summaries on a coffee shop review in Yelp. While both Recur-Sum and Copycat present a positive review about the taste of bubble tea (tea with tapioca), Recur-Sum also focuses on the dessert (12, 121), similar to the gold summary. While Copycat also refers to friendly staff, they are not mentioned in the input review. Our model successfully does not extract topic sentences about staff by measuring content overlap with the input reviews. However, Recur-Sum sometimes makes grammatical or referential errors such as "It's a little bit of the best bubble tea". These errors cause the inferior performance of RecurSum in terms of fluency. Figure 4 (c) shows the summary of an Amazon review on a table chair set. RecurSum accurately captures opinions about the table (11,111,112) and chair (12,121). The topic sentences on the bottom level elaborate on the parent sentences, referring to the easy assembly (111), the appropriate use of table (112), and the quality of chair (121). By inferring topics in the tree structure, RecurSum can offer summary sentences over multiple granularities of topics.

Ablation Study of Model Components
We report the results of the ablation study to investigate how individual components affect summarization performance. In addition to the ROUGE scores, we also report self-BLEU scores (Zhu et al., 2018) to investigate the diversity of the generated summaries. Self-BLEU is computed by calculating the BLEU score of each generated summary with all other generated summaries in the test set as references. A higher self-BLEU implies that the generated summaries are not diversified, i.e, the model tends to generate a generic summary similar to the other summaries. Table 6 shows the performances of model variants on Yelp dataset.
w/o Disc denotes our model without a discriminator. The ROUGE scores are significantly lower than the full model. Without the discriminator, the topic distribution becomes sparse, i.e., most of the review sentences are assigned to some specific topics. Therefore, the model obtains incoherent topics and generates unfaithful summaries for the input review. Discriminator penalizes this situation by assigning an appropriate topic to topically different sentences. This mechanism makes the generated topic sentences topically coherent and improves ROUGE scores.
w/o Attention indicates our model without an attention mechanism. Although the generated sentences are faithful to the input review, they are often generic and miss some specific details of the content. By adding the attention mechanism, the generated summary effectively reflects the content of the input reviews and provides more detailed information. Although the copy-mechanism (See et al., 2017) has also been reported to be useful in previous summarization models (Bražinskas et al., Table 6: Ablation study of RecurSum on Yelp. R-1/2/L denote ROUGE-1/2/L, respectively. B-3/4 denote self-BLEU3/4, respectively. 2020; Amplayo and Lapata, 2020), it degrades the performance of our model. While their models use different input-output pairs (reviews vs. pseudosummary), our model uses the same input-output pairs in an autoencoder manner and tends to fully copy the input sentences. Thus, our model fails to obtain a meaningful latent code. w/o Nucleus denotes our model using a beamsearch decoder (beam width=5) instead of nucleus sampling when decoding topic sentences in inference. As reported by Holtzman et al. (2019), we also confirmed that the beam-search decoder tends to generate bland or repetitive text and sometimes fails to capture product-specific words. Owing to nucleus sampling, the decoder generates more informative content and improves the ROUGE-1 score with a significant decrease in self-BLEU.
We also attempted to replace the encoder with BERT (Devlin et al., 2019). However, fine-tuning of pretrained components with non-pretrained components is unstable as reported by Liu and Lapata (2019b), and it does not contribute to the improvement of ROUGE scores.

Analyzing Topic-Tree Structure
As generating sentences with tree-structured topic guidance is a novel challenge, we introduce new measures to verify that the generated sentences exhibit the desired properties of tree structures. Based on the work of tree-structured topic model (Kim et al., 2012), we introduce two metrics: hierarchical affinity and topic specialization.
Hierarchical Affinity: An important characteristic of the tree structure is that a parent topic sentence is more similar to its children than the sentences descending from the other parents. To confirm this property, we estimated the similarity of sentences in parent-child pairs and non parentchild pairs. To measure sentence similarity, we used the ALBERT (Lan et al., 2019), which is a SoTA model on the semantic textual similarity benchmark (STS-B; Cer et al., 2017). In our 1. A great thing to go to the place.
11. Friendly staff and I appreciated the food.
12. There is a lot of a great meal.
112. Food was great, and the service was great.
111. The staff is very friendly and knowledgable.
121. The meal was great, but the food was fantastic.
122. The <unk> was delicious and the fish tacos were great.   Table 8: Average specialization score of each level topics, ranging from 1 (general) to 5 (specific). experiment, we used the ALBERT-base, which achieves a 84.7 Pearson correlation coefficient against the test sets of STS-B. As shown in Table 7, parent-child sentence pairs are more similar than those of non parent-child pairs in both datasets. This result indicates that the generated sentences linked by parent-child relations are topically coherent.
Topic specialization: In tree-structured topics, we would expect the root topic to generate general sentences, whereas more specific content is conveyed by the sentences generated by the leaf topics. To empirically test this property, we estimated the average specificity of sentences at each level of the tree-structured topics. We fine-tuned ALBERT-base on the task of estimating the specificity of sentences (Louis and Nenkova, 2011). We used the dataset provided by Ko et al. (2019), which comprises the Yelp, Movie, and Tweet domains. The fine-tuned model achieves a SoTA performance of 86.2 Pearson correlation coefficient on the test sets in Yelp. As shown in Table 8, we see that sentences with lower topics are more specific than higher topics. This indicates that the root sentences refer to general topics, whereas leaf sentences describe more specific topics.  Table 9: Average log determinant of covariance matrices (LogDetCov) on each level.

Analyzing Latent Space of Sentences
In Figure 5, we project the latent code of topic sentences of a restaurant review onto the top two principal component vector space. Following the modelling assumption, the latent distributions of child sentences are located relatively near their parent distributions. This property ensures that the parent and child sentences are topically coherent, as shown in Table 7. Furthermore, we present the average log determinant of the covariance matrices at each level in Table 9. We confirm that the latent code of the topic sentences has a smaller variance towards the leaves. This property forces the topic sentences to be more specific as the level becomes deeper, as described in Table 8.
6 Related Work

Text Generation with Topic Guidance
The VAE is intensively used to obtain disentangled latent code of sentences (Bowman et al., 2016;Hu et al., 2017;Tang et al., 2019). Closely related to ours, Wang et al. (2019) specify the prior as a GMM, where each mixture component corresponds to the latent code of a topic sentence and is mixed with the topic distribution inferred by the flat neural topic model (Miao et al., 2017). In contrast, we address a novel challenge to generate topic sentences with tree-structured topic guidance, where the root sentence refers to a general topic, whereas the leaf sentences describe more specific topics. We adopt the tree-structured neural topic model (Isonuma et al., 2020) to infer the topic distribution of sentences and introduce a recursive Gaussian mixture prior for modelling the latent distribution of sentences in a document.
Recently, specifically for opinionated texts, several abstractive multi-document summarization methods have been developed, such as Mean-Sum, Copycat, and DenoiseSum, as explained in Section 4.3. Concurrently with ours, Angelidis et al. (2021) use quantitized transformers enabling aspect-based extractive summarization, and Amplayo et al. (2020) incorporate the aspect and sentiment distributions into the unsupervised abstractive summarization. Our method incorporates topic-tree structure into unsupervised abstractive summarization and generates summaries consisting of multiple granularities of topics.

Conclusion
In this paper, we proposed a novel unsupervised abstractive opinion summarization method by generating topic sentences with tree-structured topic guidance. Experimental results demonstrated that the generated summaries are more informative and cover more input content than those generated by the recent unsupervised summarization (Bražinskas et al., 2020). Additionally, we demonstrated that the variance of latent Gaussians represents the granularity of sentences, analogous to Gaussian word embedding (Vilnis and McCallum, 2015). This property will be useful not only for summarization but also for other tasks that need to consider the granularity of the contents.

A.1 Inference of Topic Distribution
To approximate the tree-structured topic distribution of a sentence, we use a tree-structured neural topic model (TSNTM; Isonuma et al., 2020), which transforms a sentence into a tree-structured topic distribution using neural networks. While their model is based on the nested Chinese restaurant process (nCRP; Griffiths et al., 2004), we  Figure 6: Example of a path distribution (blue) and level distribution (red). Both the sum of a path distribution over each level and the sum of a level distribution over each path are equal to 1. make a minor change to use the nested hierarchical Dirichlet process (nHDP; Paisley et al., 2014). The nHDP generates a sentence-specific path distribution π s and level distribution φ s as ν s,k ∼ Beta(1, γ), π s,k = π s,par(k) ν s,k j∈Sib(k) (1 − νs,j) where Sib(k) and Anc(k) are the sets of the k-th topic's preceding-siblings and ancestors, respectively. As described in Figure 6, π s,k denotes the probability that a sentence s selects a path from the root to the k-th topic. φ s,k denotes the probability that a sentence s does not select the ancestral topics j ∈ Anc(k) but remains in the k-th topic along the path. By multiplying these two probabilities, we obtain θ s,k ; the probability that a sentence s selects the topic k. The nHDP does not make a significant difference in the summarization performance from the nCRP. However, the nHDP permits different lengths of each path, whereas the nCRP restricts each path length to be the same. Following Isonuma et al. (2020), we use the doubly-recurrent neural networks (DRNN; Alvarez-Melis and Jaakkola, 2017) to transform a sentence embedding y s = RNN(w s ) to the path distribution π s and level distribution φ s . The DRNN consists of two RNN decoders over respectively the ancestors and siblings. We compute the k-th topic's hidden state h k using (23) and obtain the path distribution by alternating ν s as (24): where h par(k) and h k−1 are the hidden states of a parent and a previous sibling of the k-th topic, respectively. Similarly, we obtain the level distribution, φ s , by computing η s with another DRNN.

A.2 Sensitivity for the Number of Topics
We investigated how the number of topics affects summarization performance. Table 10 shows the ROUGE scores on the various number of branches with a fixed depth of 3 in topic-tree structure. When the number of topics is small, the models achieve a relatively low score. However, when the number of branches ≥ 4, the performance does not significantly change for various numbers of topics. A similar trend is confirmed in Table 10, which shows the ROUGE scores on the various number of levels with the fixed number of branches of 3.
These results indicate that our model is relatively robust for the number of topics.

A.3 Derivation of the Equation (19)
Proposition: when q(x s |z s ) is given by (12), as sθ s,kμd,k = sθ s,kμs from (13).  (14). The given equation eventually comes down to a comparison of the entropy.