Extractive Opinion Summarization in Quantized Transformer Spaces

We present the Quantized Transformer (QT), an unsupervised system for extractive opinion summarization. QT is inspired by Vector-Quantized Variational Autoencoders, which we repurpose for popularity-driven summarization. It uses a clustering interpretation of the quantized space and a novel extraction algorithm to discover popular opinions among hundreds of reviews, a significant step towards opinion summarization of practical scope. In addition, QT enables controllable summarization without further training, by utilizing properties of the quantized space to extract aspect-specific summaries. We also make publicly available SPACE, a large-scale evaluation benchmark for opinion summarizers, comprising general and aspect-specific summaries for 50 hotels. Experiments demonstrate the promise of our approach, which is validated by human studies where judges showed clear preference for our method over competitive baselines.


Introduction
Online reviews play an integral role in modern life, as we look to previous customer experiences to inform everyday decisions. The need to digest review content has fueled progress in opinion mining (Pang and Lee, 2008), whose central goal is to automatically summarize people's attitudes towards an entity. Early work (Hu and Liu, 2004) focused on numerically aggregating customer satisfaction across different aspects of the entity under consideration (e.g., the quality of a camera, its size, clarity). More recently, the success of neural summarizers in the Wikipedia and news domains (Cheng and Lapata, 2016;See et al., 2017;Narayan et al., 2018;Liu et al., 2018;Perez-Beltrachini et al., 2019) has spurred interest in opinion summarization; the aggregation, in textual form, of opinions expressed in a set of reviews (Angelidis and Lapata, 2018;Huy Tien et al., 2019;Tian et al., 2019;Coavoux et al., 2019;Chu and Liu, 2019;Isonuma et al., 2019;Bražinskas et al., 2020;Amplayo and Lapata, 2020;. Opinion summarization has distinct characteristics that set it apart from other summarization tasks. Firstly, it cannot rely on reference summaries for training, because such meta-reviews are very scarce and their crowdsourcing is unfeasible. Even for a single entity, annotators would have to produce summaries after reading hundreds, sometimes thousands, of reviews. Secondly, the inherent subjectivity of review text distorts the notion of information importance used in generic summarization (Peyrard, 2019). Conflicting opinions are often expressed for the same entity and, therefore, useful summaries should be based on opinion popularity (Ganesan et al., 2010). Moreover, methods need to be flexible with respect to the size of the input (entities are frequently reviewed by thousands of users), and controllable with respect to the scope of the output. For instance, users may wish to read a general overview summary, or a more targeted one about a particular aspect of interest (e.g., a hotel's location, its cleanliness, or available food options).
Recent work (Tian et al., 2019;Coavoux et al., 2019;Chu and Liu, 2019;Isonuma et al., 2019;Bražinskas et al., 2020;Amplayo and Lapata, 2020; has increasingly focused on abstractive summarization, where a summary is generated token-by-token to create novel sentences that articulate prevalent opinions in the input reviews. The abstractive approach offers a solution to the lack of supervision, under the assumption that opinion summaries should be written in the style of reviews. This simplification has allowed abstractive models to generate review-like summaries from aggregate input representations, using sequence-tosequence models trained to reconstruct single reviews. Despite being fluent, abstractive summaries may still suffer from issues of text degeneration (Holtzman et al., 2020), hallucinations (Rohrbach et al., 2018), and the undesirable use of first-person narrative, a direct consequence of review-like generation. In addition, previous work used an unrealistically small number of input reviews (10 or fewer), and only sparingly investigated controllable summarization, albeit in weakly supervised settings (Amplayo and Lapata, 2019; . In this paper, we attempt to address shortcomings of existing methods by turning to extractive summarization which aims to construct an opinion summary by selecting a few representative input sentences (Angelidis and Lapata, 2018;Huy Tien et al., 2019). Specifically, we introduce the Quantized Transformer (QT), an unsupervised neural model inspired by Vector-Quantized Variational Autoencoders (VQ-VAE; van den Oord et al., 2017;Roy et al., 2018), which we repurpose for popularity-driven summarization. QT combines Transformers (Vaswani et al., 2017) with the discretization bottleneck of VQ-VAEs and is trained via sentence reconstruction, similarly to the work of Roy and Grangier (2019) on paraphrasing. At inference time, we use a clustering interpretation of the quantized space and a novel extraction algorithm that discovers popular opinions among hundreds of reviews, a significant step towards opinion summarization of practical scope. QT is also capable of aspect-specific summarization without further training, by exploiting the properties of the Transformer's multi-head sentence representations.
We further contribute to the progress of opinion mining research, by introducing SPACE (shorthand for Summaries of Popular and Aspect-specific Customer Experiences), a large-scale corpus for the evaluation of opinion summarizers. We collected 1,050 human-written summaries of TripAdvisor reviews for 50 hotels. SPACE has general summaries, giving a high-level overview of popular opinions, and aspect-specific ones, providing detail on individual aspects (e.g., location, cleanliness). Each summary is based on 100 customer reviews, an order of magnitude increase over existing corpora, thus providing a more realistic input to competing models. Experiments on SPACE and two more benchmarks demonstrate that our approach holds promise for opinion summarization. Participants in human evaluation further express a clear preference for our model over competitive baselines. We make SPACE and our code publicly available. 1 1 https://github.com/stangelid/qt 2 Related Work Ganesan et al. (2010) were the first to make the connection between opinion mining and text summarization; they develop Opinosis, a graph-based abstractive summarizer which explicitly models opinion popularity, a key characteristic of subjective text, and central to our approach. Follow-on work (Di Fabbrizio et al., 2014) adopts a hybrid approach where salient sentences are first extracted and abstracts are generated based on hand-written templates (Carenini et al., 2006). More recently, Angelidis and Lapata (2018) extract salient opinions according to their polarity intensity and aspect specificity, in a weakly supervised setting.
A popular approach to modeling opinion popularity, albeit indirectly, is vector averaging. Chu and Liu (2019) propose MeanSum, an unsupervised abstractive summarizer that learns a review decoder through reconstruction, and uses it to generate summaries conditioned on averaged representations of the inputs. Averaging is also used by Bražinskas et al. (2020), who train a copy-enabled variational autoencoder by reconstructing reviews from averaged vectors of reviews about the same entity. Other methods include denoising autoencoders (Amplayo and Lapata, 2020) and the system of Coavoux et al. (2019), an encoder-decoder architecture that uses a clustering of the encoding space to identify opinion groups, similar to our work.
Our model builds on the Vector Quantized Variational Autoencoder (VQ-VAE; van den Oord et al. 2017), a recently proposed training technique for learning discrete latent variables, which aims to overcome problems of posterior collapse and large variance associated with Variational Autoencoders (Kingma and Welling, 2014). Like other related discretization techniques (Maddison et al., 2017;Kaiser and Bengio, 2018), VQ-VAE passes the encoder output through a discretization bottleneck using a neighbor lookup in the space of latent code embeddings. The application of VQ-VAEs to opinion summarization is novel, to our knowledge, as well as the proposed sentence extraction algorithm. Our model does not depend on vector averaging, nor does it suffer from information loss and hallucination. Furthermore, it can easily accommodate a large number of input reviews. Within NLP, VQ-VAEs have been previously applied to neural machine translation (Roy et al., 2018) and paraphrase generation (Roy and Grangier, 2019). Our work is closest to Roy and Grangier (2019) in its use of a quantized Transformer, however we adopt a different training algorithm (Soft EM; Roy et al., 2018), orders of magnitude fewer discrete latent codes, a different method for obtaining head sentence vectors, and apply the QT in a novel way for extractive opinion summarization.
Besides modeling, our work contributes to the growing body of resources for opinion summarization. We release SPACE, the first corpus to contain both general and aspect-specific opinion summaries, while increasing the number of input reviews tenfold compared to popular benchmarks (Bražinskas et al., 2020;Chu and Liu, 2019;Angelidis and Lapata, 2018).

Problem Formulation
Let C be a corpus of reviews on entities {e 1 , e 2 , . . . } from a single domain d, e.g., hotels. Reviews may discuss any number of relevant aspects A d = {a 1 , a 2 , . . . }, like the hotel's rooms or location. For every entity e, we define its review set R e = {r 1 , r 2 , . . . }. Every review is a sequence of sentences (x 1 , x 2 , . . . ) and a sentence x is, in turn, a sequence of words (w 1 , w 2 , . . . ). For brevity, we use X e to denote all review sentences about entity e. We formalize two sub-tasks: (a) general opinion summarization, where a summary should cover popular opinions in R e across all discussed aspects; and (b) aspect opinion summarization, where a summary must focus on a single specified aspect a ∈ A d . In our extractive setting, these translate to creating a general or aspect summary by selecting a small subset of sentences S e ⊂ X e .
We train the Quantized Transformer (QT) through sentence reconstruction to learn a rich representation space and its quantization into latent codes (Section 3.1). We enable opinion summarization, by mapping input sentences onto their nearest latent codes and extract those sentences that are representative of the most popular codes (Section 3.2). We also illustrate how to produce aspect-specific summaries using a trained QT model and a few aspect-denoting query terms (Section 3.2.2). Original: The staff was great!

Transformer
Sentence Encoder

Reconstructed:
The staff was friendly.
x 2 x 3 q 1 q 2 q 3 latent code embed. sentence heads Figure 1: A sentence is encoded into a 3-head representation and head vectors are quantized using a weighted average of their neighboring code embeddings. The QT model is trained by reconstructing the original sentence.
maps each head vector to a mixture of discrete latent codes, and uses the codes' embeddings to produce quantized vectors {q 1 , . . . , q H }, q h ∈ R D ; (c) a Transformer sentence decoder, which attends over the quantized vectors to generate sentence reconstructionx. The decoder is not used during summarization; we only use the learned quantized space to extract sentences, as described in Section 3.2.
Sentence Encoding Our encoder prepends sentence x with the special token [SNT] and uses the vanilla Transformer encoder (Vaswani et al., 2017) to produce token-level vectors. We ignore individual word vectors and only keep the special token's vector x snt ∈ R D . We obtain a multihead representation of x, by splitting x snt into H sub-vectors {x 1 , . . . , x H }, x h ∈ R D/H , followed by a layer-normalized transformation: where x h is the h-th head and W ∈ R D×D/H , b ∈ R D are shared across heads. Hyperparameter H, i.e., the number of sentence heads of our encoder, is different from Transformer's internal attention heads. The encoder's operation is illustrated in Figure 1, where the sentence "The staff was great! " is encoded into a 3-head representation.
Vector Quantization Let z 1 , . . . , z H be discrete latent variables corresponding to H encoder heads. Every variable can take one of K possible latent codes, z h ∈ [K]. The quantizer's codebook, e ∈ R K×D , is shared across latent variables and maps each code (or cluster) to its embedding (or centroid) e k ∈ R D . Given sentence x and its multi-head encoding {x 1 , . . . , x H }, we independently quantize every head using a mixture of its nearest codes from [K]. Specifically, we follow the Soft EM training of Roy et al. (2018) and sample, with replacement, m latent codes for the h-th head: where Multinomial(l 1 , . . . , l K ) is a K-way multinomial distribution with logits l 1 , . . . , l K . The h-th quantized head vector is obtained as the average of the sampled codes' embeddings: This soft quantization process is shown in Figure 1, where head vectors x 1 , x 2 and x 3 are quantized using a weighted average of their neighboring code embeddings, to produce q 1 , q 2 , and q 3 .

Sentence Reconstruction and Training
Instead of attending over individual token vectors, as in the vanilla architecture, the Transformer sentence decoder attends over {q 1 , . . . , q H }, the quantized head vectors of the sentence, to generate reconstructionx. The model is trained to minimize: L r is the reconstruction cross entropy, and stopgradient operator sg(·) is defined as identity during forward computation and zero on backpropagation. The sampling of Equation (2) is bypassed using the straight-through estimator (Bengio et al., 2013) and the latent codebook is trained via exponentially moving averages, as detailed in Roy et al. (2018).

Summarization in Quantized Spaces
Existing neural methods for opinion summarization have modeled opinion popularity within a set of reviews by encoding each review into a vector, averaging all vectors to obtain an aggregate representation of the input, and feeding it to a review decoder to produce a summary (Chu and Liu, 2019;Coavoux et al., 2019;Bražinskas et al., 2020). This approach is problematic for two reasons. Firstly, it assumes that complex semantics of whole reviews can be encoded in a single vector. Secondly, it also assumes that features of commonly occurring opinions are preserved after averaging and, therefore, those opinions will appear in the generated summary. The latter assumption becomes particularly uncertain for larger numbers of input reviews. We take a different approach, using sentences as the unit of representation, and propose a general extracted Transformer Sentence Encoder X e : all entity sentences 1 K n k x 1 x n sentence samples The best breakfast latent codes sentence heads Figure 2: General opinion summarization with QT. All input sentences for an entity are encoded using three heads (shown in orange, blue and green crosses). Sentence vectors are clustered under their nearest latent code (gray circles). Popular clusters (histogram) correspond to commonly occurring opinions, and are used to sample and extract the most representative sentences.
extraction algorithm based on the QT, which explicitly models popularity without vector aggregation.
Using the same algorithmic framework we are also able to extract aspect-specific summaries.

General Opinion Summarization
We exploit QT's quantization of the encoding space to cluster similar sentences together, quantify the popularity of the resulting clusters, and extract representative sentences from the most popular ones. Specifically, given X e = {x 1 , . . . , x i , . . . , x N }, the N review sentences about entity e, the trained encoder produces N × H unquantized head vectors {x 11 , . . . , x ih , . . . , x NH }, where x ih is the h-th head of the i-th sentence. We perform hard quantization, assigning every vector to its nearest latent code, and counting the number of assignments per code, i.e., the popularity of each cluster: (6) Figure 2 shows how sentences X e are encoded, and their different heads are assigned to codes. Similar sentences cluster under the same codes and, consequently, clusters receiving numerous assignments are characteristic of popular opinions in X e . A general summary should consist of the sentences that are most representative of these popular clusters.
In the simplest case, we could couple every code k with its nearest sentence x (k) : and rank sentences x (k) according to the size n k of their respective clusters; the top sentences, up to a predefined budget, are extracted into a summary. The above ranking method entails that only those sentences which are the nearest neighbor of a popular code are likely to be extracted. However, a salient sentence may be in the neighborhood of multiple codes per head, despite never being the nearest sentence of a code vector. For example, the sentence "Great location and beautiful rooms" is representative of clusters encoding positive attitudes for both the location and the rooms of a hotel. To capture this, we relax the requirement of coupling every cluster with exactly one sentence and propose two-step sampling (Figure 3), a novel sampling process which simultaneously estimates cluster popularity and promotes sentences commonly found in the proximity of popular clusters. We repeatedly perform the following operations: Cluster Sampling We first sample a latent code z with probability proportional to the clusters' size: where n k is the number of assignments for code k, computed in Equation (6). For example, if the input contains many paraphrases of sentence "Excellent location", these are likely to be clustered under the same code, which in turn increases the probability of sampling that code. Cluster sampling is illustrated on the top of Figure 3, showing assignments (left) and resulting code probabilities (right).
Sentence Sampling The sampled code z exists in the neighborhood of many input sentences. Picking a single sentence as the most characteristic of that cluster is too restrictive. Instead, we sample (with replacement) sentences from the code's neighborhood n times, thus generalizing Equation (7): where the Multinomial's logits l i mark the (negative) distance of the i-th sentence's head which is closest to code z. Sentence sampling is depicted in the toy example of Figure 3 (bottom). After selecting code k = 1 during cluster sampling, four sentence samples are drawn (shown in black arrows). The next cluster sample (k = 3) results in four more sentence samples (shown in red). Sentence s 4 ("Excellent room and location" ) receives Step 2: Sentence sampling Step  Figure 3: Sentence ranking via two-step sampling. In this toy example, each sentence (s 1 to s 5 ) is assigned to its nearest code (k = 1, 2, 3), as shown by thick purple arrows. During cluster sampling, the probability of sampling a code (top right; shown as blue bars) is proportional to the number of assignments it receives. For every sampled code, we perform sentence sampling; sentences are sampled, with replacement, according to their proximity to the code's encoding. Samples from codes 1 and 3 are shown in black and red, respectively. the most votes in total, after being sampled as a neighbor of both codes. Two-step sampling is repeated multiple times and all sentences in X e are ranked according to the total number of times they have been sampled. The final summary is constructed by concatenating the top ones (see right part of Figure 2). Importantly, our extraction algorithm is not sensitive to the size of the input. More sentences increase the absolute number of assignments per code, but do not hinder two-step sampling or cause information loss; on the contrary, a larger pool of sentences may result in a more densely populated quantized encoding space and, in turn, a better estimation of cluster popularity and sentence ranking.

Aspect Opinion Summarization
So far, we have focused on selecting sentences solely based on the popularity of the opinions they express. We now turn our attention to aspect summaries, which discuss a particular aspect of an entity (e.g., the location or service of a hotel) while still presenting popular opinions. We create such extracted Transformer Sentence Encoder X e : all entity sentences 1 K n k x 1 x n sentence samples The best breakfast aspect-specific codes non-aspect sub-spaces Figure 4: Aspect opinion summarization with QT. The aspect-encoding sub-space is identified using mean aspect entropy and all other sub-spaces are ignored (shown in gray). Two-step sampling is restricted only to the codes associated with the desired aspect (shown in red).
summaries with a trained QT model, without additional fine-tuning. Instead, we exploit QT's multihead representations and only require a small number of aspect-denoting query terms. 2 We hypothesize that different sentence heads in QT encode the approximately orthogonal semantic or structural attributes which are necessary for sentence reconstruction. In the simplified example in Figure 2, the encoder's first head (orange) might capture information about the aspects of the sentence, the second head (blue) encodes sentiment, while head three (green) may encode structural information (e.g., the length of the sentence or its punctuation). Our hypothesis is reinforced by the empirical observation that sentence vectors originating from the same head will occupy their own sub-space, and do not show any similarity to vectors from other heads. As a result, each latent code k receives assignments from exactly one head of the sentence encoder. More formally, head h yields a set of latent codes such that K h ⊂ [K]. Figure 2 demonstrates this, as the encoding space consists of three sub-spaces, one for each head. Sentence and latent code vectors are further organized within that sub-space according to the attribute captured by the respective head.
To enable aspect summarization, we identify the sub-space capturing aspect-relevant information and label its aspect-specific codes, as seen in Figure 4. Specifically, we first quantify the probability of finding an aspect in the sentences assigned to a latent code and identify the head sub-space that best separates sentences according to their aspect. Then, we map every cluster within that sub-space to an aspect and extract aspect summaries only from those aspect-specific clusters.
We utilize a held-out set of review sentences X dev , and keywords Q a = {s 1 , . . . , s 5 } for aspect a. We encode and quantize sentences in X dev and compute the probability that latent code k contains tokens typical of aspect a as: where tf (Q a , k) is the number of times query terms in Q a where found in sentences assigned to k. We use information theory's entropy to measure how aspect-certain code k is: Low aspect entropy values indicate that most sentences assigned to k belong to a single aspect. It thus follows that h asp , i.e., the head sub-space which best separates sentences according to their aspect, will exhibit the lowest mean aspect entropy: We map every code produced by h asp to its aspect a (k) via Equation (10), and obtain aspect codes: To extract a summary for aspect a, we follow the ranking or sampling methods described in Equations (5)-(9), restricting the process to codes K a . Sub-space selection and aspect-specific sentence sampling are illustrated in Figure 4.

The SPACE Corpus
We introduce SPACE (Summaries of Popular and Aspect-specific Customer Experiences), a largescale opinion summarization benchmark for the evaluation of unsupervised summarizers. SPACE is built on TripAdvisor hotel reviews and aims to facilitate future research by improving upon the shortcomings of existing datasets. It comes with a training set of approximately 1.1 million reviews for over 11 thousand hotels, obtained by cleaning and downsampling an existing collection (Wang et al., 2010). The training set contains no reference summaries, and is useful for unsupervised training. For evaluation, we created a large collection of human-written, abstractive opinion summaries.  Specifically, for a held-out set of 50 hotels (25 hotels for development and 25 for testing), we asked human annotators to write high-level general summaries and aspect summaries for six popular aspects: building, cleanliness, food, location, rooms, and service. For every hotel and summary type, we collected three reference summaries from different annotators. Importantly, for every hotel, summaries were based on 100 input reviews. To the best of our knowledge, this is the largest crowdsourcing effort towards obtaining high-quality abstractive summaries of reviews, and the first to use a pool of input reviews of this scale (see Table 1 for a comparison with existing datasets). Moreover, SPACE is the first benchmark to also contain aspect-specific opinion summaries. The large number of input reviews per entity poses certain challenges with regard to the collection of human summaries. A direct approach is prohibitive, as it would require annotators to read all 100 reviews and write a summary in a single step. A more reasonable method is to first identify a subset of input sentences that most people consider salient, and then ask annotators to summarize them. Summaries were thus created in multiple stages using the Appen 3 platform and expert annotator channels of native English speakers. Although we propose an extractive model, annotators were asked to produce abstractive summaries, as we hope SPACE will be broadly useful to the summarization community. We did not allow the use of first-person narrative to collect more summary-like texts. We present our annotation procedure below. 4

Sentence Selection via Voting
The sentence selection stage identifies a subset of review sentences which contain the most salient and useful opinions expressed by the reviewers. This is a crucial but subjective task and, therefore, we devised a voting scheme which allowed us to select sentences that received votes by many annotators.
Specifically, each review was shown to five judges who were asked to select informative sentences. Annotators were encouraged to exercise their own judgement in selecting summary-worthy sentences, but were advised to focus on sentences which explicitly expressed or supported reviewer opinions, avoiding overly general or personal comments (e.g., "Loved the hotel", "I like a shower with good pressure"), and making sure that important aspects were included. We set no threshold on the number of sentences they could select (we allowed selecting all or no sentences for a given review). However, the annotation interface kept track of their total votes and guided them to select between 20% and 40% of sentences, on average.
Sentences with 4 or more votes were automatically promoted to the next stage. Inter-annotator agreement according to Cohen's kappa was k = 0.36, indicating "fair agreement". Previous studies have shown that human agreement for sentence selection tasks in summarization of news articles is usually lower than 0.3 (Radev et al., 2003). The median number of sentences promoted for summarization for each hotel was 83, while the minimum was 46. This ensured that enough sentences were always available for summarization, while simplifying the task; annotators were now required to read and summarize considerably smaller amounts of review text than the original 100 reviews.

Summary Collection
General Summaries The top-voted sentences for each hotel were presented to three annotators, who were asked to read them and produce a highlevel overview summary up to a budget of 100 words. To simplify the task and help annotators write coherent summaries, sentences with high lexical overlap were grouped together and the interface allowed the annotators to quickly sort sen-tences according to words they contained. The process resulted in an inter-annotator ROUGE-L score of 29.19 and provides ample room for future research, as detailed in our experiments (Table 2).
Aspect Summaries Top-voted sentences were further labeled by an off-the-shelf aspect classifier (Angelidis and Lapata, 2018) trained on an public aspect-labeled corpus of hotel review sentences (Marcheggiani et al., 2014). 5 Sentences outside of the six most popular aspects (building, cleanliness, food, location, rooms, and service) were ignored, and sentences with 3 votes were promoted, only if an aspect had no sentences with 4 votes. The promoted sentences were grouped according to their aspect and presented to annotators, who were asked to create a more detailed, aspect-specific summary, up to a budget of 75 words. The aspect summaries have an inter-annotator ROUGE-L score of 34.58.

Evaluation
In this section, we discuss our experimental setup, including datasets and comparison models, before presenting our automatic evaluation results, human studies, and further analyses.

Experimental Setup
Datasets We used SPACE as the main testbed for our experimental evaluation, covering both general and aspect-specific summarization tasks. For general summarization, we used two additional opinion summarization benchmarks, namely YELP (Chu and Liu, 2019) and AMAZON (Bražinskas et al., 2020) (see Table 1). For all datasets, we use pre-defined development and test set splits, and only report results on the test set.

Implementation Details
We used unigram LM SentencePiece vocabularies of 32K. 6 All system hyperparameters were selected on the development set. The Transformer's dimensionality was set to 320 and its feed-forward layer to 512. We used 3 layers and 4 internal attention heads for its encoder and decoder, whose input embedding layer was shared, but no positional encodings as we observed no summarization improvements. We used H = 8 sentence heads for representing every sentence. For the quantizer, we set the number of latent codes to K = 1, 024 and sampled m = 30 codes for every input sentence, during training. We used the Adam optimizer, with initial learning rate of 10 −3 and a learning rate decay of 0.9. We warmed up the Transformer by disabling quantization for the first 4 epochs. In total, we ran 20 training epochs. On the full SPACE corpus, QT was trained in 4 days on a single GeForce GTX 1080 Ti GPU, using our available PyTorch implementation. All general and aspect summaries were extracted with the two-step sampling procedure described in Section 3.2.1, unless otherwise stated. When twostep sampling was enabled, we ranked sentences by sampling 300 latent codes and, for every code, sampled n = 30 neighboring sentences. QT and all extractive baselines use a greedy algorithm to eliminate redundancy, similar to previous research on multi-document summarization (Cao et al., 2015;Yasunaga et al., 2017;Angelidis and Lapata, 2018).

Metrics
We evaluate the lexical overlap between system and human summaries using ROUGE F-scores. 7 We report uni-and bi-gram variants (R1/R2), as well as longest common subsequence (RL).
A successful opinion summarizer must also produce summaries which match human-written ones in terms of aspects mentioned and sentiment conveyed. For this reason, we also evaluate our systems on two metrics which utilize an off-the-shelf aspect-based sentiment analysis (ABSA) system (Miao et al., 2020), pre-trained in-domain. The ABSA system extracts opinion phrases from summaries, and predicts their aspect category and sentiment. The metrics use these predictions as follows.
Aspect Coverage We use the phrase-level aspect predictions to mark the presence or absence of an aspect in a summary. We discard very infrequent aspect categories. Similar to Pan et al. (2020), we measure precision, recall, and F1 of system against human summaries.

Aspect-level Sentiment
We propose a new metric to evaluate the sentiment consistency between system and human summaries. Specifically, we compute the sentiment polarity score towards an individual aspect a as the mean polarity of the opinion phrases that discuss this aspect in a summary (pol a ∈ [−1, 1]). We repeat the process for every aspect, thus obtaining a vector of aspect polarities for the summary (we set the polarity of absent aspects to zero). The aspect-level sentiment consistency is computed as the mean squared error between system and human polarity vectors.

Results: General Summarization
We first discuss our results on general summarization and then move on to present experiments on aspect-specific summarization. We compared our model against the following baselines: Best Review systems select the single review that best approximates the consensus opinions in the input. We use a Centroid method that encodes the entity's reviews with BERT (average token vector; Devlin et al., 2019) or SentiNeuron (Radford et al., 2017), and picks the one closest to the mean review vector. We also tested an Oracle method, which selects the review closest to the reference summaries.
Extractive systems, where we tested LexRank (Erkan and Radev, 2004), an unsupervised graph-based summarizer. To compute its adjacency matrices, we used BERT and SentiNeuron vectors, in addition to the sparse tf-idf features of the original. We also present a random extractive baseline.
Abstractive systems include Opinosis (Ganesan et al., 2010) a graph-based method; MeanSum (Chu and Liu, 2019), and Copycat (Bražinskas et al., 2020) two neural abstractive methods that generate review-like summaries from aggregate review representations learned using autoencoders. Table 2 reports ROUGE scores on SPACE (test set) for the general summarization task. QT's popularity-based extraction algorithm shows strong summarization capabilities outperforming all comparison systems (differences in ROUGE are statistically significant against all models but Copycat). This is a welcome result, considering that QT is an extractive method and does not benefit from the compression and rewording capabilities of abstractive summarizers. Moreover, as we discuss in Section 5.5, QT is less data-hungry than other neural models: it achieves the same level of performance even when trained on 5% of the dataset. We also show in Table 2 (fourth block) that the proposed two-step sampling method yields better extractive summaries compared to simply selecting the sentences nearest to the most popular clusters.
Aspect coverage and sentiment consistency results are also encouraging for QT which consistently scores highly on both metrics, while baselines show mixed results. We also compared (using ROUGE-L) general system summaries against reference aspect summaries. The results in Table 2 (column RL ASP ) confirm that aspect summarization requires tailor-made methods. Unsurprisingly, all systems are inferior to the human upper bound (i.e., inter-annotator ROUGE and aspect-based metrics), suggesting ample room for improvement.
QT's ability for general opinion summarization is further demonstrated in Table 3 which reports results on the YELP and AMAZON datasets. We present the strongest baselines, i.e., Centroid BERT , LexRank BERT , Oracle BERT , and the abstractive Opinosis, MeanSum, and Copycat. On YELP, QT performs on par with MeanSum, but worse than Copycat. However, it is important to note that, in contrast to SPACE, YELP's reference summaries were purposely written using first-person narrative giving an advantage to review-like summaries of abstractive methods. On AMAZON, QT outperforms all methods on ROUGE-1/2, but comes second to Copycat on ROUGE-L. This follows a trend seen across all datasets, where abstractive systems appear relatively stronger in terms of ROUGE-L compared to ROUGE-1/2. We partly attribute this to their ability to fuse opinions into fluent sentences, thus matching longer reference sequences.
Besides automatic evaluation, we conducted a user study to verify the utility of the generated summaries. We produced general summaries from five systems (QT, Copycat, MeanSum, LexRank BERT Centroid BERT ) for all entities in SPACE's test set. For every entity and pair of systems, we showed to three human judges a gold-standard summary for reference, and the two system summaries. We asked them to select the best summary according to four criteria: informativeness (useful opinions, consistent with reference), coherence (easy to read, avoids contradictions), conciseness (useful in a few words), and non-redundancy (no repetitions). The systems' scores were computed using Best-Worst Scaling (Louviere et al., 2015), with values ranging from −100 (unanimously worst) to +100 (unanimously best). As shown in Table 4, participants rate QT favorably over all baselines in terms of informativeness, conciseness and lack of redundancy, with slight preference for Copycat summaries with respect to coherence (statistical significance infor-   Koehn, 2004). We exclude Oracle systems from comparisons as they access gold summaries at test time. RL ASP is the Rouge-L of general summarizers against gold aspect summaries. AC and SC are shorthands for Aspect Coverage and Sentiment Consistency. Subscripts P and R refer to precision and recall, and F1 is their harmonic mean. MSE is mean squared error (lower is better).   mation in caption). QT captures essential opinions effectively, whereas there is room for improvement in terms of summary cohesion.

Results: Aspect-specific Summarization
There is no existing unsupervised system for aspectspecific opinion summarization. Instead, we use the power of BERT (Devlin et al., 2019) to enable aspect summarization for our baselines. Specifically, we obtain BERT sentence vectors (average of token vectors) for input sentences X e , which we cluster via k-means. We then replicate the clusterto-aspects mapping used by QT, as described in Equations (10)-(13): each cluster is mapped to exactly one aspect, according to the probability of finding the pre-defined aspect-denoting keywords in the sentences assigned to it. As a result, we obtain non-overlapping and aspect-specific sets of input sentences {X (a 1 ) e , X (a 2 ) e , . . . }. For aspect a i , we create aspect-filtered input reviews, by concatenating sentences in X (a i ) e based on the reviews they originated from. The filtered reviews of each aspect are given as input to general summarizers (LexRank, MeanSum and Copycat), thus producing aspect summaries. QT and all baselines use the same aspect keywords, which we sourced from a held-out set of reviews, not included in SPACE. Table 5 shows results on SPACE, for individual aspects, and on average. QT outperforms baselines in all aspects, except building, with significant improvements against Copycat and Meansum in terms of ROUGE and sentiment consistency. The abstractive methods struggle to generate summaries restricted to the aspect in question.
To verify this, we ran a second judgement elicitation study. We used summaries from competing aspect summarizers (QT ASP , Copycat ASP , MeanSum ASP , and LexRank ASP ) for all six aspects, as well as QT's general summaries. A summary was shown to three participants, who were asked whether it discussed the specified aspect exclu-    sively, partially, or not at all. Table 6 shows that 58.7% of QT aspect-specific summaries discuss the specified aspect exclusively, while only 8.7% of the summaries fail to mention the aspect. LexRank ASP follows with 23.8% of its summaries failing to mention the aspect, while the abstractive models performed significantly worse.

Further Analysis
Training Efficiency Table 7 shows ROUGE-1 scores for QT and Copycat on SPACE, when trained on different portions of the training set (randomly downsampled and averaged over 5 runs). QT exhibits impressive data efficiency; when trained on 5% of data, it performs comparably to a Copycat summarizer that has been trained on the full corpus. sub-spaces. The aspect sub-space (shown in square) was detected automatically, as it displayed the lowest mean aspect entropy (darker color). Zooming into its latent codes uncovers reasonable aspect separation, an impressive result considering that the model received no aspect-specific supervision.
Mean Aspect Entropy Figure 6 further illustrates the effectiveness of aspect entropy for detecting the head sub-space that best separates aspectspecific sentences. Each gray bar shows the mean Human QT

MeanSum Copycat
All staff members were friendly, accommodating, and helpful. The hotel and room were very clean. The room had modern charm and was nicely remodeled. The beds are extremely comfortable. The rooms are quite with wonderful beach views. The food at Hash, the restaurant in lobby, was fabulous. The location is great, very close to the beach. It's a longish walk to Santa Monica. The price is very affordable.
Great hotel. We liked our room with an ocean view. The staff were friendly and helpful. There was no balcony. The location is perfect. Our room was very quiet. I would definitely stay here again. You're one block from the beach. So it must be good! Filthy hallways. Unvacuumed room. Pricy, but well worth it.
It was a great stay! The food at the hotel is great for the price. I can't believe the noise from the street is very loud and the traffic is not so great, but that is not a problem. The restaurant was great and the food is excellent.
This hotel is in a great location, just off the beach. The staff was very friendly and helpful. We had a room with a view of the beach and ocean. The only problem was that our room was on the 4th floor with a view of the ocean. If you are looking for a nice place to sleep then this is the place for you. Table 8: Four general opinion summaries for the same hotel: One human-written and three from competing models.
Building: Bright colors, skateboards, butterfly chairs and a grand ocean/boardwalk view (always entertaining). There is a small balcony, but there's only a small glass divider between your neighbor's balcony.
Food: We had a great breakfast at Hash too! The restaurant was amazing. Lots of good restaurants within walking distance and some even deliver. The roof bar was the icing on the cake.

Location:
The location is perfect. The hotel is very central. The hotel itself is in a great location. We hardly venture far as everything we need is within walking distance, but for the sightseers the buses are on the doorstep.
Cleanliness: Our room was very clean and comfortable. The room was clean and retrofitted with all the right amenities. Our room was very large, clean, and artfully decorated.

Rooms:
The room was spacious and had really cool furnishings, and the beds were comfortable. The room's were good, and we had a free upgrade for one of them (for a Facebook 'like!) A+ for the bed and pillows.
Service: The staff is great. The staff were friendly and helpful. The hotel staff were friendly and provided us with great service. Each member of the staff was friendly and attentive. The staff excel and nothing is ever too much trouble. aspect entropy for the codes produced by one of QT's eight heads. One of the heads (leftmost) exhibits much lower entropy, indicating a strong confidence for aspect membership within its latent codes. We confirm this enables better aspect summarization by generating aspect summaries using each head, and plotting the obtained ROUGE-1 scores.
System Output Finally, we show gold-standard and system-generated general summaries in Table 8, as well as QT aspect summaries in Table 9.

Conclusions
We presented a novel opinion summarization system based on the Quantized Transformer that requires no reference summaries for training, and is able to extract general and aspect summaries from large groups of input reviews. QT is trained through sentence reconstruction and learns a rich encoding space, paired with a clustering component based on vector quantized variational autoencoders. At summarization time, we exploit the characteristics of the quantized space, to identify those clusters that correspond to the input's most popular opinions, and extract the sentences that best represent them. Moreover, we used the multihead representations of the model, and no further training, to detect the encoding sub-space that best separates aspects, enabling aspect-specific summarization. We also collected SPACE, a new opinion summarization corpus which we hope will inform and inspire further research.
Experimental results on SPACE and popular benchmarks reveal that our system is able to produce informative summaries which cover all or individual aspects of an entity. In the future, we would like to utilize the QT framework in order to generate abstractive summaries. We could also exploit QT's multi-head semantics more directly, and further improve it through weak supervision or multi-task objectives. Finally, although we focused on opinion summarization, it would be interesting to see if the proposed model can be applied to other multi-document summarization tasks.