Abstract
Conventional topic models are ineffective for topic extraction from microblog messages, because the data sparseness exhibited in short messages lacking structure and contexts results in poor message-level word co-occurrence patterns. To address this issue, we organize microblog messages as conversation trees based on their reposting and replying relations, and propose an unsupervised model that jointly learns word distributions to represent: (1) different roles of conversational discourse, and (2) various latent topics in reflecting content information. By explicitly distinguishing the probabilities of messages with varying discourse roles in containing topical words, our model is able to discover clusters of discourse words that are indicative of topical content. In an automatic evaluation on large-scale microblog corpora, our joint model yields topics with better coherence scores than competitive topic models from previous studies. Qualitative analysis on model outputs indicates that our model induces meaningful representations for both discourse and topics. We further present an empirical study on microblog summarization based on the outputs of our joint model. The results show that the jointly modeled discourse and topic representations can effectively indicate summary-worthy content in microblog conversations.
1. Introduction
Over the past two decades, the Internet has been revolutionizing the way we communicate. Microblogging, a social networking channel over the Internet, further accelerates communication and information exchange. Popular microblog platforms, such as Twitter1 and Sina Weibo,2 have become important outlets for individuals to share information and voice opinions, which further benefit downstream applications such as instant detection of breaking events (Lin et al. 2010; Weng and Lee 2011; Peng et al. 2015), real-time and ad hoc search of microblog messages (Duan et al. 2010; Li et al. 2015b), public opinions and user behavior understanding on societal issues (Pak and Paroubek 2010; Popescu and Pennacchiotti 2010; Kouloumpis, Wilson, and Moore 2011), and so forth.
However, the explosive growth of microblog data far outpaces human beings’ speed of reading and understanding. As a consequence, there is a pressing need for effective natural language processing (NLP) systems that can automatically identify gist information, and make sense of the unmanageable amount of user-generated social media content (Farzindar and Inkpen 2015). As one of the important and fundamental text analytic approaches, topic models extract key components embedded in microblog content by clustering words that describe similar semantic meanings to form latent “topics.” The derived intermediate topic representations have proven beneficial to many NLP applications for social media, such as summarization (Harabagiu and Hickl 2011), classification (Phan, Nguyen, and Horiguchi 2008; Zeng et al. 2018a), and recommendation on microblogs (Zeng et al. 2018b).
Conventionally, probabilistic topic models (e.g., probabilistic latent semantic analysis [Hofmann 1999] and latent Dirichlet allocation [Blei et al. 2003]) have achieved huge success over the past decade, owing to their fully unsupervised manner and ease of extension. The semantic structure discovered by these topic models have facilitated the progress of many research fields, for example, information retrieval (Boyd-Graber, Hu, and Mimno 2017), data mining (Lin et al. 2015), and NLP (Newman et al. 2010). Nevertheless, ascribing to their reliance on document-level word co-occurrence patterns, the progress is still limited to formal conventional documents such as news reports (Blei, Ng, and Jordan 2003) and scientific articles (Rosen-Zvi et al. 2004). The aforementioned models work poorly when directly applied to short and colloquial texts (e.g., microblog posts) owing to severe sparsity exhibited in such text genre (Wang and McCallum 2006; Hong and Davison 2010).
Previous research has proposed several methods to deal with the sparsity issue in short texts. One common approach is to aggregate short messages into long pseudo-documents. Many studies heuristically aggregate messages based on authorship (Hong and Davison 2010; Zhao et al. 2011), shared words (Weng et al. 2010), or hashtags (Ramage, Dumais, and Liebling 2010; Mehrotra et al. 2013). Quan et al. (2015) propose self-aggregation-based topic modeling (SATM) that aggregates texts jointly with topic inference. Another popular solution is to take into account word relations to alleviate document-level word sparseness. Biterm topic model (BTM) directly models the generation of word-pair co-occurrence patterns in each individual message (Yan et al. 2013; Cheng et al. 2014). More recently, word embeddings trained by large-scale external data are leveraged to capture word relations and improve topic models on short texts (Das, Zaheer, and Dyer 2015; Nguyen et al. 2015; Li et al. 2016a, 2017a; Shi et al. 2017; Xun et al. 2017).
To date, most efforts focus on content in messages, but ignore the rich discourse structure embedded in ubiquitous user interactions on microblog platforms. On microblogs, which were originally built for user communication and interaction, conversations are freely formed on issues of interests by reposting messages and replying to others. When joining a conversation, users generally post topically related content, which naturally provide effective contextual information for topic discovery. Alvarez-Melis and Saveski (2016) have shown that simply aggregating messages based on conversations can significantly boost the performance of conventional topic models and outperform models exploiting hashtag-based and user-based aggregations.
Another important issue ignored in most previous studies is the effective separation of topical words from non-topic ones (Li et al. 2016b). In microblog content, owing to its colloquial nature, non-topic words such as sentimental (e.g., “great” and “ToT”), functional (e.g., “doubt” and “why”), and other non-topic words (e.g., “oh” and “oops”) are common and usually mixed with topical words. The occurrence of non-topic words may distract the models from recognizing topical content, which thus leads to the failure to produce coherent and meaningful topics. In this article, we propose a novel model that examines the entire context of a conversation and jointly explores word distributions representing varying types of topical content and discourse roles such as agreement, question-asking, argument, and other dialogue acts (Ritter, Cherry, and Dolan 2010).3 Though Ritter, Cherry, and Dolan (2010) separate discourse, topic, and other words for modeling conversations, their model focuses on dialogue act modeling and only yields one distribution for topical content. Therefore, their model is unable to distinguish varying latent topics reflecting message content underlying the corpus. Li et al. (2016b) leverage conversational discourse structure to detect topical words from microblog posts, which explicitly explores the probabilities of different discourse roles that contain topical words. However, Li et al. (2016b) depend on a pre-trained discourse tagger and acquire a time-consuming and expensive manual annotation process for annotating conversational discourse roles on microblog messages, which does not scale for large data sets (Ritter, Cherry, and Dolan 2010; Joty, Carenini, and Lin 2011).
To exploit discourse structure of microblog conversations, we link microblog posts using reposting and replying relations to build conversation trees. Particularly, the root of a conversation tree refers to the original post and its edges represent the reposting or replying relations. To illustrate the interplay between topic and discourse, Figure 1 displays a snippet of Twitter conversation about “Trump administration’s immigration ban.” From the conversation, we can observe two major components: (1) discourse, indicated by the underlined words, describes the intention and pragmatic roles of messages in conversation structure, such as making a statement or asking a question; (2) topic, represented by the bold words, captures the topic and focus of the conversation, such as “racialism” and “Muslims.” As we can see, different discourse roles vary in probabilities to contain key content reflecting the conversation focus. For example, in Figure 1, [R5] doubts the assertion of “immigration ban is good,” and raises a new focus on “racialism.” This in fact contains more topic-related words than [R6], which simply reacts to its parent. For this reason, in this article, we attempt to identify messages with “good” discourse roles that tend to describe key focuses and salient topics of a microblog conversation tree, which enables the discovery of “good” words reflecting coherent topics. Importantly, our joint model of conversational discourse and latent topics is fully unsupervised, therefore does not require any manual annotation.
For evaluation, we conduct quantitative and qualitative analysis on large-scale Twitter and Sina Weibo corpora. Experimental results show that topics induced by our model are more coherent than existing models. Qualitative analysis on discourse further shows that our model can yield meaningful clusters of words related to manually crafted discourse categories. In addition, we present an empirical study on downstream application of microblog conversation summarization. Empirical results on ROUGE (Lin 2004) show that summaries produced based on our joint model contain more salient information than state-of-the-art summarization systems. Human evaluation also indicates that our output summaries are competitive with existing unsupervised summarization systems in the aspects of informativeness, conciseness, and readability.
In summary, our contributions in this article are 3-fold:
- •
Microblog posts organized as conversation trees for topic modeling. We propose a novel concept of representing microblog posts as conversation trees by connecting microblog posts based on reposting and replying relations for topic modeling. Conversation tree structure helps enrich context, alleviate data sparseness, and in turn improve topic modeling.
- •
Exploiting discourse in conversations for topic modeling. Our model differentiates the generative process of topical and non-topic words, according to the discourse role of the message where a word is drawn from being. This helps the model in identifying the topic-specific information from non-topic background.
- •
Thorough empirical study on the inferred topic representations. Our model shows better results than competitive topic models when evaluated on large-scale, real-world microblog corpora. We also present an effective method for using our induced results on microblog conversation summarization.
2. Related Work
This article builds upon diverse streams of previous work in lines of topic modeling, discourse analysis, and microblog summarization, which are briefly surveyed as follows.
2.1 Topic Models
Topic models aim to discover the latent semantic information, i.e., topics, from texts and have been extensively studied. This work is built upon the success of latent Dirichlet allocation (LDA) modeling (Blei et al. 2003; Blei, Ng, and Jordan 2003) and aims to learn topics in microblog messages. We first briefly introduce LDA in Section 2.1.1 and then review the related work on topic modeling for microblog content in Section 2.1.2.
2.1.1 LDA: Springboard of Topic Models.
Latent Dirichlet allocation (Blei, Ng, and Jordan 2003) is one of the most popular and well-known topic models. It uses Dirichlet priors to generate document–topic and topic–word distributions, and has shown to be effective in extracting topics from conventional documents. LDA plays an important role in semantic representation learning and serves as the springboard of many famous topic models, e.g., hierarchical latent Dirichlet allocation (Blei et al. 2003), author–topic modeling (Rosen-Zvi et al. 2004), and so forth. In addition to “topic” modeling, it has also inspired discourse (Crook, Granell, and Pulman 2009; Ritter, Cherry, and Dolan 2010; Joty, Carenini, and Lin 2011) detection without supervision or with weak supervision. However, none of the aforementioned work jointly infers discourse and topics on microblog conversations, which is a gap the present article fills. Also, to the best of our knowledge, our work serves as the first attempt to exploit the joint effects of discourse and topic on unsupervised microblog conversation summarization.
2.1.2 Topic Models for Microblog Posts.
Previous research has demonstrated that standard topic models, essentially focusing on document-level word co-occurrences, are not suitable for short and informal microblog messages because of the severe data sparsity exhibited in short texts (Wang and McCallum 2006; Hong and Davison 2010). As a result, one line of previous work focuses on enriching and exploiting contextual information. Weng et al. (2010), Hong and Davison (2010), and Zhao et al. (2011) first heuristically aggregate messages posted by the same user or sharing the same words before conventional topic models are applied. Their simple strategies, however, pose some problems. For example, it is common that a user has various interests and thus posts messages covering a wide range of topics. Ramage, Dumais, and Liebling (2010) and Mehrotra et al. (2013) use hashtags as labels to train supervised topic models. Nevertheless, these models depend on large-scale hashtag-labeled data for model training. Moreover, their performance can be inevitably compromised when facing unseen topics that are irrelevant to any hashtag in training data. Such phenomena are common because of the rapid change and wide variety of topics on social media. BTM (Yan et al. 2013; Cheng et al. 2014) directly explores unordered word-pair co-occurrence patterns in each individual message, which is equivalent to extending short documents into a biterm set consisting of all combinations of any two distinct words appearing in the document. SATM (Quan et al. 2015) combines short text aggregation and topic induction into a unified model. However, in SATM, no prior knowledge is given to ensure the quality of text aggregation, which will further affect the performance of topic inference.
Different from the aforementioned work, we organize microblog messages as conversation trees based on reposting and replying relations. It allows us to explore word co-occurrence patterns in richer context, as messages in one conversation generally focus on relevant topics. Even though researchers have started to take the contexts provided by conversations into account when discovering topics on microblogs (Alvarez-Melis and Saveski 2016; Li et al. 2016b), there is much less work that jointly predicts the topical words along with the discourse structure in conversations. Ritter, Cherry, and Dolan (2010) model dialogue acts in conversations via separating discourse words from topical words and others. Whereas their model produces only one word distribution to represent the topical content, our model is capable of generating varying discourse and topic word distributions. Another main difference is that our model explicitly explores the probabilities of messages with different discourse roles in containing topical words for topic representation, whereas their model generates topical words from a conversation-specific distribution over word types regardless of the different discourse roles of messages. The work by Li et al. (2016b) serves as another prior effort to leverage conversation structure, captured by a supervised discourse tagger, on topic induction. Different from them, our model learns discourse structure for conversations in a fully unsupervised manner, which does not require annotated data.
Another line of research tackles data sparseness by modeling word relations rather than word occurrences in documents. For example, recent research work has shown that distributional similarities of words captured by word embeddings (Mikolov et al. 2013; Mikolov, Yih, and Zweig 2013) are useful in recognizing interpretable topic word clusters from short texts (Das, Zaheer, and Dyer 2015; Nguyen et al. 2015; Li et al. 2016a, 2017a; Shi et al. 2017; Xun et al. 2017). These topic models heavily rely on meaningful word embeddings needed to be trained on a large-scale, high-quality external corpus, which should be both in the same domain and the same language as the data for topic modeling (Bollegala, Maehara, and Kawarabayashi 2015). However, such external resource is not always available. For example, to the best of our knowledge, there currently exists no high-quality word embedding corpus for Chinese social media. In contrast to these prior methods, our model does not have the prerequisite to an external resource, whose general applicability in cold-start scenarios is therefore ensured.
2.1.3 Topic Modeling and Summarization.
Previous studies have shown that the topic representation captured by topic models is useful for summarization (Nenkova and McKeown 2012). Specifically, there are two different purposes of using topic models in existing summarization systems: (1) to separate summary-worthy content and non-content background (general information) (Daumé and Marcu 2006; Haghighi and Vanderwende 2009; Çelikyilmaz and Hakkani-Tür 2010), and (2) to cluster sentences or documents into topics, with summaries then generated from each topic cluster for minimizing content redundancy (Salton et al. 1997; McKeown et al. 1999; Siddharthan, Nenkova, and McKeown 2004). Similar techniques have also been applied to summarize events or opinions on microblogs (Chakrabarti and Punera 2011; Long et al. 2011; Rosa et al. 2011; Duan et al. 2012; Shen et al. 2013; Meng et al. 2012).
Our downstream application on microblog summarization lies in the research line of point (1), whereas we integrate the effects of discourse on key content identification, which has not been studied in any prior work. Also it is worth noting that, following point (2) to cluster messages before summarization is beyond the scope of this work because we are focusing on summarizing a single conversation tree, on which there are limited topics. We leave the potential of using our model to segment topics for multi-conversation summarization to future work.
2.2 Discourse Analysis
Discourse reflects the architecture of textual structure, where the semantic or pragmatic relations among text units (e.g., clauses, sentences, paragraphs) are defined. Here we review prior work on single document discourse analysis in Section 2.2.1, followed by a description on discourse extension to represent conversation structures in Section 2.2.2.
2.2.1 Traditional View of Discourse.
It has been long pointed out that a coherent document, which gives readers continuity of senses (De Beaugrande and Dressler 1981), is not simply a collection of independent sentences. Linguists have striven to the study of discourse analysis ever since ancient Greece (Bakker and Wakker 2009). Early work shapes the modern concept of discourse (Hovy and Maier 1995) via depicting connections between text units, which reveals the structural art behind a coherent document.
Rhetorical structure theory (RST) (Mann and Thompson 1988) is one of the most influential discourse theories. According to its assumption, a coherent document can be represented by text units at different levels (e.g., clauses, sentences, paragraphs) in a hierarchical tree structure. In particular, the minimal units in RST (i.e., leaves of the tree structure) are defined as sub-sentential clauses, namely, elementary discourse units. Adjacent units are linked by rhetorical relations—condition, comparison, elaboration, and so forth. Based on RST, early work uses hand-coded rules for automatic discourse analysis (Marcu 2000; Thanh, Abeysinghe, and Huyck 2004). Later, thanks to the development of large-scale discourse corpora—RST corpus (Carlson, Marcu, and Okurovsky 2001), Graph Bank corpus (Wolf and Gibson 2005), and Penn Discourse Treebank (Prasad et al. 2008)—data-driven and learning-based discourse parsers that exploit various manually designed features (Soricut and Marcu 2003; Baldridge and Lascarides 2005; Fisher and Roark 2007; Lin, Kan, and Ng 2009; Subba and Eugenio 2009; Joty, Carenini, and Ng 2012; Feng and Hirst 2014) and representative learning (Ji and Eisenstein 2014; Li, Li, and Hovy 2014) became popular.
2.2.2 Discourse Analysis on Conversations.
Stolcke et al. (2000) provide one of the first studies focusing on this problem, and it provides a general schema of understanding conversations with discourse analysis. Because of the complex structure and informal language style, discourse parsing on conversations is still a challenging problem (Perret et al. 2016). Most research focuses on the detection of dialogue acts (DAs),4 which are defined in Stolcke et al. (2000) as the first-level conversational discourse structure. It is worth noting that a DA represents the shallow discourse role that captures illocutionary meanings of an utterance (“statement,” “question,” “agreement,” etc.).
Automatic dialogue act taggers have been conventionally trained in a supervised way with predefined tag inventories and annotated data (Stolcke et al. 2000; Cohen, Carvalho, and Mitchell 2004; Bangalore, Fabbrizio, and Stent 2006). However, DA definition is generally domain-specific and usually involves the manual designs from experts. Also, the data annotation process is slow and expensive, resulting in the limitation of data available for training DA classifiers (Jurafsky, Shriberg, and Biasca 1997; Dhillon et al. 2004; Ritter, Cherry and Dolan 2010; Joty, Carenini, and Lin 2011). These issues are pressing with the arrival of the Internet era in which new domains of conversations and even new types of dialogue act tags have boomed (Ritter, Cherry, and Dolan 2010; Joty, Carenini, and Lin 2011).
For this reason, researchers have proposed unsupervised or weakly supervised dialogue act taggers that identify indicative discourse word clusters based on probabilistic graphical models (Crook, Granell, and Pulman 2009; Ritter, Cherry, and Dolan 2010; Joty, Carenini, and Lin 2011). In our work, the discourse detection module is inspired by these previous models, where discourse roles are represented by word distributions and recognized in an unsupervised manner. Different from the previous work that focuses on discourse analysis, we explore the effects of discourse structure of conversations on distinguishing varying latent topics underlying the given collection, which has not been studied before. In addition, most existing unsupervised approaches for conversation modeling follows hidden Markov model convention and induces discourse representations in conversation threads. Considering that most social media conversations are in tree structure because one post is likely to spark multiple replying or reposting messages, our model allows the modeling of discourse roles in tree structure, which enables richer contexts to be captured. More details will be provided in Section 3.1.
2.3 Microblog Summarization
Microblog summarization can be considered as a special case of text summarization, which is conventionally defined as discovering essential content from given document(s), and producing concise and informative summaries covering important information (Radev, Hovy, and McKeown 2002). Summarization techniques can be generally categorized as extractive or abstractive methods (Das and Martins 2007). Extractive summarization captures and distills salient content, which are usually sentences, to form summaries. Abstractive summarization focuses on identifying key text units (e.g., words and phrases) and then generates grammatical summaries based on these units. Our summarization application falls into the category of extractive summarization.
Early work on microblog summarization attempts to apply conventional extractive summarization models directly—LexRank (Erkan and Radev 2004), the University of Michigan’s summarization system MEAD (Radev et al. 2004), TF-IDF (Inouye and Kalita 2011), integer linear programming (Liu, Liu, and Weng 2011; Takamura, Yokono, and Okumura 2011), graph learning (Sharifi, Hutton, and Kalita 2010), and so on. Later, researchers found that standard summarization models are not suitable for microblog posts because of the severe redundancy, noise, and sparsity problems exhibited in short and colloquial messages (Chang et al. 2013; Li et al. 2015a). To solve these problems, one common solution is to use social signals such as the user influence and retweet counts to help summarization (Duan et al. 2012; Liu et al. 2012; Chang et al. 2013). Different from the aforementioned studies, we do not include external features such as the social network structure, which ensures the general applicability of our approach when applied to domains without such information.
Discourse has been reported useful to microblog summarization. Zhang et al. (2013) and Li et al. (2015a) leverage dialogue acts to indicate summary-worthy messages. In the field of conversation summarization from other domains (e.g., meetings, forums, and e-mails), it is also popular to leverage the pre-detected discourse structure for summarization (Murray et al. 2006; McKeown, Shrestha, and Rambow 2007; Wang and Cardie 2013; Bhatia, Biyani, and Mitra 2014; Bokaei, Sameti, and Liu 2016). Oya and Carenini (2014) and Qin, Wang, and Kim (2017) address discourse tagging together with salient content discovery on e-mails and meetings, and show the usefulness of their relations in summarization. For all the systems mentioned here, manually crafted tags and annotated data are required for discourse modeling. Instead, the discourse structure is discovered in a fully unsupervised manner in our model, which is represented by word distributions and can be different from any human designed discourse inventory. The effects of such discourse representations on salient content identification have not been explored in previous work.
3. The Joint Model of Conversational Discourse and Latent Topics
We assume that the given corpus of microblog posts is organized as C conversation trees, based on reposting and replying relations. Each tree c contains Mc microblog messages and each message m has Nc,m words in vocabulary. The vocabulary size is V. We separate three components, discourse, topic, and background, underlying the given conversations, and use three types of word distributions to represent them.
At the corpus level, there are K topics represented by word distribution ϕkT ∼ Dir(β) (k = 1, 2, …, K). ϕdD ∼ Dir(β) (d = 1, 2, …, D) represents the D discourse roles embedded in the corpus. In addition, we add a background word distribution ϕB ∼ Dir(β) to capture general information (e.g., common words) that cannot indicate either discourse or topic. ϕkT, ϕdD, and ϕB are all V-dimensional multinomial word distributions over the vocabulary. For each conversation tree c, θc ∼ Dir(α) models the mixture of topics and any message m on c is assumed to contain a single topic zc,m ∈ {1, 2, …, K}.
3.1 Message-Level Modeling
For each message m on conversation tree c, our model assigns two message-level multinomial variables to it: dc,m representing its discourse role and zc,m reflecting its topic assignment, whose definitions are given in turn in the following.
Discourse Roles.
Our discourse detection is inspired by Ritter, Cherry, and Dolan (2010), which exploits the discourse dependencies derived from reposting and replying relations for assigning discourse roles. For example, a “doubt” message is likely to start controversy thus triggering another “doubt,” (e.g., [R5] and [R8] in Figure 1). Assuming that the index of m’s parent is pa(m), we use transition probabilities πd ∼ Dir(γ) (d = 1, 2, …, D) to explicitly model discourse dependency of m to pa(m). πd is a distribution over the D discourse roles and πd,d′ denotes the probability of m assigned discourse d′ given the discourse of pa(m) being d. Specifically, dc,m (discourse role of message m) is generated from discourse transition distribution πdt,pa(m) where dt,pa(m) is the discourse assignment on pa(m). In particular, to create a unified generation story, we place a pseudo message emitting no word before the root of each conversation tree and assign dummy discourse indexing D + 1 to it. πD+1, the discourse transition from pseudo messages to roots, in fact models the probabilities of different discourse roles as conversation starter.
Topic Assignments.
Messages on one conversation tree focus on related topics. To exploit such intuition in topic assignments, the topic of each message m on conversation tree c (i.e., zc,m) is sampled from the topic mixture θc of conversation tree c.
3.2 Word-Level Modeling
To distinguish varying types of word distributions to separately capture discourse, topic, and background representations, we follow the solutions from previous work to assign each word as a discrete and exact source that reflects one particular type of word representation (Daumé and Marcu 2006; Haghighi and Vanderwende 2009; Ritter, Cherry, and Dolan 2010). To this end, for each word n in message m and tree c, a ternary variable xc,m,n ∈ {DISC, TOPIC, BACK} controls word n to fall into one of the three types: discourse, topic, and background word. In doing so, words in the given collection are explicitly separated into three types, based on which the word distributions representing discourse, topic, and background components are separated accordingly.
Discourse words.
(DISC) indicate the discourse role of a message; for example, in Figure 1, “How” and the question mark “?” reflect that [R1] should be assigned the discourse role of “question.” If xc,m,n = DISC (i.e., n is assigned as a discourse word), then word wc,m,n is generated from discourse word distribution ϕdc,mD, where dc,m is m’s discourse role.
Topic words.
(TOPIC) are the core topical words that describe topics being discussed in a conversation tree, such as “Muslim,” “order,” and “Trump” in Figure 1. When xc,m,n = TOPIC, i.e., n is assigned as a topic word, word wc,m,n is hence generated from the word distribution of the topic assigned to message m, i.e., ϕzc,mT.
Background words.
(BACK) capture the general words irrelevant to either discourse or topic, such as “those” and “of” in Figure 1. When word n is assigned as a background word (xc,m,n = BACK), word wc,m,n is then drawn from background distribution ϕB.
Switching Among Topic, Discourse, and Background.
We assume that messages of different discourse roles may show different distributions of the word types as discourse, topic, and background. The ternary word type switcher xc,m,n is hence controlled by the discourse role of message m. In specific, xc,m,n is drawn from the three-dimensional distribution τdc,m,n ∼ Dir(δ) that captures the appearing probabilities of three types of words (DISC, TOPIC, BACK), when the discourse assignment to m is dc,m,n, that is, xc,m,n ∼ Multi(τdc,m). For instance, a statement message (e.g., [R3] in Figure 1) may contain more content words for topic representation than a question to other users (e.g., [R1] in Figure 1). In particular, stop words and punctuation are forced to be labeled as discourse or background words. By explicitly distinguishing different types of words with switcher xc,m,n, we can thus separate the three types of word distributions that reflect discourse, topic, and background information.
3.3 Generative Process and Parameter Estimation
In summary, Figure 2 illustrates the graphical model of our generative process that jointly explores conversational discourse and latent topics. The following shows the detailed generative process of the conversation tree c:
- •
Draw topic mixture of conversation tree θc ∼ Dir(α)
- •
For message m = 1 to Mc
- –
Draw discourse role dc,m ∼ Multi(πdc,pa(m))
- –
Draw topic assignment zc,m ∼ Multi(θc)
- –
For word n = 1 to Nc,m
Draw ternary word type switcher xc,m,n ∼ Multi(τdc,m)
If xc,m,n = = DISC
Draw wt,s,n ∼ Multi(ϕdc,mD)
If xc,m,n = = TOPIC
Draw wc,m,n ∼ Multi(ϕzc,mT)
If xc,m,n = = BACK
Draw wc,m,n ∼ Multi(ϕB)
- –
For parameter estimation, we use collapsed Gibbs sampling (Griffiths and Steyvers 2004) to carry out posterior inference for parameter learning. The hidden multinomial variables (i.e., message-level variables [d and z] and word-level variable [x]) are sampled in turn, conditioned on a complete assignment of all other hidden variables and hyper-parameters Θ = (α, β, γ, δ). For more details, we refer the readers to Appendix A.
4. Experiments on Topic Coherence
This section presents an experiment on the coherence of topics yielded by our joint model of conversational discourse and latent topics.
4.1 Data Collection and Experiment Set-up
Data Sets.
To examine the coherence of topics on diverse microblog data sets, we conduct experiments on data sets collected from two popular microblog Web sites: Twitter and Weibo,5 where the messages are mostly in English and Chinese, respectively. Table 1 shows the statistics of our five data sets used to evaluate topic coherence. In the following, we give the details of their collection processes in turn.
Data Set . | # of trees . | # of messages . | Vocab size . |
---|---|---|---|
SemEval | 8,652 | 13,582 | 3,882 |
PHEME | 7,961 | 92,883 | 10,288 |
US Election | 4,396 | 33,960 | 5,113 |
Weibo-1 | 9,959 | 91,268 | 11,849 |
Weibo-2 | 21,923 | 277,931 | 19,843 |
Data Set . | # of trees . | # of messages . | Vocab size . |
---|---|---|---|
SemEval | 8,652 | 13,582 | 3,882 |
PHEME | 7,961 | 92,883 | 10,288 |
US Election | 4,396 | 33,960 | 5,113 |
Weibo-1 | 9,959 | 91,268 | 11,849 |
Weibo-2 | 21,923 | 277,931 | 19,843 |
For Twitter data, we evaluate the coherence of topics on three data sets: SemEval, PHEME, and US Election, and tune all models in our experiments on a large-scale development data set from TREC2011 microblog track.6
- •
SemEval. We combine the data released for topic-oriented sentiment analysis task in SemEval 20157 and 2016.8 To recover the missing ancestors in conversation trees, we use Tweet Search API to retrieve messages with the “in-reply-to” relations, and collect tweets in a recursive way until all the ancestors in a conversation are recovered.9
- •
PHEME. This data set was released by Zubiaga, Liakata, and Procter (2016), and contains conversations around rumors and non-rumors posted during five breaking events: Charlie Hebdo, Ferguson, Germanwings Crash, Ottawa Shooting, and Sydney Siege.
- •
US Election. Considering that the SemEval and PHEME data sets cover a relatively wide range of topics, we are interested in studying a more challenging problem: whether topic models can differentiate latent topics in a narrow scope. To this end, we take political tweets as an example and conduct experiments on a data set with Twitter discussions about the U.S. presidential election 2016. The data set is extended from the one released by Zeng et al. (2018b) following three steps. First, some raw tweets that are likely to be in a conversation are collected by searching conversation-type keywords via Twitter Streaming API,10 which samples and returns tweets matching the given keywords.11 Second, conversations are recovered via “in-reply-to” relations as is done to build the SemEval data set. Third, the relevant conversations are selected where there exist at least one tweet containing election-related keywords.12
For Weibo data, we track the real-time trending hashtags13 on Sina Weibo and use the hashtag-search API14 to crawl the posts matching the given hashtag queries. In the end, we build a large-scale corpus containing messages posted from 2 January 2014 to 31 July 2014. To examine the performance of models on varying topic distributions, we split the corpus into seven subsets, each containing messages posted in one month. We report the topic coherence on two randomly selected subsets, Weibo-1 and Weibo-2. The remaining five data sets are used as development sets.
Comparisons.
Our model jointly identifies word clusters of discourse and topics, and explicitly explores their relations, namely, the probabilities of different discourse roles in containing topical words (see Section 3.2), which is called the topic+disc+rel model in the rest of the article. In comparison, we consider the following established models: (1) LDA: In this model, we consider each message as a document and directly apply LDA (Blei et al. 2003; Blei, Ng, and Jordan 2003) on the collection. The implementation of LDA is based on the public toolkit GibbsLDA++.15 (2) BTM: BTM16 (Yan et al. 2013; Cheng et al. 2014) is a state-of-the-art topic model for short texts. It directly models the topics of all word pairs (biterms) in each message, which has proven more effective on social media texts than LDA (Blei et al. 2003; Blei, Ng, and Jordan 2003), one-topic-per-post Dirichlet multinomial mixture (DMM) (Nigam et al. 2000), and Zhao et al. (2011) (a DMM version on posts aggregated by authorship). According to the empirical study by Li et al. (2016b), BTM has in general better performance than a newer SATM model (Quan et al. 2015) on microblog data.
In particular, this article attempts to induce topics with little external resource. Therefore, we do not compare with either Li et al. (2016b), which depends on human annotation to train a discourse tagger, or topic models that exploit word embeddings (Das, Zaheer, and Dyer 2015; Nguyen et al. 2015; Li et al., 2016a, 2017a; Shi et al. 2017; Xun et al. 2017) pre-trained on large-scale external data. The external data in training embeddings should be in both the same domain and the same language of the given collection used for topic models, which limits the applicability of topic models in the scenarios without such data. Also, Li et al. (2016b) have shown that topic models combining word embeddings trained on internal data result in worse coherence scores than BTM, which is considered in our comparison.
In addition to the existing models from previous work, we consider the following variants that explore topics by organizing messages as conversation trees:
- •
The topic only model aggregates messages from one conversation tree as a pseudo-document, on which Chemudugunta, Smyth, and Steyvers (2006) (a model proven better than LDA in topic coherence) is used to induce topics on conversation aggregations, without modeling discourse structure. It involves a background word distribution to capture non-topic words, like our topic+disc+rel model. However, different from our topic+disc+rel model, the background word distribution is controlled by a general Beta prior without differentiating discourse roles of messages.
- •
The topic+disc model is an extension to Ritter, Cherry, and Dolan (2010), following which the switcher indicating a word as a discourse, topic, or background word are drawn from a conversation-level distribution over word types. Instead, in topic+disc+rel, word-type switcher depends on message-level discourse roles (shown in Section 3.2). In terms of topic generation of topic+disc, as the model of Ritter, Cherry, and Dolan (2010) is incapable of differentiating various latent topics, we follow the same procedure of topic only and topic+disc+rel models to draw topics from conversation-level topic mixture. Another difference between topic+disc and Ritter, Cherry, and Dolan (2010) is that the discourse roles of topic+disc are explored in tree-structured conversations whereas those in Ritter, Cherry, and Dolan (2010) are captured in context of conversation treads (paths of the conversation tree).
Hyper-parameters.
For the hyper-parameters of our joint topic+disc+rel model, we fix α = 50/K, β = 0.01, following the common practice in previous work (Yan et al. 2013; Cheng et al. 2014). For Twitter corpora, we set the count of discourse roles as D = 10, according to previous setting in Ritter, Cherry, and Dolan (2010). Because there is no analog of γ (controlling the prior for discourse role dependencies of children messages to their parents), δ (controlling the prior of distributions over topic, discourse, and background words given varying discourse roles), and discourse count D in Chinese Weibo corpora, we tune them by grid search on development sets and obtain γ = 0.5, δ = 0.25, and D = 6 on Weibo data.
The hyper-parameters of LDA and BTM are set according to the best hyper-parameters reported in their original papers. For topic only and topic+disc models, the parameter settings are kept the same as topic+disc+rel, because they are its variants. And the background switchers are parameterized by symmetric Beta prior on 0.5, following the original setting from Chemudugunta, Smyth, and Steyvers (2006). We run Gibbs samplings of all models with 1,000 iterations to ensure convergence, following Zhao et al. (2011), Yan et al. (2013), and Cheng et al. (2014).
Preprocessing.
Before training topic models, we preprocess the data sets as follows. For Twitter corpora, we (1) filter non-English messages; (2) replace links, mentions (i.e., @username), and hashtags with generic tags of “URL,” “MENTION,” and “HASHTAG”; (3) tokenize messages and annotate part-of-speech (POS) tags to each word using Tweet NLP toolkit (Gimpel et al. 2011; Owoputi et al. 2013)17 ; and (4) normalize all letters to lowercases. For Weibo corpora, we (1) filter non-Chinese messages; and (2) use FudanNLP toolkit (Qiu, Zhang, and Huang 2013) for word segmentation. Then, for each data set from Twitter or Sina Weibo, we generate a vocabulary and remove low-frequency words (i.e., words occurring fewer than five times).
For our topic+disc+rel model and its variants topic only and topic+disc considering the conversation structure, we only remove digits but retain stop words and punctuation in the data because: (1) stop words and punctuation can be useful discourse indicators, such as the question mark “?” and “what” in indicating “question” messages; (2) these models are equipped with a background distribution ϕB to separate general information useless to indicate either discourse or topic, e.g., “do” and “it”; and (3) we forbid stop words and punctuation to be sampled as topical words by forcing their word type switcher x ≠ TOPIC in word generation (shown in Section 3.2). For LDA and BTM that cannot separate non-topic information, we filter out stop words and short messages with fewer than two words in preprocessing, which retains the same common settings to ensure comparable performance.18
Evaluation Metrics.
Topic model evaluation is inherently difficult. Although in many previous studies perplexity is a popular metric to evaluate the predictive abilities of topic models given held-out data set with unseen words (Blei, Ng, and Jordan 2003), we do not consider perplexity here because high perplexity does not necessarily indicate semantically coherent topics in human perception (Chang et al. 2009).
The quality of topics is commonly measured by UCI (Newman et al. 2010) and UMass coherence scores (Mimno et al. 2011), assuming that words representing a coherent topic are likely to co-occur within the same document. We only consider UMass coherence here as UMass and UCI generally agree with each other, according to Stevens et al. (2012). We also consider a newer evaluation metric, the CV coherence measure (Röder, Both, and Hinneburg 2015), as it has been proven to provide the scores closest to human evaluation compared with other widely used topic coherence metrics, including UCI and UMass scores.19 For the CV coherence measure, in brief, given a word list for topic representations (i.e., the top N words by topic–word distribution), some known topic coherence measures are combined, which estimates how similar their co-occurrence patterns with other words are in the context of a sliding window from Wikipedia.
4.2 Main Comparison Results
We evaluate topic models with two sets of K (i.e., the number of topics, K = 50 and K = 100) following previous settings (Li et al. 2016b). Tables 2 and 3 show the UMass and CV scores for topics produced on the evaluation corpora, respectively. For UMass coherence, the top 5, 10, 15, and 20 words of each topic are selected for evaluation. For CV coherence, the top 5 and 10 words are selected.20 Note that we cannot report CV scores on Chinese Weibo corpora because CV coherence is calculated based on a Wikipedia data set, which does not yet have a Chinese version. From the results, we identify the following observations:
- •
Conventional topic models cannot perform well on microblog messages. From all the comparison results, the topic coherence given by LDA is the worst, which may be because of the sparseness of document-level word concurrence patterns in short posts.
- •
Considering conversation structure is useful to topic inference. Using the contextual information provided by conversations, topic only produced competitive results compared to the state-of-the-art BTM model on short text topic modeling. This observation indicates the effectiveness of using the conversation structure to enrich context and thus results in latent topics of reasonably good quality.
- •
Jointly learning discourse information helps produce coherent topics.topic+disc and topic+disc+rel models yield generally better coherence scores than topic only, which explores topics without considering discourse. The reason may be that additionally exploring discourse in non-topic information helps recognize non-topic words, which further facilitates the separation of topical words from non-topic ones.
- •
Considering discourse roles of messages in topical word generation is useful. The results of topic+disc+rel are the best in most settings. One important reason is that topic+disc+rel explicitly explores the different probabilities of messages with varying discourse roles in containing topical or non-topic words, whereas the other models separate topical content from non-topic information regardless of the different discourse roles of messages. This observation demonstrates that messages with different discourse roles do vary in tendencies to cover topical words, which provides useful clues for key content words to be identified for topic representation.
N . | Model . | Weibo-1 . | Weibo-2 . | SemEval . | PHEME . | US Election . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | ||
5 | W/o conversation | ||||||||||
LDA | −11.77 | −11.57 | −10.56 | −12.08 | −12.20 | −12.02 | −13.27 | −13.98 | −10.07 | −10.89 | |
BTM | −9.56 | −8.74 | −8.65 | −9.88 | −9.93 | −9.61 | −10.22 | −10.44 | −12.15 | −12.15 | |
W/ conversation | |||||||||||
topic only | −8.00 | −8.78 | −9.45 | −10.06 | −8.93 | −8.88 | −10.82 | −10.63 | −10.75 | −10.98 | |
topic+disc | −9.47 | −8.87 | −9.85 | −9.60 | −8.42 | −8.26 | −10.54 | −10.40 | −10.36 | −11.17 | |
topic+disc+rel | −8.53 | −8.66 | −8.00 | −9.84 | −8.47 | −8.19 | −10.21 | −10.41 | −11.14 | −10.75 | |
10 | W/o conversation | ||||||||||
LDA | −120.06 | −123.74 | −117.00 | −123.98 | −128.15 | −132.96 | −138.99 | −145.44 | −105.21 | −110.82 | |
BTM | −89.98 | −86.96 | −87.97 | −93.03 | −105.76 | −105.98 | −108.70 | −111.32 | −114.85 | −123.42 | |
W/ conversation | |||||||||||
topic only | −90.53 | −89.89 | −108.51 | −101.20 | −89.02 | −90.53 | −105.62 | −108.25 | −104.29 | −108.51 | |
topic+disc | −91.96 | −88.75 | −100.77 | −100.58 | −87.89 | −91.82 | −106.58 | −107.14 | −105.21 | −108.31 | |
topic+disc+rel | −86.91 | −87.05 | −83.59 | −98.19 | −86.48 | −90.02 | −105.27 | −107.91 | −104.99 | −107.03 | |
15 | W/o conversation | ||||||||||
LDA | −367.68 | −357.88 | −366.31 | −373.98 | −383.05 | −391.67 | −418.58 | −424.66 | −429.89 | −436.87 | |
BTM | −265.80 | −262.62 | −281.06 | −281.46 | −307.23 | −323.37 | −328.36 | −339.94 | −344.99 | −360.95 | |
W/ conversation | |||||||||||
topic only | −261.98 | −260.62 | −298.77 | −294.51 | −257.92 | −266.86 | −313.96 | −315.78 | −313.07 | −319.99 | |
topic+disc | −261.30 | −259.23 | −301.99 | −293.21 | −261.25 | −265.88 | −313.22 | −320.05 | −317.14 | −317.82 | |
topic+disc+rel | −254.94 | −256.47 | −249.32 | −287.82 | −256.83 | −265.71 | −312.49 | −319.01 | −312.90 | −315.59 | |
20 | W/o conversation | ||||||||||
LDA | −771.34 | −736.55 | −718.48 | −741.77 | −777.00 | −782.51 | −856.37 | −859.59 | −898.77 | −892.84 | |
BTM | −559.69 | −553.62 | −526.01 | −586.65 | −636.15 | −669.16 | −682.39 | −709.81 | −713.05 | −739.35 | |
W/ conversation | |||||||||||
topic only | −528.13 | −527.71 | −602.16 | −597.80 | −529.39 | −541.31 | −643.91 | −647.74 | −634.97 | −638.10 | |
topic+disc | −530.23 | −524.15 | −607.84 | −585.99 | −535.22 | −541.18 | −641.82 | −656.16 | −641.77 | −639.35 | |
topic+disc+rel | −518.97 | −519.11 | −509.79 | −578.80 | −530.56 | −538.31 | −637.18 | −650.70 | −629.42 | −634.22 |
N . | Model . | Weibo-1 . | Weibo-2 . | SemEval . | PHEME . | US Election . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | ||
5 | W/o conversation | ||||||||||
LDA | −11.77 | −11.57 | −10.56 | −12.08 | −12.20 | −12.02 | −13.27 | −13.98 | −10.07 | −10.89 | |
BTM | −9.56 | −8.74 | −8.65 | −9.88 | −9.93 | −9.61 | −10.22 | −10.44 | −12.15 | −12.15 | |
W/ conversation | |||||||||||
topic only | −8.00 | −8.78 | −9.45 | −10.06 | −8.93 | −8.88 | −10.82 | −10.63 | −10.75 | −10.98 | |
topic+disc | −9.47 | −8.87 | −9.85 | −9.60 | −8.42 | −8.26 | −10.54 | −10.40 | −10.36 | −11.17 | |
topic+disc+rel | −8.53 | −8.66 | −8.00 | −9.84 | −8.47 | −8.19 | −10.21 | −10.41 | −11.14 | −10.75 | |
10 | W/o conversation | ||||||||||
LDA | −120.06 | −123.74 | −117.00 | −123.98 | −128.15 | −132.96 | −138.99 | −145.44 | −105.21 | −110.82 | |
BTM | −89.98 | −86.96 | −87.97 | −93.03 | −105.76 | −105.98 | −108.70 | −111.32 | −114.85 | −123.42 | |
W/ conversation | |||||||||||
topic only | −90.53 | −89.89 | −108.51 | −101.20 | −89.02 | −90.53 | −105.62 | −108.25 | −104.29 | −108.51 | |
topic+disc | −91.96 | −88.75 | −100.77 | −100.58 | −87.89 | −91.82 | −106.58 | −107.14 | −105.21 | −108.31 | |
topic+disc+rel | −86.91 | −87.05 | −83.59 | −98.19 | −86.48 | −90.02 | −105.27 | −107.91 | −104.99 | −107.03 | |
15 | W/o conversation | ||||||||||
LDA | −367.68 | −357.88 | −366.31 | −373.98 | −383.05 | −391.67 | −418.58 | −424.66 | −429.89 | −436.87 | |
BTM | −265.80 | −262.62 | −281.06 | −281.46 | −307.23 | −323.37 | −328.36 | −339.94 | −344.99 | −360.95 | |
W/ conversation | |||||||||||
topic only | −261.98 | −260.62 | −298.77 | −294.51 | −257.92 | −266.86 | −313.96 | −315.78 | −313.07 | −319.99 | |
topic+disc | −261.30 | −259.23 | −301.99 | −293.21 | −261.25 | −265.88 | −313.22 | −320.05 | −317.14 | −317.82 | |
topic+disc+rel | −254.94 | −256.47 | −249.32 | −287.82 | −256.83 | −265.71 | −312.49 | −319.01 | −312.90 | −315.59 | |
20 | W/o conversation | ||||||||||
LDA | −771.34 | −736.55 | −718.48 | −741.77 | −777.00 | −782.51 | −856.37 | −859.59 | −898.77 | −892.84 | |
BTM | −559.69 | −553.62 | −526.01 | −586.65 | −636.15 | −669.16 | −682.39 | −709.81 | −713.05 | −739.35 | |
W/ conversation | |||||||||||
topic only | −528.13 | −527.71 | −602.16 | −597.80 | −529.39 | −541.31 | −643.91 | −647.74 | −634.97 | −638.10 | |
topic+disc | −530.23 | −524.15 | −607.84 | −585.99 | −535.22 | −541.18 | −641.82 | −656.16 | −641.77 | −639.35 | |
topic+disc+rel | −518.97 | −519.11 | −509.79 | −578.80 | −530.56 | −538.31 | −637.18 | −650.70 | −629.42 | −634.22 |
N . | Model . | SemEval . | PHEME . | US Election . | |||
---|---|---|---|---|---|---|---|
K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | ||
5 | W/o conversation | ||||||
LDA | 0.514 | 0.498 | 0.474 | 0.470 | 0.473 | 0.470 | |
BTM | 0.528 | 0.518 | 0.486 | 0.477 | 0.481 | 0.480 | |
W/ conversation | |||||||
topic only | 0.526 | 0.521 | 0.492 | 0.485 | 0.477 | 0.475 | |
topic+disc | 0.526 | 0.523 | 0.481 | 0.483 | 0.475 | 0.478 | |
topic+disc+rel | 0.535 | 0.524 | 0.491 | 0.493 | 0.482 | 0.483 | |
10 | W/o conversation | ||||||
LDA | 0.404 | 0.401 | 0.375 | 0.378 | 0.351 | 0.359 | |
BTM | 0.412 | 0.406 | 0.386 | 0.385 | 0.354 | 0.363 | |
W/ conversation | |||||||
topic only | 0.399 | 0.410 | 0.388 | 0.385 | 0.359 | 0.360 | |
topic+disc | 0.408 | 0.410 | 0.388 | 0.386 | 0.356 | 0.364 | |
topic+disc+rel | 0.414 | 0.410 | 0.398 | 0.386 | 0.366 | 0.366 |
N . | Model . | SemEval . | PHEME . | US Election . | |||
---|---|---|---|---|---|---|---|
K50 . | K100 . | K50 . | K100 . | K50 . | K100 . | ||
5 | W/o conversation | ||||||
LDA | 0.514 | 0.498 | 0.474 | 0.470 | 0.473 | 0.470 | |
BTM | 0.528 | 0.518 | 0.486 | 0.477 | 0.481 | 0.480 | |
W/ conversation | |||||||
topic only | 0.526 | 0.521 | 0.492 | 0.485 | 0.477 | 0.475 | |
topic+disc | 0.526 | 0.523 | 0.481 | 0.483 | 0.475 | 0.478 | |
topic+disc+rel | 0.535 | 0.524 | 0.491 | 0.493 | 0.482 | 0.483 | |
10 | W/o conversation | ||||||
LDA | 0.404 | 0.401 | 0.375 | 0.378 | 0.351 | 0.359 | |
BTM | 0.412 | 0.406 | 0.386 | 0.385 | 0.354 | 0.363 | |
W/ conversation | |||||||
topic only | 0.399 | 0.410 | 0.388 | 0.385 | 0.359 | 0.360 | |
topic+disc | 0.408 | 0.410 | 0.388 | 0.386 | 0.356 | 0.364 | |
topic+disc+rel | 0.414 | 0.410 | 0.398 | 0.386 | 0.366 | 0.366 |
4.3. Case Study
To further evaluate the interpretability of the latent topics and discourse roles learned by our topic+disc+rel model, we present a qualitative analysis on the output samples.
Sample Latent Topics.
We first present a qualitative study on the sample-produced topics. Tables 4 displays the top 15 words of topic “Trump is a racist” induced by different models on the US election data set given K = 100.21 We have the following observations:
- •
It is challenging to extract coherent and meaningful topics from short and informal microblog messages. Without using an effective strategy to alleviate the data sparsity problem, LDA mixes the generated topic with non-topic words,22 such as “direct,” “describe,” and “opinion,” which are also likely to appear in messages whose topics are very different from “Trump is a racist.”
- •
By aggregating messages based on conversations, topic only yields the topic competitive to the one produced by a state-of-the-art BTM model. The reason behind this observation could be that the conversation context provides rich word co-occurrence patterns in topic induction, which is beneficial to alleviate the data sparsity.
- •
The topics produced by topic+disc and topic+disc+rel contain fewer non-topic words than topic only, which does not consider discourse information when generating topics, and thus contains many general words, such as “thing” and “work,” which cannot clearly indicate “Trump is a racist.”
- •
The topic generated by topic+disc+rel best describes the topic “Trump is a racist” except for a non-topic word “call” at the end of the list. This is because it successfully discovers messages with discourse roles that are more likely to cover words describing the key focus in the conversations centering on “Trump is a racist.” Without capturing such information, the topic produced by topic+disc contains some non-topic words like “yeah” and “agree.”
Sample Discourse Roles.
To show the discourse representation exploited by our topic+disc+rel model, we then present the sample discourse roles learned from the PHEME data set in Table 5. Although this is merely a qualitative human judgment, there appear to be interesting word clusters that reflect varying discourse roles found by our model without the guidance from any manual annotation on discourse. In the first column of Table 5, we intuitively name the sample generated discourse roles, which are based on our interpretations of the word cluster, and are provided to benefit the reader. We discuss each displayed discourse role in turn:
- •
Statement presents arguments and judgments, where words like “should,” and “need” are widely used in suggestions and “if” occurs when conditions are given.
- •
Reaction expresses non-argumentative opinions. Compared with “statement” messages, “reaction” messages are straightforward and generally do not contain detailed explanations (e.g., conditions). Examples include simple feeling expressions, indicated by “oh” and “!!!,” and acknowledgements, indicated by “thank” or “thanks.”
- •
Question represents users asking questions to other users, implied by the question mark “?,” “what,” “why,” and so forth.
- •
Doubt expresses strong opinions against something. Examples of indicative words include “but,” “don’t,” “just,” the question mark “?,” and so on.
- •
Reference is for quoting external resource, which is implied by words like “from,” colon, and quotation marks. The use of hashtags23 and URLs are also prominent.
Statement | MENTION . the they are HASHTAG we , to and of in them all their ! be will these our who & should do this for if us need have |
Reaction | MENTION ! . you URL HASHTAG for your thank this , on my i thanks so … and a !! the are me please oh all very !!! - is |
Question | MENTION ? you the what are is do HASHTAG why they that how this a to did about who in he so or u was it know can does on |
Doubt | MENTION . you i , your a are to don’t it but that if know u not me i’m and do have my think ? you’re just about was it’s |
Reference | : URL MENTION HASHTAG in “ ” . at , the of on a - has from is to after and are ” been have as more for least 2 |
Statement | MENTION . the they are HASHTAG we , to and of in them all their ! be will these our who & should do this for if us need have |
Reaction | MENTION ! . you URL HASHTAG for your thank this , on my i thanks so … and a !! the are me please oh all very !!! - is |
Question | MENTION ? you the what are is do HASHTAG why they that how this a to did about who in he so or u was it know can does on |
Doubt | MENTION . you i , your a are to don’t it but that if know u not me i’m and do have my think ? you’re just about was it’s |
Reference | : URL MENTION HASHTAG in “ ” . at , the of on a - has from is to after and are ” been have as more for least 2 |
5. Downstream Application: Conversation Summarization on Microblogs
Section 4 has shown that conversational discourse is helpful to recognizing key topical information from short and informal microblog messages. We are interested in whether the induced topic and discourse representations can also benefit downstream applications. Here we take microblog summarization as an example that suffers from the data sparsity problem (Chang et al. 2013; Li et al. 2015a), similar to topic modeling on short texts. In this article, we focus on a subtask of microblog summarization, namely, microblog conversation summarization, and present an empirical study to show how our output can be used to predict critical content in conversations.
We first present the task description. Given a conversation tree, succinct summaries should be produced by extracting salient content from the massive reposting and replying messages in the conversation. It helps users understand the key focus of a microblog conversation. It is also called microblog context summarization in some previous work (Chang et al. 2013; Li et al. 2015a), because the produced summaries capture informative content in the lengthy conversations and provide valuable contexts to a short post, such as the background information and public opinions. In this task, the input is a microblog conversation tree, such as the one shown in Figure 1, and the output is a subset of replying or reposting messages covering salient content of the input post.
5.1 Data Collection and Experiment Set-up
We conduct an empirical study on the outputs of our joint model on microblog conversation summarization, whose data preparation and set-up processes are presented in this section.
Data Sets.
Our experiments are conducted on a large-scale corpus containing ten large conversation trees collected from Sina Weibo, which is released by our prior work (Li et al. 2015a) and constructed following the settings described in Chang et al. (2013). The conversation trees discuss hot events taking place during 2 January to 28 July 2014, and are crawled using the PKUVIS toolkit (Ren et al. 2014). The detailed descriptions of the ten conversation trees are shown in Table 6. As can be observed, more than 12K messages on average and covers discussions about social issues, breaking news, jokes, celebrity scandals, love, and fashion, which matches the official list of typical categories for microblog posts released by Sina Weibo.24 For each conversation tree, three experienced editors are invited to write summaries. Based on the manual summaries written by them, we conduct ROUGE evaluation, shown in Section 5.2.
# of messages . | Height . | Description . |
---|---|---|
21,353 | 16 | HKU dropping out student wins the college entrance exam again. |
9,616 | 11 | German boy complains hard schoolwork in Chinese High School. |
13,087 | 8 | Movie Tiny Times 1.0 wins high grossing in criticism. |
12,865 | 8 | “I am A Singer” states that singer G.E.M asking for resinging conforms to rules. |
10,666 | 8 | Crystal Huang clarified the rumor of her derailment. |
21,127 | 11 | Germany routs Brazil 7:1 in World-Cup semi-final. |
18,974 | 13 | The pretty girl pregnant with a second baby graduated with her master’s degree. |
2,021 | 18 | Girls appealed for equality between men and women in college admission. |
9,230 | 14 | Violent terrorist attack in Kunming railway station. |
10,052 | 25 | MH17 crash killed many top HIV researchers. |
# of messages . | Height . | Description . |
---|---|---|
21,353 | 16 | HKU dropping out student wins the college entrance exam again. |
9,616 | 11 | German boy complains hard schoolwork in Chinese High School. |
13,087 | 8 | Movie Tiny Times 1.0 wins high grossing in criticism. |
12,865 | 8 | “I am A Singer” states that singer G.E.M asking for resinging conforms to rules. |
10,666 | 8 | Crystal Huang clarified the rumor of her derailment. |
21,127 | 11 | Germany routs Brazil 7:1 in World-Cup semi-final. |
18,974 | 13 | The pretty girl pregnant with a second baby graduated with her master’s degree. |
2,021 | 18 | Girls appealed for equality between men and women in college admission. |
9,230 | 14 | Violent terrorist attack in Kunming railway station. |
10,052 | 25 | MH17 crash killed many top HIV researchers. |
Though compared with many other tasks in the NLP and information retrieval community, the corpus looks relatively small. However, to the best of our knowledge, it is currently the only publicly available data set for conversation summarization.25 It is difficult and time-consuming for human editors to write summaries for conversation trees because of the substantial number of nodes and complex structure involved (Chang et al. 2013); in fact, it would be impossible for human editors to reconstruct conversation trees in a reasonable amount of time. In the evaluation for each tree, we compute the average ROUGE F1 score between the model-generated summary and the three human-generated summaries.
Summary Extraction.
Comparisons.
We consider baselines that rank and select messages by (1) length; (2) popularity (# of reposts and replies); (3) user influence (# of authors’ followers); and (4) message–message text similarities using LexRank (Erkan and Radev 2004). We also consider two state-of-the-art summarizers in comparison: (1) Chang et al. (2013), a fully supervised summarizer with manually crafted features; and (2) Li et al. (2015a), a random walk variant summarizer incorporating outputs of supervised discourse tagger. In addition, we compare the summaries extracted based on the topics yielded by our topic+disc+rel model with those based on the outputs of its variants (i.e., topic only and topic+disc).
Preprocessing.
For baselines and the two state-of-the-art summarizers, we filter out non-Chinese characters in a preprocessing step following their common settings.27 For summarization systems based on our topic model variants (i.e., topic only, topic+disc, and topic+disc+rel), the hyper-parameters and preprocessing steps are kept the same as in Section 4.1.
5.2 ROUGE Comparison
We quantitatively evaluate the performance of summarizers using ROUGE scores (Lin 2004) as a benchmark, a widely used standard for automatic summarization evaluation based on the overlapping units between a produced summary and a gold-standard reference. Specifically, Table 7 reports ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU4 output by ROUGE 1.5.5.28 From the results, we can observe that:
- •
Simple features are not effective for summarization. The poor performance of all baselines demonstrates that microblog summarization is a challenging task. It is not possible to trivially rely on simple features such as length, message popularity, user influence, or text similarities to identify summary-worthy messages because of the colloquialness, noise, and redundancy exhibited in microblog texts.
- •
Discourse can indicate summary-worthy content. The summarization system based on the topic+disc+rel model has generally better ROUGE scores than the topic+disc based system. It also yields competitive and even slightly better results than Li et al. (2015a), which relies on a supervised discourse tagger. These observations demonstrate that topic+disc+rel, without requiring gold-standard discourse annotation, is able to discover the discourse roles that are likely to convey topical words, which further reflect salient content for conversation summarization.
- •
Directly applying the outputs of our joint model of discourse and topics to summarization might not be perfect. In general, the topic+disc+rel based system achieves the best F1 scores in ROUGE comparison, which implies that the yielded discourse and topic representations can somehow indicate summary-worthy content, although large margin improvements are not observed. In Section 5.4, we will analyze the errors and present a potential solution for further improving summarization results.
Models . | Len . | ROUGE-1 . | ROUGE-2 . | ||||
---|---|---|---|---|---|---|---|
Prec . | Rec . | F1 . | Prec . | Rec . | F1 . | ||
Baselines | |||||||
Length | 95.4 | 19.6 | 53.2 | 28.1 | 5.1 | 14.3 | 7.3 |
Popularity | 27.2 | 33.8 | 25.3 | 27.9 | 8.6 | 6.1 | 6.8 |
User | 37.6 | 32.2 | 34.2 | 32.5 | 8.0 | 8.9 | 8.2 |
LexRank | 25.7 | 35.3 | 22.2 | 25.8 | 11.7 | 6.9 | 8.3 |
State-of-the-art | |||||||
Chang et al. (2013) | 68.6 | 25.4 | 48.3 | 32.8 | 7.0 | 13.4 | 9.1 |
Li et al. (2015a) | 58.6 | 27.3 | 45.4 | 33.7 | 7.6 | 12.6 | 9.3 |
Our models | |||||||
topic only | 48.6 | 30.4 | 40.4 | 33.6 | 9.2 | 12.0 | 10.0 |
topic+disc | 37.8 | 38.1 | 35.5 | 33.1 | 13.2 | 11.5 | 10.8 |
topic+disc+rel | 48.9 | 32.3 | 41.3 | 34.0 | 10.3 | 12.5 | 10.5 |
Models . | Len . | ROUGE-L . | ROUGE-SU4 . | ||||
Prec . | Rec . | F1 . | Prec . | Rec . | F1 . | ||
Baselines | |||||||
Length | 95.4 | 16.4 | 44.4 | 23.4 | 6.2 | 17.2 | 8.9 |
Popularity | 27.2 | 28.6 | 21.3 | 23.6 | 10.4 | 7.6 | 8.4 |
User | 37.6 | 28.0 | 29.6 | 28.2 | 9.8 | 10.6 | 10.0 |
LexRank | 25.7 | 30.6 | 18.8 | 22.1 | 12.3 | 7.5 | 8.8 |
State-of-the-art | |||||||
Chang et al. (2013) | 68.6 | 21.6 | 41.1 | 27.9 | 8.3 | 16.0 | 10.8 |
Li et al. (2015a) | 58.6 | 23.3 | 38.6 | 28.7 | 8.8 | 14.7 | 10.9 |
Our models | |||||||
topic only | 48.6 | 26.3 | 34.9 | 29.0 | 10.2 | 13.8 | 11.3 |
topic+disc | 37.8 | 33.3 | 30.7 | 28.6 | 13.3 | 12.2 | 11.3 |
topic+disc+rel | 48.9 | 28.0 | 35.4 | 29.3 | 10.9 | 14.0 | 11.5 |
Models . | Len . | ROUGE-1 . | ROUGE-2 . | ||||
---|---|---|---|---|---|---|---|
Prec . | Rec . | F1 . | Prec . | Rec . | F1 . | ||
Baselines | |||||||
Length | 95.4 | 19.6 | 53.2 | 28.1 | 5.1 | 14.3 | 7.3 |
Popularity | 27.2 | 33.8 | 25.3 | 27.9 | 8.6 | 6.1 | 6.8 |
User | 37.6 | 32.2 | 34.2 | 32.5 | 8.0 | 8.9 | 8.2 |
LexRank | 25.7 | 35.3 | 22.2 | 25.8 | 11.7 | 6.9 | 8.3 |
State-of-the-art | |||||||
Chang et al. (2013) | 68.6 | 25.4 | 48.3 | 32.8 | 7.0 | 13.4 | 9.1 |
Li et al. (2015a) | 58.6 | 27.3 | 45.4 | 33.7 | 7.6 | 12.6 | 9.3 |
Our models | |||||||
topic only | 48.6 | 30.4 | 40.4 | 33.6 | 9.2 | 12.0 | 10.0 |
topic+disc | 37.8 | 38.1 | 35.5 | 33.1 | 13.2 | 11.5 | 10.8 |
topic+disc+rel | 48.9 | 32.3 | 41.3 | 34.0 | 10.3 | 12.5 | 10.5 |
Models . | Len . | ROUGE-L . | ROUGE-SU4 . | ||||
Prec . | Rec . | F1 . | Prec . | Rec . | F1 . | ||
Baselines | |||||||
Length | 95.4 | 16.4 | 44.4 | 23.4 | 6.2 | 17.2 | 8.9 |
Popularity | 27.2 | 28.6 | 21.3 | 23.6 | 10.4 | 7.6 | 8.4 |
User | 37.6 | 28.0 | 29.6 | 28.2 | 9.8 | 10.6 | 10.0 |
LexRank | 25.7 | 30.6 | 18.8 | 22.1 | 12.3 | 7.5 | 8.8 |
State-of-the-art | |||||||
Chang et al. (2013) | 68.6 | 21.6 | 41.1 | 27.9 | 8.3 | 16.0 | 10.8 |
Li et al. (2015a) | 58.6 | 23.3 | 38.6 | 28.7 | 8.8 | 14.7 | 10.9 |
Our models | |||||||
topic only | 48.6 | 26.3 | 34.9 | 29.0 | 10.2 | 13.8 | 11.3 |
topic+disc | 37.8 | 33.3 | 30.7 | 28.6 | 13.3 | 12.2 | 11.3 |
topic+disc+rel | 48.9 | 28.0 | 35.4 | 29.3 | 10.9 | 14.0 | 11.5 |
5.3 Human Evaluation Results
To further evaluate the generated summaries, we conduct human evaluations on informativeness (Info), conciseness (Conc), and readability (Read) of the extracted summaries. Two native Chinese speakers are invited to read the output summaries and subjectively rate on a 1–5 Likert scale and in 0.5 units, where a higher rating indicates better quality. Their overall inter-rater agreement achieves Krippendorff’s α of 0.73, which indicates reliable results (Krippendorff 2004). Table 8 shows the average ratings by the two raters and over ten conversation trees.
Models . | Info . | Conc . | Read . |
---|---|---|---|
Baselines | |||
Length | 2.33 | 2.93 | 2.28 |
Popularity | 2.38 | 2.35 | 3.05 |
User | 3.13 | 3.10 | 3.75 |
LexRank | 3.05 | 2.70 | 3.03 |
State-of-the-art | |||
Chang et al. (2013) | 3.43 | 3.50 | 3.70 |
Li et al. (2015a) | 3.70 | 3.90 | 4.15 |
Our models | |||
topic only | 3.33 | 3.03 | 3.35 |
topic+disc | 3.25 | 3.15 | 3.55 |
topic+disc+rel | 3.35 | 3.28 | 3.73 |
Models . | Info . | Conc . | Read . |
---|---|---|---|
Baselines | |||
Length | 2.33 | 2.93 | 2.28 |
Popularity | 2.38 | 2.35 | 3.05 |
User | 3.13 | 3.10 | 3.75 |
LexRank | 3.05 | 2.70 | 3.03 |
State-of-the-art | |||
Chang et al. (2013) | 3.43 | 3.50 | 3.70 |
Li et al. (2015a) | 3.70 | 3.90 | 4.15 |
Our models | |||
topic only | 3.33 | 3.03 | 3.35 |
topic+disc | 3.25 | 3.15 | 3.55 |
topic+disc+rel | 3.35 | 3.28 | 3.73 |
As can be seen, despite of the closing results produced by supervised and well-performed unsupervised systems in automatic ROUGE evaluation (shown in Section 5.2), when the outputs are judged by humans, supervised systems Chang et al. (2013) and Li et al. (2015a), with supervision on summarization and discourse respectively, achieve much higher ratings than unsupervised systems on all three criteria. This observation demonstrates that microblog conversation summarization is essentially challenging, where manual annotations (although with high cost in time and efforts involved) can provide useful clues in guiding systems to produce summaries that will be liked by humans. Particularly, the ratings given to Li et al. (2015a) are higher than all systems in comparison by large margins, which indicates that the human-annotated discourse can well indicate summary-worthy content and also confirms the usefulness of considering discourse in microblog conversation summarization.
Among unsupervised methods, the summarization results based on our topic+disc+rel model achieves generally better ratings than other comparison methods. The possible reasons are: (1) When separating topic words from discourse and background words, it also filters out irrelevant noise and distills important content. (2) It can exploit the tendencies of messages with varying discourse roles in containing core content, thus is able to identify “bad” discourse roles that bring redundancy or irrelevant words, which disturbs reading experience.
To further analyze the generated summaries, we conduct a case study and display a sample summary, generated based on the topic+disc+rel model in Table 9. In this case, the input conversation is about the sexism issue in Chinese college entrance. As we can see, the produced summary covers salient comments that are helpful in understanding public opinions toward the gender discrimination problem. However, taking a closer look at the produced summaries, we observe that the system selects messages that contain sentiment-only information, such as “Good point! I have to repost this!,” and therefore affects the quality of the generated summary. The observation from this summary case suggests that, in addition to discourse and background, a sentiment component should be effectively captured and well separated for further improving the summarization results. The potential extension of the current summarization system to additionally incorporate sentiment will be discussed in Section 5.4.
5.4 Error Analysis and Further Discussions
Taking a closer look at our produced summaries, one major source of incorrect selection of summary-worthy messages is based on the fact that sentiment is prevalent on microblog conversations, such as “love” in [R5] and “poor” in [R6] of Figure 1. Without an additional separation of sentiment-specific information, the yielded topic representations might be mixed with sentiment components. For example, in Table 4, the topic generated by the topic+disc+rel model contains sentiment words like “wrong” and “hate.” Therefore, a direct use of the topic representations to extract summaries will unavoidably select messages that mostly reflect sentimental component, which is also illustrated by the case study in Section 5.3.
Therefore, we argue that reliable estimation of the summary-worthy content in microblog conversations requires additional consideration of sentiment, because sentiment can also be represented by word distributions and captured via topic models in an unsupervised or weakly supervised manner (Lin and He 2009; Jo and Oh 2011; Lim and Buntine 2014). In future work, we can propose another model based on our joint model topic+disc+rel that can additionally separate sentiment word representations from discourse and topics. Lazaridou, Titov, and Sporleder (2013) have demonstrated that sentiment shifts can indicate sentence-level discourse functions in product reviews. We can then hypothesize that modeling discourse roles of messages can also benefit from exploring sentiment shifts in conversations. As it might be out of the scope of this article to thoroughly explore the joint effects of topic, discourse, and sentiment on microblog conversation summarization, we hence leave the study on such extended model to future work.
6. Conclusion and Future Work
In this article, we have presented a novel topic model for microblog messages that allows the joint induction of conversational discourse and latent topics in a fully unsupervised manner. By comparing our joint model with a number of competitive topic models on real-world microblog data sets, we have demonstrated the effectiveness of using conversational discourse structure to help in identifying topical content embedded in short and colloquial microblog messages. Moreover, our empirical study on microblog conversation summarization has shown that the produced discourse and topical representations can also predict summary-worthy content. Both ROUGE evaluation and human assessment have demonstrated that the summaries generated based on the outputs of our joint model are informative, concise, and easy-to-read. Error analysis on the produced summaries has shown that sentiment should be effectively captured and separated to further advance our current summarization system forward. As a result, the joint effects of discourse, topic, and sentiment on microblog conversation summarization is worth exploring in future study.
For other lines of future work, one potential is to extend our joint model to identify topic hierarchies from microblog conversation trees. In doing so, one could learn how topics change in a microblog conversation along a certain hierarchical path. Another potential line is to combine our work with representation learning on social media. Although some previous studies have provided intriguing approaches to learning representations at the level of words (Mikolov et al. 2013; Mikolov, Yih, and Zweig 2013), sentences (Le and Mikolov 2014), and paragraphs (Kiros et al. 2015), they are limited in modeling social media content with colloquial relations. Following similar ideas in this work, where discourse and topics are jointly explored, we can conduct other types of representation learning, embeddings for words (Li et al. 2017b), messages (Dhingra et al. 2016), or users (Ding, Bickel, and Pan 2017), in the context of conversations, which should complement social media representation learning and vice versa.
Appendix A
In this section, we present the key steps for inferring our joint model of conversational discourse and latent topics. Its generation process has been described in Section 3. As described in Section 3, we use collapsed Gibbs sampling (Griffiths et al. 2004) for model inference. Before providing the formula of sampling steps, we first define the notations of all variables used in the formulations of Gibbs sampling, described in Table A.1. In particular, the various ℂ variables refer to counts excluding the message m on conversation tree c.
x | word-level word type switcher. x = 1: discourse word (DISC); x = 2: topic word (TOPIC); x = 3: background word (BACK). |
# of words with word type as x and occurring in messages with discourse d. | |
# of words that occur in messages whose discourse assignments are d, i.e., . | |
# of words occurring in message (c, m) and with word type assignment as x. | |
# of words in message (c, m), i.e., . | |
# of words indexing v in vocabulary, assigned as discourse word, and occurring in messages assigned discourse d. | |
# of words assigned as discourse words (DISC) and occurring in messages assigned as discourse d, i.e., . | |
# of words indexing v in vocabulary that occur in messages (c, m) and are assigned as topic words (TOPIC). | |
# of words assigned as topic words (TOPIC) and occurring in message (c, m), i.e., . | |
# of words indexing v in vocabulary that occur in messages (c, m) and are assigned as discourse words (DISC). | |
# of words assigned as discourse words (DISC) and occurring in message (c, m), i.e., . | |
# of messages assigned discourse d′ whose parent is assigned discourse d. | |
# of messages whose parents are assigned discourse d, i.e., . | |
I(⋅) | An indicator function, whose value is 1 when its argument inside () is true, and 0 otherwise. |
# of messages whose parent is (c, m) and assigned discourse d. | |
# of messages whose parent is (c, m), i.e., | |
# of words indexing v in vocabulary and assigned as background words (BACK) | |
# of words assigned as background words (BACK), i.e., | |
# of messages on conversation tree c and assigned topic k. | |
# of messages on conversation tree c, i.e., | |
# of words indexing v in vocabulary, sampled as topic words (TOPIC), and occurring in messages assigned topic k. | |
# of words assigned as topic word and occurring in messages assigned topics k (TOPIC), i.e., . |
x | word-level word type switcher. x = 1: discourse word (DISC); x = 2: topic word (TOPIC); x = 3: background word (BACK). |
# of words with word type as x and occurring in messages with discourse d. | |
# of words that occur in messages whose discourse assignments are d, i.e., . | |
# of words occurring in message (c, m) and with word type assignment as x. | |
# of words in message (c, m), i.e., . | |
# of words indexing v in vocabulary, assigned as discourse word, and occurring in messages assigned discourse d. | |
# of words assigned as discourse words (DISC) and occurring in messages assigned as discourse d, i.e., . | |
# of words indexing v in vocabulary that occur in messages (c, m) and are assigned as topic words (TOPIC). | |
# of words assigned as topic words (TOPIC) and occurring in message (c, m), i.e., . | |
# of words indexing v in vocabulary that occur in messages (c, m) and are assigned as discourse words (DISC). | |
# of words assigned as discourse words (DISC) and occurring in message (c, m), i.e., . | |
# of messages assigned discourse d′ whose parent is assigned discourse d. | |
# of messages whose parents are assigned discourse d, i.e., . | |
I(⋅) | An indicator function, whose value is 1 when its argument inside () is true, and 0 otherwise. |
# of messages whose parent is (c, m) and assigned discourse d. | |
# of messages whose parent is (c, m), i.e., | |
# of words indexing v in vocabulary and assigned as background words (BACK) | |
# of words assigned as background words (BACK), i.e., | |
# of messages on conversation tree c and assigned topic k. | |
# of messages on conversation tree c, i.e., | |
# of words indexing v in vocabulary, sampled as topic words (TOPIC), and occurring in messages assigned topic k. | |
# of words assigned as topic word and occurring in messages assigned topics k (TOPIC), i.e., . |
Acknowledgments
This work is partially supported by Innovation and Technology Fund (ITF) project no. 6904333, General Research Fund (GRF) project no. 14232816 (12183516), National Natural Science Foundation of China (grant no. 61702106), and Shanghai Science and Technology Commission (grant no. 17JC1420200 and grant no. 17YF1427600). We are grateful for the contributions of Yulan He, Lu Wang, and Wei Gao in shaping part of our ideas, and the efforts of Nicholas Beautramp, Sarah Shugars, Ming Liao, Xingshan Zeng, Shichao Dong, and Dingmin Wang in preparing some of the experiment data. Also, we thank Shuming Shi, Dong Yu, Tong Zhang, and the three anonymous reviewers for the insightful suggestions on various aspects of this work.
Notes
In this work, a discourse role refers to a certain type of dialogue act on message level, e.g., agreement or argument. The discourse structure of a conversation means some combination (or a probability distribution) of discourse roles.
Dialogue act can be used interchangeably with speech act (Stolcke et al. 2000).
Weibo, short for Sina Weibo, is the biggest microblog platform in China and shares the similar market penetration as Twitter (Rapoza 2011). Similar to Twitter, it has length limitation of 140 Chinese characters.
Twitter search API: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-saved_searches-show-id. Twitter has allowed users to add comments in retweets (reposting messages on Twitter) since 2015, which enables retweets to become part of a conversation. In our data set, the parents of 91.4% of such retweets can be recovered from the “in reply to status id” field returned by Twitter search API.
Conversation-type keywords are used to obtain tweets reflecting agreement, disagreement, and response, which are likely to appear in Twitter conversations. Keyword list: agreement – “agreed,” “great point,” “agree,” “good point”; disagreement – “wrong,” “bad idea,” “stupid idea,” “disagree”; response – “understand,” “interesting,” “i see.”
The full list of election-related keywords: “trump,” “clinton,” “hillary,” “election,” “president,” “politics.”
We also conducted evaluations on the LDA and BTM versions without this preprocessing step, and we obtained worse coherence scores.
Palmetto toolkit only allows at most 10 words as input for CV score calculation.
If there are multiple latent topics related to “Trump is a racist,” we pick up the most relevant one and display its representative words.
Non-topic words cannot clearly indicate the corresponding topic. Such words can occur in messages covering very different topics. For example, in Table 4, the word “opinion” is a non-topic word for “Trump is a racist,” because an “opinion” can be voiced on diverse people, events, entities, and so on.
On Twitter, a hashtag serves as a special URL, which can link other messages sharing the same hashtag.
The corpus of Chang et al. (2013) is not publicly available.
To ensure the value of KL-divergence to be finite, we smooth U(Ec) with β, which also serves as the smoothing parameter of ϕkT (Section 3).
We have also conducted evaluations on the versions without this preprocessing step, and they gave worse ROUGE scores.
github.com/summanlp/evaluation/tree/master/ROUGE-RELEASE-1.5.5. Note that the absolute scores of comparison models here are different from those reported in Li et al. (2015a). Because the ROUGE scores reported here are given by ROUGE 1.5.5, whereas Li et al. (2015a) uses the Dragon toolkit (Zhou, Zhang, and Hu 2007) for ROUGE calculation. Despite the difference in absolute scores, the trends reported here remain similar to Li et al. (2015a).
References
Author notes
Jing Li is the corresponding author. This work was partially conducted when Jing Li was at Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, HKSAR, China.