What You Say and How You Say it: Joint Modeling of Topics and Discourse in Microblog Conversations

This paper presents an unsupervised framework for jointly modeling topic content and discourse behavior in microblog conversations. Concretely, we propose a neural model to discover word clusters indicating what a conversation concerns (i.e., topics) and those reflecting how participants voice their opinions (i.e., discourse).1 Extensive experiments show that our model can yield both coherent topics and meaningful discourse behavior. Further study shows that our topic and discourse representations can benefit the classification of microblog messages, especially when they are jointly trained with the classifier.Our data sets and code are available at: http://github.com/zengjichuan/Topic_Disc.


Introduction
The last decade has witnessed the revolution of communication, where the ''kitchen table conversations'' have been expanded to public discussions on online platforms. As a consequence, in our daily life, the exposure to new information and the exchange of personal opinions have been mediated through microblogs, one popular online platform genre (Bakshy et al., 2015). The flourish of microblogs has also led to the sheer quantity of user-created conversations emerging every day, exposing individuals to superfluous information. Facing such an unprecedented number of conversations relative to limited attention of individuals, how shall we automatically extract the critical points and make sense of these microblog conversations? * This work was partially conducted in Jichuan Zeng's internship in Tencent AI Lab. Corresponding author: Jing Li. 1 Our data sets and code are available at: http:// github.com/zengjichuan/Topic_Disc.
Toward key focus understanding of a conversation, previous work has shown the benefits of discourse structure (Li et al., 2016b;Qin et al., 2017;Li et al., 2018), which shapes how messages interact with each other, forming the discussion flow, and can usefully reflect salient topics raised in the discussion process. After all, the topical content of a message naturally occurs in context of the conversation discourse and hence should not be modeled in isolation. Conversely, the extracted topics can reveal the purpose of participants and further facilitate the understanding of their discourse behavior (Qin et al., 2017). Further, the joint effects of topics and discourse will contribute to better understanding of social media conversations, benefiting downstream tasks such as the management of discussion topics and discourse behavior of social chatbots (Zhou et al., 2018) and the prediction of user engagements for conversation recommendation (Zeng et al., 2018b).
To illustrate how the topics and discourse interplay in a conversation, Figure 1 displays a snippet of Twitter conversation. As can be seen, the content words reflecting the discussion topics (such as ''supreme court'' and ''gun rights'') appear in context of the discourse flow, where participants carry the conversation forward via making a statement, giving a comment, asking a question, and so forth. Motivated by such an observation, we assume that a microblog conversation can be decomposed into two crucially different components: one for topical content and the other for discourse behavior. Here, the topic components indicate what a conversation is centered around and reflect the important discussion points put forward in the conversation process. The discourse components signal the discourse roles of messages, such as making a statement, asking a question, and other dialogue acts (Ritter et al., 2010;Joty et al., 2011), which further shape the discourse structure of a conversation. 2 To distinguish the above two components, we examine the conversation contexts and identify two types of words: topic words, indicating what a conversation focuses on, and discourse words, reflecting how the opinion is voiced in each message. For example, in Figure 1, the topic words ''gun'' and ''control'' indicate the conversation topic while the discourse word ''what'' and ''?'' signal the question in M 3 .
Concretely, we propose a neural framework built upon topic models, enabling the joint exploration of word clusters to represent topic and discourse in microblog conversations. Different from the prior models trained on annotated data (Li et al., 2016b;Qin et al., 2017), our model is fully unsupervised, not dependent on annotations for either topics or discourse, which ensures its immediate applicability in any domain or language. Moreover, taking advantages of the recent advances in neural topic models (Srivastava and Sutton, 2017;Miao et al., 2017), we are able to approximate Bayesian variational inference without requiring model-specific derivations, whereas most existing work (Ritter et al., 2010;Joty et al., 2011;Alvarez-Melis and Saveski, 2016;Zeng et al., 2018b;Li et al., 2018) require expertise involved to customize model inference algorithms. In addition, our neural nature enables 2 In this paper, the discourse role refers to a certain type of dialogue act (e.g., statement or question) for each message. And the discourse structure refers to some combination of discourse roles in a conversation. end-to-end training of topic and discourse representation learning with other neural models for diverse tasks.
For model evaluation, we conduct an extensive empirical study on two large-scale Twitter data sets. The intrinsic results show that our model can produce latent topics and discourse roles with better interpretability than the state-of-theart models from previous studies. The extrinsic evaluations on a tweet classification task exhibit the model's ability to capture useful representations for microblog messages. Particularly, our model enables an easy combination with existing neural models for end-to-end training, such as convolutional neural networks, which is shown to perform better in classification than the pipeline approach without joint training.

Related Work
Our work is in the line with previous studies that use non-neural models to leverage discourse structure for extracting topical content from conversations (Li et al., 2016b;Qin et al., 2017;Li et al., 2018). Zeng et al. (2018b) explore how discourse and topics jointly affect user engagements in microblog discussions. Different from them, we build our model in a neural network framework, where the joint effects of topic and discourse representations can be exploited for various downstream deep learning tasks in an end-to-end manner. In addition, we are inspired by prior research that only models topics or conversation discourse. In the following, we discuss them in turn.
Topic Modeling. Our work is closely related with the topic model studies. In this field, despite the huge success achieved by the springboard topic models (e.g., pLSA [Hofmann, 1999] and LDA [Blei et al., 2001]), and their extensions (Blei et al., 2003;Rosen-Zvi et al., 2004), the applications of these models have been limited to formal and welledited documents, such as news reports (Blei et al., 2003) and scientific articles (Rosen-Zvi et al., 2004), attributed to their reliance on documentlevel word collocations. When processing short texts, such as the messages on microblogs, it is likely that the performance of these models will be inevitably compromised, due to the severe data sparsity issue.
To deal with such an issue, many previous efforts incorporate the external representations, such as word embeddings (Nguyen et al., 2015;Li et al., 2016a;Shi et al., 2017) and knowledge (Song et al., 2011;Yang et al., 2015;Hu et al., 2016), pre-trained on large-scale high-quality resources. Different from them, our model learns topic and discourse representations only with the internal data and thus can be widely applied on scenarios where the specific external resource is unavailable.
In another line of the research, most prior work focuses on how to enrich the context of short messages. To this end, biterm topic model (BTM) (Yan et al., 2013) extends a message into a biterm set with all combinations of any two distinct words appearing in the message. On the contrary, our model allows the richer context in a conversation to be exploited, where word collocation patterns can be captured beyond a short message.
In addition, there are many methods using some heuristic rules to aggregate short messages into long pseudo-documents, such as those based on authorship (Hong and Davison, 2010;Zhao et al., 2011) and hashtags (Ramage et al., 2010;Mehrotra et al., 2013). Compared with these methods, we model messages in the context of their conversations, which has been demonstrated to be a more natural and effective text aggregation strategy for topic modeling (Alvarez-Melis and Saveski, 2016).
Conversation Discourse. Our work is also in the area of discourse analysis for conversations, ranging from the prediction of the shallow discourse roles on utterance level (Stolcke et al., 2000;Ji et al., 2016;Zhao et al., 2018) to the discourse parsing for a more complex conversation structure Charniak, 2008, 2010;Afantenos et al., 2015). In this area, most existing models heavily rely on the data annotated with discourse labels for learning (Zhao et al., 2017). Different from them, our model, in a fully unsupervised way, identifies distributional word clusters to represent latent discourse factors in conversations. Although such latent discourse variables have been studied in previous work (Ritter et al., 2010;Joty et al., 2011;Ji et al., 2016;Zhao et al., 2018), none of them explores the effects of latent discourse on the identification of conversation topic, which is a gap our work fills in.

Our Neural Model for Topics and Discourse in Conversations
This section introduces our neural model that jointly explores latent representations for topics and discourse in conversations. We first present an overview of our model in Section 3.1, followed by the model generative process and inference procedure in Section 3.2 and 3.3, respectively.

Model Overview
In general, our model aims to learn coherent word clusters that reflect the latent topics and discourse roles embedded in the microblog conversations.
To this end, we distinguish two latent components in the given collection: topics and discourse, each represented by a certain type of word distribution (distributional word cluster). Specifically, at the corpus level, we assume that there are K topics, represented by φ T k (k = 1, 2, . . . , K), and D discourse roles, captured with φ D d (d = 1, 2, . . . , D). φ T and φ D are all multinomial word distributions over the vocabulary size V . Inspired by the neural topic models in Miao et al. (2017), our model encodes topic and discourse distributions (φ T and φ D ) as latent variables in a neural network and learns the parameters via back propagation.
Before touching the details of our model, we first describe how we formulate the input. On microblogs, as a message might have multiple replies, messages in an entire conversation can be organized as a tree with replying relations (Li et al., 2016b. Though the recent progress in recursive models allows the representation learning from the tree-structured data, previous studies have pointed out that, in practice, sequence models serve as a more simple yet robust alternative (Li et al., 2015). In this work, we follow the common practice in most conversation modeling research (Ritter et al., 2010;Joty et al., 2011;Zhao et al., 2018) to take a conversation as a sequence of turns. To this end, each conversation tree is flattened into root-to-leaf paths. Each one of such paths is hence considered as a conversation instance, and a message on the path corresponds to a conversation turn (Zarisheva and Scheffler, 2015;Cerisara et al., 2018;Jiao et al., 2018).
The overall architecture of our model is shown in Figure 2. Formally, we formulate a conversation c as a sequence of messages (x 1 , x 2 , . . . , x M c ), where M c denotes the number of messages in c. In the conversation, each message x, as the target message, is fed into our model sequentially.
Here we process the target message x as the bag-of-words (BoW) term vector x BoW ∈ R V , following the bag-of-words assumption in most topic models (Blei et al., 2003;Miao et al., 2017). The conversation, c, where the target message x is involved, is considered as the context of x. It is also encoded in the BoW form (denoted as c BoW ∈ R V ) and fed into our model. In doing so, we ensure that the context of the target message is incorporated while learning its latent representations.
Following the previous practice in neural topic models (Miao et al., 2017;Srivastava and Sutton, 2017), we utilize the variational auto-encoder (VAE) (Kingma and Welling, 2013) to resemble the data generative process via two steps. First, given the target message x and its conversation c, our model converts them into two latent variables: topic variable z and discourse variable d. Then, using the intermediate representations captured by z and d, we reconstruct the target message, x .

Generative Process
In this section, we first describe the two latent variables in our model: the topic variable z and the discourse variable d. Then, we present our data generative process from the latent variables.
Latent Topics. For latent topic learning, we examine the main discussion points in the context of a conversation. Our assumption is that messages in the same conversation tend to focus on similar topics Zeng et al., 2018b). Concretely, we define the latent topic variable z ∈ R K at the conversation level and generate the topic mixture of c, denoted as a K-dimensional distribution θ, via a softmax construction conditioned on z (Miao et al., 2017).
Latent Discourse. For modeling the discourse structure of conversations, we capture the messagelevel discourse roles reflecting the dialogue acts of each message, as is done in Ritter et al. (2010). Concretely, given the target message x, we use a D-dimensional one-hot vector to represent the latent discourse variable d, where the high bit indicates the index of a discourse word distribution that can best express x's discourse role. In the generative process, the latent discourse d is drawn from a multinomial distribution with parameters estimated from the input data.
Data Generative Process As mentioned previously, our entire framework is based on VAE, which consists of an encoder and a decoder. The encoder maps a given input into latent topic and discourse representations and the decoder reconstructs the original input from the latent representations. In the following, we first describe the decoder followed by the encoder.
In general, our decoder is learned to reconstruct the words in the target message x (in the BoW form) from the latent topic z and latent discourse d. We show the generative story that reflects the reconstruction process below: where f * (·) is a neural perceptron, with a linear transformation of inputs activated by a non-linear transformation. Here we use rectified linear units (Nair and Hinton, 2010) as the activate functions. In particular, the weight matrix of f φ T (·) (after the softmax normalization) is considered as the topic-word distributions φ T . The discourseword distributions φ D are similarly obtained from f φ D (·).
For the encoder, we learn the parameters µ, σ, and π from the input x BoW and c BoW (the BoW form of the target message and its conversation), following the following formula:

Model Inference
For the objective function of our entire framework, we take three aspects into account: the learning of latent topics and discourse, the reconstruction of the target messages, and the separation of topicassociated words and discourse-related words.
Learning Latent Topics and Discourse. For learning the latent topics/discourse in our model, we utilize the variational inference (Blei et al., 2016) to approximate posterior distribution over the latent topic z and the latent discourse d given all the training data. To this end, we maximize the variational lower bound L z for z and L d for d, each defined as following: q(z | c) and q(d | x) are approximated posterior probabilities describing how the latent topic z and the latent discourse d are generated from the data. p(c | z) and p(x | d) represent the corpus likelihoods conditioned on the latent variables.
Here, to facilitate coherent topic production, in p(c | z), we penalize the likelihood of stopwords to be generated from latent topics following Li et al. (2018). p(z) follows the standard normal prior N (0, I) and p(d) is the uniform distribution U nif (0, 1). D KL refers to the Kullback-Leibler divergence that ensures the approximated posteriors to be close to the true ones. For more derivation details, we refer readers to Miao et al. (2017).
Reconstructing target messages. From the latent variables z and d, the goal of our model is to reconstruct the target message x. The corresponding learning objective is to maximize L x defined as: Here we design L x to ensure that the learned latent topics and discourse can reconstruct x.
Distinguishing Topics and Discourse. Our model aims to distinguish word distributions for topics (φ T ) and discourse (φ D ), which enables topics and discourse to capture different information in conversations. Concretely, we use the mutual information, given below, to measure the mutual dependency between the latent topics z and the latent discourse d. 3 Equation 4 can be further derived as the Kullback-Leibler divergence of the conditional distribution, p(d | z), and marginal distribution, p(d). The derived formula, defined as the mutual information (MI) loss (L M I ) and shown in Equation 5, is used to map z and d into the separated semantic space.
We can hence minimize L M I for guiding our model to separate word distributions that represent topics and discourse.
The Final Objective. To capture the joint effects of the learning objectives described above (L z , L d , L x , and L M I ), we design the final objective function for our entire framework as the following: where the hyperparameter λ is the trade-off parameter for balancing between the MI loss (L M I ) and the other learning objectives. By maximizing the final objective L via back propagation, the word distributions of topics and discourse can be jointly learned from microblog conversations. 4

Experimental Setup
Data Collection. For our experiments, we collected two microblog conversation data sets from Twitter. One is released by the TREC 2011 microblog track (henceforth TREC), containing conversations concerning a wide range of topics. 5 3 The distributions in Equation 4 are all conditional probability distributions given the target message x and its conversation c. We omit the conditions for simplicity. 4 To smooth the gradients in implementation, for z ∼ N (µ, σ), we apply the reparameterization on z (Kingma and Welling, 2013;Rezende et al., 2014), and for d ∼ M ulti(π), we adopt the Gumbel-Softmax trick (Maddison et al., 2016;Jang et al., 2016). 5 http://trec.nist.gov/data/tweets/.  The other is crawled from January to June 2016 with Twitter streaming API 6 (henceforth TWT16, short for Twitter 2016), following the way of building the TREC data set. During this period, there are a large volume of discussions centered around the U.S. presidential election. In addition, for both data sets, we apply Twitter search API 7 to retrieve the missing tweets in the conversation history, as the Twitter streaming API (used to collect both data sets) only returns sampled tweets from the entire pool. The statistics of the two experiment data sets are shown in Table 1. For model training and evaluation, we randomly sampled 80%, 10%, and 10% of the data to form the training, development, and test set, respectively.
Data Preprocessing. We preprocessed the data with the following steps. First, non-English tweets were filtered out. Then, hashtags, mentions (@username), and links were replaced with generic tags ''HASH'', ''MENT'', and ''URL'', respectively. Next, the natural languge toolkit was applied for tweet tokenization. 8 After that, all letters were normalized to lower cases. Finally, words that occurred fewer than 20 times were filtered out from the data.
Parameter Setting. To ensure comparable results with Li et al. (2018) (the prior work focusing on the same task as ours), in the topic coherence evaluation, we follow their setup to report the results under two sets of K (the number of topics): K = 50 and K = 100, and with the number of discourse roles (D) set to 10. The analysis for the effects of K and D will be further presented in Section 5.5. For all the other hyperparameters, we tuned them on development set by grid search. The trade-off parameter λ (defined in Equation 6), balancing the MI loss and the other objective functions, is set to 0.01. In model training, we use the Adam optimizer (Kingma and Ba, 2014) and run 100 epochs with early stop strategy adopted.
Baselines. In topic modeling experiments, we consider the five topic model baselines treating each tweet as a document: LDA (Blei et al., 2003), BTM (Yan et al., 2013), LF-LDA, LF-DMM (Nguyen et al., 2015), and NTM (Miao et al., 2017). In particular, BTM and LF-DMM are the state-of-the-art topic models for short texts. BTM explores the topics of all word pairs (biterms) in each message to alleviate data sparsity in short texts. LF-DMM incorporates word embeddings pre-trained on external data to expand semantic meanings of words, so does LF-LDA. In Nguyen et al. (2015), LF-DMM, based on one-topic-perdocument Dirichlet Multinomial Mixture (DMM) (Nigam et al., 2000), was reported to perform better than LF-LDA, based on LDA. For LF-LDA and LF-DMM, we use GloVe Twitter embeddings (Pennington et al., 2014) as the pre-trained word embeddings. 9 For the discourse modeling experiments, we compare our results with LAED (Zhao et al., 2018), a VAE-based representation learning model for conversation discourse. In addition, for both topic and discourse evaluation, we compare with Li et al. (2018), a recently proposed model for microblog conversations, where topics and discourse are jointly explored with a non-neural framework. Besides the existing models from previous studies, we also compare with the variants of our model that only models topics (henceforth TOPIC ONLY) or discourse (henceforth DISC ONLY). 10 Our joint model of topics and discourse is referred to as TOPIC+DISC.
In the preprocessing procedure for the baselines, we removed stop words and punctuation for topic models unable to learn discourse representations following the common practice in previous work (Yan et al., 2013;Miao et al., 2017). For the other models, stop words and punctuation were retained in the vocabulary, considering their usefulness as discourse indicators .

Experimental Results
In this section, we first report the topic coherence results in Section 5.1, followed by a discussion in Section 5.2 comparing the latent discourse roles discovered by our model with the manually annotated dialogue acts. Then, we study whether we can capture useful representations for microblog messages in a tweet classification task (in Section 5.3). A qualitative analysis, showing some example topics and discourse roles, is further provided in Section 5.4. Finally, in Section 5.5, we provide more discussions on our model.

Topic Coherence
For the topic coherence, we adopt the C v scores measured via the open-source Palmetto toolkit as our evaluation metric. 11 C v scores assume that the top N words in a coherent topics (ranked by likelihood) tend to co-occur in the same document and have shown comparable evaluation results to human judgments (Röder et al., 2015). Table 2 shows the average C v scores over the produced topics given N = 5 and N = 10. The values range from 0.0 to 1.0, and higher scores indicate better topic coherence. We can observe that: • Models assuming a single topic for each message do not work well. It has long been pointed out that the one-topic-per-message assumption (each message contains only one topic) helps 11 https://github.com/dice-group/Palmetto. topic models alleviate the data sparsity issue in short texts on microblogs (Zhao et al., 2011;Quan et al., 2015;Nguyen et al., 2015;Li et al., 2018). However, we observe contradictory results because both LF-DMM and Li et al. (2018), following this assumption, achieve generally worse performance than the other models. This might be attributed to the large-scale data used in our experiments (each data set has over 250K messages as shown in Table 1), which potentially provide richer word co-occurrence patterns and thus partially alleviate the data sparsity issue.
• Pre-trained word embeddings do not bring benefits. Comparing LF-LDA with LDA, we found that they result in similar coherence scores. This shows that with sufficiently large training data, with or without using the pre-trained word embeddings do not make any difference in the topic coherence results.
• Neural models perform better than non-neural baselines. When comparing the results of neural models (NTM and our models) with the other baselines, we find the former yield topics with better coherence scores in most cases.
• Modeling topics in conversations is effective. Among neural models, we found our models outperform NTM (without exploiting conversation contexts). This shows that the conversations provide useful context and enables more coherent topics to be extracted from the entire conversation thread instead of a single short message.
• Modeling topics together with discourse helps produce more coherent topics. We can observe better results with the joint model TOPIC+DISC in comparison with the variant considering topics only. This shows that TOPIC+DISC, via the joint modeling of topic-and discourse-word distributions (reflecting non-topic information), can better separate topical words from non-topical ones, hence resulting in more coherent topics.

Discourse Interpretability
In this section, we evaluate whether our model can discover meaningful discourse representations. To this end, we train the comparison models for discourse modeling on the TREC data set and test the learned latent discourse on a benchmark data set released by Cerisara et al. (2018). The benchmark data set consists of 2,217 microblog messages forming 505 conversations collected  from Mastodon, 12 a microblog platform exhibiting Twitter-like user behavior (Cerisara et al., 2018). For each message, there is a humanassigned discourse label, selected from one of the 15 dialogue acts, such as question, answer, disagreement, and so forth. For discourse evaluation, we measure whether the model-produced discourse assignments are consistent with the human-annotated dialogue acts. Hence, following Zhao et al. (2018), we assume that an interpretable latent discourse role should cluster messages labeled with the same dialogue act. Therefore, we adopt purity (Manning et al., 2008), homogeneity (Rosenberg and Hirschberg, 2007), and variation of informa tion (VI) (Meila, 2003;Goldwater and Griffiths, 2007) as our automatic evaluation metrics. Here, we set D = 15 to ensure the number of latent discourse roles to be the same as the number of manually labeled dialogue acts. Table 3 shows the comparison results of the average scores over the 15 latent discourse roles. Higher values indicate better performance for purity and homogeneity, while for VI, lower is better.
It can be observed that our models exhibit generally better performance, showing the effectiveness of our framework in inducing interpretable discourse roles. Particularly, we observe the best results achieved by our joint model TOPIC+DISC, which is learned to distinguish topicand discourse-words, important in recognizing indicative words to reflect latent discourse. To further analyze the consistency of varying latent discourse roles (produced by our TOPIC+DISC model) with the human-labeled dialogue acts, Figure 3 displays a heatmap, where each line visualizes how the messages with a dialogue act distribute over varying discourse roles. It is seen that among all dialogue acts, our model discovers more interpretable latent discourse for ''greetings'', ''thanking'', ''exclamation'', and ''offer'', where most messages are clustered into one or two dominant discourse roles. It may be because these dialogue acts can be relatively easier to detect based on their associated indicative words, such as the word ''thanks'' for ''thanking'', and the word ''wow'' for ''exclamation''.

Message Representations
To further evaluate our ability to capture effective representations for microblog messages, we take tweet classification as an example and test the classification performance with the topic and discourse representations as features. Here, the user-generated hashtags capturing the topics of online messages are used as the proxy class labels (Li et al., 2016b;Zeng et al., 2018a). We construct  Table 4: Evaluation of tweet classification results in accuracy (Acc) and average F1 (Avg F1). Representations learned by various models serve as the classification features. For our model, both the topic and discourse representations are fed into the classifier.
the classification data set from TREC and TWT16 with the following steps. First, we removed the tweets without hashtags. Second, we ranked hashtags by their frequencies. Third, we manually removed the hashtags that are not topic-related (e.g. ''#fb'' for indicating the source of tweets from Facebook), and combined the hashtags referring to the same topic (e.g., ''#DonaldTrump'' and ''#Trump''). Finally, we selected the top 50 frequent hashtags, and all tweets containing these hashtags as our classification data set. Here, we simply use the support vector machines as the classifier, since our focus is to compare the representations learned by various models. Li et al. (2018) are unable to produce vector representation on tweet level, hence not considered here. Table 4 shows the classification results of accuracy and average F1 on the two data sets with the representations learned by various models serving as the classification features. We observe that our model outperforms other models with a large margin. The possible reasons are twofold. First, our model derives topics from conversation threads and thus potentially yields better message representations. Second, the discourse representations (only produced by our model) are indicative features for hashtags, because users will exhibit various discourse behaviors in discussing diverse topics (hashtags). For instance, we observe prominent ''argument'' discourse from tweets with ''#Trump'' and ''#Hillary'', attributed to the controversial opinions to the two candidates in the 2016 U.S. presidential election.

Example Topics and Discourse Roles
We have shown that joint modeling of topics and discourse presents superior performance on a quantitative measure. In this section, we qualitatively analyze the interpretability of our outputs via analyzing the word distributions of some example topics and discourse roles.
Example Topics. Table 5 lists the top 10 words of some example latent topics discovered by various models from the TWT16 data set. According to the words shown, we can interpret the extracted topics as ''gun control'' -discussion about gun law and the failure of gun control in Chicago. We observe that LDA wrongly includes off-topic word ''flag''. From the outputs of BTM, LF-DMM, Li et al. (2018), and our TOPIC ONLY variant, though we do not find off-topic words, there are some nontopic words, such as ''said'' and ''understand''. 13 The output of our TOPIC+DISC model appears to be the most coherent, with words such as ''firearm'' and ''criminals'' included, which are clearly relevant to ''gun control''. Such results indicate the benefit of examining the conversation contexts and jointly exploring topics and discourse in them.
Example Discourse Roles. To qualitatively analyze whether our TOPIC+DISC model can discover interpretable discourse roles, we select the top 10 words from the distributions of some example discourse roles and list them in Table 6. It can be observed that there are some meaningful word clusters reflecting varying discourse roles found without any supervision. Interestingly, we observe that the latent discourse roles from TREC and TWT16, though learned separately, exhibit some notable overlap in their associated top 10 words, particularly for ''question'' and ''statement''. We also note that ''argument'' is represented by very different words. The reason is that TWT16 contains a large volume of arguments centered around candidates Clinton and Trump, resulting in the frequent appearance of words like ''he'' and ''she''.

Further Discussions
In this section, we further present more discussions on our joint model: TOPIC+DISC.
Parameter Analysis. Here, we study the two important hyper-parameters in our model, the LDA ::::: people trump police violence gun death protest guns flag shot BTM gun guns ::::: people police wrong right :::: think law agree black LF-DMM gun police black ::: said :::::: people guns killing ppl amendment laws Li et al. (2018) wrong don trump gun ::::::::: understand laws agree guns ::::: doesn :::: make NTM gun ::::::::: understand :: yes guns world dead ::: real discrimination trump silence TOPIC ONLY shootings gun guns cops charges control :::: mass commit :::: know agreed TOPIC+DISC guns gun shootings chicago shooting cops firearm criminals commit laws Table 5: Top 10 representative words of example latent topics discovered from the TWT16 data set. We interpret the topics as ''gun control'' by the displayed words. ::::::::: Non-topic :::::: words are wave-underlined and in blue, and off-topic words are underlined and in red. Table 6: Top 10 representative words of example discourse roles learned from TREC and TWT16. The discourse roles of the word clusters are manually assigned according to their associated words. number of topics (K) and the number of discourse roles (D). In Figure 4, we show the C v topic coherence given varying K in (a) and the homogeneity measure given varying D in (b). As can be seen, the curves corresponding to the performance on topics and discourse are not monotonic. In particular, better topic coherence scores are achieved given relatively larger topic numbers for TREC with the best result observed at K = 80. On the contrary, the optimum topic number for TWT16 is K = 20, and increasing the number of topics results in worse C v scores in general. This may be attributed to the relatively centralized topic concerning U.S. election in the TWT16 corpus. For discourse homogeneity, the best result is achieved given D = 15, with same the number of manually annotated dialogue acts in the benchmark.
Case Study. To further understand why our model learns meaningful representations for topics and discourse, we present a case study based on the example conversation shown in Figure 1. Specifically, we visualize the topic words (with p(w | z) > p(w | d)) in red and the rest of the words in blue to indicate discourse. Darker red indicates the higher topic likelihood (p(w | z)) and darker blue shows the higher discourse likelihood (p(w | d)). The results are shown in Figure 5. We can observe that topic and discourse words are well separated by our model, which explains why it can generate high-quality representations for both topics and discourse.
Model Extensibility. Recall that in the Introduction, we mentioned that our neural-based model has an advantage to be easily combined with other neural network architectures and allows for the joint training of both models. Here, we take message classification (with the setup in Section 5.3) as an example, and study whether Figure 4: (a) The impact of topic numbers. The horizontal axis shows the number of topics; the vertical axis shows the C v topic coherence. (b) The impact of discourse numbers. The horizontal axis represents the number of discourse; the vertical axis represents the homogeneity measure. joint training our model with convolutional neural network (CNN) (Kim, 2014), the widely used model on short text classification, can bring benefits to the classification performance. We set the embedding dimension to 200, with random initialization. The results are shown in Table 7, where we observe that joint training our model and the classifier can successfully boost the classification performance.
Error Analysis. We further analyze the errors in our outputs. For topics, taking a closer look at their word distributions, we found that our model sometimes mixes sentiment words with topic words. For example, among the top 10 words of a topic ''win people illegal americans hate lt racism social tax wrong'', there are words ''hate'' and ''wrong'', expressing sentiment rather than conveying topic-related information. This is due to the prominent co-occurrences of topic words and sentiment words in our data, which results in the similar distributions for topics and sentiment. Future work could focus on the further separation of sentiment and topic words.
For discourse, we found that our model can induce some discourse roles beyond the 15 manually defined dialogue acts in the Mastodon data set (Cerisara et al., 2018). For example, as shown in Table 6, our model discovers the ''quotation'' discourse from both TREC and TWT16, which is, however, not defined in the Mastodon data set. This perhaps should not be considered as an error. We argue that it is not sensible to pre-define a fixed set of dialogue acts for diverse microblog conversations due to the rapid change and a wide variety of user behaviors in social media. Therefore, future work should involve a better alternative to evaluate the latent discourse without Figure 5: Visualization of the topic-discourse assignment of a twitter conversion from TWT16. The annotated blue words are prone to be discourse words, and the red are topic words. The shade indicates the confidence of the current assignment. relying on manually defined dialogue acts. We also notice that our model sometimes fails to identify discourse behaviors requiring more in-depth semantic understanding, such as sarcasm, irony, and humor. This is because our model detects latent discourse purely based on the observed words, whereas the detection of sarcasm, irony, or humor requires deeper language understanding, which is beyond the capacity of our model.

Conclusion and Future Work
We have presented a neural framework that jointly explores topic and discourse from microblog conversations. Our model, in an unsupervised manner, examines the conversation contexts and discovers word distributions that reflect latent topics and discourse roles. Results from extensive experiments show that our model can generate coherent topics and meaningful discourse roles. In addition, our model can be easily combined with other neural network architectures (such as CNN) and allows for joint training, which has presented better message classification results compared with the pipeline approach without joint training.
Our model captures topic and discourse representations embedded in conversations. They are potentially useful for a broad range of downstream applications, worthy to be explored in future research. For example, our model is useful for developing social chatbots (Zhou et al., 2018  change of topics in conversation context, helpful to determine ''what to say and how to say'' in the next turn. Also, it would be interesting to study how our learned latent topics and discourse affect recommendation (Zeng et al., 2018b) and summarization of microblog conversations .