Abstract
Participants in an asynchronous conversation (e.g., forum, e-mail) interact with each other at different times, performing certain communicative acts, called speech acts (e.g., question, request). In this article, we propose a hybrid approach to speech act recognition in asynchronous conversations. Our approach works in two main steps: a long short-term memory recurrent neural network (LSTM-RNN) first encodes each sentence separately into a task-specific distributed representation, and this is then used in a conditional random field (CRF) model to capture the conversational dependencies between sentences. The LSTM-RNN model uses pretrained word embeddings learned from a large conversational corpus and is trained to classify sentences into speech act types. The CRF model can consider arbitrary graph structures to model conversational dependencies in an asynchronous conversation. In addition, to mitigate the problem of limited annotated data in the asynchronous domains, we adapt the LSTM-RNN model to learn from synchronous conversations (e.g., meetings), using domain adversarial training of neural networks. Empirical evaluation shows the effectiveness of our approach over existing ones: (i) LSTM-RNNs provide better task-specific representations, (ii) conversational word embeddings benefit the LSTM-RNNs more than the off-the-shelf ones, (iii) adversarial training gives better domain-invariant representations, and (iv) the global CRF model improves over local models.
1. Introduction
With the advent of Internet technologies, communication media like e-mails and discussion forums have become commonplace for discussing work, issues, events, and experiences. Participants in these media interact with each other asynchronously by writing at different times. This generates a type of conversational discourse, where information flow is often not sequential as in monologue (e.g., news articles) or in synchronous conversation (e.g., instant messaging). As a result, discourse structures such as topic structure, coherence structure, and conversational structure in these conversations exhibit different properties from what we observe in monologue or in synchronous conversation (Joty, Carenini, and Ng 2013; Louis and Cohen 2015).
Participants in an asynchronous conversation interact with each other in complex ways, performing certain communicative acts like asking questions, requesting information, or suggesting something. These are called speech acts (Austin 1962). For example, consider the excerpt of a forum conversation1 from our corpus in Figure 1. The participant who posted the first comment, C1, describes his situation in the first two sentences, and then asks a question in the third sentence. Other participants respond to the query by suggesting something or asking for clarification. In this process, the participants get into a conversation by taking turns, each of which consists of one or more speech acts. The two-part structures across posts like question-answer and request-grant are called adjacency pairs (Schegloff 1968).
Example of a forum conversation (truncated) with Human annotations and automatic predictions by a Local classifier and a Global classifier for speech acts (e.g., Statement, Suggestion). The incorrect decisions are underlined and marked with red color.
Example of a forum conversation (truncated) with Human annotations and automatic predictions by a Local classifier and a Global classifier for speech acts (e.g., Statement, Suggestion). The incorrect decisions are underlined and marked with red color.
Identification of speech acts is an important step toward deep conversational analysis (Bangalore, Di Fabbrizio, and Stent 2006), and has been shown to be useful in many downstream applications, including summarization (Murray et al. 2006; McKeown, Shrestha, and Rambow 2007), question answering (Hong and Davison 2009), collaborative task learning agents (Allen et al. 2007), artificial companions for people to use the Internet (Wilks 2006), and flirtation detection in speed-dates (Ranganath, Jurafsky, and McFarland 2009).
Availability of large annotated corpora like the Meeting Recorder Dialog Act (MRDA) (Dhillon et al. 2004) or the Switchboard-DAMSL (SWBD) (Jurafsky, Shriberg, and Biasca 1997) corpus has fostered research in data-driven automatic speech act recognition in synchronous domains like meeting and phone conversations (Ries 1999; Stolcke et al. 2000; Dielmann and Renals 2008).2 However, such large corpora are not available in the asynchronous domains, and many of the existing (small-sized) corpora use task-specific speech act tagsets (Cohen, Carvalho, and Mitchelle 2004; Ravi and Kim 2007; Bhatia, Biyani, and Mitra 2014) as opposed to a standard one. The unavailability of large annotated data sets with standard tagsets is one of the reasons for speech act recognition not getting much attention in asynchronous domains.
Previous attempts in automatic (sentence-level) speech act recognition in asynchronous conversations (Jeong, Lin, and Lee 2009; Qadir and Riloff 2011; Tavafi et al. 2013; Oya and Carenini 2014) suffer from at least one of the following two technical limitations.
First, they use a bag-of-words (BOW) representation (e.g., unigram, bigram) to encode lexical information of a sentence. However, consider the Suggestion sentences in the example. Arguably, a model needs to consider the structure (e.g., word order) and the compositionality of phrases to identify the right speech act for an utterance. Furthermore, BOW representation could be quite sparse, and may not generalize well when used in classification models. Recent research suggests that a condensed distributed representation learned by a neural model on the target task (e.g., speech act classification) is more effective. The task-specific training can be further improved by pretrained word embeddings (Goodfellow, Bengio, and Courville 2016).
Second, existing approaches mostly disregard conversational dependencies between sentences inside a comment and across comments. For instance, consider the example in Figure 1 again. The Suggestions are answers to Questions asked in a previous comment. We therefore hypothesize that modeling inter-sentence relations is crucial for speech act recognition. We have tagged the sentences in Figure 1 with human annotations (Human) and with the predictions of a local (Local) classifier that considers word order for sentence representation but classifies each sentence separately or individually. Prediction errors are underlined and highlighted in red. Notice the first and second sentences of comment C4, which are mistakenly tagged as Statement and Response, respectively, by our best local classifier. We hypothesize that some of the errors made by the local classifier could be corrected by utilizing a global joint model that is trained to perform a collective classification, taking into account the conversational dependencies between sentences (e.g., adjacency relations like Question-Suggestion).
However, unlike synchronous conversations (e.g., meeting, phone), modeling conversational dependencies between sentences in an asynchronous conversation is challenging, especially when the thread structure (e.g., “reply-to” links between comments) is missing, which is also our case. The conversational flow often lacks sequential dependencies in its temporal/chronological order. For example, if we arrange the sentences as they arrive in the conversation, it becomes hard to capture any dependency between the act types because the two components of the adjacency pairs can be far apart in the sequence. This leaves us with one open research question: How do we model the dependencies between sentences in a single comment and between sentences across different comments? In this article, we attempt to address this question by designing and experimenting with conditional structured models over arbitrary graph structures of the conversation. Apart from the underlying discourse structure (sequence vs. graph), asynchronous conversations differ from synchronous conversations in style (spoken vs. written) and in vocabulary usage (meeting conversations on some focused topics vs. conversations on any topic of interest in a public forum). In this article, we propose to use domain adaptation methods in the neural network framework to model these differences in the sentence encoding process.
More concretely, we make the following contributions in speech act recognition for asynchronous conversations. First, we propose to use a recurrent neural network (RNN) with a long short-term memory (LSTM) hidden layer to compose phrases in a sentence and to represent the sentence using distributed condensed vectors (i.e., embeddings). These embeddings are trained directly on the speech act classification task. We experiment with both unidirectional and bidirectional RNNs. Second, we train (task-agnostic) word embeddings from a large conversational corpus, and use it to boost the performance of the LSTM-RNN model. Third, we propose conditional structured models in the form of pairwise conditional random fields (CRF) (Murphy 2012) over arbitrary conversational structures. We experiment with different variations of this model to capture different types of interactions between sentences inside the comments and across the comments in a conversational thread. These models use the LSTM-encoded vectors as feature vectors for learning to classify sentences in a conversation collectively.
Furthermore, to address the problem of insufficient training data in the asynchronous domains, we propose to use the available labeled data from synchronous domains (e.g., meetings). To make the best use of this out-of-domain data, we adapt our LSTM-RNN encoder to learn task-specific sentence representations by modeling the differences in style and vocabulary usage between the two domains. We achieve this by using the recently proposed domain adversarial training methods of neural networks (Ganin et al. 2016). As a secondary contribution, we also present and release a forum data set annotated with a standard speech act tagset.
We train our models in various settings with synchronous and asynchronous corpora, and we evaluate on one synchronous meeting data set and three asynchronous data sets—two forum data sets and one e-mail data set. We also experimented with different pretrained word embeddings in the LSTM-RNN model. Our main findings are: (i) LSTM-RNNs provide better sentence representation than BOW and other unsupervised methods; (ii) bidirectional LSTM-RNNs, which encode a sentence using two vectors, provide better representation than the unidirectional ones; (iii) word embeddings pretrained on a large conversational corpus yield significant improvements; (iv) the globally normalized joint models (CRFs) improve over local models for certain graph structures; and (v) domain adversarial training improves the results by inducing domain-invariant features. The source code, the pretrained word embeddings, and the new data sets are available at https://ntunlpsg.github.io/demo/project/speech-act/.
After discussing related work in Section 2, we present our speech act recognition framework in Section 3. In Section 4, we present the data sets used in our experiments along with our newly created corpus. The experiments and analysis of results are presented in Section 5. Finally, we summarize our contributions with future directions in Section 6.
2. Related Work
Three lines of research are related to our work: (i) compositionality with LSTM-RNNs, (ii) conditional structured models, and (iii) speech act recognition in asynchronous conversations.
2.1 LSTM-RNNs for Composition
RNNs are arguably the most popular deep learning models in natural language processing, where they have been used for both encoding and decoding a text—for example, language modeling (Mikolov 2012; Tran, Zukerman, and Haffari 2016), machine translation (Bahdanau, Cho, and Bengio 2015), summarization (Rush, Chopra, and Weston 2015), and syntactic parsing (Dyer et al. 2015). RNNs have also been used as a sequence tagger, as in opinion mining (Irsoy and Cardie 2014; Liu, Joty, and Meng 2015), named entity recognition (Lample et al. 2016), and part-of-speech tagging (Plank, Søgaard, and Goldberg 2016).
Relevant to our implementation, Kalchbrenner and Blunsom (2013) use a simple RNN to model sequential dependencies between act types for speech act recognition in phone conversations. They use a convolutional neural network (CNN) to compose sentence representations from word vectors. Lee and Dernoncourt (2016) use a similar model, but they also experiment with RNNs to compose sentence representations. Similarly, Khanpour, Guntakandla, and Nielsen (2016) use an LSTM-based RNN to compose sentence representations. Ji, Haffari, and Eisenstein (2016) propose a latent variable RNN that can jointly model sequences of words (i.e., language modeling) and discourse relations between adjacent sentences. The discourse relations are modeled with a latent variable that can be marginalized during testing. In one experiment, they use coherence relations from the Penn Discourse Treebank corpus as the discourse relations. In another setting, they use speech acts from the SWBD corpus as the discourse relations. They show improvements on both language modeling and discourse relation prediction tasks. Shen and Lee (2016) use an attention-based LSTM-RNN model for speech act classification. The purpose of the attention is to focus on the relevant part of the input sentence. Tran, Zukerman, and Haffari (2017) use an online inference technique similar to the forward pass of the traditional forward-backward inference algorithm to improve upon the greedy decoding methods typically used in the RNN-based sequence labeling models. Vinyals and Le (2015) and Serban et al. (2016) use RNN-based encoder-decoder framework for conversation modeling. Vinyals and Le (2015) use a single RNN to encode all the previous utterances (i.e., by concatenating the tokens of previous utterances), whereas Serban et al. (2016) use a hierarchical encoder—one to encode the words in each utterance, and another to connect the encoded context vectors.
Li et al. (2015) compare recurrent neural models with recursive (syntax-based) models for several NLP tasks and conclude that recurrent models perform on par with the recursive for most tasks (or even better). For example, recurrent models outperform recursive on sentence level sentiment classification. This finding motivated us to use recurrent models rather than recursive ones.
2.2 Conditional Structured Models
There has been an explosion of interest in CRFs for solving structured output problems in NLP; see Smith (2011) for an overview. The most common type of CRF has a linear chain structure that has been used in sequence labeling tasks like part-of-speech (POS) tagging, chunking, named entity recognition, and many others (Sutton and McCallum 2012). Tree-structured CRFs have been used for parsing (e.g., Finkel, Kleeman, and Manning 2008).
The idea of combining neural networks with graphical models for speech act recognition goes back to Ries (1999), in which a feed-forward neural network is used to model the emission distribution of a supervised hidden Markov model (HMM). In this approach, each input sentence in a dialogue sequence is represented as a BOW vector, which is fed to the neural network. The corresponding sequence of speech acts is given by the hidden states of the HMM. Surendran and Levow (2006) first use support vector machines (SVMs) (i.e., local classifier) to estimate the probability of different speech acts for each individual utterance by combining sparse textual features (i.e., bag of n-grams) and dense acoustic features. The estimated probabilities are then used in the Viterbi algorithm to find the most probable tag sequence for a conversation. Julia and Iftekharuddin (2008) use a fusion of SVM and HMM classifiers with textual and acoustic features to classify utterances into speech acts.
More recently, Lample et al. (2016) proposed an LSTM-CRF model for named entity recognition (NER), which first generates a bi-directional LSTM encoding for each input word, and then it passes this representation to a CRF layer, whose task is to encourage global consistency of the NER tags. For each input word, the input to the LSTM consists of a concatenation of the corresponding word embedding and of character-level bi-LSTM embeddings for the current word. The whole network is trained end-to-end with backpropagation, which can be done effectively for chain-structured graphs. Ma and Hovy (2016) proposed a similar framework, but they replace the character-level bi-LSTM with a CNN. They evaluated their approach on POS and NER tagging tasks. Strubell et al. (2017) extended these models by substituting the word-level LSTM with an iterated dilated convolutional neural network, a variant of CNN, for which the effective context window in the input can grow exponentially with the depth of the network, while having a modest number of parameters to estimate. Their approach permits fixed-depth convolutions to run in parallel across entire documents, thus making use of GPUs, which yields up to 20-fold speed up, while retaining performance comparable to that of LSTM-CRF. Speech act recognition in asynchronous conversation posits a different problem, where the challenge is to model arbitrary conversational structures. In this work, we propose a general class of models based on pairwise CRFs that work on arbitrary graph structures.
2.3 Speech Act Recognition in Asynchronous Conversation
Previous studies on speech act recognition in asynchronous conversation have used supervised, semi-supervised, and unsupervised methods.
Cohen, Carvalho, and Mitchell (2004) first use the term e-mail speech act for classifying e-mails based on their acts (e.g., deliver, meeting). Their classifiers do not capture any contextual dependencies between the acts. To model contextual dependencies, Carvalho and Cohen (2005) use a collective classification approach with two different classifiers, one for content and one for context, in an iterative algorithm. The content classifier only looks at the content of the message, whereas the context classifier takes into account both the content of the message and the dialog act labels of its parent and children in the thread structure of the e-mail conversation. Our approach is similar in spirit to their approach with three crucial differences: (i) our CRFs are globally normalized to surmount the label bias problem, while their classifiers are normalized locally; (ii) the graph structure of the conversation is given in their case, which is not the case with ours; and (iii) their approach works at the comment level, whereas we work at the sentence level.
Identification of adjacency pairs like question-answer pairs in e-mail discussions using supervised methods was investigated in Shrestha and McKeown (2004) and Ravi and Kim (2007). Ferschke, Gurevych, and Chebotar (2012) use speech acts to analyze the collaborative process of editing Wiki pages, and apply supervised models to identify the speech acts in Wikipedia Talk pages. Other sentence-level approaches use supervised classifiers and sequence taggers (Qadir and Riloff 2011; Tavafi et al. 2013; Oya and Carenini 2014). Vosoughi and Roy (2016) trained off-the-shelf classifiers (e.g., SVM, naive Bayes, Logistic Regression) with syntactic (e.g., punctuations, dependency relations, abbreviations) and semantic feature sets (e.g., opinion words, vulgar words, emoticons) to classify tweets into six Twitter-specific speech act categories.
Several semi-supervised methods have been proposed for speech act recognition in asynchronous conversation. Jeong, Lin, and Lee (2009) use semi-supervised boosting to tag the sentences in e-mail and forum discussions with speech acts by inducing knowledge from annotated spoken conversations (Mrda meeting and SWBD telephone conversations). Given a sentence represented as a set of trees (i.e., dependency, n-gram tree, and POS tag tree), the boosting algorithm iteratively learns the best feature set (i.e., sub-trees) that minimizes the errors in the training data. This approach does not consider the dependencies between the act types, something we successfully exploit in our work. Zhang, Gao, and Li (2012) also use semi-supervised methods for speech act recognition in Twitter. They use a transductive SVM and a graph-based label propagation framework to leverage the knowledge from abundant unlabeled data. In our work, we leverage labeled data from synchronous conversations while adapting our model to account for the shift in the data distributions of the two domains. In our unsupervised adaptation scenario, we do not use any labeled data from the target (asynchronous) domain, whereas in the semi-supervised scenario, we use some labeled data from the target domain.
Among methods that use unsupervised learning, Ritter, Cherry, and Dolan (2010) propose two HMM-based unsupervised conversational models for modeling speech acts in Twitter. In particular, they use a simple HMM and a HMM+Topic model to cluster the Twitter posts (not the sentences) into act types. Because they use a unigram language model to define the emission distribution, their simple HMM model tends to find some topical clusters in addition to the clusters that are based on speech acts. The HMM+Topic model tries to separate the act indicators from the topic words. By visualizing the type of conversations found by the two models, they show that the output of the HMM+Topic model is more interpretable than that of the HMM one; however, their classification accuracy is not empirically evaluated. Therefore, it is not clear whether these models are actually useful, and which of the two models is a better speech act tagger. Paul (2012) proposes using a mixed membership Markov model to cluster sentences based on their speech acts, and show that this model outperforms a simple HMM. Joty, Carenini, and Lin (2011) propose unsupervised models for speech act recognition in e-mail and forum conversations. They propose a HMM+Mix model to separate out the topic indicators. By training their model based on a conversational structure, they demonstrate that conversational structure is crucial to learning a better speech act recognition model. In our work, we also demonstrate that conversational structure is important for modeling conversational dependencies, however, we do not use any given structure; rather, we build models based on arbitrary graph structures.
3. Our Approach
Let smn denote the m-th sentence of comment n in an asynchronous conversation; our goal is to find the corresponding speech act tag ymn ∈ 𝒯, where 𝒯 is the set of available tags. Our approach works in two main steps, as outlined in Figure 2. First, we use a RNN to encode each sentence into a task-specific distributed representation (i.e., embedding) by composing the words sequentially. The RNN is trained to classify sentences into speech act types, and is adapted to give domain-invariant sentence features when trained to leverage additional data from synchronous domains (e.g., meetings). In the second step, a structured model takes the sentence embeddings as input, and defines a joint distribution over sentences to capture the conversational dependencies. In the following sections, we describe these steps in detail.
Our two-step inference framework for speech act recognition in asynchronous conversation. Each sentence in the conversation is first encoded into a task-specific representation by a recurrent neural network (RNN). The RNN is trained on the speech act classification task, and leverages large labeled data from synchronous domains (e.g., meetings) in an adversarial domain adaptation training method. A structured model (CRF) then takes the encoded sentence vectors as input, and performs joint prediction over all sentences in a conversation.
Our two-step inference framework for speech act recognition in asynchronous conversation. Each sentence in the conversation is first encoded into a task-specific representation by a recurrent neural network (RNN). The RNN is trained on the speech act classification task, and leverages large labeled data from synchronous domains (e.g., meetings) in an adversarial domain adaptation training method. A structured model (CRF) then takes the encoded sentence vectors as input, and performs joint prediction over all sentences in a conversation.
3.1 Learning Task-Specific Sentence Representation
One of our main hypotheses is that a sentence representation method should consider the word order of the sentence. To this end, we use a RNN to encode each sentence into a vector by processing its words sequentially, at each time step combining the current input with the previous hidden state. Figure 3(a) demonstrates the process for three sentences. Initially, we create an embedding matrix E ∈ ℝ|𝒱|×D, where each row represents the distributed representation of dimension D for a word in a finite vocabulary 𝒱. We construct 𝒱 from the training data after filtering out the infrequent words.
A bidirectional LSTM-RNN to encode each sentence smn into a condensed vector zmn. The network is trained to classify each sentence into its speech act type.
A bidirectional LSTM-RNN to encode each sentence smn into a condensed vector zmn. The network is trained to classify each sentence into its speech act type.
Given an input sentence s = (w1, ⋯, wT) of length T, we first map each word wt to its corresponding index in E (equivalently, in 𝒱). The first layer of our network is a look-up layer that transforms each of these indices to a distributed representation xt ∈ ℝD by looking up the embedding matrix E. We consider E a model parameter to be learned by backpropagation. We can initialize E randomly or using pretrained word vectors (to be described in Section 4.2). The output of the look-up layer is a matrix in ℝT×D, which is fed to the recurrent layer.
Bidirectionality.
The RNN just described encodes information that it obtains only from the past. However, information from the future could also be crucial for recognizing speech acts. This is especially true for longer sentences, where a unidirectional LSTM can be limited in encoding the necessary information into a single vector. Bidirectional RNNs (Schuster and Paliwal 1997) capture dependencies from both directions, thus providing two different views of the same sentence. This amounts to having a backward counterpart for each of the equations from (1) to (5). For classification, we use the concatenated vector (equivalently, ), where and are the encoded vectors summarizing the past and the future, respectively.
3.2 Adapting LSTM-RNN with Adversarial Training
The LSTM-RNN described in the previous section can model long-distance dependencies between words, and, given enough training data, it should be able to compose a sentence, capturing its syntactic and semantic properties. However, when it comes to speech act recognition in asynchronous conversations, as mentioned before, not many large corpora annotated with a standard tagset are available. Because of the large number of parameters, the LSTM-RNN model usually overfits when it is trained on small data sets of asynchronous conversations (shown later in Section 5).
One solution to address this problem is to use data from synchronous domains for which large annotated corpora are available (e.g., Mrda meeting corpus). However, as we will see, although simple concatenation of data sets generally improves the performance of the LSTM-RNN model, it does not provide the optimal solution because the conversations in synchronous and asynchronous domains are different in modality (spoken vs. written) and in style. In other words, to get the best out of the available synchronous domain data, we need to adapt our model.
Our goal is to adapt the LSTM-RNN encoder so that it learns to encode sentence representations z (i.e., features used for classification) that are not only discriminative for the act classification task, but also invariant across the domains. To this end, we propose to use the domain adversarial training of neural networks proposed recently by Ganin et al. (2016).
Let 𝒟S = {sn, yn}n=1N denote the set of N training instances (labeled) in the source domain (e.g., Mrda meeting corpus). We consider two possible adaptation scenarios:
Unsupervised adaptation: In this scenario, we have only unlabeled examples in the target domain (e.g., forum). Let 𝒟Tu = {sn}n=N+1M be the set of (M − N − 1) unlabeled training instances in the target domain with M being the total number of training instances in the two domains.
Supervised adaptation: In addition to the unlabeled instances 𝒟Tu, here we have access to some labeled training instances in the target domain, 𝒟Tl = {sn, yn}n=M+1L, with L being the total number of training examples in the two domains.
3.2.1 Unsupervised Adaptation.
Figure 4 shows our extended LSTM-RNN network trained for domain adaptation. The input sentence s is sampled either from a synchronous domain (e.g., meeting) or from an asynchronous (e.g., forum) domain. As before, we pass the sentence through a look-up layer and a bidirectional recurrent layer to encode it into a distributed representation , using our bidirectional LSTM-RNN encoder. For domain adaptation, our goal is to adapt the encoder to generate z, such that it is not only informative for the target classification task (i.e., speech act recognition) but also invariant across domains. Upon achieving this, we can use the adapted LSTM-RNN encoder to encode a target sentence, and use the source classifier (the softmax layer) to classify the sentence into its corresponding speech act type.
In our gradient descent training, the min-max optimization is achieved by reversing the gradients (Ganin et al. 2016) of the domain discrimination loss 𝓛d(ω, θ), when they are backpropagated to the encoder. As shown in Figure 4, the gradient reversal is applied to the recurrent and embedding layers. This optimization set-up is related to the training method of Generative Adversarial Networks (Goodfellow et al. 2014), where the goal is to build deep generative models that can generate realistic images. The discriminator in Generative Adversarial Networks tries to distinguish real images from model-generated images, and thus the training attempts to minimize the discrepancy between the two image distributions. When backpropagating to the generator network, they consider a slight variation of the reverse gradients with respect to the discriminator loss. In particular, if is the discriminator probability for real images, rather than reversing the gradients of , they backpropagate the gradients of to the generator. Reversing the gradient is just a different way of achieving the same goal.
Algorithm 1 presents pseudocode of our training algorithm based on stochastic gradient descent (SGD). We first initialize the model parameters by sampling from Glorot-uniform distribution (Glorot and Bengio 2010). We then form minibatches of size b by randomly sampling b/2 labeled examples from 𝒟S and b/2 unlabeled examples from 𝒟Tu. For labeled instances, both 𝓛c(W, θ) and 𝓛d(ω, θ) losses are active, while only 𝓛d(ω, θ) is active for unlabeled instances.
The main challenge in adversarial training is to balance the two components (the task classifier and the discriminator) of the network. If one component becomes smarter, its loss to the shared layer becomes useless, and the training fails to converge (Arjovsky, Chintala, and Bottou 2017). Equivalently, if one component gets weaker, its loss overwhelms that of the other, causing training to fail. In our experiments, we found the domain discriminator to be weaker; initially, it could not distinguish the domains often. To balance the two components, we would need the error signals from the discriminator to be fairly weak initially, with full power unleashed only as the classification errors start to dominate. We follow the weighting schedule proposed by Ganin et al. (2016, page 21), who initialize λ to 0, and then change it gradually to 1 as training progresses. That is, we start training the task classifier first, and we gradually add the discriminator’s loss.
3.2.2 Supervised Adaptation.
3.3 Conditional Structured Model for Conversational Dependencies
Given the vector representation of the sentences in an asynchronous conversation, we explore two different approaches to learn classification functions. The first and the traditional approach is to learn a local classifier, ignoring the structure in the output and using it for predicting the label of each sentence separately. Indeed, this is the approach we took in the previous subsection when we fed the output layer of the LSTM RNNs (Figure 3 and 4) with the sentence vectors. However, this approach does not model the conversational dependency between sentences in a conversation (e.g., adjacency relations between question-answer and request-accept pairs).
Examples of conditional structured models for speech act recognition in asynchronous conversation. The sentence vectors (zmn) are extracted from the LSTM-RNN model.
Examples of conditional structured models for speech act recognition in asynchronous conversation. The sentence vectors (zmn) are extracted from the LSTM-RNN model.
3.3.1 Training and Inference in CRFs.
Traditionally, CRFs have been trained using offline methods like limited-memory BFGS (Murphy 2012). Online training of CRFs using SGD was proposed by Vishwanathan et al. (2006). Because RNNs are trained with online methods, to compare our two methods, we use an SGD-based algorithm to train our CRFs. Algorithm 2 gives the pseudocode of the training procedure.
We use Belief Propagation (BP) (Pearl 1988) for inference in our CRFs. BP is guaranteed to converge to an exact solution if the graph is a tree. However, exact inference is intractable for graphs with loops. Despite this, Pearl (1988) advocates for BP in loopy graphs as an approximation (see Murphy 2012, page 768). The algorithm is then called loopy BP. Although loopy BP gives approximate solutions for general graphs, it often works well in practice (Murphy, Weiss, and Jordan 1999), outperforming other methods such as mean field (Weiss 2001).
3.3.2 Variations of Graph Structures.
One of the advantages of the pairwise CRF in Equation (13) is that we can define this model over arbitrary graph structures, which allows us to capture conversational dependencies at various levels. Modeling the arbitrary graph structure can be crucial, especially in scenarios where the reply-to structure of the conversation is not known. By defining structured models over plausible graph structures, we can get a sense of the underlying conversational structure. We distinguish between two types of conversational dependencies:
Intra-comment connections: This defines how the speech acts of the sentences inside a comment are connected with each other.
Across-comment connections: This defines how the speech acts of the sentences across comments are connected in a conversation.
Table 1 summarizes the connection types that we have explored in our CRF models. Each configuration of intra- and across-connections yields a different pairwise CRF. Figure 6 shows four such CRFs with three comments — C1 being the first comment, and Ci and Cj being two other comments in the conversation. Figure 6(a) shows the structure for the NO-NO configuration, where there is no link between nodes of both intra- and across-comments. In this setting, the CRF model boils down to the MaxEnt model. Figure 6(b) shows the structure for LC-LC configuration, where there are linear chain relations between nodes of both intra- and across-comments. The linear chain across comments refers to the structure, where the last sentence of each comment is connected to the first sentence of the comment that comes next in the temporal order.
Tag . | Connection type . | Applicable to . |
---|---|---|
NO | No connection between nodes | intra & across |
LC | Linear chain connection | intra & across |
FC | Fully connected | intra & across |
FC1 | Fully connected with first comment only | across |
LC1 | Linear chain with first comment only | across |
Tag . | Connection type . | Applicable to . |
---|---|---|
NO | No connection between nodes | intra & across |
LC | Linear chain connection | intra & across |
FC | Fully connected | intra & across |
FC1 | Fully connected with first comment only | across |
LC1 | Linear chain with first comment only | across |
Figure 6(c) shows the CRF for LC-LC1, in which the sentences inside a comment have linear chain connections, and the last sentence of the first comment is connected to the first sentence of the other comments. Figure 6(d) shows the graph structure for LC-FC1 configuration, in which the sentences inside comments have linear chain connections, and sentences of the first comment are fully connected with the sentences of the other comments. Similarly, Figure 6(e) and 6(f) show the graph structures for FC-LC and FC-FC configurations.
4. Corpora
In this section, we describe the data sets used in our experiments. We use a number of labeled data sets to train and test our models, one of which we constructed in this work. Additionally, we use a large unlabeled conversational data set to train our (unsupervised) word embedding models.
4.1 Labeled Corpora
There exist large corpora of utterances annotated with speech acts in synchronous spoken domains, for example, Switchboard-DAMSL (Swbd) (Jurafsky, Shriberg, and Biasca 1997) and Meeting Recorder Dialog Act (Mrda) (Dhillon et al. 2004). However, the asynchronous domain lacks such large corpora. Some prior studies (Cohen, Carvalho, and Mitchell 2004; Feng et al. 2006; Ravi and Kim 2007; Bhatia, Biyani, and Mitra 2014) tackle the task at the comment level, and use task-specific tagsets. In contrast, in this work we are interested in identifying speech acts at the sentence level, and also using a standard tagset like the ones defined in Swbd or Mrda.
Several studies attempt to solve the task at the sentence level. Jeong, Lin, and Lee (2009) created a data set of TripAdvisor (TA) forum conversations annotated with the standard 12 act types defined in Mrda. They also remapped the BC3 e-mail corpus (Ulrich, Murray, and Carenini 2008) according to this tagset. Table 2 shows the tags and their relative frequency in the two data sets. Subsequent studies (Joty, Carenini, and Lin 2011; Tavafi et al. 2013; Oya and Carenini 2014) use these data sets. We also use these data sets in our work. Table 3 shows some basic statistics about these data sets. On average, BC3 conversations are longer than those of TripAdvisor in terms of both number of comments and number of sentences.
Tag . | Description . | BC3 . | TA . |
---|---|---|---|
S | Statement | 69.56% | 65.62% |
P | Polite mechanism | 6.97% | 9.11% |
QY | Yes-no question | 6.75% | 8.33% |
AM | Action motivator | 6.09% | 7.71% |
QW | Wh-question | 2.29% | 4.23% |
A | Accept response | 2.07% | 1.10% |
QO | Open-ended question | 1.32% | 0.92% |
AA | Acknowledge and appreciate | 1.24% | 0.46% |
QR | Or/or-clause question | 1.10% | 1.16% |
R | Reject response | 1.06% | 0.64% |
U | Uncertain response | 0.79% | 0.65% |
QH | Rhetorical question | 0.75% | 0.08% |
Tag . | Description . | BC3 . | TA . |
---|---|---|---|
S | Statement | 69.56% | 65.62% |
P | Polite mechanism | 6.97% | 9.11% |
QY | Yes-no question | 6.75% | 8.33% |
AM | Action motivator | 6.09% | 7.71% |
QW | Wh-question | 2.29% | 4.23% |
A | Accept response | 2.07% | 1.10% |
QO | Open-ended question | 1.32% | 0.92% |
AA | Acknowledge and appreciate | 1.24% | 0.46% |
QR | Or/or-clause question | 1.10% | 1.16% |
R | Reject response | 1.06% | 0.64% |
U | Uncertain response | 0.79% | 0.65% |
QH | Rhetorical question | 0.75% | 0.08% |
. | Asynchronous . | ||
---|---|---|---|
TA . | BC3 . | QC3 . | |
Total number of conversations | 200 | 39 | 47 |
Average number of comments per conversation | 4.02 | 6.54 | 13.32 |
Average number of sentences per conversation | 18.56 | 34.15 | 33.28 |
Average number of words per sentence | 14.90 | 12.61 | 19.78 |
. | Asynchronous . | ||
---|---|---|---|
TA . | BC3 . | QC3 . | |
Total number of conversations | 200 | 39 | 47 |
Average number of comments per conversation | 4.02 | 6.54 | 13.32 |
Average number of sentences per conversation | 18.56 | 34.15 | 33.28 |
Average number of words per sentence | 14.90 | 12.61 | 19.78 |
Since these data sets are relatively small in size with sparse tag distributions, we group the 12 act types into 5 coarser classes to learn a reasonable classifier. Some prior work (Tavafi et al. 2013; Oya and Carenini 2014) has also taken the same approach. More specifically, all the question types are grouped into one general class Question, all response types into Response, and appreciation and polite mechanisms into the Polite class.
In addition to the asynchronous data sets – TA, BC3, and QC3 (to be introduced subsequently), we also demonstrate the performance of our models on the synchronous Mrda meeting corpus, and use it for domain adaptation. Table 4 shows the label distribution of the resulting data sets; Statement is the most dominant class, followed by Question, Polite, and Suggestion.
Tag . | Description . | Asynchronous . | Synchronous . | ||
---|---|---|---|---|---|
TA . | BC3 . | QC3 . | Mrda . | ||
SU | Suggestion (Action motivator) | 7.71 | 5.48 | 17.38 | 5.97 |
R | Response (Accept, Reject, Uncertain) | 2.4 | 3.75 | 5.24 | 15.63 |
Q | Questions (Yes-no, Wh, Rhetorical, Or-clause, Open-ended) | 14.71 | 8.41 | 12.59 | 8.62 |
P | Polite (Acknowledge & appreciate, Polite) | 9.57 | 8.63 | 6.13 | 3.77 |
ST | Statement | 65.62 | 73.72 | 58.66 | 66.00 |
Tag . | Description . | Asynchronous . | Synchronous . | ||
---|---|---|---|---|---|
TA . | BC3 . | QC3 . | Mrda . | ||
SU | Suggestion (Action motivator) | 7.71 | 5.48 | 17.38 | 5.97 |
R | Response (Accept, Reject, Uncertain) | 2.4 | 3.75 | 5.24 | 15.63 |
Q | Questions (Yes-no, Wh, Rhetorical, Or-clause, Open-ended) | 14.71 | 8.41 | 12.59 | 8.62 |
P | Polite (Acknowledge & appreciate, Polite) | 9.57 | 8.63 | 6.13 | 3.77 |
ST | Statement | 65.62 | 73.72 | 58.66 | 66.00 |
4.1.1 QC3 Conversational Corpus: A New Asynchronous Data Set.
Because both TripAdvisor and BC3 are quite small to make a general comment about model performance in asynchronous conversations, we have created a new annotated data set of forum conversations called Qatar Computing Conversational Corpus or QC3.6 We selected 50 conversations from a popular community question answering site named Qatar Living7 for our annotation. We used three conversations for our pilot study and used the remaining 47 for the actual study. The resulting corpus, as shown in the last column of Table 3, on average contains 13.32 comments and 33.28 sentences per conversation, and 19.78 words per sentence.
Two native speakers of English annotated each conversation using a Web-based annotation framework (Ulrich, Murray, and Carenini 2008). They were asked to annotate each sentence with the most appropriate speech act tag from the list of five speech act types. Because this task is not always obvious, we gave them detailed annotation guidelines with real examples. We use Cohen’s κ to measure the agreement between the annotators. The third column in Table 5 presents the κ values for the act types, which vary from 0.43 (for Response) to 0.87 (for Question).
Tag . | Speech act . | Cohen’s κ . |
---|---|---|
SU | Suggestion | 0.86 |
R | Response | 0.43 |
Q | Question | 0.87 |
P | Polite | 0.75 |
ST | Statement | 0.78 |
Tag . | Speech act . | Cohen’s κ . |
---|---|---|
SU | Suggestion | 0.86 |
R | Response | 0.43 |
Q | Question | 0.87 |
P | Polite | 0.75 |
ST | Statement | 0.78 |
In order to create a consolidated data set, we collected the disagreements between the two annotators, and used a third annotator to resolve those cases. The fifth column in Table 4 presents the distribution of the speech acts in the resulting data set. As we can see, after Statement, Suggestion is the most frequent class, followed by the Question and the Polite classes.
4.2 Conversational Word Embeddings
One simple way to exploit unlabeled data for semi-supervised learning is to use word embeddings that are learned from large unlabeled data sets (Turian, Ratinov, and Bengio 2010). Word embeddings such as word2vec skip-gram (Mikolov, Yih, and Zweig 2013) and Glove vectors (Pennington, Socher, and Manning 2014) capture syntactic and semantic properties of words and their linguistic regularities in the vector space. The skip-gram model was trained on part of the Google news data set containing about 100 billion words, and it contains 300-dimensional vectors for 3 million unique words and phrases.8 Glove was trained on the combination of Wikipedia 2014 and Gigaword 5 data sets containing 6B tokens and 400K unique (uncased) words. It comes with 50d, 100d, 200d, and 300d vectors.9 For our experiments, we use the 300d vectors.
Many recent studies have shown that the pretrained embeddings improve the performance on supervised tasks (Schnabel et al. 2015). In our work, we have used these generic off-the-shelf pretrained embeddings to boost the performance of our models. In addition, we have also trained the word2vec skip-gram model and Glove on a large conversational corpus to obtain more relevant conversational word embeddings. Later in our experiments (Section 5) we will demonstrate that the conversational word embeddings are more effective than the generic ones because they are trained on similar data sets.
To train the word embeddings, we collected conversations of both synchronous and asynchronous types. For asynchronous, we collected e-mail threads from W3C (w3c.org), and forum conversations from TripAdvisor and QatarLiving sites. The raw data was too noisy to directly inform our models, as it contains system messages and signatures. We cleaned up the data with the intention of keeping only the headers, bodies, and quotations. For synchronous, we used the utterances from the Swbd and Mrda corpora. Table 6 shows some basic statistics about these (unlabeled) data sets. We trained our word vectors on the concatenated set of all data sets (i.e., 120M tokens). Note that the conversations in our labeled data sets were taken from these sources (e.g., BC3 from W3C, QC3 from QatarLiving, and TA from TripAdvisor.)
. | Domain . | Data sets . | Number of Threads . | Number of Tokens . | Number of Words . |
---|---|---|---|---|---|
Asynchronous | W3C | 23,940 | 21,465,830 | 546,921 | |
Forum | TripAdvisor | 25,000 | 2,037,239 | 127,233 | |
Forum | Qatar Living | 219,690 | 103,255,922 | 1,157,757 | |
Synchronus | Meeting | Mrda | - | 675,110 | 18,514 |
Phone | Swbd | - | 1,131,516 | 57,075 |
. | Domain . | Data sets . | Number of Threads . | Number of Tokens . | Number of Words . |
---|---|---|---|---|---|
Asynchronous | W3C | 23,940 | 21,465,830 | 546,921 | |
Forum | TripAdvisor | 25,000 | 2,037,239 | 127,233 | |
Forum | Qatar Living | 219,690 | 103,255,922 | 1,157,757 | |
Synchronus | Meeting | Mrda | - | 675,110 | 18,514 |
Phone | Swbd | - | 1,131,516 | 57,075 |
5. Experiments
In this section, we present our experimental settings, results, and analysis. We start with an outline of the experiments.
5.1 Outline of Experiments
Our main objective is to evaluate our speech act recognizer on asynchronous conversations. For this, we evaluate our models on the forum and e-mail data sets introduced earlier in Section 4.1: (i) our newly created QC3 data set, (ii) the TripAdvisor (TA) data set from Jeong, Lin, and Lee (2009), and (iii) the BC3 e-mail corpus from (Ulrich, Murray, and Carenini 2008). In addition, we validate our sentence encoding approach on the Mrda meeting corpus.
Because of the noisy and informal nature of conversational texts, we performed a series of preprocessing steps before using it for training or testing. We normalize all characters to their lowercased forms, truncate elongations to two characters, and spell out every digit and URL. We further tokenized the texts using the CMU TweetNLP tool (Gimpel et al. 2011).
For performance comparison, we use both accuracy and macro-averaged F1 score. Accuracy gives the overall performance of a classifier but could be biased toward the most populated classes, whereas macro-averaged F1 weights every class equally, and is not influenced by class imbalance. Statistical significance tests are done using an approximate randomization test based on the accuracy.10 We used SIGF V.2 (Padó 2006) with 10,000 iterations.
In the following, we first demonstrate the effectiveness of our LSTM-RNN model for learning task-specific sentence encoding by training it on the task in three different settings: (i) training on in-domain data only, (ii) training on a simple concatenation of synchronous and asynchronous data, and (iii) training it with adversarial training for domain adaptation. We also compare the effectiveness of different embedding types in these three training settings. The best task-specific embeddings are then extracted and fed into the CRF models to learn inter-sentence dependencies. In Section 5.3, we compare how our CRF models with different conversational graph structure perform. Table 7 gives an outline of our experimental roadmap.
Model Tested . | Training Regime . | Section . | Corpora Used . |
---|---|---|---|
LSTM-RNN | In-domain supervised | 5.2.2 | QC3/TA/BC3/Mrda (all labeled) |
Concatenation supervised | 5.2.3 | QC3+TA+BC3+Mrda (labeled) | |
Unsup. adaptation | 5.2.4 | QC3/TA/BC3 (unlabeled) + Mrda (labeled) | |
Semi-sup. adaptation | 5.2.4 | QC3/TA/BC3 (labeled) + Mrda (labeled) | |
CRFs | In-domain supervised | 5.3 | QC3/TA/BC3 (labeled; conversation level) |
Model Tested . | Training Regime . | Section . | Corpora Used . |
---|---|---|---|
LSTM-RNN | In-domain supervised | 5.2.2 | QC3/TA/BC3/Mrda (all labeled) |
Concatenation supervised | 5.2.3 | QC3+TA+BC3+Mrda (labeled) | |
Unsup. adaptation | 5.2.4 | QC3/TA/BC3 (unlabeled) + Mrda (labeled) | |
Semi-sup. adaptation | 5.2.4 | QC3/TA/BC3 (labeled) + Mrda (labeled) | |
CRFs | In-domain supervised | 5.3 | QC3/TA/BC3 (labeled; conversation level) |
5.2 Effectiveness of LSTM RNN
We first describe the experimental settings for our LSTM RNN sentence encoding model—the data set splits, training settings, and compared baselines. Then we present our results on the three training scenarios as outlined in Table 7.
5.2.1 Experimental Settings.
Corpora . | Type . | Train . | Dev. . | Test . |
---|---|---|---|---|
QC3 | asynchronous | 1,252 | 157 | 156 |
TA | asynchronous | 2,968 | 372 | 371 |
BC3 | asynchronous | 1,065 | 34 | 133 |
Mrda | synchronous | 50,865 | 8,366 | 10,492 |
Total (Concat) | asynchronous + synchronous | 56,150 | 8,929 | 11,152 |
Corpora . | Type . | Train . | Dev. . | Test . |
---|---|---|---|---|
QC3 | asynchronous | 1,252 | 157 | 156 |
TA | asynchronous | 2,968 | 372 | 371 |
BC3 | asynchronous | 1,065 | 34 | 133 |
Mrda | synchronous | 50,865 | 8,366 | 10,492 |
Total (Concat) | asynchronous + synchronous | 56,150 | 8,929 | 11,152 |
We compare the performance of our LSTM-RNN model with MaxEnt (ME) and Multi-layer Perceptron (MLP) with one hidden layer.11 In one setting, we fed them with the bag-of-words (BOW) representation of the sentence, namely, vectors containing binary values indicating the presence or absence of a word in the training set vocabulary. In another setting, we use a concatenation of the pretrained word embeddings as the sentence representation.
We train the models by optimizing the cross entropy in Equation (7) using the gradient-based learning algorithm ADAM (Kingma and Ba 2014).12 The learning rate and other parameters were set to the values as suggested by the authors. To avoid overfitting, we use dropout (Srivastava et al. 2014) of hidden units and early-stopping based on the loss on the development set.13 Maximum number of epochs was set to 50 for RNNs, ME, and MLP. We experimented with dropout rates of {0.0, 0.2, 0.4}, minibatch sizes of {16, 32, 64}, and hidden layer units of {100, 150, 200} in MLP and LSTMs. The vocabulary 𝒱 in LSTMs was limited to the most frequent P% (P ∈ {85, 90, 95}) words in the training corpus, where P is considered a hyperparameter.
We initialize the word vectors in our model either by sampling randomly from the small uniform distribution 𝒰(−0.05, 0.05), or by using pretrained embeddings. The dimension for random initialization was set to 128. For pretrained embeddings, we experiment with off-the-shelf embeddings that come with word2vec (Mikolov et al. 2013b) and Glove (Pennington, Socher, and Manning 2014) as well as with our conversational word embeddings (Section 4.2).
We experimented with four variations of our LSTM-RNN model: (i) U-LSTMrand, referring to unidirectional RNN with random word vector initialization; (ii) U-LSTMpre, referring to unidirectional RNN initialized with pretrained word embeddings of type pre; (iii) B-LSTMrand, referring to bidirectional RNN with random initialization; and (iv) B-LSTMpre, referring to bidirectional RNN initialized with pretrained word vectors of type pre.
5.2.2 Results for In-Domain Training.
Before reporting the performance of our sentence encoding model on asynchronous domains, we first evaluate it on the (synchronous) Mrda meeting corpus where it can be compared to previous studies on a large data set.
Results on MRDA Meeting Corpus.
Table 9 presents the results on Mrda for in-domain training. The first two rows show the best results reported so far on this data set from Jeong, Lin, and Lee (2009) for classifying sentences into 12 speech act types; the first row shows the results of the model that uses only n-grams, and the second row shows the results using all of the features, including n-grams, speaker, part-of-speech, and dependency structure. Note that our LSTM RNNs and their n-gram model use the same word sequence information.
. | Pretrained Embedding . | Mrda . | |
---|---|---|---|
5 classes . | 12 classes . | ||
Jeong, Lin, and Lee (2009) (n-gram) | - | - | 57.53 (83.30) |
Jeong, Lin, and Lee (2009) (all features) | - | - | 59.04 (83.49) |
MEbow | - | 65.25 (83.95) | 57.79 (82.84) |
MLPbow | - | 68.12 (84.24) | 58.19 (83.24) |
U-LSTMrandom | - | 71.19 (84.38) | 58.72 (83.34) |
U-LSTMgoogle-w2v | word2vec (Google) | 72.32 (84.19) | 59.05 (83.26) |
U-LSTMglove | Glove (off-the-shelf) | 72.24 (84.93) | 60.02 (83.14) |
B-LSTMrandom | - | 71.26 (84.12) | 60.98 (83.04) |
B-LSTMgoogle-w2v | word2vec (Google) | 72.34 (84.39) | 61.72 (83.17) |
B-LSTMglove | Glove (off-the-shelf) | 72.41 (84.80) | 62.33 (82.82) |
B-LSTMconv-w2v | word2vec (conversation) | 72.13 (85.42*) | 62.18 (83.23) |
B-LSTMconv-glove | Glove (conversation) | 72.88 (85.43*) | 62.53 (83.61) |
. | Pretrained Embedding . | Mrda . | |
---|---|---|---|
5 classes . | 12 classes . | ||
Jeong, Lin, and Lee (2009) (n-gram) | - | - | 57.53 (83.30) |
Jeong, Lin, and Lee (2009) (all features) | - | - | 59.04 (83.49) |
MEbow | - | 65.25 (83.95) | 57.79 (82.84) |
MLPbow | - | 68.12 (84.24) | 58.19 (83.24) |
U-LSTMrandom | - | 71.19 (84.38) | 58.72 (83.34) |
U-LSTMgoogle-w2v | word2vec (Google) | 72.32 (84.19) | 59.05 (83.26) |
U-LSTMglove | Glove (off-the-shelf) | 72.24 (84.93) | 60.02 (83.14) |
B-LSTMrandom | - | 71.26 (84.12) | 60.98 (83.04) |
B-LSTMgoogle-w2v | word2vec (Google) | 72.34 (84.39) | 61.72 (83.17) |
B-LSTMglove | Glove (off-the-shelf) | 72.41 (84.80) | 62.33 (82.82) |
B-LSTMconv-w2v | word2vec (conversation) | 72.13 (85.42*) | 62.18 (83.23) |
B-LSTMconv-glove | Glove (conversation) | 72.88 (85.43*) | 62.53 (83.61) |
The second group of results (third and fourth rows) are for ME and MLP models with BOW sentence representation. The third group shows the results for uni-directional LSTM with random and pretrained off-the-shelf embeddings. The fourth group shows the corresponding results for bi-directional LSTMs. Finally, the fifth row presents the results for bi-directional LSTM with our conversational embeddings. To compare our results with the results of Jeong, Lin, and Lee (2009), we ran our models on 12-class classification task in addition to our original 5-class task.
It can be observed that all of our LSTM-RNNs achieve state-of-the-art results, and the bi-directional ones with pretrained embeddings generally perform better than others in terms of the F1-score. The best results are obtained with our conversational embeddings. Our best model B-LSTMconv-glove (B-LSTM with Glove conversational embeddings) gives absolute improvements of about 5.0% and 3.5% in F1 compared to the n-gram and all-features models, respectively, of Jeong, Lin, and Lee (2009). This is remarkable because our LSTM-RNNs learn the sentence representation automatically from the word sequence and do not use any hand-engineered features.
Results on Asynchronous Data Sets.
Now let us consider the results in Table 10 for the asynchronous data sets—QC3, TA, and BC3. We show the results of our models based on 5-fold cross validation in addition to the random (20%) test set in Table 8. The 5-fold setting allows us to get more generic performance of the models on a particular data set. For simplicity, we only report the results for Glove embeddings that were found to be superior to word2vec embeddings.
. | QC3 . | TA . | BC3 . | |||
---|---|---|---|---|---|---|
Testset . | 5 folds . | Testset . | 5 folds . | Testset . | 5 folds . | |
MEbow | 55.11 (76.28) | 55.15 (73.16) | 62.82 (82.47) | 62.65 (85.04) | 54.37 (84.47) | 52.69 (81.78) |
MLPbow | 56.71 (74.35) | 59.72 (72.46) | 70.45 (83.83) | 65.18 (84.02) | 63.98 (84.58) | 62.37 (82.04) |
U-LSTMrandom | 54.52 (70.51) | 53.39 (67.22) | 64.52 (80.32) | 59.20 (80.06) | 44.41 (81.95) | 42.21 (72.44) |
U-LSTMglove | 59.95 (72.44) | 55.56 (70.03) | 67.70 (83.83) | 60.82 (83.22) | 45.67 (78.95) | 43.75 (73.50) |
U-LSTMconv-glove | 60.59 (75.64) | 58.70 (72.78) | 69.48 (83.56) | 64.64 (83.39) | 53.51 (84.21) | 49.67 (77.71) |
B-LSTMrandom | 57.57 (74.35) | 58.24 (72.46) | 74.70 (86.25*) | 67.08 (84.53) | 47.12 (81.20) | 44.97 (77.59) |
B-LSTMglove | 59.16 (73.07) | 58.86 (72.45) | 75.49 (86.77*) | 68.31 (83.81) | 51.15 (84.21) | 50.67 (75.59) |
B-LSTMconv-glove | 64.72 (77.56*) | 63.47 (75.59*) | 76.15 (86.52*) | 69.59 (86.18*) | 61.44 (83.45) | 55.84 (79.95) |
. | QC3 . | TA . | BC3 . | |||
---|---|---|---|---|---|---|
Testset . | 5 folds . | Testset . | 5 folds . | Testset . | 5 folds . | |
MEbow | 55.11 (76.28) | 55.15 (73.16) | 62.82 (82.47) | 62.65 (85.04) | 54.37 (84.47) | 52.69 (81.78) |
MLPbow | 56.71 (74.35) | 59.72 (72.46) | 70.45 (83.83) | 65.18 (84.02) | 63.98 (84.58) | 62.37 (82.04) |
U-LSTMrandom | 54.52 (70.51) | 53.39 (67.22) | 64.52 (80.32) | 59.20 (80.06) | 44.41 (81.95) | 42.21 (72.44) |
U-LSTMglove | 59.95 (72.44) | 55.56 (70.03) | 67.70 (83.83) | 60.82 (83.22) | 45.67 (78.95) | 43.75 (73.50) |
U-LSTMconv-glove | 60.59 (75.64) | 58.70 (72.78) | 69.48 (83.56) | 64.64 (83.39) | 53.51 (84.21) | 49.67 (77.71) |
B-LSTMrandom | 57.57 (74.35) | 58.24 (72.46) | 74.70 (86.25*) | 67.08 (84.53) | 47.12 (81.20) | 44.97 (77.59) |
B-LSTMglove | 59.16 (73.07) | 58.86 (72.45) | 75.49 (86.77*) | 68.31 (83.81) | 51.15 (84.21) | 50.67 (75.59) |
B-LSTMconv-glove | 64.72 (77.56*) | 63.47 (75.59*) | 76.15 (86.52*) | 69.59 (86.18*) | 61.44 (83.45) | 55.84 (79.95) |
We can observe trends similar to those for Mrda: (i) bidirectional LSTMs outperform their unidirectional counterparts, (ii) pretrained Glove vectors provide better results than the randomly initialized ones, and (iii) conversational word embeddings give the best results among the embedding types. When we compare these results with those of the baselines (MEbow and MLPbow), we see our method outperforms those on QC3 and TA (3.8% to 8.0%), but fails to do so on BC3. This is due to the small size of the data that affects deep neural methods like LSTM-RNNs, which usually require much labeled data to learn an effective compositional model. In the following, we show the effect of adding more labeled data from the Mrda meeting corpus.
5.2.3 Adding Meeting Data.
To validate our claim that LSTM-RNNs can learn a more effective model for our task when they are provided with enough training data, we create a concatenated training setting by merging the training and the development sets of the four corpora in Table 8 (see the Train and Dev. columns in the last row); the test set for each data set remains the same. We will refer to this train-test setting as Concat.
Table 11 shows the results of the baseline and the B-LSTM models on the three test sets for this concatenated training setting. We notice that our B-LSTM models with pretrained embeddings outperform MEbow and MLPbow significantly. Again, the conversational Glove embeddings prove to be the best word vectors giving the best results across the data sets. Our best model gives absolute improvements of 2% to 12% in F1 across the data sets over the best baselines.
. | Pretrained Emb . | QC3 (Testset) . | TA (Testset) . | BC3 (Testset) . |
---|---|---|---|---|
MEbow | - | 50.64 (71.15) | 72.49 (84.10) | 53.17 (76.00) |
MLPbow | - | 58.60 (74.36) | 73.07 (85.29) | 56.19 (78.00) |
B-LSTMgoogle-w2v | word2vec (off-the-shelf) | 67.00 (79.49*) | 74.63 (87.67*) | 56.55 (80.04*) |
B-LSTMglove | Glove (off-the-shelf) | 62.71 (80.13*) | 76.61 (87.33*) | 54.87 (80.00*) |
B-LSTMconv-w2v | word2vec (conversation) | 66.34 (79.48*) | 75.03 (86.55*) | 58.28 (79.00*) |
B-LSTMconv-glove | Glove (conversation) | 70.51 (80.77*) | 78.08 (88.95*) | 57.47 (80.00*) |
. | Pretrained Emb . | QC3 (Testset) . | TA (Testset) . | BC3 (Testset) . |
---|---|---|---|---|
MEbow | - | 50.64 (71.15) | 72.49 (84.10) | 53.17 (76.00) |
MLPbow | - | 58.60 (74.36) | 73.07 (85.29) | 56.19 (78.00) |
B-LSTMgoogle-w2v | word2vec (off-the-shelf) | 67.00 (79.49*) | 74.63 (87.67*) | 56.55 (80.04*) |
B-LSTMglove | Glove (off-the-shelf) | 62.71 (80.13*) | 76.61 (87.33*) | 54.87 (80.00*) |
B-LSTMconv-w2v | word2vec (conversation) | 66.34 (79.48*) | 75.03 (86.55*) | 58.28 (79.00*) |
B-LSTMconv-glove | Glove (conversation) | 70.51 (80.77*) | 78.08 (88.95*) | 57.47 (80.00*) |
When we compare these results with those in Table 10, we notice that with more heterogeneous data sets, B-LSTM, by virtue of its distributed and condensed representation, generalizes well across different domains. In contrast, ME and MLP, because of their BOW representation, suffer from the data diversity of different domains. These results also confirm that B-LSTM gives better sentence representation than BOW when it is given enough training data.
Comparison with Other Classifiers and Sentence Encoders.
Now, we compare our best B-LSTM model (i.e., B-LSTMconv-glove) with other classifiers and sentence encoders in the concatenated (Concat) training setting. The models that we compare with are:
- (a)
MEconv-glove: We represent each sentence as a concatenated vector of its word vectors, and train a MaxEnt (ME) classifier based on this representation. For word vectors, we use our best performing conversational Glove vectors as we use in our B-LSTMconv-glove model. We set a maximum sentence length of 100 words, and used zero-padding for shorter sentences. This model has a total of 100(input words) × 300 (embedding dimensions) × 5 (class labels) = 150,000 trainable parameters.14
- (b)
MLPconv-glove: We represent each sentence similarly as above, and train a one-hidden layer Multi-layer Perceptron (MLP) based on the representation. The hidden layer has 1,000 units, which is determined based on the performance on the development set. This model has a total of 100 × 300 × 1000 × 5 = 150,000,000 parameters.
- (c)
MEconv-glove-averaging: We represent each sentence as a mean vector of its word vectors, and train a MaxEnt classifier using this representation. This model has a total of 300 × 5 = 1,500 trainable parameters.
- (d)
SVMconv-glove-averaging: We train a SVM classifier based on the mean vector.15 In our training, we use a linear kernel with the default C value of 1.0.
- (e)
MEskip-thought: We encode each sentence with the skip-thought encoder of Kiros et al. (2015). The skip-thought model uses an encoder-decoder framework to learn the sentence representation in a task-agnostic (unsupervised) way. It encodes each sentence with a GRU-RNN (Cho et al. 2014), and uses the encoded vector to decode the words of the neighboring sentences using another GRU-based RNN as a language model. The model is originally trained on the BookCorpus16 with a vocabulary size of 20K words. It then uses the CBOW word2vec vectors (Mikolov et al. 2013a) to expand the vocabulary size to 930,911 words. Following the recommendation from the authors, we use the combine-skip model that concatenates the vectors encoded by a uni-directional encoder (uni-skip) and a bi-directional encoder (bi-skip). The resulting vectors are of 4,800 dimensions—the first 2,400 dimensions is the uni-skip vector, and the last 2,400 dimensions is the bi-skip vector. We learn a ME classifier based on this representation. This model has a total of 4,800 × 5 = 24,000 parameters.
- (f)
B-GRU: This is a variation of our B-LSTMconv-glove model, where we replace the LSTM cells with GRU cells (Cho et al. 2014) in the recurrent layer. This model has a total of 2 (bi-direction) × 3 (gates) × (1282 (hidden-hidden) + 300 × 128 (input-hidden)) + 256 × 5 = 329,984 trainable parameters (excluding the biases). Our LSTM-based RNN model uses four gates, which gives a total of 439,552 parameters to train.
We notice that all these models have a large number of parameters to learn an effective classification model for our task using the sentence representation as input features. Similar to our B-LSTM, the B-GRU and the skip-thought models are compositional, that is, they compose the sentence representation from the representation of its words using the sentence structure. Although the 4,800 dimensional sentence representation for skip-thought is not learned on the task, the associated weight parameters in the MEskip-thought model are trained on the task.
Table 12 presents the results. It can be observed that in general the compositional methods perform better than the non-compositional ones (e.g., averaging, concatenation), and when the compositional method is trained on the task, we get the best performance on two out of three data sets. In particular, our B-LSTMconv-glove gets the best results on QC3 and TA, outperforming B-GRUconv-glove by a slight margin in F1.17 The MEskip-thought performs the best on BC3, and close to the best results on TA. This is not so surprising because the skip-thought model encodes a sentence like a neural conversation model (Vinyals and Le 2015), and it has been shown that such models capture information relevant to speech acts (Ritter, Cherry, and Dolan 2010).
Encoder . | Classifier . | Model Name . | QC3 (Testset) . | TA (Testset) . | BC3 (Testset) . |
---|---|---|---|---|---|
Concatenation | ME | MEconv-glove | 60.52 (76.28) | 75.47 (86.79) | 60.46 (79.00) |
Concatenation | MLP | MLPconv-glove | 60.47 (73.07) | 75.85 (86.52) | 55.33 (78.00) |
Averaging | ME | MEconv-glove-averaging | 63.32 (76.92) | 73.72 (84.09) | 45.65 (74.00) |
Averaging | SVM | SVMconv-glove-averaging | 18.74 (60.89) | 29.46 (64.69) | 16.19 (68.00) |
Skip-thought | ME | MEskip-thought | 59.65 (78.13) | 77.09 (86.22) | 71.89 (89.00) |
B-GRU | ME | B-GRUconv-glove | 69.45 (81.41*) | 77.77 (88.68*) | 58.66 (79.00) |
B-LSTM | ME | B-LSTMconv-glove | 70.51 (80.77*) | 78.08 (88.95*) | 58.28 (79.00) |
Encoder . | Classifier . | Model Name . | QC3 (Testset) . | TA (Testset) . | BC3 (Testset) . |
---|---|---|---|---|---|
Concatenation | ME | MEconv-glove | 60.52 (76.28) | 75.47 (86.79) | 60.46 (79.00) |
Concatenation | MLP | MLPconv-glove | 60.47 (73.07) | 75.85 (86.52) | 55.33 (78.00) |
Averaging | ME | MEconv-glove-averaging | 63.32 (76.92) | 73.72 (84.09) | 45.65 (74.00) |
Averaging | SVM | SVMconv-glove-averaging | 18.74 (60.89) | 29.46 (64.69) | 16.19 (68.00) |
Skip-thought | ME | MEskip-thought | 59.65 (78.13) | 77.09 (86.22) | 71.89 (89.00) |
B-GRU | ME | B-GRUconv-glove | 69.45 (81.41*) | 77.77 (88.68*) | 58.66 (79.00) |
B-LSTM | ME | B-LSTMconv-glove | 70.51 (80.77*) | 78.08 (88.95*) | 58.28 (79.00) |
To further analyze the cases where B-LSTMconv-glove makes a difference, Figure 7 shows the corresponding confusion matrices for B-LSTMconv-glove and MLPconv-glove on the concatenated testsets of QC3, TA, and BC3. In general, our classifiers get confused between Response and Statement, and between Suggestion and Statement the most. We noticed a similar observation in the human annotations, where annotators had difficulties with these three acts. It is noticeable that B-LSTMconv-glove is less affected by class imbalance, and it can detect the Suggestion and Polite acts much more correctly than MLPconv-glove. This indicates that LSTM-RNNs can model the grammar of the sentence when composing the words into phrases and sentences sequentially.
Confusion matrices for (a) MLPconv-glove and (b) B-LSTMconv-glove on the test sets of QC3, TA, and BC3. P stands for Polite, Q for Question, R for Response, ST for Statement, and SU stands for Suggestion.
Confusion matrices for (a) MLPconv-glove and (b) B-LSTMconv-glove on the test sets of QC3, TA, and BC3. P stands for Polite, Q for Question, R for Response, ST for Statement, and SU stands for Suggestion.
5.2.4 Effectiveness of Domain Adaptation.
We have seen that semi-supervised learning in the form of word embeddings learned from a large unlabeled conversational corpus benefits our B-LSTM model. In the previous section, we witnessed further performance gains by exploiting more labeled data from the synchronous domain (Mrda). However, these methods make a simplified assumption that the conversational data comes from the same distribution. As mentioned before, the conversations in QC3, TA, or BC3 are quite different from Mrda meeting conversations in terms of style (spoken vs. written) and vocabulary usage. We believe that the results can be improved further by modeling the shift of domains (or distributions) explicitly.
In Section 3.2, we described two adaptation scenarios: (i) unsupervised, where no annotated data is available in the target domains, and (ii) supervised, where some annotated data is available in the target domain. We use all the available labels in the Concat data set for our supervised training. This makes the adaptation results comparable with our pre-adaptation results reported earlier in Table 12.
Table 13 presents the results for the adapted B-LSTMconv-glove model under the above training conditions (last two rows). For comparison, we have also shown the results for two baselines: (i) a transfer B-LSTMconv-glove model in the first row that is trained on only Mrda (source domain) data, and (ii) a merge B-LSTMconv-glove model in the second row that is trained on the concatenation of Mrda and the target domain data (QC3, TA, or BC3). Recall that the merge model is the one that gave the best results so far (i.e., last row of Table 11).
. | Training Regime . | QC3 (Testset) . | TA (Testset) . | BC3 (Testset) . |
---|---|---|---|---|
B-LSTMconv-glove | Transfer | 60.81 (72.35) | 70.26 (83.85) | 36.57 (57.14) |
B-LSTMconv-glove | Concatenation/Merge | 70.51 (80.77) | 78.08 (88.95) | 58.28 (79.00) |
Adapted B-LSTMconv-glove | Unsupervised adaptation | 47.67 (70.51) | 66.17 (81.85) | 39.97 (67.42) |
Adapted B-LSTMconv-glove | Supervised adaptation | 71.52 (82.48) | 81.21 (88.67) | 69.30 (88.22) |
. | Training Regime . | QC3 (Testset) . | TA (Testset) . | BC3 (Testset) . |
---|---|---|---|---|
B-LSTMconv-glove | Transfer | 60.81 (72.35) | 70.26 (83.85) | 36.57 (57.14) |
B-LSTMconv-glove | Concatenation/Merge | 70.51 (80.77) | 78.08 (88.95) | 58.28 (79.00) |
Adapted B-LSTMconv-glove | Unsupervised adaptation | 47.67 (70.51) | 66.17 (81.85) | 39.97 (67.42) |
Adapted B-LSTMconv-glove | Supervised adaptation | 71.52 (82.48) | 81.21 (88.67) | 69.30 (88.22) |
We can observe that without any labeled data in the target domain, the adapted B-LSTMconv-glove in the third row performs worse than the transfer baseline in QC3 and TA. In this case, because the out-of-domain labeled data set (MRDA) is much larger, it overwhelms the model, inducing features that are not relevant for the task in the target domain. However, when we provide the model with some labeled in-domain examples in the supervised adaptation setting (last row), we observe gains over the merge model (second row) in all three data sets. Remarkably, the absolute improvements in F1 for BC3 and TA are more than 11% and 3%, respectively.
To analyze further the performance of our adapted model, Figure 8 presents the F1 scores with varying amounts of labeled data in the target domain. It can be noticed that for all three data sets, the largest improvements come from the first 25% of the labeled data. The gains from the second quartile are also relatively higher than the last two quartiles for TA and QC3. This demonstrates the effectiveness of our adversarial domain adaptation method. In the future, it will be interesting to compare adversarial training with other domain adaptation methods.
F1 scores of our adapted model with varying amounts of labeled in-domain data.
5.3 Effectiveness of CRFs
Conversation-Level Data Set for CRFs.
To demonstrate the effectiveness of CRFs in capturing inter-sentence dependencies in an asynchronous conversation, we create another training setting called Conv-Level (Conversation-level), in which training instances are entire conversations and the random splits are done at the conversation level (as opposed to sentence) for the asynchronous corpora. This is required because the CRFs perform joint learning and inference based on an entire conversation. Table 14 shows the resulting data sets that we use to train and evaluate our CRFs. We have 229 conversations for training and 27 conversations for development.18 The test sets contain 5, 20, and 5 conversations for QC3, TA, and BC3, respectively.
. | Train . | Dev. . | Test . |
---|---|---|---|
QC3 | 38 (1,332) | 4 (111) | 5 (122) |
TA | 160 (2,957) | 20 (310) | 20 (444) |
BC3 | 31 (1,012) | 3 (101) | 5 (222) |
Total | 229 (5,301) | 27 (522) | 30 (788) |
. | Train . | Dev. . | Test . |
---|---|---|---|
QC3 | 38 (1,332) | 4 (111) | 5 (122) |
TA | 160 (2,957) | 20 (310) | 20 (444) |
BC3 | 31 (1,012) | 3 (101) | 5 (222) |
Total | 229 (5,301) | 27 (522) | 30 (788) |
Baselines and CRF Variants.
We use the following three models as baselines:
- (a)
MEbow: a MaxEnt model with BOW representation.
- (b)
Adapted B-LSTMconv-glove (semi-supervised): This model performs adversarial semi-supervised domain adaptation using labeled sentences from Mrda and Conv-Level training sets. Note that this is our best system so far (see Table 13).
- (c)
MEadapt-lstm: a MaxEnt learned from the sentence embeddings extracted from the adapted B-LSTMconv-glove (semi-supervised), that is, the sentence embeddings are used as feature vectors.
We experiment with the CRF variants shown in Table 1. Similar to MEadapt-lstm, the CRFs are trained on the Conv-Level training set using the sentence embeddings extracted by applying the adapted B-LSTMconv-glove (semi-supervised) model. The CRF models are therefore the structured versions of the MEadapt-lstm baseline.
Results and Discussion.
Table 15 shows our results on the Conv-Level data sets. We can notice that CRFs generally outperform MEs in accuracy, and for some CRF variants we get better results in both macro F1 and accuracy. This indicates that there are conversational dependencies between the sentences in a conversation.
. | QC3 . | TA . | BC3 . |
---|---|---|---|
MEbow | 57.37 (69.18) | 65.39 (85.04) | 60.32 (80.74) |
Adapted B-LSTMconv-glove (semi-sup) | 67.34 (80.15) | 70.36 (86.73) | 62.65 (83.59) |
MEadapt-lstm | 62.36 (78.93) | 63.31 (85.92) | 58.32 (81.43) |
CRFLC-NO | 67.02 (79.51) | 67.82 (86.94) | 63.15 (84.65) |
CRFLC-LC | 67.12 (79.83) | 67.94 (86.74) | 63.75 (84.10) |
CRFLC-LC1 | 69.32 (81.03*) | 68.84 (87.34) | 64.22 (84.71) |
CRFLC-FC1 | 70.11 (80.67) | 69.73 (86.51) | 66.34 (86.51*) |
CRFFC-FC | 69.65 (80.77) | 72.31 (88.61*) | 64.82 (86.18*) |
. | QC3 . | TA . | BC3 . |
---|---|---|---|
MEbow | 57.37 (69.18) | 65.39 (85.04) | 60.32 (80.74) |
Adapted B-LSTMconv-glove (semi-sup) | 67.34 (80.15) | 70.36 (86.73) | 62.65 (83.59) |
MEadapt-lstm | 62.36 (78.93) | 63.31 (85.92) | 58.32 (81.43) |
CRFLC-NO | 67.02 (79.51) | 67.82 (86.94) | 63.15 (84.65) |
CRFLC-LC | 67.12 (79.83) | 67.94 (86.74) | 63.75 (84.10) |
CRFLC-LC1 | 69.32 (81.03*) | 68.84 (87.34) | 64.22 (84.71) |
CRFLC-FC1 | 70.11 (80.67) | 69.73 (86.51) | 66.34 (86.51*) |
CRFFC-FC | 69.65 (80.77) | 72.31 (88.61*) | 64.82 (86.18*) |
If we compare the CRF variants, we can see that the model that does not consider any link across comments (CRFLC-NO) performs the worst. A simple linear chain connection between sentences in their temporal order (CRFLC-LC) does not improve much. This indicates that the linear chain CRF (Lafferty, McCallum, and Pereira 2001) is not the most appropriate model for capturing conversational dependencies in asynchronous conversations.
The CRFLC-LC1 is one of the well performing models and performs significantly better than the adapted B-LSTMconv-glove.19 This model considers linear chain connections between sentences in a comment and only to the first comment. When we change this model to consider relations with every sentence in the first comment (CRFLC-FC1), this improves the performance further, giving the best results in two of the three data sets. This indicates that there are strong dependencies between the sentences of the initial comment and the sentences of the rest of the comments, and these dependencies are better captured if the relations between them are explicitly considered. The CRFFC-FC also yields as good results as CRFLC-FC1. This could be attributed to the robustness of the fully connected CRF, which learns from all possible pairwise relations.
Another interesting observation is that no single graph structure performs the best across all conversation types. For example, CRFLC-FC1 gives the highest F1 scores for QC3 and BC3, whereas CRFFC-FC gives the highest results for TA. This shows the variation and the complicated ways in which participants communicate with each other in these conversations. One interesting future work would be to learn the underlying conversational structure automatically. However, we believe that in order to learn an effective model, this would require more labeled data.
To see some real examples in which CRF by means of its global learning and inference makes a difference, let us consider the example in Figure 1 again. We notice that the two sentences in comment C4 were mistakenly identified as Statement and Response, respectively, by the B-LSTMconv-glove local model. However, by considering these two sentences together with others in the conversation, the global CRFLC-LC1, CRFLC-FC1, and CRFFC-FC models were able to correct them (see Global). CRFLC-LC could correctly identify the first one as Question.
6. Conclusions and Future Directions
We have presented a novel two-step framework for speech act recognition in asynchronous conversation. An LSTM-RNN first composes sentences of a conversation into vector representations by considering the word order in a sentence. Then a pairwise CRF jointly models the inter-sentence dependencies in a conversation. In order to mitigate the problem of limited annotated data in the asynchronous domains, we further adapt the LSTM-RNN to learn from synchronous meeting conversations using adversarial training of neural networks.
We experimented with different LSTM variants (uni- vs. bi-directional, random vs. pretrained initialization), and different CRF variants, depending on the underlying graph structure. We trained word2vec and Glove conversational word embeddings from a large conversational corpus. We trained our models on many different settings using synchronous and asynchronous corpora, including in-domain training, concatenated training, unsupervised adaptation, supervised adaptation, and conversation level CRF joint training. We evaluated our approach on a synchronous data set (meeting) and three asynchronous data sets (two forum data sets and one e-mail data set), one of which is presented in this work.
Our experiments show that conversational word embeddings, especially conversational Glove, are quite beneficial for learning better sentence representations for the speech act classification task through bidirectional LSTM. This is especially true when the amount of labeled data in the target domain is limited. Adding more labeled data from synchronous domains yields improvements for bi-LSTMs, but even more gains can be achieved by domain adaptation with adversarial training. Further experiments with CRFs show that global joint models improve over local models given that the models consider the right graph structure. In particular, the LC-FC1 and FC-FC graph structures were among the best performers.
This work leads us to a number of future directions. First, we would like to combine CRFs with LSTM-RNNs for doing the two steps jointly, so that the LSTM-RNNs can learn the embeddings directly using the global thread-level feedback. This would require the backpropagation algorithm to take error signals from the loopy BP inference. Second, we would also like to apply our models to conversations, where the graph structure is extractable using metadata or other clues, for example, the fragment quotation graphs for e-mail threads (Carenini, Ng, and Zhou 2008). One interesting future work would be to jointly model the conversational structure (e.g., reply-to structure) and the speech acts so that the two tasks can inform each other.
In another direction, we would like to evaluate our speech act recognition model on extrinsic tasks. In a separate thread of work, we are developing coherence models for asynchronous conversations (Nguyen and Joty 2017; Joty, Mohiuddin, and Tien Nguyen 2018). Such coherence models could be useful for a number of downstream tasks including next utterance (or comment) ranking, conversation generation, and thread reconstruction (Nguyen et al. 2017). We are now looking into whether speech act information can help us in building better coherence models for asynchronous conversations. We also plan to evaluate the utility of speech acts in downstream NLP tasks involving asynchronous conversations like next utterance ranking (Lowe et al. 2015), conversation generation (Ritter, Cherry, and Dolan 2010), and summarization (Murray, Carenini, and Ng 2010). Finally, we hope that the new corpus, the conversational word embeddings, and the source code that we have made publicly available in this work will facilitate other researchers in extending our work and in applying speech act models to their NLP tasks.
Bibliographic Note
Portions of this work were previously published in the ACL-2016 conference proceeding (Joty and Hoque 2016). This article significantly extends the published work in several ways, most notably: (i) we train new word2vec and Glove word embeddings based on a large conversational corpus, and show their effectiveness by comparing with off-the-shelf word embeddings (Section 4.2 and the Results section), (ii) we extend the LSTM-RNN for domain adaptation using adversarial training of neural networks (Section 3.2), (iii) we evaluate the domain adapted LSTM-RNN model on meeting and forum data sets (Section 5.2), and (iv) we train and evaluate CRFs based on sentence embeddings extracted from the adapted LSTM-RNN (Section 5.3). Beside these extensions, a significant portion of the article was rewritten to adapt to a journal-style publication.
Acknowledgments
We thank Aseel Ghazal for her efforts in creating the new QC3 corpus. We also thank Enamul Hoque for running and organizing some of the experiments reported in the ACL-2016 paper. Many thanks to the anonymous reviewers for their insightful comments on the ACL-2016 submitted version.
Notes
Speech acts are also known as “dialog acts” in the literature.
There is bias associated with each nonlinear transformation, which we have omitted for notational simplicity.
This is also known as one-hot vector representation.
For simplicity, we list U and V parameters of LSTM in a generic way rather than being specific to the gates.
Available from https://ntunlpsg.github.io/project/speech-act/.
Available from https://code.google.com/archive/p/word2vec/.
Available from https://nlp.stanford.edu/projects/glove/.
Significance tests operate on individual instances rather than individual classes; thus not applicable for macro F1.
More hidden layers worsened the performance.
Other algorithms like Adagrad or RMSProp gave similar results.
l1 and l2 regularization on weights did not work well.
For simplicity, we excluded the bias vectors from our computation.
SVM training with linear kernel did not scale to the concatenated vector.
There is no significant difference between the accuracy numbers for B-GRU and B-LSTM.
We use the concatenated sets as train and dev sets.
Significance was computed based on the accuracy on the concatenated test set.