Abstract
We study the problem of response selection for multi-turn conversation in retrieval-based chatbots. The task involves matching a response candidate with a conversation context, the challenges for which include how to recognize important parts of the context, and how to model the relationships among utterances in the context. Existing matching methods may lose important information in contexts as we can interpret them with a unified framework in which contexts are transformed to fixed-length vectors without any interaction with responses before matching. This motivates us to propose a new matching framework that can sufficiently carry important information in contexts to matching and model relationships among utterances at the same time. The new framework, which we call a sequential matching framework (SMF), lets each utterance in a context interact with a response candidate at the first step and transforms the pair to a matching vector. The matching vectors are then accumulated following the order of the utterances in the context with a recurrent neural network (RNN) that models relationships among utterances. Context-response matching is then calculated with the hidden states of the RNN. Under SMF, we propose a sequential convolutional network and sequential attention network and conduct experiments on two public data sets to test their performance. Experiment results show that both models can significantly outperform state-of-the-art matching methods. We also show that the models are interpretable with visualizations that provide us insights on how they capture and leverage important information in contexts for matching.
1. Introduction
Recent years have witnessed a surge of interest on building conversational agents both in industry and academia. Existing conversational agents can be categorized into task-oriented dialog systems and non–task-oriented chatbots. Dialog systems focus on helping people complete specific tasks in vertical domains (Young et al. 2010), such as flight booking, bus route enquiry, restaurant recommendation, and so forth; chatbots aim to naturally and meaningfully converse with humans on open domain topics (Ritter, Cherry, and Dolan 2011). Building an open domain chatbot is challenging, because it requires the conversational engine to be capable of responding to any input from humans that covers a wide range of topics. To address the problem, researchers have considered leveraging the large amount of conversation data available on the Internet, and proposed generation-based methods (Shang, Lu, and Li 2015; Vinyals and Le 2015; Li et al. 2016b; Mou et al. 2016; Serban et al. 2016; Xing et al. 2017) and retrieval-based methods (Wang et al. 2013; Hu et al. 2014; Ji, Lu, and Li 2014; Wang et al. 2015; Yan, Song, and Wu 2016; Zhou et al. 2016; Wu et al. 2018a). Generation-based methods generate responses with natural language generation models learned from conversation data, while retrieval-based methods re-use the existing responses by selecting proper ones from an index of the conversation data. In this work, we study the problem of response selection in retrieval-based chatbots, because retrieval-based chatbots have the advantage of returning informative and fluent responses. Although most existing work on retrieval-based chatbots studies response selection for single-turn conversation (Wang et al. 2013) in which conversation history is ignored, we study the problem in a multi-turn scenario. In a chatbot, multi-turn response selection takes a message and utterances in its previous turns as an input and selects a response that is natural and relevant to the entire context.
A key step in response selection is measuring matching degree between an input and response candidates. Different from single-turn conversation, in which the input is a single utterance (i.e., the message), multi-turn conversation requires context-response matching where both the current message and the utterances in its previous turns should be taken into consideration. The challenges of the task include (1) how to extract important information (words, phrases, and sentences) from the context and leverage the information in matching; and (2) how to model relationships and dependencies among the utterances in the context. Table 1 uses an example to illustrate the challenges. First, to find a proper response for the context, the chatbot must know that “hold a drum class” and “drum” are important points. Without them, it may return a response relevant to the message (i.e., Turn-5 in the context) but nonsensical in the context (e.g., “what lessons do you want?”). On the other hand, words like “Shanghai” and “Lujiazui” are less useful and even noisy to response selection. The responses from the chatbot may drift to the topic of “Shanghai” if the chatbot pays significant attention to these words. Therefore, it is crucial yet non-trivial to let the chatbot understand the important points in the context and leverage them in matching and at the same time circumvent noise. Second, there is a clear dependency between Turn-5 and Turn-2 in the context, and the order of utterances matters in response selection because there will be different proper responses if we exchange Turn-3 and Turn-5.
Context | |
Turn-1 | Human: How are you doing? |
Turn-2 | ChatBot: I am going to hold a drum class in Shanghai. Anyone wants to join? The location is near Lujiazui. |
Turn-3 | Human: Interesting! Do you have coaches who can help me practice drum? |
Turn-4 | ChatBot: Of course. |
Turn-5 | Human: Can I have a free first lesson? |
Response Candidates | |
Response 1: Sure. Have you ever played drum before? ✓ | |
Response 2: What lessons do you want? ✗ |
Context | |
Turn-1 | Human: How are you doing? |
Turn-2 | ChatBot: I am going to hold a drum class in Shanghai. Anyone wants to join? The location is near Lujiazui. |
Turn-3 | Human: Interesting! Do you have coaches who can help me practice drum? |
Turn-4 | ChatBot: Of course. |
Turn-5 | Human: Can I have a free first lesson? |
Response Candidates | |
Response 1: Sure. Have you ever played drum before? ✓ | |
Response 2: What lessons do you want? ✗ |
Existing work, including the recurrent neural network architectures proposed by Lowe et al. (2015), the deep learning to respond architecture proposed by Yan, Song, and Wu (2016), and the multi-view architecture proposed by Zhou et al. (2016), may lose important information in context-response matching because they follow the same paradigm to perform matching, which suffers clear drawbacks. In fact, although these models have different structures, they can be interpreted with a unified framework: A context and a response are first individually represented as vectors, and then their matching score is computed with the vectors. The context representation includes two layers. The first layer represents utterances in the context, and the second layer takes the output of the first layer as an input and represents the entire context. The existing work differs in how they design the context representation and the response representation and how they calculate the matching score with the two representations. The framework view unifies the existing models and indicates the common drawbacks they have: everything in the context is compressed to one or more fixed-length vectors before matching is conducted; and there is no interaction between the context and the response in the formation of their representations. The context is represented without enough supervision from the response, and so is the response.
To overcome the drawbacks, we propose a sequential matching network (SMN) for context-response matching in our early work (Wu et al. 2017) where we construct a matching vector for each utterance–response pair through convolution and pooling on their similarity matrices, and then aggregate the sequence of matching vectors as a matching score of the context and the response. In this work, we take it one step further and generalize the SMN model to a sequential matching framework (SMF). The framework view allows us to tackle the challenges of context-response matching from a high level. Specifically, SMF matches each utterance in the context with the response at the first step and forms a sequence of matching vectors. It then accumulates the matching vectors of utterance–response pairs in the chronological order of the utterances. The final context-response matching score is calculated with the accumulation of pair matching. Different from the existing framework, SMF allows utterances in the context and the response to interact with each other at the very beginning, and thus important matching information in each utterance–response pair can be sufficiently preserved and carried to the final matching score. Moreover, relationships and dependencies among utterances are modeled in a matching fashion, so the order of utterances can supervise the aggregation of the utterance–response matching. Generally speaking, SMF consists of three layers. The first layer extracts important matching information from each utterance–response pair and transforms the information into a matching vector. The matching vectors are then uploaded to the second layer where a recurrent neural network with gated recurrent units (GRU) (Chung et al. 2014) is used to model the relationships and dependencies among utterances and accumulate the matching vectors into its hidden states. The final layer takes the hidden states of the GRU as input and calculates a matching score for the context and the response.
The key to the success of SMF lies in how to design the utterance–response matching layer, which requires identification of important parts in each utterance. We first show that the point-wise similarity calculation followed by convolution and pooling in SMN is one implementation of the utterance–response matching layer of SMF, making the SMN model a special case of the framework. Then, we propose a new model named sequential attention network (SAN), which implements the utterance–response matching layer of SMF with an attention mechanism. Specifically, for an utterance–response pair, SAN lets the response attend to important parts (either words or segments) in the utterance by weighting the parts using each part of the response. Each weight reflects how important the part in the utterance is with respect to the corresponding part in the response. Then for each part in the response, parts in the utterance are linearly combined with the weights, and the combination interacts with the part of the response by Hadamard product to form a representation of the utterance. Such utterance representations are computed on both a word level and a segment level. The two levels of representations are finally concatenated and processed by a GRU to form a matching vector. SMN and SAN are two different implementations of the utterance–response matching layer, and we give a comprehensive comparison between SAN and SMN. Theoretically, SMN is faster and easier to parallelize than SAN, whereas SAN can better utilize the sequential relationship and dependency. The empirical results are consistent with the theoretical analysis.
We empirically compare SMN and SAN on two public data sets: the Ubuntu Dialogue Corpus (Lowe et al. 2015) and the Douban Conversation Corpus (Wu et al. 2017). The Ubuntu corpus is a large-scale English data set in which negative instances are randomly sampled and dialogues are collected from a specific domain; the Douban corpus is a newly published Chinese data set where conversations are crawled from an open domain forum with response candidates collected following the procedure of retrieval-based chatbots and their appropriateness judged by human annotators. Experimental results show that on both data sets, both SMN and SAN can significantly outperform the existing methods. Particularly, on the Ubuntu corpus, SMN and SAN yield 6 and 7 percentage point improvement, respectively, on R10@1 over the best performing baseline method, and on the Douban corpus, the improvement on mean average precision from SMN and SAN over the best baseline are 2.6 and 3.6 percentage points, respectively. The empirical results indicate that SAN can achieve better performance than SMN in practice. In addition to the quantitative evaluation, we also visualize the two models with examples from the Ubuntu corpus. The visualization reveals how the two models understand conversation contexts and provides us insights on why they can achieve big improvement over state-of-the-art methods.
This work is a substantial extension of our previous work reported at ACL 2017. The extension in this article includes a unified framework for the existing methods, a proposal of a new framework for context-response matching, and a new model under the framework. Specifically, the contributions of this work include the following.
- •
We unify existing context-response matching models with a framework and disclose their intercorrelations with detailed mathematical derivations, which reveals their common drawbacks and sheds light on our new direction.
- •
We propose a new framework for multi-turn response selection, namely, the sequential matching framework, which is capable of overcoming the drawbacks suffered by the existing models and addressing the challenges of context-response matching in an end-to-end way. The framework indicates that the key to context-response matching is not the 2D convolution and pooling operations in SMN, but a general utterance–response matching function that can capture the important matching information in utterance–response pairs.
- •
We propose a new architecture, the sequential attention network, under the new framework. Moreover, we compare SAN with SMN on both efficiency and effectiveness.
- •
We conduct extensive experiments on public data sets and verify that SAN achieves new state-of-the-art performance on context-response matching.
The rest of the paper is organized as follows: In Section 2 we summarize the related work. We formalize the learning problem in Section 3. In Section 4, we interpret the existing models with a framework. Section 5 elaborates our new framework and gives two models as special cases of the framework. Section 6 gives the learning objective and some training details. In Section 7 we give details of the experiments. In Section 8, we outline our conclusions.
2. Related Work
We briefly review the history and recent progress of chatbots, and application of text matching techniques in other tasks. Together with the review of existing work, we clarify the connection and difference between these works and our work in this article.
2.1 Chatbots
Research on chatbots goes back to the 1960s when ELIZA (Weizenbaum 1966), an early chatbot, was designed with a large number of handcrafted templates and heuristic rules. ELIZA needs huge human effort but can only return limited responses. To remedy this, researchers have developed data-driven approaches (Higashinaka et al. 2014). The idea behind data-driven approaches is to build a chatbot with the large amount of conversation data available on social media such as forums and microblogging services. Methods along this line can be categorized into retrieval-based and generation-based ones.
Generation-based chatbots reply to a message with natural language generation techniques. Early work (Ritter, Cherry, and Dolan 2011) regards messages and responses as source language and target language, respectively, and learn a phrase-based statistical machine translation model to translate a message to a response. Recently, together with the success of deep learning approaches, the sequence-to-sequence framework has become the mainstream approach, because it can implicitly capture compositionality and long-span dependencies in languages. Under this framework, many models have been proposed for both single-turn conversation and multi-turn conversation. For example, in single-turn conversation, sequence-to-sequence with an attention mechanism (Shang, Lu, and Li 2015; Vinyals and Le 2015) has been applied to response generation; Li et al. (2016a) proposed a maximum mutual information objective to improve diversity of generated responses; Xing et al. (2017) and Mou et al. (2016) introduced external knowledge into the sequence-to-sequence model; Wu et al. (2018b) proposed decoding a response from a dynamic vocabulary; Li et al. (2016b) incorporated persona information into the sequence-to-sequence model to enhance response consistency with speakers; and Zhou et al. (2018) explored how to generate emotional responses with a memory augmented sequence-to-sequence model. In multi-turn conversation, Sordoni et al. (2015) compressed a context to a vector with a multi-layer perceptron in response generation; Serban et al. (2016) extended the sequence-to-sequence model to a hierarchical encoder-decoder structure; and under this structure, they further proposed two variants including VHRED (Serban et al. 2017b) and MrRNN (Serban et al. 2017a) to introduce latent and explicit variables into the generation process. Xing et al. (2018) exploited a hierarchical attention mechanism to highlight the effect of important words and utterances in generation. Upon these methods, reinforcement learning technique (Li et al. 2016c) and an adversarial learning technique (Li et al. 2017) have also been applied to response generation.
Different from the generation based systems, retrieval-based chatbots select a proper response from an index and re-use the one to reply to a new input. The key to response selection is how to match the input with a response. In a single-turn scenario, matching is conducted between a message and a response. For example, Hu et al. (2014) proposed message-response matching with convolutional neural networks; Wang et al. (2015) incorporated syntax information into matching; Ji, Lu, and Li (2014) combined several matching features, such as cosine, topic similarity, and translation score, to rank response candidates. In multi-turn conversation, matching requires taking the entire context into consideration. In this scenario, Lowe et al. (2015) used a dual long short-term memory (LSTM) model to match a response with the literal concatenation of utterances in a context; Yan, Song, and Wu (2016) reformulated the input message with the utterances in its previous turns and performed matching with a deep neural network architecture; Zhou et al. (2016) adopted an utterance view and a word view in matching to model relationships among utterances; and Wu et al. (2017) proposed a sequential matching network that can capture important information in contexts and model relationships among utterances in a unified form.
Our work is a retrieval-based method. It is an extension of the work by Wu et al. (2017) reported at the ACL conference. In this work, we analyze the existing models from a framework view, generalize the model in Wu et al. (2017) to a framework, give another implementation with better performance under the framework, and compare the new model with the model in the conference paper on various aspects.
2.2 Text Matching
In addition to response selection in chatbots, neural network–based text matching techniques have proven effective in capturing semantic relations between text pairs in a variety of NLP tasks. For example, in question answering, covolutional neural networks (Qiu and Huang 2015; Severyn and Moschitti 2015) can effectively capture compositions of n-grams and their relations in questions and answers. Inner-Attention (Wang, Liu, and Zhao 2016) and multiple view (MV)-LSTM (Wan et al. 2016a) can model complex interaction between questions and answers through recurrent neural network based architectures. (More studies on text matching for question answering can be found in Tan et al. [2016]; Liu et al. [2016a, b]; Wan et al. [2016b]; He and Lin [2016]; Yin et al. [2016]; Yin and Schütze [2015]). In Web search, Shen et al. (2014) and Huang et al. (2013) built a neural network with tri-letters to alleviate mismatching of queries and documents due to spelling errors. In textual entailment, the model in Rocktäschel et al. (2015) utilized a word-by-word attention mechanism to distinguish the relationship between two sentences. Wang and Jiang (2016b) introduced another way to adopt an attention mechanism for textual entailment. Besides those two works, Chen et al. (2016), Parikh et al. (2016), and Wang and Jiang (2016a) also investigated the textual entailment problem with neural network models.
In this work, we study text matching for response selection in multi-turn conversation, in which matching is conducted between a piece of text and a context which consists of multiple pieces of text dependent on each other. We propose a new matching framework that is able to extract important information in the context and model dependencies among utterances in the context.
3. Problem Formalization
Suppose that we have a data set 𝒟 = , where si is a conversation context, ri is a response candidate, and yi ∈ {0, 1} is a label. si = {ui,1, …, ui,ni} where are utterances. ∀k, ui,k = (wui,k,1, …, wui,k,j, …, wui,k,nui,k) where wui,k,j is the j-th word in ui,k and nui,k is the length of ui,k. Similarly, ri = (wri,1, …, wri,j, …, wri,nri) where wri,j is the j-th word in ri and nri is the length of the response. yi = 1 if ri is a proper response to si, otherwise yi = 0. Our goal is to learn a matching model g(⋅, ⋅) with 𝒟, and thus for any new context-response pair (s, r), g(s, r) measures their matching degree. According to g(s, r), we can rank candidates for s and select a proper one as its response.
In the following sections, we first review how the existing work defines g(⋅, ⋅) from a framework view. The framework view discloses the common drawbacks of the existing work. Then, based on this analysis, we propose a new matching framework and give two models under the framework.
4. A Framework for the Existing Models
There are several advantages when applying the framework view to the existing context-response matching models. First, it unifies the existing models and reveals the instinct connections among them. These models are nothing but similarity functions of a context representation and a response representation. Their difference on performance comes from how well the two representations capture the semantics and the structures of the context and the response and how accurate the similarity calculation is. For example, in empirical studies, the multi-view model performs much better than the RNN models. This is because the multi-view model captures the sequential relationship among words, the composition of n-grams, and the sequential relationship of utterances by hw(⋅) and hu(⋅); whereas in RNN models, only the sequential relationship among words are modeled by hrnn(⋅). Second, it is easy to make an extension of the existing models by replacing f(⋅), f′(⋅), h(⋅), and m(⋅, ⋅). For example, we can replace the hrnn(⋅) in RNN models with a composition of CNN and RNN to model both composition of n-grams and their sequential relationship, and we can replace the mrnn(⋅) with a more powerful neural tensor network (Socher et al. 2013). Third, the framework unveils the limitations the existing models and their possible extensions suffer: Everything in the context are compressed to one or more fixed-length vectors before matching; and there is no interaction between the context and the response in the formation of their representations. The context is represented without enough supervision from the response, and so is the response. As a result, these models may lose important information of contexts in matching, and more seriously, no matter how we improve them, as long as the improvement is under the framework, we cannot overcome the limitations. The framework view motivates us to propose a new framework that can essentially change the existing matching paradigm.
5. Sequential Matching Framework
SMF has two major differences over the existing framework: first, SMF lets each utterance in the context and the response “meet” at the very beginning, and therefore utterances and the response can sufficiently interact with each other. Through the interaction, the response will help recognize important information in each utterance. The information is preserved in the matching vectors and carried into the final matching score with minimal loss; second, matching and utterance relationships are coupled rather than separately modeled as in the existing framework. Hence, the utterance relationships (e.g., the order of the utterances), as a kind of knowledge, can supervise the formation of the matching score. Because of the differences, SMF can overcome the drawbacks the existing models suffer and tackle the two challenges of context-response matching simultaneously.
It is obvious that the success of SMF lies in how to design f(⋅, ⋅), because f(⋅, ⋅) plays a key role in capturing important information in a context. In the following sections, we will first specify the design of f(⋅, ⋅), and then discuss how to define h(⋅) and m(⋅).
5.1 Utterance–Response Matching
We design the utterance–response matching function f(⋅, ⋅) in SMF as neural networks to benefit from their powerful representation abilities. To guarantee that f(⋅, ⋅) can capture important information in utterances with the help of the response, we implement f(⋅, ⋅) using a convolution-pooling technique and an attention technique, which results in a sequential convolutional network (SCN) and a sequential attention network (SAN). Moreover, in both SCN and SAN, we consider matching on multiple levels of granularity of text. Note that in our ACL paper (Wu et al. 2017), the sequential convolutional network is named “SMN.” Here, we rename it to SCN in order to distinguish it from the framework.
5.1.1 Sequential Convolutional Network.
Figure 3 gives the architecture of SCN. Given an utterance u in a context s and a response candidate r, SCN looks up an embedding table and represents u and r as U = [eu,1, …, eu,nu] and R = [er,1, …, er,nr], respectively, where eu,i, er,i ∈ ℝd are the embeddings of the i-th word of u and r, respectively. With U and R, SCN constructs a word–word similarity matrix M1 ∈ ℝnu×nr and a sequence–sequence similarity matrix M2 ∈ ℝnu×nr as two input channels of a convolutional neural network (CNN). The CNN then extracts important matching information from the two matrices and encodes the information into a matching vector v.
SCN distills important information in each utterance in the context from multiple levels of granularity through convolution and pooling operations on similarity matrices. From Equations (15), (17), (18), and (21), we can see that by learning word embeddings and parameters of GRU from training data, important words or segments in the utterance may have high similarity with some words or segments in the response and result in high value areas in the similarity matrices. These areas will be transformed and extracted to the matching vector by convolutions and poolings. We will further explore the mechanism of SCN by visualizing M1 and M2 of an example in Section 7.
5.1.2 Sequential Attention Network.
With word embeddings U and R and hidden vectors Hu and Hr, SAN also performs utterance–response matching on a word level and a segment level. Figure 4 gives the architecture of SAN. In each level of matching, SAN exploits every part of the response (either a word or a hidden state) to weight the parts of the utterance and obtain a weighted representation of the utterance. The utterance representation then interacts with the part of the response. The interactions are finally aggregated following the order of the parts in the response as a matching vector.
From Equations (23) and (26), we can see that SAN identifies important information in utterances in a context through an attention mechanism. Words or segments in utterances that are useful to recognize the appropriateness between the context and a response will receive high weights from the response. The information conveyed by these words and segments will be highlighted in the interaction between the utterances and the response and carried to the matching vector through a RNN that models the aggregation of information in the utterances under the supervision of the response. Similar to SCN, we will further investigate the effect of the attention mechanism in SAN by visualizing the attention weights in Section 7.
5.1.3 SAN vs. SCN.
Because SCN and SAN exploits different mechanisms to understand important parts in contexts, an interesting question arises: What are the advantages and disadvantages of the two models in practice? Here, we leave empirical comparison of their performance to experiments and first compare SCN with SAN on the following aspects: (1) amount of parallelable computation, which is measured by the minimum number of sequential operations required; and (2) total time complexity.
Table 2 summarizes the comparison between the two models. In terms of parallelability, SAN uses two RNNs to learn the representations, which requires 2n sequential operations, whereas SCN has n sequentially executed operations in the construction of M2. Hence, SCN is easier to parallelize than SAN. In terms of time complexity, the complexity of SCN is 𝒪(k ⋅ n ⋅ d2 + n ⋅ d2 + n2 ⋅ d), where k is the number of feature maps in convolutions, n is max(nu, nr), and d is embedding size. More specifically, in SCN, the cost on construction of M1 and M2 is 𝒪(n ⋅ d2 + n2 ⋅ d), and the cost on convolution and pooling is 𝒪(k ⋅ n ⋅ d2). The complexity of SAN is 𝒪(n2 ⋅ d + n2 ⋅ d2), where 𝒪(n2 ⋅ d) is the cost on calculating Hu and Hr and 𝒪(n2 ⋅ d2) is the cost of the following attention-based GRU. In practice, k is usually much smaller than the maximum sentence length n. Therefore, SCN could be faster than SAN. The conclusion is also verified by empirical results in Section 7.
5.2 Matching Accumulation
5.3 Matching Prediction
m(⋅) takes {h1, …, hn} from h(⋅) as an input and predicts a matching score for (s, r). We consider three approaches to implementing m(⋅).
5.3.1 Last State.
5.3.2 Static Average.
5.3.3 Dynamic Average.
6. Model Training
7. Experiments
We test SAN and SCN on two public data sets with both quantitative metrics and qualitative analysis.
7.1 Data Sets
The first data set we exploited to test the performance of our models is the Ubuntu Dialogue Corpus v1 (Lowe et al. 2015). The corpus contains large-scale two-way conversations collected from the chat logs of the Ubuntu forum. The conversations are multi-turn discussions about Ubuntu-related technical issues. We used the copy shared by Xu et al. (Xu et al. 2017),2 in which numbers, URLs, and paths are replaced by special placeholders. The data set consists of 1 million context-response pairs for training, 0.5 million pairs for validation, and 0.5 million pairs for testing. In each conversation, a human reply is selected as a positive response to the context, and negative responses are randomly sampled. The ratio of positive responses and negative responses is 1:1 in the training set, and 1:9 in both the validation and test sets.
In addition to the Ubuntu Dialogue Corpus, we selected the Douban Conversation Corpus (Wu et al. 2017) as another data set. The data set is a recently released large-scale open-domain conversation corpus in which conversations are crawled from a popular Chinese forum Douban Group.3 The training set contains 1 million context-response pairs, and the validation set contains 50,000 pairs. In both sets, a context has a human reply as a positive response and a randomly sampled reply as a negative response. Therefore, the ratio of positive instances and negative instances in both training and validation is 1:1. Different from the Ubuntu Dialogue Corpus, the test set of the Douban Conversation Corpus contains 1,000 contexts with each one having 10 responses retrieved from a pre-built index. Each response receives three labels from human annotators that indicate its appropriateness as a reply to the context and the majority of the labels are taken as the final decision. An appropriate response means that the response can naturally reply to the conversation history by satisfying logic consistency, fluency, and semantic relevance. Otherwise, if a response does not meet any of the three conditions, it is an inappropriate response. The Fleiss kappa (Fleiss 1971) of the labeling is 0.41, which means that the labelers reached a moderate agreement in their work. Note that in our experiments, we removed contexts whose responses are all labeled as positive or negative. After this step, there are 6,670 context-response pairs left in the test set.
Table 3 summarizes the statistics of the two data sets.
. | Ubuntu Corpus . | Douban Corpus . | ||||
---|---|---|---|---|---|---|
train . | val . | test . | train . | val . | test . | |
# context-response pairs | 1M | 0.5M | 0.5M | 1M | 50k | 10k |
# candidates per context | 2 | 10 | 10 | 2 | 2 | 10 |
# positive candidates per context | 1 | 1 | 1 | 1 | 1 | 1.18 |
Min. # turns per context | 3 | 3 | 3 | 3 | 3 | 3 |
Max. # turns per context | 19 | 19 | 19 | 98 | 91 | 45 |
Avg. # turns per context | 10.10 | 10.10 | 10.11 | 6.69 | 6.75 | 6.45 |
Avg. # words per utterance | 12.45 | 12.44 | 12.48 | 18.56 | 18.50 | 20.74 |
. | Ubuntu Corpus . | Douban Corpus . | ||||
---|---|---|---|---|---|---|
train . | val . | test . | train . | val . | test . | |
# context-response pairs | 1M | 0.5M | 0.5M | 1M | 50k | 10k |
# candidates per context | 2 | 10 | 10 | 2 | 2 | 10 |
# positive candidates per context | 1 | 1 | 1 | 1 | 1 | 1.18 |
Min. # turns per context | 3 | 3 | 3 | 3 | 3 | 3 |
Max. # turns per context | 19 | 19 | 19 | 98 | 91 | 45 |
Avg. # turns per context | 10.10 | 10.10 | 10.11 | 6.69 | 6.75 | 6.45 |
Avg. # words per utterance | 12.45 | 12.44 | 12.48 | 18.56 | 18.50 | 20.74 |
7.2 Baselines
We compared our methods with the following methods:
TF-IDF: We followed Lowe et al. (2015) and computed TF-IDF-based cosine similarity between a context and a response. Utterances in the context are concatenated to form a document. IDF is computed on the training data.
Basic deep learning models: We used models in Lowe et al. (2015) and Kadlec, Schmid, and Kleindienst (2015), in which representations of a context are learned by neural networks with the concatenation of utterances as inputs and the final matching score is computed by a bilinear function of the context representation and the response representation. Models including RNN, CNN, LSTM, and BiLSTM were selected as baselines.
Multi-View: The model proposed in Zhou et al. (2016) that utilizes a hierarchical recurrent neural network to model utterance relationships. It integrates information in a context from an utterance view and a word view. Details of the model can be found in Equation (9).
Deep learning to respond (DL2R): The authors in Yan, Song, and Wu (2016) proposed several approaches to reformulate a message with previous turns in a context. The response and the reformulated message are then represented by a composition of RNN and CNN. Finally, the matching score is computed with the concatenation of the representations. Details of the model can be found in Equation (6).
Advanced single-turn matching models: Because BiLSTM does not represent the state-of-the-art matching model, we concatenated the utterances in a context and matched the long text with a response candidate using more powerful models, including MV-LSTM (Wan et al. 2016b) (2D matching), Match-LSTM (Wang and Jiang 2016b), and Attentive-LSTM (Tan et al. 2016) (two attention based models). To demonstrate the importance of modeling utterance relationships, we also calculated a matching score for the concatenation of utterances and the response candidate using the methods in Section 5.1. The two models are simple versions of SCN and SAN, respectively, without considering utterance relationships. We denote them as SCNsingle and SANsingle, respectively.
7.3 Evaluation Metrics
In experiments on the Ubuntu corpus, we followed Lowe et al. (2015) and used recall at position k in n candidates (Rn@k) as evaluation metrics. Here the matching models are required to return k most likely responses, and Rn@k = 1 if the true response is among the k candidates. Rn@k will become larger when k gets larger or n gets smaller.
We did not calculate R2@1 for the test data in the Douban corpus because one context could have more than one correct response, and we have to randomly sample one for R2@1, which may bring bias to the evaluation.
7.4 Parameter Tuning
For baseline models, we copied the numbers in the existing papers if their results on the Ubuntu corpus are reported in their original paper (TF-IDF, RNN, CNN, LSTM, BiLSTM, Multi-View); otherwise we implemented the models by tuning their parameters on the validation sets. All models were implemented using the Theano framework (Theano Development Team 2016). Word embeddings in neural networks were initialized by the results of word2vec (Mikolov et al. 2013 4) pre-trained on the training data. We did not use GloVe (Pennington, Socher, and Manning 2014) because the Ubuntu corpus contains many technical words that are not covered by Twitter or Wikipedia. The word embedding size was chosen as 200. The maximum utterance length was set as 50. The maximum context length (i.e., number of utterances per context) was varied from 1 to 20 and set as 10 at last. We padded zeros if the number of utterances in a context is less than 10; otherwise we kept the last 10 utterances. We will discuss how the performance of models changes in terms of different maximum context length later.
For SCN, the window size of convolution and pooling was tuned to {(2, 2), (3, 3) (4, 4)} and was set as (3, 3) finally. The number of feature maps is 8. The size of the hidden states in the construction of M2 is the same with the word embedding size, and the size of the output vector v was set as 50. Furthermore, the size of the hidden states in the matching accumulation module is also 50. In SAN, the size of the hidden states in the segment level representation is 200, and the size of the hidden states in Equation (29) was set as 400.
All tuning was done according to R2@1 on the validation data.
7.5 Evaluation Results
Table 4 and 5 show the evaluation results on the Ubuntu Corpus and the Douban Corpus, respectively. SAN and SCN outperform baselines over all metrics on both data sets with large margins, and except for R10@5 of SCN on the Douban corpus, the improvements are statistically significant (t-test with p-value ≤ 0.01). Our models are better than state-of-the-art single turn matching models such as MV-LSTM, Match-LSTM, SCNsingle, and SANsingle. The results demonstrate that one cannot neglect utterance relationships and simply perform multi-turn response selection by concatenating utterances together.
. | R2@1 . | R10@1 . | R10@2 . | R10@5 . |
---|---|---|---|---|
TF-IDF | 0.659 | 0.410 | 0.545 | 0.708 |
RNN | 0.768 | 0.403 | 0.547 | 0.819 |
CNN | 0.848 | 0.549 | 0.684 | 0.896 |
LSTM | 0.901 | 0.638 | 0.784 | 0.949 |
BiLSTM | 0.895 | 0.630 | 0.780 | 0.944 |
Multi-View | 0.908 | 0.662 | 0.801 | 0.951 |
DL2R | 0.899 | 0.626 | 0.783 | 0.944 |
MV-LSTM | 0.906 | 0.653 | 0.804 | 0.946 |
Match-LSTM | 0.904 | 0.653 | 0.799 | 0.944 |
Attentive-LSTM | 0.903 | 0.633 | 0.789 | 0.943 |
SCNsingle | 0.904 | 0.656 | 0.809 | 0.942 |
SANsingle | 0.906 | 0.662 | 0.810 | 0.945 |
SCNlast | 0.923 | 0.723 | 0.842 | 0.956 |
SCNstatic | 0.927 | 0.725 | 0.838 | 0.962 |
SCNdynamic | 0.926 | 0.726 | 0.847 | 0.961 |
SANlast | 0.930 | 0.733 | 0.850 | 0.961 |
SANstatic | 0.932 | 0.734 | 0.852 | 0.962 |
SANdynamic | 0.932 | 0.733 | 0.851 | 0.961 |
. | R2@1 . | R10@1 . | R10@2 . | R10@5 . |
---|---|---|---|---|
TF-IDF | 0.659 | 0.410 | 0.545 | 0.708 |
RNN | 0.768 | 0.403 | 0.547 | 0.819 |
CNN | 0.848 | 0.549 | 0.684 | 0.896 |
LSTM | 0.901 | 0.638 | 0.784 | 0.949 |
BiLSTM | 0.895 | 0.630 | 0.780 | 0.944 |
Multi-View | 0.908 | 0.662 | 0.801 | 0.951 |
DL2R | 0.899 | 0.626 | 0.783 | 0.944 |
MV-LSTM | 0.906 | 0.653 | 0.804 | 0.946 |
Match-LSTM | 0.904 | 0.653 | 0.799 | 0.944 |
Attentive-LSTM | 0.903 | 0.633 | 0.789 | 0.943 |
SCNsingle | 0.904 | 0.656 | 0.809 | 0.942 |
SANsingle | 0.906 | 0.662 | 0.810 | 0.945 |
SCNlast | 0.923 | 0.723 | 0.842 | 0.956 |
SCNstatic | 0.927 | 0.725 | 0.838 | 0.962 |
SCNdynamic | 0.926 | 0.726 | 0.847 | 0.961 |
SANlast | 0.930 | 0.733 | 0.850 | 0.961 |
SANstatic | 0.932 | 0.734 | 0.852 | 0.962 |
SANdynamic | 0.932 | 0.733 | 0.851 | 0.961 |
. | MAP . | MRR . | P@1 . | R10@1 . | R10@2 . | R10@5 . |
---|---|---|---|---|---|---|
TF-IDF | 0.331 | 0.359 | 0.180 | 0.096 | 0.172 | 0.405 |
RNN | 0.390 | 0.422 | 0.208 | 0.118 | 0.223 | 0.589 |
CNN | 0.417 | 0.440 | 0.226 | 0.121 | 0.252 | 0.647 |
LSTM | 0.485 | 0.527 | 0.320 | 0.187 | 0.343 | 0.720 |
BiLSTM | 0.479 | 0.514 | 0.313 | 0.184 | 0.330 | 0.716 |
Multi-View | 0.505 | 0.543 | 0.342 | 0.202 | 0.350 | 0.729 |
DL2R | 0.488 | 0.527 | 0.330 | 0.193 | 0.342 | 0.705 |
MV-LSTM | 0.498 | 0.538 | 0.348 | 0.202 | 0.351 | 0.710 |
Match-LSTM | 0.500 | 0.537 | 0.345 | 0.202 | 0.348 | 0.720 |
Attentive-LSTM | 0.495 | 0.523 | 0.331 | 0.192 | 0.328 | 0.718 |
SCNsingle | 0.506 | 0.543 | 0.349 | 0.203 | 0.351 | 0.709 |
SANsingle | 0.508 | 0.547 | 0.352 | 0.206 | 0.353 | 0.720 |
SCNlast | 0.526 | 0.571 | 0.393 | 0.236 | 0.387 | 0.729 |
SCNstatic | 0.523 | 0.572 | 0.387 | 0.228 | 0.387 | 0.734 |
SCNdynamic | 0.529 | 0.569 | 0.397 | 0.233 | 0.396 | 0.724 |
SANlast | 0.536 | 0.581 | 0.393 | 0.236 | 0.404 | 0.761 |
SANstatic | 0.532 | 0.575 | 0.387 | 0.228 | 0.393 | 0.736 |
SANdynamic | 0.534 | 0.577 | 0.391 | 0.230 | 0.393 | 0.742 |
. | MAP . | MRR . | P@1 . | R10@1 . | R10@2 . | R10@5 . |
---|---|---|---|---|---|---|
TF-IDF | 0.331 | 0.359 | 0.180 | 0.096 | 0.172 | 0.405 |
RNN | 0.390 | 0.422 | 0.208 | 0.118 | 0.223 | 0.589 |
CNN | 0.417 | 0.440 | 0.226 | 0.121 | 0.252 | 0.647 |
LSTM | 0.485 | 0.527 | 0.320 | 0.187 | 0.343 | 0.720 |
BiLSTM | 0.479 | 0.514 | 0.313 | 0.184 | 0.330 | 0.716 |
Multi-View | 0.505 | 0.543 | 0.342 | 0.202 | 0.350 | 0.729 |
DL2R | 0.488 | 0.527 | 0.330 | 0.193 | 0.342 | 0.705 |
MV-LSTM | 0.498 | 0.538 | 0.348 | 0.202 | 0.351 | 0.710 |
Match-LSTM | 0.500 | 0.537 | 0.345 | 0.202 | 0.348 | 0.720 |
Attentive-LSTM | 0.495 | 0.523 | 0.331 | 0.192 | 0.328 | 0.718 |
SCNsingle | 0.506 | 0.543 | 0.349 | 0.203 | 0.351 | 0.709 |
SANsingle | 0.508 | 0.547 | 0.352 | 0.206 | 0.353 | 0.720 |
SCNlast | 0.526 | 0.571 | 0.393 | 0.236 | 0.387 | 0.729 |
SCNstatic | 0.523 | 0.572 | 0.387 | 0.228 | 0.387 | 0.734 |
SCNdynamic | 0.529 | 0.569 | 0.397 | 0.233 | 0.396 | 0.724 |
SANlast | 0.536 | 0.581 | 0.393 | 0.236 | 0.404 | 0.761 |
SANstatic | 0.532 | 0.575 | 0.387 | 0.228 | 0.393 | 0.736 |
SANdynamic | 0.534 | 0.577 | 0.391 | 0.230 | 0.393 | 0.742 |
TF-IDF shows the worst performance, indicating that the multi-turn response selection problem cannot be addressed with shallow features. LSTM is the best model among the basic models. The reason might be that it models relationships among words. Multi-View is better than LSTM, demonstrating the effectiveness of the utterance-view in context modeling. Advanced models have better performance, because they are capable of capturing more complicated structures in contexts.
SAN is better than SCN on both data sets, which might be attributed to three reasons. The first reason is that SAN uses vectors instead of scalars to represent interactions between words or text segments. Therefore, the matching vectors in SAN can encode more information from the pairs than those in SCN. The second reason is that SAN uses a soft attention mechanism to emphasize important words or segments in utterances, whereas SCN uses a max pooling operation to select important information from similarity matrices. When multiple words or segments are important in an utterance–response pair, a max pooling operation just selects the top one, but the attention mechanism can leverage all of them. The last reason is that SAN models the sequential relationship and dependency among words or segments in the interaction aggregation module, whereas SCN only considers n-grams.
The three approaches to matching prediction do not show much difference in both SCN and SAN, but dynamic average and static average are better than the last state on the Ubuntu corpus and worse than it on the Douban corpus. This is because contexts in the Ubuntu corpus are longer than those in the Douban corpus (average context length 10.1 vs. 6.7), and thus the last hidden state may lose information in history on the Ubuntu data. In contrast, the Douban corpus has shorter contexts but longer utterances (average utterance length 18.5 vs. 12.4), and thus noise may be involved in response selection if more hidden states are taken into consideration.
There are two reasons that Rn@ks on the Douban corpus are much smaller than those on the Ubuntu corpus. One is that response candidates in the Douban corpus are returned by a search engine rather than negative sampling. Therefore, some negative responses in the Douban corpus might be semantically closer to the true positive responses than those in the Ubuntu corpus, and thus more difficult to differentiate by a model. The other is that there are multiple correct candidates for a context, so the maximum R10@1 for some contexts are not 1. For example, if there are three correct responses, then the maximum R10@1 is 0.33. P@1 is about 40% on the Douban corpus, indicating the difficulty of the task in a real chatbot.
7.6 Further Analysis
7.6.1 Model Ablation.
We first investigated how different parts of SCN and SAN affect their performance by ablating SCNlast and SANlast. Table 6 reports the results of ablation on the test data. First, we replaced the utterance–response matching module in SCN and SAN with a neural tensor (Socher et al. 2013) (denoted as ReplaceM), which matches an utterance and a response by feeding their representations to a neural tensor network (NTN). The result is that the performance of the two models dropped dramatically. This is because in NTN there is no interaction between the utterance and the response before their matching; and it is doubtful whether NTN can recognize important parts in the pair and encode the information into matching. As a result, the model loses important information in the pair. Therefore, we can conclude that a good utterance–response matching mechanism is crucial to the success of SMF. At least, one has to let an utterance and a response interact with each other and explicitly highlight important parts in their matching vector. Second, we replaced the GRU in the matching accumulation modules of SCN and SAN with a multi-layer perceptron (denoted as SCN ReplaceA and SAN ReplaceA, respectively). The change led to a slight performance drop. This indicates that utterance relationships are useful in context-response matching. Finally, we only left one level of granularity, either word level or segment level, in SCN and SAN, and denoted the models as SCN with words, SCN with segments, SAN with words, and SAN with segments, respectively. The results indicate that segment level matching on utterance–response pairs contributes more to the final context-response matching, and both segments and words are useful in response selection.
. | Ubuntu Corpus . | Douban Corpus . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
R2@1 . | R10@1 . | R10@2 . | R10@5 . | MAP . | MRR . | P@1 . | R10@1 . | R10@2 . | R10@5 . | |
ReplaceM | 0.905 | 0.661 | 0.799 | 0.950 | 0.503 | 0.541 | 0.343 | 0.201 | 0.364 | 0.729 |
SCN with words | 0.919 | 0.704 | 0.832 | 0.955 | 0.518 | 0.562 | 0.370 | 0.228 | 0.371 | 0.737 |
SCN with segments | 0.921 | 0.715 | 0.836 | 0.956 | 0.521 | 0.565 | 0.382 | 0.232 | 0.380 | 0.734 |
SCN ReplaceA | 0.918 | 0.716 | 0.832 | 0.954 | 0.522 | 0.565 | 0.376 | 0.220 | 0.385 | 0.727 |
SCNlast | 0.923 | 0.723 | 0.842 | 0.956 | 0.526 | 0.571 | 0.393 | 0.236 | 0.387 | 0.729 |
SAN with words | 0.922 | 0.713 | 0.842 | 0.957 | 0.523 | 0.565 | 0.372 | 0.232 | 0.381 | 0.747 |
SAN with segments | 0.928 | 0.729 | 0.846 | 0.959 | 0.532 | 0.575 | 0.385 | 0.234 | 0.393 | 0.754 |
SAN ReplaceA | 0.927 | 0.728 | 0.842 | 0.959 | 0.532 | 0.561 | 0.386 | 0.225 | 0.395 | 0.757 |
SANlast | 0.930 | 0.733 | 0.850 | 0.961 | 0.536 | 0.581 | 0.393 | 0.236 | 0.404 | 0.761 |
. | Ubuntu Corpus . | Douban Corpus . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
R2@1 . | R10@1 . | R10@2 . | R10@5 . | MAP . | MRR . | P@1 . | R10@1 . | R10@2 . | R10@5 . | |
ReplaceM | 0.905 | 0.661 | 0.799 | 0.950 | 0.503 | 0.541 | 0.343 | 0.201 | 0.364 | 0.729 |
SCN with words | 0.919 | 0.704 | 0.832 | 0.955 | 0.518 | 0.562 | 0.370 | 0.228 | 0.371 | 0.737 |
SCN with segments | 0.921 | 0.715 | 0.836 | 0.956 | 0.521 | 0.565 | 0.382 | 0.232 | 0.380 | 0.734 |
SCN ReplaceA | 0.918 | 0.716 | 0.832 | 0.954 | 0.522 | 0.565 | 0.376 | 0.220 | 0.385 | 0.727 |
SCNlast | 0.923 | 0.723 | 0.842 | 0.956 | 0.526 | 0.571 | 0.393 | 0.236 | 0.387 | 0.729 |
SAN with words | 0.922 | 0.713 | 0.842 | 0.957 | 0.523 | 0.565 | 0.372 | 0.232 | 0.381 | 0.747 |
SAN with segments | 0.928 | 0.729 | 0.846 | 0.959 | 0.532 | 0.575 | 0.385 | 0.234 | 0.393 | 0.754 |
SAN ReplaceA | 0.927 | 0.728 | 0.842 | 0.959 | 0.532 | 0.561 | 0.386 | 0.225 | 0.395 | 0.757 |
SANlast | 0.930 | 0.733 | 0.850 | 0.961 | 0.536 | 0.581 | 0.393 | 0.236 | 0.404 | 0.761 |
7.6.2 Comparison with Respect to Context Length.
We then studied how the performance of SCNlast and SANlast changes across contexts with different lengths. Context-response pairs were bucketed into three bins according to the length of the contexts (i.e., the number of utterances in the contexts), and comparison was made in different bins on different metrics. Figure 5 gives the results. Note that we did the analysis only on the Douban corpus because on the Ubuntu corpus many results were copied from the existing literatures and the bin-level results are not available. SAN and SCN consistently perform better than the baselines over bins, and a general trend is that when contexts become longer, gaps become larger. For example, in (2, 5], SAN is 3 points higher than LSTM on R10@5, but the gap becomes 5 points in (5, 10]. The results demonstrate that our models can well capture dependencies, especially long-distance dependencies, among utterances in contexts. SAN and SCN have similar trends because both of them use a GRU in the second layer to model dependencies among utterances. The performance of all models drops when the length of contexts increases from (2, 5] to (5, 10]. This is because semantics of longer contexts is more difficult to capture than that of shorter contexts. On the other hand, the performance of all models improved when the length of contexts increases from (5, 10] to (10, ). This is because the bin of (10, ) contains much less data than the other two bins (the data distribution is 53% for (2, 5], 38% for (5, 10], and 9% for (10, )), and thus the improvement does not make much sense from a statistical perspective.
7.6.3 Sensitivity to Hyper-Parameters.
We checked how sensitive SCN and SAN are regarding the size of word embedding and the maximum context length. Table 7 reports evaluation results of SCNlast and SANlast with embedding sizes varying in {50, 100, 200}. We can see that SAN is more sensitive to the word embedding size than SCN. SCN becomes stable after the embedding size exceeds 100, whereas SAN keeps improving with the increase of the embedding size. Our explanation of the phenomenon is that SCN transforms word vectors and hidden vectors of GRU to scalars in the similarity matrices by dot products, thus information in extra dimensions (e.g., entries with indices larger than 100) might be lost; on the other hand, SAN leverages the whole d-dimensional vectors in matching, so the information in the embedding can be exploited more sufficiently.
. | Ubuntu Corpus . | Douban Corpus . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
R2@1 . | R10@1 . | R10@2 . | R10@5 . | MAP . | MRR . | P@1 . | R10@1 . | R10@2 . | R10@5 . | |
SCN50d | 0.920 | 0.715 | 0.834 | 0.952 | 0.503 | 0.541 | 0.343 | 0.201 | 0.364 | 0.729 |
SCN100d | 0.921 | 0.718 | 0.838 | 0.954 | 0.524 | 0.569 | 0.391 | 0.234 | 0.387 | 0.727 |
SCN200d | 0.923 | 0.723 | 0.842 | 0.956 | 0.526 | 0.571 | 0.393 | 0.236 | 0.387 | 0.729 |
SAN50d | 0.914 | 0.698 | 0.828 | 0.950 | 0.503 | 0.541 | 0.343 | 0.201 | 0.364 | 0.729 |
SAN100d | 0.921 | 0.711 | 0.840 | 0.953 | 0.525 | 0.565 | 0.375 | 0.220 | 0.388 | 0.746 |
SAN200d | 0.930 | 0.733 | 0.850 | 0.961 | 0.536 | 0.581 | 0.393 | 0.236 | 0.404 | 0.761 |
. | Ubuntu Corpus . | Douban Corpus . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
R2@1 . | R10@1 . | R10@2 . | R10@5 . | MAP . | MRR . | P@1 . | R10@1 . | R10@2 . | R10@5 . | |
SCN50d | 0.920 | 0.715 | 0.834 | 0.952 | 0.503 | 0.541 | 0.343 | 0.201 | 0.364 | 0.729 |
SCN100d | 0.921 | 0.718 | 0.838 | 0.954 | 0.524 | 0.569 | 0.391 | 0.234 | 0.387 | 0.727 |
SCN200d | 0.923 | 0.723 | 0.842 | 0.956 | 0.526 | 0.571 | 0.393 | 0.236 | 0.387 | 0.729 |
SAN50d | 0.914 | 0.698 | 0.828 | 0.950 | 0.503 | 0.541 | 0.343 | 0.201 | 0.364 | 0.729 |
SAN100d | 0.921 | 0.711 | 0.840 | 0.953 | 0.525 | 0.565 | 0.375 | 0.220 | 0.388 | 0.746 |
SAN200d | 0.930 | 0.733 | 0.850 | 0.961 | 0.536 | 0.581 | 0.393 | 0.236 | 0.404 | 0.761 |
Figure 6 shows the performance of SCN and SAN with respect to the maximum context length. We find that both models significantly become better with the increase of maximum context length when it is lower than 5, and become stable after the maximum context length reaches 10. The results indicate that utterances from early history can provide useful information to response selection. Moreover, model performance is more sensitive to the maximum context length on the Ubuntu corpus than it is on the Douban corpus. This is because utterances in the Douban corpus are longer than those in the Ubuntu corpus (average length 18.5 vs. 12.4), which means single utterances in the Douban corpus could contain more information than those in the Ubuntu corpus. In practice, we set the maximum context length to 10 to balance effectiveness and efficiency.
7.6.4 Model Efficiency.
In Section 5.1.3, we theoretically analyzed the efficiency of SCN and SAN. To verify the theoretical results, we further empirically compared their efficiency using the training data and the test data of the two data sets. The experiments were conducted using Theano on a Tesla K80 GPU with a Windows Server 2012 operation system. The parameters of the two models are described in Section 7.4. Figure 7 gives the training time and the test time of SAN and SCN. We can see that SCN is twice as fast as SAN in the training process (as a result of low time complexity and ease of parallelization), and saves 3 msec per batch in the test process. Moreover, different matching functions do not influence the running time as much, because the bottleneck is the utterance representation learning.
The empirical results are consistent with our theoretical results: SCN is faster than SAN. The results indicate that SCN is suitable for systems that care more about efficiency, whereas SAN can reach a higher accuracy with a little sacrifice of efficiency.
7.6.5 Visualization.
We finally explained how SAN and SCN understand the semantics of conversation contexts by visualizing the similarity matrices of SCN, the attention weights of SAN, and the update gate and the reset gate of the accumulation GRU of the two models using an example from the Ubuntu corpus. Table 8 shows an example that is selected from the test set of the Ubuntu corpus and ranked at the top position by both SAN and SCN.
Context |
u1: how can unzip many rar files at once? |
u2: sure you can do that in bash |
u3: okay how? |
u4: are the files all in the same directory? |
u5: yes they all are; |
Response |
Response: then the command glebihan should extract them all from/to that directory |
Context |
u1: how can unzip many rar files at once? |
u2: sure you can do that in bash |
u3: okay how? |
u4: are the files all in the same directory? |
u5: yes they all are; |
Response |
Response: then the command glebihan should extract them all from/to that directory |
Figure 8(a) illustrates word–word similarity matrices M1 in SCN. We can see that important words in u1 such as “unzip,” “rar,” and “files” are recognized and highlighted by words like “command,” “extract,” and “directory” in r. On the other hand, the similarity matrix of r and u3 is almost blank, as there is no important information conveyed by u3. Figure 8(b) shows the sequence-to-sequence similarity matrices M2 in SCN. We find that important segments like “unzip many rar” are highlighted, and the matrices also provide complementary matching information to M1. Figure 8(c) visualizes the reset gate and the update gate of the accumulation GRU, respectively. Higher values in the update gate represent more information from the corresponding matching vector flowing into matching accumulation. From Figure 8(c), we can see that u1 is crucial to response selection and nearly all information from u1 and r flows to the hidden state of GRU, whereas other utterances are less informative and the corresponding gates are almost “closed” to keep the information from u1 and r until the final state.
Regarding SAN, Figure 9(a) and Figure 9(b) illustrate the word level attention weights A1 and segment level attention weights A2, respectively. Similar to SCN, important words such as “zip” and “file” and important segments like “unzip many rar” get high weights, whereas function words like “that” and “for” are less attended. It should be noted that as the attention weights are normalized, the gaps between high and low values in A1 and A2 are not so large as those in M1 and M2 of SCN. Figure 9(c) visualizes the gates of the accumulation GRU, from which we observed similar distributions as those of SCN.
7.7 Error Analysis and Future Work
Although models under SMF outperform baseline methods on the two data sets, there are still several problems that cannot yet be handled perfectly.
(1) Logical consistency. SMF models the context and response on a semantic level, but pays little attention to logical consistency. This leads to several bad cases in the Douban corpus. We give a typical example in Table 9. In the conversation history, one of the speakers says that he thinks the item on Taobao is fake, and the response is expected to be why he dislikes the fake shoes. However, both SCN and SAN rank the response “It is not a fake. I just worry about the date of manufacture.” at the top position. The response is inconsistent with the context in terms of logic, as it claims that the jogging shoes are not fake, which is contradictory to the context.
Context |
u1: Does anyone know Newton jogging shoes? |
u2: 100 RMB on Taobao. |
u3: I know that. I do not want to buy it because that is a fake which is made in Qingdao, |
u4: Is it the only reason you do not want to buy it? |
Response |
Response: It is not a fake. I just worry about the date of manufacture. |
Context |
u1: Does anyone know Newton jogging shoes? |
u2: 100 RMB on Taobao. |
u3: I know that. I do not want to buy it because that is a fake which is made in Qingdao, |
u4: Is it the only reason you do not want to buy it? |
Response |
Response: It is not a fake. I just worry about the date of manufacture. |
The reason behind this is that SMF only models semantics of context-response pairs. Logic, attitude, and sentiment are not taken into account in response selection.
In the future, we shall explore the logic consistency problem in retrieval-based chatbots by leveraging more features.
(2) No valid candidates. Another serious issue is the quality of candidates after retrieval. According to Wu et al. (2017), the candidate retrieval method can be described as follows: given a message un with {u1, …, un−1} utterances in its previous turns, the top five keywords are extracted from {u1, …, un−1} based on their TF-IDF scores.5un is then expanded with the keywords, and the expanded message is sent to the index to retrieve response candidates using the inline retrieval algorithm of the index. The performance of the heuristic message expansion method is not good enough. In the experiment, only 667 out of 1,000 contexts have correct candidates after response candidate retrieval. This indicates that there is still much room to improve the retrieval component, and message expansion with several keywords from previous turns may not be enough for candidate retrieval. In the future, we will consider advanced methods for retrieving candidates.
(3) Gap between training and test. The current method requires a huge amount of training data (i.e., context-response pairs) to learn a matching model. However, it is too expensive to obtain large-scale (e.g., millions of) human labeled pairs in practice. Therefore, we regard conversations with human replies as positive instances and conversations with randomly sampled replies as negative instances in model training. The negative sampling method, however, oversimplifies the learning of a matching model because most negative candidates are semantically far from human responses, and thus easy to recognize; and some negative candidates might be proper responses if they are judged by a human. Because of the gap in training and test, our matching models, although performing much better than the baseline models, are still far from perfect on the Douban corpus (see the low P@1 in Table 5). In the future, we may consider using small human labeled data sets but leveraging the large-scale unlabeled data to learn matching models.
8. Conclusion
In this paper we studied the problem of multi-turn response selection in which one has to model the relationships among utterances in a context and pay more attention to important parts of the context. We find that the existing models cannot address the two challenges at the same time when we summarize them into a general framework. Motivated by the analysis, we propose a sequential matching framework for context-response matching. The new framework is able to capture the important information in a context and model the utterance relationships simultaneously. Under the framework, we propose two specific models based on a convolution-pooling technique and an attention mechanism. We test the two models on two public data sets. The results indicate that both models can significantly outperform the state-of-the-art models. To further understand the models, we conduct ablation analysis and visualize key components of the two models. We also compare the two models in terms of their efficacy, efficiency, and sensitivity to hyper-parameters.
Acknowledgments
Yu Wu is supported by an AdeptMind Scholarship and a Microsoft Scholarship. This work was supported in part by the Natural Science Foundation of China (grants U1636211, 61672081, 61370126), the Beijing Advanced Innovation Center for Imaging Technology (grant BAICIT-2016001), and the National Key R&D Program of China (grant 2016QY04W0802).
Notes
We borrow the operator from MATLAB.
Tf is word frequency in the context, and IDF is calculated using the entire index.