## Abstract

One of the major challenges to build a task-oriented dialogue system is that dialogue state transition frequently happens between multiple domains such as booking hotels or restaurants. Recently, the encoderdecoder model based on the end-to-end neural network has become an attractive approach to meet this challenge. However, it usually requires a sufficiently large amount of training data and it is not flexible to handle dialogue state transition. This paper addresses these problems by proposing a simple but practical framework called Multi-Domain KB-BOT (MDKB-BOT), which leverages both neural networks and rule-based strategy in natural language understanding (NLU) and dialogue management (DM). Experiments on the data set of the Chinese Human-Computer Dialogue Technology Evaluation Campaign show that MDKB-BOT achieves competitive performance on several evaluation metrics, including task completion rate and user satisfaction.

## 1. INTRODUCTION

In the past decade, dialogue systems have become an attractive topic and they can be classified into open-domain dialogue systems and task-oriented dialogue systems. One general approach to dialogue system design is to treat it as a retrieval problem by learning the relevance matching score between user queries and system responses. Inspired by the recent advances in deep learning, building an end-to-end dialogue system has been a popular approach for its flexibility and extendibility. For example, encoderdecoder models based on recurrent neural networks (RNNs) directly maximize the likelihood of the desired responses when previous dialogue history data are available. However, two major drawbacks of those systems are that multiple training corpora are required and generic responses such as “I do not know” are likely to be generated. These drawbacks limit the generalization ability, especially for a task-oriented system in which knowledge from multiple domains is needed to understand users' underlying intents.

Compared to end-to-end approach, designing a task-oriented dialogue system as modularized pipeline is feasible. And each essential component is trained individually, including 1) Natural Language Understanding (NLU), to specify task domain and user intent and extract slot-value pairs, 2) Dialogue Manager (DM), to keep tracking the dialogue state and guide users to achieve a desired goal, and 3) Natural Language Generation (NLG), to generate responses. One of the challenges for a task-oriented dialogue system is that dialogue state transition frequently happens between multiple domains. If earlier components make mistakes in slot value extraction and errors are accumulated, the entire system's functionality will be severely impaired.

To address the complex dialogue state transition problem, we adopt the architecture of modularized pipeline and propose a multi-domain KB-BOT (MDKB-BOT), which leverages both rule extraction and neural networks. We run the evaluation experiments on the data set of the Chinese Human-Computer Dialogue Technology Evaluation Campaign and experimental results show that MDKB-BOT can robustly fulfill the frequent changes of user intent among three domains (flight, train and hotel) and achieve competitive scores based on human evaluation metrics.

## 2. RELATED WORK

As mentioned before, there have been a lot of research efforts in applying deep learning to task-oriented dialogue systems. One of the most effective approaches is to build a modularized pipeline system by connecting NLU, DM and NLG together. Traditional approach to NLU is to model domain classification/intent detection as sentence classification while treating slot value pairs extraction as a sequence labeling task. A desirable NLU system should not be sensitive to intent error and slot error, especially for slot filling. For example, Xu and Sarikaya [1] applied a RNN to perform contextual domain classification and used a triangular conditional random field (CRF) based on a convolutional neural network for intent detection and slot filling. Jaech, Heck and Ostendorf [2] applied multi-task learning to achieve the goal to leverage knowledge of source domains with rich data to improve the performance on the target domain with little data. Bapna et al. [3] explored the role of context information in NLU via injecting previous dialogues into a RNN based encoder and a memory network.

On the other hand, many attempts have been made to improve the architecture of DM. Recent research indicates that reinforcement learning (DL) holds promise for planning a dialogue policy based on the current dialogue state. Williams, Asadi and Zweig [4] proposed a model called Hybrid Code Networks (HCNs), which is a mixture of supervised learning and reinforcement learning. HCNs select a dialogue action every step by optimizing the reward for completing a task with policy gradient [5]. Faced with the sparse nature of the reward signal in RL, Peng et al. [6] designed an end-to-end framework for hierarchical RL where a MANAGER is used to choose current goal (like a specific domain task) and a WORKER is used to take actions and help users finish the current subtask. Inspired by recent advances in RL, Mrkšíc et al. [7] introduced a belief tracker that can overcome the drawback of requiring a large amount of hand-crafted lexicons to capture some of the linguistic variation in users' language. Their Neural Belief Tracking (NBT) models can reason over pre-trained word embeddings of system output, user utterance and candidate pairs in databases.

As for NLG, most of the current work applied information retrieval technique to a large query-response database, or used template-based methods with a set of rules to map frames to natural language or generation models. Dušek and Jurčíček [8] encoded frames based on the syntax tree and used the seq2seq model for generation.

## 3. PROPOSED FRAMEWORK

The proposed framework is illustrated in Figure 1, which includes NLU, DM and NLG. The implementation of these components is described from Section 3.1 to Section 3.3.

Figure 1.

The overall framework of the model consists of three components: (1) NLU module, which predicts intent domain and gives slot value of user utterance, (2) DM module, which outputs the dialogue action to NLG module, and (3) NLG module, which generates the final response.

Figure 1.

The overall framework of the model consists of three components: (1) NLU module, which predicts intent domain and gives slot value of user utterance, (2) DM module, which outputs the dialogue action to NLG module, and (3) NLG module, which generates the final response.

### 3.1 Natural Language Understanding (NLU)

The main tasks for NLU involve domain classification, intent detection and slot filling as illustrated in Figure 1.

#### 3.1.1 Domain Classification

A convolutional neural network proposed by Kim [9] was adopted in domain classification. Let WRv*d be the d dimensional word embedding table, where v is the vocabulary size. Then sentence semantic representation of user query XRn*d is obtained by looking up each word in W, where n is the number of words in this query. Then 1-D convolutional layer is adopted on X to extract n-gram features. However, in a convolutional neural network (CNN) errors may happen in some cases containing several domains' descriptions. For example, “The train is cheaper, but to save time, give me airline flight schedules and flight timetables.” Thus, for our online model, some rule strategies are used to deal with this misclassification problem by constructing a keyword list from both corpora and databases, e.g., city name list.

#### 3.1.2 Slot Filling

Slot filling is treated as a name entity recognition task where the popular bigin-in-out (BIO) format is used for representing tags of each word in a query. Then Long Short-Term Memory (LSTM) scans the words and outputs the representation:

$fk=α(Wfxk+Ufhk−1+bf),$
(1)
$ik=α(Wfxk+Uihk−1+bi),$
(2)
$ok=α(Woxk+Uohk−1+bo),$
(3)
$c˜k=tanh⁡(Wcxk+Uchk−1+bc),$
(4)
$ck=fk⊙ck−1+ik⊙c˜k,$
(5)
$hk=ok⊙tanh⁡(ck).$
(6)

To enhance the ability to extract slot value pairs, a CRF network is connected to the output of LSTM or bidirectional LSTM (BLSTM). Then the score of a sentence X along with a path of tags Y can be calculated as the sum of the transition score A and the LSTM network score f:

$S(X,Y,θ)=∑t=1T([A]Yt−1,Yt+[fϑ]Yt,Xt),$
(7)

where θ is the trainable parameter of the LSTM network.

Figure 2 shows a bidirectional LSTM network enhanced with a CRF network on the top layer. For our online model, we apply both keyword matching and BLSTM-CRF to avoid cases like diverse or nonstandard expressions.

Figure 2.

BLSTM-CRF model for slot filling.

Figure 2.

BLSTM-CRF model for slot filling.

#### 3.1.3 Intent Detection

Based on slots extracted from BLSTM-CRF, we update the maintained dialogue state template. Then user intent is inferred by comparing the predefined dialogue template with the new state template.

### 3.2 Dialogue Management (DM)

After the NLU module, we obtain the output of NLU that includes the user intent domain and the slot value of the current turn.

In order to avoid too many unnecessary turns of dialogue on some insignificant information for many users, we divide all of the slots into two categories: required slots and extra slots. The required slots, like <departCity>, are necessary for the task, and the extra slots may make the dialogue too tedious for many users who do not care them, such as <trainValue> and <countRate>. So we only complete the required slots necessary for booking, but if a user mentions extra slots in the dialogue, we will consider it while retrieving the information from our data base.

To regulate the dialogue course when our system interacts with users, DM module is then applied to update the conversation state and the next dialogue action. We divide this module into three states. The detailed procedures are described as follows.

• Initial state At the beginning of the conversation, utterance with no explicit intention will be considered a purposeless talk. After identifying a user's intent, the system will turn into the slot filling state. Note that the system will store the slot information even before domain prediction, and the information will be distributed to the corresponding slot afterward.

• Slot filling state The main task at this state is to interactively interact with the user to obtain the required slot information for generating responses.

• Recommendation state Our bot will list the results that can adapt to users' demands by retrieval and extraction of the data base. In case of failure, we set a series of strategies for similar recommendations, such as: (1) remove the limitations of extra slots, (2) make appropriate adjustments for the departure time, (3) change the cabin or train type, and (4) increase the price range. In addition, users can change their requests and return to the slot filling state or recommendation state again.

Throughout the whole process of the dialogue, when the system finds that an intent cannot be completed, it will prompt the user in time to avoid wasting time. For example, when a user wants to book a flight to a city where no air service is available, it is unwise to continue the dialogue with the user, and the system will recommend other means of traveling. A user can also change his or her intent anytime, and our system will store the common information of slots automatically at the transformation process.

### 3.3 Natural Language Generation (NLG)

So far, we have obtained both the category of a user query's intention and the next dialogue action each turn, which guide the NLG module to generate natural language texts and replay the user's query. Given the user's slots list, we convert it into the SQL statement, then retrieve from the date base which stores information on trains, flights and accomendations, to check if there are eligible items for the user's goal, and match the appropriate template for replay.

Because of the shortage of the large-scale dialogue corpus in these domains, we generate the utterance by the template-based NLG, which means we need to capture every case of different slot states to presuppose the dialogue template. In this way, once the user dialogue actions are found in the predefined sentence templates, we will replace the slot value with user history information. One advantage is that it can ensure the controllability of the response given to users.

## 4. EXPERIMENTS

In the Chinese Human-Computer Dialogue Technology Evaluation Campaign, a task-oriented dialogue system is developed to help users book flights, trains and hotels.

### 4.1 Data Sets

Since only three databases are provided, we extend the data set of task 1 for domain classification and rule extraction. We also annotate a 300-dialogue corpus (about 1,500 sentences for training) with slot labels for evaluating LSTM-CRF and BLSTM-CRF models. Table 1 and Table 2 show the details of the data set.

Table 1.

The statistics of training corpus for domain classification.

TrainFlightHotelOthers
#sentence 533 510 512 588
TrainFlightHotelOthers
#sentence 533 510 512 588
Table 2.

The statistics of training corpus for slot filling.

#label#sentenceAvg. sentence length (word)
TrainFlightHotel
13 15 1,013 5.86
#label#sentenceAvg. sentence length (word)
TrainFlightHotel
13 15 1,013 5.86

### 4.2 Evaluation

For slot filling in NLU, the entity-level prediction F1 score of common name entity recognition is adopted. However, dialogue evaluation still remains to be a difficult task. We use the evaluation metrics of the Chinese Human-Computer Dialogue Technology Evaluation Campaign, including task completion rate, user satisfaction score, dialogue naturalness, number of turns and robustness of uncovered cases.

## 5. DISCUSSION

Table 3 shows the results of slot filling. We compare the performance of LSTM-CRF and BLSTM-CRF with unigram and unigram plus bigram separately. As illustrated, accuracy can increase by 1.7% when considering word sequence order with BLSTM. Using bigram feature can be of help for LSTM-CRF, though it is worse for BLSTM as the average length of bigram sequence is short. One possible way is to use character embedding instead of word embedding.

Table 3.

Comparison of labeling performance on NLU.

FeatureLSTM-CRFBLSTM-CRF
Unigram 85.93% 87.63%
Unigram + bigram 86.20% 87.21%
FeatureLSTM-CRFBLSTM-CRF
Unigram 85.93% 87.63%
Unigram + bigram 86.20% 87.21%

Table 4 illustrates the performance of our system according to the evaluation metrics mentioned before. Most of the metrics are annotated manually except for the average dialogue turn. our system obtained the best score on user satisfaction, dialogue naturalness and boot ability due to the reasonable dialogue intent transition template we predefined. But this leads to a decline in task performance, especially when the user intent is not identified or the important slot value is not extracted correctly.

Table 4.

Dialogue quality evaluation results between top 4 teams in the competition.

MetricShenSiKaoPuTaoWeiDuMDKB-BOTChuMenWenWen
Completion rate 31.75 19.05 19.05 11.11
#turn 64.53 72.28 78.72 71.39
User satisfaction −1 −2
Naturalness −1 −1
Boot ability
MetricShenSiKaoPuTaoWeiDuMDKB-BOTChuMenWenWen
Completion rate 31.75 19.05 19.05 11.11
#turn 64.53 72.28 78.72 71.39
User satisfaction −1 −2
Naturalness −1 −1
Boot ability

## 6. CONCLUSION

In this paper, we proposed a simple but practical framework for multi-domain task-oriented dialogue system. Our model leverages both neural network and rule-based strategy to handle the domain transition problem. It achieves competitive results on the Chinese Human-Computer Dialogue Technology Evaluation Campaign, especially for user-friendliness and utterance guidance metrics. For future work, we are going to apply end-to-end neural networks to NLG based on the information extracted and maintained in NLU and DM to improve the system performance.

## AUTHOR CONTRIBUTIONS

Y. Lao (laoyadi@bupt.edu.cn) and W. Liu (liuweijie@bupt.edu.cn) are the leaders of the MDKB-Bot system, who drew the whole framework of the system. S. Gao (gaosheng@bupt.edu.cn, corresponding author) and S. Li (lisi@bupt.edu.cn) summarized the applications and drafted the paper. All authors participated in the revision of the manuscript.

## ACKNOWLEDGEMENTS

This work was supported by Beijing Natural Science Foundation (No. 4174098), National Natural Science Foundation of China (No. 61702047) and the Fundamental Research Funds for the Central Universities (No. 2017RC02).

## REFERENCES

[1]
P.
Xu
, &
R.
Sarikaya
.
Contextual domain classification in spoken language understanding systems using recurrent neural network
. In:
IEEE International Conference on Acoustics
, Speech and Signal Processing (ICASSP),
2014
, pp.
136
140
. 10.1109/ICASSP.2014.6853573.
[2]
A.
Jaech
,
L.
Heck
, &
M.
Ostendorf
.
Domain adaptation of recurrent neural networks for natural language understanding
.10.21437/Interspeech.2016-1598.
[3]
A.
Bapna
,
G.
Tür
,
D.
Hakkani-Tür
, &
L.
Heck
.
Sequential dialogue context modeling for spoken language understanding
. In:
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue
,
2017
, pp.
103
114
. 10.18653/v1/W17-5514.
[4]
J.D.
Williams
,
K.
, &
G.
Zweig
.
Hybrid code networks: Practical and efficient end-to-end dialogue control with supervised and reinforcement learning
. arXiv preprint. arXiv: 1702.03274, 2017.
[5]
R.J.
Williams
.
Simple statistical gradient-following algorithms for connectionist reinforcement learning
.
Machine learning
8
(
3-4
)(
1992
),
229
256
. 10.1007/BF00992696.
[6]
B.
Peng
,
X.
Li
,
L.
Li
,
J.
Gao
,
A.
Celikyilmaz
,
S.
Lee
, &
K.-F.
Wong
.
Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning
. In:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
,
2017
, pp.
2221
2230
. 10.18653/v1/D17-1237.
[7]
N.
Mrkšíc
,
D.O.
Seaghdha
,
T.-H.
Wen
,
B.
Thomson
, &
S.
Young
.
Neural belief tracker: Data-driven dialogue state tracking
. arXiv preprint. arXiv:1606.03777, 2016.
[8]
O.
Dušek
, &
F.
Jurčíček
.
Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings
. arXiv preprint. arXiv:1606.05491,
2016
.
[9]
Y.
Kim
.
Convolutional neural networks for sentence classification
. arXiv preprint. arXiv:1408.5882,
2014
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.