Abstract
Tracking dialogue states to better interpret user goals and feed downstream policy learning is a bottleneck in dialogue management. Common practice has been to treat it as a problem of classifying dialogue content into a set of pre-defined slot-value pairs, or generating values for different slots given the dialogue history. Both have limitations on considering dependencies that occur on dialogues, and are lacking of reasoning capabilities. This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data. Empirical results demonstrate that our method outperforms the state-of-the-art methods in terms of joint belief accuracy for MultiWOZ 2.1, a large-scale human--human dialogue dataset across multiple domains.
1 Introduction
Dialogue State Tracking (DST) usually works as a core component to monitor the user's intentional states (or belief states) and is crucial for appropriate dialogue management. A state in DST typically consists of a set of dialogue acts and slot value pairs. Consider the task of restaurant reservation as shown in Figure 1. In each turn, the user may inform the agent of particular goals (e.g. single one as inform(food=Indian) or composed one as inform(area=center,food=Jamaican)). Such goals given during a turn are referred as turn belief. The joint belief is the set of accumulated turn goals updated until the current turn, which summarizes the information needed to successfully maintain and finish the dialogue.
Traditionally, dialogue system is supported by a domain ontology, which defines a collection of slots and the values that each slot can take. The aim of DST is to identify good features or patterns, and map to entries such as specific slot-value pairs in the ontology. It is often treated as a classification problem. Therefore, most efforts center on (1) finding salient features: from hand-crafted features (Wang and Lemon, 2013; Sun et al., 2014a), semantic dictionaries (Henderson et al., 2014b; Rastogi et al., 2017), to neural network extracted features (Mrkšić et al., 2017); or (2) investigating effective mappings: from rule-based models (Sun et al., 2014b), generative models (Thomson and Young, 2010; Williams and Young, 2007), to discriminative ones (Lee and Eskenazi, 2013; Ren et al., 2018; Xie et al., 2018). On the other hand, some researchers attack these methods' over-dependence on domain ontology. They perform DST in the absence of a comprehensive domain ontology and handle unknown slot values by generating words from dialogue history or knowledge source (Rastogi et al., 2017; Xu and Hu, 2018; Wu et al., 2019).
However, the critical problem of modeling the dependencies and reasoning over dialogue history is not well researched. Many existing methods work on turn level only, which takes in the current turn utterance and outputs the corresponding turn belief (Henderson et al., 2014b; Zilka and Jurcicek, 2015; Rastogi et al., 2017; Xu and Hu, 2018). Compared to joint belief, the resulting turn belief only reflects single turn information, and thus is of less practical use. Therefore, more recent efforts target at the joint belief that summarizes the dialogue history. Generally speaking, they accumulate turn beliefs by rules (Mrkšić et al., 2017; Zhong et al., 2018; Nouri and Hosseini-Asl, 2018) or model information across turns via various recurrent neural networks (RNNs) (Wen et al., 2017; Ramadan et al., 2018). Although these RNN based methods model dialogue in turn by turn style, they usually feed the whole turn utterance directly to the RNN, which contains a large portion of noise, and result in unsatisfactory performance (Liao et al., 2018; Zhang et al., 2019b). More recently, there are works that directly merge fixed window of past turns (Perez and Liu, 2017; Wu et al., 2019) as new input and achieve state-of-the-art performance (Wu et al., 2019). Nonetheless, their capability of modeling long-range dependencies and doing reasoning in the interactive dialogue process is rather limited. For example, (Wu et al., 2019) performs gated copy to generate slot values from dialogue history. Although certain turns of utterances are exposed to the model, since the interactive signals are lost when concatenating turns together, it fails to do in-depth reasoning over turns.
Very recently, there is research starting to work in turn-by-turn style with pre-trained models. Generally speaking, such methods take the previous turn's belief state and the current turn utterances as input to generate new dialogue state (Chao and Lane, 2019; Kim et al., 2020; Chen et al., 2020). However, there exists a long ignored fact that as an agent's central component, the state tracker not only receives dialogue history but also observes the back-end database or knowledge base. Such an information source provides valuable hints for it to reason about user goals and update belief states. It is therefore natural to construct a bipartite graph based on the database where the entities and entity attributes are the two groups of nodes; with edges connecting them to express attribute belonging relation. As the example in Figure 1, the database does not contain restaurant entity serving Jamaican food and located in center area. Thus there would be no two-hop path between these two nodes. Existing methods like Wu et al. (2019) have to understand it via system utterances, while a DST reasoning over database would easily obtain such clues explicitly.
In this paper, we propose to do reasoning over turns and reasoning over database in Dialogue State Tracking (ReDST) for task-oriented systems. For reasoning over turns, we model dialogue state tracking as a recursive process in which the current joint belief relies on the generated current turn belief and last joint belief. Motivated by the limited length of single turn utterance and the good performance of pre-trained BERT (Devlin et al., 2019), we formalize the turn belief prediction as a token and sequence classification problem. It follows a multitask learning setting with augmented utterance inputs. To integrate the last turn belief results, an incremental inference module is applied for more robust belief updates. For reasoning over a database, we abstract the back-end database as a bipartite graph, and propagate extracted beliefs over the graph to obtain more realistic dialogue states. Contributions are summarized as:
We propose to rethink the dialogue state tracking problem for task-oriented agents, pointing out the need for proper reasoning over turns and reasoning over back-end data.
We represent the database into a bipartite graph and perform belief propagation on it, which enables the belief tracker to gain insight on potential candidates and detect conflicting requirements along the conversation course.
With the help from pre-trained Transformer models working on augmented short utterance for achieving more accurate turn beliefs, we incrementally infer joint belief via reasoning in a turn by turn style and outperform state-of-the-art methods by a large margin.
2 Related Work
2.1 Dialogue State Tracking
A plethora of research has been focused on DST. We briefly discuss them in general chronological order. At the early stage, traditional dialogue state trackers combine semantic information extracted by Language Understanding (LU) modules to do DST (Williams and Young, 2007; Williams, 2014). Such trackers accumulate errors from the LU part and possibly suffer from information loss of dialogue context. Subsequent word-based (Henderson et al., 2014b; Zilka and Jurcicek, 2015) trackers thus forgo the LU part and directly infer states using dialogue history. Hand-crafted semantic dictionaries are utilized to hold all key terms, rephrases and alternative mentions to delexicalize for achieving generalization (Rastogi et al., 2017).
Recently, most approaches for dialogue state tracking rely on deep learning models (Wen et al., 2017; Ramadan et al., 2018). (Mrkšić et al., 2017) leveraged pre-trained word vectors to resolve lexical/morphological ambiguity. As it treats slots independently that might result in missing relations among slots (Ouyang et al., 2020), Zhong et al. (2018) proposed global modules to share parameters between estimators for different slots. Similarly, (Nouri and Hosseini-Asl, 2018) used only one recurrent network with global conditioning to reduce latency while preserving performance. In general, these methods represent the dialogue state as a distribution over all candidate slot values that are defined in the ontology. This is often solved as a classification or matching problem. However, these methods rely heavily on a comprehensive ontology, which often might not be available. Therefore, Rastogi et al. (2017) introduced a sophisticated candidate generation strategy, while (Perez and Liu, 2017) followed the general paradigm of machine reading and proposed to solve it using an end-to-end memory network. Xu and Hu (2018) utilized the pointer network to extract slot values from utterances, while Wu et al. (2019) integrated copy mechanism to generate slot values.
However, these methods tend to largely ignore the dialogue logic and dependencies. For example, inter-utterance information and correlations between slot values have been shown to be challenging, let alone the frequent goal shifting of users. Consequently, reasoning over turns is sensible. We first aim to improve the turn belief prediction, then model the joint belief prediction as an updating process. Very recently, we see such design leveraged by several works. For example, Chao and Lane (2019) leverage BERT model to extract slot values for each turn, then employ a rule-based update mechanism to track dialogue states across turns. Ren et al. (2019) encode previous dialogue state and current turn utterances using Bi-LSTM, then hierarchically decode domains, slots, and values one after another. At the same time, Kim et al. (2020) encode these inputs with BERT model while predicting operation gates and generating possible values. Still, such methods largely ignore the fact that as an agent, it has access to the back-end data structure which can be leveraged to further improve the performance of DST.
2.2 Incremental Reasoning
The ability to do reasoning over the dialogue history is essential for dialogue state trackers. At the turn level, we aim to extract more accurate slot values from user utterance with the help of contextualized semantic inference. Contextualized representation learning in NLP dates back to Collobert and Weston (2008) but has had a resurgence in the recent year. Contextualized word vectors were pre-trained using machine translation data and transferred to text classification and QA tasks (McCann et al., 2017). Most recently, BERT (Devlin et al., 2019) employed Transformer layers (Vaswani et al., 2017) with a masked language modeling objective and achieved superior performance across various tasks. In DST, we also observe a wide adoption of such models (Shan et al., 2020; Liao et al., 2021). For example, Kim et al. (2020) and Heck et al. (2020) adopted the pre-trained BERT as base network. Hosseini-Asl et al. (2020) applied the pre-trained GPT-2 (Alec et al., 2019) model as the base network for dialogue state tracking.
At dialogue context level, since we perform reasoning via belief propagation through graph, our work is also related to a wide range of graph reasoning studies. As a relatively early work, the page-ranking algorithm (Page et al., 1999) used a random walk with restart mechanism to perform multi-hop reasoning. Almost at the same time, loopy belief propagation (Murphy et al., 1999) was proposed to calculate the approximate marginal probabilities of vertices in a graph based on partial information. In recent years, research on graph reasoning has moved to learn symbolic inference rules from relational paths in the KG (Xiong et al., 2017; Das et al., 2017). Under these settings, a large number of entities and many types of relationships are usually involved. In DST, Chen et al. (2020) leveraged schema graphs containing slot relations, but their method heavily relied on a complete slot ontology. Zhou and Small (2019) incorporated a dynamically evolving knowledge graph to explicitly learn relationships slots. In our work, only the attribute-belonging relations are captured, and the constructed graph is simply a bipartite graph. We thus resort to heuristic belief propagation on the bipartite graph for reasoning. Further exploring more advanced models are treated as our future work.
3 ReDST Model
The proposed ReDST model in Figure 2 consists of three components: a turn belief generator, a bipartite graph belief propagator, and an incremental belief generator. Instead of predicting the joint belief directly from dialogue history, we perform two-stage inference: It first obtains turn belief from augmented turn utterance via transformer models. Then, it reasons over turn belief and last joint belief with the help of the bipartite graph propagation results. Based on this, it incrementally infers the final joint belief.
To facilitate the model description in detail, we first introduce our mathematical notations here. We define X = {(U1,R1),⋯(UT,RT)} as the set of user utterance and system response pairs in T turns of dialogue, and B = {B1,⋯ ,BT} as the joint belief states at each turn. While Bt summarizes the dialogue history up to the current turn t, we also model the turn belief Qt that corresponds to the belief state of a specific turn (Ut,Rt), and denote Dt as the domain of this specific turn. Following (Wu et al., 2019), we design our state tracker to handle multiple tasks. Thus, each Bt or Qt consists of tuples like (domain,slot,value). Suppose there are K different (domain,slot) pairs in total, we denote Yk as the true slot value for the k-th (domain,slot) pair.
3.1 BERT-based Turn Belief Generator
Denoting Xt = (Ut,Rt) as the t-th turn utterance, the goal of turn belief generator is to predict accurate state for this specific utterance. Although the dialogue history X can accumulate in arbitrary length, the turn utterance Xt is often relatively short in oftentimes. To utilize contextualized representation for extracting beliefs and enjoy the good performance of pre-trained encoders, we fine-tune BERT as our base network while attaching the sequence classification and token classification layers in a multitask learning setting. The token classification task extracts specific slot value spans. The sequence classification task decides which domain the turn is talking about and whether a specific (domain,slot) pair takes the gate value like yes,no, doncare, none, or generate from token classification, and so forth.
The model architecture of BERT is a multi-layer bidirectional Transformer encoder based on the original Transformer model (Vaswani et al., 2017). The input representation is a concatenation of WordPiece embeddings (Wu et al., 2016), positional embeddings, and the segment embedding. As we need to predict the values for each (domain,slot) pair, we augment the input sequence as follows. Suppose we have the original utterance as Xt = x1,⋯ ,xN, the augmented utterance is then Xt′ = [CLS],domain, slot, [SEP],x1, ⋯ ,xN, [SEP]. The specific (domain, slot) works as queries to extract the answer span. We denote the outputs of BERT as H =h1,...,hN+5.1 The BERT model is pre-trained with two strategies on large-scale unlabeled text, that is, masked language model and next sentence prediction, which provide a powerful context-dependent sentence representation.
3.1.1 Filter for Improving Efficiency
3.2 Joint Belief Reasoning
Now we can predict the turn level belief state for each turn. Intuitively, we can directly apply our turn belief generator on concatenated dialogue history to obtain the joint belief as in Wu et al. (2019). However, it is hardly an optimal practice. First of all, treating all utterances as a long sequence will lose the iterative character of dialogue, thus resulting in information loss. Second, current models like recurrent networks or Transformers are known for not being able to model the long-range dependencies well. Long sequences introduce difficulty to the modeling as well as the computational complexity of Transformers. The WordPiece separation operation makes sequences even longer. Therefore, we simulate the dialogue procedure as a recursive process where current joint belief Bt relies on the last joint belief Bt−1 and the current turn belief Qt. Generally speaking, we use Bt−1 and Qt to perform belief propagation on the bipartite graph constructed based on the back-end database to obtain credibility score for each slot value pairs. Then, we do incremental belief reasoning over the recursive process using different methods.
3.2.1 Bipartite Graph Belief Propagator
As the central component for dialogue systems, the dialogue state tracker has access to the back-end database most of the time. In the course of the task-oriented dialogue, the user and agent interact with each other to reach the same stage of information awareness regarding a specific task. The user expresses requirements that, often, are hard to meet. The agent resorts to the back-end database and responds accordingly. Then the user would adjust their requirements to get the task done. In most existing DSTs, the tracker has to infer such adjustment requirements from dialogue history. With reasoning over the agent's database, we expect to harvest more accurate clues explicitly for belief update.
Consequently, we abstract the database as a bipartite graph G = (V,E), where vertices are partitioned into two groups: The entity set Vent and attribute set Vattr, where V = Vent ∪ Vattrand Vent ∩ Vattr = ϕ. The entities within Vent and Vattr are totally disconnected. Edges link two vertices from each of Vent and Vattr, representing the attribute belonging relationship. During each turn, we first map the predicted Qt and last joint belief Bt−1 to belief distributions over the graph via the function g(⋅). Here we apply fuzzy match and calculate the similarity with a threshold ϵ to realize g(⋅). We use BERT tokenizer to tokenize both dialogue and database entries. The mapping is done based on a pre-set threshold on the token level overlap ratio. For example, the generated `cambridge punt ##er' will be mapped to the database entry `the cambridge punt ##er' when their overlap ratio is larger than ϵ. In our experiment, we find that approximately 60.5% of entity names and 12.2% other slot values can be mapped.2 This mapping operation actually helps to correct some minor errors made in span extraction or generation.
3.2.2 Incremental Belief Generator
4 Experiments
4.1 Dataset
We carry out experiments on MultiWOZ 2.1 (Eric et al., 2019). It is a multi-domain dialogue dataset spanning seven distinct domains and containing over 10,000 dialogues. As compared to MultiWOZ 2.0, it fixed substantial noisy dialogue state annotations and dialogue utterances that could negatively impact the performance of state-tracking models. In MultiWOZ 2.1, there are 30 domain-slot pairs and over 4,500 possible values, which is different from existing standard datasets like WOZ (Wen et al., 2017) and DSTC2 (Henderson et al., 2014a), which have fewer than ten slots and only a few hundred values. We follow the original training, validation, and testing split and directly use the DST labels. Since the hospital and police domain have very few dialogues (10% compared to others) and only appear in the training set, we only use the other five domains in our experiment.
4.2 Settings
Training Details
Our model is trained in a two-stage style. We first train the turn belief generator using the Adam optimizer with a batch size of 32. We adopt the bert-base-uncased version of BERT and initialize the learning rate for fine-tuning as 3e-5. The α and β in Equation 4 are set to 0.05 and 1.0, respectively. We use the average of the last four hidden layer outputs of BERT as the final representation of each token.
During the later reasoning stage, regarding incremental belief reasoning, we use a fully connected two-layer feed-forward neural network with ReLU activation for MLP. The hidden size is set to 500, and the learning rate is initialized as 0.002. For GRU, we set the learning rate as 0.005. We pre-process turn utterances to alleviate the problem of ground truth absence, for example, formalize time values into standard forms. Similar to Heck et al. (2020), we also make use of the system acts to enrich the system utterances.
Evaluation Metrics
Similar to Wu et al. (2019), we adopt the evaluation metric joint goal accuracy to evaluate the performance. It is a relatively strict elevation standard. The joint goal accuracy compares the predicted belief states to the ground truth Bt at each turn t. The joint accuracy is 1.0 if and only if all (domain,slot,value) triplets are predicted correctly at each turn, otherwise it is 0.
Baselines
We denote the two versions of ReDST with different incremental reasoning modules as ReDST, and ReDST. They are compared with the following baselines.
DST Reader
(Gao et al., 2019): It treats DST as a reading comprehension problem. Given the history, it learns to extract slot values as spans.
HyST
(Goel et al., 2019): It combines a hierarchical encoder in a fixed vocabulary system with an open vocabulary n-gram copy-based system.
TRADE
(Wu et al., 2019): It concatenates the whole dialogue history as input and uses a generative state tracker with a copy mechanism to generate value for each slot separately.
DST-Picklist
(Zhang et al., 2019a): Given the whole dialogue history as input, it uses two BERT-based encoders and takes a hybrid approach of predefined ontology-based DST and open vocabulary-based DST. It defines picklist-based slots for classification and span-based slots for span extraction like DSTRead (Gao et al., 2019).
SOM
(Kim et al., 2020): It works in turn-by-turn style and considers state as an explicit fixed-sized memory, and adopts a selectively overwriting mechanism for generating values with copy.
SST
(Chen et al., 2020): It leverages a graph attention matching network to fuse information from utterances and schema graphs. A recurrent graph attention network controls state updating. It relies on a predefined ontology.
4.3 DST Results
We first compare our model with the state-of-the-art methods. As shown in Table 1, we observe that our method outperforms all the other baselines. For example, in terms of joint accuracy, which is a rather strict metric, ReDST improves the performance by 46.2%,17.4%, and 1.3% as compared to open-vocabulary based methods: the DST Reader, TRADE, and SOM, respectively. Based on results in Table 1, the methods such as DST-Picklist and SST perform better than our method. However, they rely heavily on a predefined ontology. In such methods, the value candidates for each slot to choose from are fixed already. They cannot handle unknown slot values, which largely limits their application in real-life scenarios.
. | Model . | Joint Acc . |
---|---|---|
predefined ontology | FJST | 0.378 |
HJST | 0.356 | |
HyST | 0.381 | |
DST-Picklist | 0.533 | |
SST | 0.552 | |
open-vocabulary | DST Reader | 0.364 |
TRADE | 0.453 | |
TRADE w/o gate | 0.411 | |
SOM | 0.525 | |
ReDST_MLP | 0.511 | |
ReDST_GRU | 0.532 |
. | Model . | Joint Acc . |
---|---|---|
predefined ontology | FJST | 0.378 |
HJST | 0.356 | |
HyST | 0.381 | |
DST-Picklist | 0.533 | |
SST | 0.552 | |
open-vocabulary | DST Reader | 0.364 |
TRADE | 0.453 | |
TRADE w/o gate | 0.411 | |
SOM | 0.525 | |
ReDST_MLP | 0.511 | |
ReDST_GRU | 0.532 |
We observe that a large portion of baselines work on relatively long window-sized dialogue history. FJST directly encodes the raw dialogue history using recurrent neural networks. In contrast, HJST first encodes turn utterance to vectors using a word-level RNN, and then encodes the whole history to vectors using a context level RNN. However, the lower performance of HJST demonstrates its inefficiency in learning useful features in this task. Based on HJST, HyST manages to achieve better performance by further integrating a copy-based module. Still, the performance is lower than TRADE, which encodes the raw concatenated whole dialogue history, generates or copies slot values with extra slot gates. Generally speaking, these baselines are based on recurrent neural networks for encoding dialogue history. Since the interactions between user and agent can be arbitrarily long and recurrent neural networks are not effective in modeling long-range dependencies, they might not be a good choice to model the dialogue for DST. On the contrary, single turn utterances usually are short and contain relatively simple information as compared to complicated dialogue history. It is thus better to generate belief in turn level and then integrate them via reasoning. According to the comparisons of baselines, the superior performance of SST, SOM, and ReDSTs validate this design.
Moreover, we also tested the performance of TRADE without the slot gate. The performance drops dramatically--from 0.453 to 0.411 in terms of joint accuracy. We suspect that this is due to lengthy dialogue history, where the decoder and copy mechanism start to lose focus. It might generate some value that appears in dialogue history but is not the ground truth. Therefore, the slot gate is used to decide which slot value should be taken, which resembles the inference in some sense. To validate this, we feed the single turn utterances to TRADE and generate the turn beliefs as output. Interestingly, we find that it performs similar with gate or without it, which validates our guess. However, such resembled inference is not enough. When the dialogue history becomes long, the gating mechanism will lose its focus easily. Accordingly, we report the results of TRADE and ReDST on the last four turns of dialogues in Table 2. The better performance of ReDST further validates the importance of reasoning over turns. Usually, as the interactive dialogue goes on, users might frequently adjust their goals, which requires special consideration. Since turn utterance is relatively more straightforward and dialogue is turn by turn in nature, doing DST turn by turn is a useful and practical design.
4.4 Component Analysis
Since our model makes use of the advanced BERT structure to learn the contextualized representation, we first test how much contribution the BERT has made. Therefore, we carried out a study on a turn belief generator and compare it with SOM and the BiLSTM baseline TRADE on the single turn utterance. As shown in Table 3, we observe that the BERT-based SOM and ReDST indeed perform better than single turn TRADE. This is due to the usage of pre-trained BERT in learning better-contextualized features. In the multitask setting of our design, both the token classification and sequence classification tasks benefit from BERT's strength. Moreover, we notice that when doing the single turn setting, the system response usually depends on certain information mentioned in the former turn user utterance. Therefore, we concatenate the former turn utterance to each current single turn as the input for BERT. Under this setting, we achieved a large boost in performance regarding joint accuracy as in Table 3. It provides an excellent base for the later stage inferences.
Model . | Joint Acc . |
---|---|
TRADE | 0.697 |
SOM | 0.799 |
ReDST | 0.808 |
Model . | Joint Acc . |
---|---|
TRADE | 0.697 |
SOM | 0.799 |
ReDST | 0.808 |
We also tested the effect of reasoning over the database. For a clear comparison, we ignore the evidence obtained via bipartite graph belief propagation while keeping other settings the same. To show it more clearly, we re-organize the results in Table 4. It can be observed that both ReDST and ReDST gain a bit from belief propagation. It validates the usefulness of database reasoning. However, since the graph is rather small, the performance improvement is rather limited. Similar patterns are found in Chen et al. (2020) and we suspect that it will be more helpful with larger database structure. Also, we will further explore its usage in down-stream tasks such as action prediction.
Setting . | w BP . | w/o BP . |
---|---|---|
ReDST_MLP | 0.511 | 0.507 |
ReDST_GRU | 0.532 | 0.530 |
Setting . | w BP . | w/o BP . |
---|---|---|
ReDST_MLP | 0.511 | 0.507 |
ReDST_GRU | 0.532 | 0.530 |
For different incremental reasoning modules, the results are also shown in Table 1. We find that ReDST performs better. However, we notice that simply accumulating turn belief as in Zhong et al. (2018) performs very well. The rule is to add newly predicted turn belief entries to the last joint belief. When different values for a slot appear, only keep the new one. Although this rule seems simple, it actually reflects the dialogue's interactive and updating nature. We tried to directly apply this rule on the ground truth turn belief to generate joint belief. It results in 0.963 joint accuracy. However, a critical problem of such accumulation rule is that when the generated turn belief is wrong, it will not be able to add a missing entry or delete a wrong entry. By applying GRU in ReDST, it manages to modify a bit with the help of database evidence. Still, there is large space for more powerful reasoning models to address this error accumulation issue. We will further investigate in this direction.
4.5 Error Analysis
We also provide error analysis regarding each slot for ReDST in Figure 3. To make it more clear, we also list the results of SOM for comparison. We observe that a large portion of the improvements for our method are on name entities and time-related slots. As mentioned in Wu et al. (2019), name slots in the attraction,restaurant, and hotel domains have the highest error rates. It is partly because these slots have a relatively large number of possible values that are hard to recognize. In ReDST, we map beliefs into a bipartite graph constructed via database and do belief propagation on it. This helps to improve the accuracy on name slots. Also, the classification gate design helps to improve performance on Yes/No slots. We also observe that the performance for taxidestination becomes worse. This is due to the value co-reference phenomenon where the user might just mention `taxi to the hotel' to refer to the hotel name mentioned earlier. These findings are interesting and we will explore it further.
5 Conclusion
We rethink DST from the angle of agent and point out the urgent need for in-depth reasoning other than being obsessed with generating values from history text as a whole. We demonstrated the importance of doing reasoning over turns and over the database. In detail, we fine-tuned pre-trained BERT for more accurate turn level belief generation while doing belief propagation in bipartite graph to harvest more clues. Experiments on a large-scale multi-domain dataset demonstrate the superior performance of the proposed method. In the future, we will explore more advanced algorithms for performing reasoning over turns and on graphs for generating more accurate summarization of user intention.
Acknowledgments
This research is supported by the National Research Foundation, Singapore, under its International Research Centres in Singapore Funding Initiative. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
Notes
For ease of illustration, we ignore the WordPiece separation effect on token numbers.
Over half of the slot values are time, people, stay, day, etc. There are no such nodes in the bipartite graph but we keep these slot values’ existence in the belief vector