Abstract
Generalizing dialogue state tracking (DST) to new data is especially challenging due to the strong reliance on abundant and fine-grained supervision during training. Sample sparsity, distributional shift, and the occurrence of new concepts and topics frequently lead to severe performance degradation during inference. In this paper we propose a training strategy to build extractive DST models without the need for fine-grained manual span labels. Two novel input-level dropout methods mitigate the negative impact of sample sparsity. We propose a new model architecture with a unified encoder that supports value as well as slot independence by leveraging the attention mechanism. We combine the strengths of triple copy strategy DST and value matching to benefit from complementary predictions without violating the principle of ontology independence. Our experiments demonstrate that an extractive DST model can be trained without manual span labels. Our architecture and training strategies improve robustness towards sample sparsity, new concepts, and topics, leading to state-of-the-art performance on a range of benchmarks. We further highlight our model’s ability to effectively learn from non-dialogue data.
1 Introduction
Generalization and robustness are among the key requirements for naturalistic conversational abilities of task-oriented dialogue systems (Edlund et al., 2008). In a dialogue system, dialogue state tracking (DST) solves the task of extracting meaning and intent from the user input, and keeps track of the user’s goal over the continuation of a conversation as part of a dialogue state (DS) (Young et al., 2010). A recommendation and booking system for places, for instance, needs to gather user preferences in terms of budget, location, and so forth. Concepts like these are assembled in an ontology on levels of domain (e.g., restaurant or hotel), slot (e.g., price or location), and value (e.g., “expensive” or “south”). Accurate DST is vital to a robust dialogue system, as the system’s future actions depend on the conversation’s current estimated state. However, generalizing DST to new data and domains is especially challenging. The reason is the strong reliance on supervised training.
Virtually all top-performing DST methods either entirely or partially extract values directly from context (Ni et al., 2021). However, training these models robustly is a demanding task. Extractive methods usually rely on fine-grained labels on word level indicating the precise locations of value mentions. Given the richness of human language and the ability to express the same canonical value in many different ways, producing such labels is challenging and very costly, and it is no surprise that datasets of such kind are rare (Zhang et al., 2020b; Deriu et al., 2021). Reliance on detailed labels has another downside; datasets are usually severely limited in size. This in turn leads to the problem of sample sparsity, which increases the risk for models to over-fit to the training data, for instance, by memorizing values in their respective contexts. Over- fitting prevents a state tracker to generalize to new contexts and values, which is likely to break a dialogue system entirely (Qian et al., 2021). Recently, domain-independent architectures have been encouraged to develop systems that may be built once and then applied to new scenarios with no or little additional training (Rastogi et al., 2020a, b). However, training such flexible models robustly remains a challenge, and the ever- growing need for more training samples spurs creativity to leverage non-dialogue data (Heck et al., 2020a; Namazifar et al., 2021).
We propose novel strategies for extractive DST that address the following four issues of robustness and generalization. (1) We solve the problem of requiring fine-grained span labels with a self-supervised training scheme. Specifically, we learn from random self-labeled samples how to locate occurrences of arbitrary values. All that is needed for training a full DST model is the dialogue state ground truth, which is undoubtedly much easier to obtain than fine-grained span labels. (2) We handle the sample sparsity problem by introducing two new forms of input- level dropout into training. Our proposed dropout methods are easy to apply and provide a more economical alternative to data augmentation to prevent memorization and over-fitting to certain conversation styles or dialogue patterns. (3) We add a value matching mechanism on top of extraction to enhance robustness towards previously unseen concepts. Our value matching is entirely optional and may be utilized if a set of candidate values is known during inference, for instance, from a schema or API. (4) We propose a new architecture that is entirely domain-agnostic to facilitate transfer to unseen slots and domains. For that, our model relies on the attention mechanism and conditioning on natural language slot descriptions. The established slot-independence enables zero-shot transfer. We will demonstrate that we can actively teach to track new domains by learning from non-dialogue data. This is non- trivial as the model must learn to interpret dialogue data from exposure to unstructured data.
2 Related Work
Traditional DS trackers perform prediction over a fixed ontology (Mrkšić et al., 2017; Liu and Lane, 2017; Zhong et al., 2018) and therefore have various limitations in more complex scenarios (Ren et al., 2018; Nouri and Hosseini-Asl, 2018). The idea of fixed ontologies is not sustainable for real world applications, as new concepts become impossible to capture during test time. Moreover, the demand for finely labeled data quickly grows with the ontology size, causing scalability issues.
Recent approaches to DST extract values directly from the dialogue context via span prediction (Xu and Hu, 2018; Gao et al., 2019; Chao and Lane, 2019), removing the need for fixed value candidate lists. An alternative to this mechanism is value generation via soft-gated pointer- generator copying (Wu et al., 2019; Kumar et al., 2020; Kim et al., 2020). Extractive methods have limitations as well, since many values may be expressed variably or implicitly. Contextual models such as BERT (Devlin et al., 2019) support generalization over value variations to some extent (Lee et al., 2019; Chao and Lane, 2019; Gao et al., 2019), and hybrid approaches try to mitigate the issue by resorting to picklists (Zhang et al., 2020a).
TripPy (Heck et al., 2020b) jointly addresses the issues of coreference, implicit choice, and value independence with a triple copy strategy. Here, a Transformer-based (Vaswani et al., 2017) encoder projects each dialogue turn into a semantic embedding space. Domain-slot specific slot gates then decide whether or not a slot-value is present in the current turn in order to update the dialogue state. In case of presence, the slot gates also decide which of the following three copy mechanisms to use for extraction. (1) Span prediction extracts a value directly from input. For that, domain-slot specific span prediction heads predict per token whether it is the beginning or end of a slot-value. (2) Informed value prediction copies a value from the list of values that the system informed about. This solves the implicit choice issue, where the user might positively but implicitly refer to information that the system provided. (3) Coreference prediction identifies cases where the user refers to a value that has already been assigned to a slot earlier and should now also be assigned to another slot in question. TripPy shows good robustness towards new data from known domains since it does not rely on a priori knowledge of value candidates. However, it does not support transfer to new topics, since the architecture is ontology specific. Transfer to new domains or slots is therefore impossible without re-building the model. TripPy also ignores potentially available knowledge about value candidates, since its copy mechanisms operate solely on the input. Lastly, training requires fine-grained span labels, complicating the transfer to new datasets.
While contemporary approaches to DST leverage parameter sharing and transfer learning (Rastogi et al., 2020a; Lin et al., 2021), the need for finely labeled training data is still high. Sample sparsity often causes model biases in the form of memorization or other types of over-fitting. Strategies to appease the hunger of larger models are the exploitation of out-of-domain dialogue data for transfer effects (Wu et al., 2020) and data augmentation (Campagna et al., 2020; Yu et al., 2020; Li et al., 2020; Dai et al., 2021). However, out-of-domain dialogue data is limited in quantity as well. Data augmentation still requires high level knowledge about dialogue structures and an adequate data generation strategy. Ultimately, more data also means longer training. We are aware of only one recent work that attempts DST with weak supervision. Liang et al. (2021) take a few-shot learning approach using only a subset of fully labeled training samples—typically from the end of conversations—to train a soft-gated pointer-generator network. In contrast, with our approach to spanless training, we reduce the level of granularity needed for labels to train extractive models. Note that these strategies are orthogonal.
3 TripPy-R: Robust Triple Copy DST
Let {(U1,M1),…,(UT,MT)} be the sequence of turns that form a dialogue. Ut and Mt are the token sequences of the user utterance and preceding system utterance at turn t. The task of DST is (1) to determine for every turn whether any of the domain-slot pairs in S = {S1,…,SN} is present, (2) to predict the values for each Sn, and (3) to track the dialogue state DSt. Our starting point is triple copy strategy DST (Heck et al., 2020b), because it has already been designed for robustness towards unseen values. However, we propose a new architecture with considerable differences to the baseline regarding its design, training, and inference to overcome the drawbacks of previous approaches as laid out in Section 2. We call our proposed framework TripPy-R (pronounced “trippier”), Robust triple copy strategy DST1 . Figure 1 is a depiction of our proposed model.
Proposed model architecture. TripPy-R takes the turn and dialogue history as input and outputs a DS. All inputs are encoded separately with the same fine-tuned encoder. For inference, slot and value representations are encoded once and then stored in databases for retrieval.
Proposed model architecture. TripPy-R takes the turn and dialogue history as input and outputs a DS. All inputs are encoded separately with the same fine-tuned encoder. For inference, slot and value representations are encoded once and then stored in databases for retrieval.
3.1 Model Layout
Joint Components
We design our model to be entirely domain-agnostic, adopting the idea of conditioning the model with natural language descriptions of concepts (Bapna et al., 2017; Rastogi et al., 2020b). For that, we use data-independent prediction heads that can be conditioned with slot descriptions to solve the tasks required for DST. This is different to related work such as in Heck et al. (2020b), which uses data-dependent prediction heads whose number depends on the ontology size. In contrast, prediction heads in TripPy-R are realized via the attention mechanism (Bahdanau et al., 2015). Specifically, we use scaled dot- product attention, implemented as multi-head attention according to and defined by Vaswani et al. (2017). We utilize this mechanism to query the input for the presence of information. Among other things, we deploy attention to predict whether or not a slot-value is present in the input, or to conduct sequence tagging—rather than span prediction— by assigning importance weights to input tokens.
Unified Context/Concept Encoder
Different from other domain-agnostic architectures (Lee et al., 2019; Ma et al., 2019), we rely on a single encoder that is shared among encoding tasks. This unified encoder is used to produce representations for dialogue turns and natural language slot and value descriptions. The encoder function is Enc(X) = [hCLS,h1,…,h|X|], where X is a sequence of input tokens. hCLS can be interpreted as a representation of the entire input sequence. The vectors h1 to h|X| are contextual representations for the sequence of input tokens. We define EncP(X) = [hCLS] and EncS(X) = [h1,…,h|X|] as the pooled encoding and sequence encoding of X, respectively.
Conditioned Slot Gate
Sequence Tagging
Informed Value Prediction
We adopt informed value prediction from TripPy. Ontology independence is established via our conditioned slot gate. The inform memory tracks slot-values that were informed by the system in the current dialogue turn t. If the user positively refers to an informed value, and if the user does not express the value such that sequence tagging can be used (i.e., the slot gate predicts inform), then the value ought to be copied from It to DSt.
We know from works on cognition that “all collective actions are built on common ground and its accumulation” (Clark and Brennan, 1991). In other words, it must be established in a conversation what has been understood by all participants. The process of forming mutual understanding is known as grounding. Informed value prediction in TripPy-R serves as a grounding component. As long as the information shared by the system has not yet been grounded (i.e., confirmed by the user), it is not added to the DS. This is in line with information state and dialogue management theories such as devised by Larsson and Traum (2000), which view grounding as essential to the theory of information states and therefore DST.
Coreference Prediction
Value Matching
3.2 Training and Inference
During inference, the model can draw from the rich output of the model, namely, slot gate predictions, coreference prediction, sequence tagging and value matching to adequately update the dialogue state. Slot and value descriptions are encoded only once with the fine-tuned encoder, then stored in databases, as illustrated in Figure 1 in steps and
. Pre-encoded slots condition the attention and FFN layers, and pre-encoded values are used for value matching. Note that it is straightforward to update these databases on-the-fly for a running system, thus easily expanding its capacities. Step
is the processing of dialogue turns to perform dialogue state update prediction.
3.3 Dialogue State Update
At turn t, the slot gate predicts for each slot Si how it should be updated. none means that no update is needed. dontcare denotes that any value is acceptable to the user. span indicates that a value is extractable from any of the the user utterances {Ut,…,U1}. inform denotes that the user refers to a value uttered by the system in Mt. refer indicates that the user refers to a value that is already present in DSt in a different slot. Classes true and false are used by slots that take binary values.
4 Levels of Robustness in DST
We propose the following methods to improve robustness in DST on multiple levels.
4.1 Robustness to Spanless Labels
Our framework introduces a novel training scheme to learn from data without span labels, therefore lowering the demand for fine-grained labels. We teach a proto-DST model that uses parts of TripPy-R’s architecture to tag random token sub-sequences that occur in the textual input. We use this model to locate value occurrences in each turn t of a dialogue as listed in the labels for DSt.


Top: Training samples for the proto-DST. Y = {“need”,“a”,“train”} is a randomly picked sub-sequence in Dt. The model needs to tag all tokens belonging to Y. For any random sequence , all probability mass should be assigned to xNONE. Bottom: Example of tagging the training data with a proto-DST given only spanless labels. The model needs to tag all tokens belonging to the respective values. Note how the proto-DST successfully tagged the word ‘upscale’ as an occurrence of the canonical value restaurant-price=expensive.
Training | PMUL1188 | xCLS | xNONE | xSEP | I | need | a | train | to | leave | from | Cambridge | after | 15:30 | xSEP | |
Y ∈ Dt | labels | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
labels | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
Tagging | PMUL2340 | xCLS | xNONE | xSEP | Hi, | I | am | looking | for | an | upscale | restaurant | in | the | centre | xSEP |
rest.-price=expensive | prediction | 0 | .25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .87 | 0 | 0 | 0 | 0 | 0 |
rest.-area=centre | prediction | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .99 | 0 |
Training | PMUL1188 | xCLS | xNONE | xSEP | I | need | a | train | to | leave | from | Cambridge | after | 15:30 | xSEP | |
Y ∈ Dt | labels | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
labels | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
Tagging | PMUL2340 | xCLS | xNONE | xSEP | Hi, | I | am | looking | for | an | upscale | restaurant | in | the | centre | xSEP |
rest.-price=expensive | prediction | 0 | .25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .87 | 0 | 0 | 0 | 0 | 0 |
rest.-area=centre | prediction | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .99 | 0 |


Contextual representations enable our value tagger to also identify positions of value variants, i.e., different expressions of the same value (see Table 1 for an example). We tag turns without their history. To generate labels for the history portion, we simply concatenate the tags of the preceding turns with the tags of the current turn.
4.2 Robustness to Sample Sparsity
We propose new forms of input-level dropout to increase variance in training samples while preventing an increase in data and training time.
Token Noising
Targeted feature dropout (Xu and Sarikaya, 2014) has already been used successfully in the form of slot value dropout (SVD) to stabilize DST model training (Chao and Lane, 2019; Heck et al., 2020b). During training, SVD replaces tokens of extractable values in their context by a special token xUNK with a certain probability. The representation of xUNK amalgamates the contextual representations of all tokens that are not in the encoder’s vocabulary Venc and therefore carries little semantic meaning.
Instead of randomly replacing target tokens with xUNK, we use random tokens from a frequency- sorted Venc. Specifically, a target token is replaced with probability ptn by a token xk ∈ Venc, where k is drawn from a uniform distribution . Since the least frequent tokens in Venc tend to be nonsensical, we use a cut-off K ≪|Venc| for k. The idea behind this token noising is to avoid a train-test discrepancy. With SVD, xUNK is occasionally presented as target during training, but the model will always encounter valid tokens during inference. With token noising, this mismatch does not occur. Further, token noising increases the variety of observed training samples, while SVD potentially produces duplicate inputs by masking with a placeholder.
History Dropout
We propose history dropout as another measure to prevent over-fitting due to sample sparsity. With probability phd, we discard parts of the turn history Ht during training. The cut-off is sampled from . Utilizing dialogue history is essential for competitive DST (Heck et al., 2020b). However, models might learn correlations from sparse samples that do not hold true on new data. The idea of history dropout is to prevent the model from over-relying on the history so as to not be thrown off by previously unencountered conversational styles or contents.
4.3 Robustness to Unseen Values
Robustness to unseen values is the result of multiple design choices. The applied triple copy strategy as proposed by Heck et al. (2020b) facilitates value independence. Our proposed token noising and history dropout prevent memorization of reoccurring patterns. TripPy-R’s value matching provides an alternative prediction for the DS update, in case candidate values are available during inference. Our model is equipped with the partial masking functionality (Heck et al., 2020b). Masking may be applied to informed values in the system utterances Mt,…,M1 using xUNK, which forces the model to focus on the system utterances’ context information rather than specific mentions of values.
4.4 Robustness to Unseen Slots and Domains
Domain transfer has the highest demand for generalizability and robustness. A transfer of the strong triple copy strategy DST baseline to new topics post facto is not possible due to ontology dependence of slot gates, span prediction heads, inform memory, and classification heads for coreference resolution. The latter two mechanisms in particular contribute to robustness of DST towards unseen values within known domains (Heck et al., 2020b). However, the proposed TripPy-R architecture is absolutely vital to establish robustness of triple copy strategy DST to unseen slots across new domains. TripPy-R is designed to be entirely domain-agnostic by using a model architecture whose parts can be conditioned on natural language descriptions of concepts.
5 Experimental Setup
5.1 Datasets
We use MultiWOZ 2.1 (Eric et al., 2020), WOZ 2.0 (Wen et al., 2017), sim-M, and sim-R (Shah et al., 2018) for robustness tests. MultiWOZ 2.1 is a standard benchmark for multi-domain dialogue modeling that contains 10000+ dialogues covering 5 domains (train, restaurant, hotel, taxi, attraction) and 30 unique domain-slot pairs. The other datasets are significantly smaller, making sample sparsity an issue. We test TripPy-R’s value independence on two specialized MultiWOZ test sets, OOOHeck (Heck et al., 2020b) and OOOQian (Qian et al., 2021), which replace many values with out-of-ontology (OOO) values. In addition to MultiWOZ version 2.1, we test TripPy-R on 2.0, 2.2, 2.3, and 2.4 (Budzianowski et al., 2018; Zang et al., 2020; Han et al., 2021; Ye et al., 2021a).
5.2 Evaluation
We use joint goal accuracy (JGA) as the primary metric to compare between models. The JGA given a test set is the ratio of dialogue turns for which all slots were filled with the correct value (including none). For domain-transfer tests, we report per-domain JGA, and for OOO prediction experiments, we also report per-slot accuracy. We repeat each experiment 10 times for small datasets, and three times for MultiWOZ and report averaged numbers and maximum performance. For evaluation, we follow Heck et al. (2020b).
5.3 Training
We initialize our unified encoder with RoBERTa- base (Liu et al., 2019). The input sequence length is 180 after WordPiece tokenization (Wu et al., 2016). The loss weights are . ℓg,ℓf are cross entropy loss, and ℓq,ℓm are mean squared error loss. We use the Adam optimizer (Kingma and Ba, 2015) and back-propagate through the entire network including the encoder. We also back-propagate the error for slot encodings, since we re-encode them at every step. The learning rate is 5e-5 after a warmup portion of 10% (5% for MultiWOZ), then decays linearly. The maximum number of epochs is 20 for MultiWOZ, 50 for WOZ 2.0, and 100 for sim-M/R. We use early stopping with patience (20% of max. epoch), based on the development set JGA. The batch size is 16 (32 for MultiWOZ). During training, the encoder output dropout rate is 30%, and ptn =phd =30% (10% for MultiWOZ). The weight decay rate is 0.01. For token noising, we set K = 0.2 ·|Venc|. We weight ℓg for none cases with 0.1. For value matching we tune τ in decrements of 0.1 on the development sets.
For spanless training, the maximum length of random token sequences for the proto-DST model training is 4. The maximum number of epochs is 50 for the WOZ datasets and 100 for sim-M/R. The negative sampling probability is pneg = 10%.
6 Experimental Results
6.1 Learning from Spanless Labels
The quality of the proto-DST for value tagging determines whether or not training without explicit span labels leads to useful DST models. We evaluate the tagging performance on the example of MultiWOZ 2.1 by calculating the ratio of turns for which all tokens are assigned the correct “IO” tag. Figure 3 plots the joint tagging accuracy across slots, dependent on the weight threshold in Eq. (6). It can be seen that an optimal threshold is ν = 0.3. We found this to be true across all datasets. We also found that the morphological closing operation generally improves tagging accuracy. Typical errors that are corrected by this post-processing are gaps caused by occasionally failing to tag special characters within values, for example, “:” in times with hh:mm format, and imprecisions caused by insecurities of the model when tagging long and complex values such as movie names. Average tagging accuracy across slots is 99.8%. This is particularly noteworthy since values in MultiWOZ can be expressed with a wide variety (e.g., “expensive” might be expressed as “upscale”, “fancy”, and so on). We attribute the high tagging accuracy to the expressiveness of the encoder-generated semantic contextual representations.
Tagging performance of the proto-DST model depending on the weight threshold ν.
Tagging performance of the proto-DST model depending on the weight threshold ν.
Table 2 lists the JGA of TripPy-R when trained without manual span labels. For the small datasets we did not use xNONE and negative sampling, as it did not make a significant difference. We see that performance is on par with models that were trained with full supervision. If value matching on top of sequence tagging is not used, performance is slightly below its supervised counterparts. We observed that value matching compensates for minor errors caused by the sequence tagger that was trained on automatic labels.3
DST results in JGA (± denotes standard deviation). w/o value matching refers to training and inference.
Models . | sim-M . | sim-R . | WOZ 2.0 . | MultiWOZ 2.1 . | ||||
---|---|---|---|---|---|---|---|---|
. | average . | best . | average . | best . | average . | best . | average . | best . |
TripPy (baseline) | 88.7±2.7 | 94.0 | 90.4±1.0 | 91.5 | 92.3 ±0.6 | 93.1 | 55.3±0.9 | 56.3 |
TripPy-R w/o value matching | 95.1±0.9 | 96.1 | 92.0±0.9 | 93.8 | 91.3±1.2 | 92.2 | 54.2±0.2 | 54.3 |
TripPy-R | 95.6 ±1.0 | 96.8 | 92.3±2.7 | 96.2 | 91.5±0.6 | 92.6 | 56.0 ±0.3 | 56.4 |
w/o History dropout | 95.4±0.5 | 96.1 | 93.2 ±0.9 | 94.7 | 91.6±1.0 | 93.0 | 55.5±0.6 | 56.2 |
w/o Token noising | 88.6±3.6 | 94.4 | 92.7±1.2 | 94.9 | 91.3±0.7 | 92.5 | 54.8±0.4 | 55.3 |
w/o Joint components | 87.2±3.9 | 92.6 | 90.8±0.9 | 91.9 | 91.7±0.6 | 92.8 | 54.9±0.3 | 55.3 |
TripPy-R w/ spanless training | 95.2±0.8 | 96.0 | 92.0±1.5 | 94.0 | 91.1±0.8 | 92.4 | 55.1±0.5 | 55.7 |
w/o value matching | 92.0±1.4 | 93.6 | 91.6±1.3 | 94.5 | 89.0±0.7 | 90.0 | 51.4±0.4 | 51.9 |
w/ variants | / | / | / | / | / | / | 55.2±0.1 | 55.3 |
Models . | sim-M . | sim-R . | WOZ 2.0 . | MultiWOZ 2.1 . | ||||
---|---|---|---|---|---|---|---|---|
. | average . | best . | average . | best . | average . | best . | average . | best . |
TripPy (baseline) | 88.7±2.7 | 94.0 | 90.4±1.0 | 91.5 | 92.3 ±0.6 | 93.1 | 55.3±0.9 | 56.3 |
TripPy-R w/o value matching | 95.1±0.9 | 96.1 | 92.0±0.9 | 93.8 | 91.3±1.2 | 92.2 | 54.2±0.2 | 54.3 |
TripPy-R | 95.6 ±1.0 | 96.8 | 92.3±2.7 | 96.2 | 91.5±0.6 | 92.6 | 56.0 ±0.3 | 56.4 |
w/o History dropout | 95.4±0.5 | 96.1 | 93.2 ±0.9 | 94.7 | 91.6±1.0 | 93.0 | 55.5±0.6 | 56.2 |
w/o Token noising | 88.6±3.6 | 94.4 | 92.7±1.2 | 94.9 | 91.3±0.7 | 92.5 | 54.8±0.4 | 55.3 |
w/o Joint components | 87.2±3.9 | 92.6 | 90.8±0.9 | 91.9 | 91.7±0.6 | 92.8 | 54.9±0.3 | 55.3 |
TripPy-R w/ spanless training | 95.2±0.8 | 96.0 | 92.0±1.5 | 94.0 | 91.1±0.8 | 92.4 | 55.1±0.5 | 55.7 |
w/o value matching | 92.0±1.4 | 93.6 | 91.6±1.3 | 94.5 | 89.0±0.7 | 90.0 | 51.4±0.4 | 51.9 |
w/ variants | / | / | / | / | / | / | 55.2±0.1 | 55.3 |
Impact of Tagging Variants
While our proto- DST model already achieves very high accuracy on all slots including the ones that expect values with many variants, we tested whether explicit tagging of variants may further improve performance. For instance, if a turn contains the (canonical) value “expensive” for slot hotel-pricerange, but expressed as “upscale”, we would explicitly tag such variants. While this strategy further improved the joint tagging accuracy from 94.4% to 96.1% (Figure 3), we did not see a rise in DST performance (Table 2). In other words, the contextual encoder is powerful enough to endow the proto-DST model with the ability to tag variants of values, based on semantic similarity, which renders any extra supervision for this task unnecessary.
6.2 Handling Sample Sparsity
Impact of Token Noising
We experienced that traditional SVD leads to performance gains on sim-M, but not on any of the other tested datasets, confirming Heck et al. (2020b). In contrast, token noising improved the JGA for sim-M/R considerably. Note that in Table 2, the TripPy baseline for sim-M already uses SVD. On MultiWOZ 2.1, we observed minor improvements. As with SVD, WOZ 2.0 remained unaffected. The ontology for WOZ 2.0 is rather limited and remains the same for training and testing. This is not the case for the other datasets, where values occur during testing that were never seen during training. By all appearances, presenting the model with a more diverse set of dropped-out training examples helps generalization more than using a single placeholder token. This seems especially true when there are only few value candidates per slot, and few training samples to learn from. A particularly exemplaric case is found in the sim-M dataset. Without token noising, trained models regularly end up interpreting the value “last christmas” as movie-date rather than movie-name, based on its semantic similarity to dates. Token noising, on the other hand, forces the model to put more emphasis on context rather than token identities, which effectively removes the occurrence of this error.
Impact of History Dropout
Table 2 shows that history dropout does not adversely affect DST performance. This is noteworthy because utilizing the full dialogue history is the standard in contemporary works due to its importance for adequate tracking. History dropout effectively reduces the amount of training data by omitting parts of the model input. At the same time, training samples are diversified, preventing the model from memorizing patterns in the dialogue history and promoting generalization. Figure 4 shows the severe effects of over-fitting to the dialogue history on small datasets, when not using history dropout. Here, models were only provided the current turn as input, without historical context. Models with history dropout fare considerably better, showing that they do not over-rely on the historical information. Models without history dropout do not only perform much worse, their performance is also extremely unstable. On sim-R, the span from least to highest relative performance drop is 0% to 39.4%. The results on MultiWOZ point to the importance of the historical information for proper tracking on more challenging scenarios. Here, performance drops equally in the absence of dialogue history, whether or not history dropout was used.
Performance loss due to mismatched training and testing conditions. Here, history is provided during training, but not during testing. sim-M/R and WOZ 2.0 show clear signs of over-fitting without history dropout.
Performance loss due to mismatched training and testing conditions. Here, history is provided during training, but not during testing. sim-M/R and WOZ 2.0 show clear signs of over-fitting without history dropout.
6.3 Handling Unseen Values
We probed value independence on two OOO test sets for MultiWOZ. OOOHeck replaces most values by fictional but still meaningful values that are not in the original ontology. Replacements are consistent, that is, the same value is always replaced by the same fictional stand-in. The overall OOO rate is 84%. OOOQian replaces only values of slots that expect names (i.e., name, departure, and destination) with values from a different ontology. Replacements are not consistent across dialogues, and such that names are shared across all slots, for example, street names may become hotel names, restaurants may become train stops and so on—that is, the distinction between concepts is lost.
Table 3 lists the results. The performance loss is more graceful on OOOHeck, and we see that TripPy-R has an advantage over TripPy. The performance drop is more severe on OOOQian, with comparable JGA to the baseline of Qian et al. (2021), which is a generative model. The authors of that work attribute the performance degradation to hallucinations caused by memorization effects. For our extractive model the main reason is found in the slot gate. The relative slot gate performance drop for the train domain-slots is 23.3%, while for other domain-slots it is 6.4%. We believe the reason is that most of the arbitrary substitutes carry no characteristics of train stops, but of other domains instead. This is less of a problem for the taxi domain for instance, since taxi stops are of a variety of location types. The issue of value-to-domain mismatch can be mitigated somewhat with informed value masking in system utterances (Section 4.3). While this does not particularly affect our model on the regular test set or on the more domain-consistent OOOHeck, we can see much better generalization on OOOQian.
Performance in JGA on artificial out-of-ontology test sets (± denotes standard deviation).
Models . | OOOHeck . | OOOQian . |
---|---|---|
TripPy | 40.1±1.9 | 29.2±1.9 |
Qian et al. (2021) | / | 27.0±2.0 |
TripPy-R | 42.2±0.8 | 29.7±0.7 |
TripPy-R + masking | 43.0 ±1.5 | 36.0 ±1.6 |
Models . | OOOHeck . | OOOQian . |
---|---|---|
TripPy | 40.1±1.9 | 29.2±1.9 |
Qian et al. (2021) | / | 27.0±2.0 |
TripPy-R | 42.2±0.8 | 29.7±0.7 |
TripPy-R + masking | 43.0 ±1.5 | 36.0 ±1.6 |
6.4 Handling Unseen Slots and Domains
Table 2 shows that moving from slot specific to slot independent components only marginally affects DST performance, while enabling tracking of dialogues with unseen domains and slots.
Zero-shot Performance
We conducted zero- shot experiments on MultiWOZ 2.1 by excluding all dialogues of a domain d from training and then evaluating the model on dialogues of d. In Table 4, we compare TripPy-R to recent models that support slot independence. Even though we did not specifically optimize TripPy-R for zero- shot abilities, our model shows a level of robustness that is competitive with other contemporary methods.
Best zero-shot DST results for various models on MultiWOZ 2.1 in JGA. *Li et al. (2021) presents considerably higher numbers for models with data augmentation. We compare against a model without data augmentation. ** is a model with three times as many parameters as ours.
Models . | Domains . | avg. . | ||||
---|---|---|---|---|---|---|
hotel . | rest. . | attr. . | train . | taxi . | ||
TRADE (2019; 2020) | 19.5 | 16.4 | 22.8 | 22.9 | 59.2 | 28.2 |
MA-DST (2020) | 16.3 | 13.6 | 22.5 | 22.8 | 59.3 | 26.9 |
SUMBT (2019; 2020) | 19.8 | 16.5 | 22.6 | 22.5 | 59.5 | 28.2 |
Li et al. (2021)* | 18.5 | 21.1 | 23.7 | 24.3 | 59.1 | 29.3 |
TripPy-R | 18.3 | 15.3 | 27.1 | 23.7 | 61.5 | 29.2 |
Li et al. (2021)** | 24.4 | 26.2 | 31.3 | 29.1 | 59.6 | 34.1 |
Models . | Domains . | avg. . | ||||
---|---|---|---|---|---|---|
hotel . | rest. . | attr. . | train . | taxi . | ||
TRADE (2019; 2020) | 19.5 | 16.4 | 22.8 | 22.9 | 59.2 | 28.2 |
MA-DST (2020) | 16.3 | 13.6 | 22.5 | 22.8 | 59.3 | 26.9 |
SUMBT (2019; 2020) | 19.8 | 16.5 | 22.6 | 22.5 | 59.5 | 28.2 |
Li et al. (2021)* | 18.5 | 21.1 | 23.7 | 24.3 | 59.1 | 29.3 |
TripPy-R | 18.3 | 15.3 | 27.1 | 23.7 | 61.5 | 29.2 |
Li et al. (2021)** | 24.4 | 26.2 | 31.3 | 29.1 | 59.6 | 34.1 |
Impact of Non-dialogue Data
Besides zero- shot abilities, we were curious, is it feasible to improve dialogue state tracking by learning the required mechanics purely from non-dialogue data? This is a non-trivial task, as the model needs to generalize knowledge learned from unstructured data to dialogue, that is, sequences of alternating system and user utterances. We conducted this experiment by converting MultiWOZ dialogues of a held-out domain d into non-dialogue format for training. For d, the model only sees isolated sentences or sentence pairs without any structure of a dialogue. Consequently, there is no “turn” history from which the model could learn. The assumption is that one would have some way to label sequences of interest in non-dialogue sentences, for instance with a semantic parser. As this is a feasibility study, we resort to the slot labels in DSt, which simulates having labels of very high accuracy. We tested three different data formats, (1) Review style: Only system utterances with statements are used to learn from; (2) FAQ style: A training example is formed by a user question and the following system answer. Note that this is contrary to what TripPy-R naturally expects, which is a querying system and a responding user; and (3) FAQ+ style: Combines review and FAQ style examples and adds user questions again as separate examples.
Figure 5 shows that we observed considerable improvements across all held-out domains when using non-dialogue data to learn from. Learning from additional data, even if unstructured, is particularly beneficial for unique slots, such as the restaurant-food slot which the model can not learn about from any other domain in MultiWOZ (as is reflected in a poor zero-shot performance as well). We also found that learning benefits from the combination of different formats. The heightened performance given the FAQ+ style data is not an effect of more data, but of its presentation, since we mainly re-use inputs with different formats. This observation is reminiscent of findings in psychology. Horst et al. (2011) showed that children benefited from being read the same story repeatedly. Furthermore, Johns et al. (2016) showed that contextual diversity positively affects word learning in adults. Note that this kind of learning is in contrast to few-shot learning and leveraging artificial dialogue data, which either require fine-grained manual labels or high-level knowledge of how dialogues are structured. Even though the data we used is far-removed from what a dialogue state tracker expects, TripPy-R still manages to learn how to appropriately track these new domains.
Performance of TripPy-R after training with non-dialogue style data from a held-out domain.
Performance of TripPy-R after training with non-dialogue style data from a held-out domain.
6.5 Performance Comparison
We evaluated on five versions of MultiWOZ to place TripPy-R among contemporary work. Versions 2.1 and 2.2 mainly propose general corrections to the labels of MultiWOZ 2.0. Version 2.3 unifies annotations between dialogue acts and dialogue states. In contrast, version 2.4 removes all values that were mentioned by the system from the dialogue state, unless they are proper names. Figure 6 plots the results. The performance of TripPy-R is considerably better on versions 2.3 and 2.4. This can be attributed to a more accurate prediction of the inform cases due to better test ground truths.
Comparison of TripPy-R and SOTA open vocabulary DST models. * denotes TripPy-style models.
Comparison of TripPy-R and SOTA open vocabulary DST models. * denotes TripPy-style models.
For fairness, we restricted our comparison to models that have the same general abilities, that is, they ought to be open-vocabulary and without data-specific architectures. The SOTA on 2.0 (Su et al., 2022) proposes a unified generative dialogue model to solve multiple tasks including DST and benefits from pre-training on various dialogue corpora. While profiting from more data in general, its heterogeneity in particular did not affect DST performance. Yu et al. (2020), Li et al. (2020), and Dai et al. (2021) currently share the top of the leaderboard for 2.1, all of which propose TripPy-style models that leverage data augmentation. The main reason for performance improvements lies in the larger amount of data and in diversifying samples. TripPy-R does not rely on more data, but diversifies training samples with token noising and history dropout. On version 2.2, the method of Tian et al. (2021) performs best with a two-pass generative approach that utilizes an error recovery mechanism. This mechanism can correct generation errors such as caused by hallucination, which is a phenomenon that does not occur with TripPy-R. However, their error recovery also has the potential to avoid propagation of errors made early in the dialogue, which is demonstrated by a heightened performance. Cho et al. (2021) report numbers for the method of Mehri et al. (2020) on version 2.3, which is another TripPy-style model using an encoder that was pre-trained on millions of conversations, thus greatly benefiting from specialized knowledge. For version 2.4, the current SOTA with the properties as stated above is presented by Ye et al. (2021b) and reported in Ye et al. (2021a), which is now surpassed by TripPy-R. The major difference to our model is the use of slot self-attention, which allows their model to learn correlations between slot occurrences. While TripPy-R does not model slot correlations directly, it does however explicitly learn to resolve coreferences.
6.6 Implications of the Results
The zero-shot capabilities of our proposed TripPy-R model open the door to many new applications. However, it should be noted that its performance on an unseen arbitrary domain and on unseen arbitrary slots will likely degrade. In such cases it would be more appropriate to perform adaptation, which the TripPy-R framework facilitates. This means that one would transfer the model as presented in Sections 4.3 and 4.4 and continue fine-tuning with limited—and potentially unstructured (see Section 6.4)—data from the new domain. Nonetheless, in applications such as e-commerce (Zhang et al., 2018) or customer support (García-Sardiña et al., 2018), whenever new slots or even domains are introduced, they are to a great extent related to ones that a deployed system is familiar with. We believe that the zero-shot performance presented in Table 4 is highly indicative of this set-up, as MultiWOZ domains are different, yet to some extent related.
Further, the TripPy-R model facilitates future applications in complex domains such as healthcare. One of the biggest obstacles to harnessing large amounts of natural language data in healthcare is the required labeling effort. This is particularly the case for applications in psychology, as can be seen from the recent work of Rojas-Barahona et al. (2018), where only 5K out of 1M interactions where labeled with spans for so called thinking-errors by physiologists. A framework like TripPy-R can completely bypass this step by utilizing its proto-DST, as presented in Section 4.1, eliminating the overbearing labeling effort.
7 Conclusion
In this work we present methods to facilitate robust extractive dialogue state tracking with weak supervision and sparse data. Our proposed architecture—TripPy-R—utilizes a unified encoder, the attention mechanism, and conditioning on natural language descriptions of concepts to facilitate parameter sharing and zero-shot transfer. We leverage similarity based value matching as an optional step after value extraction, without violating the principle of ontology independence.
We demonstrated the feasibility of training without manual span labels using a self-trained proto-DST model. Learning from spanless labels enables us to leverage data with weaker supervision. We showed that token noising and history dropout mitigate issues of pattern memorization and train-test discrepancies. We achieved competitive zero-shot performance and demonstrated in a feasibility study that TripPy-R can learn to track new domains by learning from non-dialogue data. We achieve either competitive or state-of- the-art performance on all tested benchmarks. For future work we continue to investigate learning from non-dialogue data, potentially in a continuous fashion over the lifetime of a dialogue system.
Acknowledgments
We thank the anonymous reviewers and the action editors for their valuable feedback. M. Heck, N. Lubis, C. van Niekerk, and S. Feng are supported by funding provided by the Alexander von Humboldt Foundation in the framework of the Sofja Kovalevskaja Award endowed by the Federal Ministry of Education and Research, while C. Geishauser and H.-C. Lin are supported by funds from the European Research Council (ERC) provided under the Horizon 2020 research and innovation programme (grant agreement no. STG2018804636). Computing resources were provided by Google Cloud and HHU ZIM.
Notes
Note that training with value matching also affects the training of the sequence tagger, be it with or without using span labels.
References
Author notes
Action Editor: Claire Gardent