Robust Dialogue State Tracking with Weak Supervision and Sparse Data

Abstract Generalizing dialogue state tracking (DST) to new data is especially challenging due to the strong reliance on abundant and fine-grained supervision during training. Sample sparsity, distributional shift, and the occurrence of new concepts and topics frequently lead to severe performance degradation during inference. In this paper we propose a training strategy to build extractive DST models without the need for fine-grained manual span labels. Two novel input-level dropout methods mitigate the negative impact of sample sparsity. We propose a new model architecture with a unified encoder that supports value as well as slot independence by leveraging the attention mechanism. We combine the strengths of triple copy strategy DST and value matching to benefit from complementary predictions without violating the principle of ontology independence. Our experiments demonstrate that an extractive DST model can be trained without manual span labels. Our architecture and training strategies improve robustness towards sample sparsity, new concepts, and topics, leading to state-of-the-art performance on a range of benchmarks. We further highlight our model’s ability to effectively learn from non-dialogue data.


Introduction
Generalisation and robustness are among the key requirements for naturalistic conversational abilities of task-oriented dialogue systems (Edlund et al., 2008).In a dialogue system, dialogue state tracking (DST) solves the task of extracting meaning and intent from the user input, and keeps track of the user's goal over the continuation of a conversation as part of a dialogue state (DS) (Young et al., 2010).A recommendation and booking system for places, for instance, needs to gather user preferences in terms of budget, location, etc. Concepts like these are assembled in an ontology on levels of domain (e.g., restaurant or hotel), slot (e.g., price or location), and value (e.g., "expensive" or "south").Accurate DST is vital to a robust dialogue system, as the system's future actions depend on the conversation's current estimated state.However, generalising DST to new data and domains is especially challenging.The reason is the strong reliance on supervised training.
Virtually all top-performing DST methods either entirely or partially extract values directly from context (Ni et al., 2021).However, training these models robustly is a demanding task.Extractive methods usually rely on fine-grained labels on word level indicating the precise locations of value mentions.Given the richness of human language and the ability to express the same canonical value in many different ways, producing such labels is challenging and very costly, and it is no surprise that datasets of such kind are rare (Zhang et al., 2020b;Deriu et al., 2021).Reliance on detailed labels has another downside; datasets are usually severely limited in size.This in turn leads to the problem of sample sparsity, which increases the risk for models to over-fit to the training data, for instance by memorising values in their respective contexts.Over-fitting prevents a state tracker to generalise to new contexts and values, which is likely to break a dialogue system entirely (Qian et al., 2021).Recently, domain-independent architectures have been encouraged to develop systems that may be built once, and then applied to new scenarios with no or little additional training (Rastogi et al., 2020b,a).However, training such flexible models robustly remains a challenge, and the ever-growing need for more training samples spurs creativity to leverage non-dialogue data (Heck et al., 2020a;Namazifar et al., 2021).
We propose novel strategies for extractive DST that address the following four issues of robust-ness and generalisation.(1) We solve the problem of requiring fine-grained span labels with a selfsupervised training scheme.Specifically, we learn from random self-labeled samples how to locate occurrences of arbitrary values.All that is needed for training a full DST model is the dialogue state ground truth, which is undoubtedly much easier to obtain than fine-grained span labels.(2) We handle the sample sparsity problem by introducing two new forms of input-level dropout into training.Our proposed dropout methods are easy to apply and provide a more economical alternative to data augmentation to prevent memorisation and overfitting to certain conversation styles or dialogue patterns.(3) We add a value matching mechanism on top of extraction to enhance robustness towards previously unseen concepts.Our value matching is entirely optional and may be utilised if a set of candidate values is known during inference, for instance from a schema or API.(4) We propose a new architecture that is entirely domain-agnostic to facilitate transfer to unseen slots and domains.
For that, our model relies on the attention mechanism and conditioning on natural language slot descriptions.The established slot-independence enables zero-shot transfer.We will demonstrate that we can actively teach to track new domains by learning from non-dialogue data.This is nontrivial as the model must learn to interpret dialogue data from exposure to unstructured data.

Related Work
Traditional DS trackers perform prediction over a fixed ontology (Mrkšić et al., 2017;Liu and Lane, 2017;Zhong et al., 2018) and therefore have various limitations in more complex scenarios (Ren et al., 2018;Nouri and Hosseini-Asl, 2018).The idea of fixed ontologies is not sustainable for real world applications, as new concepts become impossible to capture during test time.Moreover, the demand for finely labeled data quickly grows with the ontology size, causing scalability issues.
Recent approaches to DST extract values directly from the dialogue context via span prediction (Xu and Hu, 2018;Gao et al., 2019;Chao and Lane, 2019), removing the need for fixed value candidate lists.An alternative to this mechanism is value generation via soft-gated pointer-generator copying (Wu et al., 2019;Kumar et al., 2020;Kim et al., 2020).Extractive methods have limitations as well, since many values may be expressed variably or implicitly.Contextual models such as BERT (Devlin et al., 2019) support generalisation over value variations to some extent (Lee et al., 2019;Chao and Lane, 2019;Gao et al., 2019), and hybrid approaches try to mitigate the issue by resorting to picklists (Zhang et al., 2020a).
TripPy (Heck et al., 2020b) jointly addresses the issues of coreference, implicit choice and value independence with a triple copy strategy.Here, a Transformer-based (Vaswani et al., 2017) encoder projects each dialogue turn into a semantic embedding space.Domain-slot specific slot gates then decide whether or not a slot-value is present in the current turn in order to update the dialogue state.In case of presence, the slot gates also decide which of the following three copy mechanisms to use for extraction.(1) Span prediction extracts a value directly from input.For that, domain-slot specific span prediction heads predict per token whether it is the beginning or end of a slot-value.
(2) Informed value prediction copies a value from the list of values that the system informed about.This solves the implicit choice issue, where the user might positively but implicitly refer to information that the system provided.(3) Coreference prediction identifies cases where the user refers to a value that has already been assigned to a slot earlier and should now also be assigned to another slot in question.TripPy shows good robustness towards new data from known domains since it does not rely on a priori knowledge of value candidates.However, it does not support transfer to new topics, since the architecture is ontology specific.Transfer to new domains or slots is therefore impossible without re-building the model.TripPy also ignores potentially available knowledge about value candidates, since its copy mechanisms operate solely on the input.Lastly, training requires fine-grained span labels, complicating the transfer to new datasets.
While contemporary approaches to DST leverage parameter sharing and transfer learning (Rastogi et al., 2020a;Lin et al., 2021), the need for finely-labeled training data is still high.Sample sparsity often causes model biases in the form of memorisation or other types of over-fitting.Strategies to appease the hunger of larger models are the exploitation of out-of-domain dialogue data for transfer effects (Wu et al., 2020) and data augmentation (Campagna et al., 2020;Yu et al., 2020;Li et al., 2020;Dai et al., 2021).However, out-  2021) take a fewshot learning approach using only a subset of fully labeled training samples-typically from the end of conversations-to train a soft-gated pointergenerator network.In contrast, with our approach to spanless training, we reduce the level of granularity needed for labels to train extractive models.Note that these strategies are orthogonal.
3 TripPy-R: Robust Triple Copy DST Let {(U 1 , M 1 ), . . ., (U T , M T )} be the sequence of turns that form a dialogue.U t and M t are the token sequences of the user utterance and preceding system utterance at turn t.The task of DST is (1) to determine for every turn whether any of the domain-slot pairs in S = {S 1 , . . ., S N } is present, (2) to predict the values for each S n and (3) to track the dialogue state DS t .Our starting point is triple copy strategy DST (Heck et al., 2020b), because it has already been designed for robustness towards unseen values.However, we propose a new architecture with considerable differences to the baseline regarding its design, training and inference to overcome the drawbacks of previous approaches as laid out in Section 2. We call our proposed framework TripPy-R (pronounced "trippier"), Robust triple copy strategy DST1 .Figure 1 is a depiction of our proposed model. general/dsml/trippy-r-public

Model Layout
Joint Components We design our model to be entirely domain-agnostic, adopting the idea of conditioning the model with natural language descriptions of concepts (Bapna et al., 2017;Rastogi et al., 2020b).For that, we use data-independent prediction heads that can be conditioned with slot descriptions to solve the tasks required for DST.This is different to related work such as Heck et al. (2020b), which uses data-dependent prediction heads whose number depends on the ontology size.In contrast, prediction heads in TripPy-R are realised via the attention mechanism (Bahdanau et al., 2015).Specifically, we use scaled dotproduct attention, implemented as multi-head attention according to and defined by Vaswani et al. (2017).We utilise this mechanism to query the input for the presence of information.Among other things, we deploy attention to predict whether or not a slot-value is present in the input, or to conduct sequence tagging-rather than span prediction-by assigning importance weights to input tokens.
Unified Context/Concept Encoder Different from other domain-agnostic architectures (Lee et al., 2019;Ma et al., 2019), we rely on a single encoder that is shared among encoding tasks.This unified encoder is used to produce representations for dialogue turns and natural language slot and value descriptions.The encoder function is Enc(X) = [h CLS , h 1 , . . ., h |X| ], where X is a sequence of input tokens.h CLS can be interpreted as a representation of the entire input sequence.The vectors h 1 to h |X| are contextual representations for the sequence of in-put tokens.We define Enc P (X) = [h CLS ] and Enc S (X) = [h 1 , . . ., h |X| ] as the pooled encoding and sequence encoding of X, respectively.
Dialogue turns and natural language slot and value descriptions are encoded as where H t = {(U t−1 , M t−1 ), . . ., (U 1 , M 1 )} is the history of the dialogue up to turn t.The special token x CLS initiates every input sequence, and x SEP is a separator token to provide structure to multisequence inputs.S desc i is the slot description of slot S i and V S i ,j is a candidate value j for slot S i .

Conditioned Slot Gate
The slot gate outputs a probability distribution over the output classes C = {none, dontcare, span, inform, refer , true, false}.Our slot gate can be conditioned to perform a prediction for one particular slot, allowing our architecture to be ontology independent.The slot attention layer attends to token representations of a dialogue turn given the representation of a particular slot S i as query, i.e., where MHA (•) (Q, K, V , k) is a multi-head attention layer that expects a query matrix Q, a key matrix K, a value matrix V and an optional masking parameter k. g o is the layer-normalised (Ba et al., 2016) attention output and g w are the attention weights.For classification, the attention output is piped into a feed-forward network (FFN) conditioned with S i , where (Hendrycks and Gimpel, 2016).
Sequence Tagging In order to keep the value extraction directly from the input ontologyindependent as well, our model re-purposes attention to perform sequence tagging.If the slot gate predicts span, the sequence attention layer attends to token representations of the current dialogue turn given r S i as query, analogous to Eq. ( 1): Here, rt is an input mask that only allows attending to representations of user utterances.
In contrast to other work that leverages attention for DST (Lee et al., 2019;Wu et al., 2019), we explicitly teach the model where to put the attention.This way, the predicted attention weights q w become the sequence tagging predictions.Tokens that belong to a value are assigned a weight of 1, all other tokens are weighted 0. Since q w 1 = 1, we scale the target label sequences during training.During inference, we normalise q w , i.e., qw = [q 1 , . . ., q|X| ], with qj = q w,j − 1 so that we can infer sequence tags according to an "IO" tagging scheme (Ramshaw and Marcus, 1995).All qj > 0 are assigned the "I" tag, all others the "O" tag.The advantage of sequence tagging over span prediction is that training can be performed using labels for multiple occurrences of the same slot-value in the input-for instance in the current turn and the dialogue history-, and that multiple regions of interest can be predicted.
To extract a value from the context, we pick the sequence with the highest average token weight according to qw among all sequences of tokens that were assigned the "I" tag and denote this value prediction as Val( qw ).

Informed Value Prediction
We adopt informed value prediction from TripPy.Ontology independence is established via our conditioned slot gate.
The inform memory I t = {I 1 t , . . ., I |S| t } tracks slot-values that were informed by the system in the current dialogue turn t.If the user positively refers to an informed value, and if the user does not express the value such that sequence tagging can be used-i.e., the slot gate predicts inform-, then the value ought to be copied from I t to DS t .
We know from works on cognition that "all collective actions are built on common ground and its accumulation" (Clark and Brennan, 1991).In other words, it must be established in a conversation what has been understood by all participants.The process of forming mutual understanding is known as grounding.Informed value prediction in TripPy-R serves as a grounding component.As long as the information shared by the system has not yet been grounded, i.e., confirmed by the user, it is not added to the DS.This is in line with information state and dialogue management theories such as devised by Larsson and Traum (2000), which view grounding as essential to the theory of information states and therefore DST.
Coreference Prediction Although TripPy supports coreference resolution, this mechanism is limited to an a priori known set of slots.We use attention to establish slot independence for coreference resolution to overcome this limitation.If the slot gate predicts refer for a slot S i , i.e., that it refers to a value that has previously been assigned to another slot, then the refer attention needs to predict the identity of said slot S j , i.e., where the slot attention output g o is first piped through an FFN.R S = [r S 1 , . . ., r S |S| ] ∈ R d×|S| is the matrix of stacked slot representations and f w is the set of weights assigned to all candidate slots for S j .The slot with the highest assigned weight is then our referred slot S j .To resolve a coreference, S i is updated with the value of S j .During inference, R S can be modified as desired to accommodate new slots.
Value Matching In contrast to picklist based methods such as by Zhang et al. (2020a), TripPy-R performs value matching as an optional step.We first create slot-value representations for all value candidates V S i ,j of slot S i , and learn matching of dialogue context q o to the list of candidate values via value attention: where m w should place a weight close to 1 on the correct value and weights close to 0 on all the others.Dotproduct attention as used in our model is defined as Computing the dot product between input and candidate value representations is proportional to computing their cosine similarities, which is cos Therefore, optimising the model to put maximum weight on the correct value and to minimise the weights on all other candidates forces representations of the input and of values occurring in that input to be closer in their common space, and vice versa.

Training and Inference
Each training step requires the dialogue turn and all slot and value descriptions be encoded.Our unified encoder re-encodes all slot descriptions at each step.Since the number of values might be in the range of thousands, we encode them once for each epoch.The encoder is fine-tuned towards encoding all three input types.We optimise our model given the joint loss for each turn, Here, (•, •) is the loss between a prediction and a ground truth.L g , L q , L f and L m are the joint losses of the slot gate, sequence tagger, coreference prediction and value matching.It is • 1 = 1 for l f S i and l m S i , i.e., labels for coreference prediction and value matching are 1-hot vectors.Backpropagating L m also affects the sequence tagger.We scale l q S i since sequence tagging may have to label more than one token as being part of a value.
During inference, the model can draw from the rich output of the model, i.e., slot gate predictions, coreference prediction, sequence tagging and value matching to adequately update the dialogue state.Slot and value descriptions are encoded only once with the fine-tuned encoder, then stored in databases as illustrated in Figure 1 in steps 1 and 2 .Pre-encoded slots condition the attention and FFN layers, and pre-encoded values are used for value matching.Note that it is straightforward to update these databases on-thefly for a running system, thus easily expanding its capacities.
Step 3 is the processing of dialogue turns to perform dialogue state update prediction.

Dialogue State Update
At turn t, the slot gate predicts for each slot S i how it should be updated.none means that no update is needed.dontcare denotes that any value is acceptable to the user.span indicates that a value is extractable from any of the the user utterances {U t , . . ., U 1 }.inform denotes that the user refers to a value uttered by the system in M t .refer indicates that the user refers to a value that is already present in DS t in a different slot.Classes true and false are used by slots that take binary values.
If candidate values are known at inference, TripPy-R can utilise value matching to benefit from supporting predictions for the span case.Since sequence tagging and value matching predictions would compete over the slot update, we use confidence scores to make an informed decision.Given the current input, and candidate values for a slot, we can use the attention weights m w of the value attention as individual scores for each value.We can also use the L2-norm between input and values, i.e., e S i ,j = q o − r V S i ,j 2 , and e S i = [e S i ,1 , . . ., e S i ,|V S i | ] is the score set.Then where τ ∈ [0, 1] is a threshold parameter that controls the level of the model's confidence needed to still consider its value matching predictions.

Levels of Robustness in DST
We propose the following methods to improve robustness in DST on multiple levels.

Robustness to Spanless Labels
Our framework introduces a novel training scheme to learn from data without span labels, therefore lowering the demand for fine-grained labels.We teach a proto-DST model that uses parts of TripPy-R's architecture to tag random token subsequences that occur in the textual input.We use this model to locate value occurrences in each turn t of a dialogue as listed in the labels for DS t .
The proto-DST model consists of the unified encoder and the sequence attention of TripPy-R, as depicted in Figure 2. Let D t = (U t , M t ) be the input to the model, which is encoded as Let Y ∈ D t be a sub-sequence of tokens that was randomly picked from the input, encoded as r Y = Enc P (x CLS ⊕ Y ⊕ x SEP ).In Figure 2, this corresponds to input types 1 and 3 .The sequence tagger is then described as analogous to Eq. ( 2).For training, we minimise analogous to Eq. ( 5).At each training step, a random negative sample Ȳ ∈ D t rather than a positive sample is picked for training with probability p neg .For the Y ∈ D t , the label l q Y marks the positions of all tokens of Y in D t .For the Ȳ ∈ D t , the label l q Ȳ puts a weight of 1 onto special token x NONE and 0 everywhere else.The desired behaviour of this model is therefore to distribute the maximum amount of the probability mass uniformly among all tokens that belong to the randomly picked sequence.In case a queried sequence is absent from the input, all probability mass should be assigned to x NONE .Table 1

lists positive and negative training examples.
In order to tag value occurrences in dialogue turns for training with spanless labels, we predict for each value in DS t its position in D t , given the proto-DST.Let s i t be the value label for slot S i in turn t, which is encoded as r s i t = Enc P (x CLS ⊕ s i t ⊕ x SEP ).Value tagging is performed as which corresponds to input types 2 and 3 in Figure 2. q w is normalised according to Eq. ( 3).Table 1 shows examples of value tagging with the proto-DST.A set of tag weights q w is accepted if more than half the probability mass is assigned to word tokens rather than x NONE .We use a morphological closing operation (Serra, 1982) to smooth the tags, i.e., where ⊕ and are the dilation and erosion operators, δ is an indicator function, q w is interpreted as an array, ω = [1, 1, 1] is a kernel, and ν is a threshold parameter that allows filtering of tags based on their predicted weights.Contextual representations enable our value tagger to also identify positions of value variants, i.e., different expressions of the same value (see Table 1 for an example).We tag turns without their history.To generate labels for the history portion, we simply concatenate the tags of the preceding turns with the tags of the current turn.

Robustness to Sample Sparsity
We propose new forms of input-level dropout to increase variance in training samples while preventing an increase in data and training time.
Token Noising Targeted feature dropout (Xu and Sarikaya, 2014) has already been used successfully in the form of slot value dropout (SVD) to stabilise DST model training (Chao and Lane, 2019; Heck et al., 2020b).During training, SVD replaces tokens of extractable values in their context by a special token x UNK with a certain probability.The representation of x UNK amalgamates the contextual representations of all tokens that are not in the encoder's vocabulary V enc and therefore carries little semantic meaning.
Instead of randomly replacing target tokens with x UNK , we use random tokens from a frequency-sorted V enc .Specifically, a target token is replaced with probability p tn by a token x k ∈ V enc , where k is drawn from a uniform distribution U(1, K).Since the least frequent tokens in V enc tend to be nonsensical, we use a cut-off K |V enc | for k.The idea behind this token noising is to avoid a train-test discrepancy.With SVD, x UNK is occasionally presented as target during training, but the model will always encounter valid tokens during inference.With token noising, this mismatch does not occur.Further, token noising increases the variety of observed training samples, while SVD potentially produces duplicate inputs by masking with a placeholder.History Dropout We propose history dropout as another measure to prevent over-fitting due to sample sparsity.With probability p hd , we discard parts of the turn history H t during training.The cut-off is sampled from U(1, t − 1).Utilising dialogue history is essential for competitive DST (Heck et al., 2020b).However, models might learn correlations from sparse samples that do not hold true on new data.The idea of history dropout is to prevent the model from over-relying on the history so as to not be thrown off by previously unencountered conversational styles or contents.

Robustness to Unseen Values
Robustness to unseen values is the result of multiple design choices.The applied triple copy strategy as proposed by Heck et al. (2020b) facilitates value independence.Our proposed token noising and history dropout prevent memorisation of reoccurring patterns.TripPy-R's value matching provides an alternative prediction for the DS update, in case candidate values are available during inference.Our model is equipped with the partial masking functionality (Heck et al., 2020b).
Masking may be applied to informed values in the system utterances M t , . . ., M 1 using x UNK , which forces the model to focus on the system utterances' context information rather than specific mentions of values.

Robustness to Unseen Slots and Domains
Domain transfer has the highest demand for generalisability and robustness.A transfer of the strong triple copy strategy DST baseline to new topics post facto is not possible due to ontology dependence of slot gates, span prediction heads, inform memory and classification heads for coreference resolution.The latter two mechanisms in particular contribute to robustness of DST towards unseen values within known domains (Heck et al., 2020b).However, the proposed TripPy-R architecture is absolutely vital to establish robustness of triple copy strategy DST to unseen slots across new domains.TripPy-R is designed to be entirely domain-agnostic by using a model architecture whose parts can be conditioned on natural language descriptions of concepts.
5 Experimental Setup

Evaluation
We use joint goal accuracy (JGA) as primary metric to compare between models.The JGA given a test set is the ratio of dialogue turns for which all slots were filled with the correct value (including none).For domain-transfer tests, we report perdomain JGA, and for out-of-ontology prediction experiments, we also report per-slot accuracy.We repeat each experiment 10 times for small datasets, and three times for MultiWOZ and report averaged numbers and maximum performance.For evaluation, we follow Heck et al. (2020b).

Training
We initialise our unified encoder with RoBERTabase (Liu et al., 2019).The input sequence length is 180 after WordPiece tokenization (Wu et al., 2016).The loss weights are (λ g , λ q , λ f , λ m ) = (0.8, 1−λg 2 , 0.1).g , f are cross entropy loss, and q , m are mean squared error loss.We use Adam optimiser (Kingma and Ba, 2015) and back-propagate through the entire network including the encoder.We also back-propagate the error for slot encodings, since we re-encode them at every step.The learning rate is 5e-5 after a warmup portion of 10% (5% for MultiWOZ), then decays linearly.The maximum number of epochs is 20 for MultiWOZ, 50 for WOZ 2.0 and 100 for sim-M/R.We use early stopping with patience (20% of max.epoch), based on the development set JGA.The batch size is 16 (32 for MultiWOZ).During training, the encoder output dropout rate is 30%, and p tn = p hd = 30% (10% for MultiWOZ).The weight decay rate is 0.01.For token noising, we set K = 0.2 • |V enc |.We weight g for none cases with 0.1.For value matching we tune τ in decrements of 0.1 on the development sets.
For spanless training, the maximum length of random token sequences for the proto-DST model training is 4. The maximum number of epochs is 50 for the WOZ datasets and 100 for sim-M/R.The negative sampling probability is p neg = 10%.
6 Experimental Results

Learning from Spanless Labels
The quality of the proto-DST for value tagging determines whether or not training without explicit span labels leads to useful DST models.We evaluate the tagging performance on the example of MultiWOZ 2.1 by calculating the ratio of turns for which all tokens are assigned the correct "IO" tag. Figure 3 plots the joint tagging accuracy across slots, dependent on the weight threshold in Eq. ( 6).It can be seen that an optimal threshold is ν = 0.3.We found this to be true across all datasets.We also found that the morphological closing operation generally improves tagging accuracy.Typical errors that are corrected by this post-processing are gaps caused by occasionally failing to tag special characters within values, e.g.":" in times with hh:mm format, and imprecisions caused by insecurities of the model when tagging long and complex values such as movie names.Average tagging accuracy across slots is 99.8%.This is particularly noteworthy since values in MultiWOZ can be expressed with a wide variety (e.g., "expensive" might be expressed as "upscale", "fancy", and so on).We attribute the high tagging accuracy to the expressiveness of the encoder-generated semantic contextual representations.Table 2 lists the JGA of TripPy-R when trained without manual span labels.For the small datasets we did not use x NONE and negative sampling, as it did not make a significant difference.We see that performance is on par with models that were trained with full supervision.If value matching on top of sequence tagging is not used, performance is slightly below its supervised counterparts.We observed that value matching compensates for minor errors caused by the sequence tagger that was trained on automatic labels. 3mpact of Tagging Variants While our proto-DST model already achieves very high accuracy on all slots including the ones that expect values with many variants, we tested whether explicit tagging of variants may further improve performance.For instance, if a turn contains the (canonical) value "expensive" for slot hotel-pricerange, but expressed as "upscale", we would explicitly tag such variants.While this strategy further improved the joint tagging accuracy from 94.4% to 96.1% (Figure 3), we did not see a rise in DST performance (Table 2).In other words, the contextual encoder is powerful enough to endow the proto-DST model with the ability to tag variants of values, based on semantic similarity, which renders any extra supervision for this task unnecessary.

Handling Sample Sparsity
Impact of Token Noising We experienced that traditional SVD leads to performance gains on sim-M, but not on any of the other tested datasets, confirming Heck et al. (2020b).In contrast, token noising improved the JGA for sim-M/R considerably.Note that in Table 2, the TripPy baseline for sim-M already uses SVD.On MultiWOZ 2.1, we observed minor improvements.As with SVD, WOZ 2.0 remained unaffected.The ontology for WOZ 2.0 is rather limited and remains the same for training and testing.This is not the case for the other datasets, where values occur during testing that were never seen during training.By all appearances, presenting the model with a more diverse set of dropped-out training examples helps generalisation more than using a single placeholder token.This seems especially true when there are only few value candidates per slot, and few training samples to learn from.A particularly exemplaric case is found in the sim-M dataset.Without token noising, trained models regularly end up interpreting the value "last christmas" as movie-date rather than movie-name, based on its semantic similarity to dates.Token noising on the other hand forces the model to put more emphasis on context rather than token identities, which effectively removes the occurrence of this error.Impact of History Dropout Table 2 shows that history dropout does not adversely affect DST performance.This is noteworthy since utilising the full dialogue history is the standard in contemporary works due to its importance for adequate tracking.History dropout effectively reduces the amount of training data by omitting parts of the model input.At the same time training samples are diversified, preventing the model from memorising patterns in the dialogue history and promoting generalisation.Figure 4 shows the severe effects of over-fitting to the dialogue history on small datasets, when not using history dropout.
Here, models were only provided the current turn as input, without historical context.Models with history dropout fare considerably better, showing that they do not over-rely on the historical information.Models without history dropout do not only perform much worse, their performance is also extremely unstable.On sim-R, the span from least to highest relative performance drop is 0% to 39.4%.The results on MultiWOZ point to the importance of the historical information for proper tracking on more challenging scenarios.Here, performance drops equally in the absence of dialogue history, whether or not history dropout was used.

Handling Unseen Values
We probed value independence on two out-ofontology test sets for MultiWOZ.so on, i.e., the distinction between concepts is lost.Table 3 lists the results.The performance loss is more graceful on OOO Heck , and we see that TripPy-R has an advantage over TripPy.The performance drop is more severe on OOO Qian , with comparable JGA to the baseline of Qian et al. (2021), which is a generative model.The authors of that work attribute the performance degradation to hallucinations caused by memorisation effects.For our extractive model the main reason is found in the slot gate.The relative slot gate performance drop for the train domain-slots is 23.3%, while for other domain-slots it is 6.4%.We believe the reason is that most of the arbitrary substitutes carry no characteristics of train stops, but of other domains instead.This is less of a problem for the taxi domain for instance, since taxi stops are of a variety of location types.The issue of value-to-domain mismatch can be mitigated somewhat with informed value masking in system utterances (Section 4.3).While this does not particularly affect our model on the regular test set or on the more domain-consistent OOO Heck , we can see much better generalisation on OOO Qian .

Handling Unseen Slots and Domains
Table 2 shows that moving from slot specific to slot independent components only marginally af-  fects DST performance, while enabling tracking of dialogues with unseen domains and slots.
Zero-shot Performance We conducted zeroshot experiments on MultiWOZ 2.1 by excluding all dialogues of a domain d from training and then evaluating the model on dialogues of d.In Table 4, we compare TripPy-R to recent models that support slot independence.Even though we did not specifically optimise TripPy-R for zero-shot abilities, our model shows a level of robustness that is competitive with other contemporary methods.
Impact of Non-dialogue Data Besides zeroshot abilities, we were curious, is it feasible to improve dialogue state tracking by learning the required mechanics purely from non-dialogue data?This is a non-trivial task, as the model needs to generalise knowledge learned from unstructured data to dialogue, i.e., sequences of alternating system and user utterances.We conducted this experiment by converting MultiWOZ dialogues of a held-out domain d into non-dialogue format for training.For d, the model only sees isolated sentences or sentence pairs, i.e., without any structure of a dialogue.Consequently, there is no "turn" history from which the model could learn.The assumption is that one would have some way to label sequences of interest in non-dialogue sentences, for instance with a semantic parser.As this is a feasibility study, we resort to the slot labels in DS t , which simulates having labels of very high accuracy.We tested three different data formats, (1) Review style: Only system utterances with statements are used to learn from; (2) FAQ style: A training example is formed by a user question and the following system answer.Note that this is contrary to what TripPy-R naturally expects, which is a querying system and a responding user; and ( 3  again as separate examples. Figure 5 shows that we observed considerable improvements across all held-out domains when using non-dialogue data to learn from.Learning from additional data, even if unstructured, is particularly beneficial for unique slots, such as the restaurant-food slot which the model can not learn about from any other domain in MultiWOZ (as is reflected in a poor zero-shot performance as well).We also found that learning benefits from the combination of different formats.The heightened performance given the FAQ+ style data is not an effect of more data, but of its presentation, since we mainly re-use inputs with different formats.This observation is reminiscent of findings in psychology.Horst et al. (2011) showed that children benefited from being read the same story repeatedly.Furthermore, Johns et al. (2016) showed that contextual diversity positively affects word learning in adults.Note that this kind of learning is in contrast to few-shot learning and leveraging artificial dialogue data, which either require fine-grained manual labels or high-level knowledge of how dialogues are structured.Even though the data we used is far-removed from what a dialogue state tracker expects, TripPy-R still manages to learn how to appropriately track these new domains.

Performance Comparison
We evaluated on five versions of MultiWOZ to place TripPy-R among contemporary work.Versions 2.1 and 2.2 mainly propose general corrections to the labels of MultiWOZ 2.0.Version 2.3 unifies annotations between dialogue acts and dialogue states.In contrast, version 2.4 removes all values that were mentioned by the system from the dialogue state, unless they are proper names.Figure 6 plots the results.The performance of TripPy-R is considerably better on 2.3 and 2.4.This can be attributed to a more accurate prediction of the inform cases due to better test ground truths.
For fairness, we restricted our comparison to models that have the same general abilities, i.e., they ought to be open-vocabulary and without data-specific architectures.The SOTA on 2.0 (Su et al., 2022) proposes a unified generative dialogue model to solve multiple tasks including DST and benefits from pre-training on various dialogue corpora.While profiting from more data in general, its heterogeneity in particular did not affect DST performance.Yu et al. (2020), Li et al. (2020) and Dai et al. (2021) currently share the top of the leaderboard for 2.1, all of which propose TripPystyle models that leverage data augmentation.The main reason for performance improvements lies in the larger amount of data and in diversifying samples.TripPy-R does not rely on more data, but diversifies training samples with token noising and history dropout.On 2.2, the method of Tian et al. (2021) performs best with a two-pass generative approach that utilises an error recovery mechanism.This mechanism can correct generation errors such as caused by hallucination, which is a phenomenon that does not occur with TripPy-R.However, their error recovery also has the potential to avoid propagation of errors made early in the dialogue, which is demonstrated by a heightened performance.Cho et al. (2021) report numbers for the method of Mehri et al. (2020) on 2.3, which is another TripPy-style model using an encoder that was pre-trained on millions of conversations, thus greatly benefiting from specialised knowledge.For 2.4, the current SOTA with the properties as stated above is presented by Ye et al. (2021b) and reported in Ye et al. (2021a), which is now surpassed by TripPy-R.The major difference to our model is the use of slot self-attention, which allows their model to learn correlations between slot occurrences.While TripPy-R does not model slot correlations directly, it does however explicitly learn to resolve coreferences.

Implications of the Results
The zero-shot capabilities of our proposed TripPy-R model open the door to many new applications.However, it should be noted that its performance on an unseen arbitrary domain and on unseen arbitrary slots will likely degrade.In such cases it would be more appropriate to perform adaptation, which the TripPy-R framework facilitates.This means that one would transfer the model as presented in Sections 4.3 and 4.4 and continue fine-tuning with limited-and potentially unstruc-tured (see Section 6.4)-data from the new domain.Nonetheless, in applications such as ecommerce (Zhang et al., 2018) or customer support (García-Sardiña et al., 2018), whenever new slots or even domains are introduced, they are to a great extent related to ones that a deployed system is familiar with.We believe that the zero-shot performance presented in Table 4 is highly indicative of this set-up, as MultiWOZ domains are different, yet to some extent related.Further, the TripPy-R model facilitates future applications in complex domains such as healthcare.One of the biggest obstacles to harnessing large amounts of natural language data in healthcare is the required labelling effort.This is particularly the case for applications in psychology, as can be seen from the recent work of Rojas-Barahona et al. (2018) where only 5K out of 1M interactions where labelled with spans for so called thinking-errors by physiologists.A framework like TripPy-R can completely bypass this step by utilising its proto-DST, as presented in Section 4.1, eliminating the overbearing labelling effort.

Conclusion
In this work we present methods to facilitate robust extractive dialogue state tracking with weak supervision and sparse data.Our proposed architecture-TripPy-R-utilises a unified encoder, the attention mechanism and conditioning on natural language descriptions of concepts to facilitate parameter sharing and zero-shot transfer.We leverage similarity based value matching as an optional step after value extraction, without violating the principle of ontology independence.
We demonstrated the feasibility of training without manual span labels using a self-trained proto-DST model.Learning from spanless labels enables us to leverage data with weaker supervision.We showed that token noising and history dropout mitigate issues of pattern memorisation and train-test discrepancies.We achieved competitive zero-shot performance and demonstrated in a feasibility study that TripPy-R can learn to track new domains by learning from non-dialogue data.We achieve either competitive or state-ofthe-art performance on all tested benchmarks.For future work we continue to investigate learning from non-dialogue data, potentially in a continuous fashion over the lifetime of a dialogue system.

Figure 1 :
Figure 1: Proposed model architecture.TripPy-R takes the turn and dialogue history as input and outputs a DS.All inputs are encoded separately with the same fine-tuned encoder.For inference, slot and value representations are encoded once and then stored in databases for retrieval.of-domain dialogue data is limited in quantity as well.Data augmentation still requires high level knowledge about dialogue structures and an adequate data generation strategy.Ultimately, more data also means longer training.We are aware of only one recent work that attempts DST with weak supervision.Liang et al. (2021) take a fewshot learning approach using only a subset of fully labeled training samples-typically from the end of conversations-to train a soft-gated pointergenerator network.In contrast, with our approach to spanless training, we reduce the level of granularity needed for labels to train extractive models.Note that these strategies are orthogonal.
xSEP Hi, I am looking for an upscale restaurant in the centre xSEP rest.Top: Training samples for the proto-DST.Y = {"need", "a", "train"} is a randomly picked sub-sequence in D t .The model needs to tag all tokens belonging to Y .For any random sequence Ȳ ∈ D t , all probability mass should be assigned to x NONE .Bottom: Example of tagging the training data with a proto-DST given only spanless labels.The model needs to tag all tokens belonging to the respective values.Note how the proto-DST successfully tagged the word 'upscale' as an occurrence of the canonical value restaurant-price=expensive.

Figure 2 :
Figure 2: The proto-DST model for value tagging.

Figure 3 :
Figure 3: Tagging performance of the proto-DST model depending on the weight threshold ν.

Figure 4 :
Figure 4: Performance loss due to mismatched training and testing conditions.Here, history is provided during training, but not during testing.sim-M/R and WOZ 2.0 show clear signs of over-fitting without history dropout.

Figure 5 :
Figure 5: Performance of TripPy-R after training with non-dialogue style data from a held-out domain.

Figure 6 :
Figure 6: Comparison of TripPy-R and SOTA open vocabulary DST models.* denotes TripPy-style models.
/|C| , is applied to m w and e S i (interpreting them as multisets rather than vectors) to compute two confidence scores Conf(m w ) and Conf(e S i ) for the most likely value candidate.This type of confidence captures the notion of difference between the best score and the mean of all other scores, intuitively expressing model certainty.Val(m w ) = argmax(m w ) and Val(e S i ) = argmax(e S i ) are the most likely candidates according to value attention and L2-norm.For any slot that was predicted as span, the final prediction is

Table 2 :
DST results in JGA (± denotes standard deviation).w/o value matching refers to training and inference.

Table 3 :
Performance in JGA on artificial out-ofontology test sets (± denotes standard deviation).

Table 4 :
Best zero-shot DST results for various models on MultiWOZ 2.1 in JGA.*  Li et al. (2021)presents considerably higher numbers for models with data augmentation.We compare against a model without data augmentation.* * is a model with three times as many parameters as ours.
) FAQ+ style: Combines review and FAQ style examples and adds user questions