Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management

Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size con- sidering the complexity of the dialogues. Additionally, conventional training signal in- ference is not suitable for non-deterministic agent behavior, namely, considering multiple actions as valid in identical dialogue states. We propose the Conversation Graph (ConvGraph), a graph-based representation of dialogues that can be exploited for data augmentation, multi- reference training and evaluation of non- deterministic agents. ConvGraph generates novel dialogue paths to augment data volume and diversity. Intrinsic and extrinsic evaluation across three datasets shows that data augmentation and/or multi-reference training with ConvGraph can improve dialogue success rates by up to 6.4%.


Introduction
Dialogue systems research focuses on the natural language interaction between a user and an artificial conversational agent. Current trends lean towards end-to-end models Miller et al., 2017) while modular systems tend to be the preferred approach in industrial applications (Bocklisch et al., 2017;Burtsev et al., 2018). At the core of a modular conversational agent is dialogue management (DM), whose function is to exchange information with a user, update the agent's internal state, and plan its next action according to a policy (Young, 2000). The dialogue manager then collects the user's response and the process repeats until the conversation ends. More specifically, task-oriented dialogue managers are aiming towards the completion of some specific task or tasks (e.g., booking services or buying products).
Machine-learned policies for DM require large amounts of high-quality data to generalize to a variety of conversational scenarios (El Asri et al., 2017;Budzianowski et al., 2018;Rastogi et al., 2019). However, given the complexity of some tasks, the datasets are often limited in size. Reinforcement learning approaches (Henderson et al., 2008;Miller et al., 2017;Li et al., 2017;Gordon-Hall et al., 2020) can replace the need for explicit training data by exploiting a custom-designed environment to infer the training signal for the policy. Unfortunately, such custom-designed environments may not be representative of how a user would interact with a conversational agent and their manual development is time-consuming and domain-specific. Furthermore, modern conversational agents exhibit non-deterministic behavior, that is, they are able to take different but equally valid actions in identical dialogue states. Conventional agent training and evaluation do not support the non-deterministic nature of conversational datasets as only a single fixed target is considered per inference, penalizing valid model predictions.
To address these problems, we propose the Conversation Graph (ConvGraph), a DM framework for data augmentation, which also enables the training and evaluation of non-deterministic policies. If we consider each dialogue as a sequence of dialogue states, alternating between the agent and the user, we assemble a graph structure by unifying matching states across all conversations in the dataset. New dialogue paths can then be traversed in the graph resulting in novel training instances that can improve the DM policy. The unification simultaneously collects all valid actions at each point in the dialogue to facilitate the training and evaluation of non-deterministic policies.
We consider the contribution of this paper to be threefold. First, we explore several augmentation baselines as well as variants of ConvGraph augmentation to show their impact on policy through intrinsic and extrinsic evaluation. We show that augmentation with ConvGraph leads to improvements of up to 6.4% when applied in an end-toend dialogue system. In addition, we propose a loss function that takes advantage of ConvGraph's unified dialogue states to increase the success rate by up to 2.6%. Finally, we exploit ConvGraph to introduce a multi-reference evaluation metric for non-deterministic dialogue management.

Background
The idea of using graphs in dialogue systems is not new (Larsen and Baekgaard, 1994;Schlungbaum and Elwert, 1996;Agarwal, 1997), but it is limited to graphs representing flow charts where each node is an action step in a sequence, without a specific semantic importance assigned to the nodes themselves (Aust and Oerder, 1995;Wärnestål, 2005). On the other hand, the ConvGraph dialogue state information is encoded in a structured way for each node, allowing the unification of nodes across conversations. It is primarily this structured representation of dialogues that enables the use of the graph for data augmentation by traversal of new dialogue paths. It also allows for nondeterministic training and evaluation by referencing the graph to validate model predictions.

Data Augmentation
Data augmentation for dialogue management is relatively unexplored and limited to increasing training data volume with (random) data duplication/recombination (Bocklisch et al., 2017) or general machine learning data transformations such as oversampling and downsampling of existing samples (Chawla et al., 2002). We explore variants of both strategies as additional baselines. Related to our work is a recent paper on Multi-Action Data Augmentation (MADA; . Similarly to us, they are leveraging the fact that a non-deterministic agent can take different actions given the same dialogue state. However, our methodologies then diverge sharply. While we pursue data augmentation for DM, MADA is applied towards the task of context-to-text Natural Language Generation (Wen et al., 2017). This necessitates fundamental differences such as MADA not abstracting away slot values because its states must be unified on literal values and database results to generate the required natural language response. MADA additionally does not consider any dialogue history, hence this approach is not suitable for dialogue management data augmentation.

Training and Evaluation
ConvGraph enables the training and evaluation of non-deterministic agents. When multiple actions are equally valid in a particular dialogue state, conventional ''pointwise'' machine learning training and evaluation adversely influences, even penalizes otherwise correct predictions. Multireference training (sometimes called ''soft loss'') has been used for this reason, for example, to improve the application of Maximal Discrepancy to Support Vector Machines (Anguita et al., 2011) as well as to boost performance for noisy image tag alignments (Liu et al., 2012). Multi-reference evaluation has been the standard for language generation tasks in NLP, as it accounts for the nondeterministic nature of language output. In some cases, both soft training and evaluation have been used-for example, for decision tree learning with uncertain clinical measurements (Nunes et al., 2020) to mitigate the impact of hard thresholds.

The Conversation Graph
This section describes how dialogues are unified into a graph representation useful for data augmentation, model training, and evaluation.

Key Concepts
In a task-oriented dialogue system, the user interacts with an agent through natural language in order to achieve a specific goal. Every user or agent utterance corresponds to a dialogue act (da) (see Figure 1), that is, an abstract representation typically consisting of an intent and a set of slots with corresponding values. The intent denotes the abstract communication goal of the user or agent and the slot (s) -value (v) pairs encode the entities provided in an utterance. For example, inf orm{destination = London} denotes that the communication goal is to inform the listener that the value of the destination slot is ''London''. The values of all slots in a given turn constitute the belief state (bs) such that bs = [(s 1 = v 1 ), (s 2 = v 2 ) , ... (s n = v n )]. After each dialogue turn, the bs is updated as Figure 1: Two dialogues with different utterances (in plain text) but identical da sequences of (multi) intents and slots, entities (in plain text) and slot values. These dialogues follow identical edges/nodes in ConvGraph.
the dialogue proceeds towards the goal. Note that we may observe multiple intents at each turn, depending on the design of the dialogue system. We finally define the dialogue state (ds), which is a concatenation of bs and da at each turn.

Construction
We treat each dialogue D as a sequence of n encoded turns such that D = [ds 0 , ds 1 , ds 2 , . . . , ds n ] where ds 0 is the start state and ds n is the end state. ConvGraph is defined as a directed graph ConvGraph = (N, E) where N is a set of nodes, each corresponding to a dialogue state ds i and E, which is the set of edges (transitions) between any two nodes. An edge corresponds to a user or an agent dialogue act. Its frequency is also recorded, as observed in the data. Algorithm 1 shows how multiple dialogues (DS) are converted into a conversation graph such that nodes that are identical are unified. As a result, dialogue sequences intersect on common nodes (see Figure 2). During this unification, we infer which actions are valid, given the same ds. We additionally append an artificial final state to each dialogue to explicitly mark the end of the conversation for datasets where a ''task complete'' indicator is not present .

Data Augmentation
The aim of DM in a modular dialogue system is to learn a policy to predict an appropriate agent action (dialogue act da) at turn t, conditioned on the history of previous dialogue states P DS. We define a policy π θ (da t |P DS) where P DS = [ds t−1 , ds t−2 , .., ds t−n ] and n = history length. Machine-learned policies can benefit from additional training instances in order to better generalize and reduce overfitting. This leads us to ConvGraph's first application, the inference of additional training signal that can be used to train a DM policy.
Augmentation by Most Frequent Sampling is our main method. In order to generate training instances from ConvGraph, we visit all nodes and extract (da t |P DS) pairs for the policy. Because an exhaustive traversal of ConvGraph is unrealistic due to its size and connectivity on the datasets we considered, we need a strategy to select the most useful pairs. In preliminary experiments, we performed uniform sampling among the outgoing edges at each agent node. This was not promising, as it ignored the likelihood of agent actions. Instead, for each agent node in ConvGraph, we exclusively choose the most frequent outgoing edge, as observed in the original data. This process pairs frequent actions with a new history (context) thus creating new training examples. It also results in a reduction of actions at each agent node, decreasing conflicting training signals for the policy. This approach for inferring training instances, which can be combined with the original data, resulted in the most effective experimental setup (Results 5). We refer to this method as MFS henceforth.
Oracle Augmentation is featured in our experiments to explore the performance impact of oracle-guided augmentation. An oracle in this context represents an information source that can be queried to obtain additional (but incomplete) information about the development and test set. More specifically, while generating novel instances with ConvGraph, the oracle can tell us which novel training examples occur in the development and/or test set. The oracle confirms whether a new training example will be informative to the DM policy and likely lead to higher development/test scores. Adding these instances to the original training data creates another challenging baseline policy. Please note that this baseline is designed for theoretical comparisons only as obtaining this type of strong information source is not possible in real settings.

Additional Augmentation Approaches
To the best of our knowledge, there are few comparable data augmentation methods for DM. The following are the most relevant baselines that were included in our experiments.
Downsampling removes duplicate training examples, thus results in only unique instances being included in training. This action therefore reduces the size of (and balances) the train set. Downsampling is related to SMOTe, the Synthetic Minority Oversampling Technique (Chawla et al., 2002), particularly its later variants (Han et al., 2005), which aim to balance the training data by oversampling rare cases. This has been shown to improve classifier performance (Maciejewski and Stefanowski, 2011;Ramentol et al., 2012).
Data Duplication is an adaptation of dialogue concatenation and/or recombination, available in some dialogue systems although this has not been rigorously evaluated (Bocklisch et al., 2017). Our implementation takes care to combine the state sequences in a way that does not introduce inconsistent state transitions, for example, starting a new conversation without resetting the dialogue state, which would result in skipping some required steps.

Multi-Reference Evaluation
Before we describe the full evaluation procedure, we briefly introduce the Evaluation Graph (denoted ''eval'' in Table 1). This is a regular conversation graph constructed from all data splits using Algorithm 1. The EvalGraph serves as a reference tool used to look up the list of valid actions for each agent node.
The EvalGraph allows us to pool all observed agent actions into a single graph and use it to score policy predictions. We evaluate the predicted dialogue actŷ against all of the valid target dialogue acts Y and report the greatest score (see Equation 1). For example, if Y = [request(time, date), request(time) and request(date)] andŷ = request(time), then the maximum score is awarded. In our experiments, this modification of the F-score (Pedregosa et al., 2011) is referred to as SoftF1. The running time is approximately an order of magnitude slower.

Multi-Reference Training
Most conversational datasets, including ones featured in the evaluation (see 4.1), contain user interactions with non-deterministic agents. Given more than one valid response in a given state, conventional single-reference ''hard'' training penalizes the model for making a valid prediction. Propagating such a loss is likely to lead to a deteriorating dialogue management policy.
We therefore modify the Binary Cross-Entropy (BCE) loss from Equation 2 to propose the Soft Binary Cross-Entropy (SBCE) seen in Equation 3. SBCE uses ConvGraph to compute losses for all valid actions Y in a given dialogue state ds. SBCE then propagates the lowest loss forŷ. Training time with SBCE increases by an order of magnitude because multiple references are considered.

Experimental Setup
Next, we describe the platform, datasets, and metrics used throughout our experiments.

Datasets
There are two main approaches to dataset construction for dialogue systems. Machines talking to machines (M2M) is a data generation framework that makes use of rule-based user and agent simulators that interact to generate sequences of dialogue acts. Also known as Dialogue Self-Play , crowd workers proceed to lexicalize them to produce corresponding natural utterances. This approach has two main limitations: (i) both simulators must be handcoded and (ii) there is no guarantee that the simulators generate realistic conversations. Other examples of such datasets include AirDialogue (Wei et al., 2018) and Schema Guided Dialogue (Rastogi et al., 2019). The second approach is collecting dialogues through a Wizard of OZ (WOZ; Dahlbäck et al., 1993;Strauß et al., 2006) setting, which has been used to gather the DSTC2 (Henderson et al., 2014), WOZ 2.0 (Wen et al., 2017), Frames (El Asri et al., 2017), Microsoft E2E Challenge , and MultiWOZ (Budzianowski et al., 2018) datasets. The collection process involves two humans conversing, one acting as the agent and the other as the user. In standard WOZ, the user is led to believe that the agent is artificial rather than a human. This helps ensure that gathered dialogues reflect how users interact with machine-driven agents.
For our experiments, we use three datasets with original splits, both human and machine generated to ensure the applicability of our methods to a variety of conversational scenarios. These are the movie and restaurant partitions of M2M and the extended version of MultiWOZ 2.0 (Lee et al., 2019) with user intent annotations. Note that the original dataset has since been corrected and released as MultiWOZ 2.1 (Eric et al., 2020). The descriptive statistics for each dataset can be found in Table 1. ConvGraph requires user/system intents and other dialogue annotations, thus several aforementioned datasets were not compatible.

Test Set Deduplication
All test sets contain some degree of duplicate instances. Table 1 shows the highest duplication (# unique / # instances) for M2M restaurant (∼61% unique), followed by M2M movie (∼75% unique), and MultiWOZ (∼96% unique). For a more complete evaluation, we present results on two test sets for each dataset. Besides the original data, we also evaluate on a deduplicated test set. This ensures that changes in model performance are not disproportionately influenced by duplicated instances.

End-to-end User Simulation
ConvLab (Lee et al., 2019) is an end-to-end dialogue system platform built to support the MultiWOZ dataset (Budzianowski et al., 2018), a collection of human-to-human conversations spanning multiple domains and one of the largest annotated task-oriented corpora for dialogue. ConvLab allows an agent to interact with the user simulator via dialogue acts, and supports reinforcement learning, supervised learning, and rule-based agents. It can be used as an evaluation platform to test modular, task-oriented conversational agents in an end-to-end fashion and has been used in this capacity at the Eighth Dialog System Technology Challenge (Kim et al., 2019). We use ConvLab for extrinsic evaluation of DM policies. Note that the platform does not support the entirety of actions occurring in MultiWOZ. We adapt our policies' output action space accordingly, so that we are compatible with ConvLab and able to duly perform the extrinsic evaluation. 1

Metrics
We evaluate policies intrinsically and extrinsically, the latter through the ConvLab user simulator and limited to MultiWOZ since the dialogue system framework for M2M is proprietary and unavailable. For extrinsic evaluation, the main automatic metric is success rate (Kim et al., 2019), averaged over 1000 conversations or episodes. A dialogue is considered successful if all informable slots (what the agent needs to complete the task) and requestable slots (what the user wants to know) were correctly filled. We also report the average number of dialogue turns. For intrinsic evaluation, we use SoftF1 scores (Section 3.5) alongside the conventional F-scores 1 Please note that we are not excluding any portion of the MultiWOZ data. When multiple actions are taken in the same turn, ConvLab handles them as a single concatenated action, e.g., inform(departure) and inform(destination) are treated as inform(departure+destination). Due to this paradigm, every action combination needs to be treated distinctly by ConvLab. As the number of combinations is large, the user simulator is restricted to the 300 most frequent combined actions. Any policy's output needs to comply to the restricted action space. (Pedregosa et al., 2011). We refer to the latter as HardF1 as it is computed strictly against a single target y ∈ Y , even if another predictionŷ was valid under a ''soft'' evaluation becauseŷ ∈ Y . In other words, when the agent is able to predict multiple different valid actions Y , the SoftF1 score reports the lowest error for any y ∈ Y while HardF1 reports the error for exactly one y ∈ Y . Using only HardF1 can lead to unfair penalties for otherwise correct predictions. HardF1 and SoftF1 are computed using ''samples'' averaging, that is, score each prediction separately, then average for an overall score. This is the most realistic scenario for the DM task as it evaluates the quality of each interaction without pooling or averaging predictions over multiple turns. ''Micro'' and ''macro'' averages can underestimate or exaggerate changes in predictions, particularly for infrequent classes.

Policy Implementation
As the approaches we examine are orthogonal to the policy implementation, in order to minimize the influence of hyperparameter/architecture choice, we have fixed the model for training and evaluation across all experiments. We have used an LSTM (Hochreiter and Schmidhuber, 1997) model with default PyTorch hyperparameters to train all DM policies. We learn a policy π θ (da t |P DS) where t is the current agent turn, P DS = [ds t−1 , ds t−2 , .., ds t−n ] are the previous dialogue states, and n is the history length. In our experiments, we set n to 3, 4, and 5. For brevity, the Results (section 5) feature scores with history set to 4 as the differences are negligible. We proceed to encode P DS with the LSTM with a hidden layer size of 256. We explored several hidden sizes between 64 (underfitting) and 512 (overfitting) but the relative rank and differences between experimental setups were not affected. The LSTM output is passed through a ReLu (Nair and Hinton, 2010) activation followed by a linear layer with a sigmoid activation and size equal to the output size. The output size is the number of distinct labels in the observed dialogue acts. The input size is equal to |ds|, that is, output size + the encoded belief state size. The model parameter counts are as follows: M2M movie (299K) with input size of 31 and output size of 12, M2M restaurant (314K) with input size of 45 and output size of 16, and MultiWOZ (707K) with input size of 355 and The output space of dialogue management models can be framed as either multi-class classification or multi-label classification. In some dialogue systems (Bocklisch et al., 2017), actions are predicted and executed one at a time, which lends itself to multi-class classification with a Cross-Entropy loss as the probability of the target label is maximized while all other label probabilities are minimized. However, in most conversational research datasets 4.1, several target labels are jointly predicted. We consider multilabel classification with BCE (Equation 2) or SBCE loss function (Equation 3) to be more suitable for this type of DM task. An increase in the number of dialogue acts means that the output vector size would grow at a constant rate when considering multi-label classification but would grow exponentially with multi-class classification. This is not scalable beyond any but the simplest dialogue systems. Therefore, multilabel classification allows for a highly expressive agent using a small target vector while also being more sample-efficient.
According to extrinsic evaluation in ConvLab, this configuration leads to a 73.4% success rate and 10.11 turns as the average conversation length (Table 4). Therefore, our (multi-label) baseline achieves stronger results than the baseline used in the Eighth Dialog System Technology Challenge (Li et al., 2020), which reached a 61% success rate and 11.67 turns on average. Note that our data-augmentation approach and loss function are orthogonal to the choice of the DM model.

Statistical Significance
Ten models were trained for each experiment to determine the mean and variance under various random seeds. We then perform an Analysis of Variance followed by a two-tailed t-test (samples with unequal variance). In Tables 2 and 3, significant differences are noted with an asterisk (*).

Results and Analysis
Tables 2 and 3 show intrinsic evaluation results for DM policies with a history of 4, trained with BCE and SBCE loss, respectively. The H-F1 columns denote HardF1 scores, and the S-F1 columns denote SoftF1 scores. Table 4 presents the results of extrinsic evaluation in ConvLab.

Graph (Dataset) Properties
Conversational datasets have very distinct properties that will help us interpret the observed results.  An intrinsic view of the data shown in Table 1 can be used to infer the approximate performance of the DM policies, even before any training.
The MultiWOZ graph has the most complex dialogue state due to its multi-domain nature. Its EvalGraph, initiated from all MultiWOZ partitions, has ∼100K edges, which is 40 times more than M2M restaurant and almost 70 times more than M2M movie. However, the number of shared edges between the train graph (n = 82K) and the development graph (n = 12.7K) is only 2.7K. The train and test (n = 12.7K) graphs also share 2.7K edges. Once featurized into training instances, the overlaps are even smaller, approximately 800 out of 7.3K for both test and dev sets. This enables us to predict that the available data is almost certainly insufficient to learn a supervised DM policy with F-scores approaching 1.0, regardless of the model architecture. The dialogue state and target vector are too complex for the amount of data provided (71.5K instances of which ∼90% is unique). Scenarios significantly different from ones observed in training effectively demand a zero-shot transfer to unseen test instances. However, since dialogue acts consist of multiple labels, the model is able to predict many of them correctly, which is why in extrinsic evaluation, the best policy successfully handles almost 80% of dialogues.
The M2M Restaurant graph is quite different from MultiWOZ beyond just the size difference. It is important to look at graph connectivity such as the average outgoing edges (MND in Table 1) and the amount of repetition (percentage of edges visited more than once). This dataset has the highest density (2.69 MND) and repetition (87.2%). It also has the most shared edges between train (n = 1,767), development (n = 935), and test (n = 1,541) graphs. The train graph shares 629 and 951 edges with the development and test graphs respectively, a high percentage and the opposite of MultiWOZ. Perhaps unsurprisingly, the policy performance for both M2M restaurant and M2M movie is much higher than MultiWOZ.
The M2M Movie graph is approximately half the size of M2M restaurant in terms of nodes and edges. This means a lower dialogue state variety but it comes with a lower number of total dialogues (n = 768 versus n = 2,240). Repetition (79.6% versus 87.2%) is also lower while the shared edges between train (n = 978), validation (n = 498) and test (n = 842) graphs are similar to M2M restaurant (65% with dev and 59% test). It is perhaps unsurprising that these datasets were generated with the same probabilistic automata, given their similarities. The repetition (or lack thereof) in dialogues strongly contributes to the differences across experimental results.

SoftF1 and HardF1 Score
Using two evaluation metrics helps us understand experimental results from different angles. The most important contribution of the addition of the SoftF1 score is being able to measure the error of the best valid response in each agent state, making evaluation fairer by not penalizing otherwise correct predictions. While we recommend to strive to improve both scores, we  observe that an improved SoftF1 score is more likely to lead to successful conversations as it promotes the choice of actions that lead to fewer policy errors in training. In extrinsic evaluation, this translates into a higher probability of the agent navigating a conversation from start to end hence higher SoftF1 scores with samples averaging are preferred for DM.

BCE and SBCE Loss
Conventional BCE loss encourages the policy to learn all available actions, aiming to maximize the HardF1 score on the test set. Training with SBCE may lead to less diverse agent responses, however, both intrinsic and extrinsic scores show consistent improvements over BCE of around 5 SoftF1 points for M2M datasets and 10 points for MultiWOZ. The declining HardF1 scores do not correlate with lower success rates in endto-end evaluation (see Table 4). The strongest effect of SBCE is that most differences between experiments observed in Table 2 were neutralised. HardF1 scores decrease to roughly same levels while SoftF1 scores increase to their highest levels (with a few statistically significant results). We think this is because the SBCE loss may lead the policy to converge to approximately the same actions. We also observe that even without augmentation, the DM policy can be significantly improved with SBCE alone.

Downsampling
Downsampling reduces the original data size by filtering out duplicate instances. This has a similar effect as oversampling infrequent examples, that is, reducing biases towards some training instances. Downsampling ignores the likelihood of agent actions as observed in the original data, an effect also seen with uniform sampling in Section 3.3. For M2M restaurant, which has the highest repetition and therefore the most biased paths through the graph, this leads to a substantial decline in F1-scores. Specifically: (i) the ''rating'' slot score dropped by 70%, (ii) the ''time'' and ''date'' slots dropped by 41%, and (iii) the ''confirm'' intent decreased by 26%. Similar trends were observed in MultiWOZ scores but to a lesser extent. The M2M movie dataset contains relatively little repetition hence we observed no significant changes. Downsampling is more suitable for dialogues with more evenly distributed agent actions.

Data Duplication
Training data duplication did not produce any significant changes as compared to the baseline. Over all dialogue histories, the SoftF1 and HardF1 scores fluctuate around the original training data scores without any consistent patterns. This augmentation only seems effective in ultra-low data regimes (Bocklisch et al., 2017), where one possesses at most a few dozen training dialogues.

Most Frequent Sampling
MFS generates novel training instances so that the most frequent agent actions are preceded by new histories, that is, one or more original paths leading to common actions. Due to this, the infrequently visited edges effectively get pruned from the graph, leading to a ∼20% reduction in size  compared to baseline data for MultiWOZ, ∼50% for M2M (M), and ∼60% reduction for M2M (R). Repetition is also removed so that each training example occurs exactly once. As a consequence, the overlap between MFS train and dev/test sets is also reduced by 40%-60% compared to baseline. In spite of having substantially fewer paths through the graph, this is the most effective intervention from a data standpoint. We should note that when combined with SBCE, MFS is no longer considering the most probable action exclusively, as the training will defer to an equally valid action if the calculated loss is lower. Due to this, MFS achieves the highest SoftF1 with BCE loss without sacrificing HardF1. In cases where MFS alone is less effective, it can be combined with the baseline training data to achieve best performance. MFS achieves the best HardF1 when combined with SBCE (except M2M restaurant). In extrinsic evaluation, after 1000 simulated conversations with a user, MFS combined with the baseline train data improves the success rate from 73.4% to 79.8% with BCE loss. Success rate also improves with SBCE loss to 76% with and without adding original data. The length of the dialogue is also consistently reduced as the agent satisfies the user's goals faster. We observe no distinct error patterns for the movie task except for the lack of usage of the ''offer'' intent, even with MFS data. This may be due to the lower frequency of the ''offer'' intent in the dataset relative to other dialogue acts. Instead, we observe consistent, single-digit improvements (∼4 F1 points) for almost all actions and slots. For M2M restaurant, there are two main patterns: (i) The most problematic errors discussed in the downsampling section were reversed. It now means that the ''rating'', ''time'', ''date'', and ''confirm'' targets show a good improvement rather than a decline. Even the previously unused ''meal'' slot went from 0 to 35 points. (ii) The remaining actions and slots show a single-digit improvement similar to the movie task. The M2M restaurant is particularly sensitive to the removal of repeated instances, hence benefits from additional training on frequent agent actions. Errors in MultiWOZ were also reduced although two domains (police and hospital) have not been learned despite the additional MFS data. This is likely owing to their very low frequency in the training data. For all other domains, we observe that an estimated one third of the dialogue acts that were unused with baseline training data advance by around 20 F1 points, on average. More frequent dialogue acts have also improved by an estimated 10 F1 points. Despite the advantages of augmentation, many dialogue acts are still predicted with low accuracy (or not at all), which explains where the remaining ∼20% success score in extrinsic evaluation and ∼26 F1 points in intrinsic evaluation could be recovered. Example dialogues are provided in the Appendix.

Oracle Augmentation
Oracle generates novel instances at ∼56% of the original train data size for M2M restaurant, ∼65% for M2M movie, but only ∼3% for MultiWOZ due to the small percentage of shared edges. As expected, we observed improvements over baseline, although with some caveats. Oracle augmentation illustrates the need for the usage of both soft and hard evaluation. Using the conventional approach, BCE training with HardF1 evaluation, we would correctly conclude that Oracle is (mostly) effective for M2M datasets and only marginally so for MultiWOZ. While SoftF1 scores also improve over the baseline, those are 1-2 points lower than the best MFS scores. For MultiWOZ, this difference is even greater (5-6 points lower) and would be indiscernible with only a HardF1 score available to guide experimentation. Oracle augmentation is an example where the policy training risks overfitting to maximize HardF1 at the expense of SoftF1, which may explain the 2%-3% decline in success rate in end-to-end evaluation. When oracle augmentation is effective (movie task), the error pattern is similar to the movie MFS experiment. In other words, more accurate predictions were observed for all dialogue acts but with a lower magnitude.
6 Future Work

Transformers
Our data augmentation method is orthogonal to the choice of the (sequence) model. We have used an LSTM for all experiments. We have also briefly tested a Multi Layer Perceptron where the input consisted of concatenated time steps, yielding comparable results to the LSTM model. Other architectures such as Transformers (Vaswani et al., 2017), which have recently achieved SOTA performance on language modeling and transfer learning, can also be used. However, due to the symbolic nature of the dialogue management input, we may not see an advantage from using pretrained transformers that compute representations of natural language. Also, as we previously mentioned (see Section 4.5), we have performed experiments with LSTMs using larger hidden state sizes but they did not lead to any improvement. We don't expect any significant improvement in performance by switching the architecture to Transformers.

Semi-Random Data Augmentation
A random graph traversal augmentation should be avoided as the dialogue flows in the train, development, and test sets (including the user simulator) are not random. Some paths are more likely than others and some nodes/edges are more frequently visited than others. A more promising approach to show policy improvement may be with semi-random sampling from the train ConvGraph, using the validation set performance as a reward signal. Similar to a hyperparameter search (even reinforcement learning), one can repeatedly sample training instances from different hyperparameters until a stopping criterion is met. Though more computationally intensive and more challenging to reproduce, this type of data augmentation may deliver novel insights into the generation process.

Data Generation with ConvGraphs
ConvGraph is expected to be initialized from existing dialogues in order to augment the training data. However, we can also collect new data with ConvGraph by checking the uniqueness of incoming dialogue turns, possibly in real-time. New nodes and edges will make the graph denser and allow for maximally diverse data augmentation, avoiding needless repetition and accelerating data collection. ConvGraph can be efficiently expanded as the environment or user behaviors change over time in order to extend an artificial agent with additional capabilities or to bootstrap agent policies interactively (Williams and Liden, 2017;.

RNN ConvGraph
We also propose the RNN-ConvGraph, a theoretical alternative to ConvGraph. This is a generative model that takes as input a sequence of previous dialogue states P DS = [ds t−1 , ds t−2 , .. ., ds t−n ] where n = max history and predicts the next dialogue state ds t over a ''vocabulary'' of all graph nodes. Reminiscent of a generative language model, RNN-ConvGraph augments training data by using the conditional probabilities learned from the train data. Instead of an explicit graph data structure, the RNN-ConvGraph would be an implicit representation of the graph.

Conclusions
We have introduced the Conversation Graph for Dialogue Management, an approach that unifies conversations based on matching nodes (dialogue states). Exploiting the structure of ConvGraph can be effectively used for (1) data augmentation with our Most Frequent Sampling method, (2) training non-deterministic policies with SBCE, our soft loss function, and (3) a more complete and fair evaluation of nondeterministic agents with our SoftF1 score. We conducted a thorough analysis of ConvGraph on three conversational datasets and showed that they can have markedly different properties. Extrinsic evaluation with a user simulator as well as intrinsic evaluation supports that ConvGraph can successfully augment datasets by generating novel paths through the graph. The soft training loss SBCE lets the agent choose which actions to learn in each dialogue state, leading to consistent policy improvements. Finally, the soft evaluation has extended the conventional ''hard'' evaluation, which was insufficient for non-deterministic agents, leading to unfair penalties for correct predictions. We hope that our methodology as well as suggestions for future work will inspire further research in this topic area.

A Dialogue Examples: MultiWOZ
The following tables show examples of typical errors committed on the MultiWOZ dataset. For clarity and brevity, we do not provide the dialogues in full, only the turns relevant to the models' decisions. We set the dialogue history length to n = 3 thus P DS = [ds t−1 , ds t−2 , ds t−3 ] (see Section 3.3). At turn t, we show output for models BASE and B+MFS (see Section 5.6). For more detailed analysis of each model's strengths and failings, we refer the reader back to Section 5.
In Table 5, at turn t − 3, the user tells the agent of that he's interested in free hotel parking Hotel-Inform(Parking). The agent replies that there are many available hotels the user can choose from Hotel-Inform(Choice). The agent additionally asks the user to provide a desirable price range to further filter down the choices Hotel-Request(Price). At the next turn t − 1, the user provides a specific price-range, asks about the hotels' area and requests that the restaurant they are booking in parallel should be in a specific price-range. We remind the reader that in MultiWOZ, the same dialogue may span multiple domains. In the current turn t, both models respond that no hotel could be found with the additional criteria Booking-Inform(None). Inappropriately, however, the BASE model also provides a hotel's name Hotel-Inform(Name) but that hotel would not be meeting the criteria.
In Table 6, we observe a similar history that leads both agents to declare that no hotels could be found under the criteria. However, the B+MFS model goes a step further and recommends a hotel in a different area Hotel-Recom(Name, Area). In extrinsic evaluation, such action will likely lead to a shorter dialogue (see #turns in Table 4). Proactively offering appropriate hotel choices should also lead to a higher dialogue success rate.
In Table 7, both models request additional information from the user Train-Request(Leave). However, the B+MFS agent also informs the user that there are many choices available and offers to book the train. The inclusion of a ''call to action'' increases user engagement and the probability of a successful train booking.
The final example for MultiWOZ in Table 8 shows that after collecting all required information, the BASE agent merely informs the user that it found a train (specified by its ID) that    restaurant booking. The BASE model asks for the time again, which is clearly problematic as it seems to ignore the user's input. The B+MFS agent correctly handles the query and asks the user to confirm that choice thus avoiding a penalty. In Table 10, the user lets the agent know about the preferred type of restaurant Inform(Category) they would like to book. When this is confirmed by the user, the BASE model is prematurely asking the user to select from a list of available restaurants   before all necessary slots have been filled. The B+MFS agent correctly requests that the user specifies what type of meal they are interested in first. Table 11 shows the user asking for a specific category of restaurant Inform(Category). The BASE model only provides a partial output, correctly predicting that it should request information from the user but not which information it requires. This means the user would have to repeat the query in the better case or lose a dialogue in the worst case. Table 12 shows the user asking for a restaurant in a specific price range Inform(Price Range). Once the agent confirms that this is what the user is looking for, the BASE agent proceeds to offer a single restaurant. Although this is not wrong, the B+MFS model offers three choices using the Select(Restaurant Name) dialogue act. A preference for this action may lead to higher user satisfaction and ultimately to more successful dialogues.