Task-Oriented Dialogue as Dataflow Synthesis

Abstract We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.


Introduction
Two central design decisions in modern conversational AI systems are the choices of state and action representations, which determine the scope of possible user requests and agent behaviors. Dialogue systems with fixed symbolic state representations (like slot filling systems) are easy to train but hard to extend (Pieraccini et al., 1992 Figure 1: A dialogue and its dataflow graph. Turn (1) is an ordinary case of semantic parsing: the agent predicts a compositional query that encodes the user's question. Evaluating this program produces an initial graph fragment. In turn (2), that is used to refer to a salient Event; the agent resolves it to the event retrieved in (1), then uses it in a subsequent computation. Turn (3) repairs an exception via a program that makes a modified copy of a graph fragment. enough to represent arbitrary properties of the dialogue history, but so unconstrained that training a neural dialogue policy "end-to-end" fails to learn appropriate latent states (Bordes et al., 2016). This paper introduces a new framework for dialogue modeling that aims to combine the strengths of both approaches: structured enough to enable efficient learning, yet flexible enough to support open-ended, compositional user goals that involve multiple tasks and domains. The framework has two components: a new state representation in which dialogue states are represented as dataflow graphs; and a new agent architecture in which dialogue agents predict compositional programs that extend these graphs. Over the course of a dialogue, a growing dataflow graph serves as a record of common ground: an executable description of the entities that were mentioned and the actions and computations that produced them ( Figure 1).
While this paper mostly focuses on representational questions, learning is a central motivation for our approach. Learning to interpret naturallanguage requests is simpler when they are understood to specify graph-building operations. Human speakers avoid repeating themselves in conversation by using anaphora, ellipsis, and bridging to build on shared context (Mitkov, 2014). Our framework treats these constructions by translating them into explicit metacomputation operators for reference and revision, which directly retrieve fragments of the dataflow graph that represents the shared dialogue state. This approach borrows from corresponding ideas in the literature on program transformation (Visser, 2001) and results in compact, predictable programs whose structure closely mirrors user utterances.
Experiments show that our rich dialogue state representation makes it possible to build better dialogue agents for challenging tasks. First, we release a newly collected dataset of around 40K natural dialogues in English about calendars, locations, people, and weather-the largest goaloriented dialogue dataset to date. Each dialogue turn is annotated with a program implementing the user request. Many turns involve more challenging predictions than traditional slot-filling, with compositional actions, cross-domain interaction, complex anaphora, and exception handling (Figure 2). On this dataset, explicit reference mechanisms reduce the error rate of a seq2seq-withcopying model (See et al., 2017) by 6.2% on all turns and by 7.1% on turns with a cross-turn reference. To demonstrate breadth of applicability, we additionally describe how to automatically convert the simpler MultiWOZ dataset into a dataflow representation. This representation again enables a basic seq2seq model to outperform a state-of-theart, task-specific model at traditional state tracking. Our results show that within the dataflow framework, a broad range of agent behaviors are both representable and learnable, and that explicit abstractions for reference and revision are the keys to effective modeling.

Overview: Dialogue and Dataflow
This section provides a high-level overview of our dialogue modeling framework, introducing the main components of the approach. Sections 3-5 refine this picture, describing the implementation and use of specific metacomputation operators.
We model a dialogue between a (human) user and an (automated) agent as an interactive programming task where the human and computer communicate using natural language. Dialogue state is represented with a dataflow graph. At each turn, the agent's goal is to translate the most recent user utterance into a program. Predicted programs nondestructively extend the dataflow graph, construct any newly requested values or real-world side-effects, and finally describe the results to the user. Our approach is significantly different from a conventional dialogue system pipeline, which has separate modules for language understanding, dialogue state tracking, and dialogue policy execution (Young et al., 2013). Instead, a single learned model directly predicts executable agent actions and logs them in a graphical dialogue state.
Programs, graphs, and evaluation The simplest example of interactive program synthesis is question answering:

User:
When is the next retreat?
Here the agent predicts a program that invokes an API call (findEvent) on a structured input (EventSpec) to produce the desired query. 1 This is a form of semantic parsing (Zelle, 1995).
The program predicted above can be rendered as a dataflow graph: findEvent 'retreat' now start start name after EventSpec Each function call in the program corresponds to a node labeled with that function. This node's parents correspond to the arguments of the function call. The top-level call that returns the program's result is depicted with a solid border. A dataflow graph is always acyclic, but is not necessarily a tree, as nodes may be reused.
Once nodes are added to a dataflow graph, they are evaluated in topological order. Evaluating a node applies its function to its parents' values: DateTime(2020, . . .) findEvent 'retreat' now start start name after EventSpec DateTimeSpec(. . .) Here we have annotated two nodes to show that the value of after(now()) is a DateTimeSpec and the value of the returned start node is a specific DateTime. Evaluated nodes are shaded in our diagrams. Exceptions (see §5) block evaluation, leaving downstream nodes unevaluated.
The above diagram saves space by summarizing the (structured) value of a node as a string. In reality, each evaluated node has a dashed result edge that points to the result of evaluating it: That result is itself a node in the dataflow graphoften a new node added by evaluation. It may have its own result edge. 2 A node's value is found by transitively following result edges until we arrive resentation of the utterance's meaning, but a query that enables a contextually appropriate response (what Austin (1962) called the "perlocutionary force" of the utterance on its hearer). The fact that next in this context triggered a search for "events after now" was learned from annotations. See §6 for a discussion of how these annotations are standardized in the SMCalFlow dataset. 2 In other words, a function does not have to return a terminal node. Its result may be an existing node, as we will see in §3. Or it may be a new non-terminal node, i.e., the root of a subgraph that implements the function. Generating this result subgraph is reminiscent of macro expansion. The new nodes in the subgraph are then evaluated further; importantly, at a node whose result is itself. Such a terminal node is either a primitive value (e.g. 2020), or a constructor (e.g. DateTime) whose parent nodes' values specify its arguments. A constructor has the same (capitalized) name as the type it constructs.
Reference and revision We now sketch two metacomputation functions whose evaluation extends the graph in complex ways.
As a representation of dialogue state, the dataflow graph records entities that have been previously mentioned and the relationships among these entities. All nodes in the dataflow graph are eligible to be referenced by subsequent utterances. Suppose, for example, that the user continues the previous dialogue fragment with a follow-up question: User: What day of the week is that?
The user's word that becomes the refer call in our predicted program, as it is a reference to "some salient previously mentioned node." Evaluating refer here chooses the top-level node, start, from the previous turn. That node is then used as an argument to a dayOfWeek node (nodes existing from previous turns are shown here in lighter ink): Monday and evaluating the latter node applies dayOfWeek to start's value. This diagram is actually a simplification: we will show in §3 how the refer call itself is also captured in the dataflow graph. The user may next ask a question that changes the upstream constraint on the event's start time: User: What about in 2021?
they are also available for reference and revision. Of course, a library function such as findEvent or + that invokes an API will generally return its value directly as a terminal node. However, translating natural language to higher-level function calls, which have been defined to expand into lower-level library calls, is often more easily learnable and more maintainable than translating it directly to the expanded graph.
A "new" DateTimeSpec (representing in 2021) is to be substituted for some salient existing old node that has value type DateTimeSpec (in this case, the node after(now())). The revise operator non-destructively splices in this new sub-computation and returns a revised version of the most salient computation containing old (in this case, the subgraph for the previous utterance, rooted at dayOfWeek): As in the refer example, the target program (though not the above subgraph) corresponds closely to the user's new utterance, making it easy to predict. Like the utterance itself, the program does not specify the revised subgraph in full, but describes how to find and reuse relevant structure from the previous dataflow graph.
Given a dataset of turns expressed in terms of appropriate graph-manipulation programs, the learning problem for a dataflow agent is the same as for any other supervised contextual semantic parser. We want to learn a function that maps user utterances to particular programs-a well-studied task for which standard models exist. Details of the model used for our experiments in this paper are provided in §7.
Aside: response generation This paper focuses on language understanding: mapping from a user's natural language utterance to a formal response, in this case the value of the outlined node returned by a program. Dialogue systems must also perform language generation: mapping from this formal response to a natural-language response. The dataset released with this paper includes output from a learned generation model that can describe the value computed at a previous turn, describe the structure of the computation that produced the value, and reference other nodes in the dataflow graph via referring expressions. Support for structured, computation-conditional generation models is another advantage of dataflowbased dialogue state representations. While a complete description of dataflow-based language generation is beyond the scope of this paper, we briefly describe the components of the generation system relevant to the understanding system presented here.
The generation model is invoked after the evaluation phase. It conditions on a view of the graph rooted at the most recent return node, so generated responses can mention both the previously returned value and the computation that produced it. As the generation model produces the natural language response, it extends the dataflow graph. For example, if after the user query "What's the date of the next retreat?" the agent responds: Agent: It starts on April 27 at 9 am, and runs for 8 hours.
then it will also extend the dataflow graph to reflect that the event's duration was mentioned: The duration of the event is now part of the common ground in the conversation and available for future reference by either the agent or the user. The generation model is also important for agent initiative: User: Put an event on my calendar.
Agent: What should it be called?
As discussed in detail in §5, questions of this kind can be generated in response to exceptions generated by underspecified user requests. In the accompanying dataset release, the agent's utterances are annotated with their dataflow graphs as extended by the generation model.

Reference resolution
In a dialogue, entities that have been introduced once may be referred to again. In dataflow dialogues, the entities available for reference are given by the nodes in the dataflow graph. Entities are salient to conversation participants to different degrees, and their relative salience determines the ways in which they may be referenced (Lappin and Leass, 1994). For example, it generally refers to the most salient non-human entity, while more specific expressions like the Friday meeting are needed to refer to accessible but less salient entities. Not all references to entities are overt: if the agent says "You have a meeting tomorrow" and the user responds "What time?", the agent must predict the implicit reference to a salient event.
Dataflow pointers We have seen that refer is used to find referents for referring expressions. These referents may be existing dataflow nodes or new subgraphs for newly mentioned entities. We now give more detail about both possibilities. Imagine a dialogue in which the dataflow graph contains the following fragment (which translates a mention of Easter or answers When is Easter?): findDate easter holiday findDateTime easter Suppose the user subsequently mentions the day after that. We wish to produce this computation: In our framework, this is accomplished by mapping the day after that to +(refer(), Days (1)). The corresponding graph is not quite the one shown above, but it evaluates to the same value: This shows how the refer() call is reified as a node in the dataflow graph. Its result is the salient findDateTime node from the previous turnwhose own result, a specific DateTime, now serves as the value of refer. We show both result edges here. Evaluating + adds a day to this old DateTime value to get the result of +, a new DateTime.
To enable dataflow graph manipulation with referring expressions, all that is required is an implementation of refer that can produce appropriate pointers for both simple references (that) and complex ones (the first meeting).
Constraints A call to refer is essentially a query that retrieves a node from the dialogue history, using a salience model discussed below. refer takes an optional argument: a constraint on the returned node. Indeed, the proper translation of that in the context the day after that would be refer ( • Role constraints: A role constraint specifies a keyword and matches nodes that are used as keyword arguments with that keyword. For example, the month maps to refer(RoleConstraint(month)) and resolves to the constant node apr in the dialogues in §2, since that node was used as a named argument month=apr. To interpret a natural language referring expression, the program prediction model only needs to translate it into a contextually appropriate constraint C. refer(C) is then evaluated using a separate salience retrieval model that returns an appropriate node. Calling refer() with no arguments is equivalent to calling it with Constraint[Any](), which matches all nodes.The following dialogue shows referring expressions in action:
Here the 9 am meeting refers to the one that is salient from the first response, not an arbitrary one.
Salience retrieval model The salience retrieval model returns the most salient node satisfying the underlying constraint. Our dataflow framework is agnostic to the implementation of this model. A sophisticated model could select nodes via a machine-learned scoring function. In the experiments in this paper, however, we rank nodes using a hard-coded heuristic. The heuristic chooses the root node r of the previous user utterance, if it satisfies the constraint. More generally, the heuristic prefers nodes to the extent that they can be reached from r in a small number of steps, where a step may move from a node to one of its input nodes, from an evaluated node to its result node, or from the root of an utterance to the root of an adjacent (user or system) utterance. If no satisfying node is found in the past several utterances, the heuristic falls back to generating code (see footnote 2) that will search harder for a satisfying salient entity, for example by querying a database. For example, our earlier Constraint[Event](start=am(9)) may return the expression findEvent(EventSpec(start=am (9))) if no 9 am meeting has been mentioned recently, and Constraint[Person](name='Adam') may return findPerson(PersonSpec(name='Adam')) if no Adam has been mentioned. (See footnote 4.)

Revision
Beyond referring to previously mentioned entities (nodes), task-oriented dialogues frequently refer to previously executed computations (subgraphs). This is one of the major advantages of representing the dialogue state as a dataflow graph of computations, not just a set of potentially salient entities.

User:
What time on Tuesday is my planning meeting? start(findEvent(EventSpec( name= planning , start=DateTimeSpec(weekday=tuesday)))) Agent: You meet with Grace at noon.

User:
Sorry, I meant all-hands.
Agent: Your all-hands meeting is at 2:30 pm.
The second user utterance asks for the computation from the first user utterance to be repeated, but with all-hands in place of planning. The expected result is still a time, even though the second utterance makes no mention of time.
In the dataflow framework, we invoke a revise operator to construct the revised computation: Again, the content of the program closely reflects that of the corresponding utterance. The revise operator takes three arguments: • rootLoc, a constraint on the top-level node of the original computation; • oldLoc, a constraint on the node to replace within the original computation; • new, a new graph fragment to substitute there. The revise node evaluates to the root of a modified copy of the original computation, in which new now fills the role at the "old" location.
Revision is non-destructive-no part of the dialogue history is lost, so entities computed by the original target and its ancestors remain available for later reference. However, the copy shares nodes with the original computation where possible, to avoid introducing unnecessary duplicate nodes that would have to be considered by refer.
For the example dialogue at the beginning of this section, the first turn produces the light gray nodes below. The second turn adds the darker gray nodes, which specify the desired revision. Finally, evaluating the revise node selects the salient locations that match the rootLoc and oldLoc constraints (indicated in the above drawing by temporary dotted lines), and constructs the revised subgraph (the new start node below and its ancestors). The result of evaluation (dashed arrow below) is the root of the revised subgraph. Finally, evaluating these new nodes as well will establish that the value of the top-level revise is the start time of the 'all-hands' meeting on Tuesday.
In the following example, the second utterance asks to replace a date specification. However, the utterance appears in a context where the relevant DateTimeSpec-valued node is an argument that has actually not yet been provided: User: When is lunch?

User:
What about tomorrow?
The revision replaces the missing start argument to the previous EventSpec (whose absence had resulted in some default behavior) with an explicit argument (the DateTimeSpec returned by tomorrow()). To achieve this, when the salience retrieval model is run with an oldLoc constraint, it must be able to return missing arguments that satisfy that constraint. Missing arguments are implicitly present, with special value missing of the appropriate type. In practice they are created on demand. Relatedly, a user utterance sometimes modifies a previously mentioned constraint such as an EventSpec (see footnote 4). To permit this and more, we allow a more flexible version of revise to (non-destructively) transform the subgraph at oldLoc by applying a function, rather than by substituting a given subgraph new. Such functions are similar to rewrite rules in a term rewriting system (Klop, 1990), with the oldLoc argument supplying the condition. Our dataset ( §6) specifically includes reviseConstraint calls, which modify a constraint as directed, while weakening it if necessary so that it remains satisfiable. For example, if a 3:00-3:30 meeting is onscreen and the user says make it 45 minutes or make it longer, then the agent can no longer preserve previous constraints start=3:00 and end=3:30; one must be dropped.
While the examples in this section involve a single update, real-world dialogues ( §6) can involve single user requests built up over up to five turns with unrelated, intervening discussion. Revisions of revisions or of constraints on reference are also seamlessly handled: revise takes another revise or a refer node as its target, leading to a longer chain of result edges (dashed lines) to follow. Coordination of interactions among this many longrange dependencies remains a challenge even for modern attentional architectures (Bordes et al., 2016). With revise all the needed information is in one place; as experiments will show, this is crucial for good performance in more challenging dialogues.

Recovery
Sometimes users make requests that can be fulfilled only with the help of followup exchanges, if at all. Requests might be incomplete:

User:
Book a meeting for me. Our solution is to treat such discourse failures as exceptions. In principle, they are no different from other real-world obstacles to fulfilling the user's request (server errors, declined credit cards, and other business logic). To be useful, a dialogue model must have some way to recover from all these exceptions, describing the problem to the user and guiding the dialogue past it. Our dialogue manager consists mainly of an exception recovery mechanism. This contrasts with traditional slot-filling systems, where a scripted policy determines which questions to ask the user and in which order. Scripted policies are straightforward but cannot handle novel compositional utterances. Contextual semantic parsers treat compositionality, but provide no dialogue management mechanism at all. Our dataflow-based approach allows the user to express complex compositional intents, but also allows the agent to reclaim the initiative when it is unable to make progress. Specifically, the agent can elicit interactive repairs of the problematic user plan: the user communicates such repairs through the reference and revision mechanisms described in preceding sections.

Exceptions in execution
In the dataflow graph framework, failure to interpret a user utterance is signaled by exceptions, which occur during evaluation. The simplest exceptions result from errors in function calls and constructors: Evaluation of the dataflow graph specified by this program cannot be completed. The DateTimeSpec constructor generates an exception, and descendants of that node remain unevaluated. An exception is essentially just a special result (possibly a structured value) returned by evaluation. It appears in the dataflow graph, so the agent can condition on it when predicting programs in future turns. When an exception occurs, the generation model ( §2) is invoked on the exceptional node. This can be used to produce prompts like: The fact that exception recovery looks like any other turn-level prediction is another key advantage of dataflow-based state representations. In the above examples, the user specified a revision that would enable them to continue, but they also would have been free to try another utterance (List all my meetings in February) or to change goals altogether (Never mind, let's schedule a vacation).
Because of its flexibility, our exceptionhandling mechanism is suitable for many situations that have not traditionally been regarded as exceptions. For example, an interactive slot-filling workflow can be achieved via a sequence of underspecified constructors, each triggering an exception and eliciting a revision from the user:

User:
Create a meeting.
User: Planning meeting.
The agent predicted that the user intended to revise the missing name because an exception involving the name path appeared in the dialogue history on the previous turn. Recovery behaviors are enabled by the phase separation between constructing the dataflow graph (which is the job of program synthesis from natural language) and evaluating its nodes. The dataflow graph always contains a record of the user's current goal, even when the goal could not be successfully evaluated. This goal persists across turns and remains accessible to reference, and thus can be interactively refined and clarified using the same metacomputation operations as user-initiated revision. Exception handling influences the course of the dialogue, without requiring a traditional hand-written or learned "dialogue policy" that reasons about full dialogue states. Our User: Can you remind me to go to the airport tomorrow morning at 8am? createCommitEventWrapper( createPreflightEventWrapper( EventBuilder( subject= go to the airport ), start=dateAtTime( date=tomorrow(), time=numberAM (8)  Turn 1 features free-text subject and date/time. Turn 2 features revise. Turn 3 features cross-domain interaction via refer and nested API calls (findPlace and weatherQueryApi are both real-world APIs). Turn 4 features an out-of-scope utterance that is parried by a category-appropriate "fencing" response. Turn 5 confirms a proposal after intervening turns.
policy only needs to generate language (recall §2) that reacts appropriately to any exception or exceptions in the evaluation of the most recent utterance's program, just as it reacts to the return value in the case where evaluation succeeds.

Data
To validate our approach, we crowdsourced a large English dialogue dataset, SMCalFlow, featuring task-oriented conversations about calendar events, weather, places, and people. Figure 2 has an example. SMCalFlow has several key characteristics: Richly annotated: Agent responses are executable programs, featuring API calls, function composition, and complex constraints built from strings, numbers, dates and times in a variety of formats. They are not key-value structures or database queries, but instead full descriptions of the runtime behavior needed to react to the user in a real, grounded dialogue system.
Open-ended: We did not constrain crowdworkers to scripts. Instead, they were given general information about agent capabilities and were encouraged to interact freely. A practical dialogue system must also recognize and respond to out-ofscope requests. Our dataset includes many such examples (see the fourth user turn in Figure 2).
To cover a rich set of back-end capabilities while encouraging worker creativity, we designed a wide range of scenarios to guide dialogue construction. There are over 100 scenarios of varying topic and granularity. Dialogues are collected via a Wizard-of-Oz process. Every dialogue is associated with a scenario. At each turn, a crowdworker acting as the user is presented with a dialogue as context and is asked to append a new utterance. An annotator acting as the agent labels the utterance with a program (which may include refer and revise) and then selects a natural-language response from a set of candidates produced by the language generation model described in §2. The annotation interface includes an autocomplete feature based on existing annotations. Annotators also populate databases of people and events to ensure that user requests have appropriate responses. The process is iterated for a set number of turns or until the annotator indicates the end of conversation. A single dialogue may include turns from multiple crowdworkers and annotators.
Annotators are provided with detailed guidelines containing example annotations and information about available library functions. Guidelines also specify conventions for pragmatic issues like the decision to annotate next as after at the beginning of §2. Crowdworkers are recruited from Amazon Mechanical Turk with qualification requirements such as living in the United States and with a work approval rate higher than 95%.
Data is split into training, development and test sets. We review every dialogue in the test set with two additional annotators. 75% of turns pass through this double review process with no changes, which serves as an approximate measure of inter-annotator consensus on full programs.
For comparison, we also produce a version of the popular MultiWOZ 2.1 dataset (Budzianowski et al., 2018; with dataflow-based annotations. MultiWOZ is a state tracking task, so in its original format the dataset annotates each turn with a dialogue state rather than an executable representation. To obtain an equivalent (programbased) representation for MultiWOZ, at each user turn we automatically convert the update to the MultiWOZ-annotated dialogue state to a dataflow program. 5 Specifically, we define a booking function for each domain, whose arguments correspond to MultiWOZ slots. The first mention of an intent is annotated as a call to this function. Updates are annotated as revise calls. For each new or modified argument, the annotation invokes refer with an appropriate type constraint when this would successfully find the correct value (typically a pronominal reference); else it constructs the correct value directly. Turns that don't update the dialogue state produce empty programs.
Data statistics are shown in Table 1. To the best of our knowledge, SMCalFlow is the largest annotated task-oriented dialogue dataset to date. Compared to MultiWOZ, it features a larger user vocabulary, a more complex space of statemanipulation primitives, and a long tail of agent programs built from numerous function calls and deep composition.

Experiments
We evaluate our approach on SMCalFlow and MultiWOZ 2.1. All experiments use the Open-NMT (Klein et al., 2017) pointer-generator network (See et al., 2017), a sequence-to-sequence model that can copy tokens from the source sequence while decoding. Our goal is to demonstrate that dataflow-based representations benefit standard neural model architectures. Dataflowspecific modeling might improve on this baseline, and we leave this as a challenge for future work.
For each user turn i, we linearize the target program into a sequence of tokens z i . This must be predicted from the dialogue contextnamely the concatenated source sequence x i−c z i−c · · · x i−1 z i−1 x i (for SMCalFlow) or x i−c y i−c · · · x i−1 y i−1 x i (for MultiWOZ 2.1). Here c is a context window size, x j is the user utterance at user turn j, y j is the agent's naturallanguage response, and z j is the linearized agent program. Each sequence x j , y j , or z j begins with a separator token that indicates the speaker (user or agent). Our formulation of context for MultiWOZ is standard (e.g. Wu et al., 2019).
The model is trained using the Adam optimizer (Kingma and Ba, 2015) with the maximum likelihood objective. We use 0.001 as the learning rate. Training ends when the validation loss increases on two successive epochs. We use Glove800B-300d (cased) and Glove6B-300d (uncased) (Pennington et al., 2014) to initialize the word embeddings in the model for the SMCalFlow and Multi-WoZ experiments, respectively. The context window size c, hidden layer size d, number of hidden layers l, and dropout rates r are selected based on the validation loss from {2, 4, 10}, {256, 300, 320, 384}, {1, 2, 3}, {0.3, 0.5, 0.7}, respectively. Approximate 1-best decoding uses a beam of size 5. Table 2 shows results for the SMCalFlow dataset. We report program accuracy: specifically, exact-match accuracy of the predicted program after inlining metacomputation (i.e. replacing all calls to metacomputation operations with the concrete program fragments they return). 6 We also compare to baseline models that train on inlined metacomputation. These experiments make it possible to evaluate the importance of explicit dataflow manipulation compared to a standard contextual semantic parsing approach to the task: a no-metacomputation baseline can still reuse computations from previous turns via the model's copy mechanism.

Quantitative evaluation
For the full representation, c, d, l, r are 2, 384, 2, and 0.5, respectively. For the inline variant, they are 4, 320, 2, and 0.5. Turn-level exact match accuracy is around 66% in both splits. Inlining metacomputation, which forces the model to explicitly resolve cross-turn computation, reduces accuracy by 6.2% overall, 7.1% on turns involving references, and 9.0% on turns involving revision. Dataflow-based metacomputation operations are thus essential for good model performance in  Table 1: Dataset statistics. "Library Size" counts distinct function names (e.g. findEvent) plus keyword names (e.g. start=). "Length" and "Depth" columns show (.25, .50, .75) quantiles. For programs, "Length" is the number of function calls and "Depth" is determined from a tree-based program representation. "OOS" counts the outof-scope utterances. MultiWOZ statistics were calculated after applying the data processing of Wu et al. (2019). Vocabulary size is less than reported by      (Wu et al., 2019) results are from the public implementation. "Joint Goal" (Budzianowski et al., 2018) is average dialogue state exact-match, "Dialogue" is average dialogue-level exact-match, and "Prefix" is the average number of turns before an incorrect prediction. "Dataflow," "inline refer," and "inline both" all have significantly higher dialogue accuracy than TRADE (p < 0.05, McNemar's test).
all three cases. We further evaluate our approach on dialogue state tracking using MultiWOZ 2.1. Table 3 shows results. For the full representation, the selected model uses c = 2, d = 384, l = 1, and r = 0.7. For the inline refer variant, they are 2, 384, 2, and 0.5. For the variant inlining both refer and revise calls, they are 10, 384, 2, and 0.3. Even without metacomputation, prediction of programbased representations gives results comparable to the existing state of the art, TRADE, on the standard "Joint Goal" metric (turn-level exact match). (Our dataflow representation for MultiWOZ is designed so that dataflow graph evaluation produces native MultiWOZ slot-value structures.) However, Joint Goal does not fully characterize the effec-  tiveness of a state tracking system in real-world interactions, as it allows the model to recover from an error at an earlier turn by conditioning on gold agent utterances after the error. We thus evaluate on dialogue-level exact match and prefix length (the average number of turns until an error). On these metrics the benefit of dataflow over past approaches is clearer. Differences within dataflow model variants are smaller here than in Table 2. For the Joint Goal metric, the no-metacomputation baseline is better; we attribute this to the comparative simplicity of reference in the MultiWOZ dataset. In any case, casting the state-tracking problem as one of program prediction with appropriate primitives gives a state-of-the-art state-tracking model for MultiWOZ using only off-theshelf sequence prediction tools. 7 Error analysis Beyond the quantitative results shown in Tables 2-3, we manually analyzed 100 SMCalFlow turns where our model mispredicted. Table 4 breaks down the errors by type. Three categories involve straightforward parsing errors. In underprediction errors, the model fails to predict some computation (e.g. a search constraint or property extractor) specified in the user request. This behavior is not specific to our system: under-length predictions are also welldocumented in neural machine translation systems (Murray and Chiang, 2018). In entity linking errors, the model correctly identifies the presence of an entity mention in the input utterance, but uses it incorrectly in the input plan. Sometimes the entity that appears in the plan is hallucinated, appearing nowhere in the utterance; sometimes the entity is cast to a wrong type (e.g. locations interpreted as event names) used in the wrong field or extracted with wrong boundaries. In fencing errors, the model interprets an out-of-scope user utterance as an interpretable command, or vice-versa (compare to Figure 2, turn 4).
The fourth category, ambiguity errors, is more interesting. In these cases, the predicted plan corresponds to an interpretation of the user utterance that would be acceptable in some discourse context. In a third of these cases, this interpretation is ruled out by either dialogue context (e.g. interpreting what's next? as a request for the next list item rather than the event with the next earliest start time) or commonsense knowledge (make it at 8 means 8 a.m. for a business meeting and 8 p.m. for a dance party). In the remaining cases, the predicted plan expresses an alternative computation that produces the same result, or an alternative interpretation that is also contextually appropriate.

Related work
The view of dialogue as an interactive process of shared plan synthesis dates back to Grosz and Sidner's earliest work on discourse structure (1986; 7 A note on reproducibility: Dependence on internal libraries prevents us from releasing a full salience model implementation and inlining script for SMCalFlow. The accompanying data release includes both inlined and non-inlined versions of the full dataset, and inlined and non-inlined versions of our model's test set predictions, enabling side-byside comparisons and experiments with alternative representations. We provide full conversion scripts for MultiWOZ. 1988). That work represents the state of a dialogue as a predicate recognizing whether a desired piece of information has been communicated or change in world state effected. Goals can be refined via questions and corrections from both users and agents. The only systems to attempt full versions of this shared-plans framework (e.g. Allen et al., 1996;Rich et al., 2001) required inputs that could be parsed under a predefined grammar. Subsequent research on dialogue understanding has largely focused on two simpler subtasks: Contextual semantic parsing approaches focus on complex language understanding without reasoning about underspecified goals or agent initiative. Here the prototypical problem is iterated question answering (Hemphill et al., 1990;Yu et al., 2019b), in which the user asks a sequence of questions corresponding to database queries, and results of query execution are presented as structured result sets. Vlachos and Clark (2014) describe a semantic parsing representation targeted at more general dialogue problems. Most existing methods interpret context-dependent user questions (What is the next flight to Atlanta? When does it land?) by learning to copy subtrees (Zettlemoyer and Collins, 2009;Iyyer et al., 2017;Suhr et al., 2018) or tokens (Zhang et al., 2019) from previously-generated queries. In contrast, our approach reifies reuse with explicit graph operators.
Slot-filling approaches (Pieraccini et al., 1992) model simpler utterances in the context of full, interactive dialogues. It is assumed that any user intent can be represented with a flat structure consisting of a categorical dialogue act and an mapping between a fixed set of slots and string-valued fillers. Existing fine-grained dialogue act schemes (Stolcke et al., 2000) can distinguish among a range of communicative intents not modeled by our approach, and slot-filling representations have historically been easier to predict (Zue et al., 1994) and annotate (Byrne et al., 2019). But while recent variants support interaction between related slots (Budzianowski et al., 2018) and fixed-depth hierarchies of slots (Gupta et al., 2018), modern slot-filling approaches remain limited in their support for semantic compositionality. By contrast, our approach supports user requests corresponding to general compositional programs.
More recent end-to-end dialogue agents attempt to map directly from conversation histories to API calls and agent utterances using neural sequence-to-sequence models without a representation of dialogue state (Bordes et al., 2016;Yu et al., 2019a). While promising, models in these papers fail to outperform rule-or template-driven baselines.  report greater success on a generation-focused task, and promising results have also been obtained from hybrid neuro-symbolic dialogue systems (Zhao and Eskenazi, 2016;Williams et al., 2017;. Much of this work is focused on improving agent modeling for existing representation schemes like slot filling. We expect that many modeling innovations (e.g. the neural entity linking mechanism proposed by Williams et al.) could be used in conjunction with the new representational framework we have proposed in this paper.
Like slot-filling approaches, our framework is aimed at modeling full dialogues in which agents can ask questions, recover from errors, and take actions with side effects, all backed by an explicit state representation. However, our notions of "state" and "action" are much richer than in slot-filling systems, extending to arbitrary compositions of primitive operators. We use semantic parsing as a modeling framework for dialogue agents that can construct compositional states of this kind. While dataflow-based representations are widely used to model execution state for programming languages (Kam and Ullman, 1976), this is the first work we are aware of that uses them to model conversational context and dialogue.

Conclusions
We have presented a representational framework for task-oriented dialogue modeling based on dataflow graphs, in which dialogue agents predict a sequence of compositional updates to a graphical state representation. This approach makes it possible to represent and learn from complex, natural dialogues. Future work might focus on improving prediction by introducing learned implementations of refer and revise that, along with the program predictor itself, could evaluate their hypotheses for syntactic, semantic, and pragmatic plausibility. The representational framework could itself be extended, e.g. by supporting declarative user goals and preferences that persist across utterances. We hope that the rich representations presented here-as well as our new dataset-will facilitate greater use of context and compositionality in learned models for task-oriented dialogue.