Memory-Based Semantic Parsing

We present a memory-based model for context-dependent semantic parsing. Previous approaches focus on enabling the decoder to copy or modify the parse from the previous utterance, assuming there is a dependency between the current and previous parses. In this work, we propose to represent contextual information using an external memory. We learn a context memory controller that manages the memory by maintaining the cumulative meaning of sequential user utterances. We evaluate our approach on three semantic parsing benchmarks. Experimental results show that our model can better process context-dependent information and demonstrates improved performance without using task-specific decoders.


Introduction
Semantic parsing is the task of converting natural language utterances into machine interpretable meaning representations such as executable queries or logical forms. It has emerged as an important component in many natural language interfaces (Őzcan et al., 2020) with applications in robotics (Dukes, 2014), question answering (Zhong et al., 2018;Yu et al., 2018b), dialogue systems (Artzi and Zettlemoyer, 2011), and the Internet of Things (Campagna et al., 2017).
Neural network based approaches have led to significant improvements in semantic parsing (Zhong et al., 2018;Kamath and Das, 2019;Yu et al., 2018b;Yavuz et al., 2018;Yu et al., 2018a) across domains and semantic formalisms. The majority of existing studies focus on parsing utterances in isolation, and as a result they cannot readily transfer in more realistic settings where users ask multiple inter-related questions to satisfy an information need. In this work, we study context-dependent This is a pre-MIT Press publication version. semantic parsing focusing specifically on text-to-SQL generation, which has emerged as a popular application area in recent years. Figure 1 shows a sequence of utterances in an interaction. The discourse focuses on a specific topic serving a specific information need, namely finding out which Continental flights leave from Chicago on a given date and time. Importantly, interpreting each of these utterances, and mapping them to a database query to retrieve an answer needs to be situated in a particular context as the exchange proceeds. The topic further evolves as the discourse transitions from one utterance to the next and constraints (e.g., TIME or PLACE) are added or revised. For example, in Q2 the TIME constraint before 10am from Q1 is revised to before noon, and in Q3 to before 2pm. Aside from such topic extensions (Chai and Jin, 2004), the interpretation of Q2 and Q3 depends on Q1, as it is implied that the questions concern Continental flights that go from Chicago to Seattle, not just any Continental flights, however the phrase from Chicago to Seattle is elided from Q2 and Q3. The interpretation of Q4 depends on Q3 which in turn depends on Q1. Interestingly, Q5 introduces information with no dependencies on previous discourse and in this case, relying on information from previous utterances will lead to incorrect SQL queries.
The problem of contextual language processing has been most widely studied within dialogue systems where the primary goal is to incrementally fill pre-defined slot-templates, which can be then used to generate appropriate natural language responses . But the rich semantics of SQL queries makes the task of contextual text-to-SQL parsing substantially different. Previous approaches (Suhr et al., 2018; tackle this problem by enabling the decoder to copy or modify the previous queries under the assumption that they contain all necessary context for generating the current SQL query. The utter-Q1: What Continental flights go from Chicago to Seattle before 10 am in morning 1993 February twenty sixth SQL1: ( SELECT DISTINCT flight.flight_id FROM flight WHERE ( flight.airline_code = 'CO' AND ( flight . from_airport IN ( SELECT airport_service . airport_code FROM airport_service WHERE airport_service . city_code IN ( SELECT city . city_code FROM city WHERE city.city_name = 'CHICAGO' )) AND ( flight . to_airport IN ( SELECT airport_service . airport_code FROM airport_service WHERE airport_service . city_code IN ( SELECT city . city_code FROM city WHERE city.city_name = 'SEATTLE' )) AND ( flight.departure_time < 1000 ) ) ) ) ) ; Q2: Continental flights before noon that have a meal Q3: Continental flights before 2 pm Q4: On 1993 February twenty seventh Q5: All Continental flights leaving Chicago before 8 am on 1993 February twenty seventh Figure 1: Example utterances from a user interaction in the ATIS dataset. Utterance segments referring to the same entity or objects are in same color. SQL queries corresponding to Q2-Q5 follow a pattern similar to Q1 and are not shown for the sake of brevity.
ance history is encoded in a hierarchical manner and although this is a good enough approximation for most queries (in existing datasets), it is not sufficient to model long-range discourse phenomena (Grosz and Sidner, 1986).
Our own work draws inspiration from Kintsch and van Dijk's (1978) text comprehension model. In their system the process of comprehension involves three levels of operations. Firstly, smaller units of meaning, i.e., propositions, are extracted and organized into a coherent whole (microstructure); some of these are stored in a working memory buffer and allow to decide whether new input overlaps with already processed propositions. Secondly, the gist of the whole is condensed (macrostructure). And thirdly, the previous two operations generate new texts in working with the memory. In other words, the (short and long term) memory of the reader gives meaning to the text read. They propose three macro rules, viz., deletion, generalization, and construction as essential to reduce and organize the detailed information of the microstructure of the text. Furthermore, previous knowledge and experience are central to the interpretation of text enabling the reader to fill information gaps.
Our work borrows several key insights from Kintsch and van Dijk (1978) without being a direct implementation of their model. Specifically, we also break down input utterances into smaller units, namely phrases, and argue that this information can be effectively utilized in maintaining contextual information in an interaction. Furthermore, the notion of a memory buffer which can be used to store and process new and old information plays a prominent role in our approach. We propose a Memory-based ContExt model (which we call MemCE for short) for keeping track of contextual information, and learn a context memory controller that manages the memory. Each interaction (sequence of user utterances) maintains its context using a memory matrix. User utterances are segmented into a sequence of phrases representing either new information to be added into the memory (e.g., that have a meal in Figure 1) or old information which might conflict with current information in memory and needs to be updated (e.g., before 10 am should be replaced with before noon in Figure 1). Our model can inherently add new content to memory, read existing content by accessing the memory, and update old information.
We evaluate our approach on the ATIS (Suhr et al., 2018;Dahl et al., 1994), SParC (Yu et al., 2019b), and CoSQL (Yu et al., 2019a) datasets. We observe performance improvements when we combine MemCE with existing models underlying the importance of more specialized mechanisms for processing context information. In addition, our model brings interpretability in how the context is processed. We are able to inspect the learned memory controller and analyze whether important discourse phenomena such as coreference and ellipsis are modeled.

Related Work
Sequence-to-sequence neural networks (Bahdanau et al., 2015) have emerged as a general modeling framework for semantic parsing, achieving impressive results across different domains and semantic formalisms (Dong and Lapata 2016;Jia and Liang 2016;Iyer et al. 2017;Wang et al. 2020;Zhong et al. 2018;Yu et al. 2018b, inter alia). The majority of existing work has focused on mapping natural language utterances into machine-readable meaning representations in isolation without utilizing context information. While this is useful for environments consisting of one-shot interactions of users with a system (e.g., running QA queries on a database), many settings require extended interactions between a user and an automated assistant (e.g., booking a flight). This makes the one-shot parsing model inadequate for many scenarios.
In this paper we are concerned with the lesser studied problem of contextualized semantic parsing where previous utterances are taken into account in the interpretation of the current utterance. Earlier work (Miller et al., 1996;Zettlemoyer and Collins, 2009;Srivastava et al., 2017) has focused on symbolic features for representing context, e.g., by explicitly modeling discourse referents, or the flow of discourse. More recent neural methods extend the sequence-to-sequence architecture to incorporate contextual information either by modifying the encoder or the decoder. Context-aware encoders resort to concatenating the current utterance with the utterances preceding it (Suhr et al., 2018; or focus on the history of the utterances most relevant to the current decoder state . The decoders take context representations as additional input and often copy segments from the previous query (Suhr et al., 2018;. Hybrid approaches (Iyyer et al., 2017;Guo et al., 2019; employ neural networks for representation learning but use a grammar for decoding (e.g., a sequence of actions or an intermediate representation).
A tremendous amount of work has taken place in the context of discourse modeling focusing on extended texts (Mann and Thompson, 1988;Hobbs, 1985) and dialogue (Grosz and Sidner, 1986). Kintsch and van Dijk (1978) study the men-tal operations underlying the comprehension and summarization of text. They introduce propositions as the basic unit of text representation, and a model of how incoming text is processed given memory limitations; texts are reduced to important propositions (to be recalled later) using macrooperators (e.g., addition, deletion). Their model has met with popularity in cognitive psychology (Baddeley, 2007) and has also found application in summarization (Fang and Teufel, 2016).
Our work proposes a new encoder for contextualized semantic parsing. At the heart of our approach is a memory controller which keeps track of context via writing new information and updating old information. Our memory-based approach is inspired by Kintsch and van Dijk (1978) and is closest to Santoro et al. (2016), who use a memory augmented neural network Sukhbaatar et al., 2015) for meta-learning. Specifically, they introduce a method for accessing external memory which functions as short-term storage for metalearning. Although we report experiments solely on semantic parsing, our encoder is fairly general and could be applied to other context-dependent tasks such as conversational information seeking (Dalton et al., 2020) and information retrieval (Sun and Chai, 2007;Voorhees, 2004).

Model
Our model is based on the encoder-decoder architecture  with the addition of a memory component (Sukhbaatar et al., 2015;Santoro et al., 2016) for incorporating context. Let denote an interaction such that X i is the input utterance and Y i is the output SQL at interaction turn I[i]. At each turn i, given X i and all previous turns I[1 . . . i−1], our task is to predict SQL output Y i .
As shown in Figure 2, our model consists of four components, (1) a memory matrix retains discourse information, (2) a memory controller, learns to access and manipulate the memory such that correct discourse information is retained, (3) utterance and phrase encoders, and (4) a decoder which interacts with the memory and utterance encoder using an attention mechanism to generate SQL output.

Input Encoder
Each input utterance X i = (x i,1 . . . x i,|X i | ) is encoded using a bi-directional LSTM (Hochreiter and . Figure 2: Overview of model architecture. Utterances are broken down into segments. Each segment is encoded with the same encoder (same weights) and is processed independently. The context update controller learns to manipulate the memory such that correct discourse information is retained.
where, e i,j = φ(x i,j ) is a learned embedding corresponding to input token x i,j and h U i,j is the concatenation of the forward and backward LSTM hidden representations at step j. As mentioned earlier, X i is also segmented into a sequence of phrases denoted as X i = (p 1 i . . . p K i ), where K is the number of phrases for utterance X i . We provide details on how utterances are segmented into phrases in Section 4. For now, suffice it to say that they are obtained from the output of a chunker with some minimal postprocessing (e.g., to merge postmodifiers with NPs or VPs). Each phrase consists of such that j ∈ [s k : s k + |p k i |]. As shown in Figure 2, every phrase p k i in utterance i is separately encoded using biLSTM P to obtain a phrase representation h P i,k by concatenating the final forward and backward hidden representations.

Context Memory
Our context memory is a matrix M i ∈ R L×d with L memory slots, each of dimension d, where i is the state of the memory matrix at the i th interaction turn. The goal of context memory is to maintain relevant information required to parse the input utterance at each turn. As shown in Figure 2, this is achieved by learning a context update controller which is responsible for updating the memory at each turn.
For each phrase p k i belonging to a sequence of phrases within utterance X i , the controller decides whether it contains old information which conflicts with information present in the memory or new information which has to be added to the current context. When novel information is introduced, the controller should add it to an empty or least-used memory slot, otherwise the conflicting memory slot should be updated with the latest information. Let t denote the memory update time step such that t ∈ [1, n], where n is the total number of phrases in interaction I. We simplify notation, using h P t instead of h P i,k , to represent the hidden representation of a phrase at time t.
Detecting Conflicts Given phrase representation h P t (see Equation (2)), we use a similarity module to detect conflicts between h P t and every is the m th row representing a memory slot in the memory matrix. Intuitively, low similarity represents new information. Our similarity module is based on a Siamese network architecture (Bromley et al., 1994) that takes phrase hidden representation h P t and memory slot M i (m) and computes a low-dimensional representation using the same neural network weights. The resulting lowdimensional representations are then compared using the cosine distance metric: where is a small value for numerical stability and sia is a multi-layer feed-forward network with a tanh activation function. For hidden representation h, sia is computed as: where l represents the layer number and W l , b l , W , and b are learnable parameters. We useŵ t,m c to obtain a similarity distribution w t s for updating step t over memory slots. w t s represents the probability of dissimilarity (or conflict) which is calculated by computing softmax over cosine similarities with every memory slot m ∈ [1..L]: We compute softmax over cosine values so that the linear combination of w t s with least used weights w t lu (described below in the memory update paragraph) still represents the probability of update across each memory slot.
Adding New Information To add new information to the memory, i.e., when there is no conflict with any locations, we need to ascertain which memory locations are either empty or rarely used. When the memory is full, i.e., all memory slots are used during previous updates, we update the slot which was least used. This is accomplished by maintaining memory usage weights w t u ∈ R L at each update t; w t u is initialized with zeros at t = 0 and is updated by combining previous memory usage weights w t−1 u with current write weight w t w using a decay parameter λ: where write weights w t w are used to compute the write location and are described in the memory update paragraph below. The least used weight vector w t lu , at update step t is then calculated as: where for vector x we calculate softmin(x) = exp(−x)/ j exp(−x j ). Hard updates, i.e., using smallest instead of softmin are also possible. However, we found softmin to be more stable during learning.
Memory Update We wish to compute write location w t w given least used weight vector w t lu and conflict probability distribution w t s . Notice that w t s and w t lu are essentially two probability distributions each representing a candidate write location in memory. We learn a convex combination parameter µ which depends on w t s , where temperature hyperparameter τ is used to peak the write location. Finally, the memory is updated with current phrase representation h P t as,

Decoder
The output query is generated with an LSTM decoder. As shown in Figure 2, the decoder depends on the memory and utterance representations computed using Equations (10) and (1), respectively. The decoder state at time step s is computed as: where φ o is a learned embedding function for output tokens, c U s is an utterance context vector, c M s−1 is a memory context vector, and h D s−1 is the previous decoder hidden state. c U s is calculated as the weighted sum of all hidden states, where α U s is the utterance state attention score: Memory state attention score α M s and memory context vector c M s are computed in a similar manner using memory slots as hidden states 1 . The probability of output query tokens is computed as: We further modify the decoder in order to deal with the large number of database values (e.g., city names) common in text-to-SQL semantic parsing tasks. As described in Suhr et al. (2018), we add anonymized token attention scores in the output vocabulary distribution which enables copying anonymized tokens mentioned in input utterances. The final probability distribution over output vocabulary tokens and anonymized tokens is: where ⊕ represents concatenation and P (â i,s ) are anonymized token attention scores in the attention distribution α U s .

Training
Our model is trained in an end-to-end fashion using a cross-entropy loss. Given a training set of N interactions {I (l) } N l=1 , such that each interaction I (l) consists of utterances X we minimize token cross-entropy loss as: i,k denotes the predicted output token and k is the gold output token index. The total loss is the average of the utterance level losses used for back-propagation.

Experimental Setup
We evaluated MemCE, our memory-based context model, on various settings by integrating it with multiple open-source models. We achieve this by replacing the discourse component of related models with MemCE subject to minor or no additional changes. All base models in our experiments use a turn-level hierarchical encoder to capture previous language context. For primary evaluation, we use the ATIS (Hemphill et al., 1990;Dahl et al., 1994) dataset but also present results on SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a).
Utterance Segmentation We segment each input utterance into a sequence of phrases with a pretrained chunker and then apply a simple rulebased merging procedure to create bigger chunks as an approximation to propositions (Kintsch and van Dijk, 1978). Figure 3 illustrates the process. We used the Flair chunker (Akbik et al., 2018)   The rules above are applied in order. For each rule we find any chunk whose end matches the left pattern followed by a chunk whose beginning matches the right pattern. Chunks that satisfy this criterion are merged.
We segment utterances and anonymize entities independently and then match entities within segments deterministically. This step is necessary to robustly perform anonymization as in some rare cases, the chunking process will separate entities in two different phrases (e.g., in Long Beach California that is chunked as in Long Beach and California that). This is easily handled by a simple token number matching procedure between the anonymized utterance and corresponding phrases. (Paszke et al., 2019). For all experiments, we used the ADAM optimizer (Kingma and Ba, 2015) to minimize the loss function and the initial learning rate was set to 0.001. During training, we used the ReduceLROnPlateau learning rate scheduling strategy on the validation loss, with a decay rate of 0.8. We also applied dropout with 0.5 probability. Dimensions for the word embeddings were set to 300. Following previous work  we use pretrained GloVe (Pennington et al., 2014) embeddings for our main experiments on the SparC and CoSQL datasets. For ATIS, word embeddings were not pretrained (Suhr et al., 2018;. Memory length was chosen as a hyperparameter from the range [15,25] and the temperature parameter was chosen from {0.01, 0.1}. Best memory length values for ATIS, SparC, and CoSQL were 25, 16, and 20, respectively. The RNN decoder is a two-layer LSTM and the encoder is a single layer LSTM. The Siamese network in the module which detects conflicting slots uses two hidden layers.

Results
In this section, we assess the effectiveness of the MemCE encoder at handling contextual information. We present our results, evaluation methodology, and comparisons against the state of the art.

Evaluation on ATIS
We primarily focus on ATIS because it contains relatively long interactions (average length is 7) compared to other datasets (e.g, the average length in SParC is 3). Longer interactions present multiple challenges that require non-trivial processing of context, some of which are discussed in Section 6. We use the ATIS dataset split created by Suhr et al. (2018). It contains 27 tables and 162K entries with 1,148/380/130 train/dev/test interactions. The semantic representations are in SQL.
Following Suhr et al. (2018), we measure query accuracy, strict denotation accuracy, and relaxed denotation accuracy. Query accuracy is the percentage of predicted queries that match the reference query. Strict denotation accuracy is the percentage of predicted queries that when executed produce the same results as the reference query. Relaxed accuracy also gives credit to a prediction query that fails to execute if the reference table is empty. In cases where the utterance is ambiguous and there are multiple gold queries, the query or table is considered correct if they match any of the gold labels. We evaluate on both development and test set, and select the best model during training via a separate validation set consisting of 5% of the training data. Table 1 presents a summary of our results.
We compare our approach against a simple Seq2Seq model which is a baseline encoderdecoder without any access to contextual information. Seq2Seq+Concat is a strong baseline which consists of an encoder-decoder model with attention on the current and the previous three concatenated utterances. We also compare against the models of Suhr et al. (2018) and . The former employs a turn-level encoder on top of an utterance-level encoder in a hierarchical fashion together with a decoder which learns to copy complete SQL segments from the previous query (SQL segments between consecutive queries are aligned during training using a rule-based procedure). The latter enhances the turn-level encoder by employing an attention mechanism across different turns and additionally introduces a query editing mechanism which decides at each decoding step whether to copy from the previous query or insert a new token. Column Enc-Dec in Table 1 describes the various models in terms of the type of encoder/decoder used. LSTM is a vanilla encoder or decoder, HE is a turn-level hierarchical encoder, and Mem is the proposed memory-based encoder. SnipCopy and EditBased respectively refer to Suhr et al.'s (2018) and Zhang et al.'s (2019) decoders. We present two instantiations of our MemCE model with a simple LSTM decoder (Mem-LSTM) and SnipCopy (Mem-SnipCopy). For the sake of completeness, Table 1 also reports the results from  who apply a grammar-based decoder to this task; they also incorporate the interaction history by concatenating the current utterance with the previous three utterances which are encoded with a bi-directional LSTM. All models in Table 1 use entity anonymization,  additionally use identifier linking, i.e., string matching heuristic rules to link words or phrases in the input utterance to identifiers in the database (e.g., city_name_string -> "BOSTON"). Table 1, MemCE is able to outperform comparison systems. We observe a boost in denotation accuracy when using the SnipCopy decoder instead of an LSTM-based one, however, exact match does not improve. This is possibly because SnipCopy makes it easier to generate long SQL queries by copying segments, but at the same time it suffers from spurious generation and error propagation.  Table 2: Query (Q) and Interaction (I) accuracy for SParC and CoSQL. We report results on the development (D) and test (T) sets. Sparc-DI is our domain-independent split of SparC. HE is a hierarchical encoder and Mem is the proposed memory-based context encoder. LSTM is a vanilla decoder, SnipCopy copies SQL segments from the previous query, and Grammar refers to a decoder which outputs a sequence of grammar rules rather than tokens. Table cells are filled with -whenever results are not available.

As shown in
ponents. We use Mem-SnipCopy as our base model and report performance on the ATIS development set following the configuration described in Section 4. We first remove the proposed memory controller described in Section 3.2 and simplify Equation (9) using key-value based attention to calculate w t w as, We observe a decrease in performance (see second row in Table 3) indicating that the proposed memory controller is helpful in maintaining interaction context. We performed two ablation experiments to evaluate the usefulness of utterance segmentation. Firstly, instead of the phrases extracted from our segmentation procedure, we employ a variant of our model which operates over individual tokens (see row "phrases are utterance tokens" in Table 3). As can be seen, this strategy is not optimal as results decrease across metrics. We believe operating directly on tokens can lead to ambiguity during update. For example, when processing current phrase to Boston given previous utterance What Continental flights go from Chicago to Seattle, it is not obvious whether Boston should update Chicago or Seattle. Secondly, we do not use any segmentation at all, not even at the token level. Instead, we treat the entire utterance as a single phrase (see row "phrases are full utterances" in Table 3). If memory's only function is to simply store utterance encodings, then this model becomes comparable to a hierarchical encoder with attention. Again, we observe that performance decreases which indicates that our system benefits from utterance segmentation. Overall, the ablation studies in Table 3 show that segmentation and its granularity matters. Our heuristic procedure works well for the task at hand, although a learning-based method would be more flexible and potentially lead to further improvements. However, we leave this to future work.   Table 4: Statistics for SParC-DI domainindependent split which has 157 domains in total.

Evaluation on SParC and CoSQL
In this section we describe our results on SParC and CoSQL. Both datasets assume a cross-domain semantic parsing task in context with SQL as the meaning representation. In addition, for ambiguous utterances, (which cannot be uniquely mapped to SQL given past context) CoSQL also includes clarification questions (and answers). We do not tackle these explicitly but consider them part of the utterance preceding them (e.g., please list the singers | did you mean list their names? | yes). Since our primary objective is to study and measure context-dependent language understanding, we created a split of SParC which is denoted as SParC-DI 2 where domains are all seen in training, development, and test set. In this way we ensure that no model has the added advantage of being able to handle cross-domain instances while lacking context-dependent language understanding. Table 4 shows the statistics of our SParC-DI split, following a ratio of 80/10/10 percent for the training/development/test set.
We evaluate model output using exact set match accuracy (Yu et al., 2019b). 3 We report two metrics: question accuracy which is the accuracy considering all utterances independently, and interaction accuracy which is the correct interaction accuracy averaged across interactions. An interaction is marked as correct if all utterances in that interaction are correct. Since utterances in an interaction can be semantically complete (i.e., independent of  context), we prefer interaction accuracy. Table 2 summarizes our results. CDS2S is the context-dependent cross-domain parsing model of . It is is adapted from Suhr et al. (2018) to include a schema encoder which is necessary for SparC and CoSQL. It also uses a turnlevel hierarchical encoder to represent the interaction history. We also report model variants where the CDS2S encoder is combined with an LSTMbased encoder, SnipCopy (Suhr et al., 2018) and a grammar-based decoder . The latter decodes SQL queries as a sequence of grammar rules, rather than tokens. We compare the above systems with three variants of our MemCE model which differ in their use of an LSTM decoder, Snip-Copy, and the Grammar-based decoder of .
Across models and datasets we observe that MemCE improves performance which suggests that it better captures contextual information as an independent language modeling component. We observe that benefits from our memory-based encoder persist across domains and data splits even when sophisticated strategies like grammar-based decoding are adopted.

Analysis
In this section, we analyze our model's ability to handle important discourse phenomena such as focus shift, referring expressions, and ellipsis. We also showcase its interpretability by examining the behavior of the (learned) memory controller.

Focus Shift
Our linguistic analysis took place on 20 interactions 4 randomly sampled from the ATIS development set (134 utterances in total). Table 5 shows overall performance statistics for MemCE (Mem-LSTM) and Suhr et al. (2018) (HE-SnipCopy) on our sample. We annotated the focus of attention in each utterance (underlined in the example below)  Figure 4: Visualization of memory matrix. Rows represent memory content and columns represents the utterance time step. The top row shows the utterances being processed. Each row is marked with a memory slot number which represents the content of memory in that slot. Empty slots are marked with φ. The bottom row shows whether the utterance was parsed correctly() or not(). : Stale content in memory w.r.t the current utterance. : Incorrect substitution.
which we operationalized as the most salient entity (e.g., city) within the utterance (Grosz et al., 1995). Focus shift occurs when the attention transitions from one entity to another. In the interaction below the focus shifts from flights in Q2 to cities in Q3. Handling focus shift has been problematic in the context of semantic parsing (Suhr et al., 2018). In our sample, 41.8% of utterances displayed focus shift. Our model was able to correctly parse all utterances in the interaction above and is more apt at handling focus shifts compared to related systems (Suhr et al., 2018). Table 5 reports denotation and query accuracy on our analysis sample.

Referring Expressions and Ellipsis
Ellipsis refers to the omission of information from an utterance that can be recovered from the context. In the interaction below, Q2 and Q3 exemplify nominal ellipsis, the NP all flights from Long Beach to Memphis is elided and ideally should be recovered from the discourse, in order to generate correct SQL queries. Q4 is an example of coreference, they refers to the answer of Q3. However, it can also be recovered by considering all previous utterances (i.e., Where do they [flights from Long Beach to Memphis; any day] stop). Since our model explicitly stores information in context, it is able to parse utterances like Q2 and Q4 correctly. In our ATIS sample, 26.8% of the utterances exhibited ellipsis and 7.5% contained referring expressions. Results in Table 5 show that MemCE is able to better handle both such cases.

Memory Interpretation
In this section we delve into the memory controller with the aim of understanding what kind of patterns it learns and where it fails. In Figure 4, we visualize the content of memory for an interaction (top row) from the ATIS development set consisting of seven utterances. 5 Each column in Figure 4 shows the content of memory after processing the corresponding utterance in the interaction. The bottom row indicates whether the final output was correct () or not (). For the purpose of clear visualization we took the max instead of softmax in Equation (8) to obtain the memory state at any time step.
Q2 presents an interesting case for our model, it is not obvious whether Continental airlines from Q1 should be carried forward while processing Q2. The latter is genuinely ambiguous, it could be referring to Continental airlines flights or to flights by any carrier leaving from Seattle to Chicago. If we assume the second interpretation, then Q2 is more or less semantically complete and independent of Q1. 44% of utterances in our ATIS sample are semantically complete. Although we do not explicitly handle such utterances, our model is able to parse many of them correctly because they usually repeat the information mentioned in previous discourse as a single query (see Table 5). Q2 also shows that the memory controller is able to learn the similarity between long phrases: on 1993 February twenty Seventh ⇔ Show 1993 February twenty eighth flights. It also demonstrates a degree of semantic understanding, i.e., it replaces from Chicago with from Seattle in order to process utterance Q2, rather than simply relying on entity matching. Figure 4 further shows the kind of mistakes the controller makes which are mostly due to stale content in memory. In utterance Q6 the memory carries over the constraint after 1500 hours from the previous utterance which is not valid since Q6 explicitly states Show all . . . flights on Continental. At the same time constraints from Seattle and to Chicago should carry forward. Knowing which content to keep or discard makes the task challenging.
Another cause of errors relates to reinstating previously nullified constraints. In the interaction be-low, Q3 reinstates from Seattle to Chicago, the focus shifts from flights in Q1 to ground transportation in Q2 and then again to flights in Q3. Q1: Show flights from Seattle to Chicago Q2: What ground transportation is available in Chicago Q3: Show flights after 1500 hours Handling these issues altogether necessitates a non-trivial way of managing context. Given that our model is trained in an end-to-end fashion, it is encouraging to observe a one-to-one correspondence between memory and the final output which supports our hypothesis that explicitly modeling language context is helpful.

Conclusions
In this paper, we presented a memory-based model for context-dependent semantic parsing and evaluated its performance on a text-to-SQL task. Analysis of model output revealed that our approach is able to handle several discourse related phenomena to a large extent. We also analyzed the behavior of the memory controller and observed that it correlates with the model's output decisions. Our study indicates that explicitly modeling context can be helpful for contextual language processing tasks. Our model manipulates information at the phrase level which can be too rigid for fine-grained updates. In the future, we would like to experiment with learning the right level of utterance segmentation for context modeling as well as learning when to reinstate a constraint.