Abstract
We present a memory-based model for context- dependent semantic parsing. Previous approaches focus on enabling the decoder to copy or modify the parse from the previous utterance, assuming there is a dependency between the current and previous parses. In this work, we propose to represent contextual information using an external memory. We learn a context memory controller that manages the memory by maintaining the cumulative meaning of sequential user utterances. We evaluate our approach on three semantic parsing benchmarks. Experimental results show that our model can better process context-dependent information and demonstrates improved performance without using task-specific decoders.
1 Introduction
Semantic parsing is the task of converting natural language utterances into machine interpretable meaning representations such as executable queries or logical forms. It has emerged as an important component in many natural language interfaces (Őzcan et al., 2020) with applications in robotics (Dukes, 2014), question answering (Zhong et al., 2018; Yu et al., 2018b), dialogue systems (Artzi and Zettlemoyer, 2011), and the Internet of Things (Campagna et al., 2017).
Neural network based approaches have led to significant improvements in semantic parsing (Zhong et al., 2018; Kamath and Das, 2019; Yu et al., 2018b; Yavuz et al., 2018; Yu et al., 2018a) across domains and semantic formalisms. The majority of existing studies focus on parsing utterances in isolation, and as a result they cannot readily transfer in more realistic settings where users ask multiple inter-related questions to satisfy an information need. In this work, we study context-dependent semantic parsing focusing specifically on text-to-SQL generation, which has emerged as a popular application area in recent years.
Figure 1 shows a sequence of utterances in an interaction. The discourse focuses on a specific topic serving a specific information need, namely, finding out which Continental flights leave from Chicago on a given date and time. Importantly, interpreting each of these utterances, and mapping them to a database query to retrieve an answer, needs to be situated in a particular context as the exchange proceeds. The topic further evolves as the discourse transitions from one utterance to the next and constraints (e.g., TIME or PLACE) are added or revised. For example, in Q2 the TIME constraint before 10am from Q1 is revised to before noon, and in Q3 to before 2pm. Aside from such topic extensions (Chai and Jin, 2004), the interpretation of Q2 and Q3 depends on Q1, as it is implied that the questions concern Continental flights that go from Chicago to Seattle, not just any Continental flights, however the phrase from Chicago to Seattle is elided from Q2 and Q3. The interpretation of Q4 depends on Q3, which in turn depends on Q1. Interestingly, Q5 introduces information with no dependencies on previous discourse and, in this case, relying on information from previous utterances will lead to incorrect SQL queries.
Example utterances from a user interaction in the ATIS dataset. Utterance segments referring to the same entity or objects are in same color. SQL queries corresponding to Q2–Q5 follow a pattern similar to Q1 and are not shown for the sake of brevity.
Example utterances from a user interaction in the ATIS dataset. Utterance segments referring to the same entity or objects are in same color. SQL queries corresponding to Q2–Q5 follow a pattern similar to Q1 and are not shown for the sake of brevity.
The problem of contextual language processing has been most widely studied within dialogue systems where the primary goal is to incrementally fill pre-defined slot-templates, which can be then used to generate appropriate natural language responses (Gao et al., 2019). But the rich semantics of SQL queries makes the task of contextual text-to-SQL parsing substantially different. Previous approaches (Suhr et al., 2018; Zhang et al., 2019) tackle this problem by enabling the decoder to copy or modify the previous queries under the assumption that they contain all necessary context for generating the current SQL query. The utterance history is encoded in a hierarchical manner and although this is a good enough approximation for most queries (in existing datasets), it is not sufficient to model long-range discourse phenomena (Grosz and Sidner, 1986).
Our own work draws inspiration from Kintsch and van Dijk’s 1978 text comprehension model. In their system the process of comprehension involves three levels of operations. Firstly, smaller units of meaning (i.e., propositions) are extracted and organized into a coherent whole (microstructure); some of these are stored in a working memory buffer and allow to decide whether new input overlaps with already processed propositions. Secondly, the gist of the whole is condensed (macrostructure). And thirdly, the previous two operations generate new texts in working with the memory. In other words, the (short and long term) memory of the reader gives meaning to the text read. They propose three macro rules, namely, deletion, generalization, and construction, as essential to reduce and organize the detailed information of the microstructure of the text. Furthermore, previous knowledge and experience are central to the interpretation of text enabling the reader to fill information gaps.
Our work borrows several key insights from Kintsch and van Dijk (1978) without being a direct implementation of their model. Specifically, we also break down input utterances into smaller units, namely, phrases, and argue that this information can be effectively utilized in maintaining contextual information in an interaction. Furthermore, the notion of a memory buffer that can be used to store and process new and old information plays a prominent role in our approach. We propose a Memory-based ContExt model (which we call MemCE for short) for keeping track of contextual information, and learn a context memory controller that manages the memory. Each interaction (sequence of user utterances) maintains its context using a memory matrix. User utterances are segmented into a sequence of phrases representing either new information to be added into the memory (e.g., that have a meal in Figure 1) or old information which might conflict with current information in memory and needs to be updated (e.g., before 10 am should be replaced with before noon in Figure 1). Our model can inherently add new content to memory, read existing content by accessing the memory, and update old information.
We evaluate our approach on the ATIS (Suhr et al., 2018; Dahl et al., 1994), SParC (Yu et al., 2019b), and CoSQL (Yu et al., 2019a) datasets. We observe performance improvements when we combine MemCE with existing models underlying the importance of more specialized mechanisms for processing context information. In addition, our model brings interpretability in how the context is processed. We are able to inspect the learned memory controller and analyze whether important discourse phenomena such as coreference and ellipsis are modeled.
2 Related Work
Sequence-to-sequence neural networks (Bahdanau et al., 2015) have emerged as a general modeling framework for semantic parsing, achieving impressive results across different domains and semantic formalisms (Dong and Lapata, 2016; Jia and Liang, 2016; Wang et al., 2020; Zhong et al., 2018; Yu et al., 2018b, inter alia). The majority of existing work has focused on mapping natural language utterances into machine-readable meaning representations in isolation without utilizing context information. While this is useful for environments consisting of one-shot interactions of users with a system (e.g., running QA queries on a database), many settings require extended interactions between a user and an automated assistant (e.g., booking a flight). This makes the one-shot parsing model inadequate for many scenarios.
In this paper we are concerned with the lesser studied problem of contextualized semantic parsing where previous utterances are taken into account in the interpretation of the current utterance. Earlier work (Miller et al., 1996; Zettlemoyer and Collins, 2009; Srivastava et al., 2017) has focused on symbolic features for representing context— for example, by explicitly modeling discourse referents, or the flow of discourse. More recent neural methods extend the sequence-to-sequence architecture to incorporate contextual information either by modifying the encoder or the decoder. Context-aware encoders resort to concatenating the current utterance with the utterances preceding it (Suhr et al., 2018; Zhang et al., 2019) or focus on the history of the utterances most relevant to the current decoder state (Liu et al., 2020). The decoders take context representations as additional input and often copy segments from the previous query (Suhr et al., 2018; Zhang et al., 2019). Hybrid approaches (Iyyer et al., 2017; Guo et al., 2019; Liu et al., 2020; Lin et al., 2019) employ neural networks for representation learning but use a grammar for decoding (e.g., a sequence of actions or an intermediate representation).
A tremendous amount of work has taken place in the context of discourse modeling focusing on extended texts (Mann and Thompson, 1988; Hobbs, 1985) and dialogue (Grosz and Sidner, 1986). Kintsch and van Dijk (1978) study the mental operations underlying the comprehension and summarization of text. They introduce propositions as the basic unit of text representation, and a model of how incoming text is processed given memory limitations; texts are reduced to important propositions (to be recalled later) using macro-operators (e.g., addition, deletion). Their model has met with popularity in cognitive psychology (Baddeley, 2007) and has also found application in summarization (Fang and Teufel, 2016).
Our work proposes a new encoder for contextualized semantic parsing. At the heart of our approach is a memory controller that keeps track of context via writing new information and updating old information. Our memory-based approach is inspired by Kintsch and van Dijk (1978) and is closest to Santoro et al. (2016), who use a memory augmented neural network (Weston et al., 2015; Sukhbaatar et al., 2015) for meta-learning. Specifically, they introduce a method for accessing external memory which functions as short-term storage for meta-learning. Although we report experiments solely on semantic parsing, our encoder is fairly general and could be applied to other context-dependent tasks such as conversational information seeking (Dalton et al., 2020) and information retrieval (Sun and Chai, 2007; Voorhees, 2004).
3 Model
Our model is based on the encoder-decoder architecture (Cho et al., 2015) with the addition of a memory component (Sukhbaatar et al., 2015; Santoro et al., 2016) for incorporating context. Let denote an interaction such that Xi is the input utterance and Yi is the output SQL at interaction turn I[i]. At each turn i, given Xi and all previous turns I[1…i − 1], our task is to predict SQL output Yi.
As shown in Figure 2, our model consists of four components: (1) a memory matrix retains discourse information, (2) a memory controller, which learns to access and manipulate the memory such that correct discourse information is retained, (3) utterance and phrase encoders, and (4) a decoder that interacts with the memory and utterance encoder using an attention mechanism to generate SQL output.
Overview of model architecture. Utterances are broken down into segments. Each segment is encoded with the same encoder (same weights) and is processed independently. The context update controller learns to manipulate the memory such that correct discourse information is retained.
Overview of model architecture. Utterances are broken down into segments. Each segment is encoded with the same encoder (same weights) and is processed independently. The context update controller learns to manipulate the memory such that correct discourse information is retained.
3.1 Input Encoder
3.2 Context Memory
Our context memory is a matrix Mi ∈ℝL×d with L memory slots, each of dimension d, where i is the state of the memory matrix at the ith interaction turn. The goal of context memory is to maintain relevant information required to parse the input utterance at each turn. As shown in Figure 2, this is achieved by learning a context update controller, which is responsible for updating the memory at each turn.
For each phrase belonging to a sequence of phrases within utterance Xi, the controller decides whether it contains old information that conflicts with information present in the memory or new information that has to be added to the current context. When novel information is introduced, the controller should add it to an empty or least-used memory slot, otherwise the conflicting memory slot should be updated with the latest information. Let t denote the memory update time step such that t ∈ [1,n], where n is the total number of phrases in interaction I. We simplify notation, using instead of , to represent the hidden representation of a phrase at time t.
Detecting Conflicts
Adding New Information
Memory Update
3.3 Decoder
3.4 Training
4 Experimental Setup
We evaluated MemCE, our memory-based context model, on various settings by integrating it with multiple open-source models. We achieve this by replacing the discourse component of related models with MemCE subject to minor or no additional changes. All base models in our experiments use a turn-level hierarchical encoder to capture previous language context. For primary evaluation, we use the ATIS (Hemphill et al., 1990; Dahl et al., 1994) dataset but also present results on SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a).
Utterance Segmentation
Example of sentence segmentation using chunking and rule-based merging.
The rules above are applied in order. For each rule we find any chunk whose end matches the left pattern followed by a chunk whose beginning matches the right pattern. Chunks that satisfy this criterion are merged.
We segment utterances and anonymize entities independently and then match entities within segments deterministically. This step is necessary to robustly perform anonymization as in some rare cases, the chunking process will separate entities in two different phrases (e.g., in Long Beach California that is chunked as in Long Beach and California that). This is easily handled by a simple token number matching procedure between the anonymized utterance and corresponding phrases.
Model Configuration
Our model is implemented in PyTorch (Paszke et al., 2019). For all experiments, we used the ADAM optimizer (Kingma and Ba, 2015) to minimize the loss function and the initial learning rate was set to 0.001. During training, we used the ReduceLROnPlateau learning rate scheduling strategy on the validation loss, with a decay rate of 0.8. We also applied dropout with 0.5 probability. Dimensions for the word embeddings were set to 300. Following previous work (Zhang et al., 2019) we use pretrained GloVe (Pennington et al., 2014) embeddings for our main experiments on the SparC and CoSQL datasets. For ATIS, word embeddings were not pretrained (Suhr et al., 2018; Zhang et al., 2019). Memory length was chosen as a hyperparameter from the range [15, 25] and the temperature parameter was chosen from {0.01, 0.1}. Best memory length values for ATIS, SparC, and CoSQL were 25, 16, and 20, respectively. The RNN decoder is a two-layer LSTM and the encoder is a single layer LSTM. The Siamese network in the module which detects conflicting slots uses two hidden layers.
5 Results
In this section, we assess the effectiveness of the MemCE encoder at handling contextual information. We present our results, evaluation methodology, and comparisons against the state of the art.
5.1 Evaluation on ATIS
We primarily focus on ATIS because it contains relatively long interactions (average length is 7) compared with other datasets (e.g., the average length in SParC is 3). Longer interactions present multiple challenges that require non-trivial processing of context, some of which are discussed in Section 6. We use the ATIS dataset split created by Suhr et al. (2018). It contains 27 tables and 162K entries with 1,148/380/130 train/dev/test interactions. The semantic representations are in SQL.
Following Suhr et al. (2018), we measure query accuracy, strict denotation accuracy, and relaxed denotation accuracy. Query accuracy is the percentage of predicted queries that match the reference query. Strict denotation accuracy is the percentage of predicted queries that when executed produce the same results as the reference query. Relaxed accuracy also gives credit to a prediction query that fails to execute if the reference table is empty. In cases where the utterance is ambiguous and there are multiple gold queries, the query or table is considered correct if they match any of the gold labels. We evaluate on both development and test set, and select the best model during training via a separate validation set consisting of 5% of the training data.
Table 1 presents a summary of our results. We compare our approach against a simple Seq2Seq model which is a baseline encoder-decoder without any access to contextual information. Seq2Seq+Concat is a strong baseline which consists of an encoder-decoder model with attention on the current and the previous three concatenated utterances. We also compare against the models of Suhr et al. (2018) and Zhang et al. (2019). The former uses a turn-level encoder on top of an utterance-level encoder in a hierarchical fashion together with a decoder which learns to copy complete SQL segments from the previous query (SQL segments between consecutive queries are aligned during training using a rule-based procedure). The latter enhances the turn-level encoder by employing an attention mechanism across different turns and additionally introduces a query editing mechanism which decides at each decoding step whether to copy from the previous query or insert a new token. Column Enc-Dec in Table 1 describes the various models in terms of the type of encoder/decoder used. LSTM is a vanilla encoder or decoder, HE is a turn-level hierarchical encoder, and Mem is the proposed memory-based encoder. SnipCopy and EditBased respectively refer to Suhr et al.’s 2018 and Zhang et al.’s 2019 decoders. We present two instantiations of our MemCE model with a simple LSTM decoder (Mem-LSTM) and SnipCopy (Mem-SnipCopy). For the sake of completeness, Table 1 also reports the results from Lin et al. (2019), who apply a grammar-based decoder to this task; they also incorporate the interaction history by concatenating the current utterance with the previous three utterances which are encoded with a bi-directional LSTM. All models in Table 1 use entity anonymization; Lin et al. (2019) additionally use identifier linking, namely, string matching heuristic rules to link words or phrases in the input utterance to identifiers in the database (e.g., city_name_string -> ‘‘BOSTON’’).
Model accuracy on the ATIS dataset. HE is a hierarchical interaction encoder, while Mem is the proposed memory-based encoder. LSTM are vanilla encoder/decoder models, while SnipCopy copies SQL segments from the previous query and EditBased adopts a query editing mechanism.
Model . | Enc-Dec . | Dev Set . | Test Set . | ||||
---|---|---|---|---|---|---|---|
. | Denotation . | . | Denotation . | ||||
Query . | Relaxed . | Strict . | Query . | Relaxed . | Strict . | ||
Seq2Seq | LSTM-LSTM | 28.7 | 48.8 | 43.2 | 35.7 | 56.4 | 53.8 |
Seq2Seq+Concat | LSTM-LSTM | 35.1 | 59.4 | 56.7 | 42.2 | 66.6 | 65.8 |
Suhr et al. (2018) | HE-LSTM | 36.0 | 59.5 | 58.3 | — | — | — |
Suhr et al. (2018) | HE-SnipCopy | 37.5 | 63.0 | 62.5 | 43.6 | 69.3 | 69.2 |
Zhang et al. (2019) | HE-EditBased | 36.2 | 60.5 | 60.0 | 43.9 | 68.5 | 68.1 |
Lin et al. (2019) | LSTM-Grammar | 39.1 | — | 65.8 | 44.1 | — | 73.7 |
MemCE | Mem-LSTM | 40.2 | 63.6 | 61.2 | 47.0 | 70.1 | 68.9 |
MemCE | Mem-SnipCopy | 39.1 | 65.5 | 65.2 | 45.3 | 70.2 | 69.8 |
Model . | Enc-Dec . | Dev Set . | Test Set . | ||||
---|---|---|---|---|---|---|---|
. | Denotation . | . | Denotation . | ||||
Query . | Relaxed . | Strict . | Query . | Relaxed . | Strict . | ||
Seq2Seq | LSTM-LSTM | 28.7 | 48.8 | 43.2 | 35.7 | 56.4 | 53.8 |
Seq2Seq+Concat | LSTM-LSTM | 35.1 | 59.4 | 56.7 | 42.2 | 66.6 | 65.8 |
Suhr et al. (2018) | HE-LSTM | 36.0 | 59.5 | 58.3 | — | — | — |
Suhr et al. (2018) | HE-SnipCopy | 37.5 | 63.0 | 62.5 | 43.6 | 69.3 | 69.2 |
Zhang et al. (2019) | HE-EditBased | 36.2 | 60.5 | 60.0 | 43.9 | 68.5 | 68.1 |
Lin et al. (2019) | LSTM-Grammar | 39.1 | — | 65.8 | 44.1 | — | 73.7 |
MemCE | Mem-LSTM | 40.2 | 63.6 | 61.2 | 47.0 | 70.1 | 68.9 |
MemCE | Mem-SnipCopy | 39.1 | 65.5 | 65.2 | 45.3 | 70.2 | 69.8 |
As shown in Table 1, MemCE is able to outperform comparison systems. We observe a boost in denotation accuracy when using the SnipCopy decoder instead of an LSTM-based one, although, exact match does not improve. This is possibly because SnipCopy makes it easier to generate long SQL queries by copying segments, but at the same time it suffers from spurious generation and error propagation.
We performed two ablation experiments to evaluate the usefulness of utterance segmentation. Firstly, instead of the phrases extracted from our segmentation procedure, we employ a variant of our model which operates over individual tokens (see row “phrases are utterance tokens” in Table 3). As can be seen, this strategy is not optimal as results decrease across metrics. We believe operating directly on tokens can lead to ambiguity during update. For example, when processing current phrase to Boston given previous utterance What Continental flights go from Chicago to Seattle, it is not obvious whether Boston should update Chicago or Seattle. Secondly, we do not use any segmentation at all, not even at the token level. Instead, we treat the entire utterance as a single phrase (see row “phrases are full utterances” in Table 3). If memory’s only function is to simply store utterance encodings, then this model becomes comparable to a hierarchical encoder with attention. Again, we observe that performance decreases, which indicates that our system benefits from utterance segmentation. Overall, the ablation studies in Table 3 show that segmentation and its granularity matters. Our heuristic procedure works well for the task at hand, although a learning-based method would be more flexible and potentially lead to further improvements. However, we leave this to future work.
5.2 Evaluation on SParC and CoSQL
In this section we describe our results on SParC and CoSQL. Both datasets assume a cross-domain semantic parsing task in context with SQL as the meaning representation. In addition, for ambiguous utterances, (which cannot be uniquely mapped to SQL given past context), CoSQL also includes clarification questions (and answers). We do not tackle these explicitly but consider them part of the utterance preceding them (e.g., please list the singers — did you mean list their names? — yes). Since our primary objective is to study and measure context-dependent language understanding, we created a split of SParC that is denoted as SParC-DI2 , where domains are all seen in training, development, and test set. In this way we ensure that no model has the added advantage of being able to handle cross-domain instances while lacking context-dependent language understanding. Table 4 shows the statistics of our SParC-DI split, following a ratio of 80/10/10 percent for the training/development/test set.
We evaluate model output using exact set match accuracy (Yu et al., 2019b).3 We report two metrics: question accuracy, which is the accuracy considering all utterances independently, and interaction accuracy, which is the correct interaction accuracy averaged across interactions. An interaction is marked as correct if all utterances in that interaction are correct. Because utterances in an interaction can be semantically complete (i.e., independent of context), we prefer interaction accuracy.
Table 2 summarizes our results. CDS2S is the context-dependent cross-domain parsing model of Zhang et al. (2019). It is is adapted from Suhr et al. (2018) to include a schema encoder, which is necessary for SparC and CoSQL. It also uses a turn-level hierarchical encoder to represent the interaction history. We also report model variants where the CDS2S encoder is combined with an LSTM-based encoder, SnipCopy (Suhr et al., 2018), and a grammar-based decoder (Liu et al., 2020). The latter decodes SQL queries as a sequence of grammar rules, rather than tokens. We compare the above systems with three variants of our MemCE model that differ in their use of an LSTM decoder, SnipCopy, and the grammar-based decoder of Liu et al. (2020).
Query (Q) and Interaction (I) accuracy for SParC and CoSQL. We report results on the development (D) and test (T) sets. Sparc-DI is our domain-independent split of SparC. HE is a hierarchical encoder and Mem is the proposed memory-based context encoder. LSTM is a vanilla decoder, SnipCopy copies SQL segments from the previous query, and Grammar refers to a decoder which outputs a sequence of grammar rules rather than tokens. Table cells are filled with—whenever results are not available.
Model . | Enc-Dec . | CoSQL(D) . | CoSQL(T) . | SparC(D) . | SparC(T) . | SparC-DI(T) . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Q . | I . | Q . | I . | Q . | I . | Q . | I . | Q . | I . | ||
CDS2S | HE-LSTM | 13.8 | 2.1 | 13.9 | 2.6 | 21.9 | 8.1 | 23.2 | 7.5 | 39.5 | 20.1 |
CDS2S | HE-SnipCopy | 12.3 | 2.1 | — | — | 21.7 | 9.5 | 20.3 | 8.1 | 38.7 | 24 |
Liu et al. (2020) | HE-Grammar | 33.5 | 9.6 | — | — | 41.8 | 20.6 | — | — | 57.1 | 35.3 |
MemCE+CDS2S | Mem-LSTM | 13.4 | 3.4 | — | — | 21.2 | 8.8 | — | — | 41.3 | 22.9 |
MemCE+CDS2S | Mem-SnipCopy | 13.1 | 2.7 | — | — | 21.4 | 10.9 | — | — | 41.5 | 26.7 |
MemCE+Liu et al. (2020) | Mem-Grammar | 32.8 | 10.6 | 28.4 | 6.2 | 42.4 | 21.1 | 40.3 | 16.7 | 55.7 | 36.3 |
Model . | Enc-Dec . | CoSQL(D) . | CoSQL(T) . | SparC(D) . | SparC(T) . | SparC-DI(T) . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Q . | I . | Q . | I . | Q . | I . | Q . | I . | Q . | I . | ||
CDS2S | HE-LSTM | 13.8 | 2.1 | 13.9 | 2.6 | 21.9 | 8.1 | 23.2 | 7.5 | 39.5 | 20.1 |
CDS2S | HE-SnipCopy | 12.3 | 2.1 | — | — | 21.7 | 9.5 | 20.3 | 8.1 | 38.7 | 24 |
Liu et al. (2020) | HE-Grammar | 33.5 | 9.6 | — | — | 41.8 | 20.6 | — | — | 57.1 | 35.3 |
MemCE+CDS2S | Mem-LSTM | 13.4 | 3.4 | — | — | 21.2 | 8.8 | — | — | 41.3 | 22.9 |
MemCE+CDS2S | Mem-SnipCopy | 13.1 | 2.7 | — | — | 21.4 | 10.9 | — | — | 41.5 | 26.7 |
MemCE+Liu et al. (2020) | Mem-Grammar | 32.8 | 10.6 | 28.4 | 6.2 | 42.4 | 21.1 | 40.3 | 16.7 | 55.7 | 36.3 |
Ablation results with SnipCopy decoder on the ATIS development set.
. | Denotation . | ||
---|---|---|---|
. | Query . | Relaxed . | Strict . |
MemCE+SnipCopy | 39.1 | 65.5 | 65.2 |
Without memory controller | 34.3 | 58.7 | 58.1 |
Phrases are utterance tokens | 37.2 | 61.9 | 61.7 |
Phrases are full utterances | 36.8 | 64.2 | 63.9 |
. | Denotation . | ||
---|---|---|---|
. | Query . | Relaxed . | Strict . |
MemCE+SnipCopy | 39.1 | 65.5 | 65.2 |
Without memory controller | 34.3 | 58.7 | 58.1 |
Phrases are utterance tokens | 37.2 | 61.9 | 61.7 |
Phrases are full utterances | 36.8 | 64.2 | 63.9 |
Statistics for SParC-DI domain-independent split which has 157 domains in total.
. | Train . | Dev . | Test . |
---|---|---|---|
#Interactions | 2869 | 290 | 290 |
#Utterances | 8535 | 851 | 821 |
. | Train . | Dev . | Test . |
---|---|---|---|
#Interactions | 2869 | 290 | 290 |
#Utterances | 8535 | 851 | 821 |
Across models and datasets we observe that MemCE improves performance, which suggests that it better captures contextual information as an independent language modeling component. We observe that benefits from our memory-based encoder persist across domains and data splits even when sophisticated strategies like grammar-based decoding are adopted.
6 Analysis
In this section, we analyze our model’s ability to handle important discourse phenomena such as focus shift, referring expressions, and ellipsis. We also showcase its interpretability by examining the behavior of the (learned) memory controller.
6.1 Focus Shift
Our linguistic analysis took place on 20 interactions4 randomly sampled from the ATIS development set (134 utterances in total). Table 5 shows overall performance statistics for MemCE (Mem- LSTM) and Suhr et al. (2018) (HE-SnipCopy) on our sample. We annotated the focus of attention in each utterance (underlined in the example below) which we operationalized as the most salient entity (e.g., city) within the utterance (Grosz et al., 1995). Focus shift occurs when the attention transitions from one entity to another. In the interaction below the focus shifts from flights in Q2 to cities in Q3.
Handling focus shift has been problematic in the context of semantic parsing (Suhr et al., 2018). In our sample, 41.8% of utterances displayed focus shift. Our model was able to correctly parse all utterances in the interaction above and is more apt at handling focus shifts compared to related systems (Suhr et al., 2018). Table 5 reports denotation and query accuracy on our analysis sample.
Model accuracy on specific phenomena (20 interactions, ATIS dev set).
. | MemCE . | Suhr et al. (2018) . | ||
---|---|---|---|---|
. | Denotation . | Query . | Denotation . | Query . |
Focus Shift | 80.4 | 50.0 | 76.7 | 44.6 |
Referring Exp | 80.0 | 40.0 | 70.0 | 20.0 |
Ellipsis | 69.4 | 33.3 | 66.6 | 25.0 |
Independent | 81.4 | 61.1 | 81.3 | 62.7 |
. | MemCE . | Suhr et al. (2018) . | ||
---|---|---|---|---|
. | Denotation . | Query . | Denotation . | Query . |
Focus Shift | 80.4 | 50.0 | 76.7 | 44.6 |
Referring Exp | 80.0 | 40.0 | 70.0 | 20.0 |
Ellipsis | 69.4 | 33.3 | 66.6 | 25.0 |
Independent | 81.4 | 61.1 | 81.3 | 62.7 |
6.2 Referring Expressions and Ellipsis
Ellipsis refers to the omission of information from an utterance that can be recovered from the context. In the interaction below, Q2 and Q3 exemplify nominal ellipsis, the NP all flights from Long Beach to Memphis is elided and ideally should be recovered from the discourse, in order to generate correct SQL queries. Q4 is an example of coreference, they refers to the answer of Q3. However, it can also be recovered by considering all previous utterances (i.e., Where do they [flights from Long Beach to Memphis; any day] stop). Because our model explicitly stores information in context, it is able to parse utterances like Q2 and Q4 correctly.
In our ATIS sample, 26.8% of the utterances exhibited ellipsis and 7.5% contained referring expressions. Results in Table 5 show that MemCE is able to better handle both such cases.
6.3 Memory Interpretation
In this section we delve into the memory controller with the aim of understanding what kind of patterns it learns and where it fails. In Figure 4, we visualize the content of memory for an interaction (top row) from the ATIS development set consisting of seven utterances.5 Each column in Figure 4 shows the content of memory after processing the corresponding utterance in the interaction. The bottom row indicates whether the final output was correct (✓) or not (✗). For the purpose of clear visualization we took the max instead of softmax in Equation (8) to obtain the memory state at any time step.
Visualization of memory matrix. Rows represent memory content and columns represents the utterance time step. The top row shows the utterances being processed. Each row is marked with a memory slot number which represents the content of memory in that slot. Empty slots are marked with ϕ. The bottom row shows whether the utterance was parsed correctly(✓) or not(✗). : Stale content in memory with respect to the current utterance.
: Incorrect substitution.
Visualization of memory matrix. Rows represent memory content and columns represents the utterance time step. The top row shows the utterances being processed. Each row is marked with a memory slot number which represents the content of memory in that slot. Empty slots are marked with ϕ. The bottom row shows whether the utterance was parsed correctly(✓) or not(✗). : Stale content in memory with respect to the current utterance.
: Incorrect substitution.
Q2 presents an interesting case for our model, it is not obvious whether Continental airlines from Q1 should be carried forward while processing Q2. The latter is genuinely ambiguous, it could be referring to Continental airlines flights or to flights by any carrier leaving from Seattle to Chicago. If we assume the second interpretation, then Q2 is more or less semantically complete and independent of Q1. Forty-four percent of utterances in our ATIS sample are semantically complete. Although we do not explicitly handle such utterances, our model is able to parse many of them correctly because they usually repeat the information mentioned in previous discourse as a single query (see Table 5). Q2 also shows that the memory controller is able to learn the similarity between long phrases: on 1993 February twenty Seventh ⇔ Show 1993 February twenty eighth flights. It also demonstrates a degree of semantic understanding—that is, it replaces from Chicago with from Seattle in order to process utterance Q2, rather than simply relying on entity matching.
Figure 4 further shows the kind of mistakes the controller makes which are mostly due to stale content in memory. In utterance Q6 the memory carries over the constraint after 1500 hours from the previous utterance, which is not valid since Q6 explicitly states Show all …flights on Continental. At the same time constraints from Seattle and to Chicago should carry forward. Knowing which content to keep or discard makes the task challenging.
Another cause of errors relates to reinstating previously nullified constraints. In the interaction below, Q3 reinstates from Seattle to Chicago, the focus shifts from flights in Q1 to ground transportation in Q2 and then again to flights in Q3.
Handling these issues altogether necessitates a non-trivial way of managing context. Given that our model is trained in an end-to-end fashion, it is encouraging to observe a one-to-one correspondence between memory and the final output which supports our hypothesis that explicitly modeling language context is helpful.
7 Conclusions
In this paper, we presented a memory-based model for context-dependent semantic parsing and evaluated its performance on a text-to-SQL task. Analysis of model output revealed that our approach is able to handle several discourse related phenomena to a large extent. We also analyzed the behavior of the memory controller and observed that it correlates with the model’s output decisions. Our study indicates that explicitly modeling context can be helpful for contextual language processing tasks. Our model manipulates information at the phrase level which can be too rigid for fine-grained updates. In the future, we would like to experiment with learning the right level of utterance segmentation for context modeling as well as learning when to reinstate a constraint.
Acknowledgment
We thank Mike Lewis, Miguel Ballesteros, and our anonymous reviewers for their feedback. We are grateful to Alex Lascarides and Ivan Titov for their comments on the paper. This work was supported in part by Huawei and the UKRI Centre for Doctoral Training in Natural Language Processing (grant EP/S022481/1). Lapata acknowledges the support of the European Research Council (award number 681760, “Translating Multiple Modalities into Text”).
Notes
In experiments we found that using the (raw) memory directly is empirically better to encoding it with an LSTM.
We only considered training and development instances as the test set is not publicly available.
Predicted queries are decomposed into different SQL clauses and scores are computed for each clause separately.
Interactions with less than two utterances were discarded.
Q4 was repeated in the dataset. We do the same to maintain consistency and to observe the effect of repetition.