Graph Convolutional Network with Sequential Attention for Goal-Oriented Dialogue Systems

Abstract Domain-specific goal-oriented dialogue systems typically require modeling three types of inputs, namely, (i) the knowledge-base associated with the domain, (ii) the history of the conversation, which is a sequence of utterances, and (iii) the current utterance for which the response needs to be generated. While modeling these inputs, current state-of-the-art models such as Mem2Seq typically ignore the rich structure inherent in the knowledge graph and the sentences in the conversation context. Inspired by the recent success of structure-aware Graph Convolutional Networks (GCNs) for various NLP tasks such as machine translation, semantic role labeling, and document dating, we propose a memory-augmented GCN for goal-oriented dialogues. Our model exploits (i) the entity relation graph in a knowledge-base and (ii) the dependency graph associated with an utterance to compute richer representations for words and entities. Further, we take cognizance of the fact that in certain situations, such as when the conversation is in a code-mixed language, dependency parsers may not be available. We show that in such situations we could use the global word co-occurrence graph to enrich the representations of utterances. We experiment with four datasets: (i) the modified DSTC2 dataset, (ii) recently released code-mixed versions of DSTC2 dataset in four languages, (iii) Wizard-of-Oz style CAM676 dataset, and (iv) Wizard-of-Oz style MultiWOZ dataset. On all four datasets our method outperforms existing methods, on a wide range of evaluation metrics.


Introduction
Goal-oriented dialogue systems that can assist humans in various day-to-day activities have widespread applications in several domains such as e-commerce, entertainment, healthcare, and so forth. For example, such systems can help humans in scheduling medical appointments or reserving restaurants, booking tickets. From a modeling perspective, one clear advantage of dealing with domain-specific goal-oriented dialogues is that the vocabulary is typically limited, the utterances largely follow a fixed set of templates, and there is an associated domain knowledge that can be exploited. More specifically, there is some structure associated with the utterances as well as the knowledge base (KB). More formally, the task here is to generate the next response given (i) the previous utterances in the conversation history, (ii) the current user utterance (known as the query), and (iii) the entities and their relationships in the associated knowledge base. Current state-of-the-art methods (Seo et al., 2017;Madotto et al., 2018) typically use variants of Recurrent Neural Networks (RNNs) (Elman, 1990) to encode the history and current utterance or an external memory network (Sukhbaatar et al., 2015) to encode them along with the entities in the knowledge base. The encodings of the utterances and memory elements are then suitably combined using an attention network and fed to the decoder to generate the response, one word at a time. However, these methods do not exploit the structure in the knowledge base as defined by entity-entity relations and the structure in the utterances as defined by a dependency parse. Such structural information can be exploited to improve the performance of the system, as demonstrated by recent works on syntax-aware neural machine translation (Eriguchi et al., 2016;Bastings et al., 2017;Chen et al., 2017), semantic role labeling , and document dating (Vashishth et al., 2018), which use Graph Convolutional Networks (GCNs) (Defferrard et al., 2016;Duvenaud et al., 2015;Kipf and Welling, 2017) to exploit sentence structure.
In this work, we propose to use such graph structures for goal-oriented dialogues. In particular, we compute the dependency parse tree for each utterance in the conversation and use a GCN to capture the interactions between words. This allows us to capture interactions between distant words in the sentence as long as they are connected by a dependency relation. We also use GCNs to encode the entities of the KB where the entities are treated as nodes and their relations as edges of the graph. Once we have a richer structure aware representation for the utterances and the entities, we use a sequential attention mechanism to compute an aggregated context representation from the GCN node vectors of the query, history, and entities. Further, we note that in certain situations, such as when the conversation is in a code-mixed language or a language for which parsers are not available, then it may not be possible to construct a dependency parse for the utterances. To overcome this, we construct a co-occurrence matrix from the entire corpus and use this matrix to impose a graph structure on the utterances. More specifically, we add an edge between two words in a sentence if they co-occur frequently in the corpus. Our experiments suggest that this simple strategy acts as a reasonable substitute for dependency parse trees.
We perform experiments with the modified DSTC2 (Bordes et al., 2017) dataset, which contains goal-oriented conversations for making restaurant reservations. We also use its recently released code-mixed versions (Banerjee et al., 2018), which contain code-mixed conversations in four different languages: Hindi, Bengali, Gujarati, and Tamil. We compare with recent state-of-theart methods and show that on average, the proposed model gives an improvement of 2.8 BLEU points and 2 ROUGE points. We also perform experiments on two human-human dialogue datasets of different sizes: (i) Cam676 (Wen et al., 2017): a small scale dataset containing 676 dialogues from the restaurant domain; and (ii) MultiWOZ (Budzianowski et al., 2018): a largescale dataset containing around 10k dialogues and spanning multiple domains for each dialogue. On these two datasets as well, we observe a similar trend, wherein our model outperforms existing methods.
Our contributions can be summarized as follows: (i) We use GCNs to incorporate structural information for encoding query, history, and KB entities in goal-oriented dialogues; (ii) We use a sequential attention mechanism to obtain query aware and history aware context representations; (iii) We leverage co-occurrence frequencies and PPMI (positive-pointwise mutual information) values to construct contextual graphs for code-mixed utterances; and (iv) We show that the proposed model obtains state-of-the-art results on four different datasets spanning five different languages.

Related Work
In this section, we review the previous work in goal-oriented dialogue systems and describe the introduction of GCNs in NLP.
Goal-Oriented Dialogue Systems: Initial goaloriented dialogue systems (Young, 2000;Williams and Young, 2007) were based on dialogue state tracking (Williams et al., 2013;Henderson et al., 2014a,b) and included pipelined modules for natural language understanding, dialogue state tracking, policy management, and natural language generation.  used neural networks for these intermediate modules but still lacked absolute end-to-end trainability. Such pipelined modules were restricted by the fixed slot-structure assumptions on the dialogue state and required per-module based labeling. To mitigate this problem, Bordes et al. (2017) released a version of goal-oriented dialogue dataset that focuses on the development of end-to-end neural models. Such models need to reason over the associated KB triples and generate responses directly from the utterances without any additional annotations. For example, Bordes et al. (2017) proposed a Memory Network (Sukhbaatar et al., 2015) based model to match the response candidates with the multi-hop attention weighted representation of the conversation history and the KB triples in memory. Liu and Perez (2017) further added highway (Srivastava et al., 2015) and residual connections (He et al., 2016) to the memory network in order to regulate the access to the memory blocks. Seo et al. (2017) developed a variant of RNN cell that computes a refined representation of the query over multiple iterations before querying the memory. However, all these approaches retrieve the response from a set of candidate responses and such a candidate set is not easy to obtain for any new domain of interest. To account for this,  and Zhao et al. (2017) adapted RNN-based encoderdecoder models to generate appropriate responses instead of retrieving them from a candidate set.  introduced a key-value memory network based generative model that integrates the underlying KB with RNN-based encodeattend-decode models. Madotto et al. (2018) used memory networks on top of the RNN decoder to tightly integrate KB entities with the decoder in order to generate more informative responses. However, as opposed to our work, all these works ignore the underlying structure of the entityentity graph of the KB and the syntactic structure of the utterances.

GCNs in NLP:
Recently, there has been an active interest in enriching existing encodeattend-decode models (Bahdanau et al., 2015) with structural information for various NLP tasks. Such structure is typically obtained from the constituency and/or dependency parse of sentences. The idea is to treat the output of a parser as a graph and use an appropriate network to capture the interactions between the nodes of this graph. For example, Eriguchi et al. (2016) and Chen et al. (2017) showed that incorporating such syntactical structures as Tree-LSTMs in the encoder can improve the performance of neural machine translation. Peng et al. (2017) use Graph-LSTMs to perform cross sentence n-ary relation extraction and show that their formulation is applicable to any graph structure and Tree-LSTMs can be thought of as a special case of it. In parallel, Graph Convolutional Networks (GCNs) (Duvenaud et al., 2015;Defferrard et al., 2016;Kipf and Welling, 2017) and their variants (Li et al., 2016) have emerged as state-of-the-art methods for computing representations of entities in a knowledge graph. They provide a more flexible way of encoding such graph structures by capturing multi-hop relationships between nodes. This has led to their adoption for various NLP tasks such as neural machine translation (Marcheggiani et al., 2018;Bastings et al., 2017), semantic role labeling , document dating (Vashishth et al., 2018), and question answering (Johnson, 2017;De Cao et al., 2019).
To the best of our knowledge, ours is the first work that uses GCNs to incorporate dependency structural information and the entity-entity graph structure in a single end-to-end neural model for goal-oriented dialogues. This is also the first work that incorporates contextual co-occurrence information for code-mixed utterances, for which no dependency structures are available.

Background
In this section, we describe GCNs (Kipf and Welling, 2017) for undirected graphs and then describe their syntactic versions, which work with directed labeled edges of dependency parse trees.

GCN for Undirected Graphs
Graph convolutional networks operate on a graph structure and compute representations for the nodes of the graph by looking at the neighborhood of the node. We can stack k layers of GCNs to account for neighbors that are k-hops away from the current node. Formally, let G = (V, E) be an undirected graph, where V is the set of nodes (let |V| = n) and E is the set of edges. Let X ∈ R n×m be the input feature matrix with n nodes and each node x u (u ∈ V) is represented by an m-dimensional feature vector. The output of a 1-layer GCN is the hidden representation matrix H ∈ R n×d where each d-dimensional representation of a node captures the interactions with its 1-hop neighbors. Each row of this matrix can be computed as: (1) Here W ∈ R d×m is the model parameter matrix, b ∈ R d is the bias vector, and ReLU is the rectified linear unit activation function. N (v) is the set of neighbors of node v and is assumed to also include the node v so that the previous representation of the node v is also considered while computing its new hidden representation. To capture interactions with nodes that are multiple hops away, multiple layers of GCNs can be stacked together. Specifically, the representation of node v after k th GCN layer can be formulated as: ∀v ∈ V. Here h k u is the representation of the u th node in the (k − 1) th GCN layer and h 1 u = x u .

Syntactic GCN
In a directed labeled graph G = (V, E), each edge between nodes u and v is represented by a triple (u, v, L(u, v)) where L(u, v) is the associated edge label.  modified GCNs to operate over directed labeled graphs, such as the dependency parse tree of a sentence. For such a tree, in order to allow information to flow from head to dependents and vice-versa, they added inverse dependency edges from dependents to heads such as (v, u, L(u, v) ) to E and made the model parameters and biases label specific. In their formulation, (u,v) which are label-specific. Suppose there are L different labels, then this formulation will require L weights and biases per GCN layer, resulting in a large number of parameters. To avoid this, the authors use only three sets of weights and biases per GCN layer (as opposed to L) depending on the direction in which the information flows. More (u,v) instead of having a separate bias per label. The final GCN formulation can thus be described as:

Model
We first formally define the task of end-to-end goal-oriented dialogue generation. Each dialogue of t turns can be viewed as a succession of user utterances (U ) and system responses (S) and can be represented as: Along with these utterances, each dialogue is also accompanied by e KB triples that are relevant to that dialogue and can be represented as: (k 1 , k 2 , k 3 , . . . , k e ). Each triple is of the form: (entity 1 , relation, entity 2 ). These triples can be represented in the form of a graph where V k is the set of all entities and each edge in E k is of the form: (entity 1 , entity 2 , relation), where relation signifies the edge label. At any dialogue turn i, given the (i) dialogue history H = (U 1 , S 1 , U 2 , . . . , S i−1 ), (ii) the current user utterance as the query Q = U i and (iii) the associated knowledge graph G k , the task is to generate the current response S i that leads to a completion of the goal. As mentioned earlier, we exploit the graph structure in KB and the syntactic structure in the utterances to generate appropriate responses. Toward this end, we propose a model with the following components for encoding these three types of inputs. The code for the model is released publicly. 1

Query Encoder
The query Q = U i is the i th (current) user utterance in the dialogue and contains |Q| tokens. We denote the embedding of the i th token in the query as q i . We first compute the contextual representations of these tokens by passing them through a bidirectional RNN: Now, consider the dependency parse tree of the query sentence denoted by We use a query-specific GCN to operate on G Q , which takes {b i } |Q| i=1 as the input to the first GCN layer. The node representation in the k th hop of the query specific GCN is computed as: (u,v) are edge direction specific query-GCN weights and biases for the k th hop and c 1 u = b u .

Dialogue History Encoder
The history H of the dialogue contains |H| tokens and we denote the embedding of the i th token in the history by p i . Once again, we first compute the hidden representations of these tokens using a bidirectional RNN: We now compute a dependency parse tree for each sentence in the history and collectively represent all the trees as a single graph G H = (V H , E H ). Note that this graph will only contain edges between words belonging to the same sentence and there will be no edges between words across sentences. We then use a history-specific GCN to operate on G H which takes s t as the input to the first layer. The node representation in the k th hop of the history-specific GCN is computed as: (u,v) and o k dir (u,v) are edge direction-specific history-GCN weights and biases in the k th hop and a 1 u = s u . Such an encoder with a single hop of GCN is illustrated in Figure 1(b) and the encoder without the BiRNN is depicted in Figure 1(a).

KB Encoder
As mentioned earlier, G K = (V K , E K ) is the graph capturing the interactions between the entities in the knowledge graph associated with the dialogue. Let there be m such entities and we denote the embedding of the node corresponding to the i th entity as e i . We then operate a KB-specific GCN on these entity representations to obtain refined representations that capture relations between entities. The node representation in the k th hop of the KB specific GCN is computed as: (u,v) and z k dir (u,v) are edge direction-specific KB-GCN weights and biases in k th hop and r 1 u = e u . We also add inverse edges to E K similar to the case of syntactic GCNs in order to allow information flow in both the directions for an entity pair in the knowledge graph.

Sequential Attention
We use an RNN decoder to generate the tokens of the response and let the hidden states of the decoder be denoted as: where T is the total number of decoder time steps. In order to obtain a single representation of the node vectors from the final layer (k = f ) of the query-GCN, we 489 use an attention mechanism as described below: Here v 1 , W 1 and W 2 are parameters. Further, at each decoder time step, we obtain a queryaware representation from the final layer of the history-GCN by computing an attention score for each node/token in the history based on the query context vector h Q t as shown below: Here v 2 , W 3 , W 4 , and W 5 are parameters. Finally, we obtain a query and history aware representation of the KB by computing an attention score over all the nodes in the final layer of KB-GCN using h Q t and h H t as shown below: Here v 3 , W 6 , W 7 , W 8 and W 9 are parameters. This sequential attention mechanism is illustrated in Figure 2. For simplicity, we depict the GCN and RNN+GCN encoders as blocks. The internal structure of these blocks are shown in Figure 1.

Decoder
The decoder is conditioned on two components: (i) the context that contains the history and the KB and (ii) the query that is the last/previous utterance in the dialogue. We use an aggregator that learns the overall attention to be given to the history and KB components. These attention scores: θ H t and θ K t are dependent on the respective context vectors and the previous decoder state d t−1 . The final context vector is obtained as: where [; ] denotes the concatenation operator. At every time step, the decoder then computes a probability distribution over the vocabulary using the following equations: where w t is the decoder input at time step t, V and b are parameters. P vocab gives us a probability distribution over the entire vocabulary and the loss for is the t th word in the ground truth response. The total loss is an average of the per-time step losses.

Contextual Graph Creation
For the dialogue history and query encoder, we used the dependency parse tree for capturing structural information in the encodings. However, if the conversations occur in a language for which no dependency parsers exist, for example: codemixed languages like Hinglish (Hindi-English) (Banerjee et al., 2018), then we need an alternate way of extracting a graph structure from the utterances. One simple solution that has worked well in practice was to create a word co-occurrence matrix from the entire corpus where the context window is an entire sentence. Once we have such a co-occurrence matrix, for a given sentence we can connect an edge between two words if their co-occurrence frequency is above a threshold value. The co-occurrence matrix can either contain co-occurrence frequency counts or positivepointwise mutual information (PPMI) values (Church and Hanks, 1990;Dagan et al., 1993;Niwa and Nitta, 1994).

Experimental Setup
In this section, we describe the datasets used in our experiments, the various hyperparameters that we considered, and the models that we compared.

Datasets
The original DSTC2 dataset (Henderson et al., 2014a) was based on the task of restaurant table reservation and contains transcripts of real conversations between humans and bots. The utterances were labeled with the dialogue state annotations like the semantic intent representation, requested slots, and the constraints on the slot values. We report our results on the modified DSTC2 dataset of Bordes et al. (2017), where such annotations are removed and only the raw utterance-response pairs are present with an associated set of KB triples for each dialogue. It contains around 1,618 training dialogues, 500 validation dialogues, and 1,117 test dialogues. For our experiments with contextual graphs we report our results on the code-mixed versions of modified DSTC2, which was recently released by Banerjee et al. (2018). This dataset has been collected by code-mixing the utterances of the

Hyperparameters
We used the same train, test, and validation splits as provided in the original versions of the datasets. We minimized the cross entropy loss using the Adam optimizer (Kingma and Ba, 2015) and tuned the initial learning rates in the range of 0.0006 to 0.001. For regularization we used an L2 penalty of 0.001 in addition to a dropout (Srivastava et al., 2014) of 0.1. We used randomly initialized word embeddings of size 300. The RNN and GCN hidden dimensions were also chosen to be 300. We used GRU (Cho et al., 2014) cells for the RNNs. All parameters were initialized from a truncated normal distribution with a standard deviation of 0.1.

Models Compared
We compare the performance of the following models.
(i) RNN+GCN-SeA vs GCN-SeA: We use RNN+GCN-SeA to refer to the model described in Section 4. Instead of using the hidden representations obtained from the bidirectional RNNs, we also experiment by providing the token embeddings directly to the GCNs-that is, c 1 u = q u in equation 6 and a 1 u = p u in equation 8. We refer to this model as GCN-SeA.
(ii) Cross edges between the GCNs: In addition to the dependency and contextual edges, we add edges between words in the dialogue history/query and KB entities if a history/query word exactly matches the KB entity. Such edges create a single connected graph that is encoded using a single GCN encoder and then separated into different contexts to compute sequential attention. This model is referred to as RNN+CROSS-GCN-SeA.
(iii) GCN-SeA+Random vs GCN-SeA+Structure: We experiment with the model where the graph is constructed by randomly connecting edges between two words in a context. We refer to this model as GCN-SeA+Random. We refer to the model that either uses dependency or contextual graphs instead of random graphs as GCN-SeA+Structure.

Results and Discussions
In this section, we discuss the results of our experiments as summarized in Tables 1-5. We use BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) metrics to evaluate the generation quality of responses. We also report the per-response accuracy, which computes the percentage of responses in which the generated response exactly matches the ground truth response. To evaluate the model's capability of correctly injecting entities in the generated response, we report the entity F1 measure as defined in .
Results on En-DSTC2: We compare our model with the previous works on the English version of modified DSTC2 in Table 1. For most of the retrieval-based models, the BLEU or ROUGE scores are not available as they select a candidate   from a list of candidates as opposed to generating it. Our model outperforms all of the retrieval and generation-based models. We obtain a gain of 0.7 in the per-response accuracy compared with the previous retrieval based state-of-the-art model of Seo et al. (2017), which is a very strong baseline for our generation-based model. We call this a strong baseline because the candidate selection task of this model is easier than the response generation task of our model. We also obtain a gain of 2.8 BLEU points, 2 ROUGE, points and 2.5 entity F1 points compared with current state-of-the-art generation-based models.
Results on code-mixed datasets and effect of using RNNs: The results of our experiments on the code-mixed datasets are reported in Table 2. Our model outperforms the baseline models on all the code-mixed languages. One common observation from the results over all the languages is that RNN+GCN-SeA performs better than GCN-SeA. Similar observations were made by  for semantic role labeling.
Results on Cam676 dataset: The results of our experiments on the Cam676 dataset are reported in Table 3. In order to evaluate goalcompleteness, we use two additional metrics as   Table 5: GCN-SeA with random graphs and dependency/contextual graphs on all DSTC2 datasets.      Table 4. The first version (SNG) contains around 3K dialogues in which each dialogue involves only a single domain and the second version (MUL) contains all 10k dialogues. The baseline models do not use an oracle belief state as mentioned in Budzianowski et al. (2018) and therefore are comparable to our model. We observed that with a larger GCN hidden dimension (400d in Table 4) our model is able to provide the correct entities and requestable slots in SNG. On the other hand, with a smaller GCN hidden dimension (100d) we are able to generate fluent responses in SNG. On MUL, our model is able to generate fluent responses but struggles in providing the correct entity mainly due to the increased complexity of multiple domains. However, our model still provides a high number of correct requestable slots, as shown by the success rate. This is because multiple domains (hotel, restaurant, attraction, hospital) have the same requestable slots (address, phone, postcode).
Effect of using hops: As we increased the number of hops of GCNs (Figure 3), we observed a decrease in the performance. One reason for such a drop in performance could be that the average utterance length is very small (7.76 words). Thus, there is not much scope for capturing distant neighborhood information and more hops can add noisy information. The reduction is more prominent in contextual graphs in which multihop neighbors can turn out to be dissimilar words in different sentences.
Effect of using random graphs: GCN-SeA+Random and GCN-SeA+Structure take the token embeddings directly instead of passing them though an RNN. This ensures that the difference in performance of the two models are not influenced by the RNN encodings. The results are shown in Table 5 and we observe a drop in performance for GCN-SeA+Random across all the languages. This shows that the dependency and contextual structures play an important role and cannot be replaced by random graphs.
Ablations: We experiment with replacing the sequential attention by the Bahdanau attention (Bahdanau et al., 2015). We also experiment with various combinations of RNNs and GCNs as encoders. The results are shown in Table 6. We observed that GCNs do not outperform RNNs independently. In general, RNN-Bahdanau attention performs better than GCN-Bahdanau attention. The sequential attention mechanism outperforms Bahdanau attention as observed from the following comparisons: (i) GCN-Bahdanau attention vs GCN-SeA, (ii) RNN-Bahdanau attention vs RNN-SeA (in BLEU and ROUGE), and (iii) RNN+GCN-Bahdanau attention vs RNN+GCN-SeA. Overall, the best results are always obtained by our final model, which combines RNN, GCN, and sequential attention. We also performed ablations by removing specific parts of the encoder. Specifically, we experiment with (i) query encoder alone, (ii) query + history encoder, and (iii) query + KB encoder. The results shown in Table 7 suggest that the query and the KB are not enough to generate fluent responses and the previous conversation history is essential.
Human evaluations: In order to evaluate the appropriateness of our model's responses compared to the baselines, we perform a human evaluation of the generated responses using inhouse evaluators. We evaluated randomly chosen responses from 200 dialogues of En-DSTC2 and KB Triples: ask, r cuisine, italian ask, r location, centre ask, r phone, ask phone ask, r address, ask address ask, r price, cheap ask, r rating, 0 pizza hut city centre, r cuisine, italian pizza hut city centre, r location, centre pizza hut city centre, r phone, pizza hut city centre phone pizza hut city centre, r address, pizza hut city centre address pizza hut city centre, r price, cheap pizza hut city centre, r rating, 9 Seq2seq + Attn: ask is a nice place in the west of town serving tasty italian food HRED: pizza hut city centre serves italian food in the expensive price range GCN-SeA: pizza hut city centre serves italian food in the cheap price range RNN+GCN-SeA: pizza hut city centre is a great restaurant serving cheap italian food in the centre of town RNN+CROSS-GCN-SeA:pizza hut city centre is a nice place in the centre of town serving tasty italian food Table 9: Qualitative comparison of responses between the baselines and different versions of our model 100 dialogues of Cam676 using the method of pairwise comparisons introduced in Serban et al. (2017). We chose the best baseline model for each dataset, namely, HRED for En-DSTC2 and Seq2seq+Attn for Cam676. We show each dialogue context to three different evaluators and ask them to select the most appropriate response in that context. The evaluators were given no information about which model generated which response. They were allowed to choose an option for tie if they were not able to decide whether one model's response was better than the other model. The results reported in Table 8 suggest that our model's responses are favorable in noisy contexts of spontaneous conversations, such as those exhibited in the DSTC2 dataset. However, in a WOZ setting for human-human dialogues, where the conversations are less spontaneous and contexts are properly established, both the models generate appropriate responses.
Qualitative analysis: We show the generated responses of the baselines and different versions of our model in Table 9. We see that Seq2seq+Attn model is not able to suggest a restaurant with a high rating whereas HRED gets the restaurant right but suggests an incorrect price range. However, RNN+GCN-SeA suggests the correct restaurant with the preferred attributes. Although GCN-SeA selects the correct restaurant, it does not provide the location in its response.

Conclusion
We showed that structure-aware representations are useful in goal-oriented dialogue and our model outperforms existing methods on four dialogue datasets. We used GCNs to infuse structural information of dependency graphs and contextual graphs to enrich the representations of the dialogue context and KB. We also proposed a sequential attention mechanism for combining the representations of (i) query (current utterance), (ii) conversation history, and (iii) the KB. Finally, we empirically showed that when dependency parsers are not available for certain languages, such as code-mixed languages, then we can use word co-occurrence frequencies and PPMI values to extract a contextual graph and use such a graph with GCNs for improved performance.