Augmenting Transformers with KNN-Based Composite Memory for Dialog

Various machine learning tasks can benefit from access to external information of different modalities, such as text and images. Recent work has focused on learning architectures with large memories capable of storing this knowledge. We propose augmenting generative Transformer neural networks with KNN-based Information Fetching (KIF) modules. Each KIF module learns a read operation to access fixed external knowledge. We apply these modules to generative dialog modeling, a challenging task where information must be flexibly retrieved and incorporated to maintain the topic and flow of conversation. We demonstrate the effectiveness of our approach by identifying relevant knowledge required for knowledgeable but engaging dialog from Wikipedia, images, and human-written dialog utterances, and show that leveraging this retrieved information improves model performance, measured by automatic and human evaluation.


Introduction
Machine learning solutions to various tasks, such as game-playing or dialogue, are often dependent on external information.This information can take multimodal forms, including structured knowledge bases, free text, and images, and also comes in overwhelmingly large quantities.A pressing challenge is to create models that can identify which specific elements of multiple information sources are relevant, and incorporate them into standard architectures on each task.In this work, we focus on human-machine dialog and how to efficiently retrieve external knowledge that is relevant to the task.We consider two scenarios and for each scenario, retrieve two types of knowledge: (i) knowledge about similar dialog contexts and (ii) external knowledge used to ground the conversation into real world information.
Knowledge about similar dialog contexts allows for a hybrid retrieval/generative approach to dialog where the system response is generated based not only on a representation of the current dialog context and of the relevant world knowledge, but also based on a response retrieved from a similar dialog context.In this case, the retrieved knowledge can be viewed as providing information about dialog structure and dialog utterances: which type of response is likely given similar context?
External knowledge is also retrieved to improve the semantic content of the dialog model.In one scenario (Wizard of Wikipedia), it is retrieved from a preselected set of Wikipedia sentences associated with the current dialog topic.Retrieval aims to select the sentence that is most relevant at each step of the dialog and thereby to ground system responses in relevant world knowledge (e.g. by referring to a Star Wars when talking about science fiction).In the other scenario (Engaging ImageChat), the retrieved external knowledge is images and their associated dialogues.By retrieving images that are similar to the image being talked about and their associated dialog, we aim to enrich system responses with knowledge about what is typically mentioned when describing similar images (e.g. when talking about an image with cats, mentioning their breed).
Previous work has explored incorporating large external memories into neural network layers (Weston et al., 2014;Sukhbaatar et al., 2015Sukhbaatar et al., , 2019;;Lample et al., 2019).Many existing approaches focus on using attention over the memory slots, which is computationally intensive and becomes less effective as the the size of the memory grows.In this work, we propose representing multiple sources of external information as fixed encodings and using K Nearest Neighbors search to fetch relevant information.KNN search is computationally efficient and scalable, and libraries like faiss (Johnson et al., 2019) allow KNN to be easily used on GPUs and integrated into neural networks.As the external memories are kept fixed, they do not require any training to learn the memories along with the model.We can thus scale more easily to larger memories by learning only the KNN-based read operation to identify relevant information from the memory.
Our core contribution proposes an efficient, KNNbased Information Fetching (KIF) module that can access relevant external knowledge, combine knowledge from different sources, and integrate this information into standard sequence to sequence architectures.We apply these flexible modules to two dialogue datasets, challenging tasks where generative models can leverage external information to write coherent, on-topic re-sponses.We show that relevant information can be identified from hundreds of thousands of candidates in a multi-modal, multi-knowledge-source setting to improve the performance of generative dialogue models.Further, the output of the KIF modules is interpretable as specific knowledge is selected, allowing users to better understand the information the generative model conditions upon when writing the subsequent utterance.On both datasets, we achieve state of the art results compared to generative models and find there is no statistically significant difference in the interestingness or human preference of our model output compared to state of the art retrieval models.

Related Work
We discuss related work on learning to incorporate external knowledge into neural networks and efficiently accessing relevant information.We then describe work in generative dialogue that incorporates knowledge.

Incorporating External Knowledge
Augmenting neural networks with memory, or longer term components that can be accessed with read and write operations, has been explored in various proposed architectures.For example, Memory Networks (Weston et al., 2014;Sukhbaatar et al., 2015Sukhbaatar et al., , 2019) ) introduce attention mechanisms over large external memories.Neural cache models (Grave et al., 2016) simplify these to access previous memories with a dot product.Previous work has also studied how to read and write into these memory architectures (Rae et al., 2016;Graves et al., 2014;Joulin and Mikolov, 2015).In contrast, we focus on how to read very large memories.
Another line of research has focused on computational scalability for larger external memories to allow efficient access of information.For example, Chandar et al. (2016) propose a hierarchical memory network rather than a flat one and Rae et al. (2016) learn sparse operations to read and write.Lample et al. (2019) focus on learning memories of up to one million slots and how to efficiently access the slots using product keys.Khandelwal et al. (2019) use nearest neighbor operations to augment language models.Beyond explicit memory representations, it may be possible to store information implicitly during training time by memorizing common patterns present in text (Petroni et al., 2019).We focus on learning to fetch relevant information from multiple explicit external multi-modal knowledge sources and integrate them into one network.Further, our work allows the retrieved information to be interpreted as each memory slot is an explicit fact that can be read as text, rather than a learned vector such as in Lample et al. (2019).
Work has also focused on computationally efficient softmax operations (Mnih and Hinton, 2009;Grave et al., 2017;Chen et al., 2015).Many approximate softmax techniques use KNN-like operations to form clusters, and the overall softmax operation is constrained by the slow calculation of the exponential.Our usage of KNN benefits from efficient and scalable libraries such as faiss and nmslib.

Generative Dialogue
We develop a general architecture for incorporating external information and apply it to the case of generative dialogue models.Previous work in dialogue has leveraged knowledge as necessary information to accomplish the task.For example, airline and restaurant booking tasks often use API calls to access information about reservation times and availability (Bordes et al., 2016).In contrast, our work focuses on how to incorporate unstructured knowledge, such as free text found on the web.Previous work has employed architectures that attend over the available knowledge and identify relevant pieces of information, which scales poorly with large quantities of information (Dinan et al., 2018;Qin et al., 2019;Lian et al., 2019).In this work, we replace the use of attention over external information with the output of a KNN module.Other work has investigated incorporating information retrieval in language modeling and question answering (Chen et al., 2017;Fan et al., 2019;Seo et al., 2019;Guu et al., 2020), while we focus on dialogue applications and flexibly incorporating knowledge from multiple, multimodal sources.
On the modeling side, work has explored both generative (Serban et al., 2016a,b) and retrieval based models (Zhang et al., 2018), which identify the best utterance from the training set to return as the dialogue response.This often leverages self-attention or crossattention mechanisms (Humeau et al., 2019).Further work has explored hybrid models, for example using the output of a retrieval model as input for a generative model (Dinan et al., 2018;Weston et al., 2018).We extend these approaches by augmenting generative models with retrieval-like operations based on KNN search, allowing dialogue models to flexibly incorporate various sources of external knowledge.

KNN-based Information Fetching Modules
Broadly, the KNN-based Information Fetching (KIF) module assumes a model M can access inputs X = x 1 , x 2 , . . ., x n to produce outputs In a setting without additional supporting information, the model will process inputs to make output predictions: M (x i ) = ôi .However, in many tasks, additional information is present, represented as E = {e 1 , e 2 , . . ., e m }.To incorporate E into M , we encode each element of X and E into a fixed-size vector representation.This can be accomplished in a variety of ways, for example with an encoder neural network.Then, to make predictions, the model encodes x i and uses K Nearest Neighbors to find the closest related information in E. The representations of the identified nearest neighbors are combined in a weighted sum, where each of the k retrieved neighbors is weighted by its similarity to x i .
These operations are differentiable, so they can be incorporated into neural networks in a straightforward way.All elements of the knowledge source E are precomputed and kept fixed -we do not backpropagate to affect the embeddings of the pre-encoded knowledge.However, this lack of backpropagation can introduce a mismatch between the encoding of E and the model that is training, as the training model has constantly changing representations because the weights are being learned.The model must learn a function to align its representations to the external memory.To circumvent this misalignment, we instead learn a mapping operator f E (x) that trains to map elements of the model's representation of X into the information representation space E. Concretely, f E (x) is a multi-layer perceptron with ReLU nonlinearities.From the input elements of X, f E (x) learns a representation of an output close to the corresponding projection of X into E.This can be interpreted as learning a read operation on a fixed external memory.If there was no change to the encoding of the model compared to the pre-computed knowledge, then the ideal mapping operator would be the identity function.However, as the model changes significantly during the training process, the nonlinear mapping capability of f E (x) is essential to be able to identify the correct knowledge E from the input X.
Thus, a model augmented with KIF will incorporate external knowledge in the following manner.First, we find the k nearest elements to the projection of x i in E based on KNN search using inner product.Then, the relevant elements are encoded by M .We use the optimized faiss library for KNN search, which can conduct billion-scale KNN efficiently.
These elements are weighted by their nearest neighbor scores and then summed.This is subsequently concatenated to the representation of x i and used by M to form the final prediction: This can be easily extended to using multiple modules simultaneously.For instance, two sources of information, E 1 and E 2 , can be combined by identifying the top candidates of each information source.The weighted sum of the KIF output on each information source is concatenated with x i .
Finally, different sources of information may not be required for every prediction and some information sources can be more important than others.To allow the model to make more fine-grained decisions about what information to use from what source, and how much of it, we add a gating mechanism using a sigmoid function around each weighted sum of KNN representations.KIF1 i and KIF2 i denote the KIF module from Equation 1applied to E 1 and E 2 respectively.
4 Applying KIF to Dialogue Tasks We describe how to apply our method to the task of generative dialogue, a challenging setting where models autoregressively generate engaging and on-topic responses.We investigate dialogue for two main reasons: first, dialogue agents must be able to consult relevant information to maintain the topic of the conversation.Second, retrieval-based agents have strong performance compared to generative ones, due to their ability to copy dialogue utterances from the training set.Using KIF, we can incorporate the benefits of retrieval architectures into generative, knowledge-based models.
KIF for Generative Dialogue In a dialogue setting, x i represents the text of the conversation i.A conversation consists of multiple back-and-forth utterances (or turns).For example, a conversation could consist of 4 turns: ] where x i,4 is the direct utterance the model should respond to, and the earlier utterances are the conversation context.Standard generative dialog models use a Transformer neural network as M and want to produce an output o i that is an appropriate response to the conversation.However, in many cases, the conversation history alone does not include all of the information required to produce an appropriate response.To incorporate knowledge, models often concatenate a knowledge source E such as Wikipedia to x i , such that M ([x i , e 1 , e 2 , . . ., e n ]) = ôi , and use attention modules to identify the most relevant knowledge.However, this approach is computationally intensive when handling large quantities of information.Further, attention mechanisms have been found to operate poorly over long sequences, as the mechanism is blurry and struggles to make fine-grained decisions (Fan et al., 2018).The same is true for hierarchical approaches, which lack scalability.
We augment Transformer sequence-to-sequence (seq2seq) networks with KIF to create generative dialogue models.We experiment on two dialogue tasks, Wizard of Wikipedia (Dinan et al., 2018) and Engaging Imagechat (Shuster et al., 2018).In both datasets, models must leverage information external to the dialogue history alone -in Wizard of Wikipedia, the chat requires access to a knowledgeable facts and in Engaging Imagechat, discussion about a specific image.As models must process multiple inputs and ground responses in the knowledgeable facts or images, these tasks challenge existing seq2seq approaches.

Wizard of Wikipedia
The goal of the Wizard of Wikipedia dataset is to train knowledgeable agents that can chat in any domain.The dataset contains 1,365 various topics discussed in 18,430 dialogues in the training set, totalling 166,787 training utterances.The topic is included as the first utterance of the conversation.The Wikipedia knowledge is Wikipedia sentences for each dialogue turn, identified by an information retrieval system and released as part of the full dataset.
Our model for Wizard of Wikipedia has access to two sources of external information, E 1 and E 2 : • E1 is Wikipedia Knowledge provided by the dataset as evidence to support knowledgeable chitchat.The scale of this KNN search is to filter through an average of 34 sentences.The KIF module uses dialogue features to fetch relevant knowledge to condition upon to generate the subsequent utterance.• E2 is Training Utterances.To incorporate the benefits of retrieval-based dialogue models to the generative setting, we use KIF to identify relevant utterances from the training set and take their responses as input.If many conversations about dogs have already occurred, models should be able to take advantage of these human-written examples to improve their generations.For example, likely conversation could occur about the breed of the dog, daily routine with a pet, and similar topics.There are around 170K dialogue utterances as inputs to KNN search.This can be interpreted as incorporating the benefits of retrieval models by identifying an utterance with similar structure as the text the model would like to generate.We do not allow the module to fetch the correct response of the current conversation context.
Access to these two sources of knowledge can be seen as learning a template and a topic separately.Sample templates can be identified from the training utterances, and topic-specific information learned by accessing the Wikipedia knowledge.
To better identify relevant training utterances from the large quantity available, we break down x i into conversation sub-features for a more fine-grained match in the KNN search step.We concatenate the encoding of the most recent dialogue utterance (e.g.x i,last ) with the encoding of the dialogue context from the current conversation and the turn number t, such that x i,last , x i,−last , t is the representation used for KNN search.The turn number is represented as an embedding.Concretely, if the model is trying to produce the 5th turn of the conversation, then x i,last is the most recent utterance from the dialogue partner, x i,−last would be the last 3 turns of exchange, and t would be 4.
These are known to be salient conversation features.The most recent dialogue utterance is the direct turn the model is responding to, and the dialogue context may provide additional clues.The turn number is important, as earlier turns are often generic (e.g.how are you doing today) and later turns are more specific.
Engaging ImageChat The goal of Engaging Im-ageChat is to create agents capable of chitchatting about images, selected from the YFFC100M dataset (Thomee et al., 2015).The dataset contains 186,782 dialogues in the training set, each about a unique image, totalling 355,862 utterances.Agents are assigned one of 215 personalities (e.g.sweet, caring, excited) to increase engagingness.We use a Multi-Modal neural network designed to handle both image input and text input.Following Shuster et al. (2018), the images are encoded using a pre-trained ResNeXt network (Xie et al., 2017).To extract the final image representation, we project the 2048-dimensional output of the image encoder to 512-dimensions using a deep multi-layer perceptron with ReLU activation units.The conversation history, which includes the personality, is encoded with a Transformer encoder network.The image and conversation are integrated using the Multimodal-Sum-Combiner module proposed in Shuster et al. (2018).
Our model for Engaging Imagechat has access to two sources of external information, E 1 and E 2 : • E1 is Chat on Similar Images.While there are over 180K different images used in this dataset, many of the images are similar.For example, conversations associated with two pictures of dogs could be relevant to each other.The model is able to use the current image features to fetch from around 180K different images and returns 6 turns of related chat for each image.Fetching from E 1 consists of identifying related image chats, or conversations on related topics (as similar images likely have similar conversations).
• E2 is Training Utterances.Similar to the motivation for the previous dataset, we allow the model to identify training utterances that could be useful for responding in the current conversation.The scale of this fetching task is large: 350K dialogue utterances.This could be interpreted as identifying utterances with similar structure to what the model would like to generate, and is complementary to the topic-based related image chats.
To identify relevant information from training utterances, we use the same dialogue features in the KNN search step, with one modification: we add the personality provided by the dataset.The concatenation of features used for KNN search is: x i,last , x i,−last , t, p where t is the turn number and p is the personality.As utterances from speakers with the same personality are more likely to be similar, this feature improves the quality of the fetched information.For example, conversations with the sweet personality often include similar text such as aww, that's wonderful.

Implementation Details
We use parl.ai(Miller et al., 2017) to implement our models.The data for both datasets used is available for download from parl.ai as well.We use byte-pair encoding (Sennrich et al., 2015) to represent the text to better handle the rare word problem (Dinan et al., 2018;Fan et al., 2017).Our generative Transformer models have 8 encoder layers and 8 decoder layers, with FFN size 2048, embedding dimension 512, and 4 attention heads.We optimize using Adam (Kingma and Ba, 2014) and the inverse square root learning schedule (Vaswani et al., 2017) with 10k warmup updates.The initial learning rate is 0.0001 and we optimize for model perplexity.We use a dropout of 0.5 and set gradient clipping to 0.1.We set k = 5 for all cases.We pre-train the Transformer seq2seq model used for both datasets on 250M comments from Reddit.The comments are parsed to maintain conversational threads, so the encoder network has been exposed to conversational context at training time.The ResNeXt encoder is pretrained on 3.5 billion images (Mahajan et al., 2018).For both datasets, we model a vocabulary size of 54944 based on the BPE-based vocabulary from the Reddit pretraining.We tuned the learning rate and batchsize hyperparameters together.The model size is not tuned, as it was pre-trained with this size and thus kept fixed.

Model
Test

Evaluation
Generation We generate with beam search, setting the beam size to 4. We use 3-gram blocking.This technique disallows repeated n-grams from being generated multiple times and reduces repetition.
Automatic Metrics Following Dinan et al. (2018), we compute F1, a metric of unigram overlap, between the generated utterance and the human-written reference utterance from the dataset.For generative models, utterances are generated using beam search.For retrieval models, the next utterance is predicted by ranking the entire set of training utterances, and the highest scoring utterance is chosen.
In Wizard of Wikipedia, there are two test sets: a set of seen topics, or topics that have been seen at training time with new test-time dialogues.The second set is unseen, or topics that have not been encountered at all during training time.We evaluate on both subsets.

Human Evaluation
We follow the setup and use the analysis questions proposed in the Acute-Eval dialogue evaluation system (Li et al., 2019).For reproducibility, we adopt this existing evaluation setting that has been applied to several dialogue datasets.We collect 100 human-bot conversational dialogues on a crowdsourcing platform for both datasets.The dialogues are eight turns long.Then, we show pairs of the  • Who would you prefer to talk to for a long conversation?• If you had to say one of the speakers is interesting and one is boring, who would you say is more interesting?• Which speaker sounds more human?
• Which speaker has more coherent responses in the conversation?• If you had to say that one speaker is more knowledgeable and one is more ignorant, who is more knowledgeable?(Wizard of Wikipedia only) We measure the percentage of time one model was chosen over the other, taking the majority agreement between three evaluators.To reduce variance, dialogues paired in the evaluation were collected on the same topic for Wizard of Wikipedia and collected on the same image and personalities for Engaging Im-ageChat.Topic and images used are unique and taken randomly from the test set.

Baselines
We compare Transformers augmented with KIF to the state of the art retrieval models published on each dataset.We note that these existing retrieval models are state of the art on both datasets, and have been shown to be strong baselines compared to other retrieval techniques based on TF-IDF (Chen et al., 2017).
We further compare to three additional generative baselines that access knowledge: • Transformer Memory Networks.To contrast the ability of KIF to existing work, we compare our models to published Transformer Memory Networks (Dinan et al., 2018).These models encode each piece of external information independently with a Transformer Encoder, and these are stored as memory slots.To access information in the memory slots, a model performs dotproduct attention between the memory slots and the dialogue context.In Dinan et al. (2018), the knowledge selection from Wikipedia was supervised with either a two-stage model where the first model was trained to predict the right knowledge, or an end-to-end model with an auxiliary loss for knowledge prediction accuracy.
• Retrieve and Refine.We implement a hybrid model (Weston et al., 2018) that incorporates top retrieval candidates as additional input to Generative Transformer MemNets.In Retrieve and Refine, a fixed number of candidates are retrieved and concatenated to the conversational history in the encoder.Unlike the KIF-Augmented Transformer, the retrieval is conducted with a separate model so there is no backpropagation to affect the retrieval.With KIF, models can alter the retrieved candidates by learning the mapping operator.Further, a fixed amount of information is always retrieved, without the capability to easily rescale to focus on specific candidates.
• Response Generation with MR.We implement the model proposed in Qin et al. (2019), which embeds the conversation history and document contextually before decoding with a biLSTM.In Qin et al. (2019), the encodings were using pretrained CoVE vectors (McCann et al., 2017).We found our pretrained Transformer embeddings to work more effectively as they are trained specifically for dialogue.Thus, we modify this baseline to replace CoVE embeddings with domain-specific ones.
All of Transformer generative baselines are initialized with the same pre-training on Reddit that we use for our models for fair comparison on modeling quality.
Human: I love watching disney movies!Do you like Disney?Model: I love disney, I love watching disney movies and different animations and characters that make them unique Fetched Knowledge: disney consumer products and interactive media -the company specializes in producing animated films Fetched Training Utterance: i love music, I love going to music shows and dancing and having fun Human: What's your favorite disney movie?There are so many good ones.Model: I love the incredibles, they are my favorite disney movie.
Fetched Knowledge: disney announced intentions to develop additional superhero films after the success of the incredibles Fetched Training Utterance: i love kiteboarding, it is one of my favorite activities on the water.Human: omg, me too! the second one just came out i think.I don't watch a lot of sequels though, I think the last one I saw was toy story.Model: That's cool!I remember when toy story was first released in 1995.
Fetched Knowledge: talk of a second film following toy story began around a month after the film's opening, in december 1995 Fetched Training Utterance: that's neat!i like so many of her songs!i remember when her first album was released in 2008.Human: I've actually never seen toy story before.Kind of embarrassing... Model: Ohhh ... I think toy story is a classic?It was first released in 1999.
Fetched Knowledge: toy story 2 is a 1999 american computer-animated comedy film produced by pixar animation studios for walt disney pictures Fetched Training Utterance: Ohhh ... What do people with color blindness do to cope with the effects?

Results
We describe the results of incorporating KIF modules into Transformer networks.We display an example conversation between a human and our model in We compare to various baselines using automatic and human evaluation, and discuss our experiments.We present various ablation settings to understand the key features that make our method function.

KIF is Effective for Incorporating Knowledge
Automatic Evaluation.Comparing KIF augmented Transformer networks to published baselines and Retrieve and Refine, we find improved results.For Wizard of Wikipedia, the improvement in F1 score over the best baseline is around 8 points (see Table 1).A major contributing factor is the construction of the dataset -as each dialogue turn is grounded in a specific knowledge sentence from Wikipedia, improving the ability to identify the relevant fact strongly improves performance.Contrasting the results from the seen and unseen test sets in Table 1, the improvement on unseen is worse -it is harder to fetch training utterances for unseen topics.
While Imagechat has no explicit dependency on knowledge, we still see a 2 point improvement compared to the Generative Transformer MemNet (with the additional Reddit pre-training), indicating that KIF can be generally useful (see Table 2).Compared to an even stronger baseline that we tune in this work, Retrieve and Refine, we see 1 point improvement.
Human Evaluation.Results are shown in Figure 2. On both datasets, we find there is large im-provement over existing generative models (green bars) that is statistically significant for some of the evaluation questions.Evaluators agree that KIF-augmented Transformers are generally more coherent and humansounding compared to the Generative MemNet.
Compared to existing retrieval models (blue) is more nuanced.Along the lines of existing work (Zhang et al., 2018;Dinan et al., 2018), we find that retrieval-based models score very well in human evaluations that ask how human or interesting a dialogue sounds.This is because retrieval models return human-written utterances from the training set and do not suffer from decoding mistakes present in generative models.For example, on Engaging ImageChat, while our model has significantly improved over the generative baseline (see green bars in Figure 2, right), it does not beat retrieval based methods in sounding more human or being more interesting (see blue bars in Figure 2, right).
A surprising result is that KIF-augmented Transformers are voted more human sounding than retrieval models on Wizard of Wikipedia.This is because the dataset's human utterances are long and factual due to the tendency of crowdworkers to copy Wikipedia.Sometimes humans chatting with the retrieval bot would respond uh... that's an interesting fact?Otherwise, our model scores similarly to retrieval models, with most of the evaluations not having statistically significant differences.
We conduct a second evaluation on the Unseen Test Set of the Wizard of Wikipedia dataset.Results are shown in Figure 3. Trends are similar compared to the results on the Seen Test set, though the preference for the KIF-augmented Transformer is greater over the retrieval baseline.We hypothesize that because the Un-Human: Hey, how are you doing Fetched Training Utterances: I'm great, thanks for asking.Craving some chocolate.Do you like chocolate?
Hello, how is it going?I know some trivia about this movie Hello, its lunch time here, and Im in the mood for a great steak   seen Test Set is on entirely held out topics, the retrieval baseline can struggle to identify relevant utterances.In contrast, the KIF-augmented Transformer, similar to the generative baseline from Dinan et al. (2018), can use the generative capability to produce utterances.Lastly, we conduct an additional studies to examine the variance of the comparative dialogue judgements.The evaluation study for Wizard of Wikipedia is repeated three times on different days, and evaluators who have answered on previous days are not allowed to evaluate again in any subsequent experiments.We find there is greater variance on questions asking which dialogue is more human and more interesting, most likely as different evaluators can interpret these in different ways.Further, we see that comparison with the Retrieval model has less variance compared to the Generative model, possibly because the Retrieval model's human written text is devoid of mistakes.Overall, we find that the conclusions (and statistical significance) are stable across multiple evaluations.

Scaling KIF to Challenging Retrieval Settings
KIF modules can be used in more realistic and challenging settings for knowledge retrieval that test the scalability of the module.In Figure 6(a), we compare the Generative Transformer MemNet Baseline with KIF-Augmented Transformers in three settings.The first is the standard Wikipedia sentences provided by the dataset (average 34 sentences).Then, we extend to providing the full Wikipedia article (average 57 sentences) and finally to providing multiple Wikipedia articles (average 205 sentences), identified using the conversation's topic.This increasing size of available knowledge could be realistic for settings where it is unclear what information is most relevant, if filtering steps to preprocess the data remove potentially relevant information, or if information synthesis from multiple knowledge sources is necessary to produce a high quality generation.As the Wikipedia knowledge becomes more difficult to identify, performance decreases, but still outperforms the baseline that uses the dataset-provided set of 34 sentences.
Comparing the scaling capability of KIF to the standard Generative Transformer MemNet Baseline highlights the advantage of using KNN.The attention-based mechanism used in Dinan et al. (2018) struggles to identify salient information when given increasingly larger quantities of knowledge, unlike the KNN infor-  mation fetch.We hypothesize the attention mechanism is challenged by softmax-ing over a larger quantity of inputs, as it can be difficult to make sharp distinctions.

Analysis of Fetched Knowledge
Example conversations from our KIF-augmented generative model are shown in Figure 4 on Wizard of Wikipedia.We find that relevant knowledge is identified that affects the content of the generated utterance.For example, the model finds knowledge sentences about Disney movies as the human conversationalist starts the conversation discussing Disney.The model leverages the fetched knowledge to write the content of the generated utterance.In a concrete example, the fetched sentence disney announced intentions [...] after the success of the incredibles leads the model to generate the utterance i love the incredibles, they are my favorite disney movie.
In contrast, the model uses the form of the fetched training utterance often as a template for writing a response.For example, the model copies the training utterance Ohhh ... what do people with color blindness do to cope with the effects?and starts the model generation with Ohhh ... and continues with the question i think toy story is a classic?following the form of the selected training utterance.
Figure 5 displays the top-3 fetched training set utterances and knowledge sentences on the Wizard of Wikipedia dataset when responding to a human utterance.KIF modules can identify multiple relevant items.In response to the human question about blue skies the 1946 movie the model identifies both the comedy film and the band.
Finally, the elements retrieved by KIF modules pro-vide a more interpretable understanding of what the model is conditioning upon to generate a dialogue response.In Table 3, we display for the same dialogue history, changing the model's fetched training utterance and knowledge sentence for our own examples.The model heavily incorporates our manual changes of the fetched information into the generated utterance.For example, changing the knowledge directly affects what the model generates as the favorite characterfrom buzz lightyear to mr potato head to slinky dogwhile changing the fetched training utterance changes the form of the generated sentence.

Ablations
Importance of Multiple Knowledge Sources.One benefit of KIF modules is that multiple can be used together to fetch information from different sources.In this ablation, we examine the importance of this functionality.For Wizard of Wikipedia and Engaging Im-ageChat, multiple knowledge sources are used -training utterances to capture the capability of a retrievalbased model and knowledge from Wikipedia or related chats based on image features.The performance decreases when only using one source (see Table 4).
For Engaging Imagechat, this study also underlines the importance of being able to fetch in a multi-modal fashion.The general form of the KIF module -requiring only a feature vector to find nearest neighbors from -allows fetching on multiple modalities such as text and images.In Multi-Hop Retrieval with KIF.Work in memory networks (Weston et al., 2014;Sukhbaatar et al., 2015) employed multi-hop mechanisms.Such capacity could be useful in cases where multiple sources are necessary or perhaps more information is incrementally required.To emulate multi-hop memory mechanisms, we use KIF to retrieve relevant information for N = 2 or N = 3 fixed hops.As the number of hops is fixed, the multi-hop operation remains differentiable.We do not allow the model to retrieve information in a second hop if that information was already selected.We experimented in two settings.In the first, the same KIF module is used multiple times to fetch different information, and then all of the fetched knowledge

Model
Valid F1 In both multi-hop settings, no improvement in performance on the Wizard of Wikipedia dataset is observed.We hypothesize this can be partially attributed to the construction of the dataset -as humans explicitly based their written dialogue utterance on one knowledge sentence.Further, it is possible concatentation brings together too much information for the model to incorporate, and thus adding additional fetches makes the retrieval more noisy.
Using Dialogue Features for KNN Performance.The quality of the KNN search is critical to the performance of KIF modules.As the external knowledge is kept fixed, KIF must be able to align the dialogue context with the knowledge to identify relevant pieces of information.In Table 5, we show that matching on more features can improve the quality of the retrieved information.Using only the encoding of the immediate previous utterance can improve results on Wizard of Wikipedia by 7 F1 points, but this is further improved by also leveraging the encoding of context (+1.8 F1) and using the dialogue turn number (+1 F1).These features are available in the datasets, and we leverage them to improve the relatedness of retrieved knowledge.
Effect of Gating.We analyze the effect of the gating mechanism used in KIF by evaluating the capability of the gate to identify and focus on salient information.On Wizard of Wikipedia, we concatenate a third source of information: dialogue turns from a completely different corpus called PersonaChat (Zhang et al., 2018).This dataset looks quite different -short utterances without factual knowledge -and should be easy for the model to identify as distinct from Wizard of Wikipedia.As shown in Figure 6(b), if KIF on PersonaChat is included without gating, it has a harmful effect as the model includes irrelevant information.When equipped with gating, the model learns to use the gate to ignore some inputs, and can recover almost the full performance of the model without this irrelevant information source.
Size of K in KNN. Figure 6(c) shows the performance on Wizard of Wikipedia when varying the amount of knowledge.Generally, being able to access multiple relevant pieces of information is helpful, but too much information can be harmful.This is likely because the weighted sum operation becomes blurry if too many sentences are summed.
We present a KNN-based Information Fetching module that learns to identify relevant information from external knowledge sources by learning a mapping-based read operation.KIF modules benefit from the scalability and efficiency of K Nearest Neighbors search, enabling computation with large external memories.We show in the context of two dialogue datasets that relevant knowledge can be identified and incorporated to create more engaging, high quality dialogue.

Figure 1 :
Figure 1: KIF modules (orange) fetch relevant information from multi-modal external knowledge sources and incorporate it in standard neural architectures.

Figure 2 :
Figure 2: Human Evaluation Results on both Datasets.More than 50% indicates the KNN Model is preferred.Stars indicate statistical significance at p < 0.05.

Figure 3 :
Figure 3: Human Evaluation on the Unseen Test set of Wizard of Wikipedia.More than 50% indicates the KNN Model is preferred.Stars indicate statistical significance at p < 0.05

Figure 4 :
Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of Wikipedia.The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed with model output.
Figure 4, and show the top scoring Wikipedia knowledge and Training Utterance fetched by KIF modules.

Human:
What are your hobbies?Fetched Training Utterances: I work at an elementary school.I hope you find a job you love too [...] I have a hound, we just got her.Although, I grew up with Labrador Retrievers.I just love ice cream.I love the types with fruits and flavours.Do you like ice cream?Human: hi buddy, what do you think about cinematography?Gold Chosen Knowledge: cinematographers use a lens to focus reflected light from objects into a real image [...] Fetched Knowledge: cinematography is the art of motion-picture photography typically, a lens is used to repeatedly focus the light reflected from objects [...] the modern photographic camera evolved from the camera obscura Human: Speaking of blue skies, have you seen the 1946 movie staring bing crosby?Gold Chosen Knowledge: blue skies is a 1946 american musical comedy film [...] and starring bing crosby [...] Fetched Knowledge: blue skies is a 1946 american musical comedy film [...] and starring bing crosby [...] blue skies the band has since broken up blue skies was was composed in 1926 as a last -minute addition to betsy the musical

Figure 5 :
Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding to a human chat from the dataset using a trained Wizard of Wikipedia model.Examples are taken from validation.

Figure 6 :
Figure 6: Ablations on Wizard of Wikipedia.(a) KIF can scale to hundreds of relevant sentences (blue) while the baseline model, the Generative Transformer MemNet, scales poorly (gray) (b) Gating can remove irrelevant information.In the 3 Sources case, one source of external information is unrelated.(c) Performance as k varies.

Table 2 :
Results on the Engaging Imagechat dataset.

Table 3 :
Effect of Fetched Information on Generated Utterances.The top section provides examples for a fixed training utterance, changing the knowledge -the generated text maintains the construction of the training utterance but changes the favorite character to match the knowledge.The bottom section provides examples for fixed knowledge but changing the training utterance -the generated text modifies its form to match the training utterance, but the favorite character information remains consistent.

Table 4
, using the Image-based KIF to fetch text from Related Images is important to reach the strongest performance (compare Training Utterances Only that uses text-based KIF and using both Training Utterances and Related Images).

Table 4 :
Using Multiple KIF Modules on Multiple Sources is important for improved performance.

Table 5 :
Important Features for KNN Search using KIF.Salient conversation features improve performance on both datasets.

Table 6 :
Multi-hop with KIF to retrieve information with multiple fetch steps is concatenated.Results are shown in Table6(top).in the second setting, we examine spreading out the fetches into different KIF modules at various encoder network depths.This could be interpreted as the model learning to access more required information layer by layer.It is possible that as the model progresses deeper, the more abstract and high level representations that are built allow different knowledge to be retrieved.As the encoder models we experiment with have six layers, we distribute the KIF fetches evenly throughout.Results are shown in Table6(bottom).