Many studies have suggested that episodic memory is a generative process, but most computational models adopt a storage view. In this article, we present a model of the generative aspects of episodic memory. It is based on the central hypothesis that the hippocampus stores and retrieves selected aspects of an episode as a memory trace, which is necessarily incomplete. At recall, the neocortex reasonably fills in the missing parts based on general semantic information in a process we call semantic completion. The model combines two neural network architectures known from machine learning, the vector-quantized variational autoencoder (VQ-VAE) and the pixel convolutional neural network (PixelCNN). As episodes, we use images of digits and fashion items (MNIST) augmented by different backgrounds representing context. The model is able to complete missing parts of a memory trace in a semantically plausible way up to the point where it can generate plausible images from scratch, and it generalizes well to images not trained on. Compression as well as semantic completion contribute to a strong reduction in memory requirements and robustness to noise. Finally, we also model an episodic memory experiment and can reproduce that semantically congruent contexts are always recalled better than incongruent ones, high attention levels improve memory accuracy in both cases, and contexts that are not remembered correctly are more often remembered semantically congruently than completely wrong. This model contributes to a deeper understanding of the interplay between episodic memory and semantic information in the generative process of recalling the past.
Episodic memory enables us to remember personally experienced events and depends on the hippocampus (Clayton, Salwiczek, & Dickinson, 2007). Semantic information, on the other hand, is represented in neocortical areas and captures general facts and regularities of the world around us (Reisberg, 2013). Early concepts of episodic memory were based on the storage model, according to which the content of the memory more or less faithfully reflects the content of the experience (Tulving, 1972). This view is oversimplified since it reduces episodic recall to a mere readout process of stored complete information. However, overwhelming empirical evidence suggests that the recalled memories can be influenced by other information acquired before and after the encoding, as well as the context of encoding and recalling (Hemmer & Steyvers, 2009b). Pioneering studies suggest that semantic interpretations, rather than sensory inputs, are stored in memory (Bartlett, 1995; Sachs, 1967) and that memories are reconstructed during recall (Bartlett, 1995). In word list studies using the Deese–Roediger–McDermott (DRM) paradigm (Deese, 1959; Roediger & McDermott, 1995), participants “remember” semantically related words that were not on the study list when asked to retrieve the words studied earlier. There is also evidence that semantic and episodic memories interact and complement each other during retrieval (Greenberg & Verfaellie, 2010). For instance, Devitt, Addis, and Schacter (2017) found in a meta-analysis of eight studies based on autobiographical interviews that when participants report an episode, the internal (episodic) details and external (semantic) details they use are negatively correlated. The participants apparently use semantic information to compensate for insufficient episodic detail in their memory. Other examples are experiments by Bartlett (1995), where participants of nonmatching cultural backgrounds recalled folk tales. The recalled stories were distorted to match the participants' cultural background (semantic information). Finally, there are also paradigmatic examples of memory adjustments due to social context (Deuker et al., 2013; Hirst & Echterhoff, 2012), self-model (Axmacher, Do Lam, Kessler, & Fell, 2010), stress (Herten, Otto, & Wolf, 2017; Wolf, 2019), and many other factors (Addis, 2020; Schacter & Addis, 2020).
Few contemporary researchers would oppose the idea that episodic memory is, at least to a certain degree, generative in nature (Greenberg & Verfaellie, 2010). Nonetheless, most of the existing computational models (including some of our own: Cheng & Frank, 2011; Cheng & Werning, 2016; Neher, Cheng, & Wiskott, 2015) adopt the storage view, where memories are preserved and later retrieved faithfully (Becker, 2005; Jensen & Lisman, 1996; Rolls, 1995). Such models are usually tested with either random patterns or abstract spatial representations (Becker, 2005; Cheng & Frank, 2011; Jensen & Lisman, 1996; Neher et al., 2015; Rolls, 1995) but not with realistic sensory input. With such artificial input patterns, it is rather suggestive to think in terms of mere storage memory, since there is not much structure in the input that could be exploited. However, in natural stimuli, there is a rich hierarchical structure of features and statistical relationships, which was not exploited and not even considered in these models.
In order to model the generative process of episodic memory recall, we believe it is important (1) to use (real-world) input patterns as stimuli with enough structure that can be exploited by a semantic system for a generative process, (2) to discard some fraction of the input patterns during storage to model the inevitable loss of information in the brain due to the attentional bottleneck, and (3) to include a generative element in the model that is able to reasonably fill in the missing parts according to learned semantic information. Discarding a fraction of an input pattern can be done in at least two ways, by lossy compression and by selection (either before or after compression). The former refers to a process like mp3 encoding, a compression that tries to discard only the information that is irrelevant or trivially recoverable from what is being stored. The latter refers to a process where some part is selected for storage and another is discarded altogether; for example, from a picture of a water mill at a creek, the mill could be attended to and stored while the creek could be ignored. When recalling the mill, our semantic system probably complements it with a creek, but the creek might look very different from the original one, and we are probably not even aware of this. Such a process of scenario construction (Cheng, Werning, & Suddendorf, 2016) is able to generate a semantically plausible and consistent memory experience from an incomplete memory trace without us even noticing that the recall is not faithful. When playing the mp3 encoded song, on the other hand, there might be some noise due to the strong compression, but all in all, the song is faithfully reconstructed. We refer to the representation reduced by compression and selection as the gist, which is stored in a memory trace and from which the original episode can be reconstructed, either quite faithfully if only compression is involved or at least plausibly if also selection is involved.
Following the lossy compression approach from a perspective of rate-distortion theory and efficient coding, Bates and Jacobs (2020) and Nagy, Török, and Orbán (2020) have modeled perception and episodic memory as a generative process. Bates and Jacobs argue that a capacity-limited perceptual system like the brain should use prior knowledge and take into account task dependencies to compress the input into an optimal representation. Nagy et al. have demonstrated that systematic distortions in memory are similar to the distortions that are characteristic of a capacity-limited generative model adapted to an environment for compression. They use a variational autoencoder (VAE) as a model for memory (Kingma & Welling, 2013). A VAE is an autoencoder architecture that maps input images to a plain gaussian model distribution and back again to images. Episodic memory can then be modeled by storing the location in the gaussian as a memory trace, which is a low-dimensional feature vector. From such a memory trace, the full image can be reconstructed. The lossy compression from input to gaussian is so extreme in this case that memory recall is largely generative. It is even possible to generate new images without any input or memory trace by sampling from the gaussian and then decoding this vector. Any such image looks similar to the images seen during training; in fact, the system can only represent such seen images or interpolations of them. These models are already generative but with a focus on optimal compression and decompression, while our focus is attentional selection and semantic completion.
Hemmer and Steyvers (2009a), among others, have suggested a Bayesian account of reconstructive memory that captures the prior knowledge interaction with episodic memory. Although their framework of generative memory is very close to ours, their work is more concerned with why generative episodic memory would be advantageous, but in this work, we suggest a model on how this process can happen.
One might think that a storage memory would be advantageous over a generative one because of its faithfulness. However, scenario construction during recall is essential to the etiological function of episodic memory because it provides far more flexibility to deal with missing data and to adjust to variable demands and constraints than a faithful reproduction of past experiences could. Moreover, the already acquired semantic knowledge can help to improve memory efficiency. Put simply, generativity is a useful feature in episodic memory, not an aberration (Schacter, Guerin, & Jacques, 2011).
2 Computational Model of Generative Episodic Memory
Based on a biologically motivated conceptual framework and using methods from machine learning, we have developed a computational model that allows us to investigate semantic completion in a generative episodic memory on real world images on a fairly abstract level but still analogous to concrete brain structures, so that predictions on a behavioral level but also about neural processes are possible (Fayyaz, Altamimi, Cheng, & Wiskott, 2021).
2.1 Conceptual Framework
We hypothesize that generative episodic memory works as follows:
Sensory input patterns that make up the episode are perceived by a hierarchically organized network and transformed into a hierarchical perceptual-semantic representation in cortical areas, such as the visual system.
Some elements of this representation are selected for storage in episodic memory. We call this the episodic gist.
The episodic gist is stored in hippocampal memory as a set of pointers to the corresponding perceptual-semantic elements in cortical areas, referred to as memory trace.
Triggered by some external or internal cue, or even spontaneously, the memory trace can be reactivated.
The pointers in the memory trace reactivate corresponding perceptual-semantic elements in cortical areas.
Semantic information in the cortical areas complements the reactivated elements by means of a recurrent dynamics to construct a plausible full representation from the incomplete gist stored in the memory trace, a process we refer to as semantic completion.
Several of these steps and concepts deserve closer consideration.
We speak of a perceptual-semantic representation, because (1) we consider the transformation from the raw input to a high-level semantic representation a gradual process, as is well known for deep neural networks (Liuzzi, Aglinskas, & Fairhall, 2020; Zhang, Han, Worth, & Liu, 2020), and (2) while we mainly remember high-level aspects, we can also remember quite low-level aspects of an episode, such as the exact color and shape of an object. So we believe that there is no clear-cut distinction between perceptual and semantic representations (Davis et al., 2021) and therefore refer to the corresponding network as the perceptual-semantic network. A prototypical example is the visual system, which is hierarchically organized from low-level perceptual in primary visual cortex (V1) to high-level semantic in inferior temporal cortex (IT) (Felleman & VanEssen, 1991). The very rapid recognition of high-level features in images (Thorpe, Fize, & Marlot, 1996) suggests that the generation of the semantic representation is largely done by feedforward projections. The recurrent and feedback connections in turn are instrumental for recreating a full perceptual-semantic representation from a memory trace (Takeda, 2019; Xia, Guan, & Sheinberg, 2015), although they certainly also contribute during perception.
The concept of gist is well known (Koutstaal & Schacter, 1997; Oliva, 2005; Sachs, 1967). The episodic gist (Cheng & Werning, 2016) contains essentials about the episode that are selected dynamically depending on attention and the context (Graham, Simons, Pratt, Patterson, & Hodges, 2000). They may be detailed in some cases and general and vague in others.
Episodic memory traces are pointers to perceptual-semantic elements of the sensory input rather than the representations of the input itself (Fang, Rüther, Bellebaum, Wiskott, & Cheng, 2018; Teyler & Discenna, 1986). It is not clear where the expansion from pointer to the full representational element happens. It could be on the way from hippocampus to cortical areas (Teyler & Rudy, 2007). Or the pointers in hippocampus activate pointers in cortex and only those then activate the full representational element (Reber, Stark, & Squire, 1998). Our model is more in line with the second view, because in our model, cortical semantic completion happens on the pointer level.
Semantic information is usually extracted from multiple experiences, is mostly categorical, and refers to the prototypical properties of objects or people and their relationships (Collins & Quillian, 1969; Tulving, 1972). Evidence from patients with semantic dementia suggests that semantic information is vital for episodic memory recall (Irish & Piguet, 2013). It is plausible to assume that the process of semantic completion is mainly done by recurrent connections within the perceptual-semantic network, because that is the site where the semantic information is stored (O'Reilly, Wyatte, Herd, Mingus, & Jilk, 2013) and because recurrency is well suited to perform pattern completion (Hopfield, 1982).
Because of its generative nature, we call the retrieval process scenario construction.
Next we describe the network architecture with which we capture the key aspects of this conceptual framework on a level abstract and efficient enough to be applicable to real world images. Since the storage and retrieval of memory patterns in the hippocampus have been modeled far more extensively than the generative aspect of episodic memory, we focus on the latter in this study.
2.2 Network Architecture
A VQ-VAE consists of an encoder, a decoder, and a latent representation between these two. To obtain a quantized latent representation, there is also a set of codebook vectors, which are optimized by vector quantization but otherwise fixed. The VQ-VAE processes an input image in the following steps:
The VQ-VAE first converts the image of size (or for images with context) with a convolutional neural network (the encoder) to an array of -dimensional feature vectors ( for the original images and for images with context; is set to 64). The positions within the array correspond to a grid of subsampled locations in the image. Thus, this array still has a coarse spatial resolution, and the feature vectors are a description of the image around these locations.
This array is then converted to an array of indices (called index matrix for short), each index indicating the -dimensional codebook vector most similar to the feature vector at that position. This can be viewed as a quantization step that makes the representation more categorical (i.e., more semantic). In particular, it has been shown that each learned codebook vector in a VQ-VAE corresponds to some specific feature of the input (van den Oord et al., 2017). Up to this point, the network has compressed the image into a more abstract and semantic representation.
In order to recover an image from this compressed latent representation, the array of indices is converted back to an array of -dimensional vectors by replacing the indices of the index matrix by the corresponding codebook vectors, which should be similar to the array of feature vectors of step 1.
The array of codebook vectors is then decoded by a deconvolution neural network (the decoder mirroring the encoder) to produce an image of original size.
A VQ-VAE alone can convert an image into a more abstract and semantic representation and back again, but it is not generative in the sense that it could produce new reasonable images from scratch or complement incomplete images. This is fine as long as the full index matrix is available. However, we also want to model attentional selection, in which case only part of the index matrix can be recalled from memory. In such cases, we need a generative component that is able to fill in the missing parts of the image. For that, we use a PixelCNN.
A PixelCNN is a probabilistic autoregressive generative model that is able to continue sequences of numbers. It can fill in missing pixel RGB values in an image in a fixed sequence—for instance, row-wise from top left to bottom right. Completing an image with a PixelCNN is a time-consuming process. We apply the PixelCNN not to the image but to the latent index matrix, which is much faster, since the index matrix is much smaller than the image. Since a PixelCNN only works in one particular order, we can model attentional selection only in a primitive form by keeping the upper rows and neglecting the lower rows of the index matrix. The level of attention determines how many rows to keep. The remaining representation of the input is what we call the episodic gist, and it is stored and retrieved in episodic memory as a memory trace.
The VQ-VAE as well as the PixelCNN are both trained on a large set of training images. First, the VQ-VAE is trained to reconstruct the input images as much as possible, despite the strong compression in the latent representation (i.e., the index matrix). The weights of the encoder and decoder are optimized as well as the codebook vectors. Once the VQ-VAE is trained, the PixelCNN can be trained on the index matrices generated by the trained VQ-VAE from the training image. See section 5 for further details on the VQ-VAE and the PixelCNN.
Our model is designed to reflect our hypotheses on generative episodic memory. That is, the stored gist has far less information content than the input images; nonetheless, the input can be reconstructed from it. The model captures complex statistics of the input and also reflects the generative nature of episodic memory that has been observed in many studies. When the attention is low (only a small part of the index matrix is stored), the recalled memories are not necessarily faithful; still, they are valid and likely reconstructions, typical of the training data.
2.3 Analogy to the Brain
Although VQ-VAE and PixelCNN originate from the field of machine learning, we believe they can be related to aspects of neural processing in the brain, and they are an appropriate level of description for our purposes here.
The encoder network of the VQ-VAE might correspond to the feedforward processing in the visual system, which results in abstract object representations in the inferior temporal (IT) cortex. Many studies have suggested a correspondence between the hierarchy of the human visual areas and layers of CNNs (Kuzovkin et al., 2018; Lindsay, 2021; Yamins et al., 2014). The decoder has a structure symmetrical to the encoder and might be similar to the feedback connections from higher levels of the visual system to lower ones. Experimental results suggest that during retrieval, a cortical representation of the memory is formed in the lower levels of the visual pathway through feedback connections (Takeda, 2019; Xia et al., 2015). Some studies have used an autoencoder structure to model the feedback connections in the visual pathway (Al-Tahan & Mohsenzadeh, 2021). In our model, the decoder generates a cortical representation of the memory in its layers down to the image level, which we take as a readout of the cortical representation of the memory during retrieval. However, we do not mean to suggest that the brain activates sensory representations at the input level. A body of research also indicates that there is semantic learning at the level of the visual system, reflected in our model by the whole VQ-VAE network (Hu & Jacobs, 2021).
The PixelCNN learns statistical relationships between the elements of the latent representation of the VQ-VAE by repeated exposure, that is, it learns semantic information from episodes akin to how it is hypothesized also for the brain (Michaelian, 2011). It is then able to fill in missing elements in the semantic representation of an image. We hypothesize that this is akin to recurrent dynamics in the higher cortical areas that can fill in missing information in a semantically consistent and expected way (Carrillo-Reid & Yuste, 2020; Tang et al., 2018).
We do not model storage in and retrieval from the hippocampus mechanistically; we simply store and recall a perfect copy of the selected parts of the index matrix, which represents the episodic gist. Storing just the indices of the codebook vectors, and not the vectors themselves, is consistent with the indexing theory of hippocampal memory (Teyler & Discenna, 1986), although we would argue that our indices are also represented in the cortex, so that semantic completion can take place there.
Our model is able to process real-world images, and we believe that sufficiently rich statistical structure in the input patterns is essential for a meaningful simulation of episodic memory. However, large images require large data sets and are computationally expensive to process. As a compromise, we use the well-known MNIST data set of handwritten digits (LeCun, 1998), which is real-world and has a clear semantic structure of 10 digit classes, 0 to 9, so that simulations can be conducted efficiently. The pixel images show white digits on black background with gray values between 0 and 256. For some experiments, we also use the Fashion MNIST data set (Xiao, Rasul, & Vollgraf, 2017), which is similar to the digit data set but shows 10 classes of fashion items rather than digits. First, we illustrate the behavior of the system, and then model a concrete memory task and compare it to experimental results (Zoellner et al., 2021).
3.1 Semantic Learning at the Level of the VQ-VAE
Semantic learning within the VQ-VAE has two aspects: (a) the semantic learning within the encoder and decoder and (b) the categorization of the feature vectors by vector quantization. Both can potentially contribute to memory efficiency in our model. Interestingly, we find that the model works well in a regime where the encoder alone actually increases the size of the representation (from to numbers), and it is the quantization that leads to a strong compression within the VQ-VAE. Since input images have pixels with 256 gray values and index matrices have entries with 20 codebook vectors, the compression ratio as defined in van den Oord et al. (2017) is in bits.
However, semantic learning might not only help to improve memory efficiency. Here we investigate how much semantic learning helps dealing with noise and how well this generalizes from one set of images to another.
Noisy MNIST images were generated as follows. First, a noise template was generated by sampling an array of independent and identically distributed (i.i.d.) noise from a gaussian distribution with zero mean and unit variance. This template was then added to an image (having gray values between −0.5 and 0.5) with a weighting factor between 0.01 and 1, that is, between 1% and 100%. An image with 100% noise therefore still has some original image information left. Noisy images were not clipped or normalized back to . Using a fixed noise template for all noise levels realizes so-called frozen noise, which leads to smoother and more reliable results, because it eliminates random fluctuations between different noise levels.
To test the performance of the system on a set of images, we have to distinguish successes from failures. Although the mean squared distance between original image and reconstructed image is an obvious and frequently used measure, it is not very useful as a measure of perceptual similarity (Mathieu, Couprie, & LeCun, 2015). We have therefore trained two classifiers, each consisting of a three-layer convolutional neural network, to recognize the 10 digits (trained to a level of 98% correct classification on test data) or fashion item classes (91% correct classification on test data) and evaluate the performance by the classification accuracy, that is, the percentage of reconstructed images that are recognized correctly by the trained classifier.
We ran the digit or fashion MNIST classifier on the noisy images directly, on images reconstructed by a VQ-VAE without quantization (semantic learning (a)), and on images reconstructed by a VQ-VAE with quantization (semantic learning (a b)). For the VQ-VAE without quantization, training and testing were done without quantization; thus, this corresponds to a plain autoencoder. For each noise level, we did 10 runs with a different seed for the (frozen) noise template and the VQ-VAE, each one tested on 10,000 different test images—about 1,000 per digit or class. Training was done on 60,000 images. Each of the four sets of 10 runs (with and without vector quantization classes 0–4 and 5–9) had the same sets of 10 seeds for the noise and for the VQ-VAE to make the results more comparable.
The latent representation in a VQ-VAE still has some spatial resolution and can take advantage of the combinatorics in the index matrix to generalize to images with a different distribution from the one trained on. To study generalization, we tested VQ-VAEs trained on different training sets (MNIST digits 0–4, digits 5–9, fashion item classes 0–4 as well as classes 5–9) on different test sets of the same four groups, resulting overall in 16 comparisons across different noise levels. We distinguish three cases: in sample indicates that the training and test set were from the same group (e.g., both digits 0–4); out of sample indicates that the training and test set were from different groups within the same data (e.g., training on digits 0–4 and testing on digits 5–9); and out of distribution indicates that the training and test set were from different data sets (e.g., training on fashion item classes 0–4 and testing on digits 0–4). Within one comparison graph, we keep the test database constant, and we average the results over the two corresponding groups (e.g., digits 0–4 and digits 5–9). This eliminates confounding effects by different difficulty levels of the test sets. At high noise levels, the VQ-VAE tends to generate an output that consistently gets classified as one of the 10 classes. This seems to largely depend on the noise template with a preference for the digits 2, 3, 5, and 8 (see Figure S3). For a single run, this could lead to chance levels of either 0% or 20%, depending on whether the preferred digit is within the test set. Averaging over 10 runs and different combinations of training and test set largely eliminates this effect and results in a convergence to the expected chance level of 10%. Curves are averaged over training/test combinations 0–4/0–4 and 5–9/5–9 of the same data set for in-sample, over 0–4/5–9 and 5–9/0–4 of the same data set for out-of-sample, and over 0–4/0–4, 0–4/5–9, 5–9/0–4, and 5–9/5–9 of the two different data sets for out-of-distribution results.
All curves reach values higher than chance level at 100% noise level, which we attribute to the still remaining image information at that noise level.
The generalization capability demonstrated here is characteristic of a VQ-VAE. A VAE, for instance, would not be able to do that because it maps the input onto just one feature vector and can therefore not take advantage of the combinatorics of feature vectors like the VQ-VAE does. We have previously tried to use a VAE in our model, but it failed to represent the out-of-sample data. Therefore, it was not suitable for modeling the experimental results of episodic-semantic confilict resolution that is described next.
3.2 Scenario Construction by Semantic Completion
At the core of our model is the concept of an episodic gist, which is incomplete but can be complemented by semantic information to reconstruct a full scenario from a partial memory trace. What is being stored in the memory trace (i.e., what makes up the episodic gist) is largely determined by attention. In our model, attentional control is somewhat constrained and only determines how many consecutive elements of the latent representation (i.e., indices of codebook vectors), are stored row-wise starting in the upper left corner. For low attention, only the upper two out of eight rows might be stored; for high attention, the upper six and a half rows might be stored. The remaining part, if needed, has to be constructed based on semantic information. It is important to note that attentional selection does not apply to the images but to the latent representation in form of the index matrix.
3.3 Improved Memory Efficiency by Semantic Completion
3.4 Modeling Episodic-Semantic Conflict Resolution in Humans
An important goal of our modeling effort is to reproduce experimental results from episodic memory research and eventually make suggestions and predictions for new experiments. Here we relate to a recent experiment by Zoellner et al. (2021) on the conflict resolution between episodic memory and semantic information in humans.
Participants' memory was then tested on the next day with a recognition task. In this task, participants ranked on a 6-scale from -3 (surely not seen) to 3 (surely seen) how confident they were that they had seen a specific household object and, if they think they had seen it, decided which room it was in (see Figure 7, right). In addition to the 24 household objects from the apartment, 24 similar-looking distractor objects were presented as well, to avoid random guessing. Each object was presented once. Confidence level was highly correlated with task relevance (mean confidence 2.4) versus task irrelevance (mean confidence 0.9). The same task was repeated after seven days to check how memory changes over time. Since the results show no significant difference between day 2 and day 8, and we are not modeling memory accuracy over time, we pooled the data from the two days.
In the recognition task, there are three possible outcomes in the incongruent cases if the object has been remembered: the semantically incongruent but episodically correct room is remembered (episodic recall), the semantically congruent but episodically incorrect room is remembered (semantic recall), or the semantically incongruent and episodically incorrect room is remembered (wrong recall). In the congruent cases, the semantically congruent recall is also episodically correct (correct recall); the other two rooms are both wrong recalls. The episodic recalls are sometimes also called correct recalls for convenience.
For the model simulation, we proceed as follows: First, the perceptual-semantic network, VQ-VAE and PixelCNN, is trained on the congruent data set. Then a number of congruent and incongruent images are shown to the system and stored in memory traces with varying levels of attention: 5% (low attention), 52% (medium attention), or 63% (high attention) of the index matrix is stored. The stored memory traces are then recalled by the network and semantically completed by the PixelCNN. A trained classifier for digits and another one for backgrounds are used to model the responses of the participants. If the digit classifier recognizes the digit from the recalled image correctly, this counts as if the participant remembers having seen the object. Only then is the background classifier applied to determine the type of background. The digit classifier network is a basic CNN with three convolutional layers that was trained on digits in both congruent and incongruent contexts so that it is not biased by the background. This network has an accuracy of 99% on test data. The context classifier is a simple pattern matching algorithm that assigns the pattern class based on mean squared error.
Since the PixelCNN has been trained only on congruent examples, it usually fills in the semantically congruent background in the bottom half of the image if it has stored a particular digit at attention level 50% or less (here 5%), because it has no background information. If it fails to do so, it should fill in one of the other two backgrounds with equal probability. However, if the attention level is higher (here 52% or 63%), the PixelCNN should infer the (possibly incongruent) background from the bits that are preserved about it in the memory trace and complete it. The more information it has, the more reliably it recovers the correct background. Thus, the results should be trivial for congruent images because the congruent background is always recalled correctly, but for incongruent images, the outcome depends on attention level. That is, for low attention levels, the model plausibly constructs the congruent background (semantic recall), and for high attention levels, the model correctly recalls the incongruent background (episodic recall). It should usually not recall a background that is both incongruent and incorrect (wrong recall).
Experimental as well as simulation results are shown in Figure 9 in a direct comparison and can be summarized:
Congruent contexts are recalled better than incongruent ones, even for high attention levels, as there is no conflict between episodic memory and semantic information.
High confidence of having seen an object, modeled by high attention levels, increases memory accuracy in both congruent and incongruent cases, but much more so in the latter case, because in the former, performance is at a high level throughout.
Contexts that are not remembered episodically correctly are more often remembered semantically congruently than completely wrong.
Episodic and wrong recalls are equally likely in incongruent cases if the confidence/attention level is low, since there is (presumably) no information about the episodically correct room in the memory trace. This is expected for symmetry reasons if there is no particular prior toward one of the rooms.
To match the experimental results we tuned the three different attention levels, which are not well quantified in the experiment. However, the fact that there are wrong recalls of the context in incongruent cases and the good match of their proportion to the semantic recalls are emergent properties of the model.
With this work, we present a model of generative episodic memory at a rather abstract level with a network architecture combining known methods from machine learning: VQ-VAE and PixelCNN. It can process real images, includes the potential for spatial attentional selection (although still in a primitive form), can represent images that are quite different from those it was trained on, and models usage of semantic information for encoding, which relates to abstraction; for quantization, which relates to categorization; and for semantic completion to complement parts neglected by spatial attention. The term semantic might seem overly ambitious here, but we believe that the semantic information these generative models capture shares essential characteristics with what we would normally refer to as semantic, namely, general regularities of the world that hold beyond and are represented independent of particular episodes. If one would scale up the model in size and complexity, the semantic information would gradually be of a more high-level nature.
4.1 The Six Steps of Generative Episodic Memory
Our model shows how generative episodic memory can work in principle. It supports our conceptual framework for human episodic memory:
Sensory input (an image in our case) is processed by a multilayer perceptual-semantic network, for example, the visual system, to generate more abstract representation.
Some aspects of this representation, the episodic gist, are selected, presumably by attention depending on many factors.
Pointers to the selected perceptual-semantic elements in the hierarchical representation are then stored in the form of a memory trace in the hippocampus.
During recall, a memory trace is reactivated in the hippocampus.
The pointers in the memory trace in turn reactivate the perceptual-semantic elements.
Perceptual-semantic information is finally used to fill in missing parts in a dynamic process.
The last step makes episodic memory generative, and we call this process scenario construction.
4.2 Memory Efficiency
Three factors in the model contribute potentially to its memory efficiency: encoding and decoding; quantization, in both the VQ-VAE; and semantic completion in the PixelCNN. Interestingly, we find that the model works well in a regime where the encoding actually expands the representation by about a factor of four and only the quantization performs compression, so that combined, we have a compression by a factor of about 30 in the VQ-VAE. From Figure 6 one can infer that the semantic completion contributes a factor of only up to 2 to the overall compression of input images into the memory traces. We expect that for richer data sets and when the temporal dimension is taken into account, the contribution of semantic completion to the memory efficiency becomes much larger. However, we hypothesize that semantic completion, also serves other purposes.
4.3 Semantic Completion by the PixelCNN
The model shows good semantic completion capabilities. It is remarkable how well the PixelCNN completes the index matrix to generate plausible complete images even from small fragments, where faithful reconstruction is not possible (see Figure 6). Semantic completion can help to generalize better. It has been hypothesized that the main purpose of episodic memory is not to remember the past but to help us make decisions for the future (La Corte & Piolino, 2016; Schacter & Addis, 2007a, 2007b). Thus, if our knowledge about the world changes, maybe our memories of the past should also change to be maximally useful to deal with the future. Semantic completion can do exactly that.
4.4 Advantages of Using a VQ-VAE
We are not the first to model the generative nature of episodic memory. Two recent studies have used variational autoencoders (VAE) to reconstruct images from memory traces (Bates & Jacobs, 2020; Nagy et al., 2020). In contrast to a VAE, the latent representation in the VQ-VAE that we use here maintains some spatial resolution. This has two advantages. First, we can model not only compression but also spatial selection by attention. We do that by discarding some fraction of the feature vectors in the array and keeping the rest. Second, the model can also store and recall input patterns that are quite different from those seen during training, because the known feature vectors can be combined in many different new spatial constellations. For instance, a VQ-VAE trained on digits 0 to 4 can equally well represent digits 5 to 9 (see Figure 3), something a VAE cannot do. This generalization capability extends to the PixelCNN in a remarkable way. Figure 8 shows that the PixelCNN is able to disentangle digit and background in the training data and put them together in an unseen way. We see another advantage of the VQ-VAE for our purposes in that the feature vectors are quantized, which is in analogy to semantic categorization in the brain (Persaud, Hemmer, Kidd, & Piantadosi, 2017).
4.5 Is Machine Learning the Right Level of Abstraction?
It could be questioned whether machine learning methods are a good basis for modeling the brain like we do here. There is always a trade-off between the scale of the model and the biological details that it can account for. It is currently impossible to keep all the biological details when modeling an extensive system as we do here. Therefore, we believe that the artificial neural networks we use here offer a good level of abstraction with biological relevance. The convolutional neural networks, as well as recurrent neural networks on which our model is based, were inspired by the brain and are remarkably successful also in computational neuroscience (Kuzovkin et al., 2018; Lindsay, 2021; Papadimitriou, Vempala, Mitropolsky, Collins, & Maass, 2020; Savage, 2019; Yamins et al., 2014). Furthermore, they offer the advantage of efficiency, so that real world images can be processed. This is an important factor for two reasons. First, the frequently used artificial random stimuli in earlier memory modeling studies lack the statistical structure and regularities that are essential in studying the interplay between episodic memory and semantic information. Without statistical regularities that can be exploited, episodic memory cannot be generative by design. Second, being able to process images that are closer to real world images is an important step toward closing the gap between model simulations and experimental studies with human participants.
4.6 Modeling Episodic Semantic Conflict Resolution in Humans
We have successfully modeled the episodic memory experiment by Zoellner et al. (2021). Both experiments and simulations show that congruent contexts are recalled better than incongruent ones (van Kesteren, Rignanese, Gianferrara, Krabbendam, & Meeter, 2019), that attention improves correct recall in both cases, and that incorrectly recalled contexts in incongruent cases are more often remembered semantically correct than completely wrong. Figure 9 shows that we have achieved good agreement with the experimental results.
4.7 Future Research
Overall we believe this model advances our understanding and sharpens our concepts of generative episodic memory. However, the model can and should be developed in future work:
The current attentional selection is rather restrictive and needs to be more flexible, so that any location could be selected. One option would be to replace the PixelCNN by a more flexible transformer network (Chen et al., 2020; Parmar et al., 2018; Sanh, Debut, Chaumond, & Wolf, 2019).
Even though the encoder is hierarchical, consistent with our conceptual framework, the index matrix on which the selection and semantic completion are done is not. It is possible to employ a hierarchical version of the VQ-VAE (Razavi, van den Oord, & Vinyals, 2019) to allow for semantic completion in a truly hierarchical representation.
The storage process in the hippocampus is currently not modeled. This could be addressed, for instance, by adding a model of one-shot storage of pattern sequences in hippocampal memory (Melchior, Bayati, Azizi, Cheng, & Wiskott, 2019). This would also allow for the investigation of sequential episodes, not just snapshots. Sequentiality has been claimed to be one essential characteristic of episodic memory (Cheng, 2013; Cheng & Werning, 2016).
Besides developing the model further, it needs to be compared to more experiments on human episodic memory to further constrain the model and contribute to the design of new experiments. For example, our model predicts that if a person's semantic information changes considerably, that person will probably recall old memories biased toward the newly learned semantics or just less accurately; this might partially explain the phenomenon of infantile amnesia (Robinson-Riegler & Robinson-Riegler, 2012). One other possible prediction is that if we assume that stress during retrieval temporarily blocks the access to episodic traces, we would see more semantic construction when a person is under stress during retrieval compared to unstressed retrieval (Wolf, 2019).
Overall, we believe the model we present here makes a significant step toward understanding how generative episodic memory might work, and it opens numerous options for future research to investigate more aspects of scenario construction.
Since quantization is a discrete operation, it is not possible to calculate its gradient for backpropagation. Therefore, the stop gradient (sg) operator is introduced here. During the forward pass, it works like an identity operator. During the backward pass (backpropagation), the gradient is passed directly from to . The second and the third terms have identical values; the second one updates the codebook via quantization (i.e., due to nonzero ), and the third one affects only the encoder.
In our simulations, we trained the VQ-VAE with 20 codebook vectors of size 64. The weight for the commitment loss, , was set to one. The batch size was 128, and the learning rate was . The encoder consists of two convolutional layers with a kernel size of three and stride two with 16 and 32 filters, respectively, followed by another layer with stride one and 64 filters. The decoder has an architecture symmetrical to the encoder but with transposed convolutional layers. Figures S5 and S6 in the appendix provide a visualization of the network.
PixelCNNs generate images pixel by pixel and in a sequence (e.g., from top left to bottom right) conditioned on all previously sampled pixels. This process is slow for large images; in our case, we use the PixelCNN only on the index matrix , which is much smaller. After training, the model can complete a partial index matrix and even generate a new one from scratch based on the semantic information learned from the training data. This is then converted to an array of codebook vectors and passed to the decoder to generate the output.
The PixelCNN has 12 gated blocks and was also trained with the Adam optimizer with a batch size of 128 and a learning rate of . The convolutional layers in the PixelCNN have 32 feature maps. Both the VQ-VAE and PixelCNN implementation were adopted from Royer (2019). A visualization of the PixelCNN network is included in Figure S7 in the appendix.
The codes related to this paper can be found at the following link: https://github.com/ZahraFayyaz/Generative-episodic-memory.
This work was supported by a grant from the German Research Foundation (DFG), ``Constructing scenarios of the past: A new framework in episodic memory,'' FOR 2812, project number 419039588, P5 (L.W.) and 419039274, P4 (O.T.W.). We thank Dr. Anand Subramoney for his helpful comments on the manuscript.