The experimental evidence on the interrelation between episodic memory and semantic memory is inconclusive. Are they independent systems, different aspects of a single system, or separate but strongly interacting systems? Here, we propose a computational role for the interaction between the semantic and episodic systems that might help resolve this debate. We hypothesize that episodic memories are represented as sequences of activation patterns. These patterns are the output of a semantic representational network that compresses the high-dimensional sensory input. We show quantitatively that the accuracy of episodic memory crucially depends on the quality of the semantic representation. We compare two types of semantic representations: appropriate representations, which means that the representation is used to store input sequences that are of the same type as those that it was trained on, and inappropriate representations, which means that stored inputs differ from the training data. Retrieval accuracy is higher for appropriate representations because the encoded sequences are less divergent than those encoded with inappropriate representations. Consistent with our model prediction, we found that human subjects remember some aspects of episodes significantly more accurately if they had previously been familiarized with the objects occurring in the episode, as compared to episodes involving unfamiliar objects. We thus conclude that the interaction with the semantic system plays an important role for episodic memory.
When Tulving (1972) introduced the distinction between episodic memory and semantic memory, he conceived of the former as information about specific personally experienced events and the latter as general knowledge about the world divested of a specific spatiotemporal context. An example for episodic memory would be to remember having seen a butterfly in the morning in the garden. A semantic memory would be knowing what the word butterfly means. Tulving (1972) also described the interaction between the episodic and semantic systems, which he later formalized as the SPI model (Tulving, 1995). The model involves three hierarchically arranged components: the perceptual, semantic, and episodic systems (see Figure 1A). Information flow among these components is process specific: serial encoding, parallel storage, and independent (SPI) retrieval (Tulving, 1995; Tulving & Markowitsch, 1998). Serial encoding means that incoming information must first pass from the perceptual to the semantic system and is then encoded as episodic memory. As a consequence, acquisition of new episodic memory is affected by information in semantic memory. The output of a given system can be transmitted to the next level or stored at that level, or both. Hence, parallel storage implies that different aspects of the incoming information are stored in different systems. Finally, the stored information can be retrieved independently from each system (Tulving, 2002). Since Tulving's early work, a vast number of studies have investigated the relationship between episodic and semantic memory, as well as their neural underpinnings. The findings broadly fall into one of three categories: (1) episodic and semantic memory are two separate memory systems, (2) they are part of a single memory system, and (3) they are separate but strongly interacting systems.
In the first category, studies focus on the dissociation between the two forms of memory. Neuropsychological studies show that patients with medial temporal lobe damage have a severe impairment in episodic memory (Bayley, Hopkins, & Squire, 2006; Rosenbaum et al., 2008), while their semantic memory is largely spared (Manns, Hopkins, & Squire, 2003). Conversely, patients with semantic dementia have relatively spared episodic memory (Chan et al., 2001; Graham & Hodges, 1997). These findings suggest a double-dissociation of the two memory systems. This dissociation has also been observed in neuroimaging studies, where tasks thought to engage the different memory systems engaged distinct sets of brain regions (Wiggs, Weisberg, & Martin, 1998; Graham, Kropelnicki, Goldman, & Hodges, 2003; Düzel, Habib, Guderian, & Heinze, 2004; Gilboa, 2004; Maguire, 2001). In general, episodic memory is thought to crucially rely on medial temporal lobe structures, in particular, the hippocampus, whereas semantic memory is processed primarily in the neocortex (Svoboda, McKinnon, & Levine, 2006; Huth, Nishimoto, Vu, & Gallant, 2012). Nevertheless, it remains contentious how episodic memory could be conceptually distinguished from semantic memory (Cheng & Werning, 2016; Tulving, 1985; Conway, 2009; Clayton, Bussey, & Dickinson, 2003; Klein, 2013b; Suddendorf & Corballis, 1997).
In the second category, numerous studies reported commonalities in neural activations across tasks involving semantic and episodic memories (Rajah & McIntosh, 2005; Burianova, McIntosh, & Grady, 2010; Binder, Desai, Graves, & Conant, 2009; Ryan, Cox, Hayes, & Nadel, 2008). The unitary system view therefore proposes that a single declarative memory system subserves both episodic and semantic memory (Rajah & McIntosh, 2005; Baddeley, 1984; Burianova & Grady, 2007). Memory encoding is thought to always be contextual (Baddeley, 1984), but at retrieval, memories may, or may not, become decontextualized depending on the task demands (Rajah & McIntosh, 2005; Westmacott & Moscovitch, 2003; Yassa & Reagh, 2013). In the most extreme version of this view, the same memory trace could be retrieved as either episodic or semantic memory, depending on whether the retrieval was associated with autonoetic or noetic consciousness, respectively (Klein, 2013a).
Finally, the intermediate view between the first two is that episodic and semantic memory constitute two separate memory systems but interact strongly with each other (Nyberg, Forkstam, Petersson, Cabeza, & Ingvar, 2002; Cabeza & Nyberg, 2000; Gabrieli, Poldrack, & Desmond, 1998; Greenberg & Verfaellie, 2010; Cheng, Werning, & Suddendorf, 2016). Many experimental studies suggest that semantic memory affects episodic encoding (Ween, Verfaellie, & Alexander, 1996; Graham, Simons, Pratt, Patterson, & Hodges, 2000; Kinsbourne, Rufo, Gamzu, Palmer, & Berliner, 1991; Maguire, Kumaran, Hassabis, & Kopelman, 2010) and retrieval (Martin & Chao, 2001; Wagner, Paré-Blagoev, Clark, & Poldrack, 2001; Mion et al., 2010; Spreng, Mar, & Kim, 2009). Another line of evidence shows that episodic memory is improved when the to-be-remembered information is consistent with prior semantic knowledge (Anderson & Pichert, 1978; Bartlett & Kintsch, 1932; Kan, Alexander, & Verfaellie, 2009; Hemmer & Steyvers, 2009). The levels-of-processing phenomenon also supports the special role of semantic information in episodic memory (Craik & Lockhart, 1972; Craik & Tulving, 1975). Furthermore, semantic information has even been suggested to contribute to mental time travel in the future (Irish, Addis, Hodges, & Piguet, 2012; Duval et al., 2012), which is closely linked to episodic memory (Suddendorf & Corballis, 1997). On the other hand, episodic memory has been suggested to affect both the formation and retrieval of semantic memory (Verfaellie, 2000; Kitchener, Hodges, & McCarthy, 1998; Westmacott, Black, Freedman, & Moscovitch, 2004; Kopelman, Stanhope, & Kingsley, 1999).
These disparate views cannot be easily reconciled with each other or with the SPI model, and so the relationship between the episodic and semantic systems remains controversial. Here, we suggest studying the computational function of the interaction between the episodic and semantic systems as a way forward in this controversy. We develop an abstract computational framework for the interaction between the episodic and semantic systems, which derives, but also significantly differs, from Tulving's SPI model. We find in our model that the quality of the semantic representation is important for the accuracy of episodic memory retrieval. We test this prediction in a behavioral experiment and find that subjects indeed remember some key aspects of episodic memory more precisely when their semantic representation was better tuned to the objects in the episode.
2 Modelof the Interaction between the Semantic and Episodic Systems
From Tulving's SPI model, we adopt the component structure and the process-specific operation of the combined systems, but make very different assumptions about the nature of the components and the operation of the system (see Figure 1B). First, the SPI model does not specify the formats in which the episodic and semantic systems encode and store information. In our model, these formats critically determine how the episodic and semantic systems function. We have previously suggested that episodic memories are best represented as sequences of neural activity patterns (Cheng, 2013; Cheng & Werning, 2016; Cheng et al., 2016). More specifically, the recurrent CA3 network encodes sequences during the experience and replays them later during retrieval (Levy, 1996; Lisman, 1999; Buhry, Azizi, & Cheng, 2011; Cheng, 2013). By contrast, semantic memory in our model is represented by static neural activity patterns. As in the SPI model, encoding in our model is serial. The perceptual system sends inputs to the semantic system, which generates a semantic representation of the inputs. These semantic representations have been learned previously on a body of input data. During the experience of an episode, sequences of semantic patterns are stored by the episodic system as episodic memory. Specifically, the sequence is stored by associating each element of the sequence with its succeeding element. In summary, episodic memory is defined as a sequence of semantic representations. This feature makes our model different from other theoretical and computational models of the semantic and episodic systems.
Second, in contrast to the SPI model, stored information cannot be retrieved independently in our model. Instead, episodic memories have to be retrieved through the semantic system (see Figure 1B), that is, the information stored in the episodic system must first pass through the semantic system before it can be retrieved (Cheng et al., 2016). This view is consistent with empirical observations suggesting that the semantic system is important for episodic retrieval (Gilboa, 2004; Mion et al., 2010; Spreng et al., 2009; Martin & Chao, 2001; Mangels, 1997; Stuss, Craik, Sayer, Franchi, & Alexander, 1996). However, in our current computational model, we do not implement this interaction during retrieval explicitly. To study the properties of stored episodic sequences, they are directly retrieved from the episodic system. The retrieval of an episodic sequence is initiated by providing a single pattern as a retrieval cue and then proceeds from one element to the next.
In the following, we describe the elements of our model in more detail.
2.1 Input Patterns
We also used a random walk trajectory in the testing phase, where the object moves horizontally and vertically in each time step. The steps in the two directions are drawn independently from a normal distribution to reduce repetitions in the position. If a step would take the object beyond the boundary, the object is reflected on the boundary instead. The rotation of the object also follows a random walk, where the steps are drawn from .
The parameters of the object's movement were chosen to ensure that semantic representations could be learned from the input data in a systematic way. Note that the semantic representation does not learn the movement statistics itself, so that in our model, the movement statistics is not important for the model after the semantic system has been trained.
2.2 Model of the Semantic System
2.2.1 Slow Feature Analysis
2.2.2 Hierarchical Network Structure and Training
We used a hierarchical SFA network to model the semantic system. Information is first extracted locally and then integrated into more and more global and abstract features at each level of processing (see Figure 3A). The network consists of converging layers of SFA nodes. In the first layer consisting of , each node receives inputs from a pixel region in the image space, which is called the receptive field. In each direction, the receptive fields of two neighboring nodes overlap by 12 pixels. In the second layer, each one of nodes receives inputs from a patch of first-layer nodes with an overlap of two nodes. These nodes converge onto a single SFA node in the third layer. In this setup, all nodes in a given layer together cover the full image space. The activity of the node in the top layer is taken to be the output of the semantic system in our memory model.
In each SFA node, the same set of processing steps is performed (see Figure 3A, top right). The first linear SFA stage performs dimensionality reduction, whereby the input dimensionality is reduced to 48. The following quadratic expansion allows for nonlinear features and increases the dimensionality to 1224. A final linear SFA stage reduces the dimensionality to 32, except for the top layer, where the output dimensions are reduced to 4. Since each layer realizes a polynomial of degree 2, the network as a whole computes subsets of polynomials of degree .
The hierarchical network was implemented in Python using the MDP library (Zito, Wilbert, Wiskott, & Berkes, 2008). It was trained sequentially from bottom to top on sequences of 10,000 images in each training session; longer data sets did not improve the training. With our choice of the object's movement statistics, we ensure that the slowest features that emerged from the SFA network correspond to the coordinates of the object's center and its orientation (see e.g., Figures 3B and 4). The four slowest features were considered the semantic representation, denoted by on the input at time point . Note that SFA learns to extract a more abstract representation of a single input image; it does not learn the movement statistics.
2.3 Episodic Sequence Storage and Retrieval
After the sensory input is processed by the semantic system, sequences of semantic representations are stored in the episodic system. To be certain about what computations occur in the sequence network, we chose to use a highly simplified algorithmic model for sequence storage and retrieval. In our model, the episodic sequences are stored element by element (i.e., each element is stored individually). To preserve the sequential information, each element is associated with the next element in the sequence , which we refer to as the associated cue (see Figure 3C). It serves as a retrieval cue when the sequence has to be retrieved. Hence, the information about the temporal ordering of the elements in the sequence is not available at a global level, only on a pairwise basis.
This model is sufficient for our purposes here, since we focus on the interaction between the semantic and the episodic system, not on sequence storage itself. We believe that this algorithm approximates the effective computation of heteroassociations between one pattern in a sequence and the next in area CA3 in the hippocampus (Jensen & Lisman, 1996; Lisman, 1999). Our model is akin to the chaining model, which appears in various theories of memory (Ebbinghaus, Ruger, & Bussenius, 1913; Jones, Beaman, & Macken, 1996; Kieras, Meyer, Mueller, & Seymour, 1998; Lewandowsky & Murdock, 1989; Murdock, 1993; Wickelgren, 1965). It assumes that during storage, associations are formed between successive items. During retrieval, each item acts as the retrieval cue for the subsequent item (Ebbinghaus et al., 1913; Lewandowsky & Murdock, 1989; Murdock, 1993; Lashley, 1951). While the chaining model does not account for some features of serial-order memory (see section 5), it appears to be appropriate for modeling episodic memory for our purposes.
2.4 Study Design
We used two categories of objects, say A and B. For each category, we used sequences of images of the objects to train a semantic network. We then used the semantic network trained on category A to store 30 sequences, each consisting of 50 elements (unless noted otherwise), drawn from A. This case represents sequence storage using an appropriate semantic representation and is denoted by A/A. We then tested episodic memory performance by trying to retrieve the stored sequences and calculating the retrieval error. Memory performance in the A/A case was compared to the case where the same semantic network was used to store sequences of inputs drawn from the other category B, denoted by A/B. In this case, the semantic representation is inappropriate. To ensure that any difference in memory retrieval between appropriate and inappropriate cases does not merely reflect an asymmetry between categories A and B, we performed the same comparison where the roles of categories A and B were swapped (i.e., B/B versus B/A). Note that we compared the retrieval performance on the basis of the same representation but not the same test object. This is consistent with the intrasubject design in our behavioral experiment (see section 4.1), in which each subject has a given representational system but is tested with different objects.
An example for the outputs of two appropriate and two inappropriate representations is shown in Figure 4. Two SFA networks were trained on U and E, respectively. Both networks extracted the location and orientation of the trained object in the same order of slowness (see Figure 4, U/U and E/E). Although each network was trained on only one of the objects, they nevertheless also represent features of other objects to some extent (U/E and E/U). Put differently, the learned semantic representations generalize to other objects to a limited degree.
3 Semantic Representations Affect the Quality of Episodic Memory
3.1 Inappropriate Semantic Representation Yields Larger Episodic Retrieval Errors
The main result of this study is that a retrieved sequence more closely matches the stored sequence when an appropriate representation is used, as compared to when an inappropriate representation is used (see Figure 7). More quantitatively, the episodic retrieval error was consistently lower for appropriate representations (see Figure 8A, blue curves) than for inappropriate representations (see Figure 8A, red curves). In the remainder of this letter, we expand on this result, elucidate the underlying mechanisms, and test our model's predictions experimentally.
The level of retrieval noise has an important effect on episodic retrieval (see Figure 8A). Without it, episodic retrieval in all four cases is flawless. Since in the noise-free retrieval case, errors could occur only if two or more stored patterns are identical, our observation implies that no two patterns stored in the network are identical. This conclusion is not surprising since the input stimuli—gray scale, pixel images—are real valued and the Lissajous trajectories were designed to be nonrepeating. Since noise-free retrieval is unrealistic, we turned our attention to retrieval with noise. In all four cases, the retrieval error gradually grows as elements of the sequence are retrieved. At higher levels of retrieval noise, the retrieval errors approach the asymptotic error more rapidly.
Importantly, episodic retrieval with an appropriate semantic representation is more accurate than retrieval with an inappropriate representation for all retrieval noise levels (see Figure 8C). However, the degree of the benefit depends strongly on the retrieval noise. The difference is largest for , where retrieval with the appropriate representation is 30% better than with the inappropriate representation and is reduced gradually to lower levels with higher retrieval noise. The nonmonotonic relationship between the retrieval error difference and the retrieval noise level suggests that multiple competing processes influence the retrieval accuracy. Note that in Figure 8C, we assessed the retrieval error for the sequence element. Since the retrieval error is monotonic in the sequence position, we could have chosen another sequence position for our analysis and obtained similar results. Finally, the difference in retrieval error is very similar for the network trained on the U and the one trained on the E (see Figure 8C), which means that the differences between appropriate and inappropriate representations are not caused by an asymmetry between the two objects.
Another parameter that might affect retrieval performance is the number of stored sequences, since storing more patterns increases the chance of retrieving incorrect patterns. We therefore stored up to 2000 sequences (50 elements each). As expected, the retrieval error increases with the memory load (data not shown), but somewhat surprising, the differences between appropriate and inappropriate representations remain constant (see Figure 8D). We next turn to the causes of the difference in retrieval accuracy.
3.2 Sources of Retrieval Error
Retrieval errors occur when the retrieval process jumps from a correct to an incorrect pattern. Intuitively, one might expect that more frequent jumps to incorrect patterns cause higher retrieval error. Therefore, we first examined the frequency of these jumps. We separately measured the probability of the incorrect jumps that occurs within the same sequence and between different sequences. As expected, the rate of jumps between sequences increases with retrieval noise (see Figure 9A, left). Unexpectedly, the rate was only slightly larger for the inappropriate than for the appropriate representation. The rate of jumps within a sequence was even more surprising. We found that the rate declined for retrieval noise past a certain level, , and the rate of jumps within sequences was lower for the inappropriate than for the appropriate representations (see Figure 9A, right). The latter feature contrasts with our previous observation that the retrieval error was larger for the inappropriate representation and seems to indicate that retrieval in the inappropriate case is more robust to noise. The different frequency of incorrect jumps cannot account for the difference in the retrieval error between the two cases. These paradoxical findings warrant further investigation. Our hypothesis is that although the frequency of incorrect jumps is equal or lower when using inappropriate representations, the size of jumps or the subsequent errors, or both, are larger, so that the overall retrieval error is larger on average.
We therefore analyzed different types of errors that may contribute to the retrieval errors. In our model, three types of errors can occur during episodic retrieval. First, the retrieval noise added in each retrieval step can cause incorrect jumps, both within and between sequences (see equation 2.7). These errors accumulate after successive retrievals (noise drift error). Second, our model stores sequences by associating each element with the next element as the retrieval cue. If a jump occurs between sequences, the associated retrieval cue leads to an advancement through the incorrect sequence, which lead retrieval even further away from the original sequence (sequence divergence error). Third, when two elements are identical, retrieval might proceed in the incorrect sequence. Since the last case does not occur in our simulation (see section 3.1) and since it is highly unlikely that two events repeat precisely in biological systems, we consider only the first two error types in the following.
3.2.1 Noise Drift Error
The noise drift is affected by the distribution of patterns stored in the network, which appears to be different between appropriate and inappropriate representations (see Figure 6). Therefore, it might reveal the difference between retrieval with appropriate and inappropriate representations. We found that for large retrieval noise (e.g., ), the time course of the noise drift is very similar to that of the retrieval error, indicating that the retrieval noise is the dominant contributor to the retrieval error in this case (see Figure 8B). At all noise levels, we also see a difference in noise drift between appropriate and inappropriate representations. However, the difference is small compared to the difference in retrieval error for low retrieval noise (). At this noise level, the retrieval error also rises much more steeply than the noise drift.
To better understand these observations, we computed the distribution of distances between two randomly selected patterns. There are clear differences between the distributions for the appropriate and inappropriate representations (see Figure 9C), even though the means are similar (see Figure 9D). We then compared the pairwise distances to the retrieval noise (see Figure 9B). For large retrieval noise (), very few noise vectors are longer than 2.5. Since the pairwise distances in this range are more skewed for the inappropriate representation than for the appropriate representation, incorrect jumps occur more often between pairs with relatively larger distance for inappropriate than for appropriate representation, thus accounting for the higher noise drift of the inappropriate representations at high noise levels. By contrast, pairwise distances of patterns within the same sequence (see Figures 9E and 9F) were significantly lower for appropriate as compared to the inappropriate representations. This explains the unexpected finding that within-sequence jumps are more likely for appropriate representations (see Figure 9A, right). However, since most of these distances are very small, the more frequent within-sequence jumps do not cause larger retrieval errors. Furthermore, the comparison of pairwise distances between patterns in the same sequence and those between all patterns reveals an explanation for the decline of within-sequence jump rates for . When the retrieval noise is lower than this level, the noise vectors are shorter than 1 (see Figure 9B). While there are many pair distances from the same sequence within this range (see Figure 9E), they make up only a very small fraction of all pair distances (see Figure 9C). However, since there are also few patterns from other sequences at a short distance, faulty transitions go to patterns from the same as well as other sequences. At larger retrieval noise, however, the noise vector samples from distances that are dominated by patterns from other sequences. Therefore, between-sequence jumps become more likely and the within-sequence jump rate declines (see Figure 9A).
In summary, the distribution of the patterns in pattern space accounts for the properties of the noise drift error and jump probabilities. However, the noise drift errors do not account for the bulk of the difference in the retrieval error between appropriate and inappropriate representations.
3.2.2 Sequence Divergence Error
As a first measure for the structure of the sequences, we considered the consecutive distance: the mean distance between a pattern and the subsequent pattern in the sequence, which is significantly smaller for appropriate representations than for inappropriate representations (see Figure 9G). Figure 10 shows that sequences generated by an inappropriate representation might simply be more jagged, but on average, they follow a similar trajectory as sequences of appropriate representation. This jaggedness or fluctuation itself does not cause a larger retrieval error for inappropriate representations. As we have shown, incorrect jumps driven by noise occur less frequently for inappropriate representations. Higher consecutive distances are consistent with the low fraction of short distance pairs (see Figure 9E), and consequently, low levels of retrieval noise may not suffice to cause retrieval to jump to an incorrect pattern; in other words, retrieval of sequences with higher fluctuation is less sensitive to noise. While considering the interaction between stored sequences during retrieval, the consecutive distance may still increase the retrieval error since once a jump between sequences has occurred, the sequential retrieval process might drive the retrieved pattern farther away from the correct one with larger consecutive distance. We therefore examined the distance of each element from the first element in a sequence. While early in the sequence, the distance to the first element is larger for inappropriate than for appropriate representations, this relationship reverses later in the sequence (see Figure 9H). However, the probability of remaining in the correct sequence decreases exponentially with the number of retrieval steps. Since the probability of a between-sequence jump is appreciable even for small amounts of retrieval noise (see Figure 9A, ), it is highly unlikely that the retrieval stays in the correct sequence for more than about 10 elements. Therefore, only the distances early in the sequence are relevant for the retrieval performance, and those are larger for the inappropriate representation (see Figure 9H).
In summary, the higher retrieval error for the inappropriate representation can be interpreted as follows. Although the incorrect jumps occur equally or less often, once a jump between sequences has occurred, retrieval proceeds along the incorrect sequence, which diverges more for the inappropriate representation from the original one, thus leading to larger retrieval errors in comparison with the appropriate representation. In other words, it is the relationship between stored sequences rather than the structure of individual sequences that determines the difference in retrieval performance. This account can be tested by correlating the sequence divergence with the retrieval error. Since the four data points we have obtained so far are insufficient for such an analysis and to show that our results generalize to objects other than the U and E, we next turn to analyzing other objects in the same framework.
3.3 Generalization to Other Objects and Classes of Objects
To show that our findings above hold in general, we study episodic retrieval with other objects (T versus L and H versus X); (see Figures 11A and 11B), and since categorization is an important feature in semantic memory, with two arbitrary categories of objects (T and L versus U and E); (see Figure 11C). The simulations and analyses were identical to the ones reported above except where otherwise noted explicitly. In particular, since the letters H and X have a rotational symmetry that the other objects lack, we set the rotating speed of these two letters to half the value that we used for the other objects. For training with categories, the two objects appear one at a time and alternate with a transition probability of 0.05 in each step. There are no blank images at the time when the two objects alternate. This ensures that SFA extracts features that are independent of the objects (see Figure 11D). In the testing phase, we examined the retrieval performance on each object separately and averaged across both objects to report the results for the category. All new analyses confirm that the episodic retrieval is more accurate with appropriate representations than with inappropriate representations (see Figures 11A to 11C). Moreover, we verified that our results are not artifacts of the Lissajous trajectories that we use to generate sequences by obtaining very similar results with random walk trajectories and rotation (data not shown).
3.3.1 Differences in Sequence Divergence Account for Differences in Retrieval Quality
With these additional data points, we can plot the retrieval error against the sequence divergence (see Figure 12). Note that we did not average the results across the two objects for the group simulation. We found that the relationship increases sharply at low noise levels and saturates at higher noise levels. Most important for us, the relationship is monotonic, showing that differences in sequence divergence indeed account for the differences in retrieval quality.
4 Experimental Study
Although this study is primarily a modeling study, we explore whether the model's prediction has any grounding in reality. Since we were not aware of any experimental study in the literature that would be compatible with our modeling, we decided to conduct a pilot experiment to test the prediction of our computational model in principle. That is, we studied whether the facilitating effect of the semantic representation on episodic memory can be observed experimentally. Since human subjects have had a tremendous amount of visual experience with objects, we reasoned that an exposure equivalent to that in the model would not be sufficient to alter their semantic representations. Conversely, for technical reasons, we could not train our model on data as rich as experienced by humans. We therefore designed an experiment in which subjects had the opportunity to improve their semantic representations of certain objects, but not others, in a different fashion and then tested their memory for episodes involving both types of objects.
The study consisted of two phases on two consecutive days: a semantic training on day 1 and an episodic memory test on day 2.
Nineteen (10 female, 9 male) healthy subjects with a mean age of 26.3 years (SD = 6.0 years; range: 19 to 42 years) participated in the experiment. They were recruited among the psychology students at the Ruhr University Bochum and received course credit for their participation. All participants gave written informed consent, and the study was conducted in accordance with the principles of the Declaration of Helsinki. The study was approved by the ethics committee of the Faculty of Psychology, Ruhr University Bochum.
4.1.2 Stimuli and Semantic Training
Eight different shapes, which did not resemble real objects or living things, served as stimuli (see Figure 13A). The total group of stimuli was divided into two sets, A and B, consisting of four stimuli each. During the training phase on day 1, subjects learned information about one set of objects (the trained set), whereas the other set was not part of the training (untrained set). The assignment of the object sets (A and B) to the training conditions was varied among subjects.
Since adult humans have already acquired the capability to extract simple object features, we cannot use the same training procedure as we did for the model. The main requirement is that subjects have a more appropriate representation for the trained objects than for the untrained objects. How this difference came about is secondary for the purpose of the study. We therefore devised a training procedure that we reasoned to improve a subject's semantic representation of the trained objects. Subjects looked at the training objects and touched exemplars of the objects that were cut out of black board. They could hold each object in their hands for 10 seconds and were asked to try to remember the shape of the objects. Subjects were then given so-called ID cards for each object, which contained a picture of the object's shape together with four types of information that the subjects were asked to learn: the name (e.g., Christoph), the favorite color (e.g., red), the favorite food (e.g., spinach), and one ability (e.g., digging). Subjects were handed the four ID cards for the trained objects one at a time for a duration of 150 seconds (total duration 600 seconds). The instruction was to memorize the information for a subsequent test, which followed after the last ID card and a short break of about 5 minutes. This semantic test contained 16 questions, one for each piece of information for each object (e.g., “Which of the objects can fly?”; “What is the favorite food of this object?”). For each question, subjects could choose among four different alternatives, exactly one of which was correct. Importantly, participants had to reach a performance criterion of 12 correct answers in order to be admitted to the test phase on day 2. If they did not reach this criterion, the procedure of the learning phase was repeated and the test was conducted a second time.
4.1.3 Episodic Memory Test
On day 2, subjects were shown eight videos of 30 s duration each. The videos showed movements of either the four trained or the four untrained objects (four videos each; a screenshot is shown in Figure 13B). Based on the objects' movements, different events could occur in each video (see Table 1 for a list of potential events).
Before the first test video was shown, subjects were informed about the types of events that could occur and instructed to memorize the events as accurately as possible. To familiarize them with the task, participants were first shown a training video that contained well-known shapes (e.g., circle) and some of the potential events listed in Table 1.
|Moves in a zigzag manner|
|Touches a border of the screen|
|Moves parallel to a side border of the screen|
|Two objects||Move in parallel|
|Touch each other|
|Three objects||Are aligned along a line; the fourth one is not|
|Moves in a zigzag manner|
|Touches a border of the screen|
|Moves parallel to a side border of the screen|
|Two objects||Move in parallel|
|Touch each other|
|Three objects||Are aligned along a line; the fourth one is not|
Following each of the eight test videos, subjects received a questionnaire in which their episodic memory for the events in the preceding video was tested. In the questionnaires, eight events were listed, four of which occurred in the preceding video and four of which did not (“old” versus “new”). Subjects were first asked, for each event, whether it had occurred in the video. If subjects answered positively, they were asked which one of the four objects was involved in this event. In case of an event involving multiple objects, subjects had to choose among four different combinations of objects. The last question for a given event referred to the context—either spatial (four of the mentioned events per questionnaire) or temporal (the remaining four). For the questions concerning the spatial context, subjects were asked to indicate the quadrant of the screen in which quadrant the event occurred by marking the quadrant of a rectangle on the questionnaire. Similarly, questions about the temporal context of an event asked in which quarter interval the event had occurred by marking the respective section on a time bar. Videos with trained and untrained objects alternated, with the type of stimuli in the first video being counterbalanced between subjects. To quantify each subject's performance in the test phase, the sums of hits (i.e., correctly recognized events), false alarms (wrong answers to events that did not occur), correctly identified objects for an event, and correct context memories were calculated separately for trained and untrained objects. After the completion of all questionnaires, the semantic test from day 1 was repeated in order to test whether the newly acquired knowledge was still available on the test day.
4.1.4 Data Analysis
Three dependent variables were considered in the analysis of the subjects' performance in the episodic memory test, separately for videos with trained and untrained objects. As a general performance measure, the sum of false alarms in the evaluation of events (old versus new) was subtracted from the sum of hits. Then the percentage of events, for which the associated objects were remembered correctly, was calculated relative to the number of correctly remembered events. Finally, context knowledge was assessed as the percentage of correct responses on context questions relative to the number of correctly remembered events. For all three variables, paired t-tests were performed, comparing the measures for trained and untrained objects. values below 0.05 were considered significant.
4.2 Experimental Confirmation of Our Theoretical Prediction
On average, the subjects' performance score for remembering the events in the videos (hits-false alarms, maximum score 16) amounted to 10.4 (SD = 2.6) for trained objects and 10.2 (SD = 2.5) for untrained objects. The difference between conditions was not significant ( = .824). For the remembered events, subjects remembered trained objects as well as untrained objects (mean trained: 76%, SD = 14%; mean untrained: 69%, SD = 16%; = .172). However, a significant difference emerged when memory for context information was tested. With a score of 69% (SD = 17%), context memory was significantly better for events involving trained objects than for events with untrained objects (mean 59%, SD = 14%; t(18) = 2.949, = .009) (see Figures 13C to 13E). The semantic test, administered after the completion of the last episodic memory questionnaire on day 2, yielded an average score of 14.2 (SD = 2.5). Only three participants scored lower than 12, which was the cutoff criterion for the semantic test after the training session on day 1. Exclusion of these three subjects from the analysis yielded the same pattern of results. Subjects had indeed retained the information learned during the semantic training, suggesting that the semantic representations of the trained objects might be more appropriate than of the untrained objects.
The behavioral results suggest that some key aspects (spatial and temporal) of episodic memory are improved by having a more appropriate semantic representation of the objects involved, even if not all aspects of episodic memory benefited from semantic training. We return to this point in the section 5.
4.3 Modeling Our Experimental Results
The retrieval error is a good diagnostic measure for the performance of a computational model, but it cannot be compared directly to experimental results. We therefore augmented our modeling with postretrieval mechanisms to be able to compare the model directly to our experimental results. From the events that occured in the videos shown to human subjects (see Table 1), we modeled the detection of the appearance and disappearance events since these could be implemented with the fewest additional assumptions. In each sequence, the moving object can either appear or disappear once at a given time. When present, the object moves along a random walk trajectory and rotates simultaneously. When the object is absent, the inputs consist of blank images. The task of the model, not unlike that of the human subjects, is to detect the event in the output sequences of SFA and, if an event is detected, determine the context information of the event, including temporal and spatial information. Since the SFA network generates a constant output when the image is blank and time-varying representations for moving objects, we chose to detect the event by examining the time derivative of the stored sequences. If the derivative is above a threshold, the model reports an event. This task is not trivial when retrieval noise is added during recall, as we do in our other simulations.
The number of trials with correct event detection is the number of hits. General performance is measured by subtracting the sum of false alarms (the number of trials the model detects an event, even though no event had occurred) from the number of hits. To determine the time of the event, we divided the sequence into four equal time windows. The time window, in which the event is detected, is reported by the model. To extract the spatial information about the event, we note that the SFA network extracts features corresponding to the spatial information (-, -coordinates) of the moving object. We divided the 2D input space into four quadrants and asked which corner the SFA output was closest to when the event occurred. We then measured the percentage of the temporal and spatial information that the model reports correctly.
In this way, our model is able to partially replicate the experimental results. While there is no significant difference in the performance of detecting the occurrence of an event between appropriate and inappropriate conditions (see Figure 13F), there were significant differences in retrieving the time (see Figure 13G) and spatial location (see Figure 13H) of the event. These results derive from the fact that SFA generates more jagged sequences with inappropriate representations and therefore information of the event would be detected less precisely by measuring the time derivative. Although our model shows overall better performance on remembering context information than human subjects do, our results are consistent with our experimental findings. Taken together, these results suggest that appropriate semantic representations improve episodic memory but that improvement is not always apparent in every behavioral variable.
We have presented and studied a model of the interaction between the episodic and semantic system. The main assumptions of our model are that (1) all incoming information must be processed first by the semantic system before it is encoded and stored into episodic memory; (2) semantic memory is represented as static neural patterns, which represent slowly varying features that are hidden in the raw sensory input; (3) episodic memory is represented as temporal sequences of patterns, which are semantic representations of the sensory input; and (4) episodic memory cannot be retrieved directly but rather relies on the semantic system. We found in our model that episodic retrieval is more accurate if the stored sequences were encoded with an appropriate semantic representation. We emphasize that we did not attempt to model the neural mechanisms of semantic representation and episodic memory storage per se but, rather, the effective output of the computations performed by these systems. Consistent with our model's prediction, human subjects remember contextual information better if the episodes involved trained objects, for which they arguably had developed a better semantic representation. Overall, our results suggest a specific computational function for the interaction between the episodic and semantic systems, which helps clarify their disputed relationship.
5.1 Semantic Representation as an Interface to and for Episodic Memory
Serial encoding is central in our model, but not universally accepted. For instance, Simons, Graham, and Hodges (2002) suggested that external information can also enter the episodic system directly through the perceptual system. In other words, varying the contribution from either the semantic system or perceptual system individually has little effect on episodic memory (Graham et al., 2000; Simons & Graham, 2000; Simons, Graham, Galton, Patterson, & Hodges, 2001; Mayes & Roberts, 2001). While Tulving (2001) criticized the methodology of these studies, our model suggests an alternative interpretation that reconciles those results with serial encoding. In our model, the semantic system can represent inputs that it was never trained on using an inappropriate semantic representation. The accuracy of episodic memory in this case is even quite reasonable, just not as high as for memories stored with an appropriate representation.
In our study, the semantic system was trained only once and then remained static throughout testing. In principle, our model can accommodate changes in the semantic system after episodic memories have been stored. Such changes will lead to distortions of the episodic memories. Other models have found that episodic memory is fragile and might be easily eroded in the face of neocortical plasticity (Káli & Dayan, 2004). If the semantic system changes severely, it might become impossible to recall the stored episodes, as happens perhaps in childhood amnesia. Children can remember events from before the age of 3 or 4 years, but these memories fade away as children get older and adults show near total forgetting of memories from their early childhood (Fivush & Schwarzmueller, 1998; Rubin, 2000). Future work is needed to study the distortion or forgetting of episodic memory due to changes in the semantic system. Another perspective for future work is that semantic information can be acquired from learned episodes as well, which means that the semantic representation layer could be trained by episodic memory.
5.2 Chaining Model and Episodic System
The chaining model was a popular approach to model serial-order memory (Murdock, 1993, 1995; Lewandowsky & Murdock, 1989). However, a number of studies have demonstrated over the years that chaining does not account for some key features of serial order in short-term memory (Brown, Preece, & Hulme, 2000; Burgess & Hitch, 1999; Henson, 1996a, 1998; Page & Norris, 1998). As a result, contemporary computational studies turn to other principles (e.g., Burgess & Hitch, 1999; Farrell, 2012; Henson, 1998; Page & Norris, 2009). One essential distinction of our study from these other ones is that we model the episodic memory system, not sequence learning in general, and, more specifically, the role of CA3 in episodic memory. Our algorithmic model stores the association between consecutive memory patterns in a sequence, though this association in our model is predetermined rather than built during learning. By contrast, most cognitive studies of serial-order recall have been performed in the framework of working memory (see Hurlstone, Hitch, & Baddeley, 2014, for a review). For instance, subjects are typically asked to recall the list immediately after its last item disappeared (e.g., Henson, 1996b), which can hardly be considered episodic retrieval in our view. The relationship between episodic memory and working memory is beyond the scope of our current study.
5.3 Comparison to Other Computational Models
While numerous experimental studies have investigated the relation between different memory systems, very few computational studies have done so. Howard and colleagues (Howard & Kahana, 2002; Howard, Shankar, & Jagadisan, 2011) suggest that it is possible to build a model of semantic memory acquisition in the same framework, involving drifting temporal context, as their model of episodic recall. Káli and Dayan (2004) have suggested that episodic memory and semantic memory share the same representational format (i.e., static patterns) but are distinguished by different amounts of overlap between different memory patterns. However, their focus was on studying the effect of replay on memory in the cortico-hippocampal loop, and so they did not model the storage of episodic memories in the hippocampus. Hintzman's Minerva 2 model proposed that two distinct memory types can be produced by a model that only stores episodes (Hintzman, 1984, 1986). The model is a multiple trace model that stores all repetitions of an item as different episodic traces. Correspondingly, the information retrieved reflects the summed content of all activated traces in parallel. Abstract concepts, or semantic information, can be derived from the pool of episodic traces as an artifact of averaging at retrieval (“prototype effect”). Finally, Battaglia and Pennartz (2011) modeled the formation of semantic memory from relational episodic information during memory consolidation. That study examined the opposite flow of information as compared to our computational model, from the episodic to the semantic system.
Our conceptual and algorithmic model of episodic memory is fully consistent with and complementary to a theoretical framework for the function of the hippocampus in episodic memories that was proposed by Cheng (2013). That framework, called CRISP, proposes that the recurrent connections in region CA3 generate intrinsic neural sequences, which are then associated with sequences of external inputs to store episodic memories. While the complete CRISP framework has not been implemented in a computational model yet, some aspects such as the generation of intrinsic sequence in CA3 have been (Azizi, Wiskott, & Cheng, 2013; Hopfield, 2009). A central prediction of CRISP, that pattern completion in the hippocampus might not depend on CA3, has been confirmed in a computational study (Neher, Cheng, & Wiskott, 2015). In our study, we use a simplified algorithmic model to store and retrieve episodic sequences. Now that we have a better grasp of how this sequence storage interacts with the semantic system, the next step will be to implement sequence storage in a neural network.
5.4 Experimental Test of Our Theoretical Predictions
While the details of our modeling study and our experiment are not well matched, they follow parallel designs in principle. Subjects/networks first received semantic training on a certain set of objects; then episodic memory was tested on the training stimuli as well as unknown ones. This way, we can use a within-subject design to compare episodic retrieval performance when the semantic representation is appropriate to when it is inappropriate. We found better episodic memory for contextual information for the trained than the untrained objects, but comparable performance on the occurrence of events in both human subjects and our computational model. Two additional aspects are worth keeping in mind when interpreting the differing effects of semantic representations on different measures of episodic memory. First, our adult subjects have had many exposures to all sorts of objects during their lifetime, for which they have developed semantic representations. The training in our experiment is expected to improve the semantic representation of the trained objects by only a small amount relative to the representation of the untrained objects. Hence, only a small effect would be expected on episodic memory. Second, the semantic training (exposure to the object and uncorrelated information) and testing was unrelated to the episodic memory task (remembering video clips). Any improvement on episodic memory is an effect of generalization and therefore expected to be small. These small improvements might be observable in only some measures of memory performance.
Other studies have shown that a familiarization process similar to the one in our experiment facilitates item recognition memory, free recall, and associative memory (Ratcliff, Clark, & Shiffrin, 1990; Murnane & Shiffrin, 1991; Kilb & Naveh-Benjamin, 2011). These effects can be introduced through either longer exposure time or repetition of items during studying phase. Although the underlying mechanism is still not clear, McClelland and Chappell (1998) suggested that familiarization with an item induces more accurate knowledge of its characteristics, which allows it to be easily differentiated from the others. This interpretation might be compatible with our view that the semantic training in our study allowed subjects to build a more appropriate semantic representation of the trained objects.
Nevertheless, future studies are needed to bridge the differences between our current modeling and experimental setups. For one, the training paradigms were quite different. Although the sole purpose of the training was to improve the semantic representations and not to learn particular content, it would nonetheless be preferable to have more similar training procedures. For two, multiple objects appear simultaneously in the experiment and interacted in the videos, while only one object was visible at a time in our computational studies. Simpler sequences than those used in the experiment would be too simple for human subjects. More complex image sequence could not be analyzed in our current model. However, in principle, SFA can be used to extract object identity from a group of training objects displayed alternatively (Franzius et al., 2011) or simultaneously (Legenstein, Wilbert, & Wiskott, 2010). Hence, an interesting future study would be to test whether our results hold for features including object identity.
We have proposed an abstract computational model of the interdependence between the semantic and episodic system. In this study, we have presented evidence from both computational and experimental studies suggesting that semantic representations play an important role for the encoding and retrieval of episodic memory, in particular, the spatial and temporal aspects. Our model is able to cope with high-dimensional visual input and store and retrieve temporal sequences. Such properties make it different from other models. However, many aspects of our model need to be investigated further by both experimental and computational studies, ideally using parallel designs.
We thank Elena Moiseeva for her support in the initial design of the study, in particular, of the experiment. This work was supported by a grant from the Stiftung Mercator (S.C.) and grants from the German Research Foundation (DFG) through the SFB 874, projects B2 (S.C.), B3 (L.W.), and B6 (C.B.).