Forming a memory often entails the association of recent experience with present events. This recent experience is usually an information-rich and dynamic representation of the world around us. We here show that associating a static cue with a previously shown dynamic stimulus yields a detectable, dynamic representation of this stimulus. We further implicate this representation in the decrease of low-frequency power (∼4–30 Hz) in the ongoing EEG, which is a well-known correlate of successful memory formation. The reappearance of content-specific patterns in desynchronizing brain oscillations was observed in two sensory domains, that is, in a visual condition and in an auditory condition. Together with previous results, these data suggest a mechanism that generalizes across domains and processes, in which the decrease in oscillatory power allows for the dynamic representation of information in ongoing brain oscillations.
Not everything we associate in our memory occurs at the same time. When our favorite soccer player is given a red card, for instance, we are able to bring this together with the events we just witnessed a few seconds before. Later, we are naturally able to recall all relevant information leading to the red card. To successfully make this association, our brain has to accomplish two things. First, it has to keep track of the past and maintain a representation of the events in the ongoing soccer match, and second, it has to form memories in which past events are connected to the red card. Processes during the encoding phase that will determine our ability to later remember events can be investigated with the so-called subsequent memory paradigm (Paller & Wagner, 2002; Wagner, Koutstaal, & Schacter, 1999). Subsequent memory effects refer to the neural activity that distinguishes remembered from not remembered items at the time of encoding and are well documented in magnetoencephalography/electroencephalography (MEG/EEG) and fMRI, showing involvement of cortical as well as medial-temporal lobe regions (e.g., Long, Burke, & Kahana, 2014; Otten, Quayle, Akram, Ditewig, & Rugg, 2006). Concerning MEG/EEG power, decreases in low-frequency (<40 Hz) brain dynamics have repeatedly and consistently been related to successful memory formation (Hanslmayr & Staudigl, 2014).
It has recently been proposed that cortical power decreases in the alpha/beta frequency range allow for a rich representation of memory content, because a desynchronized system has more flexibility to code information over a system of high synchrony. We call this view the information via desynchronization framework (Hanslmayr, Staudigl, & Fellner, 2012). Confirming this idea, we have shown that sustained power decreases in the alpha band at approximately 8 Hz contain item-specific information about the remembered content when participants successfully replay dynamic stimuli (i.e., video and sound clips) from memory (Michelmann, Bowman, & Hanslmayr, 2016). In this study, we provided direct evidence that power decreases are involved in the representation of stimulus-specific information (Hanslmayr et al., 2012). Moreover, these results are well in line with numerous studies showing that perception is not continuous but is rather rhythmically sampled at a frequency of ∼7–8 Hz (Hanslmayr, Volberg, Wimber, Dalal, & Greenlee, 2013; Landau & Fries, 2012; VanRullen, Carlson, & Cavanagh, 2007). These outcomes, therefore, indicate that rhythmic patterns from the perception of dynamic stimuli can reappear during internally guided retrieval processes, in the absence of the stimuli themselves (Michelmann et al., 2016). Accordingly, these prior findings also suggest the possibility that the replay of temporal patterns can be observed in a situation where dynamic stimuli have to be associated with a static cue.
To address this question, we here analyze the data during the encoding phase from a previous data set (Michelmann et al., 2016). The paradigm required participants to associate a dynamic stimulus with a static word that was used as a cue in the later retrieval phase. Importantly, during encoding the perception of the dynamic stimulus and the presentation of the word cue was temporally separated, that is, in every trial, one out of four dynamic stimuli was followed by a unique word cue (Figure 1A–B). In a visual condition, these dynamic stimuli consisted of four short video clips; in an auditory session, four short sound clips were used. In a later retrieval block, participants were presented with the word cue and were tested whether they remembered the associated video/sound clip.
We hypothesize that, to associate the word cue with the dynamic stimulus, participants maintain (i.e., replay) a sensory representation of the dynamic stimulus, which is why we refer to this phase as the “maintenance and association” phase. Using temporal pattern similarity analysis, we should therefore be able to detect the replay of these patterns during the “maintenance and association” phase, that is, when the association between a word and the sound/movie is formed. In accordance with the information via desynchronization framework, we should observe stronger decreases for later remembered versus later not remembered items. This subsequent memory effect should be most evident in the frequency band that codes for the representation of the dynamic stimulus, that is, 8 Hz as per our previous findings. Moreover, if power decreases enable a richer representation of the perceptual content, we should already observe stronger power decreases for later remembered compared with later not remembered stimuli during the perception of the dynamic stimulus.
Twenty-four healthy, right-handed participants (18 women and 6 men) participated in this study. Seven further participants were tested or partly tested but could not be analyzed because of poor memory performance (n = 2), misunderstanding of instructions (n = 2), and poor quality of EEG recording and technical failure (n = 3). All participants had normal or corrected-to-normal vision. The average age of the sample was 23.38 (SD = 3.08) years. Participants were native English speakers (20), bilingual speakers (2), or had lived for more than 8 years in the United Kingdom (2). Ethics approval was granted by the University of Birmingham research ethics committee, complying with the Declaration of Helsinki. Participants provided informed consent and were given a financial compensation of £24 or course credit for participating in the study.
Material and Experimental Setup
The cues amounted to 360 words that were downloaded from the MRC Psycholinguistic Database (Coltheart, 1981). Stimulus material consisted of four video clips and four sound clips in the visual and auditory session, respectively. All clips were 3 sec long; videos showed colored neutral sceneries with an inherent temporal dynamic, and sounds were short musical samples, each played by a distinct instrument. In both sessions, a clip was associated with 30 different words. Sixty words were reserved for the distractor trials. Those words were not associated with a clip but only shown as new words during memory retrieval. Twelve additional words were used for instruction and practice of the task. For presentation, words were assigned to the clips or to distractors in a pseudorandom procedure, such that they were balanced for Kucera–Francis written frequency (mean = 23.41, SD = 11.21), concreteness (mean = 571, SD = 36), imageability (mean = 563.7, SD = 43.86), number of syllables (mean = 1.55, SD = 0.61), and number of letters (mean = 5.39, SD = 1.24). Furthermore, lists were balanced for word frequencies taken from SUBTLEXus (Brysbaert & New, 2009). Specifically, the sublexus SUBTLWF was employed (mean = 20.67, SD = 27.16). The order of presentation was also randomized, ensuring that neither the clips and their associates nor distractor words were presented more than three times in a row or in temporal clusters. The presentation of visual content was realized on a 15.6-in. CRT monitor (Taxan Ergovision 735 TC0 99) at a distance of approximately 50 cm from the participant’s eyes. The monitor refreshed at a rate of 75 Hz. On a screen size of 1280 × 1024 pixels, the video clips appeared in the dimension of 360 pixels in width and 288 pixels in height. Arial was chosen as the general text font, but font size was larger during presentation of word cues (48) than during instructions (26). To reduce the contrast, white text (RGB: 255, 255, 255) was presented against a gray background (RGB: 128, 128, 128). Auditory stimuli were presented using a speaker system (SONY SRS-SP1000). The two speakers were positioned at a distance of approximately 1.5 m in front of the participant with 60 cm of distance between the speakers.
Upon informed consent and after being set up with the EEG system, participants were presented with the instructions on the screen. Half of the participants started with the auditory session; the others were assigned to undertake the visual task first. Both sessions consisted of a learning block, a distractor block, and a test block. The sessions were identical in terms of instructions and timing and differed only in the stimulus material that was used. During instruction, the stimulus material was first presented for familiarization and then used in combination with the example words to practice the task. Instructions and practice rounds were completed in both sessions.
As a way to enhance memory performance, participants were encouraged to use memory strategies. The suggestion was to imagine the word in a vivid interaction with the material content, yet the choice of strategy remained with the participant. In the learning block, 120 clip word sequences were presented. Each sequence started with a fixation cross that was presented in the center of the screen for 1 sec, and then the video clip played for 3 sec. In the auditory condition, the fixation cross stayed on the screen, and the sound clip played for 3 sec. Immediately after the clip, a word cue was presented in the center for 4 sec, giving the participant time to learn the association. After that, an instruction requested participants to subjectively rate on a 6-point scale how easy the association between the clip and the word was. After a press on the space bar, this scale was shown. Equidistant categories were anchored with the labels “very easy” and “very hard”; those labels were displayed at both ends above the scale. Participants used six response buttons to rate the current association (see Figure 1). In total, each video clip and each sound clip was shown 30 times. Each word cue was unique; therefore, targets were shown only a single time during encoding and retrieval, and distractors were only shown a single time during retrieval.
In the distractor block, participants engaged in a short unrelated working memory task, namely they counted down in steps of 13, beginning from 408 or 402, respectively. After 1 min, the distractor task ended. Following a short self-paced break, participants refreshed the instructions on the retrieval block.
In this retrieval block, either a cue or a distractor word (i.e., a new word) was presented upon a button press on the space bar. Participants were instructed to try to vividly replay the content of the corresponding video clip or sound clip in their mind upon presentation of the cue. The word stayed on the screen for 4 sec, giving the participant the opportunity to replay the memory. Finally, a fixation cross was presented for a varying time window between 250 and 750 msec to account for movement and preparatory artifacts before the response options appeared on the screen.
The response options consisted of six options. Four small screen shots of the videos or 4 black and white pictures of the featured instruments were presented in equidistant small squares of 30 × 30 pixels. In addition, the options “new” and “old” were displayed in the form of text at the most left and most right position (see Figure 1C–D). Participants could now either indicate the target (video/sound) they just replayed by pressing the button corresponding to that clip. Instead, participants could also indicate that the word was a distractor by pressing the button corresponding to the option “new,” or they would simply indicate that they remembered the word but could not remember the clip it was associated with. In this last scenario, participants would press the button corresponding to “old.” The positions of “old” and “new” at the ends as well as the permutation of the four target positions in the middle were counterbalanced across participants. Finally, after making a decision, a 6-point rating scale was presented on which participants could rate the confidence in their response. Now, a scale with equidistant categories was presented ranging from “guess” to “very sure.” An additional possibility was to press “F2” in case of an accidental wrong button press. In this case, the whole trial was discarded from analysis. Following the retrieval block, individual electrode positions were logged, allowing for a break of approximately 30 min before beginning the second session. During encoding, participants performed 120 trials in the visual and auditory conditions, respectively. Each stimulus was therefore associated with a unique word cue 30 times. During retrieval, participants performed a total of 180 trials in each condition that consisted of the 120 unique word cues from encoding and an additional 60 distractor words. The whole experiment therefore comprised 240 encoding and 360 retrieval trials.
The recording of behavioral responses and the presentation of instructions and stimuli were realized using Psychophysics Toolbox Version 3 (Brainard, 1997) with MATLAB 2014b (MathWorks) running under Windows 7, 64-bit version on a desktop computer. Response buttons were S, D, F, J, K, L on a standard QWERTY layout. Buttons were highlighted and corresponded spatially to the response options on the screen, so participants did not have to memorize the keys. To this end, the shape of corresponding fingers was also displayed under the scale. To proceed, participants used the space bar during the experiment. Physiological responses were measured with 128 sintered Ag/AgCl active electrodes, using a BioSemi ActiveTwo amplifier, the signal was recorded at 1024 Hz sampling rate on a second computer via ActiView recording software, provided by the manufacturer (BioSemi). Electrode positions were logged with a Polhemus FASTRAK device (Colchester) in combination with Brainstorm (Tadel, Baillet, Mosher, Pantazis, & Leahy, 2011) implemented in MATLAB.
The data were preprocessed using the Fieldtrip toolbox for EEG/MEG analysis (Oostenveld, Fries, Maris, & Schoffelen, 2011). Data were cut into trial segments from 2.5 sec prestimulus to 7 sec after the onset of the dynamic stimulus. The linear trend was removed from each trial, and a baseline correction was applied based on the whole trial. Trials were then downsampled to 512 Hz, and a band-stop filter was applied at 48–52, 58–62, 98–102, and 118–122 Hz to reduce line noise at 50 Hz and noise at 60 Hz; in addition, a low-pass filter at 140 Hz was applied. After visual inspection for coarse artifacts, an independent component analysis was computed. Eye blink artifacts and eventual heartbeat/pulse artifacts were removed, bad channels were interpolated, and the data were referenced to average. Finally, the data were inspected visually, and trials that still contained artifacts were removed manually. In the auditory condition, on average, 12.95% of trials were excluded during preprocessing (SD = 4.71%, min = 4.17%, max = 25.83%); in the visual condition, excluded trials amounted to an average of 14.31% (SD = 4.49%, min = 5.83%, max = 25.00%). There was no difference in the number of trials that went into the similarity analysis after preprocessing. Neither in the visual condition (mean = 26.13, 25.00, 26.13, and 25.58 trials; SD = 2.09, 1.91, 2.29, and 2.04 trials; ANOVA: F(2.96, 68.15) = 2.037, p = .12) nor in the auditory condition (mean = 26.04, 26.25, 25.71, and 26.46 trials; SD = 1.92, 1.92, 2.12, and 2.06 trials; ANOVA: F(2.68, 61.54) = 0.906, p = .434).
For behavioral analysis, correct trials were defined as those in which the target was correctly identified. The confidence rating of the response was considered as high if a rating of 5 or 6 was selected. Misses were defined as trials in which a cue word was incorrectly identified as a new word, the wrong clip was selected, or the response “old” was given to indicate recognition of the word without remembering the target video or sound it was associated with.
Oscillatory power was determined by multiplying the Fourier-transformed data with a complex Morlet wavelet of six cycles. Raw power was defined as the squared amplitude of the complex Fourier spectrum and estimated for every fourth sampling point (i.e., sampling rate of 128 Hz). For the contrast of subsequent hits and subsequent misses, a baseline was computed as the average power between −1 and 7 sec of all trials within the contrast (Long et al., 2014). Every trial was then normalized by subtracting the baseline and subsequently dividing by the baseline (activitytf − baselinef)/baselinef, where t indexes time and f indexes frequency. The relative power was calculated for all frequencies between 2 and 30 Hz.
Phase Pattern Analysis during Perception and “Maintenance and Association”
Although participants learned the associations in the encoding block, they repeatedly perceived (saw/heard) the same dynamic stimulus. Content-specific properties could consequently be identified if they were shared by trials of the same content but not by trials of a different content. Hence, content-specific phase during perception was assessed by contrasting the phase similarity between pairs of trials in which the same content was presented, with the phase similarity of an equal number of trial pairs that were of different content. For each pair of trials, the cosine of the absolute angular distance was then computed and finally averaged across all (same or different) combinations. The average similarity value for same and different combinations was subjected to statistical testing across participants at every time point, at every electrode, and in every frequency of interest; this contrast embodies content-specific phase patterns during perception.
Participants also repeatedly associated the same dynamic stimuli (one of four videos/sounds) with a different word cue. Therefore, the temporal pattern during perception of the dynamic stimulus could also be compared with the temporal pattern during association. This way the similarity between combinations of same content (e.g., watching Video A, associating Video A to a cue) could be compared with the similarity between combinations of different content (e.g., watching Video A, associating Video B to a cue). Notably, excluding within-trial combinations eliminates the potential confound of temporal autocorrelation.
In the end, the phase similarity between combinations of same content was contrasted with the phase similarity between trials of different content. This contrast reveals phase patterns that are specific to the dynamic stimulus which participants associated with the cue. Phase similarity between perception and association was assessed across time points, consequently comparing temporal patterns. To this end, a time window from perception was used as in sliding window and a measure of phase coherence over time was computed (see below and Figure 3A–B).
To maximize the signal-to-noise ratio, the following restrictions were applied: The tested frequency was 8 Hz, following our previous results and hypotheses (Michelmann et al., 2016). The time window during perception was centered on the cluster in which phase patterns were most reliably content specific during encoding (i.e., the cluster with the lowest p value). Finally, all possible combinations of trials were used regardless of subsequent memory performance.
Lachaux et al. (2000) suggest to compute the S-PLV over 6–10 cycles of a frequency for a good signal-to-noise ratio; for our purposes, S-PLV was applied to a time window of eight cycles, which resulted in a 1-sec window for 8 Hz. Phase values were extracted by multiplying the Fourier-transformed data with a complex Morlet wavelet of six cycles. Phase values were then downsampled to 64 Hz. The similarity measure was computed for every pair of trials in the combinations of same content and in the combinations of different content. Importantly, the sliding window approach that was used accounts for the non-time-locked nature of the data (temporal patterns could be present anywhere in the “maintenance and association” interval). This resulted in a time course of similarity for the combinations of same and of different content.
The difference in this similarity was first averaged across the whole “maintenance and association” episode (between 3500 and 7000 msec) and then statistically tested across participants with a random permutation procedure (see Statistical analyses) based on clusters of summed t values across electrodes (Maris & Oostenveld, 2007). In a second test, the time courses at every electrode were compared with a series of t tests and subsequently tested with a cluster-based random permutation procedure, where clusters were summed across electrodes and time (see also Statistical analyses, below). In addition, a control frequency was tested, namely 6 Hz, based on the results from the power analysis. Time windows were defined accordingly for this frequency as eight cycles around the center of the most reliable cluster during perception.
Behavioral results were compared between the auditory and visual condition with a series of paired t tests. p Values were compared against a Bonferroni-corrected threshold (Bland & Altman, 1995); however, no specific hypothesis was tested. Performance differences between stimuli were tested with a repeated-measures ANOVA with the factor Stimulus (1, 2, 3, and 4) under Greenhouse–Geisser correction.
Decreases in Power
To test for differences in baseline-corrected power, a paired t test was first computed for every time point and frequency at every channel. For multiple-comparison correction, a random permutation procedure was applied (Maris & Oostenveld, 2007). This procedure sums up neighboring t values above a cluster-forming threshold and compares the resulting clusters' sizes to the distribution of the maximal cluster sums that are derived when condition labels are randomly swapped with the Monte Carlo method. The minimum number of neighboring channels to be considered a cluster was specified with 3, which attenuates the impact of spatially high-frequency noise; neighboring electrodes were derived via the triangulation method of the Fieldtrip toolbox (www.fieldtriptoolbox.org/). The clusters were summed across time, frequency, and channels, and then labels were permuted 1000 times; thresholding of the clusters as well as the testing of the null hypothesis were addressed with an a priori defined threshold for single-sided testing (α level of .05). To identify frequencies with a reliable power difference, a paired samples t test was computed for every frequency on the average power difference across all channels and across the whole time window of interest.
Phase Similarity during Perception of the Dynamic Stimulus
Phase similarity during perception was tested in the same way as power. A series of paired t tests was computed to contrast the average similarity of combinations of same content with the average similarity of combinations of different content. t Values for every frequency band, electrode, and time point were then corrected for multiple comparisons in an unrestricted cluster-based permutation approach. The cluster permutation compared again the sums of t values across frequency, electrodes, and time against the distribution of these clusters derived via the Monte Carlo method (as described above). Later, the frequency 8 Hz was tested separately with the same cluster permutation to identify a temporospatial cluster, in which the 8-Hz phase could differentiate content particularly well.
Phase Similarity between Perception and “Maintenance and Association”
The similarity between the time window during perception and the maintenance episode was tested for differences between combinations of same and combinations of different content. As mentioned above, in a first step, the average difference between 3.5 and 7 sec was contrasted with a paired t test on every electrode to test for a general effect. Multiple-comparison correction was done again with the following procedure: The labels of the average similarity for same and different content combinations were randomly swapped for each participant 1000 times, therefore eliminating the association between condition and observed similarity. Then, a t test was computed between random conditions on the average similarities (3.5 and 7 sec) in the same way as with the real conditions. Subsequently, t values were summed across neighboring electrodes for the random labels and for the real conditions; however, only those t values that exceeded the critical threshold (i.e., a t value corresponding to α = .05, defined a priori) were considered. This way, 1000 cluster sums were derived from the random labels, which formed the null distribution. The real cluster sum under correct labels was compared against this distribution. The p value of the whole statistical test was derived as the ratio of random cluster sums that were bigger than the observed cluster sum. In a second step, a paired t test was not only computed for every electrode but also at every time point during the “maintenance and association” interval. Differences were again tested with a cluster-based permutation approach. Now, clusters were formed by summation of the thresholded t values across neighboring electrodes and neighboring time points. They were subsequently compared against the distribution of these clusters for 1000 random label permutations in the same procedure described above.
In the visual session, participants remembered, on average, 53.92% (SD = 17.56%) of the video clips with high confidence (rating > 4), and they further remembered 9.97% (SD = 7.62%) of the clips with low confidence (Figure 1E). In the auditory session, 44.44% (SD = 19.8%) of the audio clips were subsequently remembered with high confidence, which was significantly less than in the visual condition (t(23) = −2.81, p < .01). An additional 9.06% (SD = 6.9%) of the audio clips were remembered with low confidence. In accordance, the number of subsequent misses was significantly lower in the visual session (mean = 35.07%, SD = 16.43%) than in the auditory session (mean = 45.45%, SD = 20.27%, t(23) = −3.33, p < .01). The total number of subsequent misses in the auditory condition (mean = 45.45%, SD = 20.27%) comprised 17.22% (SD = 12.98%) “old” responses, 14.24% (SD = 10.42%) erroneous selections of “new,” and 13.99% (SD = 9.69%) selections of the wrong sound clip. In the visual condition, subsequent misses (mean = 35.07%, SD = 16.43%) comprised 11.28% (SD = 9.97%) “old” responses, 14.37% (SD = 11.74%) erroneous selections of “new,” and 9.41% (SD = 6.33%) selections of the wrong video clip.
In the visual condition, there was no difference in the number of trials that were remembered. The four videos were remembered with a probability of 61.94%, 63.61%, 62.08%, and 68.06% (ANOVA: F(2.39, 54.86) = 2.250, p = .106; SD = 20.29%, 18.04%, 16.93%, and 18.49%). For high confidence, there was a probability of 50.28%, 54.17%, 55.14%, and 56.11% (ANOVA: F(2.42, 55.56) = 1.438, p = .245); SD = 20.45%, 19.91%, 17.28%, and 21.14%) to remember each video.
There was, however, a difference in the number of trials per stimulus that were remembered in the auditory condition. The four sounds were remembered with a probability of 46.53%, 61.81%, 60.28%, and 45.42% (ANOVA: F(2.66, 61.11) = 10.277, p < .001; SD = 24.93%, 20.59%, 20.83%, and 26.70%). For high confidence, there was a probability of 37.22%, 54.17%, 50.42%, and 35.97% (ANOVA: F(2.67, 61.36) = 11.342, p < .001, SD = 25.32%, 23.33%, 19.49%, and 23.26%) to remember the different sounds.
Successful Memory Encoding Is Associated with Low-frequency Power Decreases in the Visual and Auditory Condition
To find correlates of successful memory encoding, the oscillatory power between subsequently remembered (hits) and subsequently not remembered (misses) items was compared. Specifically, we contrasted trials for which associations were subsequently remembered with high confidence, with trials in which the associations were subsequently not remembered correctly. In this analysis, only those data sets were used, in which a minimum of 15 trials remained for hits or misses after preprocessing (n = 18). Two crucial episodes for successful memory encoding were tested separately: (i) the time interval in which the dynamic stimulus was actually perceived (0–3 sec) and (ii) the “maintenance and association” interval (3–7 sec), in which the memory formation would be expected to have taken place. In the time interval from 0 to 3 sec, a small cluster of power decreases was associated with successful memory in the visual condition; it displayed a trend toward significance (p < .07; Figure 2A, left). Likewise, in the auditory condition, a similar cluster of power decreases appeared (p = .047; Figure 2B, left).
During the “maintenance and association” interval (3–7 sec), substantially reduced power in the lower frequencies (<30 Hz) was observed for subsequent hits compared with subsequent misses (Figure 2, middle) in both conditions. In the visual condition, a broad cluster emerged where power was significantly lower when tested against random permutations (p = .031; Figure 2A, middle). Likewise, a broad cluster of significant power decreases appeared in the “maintenance and association” interval of the auditory condition (p < .003; Figure 2B, middle).
To identify frequencies that robustly exhibited lower oscillatory power for successful memory encoding, the power during the “maintenance and association” interval was averaged across all electrodes and time points and differences were subjected to a t test. Following our previous results (Michelmann et al., 2016), we expected the strongest power decreases in both conditions to peak at 8 Hz. Indeed, a clear peak at 8 Hz was observed in the visual condition (t(17) = −2.82, p < .01; Figure 2A, middle). In the auditory condition, however, a peak was observed at 6 Hz (t(17) = −4.45, p < .001; Figure 2B, middle), yet power decreases also extended to 8 Hz (t(17) = −3.53, p = .001).
For the visual condition, the power decreases at 8 Hz displayed a broad topography with a parietal maximum over the left hemisphere (Figure 2A, right). Decreases in 8-Hz power were similarly broadly distributed in the auditory condition, with maxima over left parietal and right frontal regions (Figure 2B, right).
Together, these results confirm the fundamental role of decreases in low-frequency oscillatory power for the successful formation of memory.
Temporal Patterns Are Content Specific during Perception and Can Be Detected during “Maintenance and Association”
The detection of content-specific temporal patterns during the “maintenance and association” period necessitates that the dynamic stimuli themselves elicit temporally distinct neural responses. To address this, we first compared the pairwise phase consistency (Vinck, van Wingerden, Womelsdorf, Fries, & Pennartz, 2010) between trials in which the same dynamic stimulus was perceived with the pairwise phase consistency between trials of different content. These findings of content-specific phase during perception were previously published (Michelmann et al., 2016).
Oscillatory phase of the neural responses was specific to the dynamic stimuli in two broad clusters in the visual (p < .001 and p = .003; Figure 3C) and one broad cluster in the auditory condition (p < .001; Figure 3G), confirming prior reports that the content of dynamic stimuli is tracked by the phase of low-frequency oscillations (Ng, Logothetis, & Kayser, 2013). Vitally, both clusters included 8 Hz, which was the oscillation for which we hypothesized to detect the reappearance of temporal patterns in the “maintenance and association” period.
We now identified periods during perception in which the time courses at 8 Hz were maximally content specific by restricting the statistical test to 8 Hz only and selecting the cluster in which content could most reliably be differentiated during perception (i.e., the cluster with the lowest p value). In the visual condition, this cluster extended from −152 to 564 msec (p < .001). Note that poststimulus effects are smeared temporally into the prestimulus interval because of the wavelet decomposition. The most reliable cluster of content specificity in the auditory condition extended from 22 to 871 msec (p = .002). A further cluster in the visual condition was observed between 2,650 and 3,300 msec (p = .016). In the auditory condition, further clusters emerged between 1,818 and 2,627 msec (p = .003) and between 1,203 and 1,504 msec (p = .047), indicating that in both modalities early and later time windows showed content-specific temporal patterns.
For the 8-Hz oscillation, a 1-sec-wide window was now centered on the cluster that most reliably distinguished content during perception (i.e., at 206 msec in the visual condition and 446 msec in the auditory condition; Figure 3A, E). In a sliding window approach, a measure of phase coherence (S-PLV; Lachaux et al., 2000) was then computed between this window and every 1-sec-wide window between 3 and 7 sec during the “maintenance and association” period (see Figure 3A–B). This was done for all available trials from encoding, regardless of subsequent memory performance. For practical reasons, at the end of the trial, the window was slid out back into the prestimulus interval (zero padding could be an alternative but more intricate approach). This time course of similarity (phase coherence) was now computed for trial combinations comprising perception and association of the same stimulus and for trial combinations of perception and association of different content. Importantly, the combinations of same content were never built within a trial, assuring a balancing of temporal autocorrelation between same and different combinations. Furthermore, by changing only the pairing between trials and using the same trials to form pairs of same and different content, it was ensured that the signal-to-noise ratio was the same in both conditions. In a first test, we subjected the average similarity across time to a t test, contrasting same and different combinations at every electrode. A cluster-based permutation revealed a significant cluster in the visual condition (p < .001), but not in the auditory condition. In a follow-up test, we repeated the t test for every time point at every electrode and summed clusters across time and electrodes. A permutation test revealed two clusters of significant differences in the visual condition (p < .001 and p = .035; Figure 3B, D). The first cluster was located over left frontal regions and extended from 4.8 to 5.41 sec after stimulus onset (i.e., 1.8–2.41 sec after the start of the “maintenance and association” phase). The second cluster was located over parietal and occipital areas, extending from 4.97 to 5.34 sec (1.97–2.34 sec of the “maintenance and association” phase; Figure 3D, right). We applied the same approach to the auditory condition; a cluster (p = .047) emerged over right frontal regions, extending from 4.11 to 4.44 sec after stimulus onset (1.11–1.44 sec of the “maintenance and association” phase); even though strictly interpreted, this cluster does not exceed a corrected alpha threshold (Figure 3F, H, right). Finally we also tested the frequency of 6 Hz, which showed the most reliable power decrease in the auditory condition; however, no effects were found.
For most of the memories that we form during the day, we rely on rich and dynamic ongoing representations of the world around us. At a later point, we then associate these representations with distinct events. Both of these properties of our natural experience are rarely captured in experiments that investigate episodic memory. First, most studies use non-information-rich stimuli to study memory, like words or pictures, and second, material for association is usually presented simultaneously.
In this study, we used a memory task that can mimic memory in a more naturalistic scenario: An ongoing representation of an information-rich, dynamic stimulus is maintained in working memory, to be associated with a subsequent event. In one session, participants repeatedly watched one out of four short video clips, which was immediately followed by a unique word cue. In a second session, participants listened to one out of four sound clips, which they subsequently associated with a cue (Figure 1). To form an association, participants had to maintain a representation of the video/sound clip in working memory.
Investigating the correlates of subsequent memory, we found broad and sustained decreases in ongoing oscillatory power to be associated with successful memory formation. These power decreases were particularly strong while participants maintained dynamic representations in working memory, namely while they formed the association. Importantly, we found that these power decreases carried stimulus-specific information in their temporal pattern of activity. Specifically, the phase of an 8-Hz frequency, which we previously linked to content representation (Michelmann et al., 2016) and where power decreases were strongest in the visual condition, was modulated in a stimulus-specific way.
These results form part of converging evidence for a general mechanism, in which desynchronization of brain oscillations in the cortex, indicated by power decreases, allows for the rich representation of information (Hanslmayr et al., 2012). Specifically, the decrease in oscillatory strength, which also signifies a release from inhibition (Haegens, Nacher, Luna, Romo, & Jensen, 2011; Klimesch, Sauseng, & Hanslmayr, 2007), renders the oscillation less stationary, that is, less predictable. In mathematical terms, this decrease of predictability means an increase in the amount of information that can be coded (Hanslmayr et al., 2012; Shannon & Weaver, 1949). During perception, we observed content-specific patterns over sensory and frontal electrodes; we then tracked these patterns on the same electrodes. When we previously observed reappearing patterns during episodic memory reinstatement, oscillatory patterns were localized in sensory-specific areas (Michelmann et al., 2016). In contrast, the pattern reappearance observed in this analysis displayed a different, that is, more frontal topography. Speculatively, the difference in observed topography may be due to different task demands during the maintenance and association period that require, for instance, working memory processes (e.g., Goldman-Rakic, 1995). The generalization of the desynchronization mechanism across different processes is further complemented by its generalization across modalities; namely, in this study as well as in previous results, we observed oscillatory patterns in desynchronizing brain dynamics for visual and auditory stimuli.
Finally, the frequency band of 7–8 Hz has been implicated in the rhythmic sampling of perceptual content (Hanslmayr et al., 2013; Landau & Fries, 2012; VanRullen et al., 2007). These studies integrate well with our findings and suggest that the 8-Hz frequency temporally organizes the representations of stimulus-specific information during perception, episodic memory reinstatement, and episodic memory formation and that decreases in oscillatory power allow these temporal patterns to resurface.
Our results moreover inform current debates about the neural mechanisms underlying working memory. Although some studies have previously shown that content-specific activity patterns can be decoded during working memory maintenance (Jafarpour, Penny, Barnes, Knight, & Duzel, 2017; Fuentemilla, Penny, Cashdollar, Bunzeck, & Düzel, 2010), other studies suggest that representations in working memory may not always be maintained online but rather latently stored in synaptic weights or even via more complex mechanisms (Stokes, 2015). Those representations can then reemerge when they become task relevant or they can be evoked experimentally by either “pinging” them with unspecific input (Wolff, Jochim, Akyürek, & Stokes, 2017) or by stimulating transcranially with a magnetic pulse (Rose et al., 2016). Hence, an important insight from the here presented study is that, during the formation of an association with a previously shown dynamic stimulus, a detectable representation of that stimulus reappears. The method that we used to observe these stimulus patterns was specifically tailored to the detection of patterns that are dynamic in nature, that is, it is suited to detect (in)consistencies in phase resets over time. It needs to be acknowledged that parts of the patterns that we track may be strongly driven by evoked responses to the stimulus onset during perception. Importantly, however, our method makes it possible to detect these patterns, when their exact time point of reappearance can be unknown and variable between trials. This is very relevant for studies that investigate working memory maintenance because patterns that are involved in the online maintenance of representations in pFC and parietal cortex of nonhuman primates have been found to be highly dynamic (Crowe, Averbeck, & Chafee, 2010; Meyers, Freedman, Kreiman, Miller, & Poggio, 2008). An open question that remains beyond the scope of this study, however, is whether content-specific temporal patterns can also be elicited by static stimuli and whether our method can help to detect them.
An interesting question that arises from our results is whether the (re)appearance of temporal patterns is functionally relevant for the successful formation of memories. We could demonstrate subsequent memory effects for power decreases here, because a minimum of 15 trials per condition can yield stable power estimates (Hanslmayr, Spitzer, & Bäuml, 2009). We could further link power decreases to the presence of content-specific temporal patterns; however, because the trial count of forgotten associations for most of the participants was too low for stable similarity estimates, it is not clear whether these patterns are functionally involved in memory formation. Specifically, this study was designed to produce a sufficient number of remembered trials, and we consequently could not contrast stimulus-specific temporal patterns between remembered and forgotten associations. Repeating this study in a longer and more adaptive design could, therefore, allow for the contrast of patterns during successful and unsuccessful memory formation.
In addition, future studies should address whether content-specific temporal patterns are causally involved in memory formation, either by disrupting content-specific temporal patterns and therefore tampering with memory formation or even by artificially introducing spurious patterns to cause forged associations.
This research was funded by a European Research Council Consolidator Grant awarded to S. H. (grant agreement 647954) and an Emmy Noether Programme Grant from the Deutsche Forschungsgemeinschaft awarded to S. H. (HA 5622/1-1). S. H. is further supported by the Wolfson Society and Royal Society. The authors would further like to acknowledge Hector de Jesus Cervantes for his help in preprocessing of the data.
Reprint requests should be sent to Sebastian Michelmann, Department of Psychology, Princeton University, Princeton Neuroscience Institute, Princeton, NJ 08544, or via e-mail: firstname.lastname@example.org.
This paper is part of a Special Focus deriving from a symposium at the 2017 annual meeting of Cognitive Neuroscience Society, entitled, “The Dynamics of Cognitive Processes: Multivariate Approaches.”