Objects are perceived within rich visual contexts, and statistical associations may be exploited to facilitate their rapid recognition. Recent work using natural scene–object associations suggests that scenes can prime the visual form of associated objects, but it remains unknown whether this relies on an extended learning process. We asked participants to learn categorically structured associations between novel objects and scenes in a paired associate memory task while ERPs were recorded. In the test phase, scenes were first presented (2500 msec), followed by objects that matched or mismatched the scene; degree of contextual mismatch was manipulated along visual and categorical dimensions. Matching objects elicited a reduced N300 response, suggesting visuostructural priming based on recently formed associations. Amplitude of an extended positivity (onset ∼200 msec) was sensitive to visual distance between the presented object and the contextually associated target object, most likely indexing visual template matching. Results suggest recent associative memories may be rapidly recruited to facilitate object recognition in a top–down fashion, with clinical implications for populations with impairments in hippocampal-dependent memory and executive function.
Rapidly perceiving and categorizing visual objects is an important survival skill, so much so that humans and other animals capitalize on statistical regularities in the environment to enhance its efficiency (Ranganath & Ritchey, 2012; Friston, 2005). It has long been known that objects presented in congruent visual contexts are recognized more accurately and rapidly and processed more efficiently than those in incongruent (but equally cluttered) contexts (Davenport & Potter, 2004; Ganis & Kutas, 2003; Boyce & Pollatsek, 1992; Boyce, Pollatsek, & Rayner, 1989; Biederman, Mezzanotte, & Rabinowitz, 1982; Palmer, 1975). Although much research has focused on what aspects of visual environments are used as cues for object recognition (reviewed in Oliva & Torralba, 2007), a smaller body of work has addressed what aspects of visual objects are brought online or primed in response to contextual cues (e.g., Truman & Mudrik, 2018; Brandman & Peelen, 2017; Schendan & Kutas, 2007; Bar, 2004; Henderson & Hollingworth, 1999; Biederman et al., 1982). These studies suggest that both semantic and structural information about visual objects are primed by scene contexts (e.g., Truman & Mudrik, 2018; Brandman & Peelen, 2017; Mudrik, Shalgi, Lamy, & Deouell, 2014; Mudrik, Lamy, & Deouell, 2010; Gronau, Neta, & Bar, 2008). However, these studies also largely relied on naturally occurring statistical associations that may be learned over long periods of time. Thus, it is possible that scene-based structural priming of visual objects is contingent on an extended learning process, which has neurophysiological and clinical implications. To test this, we primed novel objects using newly associated visual scene contexts and checked for activation of object-related category-level and lower-level visual information.
Context effects on visual object processing are abundant and suggest both semantic and form-based facilitation. Interpretations of ambiguous, degraded, briefly presented, and masked objects are influenced by their visual context (Brandman & Peelen, 2017; Freeman et al., 2015; Barenholtz, 2014; Davenport & Potter, 2004; Bar & Ullman, 1996; Palmer, 1975). For example, an identical visual percept of a blurred object can be categorized as either a hairdryer or a drill, depending on whether it is embedded in a bathroom or workshop scene (Bar, 2004). Similarly, briefly presented objects are more likely to be misidentified as a visually similar object following presentation of a context associated with the incorrect item (Palmer, 1975). Recent evidence has suggested that even racial judgments of faces on an Asian–White continuum are biased by presentation in an American or Chinese setting (Freeman et al., 2015; Freeman, Ma, Han, & Ambady, 2013). However, such effects of scene context on object categorization may be driven by semantic effects on higher level decision processes and do not necessarily reflect form-based priming (Truman & Mudrik, 2018; Henderson & Hollingworth, 1999).
The argument that scenes facilitate object identification by supporting form-based matching processes has been primarily based on indirect neurophysiological evidence (Truman & Mudrik, 2018; Brandman & Peelen, 2017; Bar, 2003, 2004, 2007). Of relevance to the current study, an ERP component, known as the N300 (also referred to as the N350, Ncl, or simply the N3 complex), is elicited by visual objects, and its amplitude is often, but not always, modulated by scene congruency (Võ & Wolfe, 2013; Mudrik et al., 2010; Sitnikova, Holcomb, Kiyonaga, & Kuperberg, 2008; cf. Ganis & Kutas, 2003). The N300 is a frontocentral negativity peaking between roughly 200 and 400 msec, believed to index an object model selection process based on global shape information (Schendan & Kutas, 2002, 2003, 2007), although slight variations in scalp topography across manipulations and over time suggest it may encompass multiple related processes (Truman & Mudrik, 2018; Schendan & Ganis, 2012). Although semantic priming consistently reduces the amplitude of a later and more centrally distributed component, the N400, amplitude of the N300 is reduced more selectively by form-based priming (Kovalenko, Chaumon, & Busch, 2012; Hamm, Johnson, & Kirk, 2002). N300 amplitude is also modulated by factors pertaining to isolated object images, such as the noise level at which a visually degraded object is recognized, and viewpoint canonicity (Schendan & Kutas, 2003; Doniger et al., 2000). fMRI data also suggest that contextual congruency of objects modulates activity in a network including the lateral occipital complex (LOC), which is known to store high-level form representations of visual objects and is believed to be the human analogue of monkey inferotemporal cortex (Gronau et al., 2008; Tootell, Tsao, & Vanduffel, 2003; Grill-Spector et al., 1998). Indeed, N300 scalp topography is consistent with generators in the ventral visual stream, including the LOC (Schendan & Ganis, 2012). Moreover, recent fMRI and MEG evidence has shown that cross-decoding accuracy of object animacy in LOC is enhanced by simultaneous presentation of disambiguating scenes and that peak cross-decoding accuracy for objects embedded in scenes is at around 320–340 msec (Brandman & Peelen, 2017). Because of its numerous links to high-level visual processes, including perceptual closure and viewpoint accommodation, as well as form-based visual priming, differences between contextually congruent and incongruent objects on the N300 component have been taken as an index of scene priming on visual form-based matching processes during object recognition.
In the current study, we will not only assess object N300 scene-congruity effects but will also more directly confirm the engagement of perceptual matching processes based on a component-neutral item-based analysis. In our component-neutral analysis, visual similarity between the presented (distorted or mismatching) object and the object expected based on the preceding context is used to predict ERP amplitude. To the extent that visual similarity to the contextually congruent target object predicts ERP amplitude, this suggests that visual information about the target object has been brought online in response to the scene. In this way, we can confirm visual form priming via two different and complementary approaches: a component-based analysis drawing from the N300 literature and a component-neutral analysis that makes a more direct inference based on the results of a single data set.
It is not yet known whether form-based priming of object representations by scenes is contingent on an extended learning process. To the extent that contextual associations can be rapidly learned, this would suggest that contextual information may be recruited to impact visual object understanding in a wider range of situations. For example, rapid learning raises the possibility that contextual facilitation effects extend beyond cases where the observer has extensive experience and the context–object associative statistics are robust across time. Prior evidence is suggestive that at least some aspects of object–scene priming may rely on hippocampally dependent processes. For example, context–object associations can be rapidly implicitly learned and applied to facilitate tasks such as visual search (reviewed in Jiang & Chun, 2003). Using an associative memory paradigm, Hannula and colleagues have shown that arbitrary face–scene pairs can be rapidly learned and yield scene congruency effects on face viewing times within a single experimental session, as long as the hippocampus is intact (Hannula, Ryan, Tranel, & Cohen, 2007). Moreover, an ERP adaptation of Hannula and colleagues' face–scene associative memory task revealed that faces matching recently associated scenes elicit a less negative scalp potential at 300 msec, which has timing and scalp topography consistent with N300 facilitation (Hannula, Federmeier, & Cohen, 2006). The current study extends Hannula and colleagues' design to novel object–scene pairs, in an attempt to elicit this early ERP congruency effect using a wider class of object stimuli, and to explicitly link it to visual form analysis. If, indeed, form-based priming of visual objects can be based on recently formed associations, this raises the possibility that visual object recognition in the natural world may be subtly disrupted in elderly or clinical populations that are less able to rapidly form and recruit novel associations.
In the current study, participants associate line drawings of novel objects with natural images of visual scenes. To facilitate learning, novel objects (which fall into two general classes, “germs” and “machines”) are grouped into categories, each of which is associated at study with exemplars of a particular scene category (beaches, highways, etc.) for any given participant. Each scene category is associated with one germ and one machine category. In the test phase, participants are shown scenes from the scene–object pairs they recently studied, followed after 2500 msec by an object that (a) exactly matches what had originally been studied on that specific scene, (b) is a distorted version of the object they studied on that scene, (c) is the wrong object altogether but belongs to the same scene-congruent category (i.e., an object that has been studied on beaches in general but now is displayed on the wrong specific beach), or (d) is the wrong object and belongs to a scene-incongruent category (i.e., a highway-associated object on a beach). If form-based priming of objects by scenes can be induced by recently learned associations, we should expect the N300 to be reduced in amplitude to objects matching the scene, relative to mismatches. We should also expect the ERP response to show graded sensitivity to the degree of low-level visual feature mismatch between presented incongruent objects and the scene-congruent target object, which we assess in a separate analysis. In addition to the N300, we will also examine context-based effects on the N400 and late positive complex (LPC) components, which have been linked to semantic memory and extended visual and task/response-related processing, respectively.
Data are reported from 24 participants (aged 18–28 years, mean age = 21; 10 men), all native English-speaking University of Illinois undergraduates, who were compensated with payment. Three additional participants were replaced because of excessive trial loss or, in one case, use of substances on our exclusionary list. All participants provided written informed consent, according to procedures established by the institutional review board at the University of Illinois. Handedness was assessed using the Edinburgh inventory (Oldfield, 1971). All participants were right-handed; mean score = 0.86, where 1 denotes strongly right-handed and −1 denotes strongly left-handed. Ten reported having left-handed family members. No participants had major exposure to languages other than English before the age of 5 years. All had normal or corrected-to-normal vision, and none had a history of neurological or psychiatric disorders or brain damage or was using neuroactive drugs. Participants were randomly assigned to 1 of 24 experimental lists.
Objects were novel, designed to resemble biological organisms (“germs”), or mechanical devices (“machines”). Object categories were paired with familiar scene categories (e.g., beaches, highways), such that each type of scene was associated with one type of germ and one type of machine for each participant. At test, participants either saw the same object that they had studied, a different exemplar of that object (e.g., a distortion), a different object from the same category (that had not been studied on that particular scene but had been studied on the same type of scene), or an object from a different category (that thus would never have appeared on any scene of that type). The object classes of germs and machines were blocked and never swapped at test. Across the full set of participants, all object types were paired with all scene types, and objects and scenes were never repeated in the study phase. Details of stimulus development and counterbalancing follow.
Stimuli consisted of pairings between photographs of natural scenes and line drawings of novel objects (see Figure 1). Scenes depicted one of six categories: beaches, city streets, mountains, forests, highways, and offices. Scenes were drawn from a pool of 288 images, 48 per category, that were previously normed as being highly representative of their respective scene types and which had been rescaled to 800 × 600 pixels (see Torralbo et al., 2013, for norming details).
Line drawings of novel object prototypes for biological organisms (“germs”) or mechanical devices (“machines”) were created by an artist with the aid of Adobe Photoshop. Within the two classes of germs and machines, drawings were further organized into six categories, each of which contained three objects. Thus, there were 18 germ and 18 machine objects, each with a single representative prototype image. The aim was for objects within a category to share some general properties, such as major parts or appendages, but to also contain features, especially visual textures, that are shared across categories. As is true for real-world categories, then, novel objects within a category have some degree of visual similarity (likely higher for some categories than others), but category membership must also be partially based on experience (here, pairings with scene types). The fact that specific features were never fully diagnostic of category membership thus renders the learning and subsequent categorization process more like that for real-world object categories, in that there are no straightforward “rules” that can dictate what goes together. Critically, though, because the categories are novel, we had control over the type and level of experience participants would have with the objects and categories.
Each object prototype image was scanned and roughly centered and then further manipulated using Photoshop and the animation software Unity to derive individual exemplars. For each object prototype, we altered the prototype image along three continuously changing parameter scales, with the prototype image at the center of each scale. We then took snapshots of the distorted images at various points along each parameter scale to generate three continuously varying sets of eight exemplars, with four exemplars on either side of the central prototype within each set. We thus produced 24 total exemplars per object, from which we sampled to generate the stimulus lists. We never displayed the original prototype images in the experiment. For germs, parameter changes induced global distortions, such as gradually twisting the body or changing its width, whereas for machines, which were assumed to have a more rigid body, parameter changes affected the size and relative positions of component parts, which were first extrapolated from the prototype images using Photoshop layers to avoid unnatural gaps. Often, several parameters would change at once along each scale to heighten the visual dissimilarity among exemplars. See Figure 2 for an illustration of how exemplars were derived from the prototype images. All novel object line drawings were resized to 274 × 274 pixels.
Counterbalancing and Experimental Conditions
Twenty-four experimental lists of stimuli were created, grouped into six sets of four. Each list was assigned to a single participant. Lists within each set maintained a fixed study trial correspondence between the six categories of novel objects within each class (germs or machines) and the six scene categories. Thus, for any given list (and participant), each germ or machine type would only ever appear on a single scene category (e.g., “beaches”) at study, and this relationship would hold for all three objects within the category. The mapping between object and scene categories was then systematically rotated across the six sets, so that over the full set of participants every object category was associated with every scene category at study. For example, the first four participants would always see exemplars from Germ Category 1 and Machine Category 1 with beaches during the study phase, whereas the second set of four participants would see exemplars from Germ Category 1 and Machine Category 1 with highways.
Each list consisted of 288 study and 288 test pairs of object and scene stimuli. Stimuli within each list were organized into 18 blocks of 16 study pairs followed by 16 test pairs. Blocks alternated between all germs and all machines; the first block for each list was always germs. The same set of 288 unique scenes was used across all lists; scenes were presented exactly once at study and once in the corresponding test phase in each list. Each set of four lists had 288 unique test object exemplars that were used across all four lists; these objects were randomly drawn (without replacement) from the total set of possible exemplars, within constraints of the experimental design. Test object exemplars were never repeated within a list but were sometimes repeated across lists, such that some test object exemplars were used more often over the full counterbalancing than others. Seven hundred ninety-three of 864 possible novel object exemplars (24 exemplars/object × 3 objects/category × 6 categories/class × 2 classes) were presented as test objects over the full set of participants.
Within each block, an approximately equal number of pairs involving each scene category and object category were presented at both study and test. All 16 scenes presented at study within a block were presented in a different pseudorandom order in the corresponding and immediately following test phase. In the test phase, 4 of the 16 test trials were assigned to each of four experimental conditions (see Figure 3):
Exact match: Test object exactly matched the object exemplar paired with the scene at study.
Distortion: Test object was a different exemplar of the same object presented with the scene at study.
Within-category mismatch: Test object was a different object within the same category as the object presented with the scene at study.
Between-category mismatch: Test object was a different category of object from that presented with the scene at study (and thus belonged to a category of object that would never have been studied with that scene category previously; e.g., an object category that had only ever appeared on offices at study, now appearing on a beach at test).
To illustrate, a typical study phase might contain scene–object (S–O) pairings as follows: S(category = 1, exemplar = a)–O(category = 1, object = 1, exemplar = a), S(2,a)–O(2,1,a), S(3,a)–O(3,1,a), S(3,b)–O(3,2,a), S(3,c)–O(3,3,a), etc. For the corresponding test phase, a match trial might be S(1,a)–O(1,1,a), a distortion trial might be S(1,a)–O(1,1,b), a within-category mismatch trial might be S(1,a)–O(1,2,b), and a between-category mismatch trial might be S(1,a)–O(2,1,b). Because there were only 16 trials per study phase, only 16 of 18 objects (6 categories × 3 objects/category, given the object class) were presented in each study phase. Half of the between-category mismatches in the test phase were created by swapping two objects from different categories that had been presented in the preceding study phase (e.g., study: S(1,a)–O(1,1,a), S(2,a)–O(2,1,a); test: S(1,a)–O(2,1,b), S(2,a)–O(1,1,b)). The other half were created by introducing an object category that had not been presented in the preceding study phase (e.g., study: S(1,a)–O(1,1,a), never present object (2,1) that block; test: S(1,a)–O(2,1,a)). Within-category mismatches were always created by swapping objects within a category that had been presented in the preceding study phase (e.g., study: S(1,a)–O(1,1,a), S(1,b)–O(1,2,a); test: S(1,a)–O(1,2,b), S(1,b)–O(1,1,b)). Other than in the exact match condition, the exact object exemplar presented at test never matched that presented in the preceding study phase.
Each list contained 72 trials per condition. Across lists, each test object and each scene appeared in each of the four experimental conditions an equal number of times. In fact, identical test object–scene pairs were used across the first three experimental conditions; only in the fourth experimental condition (between-category mismatch) was it necessary to shuffle the specific pairings of test object and scene. Trial order within each block was pseudorandomized, as follows: No more than two trials corresponding to each condition were presented in a row, and, in the test phase, no more than three trials mapping onto a “same” response were presented in a row (i.e., trials in the exact match or distortion conditions).
Participants passively studied the paired scenes and novel objects and then were tested by being asked to indicate, for each pair in a new set, whether the object matched the presented scene. Study and test phases were organized into 18 study–test blocks, between which the participant was encouraged to take a break. All breaks were self-paced.
In each study phase, 16 scene–object pairs were presented, each beginning with a white fixation cross on a black background presented for 350–550 msec (duration jittered to reduce the impact of anticipatory slow potentials on the time-locked waveform; the fixation cross remained on screen for the remainder of the trial). Next, the scene alone was presented centrally for 2500 msec on a black background. Right after the scene appeared, participants were allowed to move their eyes to take in the scene; however, 1800 msec into scene presentation, the fixation cross brightened to indicate that the participant should fixate in the center of the screen in preparation for object presentation. After 700 msec (2500 msec after scene onset), a white square containing the object appeared in the center of the screen, superimposed on top of the scene, for 2500 msec. A screen with the word “ BLINK ” was then displayed for 2000 msec (preceded and followed by 50 msec of blank screen) to encourage the participant to blink between trials.
In the test phase immediately following each study phase, 16 scene–object pairs, repeating all 16 scenes from the study phase, were displayed. Participants were asked not to move their eyes for the entire test trial duration. Similar to the study trials, each scene was first displayed by itself for 2500 msec on a black background, followed by 2500 msec during which the test object was displayed centrally on top of the scene, again embedded in a white square. Participants were asked to wait to respond until the object–scene pair was replaced by a question mark in the center of the screen; the question mark remained on screen until a response was made. Participants had three response options: The object on this scene is (1) the same object that I studied with this scene, (2) not the same object but “could have gone with” this scene, and (3) not the same object and could not have gone with this scene. Participants were told that if the object was only slightly visually distinct from what they remembered studying (e.g., had a different body position or proportions), they should still respond (1). They were also told that an object and scene “could go together” if they believed that pair looked similar to other study items and could hypothetically be presented in an upcoming trial of the experiment, even if they knew that they had not studied it. Participants were never told that there was a structured relationship among the object and scene categories. Participants were explicitly instructed that each test phase only covered materials studied in the immediately preceding study phase and that testing was noncumulative across blocks. Before the main experiment, participants were given a practice block of four study and four test trials, which used different but qualitatively similar object and scene images to those in the main study.
During recording, participants were seated in a comfortable chair at a viewing distance of approximately 100 cm from the computer display. The visual angle of the scenes was 13.6° × 9.3° and that of the object images was 4.6° × 4.2°. The recording session lasted approximately 90 min. Afterwards, participants filled out an exit survey about the strategies they had used to complete the task, including several open response questions. To determine how participants were using the response scale, they were then given several visual examples of corresponding study and test scene–object pairs and asked to indicate how they would respond to that test item, given that they remembered studying that study item. Lastly, for a subset of the object images (one prototype image from each of the six categories of germs and machines), participants indicated which scene categories the object had been associated with during the experiment by circling one or more of six scene–category labels.
EEG Data Acquisition and Preprocessing
The EEG was recorded from 26 silver/silver chloride electrodes evenly spaced over the scalp. The sites are midline prefrontal, left and right medial prefrontal, left and right lateral prefrontal, left and right medial frontal, left and right mediolateral frontal, left and right lateral frontal, midline central, left and right medial central, left and right mediolateral central, midline parietal, left and right mediolateral parietal, left and right lateral temporal, midline occipital, left and right medial occipital, and left and right lateral occipital. The midline central electrode was placed where the “Cz” electrode would appear using the international 10–20 system. Eye movements were monitored via a bipolar montage of electrodes on the outer canthus of each eye. Blinks were detected by an electrode below the left eye. Impedances were kept below 5 KΩ. Signals were amplified with a 0.02–250 Hz bandpass using a BrainVision amplifier and digitized at 1000 Hz. Data were referenced online to the left mastoid and rereferenced offline to the average of the left and right mastoids. Each trial consisted of a 1000-msec epoch preceded by a 200-msec prestimulus baseline. To reject trials contaminated by eye movements, blinks, or other recording artifacts, we performed the following tests, setting individualized thresholds as indicated to maximize the number of correct rejections while decreasing the number of false alarms (using visual inspection, blind to condition): (1) the ERPLAB blocking and flat-lining tool was applied to all non-eye channels with an amplitude tolerance of 0.2; (2) a ±70 μV absolute voltage threshold was applied to all channels in combination with a 5-Hz low-pass filter (used only for detecting bad trials); (3) blinks were rejected using a moving window peak-to-peak with a cutoff of 40–60 μV applied to the channel subtraction of the vertical eye channel and left lateral prefrontal channel; (4) step-like artifacts on the bipolar horizontal eye channel were rejected (cutoff range = 10–20 μV); (5) for 11 subjects, we applied an additional step-like artifact detection to the vertical bipolar channel to detect vertical eye movements (cutoff of 25 μV for all); and (6) for 15 participants, we applied an additional moving window peak-to-peak over frontal channels with a cutoff of 60 μV, combined with a low-pass filter (5 Hz) to remove those trials contaminated with high levels of facial muscle activity. This resulted in average trial loss of 11.3% for the exact match condition, 11.4% for the distortion condition, 12.6% for the within-category mismatch condition, and 11.8% for the between-category mismatch condition. A digital low-pass Butterworth IIR filter with a 30-Hz half-amplitude cutoff and 12 dB/octave roll-off was applied before statistical analysis. Before permutation-based cluster analysis, data were further down-sampled to 100 Hz.
Behavioral analyses of response distributions were conducted using logistic regression modeling in R. Statistical analyses of individual trial EEG data were conducted using mixed-effects models built with the lme4 package in R. Models initially included crossed random effects of subject and item, and by-subject random slopes of each predictor of interest, but the random effects structure was sometimes scaled back to address convergence issues. EEG-dependent measures for analysis (mean amplitudes over particular stretches of time and space following test object onset) were determined in two ways: (1) a priori windows in time and space determined based on the prior literature and (2) a data-driven method, permutation-based cluster analysis. For component-based analyses, we used time windows and electrode selections based on the prior literature: 250–349 msec over frontocentral sites to capture the N300, 350–499 msec over centroparietal sites for the N400, and 500–699 and 700–899 msec over posterior sites to capture early and late time windows of the LPC. Because the N300 was the primary measure of interest and because its distribution has been variably characterized over the literature, we took the conservative approach of measuring effects at all frontocentral sites (16 total); the N400 and LPC were characterized at eight sites each, focused around each component's typical distribution and reducing topographic overlap with other components. Key predictors of EEG amplitude included match condition and response outcome. We additionally tested for effects of a finer grained measure of visual similarity to the target object, defined further below.
The N300 time window has been of primary interest since this project's conception. Other components were examined without the expectation that their patterns would deviate from prior established work using similar manipulations, and their results are presented for full disclosure. We have refrained from correcting for multiple comparisons across the four separate component-based time-windows. We include uncorrected statistics on the later time windows for the reader but caution them that the presence of any earlier effects may impact the interpretability of later components. Sample size was determined with reference to Hannula et al. (2006).
Participants discriminated well among test objects that matched and mismatched the presented scene. As instructed, they tended to respond that both “exact match” and “distortion” condition objects matched the scene (Response Option 1 = “match”) and that “within-category mismatch” and “between-category mismatch” condition objects did not (Response Option 2 = “scene-congruent mismatch” and Response Option 3 = “scene-incongruent mismatch”). Accuracy of appreciating the match between object and scene was computed after collapsing the exact match and distortion conditions (which were treated as a “match”). The “within”- and “between”-category mismatch conditions were also collapsed into a single “mismatch” category. The two different mismatch responses (“scene-congruent” mismatch and “scene-incongruent” mismatch) were also collapsed. Mean accuracy was 81.5% (range = 64.2–97.9%). Participants were also sensitive to the type of mismatch and were more likely to respond that the test object could not have gone with the test scene (“scene-incongruent” response) for the between-category mismatch condition than the within-category mismatch condition. This was characterized using a logistic regression model predicting the probability of response “scene-incongruent mismatch” with fixed effect of condition (exact match or distortion versus within-category mismatch versus between-category mismatch; the model failed to converge when the exact match and distortion conditions were treated as separate predictors). The model included subject random intercepts and by-subject random effects of condition. Nested model comparisons confirmed that “scene-incongruent” responses were more likely for the between-category mismatch than the within-category mismatch condition: intercept = −0.635, β = 1.78, SE = 0.217, z = 8.208, = 32.36, p < .001. Figure 4A shows the mean response distribution across subjects for each condition.
By the end of the experiment, participants demonstrated explicit knowledge of the categorical mapping among objects and scenes. Participants were more likely to indicate that objects were associated with a scene category when the two had been paired at study. Figure 4B shows the normalized confusion matrix indicating the probability that a scene category, if circled, belonged to the correct scene category for the depicted object. This was assessed with a logistic regression model predicting the probability of circling a scene category. The model included a fixed effect of match (between the scene and object), crossed random intercepts for subject, object (response item) and scene (response choice), and a by-subject random effect of match. Nested model comparisons were used to confirm that scenes matching the object were more likely to be circled: intercept = −4.406, β = 7.814, SE = 0.918, z = 8.513, = 40.13, p < .001.
ERP Analysis: Match Condition
Our component-based match condition analysis assessed whether scenes induced N300 visual form priming of associated objects in the test phase and also examined semantic (N400) and decision-related (LPC) processing. In an analysis of the trial-by-trial ERP response to object images in the test phase, we targeted the temporal and topographic distribution of the N300, N400, and LPC (using an a priori split of the 400 msec LPC window into early and late parts, 200 msec each, keeping window size more comparable across the analyses) and examined differences across conditions (see Figure 5 for component timing and scalp distributions and Figure 6 for the ERP waveform elicited by test objects in each match condition). Only behaviorally correct and artifact-free trials were included. Trials were considered behaviorally correct when participants responded “match” to matches and distortions and responded “scene-congruent mismatch” or “scene-incongruent mismatch” to within- and between-category mismatches. Linear mixed-effects models were fit to the individual trial data, including fixed effects of the following:
condition (contrasting match, distortion, within, and between category conditions);
response type (for within- and between-category mismatch conditions only, because only one response type was considered correct for match and distortion conditions)—it should be noted that condition (within vs. between) was moderately associated with response (“scene-congruent” vs. “scene-incongruent” mismatch; Cramér's V = 0.27); and
the interaction between condition and response.
Models also included crossed random intercepts of subject, item (object + scene), channel, and by-subject random slopes of condition, response, and their interaction.
Because some of the between-category mismatches were also newly presented within the study–test block (whereas all other item types were within-block repetitions), we also examined the role of item recency in differentiating the between-category mismatches from the other conditions. To do this, we compared our model above to one with an additional fixed effect, which contrasted Condition 4 swap trials (generated by swapping object categories presented in the immediately preceding study phase) and Condition 4 new trials (generated by presenting an object that belonged to a category that was not presented in the immediately preceding study phase).
All models were fit using maximum likelihood estimation. Fixed effects were initially tested using likelihood ratio tests with nested model comparisons. Follow-up comparisons of condition means were conducted using the contest function in the lmerTest package in R, with family-wise error rate corrected p values and the Satterthwaite approximation for the degrees of freedom; except where otherwise indicated, condition contrasts collapse across response type using linear combinations of beta weights but were conducted on a model that included a fixed effect of response type. The distortion condition was included in all models but never differed reliably from the match condition. For ease of reporting, we thus describe only contrasts among the match condition and the two violation types. The 95% confidence intervals on beta weights and contrasts were computed using the bootMer function in the lme4 package in R (n = 2000).
To examine the N300, voltages were separately averaged across time for each trial from 250 to 349 msec at each of 16 frontal and central sites. There was a main effect of Condition ( = 14.91, p < .01), which persisted when recency was accounted for by including a fixed effect of “swap” versus “new” among between-category mismatches ( = 13.79, p < .01; including this effect did not substantially improve model fit, < 1). Follow-up comparisons revealed that between-category mismatches were more negative than matches, diff = −1.79 μV, 95% CI [−2.74, −0.83], F(1, 25.5) = 13.6, p < .01. There was also a tendency for between-category mismatches to be numerically more negative than within-category mismatches, diff = −1.08 μV, 95% CI [−2.25, 0.15], F(1, 21.8) = 3.19, p < .1. Within-category mismatches did not reliably differ from matches, diff = −0.70 μV, F(1, 20.2) = 1.49, p > .1. The main effect of Response Type and its interaction with Condition for within- and between-category mismatches were not statistically reliable (|t|s < 1).
To examine the N400, voltages were separately averaged across time for each trial from 350 to 499 msec at eight central and parietal sites. There was a main effect of Condition ( = 14.05, p < .01), which remained as a numeric trend when recency was accounted for by including a fixed effect of “swap” versus “new” among between-category mismatches ( = 7.80, p < .1; including this additional factor improved model fit, = 4.22, p < .05). Follow-up comparisons revealed that between-category mismatches were more negative than all other conditions: between-category mismatch − match, diff = −1.63 μV, 95% CI [−2.59, −0.68], F(1, 24.8) = 11.3, p < .01, and between-category mismatch − within-category mismatch, diff = −1.33 μV, 95% CI [−2.53, −.015], F(1, 20.2) = 4.90, p < .05. When between-category violations that had been swapped were compared with those that were new, new items were found to be more negative, diff = −0.79 μV, 95% CI [−.03, −1.55], F(1, 2546) = 4.07, p < .05. Both new and swapped between-category violations were more negative than matches (swap − match: diff = −1.23 μV, 95% CI [−2.33, −0.22], F(1, 33.5) = 5.45, p < .05; new − match: diff = −2.02 μV, 95% CI [−3.09, −1.00], F(1, 33.0) = 15.4, p < .001). However, only new between-category violations were reliably more negative than within-category violations, diff = −1.73 μV, 95% CI [−2.97, −0.56], F(1, 24.1) = 7.42, p < .05. Within-category violations did not reliably differ from matches, diff = −0.29 μV, F(1, 19.8) < 1. For the two violation conditions, the main effect of Response Type and its interaction with Condition were not statistically reliable (|t|s < 1).
To examine the LPC, voltages were separately averaged across time for each trial from 500 to 699 msec (early LPC) and from 700 to 899 msec (late LPC) at each of eight posterior sites. In the early time window, there was no reliable effect of Match Condition ( = 2.60, p > .1). However, there was a reliable effect of Response Type (scene-congruent vs. scene-incongruent mismatch) for within- and between-category violations, such that trials indicated as being a scene-congruent mismatch were less positive; β = −1.42 (μV), 95% CI [−2.60, −0.06], = 4.62, p < .05. There was no reliable interaction between Response Type and Condition (|t| < 1).
In the late time window, there was a main effect of Condition ( = 13.49, p < .01), which persisted when recency of the between-category mismatches was accounted for ( = 14.60, p < .01; including swapped vs. new as a factor did not substantially improve model fit, = 1.12, p > .1). Between-category mismatches were more positive than matches and within-category mismatches, between-category mismatch − match: diff = 1.32 μV, 95% CI [0.20, 2.48]), F(1, 24.1) = 5.30, p < .05, between-category − within-category mismatch: diff = 1.47 μV, 95% CI [0.19, 2.82]), F(1, 21.3) = 4.96, p < .05. There continued to be a reliable effect of Response Type (scene-congruent vs. scene-incongruent mismatch), with trials less positive when as being a scene-congruent mismatch, β = −2.88 (μV), 95% CI [−4.09, −1.67], = 17.97, p < .001. There was no reliable interaction between Response Type and Condition (|t| < 1).
Our component-based analysis revealed N300 priming of contextually congruent objects. Match effects were first observed in the N300 window, with increased negativity for between-category mismatches relative to matches and distortions, and within-category violations falling numerically in between. This pattern of effects was also seen on the N400, and the N400 additionally showed sensitivity to recency, with new items more negative than recently seen ones (replicating many prior studies). Finally, the LPC was sensitive to participants' judgments, being more positive to trials judged to be scene-incongruent mismatches and, in the later part of the window, to between-category violations overall.
Although there were a priori reasons to think that the N300, N400, and LPC might show effects of the experimental manipulations (cf. Hannula et al., 2006), we also were interested in characterizing the pattern obtained when no a priori choices were made about either time window or scalp channels for analysis. We examined this with an exploratory cluster analysis of match condition for behaviorally correct test object trials, conducted in the time domain using the ft_timelockstatistics function in fieldtrip. Details of the analysis and results can be found in Appendix A. Importantly, the results of the analysis converged with the component-based approach, showing effects differentiating match trials from mismatch trials beginning around 230–250 msec over the front of the head (like the N300), becoming more broadly distributed, and ending around 400–430 msec over central/posterior sites (like the N400). Differences between the two mismatch trial types were also found from 760 to 990 msec over the back of the head (like the LPC).
ERP Analysis: Target Similarity
We also conducted a component-neutral analysis to confirm that visual form information about the contextually congruent object is brought online in response to the context scene. In doing so, we can make a more direct inference about the role of scene-induced visual information in object processing in this experimental context. We assessed whether low-level visual features of the scene-congruent object were accessed in memory even when it was not displayed. In focusing on the N300, our component-based analyses targeted what are proposed to be intermediate stages of visual form analysis (Schendan & Ganis, 2015). Here, however, because we are not targeting the sensitivity of any particular waveform feature, we examined low-level visual feature similarity, providing a conservative test of visual form reactivation and taking advantage of the fact that the modeling of such low-level features is well established (Pinto, Cox, & DiCarlo, 2008; Jones & Palmer, 1987). Thus we used V1-like low-level visual features to derive a similarity metric to predict ERP amplitude. We adapted our match condition analysis models, adding a new fixed effect: visual distance from the target object. We then tested whether this predictor improved model fit using nested model comparisons. Models did not distinguish among the two types of between-category violations (new vs. swap). Including a by-subject random slope for visual distance from the target object resulted in a failure to converge, so the same random effects structure was used as for modeling the effects of match condition.
Visual distance was computed as follows. First, V1-like features were generated for each object image using the model in Pinto et al. (2008). Next, features with variance close to zero were removed to avoid numeric issues while scaling, and feature values were mean-centered and scaled to unit variance. Next, PCA was applied to reduce the dimensionality of the feature space to 766 (from >80,000) while maintaining an explained variance ratio of 0.999. Visual distance was defined as the Euclidean distance between the presented object image and the target (studied) object image in this feature space for each trial. Visual distance was grand mean-centered before model fitting.
Including visual distance to the target object as a continuous predictor substantially improved model fit for all components (N300: = 67.0, p < .001, β = −3.615 × 10−3, 95% CI [−4.45, −2.71] × 10−3; N400: = 88.3, p < .001, β = −5.612 × 10−3, 95% CI [−6.74, −4.45] × 10−3; LPC-early: = 31.5, p < .001, β = −3.211 × 10−3, 95% CI [−4.32, −2.09] × 10−3; LPC-late: = 56.6, p < .001, β = −4.464 × 10−3, 95% CI [−5.64, −3.31] × 10−3). Negative beta values indicate that for all components, the more visually similar the displayed object was to the target object, the more positive the waveform. The distribution of visual distances across trials, broken down by condition and including only behaviorally correct and artifact-free trials, is displayed in Figure 7. Table 1 lists the mean number of trials per subject for each condition and visual distance bin in Figure 7. The ERP waveforms corresponding to these trials, broken down by visual distance bin, are displayed in Figure 8, averaging across match condition.
|V1-like Feature Euclidean Distance to Target .||Match Condition .|
|Match .||Distortion .||Within-category Mismatch .||Between-category Mismatch .|
|0 (Exact Match)||48.6||0||0||0|
|>500 (Most Distinct)||0||2.3||7.5||8.4|
|V1-like Feature Euclidean Distance to Target .||Match Condition .|
|Match .||Distortion .||Within-category Mismatch .||Between-category Mismatch .|
|0 (Exact Match)||48.6||0||0||0|
|>500 (Most Distinct)||0||2.3||7.5||8.4|
Visual Distance Effects within Component Time Windows and Conditions
We also examined whether the effect of visual distance to target was modulated by category-level information. Importantly, this allowed us to observe whether visual similarity effects were apparent even within the distortion condition, suggesting a purer effect of visual similarity per se that is not contingent on category boundaries. Figure 9 shows estimated linear trends of visual distance by component and match condition, collapsing across response. We tested for interactions between visual distance to target, condition, and response, within behaviorally correct trials. The three-way interaction among Match Condition, Response, and (mean centered) Visual Distance to target, as well as all lower order interactions, were added as fixed effects to the original visual distance to target models, while maintaining an identical random effects structure. Interactions were tested using nested model comparisons. The effect of Visual Distance to target is reported separately for each condition and response, including 95% bootstrapped confidence intervals (n = 2000 iterations). All point estimates and confidence intervals on linear combinations of beta weights are reported below as 1000 times the original estimates.
There were significant interactions of Condition × Visual Distance ( = 27.59, p < .001), and Condition × Response × Visual Distance ( = 15.66, p < .001). Visual distance to target effects were significant in the distortion condition and on within-category violation trials that were responded to as “scene-congruent” mismatches, but not on other mismatch trials (distortion: −5.105 [−6.119, −4.077]; “scene-congruent” within-category mismatch: −3.903 [−5.593, −2.220]; “scene-incongruent” within-category mismatch: 0.680 [−1.095, 2.384]; “scene-congruent” between-category mismatch: 4.068 [−2.462, 11.085]; “scene-incongruent” between-category mismatch: −1.169 [−4.960, 2.452]). Within-category mismatches showed larger effects of visual distance to target when they were responded to as being “scene-congruent” (see nonoverlapping 95% CIs).
There were significant interactions of Condition × Visual Distance ( = 10.64, p < .01) and Condition × Response × Visual Distance ( = 17.39, p < .001). The effect of Visual Distance was significant at α = .05 uncorrected for all condition by response combinations, except for between-category mismatches that were responded to as “scene-congruent” (distortion: −4.903 [−6.233, −3.555]; “scene-congruent” within-category mismatch: −10.421 [−12.689, −8.205]; “scene-incongruent” within-category mismatch: −4.515 [−7.132, −2.205]; “scene-congruent” between-category mismatch: 5.804 [−2.598, 14.189]; “scene-incongruent” between-category mismatch: −4.883 [−9.290, −0.253]). Within-category mismatches showed larger effects of visual distance to target when they were responded to as being “scene-congruent”; “scene-congruent” within-category mismatches also showed larger visual distance effects than the distortion condition (see nonoverlapping 95% CIs).
There was a significant interaction of Condition × Visual Distance ( = 11.12, p < .01) and a numeric trend toward an interaction of Condition × Response × Visual Distance ( = 5.77, p < .1). The distortion and within-category mismatch conditions showed reliable effects of visual distance to target, but not the between-category mismatch conditions (distortion: −2.250 [−3.566, −0.823]; “scene-congruent” within-category mismatch: −7.001 [−9.195, −4.706]; “scene-incongruent” within-category mismatch: −3.316 [−5.694, −0.917]; “scene-congruent” between-category mismatch: 0.940 [−6.978, 8.542]; “scene-incongruent” between-category mismatch: −2.489 [−6.568, 1.480]). “Scene-congruent” within-category mismatches showed larger target dissimilarity effects than the distortion condition (see nonoverlapping CIs) and were numerically larger than “scene-incongruent” within-category mismatches.
There was a significant interaction of Condition × Visual Distance ( = 15.92, p < .001), but the Condition × Response × Visual Distance interaction did not reach significance ( = 4.56, p = .103); we still report each condition by response contrast separately for consistency. Visual distance to target effects were significant in the distortion condition and on within-category violation trials that were responded to as “scene-congruent” mismatches, but not on other mismatch trials (distortion: −6.077 [−7.434, -4.633]; “scene-congruent” within-category mismatch: −4.119 [−6.358, −1.789]; “scene-incongruent” within-category mismatch: −0.889 [−3.287, 1.514]; “scene-congruent” between-category mismatch: 2.039 [−6.135, 10.464]; “scene-incongruent” between-category mismatch: −2.215 [−6.453, 2.281]).
Summary: Visual Distance Effects within Component Time Windows and Conditions
In summary, we found that visual distance to target effects are generally apparent in both the distortion and within-category mismatch conditions, but more often did not reach significance in the between-category mismatch conditions, despite trending in the same numeric direction. We also found that within-category mismatches generally showed larger effects of visual distance to match when they were responded to as being “scene-congruent” (vs. “scene-incongruent”) scene–object pairs, particularly on the N300 and N400 components. On the N400 and early LPC components, “scene-congruent” within-category mismatches showed even larger effects of visual distance than the distortion condition.
Additional Control Analyses: Object Prototypicality
We also tested whether visual distance to target effects could be explained by a partial confound with object prototypicality. Object exemplars that were closer in visual distance to the target also tended to be closer to the object prototype (r = .155). We found that visual distance to target effects were more robust than and could not be explained away by (cue-independent) prototypicality effects, although both effects tended in the same numeric direction (see Appendix C for detailed results). Finally, we tested whether the presented object is visually compared with the target object prototype versus the target object exemplar for each ERP component (see Appendix D). We found that exemplar-specific visual information about the target object is brought online and compared with the presented object, as evidenced by sensitivity to the degree of mismatch between the current object and the target exemplar, even when mismatch between the current object and the target prototype is accounted for, during the N300, N400 and LPC time windows.
Although prior research has indicated that scenes can prime the visual form of associated objects (Truman & Mudrik, 2018; Brandman & Peelen, 2017; Bar, 2004), most studies have relied on natural statistical associations that may be learned over many years. We tested whether scene–object visual form priming extends to recently learned scene–object associations (< 2 hr) using a set of categorically organized novel objects in an explicit paired association memory task. We examined two EEG measures of form-based priming: the N300 (comparing match and mismatch trials in the test phase) and target similarity effects (within mismatch trials, regressing on a continuous measure of visual similarity to the contextually associated, but not presented, object). In both cases, results suggest that scenes can indeed prime the visual form of even recently associated objects.
Behaviorally, participants were able to successfully associate novel objects and scenes, and in a later off-line posttest, they demonstrated explicit knowledge of the category-level pairing between object types and scene types (which persisted throughout the experiment but which they were never explicitly informed of). A priori component-based and data-driven cluster-based analyses converged to reveal an N300 facilitation (reduced negativity beginning around 200–250 msec, with a frontal scalp distribution) to objects matching their associated scenes, relative to mismatches, in the test phase. Both matching objects and close distortions elicited a more positive waveform than objects that mismatched the scene type. Objects that mismatched the specific scene but were congruent with its category (beach, mountain, etc.) showed an intermediate level of facilitation in the N300 time window, particularly at central sites. This may reflect overlapping generators with the subsequent N400, which also showed more facilitation for within- than between-category mismatches and which is generally known to differentiate near and distant semantic violations (Federmeier & Kutas, 1999, 2001). Although the N300 and N400 effects we observe overlap, the early part of the effect is temporally and topologically more aligned with an N300 than an N400. Given prior work linking the N300 specifically to visual form-based priming (Kovalenko et al., 2012; Hamm et al., 2002), these results suggest that scenes may enhance accessibility of the visual form of even recently associated objects, at least when heightened accessibility is useful to the task at hand.
In a second set of analyses, we focused on distortion and mismatch trials. We revealed a novel effect: an enduring sensitivity across the ERP waveform to (low-level) visual similarity to the target (contextually congruent) item, onsetting at roughly 200 msec. The more visually similar the presented object was to the contextually congruent object, the more positive the waveform across the N300, N400, and LPC time windows. This effect was not simply driven by degree of match at the category level: It was apparent on an item-level basis within the distortion condition, holding category identity constant. Attesting to the effect's reliability, it was separately observed in the within-category mismatch condition, and there was a numeric trend in the same direction for between-category mismatches. Given the extended timing of the effect, we believe it at least partially reflects a template-matching process (Kok, Mostert, & de Lange, 2017; Mostert, Kok, & de Lange, 2015; Kok, Failing, & de Lange, 2014; Summerfield & de Lange, 2014; Summerfield et al., 2006), in which the current object is visually compared with a memory template of the target object, evoked by the context scene. We differentiate “template matching,” which may reflect a task-specific perceptual decision-making process in which a single memory representation (that of the template) is prioritized for comparison, from the generic perceptual matching processes that occur in order for an object to be recognized. Although early sensitivity to target similarity may partly reflect visual form priming itself, later sensitivity is more likely to reflect the formation of a perceptual judgment (Mostert et al., 2015). Interestingly, in the LPC time window at central and posterior sites, target similarity effects had the opposite polarity of differences due to mismatch type (i.e., more severe category violations were more positive). This suggests distinct temporally overlapping processes of visual comparison and rule-based decision-making in keeping with findings that the LPC is sensitive to both perceptual analysis and decision-related processing (e.g., Schendan & Kutas, 2002, 2003; Falkenstein, Hohnsbein, & Hoormann, 1994). Regardless of whether early sensitivity to target similarity is ultimately better explained as direct visual form priming of the current object or as an index of decision-related processing, it corroborates early availability of the memory template of the target object, making visual form priming at 200–350 msec of the presented object more plausible given that sensitivity to even low-level visual feature similarity is apparent at the same time.
Furthermore, comparing effects of visual similarity to the target exemplar versus the target prototype speaks to the types of strategies participants may have used to complete our task. Hypothetically, participants could have remembered only the abstract/amodal object concept that was paired with the scene and simply assessed whether the presented object corresponded to that concept. Thus, they could have brought online only a coarse visual representation of the target object (e.g., a prototype schematic) without finer visual details of the target object exemplar. However, sensitivity to target similarity goes beyond what would be expected if each image were compared with a prototype; rather, exemplar-specific details of the target image are compared with the current image, beginning as early as the N300 time window. Even when it was not necessary for the task at hand, participants brought online a fine-grained memory representation of the contextually associated target that included exemplar-level visual information that could be dissociated from the target prototype using gabor filters. Of course, one limitation of this analysis is that it uses the prototype images as a proxy for the conceptual prototype; in reality, the conceptual prototype will be derived from the objects to which the participant has been exposed and therefore may be biased toward the target exemplar, particularly early in learning. This moderates our conclusions somewhat. Although we only report visual distance effects using a single feature space, future work could compare multiple measures of target similarity (e.g., shape-based, frequency-based, abstract/semantic) to more fully probe the nature of the target object memory representation brought online in response to scene contexts.
Our results also suggest that task-relevant category-level information may modulate the visual matching process or the extent to which it is engaged. The effect of visual similarity to target tended to be larger in magnitude for distortions and within-category mismatches than for between-category mismatches. Also, within-category mismatches that were reported as being incongruent but able to “go with the scene” elicited a stronger visual similarity effect than those reported as mismatching the scene type. Indeed, within-category mismatches reported as being “scene-congruent” showed even larger effects of visual distance than distortions on the N400 and early LPC. These interactions may be partially driven by changes in the range of visual distance to target across match conditions and responses. For example, matching processes may be the most evident when the presented image is distinct enough from the target for the difference to be detectable, but not so different that there is little representational overlap. Nonetheless, given that within- and between-category mismatches had largely overlapping distributions of visual distance in our experiment, it may also be that mismatches eliciting a “scene-incongruent” response were more likely to be rejected from consideration in a top–down/rule-based fashion. To the extent that scene-congruent visual objects are more likely than scene-incongruent objects to be matched against a memory template for an expected scene-congruent object, this corroborates aspects of Bar's (2004) theory of object recognition, which postulates that scenes constrain the object categories that are considered for assignment to a visual stimulus. Within this framework, each scene type in our experiment served as a “context frame” associated with a subset of the object stimuli. When the presented object matched the context frame, the participant considered whether it might also belong to the target object category. However, when the presented object mismatched the context frame, this matching process was engaged to a lesser extent. Future work should further explore whether the target similarity effects observed here are task-specific and if they can be modulated by additional factors known to affect anticipatory visual processing (e.g., duration of preexposure to the priming context; Smith & Federmeier, submitted for publication).
Taken together, our results corroborate the hypothesis that recently formed arbitrary associations between contextual cues and object representations can facilitate visual object recognition via visuostructural priming (as dissociated from amodal semantic or decision boundary-based effects, which may co-occur). Although we used scene–novel object associations and an explicit memory task, similar contextual priming effects have been observed using other types of visual sequences, even when contextual associations are unrelated to the task at hand. For example, Turk-Browne and colleagues had participants make orthogonal judgments to a continuous sequence of faces and places and found that statistical regularities induce hippocampally encoded predictive cuing of upcoming stimuli and facilitate visual object recognition as indexed by behavioral responding (Turk-Browne, Scholl, Johnson, & Chun, 2010). Kok and colleagues have similarly found expectation-based preactivation of task-irrelevant visual features, using cross-modal cuing (Kok et al., 2017). Moreover, some theories suggest that top–down effects on perception via associations learned through implicit statistical learning and explicit paired associate learning may rely on partially overlapping mechanisms (Pearson & Westbrook, 2015). To the extent that statistical associations are rapidly implicitly learned and regularly used to facilitate processing of anticipated upcoming visual input, our findings may extend to visual object processing under the task demands of normal daily life. That is, recent episodic memories of the objects present in an environment may lead to visual form priming that facilitates object recognition when the environment is reinstated. In turn, populations with long-term memory impairments as the result of a disorder or as a function of normal aging may also experience disruptions in object recognition relative to healthy young adults. A growing body of literature linking the hippocampus and prefrontal cortex, areas particularly susceptible to damage and disruption (Anand & Dhikav, 2012; Fabiani, 2012; Baars & Gage, 2010), to visual prediction and mismatch detection, underscores this possibility (hippocampus: Kok & Turk-Browne, 2018; Hindy, Ng, & Turk-Browne, 2016; Chen, Cook, & Wagner, 2015; Duncan, Ketz, Inati, & Davachi, 2012; Chen, Olsen, Preston, Glover, & Wagner, 2011; frontal cortex: Summerfield & Koechlin, 2008; Bar et al., 2006; Summerfield et al., 2006).
In summary, neural evidence suggests that scenes can prime the visual form of even recently associated objects. Gabor filter-based visual features, similar to the empirically inferred neural representation in areas V1/V2, are reasonably well correlated with the memory representation of the object that is brought online in response to a context scene. Rapid statistical learning of object–scene associations could be exploited in future research to more carefully control for the strength of contextual associations between objects and scenes. Future work should focus on expanding the generalizability of our approach by bridging the gap between explicit paired associate learning and implicit statistical learning in the context of daily life. Also, the types of similarity analyses we used in this study could be extended to further refine our understanding of the nature of memory representations of visual objects by comparing multiple measures of visual similarity. Lastly, clinical implications of the current results could be verified by comparing behavioral and neural indices of scene–object priming across the lifespan and in disordered populations.
APPENDIX A: MATCH CONDITION CLUSTER ANALYSIS RESULTS
The following condition contrasts were assessed using a by-subject dependent samples t test on the subject-level averaged waveforms (down-sampled to 100 Hz) from 10 to 990 msec in 10 msec (one sample) increments:
Match versus Distortion
Match versus Within-category Mismatch
Match versus Between-category Mismatch
Distortion versus Within-category Mismatch
Within-category Mismatch versus Between-category Mismatch
Between-category Mismatch-Swapped versus Between-category Mismatch-New
Positive and negative clusters were separately assessed. Individual channel time points were considered for cluster inclusion at α = .05 and were required to have at least two neighboring channels also included in the cluster. Cluster significance was computed by comparing the sum of the t values within each cluster to the distribution of the maximum sum of t values cluster score over a random permutation baseline, α = .025, n = 2000 repetitions. Clusters were corrected for multiple comparisons by virtue of the permutation testing. See Figures A1 and A2 for the distributions of significant clusters over time and space for each condition contrast.
Distortion − Exact Match
No significant clusters were found.
Within-category Mismatch − Exact Match
A negative cluster was found from 250 to 400 msec, which began at frontal sites, was broadly distributed across the head from ∼290 to 350 msec, and ended at central sites, sum(t) = −818, p = .0095.
Between-category Mismatch − Exact Match
A negative cluster was found from 230 to 430 msec, which began at frontal sites, was broadly distributed at ∼280 to 370 msec, and ended at centroparietal sites, sum(t) = −1418, p = .0030.
Within-category Mismatch − Distortion
Two negative clusters were found. The first negative cluster was from 170 to 430 msec, again starting at frontal sites, being broadly distributed at ∼290–390 msec and ending at central sites, sum(t) = −1575, p = .0025. The second negative cluster was from 760 to 990 msec at central and posterior sites, sum(t) = −695, p = .0160.
Between-category Mismatch − Within-category Mismatch
A positive cluster was found from 760 to 990 msec at posterior sites, possibly reflecting response differences among correct trials for the between versus within-category mismatch conditions, sum(t) = 738, p = .0070.
Between-category Mismatch Swap versus New Trials
No significant clusters were found.
APPENDIX B: COMPARISON OF ERP RESPONSE TO MATCH VERSUS DISTORTION CONDITIONS
Figure B1 illustrates the comparison of ERP response during match versus distortion conditions.
APPENDIX C: VISUAL DISTANCE TO TARGET VERSUS PRESENTED OBJECT PROTOTYPICALITY
We tested that visual distance to target effects were not driven by prototypicality of the presented object (relative to its own category), which would be independent of the presented scene context. Visual distance to the 36 prototype images (18 germs, 18 machines) from which the exemplar object images were derived was computed for each object image presented in the test phase. The same feature space and procedure were used to compute visual distance as when computing visual distance between the presented and target object images. The same random effects structure was maintained as in the match condition and visual distance to target analyses. Effects of visual distance to prototype and the additive benefit of including visual distance to target as an additional predictor were assessed using nested model comparisons. To test for effects of prototypicality, models containing fixed effects of match condition, response, and their interaction, as well as (grand mean-centered) visual distance to prototype, were compared with a null model excluding the effect of prototype. All beta values and standard errors (in parentheses) are reported as 1000 times the original estimates. Effects of distance to prototype were significant or numerically trended in the same direction across all four components: N300: β = −2.606 (1.466), = 3.15, p < .1; N400: β = −5.094 (1.718), = 8.77, p < .01; early LPC: β = −5.122 (1.609), = 10.10, p < .01; late LPC: β = −3.074 (1.670), = 3.38, p < .1. Next, (grand mean-centered) visual distance to target was included as an additional predictor to distance to prototype, to see if it explained substantially more variance than a null model containing only distance to prototype. Including visual distance to target improved model fit for all four components (N300: = 64.39, p < .001; N400: = 82.12, p < .001; early LPC: = 27.00, p < .001; late LPC: = 53.77, p < .001). The converse was less true: adding distance to prototype as an additional predictor to a model that already included distance to target improved model fit more modestly or not at all (N300: < 1; N400: = 2.61, p = .106; early LPC: = 5.63, p < .05; late LPC: < 1). Within models that included both distance to prototype and distance to target as fixed effects, effect size estimates tended to be larger and standard errors tended to be smaller for distance to target effects: N300: distance to target β = −3.573 (0.445), distance to prototype β = −1.111 (1.482); N400: distance to target β = −5.470 (0.603), distance to prototype β = −2.810 (1.739); early LPC: distance to target β = −3.008 (0.579), distance to prototype β = −3.867 (1.627); late LPC: distance to target β = −4.399 (0.599), distance to prototype β = −1.237 (1.690). Taken together, visual distance to the specific target object image associated with each scene appears to be a more important explanatory variable than distance to the category prototype of the presented object.
APPENDIX D: VISUAL DISTANCE TO TARGET EXEMPLAR VERSUS TARGET PROTOTYPE
We assessed whether participants compared the current object image in the test phase to the specific target exemplar object image paired with the scene or to the prototype target object image. The same random effects structure was maintained as in the match condition and visual distance to target analyses. Effects of visual distance to target prototype and the additive benefit of including visual distance to target exemplar as an additional predictor were assessed using nested model comparisons. Models containing fixed effects of match condition, response, and their interaction, as well as visual distance to target prototype and visual distance to target exemplar, were compared with a null model excluding the effect of visual distance to target exemplar. Continuous predictors were grand mean-centered. Including visual distance to target exemplar improved model fit for all four component time windows (N300: = 57.05, p < .001; N400: = 62.92, p < .001; early LPC: = 17.55, p < .001; late LPC: = 34.07, p < .001).
Reprint requests should be sent to Cybelle M. Smith, Department of Psychology, University of Illinois at Urbana-Champaign, 603 E. Daniel St., Champaign, IL 61820, or via e-mail: firstname.lastname@example.org.
Thank you to Professors Diane M. Beck and Gary S. Dell for helpful comments on a previous draft of this manuscript, as well as to Taisuke Wakabayashi for assistance with stimulus development and data collection. This research was supported by NIA grant AG2630 and a James S. McDonnell foundation award to K. D .F., as well as NSF-GRFP grant 1144245 to C. M. S.