Sensitivity to temporal change places fundamental limits on object processing in the visual system. An emerging consensus from the behavioral and neuroimaging literature suggests that temporal resolution differs substantially for stimuli of different complexity and for brain areas at different levels of the cortical hierarchy. Here, we used steady-state visually evoked potentials to directly measure three fundamental parameters that characterize the underlying neural response to text and face images: temporal resolution, peak temporal frequency, and response latency. We presented full-screen images of text or a human face, alternated with a scrambled image, at temporal frequencies between 1 and 12 Hz. These images elicited a robust response at the first harmonic that showed differential tuning, scalp topography, and delay for the text and face images. Face-selective responses were maximal at 4 Hz, but text-selective responses, by contrast, were maximal at 1 Hz. The topography of the text image response was strongly left-lateralized at higher stimulation rates, whereas the response to the face image was slightly right-lateralized but nearly bilateral at all frequencies. Both text and face images elicited steady-state activity at more than one apparent latency; we observed early (141–160 msec) and late (>250 msec) text- and face-selective responses. These differences in temporal tuning profiles are likely to reflect differences in the nature of the computations performed by word- and face-selective cortex. Despite the close proximity of word- and face-selective regions on the cortical surface, our measurements demonstrate substantial differences in the temporal dynamics of word- versus face-selective responses.
Neurons in visual cortex are tuned to a myriad of features of the visual stimulus ranging from simple image statistics, such as spatial frequency, orientation, and disparity (De Valois, Albrecht, & Thorell, 1982; Barlow, Blakemore, & Pettigrew, 1967; Hubel & Wiesel, 1962), to dynamic properties, such as stimulus duration and direction of motion (Movshon, Thompson, & Tolhurst, 1978; Hubel & Wiesel, 1965), to high-level features, such as semantic similarity and category membership (Grill-Spector & Weiner, 2014; Huth, Nishimoto, Vu, & Gallant, 2012; Kanwisher, McDermott, & Chun, 1997). Regions of visual cortex that are sensitive to particular visual categories, such as the fusiform face area (FFA), which responds selectively to faces, and the visual word form area, which responds selectively to words, are believed to perform computations that are critical for the perception of these stimulus classes (Grill-Spector & Weiner, 2014; Wandell, Rauschecker, & Yeatman, 2012; Cohen et al., 2002; Kanwisher et al., 1997). For example, disruption of signals in the FFA through electrical stimulation impairs face perception (Jonas et al., 2012; Parvizi et al., 2012), and lesions in the vicinity of the visual word form area impair the ability to rapidly recognize words (a condition known as pure alexia or word blindness; Gaillard et al., 2006; Dejerine, 1891).
Despite the striking sensitivity of these ventral occipitotemporal regions to category membership, low-level features of the visual stimulus still influence neural responses. Understanding the low-level stimulus features that drive responses in ventral occipitotemporal cortex has helped elucidate fundamental aspects of visual computation and perception. For example, spatial tuning, one of the most extensively studied properties of neurons in visual cortex, has been fundamental for understanding differences in the computations performed by different visual regions and linking computation to perceptual function. Ventral stream regions that are important for the perception of objects, including words and faces, predominantly receive inputs from the foveal representations of early visual areas, and consequently the responses of these regions are principally driven by stimuli in the center of the visual field (Hasson, Levy, Behrmann, Hendler, & Malach, 2002; Levy, Hasson, Avidan, Hendler, & Malach, 2001). This foveal bias is believed to underlie our poor perceptual performance for objects in the periphery. For example, word recognition in the periphery is substantially slower and less accurate than would be predicted by visual acuity alone (Chung, Mansfield, & Legge, 1998).
The temporal properties of the visual system also impose fundamental limits on cortical computations but have received far less attention than spatial properties. Temporal tuning properties of the visual system can be characterized by three fundamental parameters: (1) temporal resolution or temporal acuity (i.e., the highest temporal frequency that elicits a response to a given visual feature), (2) the temporal frequency that elicits the maximal response to that feature, and (3) the delay of the response with respect to the stimulus onset (latency). The fastest rate at which neurons can track changes in a stimulus is related to the integration time of the system: Neurons that integrate over long time periods effectively low-pass filter their inputs and have low temporal acuity/resolution.
In the case of simple features such as luminance and contrast, temporal resolution is very high (Kelly, 1961a, 1961b), but for more complex features and objects, temporal resolution is much lower (Holcombe, 2009; McMains & Somers, 2004; Battelli, Cavanagh, Martini, & Barton, 2003; Potter & Faulconer, 1975). A parallel temporal hierarchy has also been observed as one progresses from early visual cortex to extrastriate areas in the temporal lobe. Early PET measurements in striate cortex indicated that peak responses to reversing checkerboards occurred between 4 and 15 Hz and similar tuning was observed using fMRI (Thomas & Menon, 1998; Zhu et al., 1998; Kwong et al., 1992) with a consensus that peak responses occur near 8 Hz (but see Ozus et al., 2001, which reported that the peak response plateaus at 6 Hz). Temporal integration of more complex information present in natural object images was first reported to differ between early retinotopic cortex and higher-order occipitotemporal areas by Mukamel, Harel, Hendler, and Malach (2004). Using fMRI, they found that, although activation increased by 200% in early visual cortex for presentation rates between 1 and 4 Hz, the increase was only 25% in occipitotemporal cortex. The difference was attributed to differences in integration time among areas that are at different stages of the visual hierarchy. Later work (McKeeff, Remus, & Tong, 2007) compared temporal tuning profiles over both retinotopic visual areas and occipitotemporal areas that were selectively responsive to face images (FFA) or house images (parahippocampal place area[PPA]). They found that maximal activation occurred around 18 Hz in early visual areas V1–V3, at ∼9 Hz in V4, but at only 4–5 Hz in FFA and PPA. In a complementary study (Hasson, Yang, Vallines, Heeger, & Rubin, 2008), silent films were temporally scrambled by cutting them into time segments of varying duration and randomizing the order of presentation. Activation in later visual areas was maximal for longer segments, suggesting that high-level areas integrate information over long time periods. Gauthier, Eger, Hesselmann, Giraud, and Kleinschmidt (2012) alternated a single face image with a single house image using rates between 1.2 and 10 Hz. They found a progressive decrease in the optimal frequency of presentation going from V1 to the lateral occipital complex to FFA and PPA. What is clear from this collection of studies is that temporal response properties slow down at higher stages in the visual system and that these response properties place fundamental constraints on perception. This suggests a simple hypothesis: Responses to stimuli (or features) represented at similar levels of the visual hierarchy will have similar temporal dynamics.
Here we use evoked potential measures of temporal processing as a means to compare the temporal limits of word- and face-selective cortex. This choice is motivated by the fact that word- and face-selective regions are immediately adjacent (within a few millimeters) on the ventral surface (Yeatman, Rauschecker, & Wandell, 2013; Wandell et al., 2012; Dehaene et al., 2010). We therefore might expect these regions to share equivalent tuning properties even though the computations required to read a word (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Seidenberg & McClelland, 1989) are certainly very different from the computations required to recognize a face (Meyers, Borzello, Freiwald, & Tsao, 2015). In support of the hypothesis that there is a canonical temporal processing profile in adjacent category-selective regions, both words and faces produce a characteristic ERP at comparable latencies (150–170 msec) after the presentation of the visual stimuli (Maurer, Brandeis, & McCandliss, 2005; Bentin, Mouchetant-Rostaing, Giard, Echallier, & Pernier, 1999; Bentin, Allison, Puce, Perez, & McCarthy, 1996). Although the N150–N170 for words and faces each have distinct scalp topographies (Rossion, Joyce, Cottrell, & Tarr, 2003), the temporal similarity between their ERP responses could be hypothesized to reflect consistent temporal tuning properties of neurons across ventral temporal cortex: If one makes the assumption that the ERP is equivalent to the impulse response of a linear system, then one would predict that the temporal tuning of faces and text should be very similar, given the similarity in the latency of the selective activity in the two tasks. An alternative hypothesis is that temporal response properties depend substantially on the specific nature of the computations that the visual system performs on different categories of stimuli, such as words and faces.
This study uses steady-state visually evoked potentials (SSVEPs) to test the hypothesis that there are canonical temporal response properties for regions at the same level of the visual hierarchy (for a recent review of the SSVEP approach, see Norcia, Appelbaum, Ales, Cottereau, & Rossion, 2015). Using the SSVEP, we assessed the temporal frequency tuning preference, the temporal resolution, and the apparent latency of word- and face-selective cortex. Despite the similarity of the N170 response to words and faces, we find markedly distinct temporal properties for the two categories of stimuli.
Eleven adults (four women) between the ages of 18 and 56 years participated. They had normal visual acuity and were screened for neurological and cognitive impairments. Each participant provided written informed consent under a protocol that conformed to the tenets of the Declaration of Helsinki that was approved by the institutional review board of Stanford University.
The text image comprised a block of common English words derived from the MCWord database (www.neuro.mcw.edu/mcword/). The face image comprised a black and white photograph of a cropped female head and face placed on a random texture background. Images extended 12° in each direction from a fixation cross in the center of the screen. To provide a comparison stimulus with the same low- and mid-level image statistics, each image was scrambled using the algorithm developed by Portilla and Simoncelli (2000), which is available at www.cns.nyu.edu/∼lcv/texture/. The algorithm learns the joint distribution of filter locations, orientations, and scales from the image (separate distributions were computed for the text and face images) and preserves this histogram in the synthesized, scrambled version. Stimuli are shown in Figure 1.
Intact and scrambled versions of the stimuli were presented in temporal alternation at rates of seven frequencies: 1, 2, 3, 4, 6, 9, and 12 Hz. These frequencies were chosen based on prior SSVEP work on faces (Alonso-Prieto, Belle, Liu-Shuang, Norcia, & Rossion, 2013) because we expected (a) the amplitude of the odd harmonic to drop close to the noise floor by 12 Hz and (b) more rapid changes in amplitude as a function of frequency at low compare to high frequencies motivating a more dense sampling of lower frequencies (1–6 Hz). Observers were given a fixation mark in the center of the image and were instructed to hold their fixation on the mark and to refrain from blinking. The image sequences were presented for 12 sec, with the first and last seconds being excluded from the analysis of the SSVEP. Five trials were run for each temporal frequency and image type with the stimuli presented in random order.
EEG Recording and SSVEP Analysis
EEG was recorded over 128 channels at a sampling rate of 500 Hz using HydroCell SensorNets (Electrical Geodesics Inc., Eugene, OR) connected to an Electrical Geodesics NetAmp 300 running NetStation 4.3 software. Data analysis was performed offline using in-house software after exporting the data and digital bandpass filtering between 0.3 and 200 Hz.
Word- and Face-selective Responses Have Different Temporal Tuning Curves
We find that word- and face-selective responses each have a unique temporal tuning curve, preferred stimulus frequency, and scalp topography (Figure 2). Word-selective cortex shows a peak response to text presented at 1 Hz, and the amplitude of the response declines monotonically as a function of presentation frequency. The brain no longer tracks the change from scrambled to intact text at presentation frequencies above 9 Hz. Face-selective cortex shows a peak response to faces presented at 4 Hz, and the amplitude of the response declines for slower or faster presentation frequencies. For faces, the response is equivalent for 1-Hz and 6-Hz presentation rates. Both word- and face-selective regions show equivalent and minimal responses to stimuli presented at 9 Hz. The left-lateralized scalp topography for words goes from being nearly equal for the two hemispheres at 1 Hz to being strongly left-lateralized at 4 Hz. The right hemisphere word response declines more rapidly as a function of presentation rate than the left hemisphere word-response (Figure 2). By contrast, the response to the face image is almost equal for both hemispheres (with a slight right hemisphere preference) at all frequencies where it is measurable, and there is not a substantial change in lateralization at different presentation rates.
Latency Topography Demonstrates Two Distinct Sources at Two Different Times
By comparing SSVEP phase values across temporal frequencies, we derived latency estimates for responses to face and word images (see Equation 1). In a linear time-invariant system, there is a linear relationship between the phase and frequency of a signal. This linear relationship indicates that all frequencies are delayed by the same constant amount (constant group delay). Consistent with the underlying model assumption of a linear time-invariant system, the phase versus frequency functions are linear for text and face stimuli. They differ, however, in slope, with the inferred delay differing by region and stimulus category. By mapping delay over the sensor array, it is apparent that both words and faces show two distinct latencies (Figure 3). This observation suggests the existence of at least two different underlying sources. In the occipitotemporal ROIs, the shortest delay for text is 140.0 ± 6.6 msec but is 159.4 ± 3.0 msec for the face image. A longer latency source is apparent over left occipitotemporal cortex for the text stimuli with a latency of 257.6 ± 7.1 msec. For the face stimuli, longer latency activity is present over right anterior temporal cortex at a latency of 287.9 ± 11.8 msec.
By measuring both the amplitude and phase of the SSVEP as a function of temporal frequency, we derive a richer description of the dynamics of word and face processing than has been possible with traditional ERP measurements, PET, or fMRI. From our measurements, we determined that temporal acuity, peak response frequency, and delay each differ for text and face images. These differences in temporal tuning profiles might be surprising considering (a) word- and face-selective ERPs have been described to have a similar time delay (Cao, Jiang, Gaspar, & Li, 2014; Pegna, Khateb, Michel, & Landis, 2004; Rossion et al., 2003), (b) word- and face-selective regions are immediately adjacent on the ventral surface of the cortex (Yeatman et al., 2013; Wandell et al., 2012; Dehaene et al., 2010), and (c) word- and face-selective regions have been hypothesized to share a common neuronal architecture (Dehaene et al., 2010; Dehaene & Cohen, 2007).
Differences in temporal tuning profiles reflect differences in the nature of the computations performed by word- and face-selective cortex. Despite the close spatial proximity of these regions, our measurements suggest that there must be substantial differences in either the neuronal architecture of, or the hierarchy of regions that feed signals into, word- and face-selective cortex. We find that temporal acuity for faces is substantially higher than for text—the amplitude of the face-selective response at 4–6 Hz is several times higher than the text-selective response. Hence, regions that process faces are more sensitive to rapidly changing stimuli than regions that process text. This observation predicts that perceptual decisions will show markedly different time courses for words and faces.
Previous work has found that the differential SSVEP response to changing identity faces versus constant identity faces is maximal at 6 Hz (Alonso-Prieto et al., 2013). One interpretation of this peak frequency is that it is due to the linear superposition of transient ERPs with a latency of 150–170 msec. However, it is important to note that the latency of ERP is influenced by two factors: (1) integration time or the amount of time required for a brain region to process the incoming information and reach a maximal response and (2) conduction delay or the amount of time required for the signal to reach this brain region. Hence, the similar ERP latency for words and faces does not by itself indicate that temporal processing is equivalent in word- and face-selective cortex.
Here, we find the best temporal frequencies for driving cortical responses are substantially lower for text (1 Hz) than for face (4 Hz) images. A direct tying of these peak frequencies to transient response latencies via the superposition model would predict latencies of 1000 msec for transient ERPs to words and 250 msec for face responses. These predicted latencies are clearly inconsistent with the common 150–170 msec ERP latency for both stimulus categories (Cao et al., 2014; Pegna et al., 2004; Rossion et al., 2003). This finding shows that, under a different set of measurement conditions, the temporal aspects of the signal in word- and face-selective cortex can be substantially different despite previous reports noting similarities between the ERP waveform.
Finally, in addition to the mixture of fixed conduction delays and integration delays inherent in visual processing, the visual system is also manifestly nonlinear and the conditions under which SSVEP measurements are made—temporally dense stimuli—are very different from the temporally sparse conditions used to measure ERP parameters. The presence of temporal nonlinearities, such as adaptation, also makes it difficult to make direct predictions in the absence of a full nonlinear model of the system response. Here we used the first harmonic of the evoked response as a proxy measure and found the phase–frequency relationship to be linear and thus were able to calculate and aggregate delay measure for the two stimulus classes we used.
This is the first EEG study to use the Portilla–Simoncelli algorithm (Portilla & Simoncelli, 2000) to create the baseline condition against which the object level response is compared. This algorithm preserves a set of higher-order, joint statistics that are lost when the phase of the power spectrum is scrambled. Our paradigm thus isolates responses (at the first harmonic) to text and face images that are higher-order than those driven by the power spectrum of the image. They are also higher-order than responses driven by the joint statistics encoded by the Portilla and Simoncelli algorithm. Previous work in macaque (Rust & Dicarlo, 2010) has found that responses in inferior temporal cortex differ between intact and scrambled versions of the same image to a greater degree than do the responses in V4 when the Portilla–Simoncelli algorithm is used. A recent report using fMRI in humans (Movshon & Simoncelli, 2014; Freeman, Ziemba, Simoncelli, & Movshon, 2013) has contrasted responses to Portilla–Simoncelli scrambled textures and intact natural textures and found differential responses occurred only at and beyond area V4. Our approach may thus make the resulting SSVEP more selective to the intrinsic structure of orthography and faces than other approaches such as phase scrambling.
By mapping the temporal delay over the electrode array, we find evidence for multiple underlying sources on the basis of significantly different response delays. It is interesting to note that even these long latency sources continue to respond to steady-state stimulation. In the case of the text response, longer latency activity may reflect increasingly complex orthographic processing. In the case face-related activity, the long latency responses over right anterior temporal cortex may arise in the “extended” face network (Haxby, Hoffman, & Gobbini, 2000) that includes anterior inferotemporal cortex (Kriegeskorte, Formisano, Sorger, & Goebel, 2007). Consistent with this interpretation, intracranial recordings with similar stimuli have found SSVEP responses to face images in anterior inferior temporal cortex (Liu-Shuang, Jonas, et al., 2015). Previous transient ERP studies have found a negativity around 250 msec for face stimuli (Schweinberger, Huddy, & Burton, 2004; Schweinberger, Pickering, Jentzsch, Burton, & Kaufmann, 2002) and for objects such as birds or cars after expertise training (Scott, Tanaka, Sheinberg, & Curran, 2006, 2008). These responses are sensitive to repetition and familiarity effects that are not seen in the N170 response. Our approach may be tapping a similar process, as both faces and text are highly overlearned stimuli in typical adults.
SSVEPs represent a promising approach for characterizing the temporal dynamics of high-level visual regions that are selective for text, faces, and other important visual categories. Temporal tuning curves can be reliably estimated from relatively short stimulation paradigms, opening the possibility of studying changes in neural dynamics over the course of development (e.g., learning to read) and in the case of developmental disorders (e.g., dyslexia and prosopagnosia). Our measurements clearly demonstrate that the temporal dynamics of word- versus face-selective cortex differ substantially, laying the foundation for models that relate temporal processing to perception and behavior.
The authors wish to acknowledge the contributions of Faraz Farzin to the conduct of the EEG recordings and early conceptual design of the study. She was supported by a Ruth L. Kirschstein National Research Service Award (F32EY021389).
Reprint requests should be sent to Jason D. Yeatman, Institute for Learning & Brain Sciences (I-LABS), Department of Speech and Hearing Sciences, University of Washington, 1715 Columbia Road N, Portage Bay Building, Seattle, WA 98115, or via e-mail: firstname.lastname@example.org.