The human voice is the primary carrier of speech but also a fingerprint for person identity. Previous neuroimaging studies have revealed that speech and identity recognition is accomplished by partially different neural pathways, despite the perceptual unity of the vocal sound. Importantly, the right STS has been implicated in voice processing, with different contributions of its posterior and anterior parts. However, the time point at which vocal and speech processing diverge is currently unknown. Also, the exact role of the right STS during voice processing is so far unclear because its behavioral relevance has not yet been established. Here, we used the high temporal resolution of magnetoencephalography and a speech task control to pinpoint transient behavioral correlates: we found, at 200 msec after stimulus onset, that activity in right anterior STS predicted behavioral voice recognition performance. At the same time point, the posterior right STS showed increased activity during voice identity recognition in contrast to speech recognition whereas the left mid STS showed the reverse pattern. In contrast to the highly speech-sensitive left STS, the current results highlight the right STS as a key area for voice identity recognition and show that its anatomical-functional division emerges around 200 msec after stimulus onset. We suggest that this time point marks the speech-independent processing of vocal sounds in the posterior STS and their successful mapping to vocal identities in the anterior STS.
Vocal speech conveys linguistic content (what is said) as well as information about the speaker (who is talking). When we listen to speech, our perception of vocal identity is largely independent from the linguistic message the speaker wants to communicate, and we can recognize phonemes, words, and sentences independent from who is talking (Van Lancker & Kreiman, 1987; Van Lancker & Canter, 1982). Neuroanatomical and functional imaging studies show that this dissociation is mirrored by partially different neural pathways: Patients with right-hemispheric lesions suffer from impaired voice identity recognition despite their preserved ability to understand speech (Lang, Kneidl, Hielscher-Fastabend, & Heckmann, 2009; Van Lancker, Kreiman, & Cummings, 1989; Van Lancker, Cummings, Kreiman, & Dobkin, 1988; Van Lancker & Canter, 1982),whereas the opposite pattern is found in patients with left-hemispheric lesions (Lang et al., 2009). Similarly, neuroimaging studies found that cortical areas sensitive to voice identity are mostly located in the right hemisphere, whereas areas sensitive to verbal features are dominant in the left hemisphere (Bonte, Valente, & Formisano, 2009; Formisano, De Martino, Bonte, & Goebel, 2008; Stevens, 2004; Belin & Zatorre, 2003; von Kriegstein, Eger, Kleinschmidt, & Giraud, 2003).
In contrast to the rich literature on speech perception (reviewed, e.g., in Hickok & Poeppel, 2007), relatively little is known about the neural mechanism of voice perception. Several fMRI studies identified voice-sensitive regions predominantly located along the right STS by using, for example, vocal and nonvocal stimuli (Belin, Zatorre, Lafaille, Ahad, & Pike, 2000), speaker and speech recognition tasks (Kriegstein & Giraud, 2004; von Kriegstein et al., 2003), or voice identity adaptation (Belin & Zatorre, 2003). They suggest that voice-sensitive regions along the right STS have different functional roles; the posterior STS (pSTS) is more closely related to the acoustic processing of vocal sounds (Andics et al., 2010; Kriegstein & Giraud, 2004), whereas the anterior STS (aSTS) represents voice identity (Andics et al., 2010; Kriegstein & Giraud, 2004; Belin & Zatorre, 2003).
So far, it is unknown when voice and speech recognition dissociate. In addition, not much is known about the different sensory processing stages of voice recognition along the right STS. Although fMRI studies suggest that voice identity perception particularly implicates the aSTS more than the pSTS, there is so far no indication as to the behavioral relevance of the aSTS. Magnetoencephalography (MEG) and EEG studies not only reveal valuable timing information (Capilla, Belin, & Gross, 2013; De Lucia, Clarke, & Murray, 2010; Charest et al., 2009; Schweinberger, 2001) but may also be more likely to pinpoint transient behavioral correlates, as compared with fMRI studies.
Here, we analyzed auditory evoked MEG responses during a voice identity and speech recognition task using the same stimulus material in both tasks. In combination with individual behaviorally assessed voice recognition performance, we addressed two questions: (i) Where and when do speech-independent neural processes emerge during voice identity recognition? (ii) What is the role of the right aSTS in voice recognition, in contrast to the pSTS? Answering these questions will enable an understanding of the dynamics and functional organization of the right STS during voice identity recognition.
Nineteen participants were included in the analysis (9 women/10 men, mean age = 24.2 ± 2.6 years). All participants were healthy, right-handed volunteers as assessed by the Edinburgh Handedness Inventory and gave their informed written consent before testing. The data of one additional participant were discarded from the analysis because of strong movement artifacts.
Stimuli were auditory speech samples from six different male speakers that consisted of German two-word sentences (e.g., “Er streicht”; He paints). All sentences started with the same pronoun (“Er”; He) and lasted approximately 1 sec. The same 36 different sentences were recorded from all of the six speakers resulting in a total of 216 stimuli. The 36 verbs (sentences only differed in their verb) consisted of 18 lexical-neighbor pairs (e.g., “Er streicht” and “Er streikt”; He strikes) to increase task difficulty (see below). Stimuli were recorded in a soundproof recording chamber using a high-quality setup (Microphone: TLM 50, Neumann, Berlin, Germany; Mic-Preamp: Mic-Amp F35, Lake People, Konstanz, Germany; Soundcard: Power Mac G5 Dual 1.8 GH, Apple, Inc., Cupertino, CA; Software: Sound Studio 3, Felt Tip, Inc., New York, NY). Recordings were sampled at 44.1 kHz and saved as wav files (16 Bit, Mono). Sound recordings were postprocessed using MatLab 7.11 (MathWorks, Natick, MA). To minimize the temporal jitter of sound onsets, stimuli were cut at an amplitude threshold that was adjusted individually for each speaker to account for differences in voice onset dynamics and ramped using half a Blackman window (5 msec). Additionally, all stimuli were adjusted for sound pressure (RMS energy) and masked with pink noise (SNR = 3 dB) to increase task difficulty. For stimulus presentation, Presentation software (Neurobehavioural Systems, Inc., Berkeley, CA) was used. Auditory stimuli were played from a PC (Soundcard: Sound Blaster, Creative Audigy 2 ZS, Creative, Singapore) and delivered through in-ear earphones (ER3-14A/B, Etymotic Research, Inc., Elk Grove Village, IL). Sound volume was adjusted using a staircase procedure and set to 50 dB above individual hearing threshold. The stimulus set for the speaker familiarization training (see below) comprised 13 German five-word declarative sentences (e.g., “Die Enten kommen an das Ufer,” The ducks come to the shore) that lasted approximately 2 sec. These were recorded with a high-definition camera (HD Legria HF S10, Canon, Tokyo, Japan) and an external microphone. A detailed description of the training material is provided in Schall, Kiebel, Maess, and von Kriegstein (2013).
Paradigm and Procedure
The experiment contained two phases: (1) a speaker familiarization training during which participants learned to identify six speakers by voice and name and (2) a test phase during which participants performed either a voice or speech recognition task on short auditory speech samples. MEG was only recorded during the second phase, that is, the test phase.
During familiarization training, participants learned to identify and name the voices of six male speakers. In addition to the name, speakers were learned together with a visual stimulus: Half of the speakers were learned together with a video showing the speaker's face, and the other half with a symbol visualizing the speaker's occupation. These two different visual learning conditions were introduced to address a learning-related hypothesis (Schall et al., 2013) and are irrelevant for the present research question. In the current study, it was important for participants to be familiarized with the voices and names so that they could recognize each speaker's identity during the test phase and so that accuracy of voice recognition could be behaviorally assessed. Familiarization was achieved by alternating “passive” and “active” learning. During passive learning, the name of a speaker was presented on the screen (1 sec) followed by voice samples of that speaker consisting of five different five-word sentences (total duration ca. 10 sec) and the accompanying visual stimulus of the speaker. During active learning, there were two conditions. In one condition, participants were presented with the voice (one five-word sentence) followed by the name of the speaker. Participants indicated whether the voice and the name matched in identity or not (three trials per speaker). In the second condition, participants first heard the voice (one five-word sentence) of the speaker and then saw the visual stimulus (face or the occupation symbol). Again the task was to indicate via button press whether the voice and the visual stimulus matched in identity or not (three trials per speaker). Participants received feedback (“correct”/“incorrect”) about their response by means of a voice sample repeated together with the correct name and visual stimulus. Participants were trained until they could reliably match all voices with the corresponding name and visual stimulus (% correct = 80%). All participants reached this criterion after two learning cycles (one learning cycle consisted of one passive learning and one active learning).
Test (MEG Recording; Figure 1)
MEG was recorded in three sessions (∼10 min each) during which participants listened to two-word sentences spoken by the six previously learned speakers and performed either a voice or speech recognition task. A trial started with a fixation cross that was accompanied by a speech sample at ca. 800 msec after fixation cross onset (random jitter between 500 and 1000 msec; Figure 1). After the presentation of the speech sample, the fixation cross was removed, and a written name (voice task) or word (speech task) appeared on the screen. Participants reported via button press (two keys for “yes”/“no”) whether the name matched the preceding speech sample in voice identity (voice task) or whether the word appeared in the speech sample (speech task). Fifty percent of all trials were match-trials. After responding (or 2 sec if there was no response), the name/word disappeared and the screen stayed blank for 500 msec until the start of the next trial. Trials were organized in blocks (12 trials per block) so that participants performed either the voice or speech task for the duration of a whole block. At the beginning of each block a written instruction (“voice”/“speech”) informed participants which task they had to perform. This block design was chosen to minimize time for task instruction. There were 18 blocks for the voice and 18 blocks for the speech task. This resulted in a total of 216 trials per task. To minimize eye and muscle artifacts, participants started blocks in a self-paced manner. Voice and speech blocks occurred in a pseudorandomized order and were balanced across recording sessions.
Participants were comfortably seated in an electromagnetically shielded room (Vacuumschmelze GmbH & Co. KG, Hanau, Germany) while MEG was recorded with a 306-channel whole head MEG device (Vectorview, Elekta-Neuromag Oy, Helsinki, Finland) comprising 102 magnetometers and 204 gradiometers at 102 different locations. The MEG was sampled at 1000 Hz and online band-pass filtered between DC and 330 Hz. Two pairs of electrodes were used to record a bipolar EOG. Additionally, one ECG channel was digitized. During recording, the position of a participant's head was measured using five head position indicators (HPI). The positions of the fiducials, HPI coils, and 50 random points on the head surface were acquired using a 3-D digitizer (FASTRAK, Polhemus, Colchester, VT).
Individual T1-weighted images were acquired with a 3-T Magnetic Resonance Scanner (Magnetom Trio, Siemens AG, Munich, Germany) using an MP-RAGE sequence with a spatial resolution of 1 mm3.
Head movement correction, bad channel interpolation, and external interference suppression were done by applying the Signal Space Separation Method (Taulu, Kajola, & Simola, 2004). For offline filtering (bandpass; 1–20 Hz;), artifact rejection, epoching, and intersubject averaging the MNE software package (version 2.7.2, M. Hämäläinen, MGH, Boston, MA) was used. Trials were discarded if they exceeded a signal change threshold of 200 pT/m (gradiometer) and 8 pT (magnetometer) or 100 μV (EOG).
Sensor Level Analysis
Averaging across participants, statistical analysis and visualization of sensor data were performed using the Fieldtrip toolbox (version date: 2012/04/02; Oostenveld, Fries, Maris, & Schoffelen, 2011) implemented in MatLab 7.11 (MathWorks, MA) and custom-made MatLab-scripts.
Cluster permutation analysis
To control for an inflated Type I error, statistical significance of task-sensitive event-related field (ERF) differences was assessed using cluster permutation tests (Maris & Oostenveld, 2007). t values, obtained from the comparison of the two task conditions, were used as a statistic. On these, two-dimensional clusters (Space × Time) were computed based on the data from all sensors and all bins between 0 and 300 msec peristimulus time (cluster size = 2 sensors). The resulting cluster statistic was calculated by summing over all t values within a cluster (sensors and time bins). Surrogate cluster statistics were generated from 1000 random partitions of the two conditions. Condition differences were considered significant for cluster statistics exceeding the 95th quantile of the surrogate distribution (i.e., p < .05).
A descriptive analysis and visualization of the effects as well as the separation of the two contrasts (voice > speech/speech > voice) proved difficult at the cluster level, as a single cluster could spread across hemispheres and time and could contain opposite sign effects. To describe and group the observed effects according to the two contrasts, we therefore performed an additional quadrant analysis. We determined the two magnetometers within each sensor quadrant that were maximally task sensitive and quantified effects on the average of these two sensors.
Source Level Analysis
To assess whether the observed effects at the sensor level are because of differential source activity in the hypothesized STS regions, we performed a source analysis. Neural sources of ERFs (magnetometer and gradiometer data) were estimated using cortically constrained distributed source imaging (Hanson, Kringelbach, & Salmelin, 2010) as implemented in the MNE software package (version 2.7.2, M. Hämäläinen, MGH, Boston, MA). We restricted the statistical analyses to ROIs using MatLab 7.11. We performed two statistical analyses, one to determine the activity in STS and the other to investigate correlations with behavioral performance.
Forward and inverse solution
Forward solutions were computed on the basis of boundary element models as volume conductors and the individual white matter surface as source space. Individual boundary element models were constructed for the inner skull surface that was extracted from segmented T1-weighted MR images. Segmentation of T1-weighted images and surface reconstruction was done using FreeSurfer software (surfer.nmr.mgh.harvard.edu/). Coregistration of MRI and MEG coordinate systems was informed by the HPIs, fiducials, and 50 additional digitization points. Inverse solutions were restricted to the cortical surfaces that were reconstructed individually and reduced to approximately 5000 vertices (dipole locations) per hemisphere. Neural source activation was estimated at each vertex over all time points using the standardized low-resolution electromagnetic tomography method (sLoreta; Pascual-Marqui, 2002). Inverse solutions were computed separately for each condition and for each participant. To allow for averaging across participants, individual solutions were transformed onto a sphere providing a common coordinate system (Fischl, Sereno, & Dale, 1999). For visualization purposes, the spherical representation was morphed onto the flattened cortical surface of a single participant (Fischl, Sereno, Tootell, & Dale, 1999).
ROIs and statistical analysis
Previous fMRI studies have shown that left and right STS are differentially involved during voice and speech recognition (Formisano et al., 2008; Kriegstein & Giraud, 2004; Belin & Zatorre, 2003; von Kriegstein et al., 2003). To assess task-related source activity changes in these regions, we followed an ROI approach and statistically contrasted source activity between the two task conditions (voice task, speech task) along the right and left STS. Right and left STS were determined anatomically based on white matter characteristics using FreeSurfer. They were manually subdivided into roughly equally sized posterior, mid, and anterior parts. For each of the resulting six ROIs, participant- and condition-specific source estimates were extracted, baseline-corrected (−100 to 0 msec), and downsampled (250 Hz). For the statistical analysis, we additionally defined two critical time windows, that is, one for the voice task > speech task contrast and one for the speech task > voice task contrast. Again, the isolation of contrast-specific time windows proved difficult at the cluster level, as one cluster contained different effects changing over time and space. We therefore determined time windows that were maximally sensitive to each contrast by performing a running t test at a significance level of p < .05 on the average of the two maximally task-sensitive magnetometers within each of the four sensor quadrants (anterior/posterior; left/right; 24 sensors per quadrant). Time windows from different quadrants that were sensitive to the same contrast were combined by taking their mathematical union. For the contrast voice task > speech task, time samples that reached the p = .05 significance were found between 177 and 239 msec (including both anterior quadrants and the right posterior quadrant, see sensor results). For the contrast speech task > speaker task, time samples that reached the p = .05 significance threshold were found between 182 and 284 msec (left posterior quadrant, see Sensor Level Analysis).
We used these maximally task-sensitive time windows to test whether the right STS showed higher activation during the voice compared with the speech task in the 177–239 msec window and whether the left STS showed higher activation during the speech compared with the voice task in the 182–284 msec time window. For these tests, participant- and condition-specific source activations were averaged within each ROI and within each time window and submitted to a paired t test. For the planned contrasts voice task > speech task for the right STS and speech task > voice task for the left STS, we performed one-sided t tests, thereby constraining the statistical analyses to effects into the expected direction. Condition differences were considered significant at p < .05 using a one-sided t test (Bonferroni-corrected for three ROIs) in case the effect had the expected direction. For completeness and to find support for the specificity of the hypothesized effects, effects in the unexpected direction were tested with a two-sided t test at a threshold of p < .05 and a very lenient threshold of p < .1 (Bonferroni-corrected for three ROIs).
To test whether source activity along the right STS predicts behavioral recognition performance, we performed a correlation analysis. To this end, we averaged, for each participant, the differential source–activity (voice task–speech task) within the 177–239 msec window and within each right STS ROI, and correlated the obtained values with individual voice recognition accuracies. Correlations were considered significant at p < .05 (Bonferroni-corrected for three ROIs).
A short description of behavioral results is given in the Behavioral Performance section. MEG data were first analyzed at the sensor level to investigate when the ERFs show a divergence of voice- and speech-related neural responses (Sensor Level Analysis). Next, we performed a source analysis (Source Level Analysis) to test whether the dissociation of vocal and verbal processes is related to a differential involvement of the right and left STS and to determine the timing and behavioral relevance of neural activity along the right STS.
On average, participants had a recognition rate of 85.02% (±8.44%) in the voice task and 87.96% (±2.40%) in the speech task. There was no significant difference in recognition performance between the two tasks (t = −1.5469, df = 18, p = .134, two-sided t test; z = −1.3286, p = .1840, Wilcoxon signed-rank test), suggesting that participants were similarly engaged during both tasks. All participants performed significantly above chance level (50%) in the speaker task (p < .01; binomial test) as well as in the speech task (p < .0001; binomial test).
Sensor Level Analysis
We were particularly interested in the task-sensitive modulations of ERFs during the sensory encoding of the vocal signal, that is, occurring in the time frame of the auditory evoked fields (50–250 msec). On the basis of previous literature (Lang et al., 2009; Van Lancker & Canter, 1982), we expected a task-specific hemispheric dissociation with a right-hemispheric bias for the contrast voice task > speech task and a left-hemispheric bias for the contrast speech task > voice task.
Cluster Permutation Analysis
The cluster permutation analysis revealed two significant clusters (positive cluster: p = .0070, negative cluster: p = .0130) showing task-sensitive amplitude changes. The positive cluster extended from 145 to 300 msec and included up to 56 sensors (the number of sensors within the cluster varied over time points). The negative cluster extended from 151 to 300 msec and included up to 44 sensors. Note that the polarity of the cluster is not informative about the direction of the effect. Both clusters were first prominent over posterior sensors (Figure 2A) and extended, over time, to anterior sensors. This suggests that speaker- and speech-sensitive neural processes begin diverge during sensory processing stages around the 200-msec latency.
In the following, we describe effects for each sensor quadrant based on the two sensors showing the largest task-sensitive response (see Methods). Similarly, we used the same two sensors within each quadrant to determine the critical time windows for the statistical evaluation of the following source analysis (see Methods).
The ERFs in the posterior quadrants showed the hypothesized direction of task-related differences: Over the right posterior quadrant, ERFs were stronger during the voice as compared with the speech task. This effect was maximal at 197 msec. In contrast, sensors in the left posterior quadrant showed higher amplitudes during the speech compared with the voice task. This effect was more prolonged and showed two peaks, one at 206 and one at 260 msec (Figure 2B, left, gray-shaded area). The anterior quadrants showed larger amplitudes during the voice as compared with the speech task (Figure 2B, right). These amplitude differences peaked at 220 msec in the right anterior quadrant and at 223 msec in the left anterior quadrant.
In summary, we observed (i) in posterior sensors, in a time period around 200 msec, a significant task-specific hemispheric dissociation, that is, an amplitude difference between voice task and speech task in the right hemisphere and between the speech task and the voice task in the left hemisphere, and (ii) in anterior sensors around 220 msec significantly higher amplitudes for the voice in contrast to the speech task in both hemispheres (Figure 2C).
Source Level Analysis
Right STS: Activity Analysis
On the basis of previous studies that showed right STS activation in the processing of voices, we hypothesized that higher source activity would be found during voice compared with speech recognition in the right STS (Capilla et al., 2013; Formisano et al., 2008; Kriegstein & Giraud, 2004; Belin et al., 2000). In addition, results of an fMRI study with a similar design to the present MEG study (i.e., comparing voice and speech recognition tasks) have implied that right pSTS and aSTS play distinct functional roles (Kriegstein & Giraud, 2004). Here, we used source analysis to determine whether and which right STS regions contributed to the 200-msec response observed at the sensor level in form of an increased activation during the voice compared with the speech task. We analyzed the source activity differences in the right STS within the maximally task-sensitive time window for the contrast voice task > speech task (177–239 msec, see ROIs and Statistical Analysis sections).
We found that the right pSTS showed significantly stronger source activity for the voice compared with the speech task (one-sided t test; t = 2.4359, df = 18, p = .0127, Bonferroni-corrected p = .0382) (Figure 3A, top). This effect peaked at 214 msec. No other STS region showed a significantly stronger activation during the voice than the speech task. In contrast, the right aSTS showed greater source activity for the speech than for the voice task. This effect was, however, only present at a very lenient threshold of p < .1 after Bonferroni-correction (two-sided t test: t = −2.3361, df = 18, p = .0313, Bonferroni-corrected p = .0938; Figure 3A, bottom). For descriptive visualization, the source activity differences for the whole right cortical surface are shown in Figure 3B. They suggest that the effect found in the right pSTS extended to the superior temporal gyrus and the planum temporale.
Right STS: Correlation Analysis
So far, it is unclear whether the involvement of the right STS during voice perception is behaviorally relevant. We therefore tested whether the source activity difference between the voice and speech tasks along the right STS correlates with individual voice recognition accuracies.
We found that the task-related activity difference (voice task − speech task) in the right aSTS correlated significantly with behavioral performance (r = 0.6472, p = .0027, Bonferroni-corrected p = .0082; Figure 3C). This means that participants who were better in recognizing voice identities had comparably higher aSTS activity during the voice than the speech recognition task. This correlation remained significant when the two outliers (performance correct < 75%) were excluded (r = 0.5746, p = .0158). The time course of the correlation coefficients (Figure 3D) shows that the correlation peaked at 206 msec. This was the case only for the right aSTS; no other STS region showed a significant correlation with behavior. For descriptive visualization, we also plotted the correlation coefficients for the whole cortical surface. This revealed that voice recognition performance correlated with source activity over an extended right anterior-temporal area including insular and inferior temporal regions (Figure 3E). This correlation was specific for voice recognition performance; none of the right STS areas showed a correlation between source activity and speech recognition performance (pSTS: r = 0.0576, p = .8148; mid STS [mSTS]: r = 0.1955, p = .4224; aSTS: r = 0.2571, p = .2880).
Functional imaging studies have shown that the left STS is crucially involved in speech comprehension (Matsumoto et al., 2011; Rosen, Wise, Chadha, Conway, & Scott, 2011; Leff et al., 2008; Davis & Johnsrude, 2003; Scott, Blank, Rosen, & Wise, 2000), and it has also specifically been shown to be comparatively more active during speech than during voice recognition (von Kriegstein et al., 2003). We therefore expected for the contrast speech task > voice task to find increased source activity in the left STS. We analyzed left STS activity in the maximally task-sensitive time window for the contrast speech task > voice task (182–284 msec; see ROIs and Statistical Analysis section).
We found that in left mSTS, the difference in source activity between the speech and the voice task was significant (one-sided t test, t = 2.3287, df = 18, p = .0159, Bonferroni-corrected: p = .0476). The activation profiles of the left mSTS are shown in Figure 4A. No other part of the left STS showed significant task-related differences. For descriptive visualization, condition differences are also shown for the complete left cortical surface (Figure 4B).
For completeness, we also checked for any correlation between differential source activity in the left STS and speech recognition performance. None of the STS regions showed a correlation with speech recognition performance (pSTS: r = −0.1359, p = .5511; mSTS: r = −0.0882, p = .7196; aSTS: r = 0.1812, p = .4578).
The present results showed that the 200-msec latency marks a critical time point during which neural population responses to vocal and verbal information diverge. At this latency, the right hemispheric STS was involved in voice identity recognition, whereas the left STS engaged in speech recognition. Importantly, the voice-related activity in the right STS showed different response characteristics in its posterior and anterior parts. The right pSTS has a generally higher activity to voice than to speech recognition, whereas the activity in the right aSTS was correlated with behavioral voice recognition performance. This finding suggests that the right aSTS plays a fundamental role in voice identity recognition and that this happens at a latency of around 200 msec, which is at the same time as the analysis in the right pSTS.
The 200-msec latency response to voice identity recognition in the right pSTS is in agreement with previous research, which showed that voice-sensitive clusters are predominantly located along the right STS (Kriegstein & Giraud, 2004; von Kriegstein et al., 2003; Belin, Zatorre, & Ahad, 2002; Belin et al., 2000). In this study, this activated cluster included the planum temporale. Extensions from STS to planum temporale have also been observed in an fMRI study (Belin et al., 2000). There is good indication that right pSTS and aSTS serve different functional roles despite their common preference for vocal stimuli. Although the specific role of the right pSTS is currently unknown, it is thought to be closely tied to the acoustic processing of voices, such as the acoustic effect of vocal tract length (von Kriegstein, Smith, Patterson, Kiebel, & Griffiths, 2010), without being sensitive to individual voice identities (Andics et al., 2010; Kriegstein & Giraud, 2004; Belin & Zatorre, 2003). In the current context, it is conceivable that the right pSTS is involved in the acoustic processing of the spectral fine structure of the stimulus. Such an interpretation is in line with the timing of the response, that is, 200 msec, which is a latency that has been shown to be highly sensitive to the spectral stimulus structure (Altmann, Gomes de Oliveira, Heinemann, & Kaiser, 2010; Shahin, Roberts, Pantev, Trainor, & Ross, 2005; Schweinberger, 2001).
A particular role of the right aSTS in voice identity processing has been suggested by several fMRI studies (Andics et al., 2010; Kriegstein & Giraud, 2004; Belin & Zatorre, 2003). For example, using a voice adaptation paradigm, Belin and Zatorre (2003) showed that the right aSTS adapts to voice identities despite changing vocalizations, that is, changes in the spectrotemporal stimulus properties. The representation of vocal identity has recently also been described for single neurons within the monkey anterior STP, that is, the voice-sensitive analogue of the human STS (Perrodin, Kayser, Logothetis, & Petkov, 2011; Petkov et al., 2008). Although a link between right aSTS activity and vocal identity recognition has not yet been established, two recent studies suggest that the right mSTS correlates with behavioral measures tapping into vocal perception. An fMRI study found that participants with a better memory for voices show a higher response to voices and other auditory stimuli within the mSTS (Watson, Latinus, Bestelmeyer, Crabbe, & Belin, 2012). Also, using TMS, Bestelmeyer, Belin, and Grosbras (2011) showed that a disruption of right mSTS activity impairs participants' ability to discriminate vocal from other environmental sounds while leaving their loudness perception intact. However, these studies have not clearly disentangled voice identity-specific neural processes from more general auditory processes and it remained unclear whether and when the right aSTS is essential for vocal identity recognition. Such a correlation is expected if individual voice identities are represented within the right aSTS. This study showed that this is indeed the case. We found that the engagement of the right aSTS during voice recognition (in contrast to speech recognition) was linked to better voice recognition performance. The region that was revealed by the correlation analysis was spatially distributed and included several anterior-temporal areas including the insula (Figure 3E). The large spatial extent of this correlation was surprising although a voice-sensitive response in the insula is not implausible given previous evidence for voice-sensitive cells in the posterior insula (Remedios, Logothetis, & Kayser, 2009). However, given the limited spatial resolution and inherent spatial smoothness of minimum-norm estimates (Hanson et al., 2010), the large spatial extend should be interpreted with caution. The finding that the right aSTS showed, on average, higher activity during speech than speaker recognition may seem counterintuitive at first sight. However, the increased right aSTS activity during speech recognition was only present at a very lenient threshold, and it is therefore not clear in how far this result is reliable. Furthermore, even if the right aSTS was over the group more involved in speech recognition, this would not rule out that the right aSTS can crucially add to voice identity recognition and that participants who have a comparably high right aSTS activity are therefore better performers.
Although there was a 20-msec delay between the posterior and anterior sensor level effects, this temporal pattern was not reflected at the source level. The correlation in right aSTS occurred within the same 200-msec time range and peaked 8 msec before the differential activity in the right pSTS. This finding speaks against a strictly sequential processing of voices along the STS as suggested by early models of voice recognition (Ellis, Jones, & Mosdell, 1997) but may be more congruent with concurrent processing of both acoustic analysis of voices and their identity (Belin, Fecteau, & Bedard, 2004).
Although several neuropsychological and neuroimaging studies have shown that voice identity and speech information follow differing neuronal pathways (Bonte et al., 2009; Lang et al., 2009; Formisano et al., 2008; Kriegstein & Giraud, 2004; von Kriegstein et al., 2003; Van Lancker & Canter, 1982), there is only little information about the timing of this dissociation. Two EEG studies, both based on the same underlying data, suggest that a dissociation between speech task- and voice task-related neural processes occurs rather late (>300 msec), after the sensory encoding of the stimulus at 100–200 msec (Hausfeld, De Martino, Bonte, & Formisano, 2012; Bonte et al., 2009). The late task-related component seemed to originate from different hemispheres; however, without source reconstruction laterality in EEG is difficult to interpret. Here, we showed that neural population responses reflecting voice task- and speech task-related processes diverge at 200 msec after stimulus onset. Around this time point, we also found the expected hemispheric dissociation between voice- and speech-related processes, that is, a right-hemispheric bias during voice and a left-hemispheric bias during speech recognition. The 200-msec latency range is considered a late auditory evoked response (for a review see Crowley & Colrain, 2004), mostly reflecting lower-level acoustic processing of an auditory signal (for a review, see Martin, Tremblay, & Korczak, 2008). The timing of the neural dissociation as well as its localization to the pSTS for voice recognition and left mSTS for speech recognition suggests that it reflects differences in the sensory encoding of the auditory stimulus in auditory association cortices during the two tasks. Given that the same stimulus material was used in both conditions, these differences cannot be due to a differential bottom–up processing of acoustic features. We rather assume that the sensory processing of the behaviorally relevant acoustic features is selectively enhanced via top–down mechanisms. This scenario still leaves room for two explanations: (1) Neurons in the left and right STS are differentially tuned and this differential tuning becomes apparent when response gains are increased, for example, by feature-based attention (Treue & Martinez Trujillo, 1999). (2) The hemispheric asymmetry is caused by a differential top–down modulation by higher-order cortices, which may originate, for example, in a differential connectivity structure beyond the sensory encoding stage (Hsiao, Cippollini, & Cottrell, 2013). Language-sensitive regions in the left hemisphere may, for instance, amplify responses in the left-hemispheric STS while identity-sensitive regions in the right hemisphere amplify processing in the right STS. The current results are not sufficient to clearly adjudicate between these two alternatives. In addition, we note that the two alternatives are not mutually exclusive. However, the two findings (i) that the condition difference at the 200-msec latency shows rather little spread across nonsensory areas and (ii) that there is no evidence for even earlier condition differences (before the 200-msec latency) suggest that the first mechanism, a feature-based attention mechanism, plays a role in explaining our data.
In voice perception studies, the time frame around the 200-msec latency has been identified with a remarkable consistency, for example, when comparing different stimulation conditions such as vocal and nonvocal sounds (Capilla et al., 2013; De Lucia et al., 2010; Charest et al., 2009) or human vocalizations and animal vocalizations. Similar latency ranges have also been found when using adaptation or priming designs (Renvall et al., 2012; Zaske, Schweinberger, Kaufmann, & Kawahara, 2009; Schweinberger, 2001). The current study employed a design in which voice-sensitive processes are selectively enhanced by different tasks. As the same stimulus material was used for both the speech and voice tasks, the observed 200-msec differential response cannot be because of different stimulus properties but must arise from differing top–down modulations. Similar to adaptation and priming studies, this design has the advantage that natural speech, with all its inherent cues to vocal identity, can be used as stimulus material without the risk that speech- rather than voice-related processes drive the response. In addition, using familiarized voices together with a voice recognition task ensures that neural processes related to voice identity recognition are reflected in the recorded neural activity.
In contrast to the right STS, the left mSTS was significantly more involved during speech compared with voice recognition. This is in line with fMRI studies showing a left-hemispheric dominance when comparing intelligible and nonintelligible speech (Rosen et al., 2011; Leff et al., 2008; Davis & Johnsrude, 2003; Scott et al., 2000) or when comparing speech and voice recognition (von Kriegstein et al., 2003). Together with the right-hemispheric dominance during voice recognition, the present results support general hemispheric asymmetry findings in the human brain. It has been suggested that the commonly observed hemispheric asymmetries during speech perception arise because of asymmetries in the auditory sensory system. Several studies suggest that the left hemisphere is more sensitive to temporal variations whereas the right hemisphere is more sensitive to spectral detail (Obleser, Eisner, & Kotz, 2008; Jamison, Watkins, Bishop, & Matthews, 2006; Zatorre & Belin, 2001). Another line of evidence supports a model that constitutes different temporal sampling for the left and right hemisphere, with the left hemisphere being more sensitive to fast and the right hemisphere to slow auditory modulations (Giraud et al., 2007; Boemio, Fromm, Braun, & Poeppel, 2005; Poeppel, 2003). This study relates to these theories because voice recognition is thought to rely on relatively time-invariant spectral features like timbre and fundamental frequency (pitch; Gaudrain, Li, Ban, & Patterson, 2009; Lavner, Gath, & Rosenhouse, 2000), whereas speech recognition entails the decoding of amplitude and spectral modulations on a faster timescale.
In conclusion, this study suggests that voice identity recognition involves two parallel and functionally complementary processes emerging around the 200-msec latency: (i) the speech-independent processing of voices in the right pSTS and (ii) the mapping of vocal sounds to vocal identities in the right aSTS.
This work was supported by a Max Planck Research Group grant to K. v. K.
Reprint requests should be sent to Sonja Schall, Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstrasse 1a, 04103 Leipzig, Germany, or via e-mail: email@example.com.