Abstract

Humans can easily recognize the motion of living creatures using only a handful of point-lights that describe the motion of the main joints (biological motion perception). This special ability to perceive the motion of animate objects signifies the importance of the spatiotemporal information in perceiving biological motion. The posterior STS (pSTS) and posterior middle temporal gyrus (pMTG) region have been established by many functional neuroimaging studies as a locus for biological motion perception. Because listening to a walking human also activates the pSTS/pMTG region, the region has been proposed to be supramodal in nature. In this study, we investigated whether the spatiotemporal information from simple auditory stimuli is sufficient to activate this biological motion area. We compared spatially moving white noise, having a running-like tempo that was consistent with biological motion, with stationary white noise. The moving-minus-stationary contrast showed significant differences in activation of the pSTS/pMTG region. Our results suggest that the spatiotemporal information of the auditory stimuli is sufficient to activate the biological motion area.

INTRODUCTION

For animals, including humans, detecting the sound of footsteps and determining the direction and location of their movement have ecological significance for survival. Although the sound of a single footstep merely provides auditory information about the footwear contacting the walking surface, listening to consecutive footsteps in real life allows us to infer how the person is moving in a certain environment. People can determine not only the direction and speed of locomotion but also the gender (Li, Logan, & Pastore, 1991), posture (Pastore, Flint, Gaston, & Solomon, 2008) and stride length (Young, Rodger, & Craig, 2013) of the walker. This phenomenon is comparable to visual biological motion perception.

Johansson (1973) used the term “biological motion” to refer to the characteristic motion patterns of living beings during locomotion and showed that the movement of 10–12 small light bulbs attached to the main joints (known as point-light [PL] animation) could evoke biological motion perception. A single frame of the PL animation may appear as a meaningless assembly of dots. However, successive frames of the PL animation depict a compelling impression of human motion, such as walking, running, or dancing. This special ability to perceive the motion of animate objects demonstrates the importance of the spatiotemporal information in perceiving biological motion. Many human neuroimaging studies that have investigated the brain areas responsive to biological motion perception using PL animations have reported that the posterior STS (pSTS) region is activated more by actual PL animations than by scrambled PL animations (for a review, see Blake & Shiffrar, 2007). In this article, we use the term “pSTS/posterior middle temporal gyrus (pMTG)” because biological motion perception activates pSTS as well as the adjacent pMTG (Allison, Puce, & McCarthy, 2000). Bidet-Caulet, Voisin, Bertrand, and Fonlupt (2005) investigated whether the pSTS/pMTG region was activated by the auditory stimuli and revealed that the pSTS/pMTG region was activated by detecting the direction of motion of the crossing walker while listening to the footsteps. This result indicates that auditory stimuli related to human motion also activate the pSTS/pMTG region.

In our previous study, we compared realistic (externalized) and artificial (internalized) spatial sounds and found that the activity in the posterior superior temporal gyrus (pSTG) and pMTG were greater in response to the externalized stimuli than in response to internalized stimuli (Callan, Callan, & Ando, 2013). The externalized stimuli were individualized binaural recordings, and the internalized stimuli were stereo recordings. By recording sounds with microphones placed within the left and right pinnae, binaural recordings include the acoustic filter characteristics of the head and ears (head-related transfer function [HRTF]) and room reverberation. During headphone listening, the HRTF is known to be responsible for perceptual sound externalization when head motion cues are not available (Blauert, 1997; Hartmann & Wittenberg, 1996; Plenge, 1974) and room reverberation is also considered to improve perceptual sound externalization (Shinn-Cunningham, Kopco, & Martin, 2005; Mershon & Bowers, 1979). The enhancement of activity in pSTG in response to the externalized stimuli shown in our previous study (Callan et al., 2013) is consistent with studies implicating pSTG in auditory spatial processing (for a review, see Ahveninen, Kopco, & Jaaskelainen, 2014). However, the second region found to be active, within pMTG spreading into pSTS, is not often reported to be involved in auditory spatial perception; instead, it is often reported to be involved in biological motion perception (Allison et al., 2000). Our finding regarding the activation in this region is consistent with the position that the auditory stimuli used in our earlier study (Callan et al., 2013) evoked biological motion perception. However, this hypothesis could not be tested because there were no conditions in which the stimuli were stationary.

PL evokes biological motion when in movement (apparent or real); similarly, white noise bursts with the proper spatiotemporal spacing will also evoke biological motion perception associated with footsteps during locomotion. The importance of temporal (interval between footsteps) rather than spectral cues in recognizing auditory biological motion of human locomotion has been proposed (Cottrell & Campbell, 2014). In support of this proposal, a study concerned with the psychoacoustics of locomotion (Kayahara & Abe, 2011) found that “running” versus “walking” perception is primarily mediated by the temporal characteristics of the interstep onset interval (ISOI). Footstep sounds are most likely rated as “walking” when ISOI is approximately 583 msec and as “running” when ISOI is approximately 250 msec (Kayahara & Abe, 2011). In each trial in the Callan et al. (2013) study, participants heard trains of three recorded white noises that were originally presented from 1 of 12 speaker locations, each spaced 30° apart from the next on the horizontal plane. The intervals between the noises were 200 msec in each train; therefore, the sound was moving with a velocity of 150°/sec. The duration of the noise was 100 msec; therefore, ISOI was 300 msec. Therefore, in correspondence with Kayahara and Abe (2011), our noise stimuli had a running-like tempo (300 msec ISOI) and could be perceived as biological motion arising from human locomotion.

The main purpose of this study was to verify whether the step-like motion of noise stimuli activates the pSTS/pMTG region known to be involved with biological motion (for a review, see Blake & Shiffrar, 2007). For this purpose, unlike the external versus internal comparison made in Callan et al.'s (2013) study, we compared moving and stationary sounds. In the moving sound condition, we presented trains of sounds at three different locations (e.g., 0°, 30°, and 60°) to simulate biological motion consistent with the timing of running footsteps. In the stationary sound condition, we presented trains of three identical sounds (e.g., 30°, 30°, and 30°). If the step-like motion of the noise stimuli were processed as biological motion, we expected that the activity induced in the pSTS/pMTG region by the moving stimuli would be greater than that induced by the stationary stimuli. We also investigated the effects of room reverberation in auditory spatial processing. For this purpose, we prepared computationally created auditory stimuli that had binaural localization cues (i.e., interaural time difference [ITD] and interaural level difference [ILD]) equivalent to the stereo recording stimuli but that did not include room reverberation. We call these stimuli ITLD. The neural effects of room reverberation were assessed by comparing the stereo recording (with reverberation) with the computationally created stimuli (without reverberation). If room reverberation improves perceptual sound externalization, we expected that the stereo recording stimuli would induce greater activity in pSTG than the computationally created stimuli.

METHODS

Participants

Thirteen adults (seven men; mean age = 28.8 years, range = 22–40 years) participated in the fMRI study. All participants had no neurological or psychiatric history, had pure tone thresholds within the normal range for octave frequencies between 250 and 8000 Hz (≤20 dB HL), and gave written informed consent for experimental procedures approved by the institutional review board at the National Institute of Information and Communications Technology.

Stimuli and Procedure

The auditory stimuli consisted of three types of recorded sounds (binaural, stereo, and mono recordings) and one type of computationally created sounds (ITLD). The three recorded sound stimuli types were of the same as those used in our previous study (Callan et al., 2013). The source sound (i.e., the sound used for recording) was band-pass filtered white noise with cutoff frequencies of 0.6 and 22 kHz, with a duration of 100 msec and a rise and fall time of 20 msec (Zimmer & Macaluso, 2005). For recording, the noise was presented through a loudspeaker (Eclipse TD508II, Fujitsu Ten Ltd., Hyogo, Japan) placed around the participants at a distance of 2 m. To avoid variance caused by disparity in the acoustical characteristics of different speakers, only one speaker was used. By moving the speaker, we recorded sounds presented from 12 horizontal directions (0°, 30°, 60°, 90°, 120°, 150°, 180°, −150°, −120°, −90°, −60°, and −30°), in which the 0° angle was in front of the participant and angles >0° were in the right frontal hemifield (Figure 1).

Figure 1. 

The 12 horizontal directions from which the auditory stimuli were presented and the response button unit. The 12 directions were divided into four quadrants. In this figure, different quadrants are presented in different colors. Participants were instructed to press one of four buttons corresponding to the quadrants.

Figure 1. 

The 12 horizontal directions from which the auditory stimuli were presented and the response button unit. The 12 directions were divided into four quadrants. In this figure, different quadrants are presented in different colors. Participants were instructed to press one of four buttons corresponding to the quadrants.

The following recording methods were used: For the binaural recording, the participants sat down in a chair. In-ear binaural microphones (SP-TFB-2, The Sound Professionals Inc., Hainesport, NJ) were attached to the entrance of the participant's ear canals, and the stimuli were recorded through the microphones. For the stereo recording, the same microphones were placed in positions similar to the locations of the participants' ears (the height from the floor was 115 cm and the distance between the left and right microphones was 15.6 cm). For the mono recording, the left microphone was placed in the center of the positions used for the stereo recording. The recordings took place in a slightly reverberant room (T60 [the time for the sound to decrease by 60 dB] = 0.14 sec). For each direction, the noise was recorded 18 times, and nine of the recordings were used for the fMRI experiment. Altogether, 108 binaurally recorded sounds (12 directions × 9 times) were prepared for each participant. One hundred eight stereo and mono sounds each were made once and used for all participants. All recorded sounds contained a 100-msec direct sound component that was accompanied and followed by room reflections. We added a 150-msec interval to the 100-msec direct sound component to include all reflections; thereby, the total duration of recorded sounds became 250 msec. Sound levels for each recording type were matched using coefficients that were calculated to equalize the root mean square energy of the sounds in front of the participants (angle 0°). The ITLD stimuli were computationally created by adding ITD and ILD that are equivalent to the stereo recordings to the source sound.

Step-like motion stimuli were produced by presenting trains of three sounds. Sound durations were 250 msec (100 msec direct sound and 150 msec reflection), intersound onset intervals were 300 msec, the duration of each train was 850 msec, and intertrain intervals were 2150 msec. We divided the 12 directions into four quadrants. In the moving sound conditions, three sounds that belonged to the same quadrant were presented as a train either in ascending or descending order (e.g., 0°, 30°, and 60° or 60°, 30°, and 0°). If the binaural recordings reproduced the distance to the loudspeaker (i.e., 2 m) accurately, the sound was moving with a velocity of 18.84 km/hr. In the stationary sound conditions, we presented trains of three identical noises (e.g., 30°, 30°, and 30°). The combination of four sound types (binaural recording [BAR], stereo recording [STEREO], ITLD, and mono recording [MONO]) and two motion levels (moving [m] and stationary [s]) produced eight experimental conditions (BARm, BARs, STEREOm, STEREOs, ITLDm, ITLDs, MONOm, and MONOs). A null condition was also included. The auditory stimuli were delivered via MR-compatible headphones (ceramic transducer headphones; frequency range = 30–40,000 Hz, approximately 20 dB SPL passive attenuation, Hitachi Advanced Systems, Kanagawa Japan). We employed a rapid event-related paradigm, and conditions were randomly presented.

Participants were instructed to indicate the quadrant in which a sound source was located by pressing one of four buttons (e.g., 0°, 30°, and 60° → right front and −120°, −150°, and −180° → left back; Figure 1) and to push any button twice if they were unable to localize the sound source. The localization task was chosen because it was same as our previous study (Callan et al., 2013) so that we could compare the results and because we wanted to avoid the top–down effects that might be caused by asking to what degree the auditory stimuli sounded like footsteps. Four localizable responses and one unlocalizable response yielded five response options. Therefore, the chance level for the task was 20%. The experiment consisted of four runs, each containing 16 trials per condition. The duration of each run was approximately 7.5 min. Percentages of correct responses for each condition were calculated. Differences between conditions were analyzed using a 4 × 2 repeated-measures ANOVA. Post hoc tests using the Bonferroni correction were also applied.

MRI Data Acquisition and Preprocessing

A 3T scanner (MAGNETOM Tim Trio, Siemens, Erlangen, Germany) with a 32-channel head coil at the ATR Brain Activity Imaging Center was used for structural and functional brain imaging. Functional T2*-weighted images were acquired using a gradient EPI sequence (repetition time = 3000 msec, flip angle = 90°, matrix size = 64 × 64 pixels, field of view = 192 × 192 mm, 30 slices). The acquisition time was 1800 msec; therefore, there was a 1200-msec quiet period between scans. We presented the auditory stimuli during these interscan intervals so that the presentation of the auditory stimuli was not interrupted by the MRI scanning noise. The slice thickness was 3 mm, and 150 volumes per run were obtained. The first two scans from each run were discarded to take into account the effects of T1 equilibration.

Images were preprocessed using programs within SPM8 (Wellcome Department of Cognitive Neurology, London). Images were realigned, slice-time corrected, and spatially normalized (voxel size = 2 × 2 × 2 mm) using a template defined by the Montreal Neurological Institute (MNI) and smoothed using a twice voxel size (6 × 6 × 6 mm) FWHM Gaussian kernel. Before the acquisition of functional images, T2-weighted anatomical images were acquired in the same plane as the functional images (matrix size = 256 × 256 pixels). The T2-weighted images were coregistered with the mean of the functional images and used to calculate the parameters required for the spatial normalization of functional images.

fMRI Data Analysis

Preprocessed MRI data were statistically analyzed on a voxel-by-voxel basis using the SPM8 software package (128-sec high-pass filter, serial correlations corrected using an autoregressive AR (1) model). The task-related neural activity was modeled with a series of events convolved with a canonical hemodynamic response function. Contrasts against baseline were calculated for each condition, and participant-specific contrast images were used as inputs for the second-level analysis. We conducted a two-way repeated-measures ANOVA for overall group analysis. Because the MONOm stimuli did not induce auditory motion perception, pairwise comparisons using localizable sound types (BAR, STEREO, and ITLD) were used to investigate the neural effects of step-like motion by comparing the moving sounds with stationary sounds. The neural effects of room reverberation on auditory spatial processing were investigated by comparing the STEREO stimuli with the ITLD stimuli. In all analyses, the statistical threshold was set to p < .05, corrected using the false discovery rate (FDR) for height and to p < .05 uncorrected for the extent, unless otherwise specified.

Control Behavioral Experiments

Because we did not ask the participants to what degree the auditory stimuli sounded like footsteps during the fMRI experiments, we could not infer whether they perceived the auditory stimuli as footsteps (i.e., biological motion). Therefore, we performed two additional behavioral tests with another set of participants to investigate which sound type sounded the most like footsteps. Ten adults (six women; mean age = 30 years, range = 21–52 years) with normal hearing participated. For these control behavioral experiments, we used the same set of stimuli as was used for the fMRI experiments. The auditory stimuli were delivered via earphones (ER-4B microPro Etymotic Research, EIk Grove Village, IL). For the behavioral tests, we did not prepare BAR stimuli for each subject. One of the 13 fMRI experiment participants' BAR stimulus sets was used. Therefore, before behavioral tests, we chose a BAR stimulus set for each participant by using a two-alternative forced-choice task. The procedure for the behavioral task was the following: For the auditory stimuli, we prepared trains of 12 sounds (0°, 30°, 60°, 90°, 120°, 150°, 180°, −150°, −120°, −90°, −60°, and −30°) of each BAR stimulus set. The intersound onset intervals were 300 msec, and the durations of each train were 3550 msec. Then, participants heard two sound trains in succession (randomly chosen) and were instructed to indicate which one (i.e., the first or second sound) sounded more externalized by pressing one of the mouse buttons (left = “the first sound” and right = “the second sound”). Using a win-stay lose-switch rule, the most appropriate set was chosen for each participant.

The descriptions of two behavioral tests are the following: The first test was a two-alternative forced-choice task using four sound types (BAR, STEREO, ITLD, and MONO). Three sounds that belonged to the same quadrant were presented as a train either in ascending or descending order (e.g., 0°, 30°, and 60° or 60°, 30°, and 0°). For each trial, participants heard two types of moving sounds in succession (e.g., BARm and STEREOm) and were instructed to indicate which one (i.e., the first or second sound) sounded more like footsteps by pressing one of the mouse buttons (left = “the first sound” and right = “the second sound”). We calculated the percentages of choosing each sound type. For the second test, we used a rating task. Participants heard one of eight sound types (BARm, BARs, STEREOm, STEREOs, ITLDm, ITLDs, MONOm, and MONOs), and 11 boxes indicating percentages (0% to 100% in 10% increments) were displayed on the screen of a computer. They were instructed to indicate how similar the sound was to human footsteps by clicking on one of the boxes (0% = “not at all” to 100% = “exactly the same”). Data from the first test were analyzed using a one-way repeated-measure ANOVA and data from the second test were analyzed using a two-way repeated-measures ANOVA.

RESULTS

Behavioral Data Analysis of the fMRI Experiments

Mean percentages of correct responses in the localization task during the fMRI experiment are shown in Table 1. The two-way repeated-measures ANOVA showed a significant main effect for the Sound type, F(3, 36) = 64.58, p < .01. However, the results failed to show a significant main effect for Motion as well as a significant interaction effect. Post hoc tests using the Bonferroni correction indicated that participants perceived the auditory quadrant of the stimulus more accurately for spatially localizable stimuli (BAR, 54.0%; STEREO, 43.3%; and ITLD, 43.2%) than for the unlocalizable MONO stimuli (20.8%). Moreover, the accuracy of the externalized (BAR) stimuli was significantly higher than that of the internalized (STEREO and ITLD) stimuli. Considering that front/back localization cues are only available for the BAR stimuli, better performance for the BAR stimuli was to be expected.

Table 1. 

Mean Percentages of Correct Responses in the Localization Task during the fMRI Experiment

MONOBARSTEREOITLD
Moving 21.15 ± 2.25 55.65 ± 3.91 43.03 ± 2.53 44.47 ± 2.47 
Stationary 20.43 ± 2.20 52.40 ± 2.98 43.39 ± 1.99 42.19 ± 2.09 
MONOBARSTEREOITLD
Moving 21.15 ± 2.25 55.65 ± 3.91 43.03 ± 2.53 44.47 ± 2.47 
Stationary 20.43 ± 2.20 52.40 ± 2.98 43.39 ± 1.99 42.19 ± 2.09 

Data Analysis of the Control Behavioral Experiments

For the first behavioral test, participants chose which of two sounds was more similar to footsteps, and we calculated the percentages of choosing each sound type. The mean percentages and SEMs for each sound type were as follows: BARm, 80.00 ± 5.26; STEREOm, 57.08 ± 2.31; ITLDm, 26.46 ± 8.40; and MONOm, 36.46 ± 5.25. A significant difference was found for sound type, F(3, 30) = 15.776, p < .01. Post hoc paired comparisons with Bonferroni correction failed to show a significant difference between ITLDm and MONOm but showed significant differences for all other comparisons.

For the second behavioral test, participants rated how similar the sound was to footsteps. The means and SEMs for each sound type were as follows: BARm, 70.28 ± 4.72; BARs, 45.57 ± 4.59; STEREOm, 67.22 ± 4.19; STEREOs, 50.23 ± 4.60; ITLDm, 47.44 ± 7.36; ITLDs, 32.56 ± 5.28; MONOm, 45.06 ± 6.27; and MONOs, 36.00 ± 5.94. Significant differences were found in the motion type, F(1, 9) = 8.997, p < .05, in the sound type, F(3, 27) = 6.952, p < .01, and in the interaction, F(3, 27) = 7.017, p < .01. The rating was significantly higher for moving sounds than stationary sounds, and the BAR and STEREO sounds were rated significantly higher than the MONO sounds. For better understanding of the interaction, we compared motion–stationary differences (i.e., the motion effect) for each sound type. The post hoc paired comparisons with Bonferroni correction showed significant motion effects for the localizable stimuli (i.e., BAR, STEREO, and ITLD) but not for the unlocalizable MONO stimuli. Moreover, the motion effect of the BAR stimuli was significantly larger than the STEREO and ITLD stimuli.

fMRI Data Analysis

The results of the two-way repeated-measures ANOVA showed the significant main effect of Sound type. Compared with the auditory spatially unlocalizable sound type (MONO), the localizable sound types (BAR, ITLD, and STEREO) activated the pSTG, pMTG, precuneus, caudate, superior frontal gyrus, anterior cingulate, and putamen (Table 2A). Moreover, the externalized sounds activated brain regions more than the internalized sounds. The BAR stimuli activated the pSTG and pMTG than the STEREO stimuli (Table 2B) and activated the pSTG than the ITLD stimuli (Table 2C). In contrast, the analysis failed to show the significant main effect of Motion. It was considered that the main effect of the motion was not significant because it included the MONO condition that did not induce auditory motion perception. As we discuss later, we found significant differences in the moving-minus-stationary contrast excluding the MONO stimuli. Then, planned comparisons were performed. First, to test the reproducibility of the results in Callan et al. (2013) concerning externalized and internalized sound source perception, we looked at the contrast of the moving conditions (BARm-minus-STEREOm). Consistent with our previous result (Callan et al., 2013), significant differences were found in pSTG and pMTG (Figure 2 and Table 2D). On the other hand, the contrast of the stationary conditions (BARs-minus-STEREOs) showed differential activation in pSTG, but not in the pMTG (Figure 2 and Table 2E). These results supported that the differential activity of pMTG observed in our previous study (Callan et al., 2013) was associated with the motion factor of the stimuli.

Table 2. 

Results of fMRI Data Analyses

RegionMNI Coordinatest
xyz
Main Effect of Sound Type 
(A) BAR, STEREO, ITLD > MONO 
 Superior temporal gyrus 64 −34 14 7.26 
−60 −40 16 6.58 
 Middle temporal gyrus −38 −62 16 6.33 
 Precuneus 46 −66 24 5.88 
−10 −46 54 4.67 
 Anterior cingulate 12 30 14 4.24 
 Putamen 12 12 −2 4.31 
−14 12 −8 4.04 
(B) BAR > STEREO 
 Superior temporal gyrus 64 −34 14 7.62 
−64 −38 28 4.90 
 Middle temporal gyrus 56 −60 4.77 
(C) BAR > ITLD 
 Superior temporal gyrus −46 −28 10 5.57 
 
Externalized > Internalized 
(D) BARm > STEREOm 
 Superior temporal gyrus 62 −34 14 5.50 
−58 −32 10 4.92 
 Middle temporal gyrus 54 −56 4.57 
(E) BARs > STEREOs 
 Superior temporal gyrus 64 −34 14 5.61 
 
Moving > Stationary 
(F) All localizable stimuli 
 Superior temporal gyrus 64 −34 18 6.02 
 Middle temporal gyrus 58 −52 14 5.08 
−54 −54 12 5.25 
 Precentral sulcus 48 40 4.59 
RegionMNI Coordinatest
xyz
Main Effect of Sound Type 
(A) BAR, STEREO, ITLD > MONO 
 Superior temporal gyrus 64 −34 14 7.26 
−60 −40 16 6.58 
 Middle temporal gyrus −38 −62 16 6.33 
 Precuneus 46 −66 24 5.88 
−10 −46 54 4.67 
 Anterior cingulate 12 30 14 4.24 
 Putamen 12 12 −2 4.31 
−14 12 −8 4.04 
(B) BAR > STEREO 
 Superior temporal gyrus 64 −34 14 7.62 
−64 −38 28 4.90 
 Middle temporal gyrus 56 −60 4.77 
(C) BAR > ITLD 
 Superior temporal gyrus −46 −28 10 5.57 
 
Externalized > Internalized 
(D) BARm > STEREOm 
 Superior temporal gyrus 62 −34 14 5.50 
−58 −32 10 4.92 
 Middle temporal gyrus 54 −56 4.57 
(E) BARs > STEREOs 
 Superior temporal gyrus 64 −34 14 5.61 
 
Moving > Stationary 
(F) All localizable stimuli 
 Superior temporal gyrus 64 −34 18 6.02 
 Middle temporal gyrus 58 −52 14 5.08 
−54 −54 12 5.25 
 Precentral sulcus 48 40 4.59 

MNI coordinates and t values of local maximums in activated clusters (FDR-corrected p < .05 for the height and uncorrected p < .05 for the extent).

Figure 2. 

Neural effects of sound externalization. Significant activation by the externalized-minus-internalized contrast is rendered on a high-resolution anatomical MRI brain template. Red corresponds to the results using moving sounds (i.e., BARm > STEREOm), green corresponds to the results using stationary sounds (i.e., BARs > STEREOs), and yellow corresponds to the overlap of the two results. pMTG activation was found only with moving sounds.

Figure 2. 

Neural effects of sound externalization. Significant activation by the externalized-minus-internalized contrast is rendered on a high-resolution anatomical MRI brain template. Red corresponds to the results using moving sounds (i.e., BARm > STEREOm), green corresponds to the results using stationary sounds (i.e., BARs > STEREOs), and yellow corresponds to the overlap of the two results. pMTG activation was found only with moving sounds.

Second, the neural effects of the step-like motion were investigated by comparing the moving stimuli with stationary stimuli. The result of the contrast with the combined localizable sound types (BAR, ITLD, and STEREO) showed significant differences in the right pSTG, the bilateral pSTS/pMTG regions, and the right premotor cortex (Figure 3A and Table 2F). We also determined brain regions that were associated with the moving-minus-stationary contrast on each sound type separately. For these analyses, a family-wise error was corrected with a cluster size (p < .001 uncorrected for the height and p < .05 FWE corrected for the extent). The BAR condition showed significant differences in the left pSTS/pMTG region (MNI coordinates: x = −56, y = −54, z = 12, k = 142), and the ITLD condition showed significant differences in the right pSTS/pMTG region (MNI coordinates: x = 56, y = −58, z = 10, k = 128). The STEREO condition failed to show significant differences.

Figure 3. 

Neural effects of step-like motion. (A) Brain regions activated to a greater extent by moving than stationary sounds. Significant activation is rendered on a high-resolution anatomical MRI brain template in hot scale. Blue corresponds to voxels used for ROI analyses. (B) Bar plots represent mean contrast estimates (mean ± SEM of left and right ROI spheres) for each condition.

Figure 3. 

Neural effects of step-like motion. (A) Brain regions activated to a greater extent by moving than stationary sounds. Significant activation is rendered on a high-resolution anatomical MRI brain template in hot scale. Blue corresponds to voxels used for ROI analyses. (B) Bar plots represent mean contrast estimates (mean ± SEM of left and right ROI spheres) for each condition.

To examine differences between the left and right pSTS/pMTG regions, we performed an ROI analysis. The data were extracted using a sphere centered upon the peak voxel in the left and right pSTS/pMTG regions. Only significantly activated voxels for the localizable moving-minus-stationary contrast (Figure 3A) were used. A relatively small 4-mm radius was chosen so that both left and right regions had the same volumes (264 mm3). Mean beta values of the ROIs for each condition are plotted in Figure 3B. Using extracted data, we performed a three-way (3 Sound type [BAR, STEREO and ITLD] × 2 Motion type [move and stationary] × 2 Hemispheres [left and right]) repeated-measures ANOVA. The results showed significant main effects of the Sound type, F(2, 24) = 7.975, p < .01, and the Motion type, F(1, 12) = 17.189, p < .01, and a significant three-way interaction, F(2, 24) = 7.235, p < .01. The main effect of the Sound type indicated that the BAR stimuli activated the pSTS/pMTG regions significantly more than the STEREO and ITLD stimuli. The main effect of the Motion type indicated that the moving sounds activated the pSTS/pMTG regions significantly more than the stationary sounds. The three-way interaction indicated that the significant Type × Motion interaction was only found in the left region, F(2, 24) = 3.833, p < .05. In other words, the motion effects (i.e., differences between moving and stationary conditions) of three sound types were significantly different in the left pSTS/pMTG region but not in the right region. The BAR versus STEREO and ITLD comparisons indicated a significant difference in the left pSTS/pMTG region, t(12) = 2.41, p < .016, but not in the right region.

Lastly, an analysis of the effects of room reverberation was performed by comparing the STEREO stimuli (with reverberation) with the ITLD stimuli (without reverberation). Neither the STEREO-minus-ITLD nor the ITLD-minus-STEREO contrast failed to show any significant differences. The small volume correction (p < .05 FWE-corrected) using the bilateral superior temporal gyrus template (Tzourio-Mazoyer et al., 2002) showed that the ITLD-minus-STEREO contrast activated the right pSTG (MNI coordinates: x = 64, y = −32, z = 16).

DISCUSSION

Consistent with our hypothesis, the step-like motion of noise activated the pSTS/pMTG region (see Figure 3), which is known to be involved in biological motion perception (for a review, see Blake & Shiffrar, 2007). Comparisons between the moving and stationary stimuli revealed that moving sounds activated the pSTS/pMTG region to a greater extent than stationary sounds. In addition, we found laterality of hemispheric activity for the motion effect. The motion effect of the externalized (BAR) sounds was significantly larger than the internalized stimuli (STEREO and ITLD) in the left pSTS/pMTG region but not in the right region. This is the first study demonstrating that the auditory step-like motion of white noise activates brain regions involved with biological motion processing, analogous to visual PL stimuli activating biological motion processing regions.

Involvement of the pSTS/pMTG region in biological motion perception has been reported both in vision (for a review, see Blake & Shiffrar, 2007) and audition (Bidet-Caulet et al., 2005). The PL animation study (Johansson, 1973) indicated that biological motion perception can be evoked by simple visual motion signals lacking the visual characteristics of body shape and demonstrated the importance of spatial-temporal information in perceiving biological motion. In the current study, we found that the step-like motion of the white noise stimuli, which did not have the spectral characteristics of footsteps, activated the pSTS/pMTG region, which is known to be involved in biological motion processing (for a review, see Blake & Shiffrar, 2007). In addition to the pSTS/pMTG region, we found differential activity in the right pSTG and the right premotor cortex (BA 6). A voxel-based lesion study with unilateral stroke patients reported the involvement of the superior temporal and premotor areas in biological motion perception (Saygin, 2007). Findings of these areas in this study also strongly support that the step-like motion of the white noise stimuli initiated biological motion processing.

Consistent with our previous study (Callan et al., 2013), the right pSTS/pMTG region was activated more by the externalized sounds than the internalized sounds. All stimuli had the same temporal characteristics (300 msec ISOI); consequently differential activity was therefore caused by differences in their spatial information. Our results indicate that the naturalistic auditory spatial information that provides externalized sound source locations and room characteristics induced biological motion processing more than the artificial spatial information. The current study does not allow us to examine how the temporal information influences auditory biological motion processing because only one type of ISOI was used in this study. A visual biological motion study reported that the spatial configuration of a PL stimulus was only needed to discriminate the facing direction of a PL walker but a fully intact spatiotemporal pattern of the stimulus was needed to discriminate the moving direction (Lange & Lappe, 2007). The results suggest that spatial information has a higher contribution to visual biological motion processing than the temporal information. In auditory biological motion processing, we expect the opposite contribution weighting because audition is more specialized for temporal processing than spatial processing in contrast to vision. Additional work with various temporal patterns is needed to elucidate the role of spatial and temporal information in auditory biological motion perception.

Because we did not ask the participants whether auditory stimuli sounded like footsteps, we cannot deduce that they perceived the auditory stimuli as footsteps during the fMRI experiments. Results from the additional behavioral tests indicate that moving sounds sounded more like footsteps than stationary sounds and that externalized moving sounds (BARm) sounded more like footsteps than internalized moving sounds (STEREOm and ITLDm). However, there are several discrepancies between the fMRI and behavioral results. The biggest discrepancy is responses to the ITLD stimuli. The fMRI results indicated strong motion effects of the ITLD stimuli (Figure 3B), but the behavioral results indicated that the ITLD stimuli sounded less like footsteps than the BAR and STEREO stimuli. We consider that this discrepancy is caused by strategies that participants used to accomplish the behavioral tests. Both behavioral tasks are difficult to perform, because the auditory stimuli did not have the spectral characteristics of footsteps. Nevertheless, they could perform the tasks if they were asked to do so. Onken, Hastie, and Revelle (1985) reported individual differences in the use of strategies in a complex decision-making task. It was shown that some individuals often utilize simplification strategies to reduce the cognitive demands of the task. In our study, we conjuncture that several participants probably used the reverberation levels as cues for making their decision. This idea is supported by the results from the two-alternative forced-choice task in which 6 of 10 participants chose the MONO (with reverberation) stimuli more than the ITLD (without reverberation) stimuli even though the MONO stimuli did not provide motion information. Therefore, we assume that performance of the behavioral tests was heavily influenced by cognitive (top–down) processing associated with their strategies. Studies have shown that both bottom–up and top–down processes contribute to successful processing of biological motion (Blake & Shiffrar, 2007). However, biological motion can be processed in the absence of top–down effects as demonstrated by a flanker-interference study in which ignored PL walkers were processed at a level sufficient to impact the perception of attended PL walkers (Thornton & Vuong, 2004). The fMRI results in this study may reflect biological motion processing that was less influenced by top–down mechanisms.

It is possible that the enhanced activity observed in the pSTS/pMTG region was caused by general auditory motion perception and not by biological motion perception, as we hypothesized. Many human neuroimaging studies have reported the involvement of the planum temporale (PT) in auditory motion perception (Krumbholz et al., 2005; Pavani, Macaluso, Warren, Driver, & Griffiths, 2002; Warren, Zielinski, Green, Rauschecker, & Griffiths, 2002; Lewis, Beauchamp, & DeYoe, 2000; Baumgart, Gaschler-Markefski, Woldorff, Heinze, & Scheich, 1999). None of these studies reported the involvement of the pSTS/pMTG region in auditory motion perception. Moreover, differential activity in the PT was not found in studies contrasting spatially varying stationary stimuli with moving stimuli (Smith, Saberi, & Hickok, 2007; Smith, Okada, Saberi, & Hickok, 2004). Smith et al. (2007) proposed that the enhanced activity of the PT did not reflect motion-specific processing but spatial processing for spatially varying sound sources. Thus, functional neuroimaging studies have provided no clear evidence for the existence of a distinct neural mechanism for auditory motion perception in humans. On the basis of these studies and in the absence of contrary evidence, we conclude that the bilateral pSTS/pMTG region activation observed in this study likely reflects neural processing of biological motion perception, rather than neural processing of general auditory motion perception.

Although the combined (BAR, ITLD, and STEREO) analysis showed significant differences in the bilateral pSTS/pMTG regions, laterality in the regions was suggested by the separate whole-brain analyses for each sound type. The BAR condition showed a significant motion effect (moving over stationary) in the left pSTS/pMTG region. On the other hand, the ITLD condition showed a significant motion effect in the right pSTS/pMTG region. To examine differences between the left and right pSTS/pMTG regions, we compared activity differences among three localizable sounds using ROI analyses. The result indicated that the motion effects of three sound types were significantly different in the left region but not in the right region. On the basis of these results, we propose that both the left and right pSTS/pMTG regions are involved in biological motion perception, but their functions are somewhat different. The left pSTS/pMTG region may be involved in utilizing monoaural spatial cues (i.e., HRTF filter characteristics) to process externalized acoustic properties relevant to biological motion. The right region may be involved in processing the binaural spatial cues (i.e., ITD and ILD) relevant to biological motion. Support of hemispheric differences in biological motion processing comes from a neuroanatomical study that reported that the gray matter volume of only the left pSTS predicted individual sensitivity to biological motion (Gilaie-Dotan, Kanai, Bahrami, Rees, & Saygin, 2013).

We also tested the effects of room reverberation on auditory spatial processing. Both binaural (ITD and ILD) and monoaural (HRTF) cues are necessary to experience the sound as localized outside the head instead of originating from inside the head (i.e., sound externalization; Hartmann & Wittenberg, 1996; Plenge, 1974). In addition to those cues, we perceive reverberation of a room in our daily lives. Room reverberation distorts spatial cues for sound localization, but affects distance judgments, provides information on room characteristics, and improves the subjective realism and externalization achieved in headphone simulation (Shinn-Cunningham et al., 2005). In our previous study, we found the involvement of pSTG in sound externalization (Callan et al., 2013; Figure 2 for a replication of our results of external vs. internal sounds from Callan et al., 2013). However, we could not differentiate the neural correlates of room reverberation processing from the neural correlates of HRTF processing because the BAR stimuli included both HRTF and room reverberation cues. Therefore, in the current study, we investigated the neural effects of room reverberation by comparing the STEREO stimuli, which have reverberation but no HRTF, with the ITLD stimuli, which lack reverberation and HRTF. We expected that pSTG would be activated more by the STEREO stimuli than by the ITLD stimuli because room reverberation can improve externalization. However, the STEREO-minus-ITLD contrast did not show any significant differences. On the contrary, our results show that the right pSTG was less activated by the STEREO stimuli than by the ITLD stimuli. The suppression of pSTG activity in this study may indicate degradation of the binaural cue (ITD and ILD) based on auditory localization because of room reverberation. Such a suppressive effect of reverberation has also been reported in anesthetized cats in response to the auditory stimuli that include only ITD cues with reverberation, albeit in the auditory midbrain (inferior colliculus; Devore, Ihlefeld, Hancock, Shinn-Cunningham, & Delgutte, 2009; pSTG was not investigated in their study).

Conclusions

Our results showed that white noise presented with a step-like motion activates the pSTS/pMTG region. The temporal spacing between the stimuli, which resembles that of the sound of a running human, evoked neural processing typical of biological motion perception. These findings suggest that presenting white noise with a step-like motion is sufficient to activate regions associated with biological motion perception even though the spectral characteristics of the white noise do not resemble that of a footstep. Both the left and right pSTS/pMTG regions were activated to a greater extent with the moving than with the stationary stimuli. However, different activation patterns for the different sound types imply functional differences between the left and right regions. We consider that the left pSTS/pMTG region processes acoustic properties associated with the monoaural (HRTF) cue and that the right region processes acoustic properties associated with binaural (ITD and ILD) cues. Further research is needed to test these hypotheses. Similar to the way in which PLs activate biological motion processing regions of the brain, our study demonstrates that simple acoustic stimuli with appropriate temporal spacing activate the biological motion processing regions of the brain.

Reprint requests should be sent to Akiko Callan, National Institute of Information and Communication Technology, 1-4 Yamadaoka, Suita City, Osaka, 565-0871, Japan, or via e-mail: acallan@nict.go.jp.

REFERENCES

Ahveninen
,
J.
,
Kopco
,
N.
, &
Jaaskelainen
,
I. P.
(
2014
).
Psychophysics and neuronal bases of sound localization in humans
.
Hearing Research
,
307
,
86
97
.
Allison
,
T.
,
Puce
,
A.
, &
McCarthy
,
G.
(
2000
).
Social perception from visual cues: Role of the STS region
.
Trends in Cognitive Sciences
,
4
,
267
278
.
Baumgart
,
F.
,
Gaschler-Markefski
,
B.
,
Woldorff
,
M. G.
,
Heinze
,
H. J.
, &
Scheich
,
H.
(
1999
).
A movement-sensitive area in auditory cortex
.
Nature
,
400
,
724
726
.
Bidet-Caulet
,
A.
,
Voisin
,
J.
,
Bertrand
,
O.
, &
Fonlupt
,
P.
(
2005
).
Listening to a walking human activates the temporal biological motion area
.
Neuroimage
,
28
,
132
139
.
Blake
,
R.
, &
Shiffrar
,
M.
(
2007
).
Perception of human motion
.
Annual Review of Psychology
,
58
,
47
73
.
Blauert
,
J.
(
1997
).
Spatial hearing
(Rev. ed.).
Cambridge, MA
:
MIT Press
.
Callan
,
A.
,
Callan
,
D. E.
, &
Ando
,
H.
(
2013
).
Neural correlates of sound externalization
.
Neuroimage
,
66C
,
22
27
.
Cottrell
,
D.
, &
Campbell
,
M. E.
(
2014
).
Auditory perception of a human walker
.
Perception
,
43
,
1225
1238
.
Devore
,
S.
,
Ihlefeld
,
A.
,
Hancock
,
K.
,
Shinn-Cunningham
,
B.
, &
Delgutte
,
B.
(
2009
).
Accurate sound localization in reverberant environments is mediated by robust encoding of spatial cues in the auditory midbrain
.
Neuron
,
62
,
123
134
.
Gilaie-Dotan
,
S.
,
Kanai
,
R.
,
Bahrami
,
B.
,
Rees
,
G.
, &
Saygin
,
A. P.
(
2013
).
Neuroanatomical correlates of biological motion detection
.
Neuropsychologia
,
51
,
457
463
.
Hartmann
,
W. M.
, &
Wittenberg
,
A.
(
1996
).
On the externalization of sound images
.
Journal of the Acoustical Society of America
,
99
,
3678
3688
.
Johansson
,
G.
(
1973
).
Visual perception of biological motion and a model for its analysis
.
Perception & Psychophysics
,
14
,
201
211
.
Kayahara
,
T.
, &
Abe
,
H.
(
2011
).
Synthesis of footstep sounds of crowd from single step sound based on cognitive property of footstep sounds
.
Proceeding of the IEEE International Symposium on VR Innovation (ISVRI), Singapore
, pp.
245
249
.
Krumbholz
,
K.
,
Schonwiesner
,
M.
,
Rubsamen
,
R.
,
Zilles
,
K.
,
Fink
,
G. R.
, &
von Cramon
,
D. Y.
(
2005
).
Hierarchical processing of sound location and motion in the human brainstem and planum temporale
.
European Journal of Neuroscience
,
21
,
230
238
.
Lange
,
J.
, &
Lappe
,
M.
(
2007
).
The role of spatial and temporal information in biological motion perception
.
Advances in Cognitive Psychology
,
3
,
419
428
.
Lewis
,
J. W.
,
Beauchamp
,
M. S.
, &
DeYoe
,
E. A.
(
2000
).
A comparison of visual and auditory motion processing in human cerebral cortex
.
Cerebral Cortex
,
10
,
873
888
.
Li
,
X. F.
,
Logan
,
R. J.
, &
Pastore
,
R. E.
(
1991
).
Perception of acoustic source characteristics: Walking sounds
.
Journal of the Acoustical Society of America
,
90
,
3036
3049
.
Mershon
,
D. H.
, &
Bowers
,
J. N.
(
1979
).
Absolute and relative cues for the auditory perception of egocentric distance
.
Perception
,
8
,
311
322
.
Onken
,
J.
,
Hastie
,
R.
, &
Revelle
,
W.
(
1985
).
Individual differences in the use of simplification strategies in a complex decision-making task
.
Journal of Experimental Psychology: Human Perception and Performance
,
11
,
14
27
.
Pastore
,
R. E.
,
Flint
,
J. D.
,
Gaston
,
J. R.
, &
Solomon
,
M. J.
(
2008
).
Auditory event perception: The source-perception loop for posture in human gait
.
Perception & Psychophysics
,
70
,
13
29
.
Pavani
,
F.
,
Macaluso
,
E.
,
Warren
,
J. D.
,
Driver
,
J.
, &
Griffiths
,
T. D.
(
2002
).
A common cortical substrate activated by horizontal and vertical sound movement in the human brain
.
Current Biology
,
12
,
1584
1590
.
Plenge
,
G.
(
1974
).
On the differences between localization and lateralization
.
Journal of the Acoustical Society of America
,
56
,
944
951
.
Saygin
,
A. P.
(
2007
).
Superior temporal and premotor brain areas necessary for biological motion perception
.
Brain
,
130
,
2452
2461
.
Shinn-Cunningham
,
B. G.
,
Kopco
,
N.
, &
Martin
,
T. J.
(
2005
).
Localizing nearby sound sources in a classroom: Binaural room impulse responses
.
Journal of the Acoustical Society of America
,
117
,
3100
3115
.
Smith
,
K. R.
,
Okada
,
K.
,
Saberi
,
K.
, &
Hickok
,
G.
(
2004
).
Human cortical auditory motion areas are not motion selective
.
NeuroReport
,
15
,
1523
1526
.
Smith
,
K. R.
,
Saberi
,
K.
, &
Hickok
,
G.
(
2007
).
An event-related fMRI study of auditory motion perception: No evidence for a specialized cortical system
.
Brain Research
,
1150
,
94
99
.
Thornton
,
I. M.
, &
Vuong
,
Q. C.
(
2004
).
Incidental processing of biological motion
.
Current Biology
,
14
,
1084
1089
.
Tzourio-Mazoyer
,
N.
,
Landeau
,
B.
,
Papathanassiou
,
D.
,
Crivello
,
F.
,
Etard
,
O.
,
Delcroix
,
N.
, et al
(
2002
).
Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain
.
Neuroimage
,
15
,
273
289
.
Warren
,
J. D.
,
Zielinski
,
B. A.
,
Green
,
G. G.
,
Rauschecker
,
J. P.
, &
Griffiths
,
T. D.
(
2002
).
Perception of sound-source motion by the human brain
.
Neuron
,
34
,
139
148
.
Young
,
W.
,
Rodger
,
M.
, &
Craig
,
C. M.
(
2013
).
Perceiving and reenacting spatiotemporal characteristics of walking sounds
.
Journal of Experimental Psychology: Human Perception and Performance
,
39
,
464
476
.
Zimmer
,
U.
, &
Macaluso
,
E.
(
2005
).
High binaural coherence determines successful sound localization and increased activity in posterior auditory areas
.
Neuron
,
47
,
893
905
.