Ideomotor theory claims that actions are cognitively represented and accessed via representations of the sensory effects they evoke. Previous studies provide support for this claim by showing that the presentation of action effects primes activation in corresponding motor structures. However, whether people actually use action-effect representations to control their motor behavior is not yet clear. In our fMRI study, we had participants prepare for manual or facial actions on a trial-by-trial basis, and hypothesized that preparation would be mediated by the cortical areas that code for the perceptual effects of these actions. Preparing for manual action induced higher activation of hand-related areas of motor cortex (demonstrating actual preparation) and of the extrastriate body area, which is known to mediate the perception of body parts. In contrast, preparing for facial action induced higher activation of face-related motor areas and of the fusiform face area, known to mediate face perception. These observations provide further support for the ideomotor theory and suggest that visual imagery might play a role in voluntary action control.
While acquiring new dance moves in dancing school, we tend to prepare what to do with our feet by evoking visual images of how the feet should be positioned on the ground after each step and how they should move from there. In fact, it is common practice in dancing schools to use foot stamp pictures in order to visualize, and help imagining the goal of each novel step movement. This is an everyday example of what has been described by ideomotor theory first developed in the late 19th century (James, 1890/1981; Münsterberg, 1888; Harless, 1861; Lotze, 1852). The core assumption of ideomotor theory holds that functional anticipations of action effects play a crucial role in action control. Lotze (1852), for instance, stated that the image of the intended event (“Vorstellung des Gewollten,” p. 301) serves as a kind of retrieval cue and mental access point for the movement itself—the executing motor program, that is. As we have no direct control over the particularities of the actions in terms of specific muscle fibers, based on the assumption, we use representations of the intended outcome of actions—in our example, the visual image of the optimal foot positions on the ground—to control and guide our behavior.
The general idea that imagery plays a role in action control is widely spread in the literature (Jeannerod, 1999; Rizzolatti, Fogassi, & Gallese, 1997), and recent work has revived and elaborated on ideomotor theory in order to make specific predictions that can be applied, among other things, to neuroimaging (Elsner & Hommel, 2001; Hommel, Müsseler, Aschersleben, & Prinz, 2001; Prinz, 1997; Hommel, 1996). The assumption is that sensory systems and motor systems are connected by means of a common coding system that represents perceived and produced events in terms of sensorimotor units or “event files” (Hommel, 1998). Due to frequent co-occurrence of actions and their perceptual consequences (e.g., visual impressions of the moving hand, the motion of the key it operates, and the light bulb being turned on by that action), codes of actions and their sensory effects are bound together. As the emerging association is assumed to be bidirectional, reactivating the sensory codes representing the light flash, say, by actively imagining the bright lamp, spreads to the associated motor program so that this program is under voluntary control.
Evidence supporting this scenario is available from a number of recent neuroimaging studies. A first positron emission tomography study by Elsner et al. (2002) had participants carry out left and right keypresses that produced high- and low-pitched sounds. Thereafter, participants were performing a passive sound-detection task in the scanner, during which they were presented with various ratios of neutral and previously acquired action-effect tones. The relative frequency of action-effect tones had a selective effect on the activation of the caudal supplementary motor area and the right hippocampus, suggesting that the latter links sensory representations of action consequences to the corresponding motor programs organized by the former. These observations were confirmed and extended in a recent fMRI study of Melcher, Weidema, Eenhuistra, Hommel, and Gruber (2008). Further support comes from a study that used fMRI to compare tone-induced brain activation in professional pianists and nonmusicians (Bangert et al., 2006). As compared to nonmusicians, the musicians showed increased activation of a cortical sensorimotor network even in a passive listening task, suggesting that the activation of tone representations spreads to apparently associated motor structures. Similar observations have been made in professional dancers, where dance-specific motor structures are activated when participants watched dance movies and had to evaluate how tiring those dances were (Calvo-Merino, Glaser, Grezes, Passingham, & Haggard, 2005; for a broader overview, see Molnar-Szakacs & Overy, 2006), and in nonmusicians after modest amounts of piano training (McNamara et al., 2008).
Although the findings of these studies are encouraging, they support ideomotor theory in rather indirect ways. On one hand, they do suggest the existence of bindings between motor structures and sensory representations of action consequences, and they support the idea that these bindings are bidirectional and can thus propagate activation in either way: from motor to sensory, as when experiencing the relationship between actions and their consequences, and from sensory to motor, as in the mentioned studies. On the other hand, however, these observations do not necessarily require that people are actually using representations of action consequences for controlling their actions. Thus, even though action-effect associations may provide a retrieval and access cue to motor programs, people may retrieve and access motor programs along different routes.
To assess this possibility, we, in the present study, did not present action effects in order to prime motor structures but, instead, had people prepare for actions that would lead to different kinds of sensory consequences, and tested whether this would activate the neural structures representing these consequences. In particular, we compared the neural activation involved in preparing for a manual action to the activation involved in preparing for a facial action. We reasoned that at least some of the sensory consequences of manual actions would be processed by extrastriate body area (EBA) located in the posterior inferior temporal sulcus. EBA has originally been assumed to be dedicated to the processing of visual information about other people's nonfacial body parts (Urgesi, Berlucchi, & Aglioti, 2004; Downing, Jiang, Shuman, & Kanwisher, 2001). Lately, it has been shown that fusiform body area (FBA) is more responsive to the visual appearance of the whole body, whereas EBA is responsive to parts of the body (Taylor, Wiggett, & Downing, 2007). Moreover, recent observation of EBA activity during the execution of self-performed limb motor actions (Astafiev, Stanley, Shulman, & Corbetta, 2004) suggests that EBA also represents one's own body parts; whereas a recent study refined this finding by showing that activity in EBA does not allow to differentiate between recognition of familiar or unfamiliar bodies (Hodzic, Muckli, Singer, & Stirn, 2009). Given that manual action results, among other things, in the self-perception of one's body parts, it thus makes sense to assume that the EBA codes for some of the previously experienced and now to be expected effects of manual actions. If so, activating a manual action in the process of preparing it for later execution should be mediated by EBA activity.
Along the same lines, we speculated that at least some of the sensory consequences of facial actions would be processed by fusiform face area (FFA) located on the lateral fusiform gyrus. This area was originally assumed to mediate the perception of other people's faces (Kanwisher, McDermott, & Chun, 1997), but it may just as well code for one's own face, and thus, mediate the perception of the sensory consequences of moving it. One might argue that people are much more familiar with kinesthetic than visual feedback from their own actions, so that kinesthetic representations of action effects should be more strongly involved in facial action control than visual representations. And yet, there is evidence of particularly strong (Heyes, 2001), presumably even prenatal (Meltzoff & Moore, 1977, 1997) intermodal connections between kinesthetic, visual, and motor representations of facial movements, which render it likely that preparing a facial action involves visual codes in FFA as well.
To summarize, we expected that preparing for manual or facial action would be mediated by the brain areas that code for the perceptual consequences of these actions, such as EBA and FFA, respectively. We thus had participants carry out a binary-choice task, in which they carried out one of two possible manual actions (a keypress with the right or left hand) or one of two facial actions (a smiling or a kissing movement) to the (red or green) color of a visual target stimulus (see Figure 1). The mapping of colors to the individual action alternatives was constant across the whole experiment, but whether a manual action or a facial action was required varied randomly from trial to trial. To signal the type of action for the upcoming trial, participants were presented with a cue (the letter X or O) that indicated the action modality. We assumed that participants would use the cue to prepare for manual or facial action, and we considered the preparation to take place between the cue presentation and the presentation of the color stimulus. Given that we were interested only in the anticipatory activation of EBA and FFA (rather than the on-line processing of body or face feedback during the execution of the action), we restricted our analysis to this preparation period.
Previous studies suggest that people prepare for the selection of an action from a small response set by preactivating all alternative actions and keeping them concurrently active until eventual response selection (Lépine, Glencross, & Requin, 1989; Heuer, 1986), but use a more piecemeal action-planning process when dealing with larger response sets (Rosenbaum, 1980). To allow for concurrent preparation, we therefore employed very small response sets of two alternatives per action type. To preactivate the two alternatives belonging to the precued type of action, participants would thus be likely to access and prime the corresponding action representations (in turn or concurrently) during the analyzed interval between action cue and color target. According to ideomotor theory, this access would be mediated by the perceptual representations of the action effects, which should increase the activation in EBA with manual preparation and in FFA for facial preparation.
We recruited 12 healthy volunteers (10 women and 2 men; age: mean = 22.3 years, range = 18–27 years) from whom we obtained written consent prior to the scanning session. The study was approved by the local ethical committee. All subjects had normal or corrected-to-normal vision. No subject had a history of neurological, major medical, or psychiatric disorder. All participants reported to be right-handed. We had to exclude one participant due to a technical failure during the localizer task.
The experimental procedure consisted of a paradigm comprising a cue and a target stimulus (Figure 1A). The letter cue (an X or O) was presented for 1000 msec; one letter instructed participants to respond manually to the following target, whereas the other letter indicated a facial action (see Figure 1B for one of the possible mappings). After a blank screen of 2000 to 5000 msec duration (which varied in steps of 500 msec), the target that specified the action alternative appeared for 1000 msec. Trial-to-trial transitions and cue-to-target transitions were counterbalanced across the experiment. The two possible manual actions were button presses with the right or left index finger. The possible face actions were either uncompressing the lips into a broad smile and raising both eyebrows, or compressing the lips into a kiss and lowering the eyebrows (Figure 1B). Both types of actions were speeded as fast responding was encouraged, but due to the difficulty to measure the facial responses, we measured reaction times for the manual actions only. The cue–task and target–response mappings were balanced across participants. The intertrial interval consisted of a variable oversampling interval between 2000 and 5000 msec to obtain an interpolated temporal resolution of 500 msec. The experiment consisted of 80 trials and lasted 12 min.
After the experimental session, we employed a localizer task during which participants had to passively view pictures of hands, faces, and their own face (Figure 2). We used eight different male and female black-and-white photographs as well as eight different black-and-white photographs of hands. For the own-face condition, we asked participants to bring a photograph of themselves. All images were adjusted to assure the same average luminance. The intertrial interval varied as above in 500-msec steps between 2000 and 5000 msec. The localizer consisted of 120 trials with a duration of 9 min.
Images were collected with a 3-T Philips Achieva MRI scanner system (Philips Medical Systems, Best, The Netherlands). First, high-resolution anatomical images were acquired using a 3-D T1-weighted sequence (voxel size = 0.88 × 0.88 × 1.2 mm3). Whole-brain functional images were collected using a T2*-weighted SENSE parallel EPI sequence sensitive to BOLD contrast (TR = 2211 msec, TE = 30 msec, image matrix = 80 × 80, FOV = 220 mm, flip angle = 80°, voxel size 2.75 × 2.75 × 2.75 mm3, 38 axial slices). Three hundred fifty-five image volumes were acquired for the experimental run and 255 for the localizer run, all aligned to the anterior and posterior commissures.
fMRI Data Preprocessing and Main Analysis
The fMRI data were analyzed using SPM5 software (Wellcome Department of Cognitive Neurology, London, UK). The first four volumes of all EPI series were excluded from the analysis to allow the magnetization to approach a dynamic equilibrium. Data processing started with slice time correction and realignment of the EPI datasets. A mean image for all EPI volumes was created, to which individual volumes were spatially realigned by means of rigid-body transformations. The high-resolution structural image was coregistered with the mean image of the EPI series. Then the structural image was normalized to the Montreal Neurological Institute template, and the normalization parameters were applied to the EPI images to ensure an anatomically informed normalization. During normalization, the anatomy image volumes were interpolated to 1 × 1 × 1 mm3. A commonly applied spatial filter of 8 mm FWHM was used on the EPI scans. Low-frequency drifts in the time domain were removed by modeling the time series for each voxel using a set of discrete cosine functions to which a cutoff of 128 sec was applied. Subject-level statistical analyses were performed using the general linear model. We modeled face and hand preparation cues as well as the face and hand targets. The main events of interest were the periods of hand and face preparation in the experimental run and the hand and face related activity in the localizer run. Vectors containing the event onsets were convolved with the canonical hemodynamic response function to form the main regressors in the design matrix (the regression model). Temporal derivatives, the regressors, and regressors accounting for variance associated with head motion were also entered into the model. The statistical parameter estimates were computed separately for each voxel for all columns in the design matrix. Contrast images were constructed from each individual to compare the relevant parameter estimates for the regressors containing the canonical hemodynamic response function. A group-level random effects analysis was then performed using one-sample t tests for each voxel of the contrast images. The resulting statistical values were thresholded at p < .001 (z > 3.09, uncorrected), with a volume greater than 350 mm3 (10 adjacent voxels).
For the signal change analysis, we defined ROIs consisting of the peak voxels and a surrounding sphere with a radius of 6 mm for each participant individually.1 Two ROIs were defined in bilateral EBA resulting from the whole-brain contrast of the hand localizer > baseline of each participant. We defined the EBA ROIs as the peak voxels in lateral occipito-temporal cortex of each participant based on coordinates of previous studies (Peelen & Downing, 2005; Downing et al., 2001) [left EBA: −49 −72 −2 (mean SD = 5); right EBA: 45 −75 −1 (mean SD = 4)]. Furthermore, we defined an ROI in the bilateral FFA resulting from the whole-brain contrast of the face localizer > baseline of each subject. We defined the FFA ROIs of each participant as the peak voxels in the fusiform gyrus based on coordinates of previous studies (Haxby et al., 1999; Kanwisher et al., 1997) [left FFA: −41 −59 −18 (mean SD = 5); right FFA: 39 −59 −19 (mean SD = 4)] (Figure 2). All individual contrasts were thresholded at p < .001 (z > 3.09, uncorrected), with a volume greater than 175 mm3 (5 adjacent voxels). Additionally, we used the target phase to obtain ROIs in hand motor cortex (hand MC; hand target > face target; left hand motor cortex: −39 −28 58; right hand motor cortex: 34 −28 61; BA 4a according to Eickhoff et al., 2005) and face motor cortex (face MC; face target > hand target; left face motor cortex: −55 −9 47; right face motor cortex: 55 −8 47; BA 1 according to Eickhoff et al., 2005; BA 4 using Talairach Daemon, Lancaster et al., 2000) on the group level. For each subject, region, and condition, the mean percent signal change of hand and face preparation over a time window of 6–8 sec after cue onset was calculated separately and compared by means of 2 × 2 × 2 repeated measures ANOVA with the factors ROI, ROI side (left vs. right), and preparation (face vs. hand).
A repeated measure ANOVA on manual reaction times revealed a significant main effect of cue–target interval [CTI 2000: 597 msec; CTI 2500: 558 msec; CTI 3000: 543 msec; CTI 3500: 519 msec; CTI 4000: 516 msec; CTI 4500: 512 msec; CTI 5000: 517 msec; F(1, 10) = 2.44, p < .05], indicating that longer CTIs allow for more extensive response preparation.
In order to explore the neural correlates of preparation in specific brain areas, we focused on specific hand and face action preparation in hand and face motor cortex first. A percent signal change analysis on bilateral motor cortex (MC) ROIs taken from the target phase (a sphere with a radius of 6 mm around the peak voxel of a group-level analysis in face and hand MC) showed a significant interaction of ROI location (hand MC vs. face MC) and preparation (hand vs. face) [Figure 3; F(1, 10) = 17.13, p < .01]. Post hoc paired t tests revealed significant differences between hand and face preparation in hand motor cortex [t(10) = 4.25, p < .01] and a marginally significant difference between hand and face preparation in face motor cortex [t(10) = 2.18, p = .054]. The lower reliability of the latter difference could result from a greater variability in face motor cortex activation between participants due to less training in responding with the face and due to increased head movement during facial responding. Overall, the result indicates that the action-modality cues (in particular, the hand cue) were used to actively prepare for the type of action afforded in the upcoming trial, even though the individual action alternative was not yet known.
Based on the ideomotor theory, we predicted that action preparation is mediated by activating perceptual representations of the associated action effects, which we assume to be coded in EBA for manual movements and FFA for facial movements. Indeed, a percent signal change analysis on brain areas involved in the perception of hands and faces during the localizer task showed a significant interaction of ROI location (EBA vs. FFA) and prepared action modality (hand vs. face) [Figure 4; F(1, 10) = 36.37, p < .001]. Post hoc paired t tests revealed significant differences between face and hand preparation in FFA [t(10) = 6.37, p < .001] as well as in EBA [t(10) = −3.85, p < .01]. A similar result is obtained on ROIs based on the contrast own face > baseline [F(1, 10) = 25.23, p < .01]. This interaction clearly supports the prediction of ideomotor theory that the perceptual representation of the effectors plays a role in the preparation of effector-specific movements.
The major aim of our study was to assess the basic assumption of the ideomotor theory that voluntary movements are accessed by anticipating their perceivable effects (James, 1890/1981; Münsterberg, 1888; Harless, 1861; Lotze, 1852). If activating an action is mediated by codes of their effects, as we reasoned, then preparing for a particular type of action should lead to the activation of brain areas that are coding for these actions' perceptual consequences. Participants were asked to prepare for manual or facial actions, which we assumed would involve the activation of EBA and FFA, respectively. This is exactly what the findings show: Manual preparation did not only involve increased activation of hand-related areas of the motor cortex but also increased activation in EBA, whereas facial preparation involved increased activation in face-related motor areas and FFA. Accordingly, we take the outcome of this study as further support of the ideomotor theory.
Moreover, our study suggests that both EBA and FFA play a role in the representation of one's own body (cf., Berlucchi & Aglioti, 1997) and in the control of voluntary action. This stands in contrast to recent studies suggesting that body form is represented in EBA but processing of body actions is related to ventral premotor cortex (Moro et al., 2008; Urgesi, Candidi, Ionta, & Aglioti, 2007). However, in contrast to our approach, these studies explored how body stimuli are processed when focusing on the form or the action, but not whether voluntary action may involve the visual imagery of the goal. We aimed to demonstrate that in order to set up an action to reach a particular goal, people may activate the representations of those action effects that match this goal, and that this activation spreads to the motor actions associated with these representations. For instance, planning a hand-reaching action would involve the activation of the perceptual representation of the intended effector and, perhaps, its intended end state (cf., Rosenbaum et al., 1990), which then activates those motor patterns that were leading to perceptions of this effector and this end state in the past. The motor patterns are likely to be organized via the supplementary motor area and may be linked via the hippocampus to action-effect representations in EBA, FFA, and elsewhere (Melcher et al., 2008; Elsner et al., 2002).
On one hand, the evidence provided by our present study can be considered more direct than previous demonstrations that motor structures are activated by the presentation of action effects (McNamara et al., 2008; Melcher et al., 2008; Bangert et al., 2006; Elsner et al., 2002). These demonstrations indicate the existence of bidirectional associations between motor programs and sensory action-effect representations, but they do not show that action effects are activated in the course of action planning. Our present study does provide this evidence: Planning an action indeed involves the activation of the neural codes of sensory action effects. On the other hand, however, our observation is still correlational in showing that motor preparation and action-effect activation co-occur, without indicating whether effect activation was necessary or sufficient for successful action preparation. To determine whether the relationship between motor preparation and action-effect activation is causal, it would be necessary to test whether motor preparation is possible if effect activation is prevented (e.g., by means of TMS or natural lesions). In a recent rTMS study, the reverse of the bidirectional binding demonstrated here has been shown in a facial expressions discrimination task (Pitcher, Garrido, Walsh, & Duchaine, 2008): When participants received rTMS on the face region of right somatosensory cortex, their task performance was impaired, suggesting a necessity of somatovisceral simulation in order to perceive facial expressions.
The work was supported by a post-doc grant of the Research Foundation-Flanders (FWO-Vlaanderen) awarded to the first author.
Reprint requests should be sent to Simone Kühn, Department of Experimental Psychology, University of Ghent, Henri Dunantlaan 2, 9000 Ghent, Belgium, or via e-mail: firstname.lastname@example.org.
We also considered the temporal dynamics (time lines) of the BOLD signal. However, due to the fact that the task cue determines no more than the earliest possible time point of selective action preparation, but not necessarily the actual starting point of the preparation process (which is likely to vary considerably both intra- and interindividually), the signal is smeared over time and turned out to be not very informative.