In one popular account of the human visual system, two streams are distinguished, a ventral stream specialized for perception and a dorsal stream specialized for action. The skillful use of familiar tools, however, is likely to involve the cooperation of both streams. Using functional magnetic resonance imaging, we scanned individuals while they viewed short movies of familiar tools being grasped in ways that were either consistent or inconsistent with how tools are typically grasped during use. Typical-for-use actions were predicted to preferentially activate parietal areas important for tool use. Instead, our results revealed several areas within the ventral stream, as well as the left posterior middle temporal gyrus, as preferentially active for our typical-for-use actions. We believe these findings reflect sensitivity to learned semantic associations and suggest a special role for these areas in representing object-specific actions. We hypothesize that during actual tool use a complex interplay between the two streams must take place, with ventral stream areas providing critical input as to how an object should be engaged in accordance with stored semantic knowledge.
According to one influential view of the human cortical visual system, a dorsal stream, projecting from occipital to posterior parietal cortex, uses visual information to guide actions while a ventral stream, projecting from occipital to inferior temporal cortex, uses visual information to construct detailed perceptual representations, including those critical for the visual recognition of objects (Goodale & Milner, 1992). In general, the advent of human neuroimaging has led to additional support for this account, describing several ventral stream areas as specialized for visual recognition (for a review, see Grill-Spector & Malach, 2004) and various dorsal stream areas as specialized for the visual control of actions (for reviews, see Culham & Valyear, 2006; Culham, Cavina-Pratesi, & Singhal, 2005). However, as research progresses, the precise functionality of the two streams continues to be refined (e.g., Jeannerod & Jacob, 2005; Rizzolatti & Matelli, 2003). For example, various lines of evidence suggest several additional roles for the dorsal stream, beyond visuomotor transformations and the guidance of actions. In this study, we take a closer look at two such processes, action observation and tool use, and consider the potential relationships between them. Specifically, we tested whether or not parietal tool use areas would respond to observing others grasping tools and, moreover, if such responses would differ depending on the functionality of the grasp (i.e., depending on whether or not the grasp was consistent with the use of the tool).
With tool use and manual praxis skills, accurate visuomotor control is obviously a key component, and areas within posterior parietal cortex have long been thought of as critical (e.g., Haaland, Harrington, & Knight, 2000; for a review, see Rothi & Heilman, 1997). However, several aspects of these types of actions greatly differ from the types of dorsal stream processing principles that have typically been emphasized. For example, Goodale and Milner (1992) showed that visually guided actions such as object grasping can be carried out independently and in the absence of conscious object perception and recognition (as mediated by the ventral stream). However, in the case of complex learned actions such as tool use, object recognition and access to stored semantic knowledge is likely to play an important role (Frey, 2007; Creem & Proffitt, 2001; Hodges, Bozeat, Lambon Ralph, Patterson, & Spatt, 2000; Hodges, Spatt, & Patterson, 1999; Milner & Goodale, 1995). Similarly, Milner and Goodale (1995) stressed that the visuomotor transformations performed by the dorsal stream are not likely to call upon stored representations of previous actions, but instead should be computed from the bottom–up, in real time. Here again though, tool use is very much thought to rely on stored representations of actions (for a review, see Rothi & Heilman, 1997). Thus, for Goodale and Milner, tool use is a special kind of visuomotor behavior, one that calls for explicit cooperation between dorsal and ventral pathways.
The role of parietal cortex in observing the actions of others is a relatively recent discovery. In both humans and monkeys, parietal and frontal responses to observed actions appear to overlap with those areas critical for the control of actions (for a review, see Rizzolatti & Craighero, 2004). Indeed, activity during both action execution and observation is considered a defining characteristic of mirror neurons in the monkey. Importantly, many of these parietal and frontal mirror neurons show tight congruence between the types of actions they encode motorically and those they encode visually. In studies involving action observations in humans, others have shown that the specificity of areas active when observing particular actions appears to closely match the specificity of areas active when performing those same actions (e.g., Filimon, Nelson, Hagler, & Sereno, 2007; Shmuelof & Zohary, 2005, 2006) and, similarly, responses to perceived actions appear to depend on the particular motor repertoire of the observer (for a review, see Shmuelof & Zohary, 2007). For example, in some exciting imaging work by Calvo-Merino, Grezes, Glaser, Passingham, and Haggard (2006) and Calvo-Merino, Glaser, Grezes, Passingham, and Haggard (2005), greater activity within several parietal and frontal areas was reported when participants viewed actions that they themselves were able to perform than when they viewed actions they could not perform. Whether or not such activity truly reflects mirror-like mechanisms, similar to those noted in the macaque, is an issue of current contention that has not yet been resolved (Turella, Pierno, Tubaldi, & Castiello, 2009; Dinstein, Thomas, Behrmann, & Heeger, 2008; although see recent findings from Chong, Cunnington, Williams, Kanwisher, & Mattingley, 2008).
Perhaps the most compelling evidence for the importance of parietal cortex in perceiving and recognizing actions comes from case studies of patients with parietal damage. Here, others have noted that deficits with action imitation often co-occur with problems in recognizing actions, and this particular pattern is most strongly associated with left inferior parietal lesions (Buxbaum, Kyle, & Menon, 2005; Wang & Goodglass, 1992; Heilman, Rothi, & Valenstein, 1982). Indeed, based on their close analyses of these types of patients, Buxbaum et al. (2005) suggest that the same parietal representations may be critical for both the production and recognition of complex actions, consistent with a “direct matching hypothesis” underlying action recognition (Gallese, Fadiga, Fogassi, & Rizzolatti, 1996; Rizzolatti, Fadiga, Gallese, & Fogassi, 1996; for a review, see Rizzolatti, Fogassi, & Gallese, 2001). From a theoretical perspective, an active role in perceiving and understanding actions may be particularly useful for areas involved in praxis, for example, when learning new skills through observation, as is often the case with human tool use learning.
We were interested in whether or not parietal responses to observing others' actions would depend on how well the observed actions matched those normally associated with tool use. To address this question, we scanned individuals while they viewed short movies of familiar tools being grasped in ways that were either consistent or inconsistent with how tools are typically grasped during use. By using tool grasping, as opposed to whole arm movements with a tool in hand, we were able to keep very tight control over our two critical stimulus conditions (see Figure 1). That is, our “typical grasping” (TG) and “atypical grasping” (AG) movies simply varied with respect to how a target tool was grasped in conjunction with how it was oriented. This design allowed us to manipulate the strength to which these actions were associated with typical tool use, while at the same time keeping other factors between conditions, such as the constituent arm and hand movements themselves, very similar. In previous work, Creem and Proffitt (2001) showed that when individuals were asked to grasp familiar tools, they typically rotated their wrist and hand in accordance with how the handle of the tool was oriented, as with our TG condition. This finding is one instance of a more general “end state comfort effect,” whereby subjects will adopt an initially uncomfortable posture that enables a comfortable posture at the conclusion of an action (Rosenbaum, van Heugten, & Caldwell, 1996; Rosenbaum & Jorgensen, 1992; Rosenbaum et al., 1990). As already noted, previous imaging work involving action observations have shown that parietal areas respond more strongly to actions that closely match internal representations (e.g., Calvo-Merino et al., 2006). Thus, we predicted that parietal areas involved with tool use would respond more strongly to our TG actions as these were the types of grasping actions normally associated with using tools (as opposed to AG actions). To help us identify parietal areas associated with tool use, independently from our main experiment, we used a separate localizer paradigm based on previous imaging work (see Localizer 1: Bodies, Objects, Tools). We also thought that many other areas could be differentially active for our movie conditions, including the possibility of detecting areas that prefer viewing AG as compared with TG. For example, in some ways, our AG movies may seem more surprising or unusual to subjects, which might be expected to influence the activity of areas involved with understanding the intentional aspects of others actions (e.g., temporo-parietal junction; Saxe & Kanwisher, 2003). Indeed, others have shown that unexpected or unusual events can lead to increased activity in many areas (e.g., Liepelt, Von Cramon, & Brass, 2008; Summerfield, Trittschuh, Monti, Mesulam, & Egner, 2008). Thus, we also performed a whole-volume voxelwise analyses, directly comparing activation between conditions.
Rather than finding a preference for viewing TG actions within parietal cortex, this pattern of activity was observed within several ventral stream areas. Our discussion focuses on interpreting the significance of these ventral stream activations, as well as addressing the findings within parietal cortex, in particular, within the context of action understanding and tool use.
Nine neurologically intact individuals participated in the study (5 women; age range = 22–41 years) and each provided informed consent in accordance with the guidelines approved by the University of Western Ontario Health Sciences Review Ethics Board. All individuals were right-handed, with normal or corrected-to-normal visual acuity, and all were naïve to the purpose of the study.
Movie clips were recorded and shown at 30 frames per sec, were 2 sec in duration, and each frame subtended 15° of the subject's visual field. There were three types of movie clips—TG, AG, and Reach movies—which were organized into separate 16-sec epochs, with six clips per epoch and an interclip interval of ∼333 msec. Regardless of the movie type, each epoch comprised three movie clips in which the handle of the tool faced away from the actor and three movie clips in which the handle of the tool faced toward the actor. In TG, when the handle faced away from the actor, the hand was rotated about the wrist at the point of prehension such that the tool was grasped in a functionally appropriate manner, whereas when the handle faced toward the actor, the tool was grasped without such a rotation (but still in a functionally appropriate manner; see Figure 1A). In AG, the reverse was true, such that when the handle faced away there was no rotation of the hand but when the handle faced toward the actor there was. This combination brings about grasping actions that do not easily allow for the actor to use the tool without further postural adjustments (see Figure 1B). The Reach movies simply involved the touching of the target tool (at the handle) with the actor's knuckles. Note also that, in the interest of keeping hand and arm trajectories similar across conditions, regardless of handle orientation, half of our Reach movies also involved a rotation of the hand at the point of contact. Also, we performed a left–right “horizontal flip” on our movie clips, such that half of our blocks showed actions with the right hand approaching from the left side of space (Figure 1), whereas the other half showed actions with the left hand approaching from the right side of space, balanced across conditions. Our three movie conditions were interleaved with either 16-sec fixation epochs or with epochs comprised of scrambled-up versions of the movies. In the scrambled epochs, like with the other movie conditions, six distinct movie clips of 2 sec duration were shown with an interclip interval of ∼333 msec. Scrambled movie clips were created by deconstructing each clip into its constituent frames (using Adobe Premiere), dividing each frame into a grid of 48 × 48 cells and then randomly reordering the cells of the grid (with subsequent frames of a given clip scrambled and reordered in the same manner, using a custom Matlab code), and then finally reconstructing the movie clip from the newly scrambled frames (again, using Adobe Premiere).
Each run lasted 6 min 40 sec and comprised 25 separate epochs (6 fixation, 7 scrambled, and 4 epochs per movie type). Throughout each run, subjects performed a 1-back task, whereby responses were made, via a right-handed button press, whenever two successive video clips were identical. Each epoch could contain either 0 or 1 repeated clip, balanced across conditions (2 repeats per movie type, 3 repeats for scrambled). Subjects were told that their main goal should be to perceive each of the movies intently, that the repeated clips would occur quite infrequently, and that the task of detecting these repeats would be used as an index of their attention to the movies. A solid red circle, superimposed on the center of each frame, served as a fixation point throughout.
Altogether, our collection of tools included 33 different identities (see Appendix A) and 4 different exemplars for each identity (e.g., 4 distinct umbrellas) for a total of 132 distinct objects, and for each object, any given hand posture might be associated with it. Each run was organized such that within the first half, all 33 distinct tool identities were shown (divided up among the first six intact movie epochs) and within the last half a different exemplar of the 33 identities were shown (divided up among the 6 remaining intact movie epochs). The following run showed the third and fourth exemplar versions, again distributed separately across the first and second halves of the run. The third run used the same tools as in the first run, and the fourth run used the same tools as in the second run; however, in each case, the hand actions associated with each of the tools differed from those shown previously. That is, careful organization of clips ensured that when tool identities and identity exemplars were repeated, they were not coupled with the specific hand actions for which they were previously associated. Thus, the time between repetition of tool identities (and exemplars) was maximized, and the type of hand actions associated with each repetition was varied and unpredictable. Both of these measures were taken so as to minimize the potential for complicated repetition effects to accrue upon the repetition of identities and/or exemplars. The order of runs was counterbalanced across individuals.
Note also that to gain some appreciation of how familiar our subjects were with the appropriate use of our different tools, we asked them to estimate levels of hands-on experience using the following 5-point scale: 1 = never used or seen in use; 2 = never used myself, but seen in use; 3 = used this tool maybe once or twice in my life; 4 = use this tool approximately once a year; and 5 = weekly or daily use. This scale was taken directly from a recent imaging study by Vingerhoets (2008) that was specifically designed to address issues of tool familiarity. We received responses from seven out of nine subjects, and the mean “familiarity-of-use” score across all of our tools was found to be 4.4, with a standard deviation of 0.5, indicating that our tools were highly familiar to our subjects. Most importantly, given that each particular tool was distributed evenly across our three movie types, any observed activation differences across movie types could not be attributed to differences in tool familiarity.
Localizer 1: Bodies, Objects, Tools
Each of these runs included color photos of familiar tools (87 different identities), headless bodies (87 different identities; 44 were females), nontool objects (87 different identities, including vehicles, furniture and appliances, food items, plants, clothing items, and other objects from miscellaneous categories), and scrambled-up versions of these stimuli. All stimuli were selected from the Hemera Photo-Objects image database (Hemera Technologies, Gatineau, QC). For the scrambled stimuli, we divided each of our photo images into a grid of 48 × 48 cells and then randomly reordered the cells of the grid. A small black circle was superimposed in the center of each image to serve as a fixation point. Each image subtended 15° of the subject's visual field. Stimuli were organized into separate 16 s epochs, with 18 photos per epoch, presented at a rate of 400 msec per photo with a 400-msec interstimulus interval. Each run lasted 6 min 40 sec and was composed of six stimulus epochs per condition and seven (baseline) scrambled epochs. Stimulus epochs were organized into sets of three, separated by scrambled epochs, balanced for epoch history within a single run. All subjects received four of these localizer runs, photos were repeated across runs, and the stimulus and epoch orders were pseudorandomized and balanced across runs. Subjects performed a 1-back task throughout, whereby responses were made, via a right-handed button press, whenever two successive photos were identical. Each stimulus epoch included either three or four repeated photos, balanced across conditions (with a total of 21 repeats per condition per run). Scrambled-up photos were not repeated and subjects were simply asked to passively view the stimuli during scrambled epochs.
Localizer 2: Motion Sensitivity
Each of these runs included alternating 12-sec epochs of stationary (baseline) and moving stimuli. Each subject (except one, due to time constraints) received two identical runs, with seven stationary and six moving epochs per run, resulting in a single run length of 3 min 28 sec. The stimulus was an annulus checkerboard pattern, which moved in and out during motion epochs and remained static during stationary epochs. Subjects were simply asked to passively view the stimuli.
All imaging was performed at the Robarts Research Institute (London, Ontario, Canada) on a 4-Tesla, whole-body MRI system (Varian, Palo Alto, CA; Siemens, Erlangen, Germany) using a transmit–receive hybrid birdcage radio-frequency head coil. Each imaging session took approximately 1 hr 45 min to complete and included 10 functional runs and a single high-resolution anatomical scan. Functional volumes were collected using a T2*-weighted, navigator echo corrected, segmented spiral acquisition (echo time [TE] = 15 msec; flip angle [FA] = 40°; repetition time [TR] = 1000 msec with two segments/plane for a volume acquisition time of 2 sec) to image the BOLD signal over time (Ogawa et al., 1992). Each volume comprised 17 contiguous, 6-mm axial–oblique slices, spanning from the most superior point of cortex down through ventral fusiform cortex, including approximately two-thirds of cerebellum. The field of view was 22.0 cm × 22.0 cm, with an in-plane resolution of 64 by 64 pixels, resulting in a voxel size of approximately 3.4 mm × 3.4 mm × 6.0 mm. Anatomical volumes were collected in the same orientation and in-plane field-of-view as the functional scans using a T1-weighted 3-D magnetization-prepared spiral acquisition (inversion time [TI] = 1300 msec; TE = 3.0 msec; TR = 50 msec; FA = 20°; matrix size of 256 × 256 × 96). The resultant voxel size was 0.9 mm × 0.9 mm × 2.0 mm.
Data Preprocessing and Analysis
Imaging data were preprocessed and analyzed using Brain Voyager QX version 1.7.9 (Brain Innovation, Maastricht, The Netherlands). Each functional run was assessed for subject head motion by viewing cineloop animation and by examining Brain Voyager motion-detection parameter plots after running 3-D motion correction algorithms on the untransformed two-dimensional data. No abrupt movements were detected in the animations and no deviations larger than 1 mm (translations) or 1° (rotations) were observed in the motion correction output. Functional data were then preprocessed with linear trend removal and underwent high-pass temporal frequency filtering to remove frequencies below three cycles per run. Functional volumes were aligned to anatomical volumes, which were then transformed into standard stereotaxic space (Talairach & Tournoux, 1988).
All imaging data were analyzed using contrasts within a general linear model (GLM) for each type of run (localizer and experimental runs). Each GLM included predictor functions for each of the conditions (except the baseline), generated by rectangular wave functions (high during the condition and low during all other conditions) convolved with the default Brain Voyager QX “two-gamma” function designed to estimate hemodynamic response properties. For the experimental runs, the baseline was defined as the scrambled movie epochs, and a predictor of no interest was included to account for the fixation epochs. Prior to GLM analysis, each run was Z-transformed, effectively giving each run a mean signal of zero and converting beta weights into units of standard deviations.
Region-of-Interest (ROI) Selections and Analyses
For each individual, data from the localizer scans were used to identify several distinct areas based on previous imaging work. A similar selection procedure was used to define all ROIs in all individuals, whereby the most significantly active voxel(s), or peak, was first identified based on a particular contrast (see Results), statistical thresholds were then set to a determined minimum, and a volume of interest up to (10 mm)3 = 1000 mm3 around the peak was selected. The determined minimum threshold value varied depending on the nature of the contrast used to define the region and on the robustness of the resultant activity within each individual. For example, for both tool-selective areas, which were identified using a more stringent conjunction analysis (see Results), the minimum determinant threshold was set to a p < .005 (uncorrected) for each individual. Note that we define a conjunction contrast as a Boolean AND, such that for any one voxel to be flagged as significant it must show a significant difference in each of the component contrasts.
For each subject's ROI, we extracted the average time-course activity, aligned to the onset of each epoch, from experimental runs. It is worth emphasizing that this activity is completely independent from the activity used to identify and select the regions based on either of the localizers. Within a given subject's ROI, the mean percent BOLD signal change (mean %BSC) associated with each condition was computed as the average of the activation at the peak of the response (i.e., volumes 5–7, corresponding to 10–14 sec after the start of each epoch). In order to compare activations across conditions, the mean %BSC values were then entered into a one-way repeated measures analysis of variance, with subject as a random factor. Where significant differences were found, in order to test for differences between pairs of conditions, all possible post-hoc comparisons were performed by computing an F-statistic. Tukey's wholly significant difference was then used to correct the critical significance value to control for the problem of multiple comparisons.
A whole-brain voxelwise analysis was performed for the entire group of subjects using an averaged GLM fitted for random effects analyses, with separate predictor functions for each condition (except the scrambled baseline) for each subject. Three contrasts of interest were performed (see Results). Activation maps were set to reliable statistical thresholds (p < .005, minimum cluster size of 163 mm3), using Monte Carlo simulations (performed with AlphaSim software, courtesy of Douglas Ward, Medical College of Wisconsin) to verify that the resultant clusters were unlikely to have arisen due to chance (corrected, p < .05), given the problem of multiple comparisons.
Two subjects' behavioral responses were not acquired due to technical problems. Repeated measures analysis of variance revealed no significant differences in response reaction times across conditions [F(3, 18) = 1.47, p = .26], however, there were differences in the accuracy of correct responses [F(3, 18) = 11.39, p < .001]. Individual pairwise comparisons revealed that these differences reflect a greater failed-detection rate for the scrambled condition (missed repeats = 22.62%, SE = 5.67), as compared with all other conditions (p < .001; TG missed repeats = 3.57%, SE = 3.57; AG missed repeats = 7.14%, SE = 4.61; Reach missed repeats = 7.14%, SE = 3.72). Differences between TG, AG, and Reach were not significant [F(3, 18) = 1.0, p = .73].
Our first localizer paradigm (see Methods for details) showed pictures of tools, other familiar objects, headless bodies, and scrambled stimuli. Tool-selective areas were identified by contrasting the viewing of tools versus objects, tools versus bodies, and tools versus scrambled images. In each individual, this conjunction contrast reliably revealed two areas of robust activity, one localized to posterior middle temporal gyrus (pMTG; see Figure 2A), and the other localized within left anterior intraparietal sulcus (aIPS), often on the medial bank of the sulcus, near the junction of postcentral sulcus (see Figure 2D). The locations of these foci are highly consistent with previous imaging studies involving tools, including tool viewing and naming (Valyear, Cavina-Pratesi, Stiglick, & Culham, 2007; Chao & Martin, 2000; Chao, Haxby, & Martin, 1999; Martin, Wiggs, Ungerleider, & Haxby, 1996), pantomime tool use (e.g., Fridman et al., 2006; Johnson-Frey, Newman-Norlund, & Grafton, 2005), imagined tool use (Creem-Regehr & Lee, 2005), and various other tool-related paradigms (for reviews, see Frey, 2007; Lewis, 2006; Johnson-Frey, 2004). Our primary interest was in evaluating how these tool-selective areas would respond during the observation of our different types of tool-directed actions. In particular, we predicted that if these areas were tuned to the functional aspects of learned object-specific actions, then they would respond more robustly during the observation of TG as compared with AG. Of these two areas, we predicted that the parietal tool area would be the most likely candidate to show such response selectivity. Note, however, whether or not the tool-selective pMTG should be considered part of the dorsal or the ventral stream, or neither, remains uncertain. Indeed, as we will later discuss, left pMTG is active in many different types of tool-related paradigms, and may have a particularly special role in processing the motion aspects of tool use. Thus, although we had clear predictions with regard to the parietal tool area, we were uncertain about how pMTG would respond. Our findings are shown in Figure 2C and F (see also Supplementary Table 1 for complete ROI results). Contrary to our predictions, although the tool-selective aIPS showed higher responses to both types of grasping movies relative to the reaching movies (TG > Reach, p < .01; AG > Reach, p < .001), this area did not distinguish between our two different types of grasping actions (p = .55). In contrast, tool-selective pMTG was more responsive to our TG movies as compared with our AG movies (TG > AG, p < .001), and the activity associated with viewing AG and Reach movies did not differ (p = .71).
It should also be noted that in the majority of subjects (8/9), an additional focus of activity within intraparietal sulcus, posterior to our aIPS area, was detected using the conjunction contrast (which can be seen in Figure 2D and E). Tool selectivity within the more posterior regions of intraparietal sulcus is also consistent with previous imaging work (e.g., Valyear et al., 2007). However, analysis of the time-course activity within this area during the viewing of our experimental movies revealed no significant differences [F(2, 14) = 0.01, p = .91].
This localizer paradigm also allowed us to identify several other previously characterized visual areas, including body-selective areas, the extrastriate and fusiform body areas (EBA and FBA) (Downing, Chan, Peelen, Dodds, & Kanwisher, 2006; Peelen & Downing, 2005; Downing, Jiang, Shuman, & Kanwisher, 2001), and other higher-level object-related areas, the lateral occipital object areas (LO) and ventral temporo-occipital object areas (vTO) (e.g., Grill-Spector, Kushnir, Edelman, Itzchak, & Malach, 1998). Body-selective areas were identified using a conjunction contrast (bodies > tools, bodies > objects, and bodies > scrambled) and object-sensitive areas were identified using a simple contrast of objects versus scrambled. These results are summarized in Figure 3A and B. Most interestingly, the left vTO and area LO bilaterally showed a significant degree of selectivity for the TG movies as compared with both the AG and Reach movies (left LO: TG > AG, p < .01; TG > Reach, p < .01; right LO: TG > AG, p < .0001; TG > Reach, p < .0001; left vTO: TG > AG, p < .001; TG > Reach, p < .01), which did not differ from one another (left LO: p = .95; right LO: p = .99; left vTO: p = .65).
In addition to these localizer runs, all subjects except for one (due to time constraints) received two very short runs involving alternating blocks of moving and stationary patterns. This second localizer paradigm was used to identify the well-studied human motion complex MT+ (Tootell et al., 1995). In all eight subjects, area MT+ was identified bilaterally (Figure 3C), and the location of these foci was highly consistent with previous imaging studies (Dumoulin et al., 2000; Watson et al., 1993). Both the left and right MT+ showed a continuum of preferential activity in response to our action movies, showing the greatest amount of activity for TG actions, an intermediate level of activity for AG actions, and the least amount of activity for the Reach actions (left MT+: TG > AG, p < .05; TG > Reach, p < .0001; AG > Reach, p < .01; right MT+: TG > AG, p < .001; TG > Reach, p < .000001; AG > Reach, p < .0001).
Worth noting, a very consistent spatial relationship between nearby areas LO, EBA, MT+, and the tool-related pMTG was observed within and across individuals. This configuration is shown in Figure 4.
Based on all experimental data collapsed across all individuals, fitted for random effects analyses, a direct contrast between TG and AG conditions revealed several, often contiguous, activation foci (Figure 5). These activations were localized to the posterior occipital and lateral temporo-occipital cortices; no significant clusters were observed within parietal or frontal cortex, even at more liberal thresholds. This pattern of activity is highly consistent with our ROI findings. Indeed, many of the foci appear to correspond well with areas LO and MT+ (bilaterally) and the tool-selective pMTG (see Figure 5), all of which also showed a preferential response for the TG actions as revealed via our ROI analysis. However, increasing the thresholds to isolate the individual hot spots also revealed a few areas that did not correspond as readily with our ROI results. In particular, three separable foci were noted in posterior occipital cortex, one near right calcarine sulcus and the other two appeared symmetrical, situated much more ventrally. In addition, one small area within left putamen was found to be significantly more active for viewing our AG as compared with our TG movies (Table 1).
|TG > AG|
|Left anterior lateral occipito-temporal cortex||−52||−64||−3||418|
|Left posterior occipito-temporal cortex||−30||−82||−6||3864|
|Right anterior lateral occipito-temporal cortex||41||−70||−4||376|
|Right posterior occipito-temporal cortex||26||−87||−6||1486|
|Right posterior medial occipital cortex||11||−87||−10||1021|
|AG > TG|
|TG > AG|
|Left anterior lateral occipito-temporal cortex||−52||−64||−3||418|
|Left posterior occipito-temporal cortex||−30||−82||−6||3864|
|Right anterior lateral occipito-temporal cortex||41||−70||−4||376|
|Right posterior occipito-temporal cortex||26||−87||−6||1486|
|Right posterior medial occipital cortex||11||−87||−10||1021|
|AG > TG|
Areas are based on the group-averaged activity within experimental runs, using a random effects GLM, with activation maps cluster size corrected for the problem of multiple comparisons (p < .05). Contrasts used to define each area, mean center of mass Talairach coordinates, and the volumes for each area are indicated.
Most importantly, with respect to our a priori objectives, the results from both our ROI and voxelwise approaches failed to detect any differential activity within parietal or frontal areas in response to our different types of grasping movies.
BEHAVIORAL FOLLOW-UP STUDY
Because much of our imaging findings were unexpected, we decided to run the following behavioral experiment in order to help guide our interpretations of the data. In particular, this follow-up study was designed to help account for the pattern of activity observed within several ventral stream areas previously implicated as crucial for object recognition (e.g., area LO; see Figures 3B and 5). We reasoned that because our TG and AG movies evoked differential activity within these areas, object recognition processing within the context of either TG or AG actions may reflect these differences. Specifically, because our TG actions evoked stronger responses within these areas, we predicted that object recognition would be facilitated in this condition relative to AG.
We used an object naming paradigm and examined accuracy scores and voice-onset reaction times as measures of object recognition processing. Thirty-one subjects (17 women; age range = 19–43 year), different from those who participated in the imaging study, took part in this experiment. The task simply involved naming pictures of tools, presented singularly on a computer screen. Each picture was shown for 2 sec and subjects advanced each subsequent trial themselves, using a button press. Critically, in some of the pictures, the tool was being grasped with a TG posture, whereas in others the tool was grasped with an AG posture. As a control condition, which we referred to as neutral, we had tools presented in isolation, with no hand involved. Most of the pictures (86%) were taken as single frames from our AG and TG movies used in the imaging experiment. That is, due to confounds such as differences in the amount of object being occluded at the point of grasping across some of the TG and AG movies, not all of the tool movies used in the imaging experiment could be used as stimuli for this naming experiment. The remaining tool pictures were taken from movies we had collected previously but had not used in the imaging experiment. For the TG and AG pictures, we used only the situation where the handle of the tool faced the actor, not unlike the last frames shown with the garden trowel in Figure 1A and B (see Figure 6). For the neutral condition, we simply took the first frame from either of our TG or AG movies, where no hand was yet present. Each subject received six different orders, and order by trial type was balanced across subjects. There were 22 trials for each condition per order, leading to a total of 132 trials per condition per subject. Mean voice-onset reaction times and accuracy scores per condition per subject were then entered into a one-way repeated measures analysis of variance, with subject as a random factor. Where significant differences were found, in order to test for differences between pairs of conditions, all possible post-hoc comparisons were performed by computing an F-statistic. Tukey's wholly significant difference was then used to correct the critical significance value in order to control for the problem of multiple comparisons.
Repeated measures analysis of variance revealed significant differences in voice-onset reaction times across conditions [F(2, 60) = 4.47, p < .05]. The mean reaction times for each condition were as follows: TG = 838.7 msec; AG = 847.0 msec; and neutral = 839.3 msec. Individual pairwise comparisons showed that naming pictures with an AG posture took significantly longer than naming both TG [F(1, 60) = 7.25, p < .05] and neutral pictures [F(1, 60) = 6.09, p < .05]. In contrast, naming latencies for TG and neutral conditions did not differ [F(1, 60) = 0.05, p = .99]. Plotted in Figure 6 are the differences in naming latencies between TG and AG versus neutral, with error bars indicating the 95% confidence intervals, which reflect the variance in these difference scores across individuals. Clearly, there is a small but reliable cost to naming AG pictures relative to neutral, but no statistical difference between naming TG and neutral pictures. There were no significant differences in naming accuracy across any of the three conditions; all conditions scored, at ceiling, 99% correct [F(2, 60) = 2.38, p = .10].
We predicted that parietal areas involved in tool use and praxis would respond preferentially to our TG movies. Inconsistent with these predictions, both typical and atypical types of tool grasping actions were found to activate parietal areas in much the same way. Most intriguing, however, was that several areas more closely associated with the ventral stream were activated more strongly while observing TG as compared with AG. We view these findings as evidence for sensitivity within the ventral stream to learned semantic and/or contextual associations; in particular, those associations tied to stored knowledge of object-specific actions. In this way, our findings have important implications for understanding the cortical mechanisms underlying human tool use, and, more specifically, how semantic knowledge of tools and tool-related actions is likely to be represented in the brain.
Both of our approaches, ROI and voxelwise strategies, converged upon much the same findings: Viewing TG as compared with AG led to greater activation in posterior and ventral temporo-occipital cortex (Figure 5). As our ROI findings indicate, these areas include the left hemisphere tool-selective pMTG, left vTO area, bilateral area LO, and bilateral MT+ (Figures 2A, 3B, and C). Areas LO and vTO are shape-selective visual areas of the ventral stream, considered part of the lateral occipital complex, thought to be critical for perceiving and recognizing objects (James, Culham, Humphrey, Milner, & Goodale, 2003; Bar et al., 2001; Grill-Spector, Kushnir, Hendler, & Malach, 2000; James, Humphrey, Gati, Menon, & Goodale, 2000; for a review, see Grill-Spector & Malach, 2004). Previous work has suggested a special role for left mid-fusiform gyrus, near our left vTO area, in processing familiar tools (Beauchamp, Lee, Haxby, & Martin, 2002, 2003; Chao, Weisberg, & Martin, 2002; Chao et al., 1999). In particular, left mid-fusiform gyrus is supposed to be important for processing the form and structure of tools (for a review, see Beauchamp & Martin, 2007). Together with the activity seen in other parts of lateral occipital complex, as well as left tool-selective pMTG (discussed in more detail below), we view these findings as evidence for sensitivity to the contextual aspects of our movies. Indeed, a ramping up of activity might arise within this network whenever object-directed actions are perceived within a familiar or stereotypical context; in our case, the viewing of tools grasped in familiar ways resonates with these areas more strongly than the viewing of tools grasped in not so familiar ways. For example, perhaps seeing a garden shovel being grasped properly tends to more robustly activate other semantic associates, like plants and dirt, and this may have led to stronger and more extensive activations within posterior and ventral temporo-occipital cortex, as we have observed.
The results from our follow-up behavioral naming study corroborate our imaging findings. We found significantly shorter naming latencies when subjects named tools that were being grasped with a TG posture as compared with an AG posture. If knowledge about the functional properties of tools was accessible to ventral stream areas critical for object recognition, then TG might lead to increased activations within these areas, as our imaging data support. This increased activity may then be expected to facilitate object identification and naming, as our behavioral data support. As one anonymous reviewer pointed out, our naming results are strikingly similar to the scene superiority effects described by Biederman, Mezzanotte, and Rabinowitz (1982) and Biederman (1981), in which objects are more easily identified when presented in the context of a typical setting. For example, Biederman et al. found that detecting the presence of an object was more difficult if presented in an unusual scene (e.g., a fire hydrant in a kitchen) or in an unusual position (e.g., a fire hydrant on top of a car). It is easy to see how our findings can be considered consistent with these results; in our case, hand postures were either unusual, as with AG, which was found to be costly for object recognition, or usual, as with TG, which had no effect on object recognition (see Figure 6). In other words, depending on the posture of the hand, our objects were presented either in a typical or in an atypical context, and, similar to the findings of Biederman et al., context influenced object identification (and, in the case of our movies, the patterns of activity within ventral stream areas known to mediate higher-level object processing).
Before proceeding, however, we would like to address the fact that our findings were not limited to the higher-level object areas of the ventral stream, but rather also included motion-specialized area MT+, tool-selective pMTG, and, rather surprisingly, more posterior occipital areas. With respect to the activation observed in more posterior occipital cortex, in particular, we should consider the possibility that instead of higher-level semantic or contextual influences, our findings might simply reflect low-level differences between our TG and AG movies. Two such accounts seem possible. First, compared with TG, AG may have led to more occlusion of the functional aspects of our tools (see Figure 1). However, AG was also likely to involve more tool occlusion than with our Reach condition, such that if our effects were simply driven by differences in occlusion levels, differences between AG and Reach would have also been expected. Moreover, given that for all clips there was plenty of time for tools to be recognized before any occlusion took place (approximately 1000 msec), we do not feel that differences in occlusion levels would have had any substantial impact on our findings. Second, although the hand actions within our TG and AG movies were similar, upon close inspection, TG appears to involve more fine-tuned postural adjustments of the wrist, fingers, and thumb, in particular, at the point of grasp and as the object is being lifted. Again, however, any area sensitive to such differences would also be expected to show higher activity for AG versus Reach as the grasping actions clearly have more postural movements and/or motion transients. Also, both TG and AG involve lifting, and thus, motion of the tool, whereas Reach movies do not. In fact, it is difficult to imagine any argument for low-level differences between TG and AG that would not also predict differences between AG and Reach. In other words, any low-level account of differential activations between TG and AG would also predict differences between AG and Reach. Of the areas we identified, only the pattern in area MT+ was consistent with such predictions (Figure 3C). Thus, it is possible that the activation pattern observed in MT+ simply reflects sensitivity to lower-level stimulus differences across movie types. However, as we will return to below, there is another possible account of the activation we observed in MT+ worth considering. For now, we would like to emphasize that the patterns of activity we have observed elsewhere, including early ventral occipital cortex, are inconsistent with any plausible low-level explanations.
Instead, we view the activity seen in more posterior areas of occipital cortex as coupled with that seen in higher-level visual areas, such as LO and vTO. That is, we believe our findings reflect sensitivity to learned contextual and/or semantic associations within a widespread, albeit primarily ventral stream, network of the visual system. Notice that feedback projections are an integral part of the primate visual system (Felleman & Van Essen, 1991), and mounting evidence suggests that feedback from higher- to lower-level visual areas plays an important, if not essential, role in perceptual processing (Murray, Boyaci, & Kersten, 2006; Murray, Schrater, & Kersten, 2004; Pascual-Leone & Walsh, 2001; for reviews, see Blake & Logothetis, 2002; Bullier, 2001; Lamme & Roelfsema, 2000). Perhaps the activity we observed in higher-order visual areas, such as LO and left vTO (and/or the tool-selective pMTG), is driving the effects observed in more posterior areas, via recurrent connections. Moreover, Biederman et al. were explicit to emphasize that their findings, discussed above, did not fit well with a strictly bottom–up view of perceptual processing. Instead, their results indicate that object semantics are accessible very early on, and can influence perception and object recognition rather immediately. Similar findings have been described with letters and words, in which letters are more accurately identified within the context of real words versus nonwords, or in isolation (e.g., Reicher, 1969). To account for such findings, Rumelhart and McClelland (1982) and McClelland and Rumelhart (1981) put forth a computational model describing parallel excitatory and inhibitory interactions between multiple levels of processing. When letters are shown within the context of a word, low-level (e.g., visual feature) and high-level (e.g., word knowledge) representations interact with one another to strengthen the overall excitatory activity of the network, leading to a perceptual advantage. It is exciting to consider that just such a mechanism may relate to our imaging findings, and, more directly, may in fact underlie the naming effects we have observed. Along a similar vein, one of the core principles of many prominent theories on the organization of semantic knowledge is the importance of multidirectional interactions between higher-level representations and more bottom–up, modality-specific systems (for reviews, see Barsalou, 2008; Simmons & Barsalou, 2003; Humphreys & Forde, 2001). Indeed, we believe our imaging results reflect this kind of organization, whereby conceptual knowledge about the functional properties of objects is anchored within areas of the ventral stream specialized for object recognition.
Importantly, differences in general attentional mechanisms, such as sensitivity to task demands, cannot adequately account for our findings. First, there is no evidence to suggest that subjects would have paid more attention, or that the 1-back task was more demanding for TG; in fact, if anything, AG would seem more likely to capture the greater interest, as these actions are less familiar and less predictable. Second, if attentional processes were driving our effects, then one would predict highest activation for our scrambled movies, for which our 1-back task was appreciably more difficult. Lastly, areas previously implicated as sensitive to attention and task demands (e.g., superior parietal areas) (for a review, see Kanwisher & Wojciulik, 2000) were not preferentially active for TG, as would be expected if differential allocation of general attentional resources were driving our effects. There are a few other possibilities, however, that may or may not involve differential attentional mechanisms. For example, TG movies may hold more implied motion than our AG movies, by virtue of the fact that these movies may more readily predict future movements. Such an account may be particularly attractive for area MT+, considering that many previous studies have shown this area to be sensitive to implied motion (e.g., Lorteije et al., 2006; Kourtzi & Kanwisher, 2000; Senior et al., 2000). We should note, however, that this interpretation may also account for the activity seen in other areas, besides MT+. Most importantly, such sensitivity to anticipated motion patterns must be based on stored knowledge about object-specific actions. Finally, we wish to acknowledge that preferential responses to TG need not reflect the activation of explicit semantic representations, but instead, may relate to implicit experiential or procedural knowledge of tool use actions. That is, we cannot rule out the potential role of pragmatic processing related to tools and/or the actions for which they typically afford. We do, however, find it difficult to accept a purely pragmatic-based account of our findings, mainly because many previous data indicate a strong parietal/frontal involvement when it comes to the pragmatic aspects of actions (e.g., Boronat et al., 2005; Kellenbach, Brett, & Patterson, 2003).
Why do we not find preferential activity for our TG movies within parietal/frontal areas? There is certainly plenty of evidence showing that these areas play a crucial role in underlying praxis and object-specific action knowledge (Johnson-Frey, 2004; Haaland et al., 2000; Rothi & Heilman, 1997). There is also plenty of neuroimaging evidence showing that these areas can become active in the absence of any overt movement (e.g., with imagined tool use). Are parietal/frontal areas simply insensitive to the familiarity, or typicality, of observed tool use actions? Consistent with our findings, Johnson-Frey et al. (2003), using a very similar imaging paradigm to ours but with static pictures, also found that frontal areas were insensitive to the functionality of observed grasps (note, however, that these authors constrained their analyses to only frontal areas). However, this conclusion seems particularly surprising for inferior parietal areas, given that others (e.g., Buxbaum et al., 2005; Heilman et al., 1982) have argued that the recognition of tool use actions critically depends on the integrity of such areas. Instead, we believe that the pattern of activation we observed in parietal cortex was strongly influenced by the particular types of actions we chose to use. Specifically, if we had shown movies of tools being used, rather than simply being grasped, differential modulation within parietal and/or frontal cortex may have been observed. In other words, perhaps parietal tool areas specifically encode actions with tools, and not simply toward them. Indeed, most studies interested in the parieto-frontal representations critical for knowing how to use tools, not surprisingly, have looked only at those actions associated with having the tool in hand. For example, to test for damage to these representations, patients are often asked to pantomime how they would use objects, not how they would grasp-to-use them. It is worth mentioning, however, that when tested, deficits specific for grasping-to-use objects have been noted in some apraxic patients with parietal damage, suggesting that there are parietal areas specialized for mediating object-specific functional grasps (Sirigu et al., 1995; for a review, see Daprati & Sirigu, 2006). Still, these areas may be important for functional grasping in the sense that they provide a special interface, critical for receiving and integrating input from other areas. In this way, our results suggest that prior to the actual use of objects, the ventral stream provides important information to specific parietal areas about how to most efficiently engage an object based on semantic knowledge about its identity, function, and how it is to be moved and used (Creem & Proffitt, 2001; Milner & Goodale, 1995).
There is certainly growing consensus about left pMTG and its importance in knowing about object-specific actions and familiar tools. This area is active during the generation of words associated with object-specific actions (Martin, Haxby, Lalonde, Wiggs, & Ungerleider, 1995), when viewing and naming tools relative to other objects (e.g., Valyear et al., 2007; Martin et al., 1996), during the retrieval of semantic information about object function and manipulability (Boronat et al., 2005; Kellenbach et al., 2003), during pantomime object use (e.g., Fridman et al., 2006; Johnson-Frey et al., 2005), and is even preferentially responsive to the sounds of familiar tools in action (Lewis, Brefczynski, Phinney, Janik, & DeYoe, 2005; Beauchamp, Argall, Bodurka, Duyn, & Martin, 2004; Beauchamp, Lee, Argall, & Martin, 2004). Also, exciting new findings indicate that tool selectivity in this area comes about as individuals learn about the function and manipulability of novel objects (Weisberg, van Turennout, & Martin, 2007). Human pMTG has not yet been classified as either a dorsal or ventral stream area; similar to MT+, it may have crosstalk with both of the classic visual streams. Notably, pMTG is in a good position to receive various types of input (e.g., visual and auditory; Beauchamp, Argall, et al., 2004; Beauchamp, Lee, et al., 2004) and to mediate interactions between dorsal and ventral pathways. As shown in Figure 4, tool-selective pMTG sits just anterior, lateral, and slightly ventral to the well-studied human motion complex MT+. This relationship is consistent with previous descriptions by Beauchamp et al. (2002, 2003), who also showed that pMTG is more active for tools in motion than for bodies in motion, whereas MT+ shows comparable activity for both. If, as Beauchamp et al. suggested, this area is particularly important for processing tool motion (for a review, see Beauchamp & Martin, 2007), our results would indicate that this processing includes knowledge about how tools and specific body parts (e.g., the arm and hand) typically move and interact together during use. That is, we believe our results suggest an important role for pMTG in predicting object-mediated action outcomes, including how tools and body effectors are likely to move in both time and space, based on prior experience actually using, or, to some extent, seeing others use tools.
To summarize, our findings suggest that during the perception of object-directed actions, the ventral stream is likely to play a prominent role in processing the meaning and interpretation of the action, presumably by integrating information about the motoric details of the action with stored knowledge about the object. Several areas, including tool-selective left pMTG and higher-level object processing areas LO and the left vTO, were preferentially active for grasping actions that were consistent with the conventional use of tools. Although other accounts remain possible, we view our findings as reflecting a special role for the ventral stream, as well as tool-selective pMTG, in coupling stored perceptual and semantic knowledge about objects with procedural knowledge supporting their skilled use. These findings may extend to suggest that, during actual tool use, a complex interplay between ventral and dorsal streams must take place, with ventral stream areas providing critical input as to how an object should be engaged in accordance with stored semantic knowledge. Future research in our lab will look to provide new insights into how these interactions are mediated during actual tool use.
Cordless power drill
This work was funded by an operating grant from the Canadian Institutes of Health Research (MOP 84293) to J. C. and a postgraduate Canada Graduate Scholarship from the Natural Sciences and Engineering Research Council of Canada to K. V. We thank Mel Goodale and Paul Gribble for their insightful comments and suggestions, and Cristiana Cavina-Pratesi for her generous help with filming the movies.
Reprint requests should be sent to Kenneth F. Valyear, Department of Psychology, Social Science Centre, University of Western Ontario, or via e-mail: firstname.lastname@example.org; Web: http://psychology.uwo.ca/culhamlab/KennethValyear.html.